Re: [networking-discuss] Help needed on big transfers failure with e1000g

Arnaud Brand Thu, 04 Feb 2010 16:33:33 -0800

Hi Min,

The transfer failed after 9h43 / 880GB.
I'm going to restart it and snoop to a file in the hope I see something.

If you have some dtrace script to see what's going on or some systemsetting to get logs of what might be going on, I would be more thathappy to test it.


Thanks,
Arnaud

Le 04/02/10 11:12, Arnaud Brand a écrit :

Hi Min,
I had problems with build 130, and saw that the e1000g driver wasupdated in 131. That's why we updated.I haven't tested with previous builds and was not that extensive withmy tests in 130, so I cannot tell wether it's the same problem or not.
The kstat command returns
        Reset Count                     0
        Reset Count                     0
on both nodes.
A colleague of mine connected the two machines back to back (I'm athome, sick, not at work).I restarted the transfers that failed last night, I'll keep you postedabout it (already done 50GB).
I know this switch (HP 4208) doesn't support jumbo frames, so Ihaven't enabled jumbo on the nodes.If this tranfer works, could this mean our switch is kind of broken orbuggy ?
Regarding the replacement with other builds, I can do it on thereceiving end, but not that easily on the sending end (the one that"looses" the connection).
Thanks,
Arnaud

Le 04/02/2010 07:07, Min Miles Xu a écrit :
Hi Arnaud,
On which build you started to notice the issue? Could you still pingthrough when you noticed the error? What's the output of "kstat -me1000g |grep Reset"?Recv_Length_Errors indicates the packets received are undersized(< 64bytes) or oversized (> 1522 bytes when jumbo frames aren't enabled).I expect to narrow down the issue by simplying the networkconfigurations You mentioned the two machines are connected via aswitch. Could you try to have the two machines connectedback-to-back? Further more, Could you try to replace one machine withother builds/OSs?
Thanks,

Miles Xu

Arnaud Brand wrote:
Hi folks,
My situation is the following : 2 computers (A and B) runningOpensolaris b131 having intel 82574L NICs, connected through anHP4208 switch.
Both computers are on the same network.
I have transfers running from computer A to computer B, eitherthrough ssh or netcat.
As long a computer B is not too busy, the transfer goes like a charm.
But when B's really busy (doing zfs recv from a local file in thiscase) , the transfer fails is an odd way after some time (tests showsomewhere between 10 minutes and 13 hours).
What's odd is that A reports that he could not read from B andcloses the connection (no sign of it in netstat), but B still thinksthe connection is open.Further, running "kstat -p | grep e1000g | grep -i err" on A showall zeroes but for the following :
e1000g:1:statistics:Recv_Length_Errors  14
link:0:e1000g1:ierrors  14
e1000g:1:mac:ierrors    14

More details on the test cases is available there :
http://opensolaris.org/jive/thread.jspa?threadID=122977&tstart=0
You can see that Brent Jones mentionned the following CR but thisis marked as a dupplicate of something fixed in 131.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6905510

I did not do any twiddling in e1000g.conf.
Both e1000g are grouped in a aggregation named trk0.
Per advice of Richard Elling, I disabled LACP and, just to be sure,I unplugged one network cable on each machine.
If any of you has any clue or workaround to try, please share.

Thanks,
Arnaud

_______________________________________________
networking-discuss mailing list
[email protected]
_______________________________________________
networking-discuss mailing list
[email protected]


_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Help needed on big transfers failure with e1000g

Reply via email to