Hi Min,

The transfer failed after 9h43 / 880GB.
I'm going to restart it and snoop to a file in the hope I see something.

If you have some dtrace script to see what's going on or some system setting to get logs of what might be going on, I would be more that happy to test it.

Thanks,
Arnaud

Le 04/02/10 11:12, Arnaud Brand a écrit :
Hi Min,

I had problems with build 130, and saw that the e1000g driver was updated in 131. That's why we updated. I haven't tested with previous builds and was not that extensive with my tests in 130, so I cannot tell wether it's the same problem or not.

The kstat command returns
        Reset Count                     0
        Reset Count                     0
on both nodes.

A colleague of mine connected the two machines back to back (I'm at home, sick, not at work). I restarted the transfers that failed last night, I'll keep you posted about it (already done 50GB).

I know this switch (HP 4208) doesn't support jumbo frames, so I haven't enabled jumbo on the nodes. If this tranfer works, could this mean our switch is kind of broken or buggy ?

Regarding the replacement with other builds, I can do it on the receiving end, but not that easily on the sending end (the one that "looses" the connection).

Thanks,
Arnaud

Le 04/02/2010 07:07, Min Miles Xu a écrit :
Hi Arnaud,

On which build you started to notice the issue? Could you still ping through when you noticed the error? What's the output of "kstat -m e1000g |grep Reset"? Recv_Length_Errors indicates the packets received are undersized(< 64 bytes) or oversized (> 1522 bytes when jumbo frames aren't enabled). I expect to narrow down the issue by simplying the network configurations You mentioned the two machines are connected via a switch. Could you try to have the two machines connected back-to-back? Further more, Could you try to replace one machine with other builds/OSs?

Thanks,

Miles Xu

Arnaud Brand wrote:
Hi folks,

My situation is the following : 2 computers (A and B) running Opensolaris b131 having intel 82574L NICs, connected through an HP4208 switch.
Both computers are on the same network.
I have transfers running from computer A to computer B, either through ssh or netcat.

As long a computer B is not too busy, the transfer goes like a charm.
But when B's really busy (doing zfs recv from a local file in this case) , the transfer fails is an odd way after some time (tests show somewhere between 10 minutes and 13 hours).

What's odd is that A reports that he could not read from B and closes the connection (no sign of it in netstat), but B still thinks the connection is open. Further, running "kstat -p | grep e1000g | grep -i err" on A show all zeroes but for the following :
e1000g:1:statistics:Recv_Length_Errors  14
link:0:e1000g1:ierrors  14
e1000g:1:mac:ierrors    14

More details on the test cases is available there :
http://opensolaris.org/jive/thread.jspa?threadID=122977&tstart=0

You can see that Brent Jones mentionned the following CR but this is marked as a dupplicate of something fixed in 131.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6905510

I did not do any twiddling in e1000g.conf.
Both e1000g are grouped in a aggregation named trk0.
Per advice of Richard Elling, I disabled LACP and, just to be sure, I unplugged one network cable on each machine.

If any of you has any clue or workaround to try, please share.

Thanks,
Arnaud

_______________________________________________
networking-discuss mailing list
[email protected]


_______________________________________________
networking-discuss mailing list
[email protected]

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to