Snooping showed that the receiving end sent RST packets.
My TCP is quite rusty, but if I remember well, RST does not mean "close
the connection".
Does anyone have a clue why the sending side chooses to close the
connection ?
Furthermore, the receiving end doesn't even get FIN packets from the
sending end.
Again, if I remember well, one should send FIN packets and wait for
FIN/ACK (or timeout I guess) before closing the connection.
I'm in the process of upgrading to 132 to see if it fixes this problem
(although the changelog doesn't mention anything related to tcp problems).
Arnaud
Le 05/02/10 01:32, Arnaud Brand a écrit :
Hi Min,
The transfer failed after 9h43 / 880GB.
I'm going to restart it and snoop to a file in the hope I see something.
If you have some dtrace script to see what's going on or some system
setting to get logs of what might be going on, I would be more that
happy to test it.
Thanks,
Arnaud
Le 04/02/10 11:12, Arnaud Brand a écrit :
Hi Min,
I had problems with build 130, and saw that the e1000g driver was
updated in 131. That's why we updated.
I haven't tested with previous builds and was not that extensive with
my tests in 130, so I cannot tell wether it's the same problem or not.
The kstat command returns
Reset Count 0
Reset Count 0
on both nodes.
A colleague of mine connected the two machines back to back (I'm at
home, sick, not at work).
I restarted the transfers that failed last night, I'll keep you
posted about it (already done 50GB).
I know this switch (HP 4208) doesn't support jumbo frames, so I
haven't enabled jumbo on the nodes.
If this tranfer works, could this mean our switch is kind of broken
or buggy ?
Regarding the replacement with other builds, I can do it on the
receiving end, but not that easily on the sending end (the one that
"looses" the connection).
Thanks,
Arnaud
Le 04/02/2010 07:07, Min Miles Xu a écrit :
Hi Arnaud,
On which build you started to notice the issue? Could you still ping
through when you noticed the error? What's the output of "kstat -m
e1000g |grep Reset"?
Recv_Length_Errors indicates the packets received are undersized(<
64 bytes) or oversized (> 1522 bytes when jumbo frames aren't
enabled). I expect to narrow down the issue by simplying the network
configurations You mentioned the two machines are connected via a
switch. Could you try to have the two machines connected
back-to-back? Further more, Could you try to replace one machine
with other builds/OSs?
Thanks,
Miles Xu
Arnaud Brand wrote:
Hi folks,
My situation is the following : 2 computers (A and B) running
Opensolaris b131 having intel 82574L NICs, connected through an
HP4208 switch.
Both computers are on the same network.
I have transfers running from computer A to computer B, either
through ssh or netcat.
As long a computer B is not too busy, the transfer goes like a charm.
But when B's really busy (doing zfs recv from a local file in this
case) , the transfer fails is an odd way after some time (tests
show somewhere between 10 minutes and 13 hours).
What's odd is that A reports that he could not read from B and
closes the connection (no sign of it in netstat), but B still
thinks the connection is open.
Further, running "kstat -p | grep e1000g | grep -i err" on A show
all zeroes but for the following :
e1000g:1:statistics:Recv_Length_Errors 14
link:0:e1000g1:ierrors 14
e1000g:1:mac:ierrors 14
More details on the test cases is available there :
http://opensolaris.org/jive/thread.jspa?threadID=122977&tstart=0
You can see that Brent Jones mentionned the following CR but this
is marked as a dupplicate of something fixed in 131.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6905510
I did not do any twiddling in e1000g.conf.
Both e1000g are grouped in a aggregation named trk0.
Per advice of Richard Elling, I disabled LACP and, just to be sure,
I unplugged one network cable on each machine.
If any of you has any clue or workaround to try, please share.
Thanks,
Arnaud
_______________________________________________
networking-discuss mailing list
[email protected]