I did, in fact, try with default interrupt settings and the hang persisted.  
Getting access to the cluster shouldn't be a problem at all.

Roy

>-----Original Message-----
>From: Brandeburg, Jesse
>Sent: Thursday, November 12, 2009 12:16 AM
>To: Larsen, Roy K
>Cc: e1000-de...@lists.sf.net
>Subject: Re: [E1000-devel] LRO botch with 82598EB 2.0.44.14-NAPI
>
>Hi Roy, I am sure we can figure out what is going on, thanks for the
>report.
>
>Can you run one test for me? Please try without the
>InterruptThrottleRate driver parameter, but with LRO enabled.
>
>Since you are here at the same campus as we are I hope I can maybe just
>get direct access to your machines.
>
>On Wed, 2009-11-11 at 15:24 -0800, Larsen, Roy K wrote:
>> I believe there is a problem with the software LRO in the ixgbe driver.
>With LRO enabled, my cluster application hangs where two processes have
>data to send to each other as indicated by looking at the send queue with
>netstat(8) but it is not making progress even though the receive queues are
>empty.  If I build the driver without LRO (make CFLAGS_EXTRA="-
>DIXGBE_NO_LRO" install), this issue goes away. These are compute nodes that
>do not do routing or IP forwarding.  The hang is easily reproduced.  The
>particulars follow.
>>
>> Roy Larsen
>> Intel Corp.
>> roy.k.lar...@intel.com<mailto:roy.k.lar...@intel.com>
>> JF5-3-J4
>>
>> ------------------
>>
>> Red Hat EL5.3 (2.6.18-128.el5 kernel)
>> Dual socket Nehalem 2.9GHz nodes (8 cores) with 12GB of memory, hyper-
>threading disabled
>> Ethernet controller: Intel Corporation 82598EB 10-Gigabit AT CX4 Network
>Connection (rev 01)
>> Fujitsu xg700 switch
>> 8 nodes (64 cores)
>> Intel MPI 4.0.0.014
>>
>> [r...@cstnh-1 library]# ethtool -i eth2
>> driver: ixgbe
>> version: 2.0.44.14-NAPI
>> firmware-version: 1.8-0
>> bus-info: 0000:02:00.0
>>
>> ixgbe driver loaded with following options:
>> modprobe ixgbe InterruptThrottleRate=0,0
>>
>> netstat -t on node "nh1-eth2"
>>
>> Proto Recv-Q Send-Q Local Address               Foreign Address
>State
>> tcp        0   5224 nh1-eth2:55716              nh2-eth2:44115
>ESTABLISHED
>>
>> netstat -t on node "nh2-eth2"
>>
>> Proto Recv-Q Send-Q Local Address               Foreign Address
>State
>> tcp        0 331648 nh2-eth2:44115              nh1-eth2:55716
>ESTABLISHED
>>
>> The tcpdump(8) trace shows the connection is not making progress
>>
>> [r...@cstnh-1 library]# tcpdump -i eth2 -v host nh2-eth2 and host nh1-
>eth2 and port 55716 and port 44115
>> tcpdump-tnic: listening on eth2, link-type EN10MB (Ethernet), capture
>size 96 bytes
>> 17:17:17.092120 IP (tos 0x0, ttl 64, id 95, offset 0, flags [DF], proto
>TCP (6), length 1500)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 162998588, win 382,
>length 1460
>> 17:17:17.092227 IP (tos 0x0, ttl 64, id 56737, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:17:17.348305 IP (tos 0x0, ttl 64, id 56738, offset 0, flags [DF],
>proto TCP (6), length 1500)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382,
>length 1460
>> 17:17:17.348326 IP (tos 0x0, ttl 64, id 96, offset 0, flags [DF], proto
>TCP (6), length 40)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct),
>ack 1, win 382, length 0
>> 17:17:17.348331 IP (tos 0x0, ttl 64, id 56739, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:18:08.548706 IP (tos 0x0, ttl 64, id 97, offset 0, flags [DF], proto
>TCP (6), length 1500)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length
>1460
>> 17:18:08.548711 IP (tos 0x0, ttl 64, id 56740, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:18:09.060306 IP (tos 0x0, ttl 64, id 56741, offset 0, flags [DF],
>proto TCP (6), length 1500)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382,
>length 1460
>> 17:18:09.060327 IP (tos 0x0, ttl 64, id 98, offset 0, flags [DF], proto
>TCP (6), length 40)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct),
>ack 1, win 382, length 0
>> 17:18:09.060332 IP (tos 0x0, ttl 64, id 56742, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:19:51.461901 IP (tos 0x0, ttl 64, id 99, offset 0, flags [DF], proto
>TCP (6), length 1500)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length
>1460
>> 17:19:51.461909 IP (tos 0x0, ttl 64, id 56743, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:19:52.484306 IP (tos 0x0, ttl 64, id 56744, offset 0, flags [DF],
>proto TCP (6), length 1500)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382,
>length 1460
>> 17:19:52.484328 IP (tos 0x0, ttl 64, id 100, offset 0, flags [DF], proto
>TCP (6), length 40)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct),
>ack 1, win 382, length 0
>> 17:19:52.484333 IP (tos 0x0, ttl 64, id 56745, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:21:51.463283 IP (tos 0x0, ttl 64, id 101, offset 0, flags [DF], proto
>TCP (6), length 1500)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length
>1460
>> 17:21:51.463288 IP (tos 0x0, ttl 64, id 56746, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>> 17:21:52.484305 IP (tos 0x0, ttl 64, id 56747, offset 0, flags [DF],
>proto TCP (6), length 1500)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382,
>length 1460
>> 17:21:52.484327 IP (tos 0x0, ttl 64, id 102, offset 0, flags [DF], proto
>TCP (6), length 40)
>>     nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct),
>ack 1, win 382, length 0
>> 17:21:52.484332 IP (tos 0x0, ttl 64, id 56748, offset 0, flags [DF],
>proto TCP (6), length 40)
>>     nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct),
>ack 39426, win 382, length 0
>--
>Jesse Brandeburg
>This email sent via Evolution, powered by Linux


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel

Reply via email to