I did, in fact, try with default interrupt settings and the hang persisted. Getting access to the cluster shouldn't be a problem at all.
Roy >-----Original Message----- >From: Brandeburg, Jesse >Sent: Thursday, November 12, 2009 12:16 AM >To: Larsen, Roy K >Cc: e1000-de...@lists.sf.net >Subject: Re: [E1000-devel] LRO botch with 82598EB 2.0.44.14-NAPI > >Hi Roy, I am sure we can figure out what is going on, thanks for the >report. > >Can you run one test for me? Please try without the >InterruptThrottleRate driver parameter, but with LRO enabled. > >Since you are here at the same campus as we are I hope I can maybe just >get direct access to your machines. > >On Wed, 2009-11-11 at 15:24 -0800, Larsen, Roy K wrote: >> I believe there is a problem with the software LRO in the ixgbe driver. >With LRO enabled, my cluster application hangs where two processes have >data to send to each other as indicated by looking at the send queue with >netstat(8) but it is not making progress even though the receive queues are >empty. If I build the driver without LRO (make CFLAGS_EXTRA="- >DIXGBE_NO_LRO" install), this issue goes away. These are compute nodes that >do not do routing or IP forwarding. The hang is easily reproduced. The >particulars follow. >> >> Roy Larsen >> Intel Corp. >> roy.k.lar...@intel.com<mailto:roy.k.lar...@intel.com> >> JF5-3-J4 >> >> ------------------ >> >> Red Hat EL5.3 (2.6.18-128.el5 kernel) >> Dual socket Nehalem 2.9GHz nodes (8 cores) with 12GB of memory, hyper- >threading disabled >> Ethernet controller: Intel Corporation 82598EB 10-Gigabit AT CX4 Network >Connection (rev 01) >> Fujitsu xg700 switch >> 8 nodes (64 cores) >> Intel MPI 4.0.0.014 >> >> [r...@cstnh-1 library]# ethtool -i eth2 >> driver: ixgbe >> version: 2.0.44.14-NAPI >> firmware-version: 1.8-0 >> bus-info: 0000:02:00.0 >> >> ixgbe driver loaded with following options: >> modprobe ixgbe InterruptThrottleRate=0,0 >> >> netstat -t on node "nh1-eth2" >> >> Proto Recv-Q Send-Q Local Address Foreign Address >State >> tcp 0 5224 nh1-eth2:55716 nh2-eth2:44115 >ESTABLISHED >> >> netstat -t on node "nh2-eth2" >> >> Proto Recv-Q Send-Q Local Address Foreign Address >State >> tcp 0 331648 nh2-eth2:44115 nh1-eth2:55716 >ESTABLISHED >> >> The tcpdump(8) trace shows the connection is not making progress >> >> [r...@cstnh-1 library]# tcpdump -i eth2 -v host nh2-eth2 and host nh1- >eth2 and port 55716 and port 44115 >> tcpdump-tnic: listening on eth2, link-type EN10MB (Ethernet), capture >size 96 bytes >> 17:17:17.092120 IP (tos 0x0, ttl 64, id 95, offset 0, flags [DF], proto >TCP (6), length 1500) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 162998588, win 382, >length 1460 >> 17:17:17.092227 IP (tos 0x0, ttl 64, id 56737, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:17:17.348305 IP (tos 0x0, ttl 64, id 56738, offset 0, flags [DF], >proto TCP (6), length 1500) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, >length 1460 >> 17:17:17.348326 IP (tos 0x0, ttl 64, id 96, offset 0, flags [DF], proto >TCP (6), length 40) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), >ack 1, win 382, length 0 >> 17:17:17.348331 IP (tos 0x0, ttl 64, id 56739, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:18:08.548706 IP (tos 0x0, ttl 64, id 97, offset 0, flags [DF], proto >TCP (6), length 1500) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length >1460 >> 17:18:08.548711 IP (tos 0x0, ttl 64, id 56740, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:18:09.060306 IP (tos 0x0, ttl 64, id 56741, offset 0, flags [DF], >proto TCP (6), length 1500) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, >length 1460 >> 17:18:09.060327 IP (tos 0x0, ttl 64, id 98, offset 0, flags [DF], proto >TCP (6), length 40) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), >ack 1, win 382, length 0 >> 17:18:09.060332 IP (tos 0x0, ttl 64, id 56742, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:19:51.461901 IP (tos 0x0, ttl 64, id 99, offset 0, flags [DF], proto >TCP (6), length 1500) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length >1460 >> 17:19:51.461909 IP (tos 0x0, ttl 64, id 56743, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:19:52.484306 IP (tos 0x0, ttl 64, id 56744, offset 0, flags [DF], >proto TCP (6), length 1500) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, >length 1460 >> 17:19:52.484328 IP (tos 0x0, ttl 64, id 100, offset 0, flags [DF], proto >TCP (6), length 40) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), >ack 1, win 382, length 0 >> 17:19:52.484333 IP (tos 0x0, ttl 64, id 56745, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:21:51.463283 IP (tos 0x0, ttl 64, id 101, offset 0, flags [DF], proto >TCP (6), length 1500) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length >1460 >> 17:21:51.463288 IP (tos 0x0, ttl 64, id 56746, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >> 17:21:52.484305 IP (tos 0x0, ttl 64, id 56747, offset 0, flags [DF], >proto TCP (6), length 1500) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, >length 1460 >> 17:21:52.484327 IP (tos 0x0, ttl 64, id 102, offset 0, flags [DF], proto >TCP (6), length 40) >> nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), >ack 1, win 382, length 0 >> 17:21:52.484332 IP (tos 0x0, ttl 64, id 56748, offset 0, flags [DF], >proto TCP (6), length 40) >> nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), >ack 39426, win 382, length 0 >-- >Jesse Brandeburg >This email sent via Evolution, powered by Linux ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel