Re: [E1000-devel] igb + bonding + netem packet corruption = lockup on centos/rhel 6

Alexander Duyck Fri, 20 Dec 2013 15:43:14 -0800

On 12/20/2013 02:43 PM, Matthew Kent wrote:
> On Friday, December 20, 2013 at 2:02 PM, Alexander Duyck wrote:
>> On 12/20/2013 12:38 PM, Matthew Kent wrote:
>>> Hello all,
>>>
>>> We've been doing some testing here with netem and packet corruption
>>> (http://www.linuxfoundation.org/collaborate/workgroups/networking/netem#Packet_corruption)
>>> and I believe we may have stumbled on an igb or bonding bug that
>>> leads to a nasty system lockup on RHEL/CentOS 6. We have a
>>> reproducible test case, but I'm looking for pointers on what I can
>>> collect to help deconstruct and hopefully solve the issue.
>>>
>>> Here's the setup:
>>>
>>> Server A
>>> --------
>>> * Dell r710 server
>>> * bnx2 2.1.11 (5.0.12 bc 5.0.11)
>>> * Ubuntu precise
>>> * kernel 3.2.0-49-generic
>>> * 2 bonded interfaces in 802.3ad
>>> * one connection to each top of the rack Force10 switch
>>> * untagged interface in vlan id 500
>>>
>>> Server B
>>> --------
>>> * Dell c5220 server sled
>>> * igb 5.0.5-k (firmware 3.29)
>>> * CentOS 6.5
>>> * kernel 2.6.32-431.1.2.0.1.el6.x86_64
>>> * 2 bonded interfaces in 802.3ad
>>> * one connection to each top of the rack Force10 switch
>>> * tagged interface in vlan id 500
>>>
>>> * Both servers plugged into the same Force10 switches
>>> * Both servers are completely isolated on their own vlan unique to
>>> these switches
>>> * tx flow control disabled on all switch ports
>>>
>>> And the tests:
>>>
>>> Test #1
>>> -------
>>>
>>> Server A runs:
>>>
>>> tc qdisc add dev eth0 root netem corrupt 5%
>>>
>>> introducing some corruption on one of the bonded interfaces. Server
>>> B does nothing.
>>>
>>> Result:
>>> Server B after 2-5 minutes, even at idle, locks up *hard*. By that I
>>> mean it's completely unresponsive, but the kernel doesn't actually
>>> panic. Eventually after a couple minutes it starts dumping strack
>>> traces from the other stuck threads. Worse still, this somehow
>>> renders the onboard ipmi management controller unreachable as well,
>>> requiring a physical reset of the server sled. Server B will
>>> continue crashing until tc is disabled on Server A.
>>>
>>> Example call trace during the lockup
>>> https://gist.github.com/mdkent/a792cda348ca0048d3cc (though this was
>>> from an earlier test on centos 6.3, the end result is the same).
>>>
>>>
>>> Test #2
>>> -------
>>>
>>> This time we remove the bonded interface on Server B on both the
>>> host and switch, giving us a single untagged port in vlan 500.
>>>
>>> We introduce the same corruption by Server A. We also send some
>>> traffic from Server A -> Server B since things are quieter with lacp
>>> disabled.
>>>
>>> Result:
>>> Everything is fine. Server B has no complaints.
>>>
>>>
>>> The obvious solution is to stay away from netem :) but this is a
>>> scary bug. It's already caused a major outage for us that was very
>>> difficult to debug.
>>>
>>> What can I gather and provide to help fix this?
>>
> Thanks for looking! 
>> Hello Matthew,
>>
>> Based on the traces provided the issue would appear to be a soft lockup
>> related to a locking issue in the bonding driver. By any chance have you
>> tried testing this issue with an interface other than igb in the same
>> system? This would help to determine if igb actually plays a role in
>> this or if it is just specific to the bonding driver receiving frames
>> from netem.
>>
> Unfortunately these Dell c5220 server sleds can’t accommodate any add
> in cards. They ship with just the 2 onboard igb interfaces.
>
> This might help though, here’s a complete review of what we saw during
> the outage netem triggered:
>
> * All servers with CentOS/RHEL 6 + igb locked up.
> * All servers with CentOS/RHEL 6 + ixgbe locked up.
> * Some servers with Ubuntu precise + igb started port flapping but no
> lockups. Reboot cured them.
> * All servers with Ubuntu lucid + igb were fine.
> * All servers with CentOS/RHEL 6 + bnx2 were fine.
> * All servers with precise + bnx2 were fine.
>  
> Our servers are mostly a mix of older Dell r720s and newer Dell c5220
> and Dell c6220’s, with all of them using bonding.
>> As far as the igb driver itself, could you send us the ethtool -i,
>> ethtool -S, and lspci -vvv output for the interface? This would help us
>> to determine the hardware configuration you have.
>> Thanks,
> Sure! Here you go https://gist.github.com/mdkent/821026b70882c706142e 
>
> Let me know if you’d like me to open up a proper bug report.
>
> Thanks again.
>
> - Matt


Matt,

Thanks for the info.  We will need to work on reproducing this here.

In the meantime would it be possible to try and reproduce this with the
latest kernel.org kernel?  The reason I ask is because the information
you had on both lucid and precise seems to point to something that may
have been introduced between 2.6.32 and 3.2 and it is possible that the
issue may have already been resolved in a later kernel and I just want
to verify if that is the case.

Thanks,

Alex

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] igb + bonding + netem packet corruption = lockup on centos/rhel 6

Reply via email to