Re: [tipc-discussion] reproducible link failure scenario

Butler, Peter Mon, 12 Dec 2016 00:35:26 -0800

Thanks - just testing it out now and so far all looks good.

________________________________
From: Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com>
Sent: Monday, December 12, 2016 3:20:51 AM
To: tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] reproducible link failure scenario


On 12/09/2016 09:25 PM, Butler, Peter wrote:
> We can certainly do that for future upgrades of our customers.  However we 
> may need to just patch in the interim.
>
>
> Is the patch small enough (self-contained enough) that it would be easy 
> enough for me to port it into our 4.4.0 kernel?  Or does it make use of many 
> kernel constructs that have changed between 4.4 and 4.8?
>
Yes, the commit below should apply cleanly on 4.4.0.
https://www.spinics.net/lists/netdev/msg392988.html

/Partha
> ________________________________
> From: Jon Maloy <jon.ma...@ericsson.com>
> Sent: Friday, December 9, 2016 1:57:46 PM
> To: Butler, Peter; tipc-discussion@lists.sourceforge.net
> Subject: RE: reproducible link failure scenario
>
> Hi Peter,
> This is a known bug, fixed in commit d2f394dc4816 ("tipc: fix random link 
> resets while adding a second bearer") from Partha Bhuvaragan, and present in 
> Linux 4.8.
> Do you have any possibility to upgrade your kernel?
>
> BR
> ///jon
>
>> -----Original Message-----
>> From: Butler, Peter [mailto:pbut...@sonusnet.com]
>> Sent: Friday, 09 December, 2016 09:31
>> To: tipc-discussion@lists.sourceforge.net
>> Subject: [tipc-discussion] reproducible link failure scenario
>>
>> I have a reproducible failure scenario that results in the following kernel
>> messages being printed in succession (along with the associated link 
>> failing):
>>
>> Dec  8 12:10:33 [SEQ 617259] lab236slot6 kernel:  [44856.752261] 
>> Retransmission
>> failure on link <1.1.6:p19p1-1.1.8:p19p1>
>> Dec  8 12:10:33 [SEQ 617260] lab236slot6 kernel:  [44856.758633] Resetting 
>> link
>> Link <1.1.6:p19p1-1.1.8:p19p1> state e
>> Dec  8 12:10:33 [SEQ 617261] lab236slot6 kernel:  [44856.758635] XMTQ: 3 
>> [2-4],
>> BKLGQ: 0, SNDNX: 5, RCVNX: 4
>> Dec  8 12:10:33 [SEQ 617262] lab236slot6 kernel:  [44856.758637] Failed msg: 
>> usr
>> 10, typ 0, len 1540, err 0
>> Dec  8 12:10:33 [SEQ 617263] lab236slot6 kernel:  [44856.758638] sqno 2, 
>> prev:
>> 1001006, src: 1001006
>>
>> The issue occurs within 30 seconds after any node in the cluster is rebooted.
>> There are two 10Gb Ethernet fabrics in the cluster, so every node has two 
>> links to
>> every other node.  When the failure occurs, it is only ever one of the two 
>> links
>> that fails (although it appears to be random which of the two that will be 
>> on a
>> boot-to-boot basis).
>>
>> Important: links only fail to a common node in the mesh.  While all nodes in 
>> the
>> mesh are running the same kernel (Linux 4.4.0), the common node is the only
>> one that is also running DRBD.  Actually, there are two nodes running DRBD, 
>> but
>> at any given time only one of the two nodes is the 'active' DRBD manager, so 
>> to
>> speak, as they use a shared IP for the DRBD functionality, much akin the 
>> HA-Linux
>> heartbeat.  Again, the failure only ever occurs on TIPC links to the active 
>> DRBD
>> node, as the other one is invisible (insofar as DRBD is concerned) as a 
>> stand-by.
>>
>> So it would appear (on the surface at least) that there is some conflict 
>> between
>> running DRBD within the TIPC mesh.
>>
>> This failure scenario is 100% reproducible and only takse the time of a 
>> reboot + 30
>> seconds to trigger.  It should be noted that the issue is only triggered if 
>> a node is
>> rebooted after the DRBD node is already up and running.  In other words, if 
>> the
>> DRBD node is rebooted *after* all other nodes are up and running, the link
>> failures to the other nodes do not occur (unless, of course, one or more of 
>> those
>> nodes is then subsequently rebooted, in which case those nodes will 
>> experience
>> a link failure once up and running).
>>
>> One other useful piece of information.  When the TIPC link fails (again it 
>> is only
>> ever one of the two TIPC links to a node that ever fails), it can be 
>> recovered by
>> manually 'bouncing' the bearer on the DRBD card (i.e. disabling bearer 
>> followed
>> by enabling the bearer).  However the interesting point here is that if the 
>> link on
>> fabric A is the one that failed, it is the B bearer that must be 'bounced' 
>> to fix the
>> link on fabric A.  Sounds like something to do with the DRBD shared address
>> scheme...
>>
>> ------------------------------------------------------------------------------
>> Developer Access Program for Intel Xeon Phi Processors
>> Access to Intel Xeon Phi processor-based developer platforms.
>> With one year of Intel Parallel Studio XE.
>> Training and support from Colfax.
>> Order your platform today.http://sdm.link/xeonphi
>> _______________________________________________
>> tipc-discussion mailing list
>> tipc-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> _______________________________________________
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] reproducible link failure scenario

Reply via email to