Re: [tipc-discussion] reproducible link failure scenario

Jon Maloy Fri, 09 Dec 2016 10:58:57 -0800

Hi Peter,
This is a known bug, fixed in commit d2f394dc4816 ("tipc: fix random link 
resets while adding a second bearer") from Partha Bhuvaragan, and present in 
Linux 4.8.
Do you have any possibility to upgrade your kernel?


BR
///jon

> -----Original Message-----
> From: Butler, Peter [mailto:[email protected]]
> Sent: Friday, 09 December, 2016 09:31
> To: [email protected]
> Subject: [tipc-discussion] reproducible link failure scenario
> 
> I have a reproducible failure scenario that results in the following kernel
> messages being printed in succession (along with the associated link failing):
> 
> Dec  8 12:10:33 [SEQ 617259] lab236slot6 kernel:  [44856.752261] 
> Retransmission
> failure on link <1.1.6:p19p1-1.1.8:p19p1>
> Dec  8 12:10:33 [SEQ 617260] lab236slot6 kernel:  [44856.758633] Resetting 
> link
> Link <1.1.6:p19p1-1.1.8:p19p1> state e
> Dec  8 12:10:33 [SEQ 617261] lab236slot6 kernel:  [44856.758635] XMTQ: 3 
> [2-4],
> BKLGQ: 0, SNDNX: 5, RCVNX: 4
> Dec  8 12:10:33 [SEQ 617262] lab236slot6 kernel:  [44856.758637] Failed msg: 
> usr
> 10, typ 0, len 1540, err 0
> Dec  8 12:10:33 [SEQ 617263] lab236slot6 kernel:  [44856.758638] sqno 2, prev:
> 1001006, src: 1001006
> 
> The issue occurs within 30 seconds after any node in the cluster is rebooted.
> There are two 10Gb Ethernet fabrics in the cluster, so every node has two 
> links to
> every other node.  When the failure occurs, it is only ever one of the two 
> links
> that fails (although it appears to be random which of the two that will be on 
> a
> boot-to-boot basis).
> 
> Important: links only fail to a common node in the mesh.  While all nodes in 
> the
> mesh are running the same kernel (Linux 4.4.0), the common node is the only
> one that is also running DRBD.  Actually, there are two nodes running DRBD, 
> but
> at any given time only one of the two nodes is the 'active' DRBD manager, so 
> to
> speak, as they use a shared IP for the DRBD functionality, much akin the 
> HA-Linux
> heartbeat.  Again, the failure only ever occurs on TIPC links to the active 
> DRBD
> node, as the other one is invisible (insofar as DRBD is concerned) as a 
> stand-by.
> 
> So it would appear (on the surface at least) that there is some conflict 
> between
> running DRBD within the TIPC mesh.
> 
> This failure scenario is 100% reproducible and only takse the time of a 
> reboot + 30
> seconds to trigger.  It should be noted that the issue is only triggered if a 
> node is
> rebooted after the DRBD node is already up and running.  In other words, if 
> the
> DRBD node is rebooted *after* all other nodes are up and running, the link
> failures to the other nodes do not occur (unless, of course, one or more of 
> those
> nodes is then subsequently rebooted, in which case those nodes will experience
> a link failure once up and running).
> 
> One other useful piece of information.  When the TIPC link fails (again it is 
> only
> ever one of the two TIPC links to a node that ever fails), it can be 
> recovered by
> manually 'bouncing' the bearer on the DRBD card (i.e. disabling bearer 
> followed
> by enabling the bearer).  However the interesting point here is that if the 
> link on
> fabric A is the one that failed, it is the B bearer that must be 'bounced' to 
> fix the
> link on fabric A.  Sounds like something to do with the DRBD shared address
> scheme...
> 
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] reproducible link failure scenario

Reply via email to