[tipc-discussion] reproducible link failure scenario

Butler, Peter Fri, 09 Dec 2016 10:05:31 -0800

I have a reproducible failure scenario that results in the following kernel 
messages being printed in succession (along with the associated link failing):


Dec  8 12:10:33 [SEQ 617259] lab236slot6 kernel:  [44856.752261] Retransmission 
failure on link <1.1.6:p19p1-1.1.8:p19p1>
Dec  8 12:10:33 [SEQ 617260] lab236slot6 kernel:  [44856.758633] Resetting link 
 Link <1.1.6:p19p1-1.1.8:p19p1> state e
Dec  8 12:10:33 [SEQ 617261] lab236slot6 kernel:  [44856.758635] XMTQ: 3 [2-4], 
BKLGQ: 0, SNDNX: 5, RCVNX: 4
Dec  8 12:10:33 [SEQ 617262] lab236slot6 kernel:  [44856.758637] Failed msg: 
usr 10, typ 0, len 1540, err 0
Dec  8 12:10:33 [SEQ 617263] lab236slot6 kernel:  [44856.758638] sqno 2, prev: 
1001006, src: 1001006

The issue occurs within 30 seconds after any node in the cluster is rebooted.  
There are two 10Gb Ethernet fabrics in the cluster, so every node has two links 
to every other node.  When the failure occurs, it is only ever one of the two 
links that fails (although it appears to be random which of the two that will 
be on a boot-to-boot basis).

Important: links only fail to a common node in the mesh.  While all nodes in 
the mesh are running the same kernel (Linux 4.4.0), the common node is the only 
one that is also running DRBD.  Actually, there are two nodes running DRBD, but 
at any given time only one of the two nodes is the 'active' DRBD manager, so to 
speak, as they use a shared IP for the DRBD functionality, much akin the 
HA-Linux heartbeat.  Again, the failure only ever occurs on TIPC links to the 
active DRBD node, as the other one is invisible (insofar as DRBD is concerned) 
as a stand-by.

So it would appear (on the surface at least) that there is some conflict 
between running DRBD within the TIPC mesh.

This failure scenario is 100% reproducible and only takse the time of a reboot 
+ 30 seconds to trigger.  It should be noted that the issue is only triggered 
if a node is rebooted after the DRBD node is already up and running.  In other 
words, if the DRBD node is rebooted *after* all other nodes are up and running, 
the link failures to the other nodes do not occur (unless, of course, one or 
more of those nodes is then subsequently rebooted, in which case those nodes 
will experience a link failure once up and running).

One other useful piece of information.  When the TIPC link fails (again it is 
only ever one of the two TIPC links to a node that ever fails), it can be 
recovered by manually 'bouncing' the bearer on the DRBD card (i.e. disabling 
bearer followed by enabling the bearer).  However the interesting point here is 
that if the link on fabric A is the one that failed, it is the B bearer that 
must be 'bounced' to fix the link on fabric A.  Sounds like something to do 
with the DRBD shared address scheme...

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

[tipc-discussion] reproducible link failure scenario

Reply via email to