I have a reproducible failure scenario that results in the following kernel messages being printed in succession (along with the associated link failing):
Dec 8 12:10:33 [SEQ 617259] lab236slot6 kernel: [44856.752261] Retransmission failure on link <1.1.6:p19p1-1.1.8:p19p1> Dec 8 12:10:33 [SEQ 617260] lab236slot6 kernel: [44856.758633] Resetting link Link <1.1.6:p19p1-1.1.8:p19p1> state e Dec 8 12:10:33 [SEQ 617261] lab236slot6 kernel: [44856.758635] XMTQ: 3 [2-4], BKLGQ: 0, SNDNX: 5, RCVNX: 4 Dec 8 12:10:33 [SEQ 617262] lab236slot6 kernel: [44856.758637] Failed msg: usr 10, typ 0, len 1540, err 0 Dec 8 12:10:33 [SEQ 617263] lab236slot6 kernel: [44856.758638] sqno 2, prev: 1001006, src: 1001006 The issue occurs within 30 seconds after any node in the cluster is rebooted. There are two 10Gb Ethernet fabrics in the cluster, so every node has two links to every other node. When the failure occurs, it is only ever one of the two links that fails (although it appears to be random which of the two that will be on a boot-to-boot basis). Important: links only fail to a common node in the mesh. While all nodes in the mesh are running the same kernel (Linux 4.4.0), the common node is the only one that is also running DRBD. Actually, there are two nodes running DRBD, but at any given time only one of the two nodes is the 'active' DRBD manager, so to speak, as they use a shared IP for the DRBD functionality, much akin the HA-Linux heartbeat. Again, the failure only ever occurs on TIPC links to the active DRBD node, as the other one is invisible (insofar as DRBD is concerned) as a stand-by. So it would appear (on the surface at least) that there is some conflict between running DRBD within the TIPC mesh. This failure scenario is 100% reproducible and only takse the time of a reboot + 30 seconds to trigger. It should be noted that the issue is only triggered if a node is rebooted after the DRBD node is already up and running. In other words, if the DRBD node is rebooted *after* all other nodes are up and running, the link failures to the other nodes do not occur (unless, of course, one or more of those nodes is then subsequently rebooted, in which case those nodes will experience a link failure once up and running). One other useful piece of information. When the TIPC link fails (again it is only ever one of the two TIPC links to a node that ever fails), it can be recovered by manually 'bouncing' the bearer on the DRBD card (i.e. disabling bearer followed by enabling the bearer). However the interesting point here is that if the link on fabric A is the one that failed, it is the B bearer that must be 'bounced' to fix the link on fabric A. Sounds like something to do with the DRBD shared address scheme... ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion