Hi Peter, This is a known bug, fixed in commit d2f394dc4816 ("tipc: fix random link resets while adding a second bearer") from Partha Bhuvaragan, and present in Linux 4.8. Do you have any possibility to upgrade your kernel?
BR ///jon > -----Original Message----- > From: Butler, Peter [mailto:pbut...@sonusnet.com] > Sent: Friday, 09 December, 2016 09:31 > To: tipc-discussion@lists.sourceforge.net > Subject: [tipc-discussion] reproducible link failure scenario > > I have a reproducible failure scenario that results in the following kernel > messages being printed in succession (along with the associated link failing): > > Dec 8 12:10:33 [SEQ 617259] lab236slot6 kernel: [44856.752261] > Retransmission > failure on link <1.1.6:p19p1-1.1.8:p19p1> > Dec 8 12:10:33 [SEQ 617260] lab236slot6 kernel: [44856.758633] Resetting > link > Link <1.1.6:p19p1-1.1.8:p19p1> state e > Dec 8 12:10:33 [SEQ 617261] lab236slot6 kernel: [44856.758635] XMTQ: 3 > [2-4], > BKLGQ: 0, SNDNX: 5, RCVNX: 4 > Dec 8 12:10:33 [SEQ 617262] lab236slot6 kernel: [44856.758637] Failed msg: > usr > 10, typ 0, len 1540, err 0 > Dec 8 12:10:33 [SEQ 617263] lab236slot6 kernel: [44856.758638] sqno 2, prev: > 1001006, src: 1001006 > > The issue occurs within 30 seconds after any node in the cluster is rebooted. > There are two 10Gb Ethernet fabrics in the cluster, so every node has two > links to > every other node. When the failure occurs, it is only ever one of the two > links > that fails (although it appears to be random which of the two that will be on > a > boot-to-boot basis). > > Important: links only fail to a common node in the mesh. While all nodes in > the > mesh are running the same kernel (Linux 4.4.0), the common node is the only > one that is also running DRBD. Actually, there are two nodes running DRBD, > but > at any given time only one of the two nodes is the 'active' DRBD manager, so > to > speak, as they use a shared IP for the DRBD functionality, much akin the > HA-Linux > heartbeat. Again, the failure only ever occurs on TIPC links to the active > DRBD > node, as the other one is invisible (insofar as DRBD is concerned) as a > stand-by. > > So it would appear (on the surface at least) that there is some conflict > between > running DRBD within the TIPC mesh. > > This failure scenario is 100% reproducible and only takse the time of a > reboot + 30 > seconds to trigger. It should be noted that the issue is only triggered if a > node is > rebooted after the DRBD node is already up and running. In other words, if > the > DRBD node is rebooted *after* all other nodes are up and running, the link > failures to the other nodes do not occur (unless, of course, one or more of > those > nodes is then subsequently rebooted, in which case those nodes will experience > a link failure once up and running). > > One other useful piece of information. When the TIPC link fails (again it is > only > ever one of the two TIPC links to a node that ever fails), it can be > recovered by > manually 'bouncing' the bearer on the DRBD card (i.e. disabling bearer > followed > by enabling the bearer). However the interesting point here is that if the > link on > fabric A is the one that failed, it is the B bearer that must be 'bounced' to > fix the > link on fabric A. Sounds like something to do with the DRBD shared address > scheme... > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today.http://sdm.link/xeonphi > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion