Thanks - just testing it out now and so far all looks good. ________________________________ From: Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com> Sent: Monday, December 12, 2016 3:20:51 AM To: tipc-discussion@lists.sourceforge.net Subject: Re: [tipc-discussion] reproducible link failure scenario
On 12/09/2016 09:25 PM, Butler, Peter wrote: > We can certainly do that for future upgrades of our customers. However we > may need to just patch in the interim. > > > Is the patch small enough (self-contained enough) that it would be easy > enough for me to port it into our 4.4.0 kernel? Or does it make use of many > kernel constructs that have changed between 4.4 and 4.8? > Yes, the commit below should apply cleanly on 4.4.0. https://www.spinics.net/lists/netdev/msg392988.html /Partha > ________________________________ > From: Jon Maloy <jon.ma...@ericsson.com> > Sent: Friday, December 9, 2016 1:57:46 PM > To: Butler, Peter; tipc-discussion@lists.sourceforge.net > Subject: RE: reproducible link failure scenario > > Hi Peter, > This is a known bug, fixed in commit d2f394dc4816 ("tipc: fix random link > resets while adding a second bearer") from Partha Bhuvaragan, and present in > Linux 4.8. > Do you have any possibility to upgrade your kernel? > > BR > ///jon > >> -----Original Message----- >> From: Butler, Peter [mailto:pbut...@sonusnet.com] >> Sent: Friday, 09 December, 2016 09:31 >> To: tipc-discussion@lists.sourceforge.net >> Subject: [tipc-discussion] reproducible link failure scenario >> >> I have a reproducible failure scenario that results in the following kernel >> messages being printed in succession (along with the associated link >> failing): >> >> Dec 8 12:10:33 [SEQ 617259] lab236slot6 kernel: [44856.752261] >> Retransmission >> failure on link <1.1.6:p19p1-1.1.8:p19p1> >> Dec 8 12:10:33 [SEQ 617260] lab236slot6 kernel: [44856.758633] Resetting >> link >> Link <1.1.6:p19p1-1.1.8:p19p1> state e >> Dec 8 12:10:33 [SEQ 617261] lab236slot6 kernel: [44856.758635] XMTQ: 3 >> [2-4], >> BKLGQ: 0, SNDNX: 5, RCVNX: 4 >> Dec 8 12:10:33 [SEQ 617262] lab236slot6 kernel: [44856.758637] Failed msg: >> usr >> 10, typ 0, len 1540, err 0 >> Dec 8 12:10:33 [SEQ 617263] lab236slot6 kernel: [44856.758638] sqno 2, >> prev: >> 1001006, src: 1001006 >> >> The issue occurs within 30 seconds after any node in the cluster is rebooted. >> There are two 10Gb Ethernet fabrics in the cluster, so every node has two >> links to >> every other node. When the failure occurs, it is only ever one of the two >> links >> that fails (although it appears to be random which of the two that will be >> on a >> boot-to-boot basis). >> >> Important: links only fail to a common node in the mesh. While all nodes in >> the >> mesh are running the same kernel (Linux 4.4.0), the common node is the only >> one that is also running DRBD. Actually, there are two nodes running DRBD, >> but >> at any given time only one of the two nodes is the 'active' DRBD manager, so >> to >> speak, as they use a shared IP for the DRBD functionality, much akin the >> HA-Linux >> heartbeat. Again, the failure only ever occurs on TIPC links to the active >> DRBD >> node, as the other one is invisible (insofar as DRBD is concerned) as a >> stand-by. >> >> So it would appear (on the surface at least) that there is some conflict >> between >> running DRBD within the TIPC mesh. >> >> This failure scenario is 100% reproducible and only takse the time of a >> reboot + 30 >> seconds to trigger. It should be noted that the issue is only triggered if >> a node is >> rebooted after the DRBD node is already up and running. In other words, if >> the >> DRBD node is rebooted *after* all other nodes are up and running, the link >> failures to the other nodes do not occur (unless, of course, one or more of >> those >> nodes is then subsequently rebooted, in which case those nodes will >> experience >> a link failure once up and running). >> >> One other useful piece of information. When the TIPC link fails (again it >> is only >> ever one of the two TIPC links to a node that ever fails), it can be >> recovered by >> manually 'bouncing' the bearer on the DRBD card (i.e. disabling bearer >> followed >> by enabling the bearer). However the interesting point here is that if the >> link on >> fabric A is the one that failed, it is the B bearer that must be 'bounced' >> to fix the >> link on fabric A. Sounds like something to do with the DRBD shared address >> scheme... >> >> ------------------------------------------------------------------------------ >> Developer Access Program for Intel Xeon Phi Processors >> Access to Intel Xeon Phi processor-based developer platforms. >> With one year of Intel Parallel Studio XE. >> Training and support from Colfax. >> Order your platform today.http://sdm.link/xeonphi >> _______________________________________________ >> tipc-discussion mailing list >> tipc-discussion@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today.http://sdm.link/xeonphi > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion