Hi Jon, It is a similar problem in terms of what happens to the bc link. I do have that patch applied.
I have added debug through the remove bc peer and various other functions and the setting of the acked field to 0 is occurring when processing a packet from named (msg user 11) or BCAST protocol (msg user 5). Thanks, JT On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com> wrote: > Hi John, > This sounds a lot like the problem I tried to fix in > a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when > peer is lost") > So, either that patch is not present in your kernel (if it is 4.7 it is > supposed to be) or my solution somehow hasn't solved the problem. > Can you confirm that the patch is there? > > BR > ///jon > > > -----Original Message----- > > From: John THompson [mailto:thompa....@gmail.com] > > Sent: Tuesday, 23 August, 2016 20:21 > > To: tipc-discussion@lists.sourceforge.net > > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a > named with a BC > > ACK of 0 > > > > Hi, > > > > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040 > > and Marvell Armada-XP processors. There are 10 nodes in all. > > When 2 of the nodes are removed, then rejoin the cluster we sometimes see > > behaviour where the TIPC BC link gets stuck and eventually the backlog > gets > > full. the 2 nodes that are joining have already connected together. > > > > The problem occurs when the BC link sndnxt value is greater than 32k on > one > > of the nodes (call it NODE1) and 2 nodes begin to join. > > When NODE1 detects the joining nodes, at some early point after they have > > joined, NODE1 receives a NAMED publication with a BC ack of 0. NODE1 > > immediately sets its BC acked to 0 and tries to ack packets off the > > transmq. No packets get removed as the new ack value doesn't match any > of > > the packets that need to be acked. > > > > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures > that > > the new acked value is more than the old acked value. When the values > are > > greater than 32k apart this means that 0 can indeed be greater than > > 40,000. So when new packets are processed the new BC ack value is > > considered less than the stored one (0). > > > > This results in the BC transmq getting full and the backlog getting full, > > thereby preventing communication over the BC link between nodes. > > > > I am persisting in trying to work out why the NAMED publication has a BC > > ack of 0, which I think is the root cause of the problem. > > > > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an > > invalid BC ack value cannot be set. I am defining invalid as being an > > acked value that is greater than the current BC acked value + the link > > window. > > > > Thanks, > > John > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > > tipc-discussion mailing list > > tipc-discussion@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion