Hi Jon, To clarify my previous email regarding the behaviour observed,
What happens over time: + remove bc peer ... some time until peer rejoins ... + add bc peer + tipc_link_bc_ack_rcv link is up = false, node is up = false (this gets called a number of times until both the link and node are up) + tipc_link_bc_ack_rcv l->acked set to valid ack ... + tipc_rcv - usr 5 or 11, bc_ack = 0 + tipc_bcast_ack_rcv + tipc_link_bc_ack_rcv sets l->acked to 0 Regards, JT On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com> wrote: > Hi Jon, > > It is a similar problem in terms of what happens to the bc link. I do > have that patch applied. > > I have added debug through the remove bc peer and various other functions > and the setting of the acked field to 0 is occurring when processing a > packet from named (msg user 11) or BCAST protocol (msg user 5). > > Thanks, > JT > > On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com> > wrote: > >> Hi John, >> This sounds a lot like the problem I tried to fix in >> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when >> peer is lost") >> So, either that patch is not present in your kernel (if it is 4.7 it is >> supposed to be) or my solution somehow hasn't solved the problem. >> Can you confirm that the patch is there? >> >> BR >> ///jon >> >> > -----Original Message----- >> > From: John THompson [mailto:thompa....@gmail.com] >> > Sent: Tuesday, 23 August, 2016 20:21 >> > To: tipc-discussion@lists.sourceforge.net >> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a >> named with a BC >> > ACK of 0 >> > >> > Hi, >> > >> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040 >> > and Marvell Armada-XP processors. There are 10 nodes in all. >> > When 2 of the nodes are removed, then rejoin the cluster we sometimes >> see >> > behaviour where the TIPC BC link gets stuck and eventually the backlog >> gets >> > full. the 2 nodes that are joining have already connected together. >> > >> > The problem occurs when the BC link sndnxt value is greater than 32k on >> one >> > of the nodes (call it NODE1) and 2 nodes begin to join. >> > When NODE1 detects the joining nodes, at some early point after they >> have >> > joined, NODE1 receives a NAMED publication with a BC ack of 0. NODE1 >> > immediately sets its BC acked to 0 and tries to ack packets off the >> > transmq. No packets get removed as the new ack value doesn't match any >> of >> > the packets that need to be acked. >> > >> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures >> that >> > the new acked value is more than the old acked value. When the values >> are >> > greater than 32k apart this means that 0 can indeed be greater than >> > 40,000. So when new packets are processed the new BC ack value is >> > considered less than the stored one (0). >> > >> > This results in the BC transmq getting full and the backlog getting >> full, >> > thereby preventing communication over the BC link between nodes. >> > >> > I am persisting in trying to work out why the NAMED publication has a BC >> > ack of 0, which I think is the root cause of the problem. >> > >> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an >> > invalid BC ack value cannot be set. I am defining invalid as being an >> > acked value that is greater than the current BC acked value + the link >> > window. >> > >> > Thanks, >> > John >> > ------------------------------------------------------------ >> ------------------ >> > _______________________________________________ >> > tipc-discussion mailing list >> > tipc-discussion@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> > > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion