On 08/29/2016 06:48 PM, Jon Maloy wrote: > Hi John, > Sorry for my late answer; I was on vacation for a few days. > It seems I gave you the wrong commit reference in my previous mail. The one I > really meant was > 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”) > > This one explains why the first packets sometimes get an invalid ack number, > but also remedies it, and I simply cannot see how an invalid ack #0 can ever > be accepted when this patch is applied. > I see no reason why this patch shouldn’t also be present in you code, but > just to make sure, can you confirm this? > > I am right now wondering if a retransmission is the problem: > 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to true. Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, the likely sequence is rather 1: We receive a STATE message with unicast ack #1. This message should also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up is set to true. 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the invalid unicast ack #0. This one is now accepted.
I believe this may happen, because STATE messages, contrary to data packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data messages, but I cannot see it happening as often and consistently as you seem to be observing it. Another possibility is that bc_ack in the received STATE message also is an invalid zero, although I cannot see how this can happen either. Regards ///jon > 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, and > we are in trouble. > > I’ll try to figure out a solution to this, but it may be possible for you to > verify this first. > > BR > ///jon > > > > From: John THompson [mailto:thompa....@gmail.com] > Sent: Wednesday, 24 August, 2016 16:22 > To: Jon Maloy <jon.ma...@ericsson.com> > Cc: tipc-discussion@lists.sourceforge.net > Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a > named with a BC ACK of 0 > > Hi Jon, > > To clarify my previous email regarding the behaviour observed, > > What happens over time: > + remove bc peer > ... > some time until peer rejoins > ... > + add bc peer > + tipc_link_bc_ack_rcv > link is up = false, node is up = false > (this gets called a number of times until both the link and node are up) > > + tipc_link_bc_ack_rcv > l->acked set to valid ack > ... > + tipc_rcv - usr 5 or 11, bc_ack = 0 > + tipc_bcast_ack_rcv > + tipc_link_bc_ack_rcv > sets l->acked to 0 > > Regards, > JT > > > On Thu, Aug 25, 2016 at 8:06 AM, John THompson > <thompa....@gmail.com<mailto:thompa....@gmail.com>> wrote: > Hi Jon, > > It is a similar problem in terms of what happens to the bc link. I do have > that patch applied. > > I have added debug through the remove bc peer and various other functions and > the setting of the acked field to 0 is occurring when processing a packet > from named (msg user 11) or BCAST protocol (msg user 5). > > Thanks, > JT > > On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy > <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>> wrote: > Hi John, > This sounds a lot like the problem I tried to fix in > a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when peer > is lost") > So, either that patch is not present in your kernel (if it is 4.7 it is > supposed to be) or my solution somehow hasn't solved the problem. > Can you confirm that the patch is there? > > BR > ///jon > >> -----Original Message----- >> From: John THompson >> [mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>] >> Sent: Tuesday, 23 August, 2016 20:21 >> To: >> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> >> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a named >> with a BC >> ACK of 0 >> >> Hi, >> >> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040 >> and Marvell Armada-XP processors. There are 10 nodes in all. >> When 2 of the nodes are removed, then rejoin the cluster we sometimes see >> behaviour where the TIPC BC link gets stuck and eventually the backlog gets >> full. the 2 nodes that are joining have already connected together. >> >> The problem occurs when the BC link sndnxt value is greater than 32k on one >> of the nodes (call it NODE1) and 2 nodes begin to join. >> When NODE1 detects the joining nodes, at some early point after they have >> joined, NODE1 receives a NAMED publication with a BC ack of 0. NODE1 >> immediately sets its BC acked to 0 and tries to ack packets off the >> transmq. No packets get removed as the new ack value doesn't match any of >> the packets that need to be acked. >> >> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that >> the new acked value is more than the old acked value. When the values are >> greater than 32k apart this means that 0 can indeed be greater than >> 40,000. So when new packets are processed the new BC ack value is >> considered less than the stored one (0). >> >> This results in the BC transmq getting full and the backlog getting full, >> thereby preventing communication over the BC link between nodes. >> >> I am persisting in trying to work out why the NAMED publication has a BC >> ack of 0, which I think is the root cause of the problem. >> >> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an >> invalid BC ack value cannot be set. I am defining invalid as being an >> acked value that is greater than the current BC acked value + the link >> window. >> >> Thanks, >> John >> ------------------------------------------------------------------------------ >> _______________________________________________ >> tipc-discussion mailing list >> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > ------------------------------------------------------------------------------ > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion