Hi Jon, I have verified that the patch is included in my build. 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)
I am trying to verify which packets are received when the problem occurs but I am having trouble getting the information out of my system at the moment. I will keep trying. Thanks, JT On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote: > > > On 08/29/2016 06:48 PM, Jon Maloy wrote: > >> Hi John, >> Sorry for my late answer; I was on vacation for a few days. >> It seems I gave you the wrong commit reference in my previous mail. The >> one I really meant was >> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization >> criteria”) >> >> This one explains why the first packets sometimes get an invalid ack >> number, but also remedies it, and I simply cannot see how an invalid ack #0 >> can ever be accepted when this patch is applied. >> I see no reason why this patch shouldn’t also be present in you code, but >> just to make sure, can you confirm this? >> >> I am right now wondering if a retransmission is the problem: >> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to >> true. >> > Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, > the likely sequence is rather > 1: We receive a STATE message with unicast ack #1. This message should > also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up > is set to true. > 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the > invalid unicast ack #0. This one is now accepted. > > I believe this may happen, because STATE messages, contrary to data > packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data > messages, but I cannot see it happening as often and consistently as you > seem to be observing it. Another possibility is that bc_ack in the received > STATE message also is an invalid zero, although I cannot see how this can > happen either. > > Regards > ///jon > > 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, >> and we are in trouble. >> >> I’ll try to figure out a solution to this, but it may be possible for you >> to verify this first. >> >> BR >> ///jon >> >> >> >> From: John THompson [mailto:thompa....@gmail.com] >> Sent: Wednesday, 24 August, 2016 16:22 >> To: Jon Maloy <jon.ma...@ericsson.com> >> Cc: tipc-discussion@lists.sourceforge.net >> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a >> named with a BC ACK of 0 >> >> Hi Jon, >> >> To clarify my previous email regarding the behaviour observed, >> >> What happens over time: >> + remove bc peer >> ... >> some time until peer rejoins >> ... >> + add bc peer >> + tipc_link_bc_ack_rcv >> link is up = false, node is up = false >> (this gets called a number of times until both the link and node are >> up) >> >> + tipc_link_bc_ack_rcv >> l->acked set to valid ack >> ... >> + tipc_rcv - usr 5 or 11, bc_ack = 0 >> + tipc_bcast_ack_rcv >> + tipc_link_bc_ack_rcv >> sets l->acked to 0 >> >> Regards, >> JT >> >> >> On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com >> <mailto:thompa....@gmail.com>> wrote: >> Hi Jon, >> >> It is a similar problem in terms of what happens to the bc link. I do >> have that patch applied. >> >> I have added debug through the remove bc peer and various other functions >> and the setting of the acked field to 0 is occurring when processing a >> packet from named (msg user 11) or BCAST protocol (msg user 5). >> >> Thanks, >> JT >> >> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com >> <mailto:jon.ma...@ericsson.com>> wrote: >> Hi John, >> This sounds a lot like the problem I tried to fix in >> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when >> peer is lost") >> So, either that patch is not present in your kernel (if it is 4.7 it is >> supposed to be) or my solution somehow hasn't solved the problem. >> Can you confirm that the patch is there? >> >> BR >> ///jon >> >> -----Original Message----- >>> From: John THompson [mailto:thompa....@gmail.com<mailto: >>> thompa....@gmail.com>] >>> Sent: Tuesday, 23 August, 2016 20:21 >>> To: tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion >>> @lists.sourceforge.net> >>> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a >>> named with a BC >>> ACK of 0 >>> >>> Hi, >>> >>> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040 >>> and Marvell Armada-XP processors. There are 10 nodes in all. >>> When 2 of the nodes are removed, then rejoin the cluster we sometimes see >>> behaviour where the TIPC BC link gets stuck and eventually the backlog >>> gets >>> full. the 2 nodes that are joining have already connected together. >>> >>> The problem occurs when the BC link sndnxt value is greater than 32k on >>> one >>> of the nodes (call it NODE1) and 2 nodes begin to join. >>> When NODE1 detects the joining nodes, at some early point after they have >>> joined, NODE1 receives a NAMED publication with a BC ack of 0. NODE1 >>> immediately sets its BC acked to 0 and tries to ack packets off the >>> transmq. No packets get removed as the new ack value doesn't match any >>> of >>> the packets that need to be acked. >>> >>> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures >>> that >>> the new acked value is more than the old acked value. When the values >>> are >>> greater than 32k apart this means that 0 can indeed be greater than >>> 40,000. So when new packets are processed the new BC ack value is >>> considered less than the stored one (0). >>> >>> This results in the BC transmq getting full and the backlog getting full, >>> thereby preventing communication over the BC link between nodes. >>> >>> I am persisting in trying to work out why the NAMED publication has a BC >>> ack of 0, which I think is the root cause of the problem. >>> >>> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an >>> invalid BC ack value cannot be set. I am defining invalid as being an >>> acked value that is greater than the current BC acked value + the link >>> window. >>> >>> Thanks, >>> John >>> ------------------------------------------------------------ >>> ------------------ >>> _______________________________________________ >>> tipc-discussion mailing list >>> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion >>> @lists.sourceforge.net> >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> >> >> ------------------------------------------------------------ >> ------------------ >> _______________________________________________ >> tipc-discussion mailing list >> tipc-discussion@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> > > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion