On 08/29/2016 06:48 PM, Jon Maloy wrote:
> Hi John,
> Sorry for my late answer; I was on vacation for a few days.
> It seems I gave you the wrong commit reference in my previous mail. The one I 
> really meant was
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)
>
> This one explains why the first packets sometimes get an invalid ack number, 
> but also remedies it, and I simply cannot see how an invalid ack #0 can ever 
> be accepted when this patch is applied.
> I see no reason why this patch shouldn’t also be present in you code, but 
> just to make sure, can you confirm this?
>
> I am right now wondering if a retransmission is the problem:
> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to true.
Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, 
the likely sequence is rather
1: We receive a STATE message with unicast ack #1. This message should 
also contain a valid, with high probability non-zero, bc_ack. 
bc_peer_is_up is set to true.
2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the 
invalid unicast ack #0. This one is now accepted.

I believe this may happen, because STATE messages, contrary to data 
packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data 
messages, but I cannot see it happening as often and consistently as you 
seem to be observing it. Another possibility is that bc_ack in the 
received STATE message also is an invalid zero, although I cannot see 
how this can happen either.

Regards
///jon

> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, and 
> we are in trouble.
>
> I’ll try to figure out a solution to this, but it may be possible for you to 
> verify this first.
>
> BR
> ///jon
>
>
>
> From: John THompson [mailto:thompa....@gmail.com]
> Sent: Wednesday, 24 August, 2016 16:22
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: tipc-discussion@lists.sourceforge.net
> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a 
> named with a BC ACK of 0
>
> Hi Jon,
>
> To clarify my previous email regarding the behaviour observed,
>
> What happens over time:
> + remove bc peer
> ...
> some time until peer rejoins
> ...
> + add bc peer
> + tipc_link_bc_ack_rcv
>    link is up = false, node is up = false
>    (this gets called a number of times until both the link and node are up)
>
> + tipc_link_bc_ack_rcv
>    l->acked set to valid ack
> ...
> + tipc_rcv - usr 5 or 11, bc_ack = 0
>    + tipc_bcast_ack_rcv
>      + tipc_link_bc_ack_rcv
>        sets l->acked to 0
>
> Regards,
> JT
>
>
> On Thu, Aug 25, 2016 at 8:06 AM, John THompson 
> <thompa....@gmail.com<mailto:thompa....@gmail.com>> wrote:
> Hi Jon,
>
> It is a similar problem in terms of what happens to the bc link.  I do have 
> that patch applied.
>
> I have added debug through the remove bc peer and various other functions and 
> the setting of the acked field to 0 is occurring when processing a packet 
> from named (msg user 11) or BCAST protocol (msg user 5).
>
> Thanks,
> JT
>
> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy 
> <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>> wrote:
> Hi John,
> This sounds a lot like the problem I tried to fix in
> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when peer 
> is lost")
> So, either that patch is not present in your kernel (if it is 4.7 it is 
> supposed to be) or my solution somehow hasn't solved the problem.
> Can you confirm that the patch is there?
>
> BR
> ///jon
>
>> -----Original Message-----
>> From: John THompson 
>> [mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>]
>> Sent: Tuesday, 23 August, 2016 20:21
>> To: 
>> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
>> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a named 
>> with a BC
>> ACK of 0
>>
>> Hi,
>>
>> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>> and Marvell Armada-XP processors.  There are 10 nodes in all.
>> When 2 of the nodes are removed, then rejoin the cluster we sometimes see
>> behaviour where the TIPC BC link gets stuck and eventually the backlog gets
>> full.  the 2 nodes that are joining have already connected together.
>>
>> The problem occurs when the BC link sndnxt value is greater than 32k on one
>> of the nodes (call it NODE1) and 2 nodes begin to join.
>> When NODE1 detects the joining nodes, at some early point after they have
>> joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
>> immediately sets its BC acked to 0 and tries to ack packets off the
>> transmq.  No packets get removed as the new ack value doesn't match any of
>> the packets that need to be acked.
>>
>> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that
>> the new acked value is more than the old acked value.  When the values are
>> greater than 32k apart this means that 0 can indeed be greater than
>> 40,000.  So when new packets are processed the new BC ack value is
>> considered less than the stored one (0).
>>
>> This results in the BC transmq getting full and the backlog getting full,
>> thereby preventing communication over the BC link between nodes.
>>
>> I am persisting in trying to work out why the NAMED publication has a BC
>> ack of 0, which I think is the root cause of the problem.
>>
>> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
>> invalid BC ack value cannot be set.  I am defining invalid as being an
>> acked value that is greater than the current BC acked value + the link
>> window.
>>
>> Thanks,
>> John
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> tipc-discussion mailing list
>> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
> ------------------------------------------------------------------------------
> _______________________________________________
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion


------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to