Hi John,
Sorry for my late answer; I was on vacation for a few days.
It seems I gave you the wrong commit reference in my previous mail. The one I 
really meant was
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)

This one explains why the first packets sometimes get an invalid ack number, 
but also remedies it, and I simply cannot see how an invalid ack #0 can ever be 
accepted when this patch is applied.
I see no reason why this patch shouldn’t also be present in you code, but just 
to make sure, can you confirm this?

I am right now wondering if a retransmission is the problem:
1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to true.
2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, and we 
are in trouble.

I’ll try to figure out a solution to this, but it may be possible for you to 
verify this first.

BR
///jon



From: John THompson [mailto:thompa....@gmail.com]
Sent: Wednesday, 24 August, 2016 16:22
To: Jon Maloy <jon.ma...@ericsson.com>
Cc: tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a named 
with a BC ACK of 0

Hi Jon,

To clarify my previous email regarding the behaviour observed,

What happens over time:
+ remove bc peer
...
some time until peer rejoins
...
+ add bc peer
+ tipc_link_bc_ack_rcv
  link is up = false, node is up = false
  (this gets called a number of times until both the link and node are up)

+ tipc_link_bc_ack_rcv
  l->acked set to valid ack
...
+ tipc_rcv - usr 5 or 11, bc_ack = 0
  + tipc_bcast_ack_rcv
    + tipc_link_bc_ack_rcv
      sets l->acked to 0

Regards,
JT


On Thu, Aug 25, 2016 at 8:06 AM, John THompson 
<thompa....@gmail.com<mailto:thompa....@gmail.com>> wrote:
Hi Jon,

It is a similar problem in terms of what happens to the bc link.  I do have 
that patch applied.

I have added debug through the remove bc peer and various other functions and 
the setting of the acked field to 0 is occurring when processing a packet from 
named (msg user 11) or BCAST protocol (msg user 5).

Thanks,
JT

On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy 
<jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>> wrote:
Hi John,
This sounds a lot like the problem I tried to fix in
a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when peer is 
lost")
So, either that patch is not present in your kernel (if it is 4.7 it is 
supposed to be) or my solution somehow hasn't solved the problem.
Can you confirm that the patch is there?

BR
///jon

> -----Original Message-----
> From: John THompson [mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>]
> Sent: Tuesday, 23 August, 2016 20:21
> To: 
> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a named 
> with a BC
> ACK of 0
>
> Hi,
>
> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
> and Marvell Armada-XP processors.  There are 10 nodes in all.
> When 2 of the nodes are removed, then rejoin the cluster we sometimes see
> behaviour where the TIPC BC link gets stuck and eventually the backlog gets
> full.  the 2 nodes that are joining have already connected together.
>
> The problem occurs when the BC link sndnxt value is greater than 32k on one
> of the nodes (call it NODE1) and 2 nodes begin to join.
> When NODE1 detects the joining nodes, at some early point after they have
> joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
> immediately sets its BC acked to 0 and tries to ack packets off the
> transmq.  No packets get removed as the new ack value doesn't match any of
> the packets that need to be acked.
>
> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that
> the new acked value is more than the old acked value.  When the values are
> greater than 32k apart this means that 0 can indeed be greater than
> 40,000.  So when new packets are processed the new BC ack value is
> considered less than the stored one (0).
>
> This results in the BC transmq getting full and the backlog getting full,
> thereby preventing communication over the BC link between nodes.
>
> I am persisting in trying to work out why the NAMED publication has a BC
> ack of 0, which I think is the root cause of the problem.
>
> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
> invalid BC ack value cannot be set.  I am defining invalid as being an
> acked value that is greater than the current BC acked value + the link
> window.
>
> Thanks,
> John
> ------------------------------------------------------------------------------
> _______________________________________________
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion


------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to