Hi Jon,

To clarify my previous email regarding the behaviour observed,

What happens over time:
+ remove bc peer
...
some time until peer rejoins
...
+ add bc peer
+ tipc_link_bc_ack_rcv
  link is up = false, node is up = false
  (this gets called a number of times until both the link and node are up)

+ tipc_link_bc_ack_rcv
  l->acked set to valid ack
...
+ tipc_rcv - usr 5 or 11, bc_ack = 0
  + tipc_bcast_ack_rcv
    + tipc_link_bc_ack_rcv
      sets l->acked to 0

Regards,
JT


On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com> wrote:

> Hi Jon,
>
> It is a similar problem in terms of what happens to the bc link.  I do
> have that patch applied.
>
> I have added debug through the remove bc peer and various other functions
> and the setting of the acked field to 0 is occurring when processing a
> packet from named (msg user 11) or BCAST protocol (msg user 5).
>
> Thanks,
> JT
>
> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com>
> wrote:
>
>> Hi John,
>> This sounds a lot like the problem I tried to fix in
>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>> peer is lost")
>> So, either that patch is not present in your kernel (if it is 4.7 it is
>> supposed to be) or my solution somehow hasn't solved the problem.
>> Can you confirm that the patch is there?
>>
>> BR
>> ///jon
>>
>> > -----Original Message-----
>> > From: John THompson [mailto:thompa....@gmail.com]
>> > Sent: Tuesday, 23 August, 2016 20:21
>> > To: tipc-discussion@lists.sourceforge.net
>> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
>> named with a BC
>> > ACK of 0
>> >
>> > Hi,
>> >
>> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>> > and Marvell Armada-XP processors.  There are 10 nodes in all.
>> > When 2 of the nodes are removed, then rejoin the cluster we sometimes
>> see
>> > behaviour where the TIPC BC link gets stuck and eventually the backlog
>> gets
>> > full.  the 2 nodes that are joining have already connected together.
>> >
>> > The problem occurs when the BC link sndnxt value is greater than 32k on
>> one
>> > of the nodes (call it NODE1) and 2 nodes begin to join.
>> > When NODE1 detects the joining nodes, at some early point after they
>> have
>> > joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
>> > immediately sets its BC acked to 0 and tries to ack packets off the
>> > transmq.  No packets get removed as the new ack value doesn't match any
>> of
>> > the packets that need to be acked.
>> >
>> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
>> that
>> > the new acked value is more than the old acked value.  When the values
>> are
>> > greater than 32k apart this means that 0 can indeed be greater than
>> > 40,000.  So when new packets are processed the new BC ack value is
>> > considered less than the stored one (0).
>> >
>> > This results in the BC transmq getting full and the backlog getting
>> full,
>> > thereby preventing communication over the BC link between nodes.
>> >
>> > I am persisting in trying to work out why the NAMED publication has a BC
>> > ack of 0, which I think is the root cause of the problem.
>> >
>> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
>> > invalid BC ack value cannot be set.  I am defining invalid as being an
>> > acked value that is greater than the current BC acked value + the link
>> > window.
>> >
>> > Thanks,
>> > John
>> > ------------------------------------------------------------
>> ------------------
>> > _______________________________________________
>> > tipc-discussion mailing list
>> > tipc-discussion@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to