Hi Jon,

It is a similar problem in terms of what happens to the bc link.  I do have
that patch applied.

I have added debug through the remove bc peer and various other functions
and the setting of the acked field to 0 is occurring when processing a
packet from named (msg user 11) or BCAST protocol (msg user 5).

Thanks,
JT

On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com> wrote:

> Hi John,
> This sounds a lot like the problem I tried to fix in
> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
> peer is lost")
> So, either that patch is not present in your kernel (if it is 4.7 it is
> supposed to be) or my solution somehow hasn't solved the problem.
> Can you confirm that the patch is there?
>
> BR
> ///jon
>
> > -----Original Message-----
> > From: John THompson [mailto:thompa....@gmail.com]
> > Sent: Tuesday, 23 August, 2016 20:21
> > To: tipc-discussion@lists.sourceforge.net
> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
> named with a BC
> > ACK of 0
> >
> > Hi,
> >
> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
> > and Marvell Armada-XP processors.  There are 10 nodes in all.
> > When 2 of the nodes are removed, then rejoin the cluster we sometimes see
> > behaviour where the TIPC BC link gets stuck and eventually the backlog
> gets
> > full.  the 2 nodes that are joining have already connected together.
> >
> > The problem occurs when the BC link sndnxt value is greater than 32k on
> one
> > of the nodes (call it NODE1) and 2 nodes begin to join.
> > When NODE1 detects the joining nodes, at some early point after they have
> > joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
> > immediately sets its BC acked to 0 and tries to ack packets off the
> > transmq.  No packets get removed as the new ack value doesn't match any
> of
> > the packets that need to be acked.
> >
> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
> that
> > the new acked value is more than the old acked value.  When the values
> are
> > greater than 32k apart this means that 0 can indeed be greater than
> > 40,000.  So when new packets are processed the new BC ack value is
> > considered less than the stored one (0).
> >
> > This results in the BC transmq getting full and the backlog getting full,
> > thereby preventing communication over the BC link between nodes.
> >
> > I am persisting in trying to work out why the NAMED publication has a BC
> > ack of 0, which I think is the root cause of the problem.
> >
> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
> > invalid BC ack value cannot be set.  I am defining invalid as being an
> > acked value that is greater than the current BC acked value + the link
> > window.
> >
> > Thanks,
> > John
> > ------------------------------------------------------------
> ------------------
> > _______________________________________________
> > tipc-discussion mailing list
> > tipc-discussion@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to