Hi,

I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
and Marvell Armada-XP processors.  There are 10 nodes in all.
When 2 of the nodes are removed, then rejoin the cluster we sometimes see
behaviour where the TIPC BC link gets stuck and eventually the backlog gets
full.  the 2 nodes that are joining have already connected together.

The problem occurs when the BC link sndnxt value is greater than 32k on one
of the nodes (call it NODE1) and 2 nodes begin to join.
When NODE1 detects the joining nodes, at some early point after they have
joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
immediately sets its BC acked to 0 and tries to ack packets off the
transmq.  No packets get removed as the new ack value doesn't match any of
the packets that need to be acked.

The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that
the new acked value is more than the old acked value.  When the values are
greater than 32k apart this means that 0 can indeed be greater than
40,000.  So when new packets are processed the new BC ack value is
considered less than the stored one (0).

This results in the BC transmq getting full and the backlog getting full,
thereby preventing communication over the BC link between nodes.

I am persisting in trying to work out why the NAMED publication has a BC
ack of 0, which I think is the root cause of the problem.

I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
invalid BC ack value cannot be set.  I am defining invalid as being an
acked value that is greater than the current BC acked value + the link
window.

Thanks,
John
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to