Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

John THompson Tue, 06 Sep 2016 17:48:59 -0700

Hi Jon,

My comments are below.


JT

On Wed, Sep 7, 2016 at 12:48 AM, Jon Maloy <[email protected]> wrote:

> Hi John,
>
> See below.
>
>
>
> ///jon
>
>
>
>
>
> *From:* John THompson [mailto:[email protected]]
> *Sent:* Tuesday, 06 September, 2016 00:14
> *To:* Jon Maloy <[email protected]>
> *Cc:* Jon Maloy <[email protected]>; tipc-discussion@lists.
> sourceforge.net
> *Subject:* Re: [tipc-discussion] BC rcv link acked stuck after receiving
> a named with a BC ACK of 0
>
>
>
> Hi Jon,
>
>
>
> The packet I see the error happening on is when receiving a usr 11
>
> (NAME_DISTRIBUTOR) over the unicast link.
>
> The reception of this packet is happening interleaved with processing
>
> a packet (or packets) on the BC link that has brought the peer up.
>
> The BC link packet processing has the tipc_bcast_lock and the unicast
>
> pkt processing cannot get the bcast lock for a while.
>
> When it can get the lock it processes the BC ack == 0 from the
> NAME_DISTRIBUTOR
>
> packet and sets the acked field on the BC link to 0.
>
>
>
> The debug / call trace below is me trying to show from the debug I
> captured what happens.
>
> If I add debug for each pkt the problem doesn't reproduce.
>
>
>
>
>
> tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user
> 11 type 0
>
>   + calls tipc_bcast_ack_rcv
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
>     + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
> link not up 1 or peer not up 1
>
>
>
> What kind of packet was this?
>
This is a LINK_PROTOCOL STATE msg received on the unicast link
Dbg of the packet:
broadcast-link-5-18 bc ack 0 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr usr
7 type 0 bc ack 0 seq 32771 ack 0)- can't ack as link not up 0 or peer not
up 1



>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
>     + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack
> as link not up 0 or peer not up 1
>
>
>
> And this?
>
This is a LINK_PROTOCOL STATE msg received on the unicast link, with ack ==
1, therefore this packet will be the first pkt that sets bc_peer_is_up to
true.
Dbg of the packet:
broadcast-link-5-18 bc ack 43402 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr
usr 7 type 0 bc ack 43402 seq 32771 ack 1)- can't ack as link not up 0 or
peer not up 1


>
>
> ===
>
> Somewhere at this point bc_peer_is_up gets set
>
> ===
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
>     + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
> than it was previously (53574)
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
>     + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
> than it was previously (53574)
>
>
>
>   + from tipc_rcv on unicast link
>
>     + tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old
> 53574 win 200
>
>       - dump_stack
>
> CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P           O    4.4.6-at1 #3
>
> Call Trace:
>
> [a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
>
> [a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
>
> [a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
>
> [a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
>
> [a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
>
> [a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
>
> [a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
>
> [a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
>
> [a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
>
> [a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
>
> [a3093dc0] [805119f0] process_backlog+0x90/0x140
>
> [a3093df0] [8051103c] net_rx_action+0x15c/0x320
>
> [a3093e50] [8002594c] __do_softirq+0x13c/0x250
>
> [a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
>
> [a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
>
> [a3093ef0] [8003fb38] kthread+0xc8/0xe0
>
> [a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64
>
>
>
> I am going to send in a patch that adds checking for a valid BC ack (being
> within the window size) to
>
> tipc_link_bc_ack_rcv.
>
>
>
> Not sure that is a good idea. Even if #0 happens to be within a valid
> range it is still invalid, and may lead to an inadvertent release of
> packets which are not ready to be released yet.  I’ll try to take a
> closer look at this today.
>
I thought that #0 was a valid value when the seqno wraps around?  There
doesn't appear to be any special handling for the wrap around case.  I
agree with your statement that it might lead to inadvertent release of
packets that aren't ready to be released.  Therefore I won't send it in.

JT




>
> ///jon
>
>
>
> Cheers,
>
> JT
>
>
>
> On Wed, Aug 31, 2016 at 10:57 PM, John THompson <[email protected]>
> wrote:
>
> Hi Jon,
>
>
>
> I have verified that the patch is included in my build.
>
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
> criteria”)
>
>
>
> I am trying to verify which packets are received when the problem occurs
> but I am having trouble getting the information out of my system at the
> moment.
>
>
>
> I will keep trying.
>
> Thanks,
>
> JT
>
>
>
>
>
> On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <[email protected]> wrote:
>
>
>
> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>
> Hi John,
> Sorry for my late answer; I was on vacation for a few days.
> It seems I gave you the wrong commit reference in my previous mail. The
> one I really meant was
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
> criteria”)
>
> This one explains why the first packets sometimes get an invalid ack
> number, but also remedies it, and I simply cannot see how an invalid ack #0
> can ever be accepted when this patch is applied.
> I see no reason why this patch shouldn’t also be present in you code, but
> just to make sure, can you confirm this?
>
> I am right now wondering if a retransmission is the problem:
> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
> true.
>
> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
> the likely sequence is rather
> 1: We receive a STATE message with unicast ack #1. This message should
> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
> is set to true.
> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
> invalid unicast ack #0. This one is now accepted.
>
> I believe this may happen, because STATE messages, contrary to data
> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
> messages, but I cannot see it happening as often and consistently as you
> seem to be observing it. Another possibility is that bc_ack in the received
> STATE message also is an invalid zero, although I cannot see how this can
> happen either.
>
> Regards
> ///jon
>
> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted,
> and we are in trouble.
>
> I’ll try to figure out a solution to this, but it may be possible for you
> to verify this first.
>
> BR
> ///jon
>
>
>
> From: John THompson [mailto:[email protected]]
> Sent: Wednesday, 24 August, 2016 16:22
> To: Jon Maloy <[email protected]>
> Cc: [email protected]
> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a
> named with a BC ACK of 0
>
> Hi Jon,
>
> To clarify my previous email regarding the behaviour observed,
>
> What happens over time:
> + remove bc peer
> ...
> some time until peer rejoins
> ...
> + add bc peer
> + tipc_link_bc_ack_rcv
>    link is up = false, node is up = false
>    (this gets called a number of times until both the link and node are up)
>
> + tipc_link_bc_ack_rcv
>    l->acked set to valid ack
> ...
> + tipc_rcv - usr 5 or 11, bc_ack = 0
>    + tipc_bcast_ack_rcv
>      + tipc_link_bc_ack_rcv
>        sets l->acked to 0
>
> Regards,
> JT
>
> On Thu, Aug 25, 2016 at 8:06 AM, John THompson <[email protected]
> <mailto:[email protected]>> wrote:
> Hi Jon,
>
> It is a similar problem in terms of what happens to the bc link.  I do
> have that patch applied.
>
> I have added debug through the remove bc peer and various other functions
> and the setting of the acked field to 0 is occurring when processing a
> packet from named (msg user 11) or BCAST protocol (msg user 5).
>
> Thanks,
> JT
>
> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <[email protected]<
> mailto:[email protected]>> wrote:
> Hi John,
> This sounds a lot like the problem I tried to fix in
> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
> peer is lost")
> So, either that patch is not present in your kernel (if it is 4.7 it is
> supposed to be) or my solution somehow hasn't solved the problem.
> Can you confirm that the patch is there?
>
> BR
> ///jon
>
> -----Original Message-----
> From: John THompson [mailto:[email protected]<mailto:
> [email protected]>]
> Sent: Tuesday, 23 August, 2016 20:21
>
> To: [email protected]<mailto:tipc-
> [email protected]>
> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a named
> with a BC
> ACK of 0
>
> Hi,
>
> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
> and Marvell Armada-XP processors.  There are 10 nodes in all.
> When 2 of the nodes are removed, then rejoin the cluster we sometimes see
> behaviour where the TIPC BC link gets stuck and eventually the backlog gets
> full.  the 2 nodes that are joining have already connected together.
>
> The problem occurs when the BC link sndnxt value is greater than 32k on one
> of the nodes (call it NODE1) and 2 nodes begin to join.
> When NODE1 detects the joining nodes, at some early point after they have
> joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
> immediately sets its BC acked to 0 and tries to ack packets off the
> transmq.  No packets get removed as the new ack value doesn't match any of
> the packets that need to be acked.
>
> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that
> the new acked value is more than the old acked value.  When the values are
> greater than 32k apart this means that 0 can indeed be greater than
> 40,000.  So when new packets are processed the new BC ack value is
> considered less than the stored one (0).
>
> This results in the BC transmq getting full and the backlog getting full,
> thereby preventing communication over the BC link between nodes.
>
> I am persisting in trying to work out why the NAMED publication has a BC
> ack of 0, which I think is the root cause of the problem.
>
> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
> invalid BC ack value cannot be set.  I am defining invalid as being an
> acked value that is greater than the current BC acked value + the link
> window.
>
> Thanks,
> John
> ------------------------------------------------------------
> ------------------
> _______________________________________________
> tipc-discussion mailing list
>
> [email protected]<mailto:tipc-
> [email protected]>
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
>
> ------------------------------------------------------------
> ------------------
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
>
>
>
>
>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

Reply via email to