Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-09-07 Thread Jon Maloy


From: John THompson [mailto:thompa@gmail.com]
Sent: Tuesday, 06 September, 2016 20:48
To: Jon Maloy 
Cc: Jon Maloy ; tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a named 
with a BC ACK of 0

Hi Jon,

My comments are below.

JT

On Wed, Sep 7, 2016 at 12:48 AM, Jon Maloy 
> wrote:

[…]

Not sure that is a good idea. Even if #0 happens to be within a valid range it 
is still invalid, and may lead to an inadvertent release of packets which are 
not ready to be released yet.  I’ll try to take a closer look at this today.
I thought that #0 was a valid value when the seqno wraps around?  There doesn't 
appear to be any special handling for the wrap around case.  I agree with your 
statement that it might lead to inadvertent release of packets that aren't 
ready to be released.  Therefore I won't send it in.

There is nothing wrong with the value 0 as such, it may be as valid as any 
other value. But any values that is not directly or indirectly based on the one 
received in the received broadcast init message is per definition invalid, and 
may wreak havoc in the send queue.

///jon


JT




///jon

Cheers,
JT

On Wed, Aug 31, 2016 at 10:57 PM, John THompson 
> wrote:
Hi Jon,

I have verified that the patch is included in my build.
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)

I am trying to verify which packets are received when the problem occurs but I 
am having trouble getting the information out of my system at the moment.

I will keep trying.
Thanks,
JT


On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy 
> wrote:


On 08/29/2016 06:48 PM, Jon Maloy wrote:
Hi John,
Sorry for my late answer; I was on vacation for a few days.
It seems I gave you the wrong commit reference in my previous mail. The one I 
really meant was
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)

This one explains why the first packets sometimes get an invalid ack number, 
but also remedies it, and I simply cannot see how an invalid ack #0 can ever be 
accepted when this patch is applied.
I see no reason why this patch shouldn’t also be present in you code, but just 
to make sure, can you confirm this?

I am right now wondering if a retransmission is the problem:
1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to true.
Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, the 
likely sequence is rather
1: We receive a STATE message with unicast ack #1. This message should also 
contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up is set 
to true.
2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the invalid 
unicast ack #0. This one is now accepted.

I believe this may happen, because STATE messages, contrary to data packets, 
are sent as TC_PRIO_CONTROL, and may sometimes bypass data messages, but I 
cannot see it happening as often and consistently as you seem to be observing 
it. Another possibility is that bc_ack in the received STATE message also is an 
invalid zero, although I cannot see how this can happen either.

Regards
///jon
2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, and we 
are in trouble.

I’ll try to figure out a solution to this, but it may be possible for you to 
verify this first.

BR
///jon



From: John THompson [mailto:thompa@gmail.com]
Sent: Wednesday, 24 August, 2016 16:22
To: Jon Maloy >
Cc: 
tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a named 
with a BC ACK of 0

Hi Jon,

To clarify my previous email regarding the behaviour observed,

What happens over time:
+ remove bc peer
...
some time until peer rejoins
...
+ add bc peer
+ tipc_link_bc_ack_rcv
   link is up = false, node is up = false
   (this gets called a number of times until both the link and node are up)

+ tipc_link_bc_ack_rcv
   l->acked set to valid ack
...
+ tipc_rcv - usr 5 or 11, bc_ack = 0
   + tipc_bcast_ack_rcv
 + tipc_link_bc_ack_rcv
   sets l->acked to 0

Regards,
JT
On Thu, Aug 25, 2016 at 8:06 AM, John THompson 
>>
 wrote:
Hi Jon,

It is a similar problem in terms of what happens to the bc link.  I do have 
that patch applied.

I have added debug through the remove bc peer and various other functions and 
the setting of the acked field to 0 is occurring when processing a packet from 
named (msg user 11) or BCAST protocol (msg user 5).

Thanks,
JT

On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy 

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-09-06 Thread John THompson
Hi Jon,

My comments are below.

JT

On Wed, Sep 7, 2016 at 12:48 AM, Jon Maloy  wrote:

> Hi John,
>
> See below.
>
>
>
> ///jon
>
>
>
>
>
> *From:* John THompson [mailto:thompa@gmail.com]
> *Sent:* Tuesday, 06 September, 2016 00:14
> *To:* Jon Maloy 
> *Cc:* Jon Maloy ; tipc-discussion@lists.
> sourceforge.net
> *Subject:* Re: [tipc-discussion] BC rcv link acked stuck after receiving
> a named with a BC ACK of 0
>
>
>
> Hi Jon,
>
>
>
> The packet I see the error happening on is when receiving a usr 11
>
> (NAME_DISTRIBUTOR) over the unicast link.
>
> The reception of this packet is happening interleaved with processing
>
> a packet (or packets) on the BC link that has brought the peer up.
>
> The BC link packet processing has the tipc_bcast_lock and the unicast
>
> pkt processing cannot get the bcast lock for a while.
>
> When it can get the lock it processes the BC ack == 0 from the
> NAME_DISTRIBUTOR
>
> packet and sets the acked field on the BC link to 0.
>
>
>
> The debug / call trace below is me trying to show from the debug I
> captured what happens.
>
> If I add debug for each pkt the problem doesn't reproduce.
>
>
>
>
>
> tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user
> 11 type 0
>
>   + calls tipc_bcast_ack_rcv
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
> link not up 1 or peer not up 1
>
>
>
> What kind of packet was this?
>
This is a LINK_PROTOCOL STATE msg received on the unicast link
Dbg of the packet:
broadcast-link-5-18 bc ack 0 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr usr
7 type 0 bc ack 0 seq 32771 ack 0)- can't ack as link not up 0 or peer not
up 1



>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack
> as link not up 0 or peer not up 1
>
>
>
> And this?
>
This is a LINK_PROTOCOL STATE msg received on the unicast link, with ack ==
1, therefore this packet will be the first pkt that sets bc_peer_is_up to
true.
Dbg of the packet:
broadcast-link-5-18 bc ack 43402 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr
usr 7 type 0 bc ack 43402 seq 32771 ack 1)- can't ack as link not up 0 or
peer not up 1


>
>
> ===
>
> Somewhere at this point bc_peer_is_up gets set
>
> ===
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
> than it was previously (53574)
>
> tipc_rcv
>
>   + tipc_bcast_ack_rcv
>
> + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
> than it was previously (53574)
>
>
>
>   + from tipc_rcv on unicast link
>
> + tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old
> 53574 win 200
>
>   - dump_stack
>
> CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P   O4.4.6-at1 #3
>
> Call Trace:
>
> [a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
>
> [a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
>
> [a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
>
> [a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
>
> [a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
>
> [a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
>
> [a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
>
> [a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
>
> [a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
>
> [a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
>
> [a3093dc0] [805119f0] process_backlog+0x90/0x140
>
> [a3093df0] [8051103c] net_rx_action+0x15c/0x320
>
> [a3093e50] [8002594c] __do_softirq+0x13c/0x250
>
> [a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
>
> [a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
>
> [a3093ef0] [8003fb38] kthread+0xc8/0xe0
>
> [a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64
>
>
>
> I am going to send in a patch that adds checking for a valid BC ack (being
> within the window size) to
>
> tipc_link_bc_ack_rcv.
>
>
>
> Not sure that is a good idea. Even if #0 happens to be within a valid
> range it is still invalid, and may lead to an inadvertent release of
> packets which are not ready to be released yet.  I’ll try to take a
> closer look at this today.
>
I thought that #0 was a valid value when the seqno wraps around?  There
doesn't appear to be any special handling for the wrap around case.  I
agree with your statement that it might lead to inadvertent release of
packets that aren't ready to be released.  Therefore I won't send it in.

JT




>
> ///jon
>
>
>
> Cheers,
>
> JT
>
>
>
> On Wed, Aug 31, 2016 at 10:57 PM, John THompson 
> wrote:
>
> Hi Jon,
>
>
>
> I have verified that the patch is included in my build.
>
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
> criteria”)
>
>
>
> I am trying to verify which packets are received when the problem occurs
> but I am having trouble getting the information out of my system at the
> 

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-09-06 Thread Jon Maloy
Hi John,
See below.

///jon


From: John THompson [mailto:thompa@gmail.com]
Sent: Tuesday, 06 September, 2016 00:14
To: Jon Maloy 
Cc: Jon Maloy ; tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a named 
with a BC ACK of 0

Hi Jon,

The packet I see the error happening on is when receiving a usr 11
(NAME_DISTRIBUTOR) over the unicast link.
The reception of this packet is happening interleaved with processing
a packet (or packets) on the BC link that has brought the peer up.
The BC link packet processing has the tipc_bcast_lock and the unicast
pkt processing cannot get the bcast lock for a while.
When it can get the lock it processes the BC ack == 0 from the NAME_DISTRIBUTOR
packet and sets the acked field on the BC link to 0.

The debug / call trace below is me trying to show from the debug I captured 
what happens.
If I add debug for each pkt the problem doesn't reproduce.


tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user 11 
type 0
  + calls tipc_bcast_ack_rcv
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as link 
not up 1 or peer not up 1

What kind of packet was this?

tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as link 
not up 0 or peer not up 1

And this?

===
Somewhere at this point bc_peer_is_up gets set
===
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less than 
it was previously (53574)
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less than 
it was previously (53574)

  + from tipc_rcv on unicast link
+ tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old 53574 
win 200
  - dump_stack
CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P   O4.4.6-at1 #3
Call Trace:
[a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
[a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
[a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
[a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
[a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
[a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
[a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
[a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
[a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
[a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
[a3093dc0] [805119f0] process_backlog+0x90/0x140
[a3093df0] [8051103c] net_rx_action+0x15c/0x320
[a3093e50] [8002594c] __do_softirq+0x13c/0x250
[a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
[a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
[a3093ef0] [8003fb38] kthread+0xc8/0xe0
[a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64

I am going to send in a patch that adds checking for a valid BC ack (being 
within the window size) to
tipc_link_bc_ack_rcv.

Not sure that is a good idea. Even if #0 happens to be within a valid range it 
is still invalid, and may lead to an inadvertent release of packets which are 
not ready to be released yet.  I’ll try to take a closer look at this today.

///jon

Cheers,
JT

On Wed, Aug 31, 2016 at 10:57 PM, John THompson 
> wrote:
Hi Jon,

I have verified that the patch is included in my build.
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)

I am trying to verify which packets are received when the problem occurs but I 
am having trouble getting the information out of my system at the moment.

I will keep trying.
Thanks,
JT


On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy 
> wrote:


On 08/29/2016 06:48 PM, Jon Maloy wrote:
Hi John,
Sorry for my late answer; I was on vacation for a few days.
It seems I gave you the wrong commit reference in my previous mail. The one I 
really meant was
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)

This one explains why the first packets sometimes get an invalid ack number, 
but also remedies it, and I simply cannot see how an invalid ack #0 can ever be 
accepted when this patch is applied.
I see no reason why this patch shouldn’t also be present in you code, but just 
to make sure, can you confirm this?

I am right now wondering if a retransmission is the problem:
1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to true.
Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, the 
likely sequence is rather
1: We receive a STATE message with unicast ack #1. This message should also 
contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up is set 
to true.
2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the invalid 
unicast ack #0. This one is now accepted.

I believe this may happen, because STATE messages, 

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-09-05 Thread John THompson
Hi Jon,

The packet I see the error happening on is when receiving a usr 11
(NAME_DISTRIBUTOR) over the unicast link.
The reception of this packet is happening interleaved with processing
a packet (or packets) on the BC link that has brought the peer up.
The BC link packet processing has the tipc_bcast_lock and the unicast
pkt processing cannot get the bcast lock for a while.
When it can get the lock it processes the BC ack == 0 from the
NAME_DISTRIBUTOR
packet and sets the acked field on the BC link to 0.

The debug / call trace below is me trying to show from the debug I captured
what happens.
If I add debug for each pkt the problem doesn't reproduce.


tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user 11
type 0
  + calls tipc_bcast_ack_rcv
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
link not up 1 or peer not up 1
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
link not up 0 or peer not up 1
===
Somewhere at this point bc_peer_is_up gets set
===
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
than it was previously (53574)
tipc_rcv
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
than it was previously (53574)

  + from tipc_rcv on unicast link
+ tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old
53574 win 200
  - dump_stack
CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P   O4.4.6-at1 #3
Call Trace:
[a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
[a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
[a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
[a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
[a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
[a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
[a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
[a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
[a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
[a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
[a3093dc0] [805119f0] process_backlog+0x90/0x140
[a3093df0] [8051103c] net_rx_action+0x15c/0x320
[a3093e50] [8002594c] __do_softirq+0x13c/0x250
[a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
[a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
[a3093ef0] [8003fb38] kthread+0xc8/0xe0
[a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64

I am going to send in a patch that adds checking for a valid BC ack (being
within the window size) to
tipc_link_bc_ack_rcv.

Cheers,
JT

On Wed, Aug 31, 2016 at 10:57 PM, John THompson 
wrote:

> Hi Jon,
>
> I have verified that the patch is included in my build.
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
> criteria”)
>
> I am trying to verify which packets are received when the problem occurs
> but I am having trouble getting the information out of my system at the
> moment.
>
> I will keep trying.
> Thanks,
> JT
>
>
> On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy  wrote:
>
>>
>>
>> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>>
>>> Hi John,
>>> Sorry for my late answer; I was on vacation for a few days.
>>> It seems I gave you the wrong commit reference in my previous mail. The
>>> one I really meant was
>>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
>>> criteria”)
>>>
>>> This one explains why the first packets sometimes get an invalid ack
>>> number, but also remedies it, and I simply cannot see how an invalid ack #0
>>> can ever be accepted when this patch is applied.
>>> I see no reason why this patch shouldn’t also be present in you code,
>>> but just to make sure, can you confirm this?
>>>
>>> I am right now wondering if a retransmission is the problem:
>>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
>>> true.
>>>
>> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
>> the likely sequence is rather
>> 1: We receive a STATE message with unicast ack #1. This message should
>> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
>> is set to true.
>> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
>> invalid unicast ack #0. This one is now accepted.
>>
>> I believe this may happen, because STATE messages, contrary to data
>> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
>> messages, but I cannot see it happening as often and consistently as you
>> seem to be observing it. Another possibility is that bc_ack in the received
>> STATE message also is an invalid zero, although I cannot see how this can
>> happen either.
>>
>> Regards
>> ///jon
>>
>> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted,
>>> and we are in trouble.
>>>
>>> I’ll try to figure out a solution to this, but it may be possible for
>>> you to verify this 

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-31 Thread John THompson
Hi Jon,

I have verified that the patch is included in my build.
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
criteria”)

I am trying to verify which packets are received when the problem occurs
but I am having trouble getting the information out of my system at the
moment.

I will keep trying.
Thanks,
JT


On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy  wrote:

>
>
> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>
>> Hi John,
>> Sorry for my late answer; I was on vacation for a few days.
>> It seems I gave you the wrong commit reference in my previous mail. The
>> one I really meant was
>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
>> criteria”)
>>
>> This one explains why the first packets sometimes get an invalid ack
>> number, but also remedies it, and I simply cannot see how an invalid ack #0
>> can ever be accepted when this patch is applied.
>> I see no reason why this patch shouldn’t also be present in you code, but
>> just to make sure, can you confirm this?
>>
>> I am right now wondering if a retransmission is the problem:
>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
>> true.
>>
> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
> the likely sequence is rather
> 1: We receive a STATE message with unicast ack #1. This message should
> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
> is set to true.
> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
> invalid unicast ack #0. This one is now accepted.
>
> I believe this may happen, because STATE messages, contrary to data
> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
> messages, but I cannot see it happening as often and consistently as you
> seem to be observing it. Another possibility is that bc_ack in the received
> STATE message also is an invalid zero, although I cannot see how this can
> happen either.
>
> Regards
> ///jon
>
> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted,
>> and we are in trouble.
>>
>> I’ll try to figure out a solution to this, but it may be possible for you
>> to verify this first.
>>
>> BR
>> ///jon
>>
>>
>>
>> From: John THompson [mailto:thompa@gmail.com]
>> Sent: Wednesday, 24 August, 2016 16:22
>> To: Jon Maloy 
>> Cc: tipc-discussion@lists.sourceforge.net
>> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a
>> named with a BC ACK of 0
>>
>> Hi Jon,
>>
>> To clarify my previous email regarding the behaviour observed,
>>
>> What happens over time:
>> + remove bc peer
>> ...
>> some time until peer rejoins
>> ...
>> + add bc peer
>> + tipc_link_bc_ack_rcv
>>link is up = false, node is up = false
>>(this gets called a number of times until both the link and node are
>> up)
>>
>> + tipc_link_bc_ack_rcv
>>l->acked set to valid ack
>> ...
>> + tipc_rcv - usr 5 or 11, bc_ack = 0
>>+ tipc_bcast_ack_rcv
>>  + tipc_link_bc_ack_rcv
>>sets l->acked to 0
>>
>> Regards,
>> JT
>>
>>
>> On Thu, Aug 25, 2016 at 8:06 AM, John THompson > > wrote:
>> Hi Jon,
>>
>> It is a similar problem in terms of what happens to the bc link.  I do
>> have that patch applied.
>>
>> I have added debug through the remove bc peer and various other functions
>> and the setting of the acked field to 0 is occurring when processing a
>> packet from named (msg user 11) or BCAST protocol (msg user 5).
>>
>> Thanks,
>> JT
>>
>> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy > > wrote:
>> Hi John,
>> This sounds a lot like the problem I tried to fix in
>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>> peer is lost")
>> So, either that patch is not present in your kernel (if it is 4.7 it is
>> supposed to be) or my solution somehow hasn't solved the problem.
>> Can you confirm that the patch is there?
>>
>> BR
>> ///jon
>>
>> -Original Message-
>>> From: John THompson [mailto:thompa@gmail.com>> thompa@gmail.com>]
>>> Sent: Tuesday, 23 August, 2016 20:21
>>> To: tipc-discussion@lists.sourceforge.net>> @lists.sourceforge.net>
>>> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
>>> named with a BC
>>> ACK of 0
>>>
>>> Hi,
>>>
>>> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>>> and Marvell Armada-XP processors.  There are 10 nodes in all.
>>> When 2 of the nodes are removed, then rejoin the cluster we sometimes see
>>> behaviour where the TIPC BC link gets stuck and eventually the backlog
>>> gets
>>> full.  the 2 nodes that are joining have already connected together.
>>>
>>> The problem occurs when the BC link sndnxt value is greater than 32k on
>>> one
>>> of the nodes (call it NODE1) and 2 nodes begin to join.
>>> When NODE1 detects the joining nodes, at some early 

Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-29 Thread Jon Maloy
Hi John,
Sorry for my late answer; I was on vacation for a few days.
It seems I gave you the wrong commit reference in my previous mail. The one I 
really meant was
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization criteria”)

This one explains why the first packets sometimes get an invalid ack number, 
but also remedies it, and I simply cannot see how an invalid ack #0 can ever be 
accepted when this patch is applied.
I see no reason why this patch shouldn’t also be present in you code, but just 
to make sure, can you confirm this?

I am right now wondering if a retransmission is the problem:
1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to true.
2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, and we 
are in trouble.

I’ll try to figure out a solution to this, but it may be possible for you to 
verify this first.

BR
///jon



From: John THompson [mailto:thompa@gmail.com]
Sent: Wednesday, 24 August, 2016 16:22
To: Jon Maloy 
Cc: tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a named 
with a BC ACK of 0

Hi Jon,

To clarify my previous email regarding the behaviour observed,

What happens over time:
+ remove bc peer
...
some time until peer rejoins
...
+ add bc peer
+ tipc_link_bc_ack_rcv
  link is up = false, node is up = false
  (this gets called a number of times until both the link and node are up)

+ tipc_link_bc_ack_rcv
  l->acked set to valid ack
...
+ tipc_rcv - usr 5 or 11, bc_ack = 0
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv
  sets l->acked to 0

Regards,
JT


On Thu, Aug 25, 2016 at 8:06 AM, John THompson 
> wrote:
Hi Jon,

It is a similar problem in terms of what happens to the bc link.  I do have 
that patch applied.

I have added debug through the remove bc peer and various other functions and 
the setting of the acked field to 0 is occurring when processing a packet from 
named (msg user 11) or BCAST protocol (msg user 5).

Thanks,
JT

On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy 
> wrote:
Hi John,
This sounds a lot like the problem I tried to fix in
a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when peer is 
lost")
So, either that patch is not present in your kernel (if it is 4.7 it is 
supposed to be) or my solution somehow hasn't solved the problem.
Can you confirm that the patch is there?

BR
///jon

> -Original Message-
> From: John THompson [mailto:thompa@gmail.com]
> Sent: Tuesday, 23 August, 2016 20:21
> To: 
> tipc-discussion@lists.sourceforge.net
> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a named 
> with a BC
> ACK of 0
>
> Hi,
>
> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
> and Marvell Armada-XP processors.  There are 10 nodes in all.
> When 2 of the nodes are removed, then rejoin the cluster we sometimes see
> behaviour where the TIPC BC link gets stuck and eventually the backlog gets
> full.  the 2 nodes that are joining have already connected together.
>
> The problem occurs when the BC link sndnxt value is greater than 32k on one
> of the nodes (call it NODE1) and 2 nodes begin to join.
> When NODE1 detects the joining nodes, at some early point after they have
> joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
> immediately sets its BC acked to 0 and tries to ack packets off the
> transmq.  No packets get removed as the new ack value doesn't match any of
> the packets that need to be acked.
>
> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that
> the new acked value is more than the old acked value.  When the values are
> greater than 32k apart this means that 0 can indeed be greater than
> 40,000.  So when new packets are processed the new BC ack value is
> considered less than the stored one (0).
>
> This results in the BC transmq getting full and the backlog getting full,
> thereby preventing communication over the BC link between nodes.
>
> I am persisting in trying to work out why the NAMED publication has a BC
> ack of 0, which I think is the root cause of the problem.
>
> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
> invalid BC ack value cannot be set.  I am defining invalid as being an
> acked value that is greater than the current BC acked value + the link
> window.
>
> Thanks,
> John
> --
> ___
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion



Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-24 Thread John THompson
Hi Jon,

To clarify my previous email regarding the behaviour observed,

What happens over time:
+ remove bc peer
...
some time until peer rejoins
...
+ add bc peer
+ tipc_link_bc_ack_rcv
  link is up = false, node is up = false
  (this gets called a number of times until both the link and node are up)

+ tipc_link_bc_ack_rcv
  l->acked set to valid ack
...
+ tipc_rcv - usr 5 or 11, bc_ack = 0
  + tipc_bcast_ack_rcv
+ tipc_link_bc_ack_rcv
  sets l->acked to 0

Regards,
JT


On Thu, Aug 25, 2016 at 8:06 AM, John THompson  wrote:

> Hi Jon,
>
> It is a similar problem in terms of what happens to the bc link.  I do
> have that patch applied.
>
> I have added debug through the remove bc peer and various other functions
> and the setting of the acked field to 0 is occurring when processing a
> packet from named (msg user 11) or BCAST protocol (msg user 5).
>
> Thanks,
> JT
>
> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy 
> wrote:
>
>> Hi John,
>> This sounds a lot like the problem I tried to fix in
>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>> peer is lost")
>> So, either that patch is not present in your kernel (if it is 4.7 it is
>> supposed to be) or my solution somehow hasn't solved the problem.
>> Can you confirm that the patch is there?
>>
>> BR
>> ///jon
>>
>> > -Original Message-
>> > From: John THompson [mailto:thompa@gmail.com]
>> > Sent: Tuesday, 23 August, 2016 20:21
>> > To: tipc-discussion@lists.sourceforge.net
>> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
>> named with a BC
>> > ACK of 0
>> >
>> > Hi,
>> >
>> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>> > and Marvell Armada-XP processors.  There are 10 nodes in all.
>> > When 2 of the nodes are removed, then rejoin the cluster we sometimes
>> see
>> > behaviour where the TIPC BC link gets stuck and eventually the backlog
>> gets
>> > full.  the 2 nodes that are joining have already connected together.
>> >
>> > The problem occurs when the BC link sndnxt value is greater than 32k on
>> one
>> > of the nodes (call it NODE1) and 2 nodes begin to join.
>> > When NODE1 detects the joining nodes, at some early point after they
>> have
>> > joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
>> > immediately sets its BC acked to 0 and tries to ack packets off the
>> > transmq.  No packets get removed as the new ack value doesn't match any
>> of
>> > the packets that need to be acked.
>> >
>> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
>> that
>> > the new acked value is more than the old acked value.  When the values
>> are
>> > greater than 32k apart this means that 0 can indeed be greater than
>> > 40,000.  So when new packets are processed the new BC ack value is
>> > considered less than the stored one (0).
>> >
>> > This results in the BC transmq getting full and the backlog getting
>> full,
>> > thereby preventing communication over the BC link between nodes.
>> >
>> > I am persisting in trying to work out why the NAMED publication has a BC
>> > ack of 0, which I think is the root cause of the problem.
>> >
>> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
>> > invalid BC ack value cannot be set.  I am defining invalid as being an
>> > acked value that is greater than the current BC acked value + the link
>> > window.
>> >
>> > Thanks,
>> > John
>> > 
>> --
>> > ___
>> > tipc-discussion mailing list
>> > tipc-discussion@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>
>
--
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] BC rcv link acked stuck after receiving a named with a BC ACK of 0

2016-08-24 Thread John THompson
Hi Jon,

It is a similar problem in terms of what happens to the bc link.  I do have
that patch applied.

I have added debug through the remove bc peer and various other functions
and the setting of the acked field to 0 is occurring when processing a
packet from named (msg user 11) or BCAST protocol (msg user 5).

Thanks,
JT

On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy  wrote:

> Hi John,
> This sounds a lot like the problem I tried to fix in
> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
> peer is lost")
> So, either that patch is not present in your kernel (if it is 4.7 it is
> supposed to be) or my solution somehow hasn't solved the problem.
> Can you confirm that the patch is there?
>
> BR
> ///jon
>
> > -Original Message-
> > From: John THompson [mailto:thompa@gmail.com]
> > Sent: Tuesday, 23 August, 2016 20:21
> > To: tipc-discussion@lists.sourceforge.net
> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
> named with a BC
> > ACK of 0
> >
> > Hi,
> >
> > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
> > and Marvell Armada-XP processors.  There are 10 nodes in all.
> > When 2 of the nodes are removed, then rejoin the cluster we sometimes see
> > behaviour where the TIPC BC link gets stuck and eventually the backlog
> gets
> > full.  the 2 nodes that are joining have already connected together.
> >
> > The problem occurs when the BC link sndnxt value is greater than 32k on
> one
> > of the nodes (call it NODE1) and 2 nodes begin to join.
> > When NODE1 detects the joining nodes, at some early point after they have
> > joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
> > immediately sets its BC acked to 0 and tries to ack packets off the
> > transmq.  No packets get removed as the new ack value doesn't match any
> of
> > the packets that need to be acked.
> >
> > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
> that
> > the new acked value is more than the old acked value.  When the values
> are
> > greater than 32k apart this means that 0 can indeed be greater than
> > 40,000.  So when new packets are processed the new BC ack value is
> > considered less than the stored one (0).
> >
> > This results in the BC transmq getting full and the backlog getting full,
> > thereby preventing communication over the BC link between nodes.
> >
> > I am persisting in trying to work out why the NAMED publication has a BC
> > ack of 0, which I think is the root cause of the problem.
> >
> > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
> > invalid BC ack value cannot be set.  I am defining invalid as being an
> > acked value that is greater than the current BC acked value + the link
> > window.
> >
> > Thanks,
> > John
> > 
> --
> > ___
> > tipc-discussion mailing list
> > tipc-discussion@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
--
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion