Hi Jon,

Again, thanks for thinking about this.  Kernel version :
     Linux a33ems1 3.10.62-ltsi-WR6.0.0.36_standard #1 SMP PREEMPT Mon Aug 20 
17:25:51 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux

The retransmission error (i.e., kind of catastrophic interruption) only was 
noted upon experiments pushing the window size to 300-400, it was not noted at 
lower window sizes.

Our thoughts are around how to see evidence of retransmission going on prior to 
that, or ways to see evidence of that in TIPC 2.0.5.   We might try to compare 
to TIPC 1.7.7 under Wind River Linux 3 as a data point, but care mostly about 
addressing it under 2.0.5 and Wind River Linux 6.   

This is not running a VM.   Not sure about your question on Remote Procedure 
Call activated, if there's a command to run or code construct to check I could 
get that.

Regards

PK
-----Original Message-----
From: Jon Maloy <[email protected]> 
Sent: Thursday, September 20, 2018 7:28 AM
To: Peter Koss <[email protected]>; [email protected]
Subject: RE: What affects congestion beyond window size, and what might have 
reduced congestion thresholds in TIPC 2.0.x?

Hi Peter,
See my comments below.

> -----Original Message-----
> From: Peter Koss <[email protected]>
> Sent: September 19, 2018 6:11 PM
> To: Jon Maloy <[email protected]>; tipc- 
> [email protected]
> Subject: RE: What affects congestion beyond window size, and what 
> might have reduced congestion thresholds in TIPC 2.0.x?
> 
> Thanks for responding.
> 
> There was code in TIPC 1.7x that gave some node receive queue 
> inforHimation, but that is now obsolete in 2.0.x.  We are using socket 
> receive calls to get data instead, which seems to suggest one of two 
> problems: either the receive side queue is filling up and exceeding 
> limits, or the ack back to the sender is having trouble.  We do see the 
> sender getting an errno=EAGAIN.
> Overall the performance levels we see are much less than with TIPC 
> 1.7.x under Wind River Linux 3 than with TIPC 2.0.x under Wind River Linux 6.

Which Linux kernel version are you running? 

> 
> case TIPC_NODE_RECVQ_DEPTH:
> +               value = (u32)atomic_read(&tipc_queue_size);     <== This is 
> obsolete
> now, call occurs, we get 0.
> +               break;
> +       case TIPC_SOCK_RECVQ_DEPTH:
> +               value = skb_queue_len(&sk->sk_receive_queue);
> +               break;
> 
> 
> Questions we have currently:
> - What is the socket receive queue limit (default)?

That depends  on the Linux version you are using. Prior to 4.6 it was 64 Mb, in 
later versions it is 2 Mb, but with a much better flow control algorithm.

> - Is it wise to try a window size > 150?

I have never done it myself except for experimental purposes, but I see no 
problem with it.
Do you have any particular reason to do so? Does it give significant better 
throughput than  at 150 ?

> - Is there a good way to control or influence the flow control 
> sender/receiver coordination,
You can improve the window size to potentially improve link level throughput, 
and you can increase sending socket importance priority to reduce the risk of 
receive socket buffer overflow.

> or a best way to adjust receive buffer limit?
If you want to change this you follow the instruction under 5.2 at the 
following link:
http://tipc.sourceforge.net/programming.html#incr_rcvbuf
But I see no sign that buffer overflow is your problem.

> 
> For context, the first sign of errors shows up as congestion, where 
> the max value will increase to slightly above whatever window size we 
> set (50,150,300,400).
> 
> pl0_1:~$ /usr/sbin/tipc-config -ls | grep "Send queue max"
>   Congestion link:0  Send queue max:2 avg:1
>   Congestion link:93121  Send queue max:162 avg:3
>   Congestion link:206724  Send queue max:164 avg:3
>   Congestion link:67839  Send queue max:167 avg:3
>   Congestion link:214788  Send queue max:166 avg:3
>   Congestion link:205240  Send queue max:165 avg:3
>   Congestion link:240955  Send queue max:166 avg:3
>   Congestion link:0  Send queue max:0 avg:0
>   Congestion link:0  Send queue max:1 avg:0
>   Congestion link:0  Send queue max:0 avg:0

This is all normal and unproblematic. We allow an oversubscription of one 
message (max 46 1500 byte packets) on each link to make the algorithm simpler. 
So you will often find the max value higher than the nominal upper limit.

> 
> The next following error occur only when the window size is high, 
> 300-400, not seen at
> 50 or 150, so we think this may be extraneous to our issue.   It also makes us
> wonder
> whether going above 150 is wise, hence the question above.
> 
> Sep 17 05:42:00 pl0_4 kernel: tipc: Retransmission failure on link 
> <1.1.5:bond1-1.1.2:bond1> Sep 17 05:42:00 pl0_4 kernel: tipc: 
> Resetting link

This is your real problem. For some reason a packet has been retransmitted >100 
times on a link without going through. Then the link is reset, and all 
associated connections as well.
We have seen this happen for various reasons over the years, and fixed them all.
Is possibly RPC activated on your receiving node?
Are you running a VM with a virtio interface? This one tends to be overwhelmed 
sometimes and just stops sending for 30 seconds, something leading to broken 
links.

But again, it all depends on which kernel and environment you are running. 
Please update me on this.

BR
///jon

> Sep 17 05:42:00 pl0_4 kernel: Link 1001002<eth:bond1>::WW Sep 17 
> 05:42:00
> pl0_4 kernel: tipc: Lost link <1.1.5:bond1-1.1.2:bond1> on network 
> plane A Sep 17 05:42:00 pl0_4 kernel: tipc: Lost contact with <1.1.2> 
> Sep 17 05:42:00
> pl0_10 kernel: tipc: Resetting link <1.1.2:bond1-1.1.5:bond1>, 
> requested by peer Sep 17 05:42:00 pl0_10 kernel: tipc: Lost link 
> <1.1.2:bond1-1.1.5:bond1> on network plane A
> 
> Thanks in advance, advice is appreciated.
> 
> PK
> 
> -----Original Message-----
> From: Jon Maloy <[email protected]>
> Sent: Tuesday, September 18, 2018 12:15 PM
> To: Peter Koss <[email protected]>; tipc- 
> [email protected]
> Subject: RE: What affects congestion beyond window size, and what 
> might have reduced congestion thresholds in TIPC 2.0.x?
> 
> Hi Peter,
> The only parameter of those mentioned below that would have any effect 
> on congestion is TIPC_MAX_LINK_WIN, which should reduce occurrences of 
> link level congestion.
> However, you don't describe which symptoms you see caused by this 
> congestion.
> - Is it only a higher 'congested'  counter when you look at the link 
> statistics? If so, you don't have a problem at all, this is a totally 
> normal and frequent occurrence. (Maybe we should have given this field 
> a different name to avert
> confusion.)
> - If this causes a severely reduced throughput you may have a problem, 
> but I don't find that very likely.
> - If you are losing messages at the socket level (dropped because of 
> receive buffer overflow) you *do* have a problem, but this can most 
> often be remedied by extending the socket receive buffer limit.
> 
> BR
> ///Jon Maloy
> 
> -----Original Message-----
> From: Peter Koss <[email protected]>
> Sent: September 18, 2018 12:33 PM
> To: [email protected]
> Subject: [tipc-discussion] What affects congestion beyond window size, 
> and what might have reduced congestion thresholds in TIPC 2.0.x?
> 
> 
> In TIPC 1.7.6, we battled with congestion quite a bit.    We ultimately 
> settled
> on adjusting these parameters in TIPC, which we also used in TIPC 
> 1.7.7.  This was running on Wind River Linux 3, where TIPC was an 
> independent module from the kernel.
> 
> SOL_TIPC                                          changed from 271 to 50.  
> (probably not
> affecting congestion)
> TIPC_MAX_LINK_WIN                   changed from 50 to 150
> TIPC_NODE_RECVQ_DEPTH        set to 131
> 
> Using Wind River Linux 6, we get TIPC 2.0.5 as part of the kernel, and 
> we see congestion at occurring at much lower overall load levels (less 
> traffic overall),
> compared to TIPC 1.7.7 & WR3.   We've made the same changes as above via
> a loadable module for TIPC 2.0.5, and also noted that
> TIPC_NODE_RECVQ_DEPTH is now obsoleted.   Upon observing congestion,
> we have changed the default window size, and max window size, up to 
> 300 and even 400.  This helps congestion a little bit, but not sufficiently.
> 
> 
> Does anyone know:
> -What has changed in TIPC 2.0.x that affects this?
> -Are there other parameters to change, to assist this?
> -Is there a replacement set of parameters that affect what 
> TIPC_NODE_RECVQ_DEPTH influences?
> 
> 
> 
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended 
> solely for the use of the addressee(s). If you are not the intended 
> recipient, please notify so to the sender by e-mail and delete the 
> original message. In such cases, please notify us immediately at 
> [email protected] <mailto:[email protected]> . Further, you are not to 
> copy, disclose, or distribute this e-mail or its contents to any 
> unauthorized person(s). Any such actions are considered unlawful. This 
> e-mail may contain viruses. Infinite has taken every reasonable 
> precaution to minimize this risk, but is not liable for any damage you 
> may sustain as a result of any virus in this e-mail. You should carry out 
> your own virus checks before opening the e-mail or attachments.
> Infinite reserves the right to monitor and review the content of all 
> messages sent to or from this e-mail address. Messages sent to or from 
> this e-mail address may be stored on the Infinite e-mail system.
> 
> 
> 
> ***INFINITE******** End of Disclaimer********INFINITE********
> 
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion


_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to