Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
Hi.

On středa 21. února 2018 0:21:37 CET Eric Dumazet wrote:
> My latest patch (fixing BBR underestimation of cwnd)
> was meant for net tree, on a NIC where SG/TSO/GSO) are disabled.
> 
> ( ie when sk->sk_gso_max_segs is not set to 'infinite' )
> 
> It is packet scheduler independent really.
> 
> Tested here with pfifo_fast, TSO/GSO off.

Well, before the patch with BBR and sg off here it is ~450 Mbps for fq and 
~115 Mbps for pfifo_fast. So, comparing to what I see with the patch (850 and 
200 respectively), it is definitely an improvement.

Thanks.




Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
On úterý 20. února 2018 21:09:37 CET Eric Dumazet wrote:
> Also you can tune your NIC to accept few MSS per GSO/TSO packet
> 
> ip link set dev eth0 gso_max_segs 2
> 
> So even if TSO/GSO is there, BBR should not use sk->sk_gso_max_segs to
> size its bursts, since burt sizes are also impacting GRO on the
> receiver.

net-next + 7 patches (6 from the patchset + this one).

Before playing with gso_max_segs:

BBR+fq
sg on:  4.39 Gbits/sec
sg off: 1.33 Gbits/sec

BBR+fq_codel
sg on:  4.02 Gbits/sec
sg off: 1.41 Gbits/sec

BBR+pfifo_fast
sg on:  3.66 Gbits/sec
sg off: 1.41 Gbits/sec

Reno+fq
sg on:  5.69 Gbits/sec
sg off: 1.53 Gbits/sec

Reno+fq_codel
sg on:  6.33 Gbits/sec
sg off: 1.50 Gbits/sec

Reno+pfifo_fast
sg on:  6.26 Gbits/sec
sg off: 1.48 Gbits/sec

After "ip link set dev eth1 gso_max_segs 2":

BBR+fq
sg on:  806 Mbits/sec
sg off: 886 Mbits/sec

BBR+fq_codel
sg on:  206 Mbits/sec
sg off: 207 Mbits/sec

BBR+pfifo_fast
sg on:  220 Mbits/sec
sg off: 200 Mbits/sec

Reno+fq
sg on:  2.16 Gbits/sec
sg off: 1.27 Gbits/sec

Reno+fq_codel
sg on:  2.45 Gbits/sec
sg off: 1.52 Gbits/sec

Reno+pfifo_fast
sg on:  2.31 Gbits/sec
sg off: 1.54 Gbits/sec

Oleksandr




Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
On úterý 20. února 2018 20:56:24 CET Eric Dumazet wrote:
> That is with the other patches _not_ applied ?

Yes, other patches are not applied. It is v4.15.4 + this patch only + BBR + 
fq_codel or pfifo_fast. Shall I re-test it on the net-next with the whole 
patchset (because it is not applied cleanly to 4.15)?

Oleksandr




Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
On úterý 20. února 2018 20:39:49 CET Eric Dumazet wrote:
> I am not trying to compare BBR and Reno on a lossless link.
> 
> Reno is running as fast as possible and will win when bufferbloat is
> not an issue.
> 
> If bufferbloat is not an issue, simply use Reno and be happy ;)
> 
> My patch helps BBR only, I thought it was obvious ;)

Umm, yes, and my point was rather something like "the speed on a lossless link 
while using BBR with and without this patch is the same". Sorry for a 
confusion. I guess, the key word here is "lossless".

Oleksandr




Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko
Hi.

On úterý 20. února 2018 19:57:42 CET Eric Dumazet wrote:
> Actually timer drifts are not horrible (at least on my lab hosts)
> 
> But BBR has a pessimistic way to sense the burst size, as it is tied to
> TSO/GSO being there.
> 
> Following patch helps a lot.

Not really, at least if applied to v4.15.4. Still getting 2 Gbps less between 
VMs if using BBR instead of Reno.

Am I doing something wrong?

Oleksandr




Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-20 Thread Oleksandr Natalenko

Hi.

19.02.2018 20:56, Eric Dumazet wrote:

Switching TCP to GSO mode, relying on core networking layers
to perform eventual adaptation for dumb devices was overdue.

1) Most TCP developments are done with TSO in mind.
2) Less high-resolution timers needs to be armed for TCP-pacing
3) GSO can benefit of xmit_more hint
4) Receiver GRO is more effective (as if TSO was used for real on 
sender)

   -> less ACK packets and overhead.
5) Write queues have less overhead (one skb holds about 64KB of 
payload)

6) SACK coalescing just works. (no payload in skb->head)
7) rtx rb-tree contains less packets, SACK is cheaper.
8) Removal of legacy code. Less maintenance hassles.

Note that I have left the sendpage/zerocopy paths, but they probably 
can

benefit from the same strategy.

Thanks to Oleksandr Natalenko for reporting a performance issue for
BBR/fq_codel,
which was the main reason I worked on this patch series.


Thanks for dealing with this that fast.

Does this mean that the option to optimise internal TCP pacing is still 
an open question?


Oleksandr


Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-18 Thread Oleksandr Natalenko
Hi.

On neděle 18. února 2018 22:04:27 CET Eric Dumazet wrote:
> I was able to take a look today, and I believe this is the time to
> switch TCP to GSO being always on.
> 
> As a bonus, we get speed boost for cubic as well.
> 
> Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack
> coalescing, TCP pacing...) all were developed/tested/maintained with
> GSO/TSO being the norm.
> 
> Can you please test the following patch ?

Yes, results below:

BBR+fq:
sg on:  6.02 Gbits/sec
sg off: 1.33 Gbits/sec

BBR+pfifo_fast:
sg on:  4.13 Gbits/sec
sg off: 1.34 Gbits/sec

BBR+fq_codel:
sg on:  4.16 Gbits/sec
sg off: 1.35 Gbits/sec

Reno+fq:
sg on:  6.44 Gbits/sec
sg off: 1.39 Gbits/sec

Reno+pfifo_fast:
sg on:  6.36 Gbits/sec
sg off: 1.39 Gbits/sec

Reno+fq_codel:
sg on:  6.41 Gbits/sec
sg off: 1.38 Gbits/sec

While BBR still suffers when fq is not used, disabling sg doesn't bring 
drastic throughput drop anymore. So, looks good to me, eh?

> Note that some cleanups can be done later in TCP stack, removing lots
> of legacy stuff.
> 
> Also TCP internal-pacing could benefit from something similar to this
> fq patch eventually, although there is no hurry.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?i
> d=fefa569a9d4bc4b7758c0fddd75bb0382c95da77  

Feel free to ping me if you have something else to test then ;).

> Of course, you have to consider why SG was disabled on your device,
> this looks very pessimistic.

Dunno why that happens, but I've managed to just enable it automatically on 
interface up.

Thanks.

Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-17 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote:
> Well, no effect  here on e1000e (1 Gbit) at least
> 
> # ethtool -K eth3 sg off
> Actual changes:
> scatter-gather: off
> tx-scatter-gather: off
> tcp-segmentation-offload: off
> tx-tcp-segmentation: off [requested on]
> tx-tcp6-segmentation: off [requested on]
> generic-segmentation-offload: off [requested on]
> 
> # tc qd replace dev eth3 root pfifo_fast
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> 941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> 941
> # tc qd replace dev eth3 root fq
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> 941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> 941
> # tc qd replace dev eth3 root fq_codel
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> 941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> 941
> #

That really looks strange to me. I'm able to reproduce the effect caused by 
disabling scatter-gather even on the VM (using iperf3, as usual):

BBR+fq_codel:
sg on:  4.23 Gbits/sec
sg off: 121 Mbits/sec

BBR+fq:
sg on:  6.38 Gbits/sec
sg off: 437 Mbits/sec

Reno+fq_codel:
sg on:  6.74 Gbits/sec
sg off: 1.37 Gbits/sec

Reno+fq:
sg on:  6.53 Gbits/sec
sg off: 1.19 Gbits/sec

Regardless of which congestion algorithm and qdisc is in use, the throughput 
drops, but when BBR is in use, especially with something non-fq, it drops the 
most.

Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
On pátek 16. února 2018 23:50:35 CET Eric Dumazet wrote:
> /* snip */
> If you use
> 
> tcptrace -R test_s2c.pcap
> xplot.org d2c_rtt.xpl
> 
> Then you'll see plenty of suspect 40ms rtt samples.

That's odd. Even the way how they look uniformly.

> It looks like receiver misses wakeups for some reason,
> and only the TCP delayed ACK timer is helping.
> 
> So it does not look like a sender side issue to me.

To make things even more complicated, I've disabled sg on the server, leaving 
it enabled on the client:

client to server flow: 935 Mbits/sec
server to client flow: 72.5 Mbits/sec

So still, to me it looks like a sender issue. No?




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 21:54:05 CET Eric Dumazet wrote:
> /* snip */
> Something fishy really :
> /* snip */
> Not only the receiver suddenly adds a 25 ms delay, but also note that
> it acknowledges all prior segments (ack 112949), but with a wrong ecr
> value ( 2327043753 )
> instead of 2327043759
> /* snip */

Eric has encouraged me to look closer at what's there in the ethtool, and I've 
just had a free time to play with it. I've found out that enabling scatter-
gather (ethtool -K enp3s0 sg on, it is disabled by default on both hosts) 
brings the throughput back to normal even with BBR+fq_codel.

Wh? What's the deal BBR has with sg?

Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 18:56:12 CET Holger Hoffstätte wrote:
> There is simply no reason why you shouldn't get approx. line rate
> (~920+-ish) Mbit over wired 1GBit Ethernet; even my broken 10-year old
> Core2Duo laptop can do that. Can you boot with spectre_v2=off and try "the
> simplest case" with the defaults cubic/pfifo_fast? spectre_v2 has terrible
> performance impact esp. on small/older processors.

Just have tried. No visible difference.

> When I last benchmarked full PREEMPT with 4.9.x it was similarly bad and
> also had a noticeable network throughput impact even on my i7.
> 
> Also congratulations for being the only other person I know who ever tried
> YeAH. :-)

Well, according to the git log on tcp_yeah.c and Reported-by tag, I was not 
the only one there ;).

Regards,
  Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 17:25:58 CET Eric Dumazet wrote:
> The way TCP pacing works, it defaults to internal pacing using a hint
> stored in the socket.
> 
> If you change the qdisc while flow is alive, result could be unexpected.

I don't change a qdisc while flow is alive. Either the VM is completely 
restarted, or iperf3 is restarted on both sides.

> (TCP socket remembers that one FQ was supposed to handle the pacing)
> 
> What results do you have if you use standard pfifo_fast ?

Almost the same as with fq_codel (see my previous email with numbers).

> I am asking because TCP pacing relies on High resolution timers, and
> that might be weak on your VM.

Also, I've switched to measuring things on a real HW only (also see previous 
email with numbers).

Thanks.

Regards,
  Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 17:26:11 CET Holger Hoffstätte wrote:
> These are very odd configurations. :)
> Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply
> have too much overhead.

Since the pacing is based on hrtimers, should HZ matter at all? Even if so, 
poor 1 Gbps link shouldn't drop to below 100 Mbps, for sure.

> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Okay, got it.

> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
> /* snip */

Yes, and that's strange :/. And that's why I'm wondering what I am missing 
since things cannot be *that* bad.

> /* snip */
> Please note that BBR was developed to address the case of WAN transfers
> (or more precisely high BDP paths) which often suffer from TCP throughput
> collapse due to single packet loss events. While it might "work" in other
> scenarios as well, strictly speaking delay-based anything is increasingly
> less likely to work when there is no meaningful notion of delay - such
> as on a LAN. (yes, this is very simplified..)
> 
> The BBR mailing list has several nice reports why the current BBR
> implementation (dubbed v1) has a few - sometimes severe - problems.
> These are being addressed as we speak.
> 
> (let me know if you want some of those tech reports by email. :)

Well, yes, please, why not :).

> /* snip */
> I'm not sure testing the old version without builtin pacing is going to help
> matters in finding the actual problem. :)
> Several people have reported severe performance regressions with 4.15.x,
> maybe that's related. Can you test latest 4.14.x?

Observed this on v4.14 too but didn't pay much attention until realised that 
things look definitely wrong.

> Out of curiosity, what is the expected use case for BBR here?

Nothing special, just assumed it could be set as a default for both WAN and 
LAN usage.

Regards,
  Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote:
> Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
> have not run into this one in our team, but we will try to work with you to
> fix this.
> 
> Would you be able to take a sender-side tcpdump trace of the slow BBR
> transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
> fine. Maybe something like:
> 
>   tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT

So, going on with two real HW hosts. They are both running latest stock Arch 
Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are 
interconnected with 1 Gbps link (via switch if that matters). Using iperf3, 
running each test for 20 seconds.

Having BBR+fq_codel (or pfifo_fast, same result) on both hosts:

Client to server: 112 Mbits/sec
Server to client: 96.1 Mbits/sec

Having BBR+fq on both hosts:

Client to server: 347 Mbits/sec
Server to client: 397 Mbits/sec

Having YeAH+fq on both hosts:

Client to server: 928 Mbits/sec
Server to client: 711 Mbits/sec

(when the server generates traffic, the throughput is a little bit lower, as 
you can see, but I assume that's because I have there low-power Silvermont 
CPU, when the client has Ivy Bridge beast)

Now, to tcpdump. I've captured it 2 times, for client-to-server flow (c2s) and 
for server-to-client flow (s2c) while using BBR + pfifo_fast:

# tcpdump -w test_XXX.pcap -c100 -s 100 -i enp2s0 port 5201

I've uploaded both files here [1].

Thanks.

Oleksandr

[1] https://natalenko.name/myfiles/bbr/




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi!

On pátek 16. února 2018 17:45:56 CET Neal Cardwell wrote:
> Eric raises a good question: bare metal vs VMs.
> 
> Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your
> second e-mail did not seem to mention if those results were for bare
> metal or a VM scenario: can you please clarify the details on your
> second set of tests?

Ugh, so many letters simultaneously… I'll answer them one by one if you don't 
mind :).

Both the first and the second set of tests were performed on 2 KVM VMs, but 
from now I'll test everything using real HW only to exclude potential 
influence of virtualisation. Also, as I've already pointed out, on the real HW 
the difference is even bigger (~10 times).

Now, I'm going to answer other emails of yours, including the actual results 
from the real HW and tcpdump output as requested.

Thanks!

Regards,
  Oleksandr




Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Oleksandr Natalenko
Hi, David, Eric, Neal et al.

On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
> I've faced an issue with a limited TCP bandwidth between my laptop and a
> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
> To verify my observations, I've set up 2 KVM VMs with the following
> parameters:
> 
> 1) Linux v4.15.3
> 2) virtio NICs
> 3) 128 MiB of RAM
> 4) 2 vCPUs
> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz
> 
> The VMs are interconnected via host bridge (-netdev bridge). I was running
> iperf3 in the default and reverse mode. Here are the results:
> 
> 1) BBR on both VMs
> 
> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
> 
> 2) Reno on both VMs
> 
> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
> 
> 3) Reno on client, BBR on server
> 
> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
> 
> 4) BBR on client, Reno on server
> 
> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
> 
> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.
> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
> transferring some files between hosts).
> 
> Also, I've tried to use YeAH instead of Reno, and it gives me the same
> results as Reno (IOW, YeAH works fine too).
> 
> Questions:
> 
> 1) is this expected?
> 2) or am I missing some extra BBR tuneable?
> 3) if it is not a regression (I don't have any previous data to compare
> with), how can I fix this?
> 4) if it is a bug in BBR, what else should I provide or check for a proper
> investigation?

I've played with BBR a little bit more and managed to narrow the issue down to 
the changes between v4.12 and v4.13. Here are my observations:

v4.12 + BBR + fq_codel == OK
v4.12 + BBR + fq   == OK
v4.13 + BBR + fq_codel == Not OK
v4.13 + BBR + fq   == OK

I think this has something to do with an internal TCP implementation for 
pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to 
allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the 
throughput is high and saturates the link, but if another qdisc is in use, for 
instance, fq_codel, the throughput drops. Just to be sure, I've also tried 
pfifo_fast instead of fq_codel with the same outcome resulting in the low 
throughput.

Unfortunately, I do not know if this is something expected or should be 
considered as a regression. Thus, asking for an advice.

Ideas?

Thanks.

Regards,
  Oleksandr




TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-15 Thread Oleksandr Natalenko
Hello.

I've faced an issue with a limited TCP bandwidth between my laptop and a 
server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To 
verify my observations, I've set up 2 KVM VMs with the following parameters:

1) Linux v4.15.3
2) virtio NICs
3) 128 MiB of RAM
4) 2 vCPUs
5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz

The VMs are interconnected via host bridge (-netdev bridge). I was running 
iperf3 in the default and reverse mode. Here are the results:

1) BBR on both VMs

upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
download: 3.39 Gbits/sec, cwnd ~ 320 KBytes

2) Reno on both VMs

upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)

3) Reno on client, BBR on server

upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
download: 3.45 Gbits/sec, cwnd ~ 320 KBytes

4) BBR on client, Reno on server

upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)

So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. If 
using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to 
~100 Mbps (verifiable not only by iperf3, but also by scp while transferring 
some files between hosts).

Also, I've tried to use YeAH instead of Reno, and it gives me the same results 
as Reno (IOW, YeAH works fine too).

Questions:

1) is this expected?
2) or am I missing some extra BBR tuneable?
3) if it is not a regression (I don't have any previous data to compare with), 
how can I fix this?
4) if it is a bug in BBR, what else should I provide or check for a proper 
investigation?

Thanks.

Regards,
  Oleksandr




Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-11-10 Thread Oleksandr Natalenko

Uhh, sorry, just found the original submission [1].

[1] https://marc.info/?l=linux-netdev=151009763926816=2

10.11.2017 14:15, Oleksandr Natalenko wrote:

Hi.

I'm running the machine with this patch applied for 7 hours now, and
the warning hasn't appeared yet. Typically, it should be there within
the first hour.

I'll keep an eye on it for a longer time, but as of now it looks good.

Some explanation on this please?

Thanks!

06.11.2017 23:27, Yuchung Cheng wrote:
...snip...

hi guys can you try if the warning goes away w/ this quick fix?


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0ada8bfc2ebd..072aab2a8226 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2626,7 +2626,7 @@ void tcp_simple_retransmit(struct sock *sk)

tcp_clear_retrans_hints_partial(tp);

-   if (prior_lost == tp->lost_out)
+   if (!tp->lost_out)
return;

if (tcp_is_reno(tp))

...snip...


Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-11-10 Thread Oleksandr Natalenko

Hi.

I'm running the machine with this patch applied for 7 hours now, and the 
warning hasn't appeared yet. Typically, it should be there within the 
first hour.


I'll keep an eye on it for a longer time, but as of now it looks good.

Some explanation on this please?

Thanks!

06.11.2017 23:27, Yuchung Cheng wrote:
...snip...

hi guys can you try if the warning goes away w/ this quick fix?


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0ada8bfc2ebd..072aab2a8226 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2626,7 +2626,7 @@ void tcp_simple_retransmit(struct sock *sk)

tcp_clear_retrans_hints_partial(tp);

-   if (prior_lost == tp->lost_out)
+   if (!tp->lost_out)
return;

if (tcp_is_reno(tp))

...snip...


Re: [PATCH net] tcp: fix tcp_mtu_probe() vs highest_sack

2017-11-03 Thread Oleksandr Natalenko
Hi.

Thanks for the fix.

However, tcp_fastretrans_alert() warning case still remains open even with 
this patch. Do I understand correctly that these are 2 different issues?

Currently, I use latest 4.13 stable kernel + this patch and still get:

WARNING: CPU: 1 PID: 736 at net/ipv4/tcp_input.c:2826 tcp_fastretrans_alert
+0x7c8/0x990

Any idea on this?

On úterý 31. října 2017 7:08:20 CET Eric Dumazet wrote:
> From: Eric Dumazet <eduma...@google.com>
> 
> Based on SNMP values provided by Roman, Yuchung made the observation
> that some crashes in tcp_sacktag_walk() might be caused by MTU probing.
> 
> Looking at tcp_mtu_probe(), I found that when a new skb was placed
> in front of the write queue, we were not updating tcp highest sack.
> 
> If one skb is freed because all its content was copied to the new skb
> (for MTU probing), then tp->highest_sack could point to a now freed skb.
> 
> Bad things would then happen, including infinite loops.
> 
> This patch renames tcp_highest_sack_combine() and uses it
> from tcp_mtu_probe() to fix the bug.
> 
> Note that I also removed one test against tp->sacked_out,
> since we want to replace tp->highest_sack regardless of whatever
> condition, since keeping a stale pointer to freed skb is a recipe
> for disaster.
> 
> Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct
> access") Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: Alexei Starovoitov <alexei.starovoi...@gmail.com>
> Reported-by: Roman Gushchin <g...@fb.com>
> Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name>
> ---
>  include/net/tcp.h |6 +++---
>  net/ipv4/tcp_output.c |3 ++-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index
> 33599d17522d6a19b9d9a316cc1579cd5e71ee32..e6d0002a1b0bc5f28c331a760823c8dc9
> 2f8fe24 100644 --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1771,12 +1771,12 @@ static inline void tcp_highest_sack_reset(struct
> sock *sk) tcp_sk(sk)->highest_sack = tcp_write_queue_head(sk);
>  }
> 
> -/* Called when old skb is about to be deleted (to be combined with new skb)
> */ -static inline void tcp_highest_sack_combine(struct sock *sk,
> +/* Called when old skb is about to be deleted and replaced by new skb */
> +static inline void tcp_highest_sack_replace(struct sock *sk,
>   struct sk_buff *old,
>   struct sk_buff *new)
>  {
> - if (tcp_sk(sk)->sacked_out && (old == tcp_sk(sk)->highest_sack))
> + if (old == tcp_highest_sack(sk))
>   tcp_sk(sk)->highest_sack = new;
>  }
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index
> ae60dd3faed0adc71731bc686f878afd4c628d32..823003eef3a21a5cc5c27e0be9f46159a
> fa060df 100644 --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2062,6 +2062,7 @@ static int tcp_mtu_probe(struct sock *sk)
>   nskb->ip_summed = skb->ip_summed;
> 
>   tcp_insert_write_queue_before(nskb, skb, sk);
> + tcp_highest_sack_replace(sk, skb, nskb);
> 
>   len = 0;
>   tcp_for_write_queue_from_safe(skb, next, sk) {
> @@ -2665,7 +2666,7 @@ static bool tcp_collapse_retrans(struct sock *sk,
> struct sk_buff *skb) else if (!skb_shift(skb, next_skb, next_skb_size))
>   return false;
>   }
> - tcp_highest_sack_combine(sk, next_skb, skb);
> + tcp_highest_sack_replace(sk, next_skb, skb);
> 
>   tcp_unlink_write_queue(next_skb, sk);




Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-28 Thread Oleksandr Natalenko
Hi.

Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it 
intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix 
warning in tcp_fastretrans_alert() for me.

On středa 27. září 2017 2:18:32 CEST Yuchung Cheng wrote:
> On Tue, Sep 26, 2017 at 5:12 PM, Yuchung Cheng  wrote:
> > On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin  wrote:
> >>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin  wrote:
> >>> > > Hello.
> >>> > > 
> >>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting
> >>> > > in the
> >>> > > warning shown below. Most of the time it is harmless, but rarely it
> >>> > > just
> >>> > > causes either freeze or (I believe, this is related too) panic in
> >>> > > tcp_sacktag_walk() (because sk_buff passed to this function is
> >>> > > NULL).
> >>> > > Unfortunately, I still do not have proper stacktrace from panic, but
> >>> > > will try to capture it if possible.
> >>> > > 
> >>> > > Also, I have custom settings regarding TCP stack, shown below as
> >>> > > well. ifb is used to shape traffic with tc.
> >>> > > 
> >>> > > Please note this regression was already reported as BZ [1] and as a
> >>> > > letter to ML [2], but got neither attention nor resolution. It is
> >>> > > reproducible for (not only) me on my home router since v4.11 till
> >>> > > v4.13.1 incl.
> >>> > > 
> >>> > > Please advise on how to deal with it. I'll provide any additional
> >>> > > info if
> >>> > > necessary, also ready to test patches if any.
> >>> > > 
> >>> > > Thanks.
> >>> > > 
> >>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
> >>> > > [2]
> >>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.ne
> >>> > > t_lists_netdev_msg436158.html=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=jJ
> >>> > > YgtDM7QT-W-Fz_d29HYQ=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA
> >>> > > =-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc=>>> > 
> >>> > We're experiencing the same problems on some machines in our fleet.
> >>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and
> >>> > sometimes panics in tcp_sacktag_walk().
> >> 
> >>> > Here is an example of a backtrace with the panic log:
> >> Hi Yuchung!
> >> 
> >>> do you still see the panics if you disable RACK?
> >>> sysctl net.ipv4.tcp_recovery=0?
> >> 
> >> No, we haven't seen any crash since that.
> > 
> > I am out of ideas how RACK can potentially cause tcp_sacktag_walk to
> > take an empty skb :-( Do you have stack trace or any hint on which call
> > to tcp-sacktag_walk triggered the panic? internally at Google we never
> > see that.
> 
> hmm something just struck me: could you try
> sysctl net.ipv4.tcp_recovery=1 net.ipv4.tcp_retrans_collapse=0
> and see if kernel still panics on sack processing?
> 
> >>> also have you experience any sack reneg? could you post the output of
> >>> ' nstat |grep -i TCP' thanks
> >> 
> >> hostnameTcpActiveOpens  22896800.0
> >> hostnameTcpPassiveOpens 35927580.0
> >> hostnameTcpAttemptFails 746910 0.0
> >> hostnameTcpEstabResets  154988 0.0
> >> hostnameTcpInSegs   162586782550.0
> >> hostnameTcpOutSegs  469670116110.0
> >> hostnameTcpRetransSegs  13724310   0.0
> >> hostnameTcpInErrs   2  0.0
> >> hostnameTcpOutRsts  94187980.0
> >> hostnameTcpExtEmbryonicRsts 2303   0.0
> >> hostnameTcpExtPruneCalled   90192  0.0
> >> hostnameTcpExtOfoPruned 57274  0.0
> >> hostnameTcpExtOutOfWindowIcmps  3  0.0
> >> hostnameTcpExtTW11647050.0
> >> hostnameTcpExtTWRecycled2  0.0
> >> hostnameTcpExtPAWSEstab 1590.0
> >> hostnameTcpExtDelayedACKs   209207209  0.0
> >> hostnameTcpExtDelayedACKLocked  508571 0.0
> >> hostnameTcpExtDelayedACKLost17132480.0
> >> hostnameTcpExtListenOverflows   6250.0
> >> hostnameTcpExtListenDrops   6250.0
> >> hostnameTcpExtTCPHPHits 9341188489 0.0
> >> hostnameTcpExtTCPPureAcks   1434646465 0.0
> >> hostnameTcpExtTCPHPAcks 5733614672 0.0
> >> hostnameTcpExtTCPSackRecovery   32616980.0
> >> hostnameTcpExtTCPSACKReneging   12203  0.0
> >> hostnameTcpExtTCPSACKReorder433189 0.0
> >> hostname

Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-19 Thread Oleksandr Natalenko
And 2 more events:

===
$ dmesg --time-format iso | grep RIP
…
2017-09-19T16:52:21,623328+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0
2017-09-19T16:52:40,455296+0200 RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990
2017-09-19T16:52:41,047378+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0
…
2017-09-19T16:54:59,930726+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0
2017-09-19T16:55:07,985767+0200 RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990
2017-09-19T16:55:41,911527+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0
…
===

On pondělí 18. září 2017 23:40:08 CEST Yuchung Cheng wrote:
> On Mon, Sep 18, 2017 at 1:46 PM, Oleksandr Natalenko
> 
> <oleksa...@natalenko.name> wrote:
> > Actually, same warning was just triggered with RACK enabled. But main
> > warning was not triggered in this case.
> 
> Thanks.
> 
> I assume this kernel does not have the patch that Neal proposed in his
> first reply?
> 
> The main warning needs to be triggered by another peculiar SACK that
> kicks the sender into recovery again (after undo). Please let it run
> longer if possible to see if we can get both. But the new data does
> indicate the we can (validly) be in CA_Open with retrans_out > 0.
> 
> > ===
> > Sep 18 22:44:32 defiant kernel: [ cut here ]
> > Sep 18 22:44:32 defiant kernel: WARNING: CPU: 1 PID: 702 at net/ipv4/
> > tcp_input.c:2392 tcp_undo_cwnd_reduction+0xbd/0xd0
> > Sep 18 22:44:32 defiant kernel: Modules linked in: netconsole ctr ccm
> > cls_bpf sch_htb act_mirred cls_u32 sch_ingress sit tunnel4 ip_tunnel
> > 8021q mrp nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap
> > nft_set_hash nft_set_rbtree nf_tables_inet nf_tables_ipv6 nft_masq_ipv4
> > nf_nat_masquerade_ipv4 nft_masq nft_nat nft_counter nft_meta
> > nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> > nf_conntrack libcrc32c crc32c_generic nf_tables_ipv4 nf_tables tun nct6775
> > nfnetlink hwmon_vid nls_iso8859_1 nls_cp437 vfat fat ext4
> > snd_hda_codec_hdmi mbcache jbd2 snd_hda_codec_realtek
> > snd_hda_codec_generic f2fs arc4 fscrypto intel_rapl iTCO_wdt ath9k
> > iTCO_vendor_support intel_powerclamp ath9k_common ath9k_hw coretemp
> > kvm_intel ath mac80211 kvm irqbypass intel_cstate cfg80211 pcspkr
> > snd_hda_intel snd_hda_codec r8169
> > Sep 18 22:44:32 defiant kernel:  joydev evdev mii snd_hda_core mousedev
> > mei_txe input_leds i2c_i801 mac_hid i915 lpc_ich mei shpchp snd_hwdep
> > snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670
> > snd_soc_sst_atom_hifi2_platform battery snd_soc_sst_match snd_soc_rl6231
> > drm_kms_helper hci_uart ov5693(C) ov2722(C) lm3554(C) btbcm btqca
> > v4l2_common snd_soc_core btintel snd_compress videodev snd_pcm_dmaengine
> > snd_pcm video bluetooth snd_timer drm media tpm_tis snd i2c_hid soundcore
> > tpm_tis_core rfkill_gpio ac97_bus soc_button_array ecdh_generic rfkill
> > crc16 tpm 8250_dw intel_gtt syscopyarea sysfillrect acpi_pad sysimgblt
> > intel_int0002_vgpio fb_sys_fops pinctrl_cherryview i2c_algo_bit button
> > sch_fq_codel tcp_bbr ifb ip_tables x_tables btrfs xor raid6_pq
> > algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj usbhid hid uas
> > Sep 18 22:44:32 defiant kernel:  usb_storage dm_crypt dm_mod dax raid10
> > md_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel
> > ghash_clmulni_intel pcbc ahci aesni_intel xhci_pci libahci aes_x86_64
> > crypto_simd glue_helper xhci_hcd cryptd libata usbcore scsi_mod
> > usb_common serio sdhci_acpi sdhci led_class mmc_core
> > Sep 18 22:44:32 defiant kernel: CPU: 1 PID: 702 Comm: irq/123-enp3s0
> > Tainted: GWC  4.13.0-pf4 #1
> > Sep 18 22:44:32 defiant kernel: Hardware name: To Be Filled By O.E.M. To
> > Be
> > Filled By O.E.M./J3710-ITX, BIOS P1.30 03/30/2016
> > Sep 18 22:44:32 defiant kernel: task: 88923a738000 task.stack:
> > 95800150
> > Sep 18 22:44:32 defiant kernel: RIP:
> > 0010:tcp_undo_cwnd_reduction+0xbd/0xd0
> > Sep 18 22:44:32 defiant kernel: RSP: 0018:88927fc83a48 EFLAGS:
> > 00010202
> > Sep 18 22:44:32 defiant kernel: RAX: 0001 RBX:
> > 8892412d9800
> > RCX: 88927fc83b0c
> > Sep 18 22:44:32 defiant kernel: RDX: 7fff RSI:
> > 0001
> > RDI: 8892412d9800
> > Sep 18 22:44:32 defiant kernel: RBP: 88927fc83a50 R08:
> > 
> > R09: 18dfb063
> > Sep 18 22:44:32 defiant kernel: R10: 18dfd223 R11:
> > 18dfb063
> > R12: 5320
> > Sep 18 22:44:32 defiant kernel: R13: 88927fc83b10 R14:
> > 0001
> > R15: 88927fc83b0c
> > Sep 

Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-19 Thread Oleksandr Natalenko

Hi.

18.09.2017 23:40, Yuchung Cheng wrote:

I assume this kernel does not have the patch that Neal proposed in his
first reply?


Correct.


The main warning needs to be triggered by another peculiar SACK that
kicks the sender into recovery again (after undo). Please let it run
longer if possible to see if we can get both. But the new data does
indicate the we can (validly) be in CA_Open with retrans_out > 0.


OK, here it is:

===
» LC_TIME=C jctl -kb | grep RIP
…
Sep 19 12:54:03 defiant kernel: RIP: 
0010:tcp_undo_cwnd_reduction+0xbd/0xd0
Sep 19 12:54:22 defiant kernel: RIP: 
0010:tcp_undo_cwnd_reduction+0xbd/0xd0
Sep 19 12:54:25 defiant kernel: RIP: 
0010:tcp_undo_cwnd_reduction+0xbd/0xd0
Sep 19 12:56:00 defiant kernel: RIP: 
0010:tcp_fastretrans_alert+0x7c8/0x990
Sep 19 12:57:07 defiant kernel: RIP: 
0010:tcp_undo_cwnd_reduction+0xbd/0xd0
Sep 19 12:57:14 defiant kernel: RIP: 
0010:tcp_undo_cwnd_reduction+0xbd/0xd0
Sep 19 12:58:04 defiant kernel: RIP: 
0010:tcp_undo_cwnd_reduction+0xbd/0xd0

…
===

Note timestamps — two types of warning are distant in time, so didn't 
happen at once.


While still running this kernel, anything else I can check for you?


Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-18 Thread Oleksandr Natalenko
 22:44:32 defiant kernel:  tasklet_action+0x63/0x120
Sep 18 22:44:32 defiant kernel:  __do_softirq+0xdf/0x2e5
Sep 18 22:44:32 defiant kernel:  ? irq_finalize_oneshot.part.39+0xe0/0xe0
Sep 18 22:44:32 defiant kernel:  do_softirq_own_stack+0x1c/0x30
Sep 18 22:44:32 defiant kernel:  
Sep 18 22:44:32 defiant kernel:  do_softirq.part.17+0x4e/0x60
Sep 18 22:44:32 defiant kernel:  __local_bh_enable_ip+0x77/0x80
Sep 18 22:44:32 defiant kernel:  irq_forced_thread_fn+0x5c/0x70
Sep 18 22:44:32 defiant kernel:  irq_thread+0x131/0x1a0
Sep 18 22:44:32 defiant kernel:  ? wake_threads_waitq+0x30/0x30
Sep 18 22:44:32 defiant kernel:  kthread+0x126/0x140
Sep 18 22:44:32 defiant kernel:  ? irq_thread_check_affinity+0x90/0x90
Sep 18 22:44:32 defiant kernel:  ? kthread_create_on_node+0x70/0x70
Sep 18 22:44:32 defiant kernel:  ret_from_fork+0x25/0x30
Sep 18 22:44:32 defiant kernel: Code: 5d c3 80 60 35 fb 48 8b 00 48 39 c2 74 
85 48 3b 83 50 01 00 00 75 eb e9 77 ff ff ff 89 83 48 06 00 00 80 a3 1e 06 00 
00 fb eb b3 <0f> ff 5b 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
Sep 18 22:44:32 defiant kernel: ---[ end trace 1aea180efeedb474 ]---
===

On pondělí 18. září 2017 20:01:42 CEST Yuchung Cheng wrote:
> On Mon, Sep 18, 2017 at 10:59 AM, Oleksandr Natalenko
> 
> <oleksa...@natalenko.name> wrote:
> > OK. Should I keep FACK disabled?
> 
> Yes since it is disabled in the upstream by default. Although you can
> experiment FACK enabled additionally.
> 
> Do we know the crash you first experienced is tied to this issue?
> 
> > On pondělí 18. září 2017 19:51:21 CEST Yuchung Cheng wrote:
> >> Can you try this patch to verify my theory with tcp_recovery=0 and 1?
> >> thanks
> >> 
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index 5af2f04f8859..9253d9ee7d0e 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock
> >> *sk, bool unmark_loss)
> >> 
> >> }
> >> tp->snd_cwnd_stamp = tcp_time_stamp;
> >> tp->undo_marker = 0;
> >> 
> >> +   WARN_ON(tp->retrans_out);
> >> 
> >>  }




Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-18 Thread Oleksandr Natalenko
:18:34 defiant kernel:  tasklet_action+0x63/0x120
Sep 18 22:18:34 defiant kernel:  __do_softirq+0xdf/0x2e5
Sep 18 22:18:34 defiant kernel:  ? irq_finalize_oneshot.part.39+0xe0/0xe0
Sep 18 22:18:34 defiant kernel:  do_softirq_own_stack+0x1c/0x30
Sep 18 22:18:34 defiant kernel:  
Sep 18 22:18:34 defiant kernel:  do_softirq.part.17+0x4e/0x60
Sep 18 22:18:34 defiant kernel:  __local_bh_enable_ip+0x77/0x80
Sep 18 22:18:34 defiant kernel:  irq_forced_thread_fn+0x5c/0x70
Sep 18 22:18:34 defiant kernel:  irq_thread+0x131/0x1a0
Sep 18 22:18:34 defiant kernel:  ? wake_threads_waitq+0x30/0x30
Sep 18 22:18:34 defiant kernel:  kthread+0x126/0x140
Sep 18 22:18:34 defiant kernel:  ? irq_thread_check_affinity+0x90/0x90
Sep 18 22:18:34 defiant kernel:  ? kthread_create_on_node+0x70/0x70
Sep 18 22:18:34 defiant kernel:  ret_from_fork+0x25/0x30
Sep 18 22:18:34 defiant kernel: Code: 5d c3 80 60 35 fb 48 8b 00 48 39 c2 74 
85 48 3b 83 50 01 00 00 75 eb e9 77 ff ff ff 89 83 48 06 00 00 80 a3 1e 06 00 
00 fb eb b3 <0f> ff 5b 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
Sep 18 22:18:34 defiant kernel: ---[ end trace 1aea180efeedb473 ]---
===

Should I continue with net.ipv4.tcp_recovery = 1, or this is enough?

On pondělí 18. září 2017 20:01:42 CEST Yuchung Cheng wrote:
> On Mon, Sep 18, 2017 at 10:59 AM, Oleksandr Natalenko
> 
> <oleksa...@natalenko.name> wrote:
> > OK. Should I keep FACK disabled?
> 
> Yes since it is disabled in the upstream by default. Although you can
> experiment FACK enabled additionally.
> 
> Do we know the crash you first experienced is tied to this issue?
> 
> > On pondělí 18. září 2017 19:51:21 CEST Yuchung Cheng wrote:
> >> Can you try this patch to verify my theory with tcp_recovery=0 and 1?
> >> thanks
> >> 
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index 5af2f04f8859..9253d9ee7d0e 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock
> >> *sk, bool unmark_loss)
> >> 
> >> }
> >> tp->snd_cwnd_stamp = tcp_time_stamp;
> >> tp->undo_marker = 0;
> >> 
> >> +   WARN_ON(tp->retrans_out);
> >> 
> >>  }




Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-18 Thread Oleksandr Natalenko
On pondělí 18. září 2017 20:01:42 CEST Yuchung Cheng wrote:
> Yes since it is disabled in the upstream by default. Although you can
> experiment FACK enabled additionally.

OK.

> Do we know the crash you first experienced is tied to this issue?

No, unfortunately. I wasn't able to re-create it again, so lets focus on 
tcp_fastretrans_alert warning only.


Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-18 Thread Oleksandr Natalenko
OK. Should I keep FACK disabled?

On pondělí 18. září 2017 19:51:21 CEST Yuchung Cheng wrote:
> Can you try this patch to verify my theory with tcp_recovery=0 and 1? thanks
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5af2f04f8859..9253d9ee7d0e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock
> *sk, bool unmark_loss)
> }
> tp->snd_cwnd_stamp = tcp_time_stamp;
> tp->undo_marker = 0;
> +   WARN_ON(tp->retrans_out);
>  }



Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-17 Thread Oleksandr Natalenko
Hi.

Just to note that it looks like disabling RACK and re-enabling FACK prevents 
warning from happening:

net.ipv4.tcp_fack = 1
net.ipv4.tcp_recovery = 0

Hope I get semantics of these tunables right.

On pátek 15. září 2017 21:04:36 CEST Oleksandr Natalenko wrote:
> Hello.
> 
> With net.ipv4.tcp_fack set to 0 the warning still appears:
> 
> ===
> » sysctl net.ipv4.tcp_fack
> net.ipv4.tcp_fack = 0
> 
> » LC_TIME=C dmesg -T | grep WARNING
> [Fri Sep 15 20:40:30 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c:
> 2826 tcp_fastretrans_alert+0x7c8/0x990
> [Fri Sep 15 20:40:30 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c:
> 2826 tcp_fastretrans_alert+0x7c8/0x990
> [Fri Sep 15 20:48:37 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c:
> 2826 tcp_fastretrans_alert+0x7c8/0x990
> [Fri Sep 15 20:48:55 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c:
> 2826 tcp_fastretrans_alert+0x7c8/0x990
> 
> » ps -up 711
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> root   711  4.3  0.0  0 0 ?S18:12   7:23 [irq/123-
> enp3s0]
> ===
> 
> Any suggestions?
> 
> On pátek 15. září 2017 16:03:00 CEST Neal Cardwell wrote:
> > Thanks for testing that. That is a very useful data point.
> > 
> > I was able to cook up a packetdrill test that could put the connection
> > in CA_Disorder with retransmitted packets out, but not in CA_Open. So
> > we do not yet have a test case to reproduce this.
> > 
> > We do not see this warning on our fleet at Google. One significant
> > difference I see between our environment and yours is that it seems
> > 
> > you run with FACK enabled:
> >   net.ipv4.tcp_fack = 1
> > 
> > Note that FACK was disabled by default (since it was replaced by RACK)
> > between kernel v4.10 and v4.11. And this is exactly the time when this
> > bug started manifesting itself for you and some others, but not our
> > fleet. So my new working hypothesis would be that this warning is due
> > to a behavior that only shows up in kernels >=4.11 when FACK is
> > enabled.
> > 
> > Would you be able to disable FACK ("sysctl net.ipv4.tcp_fack=0" at
> > boot, or net.ipv4.tcp_fack=0 in /etc/sysctl.conf, or equivalent),
> > reboot, and test the kernel for a few days to see if the warning still
> > pops up?
> > 
> > thanks,
> > neal
> > 
> > [ps: apologies for the previous, mis-formatted post...]




Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-15 Thread Oleksandr Natalenko
Hello.

With net.ipv4.tcp_fack set to 0 the warning still appears:

===
» sysctl net.ipv4.tcp_fack 
net.ipv4.tcp_fack = 0

» LC_TIME=C dmesg -T | grep WARNING
[Fri Sep 15 20:40:30 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c:
2826 tcp_fastretrans_alert+0x7c8/0x990
[Fri Sep 15 20:40:30 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c:
2826 tcp_fastretrans_alert+0x7c8/0x990
[Fri Sep 15 20:48:37 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c:
2826 tcp_fastretrans_alert+0x7c8/0x990
[Fri Sep 15 20:48:55 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c:
2826 tcp_fastretrans_alert+0x7c8/0x990

» ps -up 711
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   711  4.3  0.0  0 0 ?S18:12   7:23 [irq/123-
enp3s0]
===

Any suggestions?

On pátek 15. září 2017 16:03:00 CEST Neal Cardwell wrote:
> Thanks for testing that. That is a very useful data point.
> 
> I was able to cook up a packetdrill test that could put the connection
> in CA_Disorder with retransmitted packets out, but not in CA_Open. So
> we do not yet have a test case to reproduce this.
> 
> We do not see this warning on our fleet at Google. One significant
> difference I see between our environment and yours is that it seems
> you run with FACK enabled:
> 
>   net.ipv4.tcp_fack = 1
> 
> Note that FACK was disabled by default (since it was replaced by RACK)
> between kernel v4.10 and v4.11. And this is exactly the time when this
> bug started manifesting itself for you and some others, but not our
> fleet. So my new working hypothesis would be that this warning is due
> to a behavior that only shows up in kernels >=4.11 when FACK is
> enabled.
> 
> Would you be able to disable FACK ("sysctl net.ipv4.tcp_fack=0" at
> boot, or net.ipv4.tcp_fack=0 in /etc/sysctl.conf, or equivalent),
> reboot, and test the kernel for a few days to see if the warning still
> pops up?
> 
> thanks,
> neal
> 
> [ps: apologies for the previous, mis-formatted post...]




Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-14 Thread Oleksandr Natalenko
Hi.

I've applied your test patch but it doesn't fix the issue for me since the 
warning is still there.

Were you able to reproduce it?

On pondělí 11. září 2017 1:59:02 CEST Neal Cardwell wrote:
> Thanks for the detailed report!
> 
> I suspect this is due to the following commit, which happened between
> 4.10 and 4.11:
> 
>   89fe18e44f7e tcp: extend F-RTO to catch more spurious timeouts
>  
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?
> id=89fe18e44f7e
> 
> This commit expanded the set of scenarios where we would undo a
> CA_Loss cwnd reduction and return to TCP_CA_Open, but did not include
> a check to see if there were any in-flight retransmissions. I think we
> need a fix like the following:
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 659d1baefb2b..730a2de9d2b0 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2439,7 +2439,7 @@ static bool tcp_try_undo_loss(struct sock *sk,
> bool frto_undo)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
> 
> -   if (frto_undo || tcp_may_undo(tp)) {
> +   if ((frto_undo || tcp_may_undo(tp)) && !tp->retrans_out) {
> tcp_undo_cwnd_reduction(sk, true);
> 
> DBGUNDO(sk, "partial loss");
> 
> I will try a packetdrill test to see if I can reproduce this issue and
> verify the fix.
> 
> thanks,
> neal




[REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-10 Thread Oleksandr Natalenko
Hello.

Since, IIRC, v4.11, there is some regression in TCP stack resulting in the 
warning shown below. Most of the time it is harmless, but rarely it just 
causes either freeze or (I believe, this is related too) panic in 
tcp_sacktag_walk() (because sk_buff passed to this function is NULL). 
Unfortunately, I still do not have proper stacktrace from panic, but will try 
to capture it if possible.

Also, I have custom settings regarding TCP stack, shown below as well. ifb is 
used to shape traffic with tc.

Please note this regression was already reported as BZ [1] and as a letter to 
ML [2], but got neither attention nor resolution. It is reproducible for (not 
only) me on my home router since v4.11 till v4.13.1 incl.

Please advise on how to deal with it. I'll provide any additional info if 
necessary, also ready to test patches if any.

Thanks.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
[2] https://www.spinics.net/lists/netdev/msg436158.html

=== warning
[14407.060066] [ cut here ]
[14407.060353] WARNING: CPU: 0 PID: 719 at net/ipv4/tcp_input.c:2826 
tcp_fastretrans_alert+0x7c8/0x990
[14407.060747] Modules linked in: netconsole ctr ccm cls_bpf sch_htb 
act_mirred cls_u32 sch_ingress sit tunnel4 ip_tunnel 8021q mrp nf
_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap nft_set_hash 
nft_set_rbtree nf_tables_inet nf_tables_ipv6 nft_masq_ipv4 nf_nat_ma
squerade_ipv4 nft_masq nft_nat nft_counter nft_meta nft_chain_nat_ipv4 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrac
k libcrc32c crc32c_generic nf_tables_ipv4 tun nf_tables nfnetlink nct6775 
hwmon_vid nls_iso8859_1 nls_cp437 vfat fat ext4 mbcache jbd2
 arc4 f2fs snd_hda_codec_hdmi fscrypto snd_hda_codec_realtek 
snd_hda_codec_generic intel_rapl intel_powerclamp coretemp iTCO_wdt iTCO_
vendor_support ath9k ath9k_common kvm_intel ath9k_hw kvm ath irqbypass 
intel_cstate mac80211 pcspkr snd_intel_sst_acpi i2c_i801 i915 s
nd_hda_intel
[14407.063800]  snd_intel_sst_core r8169 cfg80211 evdev mii snd_hda_codec 
joydev mousedev input_leds snd_soc_rt5670 mei_txe snd_soc_ss
t_atom_hifi2_platform snd_hda_core snd_soc_rl6231 snd_soc_sst_match mac_hid 
mei lpc_ich shpchp drm_kms_helper snd_hwdep snd_soc_core s
nd_compress battery snd_pcm_dmaengine drm hci_uart ov2722(C) snd_pcm lm3554(C) 
ov5693(C) snd_timer v4l2_common btbcm snd intel_gtt btq
ca btintel videodev syscopyarea bluetooth video soundcore sysfillrect media 
sysimgblt ac97_bus ecdh_generic rfkill_gpio i2c_hid rfkill
 tpm_tis crc16 fb_sys_fops i2c_algo_bit 8250_dw tpm_tis_core tpm 
soc_button_array pinctrl_cherryview intel_int0002_vgpio acpi_pad butt
on sch_fq_codel tcp_bbr ifb ip_tables x_tables btrfs xor raid6_pq 
algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj usbhid hid
uas usb_storage
[14407.066873]  dm_crypt dm_mod dax raid10 md_mod sd_mod crct10dif_pclmul 
crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_int
el aes_x86_64 crypto_simd glue_helper cryptd ahci xhci_pci libahci xhci_hcd 
libata usbcore scsi_mod usb_common serio sdhci_acpi sdhci
led_class mmc_core
[14407.068034] CPU: 0 PID: 719 Comm: irq/123-enp3s0 Tainted: G C  
4.13.0-pf2 #1
[14407.068403] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./
J3710-ITX, BIOS P1.30 03/30/2016
[14407.068827] task: 98b1c0a05400 task.stack: bb59c15c
[14407.069111] RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990
[14407.069358] RSP: 0018:98b1ffc03a78 EFLAGS: 00010202
[14407.069607] RAX:  RBX: 98b135ae RCX: 
98b1ffc03b0c
[14407.069928] RDX: 0001 RSI: 0001 RDI: 
98b135ae
[14407.070248] RBP: 98b1ffc03ab8 R08:  R09: 
98b1ffc03b60
[14407.070565] R10:  R11:  R12: 
5120
[14407.070884] R13: 98b1ffc03b10 R14: 0001 R15: 
98b1ffc03b0c
[14407.071205] FS:  () GS:98b1ffc0() knlGS:

[14407.071564] CS:  0010 DS:  ES:  CR0: 80050033
[14407.071827] CR2: 7ffc580b2f0f CR3: 10a09000 CR4: 
001006f0
[14407.072146] Call Trace:
[14407.072279]  
[14407.072412]  ? sk_reset_timer+0x18/0x30
[14407.072610]  tcp_ack+0x741/0x1110
[14407.072810]  tcp_rcv_established+0x325/0x770
[14407.073033]  ? sk_filter_trim_cap+0xd4/0x1a0
[14407.073249]  tcp_v4_do_rcv+0x90/0x1e0
[14407.073449]  tcp_v4_rcv+0x950/0xa10
[14407.073647]  ? nf_ct_deliver_cached_events+0xb8/0x110 [nf_conntrack]
[14407.073955]  ip_local_deliver_finish+0x68/0x210
[14407.074183]  ip_local_deliver+0xfa/0x110
[14407.074385]  ? ip_rcv_finish+0x410/0x410
[14407.074589]  ip_rcv_finish+0x120/0x410
[14407.074782]  ip_rcv+0x28e/0x3b0
[14407.074952]  ? inet_del_offload+0x40/0x40
[14407.075154]  __netif_receive_skb_core+0x39b/0xb00
[14407.075389]  ? netif_receive_skb_internal+0xa0/0x480
[14407.075635]  ? skb_release_all+0x24/0x30
[14407.075832]  ? consume_skb+0x38/0xa0
[14407.076025]  

kernel BUG at net/netfilter/nf_nat_core.c:395

2016-02-10 Thread Oleksandr Natalenko
Hi.

With 4.4.1 I've got BUG_ON() triggered in net/netfilter/nf_nat_core.c:395, 
nf_nat_setup_info(), today on my home router.

Here is full trace got via netconsole: [1]

I perform LAN NATting using nftables like this:

===
table ip nat {
chain prerouting {
type nat hook prerouting priority -150;
}
 
chain postrouting {
type nat hook postrouting priority -150;
 
oifname enp2s0 ip saddr 172.17.28.0/24 counter snat 1.2.3.4
oifname enp2s0 ip saddr 172.17.29.0/24 counter snat 1.2.3.4
oifname enp2s0 ip saddr 172.17.31.0/24 counter snat 1.2.3.4
oifname enp2s0 ip saddr 172.17.35.0/24 counter snat 1.2.3.4
oifname enp2s0 ip saddr 172.17.37.0/24 counter snat 1.2.3.4
oifname tun0 ip saddr 172.17.28.0/24 counter masquerade
oifname tun0 ip saddr 172.17.29.0/24 counter masquerade
oifname tinc0 ip saddr 172.17.28.0/24 counter masquerade
oifname tinc0 ip saddr 172.17.29.0/24 counter masquerade
}
}
===

Traffic filtering is done via nftables as well.

Ideas? What could I do to debug the issue better?

[1] https://gist.github.com/bbb3712f40a7753537fe


Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2016-01-06 Thread Oleksandr Natalenko
Sure, but after catching the stacktrace.

On середа, 6 січня 2016 р. 10:43:45 EET Yuchung Cheng wrote:
> Could you turn off ecn (sysctl net.ipv4.tcp_ecn=0) to see if this still
> happen?


> >> On December 22, 2015 4:10:32 AM EET, Yuchung Cheng <ych...@google.com> 
wrote:
> >> >On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko
> >> >
> >> ><oleksa...@natalenko.name> wrote:
> >> >> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB
> >> >
> >> >mode by
> >> >
> >> >> default and SS mode conditionally) introduced changes to
> >> >
> >> >net/ipv4/tcp_input.c
> >> >
> >> >> tcp_cwnd_reduction() that, possibly, cause division by zero, and
> >> >
> >> >therefore,
> >> >
> >> >> kernel panic in interrupt handler [1].
> >> >> 
> >> >> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the
> >> >
> >> >issue.
> >> >
> >> >> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day
> >> >> (occasionally).
> >> >> 
> >> >> What could be done to help in debugging this issue?
> >> >
> >> >Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)?
> >> >
> >> >If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR
> >> >state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can
> >> >you try this debug / quick-fix patch and send me the error message if
> >> >any?
> >> >
> >> >> Regards,
> >> >> 
> >> >>   Oleksandr.
> >> >> 
> >> >> [1] http://i.piccy.info/
> >> >
> >> >i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg
> >> 
> >> --
> >> Sent from my Android device with K-9 Mail. Please excuse my brevity.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2016-01-06 Thread Oleksandr Natalenko
Unfortunately, the patch didn't help -- I've got the same stacktrace with 
slightly different offset (+3) within the function.

Now trying to get full stacktrace via netconsole. Need more time.

Meanwhile, any other ideas on what went wrong?


On December 22, 2015 4:10:32 AM EET, Yuchung Cheng <ych...@google.com> wrote:
>On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko
><oleksa...@natalenko.name> wrote:
>> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB
>mode by
>> default and SS mode conditionally) introduced changes to
>net/ipv4/tcp_input.c
>> tcp_cwnd_reduction() that, possibly, cause division by zero, and
>therefore,
>> kernel panic in interrupt handler [1].
>>
>> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the
>issue.
>>
>> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day
>> (occasionally).
>>
>> What could be done to help in debugging this issue?
>Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)?
>
>If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR
>state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can
>you try this debug / quick-fix patch and send me the error message if
>any?
>
>
>>
>> Regards,
>>   Oleksandr.
>>
>> [1] http://i.piccy.info/
>>
>i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2015-12-22 Thread Oleksandr Natalenko
That is correct, I have net.ipv4.tcp_ecn set to 1.

I've recompiled the kernel with proposed patch, now still waiting for issue to 
be triggered.

Could I manually simulate the erroneous TCP ECN behavior to speed up the 
debugging?

On понеділок, 21 грудня 2015 р. 18:10:32 EET Yuchung Cheng wrote:
> On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko
> 
> <oleksa...@natalenko.name> wrote:
> > Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB mode by
> > default and SS mode conditionally) introduced changes to
> > net/ipv4/tcp_input.c tcp_cwnd_reduction() that, possibly, cause division
> > by zero, and therefore, kernel panic in interrupt handler [1].
> > 
> > Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the issue.
> > 
> > I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day
> > (occasionally).
> > 
> > What could be done to help in debugging this issue?
> 
> Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)?
> 
> If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR
> state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can
> you try this debug / quick-fix patch and send me the error message if
> any?
> 
> > Regards,
> > 
> >   Oleksandr.
> > 
> > [1] http://i.piccy.info/
> > i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2015-12-21 Thread Oleksandr Natalenko
Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB mode by 
default and SS mode conditionally) introduced changes to net/ipv4/tcp_input.c 
tcp_cwnd_reduction() that, possibly, cause division by zero, and therefore, 
kernel panic in interrupt handler [1].

Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the issue.

I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day 
(occasionally).

What could be done to help in debugging this issue?

Regards,
  Oleksandr.

[1] http://i.piccy.info/
i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html