Re: [PATCH net-next 0/6] tcp: remove non GSO code
Hi. On středa 21. února 2018 0:21:37 CET Eric Dumazet wrote: > My latest patch (fixing BBR underestimation of cwnd) > was meant for net tree, on a NIC where SG/TSO/GSO) are disabled. > > ( ie when sk->sk_gso_max_segs is not set to 'infinite' ) > > It is packet scheduler independent really. > > Tested here with pfifo_fast, TSO/GSO off. Well, before the patch with BBR and sg off here it is ~450 Mbps for fq and ~115 Mbps for pfifo_fast. So, comparing to what I see with the patch (850 and 200 respectively), it is definitely an improvement. Thanks.
Re: [PATCH net-next 0/6] tcp: remove non GSO code
On úterý 20. února 2018 21:09:37 CET Eric Dumazet wrote: > Also you can tune your NIC to accept few MSS per GSO/TSO packet > > ip link set dev eth0 gso_max_segs 2 > > So even if TSO/GSO is there, BBR should not use sk->sk_gso_max_segs to > size its bursts, since burt sizes are also impacting GRO on the > receiver. net-next + 7 patches (6 from the patchset + this one). Before playing with gso_max_segs: BBR+fq sg on: 4.39 Gbits/sec sg off: 1.33 Gbits/sec BBR+fq_codel sg on: 4.02 Gbits/sec sg off: 1.41 Gbits/sec BBR+pfifo_fast sg on: 3.66 Gbits/sec sg off: 1.41 Gbits/sec Reno+fq sg on: 5.69 Gbits/sec sg off: 1.53 Gbits/sec Reno+fq_codel sg on: 6.33 Gbits/sec sg off: 1.50 Gbits/sec Reno+pfifo_fast sg on: 6.26 Gbits/sec sg off: 1.48 Gbits/sec After "ip link set dev eth1 gso_max_segs 2": BBR+fq sg on: 806 Mbits/sec sg off: 886 Mbits/sec BBR+fq_codel sg on: 206 Mbits/sec sg off: 207 Mbits/sec BBR+pfifo_fast sg on: 220 Mbits/sec sg off: 200 Mbits/sec Reno+fq sg on: 2.16 Gbits/sec sg off: 1.27 Gbits/sec Reno+fq_codel sg on: 2.45 Gbits/sec sg off: 1.52 Gbits/sec Reno+pfifo_fast sg on: 2.31 Gbits/sec sg off: 1.54 Gbits/sec Oleksandr
Re: [PATCH net-next 0/6] tcp: remove non GSO code
On úterý 20. února 2018 20:56:24 CET Eric Dumazet wrote: > That is with the other patches _not_ applied ? Yes, other patches are not applied. It is v4.15.4 + this patch only + BBR + fq_codel or pfifo_fast. Shall I re-test it on the net-next with the whole patchset (because it is not applied cleanly to 4.15)? Oleksandr
Re: [PATCH net-next 0/6] tcp: remove non GSO code
On úterý 20. února 2018 20:39:49 CET Eric Dumazet wrote: > I am not trying to compare BBR and Reno on a lossless link. > > Reno is running as fast as possible and will win when bufferbloat is > not an issue. > > If bufferbloat is not an issue, simply use Reno and be happy ;) > > My patch helps BBR only, I thought it was obvious ;) Umm, yes, and my point was rather something like "the speed on a lossless link while using BBR with and without this patch is the same". Sorry for a confusion. I guess, the key word here is "lossless". Oleksandr
Re: [PATCH net-next 0/6] tcp: remove non GSO code
Hi. On úterý 20. února 2018 19:57:42 CET Eric Dumazet wrote: > Actually timer drifts are not horrible (at least on my lab hosts) > > But BBR has a pessimistic way to sense the burst size, as it is tied to > TSO/GSO being there. > > Following patch helps a lot. Not really, at least if applied to v4.15.4. Still getting 2 Gbps less between VMs if using BBR instead of Reno. Am I doing something wrong? Oleksandr
Re: [PATCH net-next 0/6] tcp: remove non GSO code
Hi. 19.02.2018 20:56, Eric Dumazet wrote: Switching TCP to GSO mode, relying on core networking layers to perform eventual adaptation for dumb devices was overdue. 1) Most TCP developments are done with TSO in mind. 2) Less high-resolution timers needs to be armed for TCP-pacing 3) GSO can benefit of xmit_more hint 4) Receiver GRO is more effective (as if TSO was used for real on sender) -> less ACK packets and overhead. 5) Write queues have less overhead (one skb holds about 64KB of payload) 6) SACK coalescing just works. (no payload in skb->head) 7) rtx rb-tree contains less packets, SACK is cheaper. 8) Removal of legacy code. Less maintenance hassles. Note that I have left the sendpage/zerocopy paths, but they probably can benefit from the same strategy. Thanks to Oleksandr Natalenko for reporting a performance issue for BBR/fq_codel, which was the main reason I worked on this patch series. Thanks for dealing with this that fast. Does this mean that the option to optimise internal TCP pacing is still an open question? Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On neděle 18. února 2018 22:04:27 CET Eric Dumazet wrote: > I was able to take a look today, and I believe this is the time to > switch TCP to GSO being always on. > > As a bonus, we get speed boost for cubic as well. > > Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack > coalescing, TCP pacing...) all were developed/tested/maintained with > GSO/TSO being the norm. > > Can you please test the following patch ? Yes, results below: BBR+fq: sg on: 6.02 Gbits/sec sg off: 1.33 Gbits/sec BBR+pfifo_fast: sg on: 4.13 Gbits/sec sg off: 1.34 Gbits/sec BBR+fq_codel: sg on: 4.16 Gbits/sec sg off: 1.35 Gbits/sec Reno+fq: sg on: 6.44 Gbits/sec sg off: 1.39 Gbits/sec Reno+pfifo_fast: sg on: 6.36 Gbits/sec sg off: 1.39 Gbits/sec Reno+fq_codel: sg on: 6.41 Gbits/sec sg off: 1.38 Gbits/sec While BBR still suffers when fq is not used, disabling sg doesn't bring drastic throughput drop anymore. So, looks good to me, eh? > Note that some cleanups can be done later in TCP stack, removing lots > of legacy stuff. > > Also TCP internal-pacing could benefit from something similar to this > fq patch eventually, although there is no hurry. > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?i > d=fefa569a9d4bc4b7758c0fddd75bb0382c95da77 Feel free to ping me if you have something else to test then ;). > Of course, you have to consider why SG was disabled on your device, > this looks very pessimistic. Dunno why that happens, but I've managed to just enable it automatically on interface up. Thanks. Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote: > Well, no effect here on e1000e (1 Gbit) at least > > # ethtool -K eth3 sg off > Actual changes: > scatter-gather: off > tx-scatter-gather: off > tcp-segmentation-offload: off > tx-tcp-segmentation: off [requested on] > tx-tcp6-segmentation: off [requested on] > generic-segmentation-offload: off [requested on] > > # tc qd replace dev eth3 root pfifo_fast > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > 941 > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > 941 > # tc qd replace dev eth3 root fq > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > 941 > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > 941 > # tc qd replace dev eth3 root fq_codel > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > 941 > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > 941 > # That really looks strange to me. I'm able to reproduce the effect caused by disabling scatter-gather even on the VM (using iperf3, as usual): BBR+fq_codel: sg on: 4.23 Gbits/sec sg off: 121 Mbits/sec BBR+fq: sg on: 6.38 Gbits/sec sg off: 437 Mbits/sec Reno+fq_codel: sg on: 6.74 Gbits/sec sg off: 1.37 Gbits/sec Reno+fq: sg on: 6.53 Gbits/sec sg off: 1.19 Gbits/sec Regardless of which congestion algorithm and qdisc is in use, the throughput drops, but when BBR is in use, especially with something non-fq, it drops the most. Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On pátek 16. února 2018 23:50:35 CET Eric Dumazet wrote: > /* snip */ > If you use > > tcptrace -R test_s2c.pcap > xplot.org d2c_rtt.xpl > > Then you'll see plenty of suspect 40ms rtt samples. That's odd. Even the way how they look uniformly. > It looks like receiver misses wakeups for some reason, > and only the TCP delayed ACK timer is helping. > > So it does not look like a sender side issue to me. To make things even more complicated, I've disabled sg on the server, leaving it enabled on the client: client to server flow: 935 Mbits/sec server to client flow: 72.5 Mbits/sec So still, to me it looks like a sender issue. No?
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 21:54:05 CET Eric Dumazet wrote: > /* snip */ > Something fishy really : > /* snip */ > Not only the receiver suddenly adds a 25 ms delay, but also note that > it acknowledges all prior segments (ack 112949), but with a wrong ecr > value ( 2327043753 ) > instead of 2327043759 > /* snip */ Eric has encouraged me to look closer at what's there in the ethtool, and I've just had a free time to play with it. I've found out that enabling scatter- gather (ethtool -K enp3s0 sg on, it is disabled by default on both hosts) brings the throughput back to normal even with BBR+fq_codel. Wh? What's the deal BBR has with sg? Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 18:56:12 CET Holger Hoffstätte wrote: > There is simply no reason why you shouldn't get approx. line rate > (~920+-ish) Mbit over wired 1GBit Ethernet; even my broken 10-year old > Core2Duo laptop can do that. Can you boot with spectre_v2=off and try "the > simplest case" with the defaults cubic/pfifo_fast? spectre_v2 has terrible > performance impact esp. on small/older processors. Just have tried. No visible difference. > When I last benchmarked full PREEMPT with 4.9.x it was similarly bad and > also had a noticeable network throughput impact even on my i7. > > Also congratulations for being the only other person I know who ever tried > YeAH. :-) Well, according to the git log on tcp_yeah.c and Reported-by tag, I was not the only one there ;). Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 17:25:58 CET Eric Dumazet wrote: > The way TCP pacing works, it defaults to internal pacing using a hint > stored in the socket. > > If you change the qdisc while flow is alive, result could be unexpected. I don't change a qdisc while flow is alive. Either the VM is completely restarted, or iperf3 is restarted on both sides. > (TCP socket remembers that one FQ was supposed to handle the pacing) > > What results do you have if you use standard pfifo_fast ? Almost the same as with fq_codel (see my previous email with numbers). > I am asking because TCP pacing relies on High resolution timers, and > that might be weak on your VM. Also, I've switched to measuring things on a real HW only (also see previous email with numbers). Thanks. Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 17:26:11 CET Holger Hoffstätte wrote: > These are very odd configurations. :) > Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply > have too much overhead. Since the pacing is based on hrtimers, should HZ matter at all? Even if so, poor 1 Gbps link shouldn't drop to below 100 Mbps, for sure. > BBR in general will run with lower cwnd than e.g. Cubic or others. > That's a feature and necessary for WAN transfers. Okay, got it. > Something seems really wrong with your setup. I get completely > expected throughput on wired 1Gb between two hosts: > /* snip */ Yes, and that's strange :/. And that's why I'm wondering what I am missing since things cannot be *that* bad. > /* snip */ > Please note that BBR was developed to address the case of WAN transfers > (or more precisely high BDP paths) which often suffer from TCP throughput > collapse due to single packet loss events. While it might "work" in other > scenarios as well, strictly speaking delay-based anything is increasingly > less likely to work when there is no meaningful notion of delay - such > as on a LAN. (yes, this is very simplified..) > > The BBR mailing list has several nice reports why the current BBR > implementation (dubbed v1) has a few - sometimes severe - problems. > These are being addressed as we speak. > > (let me know if you want some of those tech reports by email. :) Well, yes, please, why not :). > /* snip */ > I'm not sure testing the old version without builtin pacing is going to help > matters in finding the actual problem. :) > Several people have reported severe performance regressions with 4.15.x, > maybe that's related. Can you test latest 4.14.x? Observed this on v4.14 too but didn't pay much attention until realised that things look definitely wrong. > Out of curiosity, what is the expected use case for BBR here? Nothing special, just assumed it could be set as a default for both WAN and LAN usage. Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote: > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We > have not run into this one in our team, but we will try to work with you to > fix this. > > Would you be able to take a sender-side tcpdump trace of the slow BBR > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be > fine. Maybe something like: > > tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT So, going on with two real HW hosts. They are both running latest stock Arch Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are interconnected with 1 Gbps link (via switch if that matters). Using iperf3, running each test for 20 seconds. Having BBR+fq_codel (or pfifo_fast, same result) on both hosts: Client to server: 112 Mbits/sec Server to client: 96.1 Mbits/sec Having BBR+fq on both hosts: Client to server: 347 Mbits/sec Server to client: 397 Mbits/sec Having YeAH+fq on both hosts: Client to server: 928 Mbits/sec Server to client: 711 Mbits/sec (when the server generates traffic, the throughput is a little bit lower, as you can see, but I assume that's because I have there low-power Silvermont CPU, when the client has Ivy Bridge beast) Now, to tcpdump. I've captured it 2 times, for client-to-server flow (c2s) and for server-to-client flow (s2c) while using BBR + pfifo_fast: # tcpdump -w test_XXX.pcap -c100 -s 100 -i enp2s0 port 5201 I've uploaded both files here [1]. Thanks. Oleksandr [1] https://natalenko.name/myfiles/bbr/
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi! On pátek 16. února 2018 17:45:56 CET Neal Cardwell wrote: > Eric raises a good question: bare metal vs VMs. > > Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your > second e-mail did not seem to mention if those results were for bare > metal or a VM scenario: can you please clarify the details on your > second set of tests? Ugh, so many letters simultaneously… I'll answer them one by one if you don't mind :). Both the first and the second set of tests were performed on 2 KVM VMs, but from now I'll test everything using real HW only to exclude potential influence of virtualisation. Also, as I've already pointed out, on the real HW the difference is even bigger (~10 times). Now, I'm going to answer other emails of yours, including the actual results from the real HW and tcpdump output as requested. Thanks! Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi, David, Eric, Neal et al. On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote: > I've faced an issue with a limited TCP bandwidth between my laptop and a > server in my 1 Gbps LAN while using BBR as a congestion control mechanism. > To verify my observations, I've set up 2 KVM VMs with the following > parameters: > > 1) Linux v4.15.3 > 2) virtio NICs > 3) 128 MiB of RAM > 4) 2 vCPUs > 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz > > The VMs are interconnected via host bridge (-netdev bridge). I was running > iperf3 in the default and reverse mode. Here are the results: > > 1) BBR on both VMs > > upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes > download: 3.39 Gbits/sec, cwnd ~ 320 KBytes > > 2) Reno on both VMs > > upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) > download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) > > 3) Reno on client, BBR on server > > upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) > download: 3.45 Gbits/sec, cwnd ~ 320 KBytes > > 4) BBR on client, Reno on server > > upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes > download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) > > So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. > If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput > to ~100 Mbps (verifiable not only by iperf3, but also by scp while > transferring some files between hosts). > > Also, I've tried to use YeAH instead of Reno, and it gives me the same > results as Reno (IOW, YeAH works fine too). > > Questions: > > 1) is this expected? > 2) or am I missing some extra BBR tuneable? > 3) if it is not a regression (I don't have any previous data to compare > with), how can I fix this? > 4) if it is a bug in BBR, what else should I provide or check for a proper > investigation? I've played with BBR a little bit more and managed to narrow the issue down to the changes between v4.12 and v4.13. Here are my observations: v4.12 + BBR + fq_codel == OK v4.12 + BBR + fq == OK v4.13 + BBR + fq_codel == Not OK v4.13 + BBR + fq == OK I think this has something to do with an internal TCP implementation for pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the throughput is high and saturates the link, but if another qdisc is in use, for instance, fq_codel, the throughput drops. Just to be sure, I've also tried pfifo_fast instead of fq_codel with the same outcome resulting in the low throughput. Unfortunately, I do not know if this is something expected or should be considered as a regression. Thus, asking for an advice. Ideas? Thanks. Regards, Oleksandr
TCP and BBR: reproducibly low cwnd and bandwidth
Hello. I've faced an issue with a limited TCP bandwidth between my laptop and a server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To verify my observations, I've set up 2 KVM VMs with the following parameters: 1) Linux v4.15.3 2) virtio NICs 3) 128 MiB of RAM 4) 2 vCPUs 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz The VMs are interconnected via host bridge (-netdev bridge). I was running iperf3 in the default and reverse mode. Here are the results: 1) BBR on both VMs upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes download: 3.39 Gbits/sec, cwnd ~ 320 KBytes 2) Reno on both VMs upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) 3) Reno on client, BBR on server upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) download: 3.45 Gbits/sec, cwnd ~ 320 KBytes 4) BBR on client, Reno on server upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to ~100 Mbps (verifiable not only by iperf3, but also by scp while transferring some files between hosts). Also, I've tried to use YeAH instead of Reno, and it gives me the same results as Reno (IOW, YeAH works fine too). Questions: 1) is this expected? 2) or am I missing some extra BBR tuneable? 3) if it is not a regression (I don't have any previous data to compare with), how can I fix this? 4) if it is a bug in BBR, what else should I provide or check for a proper investigation? Thanks. Regards, Oleksandr
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Uhh, sorry, just found the original submission [1]. [1] https://marc.info/?l=linux-netdev=151009763926816=2 10.11.2017 14:15, Oleksandr Natalenko wrote: Hi. I'm running the machine with this patch applied for 7 hours now, and the warning hasn't appeared yet. Typically, it should be there within the first hour. I'll keep an eye on it for a longer time, but as of now it looks good. Some explanation on this please? Thanks! 06.11.2017 23:27, Yuchung Cheng wrote: ...snip... hi guys can you try if the warning goes away w/ this quick fix? diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0ada8bfc2ebd..072aab2a8226 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2626,7 +2626,7 @@ void tcp_simple_retransmit(struct sock *sk) tcp_clear_retrans_hints_partial(tp); - if (prior_lost == tp->lost_out) + if (!tp->lost_out) return; if (tcp_is_reno(tp)) ...snip...
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hi. I'm running the machine with this patch applied for 7 hours now, and the warning hasn't appeared yet. Typically, it should be there within the first hour. I'll keep an eye on it for a longer time, but as of now it looks good. Some explanation on this please? Thanks! 06.11.2017 23:27, Yuchung Cheng wrote: ...snip... hi guys can you try if the warning goes away w/ this quick fix? diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0ada8bfc2ebd..072aab2a8226 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2626,7 +2626,7 @@ void tcp_simple_retransmit(struct sock *sk) tcp_clear_retrans_hints_partial(tp); - if (prior_lost == tp->lost_out) + if (!tp->lost_out) return; if (tcp_is_reno(tp)) ...snip...
Re: [PATCH net] tcp: fix tcp_mtu_probe() vs highest_sack
Hi. Thanks for the fix. However, tcp_fastretrans_alert() warning case still remains open even with this patch. Do I understand correctly that these are 2 different issues? Currently, I use latest 4.13 stable kernel + this patch and still get: WARNING: CPU: 1 PID: 736 at net/ipv4/tcp_input.c:2826 tcp_fastretrans_alert +0x7c8/0x990 Any idea on this? On úterý 31. října 2017 7:08:20 CET Eric Dumazet wrote: > From: Eric Dumazet <eduma...@google.com> > > Based on SNMP values provided by Roman, Yuchung made the observation > that some crashes in tcp_sacktag_walk() might be caused by MTU probing. > > Looking at tcp_mtu_probe(), I found that when a new skb was placed > in front of the write queue, we were not updating tcp highest sack. > > If one skb is freed because all its content was copied to the new skb > (for MTU probing), then tp->highest_sack could point to a now freed skb. > > Bad things would then happen, including infinite loops. > > This patch renames tcp_highest_sack_combine() and uses it > from tcp_mtu_probe() to fix the bug. > > Note that I also removed one test against tp->sacked_out, > since we want to replace tp->highest_sack regardless of whatever > condition, since keeping a stale pointer to freed skb is a recipe > for disaster. > > Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct > access") Signed-off-by: Eric Dumazet <eduma...@google.com> > Reported-by: Alexei Starovoitov <alexei.starovoi...@gmail.com> > Reported-by: Roman Gushchin <g...@fb.com> > Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name> > --- > include/net/tcp.h |6 +++--- > net/ipv4/tcp_output.c |3 ++- > 2 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index > 33599d17522d6a19b9d9a316cc1579cd5e71ee32..e6d0002a1b0bc5f28c331a760823c8dc9 > 2f8fe24 100644 --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -1771,12 +1771,12 @@ static inline void tcp_highest_sack_reset(struct > sock *sk) tcp_sk(sk)->highest_sack = tcp_write_queue_head(sk); > } > > -/* Called when old skb is about to be deleted (to be combined with new skb) > */ -static inline void tcp_highest_sack_combine(struct sock *sk, > +/* Called when old skb is about to be deleted and replaced by new skb */ > +static inline void tcp_highest_sack_replace(struct sock *sk, > struct sk_buff *old, > struct sk_buff *new) > { > - if (tcp_sk(sk)->sacked_out && (old == tcp_sk(sk)->highest_sack)) > + if (old == tcp_highest_sack(sk)) > tcp_sk(sk)->highest_sack = new; > } > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index > ae60dd3faed0adc71731bc686f878afd4c628d32..823003eef3a21a5cc5c27e0be9f46159a > fa060df 100644 --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2062,6 +2062,7 @@ static int tcp_mtu_probe(struct sock *sk) > nskb->ip_summed = skb->ip_summed; > > tcp_insert_write_queue_before(nskb, skb, sk); > + tcp_highest_sack_replace(sk, skb, nskb); > > len = 0; > tcp_for_write_queue_from_safe(skb, next, sk) { > @@ -2665,7 +2666,7 @@ static bool tcp_collapse_retrans(struct sock *sk, > struct sk_buff *skb) else if (!skb_shift(skb, next_skb, next_skb_size)) > return false; > } > - tcp_highest_sack_combine(sk, next_skb, skb); > + tcp_highest_sack_replace(sk, next_skb, skb); > > tcp_unlink_write_queue(next_skb, sk);
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hi. Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix warning in tcp_fastretrans_alert() for me. On středa 27. září 2017 2:18:32 CEST Yuchung Cheng wrote: > On Tue, Sep 26, 2017 at 5:12 PM, Yuchung Chengwrote: > > On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin wrote: > >>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin wrote: > >>> > > Hello. > >>> > > > >>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting > >>> > > in the > >>> > > warning shown below. Most of the time it is harmless, but rarely it > >>> > > just > >>> > > causes either freeze or (I believe, this is related too) panic in > >>> > > tcp_sacktag_walk() (because sk_buff passed to this function is > >>> > > NULL). > >>> > > Unfortunately, I still do not have proper stacktrace from panic, but > >>> > > will try to capture it if possible. > >>> > > > >>> > > Also, I have custom settings regarding TCP stack, shown below as > >>> > > well. ifb is used to shape traffic with tc. > >>> > > > >>> > > Please note this regression was already reported as BZ [1] and as a > >>> > > letter to ML [2], but got neither attention nor resolution. It is > >>> > > reproducible for (not only) me on my home router since v4.11 till > >>> > > v4.13.1 incl. > >>> > > > >>> > > Please advise on how to deal with it. I'll provide any additional > >>> > > info if > >>> > > necessary, also ready to test patches if any. > >>> > > > >>> > > Thanks. > >>> > > > >>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835 > >>> > > [2] > >>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.ne > >>> > > t_lists_netdev_msg436158.html=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=jJ > >>> > > YgtDM7QT-W-Fz_d29HYQ=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA > >>> > > =-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc=>>> > > >>> > We're experiencing the same problems on some machines in our fleet. > >>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and > >>> > sometimes panics in tcp_sacktag_walk(). > >> > >>> > Here is an example of a backtrace with the panic log: > >> Hi Yuchung! > >> > >>> do you still see the panics if you disable RACK? > >>> sysctl net.ipv4.tcp_recovery=0? > >> > >> No, we haven't seen any crash since that. > > > > I am out of ideas how RACK can potentially cause tcp_sacktag_walk to > > take an empty skb :-( Do you have stack trace or any hint on which call > > to tcp-sacktag_walk triggered the panic? internally at Google we never > > see that. > > hmm something just struck me: could you try > sysctl net.ipv4.tcp_recovery=1 net.ipv4.tcp_retrans_collapse=0 > and see if kernel still panics on sack processing? > > >>> also have you experience any sack reneg? could you post the output of > >>> ' nstat |grep -i TCP' thanks > >> > >> hostnameTcpActiveOpens 22896800.0 > >> hostnameTcpPassiveOpens 35927580.0 > >> hostnameTcpAttemptFails 746910 0.0 > >> hostnameTcpEstabResets 154988 0.0 > >> hostnameTcpInSegs 162586782550.0 > >> hostnameTcpOutSegs 469670116110.0 > >> hostnameTcpRetransSegs 13724310 0.0 > >> hostnameTcpInErrs 2 0.0 > >> hostnameTcpOutRsts 94187980.0 > >> hostnameTcpExtEmbryonicRsts 2303 0.0 > >> hostnameTcpExtPruneCalled 90192 0.0 > >> hostnameTcpExtOfoPruned 57274 0.0 > >> hostnameTcpExtOutOfWindowIcmps 3 0.0 > >> hostnameTcpExtTW11647050.0 > >> hostnameTcpExtTWRecycled2 0.0 > >> hostnameTcpExtPAWSEstab 1590.0 > >> hostnameTcpExtDelayedACKs 209207209 0.0 > >> hostnameTcpExtDelayedACKLocked 508571 0.0 > >> hostnameTcpExtDelayedACKLost17132480.0 > >> hostnameTcpExtListenOverflows 6250.0 > >> hostnameTcpExtListenDrops 6250.0 > >> hostnameTcpExtTCPHPHits 9341188489 0.0 > >> hostnameTcpExtTCPPureAcks 1434646465 0.0 > >> hostnameTcpExtTCPHPAcks 5733614672 0.0 > >> hostnameTcpExtTCPSackRecovery 32616980.0 > >> hostnameTcpExtTCPSACKReneging 12203 0.0 > >> hostnameTcpExtTCPSACKReorder433189 0.0 > >> hostname
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
And 2 more events: === $ dmesg --time-format iso | grep RIP … 2017-09-19T16:52:21,623328+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 2017-09-19T16:52:40,455296+0200 RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990 2017-09-19T16:52:41,047378+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 … 2017-09-19T16:54:59,930726+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 2017-09-19T16:55:07,985767+0200 RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990 2017-09-19T16:55:41,911527+0200 RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 … === On pondělí 18. září 2017 23:40:08 CEST Yuchung Cheng wrote: > On Mon, Sep 18, 2017 at 1:46 PM, Oleksandr Natalenko > > <oleksa...@natalenko.name> wrote: > > Actually, same warning was just triggered with RACK enabled. But main > > warning was not triggered in this case. > > Thanks. > > I assume this kernel does not have the patch that Neal proposed in his > first reply? > > The main warning needs to be triggered by another peculiar SACK that > kicks the sender into recovery again (after undo). Please let it run > longer if possible to see if we can get both. But the new data does > indicate the we can (validly) be in CA_Open with retrans_out > 0. > > > === > > Sep 18 22:44:32 defiant kernel: [ cut here ] > > Sep 18 22:44:32 defiant kernel: WARNING: CPU: 1 PID: 702 at net/ipv4/ > > tcp_input.c:2392 tcp_undo_cwnd_reduction+0xbd/0xd0 > > Sep 18 22:44:32 defiant kernel: Modules linked in: netconsole ctr ccm > > cls_bpf sch_htb act_mirred cls_u32 sch_ingress sit tunnel4 ip_tunnel > > 8021q mrp nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap > > nft_set_hash nft_set_rbtree nf_tables_inet nf_tables_ipv6 nft_masq_ipv4 > > nf_nat_masquerade_ipv4 nft_masq nft_nat nft_counter nft_meta > > nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > > nf_conntrack libcrc32c crc32c_generic nf_tables_ipv4 nf_tables tun nct6775 > > nfnetlink hwmon_vid nls_iso8859_1 nls_cp437 vfat fat ext4 > > snd_hda_codec_hdmi mbcache jbd2 snd_hda_codec_realtek > > snd_hda_codec_generic f2fs arc4 fscrypto intel_rapl iTCO_wdt ath9k > > iTCO_vendor_support intel_powerclamp ath9k_common ath9k_hw coretemp > > kvm_intel ath mac80211 kvm irqbypass intel_cstate cfg80211 pcspkr > > snd_hda_intel snd_hda_codec r8169 > > Sep 18 22:44:32 defiant kernel: joydev evdev mii snd_hda_core mousedev > > mei_txe input_leds i2c_i801 mac_hid i915 lpc_ich mei shpchp snd_hwdep > > snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5670 > > snd_soc_sst_atom_hifi2_platform battery snd_soc_sst_match snd_soc_rl6231 > > drm_kms_helper hci_uart ov5693(C) ov2722(C) lm3554(C) btbcm btqca > > v4l2_common snd_soc_core btintel snd_compress videodev snd_pcm_dmaengine > > snd_pcm video bluetooth snd_timer drm media tpm_tis snd i2c_hid soundcore > > tpm_tis_core rfkill_gpio ac97_bus soc_button_array ecdh_generic rfkill > > crc16 tpm 8250_dw intel_gtt syscopyarea sysfillrect acpi_pad sysimgblt > > intel_int0002_vgpio fb_sys_fops pinctrl_cherryview i2c_algo_bit button > > sch_fq_codel tcp_bbr ifb ip_tables x_tables btrfs xor raid6_pq > > algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj usbhid hid uas > > Sep 18 22:44:32 defiant kernel: usb_storage dm_crypt dm_mod dax raid10 > > md_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel > > ghash_clmulni_intel pcbc ahci aesni_intel xhci_pci libahci aes_x86_64 > > crypto_simd glue_helper xhci_hcd cryptd libata usbcore scsi_mod > > usb_common serio sdhci_acpi sdhci led_class mmc_core > > Sep 18 22:44:32 defiant kernel: CPU: 1 PID: 702 Comm: irq/123-enp3s0 > > Tainted: GWC 4.13.0-pf4 #1 > > Sep 18 22:44:32 defiant kernel: Hardware name: To Be Filled By O.E.M. To > > Be > > Filled By O.E.M./J3710-ITX, BIOS P1.30 03/30/2016 > > Sep 18 22:44:32 defiant kernel: task: 88923a738000 task.stack: > > 95800150 > > Sep 18 22:44:32 defiant kernel: RIP: > > 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 > > Sep 18 22:44:32 defiant kernel: RSP: 0018:88927fc83a48 EFLAGS: > > 00010202 > > Sep 18 22:44:32 defiant kernel: RAX: 0001 RBX: > > 8892412d9800 > > RCX: 88927fc83b0c > > Sep 18 22:44:32 defiant kernel: RDX: 7fff RSI: > > 0001 > > RDI: 8892412d9800 > > Sep 18 22:44:32 defiant kernel: RBP: 88927fc83a50 R08: > > > > R09: 18dfb063 > > Sep 18 22:44:32 defiant kernel: R10: 18dfd223 R11: > > 18dfb063 > > R12: 5320 > > Sep 18 22:44:32 defiant kernel: R13: 88927fc83b10 R14: > > 0001 > > R15: 88927fc83b0c > > Sep
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hi. 18.09.2017 23:40, Yuchung Cheng wrote: I assume this kernel does not have the patch that Neal proposed in his first reply? Correct. The main warning needs to be triggered by another peculiar SACK that kicks the sender into recovery again (after undo). Please let it run longer if possible to see if we can get both. But the new data does indicate the we can (validly) be in CA_Open with retrans_out > 0. OK, here it is: === » LC_TIME=C jctl -kb | grep RIP … Sep 19 12:54:03 defiant kernel: RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 Sep 19 12:54:22 defiant kernel: RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 Sep 19 12:54:25 defiant kernel: RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 Sep 19 12:56:00 defiant kernel: RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990 Sep 19 12:57:07 defiant kernel: RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 Sep 19 12:57:14 defiant kernel: RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 Sep 19 12:58:04 defiant kernel: RIP: 0010:tcp_undo_cwnd_reduction+0xbd/0xd0 … === Note timestamps — two types of warning are distant in time, so didn't happen at once. While still running this kernel, anything else I can check for you?
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
22:44:32 defiant kernel: tasklet_action+0x63/0x120 Sep 18 22:44:32 defiant kernel: __do_softirq+0xdf/0x2e5 Sep 18 22:44:32 defiant kernel: ? irq_finalize_oneshot.part.39+0xe0/0xe0 Sep 18 22:44:32 defiant kernel: do_softirq_own_stack+0x1c/0x30 Sep 18 22:44:32 defiant kernel: Sep 18 22:44:32 defiant kernel: do_softirq.part.17+0x4e/0x60 Sep 18 22:44:32 defiant kernel: __local_bh_enable_ip+0x77/0x80 Sep 18 22:44:32 defiant kernel: irq_forced_thread_fn+0x5c/0x70 Sep 18 22:44:32 defiant kernel: irq_thread+0x131/0x1a0 Sep 18 22:44:32 defiant kernel: ? wake_threads_waitq+0x30/0x30 Sep 18 22:44:32 defiant kernel: kthread+0x126/0x140 Sep 18 22:44:32 defiant kernel: ? irq_thread_check_affinity+0x90/0x90 Sep 18 22:44:32 defiant kernel: ? kthread_create_on_node+0x70/0x70 Sep 18 22:44:32 defiant kernel: ret_from_fork+0x25/0x30 Sep 18 22:44:32 defiant kernel: Code: 5d c3 80 60 35 fb 48 8b 00 48 39 c2 74 85 48 3b 83 50 01 00 00 75 eb e9 77 ff ff ff 89 83 48 06 00 00 80 a3 1e 06 00 00 fb eb b3 <0f> ff 5b 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f Sep 18 22:44:32 defiant kernel: ---[ end trace 1aea180efeedb474 ]--- === On pondělí 18. září 2017 20:01:42 CEST Yuchung Cheng wrote: > On Mon, Sep 18, 2017 at 10:59 AM, Oleksandr Natalenko > > <oleksa...@natalenko.name> wrote: > > OK. Should I keep FACK disabled? > > Yes since it is disabled in the upstream by default. Although you can > experiment FACK enabled additionally. > > Do we know the crash you first experienced is tied to this issue? > > > On pondělí 18. září 2017 19:51:21 CEST Yuchung Cheng wrote: > >> Can you try this patch to verify my theory with tcp_recovery=0 and 1? > >> thanks > >> > >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >> index 5af2f04f8859..9253d9ee7d0e 100644 > >> --- a/net/ipv4/tcp_input.c > >> +++ b/net/ipv4/tcp_input.c > >> @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock > >> *sk, bool unmark_loss) > >> > >> } > >> tp->snd_cwnd_stamp = tcp_time_stamp; > >> tp->undo_marker = 0; > >> > >> + WARN_ON(tp->retrans_out); > >> > >> }
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
:18:34 defiant kernel: tasklet_action+0x63/0x120 Sep 18 22:18:34 defiant kernel: __do_softirq+0xdf/0x2e5 Sep 18 22:18:34 defiant kernel: ? irq_finalize_oneshot.part.39+0xe0/0xe0 Sep 18 22:18:34 defiant kernel: do_softirq_own_stack+0x1c/0x30 Sep 18 22:18:34 defiant kernel: Sep 18 22:18:34 defiant kernel: do_softirq.part.17+0x4e/0x60 Sep 18 22:18:34 defiant kernel: __local_bh_enable_ip+0x77/0x80 Sep 18 22:18:34 defiant kernel: irq_forced_thread_fn+0x5c/0x70 Sep 18 22:18:34 defiant kernel: irq_thread+0x131/0x1a0 Sep 18 22:18:34 defiant kernel: ? wake_threads_waitq+0x30/0x30 Sep 18 22:18:34 defiant kernel: kthread+0x126/0x140 Sep 18 22:18:34 defiant kernel: ? irq_thread_check_affinity+0x90/0x90 Sep 18 22:18:34 defiant kernel: ? kthread_create_on_node+0x70/0x70 Sep 18 22:18:34 defiant kernel: ret_from_fork+0x25/0x30 Sep 18 22:18:34 defiant kernel: Code: 5d c3 80 60 35 fb 48 8b 00 48 39 c2 74 85 48 3b 83 50 01 00 00 75 eb e9 77 ff ff ff 89 83 48 06 00 00 80 a3 1e 06 00 00 fb eb b3 <0f> ff 5b 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f Sep 18 22:18:34 defiant kernel: ---[ end trace 1aea180efeedb473 ]--- === Should I continue with net.ipv4.tcp_recovery = 1, or this is enough? On pondělí 18. září 2017 20:01:42 CEST Yuchung Cheng wrote: > On Mon, Sep 18, 2017 at 10:59 AM, Oleksandr Natalenko > > <oleksa...@natalenko.name> wrote: > > OK. Should I keep FACK disabled? > > Yes since it is disabled in the upstream by default. Although you can > experiment FACK enabled additionally. > > Do we know the crash you first experienced is tied to this issue? > > > On pondělí 18. září 2017 19:51:21 CEST Yuchung Cheng wrote: > >> Can you try this patch to verify my theory with tcp_recovery=0 and 1? > >> thanks > >> > >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >> index 5af2f04f8859..9253d9ee7d0e 100644 > >> --- a/net/ipv4/tcp_input.c > >> +++ b/net/ipv4/tcp_input.c > >> @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock > >> *sk, bool unmark_loss) > >> > >> } > >> tp->snd_cwnd_stamp = tcp_time_stamp; > >> tp->undo_marker = 0; > >> > >> + WARN_ON(tp->retrans_out); > >> > >> }
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
On pondělí 18. září 2017 20:01:42 CEST Yuchung Cheng wrote: > Yes since it is disabled in the upstream by default. Although you can > experiment FACK enabled additionally. OK. > Do we know the crash you first experienced is tied to this issue? No, unfortunately. I wasn't able to re-create it again, so lets focus on tcp_fastretrans_alert warning only.
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
OK. Should I keep FACK disabled? On pondělí 18. září 2017 19:51:21 CEST Yuchung Cheng wrote: > Can you try this patch to verify my theory with tcp_recovery=0 and 1? thanks > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 5af2f04f8859..9253d9ee7d0e 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock > *sk, bool unmark_loss) > } > tp->snd_cwnd_stamp = tcp_time_stamp; > tp->undo_marker = 0; > + WARN_ON(tp->retrans_out); > }
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hi. Just to note that it looks like disabling RACK and re-enabling FACK prevents warning from happening: net.ipv4.tcp_fack = 1 net.ipv4.tcp_recovery = 0 Hope I get semantics of these tunables right. On pátek 15. září 2017 21:04:36 CEST Oleksandr Natalenko wrote: > Hello. > > With net.ipv4.tcp_fack set to 0 the warning still appears: > > === > » sysctl net.ipv4.tcp_fack > net.ipv4.tcp_fack = 0 > > » LC_TIME=C dmesg -T | grep WARNING > [Fri Sep 15 20:40:30 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c: > 2826 tcp_fastretrans_alert+0x7c8/0x990 > [Fri Sep 15 20:40:30 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c: > 2826 tcp_fastretrans_alert+0x7c8/0x990 > [Fri Sep 15 20:48:37 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c: > 2826 tcp_fastretrans_alert+0x7c8/0x990 > [Fri Sep 15 20:48:55 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c: > 2826 tcp_fastretrans_alert+0x7c8/0x990 > > » ps -up 711 > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > root 711 4.3 0.0 0 0 ?S18:12 7:23 [irq/123- > enp3s0] > === > > Any suggestions? > > On pátek 15. září 2017 16:03:00 CEST Neal Cardwell wrote: > > Thanks for testing that. That is a very useful data point. > > > > I was able to cook up a packetdrill test that could put the connection > > in CA_Disorder with retransmitted packets out, but not in CA_Open. So > > we do not yet have a test case to reproduce this. > > > > We do not see this warning on our fleet at Google. One significant > > difference I see between our environment and yours is that it seems > > > > you run with FACK enabled: > > net.ipv4.tcp_fack = 1 > > > > Note that FACK was disabled by default (since it was replaced by RACK) > > between kernel v4.10 and v4.11. And this is exactly the time when this > > bug started manifesting itself for you and some others, but not our > > fleet. So my new working hypothesis would be that this warning is due > > to a behavior that only shows up in kernels >=4.11 when FACK is > > enabled. > > > > Would you be able to disable FACK ("sysctl net.ipv4.tcp_fack=0" at > > boot, or net.ipv4.tcp_fack=0 in /etc/sysctl.conf, or equivalent), > > reboot, and test the kernel for a few days to see if the warning still > > pops up? > > > > thanks, > > neal > > > > [ps: apologies for the previous, mis-formatted post...]
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hello. With net.ipv4.tcp_fack set to 0 the warning still appears: === » sysctl net.ipv4.tcp_fack net.ipv4.tcp_fack = 0 » LC_TIME=C dmesg -T | grep WARNING [Fri Sep 15 20:40:30 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c: 2826 tcp_fastretrans_alert+0x7c8/0x990 [Fri Sep 15 20:40:30 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c: 2826 tcp_fastretrans_alert+0x7c8/0x990 [Fri Sep 15 20:48:37 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c: 2826 tcp_fastretrans_alert+0x7c8/0x990 [Fri Sep 15 20:48:55 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c: 2826 tcp_fastretrans_alert+0x7c8/0x990 » ps -up 711 USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 711 4.3 0.0 0 0 ?S18:12 7:23 [irq/123- enp3s0] === Any suggestions? On pátek 15. září 2017 16:03:00 CEST Neal Cardwell wrote: > Thanks for testing that. That is a very useful data point. > > I was able to cook up a packetdrill test that could put the connection > in CA_Disorder with retransmitted packets out, but not in CA_Open. So > we do not yet have a test case to reproduce this. > > We do not see this warning on our fleet at Google. One significant > difference I see between our environment and yours is that it seems > you run with FACK enabled: > > net.ipv4.tcp_fack = 1 > > Note that FACK was disabled by default (since it was replaced by RACK) > between kernel v4.10 and v4.11. And this is exactly the time when this > bug started manifesting itself for you and some others, but not our > fleet. So my new working hypothesis would be that this warning is due > to a behavior that only shows up in kernels >=4.11 when FACK is > enabled. > > Would you be able to disable FACK ("sysctl net.ipv4.tcp_fack=0" at > boot, or net.ipv4.tcp_fack=0 in /etc/sysctl.conf, or equivalent), > reboot, and test the kernel for a few days to see if the warning still > pops up? > > thanks, > neal > > [ps: apologies for the previous, mis-formatted post...]
Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hi. I've applied your test patch but it doesn't fix the issue for me since the warning is still there. Were you able to reproduce it? On pondělí 11. září 2017 1:59:02 CEST Neal Cardwell wrote: > Thanks for the detailed report! > > I suspect this is due to the following commit, which happened between > 4.10 and 4.11: > > 89fe18e44f7e tcp: extend F-RTO to catch more spurious timeouts > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/? > id=89fe18e44f7e > > This commit expanded the set of scenarios where we would undo a > CA_Loss cwnd reduction and return to TCP_CA_Open, but did not include > a check to see if there were any in-flight retransmissions. I think we > need a fix like the following: > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 659d1baefb2b..730a2de9d2b0 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -2439,7 +2439,7 @@ static bool tcp_try_undo_loss(struct sock *sk, > bool frto_undo) > { > struct tcp_sock *tp = tcp_sk(sk); > > - if (frto_undo || tcp_may_undo(tp)) { > + if ((frto_undo || tcp_may_undo(tp)) && !tp->retrans_out) { > tcp_undo_cwnd_reduction(sk, true); > > DBGUNDO(sk, "partial loss"); > > I will try a packetdrill test to see if I can reproduce this issue and > verify the fix. > > thanks, > neal
[REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
Hello. Since, IIRC, v4.11, there is some regression in TCP stack resulting in the warning shown below. Most of the time it is harmless, but rarely it just causes either freeze or (I believe, this is related too) panic in tcp_sacktag_walk() (because sk_buff passed to this function is NULL). Unfortunately, I still do not have proper stacktrace from panic, but will try to capture it if possible. Also, I have custom settings regarding TCP stack, shown below as well. ifb is used to shape traffic with tc. Please note this regression was already reported as BZ [1] and as a letter to ML [2], but got neither attention nor resolution. It is reproducible for (not only) me on my home router since v4.11 till v4.13.1 incl. Please advise on how to deal with it. I'll provide any additional info if necessary, also ready to test patches if any. Thanks. [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835 [2] https://www.spinics.net/lists/netdev/msg436158.html === warning [14407.060066] [ cut here ] [14407.060353] WARNING: CPU: 0 PID: 719 at net/ipv4/tcp_input.c:2826 tcp_fastretrans_alert+0x7c8/0x990 [14407.060747] Modules linked in: netconsole ctr ccm cls_bpf sch_htb act_mirred cls_u32 sch_ingress sit tunnel4 ip_tunnel 8021q mrp nf _conntrack_ipv6 nf_defrag_ipv6 nft_ct nft_set_bitmap nft_set_hash nft_set_rbtree nf_tables_inet nf_tables_ipv6 nft_masq_ipv4 nf_nat_ma squerade_ipv4 nft_masq nft_nat nft_counter nft_meta nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrac k libcrc32c crc32c_generic nf_tables_ipv4 tun nf_tables nfnetlink nct6775 hwmon_vid nls_iso8859_1 nls_cp437 vfat fat ext4 mbcache jbd2 arc4 f2fs snd_hda_codec_hdmi fscrypto snd_hda_codec_realtek snd_hda_codec_generic intel_rapl intel_powerclamp coretemp iTCO_wdt iTCO_ vendor_support ath9k ath9k_common kvm_intel ath9k_hw kvm ath irqbypass intel_cstate mac80211 pcspkr snd_intel_sst_acpi i2c_i801 i915 s nd_hda_intel [14407.063800] snd_intel_sst_core r8169 cfg80211 evdev mii snd_hda_codec joydev mousedev input_leds snd_soc_rt5670 mei_txe snd_soc_ss t_atom_hifi2_platform snd_hda_core snd_soc_rl6231 snd_soc_sst_match mac_hid mei lpc_ich shpchp drm_kms_helper snd_hwdep snd_soc_core s nd_compress battery snd_pcm_dmaengine drm hci_uart ov2722(C) snd_pcm lm3554(C) ov5693(C) snd_timer v4l2_common btbcm snd intel_gtt btq ca btintel videodev syscopyarea bluetooth video soundcore sysfillrect media sysimgblt ac97_bus ecdh_generic rfkill_gpio i2c_hid rfkill tpm_tis crc16 fb_sys_fops i2c_algo_bit 8250_dw tpm_tis_core tpm soc_button_array pinctrl_cherryview intel_int0002_vgpio acpi_pad butt on sch_fq_codel tcp_bbr ifb ip_tables x_tables btrfs xor raid6_pq algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj usbhid hid uas usb_storage [14407.066873] dm_crypt dm_mod dax raid10 md_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_int el aes_x86_64 crypto_simd glue_helper cryptd ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common serio sdhci_acpi sdhci led_class mmc_core [14407.068034] CPU: 0 PID: 719 Comm: irq/123-enp3s0 Tainted: G C 4.13.0-pf2 #1 [14407.068403] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./ J3710-ITX, BIOS P1.30 03/30/2016 [14407.068827] task: 98b1c0a05400 task.stack: bb59c15c [14407.069111] RIP: 0010:tcp_fastretrans_alert+0x7c8/0x990 [14407.069358] RSP: 0018:98b1ffc03a78 EFLAGS: 00010202 [14407.069607] RAX: RBX: 98b135ae RCX: 98b1ffc03b0c [14407.069928] RDX: 0001 RSI: 0001 RDI: 98b135ae [14407.070248] RBP: 98b1ffc03ab8 R08: R09: 98b1ffc03b60 [14407.070565] R10: R11: R12: 5120 [14407.070884] R13: 98b1ffc03b10 R14: 0001 R15: 98b1ffc03b0c [14407.071205] FS: () GS:98b1ffc0() knlGS: [14407.071564] CS: 0010 DS: ES: CR0: 80050033 [14407.071827] CR2: 7ffc580b2f0f CR3: 10a09000 CR4: 001006f0 [14407.072146] Call Trace: [14407.072279] [14407.072412] ? sk_reset_timer+0x18/0x30 [14407.072610] tcp_ack+0x741/0x1110 [14407.072810] tcp_rcv_established+0x325/0x770 [14407.073033] ? sk_filter_trim_cap+0xd4/0x1a0 [14407.073249] tcp_v4_do_rcv+0x90/0x1e0 [14407.073449] tcp_v4_rcv+0x950/0xa10 [14407.073647] ? nf_ct_deliver_cached_events+0xb8/0x110 [nf_conntrack] [14407.073955] ip_local_deliver_finish+0x68/0x210 [14407.074183] ip_local_deliver+0xfa/0x110 [14407.074385] ? ip_rcv_finish+0x410/0x410 [14407.074589] ip_rcv_finish+0x120/0x410 [14407.074782] ip_rcv+0x28e/0x3b0 [14407.074952] ? inet_del_offload+0x40/0x40 [14407.075154] __netif_receive_skb_core+0x39b/0xb00 [14407.075389] ? netif_receive_skb_internal+0xa0/0x480 [14407.075635] ? skb_release_all+0x24/0x30 [14407.075832] ? consume_skb+0x38/0xa0 [14407.076025]
kernel BUG at net/netfilter/nf_nat_core.c:395
Hi. With 4.4.1 I've got BUG_ON() triggered in net/netfilter/nf_nat_core.c:395, nf_nat_setup_info(), today on my home router. Here is full trace got via netconsole: [1] I perform LAN NATting using nftables like this: === table ip nat { chain prerouting { type nat hook prerouting priority -150; } chain postrouting { type nat hook postrouting priority -150; oifname enp2s0 ip saddr 172.17.28.0/24 counter snat 1.2.3.4 oifname enp2s0 ip saddr 172.17.29.0/24 counter snat 1.2.3.4 oifname enp2s0 ip saddr 172.17.31.0/24 counter snat 1.2.3.4 oifname enp2s0 ip saddr 172.17.35.0/24 counter snat 1.2.3.4 oifname enp2s0 ip saddr 172.17.37.0/24 counter snat 1.2.3.4 oifname tun0 ip saddr 172.17.28.0/24 counter masquerade oifname tun0 ip saddr 172.17.29.0/24 counter masquerade oifname tinc0 ip saddr 172.17.28.0/24 counter masquerade oifname tinc0 ip saddr 172.17.29.0/24 counter masquerade } } === Traffic filtering is done via nftables as well. Ideas? What could I do to debug the issue better? [1] https://gist.github.com/bbb3712f40a7753537fe
Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero
Sure, but after catching the stacktrace. On середа, 6 січня 2016 р. 10:43:45 EET Yuchung Cheng wrote: > Could you turn off ecn (sysctl net.ipv4.tcp_ecn=0) to see if this still > happen? > >> On December 22, 2015 4:10:32 AM EET, Yuchung Cheng <ych...@google.com> wrote: > >> >On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko > >> > > >> ><oleksa...@natalenko.name> wrote: > >> >> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB > >> > > >> >mode by > >> > > >> >> default and SS mode conditionally) introduced changes to > >> > > >> >net/ipv4/tcp_input.c > >> > > >> >> tcp_cwnd_reduction() that, possibly, cause division by zero, and > >> > > >> >therefore, > >> > > >> >> kernel panic in interrupt handler [1]. > >> >> > >> >> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the > >> > > >> >issue. > >> > > >> >> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day > >> >> (occasionally). > >> >> > >> >> What could be done to help in debugging this issue? > >> > > >> >Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)? > >> > > >> >If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR > >> >state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can > >> >you try this debug / quick-fix patch and send me the error message if > >> >any? > >> > > >> >> Regards, > >> >> > >> >> Oleksandr. > >> >> > >> >> [1] http://i.piccy.info/ > >> > > >> >i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg > >> > >> -- > >> Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero
Unfortunately, the patch didn't help -- I've got the same stacktrace with slightly different offset (+3) within the function. Now trying to get full stacktrace via netconsole. Need more time. Meanwhile, any other ideas on what went wrong? On December 22, 2015 4:10:32 AM EET, Yuchung Cheng <ych...@google.com> wrote: >On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko ><oleksa...@natalenko.name> wrote: >> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB >mode by >> default and SS mode conditionally) introduced changes to >net/ipv4/tcp_input.c >> tcp_cwnd_reduction() that, possibly, cause division by zero, and >therefore, >> kernel panic in interrupt handler [1]. >> >> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the >issue. >> >> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day >> (occasionally). >> >> What could be done to help in debugging this issue? >Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)? > >If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR >state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can >you try this debug / quick-fix patch and send me the error message if >any? > > >> >> Regards, >> Oleksandr. >> >> [1] http://i.piccy.info/ >> >i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero
That is correct, I have net.ipv4.tcp_ecn set to 1. I've recompiled the kernel with proposed patch, now still waiting for issue to be triggered. Could I manually simulate the erroneous TCP ECN behavior to speed up the debugging? On понеділок, 21 грудня 2015 р. 18:10:32 EET Yuchung Cheng wrote: > On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko > > <oleksa...@natalenko.name> wrote: > > Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB mode by > > default and SS mode conditionally) introduced changes to > > net/ipv4/tcp_input.c tcp_cwnd_reduction() that, possibly, cause division > > by zero, and therefore, kernel panic in interrupt handler [1]. > > > > Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the issue. > > > > I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day > > (occasionally). > > > > What could be done to help in debugging this issue? > > Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)? > > If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR > state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can > you try this debug / quick-fix patch and send me the error message if > any? > > > Regards, > > > > Oleksandr. > > > > [1] http://i.piccy.info/ > > i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero
Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB mode by default and SS mode conditionally) introduced changes to net/ipv4/tcp_input.c tcp_cwnd_reduction() that, possibly, cause division by zero, and therefore, kernel panic in interrupt handler [1]. Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the issue. I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day (occasionally). What could be done to help in debugging this issue? Regards, Oleksandr. [1] http://i.piccy.info/ i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html