It is certainly odd, but it's definitely a "thing." https://archive.nanog.org/meetings/nanog37/presentations/matt.levine.pdf
On Wed, Nov 13, 2019 at 10:24 AM Matt Corallo <na...@as397444.net> wrote: > > This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP > is... out of spec to say the least), not a bug in ECN/ECMP. > > > On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG > > <nanog@nanog.org> wrote: > > > > > >> > >> Hello > >> > >> I have a customer that believes my network has a ECN problem. We do > >> not, we just move packets. But how do I prove it? > >> > >> Is there a tool that checks for ECN trouble? Ideally something I could > >> run on the NLNOG Ring network. > >> > >> I believe it likely that it is the destination that has the problem. > > > > Hi Baldur > > > > I believe I may be that customer :) > > > > First of all, thank you for looking into the issue! We've been having > > great fun over on the ecn-sane mailing list trying to figure out what's > > going on. I'll summarise below, but see this thread for the discussion > > and debugging details: > > https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html > > > > The short version is that the problem appears to come from a combination > > of the ECMP routing in your network, and Cloudflare's heavy use of > > anycast. Specifically, a router in your network appears to be doing ECMP > > by hashing on the packet header, *including the ECN bits*. This breaks > > TCP connections with ECN because the TCP SYN (with no ECN bits set) end > > up taking a different path than the rest of the flow (which is marked as > > ECT(0)). When the destination is anycasted, this means that the data > > packets go to a different server than the SYN did. This second server > > doesn't recognise the connection, and so replies with a TCP RST. To fix > > this, simply exclude the ECN bits (or the whole TOS byte) from your > > router's ECMP hash. > > > > For a longer exposition, see below. You should be able to verify this > > from somewhere else in the network, but if there's anything else you > > want me to test, do let me know. Also, would you mind sharing the router > > make and model that does this? We're trying to collect real-world > > examples of network problems caused by ECN and this is definitely an > > interesting example. > > > > -Toke > > > > > > > > The long version: > > > > From my end I can see that I have two paths to Cloudflare; which is > > taken appears to be based on a hash of the packet header, as can be seen > > by varying the source port: > > > > $ traceroute -q 1 --sport=10000 104.24.125.13 > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets > > 1 _gateway (10.42.3.1) 0.357 ms > > 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms > > 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms > > 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms > > 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms > > 6 104.24.125.13 (104.24.125.13) 1.322 ms > > > > $ traceroute -q 1 --sport=10001 104.24.125.13 > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets > > 1 _gateway (10.42.3.1) 0.293 ms > > 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms > > 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms > > 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms > > 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms > > 6 149.6.142.130 (149.6.142.130) 6.925 ms > > 7 104.24.125.13 (104.24.125.13) 1.501 ms > > > > > > This is fine in itself. However, the problem stems from the fact that > > the ECN bits in the IP header are also included in the ECMP hash (-t > > sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is > > ECT(1)): > > > > $ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets > > 1 _gateway (10.42.3.1) 0.336 ms > > 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms > > 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms > > 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms > > 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms > > 6 104.24.125.13 (104.24.125.13) 1.210 ms > > > > $ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets > > 1 _gateway (10.42.3.1) 0.339 ms > > 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms > > 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms > > 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms > > 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms > > 6 149.6.142.130 (149.6.142.130) 6.888 ms > > 7 104.24.125.13 (104.24.125.13) 1.785 ms > > > > > > So why is this a problem? The TCP SYN packet first needs to negotiate > > ECN, so it is sent without any ECN bits set in the header; after > > negotiation succeeds, the data packets will be marked as ECT(0). But > > because that becomes part of the ECMP hash, those packets will take > > another path. And since the destination is anycasted, that means they > > will also end up at a different endpoint. This second endpoint won't > > recognise the connection, and reply with a TCP RST. This is clearly > > visible in tcpdump; notice the different TOS values, and that the RST > > packet has a different TTL than the SYN-ACK: > > > > 12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto > > TCP (6), length 60) > > 10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff > > (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss > > 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 > > 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP > > (6), length 52) > > 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a > > (correct), seq 1936951409, ack 3345293503, win 29200, options [mss > > 1400,nop,nop,sackOK,nop,wscale 10], length 0 > > 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto > > TCP (6), length 40) > > 10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect > > -> 0x503e), seq 1, ack 1, win 502, length 0 > > 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], > > proto TCP (6), length 117) > > 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 > > (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 > > GET / HTTP/1.1 > > Host: 104.24.125.13 > > User-Agent: curl/7.66.0 > > Accept: */* > > > > 12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], > > proto TCP (6), length 40) > > 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), > > seq 1936951410, win 0, length 0 > > > > > > The fix is to stop hashing on the ECN bits when doing ECMP. You could > > keep hashing on the diffserv part of the TOS field if you want, but I > > think it would also be fine to just exclude the TOS field entirely from > > the hash. >