Can you share the Arista model and EOS version of the devices you installed that TTL hashing was enabled by default?
On Sat, Sep 1, 2018 at 2:51 PM, <frnk...@iname.com> wrote: > I want to share a little bit of our journey in tracking down the TCP RSTs > that impacted some of our customers for almost ten weeks. > > > > Almost immediately after we turned up two new Arista border routers in > late July we started receiving a trickle of complaints from customers > regarding their inability to access certain websites (mostly B2B). All the > packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST > from the website after the client sent a TLS/SSL Client Hello. As the > reports continued to come in, we built a Google Doc to keep track and it > became clear that most of the sites were hosted by Incapsula/Imperva, but > there were also a few by Sucuri and Fastly. Knowing that Incapsula provides > DoS protection, we attempted to work with them (providing websites, > source/destination IPs, traceroutes, and packet captures) to find out why > their hosts were issuing our customers a TCP RST, but we made little > progress. We moved some of the affected customers to different IP addresses > but that didn’t resolve the issue. We also asked our customer to work with > the website to see if they would be willing to open a ticket with > Incapsula. In the meantime, customers were getting frustrated! They > couldn’t visit Incapsula-hosted healthcare websites, financial firms, > product dealers, etc. Over the weeks, a few of those customers > purchased/borrowed different routers and some of those didn’t have website > issues anymore. And more than a few of them discovered that the websites > worked fine from home or their mobile phone/hotspot, but not from their > Internet connection with us. You can guess where they were applying > pressure! That said, we didn’t know why a small handful of companies, known > for DoS protection, were issuing TCP RSTs to just some of our customers. > > > > Earlier this week we received four or five more websites from yet another > affected customer, but most of those were with Fastly. By this time, we had > been able to replicate the issue in our lab. Feeling desperate to make some > tangible progress on this issue, I reached out to the Fastly NOC. In less > than 12 hours they provided some helpful feedback, pointing out that a > single traceroute to a Fastly site was hitting two of their POPs (they use > anycast) and because they don’t sync state between POPs the second POP > would naturally issue a TCP RST (sidebar: fascinating blog article on > Fastly’s infrastructure here: https://www.fastly.com/blog/ > building-and-scaling-fastly-network-part-2-balancing-requests). In > subsequent email exchanges, the Fastly NOC suggested that it appeared that > we were “spraying flows” (that is, packets related to single client session > were egressing our network via different paths). Because Fastly is also > present with us at an IX (though they weren’t advertising their anycast IPs > at the time), they suggested that we look at how our traffic egresses our > network (IX versus transit) and our routers’ outbound > load-balancing/hashing schemes. > > > > The IX turned up to be a red herring, so I turned my attention to our > transit. Each of our border routers has two BGP sessions over two circuits > to transit provider POP A and two BGP sessions over two circuits to transit > provider POP B, for a total of four BGP sessions per border router, a total > of eight BGP sessions altogether. Starting with our core router, I > confirmed that its ECMP hashing was consistent such that Fastly-bound > traffic always went to border router 1 or border router 2. Then I looked at > the ECMP hashing scheme on our border routers and noticed something unique > – by default Arista also uses TTL: > > > > IPv4 hash fields: > > Source IPv4 Address is ON > > Protocol is ON > > Time-To-Live is ON > > Destination IPv4 Address is ON > > > > Since the source and destination IPs and protocol weren’t changing, > perhaps the TTL was not consistent? I opened the first packet trace in > Wireshark and jackpot – the TTL value was 128 on the SYN but 127 on the > TLS/SSL Client Hello. I adjusted the Arista’s load-balancing profile not to > use TTL and immediately my MTR in the background changed and all the sites > on the lab machine that couldn’t load before … were now loading. > > > > Fastly also pointed me to another article written by Joel Jaeggli ( > https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that > discusses IPv6 flow labels – we removed that from the border routers’ IPv6 > hash fields, too. > > > > I reviewed the packet traces today and noticed that TTL values remained > consistent at 128 **behind** the router CPE. In packet captures on the > WAN interface of the router CPE I see that the SYN remains at 128, but the > TLS/Client Hello is properly decremented to 127. So, it appears that some > router CPE (and there were a variety of makes and models) are doing > something special to certain packets and not decrementing the TTL. > > This explains why: > > - our customers had issues with all their devices behind their router > CPE > - the issue remained regardless of what public IP address their router > CPE obtained via DHCP or was assigned > - some customers who changed their router CPE didn’t have the issue > anymore – they got lucky with a router that doesn’t adjust/reset the TTL > - why customers who used our managed Wi-Fi router did not see the > issue, because that model doesn’t apparently manipulate the TTL, at least > not in an inconsistent way. > > > > Lesson learned: review a device’s hashing mechanism before going into > production. > > > > For those interested, I have links to the packet traces below my > signature, showing the inconsistent TTL values. > > > > Thanks again to the fantastic group of folk at the Fastly NOC who so ably > pointed us in the right direction! > > > > Frank > > > > https://www.premieronline.net/~fbulk/example1.pcapng > > https://www.premieronline.net/~fbulk/example2.pcapng > > https://www.premieronline.net/~fbulk/example3.pcapng >