[Bloat] The curious case of "cursed-ECN" steam downloads

Sebastian Moeller via Bloat Sun, 03 Sep 2023 11:54:37 -0700

Dear ECN experts,


I want to report some oddities I encounter with downloading data on steam's 
network. (My kids started playing steam games recently, so my network started 
to see steam loads). Some games apparently need routine updates every few weeks 
in the multi GB range which on the one hand seems quite excessive to me, but on 
the other hand it presents a nice way to look closer at how modern CDN-backed 
downloads operate. All of this is flowing trough my OpenWrt-based router using 
cake traffic-shaper/scheduler/AQM combos in both directions. (Cake by default 
uses rfc3168 ECN signaling, for packets presenting either ECT(0) or ECT(1), but 
concurrently employs a BLUE to increase pure drop probability to eventually 
also reign in unresponsive flows).

So far I have seen two things worth noting:

A) excessive CE marking (this happened with a multi-GB download served from 
cloudflare CDN nodes) on the range of 70% of packets marked; Jonathan gave a 
reasonable explanation that this might be BBRv1 in action. Side-note: am I the 
only one slightly miffed that BBR apparently ignored to implement either an 
appropriate CE response or to make sure BBR using TCPs refrain from negotiating 
ECN? Unfortunately I did not take packet captures of this event (only tc -s 
qdisc snapshots before and after). This I posted earlier on the list already.


B) Excessive ECT(1) marking (this happened with a multi-GB download)

Here is cctrace output for the respective flow (cctrace was developed as part 
of SCE and the reported SCE events are equivalent with ECT(1))
55827-443:                                                                      
                                                                                
                                         
   Up:   SCE=0, CE=0, ECE=79538, CWR=0, NS=0, total=104222                      
                                                                                
                                         
   Down: SCE=200260, CE=0, ECE=0, CWR=2691, NS=0, total=200429 

Here is what wireshark reported for the same TCP flow

{
"Address A": "2a01:c22:8c6c:8700:b84b:c89e:6424:8c74",
"Address B": "2a01:bc80:7:100::9b85:f813",
"Bits/s A → B": "289152",
"Bits/s B → A": "9205750",
"Bytes": "305873523",
"Bytes A → B": "9314897",
"Bytes B → A": "296558626",
"Duration": "257.715992",
"Packets": "304651",
"Packets A → B": "104222",
"Packets B → A": "200429",
"Percent Filtered": "0",
"Port A": "55827",
"Port B": "443",
"Rel Start": "314.512067",
"Stream ID": "259",
"Total Packets": "0"
},

So all in all roughly 5% of the total download volume, but essentially all 
ECT(1).


Here I took a packet capture (albeit upstream of my ingress shaper so I see no 
CE markings or dropped packets), but I failed to take the tc -s qdisc snapshots 
to get easy access to number of CE marks and drops (these would not be per flow 
anyway).

123-1234567:CAKE-autorate user$ sudo mtr -ezb6w -c 100 
2a01:bc80:7:100::9b85:f813
Password:
Start: 2023-09-03T19:53:45+0200
HOST: 123-1234567.local                                                         
                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS6805   
dynamic-2a01-0c23-9012-7900-0000-0000-0000-0001.c23.pool.telefonica.de 
(2a01:c23:9012:7900::1)   1.0%   100    1.1   1.2   0.8   2.7   0.3
  2. AS6805   2a02:3001::11e                                                    
                               0.0%   100   15.9  46.2  12.9 149.6  26.4
  3. AS6805   2a02:3001::1b7                                                    
                              56.0%   100   13.7  12.7  11.4  13.9   0.6
  4. AS6805   2a02:3040:0:10::1c                                                
                              46.0%   100   12.2  12.8  11.7  15.1   0.6
  5. AS???    ???                                                               
                              100.0   100    0.0   0.0   0.0   0.0   0.0
  6. AS???    ???                                                               
                              100.0   100    0.0   0.0   0.0   0.0   0.0
  7. AS???    amsix-v6.valve.net (2001:7f8:1::a503:2590:1)                      
                               0.0%   100   28.6  31.7  26.0 116.4  18.1
  8. AS32590  2a01:bc80:7:ffff::9b85:f8fb                                       
                               0.0%   100   28.7  29.0  27.5  32.6   0.8
  9. AS32590  2a01:bc80:7:100::9b85:f813                                        
                               0.0%   100   26.9  26.5  24.7  31.8   1.0


At an average 9.2 Mbps, I saw one CWR for every 79538/2691 = 29.5 ECEs. With 
~9Mbps this would be around 1000*(29.5*1534*8)/(9.2 * 1000^2) = 39.4 ms time 
between sending the first ECE before seeing a CWR. This seems within range of 
the 24.7ms unloaded RTT to that IPv6. (Note when I first reported this I mtr's 
against the wrong server IPv6 and reported ~10ms RTT, but looking at the actual 
TCP stream and mtr-ing that today I got the above.)


The CWR response to the ECEs indicates to me, that the flow was responsive to 
ECN signaling (how well I can not say, as I did not capture CE marks nor packet 
drops) and the network stayed responsive during the download, so to me this 
looks like ECT(1) was used for a genuine ECN-enabled and CE-responsive flow.

Would be interesting if others could check whether they see ECT(1) in actual 
use on their homelinks? Initially I thought this might be my ISP doing some 
mismarking, but it would be quite lucky to re-mark a genuine ECT(0) flow to TOS 
0x01 and hence ECT(1) so that ECN signaling would still work, while leaving the 
other ~8 flows from the same remote address at ECT(0) (also with working ECN). 
Just in case my ISP is AS6805 Telefónica Germany GmbH & Co. OHG. 

For quick monitoring on my OpenWrt router, I use the following tcpdump 
invocation just to see what is happening (pppoe-wan is my wan interface's name):

tcpdump -i pppoe-wan -v -n '(ip6 and (ip6[0:2] & 0x30) >> 4  == 1)' or '(ip and 
(ip[1] & 0x3) == 1)' # ECT(1)

(to exercise/test this I use (under macos): 
ping -z 0x01 -c 10 one.one.one.one
or
ping6 -z 0x01 -c 10 one.one.one.one)

and to see ECN activity in TCP I use:

tcpdump -i pppoe-wan -v -n '(tcp[tcpflags] & (tcp-ece|tcp-cwr) != 0)' or 
'((ip6[6] = 6) and (ip6[53] & 0xC0 != 0))' # TCP ECN flags, ECN in action





Regards
        Sebastian










_______________________________________________
Bloat mailing list
[email protected]
https://lists.bufferbloat.net/listinfo/bloat

[Bloat] The curious case of "cursed-ECN" steam downloads

Reply via email to