On Tue, Oct 10, 2023 at 03:57:09PM +0200, Willy Tarreau wrote: > On Tue, Oct 10, 2023 at 03:49:21PM +0200, Willy Tarreau wrote: > > > Seems like a clever update to the "good old" h2 multiplexing abuse > > > vectors: > > > 1. client opens a lot of H2 streams on a connection > > > 2. Spams some requests > > > 3. immediately sends h2 RST frames for all of them > > > 4. Go back to 1. repeat. > > > > Yes, precisely one of those I tested last week among the usual approchaes > > consisting in creating tons of streams while staying within the protocol's > > validity limits. The only thing it did was to detect the pool issue I > > mentioned in the dev7 announce. > > > > > The idea being to cause resource exhaustion on the server/proxy at least > > > when it allocates stream related buffers etc, and the underlying server > > > too > > > since it likely sees the requests before they get cancelled. > > > > > > Looking at HAProxy I'd like to know if someone's aware of a decent > > > mitigation option? > > > > We first need to check if at all we're affected, since we keep a count > > of the attached streams for precisely this case and we refrain from > > processing headers frames when we have too many streams, so normally > > the mux will pause waiting for the upper layers to close. > > So at first glance we indeed addressed this case in 2018 (1.9-dev) > with this commit: > > f210191dc ("BUG/MEDIUM: h2: don't accept new streams if conn_streams are > still in excess") > > It was incomplete by then an later refined, but the idea is there. > But I'll try to stress that area again to see.
So I was bored by wasting my time trying to harm the process from scripts, I finally wrote a small program that does the same but much faster. For now it's boring as well. In short: - the concurrency issue was addressed 5 years ago with the commit above so all maintained version are immune to this. The principle is that each H2 connection knows both the number of protocol-level streams attached to them but also application-level streams, and it's that one that enforces the limit, preventing from processing more requests until the number of active streams is within the limit again. In the worst case (i.e. if the attacker downloads and its window updates cannot enter anymore), the streams will simply time out then the connection, like on a single non-multiplexed connection so nothing new here. It also means that no more than the configured limit of streams per connection will reach the hosted application at once. - the risk of CPU usage that was also mentioned is not much relevant either. These aborted requests actually cost less CPU than completed ones, and on my laptop I found that I would reach up to 100-150k req/s per core (depending on CPU thermal throttling) which is perfectly within what we normally observe with a standard h2load injection. With less streams I could even reach 1 million requests per second total, because they were aborted before being turned into a regular stream, so the load was essentially between haproxy and the client. So at this point I'm still failing to find any case where this attack hurts haproxy more than any of the benchmarks we're routinely inflicting it, given that it acts exactly like a client configured with a short timeout (e.g. if you configure haproxy with "timeout server 1" and have an h2 server, you will basically get the same traffic pattern). If you want to block some annoying testers who would fill your logs in the next few days, I checked that the following works fine here (set the limit to the max number of requests per 10 seconds you want to accept, e.g. 1000 below, or 100/s to keep a large margin): tcp-request connection track-sc0 src http-request reject if { sc0_http_req_rate gt 1000 } stick-table type ip size 1m store http_req_rate(10s) It's even possible to play with "set-log-level silent" to avoid logging them, and you may even block new connections at the TCP layer. But for now if you site requires any of this, I can't see how it has not experienced weekly outages from standard attacks. Note that when I'm saying that it's not because your server can process 2 million req/s that you have to use it at that speed on a single machine, it's exactly to keep that type of comfortable margin. I tested the client on an EPYC 74F3 server (24 cores at 3 GHz, the one that I demoed at haproxyconf) and it handles 800k req/s at saturation, spending most of its time in the weighted roundrobin lock (it reaches 850k with random so pretty much standard for this machine), and perf top looks good: Samples: 9M of event 'cycles', 4000 Hz, Event count (approx.): 864711451063 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 4.11% haproxy [.] process_stream 3.29% haproxy [.] srv_add_to_idle_list 3.02% haproxy [.] conn_backend_get.isra.0 2.61% haproxy [.] back_try_conn_req 2.46% haproxy [.] stream_set_backend 1.99% haproxy [.] chash_get_server_hash 1.75% haproxy [.] assign_server 1.67% [kernel] [k] fetch_pte.isra.0 1.67% [kernel] [k] ice_napi_poll 1.44% haproxy [.] sess_change_server 1.39% haproxy [.] stream_update_time_stats 1.35% [kernel] [k] iommu_map_page 1.31% haproxy [.] http_wait_for_request 1.24% [kernel] [k] acpi_processor_ffh_cstate_enter 1.23% haproxy [.] __pool_alloc 1.22% [kernel] [k] skb_release_data 1.17% [kernel] [k] tcp_ack 1.13% [kernel] [k] tcp_recvmsg 1.09% [kernel] [k] copy_user_generic_string 0.99% [kernel] [k] __fget_light I also compared the number of calls to the different functions inside the process under attack and under h2load. They're pretty much identical: Attack (24 clients, stopped at ~2.2M req): $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on|auto|off} Memory usage profiling : off # set profiling memory {on|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg process_stream 2258751 14.59s 6.457us 46.20m 1.227ms <- sc_notify@src/stconn.c:1141 task_wakeup sc_conn_io_cb 2257588 866.4ms 383.0ns 7.043m 187.2us <- sc_app_chk_rcv_conn@src/stconn.c:780 tasklet_wakeup process_stream 2253237 31.33s 13.90us 47.91m 1.276ms <- stream_new@src/stream.c:578 task_wakeup h1_io_cb 2250867 6.203s 2.755us 43.14s 19.16us <- sock_conn_iocb@src/sock.c:875 tasklet_wakeup sc_conn_io_cb 1436248 1.087s 756.0ns 49.17s 34.23us <- h1_wake_stream_for_recv@src/mux_h1.c:2578 tasklet_wakeup h1_io_cb 247901 69.24ms 279.0ns 31.15s 125.7us <- h1_takeover@src/mux_h1.c:4147 tasklet_wakeup h2_io_cb 200187 4.025s 20.11us 27.02s 135.0us <- h2c_restart_reading@src/mux_h2.c:737 tasklet_wakeup sc_conn_io_cb 161965 124.1ms 766.0ns 20.77s 128.2us <- h2s_notify_recv@src/mux_h2.c:1240 tasklet_wakeup h2_io_cb 70176 1.483s 21.14us 9.021s 128.5us <- h2_snd_buf@src/mux_h2.c:6675 tasklet_wakeup h1_io_cb 6126 8.775ms 1.432us 16.69s 2.725ms <- sock_conn_iocb@src/sock.c:854 tasklet_wakeup sc_conn_io_cb 6115 46.96ms 7.679us 22.63s 3.700ms <- h1_wake_stream_for_send@src/mux_h1.c:2588 tasklet_wakeup h1_timeout_task 4316 832.9us 192.0ns 4.955s 1.148ms <- h1_release@src/mux_h1.c:1045 task_wakeup accept_queue_process 12 133.6us 11.13us 1.449ms 120.8us <- listener_accept@src/listener.c:1469 tasklet_wakeup h1_io_cb 11 3.006ms 273.3us 4.930us 448.0ns <- conn_subscribe@src/connection.c:736 tasklet_wakeup h2_io_cb 6 405.4us 67.56us 4.555ms 759.2us <- sock_conn_iocb@src/sock.c:875 tasklet_wakeup task_run_applet 1 - - 233.9us 233.9us <- sc_applet_create@src/stconn.c:502 appctx_wakeup sc_conn_io_cb 1 11.86us 11.86us 1.373us 1.373us <- sock_conn_iocb@src/sock.c:875 tasklet_wakeup task_run_applet 1 411.0ns 411.0ns 5.311us 5.311us <- sc_app_shut_applet@src/stconn.c:911 appctx_wakeup h2load with 24 clients and approx same number of requests: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on|auto|off} Memory usage profiling : off # set profiling memory {on|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg process_stream 2261040 13.42s 5.933us 34.49m 915.2us <- sc_notify@src/stconn.c:1141 task_wakeup sc_conn_io_cb 2259306 657.0ms 290.0ns 5.251m 139.4us <- sc_app_chk_rcv_conn@src/stconn.c:780 tasklet_wakeup h1_io_cb 2258770 4.721s 2.090us 30.58s 13.54us <- sock_conn_iocb@src/sock.c:875 tasklet_wakeup process_stream 2258252 29.02s 12.85us 34.76m 923.5us <- stream_new@src/stream.c:578 task_wakeup sc_conn_io_cb 1690412 1.102s 652.0ns 59.93s 35.45us <- h1_wake_stream_for_recv@src/mux_h1.c:2578 tasklet_wakeup h2_io_cb 183444 2.820s 15.37us 28.68s 156.3us <- h2_snd_buf@src/mux_h2.c:6675 tasklet_wakeup h2_io_cb 149856 203.4ms 1.357us 496.5ms 3.313us <- h2c_restart_reading@src/mux_h2.c:737 tasklet_wakeup h2_io_cb 72164 2.100s 29.10us 7.894s 109.4us <- sock_conn_iocb@src/sock.c:875 tasklet_wakeup h1_io_cb 4805 1.579ms 328.0ns 573.0ms 119.3us <- h1_takeover@src/mux_h1.c:4147 tasklet_wakeup sc_conn_io_cb 2288 15.37ms 6.717us 24.89s 10.88ms <- h1_wake_stream_for_send@src/mux_h1.c:2588 tasklet_wakeup h1_io_cb 2288 2.436ms 1.064us 20.22s 8.838ms <- sock_conn_iocb@src/sock.c:854 tasklet_wakeup h1_timeout_task 222 95.21us 428.0ns 750.5ms 3.381ms <- h1_release@src/mux_h1.c:1045 task_wakeup h2_timeout_task 24 7.914us 329.0ns 314.9us 13.12us <- h2_release@src/mux_h2.c:1146 task_wakeup accept_queue_process 9 75.00us 8.333us 416.8us 46.31us <- listener_accept@src/listener.c:1469 tasklet_wakeup h1_io_cb 2 124.6us 62.28us 1.785us 892.0ns <- conn_subscribe@src/connection.c:736 tasklet_wakeup task_run_applet 1 431.0ns 431.0ns 5.631us 5.631us <- sc_app_shut_applet@src/stconn.c:911 appctx_wakeup task_run_applet 1 - - 1.563us 1.563us <- sc_applet_create@src/stconn.c:502 appctx_wakeup sc_conn_io_cb 1 13.80us 13.80us 1.864us 1.864us <- sock_conn_iocb@src/sock.c:875 tasklet_wakeup I can't differentiate them, most of the activity is at the application layer (process_stream). As Tristan mentioned, lowering tune.h2.be.max-concurrent-streams may also slow them down but it will also slow down some sites with many objects (think shops with many images). For a long time the high parallelism of H2 was sold as a huge differentiator, I don't feel like starting to advertise lowering it now. Also, please note that when it comes to anything past the reverse-proxy, there's no difference between the attack over a single connection and sending 10 times less traffic over 10 connections (i.e. h2load), the total number of streams remains the same so in any case, any remediation based on lowering the number of streams per connection just calls for increasing the number of connections for the client. Now of course if some find corner cases that affect them, I'm all ears (and we can even discuss them privately if really needed). But I think that this issue essentially depends on the components architecture, some will eat more CPU, others more RAM, etc. There are lots of other interesting attacks on the H2 protocol, that can be triggered just with a regular client with low timeouts, with low stream windows (use h2load -w 1 to have fun), zero-window during transfers, and even playing with one-byte continuation frames that may force some components to perform reallocations and copies. But most of them depend on the implementation and on the attacker and were discussed in great lengths during the protocol design 10 years ago so that the cost remains balanced between the attacker and the target. In the end, H2 is not much robust but each implementation has certain possibilities to cover some of the limitations and these differ due to many architectural constraints. The good point in this is that this will probably make more people want to reconsider H3/QUIC if they don't trust their products anymore :-) Willy