On Tue, Oct 10, 2023 at 03:57:09PM +0200, Willy Tarreau wrote:
> On Tue, Oct 10, 2023 at 03:49:21PM +0200, Willy Tarreau wrote:
> > > Seems like a clever update to the "good old" h2 multiplexing abuse 
> > > vectors:
> > > 1. client opens a lot of H2 streams on a connection
> > > 2. Spams some requests
> > > 3. immediately sends h2 RST frames for all of them
> > > 4. Go back to 1. repeat.
> > 
> > Yes, precisely one of those I tested last week among the usual approchaes
> > consisting in creating tons of streams while staying within the protocol's
> > validity limits. The only thing it did was to detect the pool issue I
> > mentioned in the dev7 announce.
> > 
> > > The idea being to cause resource exhaustion on the server/proxy at least
> > > when it allocates stream related buffers etc, and the underlying server 
> > > too
> > > since it likely sees the requests before they get cancelled.
> > > 
> > > Looking at HAProxy I'd like to know if someone's aware of a decent
> > > mitigation option?
> > 
> > We first need to check if at all we're affected, since we keep a count
> > of the attached streams for precisely this case and we refrain from
> > processing headers frames when we have too many streams, so normally
> > the mux will pause waiting for the upper layers to close.
> 
> So at first glance we indeed addressed this case in 2018 (1.9-dev)
> with this commit:
> 
>   f210191dc ("BUG/MEDIUM: h2: don't accept new streams if conn_streams are 
> still in excess")
> 
> It was incomplete by then an later refined, but the idea is there.
> But I'll try to stress that area again to see.

So I was bored by wasting my time trying to harm the process from scripts,
I finally wrote a small program that does the same but much faster. For
now it's boring as well. In short:

  - the concurrency issue was addressed 5 years ago with the commit
    above so all maintained version are immune to this. The principle
    is that each H2 connection knows both the number of protocol-level
    streams attached to them but also application-level streams, and
    it's that one that enforces the limit, preventing from processing
    more requests until the number of active streams is within the limit
    again. In the worst case (i.e. if the attacker downloads and its
    window updates cannot enter anymore), the streams will simply time
    out then the connection, like on a single non-multiplexed connection
    so nothing new here. It also means that no more than the configured
    limit of streams per connection will reach the hosted application
    at once.

  - the risk of CPU usage that was also mentioned is not much relevant
    either. These aborted requests actually cost less CPU than completed
    ones, and on my laptop I found that I would reach up to 100-150k req/s
    per core (depending on CPU thermal throttling) which is perfectly
    within what we normally observe with a standard h2load injection.
    With less streams I could even reach 1 million requests per second
    total, because they were aborted before being turned into a regular
    stream, so the load was essentially between haproxy and the client.

So at this point I'm still failing to find any case where this attack
hurts haproxy more than any of the benchmarks we're routinely inflicting
it, given that it acts exactly like a client configured with a short
timeout (e.g. if you configure haproxy with "timeout server 1" and
have an h2 server, you will basically get the same traffic pattern).

If you want to block some annoying testers who would fill your logs
in the next few days, I checked that the following works fine here
(set the limit to the max number of requests per 10 seconds you want
to accept, e.g. 1000 below, or 100/s to keep a large margin):

      tcp-request connection track-sc0 src
      http-request reject if { sc0_http_req_rate gt 1000 }
      stick-table type ip size 1m store http_req_rate(10s)

It's even possible to play with "set-log-level silent" to avoid logging
them, and you may even block new connections at the TCP layer. But
for now if you site requires any of this, I can't see how it has not
experienced weekly outages from standard attacks. Note that when I'm
saying that it's not because your server can process 2 million req/s
that you have to use it at that speed on a single machine, it's exactly
to keep that type of comfortable margin.

I tested the client on an EPYC 74F3 server (24 cores at 3 GHz, the one
that I demoed at haproxyconf) and it handles 800k req/s at saturation,
spending most of its time in the weighted roundrobin lock (it reaches
850k with random so pretty much standard for this machine), and perf
top looks good:

  Samples: 9M of event 'cycles', 4000 Hz, Event count (approx.): 864711451063 
lost: 0/0 drop: 0/0
  Overhead  Shared Object             Symbol
     4.11%  haproxy                   [.] process_stream
     3.29%  haproxy                   [.] srv_add_to_idle_list
     3.02%  haproxy                   [.] conn_backend_get.isra.0
     2.61%  haproxy                   [.] back_try_conn_req
     2.46%  haproxy                   [.] stream_set_backend
     1.99%  haproxy                   [.] chash_get_server_hash
     1.75%  haproxy                   [.] assign_server
     1.67%  [kernel]                  [k] fetch_pte.isra.0
     1.67%  [kernel]                  [k] ice_napi_poll
     1.44%  haproxy                   [.] sess_change_server
     1.39%  haproxy                   [.] stream_update_time_stats
     1.35%  [kernel]                  [k] iommu_map_page
     1.31%  haproxy                   [.] http_wait_for_request
     1.24%  [kernel]                  [k] acpi_processor_ffh_cstate_enter
     1.23%  haproxy                   [.] __pool_alloc
     1.22%  [kernel]                  [k] skb_release_data
     1.17%  [kernel]                  [k] tcp_ack
     1.13%  [kernel]                  [k] tcp_recvmsg
     1.09%  [kernel]                  [k] copy_user_generic_string
     0.99%  [kernel]                  [k] __fget_light

I also compared the number of calls to the different functions inside
the process under attack and under h2load. They're pretty much identical:

Attack (24 clients, stopped at ~2.2M req):

  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on            # set profiling tasks 
{on|auto|off}
  Memory usage profiling              : off           # set profiling memory 
{on|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    process_stream              2258751   14.59s    6.457us   46.20m    1.227ms 
<- sc_notify@src/stconn.c:1141 task_wakeup
    sc_conn_io_cb               2257588   866.4ms   383.0ns   7.043m    187.2us 
<- sc_app_chk_rcv_conn@src/stconn.c:780 tasklet_wakeup
    process_stream              2253237   31.33s    13.90us   47.91m    1.276ms 
<- stream_new@src/stream.c:578 task_wakeup
    h1_io_cb                    2250867   6.203s    2.755us   43.14s    19.16us 
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
    sc_conn_io_cb               1436248   1.087s    756.0ns   49.17s    34.23us 
<- h1_wake_stream_for_recv@src/mux_h1.c:2578 tasklet_wakeup
    h1_io_cb                     247901   69.24ms   279.0ns   31.15s    125.7us 
<- h1_takeover@src/mux_h1.c:4147 tasklet_wakeup
    h2_io_cb                     200187   4.025s    20.11us   27.02s    135.0us 
<- h2c_restart_reading@src/mux_h2.c:737 tasklet_wakeup
    sc_conn_io_cb                161965   124.1ms   766.0ns   20.77s    128.2us 
<- h2s_notify_recv@src/mux_h2.c:1240 tasklet_wakeup
    h2_io_cb                      70176   1.483s    21.14us   9.021s    128.5us 
<- h2_snd_buf@src/mux_h2.c:6675 tasklet_wakeup
    h1_io_cb                       6126   8.775ms   1.432us   16.69s    2.725ms 
<- sock_conn_iocb@src/sock.c:854 tasklet_wakeup
    sc_conn_io_cb                  6115   46.96ms   7.679us   22.63s    3.700ms 
<- h1_wake_stream_for_send@src/mux_h1.c:2588 tasklet_wakeup
    h1_timeout_task                4316   832.9us   192.0ns   4.955s    1.148ms 
<- h1_release@src/mux_h1.c:1045 task_wakeup
    accept_queue_process             12   133.6us   11.13us   1.449ms   120.8us 
<- listener_accept@src/listener.c:1469 tasklet_wakeup
    h1_io_cb                         11   3.006ms   273.3us   4.930us   448.0ns 
<- conn_subscribe@src/connection.c:736 tasklet_wakeup
    h2_io_cb                          6   405.4us   67.56us   4.555ms   759.2us 
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
    task_run_applet                   1      -         -      233.9us   233.9us 
<- sc_applet_create@src/stconn.c:502 appctx_wakeup
    sc_conn_io_cb                     1   11.86us   11.86us   1.373us   1.373us 
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
    task_run_applet                   1   411.0ns   411.0ns   5.311us   5.311us 
<- sc_app_shut_applet@src/stconn.c:911 appctx_wakeup

h2load with 24 clients and approx same number of requests:

  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on            # set profiling tasks 
{on|auto|off}
  Memory usage profiling              : off           # set profiling memory 
{on|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    process_stream              2261040   13.42s    5.933us   34.49m    915.2us 
<- sc_notify@src/stconn.c:1141 task_wakeup
    sc_conn_io_cb               2259306   657.0ms   290.0ns   5.251m    139.4us 
<- sc_app_chk_rcv_conn@src/stconn.c:780 tasklet_wakeup
    h1_io_cb                    2258770   4.721s    2.090us   30.58s    13.54us 
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
    process_stream              2258252   29.02s    12.85us   34.76m    923.5us 
<- stream_new@src/stream.c:578 task_wakeup
    sc_conn_io_cb               1690412   1.102s    652.0ns   59.93s    35.45us 
<- h1_wake_stream_for_recv@src/mux_h1.c:2578 tasklet_wakeup
    h2_io_cb                     183444   2.820s    15.37us   28.68s    156.3us 
<- h2_snd_buf@src/mux_h2.c:6675 tasklet_wakeup
    h2_io_cb                     149856   203.4ms   1.357us   496.5ms   3.313us 
<- h2c_restart_reading@src/mux_h2.c:737 tasklet_wakeup
    h2_io_cb                      72164   2.100s    29.10us   7.894s    109.4us 
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
    h1_io_cb                       4805   1.579ms   328.0ns   573.0ms   119.3us 
<- h1_takeover@src/mux_h1.c:4147 tasklet_wakeup
    sc_conn_io_cb                  2288   15.37ms   6.717us   24.89s    10.88ms 
<- h1_wake_stream_for_send@src/mux_h1.c:2588 tasklet_wakeup
    h1_io_cb                       2288   2.436ms   1.064us   20.22s    8.838ms 
<- sock_conn_iocb@src/sock.c:854 tasklet_wakeup
    h1_timeout_task                 222   95.21us   428.0ns   750.5ms   3.381ms 
<- h1_release@src/mux_h1.c:1045 task_wakeup
    h2_timeout_task                  24   7.914us   329.0ns   314.9us   13.12us 
<- h2_release@src/mux_h2.c:1146 task_wakeup
    accept_queue_process              9   75.00us   8.333us   416.8us   46.31us 
<- listener_accept@src/listener.c:1469 tasklet_wakeup
    h1_io_cb                          2   124.6us   62.28us   1.785us   892.0ns 
<- conn_subscribe@src/connection.c:736 tasklet_wakeup
    task_run_applet                   1   431.0ns   431.0ns   5.631us   5.631us 
<- sc_app_shut_applet@src/stconn.c:911 appctx_wakeup
    task_run_applet                   1      -         -      1.563us   1.563us 
<- sc_applet_create@src/stconn.c:502 appctx_wakeup
    sc_conn_io_cb                     1   13.80us   13.80us   1.864us   1.864us 
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup

I can't differentiate them, most of the activity is at the application
layer (process_stream).

As Tristan mentioned, lowering tune.h2.be.max-concurrent-streams may
also slow them down but it will also slow down some sites with many
objects (think shops with many images). For a long time the high
parallelism of H2 was sold as a huge differentiator, I don't feel like
starting to advertise lowering it now.

Also, please note that when it comes to anything past the reverse-proxy,
there's no difference between the attack over a single connection and
sending 10 times less traffic over 10 connections (i.e. h2load), the
total number of streams remains the same so in any case, any remediation
based on lowering the number of streams per connection just calls for
increasing the number of connections for the client.

Now of course if some find corner cases that affect them, I'm all ears
(and we can even discuss them privately if really needed). But I think
that this issue essentially depends on the components architecture, some
will eat more CPU, others more RAM, etc.

There are lots of other interesting attacks on the H2 protocol, that
can be triggered just with a regular client with low timeouts, with low
stream windows (use h2load -w 1 to have fun), zero-window during transfers,
and even playing with one-byte continuation frames that may force some
components to perform reallocations and copies. But most of them depend
on the implementation and on the attacker and were discussed in great
lengths during the protocol design 10 years ago so that the cost remains
balanced between the attacker and the target. In the end, H2 is not much
robust but each implementation has certain possibilities to cover some
of the limitations and these differ due to many architectural constraints.

The good point in this is that this will probably make more people want
to reconsider H3/QUIC if they don't trust their products anymore :-)

Willy

Reply via email to