https://bz.apache.org/bugzilla/show_bug.cgi?id=66646

            Bug ID: 66646
           Summary: apache http2 h2eos bucket corruption in
                    httpd-2.4.55/56/57 (with workaround)
           Product: Apache httpd-2
           Version: 2.4.57
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: mod_http2
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: ---

Created attachment 38585
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=38585&action=edit
check for h2eos bucket corruption

Hello,

starting with the big mod_http2 update in httpd-2.4.55 we see random
crashes in core_filter.c send_brigade_nonblocking/writev_nonblocking
because of corrupted buckets.

- the crashes do not occur when the current httpd-code (httpd-2.4.57)
  is used with mod_http2 from httpd-2.4.54.

- our servers are configured to run with 64 worker-threads per process,
  and the machines use redhat-linux. For each server which serves 
  about 20 million requests per day, we see between 1 and 5 crashes daily. 

- crashes occur always in the context of a HTTP2-request (mostly simple
  GET-requests). See example below.

- traces show that the crash happens always in cases where a h2eos
  bucket is destroyed and the following bucket from the brigade
  belongs to the same stream.

  see patch: httpd-trunk-mod_http2-check-for-h2eos-bucket-corruption

- the next/prev pointers of this following bucket sometimes point to
  illegal memory and indicate a use-after-free bug where the memory
  gets reused after the bucket-destroy-handler of the h2eos token
  was called.

- the patch checks for such cases in core_filter:send_brigade_nonblocking,
  and shows that the problem occurs far more often than the count of
  crashes we see.

- after a fresh apache restart it always takes some time (20-30 minutes)
  on our busy servers until the bug is triggered, so it might have todo
  with a special behavior of the apache-handlers in situtions where for
  example 'MaxConnectionsPerChild' is reached.

We are currently using a workaround to fix these crashes, which
was built to find out more about the problem:

- in h2_c1_io.c:pass_output the buckets are reordered, so that a 
  flush bucket and scratch heap bucket is inserted before the h2eos bucket
  when mod_http2 passes the output-brigade to the filters of other
  apache modules.

- since the h2eos<->flush swap fix in h2_c1_io.c seems not to be sufficient
  to catch all cases where a flush bucket is inserted, the remaining cases
  are swapped in core_filter.

  see patch: httpd-trunk-mod_http2-workaround-for-h2eos-bucket-corruption
  (should be applied on top of the other patch).

- with this patch applied the crashes are absent & the check doesn't trigger.

I think a real fix must more generally enforce the bucket-order in
mod_http2, so that h2eos is always the last bucket in a http2 stream.
Another possibility might be to allocate the flush/scratch-heap buckets
from a pool with a longer lifetime than the pool used for h2eos.

Example output when _checked_bucket_delete triggers:
---------------------------------------------------
[Tue Jun 06 12:49:55 2023] [error] [client ...] send_brigade_nonblocking_len0:
BAD BUCKET type H2EOS bucket 0x7f681c0009c8 before:(next == 0x7f681c000d88,
prev == 0x7f690c09df98) (next.next == 0x7f690c09df98, next.prev ==
0x7f681c0009c8) (first == 0x7f681c0009c8, last == 0x7f681c000d88) bb-sentinel
0x7f690c09df98 after:next type FLUSH 0x7f681c000d88 (next == 0x7f681c000ba8,
prev == 0x7f690c09df98) bb (first == 0x7f690c09df98, last == 0x7f690c09df98)

- before:
  head:  0x7f690c09df98 [first: h2eos, last: flush]
  h2eos: 0x7f681c0009c8 [next: flush, prev: head]
  flush: 0x7f681c000d88 [next: head, prev: h2eos]

- after bucket_delete(h2eos):
  head:  0x7f690c09df98 [first: head, last: head]
  flush: 0x7f681c000d88 [next: next, prev: head]
  next:  0x7f681c000ba8 ???

Example crash core:
------------------
Note: Line numbers here do not match repo because of included debug code.
      Such crashes occur after a check like above triggered in a 
      preceeding send_brigade_nonblocking loop iteration.

(gdb) bt
  #0  0x00007f1494013c68 in ?? ()
  #1  0x00000000004c9db7 in writev_nonblocking (s=s@entry=0x7f15d80b8340,
bb=bb@entry=0x7f15d80b8af0, bytes_to_write=bytes_to_write@entry=361,
nvec=nvec@entry=4, c=c@entry=0x7f15d80b8558, ctx=<optimized out>,
ctx=<optimized out>) at core_filters.c:850 --> if (n) { apr_bucket_split() ...
}
  #2  0x00000000004caf75 in send_brigade_nonblocking (c=0x7f15d80b8558,
ctx=0x7f1494071678, bb=0x7f15d80b8af0, s=0x7f15d80b8340) at core_filters.c:819
--> if (nvec > 0) { rv = writev_nonblocking .. }
  #3  ap_core_output_filter (f=0x7f1494071348, new_bb=0x7f15d80b8af0) at
core_filters.c:538
  #4  0x00000000005b58bc in ssl_io_filter_output (f=0x7f15d80b8aa8,
bb=0x7f14940717e0) at ssl_engine_io.c:2094
  #5  0x00000000005b234e in ssl_io_filter_coalesce (f=0x7f15d80b8a80,
bb=0x7f14940717e0) at ssl_engine_io.c:2037
  #6  0x00000000005ef937 in pass_output (flush=1, io=0x7f1494049af0) at
h2_c1_io.c:281
  #7  h2_c1_io_assure_flushed (io=io@entry=0x7f1494049af0) at h2_c1_io.c:340
  #8  0x00000000005dc9f0 in h2_session_shutdown_notice (session=0x7f1494049ab0)
at h2_session.c:735
  #9  h2_session_ev_mpm_stopping (arg=0, msg=0x0, session=0x7f1494049ab0) at
h2_session.c:1524
  #10 h2_session_dispatch_event (session=0x7f1494049ab0, ev=<optimized out>,
arg=0, msg=0x0) at h2_session.c:1726
  #11 0x00000000005de673 in h2_session_process (session=0x7f1494049ab0,
async=0) at h2_session.c:1798
  #12 0x00000000005cb0a9 in h2_c1_run (c=c@entry=0x7f15d80b8558) at h2_c1.c:130
  #13 0x00000000005cb4bd in h2_c1_hook_process_connection (c=0x7f15d80b8558) at
h2_c1.c:278
  #14 0x00000000004dbd10 in ap_run_process_connection
(c=c@entry=0x7f15d80b8558) at connection.c:42
  #15 0x00000000004dc23e in ap_process_connection (c=c@entry=0x7f15d80b8558,
csd=csd@entry=0x7f15d80b8340) at connection.c:217
  #16 0x00000000008212db in process_socket (bucket_alloc=0x7f1494013108,
my_thread_num=8, my_child_num=102, sock=0x7f15d80b8340, p=0x7f15d80b82b8,
thd=0x2e74528) at worker.c:485
  #17 worker_thread (thd=0x2e74528, dummy=<optimized out>) at worker.c:814
  #18 0x00007f15ff802ea5 in start_thread () from /lib64/libpthread.so.0
  #19 0x00007f15ff327b0d in clone () from /lib64/libc.so.6
(gdb) frame 1
(gdb) info locals
  apr__d = 0x7f15d80b8af8
  apr__b = 0x7f15d80b8af8
  bucket = 0x7f15d80b8af8
  n = 46
  rv = 0
  vec = <optimized out>
  bytes_written = 361
  i = 3
  offset = 3
(gdb) p *bb
  $1 = {p = 0x7f15d80b82b8, list = {next = 0x7f15d80b8af8, prev =
0x7f15d80b8af8}, bucket_alloc = 0x7f1494013108}
(gdb) p *bucket
  $2 = {link = {next = 0x7f15d80b8af8, prev = 0x7f15d80b8af8}, type =
0x7f1494013108, length = 139731795674416, start = 139731795675888, data =
0x92e640 <brigade_cleanup>, free = 0x987aa0 <apr_pool_cleanup_null>, list =
0x7f15d80b8b28}

Greetings,
Barnim

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to