This is an automated email from the ASF dual-hosted git repository.

etseidl pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/main by this push:
     new 1d89737b4d fix(parquet): Prevent negative list sizes in Thrift compact 
protocol parser (#9868)
1d89737b4d is described below

commit 1d89737b4dc13a603f7e88a7ca20d571120b9b5f
Author: masumi-ryugo <[email protected]>
AuthorDate: Wed May 6 03:33:18 2026 +0900

    fix(parquet): Prevent negative list sizes in Thrift compact protocol parser 
(#9868)
    
    # Which issue does this PR close?
    
    Closes (in part) #9874. That issue tracks both this immediate fix and
    the broader thrift-parser hardening @etseidl mentioned in review.
    
    Originally found via `cargo-fuzz` libFuzzer harness over
    `ParquetMetaDataReader::parse_and_finish` and
    `ParquetRecordBatchReaderBuilder::try_new` — ~95 unique crashing
    inputs all converged on this single root cause.
    
    # Rationale for this change
    
    `parquet::parquet_thrift::read_thrift_vec` reads a Thrift
    compact-protocol
    list header and then calls `Vec::with_capacity(list_ident.size as
    usize)`
    where `list_ident.size` is a 32-bit varint pulled directly from
    attacker-controlled bytes. On a malformed input the value can be close
    to
    `i32::MAX`, which after the per-element-size multiplication is a
    multi-GB-to-multi-hundred-GB allocation request. That panics in
    `alloc::raw_vec::handle_error` with `capacity overflow` (or OOM-kills
    the
    process on smaller-but-still-huge values) before any element is decoded.
    
    Every public sync API that parses parquet bytes funnels through this
    function:
    
    - `ParquetMetaDataReader::parse_and_finish`
    - `ParquetMetaDataReader::decode_metadata`
    - `SerializedFileReader::new`
    - `ParquetRecordBatchReaderBuilder::try_new`
    - the `async_reader` family (re-exports `decode_metadata` after
      prefetching footer bytes)
    
    So any downstream code that hands attacker-controlled bytes to a parquet
    reader gets a panic-on-decode DoS. Per `SECURITY.md` this is a bug, not
    a
    vulnerability — there is no information disclosure or RCE path, only
    availability — but it is reachable from every metadata entry point and
    worth closing.
    
    # What changes are included in this PR?
    
    - Add a default `remaining_bytes() -> Option<usize>` method to
      `ThriftCompactInputProtocol`, returning the number of bytes still
      available when the protocol is backed by an in-memory buffer.
    - Override it on `ThriftSliceInputProtocol` to return
    `Some(self.buf.len())`.
    - In `read_thrift_vec`, clamp the capacity passed to
    `Vec::with_capacity`
      by `remaining_bytes()` (every Thrift element costs at least one wire
      byte, so a list whose declared `size` exceeds the remaining input is
      malformed by definition — the same clamp the C++ Thrift runtime
      applies). Streaming protocols that cannot report a remaining count
      fall back to a 1 MiB cap and grow on push.
    - Reject negative sizes explicitly. Although the wire format is
    unsigned,
      a hostile encoder can construct a varint that decodes to a negative
      `i32`.
    - Three regression tests in the existing `parquet_thrift::tests` module:
      - `test_read_thrift_vec_huge_size_does_not_panic` exercises the direct
        `read_thrift_vec` path with a varint that decodes to ~`i32::MAX`.
      - `test_decode_metadata_huge_thrift_list_does_not_panic` runs the
    10-byte fuzzer repro through `ParquetMetaDataReader::decode_metadata`
        and asserts a clean `Err`.
      - `test_read_thrift_vec_negative_size_returns_err` asserts the
        explicit negative-size guard.
    
    # Are these changes tested?
    
    Yes — three new unit tests as above. `cargo test -p parquet --release`,
    `cargo clippy -p parquet --all-targets -- -D warnings`, and
    `cargo fmt --check` are all clean.
    
    (Note: a single pre-existing `should_panic` test
    `test_read_non_utf8_binary_as_utf8` aborts during the rust test
    harness's panic-formatting machinery on `origin/main` HEAD itself, so
    it is excluded with `--skip` for the local run. Unrelated to this PR.)
    
    # Are there any user-facing changes?
    
    Yes — malformed Parquet inputs that previously triggered a `capacity
    overflow` panic during metadata decoding now return a `ParquetError`
    that callers can handle. No behavior change for well-formed files.
    
    # Reproducer
    
    10 bytes:
    
    ```
    0x28 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0x51
    ```
    
    ```rust
    let bytes: &[u8] = &[0x28, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 
0x51];
    let _ = 
parquet::file::metadata::ParquetMetaDataReader::decode_metadata(bytes);
    ```
    
    Before this PR (RUST_BACKTRACE=1):
    
    ```
    panicked at library/alloc/src/raw_vec/mod.rs:28:5: capacity overflow
       3: alloc::raw_vec::handle_error
       4: alloc::raw_vec::with_capacity_in
       7: alloc::vec::Vec::with_capacity::<SchemaElement>
       8: parquet::parquet_thrift::read_thrift_vec       (parquet_thrift.rs:690)
       9: parquet::file::metadata::thrift::parquet_metadata_from_bytes
                                                         (thrift/mod.rs:790)
      10: parquet::file::metadata::parser::decode_metadata (parser.rs:233)
      13: ParquetMetaDataReader::decode_footer_metadata   (reader.rs:783)
      14: ParquetMetaDataReader::parse_metadata           (reader.rs:585)
      17: ParquetMetaDataReader::parse_and_finish         (reader.rs:230)
    ```
    
    After this PR: `Err(ParquetError::EOF("Unexpected EOF"))` (or similar
    shape depending on which struct field fails first), no panic.
    
    A 45-byte reproducer that drives the full reader stack (with valid
    `PAR1` magic, so it works for
    `ParquetRecordBatchReaderBuilder::try_new`)
    behaves the same way.
    
    # Found via
    
    `cargo-fuzz` libFuzzer harness wrapping
    `ParquetMetaDataReader::parse_and_finish` and
    `ParquetRecordBatchReaderBuilder::try_new` over `bytes::Bytes`. ~95
    unique
    crashing inputs (different VLQ sizes, different element types —
    `SchemaElement`, `RowGroup`, `ColumnChunk` — different offsets in the
    metadata graph) all converged on this single root cause within ~3
    minutes
    of single-thread fuzzing. One bug, many surface symptoms.
    
    ---------
    
    Co-authored-by: masumi ryugo 
<[email protected]>
    Co-authored-by: Ed Seidl <[email protected]>
---
 parquet/src/parquet_thrift.rs | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/parquet/src/parquet_thrift.rs b/parquet/src/parquet_thrift.rs
index 6c82a0bf2c..a13baa0924 100644
--- a/parquet/src/parquet_thrift.rs
+++ b/parquet/src/parquet_thrift.rs
@@ -36,6 +36,7 @@ use crate::{
     write_thrift_field,
 };
 use std::io::Error;
+use std::num::TryFromIntError;
 use std::str::Utf8Error;
 
 #[derive(Debug)]
@@ -46,6 +47,7 @@ pub(crate) enum ThriftProtocolError {
     InvalidElementType(u8),
     FieldDeltaOverflow { field_delta: u8, last_field_id: i16 },
     InvalidBoolean(u8),
+    IntegerOverflow,
     Utf8Error,
     SkipDepth(FieldType),
     SkipUnsupportedType(FieldType),
@@ -70,6 +72,9 @@ impl From<ThriftProtocolError> for ParquetError {
             ThriftProtocolError::InvalidBoolean(value) => {
                 general_err!("cannot convert {} into bool", value)
             }
+            ThriftProtocolError::IntegerOverflow => {
+                general_err!("integer overflow decoding thrift value")
+            }
             ThriftProtocolError::Utf8Error => general_err!("invalid utf8"),
             ThriftProtocolError::SkipDepth(field_type) => {
                 general_err!("cannot parse past {:?}", field_type)
@@ -94,6 +99,13 @@ impl From<Error> for ThriftProtocolError {
     }
 }
 
+impl From<TryFromIntError> for ThriftProtocolError {
+    fn from(_: TryFromIntError) -> Self {
+        // ignore error payload to reduce the size of ThriftProtocolError
+        Self::IntegerOverflow
+    }
+}
+
 pub type ThriftProtocolResult<T> = Result<T, ThriftProtocolError>;
 
 /// Wrapper for thrift `double` fields. This is used to provide
@@ -317,7 +329,13 @@ pub(crate) trait ThriftCompactInputProtocol<'a> {
             // high bits set high if count and type encoded separately
             possible_element_count as i32
         } else {
-            self.read_vlq()? as _
+            // The list size on the wire is an unsigned varint, but we 
represent
+            // it as `i32` (matching Java's `int` and the Thrift schema).
+            // A varint that decodes above `i32::MAX` is malformed input — 
reject
+            // it here at the protocol layer rather than letting the cast wrap
+            // into a negative size that downstream allocation code has to
+            // re-validate.
+            i32::try_from(self.read_vlq()?)?
         };
 
         Ok(ListIdentifier {
@@ -1106,4 +1124,18 @@ pub(crate) mod tests {
         assert_eq!(header.size, 0);
         assert_eq!(header.element_type, ElementType::Byte);
     }
+
+    /// A Thrift list header whose `size` varint decodes above `i32::MAX`
+    /// must be rejected at the protocol layer rather than wrapping into a
+    /// negative `i32` and being smuggled into downstream allocation code.
+    #[test]
+    fn test_read_list_begin_size_above_i32_max_returns_err() {
+        // List header: element_type=8 (Binary), 0xF=follow-up varint.
+        // Varint 80 80 80 80 08 decodes to 0x8000_0000 = i32::MAX + 1.
+        let mut data: Vec<u8> = vec![0xF8];
+        data.extend_from_slice(&[0x80, 0x80, 0x80, 0x80, 0x08]);
+        let mut prot = ThriftSliceInputProtocol::new(&data);
+        let result = prot.read_list_begin();
+        assert!(result.is_err(), "expected error, got {result:?}");
+    }
 }

Reply via email to