This is an automated email from the ASF dual-hosted git repository.
etseidl pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new 1d89737b4d fix(parquet): Prevent negative list sizes in Thrift compact
protocol parser (#9868)
1d89737b4d is described below
commit 1d89737b4dc13a603f7e88a7ca20d571120b9b5f
Author: masumi-ryugo <[email protected]>
AuthorDate: Wed May 6 03:33:18 2026 +0900
fix(parquet): Prevent negative list sizes in Thrift compact protocol parser
(#9868)
# Which issue does this PR close?
Closes (in part) #9874. That issue tracks both this immediate fix and
the broader thrift-parser hardening @etseidl mentioned in review.
Originally found via `cargo-fuzz` libFuzzer harness over
`ParquetMetaDataReader::parse_and_finish` and
`ParquetRecordBatchReaderBuilder::try_new` — ~95 unique crashing
inputs all converged on this single root cause.
# Rationale for this change
`parquet::parquet_thrift::read_thrift_vec` reads a Thrift
compact-protocol
list header and then calls `Vec::with_capacity(list_ident.size as
usize)`
where `list_ident.size` is a 32-bit varint pulled directly from
attacker-controlled bytes. On a malformed input the value can be close
to
`i32::MAX`, which after the per-element-size multiplication is a
multi-GB-to-multi-hundred-GB allocation request. That panics in
`alloc::raw_vec::handle_error` with `capacity overflow` (or OOM-kills
the
process on smaller-but-still-huge values) before any element is decoded.
Every public sync API that parses parquet bytes funnels through this
function:
- `ParquetMetaDataReader::parse_and_finish`
- `ParquetMetaDataReader::decode_metadata`
- `SerializedFileReader::new`
- `ParquetRecordBatchReaderBuilder::try_new`
- the `async_reader` family (re-exports `decode_metadata` after
prefetching footer bytes)
So any downstream code that hands attacker-controlled bytes to a parquet
reader gets a panic-on-decode DoS. Per `SECURITY.md` this is a bug, not
a
vulnerability — there is no information disclosure or RCE path, only
availability — but it is reachable from every metadata entry point and
worth closing.
# What changes are included in this PR?
- Add a default `remaining_bytes() -> Option<usize>` method to
`ThriftCompactInputProtocol`, returning the number of bytes still
available when the protocol is backed by an in-memory buffer.
- Override it on `ThriftSliceInputProtocol` to return
`Some(self.buf.len())`.
- In `read_thrift_vec`, clamp the capacity passed to
`Vec::with_capacity`
by `remaining_bytes()` (every Thrift element costs at least one wire
byte, so a list whose declared `size` exceeds the remaining input is
malformed by definition — the same clamp the C++ Thrift runtime
applies). Streaming protocols that cannot report a remaining count
fall back to a 1 MiB cap and grow on push.
- Reject negative sizes explicitly. Although the wire format is
unsigned,
a hostile encoder can construct a varint that decodes to a negative
`i32`.
- Three regression tests in the existing `parquet_thrift::tests` module:
- `test_read_thrift_vec_huge_size_does_not_panic` exercises the direct
`read_thrift_vec` path with a varint that decodes to ~`i32::MAX`.
- `test_decode_metadata_huge_thrift_list_does_not_panic` runs the
10-byte fuzzer repro through `ParquetMetaDataReader::decode_metadata`
and asserts a clean `Err`.
- `test_read_thrift_vec_negative_size_returns_err` asserts the
explicit negative-size guard.
# Are these changes tested?
Yes — three new unit tests as above. `cargo test -p parquet --release`,
`cargo clippy -p parquet --all-targets -- -D warnings`, and
`cargo fmt --check` are all clean.
(Note: a single pre-existing `should_panic` test
`test_read_non_utf8_binary_as_utf8` aborts during the rust test
harness's panic-formatting machinery on `origin/main` HEAD itself, so
it is excluded with `--skip` for the local run. Unrelated to this PR.)
# Are there any user-facing changes?
Yes — malformed Parquet inputs that previously triggered a `capacity
overflow` panic during metadata decoding now return a `ParquetError`
that callers can handle. No behavior change for well-formed files.
# Reproducer
10 bytes:
```
0x28 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0x51
```
```rust
let bytes: &[u8] = &[0x28, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
0x51];
let _ =
parquet::file::metadata::ParquetMetaDataReader::decode_metadata(bytes);
```
Before this PR (RUST_BACKTRACE=1):
```
panicked at library/alloc/src/raw_vec/mod.rs:28:5: capacity overflow
3: alloc::raw_vec::handle_error
4: alloc::raw_vec::with_capacity_in
7: alloc::vec::Vec::with_capacity::<SchemaElement>
8: parquet::parquet_thrift::read_thrift_vec (parquet_thrift.rs:690)
9: parquet::file::metadata::thrift::parquet_metadata_from_bytes
(thrift/mod.rs:790)
10: parquet::file::metadata::parser::decode_metadata (parser.rs:233)
13: ParquetMetaDataReader::decode_footer_metadata (reader.rs:783)
14: ParquetMetaDataReader::parse_metadata (reader.rs:585)
17: ParquetMetaDataReader::parse_and_finish (reader.rs:230)
```
After this PR: `Err(ParquetError::EOF("Unexpected EOF"))` (or similar
shape depending on which struct field fails first), no panic.
A 45-byte reproducer that drives the full reader stack (with valid
`PAR1` magic, so it works for
`ParquetRecordBatchReaderBuilder::try_new`)
behaves the same way.
# Found via
`cargo-fuzz` libFuzzer harness wrapping
`ParquetMetaDataReader::parse_and_finish` and
`ParquetRecordBatchReaderBuilder::try_new` over `bytes::Bytes`. ~95
unique
crashing inputs (different VLQ sizes, different element types —
`SchemaElement`, `RowGroup`, `ColumnChunk` — different offsets in the
metadata graph) all converged on this single root cause within ~3
minutes
of single-thread fuzzing. One bug, many surface symptoms.
---------
Co-authored-by: masumi ryugo
<[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
---
parquet/src/parquet_thrift.rs | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/parquet/src/parquet_thrift.rs b/parquet/src/parquet_thrift.rs
index 6c82a0bf2c..a13baa0924 100644
--- a/parquet/src/parquet_thrift.rs
+++ b/parquet/src/parquet_thrift.rs
@@ -36,6 +36,7 @@ use crate::{
write_thrift_field,
};
use std::io::Error;
+use std::num::TryFromIntError;
use std::str::Utf8Error;
#[derive(Debug)]
@@ -46,6 +47,7 @@ pub(crate) enum ThriftProtocolError {
InvalidElementType(u8),
FieldDeltaOverflow { field_delta: u8, last_field_id: i16 },
InvalidBoolean(u8),
+ IntegerOverflow,
Utf8Error,
SkipDepth(FieldType),
SkipUnsupportedType(FieldType),
@@ -70,6 +72,9 @@ impl From<ThriftProtocolError> for ParquetError {
ThriftProtocolError::InvalidBoolean(value) => {
general_err!("cannot convert {} into bool", value)
}
+ ThriftProtocolError::IntegerOverflow => {
+ general_err!("integer overflow decoding thrift value")
+ }
ThriftProtocolError::Utf8Error => general_err!("invalid utf8"),
ThriftProtocolError::SkipDepth(field_type) => {
general_err!("cannot parse past {:?}", field_type)
@@ -94,6 +99,13 @@ impl From<Error> for ThriftProtocolError {
}
}
+impl From<TryFromIntError> for ThriftProtocolError {
+ fn from(_: TryFromIntError) -> Self {
+ // ignore error payload to reduce the size of ThriftProtocolError
+ Self::IntegerOverflow
+ }
+}
+
pub type ThriftProtocolResult<T> = Result<T, ThriftProtocolError>;
/// Wrapper for thrift `double` fields. This is used to provide
@@ -317,7 +329,13 @@ pub(crate) trait ThriftCompactInputProtocol<'a> {
// high bits set high if count and type encoded separately
possible_element_count as i32
} else {
- self.read_vlq()? as _
+ // The list size on the wire is an unsigned varint, but we
represent
+ // it as `i32` (matching Java's `int` and the Thrift schema).
+ // A varint that decodes above `i32::MAX` is malformed input —
reject
+ // it here at the protocol layer rather than letting the cast wrap
+ // into a negative size that downstream allocation code has to
+ // re-validate.
+ i32::try_from(self.read_vlq()?)?
};
Ok(ListIdentifier {
@@ -1106,4 +1124,18 @@ pub(crate) mod tests {
assert_eq!(header.size, 0);
assert_eq!(header.element_type, ElementType::Byte);
}
+
+ /// A Thrift list header whose `size` varint decodes above `i32::MAX`
+ /// must be rejected at the protocol layer rather than wrapping into a
+ /// negative `i32` and being smuggled into downstream allocation code.
+ #[test]
+ fn test_read_list_begin_size_above_i32_max_returns_err() {
+ // List header: element_type=8 (Binary), 0xF=follow-up varint.
+ // Varint 80 80 80 80 08 decodes to 0x8000_0000 = i32::MAX + 1.
+ let mut data: Vec<u8> = vec![0xF8];
+ data.extend_from_slice(&[0x80, 0x80, 0x80, 0x80, 0x08]);
+ let mut prot = ThriftSliceInputProtocol::new(&data);
+ let result = prot.read_list_begin();
+ assert!(result.is_err(), "expected error, got {result:?}");
+ }
}