masumi-ryugo opened a new issue, #9874:
URL: https://github.com/apache/arrow-rs/issues/9874

   ## Describe the bug
   
   `parquet::parquet_thrift::read_thrift_vec` reads a Thrift compact-protocol 
list header and then calls `Vec::with_capacity(list_ident.size as usize)`, 
where `list_ident.size` is a 32-bit varint pulled directly from 
attacker-controlled bytes. With a malformed input, the value can be close to 
`i32::MAX`; after the per-element-size multiplication this becomes a multi-GB 
to multi-hundred-GB allocation request, which panics in 
`alloc::raw_vec::handle_error` with `capacity overflow` (or OOM-kills the 
process on smaller-but-still-huge values) before any element is decoded.
   
   Every public sync metadata entry point funnels through this function:
   
   - `ParquetMetaDataReader::parse_and_finish`
   - `ParquetMetaDataReader::decode_metadata`
   - `SerializedFileReader::new`
   - `ParquetRecordBatchReaderBuilder::try_new`
   - the `async_reader` family (re-exports `decode_metadata` after prefetching 
footer bytes)
   
   So any downstream code that hands attacker-controlled bytes to a parquet 
reader gets a panic-on-decode DoS.
   
   Per `SECURITY.md` this is a **bug, not a vulnerability** — there is no 
information disclosure or RCE path, only availability — but it is reachable 
from every metadata entry point, so it is worth tracking and closing properly.
   
   ## To Reproduce
   
   10 bytes:
   
   ```
   0x28 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0x51
   ```
   
   ```rust
   let bytes: &[u8] = &[0x28, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 
0x51];
   let _ = 
parquet::file::metadata::ParquetMetaDataReader::decode_metadata(bytes);
   ```
   
   Output (with `RUST_BACKTRACE=1`):
   
   ```
   panicked at library/alloc/src/raw_vec/mod.rs:28:5: capacity overflow
      3: alloc::raw_vec::handle_error
      4: alloc::raw_vec::with_capacity_in
      7: alloc::vec::Vec::with_capacity::<SchemaElement>
      8: parquet::parquet_thrift::read_thrift_vec       (parquet_thrift.rs:690)
      9: parquet::file::metadata::thrift::parquet_metadata_from_bytes
                                                        (thrift/mod.rs:790)
     10: parquet::file::metadata::parser::decode_metadata (parser.rs:233)
     13: ParquetMetaDataReader::decode_footer_metadata   (reader.rs:783)
     14: ParquetMetaDataReader::parse_metadata           (reader.rs:585)
     17: ParquetMetaDataReader::parse_and_finish         (reader.rs:230)
   ```
   
   A 45-byte reproducer with valid `PAR1` magic drives the same panic through 
`ParquetRecordBatchReaderBuilder::try_new`.
   
   ## Expected behavior
   
   Malformed Parquet input should return a `ParquetError` that callers can 
handle, never panic.
   
   ## Found via
   
   `cargo-fuzz` libFuzzer harness wrapping 
`ParquetMetaDataReader::parse_and_finish` and 
`ParquetRecordBatchReaderBuilder::try_new` over `bytes::Bytes`. ~95 unique 
crashing inputs (different VLQ sizes, different element types — 
`SchemaElement`, `RowGroup`, `ColumnChunk` — different offsets in the metadata 
graph) all converged on this single root cause within ~3 minutes of 
single-thread fuzzing. **One bug, many surface symptoms.**
   
   ## Scope
   
   This issue tracks two related strands:
   
   1. **Immediate fix** — bound the `Vec::with_capacity` in `read_thrift_vec` 
by remaining input bytes, since every Thrift element costs at least one wire 
byte. Also reject negative declared sizes explicitly. Proposed in #9868.
   2. **Broader thrift parser hardening** — per @etseidl on #9868: \"I think we 
could make a few more improvements to the thrift parser here.\" This issue is 
the umbrella for those follow-ups (e.g. reviewing every 
`with_capacity`/`reserve` in `parquet_thrift` against attacker-controlled 
sizes, audit of nested-list / map-key-or-value paths, any other decoders that 
could allocate before reading).
   
   Happy to do further fuzzing passes against specific suspected hotspots if 
maintainers point at where to look.
   
   ## Environment
   
   - crate: `parquet`
   - master at the time of report: HEAD around `fd86c75` (see #9868 merge-base)
   - rustc: stable
   
   ## Related
   
   - PR #9868 — proposed fix for strand 1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to