This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new 7f6524def2 fix(parquet): bound schema num_children before
Vec::with_capacity (#9884)
7f6524def2 is described below
commit 7f6524def267f5c5be73b7d5320185ea9f3bb91f
Author: masumi-ryugo <[email protected]>
AuthorDate: Thu May 7 00:02:47 2026 +0900
fix(parquet): bound schema num_children before Vec::with_capacity (#9884)
## What
Validate `SchemaElement::num_children` in
`parquet::schema::types::schema_from_array_helper` before passing it to
`Vec::with_capacity`. Reject:
1. **Negative counts** — they wrap to a huge `usize` on the cast, and
they're nonsensical anyway (`num_children` is a count).
2. **Counts that exceed the remaining schema-array entries** — each
declared child must consume at least one entry, so a value larger than
`num_elements - index - 1` is invalid input. Without this bound the cast
feeds `Vec::with_capacity(~i32::MAX)` and `alloc::raw_vec::handle_error`
panics with `capacity overflow` on a 64-bit target.
```rust
if n < 0 {
return Err(general_err!(...));
}
let n_children = n as usize;
let remaining = num_elements.saturating_sub(index + 1);
if n_children > remaining {
return Err(general_err!(...));
}
let mut fields = Vec::with_capacity(n_children);
```
## How it was found
The `parquet/fuzz/thrift_decode` cargo-fuzz harness from #5332 /
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses),
running for ~30 s on `parquet/main` HEAD with no seed corpus, produces
this 22-byte input:
```
00000000 02 00 2b 11 01 08 00 11 11 00 00 00 00 00 00 00
|..+.............|
00000010 f7 f9 11 01 a5 ff |......|
```
Backtrace before this patch:
```
thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow
2: alloc::raw_vec::capacity_overflow
3: alloc::raw_vec::handle_error raw_vec/mod.rs:889:29
4: schema_from_array_helper schema/types.rs:1404
5: parquet_schema_from_array schema/types.rs:1304:17
6: parquet_metadata_from_bytes metadata/thrift/mod.rs:791:31
7: decode_metadata metadata/parser.rs:233:5
8: decode_metadata metadata/reader.rs:823:9
```
After this patch the same bytes return a clean `ParquetError::General`,
no panic, no allocation spike.
## Tests
Three new tests in `schema::types::tests`:
- `test_parquet_schema_from_array_rejects_negative_num_children` —
direct `parquet_schema_from_array` call with `num_children = Some(-1)`.
- `test_parquet_schema_from_array_rejects_overflowing_num_children` —
same with `num_children = Some(i32::MAX)`.
- `test_decode_metadata_oom_repro_schema_overflow_does_not_panic` —
end-to-end via `ParquetMetaDataReader::decode_metadata` using the
22-byte repro.
The existing 107 tests under `schema::` continue to pass.
## Relationship to #9883
Sibling fix, not dependent. #9883 caps the up-front allocation
**inside** the thrift parser (`read_thrift_vec`); this PR caps the
allocation **downstream** of the thrift parser, where
`SchemaElement::num_children` becomes a `Vec::with_capacity` argument.
They live on different code paths — fuzz finds inputs that hit either
one independently — and either can land first.
xref #5332 #9883 #9868 #9874
---------
Co-authored-by: masumi ryugo
<[email protected]>
---
parquet/src/schema/types.rs | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/parquet/src/schema/types.rs b/parquet/src/schema/types.rs
index 2c63da74df..0d504e16fc 100644
--- a/parquet/src/schema/types.rs
+++ b/parquet/src/schema/types.rs
@@ -1401,7 +1401,7 @@ fn schema_from_array_helper<'a>(
Some(n) => {
let repetition = element.repetition_type;
- let mut fields = Vec::with_capacity(n as usize);
+ let mut fields = Vec::with_capacity(usize::try_from(n)?);
let mut next_index = index + 1;
for _ in 0..n {
let child_result = schema_from_array_helper(elements,
num_elements, next_index)?;
@@ -2517,4 +2517,22 @@ mod tests {
let result_schema = parquet_schema_from_array(thrift_schema).unwrap();
assert_eq!(result_schema, expected_schema);
}
+
+ #[test]
+ fn test_parquet_schema_from_array_rejects_negative_num_children() {
+ let elements = vec![SchemaElement {
+ r#type: None,
+ type_length: None,
+ repetition_type: Some(Repetition::REQUIRED),
+ name: "schema",
+ num_children: Some(-1),
+ converted_type: None,
+ scale: None,
+ precision: None,
+ field_id: None,
+ logical_type: None,
+ }];
+ let result = parquet_schema_from_array(elements);
+ assert!(result.unwrap_err().to_string().contains("Integer overflow"));
+ }
}