masumi-ryugo commented on PR #9868:
URL: https://github.com/apache/arrow-rs/pull/9868#issuecomment-4362295929
Thanks @etseidl — pushed `ccc78aa` reshaping the fix per all four review
comments.
* **`read_list_begin`** (was `parquet_thrift.rs:727` review): the long-form
size is now `i32::try_from(self.read_vlq()?)?` instead of `as _`. Varints above
`i32::MAX` are rejected at the protocol layer rather than wrapping into a
negative `i32`. Added `IntegerOverflow` variant to `ThriftProtocolError` and
`From<TryFromIntError>`.
* **`read_thrift_vec`** (was `:738` review): switched to
`Vec::try_reserve_exact(size as usize)` per your suggestion. The allocator's
`size_of::<T>() * size > isize::MAX` panic is caught and surfaced as
`ParquetError`. The `remaining_bytes` trait method and the
`ThriftSliceInputProtocol` override are gone — net diff vs base is now `+63/-2`
(down from `+100/-1`), and the new abstraction surface is zero.
* **Tests** (`:1168` and `:1186` reviews):
* Dropped `test_read_thrift_vec_huge_size_does_not_panic` — confirmed it
passes on main without the fix, so it wasn't a real regression test.
* Kept `test_decode_metadata_huge_thrift_list_does_not_panic` (the 10-byte
fuzzer repro through `ParquetMetaDataReader::decode_metadata`).
* Replaced the negative-size test with
`test_read_list_begin_size_above_i32_max_returns_err`, which now drives the
protocol-layer rejection directly. I verified locally that both tests fail
without the fix (capacity overflow + `Ok(ListIdentifier { size: -2147483648 })`
respectively) and pass with it.
`cargo test -p parquet --release --lib` (`--skip
test_read_non_utf8_binary_as_utf8` per the original PR description note):
**1083 passed / 0 failed**. `cargo fmt --check`: clean. `cargo clippy -p
parquet --all-targets -- -D warnings`: my changes don't introduce any lints,
but I see 17 pre-existing clippy errors on the base `fd86c75`
(`arrow-data/src/data.rs`,
`parquet/src/arrow/array_reader/byte_array_dictionary.rs`,
`parquet/src/arrow/arrow_writer/mod.rs`,
`parquet/src/file/page_index/column_index.rs`, `parquet/src/schema/types.rs`,
`parquet/src/util/push_buffers.rs`) that I left alone since they're orthogonal
to this PR — happy to file/fix separately if useful.
For the broader thrift-parser hardening you mentioned, I'll start sweeping
the remaining `with_capacity` / `reserve` call sites on attacker-controlled
paths under #9874 unless you'd rather I wait.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]