BoazC-MSFT opened a new issue, #9839:
URL: https://github.com/apache/arrow-rs/issues/9839
**Describe the bug**
Starting in `parquet` 57.0.0, malformed Parquet files can panic while
reading page metadata instead of returning a `ParquetError`.
The panic occurs when a compact-Thrift field is expected to be a `bool`, but
the field header contains a non-bool wire type. Compact-Thrift encodes struct
bool values directly in the field header (`BooleanTrue` / `BooleanFalse`). The
custom thrift reader currently assumes that schema-declared bool fields always
have one of those bool field-header types and unwraps `field_ident.bool_val`.
For malformed input, `field_ident.bool_val` is `None`, causing:
```text
called `Option::unwrap()` on a `None` value
```
One observed crash path is page-header parsing for
`DictionaryPageHeader.is_sorted`:
```rust
SerializedPageReader::read_page_header_len
PageHeader::read_thrift_without_stats
DictionaryPageHeader::read_thrift
field_ident.bool_val.unwrap()
```
There is a similar `unwrap` in the hand-expanded
`DataPageHeaderV2::read_thrift_without_stats` reader for `is_compressed`.
**To Reproduce**
Create or mutate a Parquet file so that a compact-Thrift bool field in the
page header is encoded with a non-bool field type.
For example, mutate `DictionaryPageHeader.is_sorted` from compact-Thrift
`BOOL_FALSE` to `I32`:
`0x12 -> 0x15`
where:
```test
0x12 = field-id delta 1 + BOOL_FALSE
0x15 = field-id delta 1 + I32
```
Then read the file with parquet 57.0.0 or later through a page-reader path
that parses the page header, such as reading records from a column reader.
This panics with:
`called `Option::unwrap()` on a `None` value`
**Expected behavior**
Malformed Parquet input should return a `ParquetError`, not panic.
For a bool field whose compact-Thrift field header is not `BooleanTrue` or
`BooleanFalse`, the reader should return an error indicating that the bool
field has an invalid thrift field type.
**Additional context**
This appears to be a regression from the custom thrift parser introduced in
parquet 57.0.0. Older versions using the generated thrift reader returned an
error for equivalent malformed input.
The same issue applies to at least two bool page-header fields:
- `DictionaryPageHeader.is_sorted`
- `DataPageHeaderV2.is_compressed`
A minimal fix is to replace the unsafe `field_ident.bool_val.unwrap()` calls
with error returns when `field_ident.bool_val` is `None` and add regression
tests for both fields.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]