BoazC-MSFT opened a new issue, #9839:
URL: https://github.com/apache/arrow-rs/issues/9839

   **Describe the bug**
   Starting in `parquet` 57.0.0, malformed Parquet files can panic while 
reading page metadata instead of returning a `ParquetError`.
   
   The panic occurs when a compact-Thrift field is expected to be a `bool`, but 
the field header contains a non-bool wire type. Compact-Thrift encodes struct 
bool values directly in the field header (`BooleanTrue` / `BooleanFalse`). The 
custom thrift reader currently assumes that schema-declared bool fields always 
have one of those bool field-header types and unwraps `field_ident.bool_val`.
   
   For malformed input, `field_ident.bool_val` is `None`, causing:
   
   ```text
   called `Option::unwrap()` on a `None` value
   ```
   
   One observed crash path is page-header parsing for 
`DictionaryPageHeader.is_sorted`:
   
   ```rust
    SerializedPageReader::read_page_header_len
    PageHeader::read_thrift_without_stats
    DictionaryPageHeader::read_thrift
    field_ident.bool_val.unwrap()
   ```
   
   There is a similar `unwrap` in the hand-expanded 
`DataPageHeaderV2::read_thrift_without_stats` reader for `is_compressed`.
   
   **To Reproduce**
   Create or mutate a Parquet file so that a compact-Thrift bool field in the 
page header is encoded with a non-bool field type.
   
   For example, mutate `DictionaryPageHeader.is_sorted` from compact-Thrift 
`BOOL_FALSE` to `I32`:
   
   `0x12  ->  0x15`
   
   where:
   
   ```test
    0x12 = field-id delta 1 + BOOL_FALSE
    0x15 = field-id delta 1 + I32
   ```
   
   Then read the file with parquet 57.0.0 or later through a page-reader path 
that parses the page header, such as reading records from a column reader.
   
   This panics with:
   
   `called `Option::unwrap()` on a `None` value`
   
   **Expected behavior**
   Malformed Parquet input should return a `ParquetError`, not panic.
   
   For a bool field whose compact-Thrift field header is not `BooleanTrue` or 
`BooleanFalse`, the reader should return an error indicating that the bool 
field has an invalid thrift field type.
   
   **Additional context**
   This appears to be a regression from the custom thrift parser introduced in 
parquet 57.0.0. Older versions using the generated thrift reader returned an 
error for equivalent malformed input.
   
   The same issue applies to at least two bool page-header fields:
   
    - `DictionaryPageHeader.is_sorted`
    - `DataPageHeaderV2.is_compressed`
   
   A minimal fix is to replace the unsafe `field_ident.bool_val.unwrap()` calls 
with error returns when `field_ident.bool_val` is `None` and add regression 
tests for both fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to