pchintar opened a new issue, #9832:
URL: https://github.com/apache/arrow-rs/issues/9832
### Description
Decoding `ColumnIndex` assumes that page-aligned arrays (`null_pages`,
`min_values`, `max_values`) have matching lengths. This assumption is not
validated, leading to a panic when they are inconsistent.
---
### Root Cause
In `parquet/src/file/page_index/column_index.rs`, decoding performs
unchecked indexing:
```rust
let len = null_pages.len();
for (i, is_null) in null_pages.iter().enumerate().take(len) {
if !is_null {
let min = min_bytes[i];
let max = max_bytes[i];
...
}
}
```
Similarly for byte array indexes:
```rust
let min = min_values[i];
let max = max_values[i];
```
But there is no validation that:
```text
min_values.len() == null_pages.len()
max_values.len() == null_pages.len()
```
---
### Impact
* Panic (`index out of bounds`) on malformed or corrupted metadata
* Inconsistent with expected behavior (should return `ParquetError`)
* Affects robustness when handling external/untrusted parquet files
---
### Reproduction
```rust
let column_index = ThriftColumnIndex {
null_pages: vec![false, false],
min_values: vec![&[1, 0, 0, 0]],
max_values: vec![&[10, 0, 0, 0]],
null_counts: None,
repetition_level_histograms: None,
definition_level_histograms: None,
boundary_order: BoundaryOrder::UNORDERED,
};
let _ = PrimitiveColumnIndex::<i32>::try_from_thrift(column_index);
```
Results in:
```text
index out of bounds: the len is 1 but the index is 1
```
---
### Expected Behavior
Return a `ParquetError` when array lengths do not match the number of pages.
---
### Proposed Fix
Validate lengths in:
* `PrimitiveColumnIndex::try_new`
* `ByteArrayColumnIndex::try_new`
before indexing into `min_values` / `max_values`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]