pchintar opened a new issue, #9832:
URL: https://github.com/apache/arrow-rs/issues/9832

   ### Description
   
   Decoding `ColumnIndex` assumes that page-aligned arrays (`null_pages`, 
`min_values`, `max_values`) have matching lengths. This assumption is not 
validated, leading to a panic when they are inconsistent.
   
   ---
   
   ### Root Cause
   
   In `parquet/src/file/page_index/column_index.rs`, decoding performs 
unchecked indexing:
   
   ```rust
   let len = null_pages.len();
   
   for (i, is_null) in null_pages.iter().enumerate().take(len) {
       if !is_null {
           let min = min_bytes[i];
           let max = max_bytes[i];
           ...
       }
   }
   ```
   
   Similarly for byte array indexes:
   
   ```rust
   let min = min_values[i];
   let max = max_values[i];
   ```
   
   But there is no validation that:
   
   ```text
   min_values.len() == null_pages.len()
   max_values.len() == null_pages.len()
   ```
   
   ---
   
   ### Impact
   
   * Panic (`index out of bounds`) on malformed or corrupted metadata
   * Inconsistent with expected behavior (should return `ParquetError`)
   * Affects robustness when handling external/untrusted parquet files
   
   ---
   
   ### Reproduction
   
   ```rust
   let column_index = ThriftColumnIndex {
       null_pages: vec![false, false],
       min_values: vec![&[1, 0, 0, 0]],
       max_values: vec![&[10, 0, 0, 0]],
       null_counts: None,
       repetition_level_histograms: None,
       definition_level_histograms: None,
       boundary_order: BoundaryOrder::UNORDERED,
   };
   
   let _ = PrimitiveColumnIndex::<i32>::try_from_thrift(column_index);
   ```
   
   Results in:
   
   ```text
   index out of bounds: the len is 1 but the index is 1
   ```
   
   ---
   
   ### Expected Behavior
   
   Return a `ParquetError` when array lengths do not match the number of pages.
   
   ---
   
   ### Proposed Fix
   
   Validate lengths in:
   
   * `PrimitiveColumnIndex::try_new`
   * `ByteArrayColumnIndex::try_new`
   
   before indexing into `min_values` / `max_values`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to