jroddev opened a new pull request, #6582:
URL: https://github.com/apache/arrow-rs/pull/6582

   # Which issue does this PR close?
   Closes #.
   
   # Rationale for this change
   
   If `offset_index` is loaded as `Some([])` then AsyncReader may panic in a 
couple of places. These panics do not occur when `offset_index` is `None`. 
Empty `offset_index` can end up as either depending on how its loaded.
   ```rust
   thread 'arrow::async_reader::tests::my_test' panicked at 
parquet/src/arrow/async_reader/mod.rs:832:34:
   index out of bounds: the len is 0 but the index is 0
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   
   An empty `offset_index` will switch from `None` to `Some([])` after it is 
written to disk with `ParquetMetaDataWriter` and then read back into memory 
with `ParquetMetaDataReader` (as is done in the 
[external_metadata](https://github.com/apache/arrow-rs/blob/9485897ccb6da955a3efeba84e552e85d4efaa20/parquet/examples/external_metadata.rs)
 example).
   
   I believe I also saw this issue on one parquet file without the roundtrip 
above, but I don't currently have the file to verify.
   
   The panic can be reproduced in the external_metadata example by changing it 
to load 
[alltypes_plain.parquet](https://github.com/apache/parquet-testing/blob/50af3d8ce206990d81014b1862e5ce7380dc3e08/data/alltypes_plain.parquet)
 instead of using the one in generated in code. You will also need to remove 
the call to `prepare_metadata` as the filesize assertion does not hold with the 
new data.
   ```rust
       - let parquet_path = create_parquet_file(&tempdir);
       + let testdata = arrow::util::test_util::parquet_test_data();
       + let parquet_path = format!("{testdata}/alltypes_plain.parquet");
       ...
       - let metadata = prepare_metadata(metadata);
   ```
   
   Possible related to https://github.com/apache/arrow-rs/issues/6447
   
https://github.com/apache/arrow-rs/blob/9485897ccb6da955a3efeba84e552e85d4efaa20/parquet/src/file/metadata/reader.rs#L261
   
   Sounds similar to 
   https://github.com/apache/arrow-rs/pull/4761
   
   Came across the issue at the same time as this one 
https://github.com/apache/arrow-rs/issues/6581
   
   With this change `page_locations` will now equal `None` in both empty 
`offset_index` scenarios. 
   When `offset_index` is provided then `page_locations` is filled as expected:
   ```rust
   page_locations: Some(
       [
           PageLocation {
               offset: 4,
               compressed_page_size: 109,
               first_row_index: 0,
           },
           PageLocation {
               offset: 113,
               compressed_page_size: 109,
               first_row_index: 21,
           },
          ...
      ])
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to