jroddev opened a new pull request, #6582: URL: https://github.com/apache/arrow-rs/pull/6582
# Which issue does this PR close? Closes #. # Rationale for this change If `offset_index` is loaded as `Some([])` then AsyncReader may panic in a couple of places. These panics do not occur when `offset_index` is `None`. Empty `offset_index` can end up as either depending on how its loaded. ```rust thread 'arrow::async_reader::tests::my_test' panicked at parquet/src/arrow/async_reader/mod.rs:832:34: index out of bounds: the len is 0 but the index is 0 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` An empty `offset_index` will switch from `None` to `Some([])` after it is written to disk with `ParquetMetaDataWriter` and then read back into memory with `ParquetMetaDataReader` (as is done in the [external_metadata](https://github.com/apache/arrow-rs/blob/9485897ccb6da955a3efeba84e552e85d4efaa20/parquet/examples/external_metadata.rs) example). I believe I also saw this issue on one parquet file without the roundtrip above, but I don't currently have the file to verify. The panic can be reproduced in the external_metadata example by changing it to load [alltypes_plain.parquet](https://github.com/apache/parquet-testing/blob/50af3d8ce206990d81014b1862e5ce7380dc3e08/data/alltypes_plain.parquet) instead of using the one in generated in code. You will also need to remove the call to `prepare_metadata` as the filesize assertion does not hold with the new data. ```rust - let parquet_path = create_parquet_file(&tempdir); + let testdata = arrow::util::test_util::parquet_test_data(); + let parquet_path = format!("{testdata}/alltypes_plain.parquet"); ... - let metadata = prepare_metadata(metadata); ``` Possible related to https://github.com/apache/arrow-rs/issues/6447 https://github.com/apache/arrow-rs/blob/9485897ccb6da955a3efeba84e552e85d4efaa20/parquet/src/file/metadata/reader.rs#L261 Sounds similar to https://github.com/apache/arrow-rs/pull/4761 Came across the issue at the same time as this one https://github.com/apache/arrow-rs/issues/6581 With this change `page_locations` will now equal `None` in both empty `offset_index` scenarios. When `offset_index` is provided then `page_locations` is filled as expected: ```rust page_locations: Some( [ PageLocation { offset: 4, compressed_page_size: 109, first_row_index: 0, }, PageLocation { offset: 113, compressed_page_size: 109, first_row_index: 21, }, ... ]) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
