alamb commented on issue #7973: URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3119283361
The failure is here: https://github.com/apache/arrow-rs/blob/9c0cb9a56f0099e7d39087826d7e409ce0f1bf5f/parquet/src/arrow/buffer/offset_buffer.rs#L78-L77 Which is related to creating `StringArray` where the array has more than 2GB of data. With the default batch_size of 8k, rows, that means the average string length in each batch is restricted to 2GB / 8k = 256k Your data file has a length of 5MB per row, which means you can only fit ~409 rows in each batch before exceeding the 2GB limit The real fix for this bug is probably for the parquet decoder to internally create smaller batches when the data can't fit into the target batch size and data type ## Workaround 1: use a smaller batch size 1. Set the batch size to be smaller (it works for me with a batch size of 100) I am able to reproduce this using datafusion like ```shell datafusion-cli -c "select length(html) from 'evil.parquet';" DataFusion CLI v49.0.0 thread 'main' panicked at /Users/andrewlamb/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-55.2.0/src/builder/generic_bytes_builder.rs:86:57: byte array offset overflow note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` Setting the batch size to 100 lets it go: ``` andrewlamb@Andrews-MacBook-Pro-3 Downloads % datafusion-cli -c "set datafusion.execution.batch_size=100; select avg(length(html)) from 'evil.parquet';" DataFusion CLI v49.0.0 0 row(s) fetched. Elapsed 0.000 seconds. +------------------------------------------+ | avg(character_length(evil.parquet.html)) | +------------------------------------------+ | 5068563.0 | +------------------------------------------+ 1 row(s) fetched. Elapsed 0.528 seconds. ```` ## Workaround 2: use a different arrow type: You can potentially override the schema to use `DataType::LargeUt8` rather than Utf8`DataType::Utf8`: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_schema -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
