alamb commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3119283361

   The failure is here: 
https://github.com/apache/arrow-rs/blob/9c0cb9a56f0099e7d39087826d7e409ce0f1bf5f/parquet/src/arrow/buffer/offset_buffer.rs#L78-L77
   
   Which is related to creating `StringArray` where the array has more than 2GB 
of data. With the default batch_size of 8k, rows, that means the average string 
length in each batch is restricted to 2GB / 8k = 256k
   
   Your data file has a length of 5MB per row, which means you can only fit 
~409 rows in each batch before exceeding the 2GB limit
   
   The real fix for this bug is probably for the parquet decoder to internally 
create smaller batches when the data can't fit into the target batch size and 
data type
   
   ## Workaround 1: use a smaller batch size
   
   1. Set the batch size to be smaller (it works for me with a batch size of 
100)
   
   
   I am able to reproduce this using datafusion like
   ```shell
   datafusion-cli -c "select length(html) from 'evil.parquet';"
   DataFusion CLI v49.0.0
   
   thread 'main' panicked at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-55.2.0/src/builder/generic_bytes_builder.rs:86:57:
   byte array offset overflow
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   
   Setting the batch size to 100 lets it go:
   ```
   andrewlamb@Andrews-MacBook-Pro-3 Downloads %  datafusion-cli -c "set 
datafusion.execution.batch_size=100; select avg(length(html)) from 
'evil.parquet';"
   DataFusion CLI v49.0.0
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   +------------------------------------------+
   | avg(character_length(evil.parquet.html)) |
   +------------------------------------------+
   | 5068563.0                                |
   +------------------------------------------+
   1 row(s) fetched.
   Elapsed 0.528 seconds.
   ````
   
   ## Workaround 2: use a different arrow type:
   
   You can potentially override the schema to use `DataType::LargeUt8` rather 
than Utf8`DataType::Utf8`: 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_schema
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to