alamb opened a new issue, #7490:
URL: https://github.com/apache/arrow-rs/issues/7490

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   By default the arrow-rs parquet writer will save the entire actual min and 
max values for any column that has statistics enabled into the page metadata
   
   For large binary/string columns (think JSON blobs), this means that two (a 
min and a max) potentially large values will be stored in both the file level 
metadata as well as in each page header
   
   This can lead to pathalogical cases such as described in
   - https://github.com/apache/arrow-rs/issues/7489
   
   It is possible to control the maximum size of the values using 
[`WriterPropertiesBuilder::set_statistics_truncate_length`](https://arrow.apache.org/rust/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_truncate_length)
 however this value currently defaults to `None` (unlimited)
   
   I also think it is unlikely that the actual min/max values for large string 
columns will add significantly better pruning. 
   
   **Describe the solution you'd like**
   I propose we set the default statistics truncate length to a non None value 
to avoid pathalogical cases
   
   
   **Describe alternatives you've considered**
   I would propose picking a value like `128` that is long enough to capture 
all primitive data types and 
   "sort" strings. 
   
   We can (and should) also document the default better
   
   **Additional context**
   - related to https://github.com/apache/arrow-rs/issues/7489
   - https://github.com/apache/arrow/issues/46404
   - https://github.com/kylebarron/arro3/issues/324


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to