[I] Consider enabling block compressions for parquet string columns [arrow-rs]

via GitHub Fri, 11 Apr 2025 09:18:17 -0700


alamb opened a new issue, #7407:
URL: https://github.com/apache/arrow-rs/issues/7407


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   When using block compression, there is a tradeoff:
   1. Smaller file sizes (and thus potentially more efficient file IO)
   2. Longer decoding time (requires more CPU to decode the pages)
   
   Most systems I know of in practice (e.g. duckdb, datafusion, influxdb 3.0) 
default to using page level compression, but the parquet defaults to no 
compression ([source 
here](https://docs.rs/parquet/latest/src/parquet/file/properties.rs.html#34))
   
   @XiangpengHao  suggests in 
https://github.com/apache/arrow-rs/issues/7363#issuecomment-2797292029
   
   > As a side note, I think we should by default enable compression in parquet 
writer settings. As parquet doesn't have good string encodings, without block 
compressions, string columns are practically almost uncompressed.
   
   
   **Describe the solution you'd like**
   Enable compression by default
   
   
   **Describe alternatives you've considered**
   
   One question is if we should use default compressions for strings and non 
strings
   
   1. I suggest we follow DuckDB's lead and default to `SNAPPY` compression to 
balance speed and compression ratio. 
   2. We could also  use `ZSTD`, what DataFusion uses -- that gives higher 
compression ratios but slower performance
   3. Don't change the default but better document the 
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Consider enabling block compressions for parquet string columns [arrow-rs]

Reply via email to