[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #7692: Update Default Parquet Write Compression

via GitHub Fri, 29 Sep 2023 04:59:25 -0700


devinjdangelo commented on PR #7692:
URL: 
https://github.com/apache/arrow-datafusion/pull/7692#issuecomment-1740770336


   @tustvold Going from uncompressed to even zstd(1) is likely to bring a 50% 
reduction in file size (up to 80% is not uncommon depending on the data). I 
have some of my own DataFusion specific benchmarks, but Uber Engineering has a 
nice public analysis on this topic 
[here](https://www.uber.com/en-US/blog/cost-efficiency-big-data/). 
   
   I added some benchmarks showing with/without compression 
[here](https://github.com/apache/arrow-datafusion/pull/7655). Zstd(1)-Zstd(3) 
write speeds in the single threaded `AsyncArrowWriter` is 15-30% slower for a 
50-60% reduction in file size. The parallel writer is only about 5% slower for 
the same reduction in file size. I have not compared read speeds on the 
uncompressed vs compressed files, though if you are I/O limited to a remote 
ObjectStore it is possible for compression to improve read performance.
   
   Another argument in favor of this change is that most other popular 
frameworks with parquet write support default to compression of either snappy 
(compatibility) or zstd (best performance), so users of DataFusion imo will not 
expect the default to be uncompressed.
   - Apache Spark: 
[snappy](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)
   - Polars: 
[zstd](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.LazyFrame.sink_parquet.html#polars.LazyFrame.sink_parquet)
   - Pandas: 
[snappy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html)
   - DuckDb: [snappy](https://duckdb.org/docs/archive/0.6.1/data/parquet.html)
   
   The Datafusion default is not all that important for systems/database 
developers, since those users almost certainly will tune the settings to their 
use case anyway. This is more important for efforts to gain adoption of direct 
users using Datafusion for analysis/ETL workloads, such as via the 
datafusion-cli or python bindings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #7692: Update Default Parquet Write Compression

Reply via email to