devinjdangelo commented on PR #7692: URL: https://github.com/apache/arrow-datafusion/pull/7692#issuecomment-1740770336
@tustvold Going from uncompressed to even zstd(1) is likely to bring a 50% reduction in file size (up to 80% is not uncommon depending on the data). I have some of my own DataFusion specific benchmarks, but Uber Engineering has a nice public analysis on this topic [here](https://www.uber.com/en-US/blog/cost-efficiency-big-data/). I added some benchmarks showing with/without compression [here](https://github.com/apache/arrow-datafusion/pull/7655). Zstd(1)-Zstd(3) write speeds in the single threaded `AsyncArrowWriter` is 15-30% slower for a 50-60% reduction in file size. The parallel writer is only about 5% slower for the same reduction in file size. I have not compared read speeds on the uncompressed vs compressed files, though if you are I/O limited to a remote ObjectStore it is possible for compression to improve read performance. Another argument in favor of this change is that most other popular frameworks with parquet write support default to compression of either snappy (compatibility) or zstd (best performance), so users of DataFusion imo will not expect the default to be uncompressed. - Apache Spark: [snappy](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) - Polars: [zstd](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.LazyFrame.sink_parquet.html#polars.LazyFrame.sink_parquet) - Pandas: [snappy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html) - DuckDb: [snappy](https://duckdb.org/docs/archive/0.6.1/data/parquet.html) The Datafusion default is not all that important for systems/database developers, since those users almost certainly will tune the settings to their use case anyway. This is more important for efforts to gain adoption of direct users using Datafusion for analysis/ETL workloads, such as via the datafusion-cli or python bindings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
