thomas-pfeiffer opened a new issue, #15754: URL: https://github.com/apache/iceberg/issues/15754
### Query engine Spark, pyiceberg, or any other. ### Question Apologies, if this is a rather stupid question, but it's currently not really clear how compression effects the targeted Parquet data file size or if it doesn't. Scenario: Assuming, I have an Iceberg table, without any partitioning. The total dataset is 5GBs uncompressed. I set now `write.target-file-size-bytes` to 256MB, I would expect that the resulting Parquet data files are roughly ~256MB in size on disk. - Is my understanding so far correct? If I set `write.parquet.compression-codec` to `zstd` and `write.parquet.compression-level` to 3 additionally, I expect the total disk size would be less than the 5GB uncompressed, because the data is now compressed. - Will I still get ~256MB files, just each one containing more data, hence less parquet files in total? - Or will I get smaller-sized Parquet data files on disk, but they contain roughly the same amount of data as the uncompressed 256MB parquet files from above? For the sake of this question, I would ignore the size of manifests and metadata files. Remark: The only thing I found so far in the documentation so far is this [Spark section](https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes), but I think this one talks more about Sparks size limits and its effect on the max. parquet file size. For the sake of this question, I would ignore such limitations of the given query engine(s). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
