thomas-pfeiffer opened a new issue, #15754:
URL: https://github.com/apache/iceberg/issues/15754

   ### Query engine
   
   Spark, pyiceberg, or any other.
   
   ### Question
   
   Apologies, if this is a rather stupid question, but it's currently not 
really clear how compression effects the targeted Parquet data file size or if 
it doesn't.
   
   Scenario:
   Assuming, I have an Iceberg table, without any partitioning. The total 
dataset is 5GBs uncompressed. I set now `write.target-file-size-bytes` to 
256MB, I would expect that the resulting Parquet data files are roughly ~256MB 
in size on disk. 
   - Is my understanding so far correct?
   
   If I set `write.parquet.compression-codec` to `zstd` and 
`write.parquet.compression-level` to 3 additionally, I expect the total disk 
size would be less than the 5GB uncompressed, because the data is now 
compressed. 
   - Will I still get ~256MB files, just each one containing more data, hence 
less parquet files in total? 
   - Or will I get smaller-sized Parquet data files on disk, but they contain 
roughly the same amount of data as the uncompressed 256MB parquet files from 
above?
   
   For the sake of this question, I would ignore the size of manifests and 
metadata files.
   
   Remark: The only thing I found so far in the documentation so far is this 
[Spark 
section](https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes),
 but I think this one talks more about Sparks size limits and its effect on the 
max. parquet file size. For the sake of this question, I would ignore such 
limitations of the given query engine(s).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to