yabola commented on PR #6831: URL: https://github.com/apache/kyuubi/pull/6831#issuecomment-2522505452
@pan3793 emmm, but in the scenario of merging small files, we only need to consider the shuffle data size (this rule is only for shuffle data to file, doesn't matter what the data source is). Due to the row storage and estimation method of shuffle data, there is still a significant difference between the shuffle size and the actual written file size, especially for Parquet , usually less than 1/3 of the size of the shuffle data. iceberg Implementation: https://github.com/apache/iceberg/blob/38c8daa4eae8a75ab46571f1efce1609100f53dd/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCompressionUtil.java#L60-L69 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
