pvary commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3840786020
> 1. Does `ParquetFileMerger` consider this small row group scenario? Or is it designed primarily for files that already have properly-sized row groups? In this case, the user should choose rewrite-based compaction. Rewrite compaction reads and rewrites the data, allowing row groups to be recreated with appropriate sizes. > 2. Is there a recommended minimum row group size threshold below which `ParquetFileMerger` should not be used? This is highly use-case specific. I would recommend running experiments with your own workloads to determine what works best in your scenarios. > 3. Should the caller be responsible for filtering out small-row-group files before calling this API? With proper configuration, compaction should happen only once per row. If merger-based compaction works well for your workload, you should use that; otherwise, rewrite-based compaction is the better choice. I don’t see a strong use case for making this decision on a per-file basis. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
