Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Tue, 03 Feb 2026 03:35:14 -0800


pvary commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3840786020


   > 1. Does `ParquetFileMerger` consider this small row group scenario? Or is 
it designed primarily for files that already have properly-sized row groups?
   
   In this case, the user should choose rewrite-based compaction. Rewrite 
compaction reads and rewrites the data, allowing row groups to be recreated 
with appropriate sizes.
   
   > 2. Is there a recommended minimum row group size threshold below which 
`ParquetFileMerger` should not be used?
   
   This is highly use-case specific. I would recommend running experiments with 
your own workloads to determine what works best in your scenarios.
   
   > 3. Should the caller be responsible for filtering out small-row-group 
files before calling this API?
   
   With proper configuration, compaction should happen only once per row. If 
merger-based compaction works well for your workload, you should use that; 
otherwise, rewrite-based compaction is the better choice. I don’t see a strong 
use case for making this decision on a per-file basis.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to