lintingbin commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3834417714

   Hi @shangxinli, great work on this PR! I'm from the Apache Amoro project and 
we're very interested in leveraging this optimization.
   
   I have a question about the row-group merging behavior:
   
   When merging many small files where each file contains small row groups 
(e.g., < 1MB per row group), the merged output file will still contain many 
small row groups since `reader.appendTo(writer)` copies row groups as-is 
without combining them.
   
   As noted in 
[PARQUET-1115](https://issues.apache.org/jira/browse/PARQUET-1115):
   > "When used to merge many small files, the resulting file will still 
contain small row groups and one loses most of the advantages of larger files."
   
   This could negatively impact read performance (predicate pushdown 
efficiency, vectorized read benefits, etc.).
   
   **Questions:**
   1. Is there any plan to add a minimum row-group size threshold to determine 
eligibility for binary merge?
   2. Or perhaps a hybrid mode that falls back to row-level rewrite when source 
row groups are below a certain size?
   3. Should the caller be responsible for checking row group sizes before 
calling `ParquetFileMerger.mergeFiles()`?
   4. **For a two-phase approach**: Could we first use traditional row-level 
rewrite to merge small files into larger files (with proper-sized row groups), 
and then use `ParquetFileMerger` to merge those larger files? Would this be a 
recommended pattern, or is there a more efficient way to handle this scenario?
   
   Thanks for your great contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to