lintingbin commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3838725691

   > > > "When used to merge many small files, the resulting file will still 
contain small row groups and one loses most of the advantages of larger files."
   > > 
   > > 
   > > This could negatively impact read performance (predicate pushdown 
efficiency, vectorized read benefits, etc.).
   > 
   > This can have mixed effects on read performance. In Iceberg’s Java 
readers, filtering is applied at the row‑group level, so smaller row groups can 
improve predicate pruning. However, this comes at the cost of full scan read 
efficiency and compression.
   
   Thanks for the explanation about the trade-off!
   
   I'd like to ask about a very common scenario - **Flink streaming writes with 
frequent checkpoints:**
   
   In typical Flink-Iceberg streaming pipelines, Flink commits data files on 
every checkpoint (usually every 1-2 minutes). This creates a large number of 
small files, each containing small row groups (often just a few MB).
   
   For example:
   - Checkpoint interval: 1 minute
   - Row group size per file: 1-10 MB
   - After 1 hour: hundreds of small files with tiny row groups
   
   If we use `ParquetFileMerger` to compact these files, the merged file would 
still contain hundreds of small row groups, and compression efficiency cannot 
be improved since it's a binary copy.
   
   **Questions:**
   1. Does `ParquetFileMerger` consider this small row group scenario? Or is it 
designed primarily for files that already have properly-sized row groups?
   2. Is there a recommended minimum row group size threshold below which 
`ParquetFileMerger` should not be used?
   3. Should the caller be responsible for filtering out small-row-group files 
before calling this API?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to