lintingbin commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3838725691
> > > "When used to merge many small files, the resulting file will still contain small row groups and one loses most of the advantages of larger files." > > > > > > This could negatively impact read performance (predicate pushdown efficiency, vectorized read benefits, etc.). > > This can have mixed effects on read performance. In Iceberg’s Java readers, filtering is applied at the row‑group level, so smaller row groups can improve predicate pruning. However, this comes at the cost of full scan read efficiency and compression. Thanks for the explanation about the trade-off! I'd like to ask about a very common scenario - **Flink streaming writes with frequent checkpoints:** In typical Flink-Iceberg streaming pipelines, Flink commits data files on every checkpoint (usually every 1-2 minutes). This creates a large number of small files, each containing small row groups (often just a few MB). For example: - Checkpoint interval: 1 minute - Row group size per file: 1-10 MB - After 1 hour: hundreds of small files with tiny row groups If we use `ParquetFileMerger` to compact these files, the merged file would still contain hundreds of small row groups, and compression efficiency cannot be improved since it's a binary copy. **Questions:** 1. Does `ParquetFileMerger` consider this small row group scenario? Or is it designed primarily for files that already have properly-sized row groups? 2. Is there a recommended minimum row group size threshold below which `ParquetFileMerger` should not be used? 3. Should the caller be responsible for filtering out small-row-group files before calling this API? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
