Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Mon, 09 Mar 2026 14:49:30 -0700


RussellSpitzer commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-4027129644


   I think large row group size may be the only place where this make sense but 
only if you are on HDFS and have very-very large tables. I think it's almost 
always objectively better to have multiple manifest entries than a single one 
for scanning. 
   
   I am also not convinced by arguments that we should make Iceberg perform 
better for tools which list directory contents. We shouldn't optimize for 
things that directly contradict Iceberg's goals (eliminating the burden of list 
operations.)
   
   I think I would need to see some real benchmarking to be convinced that this 
is the right pathway to go, especially for small files. For large files we 
would need a better argument about why we would want to compact files which are 
already large to make them extra large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to