shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3939814676

   @RussellSpitzer, the benefits of having fewer files are:
     - Fewer file open/close operations                                         
                                                                                
                                                        
     - Reduced file listing and planning overhead                               
                                                                                
                                                        
     - Less manifest/metadata bloat at the catalog level
     - Too many small files hurts the storage system
   
   The number of row groups after merging can be tuned to be a reasonable 
number. In some scenarios like streaming ingestion, we see a lot of files have 
only 1 row group. So it is not that we will end up with too many row groups 
after merging — it is that we have too few row groups (just 1) before merging, 
which defeats the purpose of Parquet's row group design.   
   
   This solution offers fast merging. In reality, the high cost of merging at 
the record level is a real problem in streaming ingestion since all the data 
needs to be rewritten (double the cost). This feature is not a replacement for 
record-level rewrite which is still the default one. Both have pros and cons 
and can be used for different purposes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to