[GitHub] [hudi] nsivabalan commented on issue #3892: Insert produces 44764 files with ~50MB each

GitBox Mon, 01 Nov 2021 15:27:16 -0700


nsivabalan commented on issue #3892:
URL: https://github.com/apache/hudi/issues/3892#issuecomment-956756527



   Let me try to explain. @bhasudha : Can you document this somewhere. might be 
useful for everyone in the community in general. 
   
   Bulk_insert: 
   This does not do any small file handling. 
   And so, solely relies on  
HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() and parallelism 
set for bulk_insert. 
   
   Insert: 
   Will do small file handling and could bin back incoming records to existing 
files. 
   For first commit for a hudi table, hudi does not have any idea of the record 
size. and so it relies on 
HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() to determine 
how many might got into one data file. In subsequent commits, hudi will infer 
the record size from previous commits and will use that to do small file 
handling. 
   
   btw, each operation has a different config for parallelism. just incase you 
weren't aware of it. 
   
   hoodie.upsert.shuffle.parallelism
   hoodie.insert.shuffle.parallelism
   hoodie.delete.shuffle.parallelism
   hoodie.bulkinsert.shuffle.parallelism
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3892: Insert produces 44764 files with ~50MB each

Reply via email to