nsivabalan commented on issue #3892: URL: https://github.com/apache/hudi/issues/3892#issuecomment-956756527
Let me try to explain. @bhasudha : Can you document this somewhere. might be useful for everyone in the community in general. Bulk_insert: This does not do any small file handling. And so, solely relies on HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() and parallelism set for bulk_insert. Insert: Will do small file handling and could bin back incoming records to existing files. For first commit for a hudi table, hudi does not have any idea of the record size. and so it relies on HoodieCompactionConfig.COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key() to determine how many might got into one data file. In subsequent commits, hudi will infer the record size from previous commits and will use that to do small file handling. btw, each operation has a different config for parallelism. just incase you weren't aware of it. hoodie.upsert.shuffle.parallelism hoodie.insert.shuffle.parallelism hoodie.delete.shuffle.parallelism hoodie.bulkinsert.shuffle.parallelism -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org