Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
codope closed issue #9915: [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode URL: https://github.com/apache/hudi/issues/9915 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
ad1happy2go commented on issue #9915: URL: https://github.com/apache/hudi/issues/9915#issuecomment-1786751122 @fenil25 Closing this. Please reopen if you have any more questions. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
ad1happy2go commented on issue #9915: URL: https://github.com/apache/hudi/issues/9915#issuecomment-1782349424 FULL_RECORD bootstrap internally uses the bulk_insert operation type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
fenil25 commented on issue #9915: URL: https://github.com/apache/hudi/issues/9915#issuecomment-1779758399 Got it. Thanks @ad1happy2go 🙇 Are bulk_insert and full_record bootstrap modes the same then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
ad1happy2go commented on issue #9915: URL: https://github.com/apache/hudi/issues/9915#issuecomment-1778505679 @fenil25 bulk-insert operation doesn't handle the small file handling, that is why you see the file sizes equal to split size. Sp the total number of partitions is calculated as `number_of_files * number_of_blocks_in_file`. - One way to handle this case will be running clustering with proper configuration to achieve the correct size files. - The other way is to configure the spark configuration `spark.sql.files.maxPartitionBytes` while doing bulk-insert which is default 128 MB in spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
fenil25 opened a new issue, #9915: URL: https://github.com/apache/hudi/issues/9915 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I want to bootstrap a table into Hudi. Size of the table is around 12 TB. The base path of the source table is in S3. Its a partitioned hive table and the average parquet file size is 2.5Gb. I used the FULL_RECORD bootstrap mode using Spark for bootstrapping and it was successful. However, the average file size of hudi table was around 120 Mb which aligns with the default which ended up creating 100K+ files. I am using S3 storage as the DFS. This made the read performance quite slow. I am not using any table partitioning yet. I did set `hoodie.parquet.max.file.size": 1258291200,` (~1.2Gb) but this configuration was completely ignored. FAQs and File Sizing docs mainly talk about ways to adjust the file size while streaming data into Hudi. How can I control the file size during the bootstrapping process itself? I also read in the docs that - ``` A full record bootstrap is functionally equivalent to a bulk-insert. ``` Does that mean both are essentially the same. Is there any advantage of using one over the another? (Note: _METADATA_ONLY does not work for our use-case_) **Environment Description** Running it via EMR * Hudi version : 13.0 * Spark version : 3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org