Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-31 Thread via GitHub


codope closed issue #9915: [SUPPORT] Control file sizing during FULL_RECORD 
bootstrap mode
URL: https://github.com/apache/hudi/issues/9915


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-31 Thread via GitHub


ad1happy2go commented on issue #9915:
URL: https://github.com/apache/hudi/issues/9915#issuecomment-1786751122

   @fenil25 Closing this. Please reopen if you have any more questions. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-26 Thread via GitHub


ad1happy2go commented on issue #9915:
URL: https://github.com/apache/hudi/issues/9915#issuecomment-1782349424

   FULL_RECORD bootstrap internally uses the bulk_insert operation type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-25 Thread via GitHub


fenil25 commented on issue #9915:
URL: https://github.com/apache/hudi/issues/9915#issuecomment-1779758399

   Got it. Thanks @ad1happy2go 🙇 
   Are bulk_insert and full_record bootstrap modes the same then? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-24 Thread via GitHub


ad1happy2go commented on issue #9915:
URL: https://github.com/apache/hudi/issues/9915#issuecomment-1778505679

   @fenil25 bulk-insert operation doesn't handle the small file handling, that 
is why you see the file sizes equal to split size. Sp the total number of 
partitions is calculated as `number_of_files * number_of_blocks_in_file`. 
   - One way to handle this case will be running clustering with proper 
configuration to achieve the correct size files. 
   - The other way is to configure the spark configuration 
`spark.sql.files.maxPartitionBytes` while doing bulk-insert which is default 
128 MB in spark. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-24 Thread via GitHub


fenil25 opened a new issue, #9915:
URL: https://github.com/apache/hudi/issues/9915

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   I want to bootstrap a table into Hudi. Size of the table is around 12 TB. 
The base path of the source table is in S3. Its a partitioned hive table and 
the average parquet file size is 2.5Gb.  I used the FULL_RECORD bootstrap mode 
using Spark for bootstrapping and it was successful. 
   However, the average file size of hudi table was around 120 Mb which aligns 
with the default which ended up creating 100K+ files. I am using S3 storage as 
the DFS. This made the read performance quite slow. 
   I am not using any table partitioning yet. I did set 
`hoodie.parquet.max.file.size": 1258291200,` (~1.2Gb) but this configuration 
was completely ignored. 
   FAQs and File Sizing docs mainly talk about ways to adjust the file size 
while streaming data into Hudi. 
   How can I control the file size during the bootstrapping process itself? 
   
   I also read in the docs that - 
   ```
   A full record bootstrap is functionally equivalent to a bulk-insert.
   ```
   Does that mean both are essentially the same. Is there any advantage of 
using one over the another? (Note: _METADATA_ONLY does not work for our 
use-case_)
   
   
   **Environment Description**
   Running it via EMR 
   
   * Hudi version : 13.0 
   
   * Spark version : 3.3 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org