Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-03-06 Thread via GitHub


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1981519987

   > @huliwuli So, It looks like your per record size is really small. Hudi 
uses previous commit's statistics to guess future record sizes. For very first 
commit, it relies on the config "hoodie.copyonwrite.record.size.estimate" 
(default 1024). So setting it to a lower value might worked for you. Is that 
correct?
   > 
   > bulk_insert don't merge the small files out of the box. So you need to run 
clustering job for merging small files. If most of the time you just get 
inserts, then you may just use COW table. I assume by delete previous data you 
mean delete old partitions only.
   
   Thanks for the reply.   "hoodie.copyonwrite.record.size.estimate"  works on 
my MOR table when I set it to 30-40.
   
   In most cases, we delete some rows for one old partition, but the number of 
rows is not predictable.  We currently use MoR, if you suggest we use the COW 
table, can I switch to COW directly from the hudi options?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-03-06 Thread via GitHub


ad1happy2go commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1980854966

   @huliwuli So, It looks like your per record size is really small. Hudi uses 
previous commit's statistics to guess future record sizes. For very first 
commit, it relies on the config "hoodie.copyonwrite.record.size.estimate" 
(default 1024). So setting it to a lower value might worked for you. Is that 
correct?
   
   bulk_insert don't merge the small files out of the box. So you need to run 
clustering job for merging small files. If most of the time you just get 
inserts, then you may just use COW table. I assume by delete previous data you 
mean delete old partitions only. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-27 Thread via GitHub


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1967593378

   > @huliwuli "insert" operation type should handle merging small files. I see 
you set up small file size limit as 10 MB. can you remove that config (default 
104857600) or increase that and see if that helps.
   
   Not working with EMR 6.15, it still generates 40 files -50 files for one-day 
data.  
   
   "hoodie.copyonwrite.record.size.estimate"  This setting works, however I am 
using MoR type table. Not sure whether it will occur any risks if I use this 
setting on the MoR-type table.
   
   I know bulk_insert works and controls the file size, but is it okay to use 
bulk_insert with append for daily delta data?
   
   My use case is only to insert data into a partition(date),  and sometimes 
need to delete previous data. That's why I use MoR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-27 Thread via GitHub


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1967344211

   > @huliwuli Do you see a successful .replacecommit if the clustering was 
successful. Can you post screenshot for the timeline
   
   Thanks for help,  I did not see a named "successful.replacecommit"  Here is 
the timeline
   
   
![bfcad332be78dded470ae28b60d5db2](https://github.com/apache/hudi/assets/46934296/806255c4-afb5-414b-913e-9eaba9c9d4a2)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-27 Thread via GitHub


ad1happy2go commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1966830730

   @huliwuli Do you see a successful .replacecommit if the clustering was 
successful. Can you post screenshot for the timeline
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-27 Thread via GitHub


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1966761904

   > @huliwuli "insert" operation type should handle merging small files. I see 
you set up a small file size limit of 10 MB. Can you remove that config 
(default 104857600) or increase that and see if that helps?
   
   I will try it, continuing with the clustering issue. It still raises an 
internal error if I use inline clustering for Athena. Additionally,  async 
clustering worked greate, I can see the large parquets after clustering and 
replace commit. However, when I query using pyspark and Athena, I am not able 
to see the latest commit (timeline).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-27 Thread via GitHub


ad1happy2go commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1966740579

   @huliwuli "insert" operation type should handle merging small files. I see 
you set up small file size limit as 10 MB. can you remove that config (default 
104857600) or increase that and see if that helps. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-23 Thread via GitHub


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1962059629

   **Regarding Athena Issue:**
   Due to the small size of parquets, I implemented clustering (inline) with 
max commits =1 for test.
   
   Athena Raises Error:
   Generic_INTERNAL_ERROR: Can not read value at 0 in block -1 in 
S3/XX/XXX//X/date=20XX-XX-XX.parquet
   
   I checked commits, hudi-cli shows two commits one is before the clustering, 
another one is created after the clustering,  but when querying data from _rt 
table it still has the old commit time.  
   I think the Hudi table did not sync with Hive in this case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-23 Thread via GitHub


huliwuli commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1961532330

   @ad1happy2go  Thanks for the reply.  I used insert operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-22 Thread via GitHub


ad1happy2go commented on issue #10716:
URL: https://github.com/apache/hudi/issues/10716#issuecomment-1958902900

   @huliwuli What operation type you are using? In case you are using 
bulk_insert then it will not handle small file merging out of the box. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

2024-02-20 Thread via GitHub


yizhenglu opened a new issue, #10716:
URL: https://github.com/apache/hudi/issues/10716

   **_Tips before filing an issue_**
   
   **Describe the problem you faced**
   
   Background:
   Currently, I have around 100 mb data for each day (batch process), so I am 
using the delete operation with broadcast join in spark to delete the unused 
data(data will be updated the next day), I try to avoid global index scanning, 
so I insert delta data into a new partition. 
   
   **Parquet size Issue:**
   Set hoodie. shuffle. parallelism =2, does not change the **parquet size** 
when using **insert**.
   It generates 40 parquets for one partitiom(daily), total size: 180 mb
   
   **Athena Issue:**
   Due to the small number of parquets, I implemented clustering (inline) with 
max commits =1 for test.
   
   Athena Raises Error shows:
   Generic_INTERNAL_ERROR: Can not read value at 0 in block -1 in 
S3/XX/XXX//X/date=20XX-XX-XX.parquet.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Hudi Options like below:
   Table ddl (_rt, _ro) was created automatically when enabling the hive sync 
tool.
   
![00a427c6f6f4d8e2a73b07ccf75bdf5](https://github.com/apache/hudi/assets/46934296/ca8fb5d0-2905-4292-8ae1-3309b23250a8)
   record keys are strings, partition keys are just dates in -mm-dd 
   2.
   
![e3471b52ec345815027c1dd2a759cce](https://github.com/apache/hudi/assets/46934296/8d96a9ce-fc89-41a7-9d3c-4be5c826fe4d)
   
   
   **Expected behavior**
   
   I wish to keep my parquet size around 120 MB for each file (one partition) 
or two parquet files (around 64 MB).
   
   
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.4.1
   
   * Hive version : 0.13.1
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org