Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-02-20 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1955928240

   @danny0405 still waiting for your response. can you please take look on this 
plz?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-31 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1919229376

   Hi @ad1happy2go,
   There is little correction on the commit file size.
   
   > which ultimately causing OOM due to 400MB commit files. 
   
   its a 41 Mb commit file size @ad1happy2go. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-31 Thread via GitHub


ad1happy2go commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1918948729

   @xicm @danny0405 Had a discussion with @maheshguptags . Let me try to 
summarise his issue.
   
   He is having around 5000 partitions in total and using the bucket index. 
When he use parallelism(write.tasks) as 20 the job takes 1:45 mins and when it 
is 100 it takes 35 mins.
   
   But with increase in parallelism, the number of file groups explodes as 
expected. This result in lot of small file groups with very few records each 
(~20) , which ultimately causing OOM due to 400MB commit files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-31 Thread via GitHub


ad1happy2go commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1918825052

   @maheshguptags Lets get into a call to discuss this further.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-19 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-195386

   @ad1happy2go I tried the below configuration for Kafka, but it didn't help.
   ```
   source.kafka.max.poll.records=300
   source.kafka.max.poll.interval.ms=30
   ```
   
   I tried different configurations for the above config.
   cc : @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-18 Thread via GitHub


ad1happy2go commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1898160927

   Discussed with @maheshguptags . Advised to explore flink Kafka stream 
configs to control number of records/bytes in one MicroBatch.
   
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/
   
   cc @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-16 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894959705

   @danny0405 can you please share the config to deduct the filegroup 
per-commit?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-16 Thread via GitHub


danny0405 commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894838453

   Yeah, try to deduct the number of file groups per-commit, because for each 
file group, we have a in-memory buffer before flushing into disk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-15 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1893141086

   @xicm The dataset is huge, around 100M. However, for performance evaluation, 
I have only ingested 1.78M.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-15 Thread via GitHub


xicm commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1893120259

   Can you redesign the partitions? There are only 100M of data, but there are 
so many partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-15 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1893110358

   Hi @xicm and @danny0405,
   I tried to increase the parallelism as @xicm suggested but it is trying to 
consume the data in a single commit 
i.e. it accumulates the data into a single commit which causes a Heap OOM 
issue. 
   
   https://github.com/apache/hudi/assets/115445723/0b02127c-e14c-47b4-9033-db96d5e45a51;>
   
   **Commit size from .hoodie folder**
   
   Second commit is trying to consume the entire data in one commit i.e. it is 
creating 41MB .commit file.
   
   https://github.com/apache/hudi/assets/115445723/ea1fad68-7a9d-4eaf-970c-63ec9adbe479;>
   Can we reduce/control the commit filesize?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-12 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-103564

@danny0405 I am asking about the number of file group added for particular 
commit. I am already implementing bucket index.
   Number of filegroup is more than 2000 for a commit.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-12 Thread via GitHub


danny0405 commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1888785053

   `hoodie.bucket.index.num.buckets` controls the number of buckets under one 
partiiton, and by default it is 4 in Flink.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-10 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1886251792

   @xicm Let me try to increase the number write task and for load and test the 
performance.
   
   thanks  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-10 Thread via GitHub


xicm commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1886083636

   Sorry for my wrong understanding of `SubTasks`. Hudi splits the input data 
by partition+fileGroup and then writes these partitioned data with parallelism 
of `write.tasks`. The job write 2000+ files in a commit, parallelism of 20 is 
too small.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-10 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884581679

   Yes. it is 20. it start from 0 and ending with 19. 
   https://github.com/apache/hudi/assets/115445723/e3e744c3-bdd1-4bc7-bb11-90375ed1c554;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-10 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884496142

   I already had 20 task to write the data, please check in below screenshot. 
do you want me to increase it more?
   https://github.com/apache/hudi/assets/115445723/6c837122-b197-4f72-b9ae-507735b87b7a;>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-10 Thread via GitHub


xicm commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884473175

   Small bucket num will not fit the growing data. Generally We estimate the 
data size to determine the number of buckets. Simple bucket number can't be 
changed, but we can increase the number of `writer.tasks`.
   
   I think you problem is the data is too scattered. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-10 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884430702

   Hi @xicm,
   I tried below combination with same number record.
   https://github.com/apache/hudi/assets/115445723/7398fe1a-914a-44b2-94f5-e7b9fdc9a7c9;>
   
   Please find the below details related to filegroups
   https://github.com/apache/hudi/assets/115445723/722a93e3-f588-4694-af7f-bf7b2752bd2e;>
   
   After testing it several times I noticed that 8,4 buckets looks good for 
data size which is <100M.
   
   As we know once the number of buckets is set we cannot change it.
   
   so I have question related to same.
   
   Suppose I took 8 as buckets and the streaming data is constantly growing 
(100 million per ID), will it affect the performance (considering that the job 
is streaming)?
   
   Thanks 
   Mahesh Gupta

   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884209599

   Yes I am trying to test the different combination with bucket number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


xicm commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884052740

   > can you tell me how to check number of filegroup?
   cli or spark sql, show_commits, pay attention to `total_files_added` and 
`total_files_updated`
   
   > it is still taking 45-50 min to execute which 5 times as compare to 1 
level partition.
   reduce the bucket num and increase `write.tasks`, test a few times to get a 
better performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882910466

   @xicm I reduced the number of bucket( it makes sense to reduce the bucket 
size as we have second level partition) but it is still taking 45-50 min to 
execute which 5 times as compare to 1 level partition.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882789373

   @xicm let me reduce the number of bucket and test it for same number of 
record to check process time.
   can you tell me how to check number of filegroup?
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


xicm commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882747205

   not sure if this is the cause, can you check the number of file groups after 
partition field changed, and reduce the bucket number to see the time cost.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882727956

   
   @xicm Yes I agree but it would not effect it to 10 times. 
   
   let say I have 100 partition and each partition has 10 sub-partition with 16 
bucket then total task would be 100*10*16 at max whereas with single partition 
it has 100*16 right?
   
   I understand it will take 5-7 min extra compared to single partition but not 
10 times.
   
   let me know your thoughts  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-09 Thread via GitHub


xicm commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882703763

   Add a partition field means more tasks. And the index is BUCKET, the task 
number could be bucket_num*partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-08 Thread via GitHub


maheshguptags opened a new issue, #10456:
URL: https://github.com/apache/hudi/issues/10456

   I am trying to add second level of partition to my table instead of one 
level of partition but it is taking 10X time as compared to single level 
partition in hudi flink job.
   
   I tried to ingest 1.8M record into one level of partition and it took around 
12-15 Min to ingest all the data then with same configuration I just added 
another level of partition key with same data payload and it took around 1 hour 
45 Min to complete the process.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   below is the configuration that I am using for table. You can add the table 
creation statement with below properties. 
   
   ```
   PARTITIONED BY (`client_id`,`hashed_server_id`)
   WITH ('connector' = 'hudi','path' = '${table_location}',
   'table.type' = 'COPY_ON_WRITE',
   'hoodie.datasource.write.recordkey.field' = 'a,b',
   'payload.class'='x.y.PartialUpdate',
   'precombine.field'='ts',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.clean.automatic' = 'true',
   'hoodie.clean.max.commits'='5',
   'hoodie.clean.trigger.strategy'='NUM_COMMITS',
   'hoodie.cleaner.parallelism'='100',
   'hoodie.cleaner.commits.retained'='4',
   'hoodie.index.type'= 'BUCKET',
   'hoodie.index.bucket.engine' = 'SIMPLE',
   'hoodie.bucket.index.num.buckets'='16',
   'hoodie.bucket.index.hash.field'='a',
   'hoodie.parquet.small.file.limit'='104857600',
   'hoodie.parquet.compression.codec'='snappy')
   ``` 
   
   **Expected behavior**
   As it is just a partition addition to the storage it should not impact the 
performance much(I can understand if it takes 5-7 min extra as complexkey 
generation is bit slower than simplekey ). 
   
   **Environment Description**
   * Flink  1.17.1 
   * Hudi version : 14
   
   * Spark version : NA
   
   * Hive version : NA
   
   * Hadoop version : 3.4.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :Yes
   
   
   **Additional context**
   
   My table type is upsert and I have test the functionality and it is working 
fine and I cannot change the table type.
   
   I also discussed with @ad1happy2go and he also suggested that it wont impact 
much as it just a another level of partition.
   
   CC : @ad1happy2go @codope @danny0405 @yo
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org