Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1955928240 @danny0405 still waiting for your response. can you please take look on this plz? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1919229376 Hi @ad1happy2go, There is little correction on the commit file size. > which ultimately causing OOM due to 400MB commit files. its a 41 Mb commit file size @ad1happy2go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
ad1happy2go commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1918948729 @xicm @danny0405 Had a discussion with @maheshguptags . Let me try to summarise his issue. He is having around 5000 partitions in total and using the bucket index. When he use parallelism(write.tasks) as 20 the job takes 1:45 mins and when it is 100 it takes 35 mins. But with increase in parallelism, the number of file groups explodes as expected. This result in lot of small file groups with very few records each (~20) , which ultimately causing OOM due to 400MB commit files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
ad1happy2go commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1918825052 @maheshguptags Lets get into a call to discuss this further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-195386 @ad1happy2go I tried the below configuration for Kafka, but it didn't help. ``` source.kafka.max.poll.records=300 source.kafka.max.poll.interval.ms=30 ``` I tried different configurations for the above config. cc : @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
ad1happy2go commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1898160927 Discussed with @maheshguptags . Advised to explore flink Kafka stream configs to control number of records/bytes in one MicroBatch. https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/ cc @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894959705 @danny0405 can you please share the config to deduct the filegroup per-commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
danny0405 commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894838453 Yeah, try to deduct the number of file groups per-commit, because for each file group, we have a in-memory buffer before flushing into disk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1893141086 @xicm The dataset is huge, around 100M. However, for performance evaluation, I have only ingested 1.78M. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
xicm commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1893120259 Can you redesign the partitions? There are only 100M of data, but there are so many partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1893110358 Hi @xicm and @danny0405, I tried to increase the parallelism as @xicm suggested but it is trying to consume the data in a single commit i.e. it accumulates the data into a single commit which causes a Heap OOM issue. https://github.com/apache/hudi/assets/115445723/0b02127c-e14c-47b4-9033-db96d5e45a51;> **Commit size from .hoodie folder** Second commit is trying to consume the entire data in one commit i.e. it is creating 41MB .commit file. https://github.com/apache/hudi/assets/115445723/ea1fad68-7a9d-4eaf-970c-63ec9adbe479;> Can we reduce/control the commit filesize? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-103564 @danny0405 I am asking about the number of file group added for particular commit. I am already implementing bucket index. Number of filegroup is more than 2000 for a commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
danny0405 commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1888785053 `hoodie.bucket.index.num.buckets` controls the number of buckets under one partiiton, and by default it is 4 in Flink. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1886251792 @xicm Let me try to increase the number write task and for load and test the performance. thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
xicm commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1886083636 Sorry for my wrong understanding of `SubTasks`. Hudi splits the input data by partition+fileGroup and then writes these partitioned data with parallelism of `write.tasks`. The job write 2000+ files in a commit, parallelism of 20 is too small. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884581679 Yes. it is 20. it start from 0 and ending with 19. https://github.com/apache/hudi/assets/115445723/e3e744c3-bdd1-4bc7-bb11-90375ed1c554;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884496142 I already had 20 task to write the data, please check in below screenshot. do you want me to increase it more? https://github.com/apache/hudi/assets/115445723/6c837122-b197-4f72-b9ae-507735b87b7a;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
xicm commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884473175 Small bucket num will not fit the growing data. Generally We estimate the data size to determine the number of buckets. Simple bucket number can't be changed, but we can increase the number of `writer.tasks`. I think you problem is the data is too scattered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884430702 Hi @xicm, I tried below combination with same number record. https://github.com/apache/hudi/assets/115445723/7398fe1a-914a-44b2-94f5-e7b9fdc9a7c9;> Please find the below details related to filegroups https://github.com/apache/hudi/assets/115445723/722a93e3-f588-4694-af7f-bf7b2752bd2e;> After testing it several times I noticed that 8,4 buckets looks good for data size which is <100M. As we know once the number of buckets is set we cannot change it. so I have question related to same. Suppose I took 8 as buckets and the streaming data is constantly growing (100 million per ID), will it affect the performance (considering that the job is streaming)? Thanks Mahesh Gupta -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884209599 Yes I am trying to test the different combination with bucket number. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
xicm commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1884052740 > can you tell me how to check number of filegroup? cli or spark sql, show_commits, pay attention to `total_files_added` and `total_files_updated` > it is still taking 45-50 min to execute which 5 times as compare to 1 level partition. reduce the bucket num and increase `write.tasks`, test a few times to get a better performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882910466 @xicm I reduced the number of bucket( it makes sense to reduce the bucket size as we have second level partition) but it is still taking 45-50 min to execute which 5 times as compare to 1 level partition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882789373 @xicm let me reduce the number of bucket and test it for same number of record to check process time. can you tell me how to check number of filegroup? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
xicm commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882747205 not sure if this is the cause, can you check the number of file groups after partition field changed, and reduce the bucket number to see the time cost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882727956 @xicm Yes I agree but it would not effect it to 10 times. let say I have 100 partition and each partition has 10 sub-partition with 16 bucket then total task would be 100*10*16 at max whereas with single partition it has 100*16 right? I understand it will take 5-7 min extra compared to single partition but not 10 times. let me know your thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
xicm commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1882703763 Add a partition field means more tasks. And the index is BUCKET, the task number could be bucket_num*partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags opened a new issue, #10456: URL: https://github.com/apache/hudi/issues/10456 I am trying to add second level of partition to my table instead of one level of partition but it is taking 10X time as compared to single level partition in hudi flink job. I tried to ingest 1.8M record into one level of partition and it took around 12-15 Min to ingest all the data then with same configuration I just added another level of partition key with same data payload and it took around 1 hour 45 Min to complete the process. **To Reproduce** Steps to reproduce the behavior: below is the configuration that I am using for table. You can add the table creation statement with below properties. ``` PARTITIONED BY (`client_id`,`hashed_server_id`) WITH ('connector' = 'hudi','path' = '${table_location}', 'table.type' = 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field' = 'a,b', 'payload.class'='x.y.PartialUpdate', 'precombine.field'='ts', 'hoodie.clean.async'='true', 'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS', 'hoodie.clean.automatic' = 'true', 'hoodie.clean.max.commits'='5', 'hoodie.clean.trigger.strategy'='NUM_COMMITS', 'hoodie.cleaner.parallelism'='100', 'hoodie.cleaner.commits.retained'='4', 'hoodie.index.type'= 'BUCKET', 'hoodie.index.bucket.engine' = 'SIMPLE', 'hoodie.bucket.index.num.buckets'='16', 'hoodie.bucket.index.hash.field'='a', 'hoodie.parquet.small.file.limit'='104857600', 'hoodie.parquet.compression.codec'='snappy') ``` **Expected behavior** As it is just a partition addition to the storage it should not impact the performance much(I can understand if it takes 5-7 min extra as complexkey generation is bit slower than simplekey ). **Environment Description** * Flink 1.17.1 * Hudi version : 14 * Spark version : NA * Hive version : NA * Hadoop version : 3.4.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) :Yes **Additional context** My table type is upsert and I have test the functionality and it is working fine and I cannot change the table type. I also discussed with @ad1happy2go and he also suggested that it wont impact much as it just a another level of partition. CC : @ad1happy2go @codope @danny0405 @yo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org