[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]
rnatarajan commented on issue #2083: URL: https://github.com/apache/hudi/issues/2083#issuecomment-699508859 Sorry I did not update the ticket in the past week. 1. Yes that is correct. 2. I tried with both Spark Structured Streaming and DStream. But Since our source is Debezium/Kafka, we had to use ForEachRDD to convert few fields (days since epoch to date, unix time to timestamp) and then we are doing df.write.format("hudi")..mode(SaveMode.Append).save("") 3. I am attaching the screenshot that shows that count takes most time in case of bulk_insert and countByKey in case of insert. Structured Streaming had a higher throughput. But with triggers in Spark, I cannot post granular details. Hence using DStream to illustrate thoughput issue. Regarding operation mode as insert, each DStream batch was about 434000 records. For the first batch processing took 1.3mins batch but then the processing time drops to about 37s. Attached are the details that narrow down the bottleneck to CountByKey. Picture shows that first batch took 53s in countByKey where as for subsequent batches drops to about 28s. Time taken in CountByKey remains around 28s for each batch of 434000 records. In this case DStream achieves a peak throughput of about 12K Rows per second. ![insert_time_taken_by_each_batch](https://user-images.githubusercontent.com/2908985/94343943-77272500-ffe1-11ea-8c46-65c7b39959d3.png) ![insert_for_each_rdd](https://user-images.githubusercontent.com/2908985/94343947-7bebd900-ffe1-11ea-8b6b-a802873d7d03.png) ![insert_dag_details](https://user-images.githubusercontent.com/2908985/94343951-7f7f6000-ffe1-11ea-8f76-7caedb49ff5c.png) ![insert_first_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343955-81e1ba00-ffe1-11ea-9fd7-1fe0c5c0c85a.png) ![insert_expand_bottleneck_in_first_batch](https://user-images.githubusercontent.com/2908985/94343959-860dd780-ffe1-11ea-8d63-80836cb92ff7.png) ![insert_subsequent_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343961-8ad28b80-ffe1-11ea-8890-0b11720fb460.png) Regarding operation mode as bulk_insert, each DStream batch was about 434000 records. For the first batch processing took 1.2mins batch but then the processing time drops to about 34s. Attached are the details that narrow down the bottleneck to Count. Picture shows that first batch took 56s in Count where as for subsequent batches drops to about 32s. Time taken in Count remains around 32s for each batch of 434000 records. In this case DStream achieves a peak throughput of about Rows per second. ![bulk_insert_time_taken_by_each_batch](https://user-images.githubusercontent.com/2908985/94343967-90c86c80-ffe1-11ea-9b75-a19b170f3a74.png) ![bulk_insert_for_each_rdd](https://user-images.githubusercontent.com/2908985/94343968-93c35d00-ffe1-11ea-8d72-47a868f87195.png) ![bulk_insert_dag_details](https://user-images.githubusercontent.com/2908985/94343971-96be4d80-ffe1-11ea-9f49-370d38619452.png) ![bulk_insert_first_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343972-9920a780-ffe1-11ea-9ae1-506f9c41b4ae.png) ![bulk_insert_expand_bottleneck_in_first_batch](https://user-images.githubusercontent.com/2908985/94343974-9c1b9800-ffe1-11ea-8b2a-5d317bb4495d.png) ![bulk_insert_subsequent_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343976-9f168880-ffe1-11ea-9661-275e2425aced.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]
rnatarajan commented on issue #2083: URL: https://github.com/apache/hudi/issues/2083#issuecomment-693531236 @n3nash Environment - AWS. Master Nodes - 1 m5.xlarge - 4 vCore, 16 GiB memory Core Nodes - 6 c5.xlarge - 4 vCore, 8 GiB memory Spark Submit config - --driver-memory 4G --executor-memory 5G --executor-cores 4 --num-executors 6 Hudi Config - hoodie.combine.before.upsert=false hoodie.bulkinsert.shuffle.parallelism=10 hoodie.insert.shuffle.parallelism=10 hoodie.upsert.shuffle.parallelism=10 hoodie.delete.shuffle.parallelism=1 hoodie.datasource.write.operatio=bulk_insert hoodie.bulkinsert.sort.mode=NONE hoodie.datasource.write.table.type=MERGE_ON_READ hoodie.datasource.write.partitionpath.field="" hoodie.combine.before.upsert=false The events ingested are not time based events. Events have unique id in the of type long and using the unique id as hoodie.datasource.write.recordkey.field Events have a date field and the date field is used as hoodie.datasource.write.precombine.field Events have 40 columns of types - long, int, date, timestamp, string. For Ingesting, we attempted both bulk insert and insert. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]
rnatarajan commented on issue #2083: URL: https://github.com/apache/hudi/issues/2083#issuecomment-692486055 Update on this: Found the bottleneck as [countByKey](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java#L73) We were reading data from Kafka(spread across 20 partitions) We tested with hoodie.datasource.write.partitionpath.field as "" or "" In both cases, records read from Kafka across all partitioned(For a batch) was shuffled performing countByKey. This caused a major throughput drop. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]
rnatarajan commented on issue #2083: URL: https://github.com/apache/hudi/issues/2083#issuecomment-692170318 Update on what @rafaelhbarros has mentioned. With Hudi 0.6.0, Identified a bottleneck in Sort and turned the feature off ("hoodie.bulkinsert.sort.mode - NONE"). Matching parallelism with number of cores*executors available give the optimal speed. If the cores*executors = 10 and if parallelism is 20, then 10 cores*processors cannot perform real parallelism of 20 and the time taken to process the record becomes more. With Hudi MoR and Bulk Insert + without Sort, with the parameters that @rafaelhbarros has posted was able to achieve about 20K Rows Per second. With Hudi CoW and Insert Mode + Without Sort was able to achieve 15K Rows per second. We are aiming to achieve about 20K Rows per second with similar hardware( --driver-memory 4G--executor-memory 5G --executor-cores 4 --num-executors 6 ). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org