[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

2020-09-26 Thread GitBox


rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-699508859


   Sorry I did not update the ticket in the past week.
   
   1. Yes that is correct.
   2. I tried with both Spark Structured Streaming and DStream. But Since our 
source is Debezium/Kafka, we had to use ForEachRDD to convert few fields (days 
since epoch to date, unix time to timestamp) and then we are doing 
df.write.format("hudi")..mode(SaveMode.Append).save("")
   3. I am attaching the screenshot that shows that count takes most time in 
case of bulk_insert and countByKey in case of insert.
   
   Structured Streaming had a higher throughput. But with triggers in Spark, I 
cannot post granular details. Hence using DStream to illustrate thoughput issue.
   
   Regarding operation mode as insert, each DStream batch was about 434000 
records. 
   For the first batch processing took 1.3mins batch but then the processing 
time drops to about 37s. 
   Attached are the details that narrow down the bottleneck to CountByKey.
   Picture shows that first batch took 53s in countByKey where as for 
subsequent batches drops to about 28s. 
   Time taken in CountByKey remains around 28s for each batch of 434000 records.
   In this case DStream achieves a peak throughput of about 12K Rows per second.
   
   
![insert_time_taken_by_each_batch](https://user-images.githubusercontent.com/2908985/94343943-77272500-ffe1-11ea-8c46-65c7b39959d3.png)
   
![insert_for_each_rdd](https://user-images.githubusercontent.com/2908985/94343947-7bebd900-ffe1-11ea-8b6b-a802873d7d03.png)
   
![insert_dag_details](https://user-images.githubusercontent.com/2908985/94343951-7f7f6000-ffe1-11ea-8f76-7caedb49ff5c.png)
   
![insert_first_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343955-81e1ba00-ffe1-11ea-9fd7-1fe0c5c0c85a.png)
   
![insert_expand_bottleneck_in_first_batch](https://user-images.githubusercontent.com/2908985/94343959-860dd780-ffe1-11ea-8d63-80836cb92ff7.png)
   
![insert_subsequent_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343961-8ad28b80-ffe1-11ea-8890-0b11720fb460.png)
   
   
   
   Regarding operation mode as bulk_insert, each DStream batch was about 434000 
records.
   For the first batch processing took 1.2mins batch but then the processing 
time drops to about 34s. 
   Attached are the details that narrow down the bottleneck to Count.
   Picture shows that first batch took 56s in Count where as for subsequent 
batches drops to about 32s. 
   Time taken in Count remains around 32s for each batch of 434000 records.
   In this case DStream achieves a peak throughput of about  Rows per second.
   
   
   
![bulk_insert_time_taken_by_each_batch](https://user-images.githubusercontent.com/2908985/94343967-90c86c80-ffe1-11ea-9b75-a19b170f3a74.png)
   
![bulk_insert_for_each_rdd](https://user-images.githubusercontent.com/2908985/94343968-93c35d00-ffe1-11ea-8d72-47a868f87195.png)
   
![bulk_insert_dag_details](https://user-images.githubusercontent.com/2908985/94343971-96be4d80-ffe1-11ea-9f49-370d38619452.png)
   
![bulk_insert_first_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343972-9920a780-ffe1-11ea-9ae1-506f9c41b4ae.png)
   
![bulk_insert_expand_bottleneck_in_first_batch](https://user-images.githubusercontent.com/2908985/94343974-9c1b9800-ffe1-11ea-8b2a-5d317bb4495d.png)
   
![bulk_insert_subsequent_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343976-9f168880-ffe1-11ea-9661-275e2425aced.png)
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

2020-09-16 Thread GitBox


rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-693531236


   @n3nash 
   
   Environment - AWS.
   Master Nodes - 1 m5.xlarge - 4 vCore, 16 GiB memory
   Core Nodes - 6 c5.xlarge - 4 vCore, 8 GiB memory
   Spark Submit config - --driver-memory 4G --executor-memory 5G 
--executor-cores 4 --num-executors 6 
   Hudi Config - 
   hoodie.combine.before.upsert=false 
   hoodie.bulkinsert.shuffle.parallelism=10 
   hoodie.insert.shuffle.parallelism=10 
   hoodie.upsert.shuffle.parallelism=10 
   hoodie.delete.shuffle.parallelism=1
   hoodie.datasource.write.operatio=bulk_insert
   hoodie.bulkinsert.sort.mode=NONE
   hoodie.datasource.write.table.type=MERGE_ON_READ
   hoodie.datasource.write.partitionpath.field=""
   hoodie.combine.before.upsert=false
   
   
   The events ingested are not time based events. 
   
   Events have unique id in the of type long and using the unique id as 
hoodie.datasource.write.recordkey.field
   Events have a date field and the date field is used as 
hoodie.datasource.write.precombine.field
   
   Events have 40 columns of types - long, int, date, timestamp, string.
   
   For Ingesting, we attempted both bulk insert and insert.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

2020-09-15 Thread GitBox


rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-692486055


   Update on this: 
   
   Found the bottleneck as 
[countByKey](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java#L73)
   
   We were reading data from Kafka(spread across 20 partitions)
   We tested with hoodie.datasource.write.partitionpath.field as "" or 
""
   
   In both cases, records read from Kafka across all partitioned(For a batch) 
was shuffled performing countByKey.
   This caused a major throughput drop.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

2020-09-14 Thread GitBox


rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-692170318


   Update on what @rafaelhbarros  has mentioned.
   
   With Hudi 0.6.0, Identified a bottleneck in Sort and turned the feature off 
("hoodie.bulkinsert.sort.mode - NONE").
   Matching parallelism with number of cores*executors available give the 
optimal speed.
   If the cores*executors = 10 and if parallelism is 20, then 10 
cores*processors cannot perform real parallelism of 20 and the time taken to 
process the record becomes more.
   
   With Hudi MoR and Bulk Insert + without Sort, with the parameters that 
@rafaelhbarros has posted was able to achieve about 20K Rows Per second.
   
   With Hudi CoW and Insert Mode + Without Sort was able to achieve 15K Rows 
per second.
   
   We are aiming to achieve about 20K Rows per second with similar hardware( 
--driver-memory 4G--executor-memory 5G   
--executor-cores 4  --num-executors 6 ).
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org