[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

GitBox Wed, 16 Sep 2020 09:51:33 -0700


rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-693531236



   @n3nash 
   
   Environment - AWS.
   Master Nodes - 1 m5.xlarge - 4 vCore, 16 GiB memory
   Core Nodes - 6 c5.xlarge - 4 vCore, 8 GiB memory
   Spark Submit config - --driver-memory 4G --executor-memory 5G 
--executor-cores 4 --num-executors 6 
   Hudi Config - 
   hoodie.combine.before.upsert=false 
   hoodie.bulkinsert.shuffle.parallelism=10 
   hoodie.insert.shuffle.parallelism=10 
   hoodie.upsert.shuffle.parallelism=10 
   hoodie.delete.shuffle.parallelism=1
   hoodie.datasource.write.operatio=bulk_insert
   hoodie.bulkinsert.sort.mode=NONE
   hoodie.datasource.write.table.type=MERGE_ON_READ
   hoodie.datasource.write.partitionpath.field=""
   hoodie.combine.before.upsert=false
   
   
   The events ingested are not time based events. 
   
   Events have unique id in the of type long and using the unique id as 
hoodie.datasource.write.recordkey.field
   Events have a date field and the date field is used as 
hoodie.datasource.write.precombine.field
   
   Events have 40 columns of types - long, int, date, timestamp, string.
   
   For Ingesting, we attempted both bulk insert and insert.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

Reply via email to