Rap70r edited a comment on issue #3697:
URL: https://github.com/apache/hudi/issues/3697#issuecomment-933893616


   Hi @xushiyan,
   
   Here is an update for our latest tests. I have switched to d3.xlarge 
instance type and used the following configs:
   `spark-submit --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf 
spark.shuffle.service.enabled=true --conf 
spark.sql.hive.convertMetastoreParquet=false --conf 
spark.driver.maxResultSize=6g --conf spark.driver.memory=17g --conf 
spark.executor.cores=2 --conf 
spark.hadoop.parquet.enable.summary-metadata=false --conf 
spark.driver.memoryOverhead=6g --conf spark.network.timeout=600s --conf 
spark.executor.instances=50 --conf spark.executor.memoryOverhead=4g --conf 
spark.driver.cores=2 --conf spark.executor.memory=8g --conf 
spark.memory.storageFraction=0.1 --conf spark.executor.heartbeatInterval=120s 
--conf spark.memory.fraction=0.4 --conf spark.rdd.compress=true --conf 
spark.kryoserializer.buffer.max=200m --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.shuffle.partitions=200 --conf spark.default.parallelism=200 --conf 
spark.task.cpus=2`
   
   I also removed "spark.sql.parquet.mergeSchema".
   
   I have noticed a significant increase of speed for all the steps except the 
one that extracts events from Kafka. That step I can't seem to improve. We are 
using st1 high throughput ebs that is attached to the emr's master node. The 
topic is compacted and it contains ~50 million records across 50 partitions. 
Even with the above powerful instance it takes 40 minutes to extract all 
records.
   Basically, the part that is slow is the seeking part. It takes couple of 
minutes to seek from offset 50000 to 100000.
   Do you have any suggestions on how to improve data ingestion from kafka 
using spark structured streaming?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to