Rap70r commented on issue #3697: URL: https://github.com/apache/hudi/issues/3697#issuecomment-933893616
Hi @xushiyan, Here is an update for our latest tests. I have switched to d3.xlarge instance type and used the following configs: `spark-submit --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.driver.maxResultSize=6g --conf spark.driver.memory=17g --conf spark.executor.cores=2 --conf spark.hadoop.parquet.enable.summary-metadata=false --conf spark.driver.memoryOverhead=6g --conf spark.network.timeout=600s --conf spark.executor.instances=50 --conf spark.executor.memoryOverhead=4g --conf spark.driver.cores=2 --conf spark.executor.memory=8g --conf spark.memory.storageFraction=0.1 --conf spark.executor.heartbeatInterval=120s --conf spark.memory.fraction=0.4 --conf spark.rdd.compress=true --conf spark.kryoserializer.buffer.max=200m --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.shuffle.partitions=200 --conf spark.default.parallelism=200 --conf spark.task.cpus=2` I also removed "spark.sql.parquet.mergeSchema". I have noticed a significant increase of speed for all the steps except the one that extracts events from Kafka. That step I can't seem to improve. We are using st1 high throughput ebs that is attached to the emr's master node. The topic is compacted and it contains ~50 million records across 50 partitions. Even with the above powerful instance it takes 40 minutes to extract all records. Do you have any suggestions on how to improve data ingestion from kafka using spark structured streaming? Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org