Hi all, I'm running a Spark Streaming application with 1-hour batches to join two data feeds and write the output to disk. The total size of one data feed is about 40 GB per hour (split in multiple files), while the size of the second data feed is about 600-800 MB per hour (also split in multiple files). Due to application constraints, I may not be able to run smaller batches. Currently, it takes about 20 minutes to produce the output in a cluster with 140 cores and 700 GB of RAM. I'm running 7 workers and 28 executors, each with 5 cores and 22 GB of RAM.
I execute mapToPair(), filter(), and reduceByKeyAndWindow(1 hour batch) on the 40 GB data feed. Most of the computation time is spent on these operations. What worries me is the Garbage Collection (GC) execution time per executor, which goes from 25 secs to 9.2 mins. I attach two screenshots below: one lists the GC time and one prints out GC comments for a single executor. I anticipate that the executor that spends 9.2 mins in doing garbage collection is eventually killed by the Spark driver. I think these numbers are too high. Do you have any suggestion about keeping GC time low? I'm already using Kryo Serializer, ++UseConcMarkSweepGC, and spark.rdd.compress=true. Is there anything else that would help? Thanks <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27087/gc_time.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27087/executor_16.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-long-garbage-collection-time-tp27087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org