Hi all,

I'm running a Spark Streaming application with 1-hour batches to join two
data feeds and write the output to disk. The total size of one data feed is
about 40 GB per hour (split in multiple files), while the size of the second
data feed is about 600-800 MB per hour (also split in multiple files). Due
to application constraints, I may not be able to run smaller batches.
Currently, it takes about 20 minutes to produce the output in a cluster with
140 cores and 700 GB of RAM. I'm running 7 workers and 28 executors, each
with 5 cores and 22 GB of RAM.

I execute mapToPair(), filter(), and reduceByKeyAndWindow(1 hour batch) on
the 40 GB data feed. Most of the computation time is spent on these
operations. What worries me is the Garbage Collection (GC) execution time
per executor, which goes from 25 secs to 9.2 mins. I attach two screenshots
below: one lists the GC time and one prints out GC comments for a single
executor. I anticipate that the executor that spends 9.2 mins in doing
garbage collection is eventually killed by the Spark driver.

I think these numbers are too high. Do you have any suggestion about keeping
GC time low? I'm already using Kryo Serializer, ++UseConcMarkSweepGC, and
spark.rdd.compress=true.

Is there anything else that would help?

Thanks
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27087/gc_time.png> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27087/executor_16.png>
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-long-garbage-collection-time-tp27087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to