I have a Spark program which exhibits increasing resource usage. Spark
Streaming (https://spark.apache.org/streaming/) is used to provide the data
source. The Spark Driver class receives "events" by querying a MongoDB in a
custom JavaReceiverInputDStream. These events are then transformed via
mapToPair(), which creates tuples mapping an id to each event. The stream is
partitioned and we run a groupByKey(). Finally the events are processed by
foreachRDD().

Running it for several hours on a standalone cluster, a clear trend emerges
of both CPU and heap memory usage increasing. This occurs even if the data
source offers no events, so there is no actual processing to perform.
Similarly, omitting the bulk of processing code within foreachRDD() does not
eliminate the problem.

I've tried eliminating steps in the process to identify the culprit, and it
looks like it's the partitioning step that prompts the CPU usage to increase
over time.

Has anyone else experienced this sort of behaviour?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-using-Spark-Streaming-shows-increasing-memory-CPU-usage-tp23545.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to