I somehow missed that parameter when I was reviewing the documentation, that should do the trick! Thank you!
2014-09-10 2:10 GMT+01:00 Shao, Saisai <saisai.s...@intel.com>: > Hi Luis, > > > > The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be > used to remove useless timeout streaming data, the difference is that > “spark.cleaner.ttl” is time-based cleaner, it does not only clean streaming > input data, but also Spark’s useless metadata; while > “spark.streaming.unpersist” is reference-based cleaning mechanism, > streaming data will be removed when out of slide duration. > > > > Both these two parameter can alleviate the memory occupation of Spark > Streaming. But if the data is flooded into Spark Streaming when start up > like your situation using Kafka, these two parameters cannot well mitigate > the problem. Actually you need to control the input data rate to not inject > so fast, you can try “spark.straming.receiver.maxRate” to control the > inject rate. > > > > Thanks > > Jerry > > > > *From:* Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com] > *Sent:* Wednesday, September 10, 2014 5:21 AM > *To:* user@spark.apache.org > *Subject:* spark.cleaner.ttl and spark.streaming.unpersist > > > > The executors of my spark streaming application are being killed due to > memory issues. The memory consumption is quite high on startup because is > the first run and there are quite a few events on the kafka queues that are > consumed at a rate of 100K events per sec. > > I wonder if it's recommended to use spark.cleaner.ttl and > spark.streaming.unpersist together to mitigate that problem. And I also > wonder if new RDD are being batched while a RDD is being processed. > > Regards, > > Luis >