I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case, the receivers die within an hour because Yarn kills the containers for high memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I don't think stale RDDs are an issue here. I did a "jmap -histo" on a couple of running receiver processes and in a heap of 30G, roughly ~16G is taken by "[B" which is byte arrays.
Still investigating more and would appreciate pointers for troubleshooting. I have dumped the heap of a receiver and will try to go over it. On Wed, Sep 10, 2014 at 1:43 AM, Luis Ángel Vicente Sánchez < langel.gro...@gmail.com> wrote: > I somehow missed that parameter when I was reviewing the documentation, > that should do the trick! Thank you! > > 2014-09-10 2:10 GMT+01:00 Shao, Saisai <saisai.s...@intel.com>: > > Hi Luis, >> >> >> >> The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be >> used to remove useless timeout streaming data, the difference is that >> “spark.cleaner.ttl” is time-based cleaner, it does not only clean streaming >> input data, but also Spark’s useless metadata; while >> “spark.streaming.unpersist” is reference-based cleaning mechanism, >> streaming data will be removed when out of slide duration. >> >> >> >> Both these two parameter can alleviate the memory occupation of Spark >> Streaming. But if the data is flooded into Spark Streaming when start up >> like your situation using Kafka, these two parameters cannot well mitigate >> the problem. Actually you need to control the input data rate to not inject >> so fast, you can try “spark.straming.receiver.maxRate” to control the >> inject rate. >> >> >> >> Thanks >> >> Jerry >> >> >> >> *From:* Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com] >> *Sent:* Wednesday, September 10, 2014 5:21 AM >> *To:* user@spark.apache.org >> *Subject:* spark.cleaner.ttl and spark.streaming.unpersist >> >> >> >> The executors of my spark streaming application are being killed due to >> memory issues. The memory consumption is quite high on startup because is >> the first run and there are quite a few events on the kafka queues that are >> consumed at a rate of 100K events per sec. >> >> I wonder if it's recommended to use spark.cleaner.ttl and >> spark.streaming.unpersist together to mitigate that problem. And I also >> wonder if new RDD are being batched while a RDD is being processed. >> >> Regards, >> >> Luis >> > >