I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case,
the receivers die within an hour because Yarn kills the containers for high
memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I
don't think stale RDDs are an issue here. I did a "jmap -histo" on a couple
of running receiver processes and in a heap of 30G, roughly ~16G is taken
by "[B" which is byte arrays.

Still investigating more and would appreciate pointers for troubleshooting.
I have dumped the heap of a receiver and will try to go over it.




On Wed, Sep 10, 2014 at 1:43 AM, Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com> wrote:

> I somehow missed that parameter when I was reviewing the documentation,
> that should do the trick! Thank you!
>
> 2014-09-10 2:10 GMT+01:00 Shao, Saisai <saisai.s...@intel.com>:
>
>  Hi Luis,
>>
>>
>>
>> The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be
>> used to remove useless timeout streaming data, the difference is that
>> “spark.cleaner.ttl” is time-based cleaner, it does not only clean streaming
>> input data, but also Spark’s useless metadata; while
>> “spark.streaming.unpersist” is reference-based cleaning mechanism,
>> streaming data will be removed when out of slide duration.
>>
>>
>>
>> Both these two parameter can alleviate the memory occupation of Spark
>> Streaming. But if the data is flooded into Spark Streaming when start up
>> like your situation using Kafka, these two parameters cannot well mitigate
>> the problem. Actually you need to control the input data rate to not inject
>> so fast, you can try “spark.straming.receiver.maxRate” to control the
>> inject rate.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
>> *Sent:* Wednesday, September 10, 2014 5:21 AM
>> *To:* user@spark.apache.org
>> *Subject:* spark.cleaner.ttl and spark.streaming.unpersist
>>
>>
>>
>> The executors of my spark streaming application are being killed due to
>> memory issues. The memory consumption is quite high on startup because is
>> the first run and there are quite a few events on the kafka queues that are
>> consumed at a rate of 100K events per sec.
>>
>> I wonder if it's recommended to use spark.cleaner.ttl and
>> spark.streaming.unpersist together to mitigate that problem. And I also
>> wonder if new RDD are being batched while a RDD is being processed.
>>
>> Regards,
>>
>> Luis
>>
>
>

Reply via email to