What about cleaning up the tempData that gets generated by shuffles. We
have a lot of temp data that gets generated by shuffles in /tmp folder.
That's why we are using ttl. Also if I keep an RDD in cache is it available
across all the executors or just the same executor?

On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das <t...@databricks.com> wrote:

> Setting a ttl is not recommended any more as Spark works with Java GC to
> clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference
> any more.
>
> So you can keep an RDD cached in Spark, and every minute uncache the
> previous one, and cache a new one.
>
> TD
>
> On Fri, Oct 16, 2015 at 12:02 PM, swetha <swethakasire...@gmail.com>
> wrote:
>
>> Hi,
>>
>> How to put a changing object in Cache for ever in Streaming. I know that
>> we
>> can do rdd.cache but I think .cache would be cleaned up if we set ttl in
>> Streaming. Our requirement is to have an object in memory. The object
>> would
>> be updated every minute  based on the records that we get in our Streaming
>> job.
>>
>>  Currently I am keeping that in updateStateByKey. But, my updateStateByKey
>> is tracking the realtime Session information as well. So, my
>> updateStateByKey has 4 arguments that tracks session information and  this
>> object  that tracks the performance info separately. I was thinking it may
>> be too much to keep so much data in updateStateByKey.
>>
>> Is it recommended to hold a lot of data using updateStateByKey?
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to