What about cleaning up the tempData that gets generated by shuffles. We have a lot of temp data that gets generated by shuffles in /tmp folder. That's why we are using ttl. Also if I keep an RDD in cache is it available across all the executors or just the same executor?
On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das <t...@databricks.com> wrote: > Setting a ttl is not recommended any more as Spark works with Java GC to > clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference > any more. > > So you can keep an RDD cached in Spark, and every minute uncache the > previous one, and cache a new one. > > TD > > On Fri, Oct 16, 2015 at 12:02 PM, swetha <swethakasire...@gmail.com> > wrote: > >> Hi, >> >> How to put a changing object in Cache for ever in Streaming. I know that >> we >> can do rdd.cache but I think .cache would be cleaned up if we set ttl in >> Streaming. Our requirement is to have an object in memory. The object >> would >> be updated every minute based on the records that we get in our Streaming >> job. >> >> Currently I am keeping that in updateStateByKey. But, my updateStateByKey >> is tracking the realtime Session information as well. So, my >> updateStateByKey has 4 arguments that tracks session information and this >> object that tracks the performance info separately. I was thinking it may >> be too much to keep so much data in updateStateByKey. >> >> Is it recommended to hold a lot of data using updateStateByKey? >> >> >> Thanks, >> Swetha >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >