It seems default value for spark.cleaner.delay is 3600 seconds but I need to be able to count things on daily, weekly or even monthly based.
I suppose the aim of DStream batches and spark.cleaner.delay is to avoid space issues (running out of memory etc.). I usually use HyperLogLog for counting unique things to save space, and AFAIK, the other metrics are simply long values which doesn't require much space. When I start learning Spark Streaming it really confused me because in my first "Hello World" example all I wanted is to count all events processed by Spark Streaming. DStream batches are nice but when I need simple counting operations it becomes complex. Since Spark Streaming creates new DStreams for each interval, I needed to merge them in a single DStream so I used updateStateByKey() to generate a StateDStream. I seems it works now but I'm not sure whether it's efficient or not because I all need is a single global counter but now Spark has counters for all 2 seconds intervals plus a global counter for StateDStream. I don't have any specific purpose like "Show me this type of unique things for last 10 minutes", instead I need to be able to count things in a large scale; it can be both 10 minutes or 1 month. I create pre-aggregation rules on the fly and when I need simple monthly based counter, Spark seems overkill to me for now. Do you have any advice for me to use efficiently using Spark Streaming? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-advice-for-using-big-spark-cleaner-delay-value-in-Spark-Streaming-tp4895.html Sent from the Apache Spark User List mailing list archive at Nabble.com.