Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?
Thanks for your reply. Sorry for the late response, I wanted to do some tests before writing back. The counting part works similar to your advice. I specify a minimum interval like 1 minute, in each hour, day etc. it sums all counters of the current children intervals. However when I want to count unique visitors of the month things get much more complex. I need to merge 30 sets which contains visitor id's and each of them has more than a few hundred thousands of elements. Merging sets may be still the best option rather than keeping another Set for last month though, however I'm not sure because when there are many intersections it may be inefficient. BTW, I have one more question. The HLL example in repository seems confusing to me. How Spark handles global variable usages in mapPartitions method? (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/TwitterAlgebirdHLL.scala#L68) I'm also a newbie but I thought the map and mapPartitions methods are similar to Hadoop's map methods so when we run the example on a cluster how an external node reaches a global variable in a single node? Does Spark replicate HyperLogLogMonoid instances across the cluster? Thanks, Burak Emre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-advice-for-using-big-spark-cleaner-delay-value-in-Spark-Streaming-tp4895p5131.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Any advice for using big spark.cleaner.delay value in Spark Streaming?
It seems default value for spark.cleaner.delay is 3600 seconds but I need to be able to count things on daily, weekly or even monthly based. I suppose the aim of DStream batches and spark.cleaner.delay is to avoid space issues (running out of memory etc.). I usually use HyperLogLog for counting unique things to save space, and AFAIK, the other metrics are simply long values which doesn't require much space. When I start learning Spark Streaming it really confused me because in my first Hello World example all I wanted is to count all events processed by Spark Streaming. DStream batches are nice but when I need simple counting operations it becomes complex. Since Spark Streaming creates new DStreams for each interval, I needed to merge them in a single DStream so I used updateStateByKey() to generate a StateDStream. I seems it works now but I'm not sure whether it's efficient or not because I all need is a single global counter but now Spark has counters for all 2 seconds intervals plus a global counter for StateDStream. I don't have any specific purpose like Show me this type of unique things for last 10 minutes, instead I need to be able to count things in a large scale; it can be both 10 minutes or 1 month. I create pre-aggregation rules on the fly and when I need simple monthly based counter, Spark seems overkill to me for now. Do you have any advice for me to use efficiently using Spark Streaming? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-advice-for-using-big-spark-cleaner-delay-value-in-Spark-Streaming-tp4895.html Sent from the Apache Spark User List mailing list archive at Nabble.com.