good catch on the cleaner.ttl

@jatin-  when you say "memory-persisted RDD", what do you mean exactly?  and 
how much data are you talking about?  remember that spark can evict these 
memory-persisted RDDs at any time.  they can be recovered from Kafka, but this 
is not a good situation to be in.

also, is this spark 1.6 with the new mapState() or the old updateStateByKey()?  
you definitely want the newer 1.6 mapState().

and is there any other way to store and aggregate this data outside of spark?  
I get a bit nervous when I see people treat spark/streaming like an in-memory 
database.

perhaps redis or another type of in-memory store is more appropriate.  or just 
write to long-term storage using parquet.

if this is a lot of data, you may want to use approximate probabilistic data 
structures like CountMin Sketch or HyperLogLog.  here's some relevant links 
with more info - including how to use these with redis: 

https://www.slideshare.net/cfregly/advanced-apache-spark-meetup-approximations-and-probabilistic-data-structures-jan-28-2016-galvanize

https://github.com/fluxcapacitor/pipeline/tree/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx

you can then setup a cron (or airflow) spark job to do the compute and 
aggregate against either redis or long-term storage.

this reference pipeline contains the latest airflow workflow scheduler:  
https://github.com/fluxcapacitor/pipeline/wiki

my advice with spark streaming is to get the data out of spark streaming as 
quickly as possible - and into a more durable format more suitable for 
aggregation and compute.  

this greatly simplifies your operational concerns, in my
opinion.

good question.  very common use case.

> On Feb 21, 2016, at 12:22 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> w.r.t. cleaner TTL, please see:
> 
> [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
> 
> FYI
> 
>> On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas <gerard.m...@gmail.com> wrote:
>> 
>> It sounds like another  window operation on top of the 30-min window will 
>> achieve the  desired objective. 
>> Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl) 
>> to a long enough value and you will require enough resources (mem & disk) to 
>> keep the required data.  
>> 
>> -kr, Gerard.
>> 
>>> On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar 
>>> <jku...@rocketfuelinc.com.invalid> wrote:
>>> Hello Spark users,
>>> 
>>> I have to aggregate messages from kafka and at some fixed interval (say 
>>> every half hour) update a memory persisted RDD and run some computation. 
>>> This computation uses last one day data. Steps are:
>>> 
>>> - Read from realtime Kafka topic X in spark streaming batches of 5 seconds
>>> - Filter the above DStream messages and keep some of them
>>> - Create windows of 30 minutes on above DStream and aggregate by Key
>>> - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd
>>> - Maintain last N such RDDs in a deque persisting them on disk. While 
>>> adding new RDD, subtract oldest RDD from the combinedRdd.
>>> - Final step consider last N such windows (of 30 minutes each) and do final 
>>> aggregation
>>> 
>>> Does the above way of using spark streaming looks reasonable? Is there a 
>>> better way of doing the above?
>>> 
>>> --
>>> Thanks
>>> Jatin
> 

Reply via email to