Re: controlling the time in spark-streaming

Laeeq Ahmed Wed, 09 Jul 2014 06:57:25 -0700

Hi,

For QueueRDD, have a look here. 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

Regards,
Laeeq

On Friday, May 23, 2014 10:33 AM, Mayur Rustagi <mayur.rust...@gmail.com> wrote:

Well its hard to use text data as time of input. 
But if you are adament here's what you would do. 
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files you 
receive & push them into the folder in order of time. Mostly your data would be 
produced at t, you would get it at t + say 5 sec, & you can push it in & get 
processed at t + say 10 sec. Then you can timeshift your calculations. If you 
are okay with broad enough time frame you should be fine.

Another way is to use queue processing.
Queue<JavaRDD<Integer>> rddQueue = new LinkedList<JavaRDD<Integer>>();
Create a Dstream to consume from Queue of RDD, keep looking into the folder of 
files & create rdd's from them at a min level & push them into thee queue. This 
would cause you to go through your data atleast twice & just provide order 
guarantees , processing time is still going to vary with how quickly you can 
process your RDD.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi

On Thu, May 22, 2014 at 9:08 PM, Ian Holsman <i...@holsman.com.au> wrote:

Hi.
>
>
>I'm writing a pilot project, and plan on using spark's streaming app for it.
>
>
>To start with I have a dump of some access logs with their own timestamps, and 
>am using the textFileStream and some old files to test it with.
>
>
>One of the issues I've come across is simulating the windows. I would like use 
>the timestamp from the access logs as the 'system time' instead of the real 
>clock time.
>
>
>I googled a bit and found the 'manual' clock which appears to be used for 
>testing the job scheduler.. but I'm not sure what my next steps should be.
>
>
>I'm guessing I'll need to do something like
>
>
>1. use the textFileStream to create a 'DStream'
>2. have some kind of DStream that runs on top of that that creates the RDDs 
>based on the timestamps Instead of the system time
>3. the rest of my mappers.
>
>
>Is this correct? or do I need to create my own 'textFileStream' to initially 
>create the RDDs and modify the system clock inside of it.
>
>
>I'm not too concerned about out-of-order messages, going backwards in time, or 
>being 100% in sync across workers.. as this is more for 
>'development'/prototyping.
>
>
>Are there better ways of achieving this? I would assume that controlling the 
>windows RDD buckets would be a common use case.
>
>
>TIA
>Ian
>
>-- 
>
>Ian Holsman
>i...@holsman.com.au 
>PH: + 61-3-9028 8133 / +1-(425) 998-7083

Re: controlling the time in spark-streaming

Reply via email to