Well its hard to use text data as time of input.
But if you are adament here's what you would do.
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files
you receive & push them into the folder in order of time. Mostly your data
would be produced at t, you would get it at t + say 5 sec, & you can push
it in & get processed at t + say 10 sec. Then you can timeshift your
calculations. If you are okay with broad enough time frame you should be
fine.

Another way is to use queue processing.
Queue<JavaRDD<Integer>> rddQueue = new LinkedList<JavaRDD<Integer>>();
Create a Dstream to consume from Queue of RDD, keep looking into the folder
of files & create rdd's from them at a min level & push them into thee
queue. This would cause you to go through your data atleast twice & just
provide order guarantees , processing time is still going to vary with how
quickly you can process your RDD.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, May 22, 2014 at 9:08 PM, Ian Holsman <i...@holsman.com.au> wrote:

> Hi.
>
> I'm writing a pilot project, and plan on using spark's streaming app for
> it.
>
> To start with I have a dump of some access logs with their own timestamps,
> and am using the textFileStream and some old files to test it with.
>
> One of the issues I've come across is simulating the windows. I would like
> use the timestamp from the access logs as the 'system time' instead of the
> real clock time.
>
> I googled a bit and found the 'manual' clock which appears to be used for
> testing the job scheduler.. but I'm not sure what my next steps should be.
>
> I'm guessing I'll need to do something like
>
> 1. use the textFileStream to create a 'DStream'
> 2. have some kind of DStream that runs on top of that that creates the
> RDDs based on the timestamps Instead of the system time
> 3. the rest of my mappers.
>
> Is this correct? or do I need to create my own 'textFileStream' to
> initially create the RDDs and modify the system clock inside of it.
>
> I'm not too concerned about out-of-order messages, going backwards in
> time, or being 100% in sync across workers.. as this is more for
> 'development'/prototyping.
>
> Are there better ways of achieving this? I would assume that controlling
> the windows RDD buckets would be a common use case.
>
> TIA
> Ian
>
> --
> Ian Holsman
> i...@holsman.com.au
> PH: + 61-3-9028 8133 / +1-(425) 998-7083
>

Reply via email to