Well its hard to use text data as time of input. But if you are adament here's what you would do. Have a Dstream object which works in on a folder using filestream/textstream Then have another process (spark streaming or cron) read through the files you receive & push them into the folder in order of time. Mostly your data would be produced at t, you would get it at t + say 5 sec, & you can push it in & get processed at t + say 10 sec. Then you can timeshift your calculations. If you are okay with broad enough time frame you should be fine.
Another way is to use queue processing. Queue<JavaRDD<Integer>> rddQueue = new LinkedList<JavaRDD<Integer>>(); Create a Dstream to consume from Queue of RDD, keep looking into the folder of files & create rdd's from them at a min level & push them into thee queue. This would cause you to go through your data atleast twice & just provide order guarantees , processing time is still going to vary with how quickly you can process your RDD. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, May 22, 2014 at 9:08 PM, Ian Holsman <i...@holsman.com.au> wrote: > Hi. > > I'm writing a pilot project, and plan on using spark's streaming app for > it. > > To start with I have a dump of some access logs with their own timestamps, > and am using the textFileStream and some old files to test it with. > > One of the issues I've come across is simulating the windows. I would like > use the timestamp from the access logs as the 'system time' instead of the > real clock time. > > I googled a bit and found the 'manual' clock which appears to be used for > testing the job scheduler.. but I'm not sure what my next steps should be. > > I'm guessing I'll need to do something like > > 1. use the textFileStream to create a 'DStream' > 2. have some kind of DStream that runs on top of that that creates the > RDDs based on the timestamps Instead of the system time > 3. the rest of my mappers. > > Is this correct? or do I need to create my own 'textFileStream' to > initially create the RDDs and modify the system clock inside of it. > > I'm not too concerned about out-of-order messages, going backwards in > time, or being 100% in sync across workers.. as this is more for > 'development'/prototyping. > > Are there better ways of achieving this? I would assume that controlling > the windows RDD buckets would be a common use case. > > TIA > Ian > > -- > Ian Holsman > i...@holsman.com.au > PH: + 61-3-9028 8133 / +1-(425) 998-7083 >