Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala
Regards, Laeeq On Friday, May 23, 2014 10:33 AM, Mayur Rustagi <[email protected]> wrote: Well its hard to use text data as time of input. But if you are adament here's what you would do. Have a Dstream object which works in on a folder using filestream/textstream Then have another process (spark streaming or cron) read through the files you receive & push them into the folder in order of time. Mostly your data would be produced at t, you would get it at t + say 5 sec, & you can push it in & get processed at t + say 10 sec. Then you can timeshift your calculations. If you are okay with broad enough time frame you should be fine. Another way is to use queue processing. Queue<JavaRDD<Integer>> rddQueue = new LinkedList<JavaRDD<Integer>>(); Create a Dstream to consume from Queue of RDD, keep looking into the folder of files & create rdd's from them at a min level & push them into thee queue. This would cause you to go through your data atleast twice & just provide order guarantees , processing time is still going to vary with how quickly you can process your RDD. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, May 22, 2014 at 9:08 PM, Ian Holsman <[email protected]> wrote: Hi. > > >I'm writing a pilot project, and plan on using spark's streaming app for it. > > >To start with I have a dump of some access logs with their own timestamps, and >am using the textFileStream and some old files to test it with. > > >One of the issues I've come across is simulating the windows. I would like use >the timestamp from the access logs as the 'system time' instead of the real >clock time. > > >I googled a bit and found the 'manual' clock which appears to be used for >testing the job scheduler.. but I'm not sure what my next steps should be. > > >I'm guessing I'll need to do something like > > >1. use the textFileStream to create a 'DStream' >2. have some kind of DStream that runs on top of that that creates the RDDs >based on the timestamps Instead of the system time >3. the rest of my mappers. > > >Is this correct? or do I need to create my own 'textFileStream' to initially >create the RDDs and modify the system clock inside of it. > > >I'm not too concerned about out-of-order messages, going backwards in time, or >being 100% in sync across workers.. as this is more for >'development'/prototyping. > > >Are there better ways of achieving this? I would assume that controlling the >windows RDD buckets would be a common use case. > > >TIA >Ian > >-- > >Ian Holsman >[email protected] >PH: + 61-3-9028 8133 / +1-(425) 998-7083
