Thanks TD, that is very useful. On Tue, Jul 14, 2015 at 10:19 PM, Tathagata Das <t...@databricks.com> wrote:
> You can do this. > > // global variable to keep track of latest stuff > var latestTime = _ > var latestRDD = _ > > > dstream.foreachRDD((rdd: RDD[..], time: Time) => { > latestTime = time > latestRDD = rdd > }) > > Now you can asynchronously access the latest RDD. However if you are going > to run jobs on the latest RDD, you must tell the streaming subsystem to > keep the necessary data around for longer, otherwise it will get deleted > even before asynchronous query has completed. Use this. > > streamingContext.remember(<expected max duration of your async query on > latest RDD>) > > > On Tue, Jul 14, 2015 at 6:57 PM, Chen Song <chen.song...@gmail.com> wrote: > >> I have been POC adding a rest service in a Spark Streaming job. Say I >> create a stateful DStream X by using updateStateByKey, and each time there >> is a HTTP request, I want to apply some transformations/actions on the >> latest RDD of X and collect the results immediately but not scheduled by >> streaming batch interval. >> >> * Is that even possible? >> * The reason I think of this is because user can get a list of RDDs by >> DStream.window.slice but I cannot find a way to get the most recent RDD in >> the DSteam. >> >> >> -- >> Chen Song >> >> > -- Chen Song