Thanks TD, that is very useful.

On Tue, Jul 14, 2015 at 10:19 PM, Tathagata Das <t...@databricks.com> wrote:

> You can do this.
>
> // global variable to keep track of latest stuff
> var latestTime = _
> var latestRDD = _
>
>
> dstream.foreachRDD((rdd: RDD[..], time: Time) => {
>     latestTime = time
>     latestRDD = rdd
> })
>
> Now you can asynchronously access the latest RDD. However if you are going
> to run jobs on the latest RDD, you must tell the streaming subsystem to
> keep the necessary data around for longer, otherwise it will get deleted
> even before asynchronous query has completed. Use this.
>
> streamingContext.remember(<expected max duration of your async query on
> latest RDD>)
>
>
> On Tue, Jul 14, 2015 at 6:57 PM, Chen Song <chen.song...@gmail.com> wrote:
>
>> I have been POC adding a rest service in a Spark Streaming job. Say I
>> create a stateful DStream X by using updateStateByKey, and each time there
>> is a HTTP request, I want to apply some transformations/actions on the
>> latest RDD of X and collect the results immediately but not scheduled by
>> streaming batch interval.
>>
>> * Is that even possible?
>> * The reason I think of this is because user can get a list of RDDs by
>> DStream.window.slice but I cannot find a way to get the most recent RDD in
>> the DSteam.
>>
>>
>> --
>> Chen Song
>>
>>
>


-- 
Chen Song

Reply via email to