Re: adding new elements to batch RDD from DStream RDD

Sean Owen Wed, 15 Apr 2015 12:44:43 -0700

Yep, you are looking at operations on DStream, which is not what I'm
talking about. You should look at DStream.foreachRDD (or Java
equivalent), which hands you an RDD. Makes more sense?


The rest may make more sense when you try it. There is actually a lot
less complexity than you think.

On Wed, Apr 15, 2015 at 8:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> The OO API in question was mentioned several times - as the transform method 
> of DStreamRDD which is the ONLY way to join/cogroup/union DSTreamRDD with 
> batch RDD aka JavaRDD
>
> Here is paste from the spark javadoc
>
> <K2,V2> JavaPairDStream<K2,V2> transformToPair(Function<R,JavaPairRDD<K2,V2>> 
> transformFunc)
> Return a new DStream in which each RDD is generated by applying a function on 
> each RDD of 'this' DStream.
>
> As you can see it ALWAYS returns a DStream NOT a JavaRDD aka batch RDD
>
> Re the rest of the discussion (re-loading batch RDD from file within spark 
> steraming context) - lets leave that since we are not getting anywhere
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Wednesday, April 15, 2015 8:30 PM
> To: Evo Eftimov
> Cc: user@spark.apache.org
> Subject: Re: adding new elements to batch RDD from DStream RDD
>
> What API differences are you talking about? a DStream gives a sequence of 
> RDDs. I'm not referring to DStream or its API.
>
> Spark in general can execute many pipelines at once, ones that even refer to 
> the same RDD. What I mean you seem to be looking for a way to change one 
> shared RDD, but in fact, you simply create an RDD on top of the current state 
> of the data whenever and wherever you wish. Unless you're caching the RDD's 
> blocks, you don't have much need to share a reference to one RDD anyway, 
> which is what I thought you were getting at.
>
> On Wed, Apr 15, 2015 at 8:25 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>> I keep seeing only common statements
>>
>> Re DStream RDDs and Batch RDDs - There is certainly something to "keep me 
>> from using them together" and it is the OO API differences I have described 
>> previously, several times ...
>>
>> Re the batch RDD reloading from file and that "there is no need for
>> threads" - the driver of spark streaming app instantiates and submits
>> a DAG pipeline to the spark streaming cluster and keeps it alive while
>> it is running - this is not exactly a liner execution where the main
>> thread of the driver can invoke the spark context method for loading
>> batch RDDs from file for e.g. a second time moreover after specific
>> period of time
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:so...@cloudera.com]
>> Sent: Wednesday, April 15, 2015 8:14 PM
>> To: Evo Eftimov
>> Cc: user@spark.apache.org
>> Subject: Re: adding new elements to batch RDD from DStream RDD
>>
>> Yes, I mean there's nothing to keep you from using them together other than 
>> their very different lifetime. That's probably the key here: if you need the 
>> streaming data to live a long time it has to live in persistent storage 
>> first.
>>
>> I do exactly this and what you describe for the same purpose.
>> I don't believe there's any need for threads; an RDD is just bookkeeping 
>> about partitions, and that has to be re-assessed when the underlying data 
>> grows. But making a new RDD on the fly is easy. It's a "reference" to the 
>> data only.
>>
>> (Well, that changes if you cache the results, in which case you very
>> much care about unpersisting the RDD before getting a different
>> reference to all of the same data and more.)
>>
>>
>>
>>
>> On Wed, Apr 15, 2015 at 8:06 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>>> Hi Sean well there is certainly a difference between "batch" RDD and 
>>> "streaming" RDD and in the previous reply you have already outlined some. 
>>> Other differences are in the Object Oriented Model / API of Spark, which 
>>> also matters besides the RDD / Spark Cluster Platform architecture.
>>>
>>> Secondly, in the previous em I have clearly described what I mean by
>>> "update" and that it is a result of RDD transformation and hence a
>>> new RDD derived from the previously joined/union/cogrouped one - ie
>>> not "mutating" an existing RDD
>>>
>>> Lets also leave aside the architectural goal why I want to keep updating a 
>>> batch RDD with new data coming from DStream RDDs - fyi it is NOT to "make 
>>> streaming RDDs long living"
>>>
>>> Let me now go back to the overall objective - the app context is
>>> Spark Streaming job. I want to "update" / "add" the content of
>>> incoming streaming RDDs (e.g. JavaDStreamRDDs) to an already loaded
>>> (e.g. from HDFS file) batch RDD e.g. JavaRDD - the only way to union
>>> / join / cogroup from DSTreamRDD to batch RDD is via the "transform"
>>> method which always returns DStream RDD NOT batch RDD - check the API
>>>
>>> On a separate note - your suggestion to keep reloading a Batch RDD
>>> from a file - it may have some applications in other scenarios so
>>> lets drill down into it - in the context of Spark Streaming app where
>>> the driver launches a DAG pipeline and then just essentially hangs, I
>>> guess the only way to keep reloading a batch RDD from file is from a
>>> separate thread still using the same spark context. The thread will
>>> reload the batch RDD with the same reference ie reassign the
>>> reference to the newly instantiated/loaded batch RDD - is that what
>>> you mean by reloading batch RDD from file
>>>
>>> -----Original Message-----
>>> From: Sean Owen [mailto:so...@cloudera.com]
>>> Sent: Wednesday, April 15, 2015 7:43 PM
>>> To: Evo Eftimov
>>> Cc: user@spark.apache.org
>>> Subject: Re: adding new elements to batch RDD from DStream RDD
>>>
>>> What do you mean by "batch RDD"? they're just RDDs, though store their data 
>>> in different ways and come from different sources. You can union an RDD 
>>> from an HDFS file with one from a DStream.
>>>
>>> It sounds like you want streaming data to live longer than its batch 
>>> interval, but that's not something you can expect the streaming framework 
>>> to provide. It's perfectly possible to save the RDD's data to persistent 
>>> store and use it later.
>>>
>>> You can't update RDDs; they're immutable. You can re-read data from 
>>> persistent store by making a new RDD at any time.
>>>
>>> On Wed, Apr 15, 2015 at 7:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>>>> The only way to join / union /cogroup a DStream RDD with Batch RDD
>>>> is via the "transform" method, which returns another DStream RDD and
>>>> hence it gets discarded at the end of the micro-batch.
>>>>
>>>> Is there any way to e.g. union Dstream RDD with Batch RDD which
>>>> produces a new Batch RDD containing the elements of both the DStream
>>>> RDD and the Batch RDD.
>>>>
>>>> And once such Batch RDD is created in the above way, can it be used
>>>> by other DStream RDDs to e.g. join with as this time the result can
>>>> be another DStream RDD
>>>>
>>>> Effectively the functionality described above will result in
>>>> periodical updates (additions) of elements to a Batch RDD - the
>>>> additional elements will keep coming from DStream RDDs which keep
>>>> streaming in with every micro-batch.
>>>> Also newly arriving DStream RDDs will be able to join with the thus
>>>> previously updated BAtch RDD and produce a result DStream RDD
>>>>
>>>> Something almost like that can be achieved with updateStateByKey,
>>>> but is there a way to do it as described here
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/adding-new-eleme
>>>> n t s-to-batch-RDD-from-DStream-RDD-tp22504.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> --------------------------------------------------------------------
>>>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: adding new elements to batch RDD from DStream RDD

Reply via email to