Re: adding new elements to batch RDD from DStream RDD

Sean Owen Wed, 15 Apr 2015 12:32:55 -0700

What API differences are you talking about? a DStream gives a sequence
of RDDs. I'm not referring to DStream or its API.


Spark in general can execute many pipelines at once, ones that even
refer to the same RDD. What I mean you seem to be looking for a way to
change one shared RDD, but in fact, you simply create an RDD on top of
the current state of the data whenever and wherever you wish. Unless
you're caching the RDD's blocks, you don't have much need to share a
reference to one RDD anyway, which is what I thought you were getting
at.

On Wed, Apr 15, 2015 at 8:25 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> I keep seeing only common statements
>
> Re DStream RDDs and Batch RDDs - There is certainly something to "keep me 
> from using them together" and it is the OO API differences I have described 
> previously, several times ...
>
> Re the batch RDD reloading from file and that "there is no need for threads" 
> - the driver of spark streaming app instantiates and submits a DAG pipeline 
> to the spark streaming cluster and keeps it alive while it is running - this 
> is not exactly a liner execution where the main thread of the driver can 
> invoke the spark context method for loading batch RDDs from file for e.g. a 
> second time moreover after specific period of time
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Wednesday, April 15, 2015 8:14 PM
> To: Evo Eftimov
> Cc: user@spark.apache.org
> Subject: Re: adding new elements to batch RDD from DStream RDD
>
> Yes, I mean there's nothing to keep you from using them together other than 
> their very different lifetime. That's probably the key here: if you need the 
> streaming data to live a long time it has to live in persistent storage first.
>
> I do exactly this and what you describe for the same purpose.
> I don't believe there's any need for threads; an RDD is just bookkeeping 
> about partitions, and that has to be re-assessed when the underlying data 
> grows. But making a new RDD on the fly is easy. It's a "reference" to the 
> data only.
>
> (Well, that changes if you cache the results, in which case you very much 
> care about unpersisting the RDD before getting a different reference to all 
> of the same data and more.)
>
>
>
>
> On Wed, Apr 15, 2015 at 8:06 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>> Hi Sean well there is certainly a difference between "batch" RDD and 
>> "streaming" RDD and in the previous reply you have already outlined some. 
>> Other differences are in the Object Oriented Model / API of Spark, which 
>> also matters besides the RDD / Spark Cluster Platform architecture.
>>
>> Secondly, in the previous em I have clearly described what I mean by
>> "update" and that it is a result of RDD transformation and hence a new
>> RDD derived from the previously joined/union/cogrouped one - ie not
>> "mutating" an existing RDD
>>
>> Lets also leave aside the architectural goal why I want to keep updating a 
>> batch RDD with new data coming from DStream RDDs - fyi it is NOT to "make 
>> streaming RDDs long living"
>>
>> Let me now go back to the overall objective - the app context is Spark
>> Streaming job. I want to "update" / "add" the content of incoming
>> streaming RDDs (e.g. JavaDStreamRDDs) to an already loaded (e.g. from
>> HDFS file) batch RDD e.g. JavaRDD - the only way to union / join /
>> cogroup from DSTreamRDD to batch RDD is via the "transform" method
>> which always returns DStream RDD NOT batch RDD - check the API
>>
>> On a separate note - your suggestion to keep reloading a Batch RDD
>> from a file - it may have some applications in other scenarios so lets
>> drill down into it - in the context of Spark Streaming app where the
>> driver launches a DAG pipeline and then just essentially hangs, I
>> guess the only way to keep reloading a batch RDD from file is from a
>> separate thread still using the same spark context. The thread will
>> reload the batch RDD with the same reference ie reassign the reference
>> to the newly instantiated/loaded batch RDD - is that what you mean by
>> reloading batch RDD from file
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:so...@cloudera.com]
>> Sent: Wednesday, April 15, 2015 7:43 PM
>> To: Evo Eftimov
>> Cc: user@spark.apache.org
>> Subject: Re: adding new elements to batch RDD from DStream RDD
>>
>> What do you mean by "batch RDD"? they're just RDDs, though store their data 
>> in different ways and come from different sources. You can union an RDD from 
>> an HDFS file with one from a DStream.
>>
>> It sounds like you want streaming data to live longer than its batch 
>> interval, but that's not something you can expect the streaming framework to 
>> provide. It's perfectly possible to save the RDD's data to persistent store 
>> and use it later.
>>
>> You can't update RDDs; they're immutable. You can re-read data from 
>> persistent store by making a new RDD at any time.
>>
>> On Wed, Apr 15, 2015 at 7:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>>> The only way to join / union /cogroup a DStream RDD with Batch RDD is
>>> via the "transform" method, which returns another DStream RDD and
>>> hence it gets discarded at the end of the micro-batch.
>>>
>>> Is there any way to e.g. union Dstream RDD with Batch RDD which
>>> produces a new Batch RDD containing the elements of both the DStream
>>> RDD and the Batch RDD.
>>>
>>> And once such Batch RDD is created in the above way, can it be used
>>> by other DStream RDDs to e.g. join with as this time the result can
>>> be another DStream RDD
>>>
>>> Effectively the functionality described above will result in
>>> periodical updates (additions) of elements to a Batch RDD - the
>>> additional elements will keep coming from DStream RDDs which keep
>>> streaming in with every micro-batch.
>>> Also newly arriving DStream RDDs will be able to join with the thus
>>> previously updated BAtch RDD and produce a result DStream RDD
>>>
>>> Something almost like that can be achieved with updateStateByKey, but
>>> is there a way to do it as described here
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/adding-new-elemen
>>> t s-to-batch-RDD-from-DStream-RDD-tp22504.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>> additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: adding new elements to batch RDD from DStream RDD

Reply via email to