Re: Lookup / Access of master data in spark streaming

Olivier Girardot Mon, 05 Oct 2015 23:20:28 -0700

That's great !  Thanks !
So to sum up, to do some kind of "always up-to-date" lookup we can use
broadcast variables and re-broadcast when the data has changed using
whether the "transform" RDD to RDD transformation, "foreachRDD" or
transformWith.


Thank you for your time

Regards,

2015-10-05 23:49 GMT+02:00 Tathagata Das <t...@databricks.com>:

> Yes, when old broacast objects are not referenced any more in the driver,
> then associated data in the driver AND the executors will get cleared.
>
> On Mon, Oct 5, 2015 at 1:40 PM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> @td does that mean that the "old" broadcasted data will in any way be
>> "garbage collected" at some point if no RDD or transformation is using it
>> anymore ?
>>
>> Regards,
>>
>> Olivier.
>>
>> 2015-04-09 21:49 GMT+02:00 Amit Assudani <aassud...@impetus.com>:
>>
>>> Thanks a lot TD for detailed answers. The answers lead to few more
>>> questions,
>>>
>>>
>>>    1. "the transform RDD-to-RDD function runs on the driver “ - I
>>>    didn’t understand this, does it mean when I use transform function on
>>>    DStream, it is not parallelized, surely I m missing something here.
>>>    2.  updateStateByKey I think won’t work in this use case,  I have
>>>    three separate attribute streams ( with different frequencies ) make up 
>>> the
>>>    combined state ( i.e. Entity ) at point in time on which I want to do 
>>> some
>>>    processing. Do you think otherwise ?
>>>    3. transform+join seems only option so far, but any guestimate how
>>>    would this perform/ react on cluster ? Assuming, master data in 100s of
>>>    Gbs, and join is based on some row key. We are talking about slice of
>>>    stream data to be joined with 100s of Gbs of master data continuously. Is
>>>    it something can be done but should not be done ?
>>>
>>> Regards,
>>> Amit
>>>
>>> From: Tathagata Das <t...@databricks.com>
>>> Date: Thursday, April 9, 2015 at 3:13 PM
>>> To: amit assudani <aassud...@impetus.com>
>>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: Re: Lookup / Access of master data in spark streaming
>>>
>>> Responses inline. Hope they help.
>>>
>>> On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani <aassud...@impetus.com>
>>> wrote:
>>>
>>>> Hi Friends,
>>>>
>>>> I am trying to solve a use case in spark streaming, I need help on
>>>> getting to right approach on lookup / update the master data.
>>>>
>>>> Use case ( simplified )
>>>> I’ve a dataset of entity with three attributes and identifier/row key
>>>> in a persistent store.
>>>>
>>>> Each attribute along with row key come from a different stream let’s
>>>> say, effectively 3 source streams.
>>>>
>>>> Now whenever any attribute comes up, I want to update/sync the
>>>> persistent store and do some processing, but the processing would require
>>>> the latest state of entity with latest values of three attributes.
>>>>
>>>> I wish if I have the all the entities cached in some sort of
>>>> centralized cache ( like we have data in hdfs ) within spark streaming
>>>> which may be used for data local processing. But I assume there is no such
>>>> thing.
>>>>
>>>> potential approaches I m thinking of, I suspect first two are not
>>>> feasible, but I want to confirm,
>>>>       1.  Is Broadcast Variables mutable ? If yes, can I use it as
>>>> cache for all entities sizing  around 100s of GBs provided i have a cluster
>>>> with enough RAM.
>>>>
>>>
>>> Broadcast variables are not mutable. But you can always create a new
>>> broadcast variable when you want and use the "latest" broadcast variable in
>>> your computation.
>>>
>>> dstream.transform { rdd =>
>>>
>>>    val latestBroacast = getLatestBroadcastVariable()  // fetch existing
>>> or update+create new and return
>>>    val transformedRDD = rdd. ......  // use  latestBroacast in RDD
>>> tranformations
>>>    transformedRDD
>>> }
>>>
>>> Since the transform RDD-to-RDD function runs on the driver every batch
>>> interval, it will always use the latest broadcast variable that you want.
>>> Though note that whenever you create a new broadcast, the next batch may
>>> take a little longer to as the data needs to be actually broadcasted out.
>>> That can also be made asynchronous by running a simple task (to force the
>>> broadcasting out) on any new broadcast variable in a different thread as
>>> Spark Streaming batch schedule, but using the same underlying Spark Context.
>>>
>>>
>>>
>>>>
>>>>    1. Is there any kind of sticky partition possible, so that I route
>>>>    my stream data to go through the same node where I've the corresponding
>>>>    entities, subset of entire store, cached in memory within JVM / off 
>>>> heap on
>>>>    the node, this would avoid lookups from store.
>>>>
>>>> You could use updateStateByKey. That is quite sticky, but does not
>>> eliminate the possibility that it can run on a different node. In fact this
>>> is necessary for fault-tolerance - what if the node it was supposed to run
>>> goes down? The task will be run on a different node, and you have to
>>>  design your application such that it can handle that.
>>>
>>>
>>>>    1. If I stream the entities from persistent store into engine, this
>>>>    becomes 4th stream - the entity stream, how do i use join / merge to 
>>>> enable
>>>>    stream 1,2,3 to lookup and update the data from stream 4. Would
>>>>    DStream.join work for few seconds worth of data in attribute streams 
>>>> with
>>>>    all data in entity stream ? Or do I use transform and within that use 
>>>> rdd
>>>>    join, I’ve doubts if I am leaning towards core spark approach in spark
>>>>    streaming ?
>>>>
>>>>
>>> Depends on what kind of join! If you want the join every batch in stream
>>> with a static data set (or rarely updated dataset), the transform+join is
>>> the way to go. If you want to join one stream with a window of data from
>>> another stream, then DStream.join is the way to go.
>>>
>>>>
>>>>    1.
>>>>
>>>>
>>>>    1. The last approach, which i think will surely work but i want to
>>>>    avoid, is i keep the entities in IMDB and do lookup/update calls on from
>>>>    stream 1,2 and 3.
>>>>
>>>>
>>>> Any help is deeply appreciated as this would help me design my system
>>>> efficiently and the solution approach may become a beacon for lookup master
>>>> data sort of problems.
>>>>
>>>> Regards,
>>>> Amit
>>>>
>>>> ------------------------------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTE: This message may contain information that is confidential,
>>>> proprietary, privileged or otherwise protected by law. The message is
>>>> intended solely for the named addressee. If received in error, please
>>>> destroy and notify the sender. Any use of this email is prohibited when
>>>> received in error. Impetus does not represent, warrant and/or guarantee,
>>>> that the integrity of this communication has been maintained nor that the
>>>> communication is free of errors, virus, interception or interference.
>>>>
>>>
>>>
>>> ------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>> NOTE: This message may contain information that is confidential,
>>> proprietary, privileged or otherwise protected by law. The message is
>>> intended solely for the named addressee. If received in error, please
>>> destroy and notify the sender. Any use of this email is prohibited when
>>> received in error. Impetus does not represent, warrant and/or guarantee,
>>> that the integrity of this communication has been maintained nor that the
>>> communication is free of errors, virus, interception or interference.
>>>
>>
>>
>>
>> --
>> *Olivier Girardot* | Associé
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: Lookup / Access of master data in spark streaming

Reply via email to