That's great ! Thanks ! So to sum up, to do some kind of "always up-to-date" lookup we can use broadcast variables and re-broadcast when the data has changed using whether the "transform" RDD to RDD transformation, "foreachRDD" or transformWith.
Thank you for your time Regards, 2015-10-05 23:49 GMT+02:00 Tathagata Das <t...@databricks.com>: > Yes, when old broacast objects are not referenced any more in the driver, > then associated data in the driver AND the executors will get cleared. > > On Mon, Oct 5, 2015 at 1:40 PM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> @td does that mean that the "old" broadcasted data will in any way be >> "garbage collected" at some point if no RDD or transformation is using it >> anymore ? >> >> Regards, >> >> Olivier. >> >> 2015-04-09 21:49 GMT+02:00 Amit Assudani <aassud...@impetus.com>: >> >>> Thanks a lot TD for detailed answers. The answers lead to few more >>> questions, >>> >>> >>> 1. "the transform RDD-to-RDD function runs on the driver “ - I >>> didn’t understand this, does it mean when I use transform function on >>> DStream, it is not parallelized, surely I m missing something here. >>> 2. updateStateByKey I think won’t work in this use case, I have >>> three separate attribute streams ( with different frequencies ) make up >>> the >>> combined state ( i.e. Entity ) at point in time on which I want to do >>> some >>> processing. Do you think otherwise ? >>> 3. transform+join seems only option so far, but any guestimate how >>> would this perform/ react on cluster ? Assuming, master data in 100s of >>> Gbs, and join is based on some row key. We are talking about slice of >>> stream data to be joined with 100s of Gbs of master data continuously. Is >>> it something can be done but should not be done ? >>> >>> Regards, >>> Amit >>> >>> From: Tathagata Das <t...@databricks.com> >>> Date: Thursday, April 9, 2015 at 3:13 PM >>> To: amit assudani <aassud...@impetus.com> >>> Cc: "user@spark.apache.org" <user@spark.apache.org> >>> Subject: Re: Lookup / Access of master data in spark streaming >>> >>> Responses inline. Hope they help. >>> >>> On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani <aassud...@impetus.com> >>> wrote: >>> >>>> Hi Friends, >>>> >>>> I am trying to solve a use case in spark streaming, I need help on >>>> getting to right approach on lookup / update the master data. >>>> >>>> Use case ( simplified ) >>>> I’ve a dataset of entity with three attributes and identifier/row key >>>> in a persistent store. >>>> >>>> Each attribute along with row key come from a different stream let’s >>>> say, effectively 3 source streams. >>>> >>>> Now whenever any attribute comes up, I want to update/sync the >>>> persistent store and do some processing, but the processing would require >>>> the latest state of entity with latest values of three attributes. >>>> >>>> I wish if I have the all the entities cached in some sort of >>>> centralized cache ( like we have data in hdfs ) within spark streaming >>>> which may be used for data local processing. But I assume there is no such >>>> thing. >>>> >>>> potential approaches I m thinking of, I suspect first two are not >>>> feasible, but I want to confirm, >>>> 1. Is Broadcast Variables mutable ? If yes, can I use it as >>>> cache for all entities sizing around 100s of GBs provided i have a cluster >>>> with enough RAM. >>>> >>> >>> Broadcast variables are not mutable. But you can always create a new >>> broadcast variable when you want and use the "latest" broadcast variable in >>> your computation. >>> >>> dstream.transform { rdd => >>> >>> val latestBroacast = getLatestBroadcastVariable() // fetch existing >>> or update+create new and return >>> val transformedRDD = rdd. ...... // use latestBroacast in RDD >>> tranformations >>> transformedRDD >>> } >>> >>> Since the transform RDD-to-RDD function runs on the driver every batch >>> interval, it will always use the latest broadcast variable that you want. >>> Though note that whenever you create a new broadcast, the next batch may >>> take a little longer to as the data needs to be actually broadcasted out. >>> That can also be made asynchronous by running a simple task (to force the >>> broadcasting out) on any new broadcast variable in a different thread as >>> Spark Streaming batch schedule, but using the same underlying Spark Context. >>> >>> >>> >>>> >>>> 1. Is there any kind of sticky partition possible, so that I route >>>> my stream data to go through the same node where I've the corresponding >>>> entities, subset of entire store, cached in memory within JVM / off >>>> heap on >>>> the node, this would avoid lookups from store. >>>> >>>> You could use updateStateByKey. That is quite sticky, but does not >>> eliminate the possibility that it can run on a different node. In fact this >>> is necessary for fault-tolerance - what if the node it was supposed to run >>> goes down? The task will be run on a different node, and you have to >>> design your application such that it can handle that. >>> >>> >>>> 1. If I stream the entities from persistent store into engine, this >>>> becomes 4th stream - the entity stream, how do i use join / merge to >>>> enable >>>> stream 1,2,3 to lookup and update the data from stream 4. Would >>>> DStream.join work for few seconds worth of data in attribute streams >>>> with >>>> all data in entity stream ? Or do I use transform and within that use >>>> rdd >>>> join, I’ve doubts if I am leaning towards core spark approach in spark >>>> streaming ? >>>> >>>> >>> Depends on what kind of join! If you want the join every batch in stream >>> with a static data set (or rarely updated dataset), the transform+join is >>> the way to go. If you want to join one stream with a window of data from >>> another stream, then DStream.join is the way to go. >>> >>>> >>>> 1. >>>> >>>> >>>> 1. The last approach, which i think will surely work but i want to >>>> avoid, is i keep the entities in IMDB and do lookup/update calls on from >>>> stream 1,2 and 3. >>>> >>>> >>>> Any help is deeply appreciated as this would help me design my system >>>> efficiently and the solution approach may become a beacon for lookup master >>>> data sort of problems. >>>> >>>> Regards, >>>> Amit >>>> >>>> ------------------------------ >>>> >>>> >>>> >>>> >>>> >>>> >>>> NOTE: This message may contain information that is confidential, >>>> proprietary, privileged or otherwise protected by law. The message is >>>> intended solely for the named addressee. If received in error, please >>>> destroy and notify the sender. Any use of this email is prohibited when >>>> received in error. Impetus does not represent, warrant and/or guarantee, >>>> that the integrity of this communication has been maintained nor that the >>>> communication is free of errors, virus, interception or interference. >>>> >>> >>> >>> ------------------------------ >>> >>> >>> >>> >>> >>> >>> NOTE: This message may contain information that is confidential, >>> proprietary, privileged or otherwise protected by law. The message is >>> intended solely for the named addressee. If received in error, please >>> destroy and notify the sender. Any use of this email is prohibited when >>> received in error. Impetus does not represent, warrant and/or guarantee, >>> that the integrity of this communication has been maintained nor that the >>> communication is free of errors, virus, interception or interference. >>> >> >> >> >> -- >> *Olivier Girardot* | Associé >> o.girar...@lateral-thoughts.com >> +33 6 24 09 17 94 >> > > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94