Responses inline. Hope they help. On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani <aassud...@impetus.com> wrote:
> Hi Friends, > > I am trying to solve a use case in spark streaming, I need help on > getting to right approach on lookup / update the master data. > > Use case ( simplified ) > I’ve a dataset of entity with three attributes and identifier/row key in > a persistent store. > > Each attribute along with row key come from a different stream let’s > say, effectively 3 source streams. > > Now whenever any attribute comes up, I want to update/sync the > persistent store and do some processing, but the processing would require > the latest state of entity with latest values of three attributes. > > I wish if I have the all the entities cached in some sort of centralized > cache ( like we have data in hdfs ) within spark streaming which may be > used for data local processing. But I assume there is no such thing. > > potential approaches I m thinking of, I suspect first two are not > feasible, but I want to confirm, > 1. Is Broadcast Variables mutable ? If yes, can I use it as cache > for all entities sizing around 100s of GBs provided i have a cluster with > enough RAM. > Broadcast variables are not mutable. But you can always create a new broadcast variable when you want and use the "latest" broadcast variable in your computation. dstream.transform { rdd => val latestBroacast = getLatestBroadcastVariable() // fetch existing or update+create new and return val transformedRDD = rdd. ...... // use latestBroacast in RDD tranformations transformedRDD } Since the transform RDD-to-RDD function runs on the driver every batch interval, it will always use the latest broadcast variable that you want. Though note that whenever you create a new broadcast, the next batch may take a little longer to as the data needs to be actually broadcasted out. That can also be made asynchronous by running a simple task (to force the broadcasting out) on any new broadcast variable in a different thread as Spark Streaming batch schedule, but using the same underlying Spark Context. > > 1. Is there any kind of sticky partition possible, so that I route my > stream data to go through the same node where I've the corresponding > entities, subset of entire store, cached in memory within JVM / off heap on > the node, this would avoid lookups from store. > > You could use updateStateByKey. That is quite sticky, but does not eliminate the possibility that it can run on a different node. In fact this is necessary for fault-tolerance - what if the node it was supposed to run goes down? The task will be run on a different node, and you have to design your application such that it can handle that. > 1. If I stream the entities from persistent store into engine, this > becomes 4th stream - the entity stream, how do i use join / merge to enable > stream 1,2,3 to lookup and update the data from stream 4. Would > DStream.join work for few seconds worth of data in attribute streams with > all data in entity stream ? Or do I use transform and within that use rdd > join, I’ve doubts if I am leaning towards core spark approach in spark > streaming ? > > Depends on what kind of join! If you want the join every batch in stream with a static data set (or rarely updated dataset), the transform+join is the way to go. If you want to join one stream with a window of data from another stream, then DStream.join is the way to go. > > 1. > > > 1. The last approach, which i think will surely work but i want to > avoid, is i keep the entities in IMDB and do lookup/update calls on from > stream 1,2 and 3. > > > Any help is deeply appreciated as this would help me design my system > efficiently and the solution approach may become a beacon for lookup master > data sort of problems. > > Regards, > Amit > > ------------------------------ > > > > > > > NOTE: This message may contain information that is confidential, > proprietary, privileged or otherwise protected by law. The message is > intended solely for the named addressee. If received in error, please > destroy and notify the sender. Any use of this email is prohibited when > received in error. Impetus does not represent, warrant and/or guarantee, > that the integrity of this communication has been maintained nor that the > communication is free of errors, virus, interception or interference. >