Thanks Jörn, sounds like there's nothing obvious I'm missing, which is encouraging.
I've not used Redis, but it does seem that for most of my current and likely future use-cases it would be the best fit (nice compromise of scale and easy setup / access). Thanks, Tom On Wed, Sep 14, 2016 at 10:09 PM Jörn Franke <jornfra...@gmail.com> wrote: > Hmm is it just a lookup and the values are small? I do not think that in > this case redis needs to be installed on each worker node. Redis has a > rather efficient protocol. Hence one or a few dedicated redis nodes > probably fit your purpose more then needed. Just try to reuse connections > and do not establish it for each lookup from the same node. > > Additionally Redis has a lot of interesting data structures such as > hyperloglogs. > > Hbase - you can design here where to store which part of the reference > data set and partition in Spark accordingly. Depends on the data and is > tricky. > > About the other options I am a bit skeptical - especially since you need > to include updated data, might have side effects. > > Nevertheless, you mention all the options that are possible. I guess for a > true evaluation you have to check your use case, the envisioned future > architecture for other use cases, required performance, maintability etc. > > On 14 Sep 2016, at 20:44, Tom Davis <mailinglists...@gmail.com> wrote: > > Hi all, > > Interested in patterns people use in the wild for lookup against reference > data sets from a Spark streaming job. The reference dataset will be updated > during the life of the job (although being 30mins out of date wouldn't be > an issue, for example). > > So far I have come up with a few options, all of which have advantages and > disadvantages: > > 1. For small reference datasets, distribute the data as an in memory Map() > from the driver, refreshing it inside the foreachRDD() loop. > > Obviously the limitation here is size. > > 2. Run a Redis (or similar) cache on each worker node, perform lookups > against this. > > There's some complexity to managing this, probably outside of the Spark > job. > > 3. Load the reference data into an RDD, again inside the foreachRDD() loop > on the driver. Perform a join of the reference and stream batch RDDs. > Perhaps keep the reference RDD in memory. > > I suspect that this will scale, but I also suspect there's going to be the > potential for a lot of data shuffling across the network which will slow > things down. > > 4. Similar to the Redis option, but use Hbase. Scales well and makes data > available to other services but is a call out over the network, albeit > within the cluster. > > I guess there's no solution that fits all, but interested in other > people's experience and whether I've missed anything obvious. > > Thanks, > > Tom > >