Thanks Jörn, sounds like there's nothing obvious I'm missing, which is
encouraging.

I've not used Redis, but it does seem that for most of my current and
likely future use-cases it would be the best fit (nice compromise of scale
and easy setup / access).

Thanks,

Tom

On Wed, Sep 14, 2016 at 10:09 PM Jörn Franke <jornfra...@gmail.com> wrote:

> Hmm is it just a lookup and the values are small? I do not think that in
> this case redis needs to be installed on each worker node. Redis has a
> rather efficient protocol. Hence one or a few dedicated redis nodes
> probably fit your purpose more then needed. Just try to reuse connections
> and do not establish it for each lookup from the same node.
>
> Additionally Redis has a lot of interesting data structures such as
> hyperloglogs.
>
> Hbase - you can design here where to store which part of the reference
> data set and partition in Spark accordingly. Depends on the data and is
> tricky.
>
> About the other options I am a bit skeptical - especially since you need
> to include updated data, might have side effects.
>
> Nevertheless, you mention all the options that are possible. I guess for a
> true evaluation you have to check your use case, the envisioned future
> architecture for other use cases, required performance, maintability etc.
>
> On 14 Sep 2016, at 20:44, Tom Davis <mailinglists...@gmail.com> wrote:
>
> Hi all,
>
> Interested in patterns people use in the wild for lookup against reference
> data sets from a Spark streaming job. The reference dataset will be updated
> during the life of the job (although being 30mins out of date wouldn't be
> an issue, for example).
>
> So far I have come up with a few options, all of which have advantages and
> disadvantages:
>
> 1. For small reference datasets, distribute the data as an in memory Map()
> from the driver, refreshing it inside the foreachRDD() loop.
>
> Obviously the limitation here is size.
>
> 2. Run a Redis (or similar) cache on each worker node, perform lookups
> against this.
>
> There's some complexity to managing this, probably outside of the Spark
> job.
>
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop
> on the driver. Perform a join of the reference and stream batch RDDs.
> Perhaps keep the reference RDD in memory.
>
> I suspect that this will scale, but I also suspect there's going to be the
> potential for a lot of data shuffling across the network which will slow
> things down.
>
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data
> available to other services but is a call out over the network, albeit
> within the cluster.
>
> I guess there's no solution that fits all, but interested in other
> people's experience and whether I've missed anything obvious.
>
> Thanks,
>
> Tom
>
>

Reply via email to