Hi all,

Interested in patterns people use in the wild for lookup against reference
data sets from a Spark streaming job. The reference dataset will be updated
during the life of the job (although being 30mins out of date wouldn't be
an issue, for example).

So far I have come up with a few options, all of which have advantages and
disadvantages:

1. For small reference datasets, distribute the data as an in memory Map()
from the driver, refreshing it inside the foreachRDD() loop.

Obviously the limitation here is size.

2. Run a Redis (or similar) cache on each worker node, perform lookups
against this.

There's some complexity to managing this, probably outside of the Spark job.

3. Load the reference data into an RDD, again inside the foreachRDD() loop
on the driver. Perform a join of the reference and stream batch RDDs.
Perhaps keep the reference RDD in memory.

I suspect that this will scale, but I also suspect there's going to be the
potential for a lot of data shuffling across the network which will slow
things down.

4. Similar to the Redis option, but use Hbase. Scales well and makes data
available to other services but is a call out over the network, albeit
within the cluster.

I guess there's no solution that fits all, but interested in other people's
experience and whether I've missed anything obvious.

Thanks,

Tom

Reply via email to