Hi all, Interested in patterns people use in the wild for lookup against reference data sets from a Spark streaming job. The reference dataset will be updated during the life of the job (although being 30mins out of date wouldn't be an issue, for example).
So far I have come up with a few options, all of which have advantages and disadvantages: 1. For small reference datasets, distribute the data as an in memory Map() from the driver, refreshing it inside the foreachRDD() loop. Obviously the limitation here is size. 2. Run a Redis (or similar) cache on each worker node, perform lookups against this. There's some complexity to managing this, probably outside of the Spark job. 3. Load the reference data into an RDD, again inside the foreachRDD() loop on the driver. Perform a join of the reference and stream batch RDDs. Perhaps keep the reference RDD in memory. I suspect that this will scale, but I also suspect there's going to be the potential for a lot of data shuffling across the network which will slow things down. 4. Similar to the Redis option, but use Hbase. Scales well and makes data available to other services but is a call out over the network, albeit within the cluster. I guess there's no solution that fits all, but interested in other people's experience and whether I've missed anything obvious. Thanks, Tom