Hello.

We're running applications using Spark Streaming.  We're going to begin
work to move to using Structured Streaming.  One of our key scenarios is to
lookup values from an external data source for each record in an incoming
stream.  In Spark Streaming we currently read the external data, broadcast
it and then lookup the value from the broadcast.  The broadcast value is
refreshed on a periodic basis - with the need to refresh evaluated on each
batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records).
Each stream we're doing the lookup(s) for is ~6M records / second.

While we could conceivably continue this pattern in Structured Streaming
with Spark 2.4.x and the 'foreachBatch', based on my read of documentation
this seems like a bit of an anti-pattern in Structured Streaming.

So I am looking for advice: What mechanism would you suggest to on a
periodic basis read an external data source and do a fast lookup for a
streaming input.  One option appears to be to do a broadcast left outer
join?  In the past this mechanism has been less easy to performance tune
than doing an explicit broadcast and lookup.

Regards,

Bryan Jeffrey

Reply via email to