Hello. We're running applications using Spark Streaming. We're going to begin work to move to using Structured Streaming. One of our key scenarios is to lookup values from an external data source for each record in an incoming stream. In Spark Streaming we currently read the external data, broadcast it and then lookup the value from the broadcast. The broadcast value is refreshed on a periodic basis - with the need to refresh evaluated on each batch (in a foreachRDD). The broadcasts are somewhat large (~1M records). Each stream we're doing the lookup(s) for is ~6M records / second.
While we could conceivably continue this pattern in Structured Streaming with Spark 2.4.x and the 'foreachBatch', based on my read of documentation this seems like a bit of an anti-pattern in Structured Streaming. So I am looking for advice: What mechanism would you suggest to on a periodic basis read an external data source and do a fast lookup for a streaming input. One option appears to be to do a broadcast left outer join? In the past this mechanism has been less easy to performance tune than doing an explicit broadcast and lookup. Regards, Bryan Jeffrey