I would like to develop a Spark Structured Streaming job that reads
messages in a Stream which needs to be “joined” with another Stream of
“Reference” data.

For example, let’s say I’m reading messages from Kafka coming in from (lots
of) IOT devices. This message has a ‘device_id’. We have a DEVICE table on
a relational database. What I need to do is “join” the ‘device_id’ in the
message with the ‘device_id’ on the table to enrich the incoming message.
Somewhere I read that, this can be done by joining two streams. I guess, we
can create a “Stream” that reads the DEVICE table once every hour or so.

Questions:
1) Is this the right way to solve this use case?
2) Should we use a Stateful Stream for reading DEVICE table with State
timeout set to an hour?
3) What would happen while the DEVICE state is getting updated from the
table on the relational database?

Guidance would be greatly appreciated. Thanks.
  • Stream which needs... Eric Beabes

Reply via email to