Re: Stream which needs to be “joined” with another Stream of “Reference” data.

Yuri Oleynikov (‫יורי אולייניקוב‬‎) Mon, 03 May 2021 10:28:15 -0700

You can do the enrichment with stream(events)-static(device table) join when 
the device table is slow changing dimension (let’s say once a day change) and 
it’s in delta format, then for every micro batch with stream-static John the 
device table will be rescanned and up to date device data will be loaded.


If device table is not slow dimension(once an hour change), then you’d probably 
need stream-stream join but I’m not sure if RDBMS (aka jdbc) in Spark supports 
streaming mode.
So I’d better sync jdbc with parquet/delta periodically in order to emulate 
streaming source


> On 3 May 2021, at 20:02, Eric Beabes <mailinglist...@gmail.com> wrote:
> 
> 
> 1) Device_id might be different for messages in a batch.
> 2) It's a Streaming application. The IOT messages are getting read in a 
> Structured Streaming job in a "Stream". The Dataframe would need to be 
> updated every hour. Have you done something similar in the past? Do you have 
> an example to share?
> 
>> On Mon, May 3, 2021 at 9:52 AM Mich Talebzadeh <mich.talebza...@gmail.com> 
>> wrote:
>> Can you please clarify:
>> 
>> The IOT messages in one batch have the same device_id or every row has 
>> different device_id?
>> The RDBMS table can be read through JDBC in Spark and a dataframe can be 
>> created on. Does that work for you? You do not really need to stream the 
>> reference table. 
>> 
>> HTH
>> 
>>  
>> 
>>    view my Linkedin profile
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>>> On Mon, 3 May 2021 at 17:37, Eric Beabes <mailinglist...@gmail.com> wrote:
>>> I would like to develop a Spark Structured Streaming job that reads 
>>> messages in a Stream which needs to be “joined” with another Stream of 
>>> “Reference” data.
>>> 
>>> For example, let’s say I’m reading messages from Kafka coming in from (lots 
>>> of) IOT devices. This message has a ‘device_id’. We have a DEVICE table on 
>>> a relational database. What I need to do is “join” the ‘device_id’ in the 
>>> message with the ‘device_id’ on the table to enrich the incoming message. 
>>> Somewhere I read that, this can be done by joining two streams. I guess, we 
>>> can create a “Stream” that reads the DEVICE table once every hour or so. 
>>> 
>>> Questions:
>>> 1) Is this the right way to solve this use case? 
>>> 2) Should we use a Stateful Stream for reading DEVICE table with State 
>>> timeout set to an hour?
>>> 3) What would happen while the DEVICE state is getting updated from the 
>>> table on the relational database?
>>> 
>>> Guidance would be greatly appreciated. Thanks.

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

Reply via email to