Join with slow changing dimensions/ streams

Hanan Yehudai Mon, 02 Sep 2019 06:38:51 -0700

I have a very common use case -    enriching the stream with  some dimension 
tables.


e.g   the events stream has a SERVER_ID ,  and another files have the LOCATION  
associated with e SERVER_ID. ( a dimension table  csv file)

in SQL I would  simply join.
but hen using Flink  stream API ,  as far as I see,  there are several option 
and I wondered which would be optimal.


1. Use the JOIN operator,,  from the documentation 
(https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/joining.html)
this is always has some time aspect  to the join .  unless I use an interval 
join with very large upper bound and associate the dimension stream record with 
 an old timestamp.

2. just write a mapper function the gets the NAME from the dimesion records – 
that are preloaded on the mapFunction  loading method.

3. use a broadcast state – this way I can also listen to the changes on the 
dimension  tables  and do the actual join in the processElement ducntion.

What soul be the most efficient way to do this from mem and Cpu consumption 
perspective ?

Or is there another , better way ?

Join with slow changing dimensions/ streams

Reply via email to