Join a datastream with tables stored in Hive

Krzysztof Zarzycki Thu, 12 Dec 2019 09:44:34 -0800

Hello dear Flinkers,
If this kind of question was asked on the groups, I'm sorry for a
duplicate. Feel free to just point me to the thread.
I have to solve a probably pretty common case of joining a datastream to a
dataset.
Let's say I have the following setup:
* I have a high pace stream of events coming in Kafka.
* I have some dimension tables stored in Hive. These tables are changed
daily. I can keep a snapshot for each day.


Now conceptually, I would like to join the stream of incoming events to the
dimension tables (simple hash join). we can consider two cases:
1) simpler, where I join the stream with the most recent version of the
dictionaries. (So the result is accepted to be nondeterministic if the job
is retried).
2) more advanced, where I would like to do temporal join of the stream with
dictionaries snapshots that were valid at the time of the event. (This
result should be deterministic).

The end goal is to do aggregation of that joined stream, store results in
Hive or more real-time analytical store (Druid).

Now, could you please help me understand is any of these cases
implementable with declarative Table/SQL API? With use of temporal joins,
catalogs, Hive integration, JDBC connectors, or whatever beta features
there are now. (I've read quite a lot of Flink docs about each of those,
but I have a problem to compile this information in the final design.)
Could you please help me understand how these components should cooperate?
If that is impossible with Table API, can we come up with the easiest
implementation using Datastream API ?

Thanks a lot for any help!
Krzysztof

Join a datastream with tables stored in Hive

Reply via email to