Hi all. How would one approach the following scenario with Kafka streams?

There is one input topic. It has data from different sources in a normalized 
format. There is a need to join records that come from different sources but 
are linked to the same entity (record linkage). There is a deterministic 
rule-set to calculate (composed) "match-keys" for every incoming record that 
allow the correlation of records that are linked to the same entity.

Example: There are events A (userid,first name,last name,....), B(username, 
location,.....) and C(location, weatcher-data,....). There is a set of rules in 
order to correlate A with B (A.firstName+A.lastName = B.username) and B with C 
(B.location = C.location). At the end, we want to get the whole graph of 
correlated records.

Constraints: The latency of the records linkage should be as low as possible. 
The state stores should contain the messages of the last 180 days for linkage. 
(We are talking about tens to hundreds of GB of data)

I already implemented a solution with spark + an external database. I calculate 
the match-keys and then store mappings for event-id => list-of-match-keys, 
match-key => list-of-event-ids and event-id => event-payload in the database. 
By querying the database one can get a graph of "event -> match-keys -> more 
events" and so on. I do the querying in a loop until there are no new events 
added. As a last step, I read the payloads using the accumulated event-ids. 
However, this solution has a high latency because of the external database 
calls. That’s why the idea of having KTables as local state stores sounds so 
interesting to me.

Now with Kafka streams I would like to use the embedded state with KTables but 
I find it quite hard to come up with a solution. I think what I want to do is a 
self-join on the incoming topic which is not yet supported by the DSL. I 
thought of using the Processor API implementing a very similar solution to the 
one I described with spark: using several state stores for the mapping of event 
=> match-keys, match-key => events. Beside the fact that I don't know how to 
address the partitioning (or whether I need a global store) I am not sure 
whether this is the way one would go with Kafka streams.

Another solution I could think of is a loop in the topology so that an event 
would flow several times through the loop (which again has KTables for the 
mapping of event-id and match-key) until there are no new matches. Are loops 
possible at all and if so, is it a good approach or should one avoid loops? At 
the end of the record linkage process I’d like to have *one* message that 
contains the payloads of all correlated events and is then processed by the 
downstream processors. However I can only think of solutions where I need to do 
a flatMap() (do a join for every match-key) so that there is more than one 
message.

Do you have any feedback or suggestions? Any examples that could help?

Kind regards,

Wladislaw

Reply via email to