Maybe it’s not what you are looking for, but I think that functionality can
be implemented in two steps.

First, you perform a LEFT join, and then you filter by “null” on the joined
field to identify those whose joins didn’t succeed.

HTH,

Rubén

El El mar, 16 mar 2021 a las 5:34, Ross Black <ross.w.bl...@gmail.com>
escribió:

> Hi,
>
> I am trying to find the best pattern to solve a specific problem using
> Kafka streaming.  All of our current processing uses the Kafka streaming
> API (using multiple joins, windows, repartitions etc) so I already think I
> have a decent grasp of the fundamentals.
>
> We have 2 streams of events:
> - primary events (P), which indicate some key event in the system and carry
> a large amount of data
> - secondary events (S), which should *always* occur as a follow-on to the
> primary event and only contain a reference to the single associated primary
> event.
>
> I want to join secondary events to primary events (the simple part) BUT I
> also want to find out when secondary events have been *unable* to be
> joined.
> A secondary is unable to be joined:
> - when primary event delivery has been delayed (so that secondary events
> are received before the associated primary event)
> - when primary events go missing (the event collection system is noisy, so
> we do lose a small bu significant number of primary events)
> - due to coding errors in the collectors, where an incorrect reference has
> been inserted into the secondary event
>
> Currently this functionality is implemented using a database:
> - primary events are inserted into the database and then secondary events
> lookup the primary by-reference.  If the primary is found the secondary is
> sent to a "JOINED" topic.
> - if the primary is not found, the secondary event is buffered in the
> database until the primary is received and then joined+emitted (and the
> secondary event is removed from the DB)
> - after some arbitrary time period, the database is queried for outstanding
> not-joined secondary events and they are emitted to an "UNJOINED" topic.
> This allows alerting on unmatched secondary events to drive quality
> measures, and allows offline analysis (to understand why)
>
> Some options:
> 1. Implement the same strategy as existing except using Kafka state stores
> instead of the DB.  With this approach I am concerned about atomic
> correctness - i.e. that state in the Kafka store can be managed so that the
> event is never sent to both JOINED and UNJOINED.
>
>
> 2. Continually emit key-values to a "PENDING" topic for the secondary join.
> An example sequence could be something like ...(where primary events = P,
> secondary events = S) :
>     a) receive S with no matching P => emit {S, false}
>     b) receive matching P for S => emit {S, null} (to effectively delete it
> from the topic)
>     c) receive S with matching P => do not emit anything
>
> Now the problem becomes more like building a time-window of events from
> PENDING, to eventually emit the events to UNJOINED.  I am also uncertain as
> how to ensure events can never end up in both JOINED and UNJOINED.
>
> My apologies for the wall of text .. I find it a difficult problem to
> explain. 😏
>
>
> Is there some pattern I am missing that will help solve this problem?
> Any help / other suggestions would be appreciated.
>
> Thanks,
> Ross
>
-- 
Rubén
--

<https://www.confluent.io>

Confluent
Data in Motion

Rubén Terceño Rodríguez

Director, Solutions Engineering, EMEA

+34 661 42 42 28

Follow us:  Blog
<https://confluent.io/blog?utm_source=footer&utm_medium=email&utm_campaign=ch.email-signature_type.community_content.blog>
• Slack <https://slackpass.io/confluentcommunity> • Twitter
<https://twitter.com/ConfluentInc> • YouTube <https://youtube.com/confluent>

Reply via email to