Hello,

We are planning a system that will be comprised of 3 different jobs:

  1.  Getting a stream of events, adding some metadata to the events, and 
outputting them to a temporary message queue.
  2.  Performing some calculations on the events we got from job 1, as required 
for product A.
  3.  Performing a different set of calculations of the events from job 1, for 
product B.

All 3 jobs will be developed by different teams, so we don't want to create one 
massive job that does everything.
The problem is that every message queuing sink only provides at-least-once 
guarantee. If job 1 crashes and recovers, we will get the same events in the 
queue and jobs 2 and 3 will process events twice. This is obviously a problem, 
and I guess we are not the first to stumble upon it.

Did anyone else had this issue? It seems to me like a fundamental problem of 
passing data between jobs, so hopefully there are known solutions and best 
practices. It would be great if you can share any solution.

Thanks,
Avihai

Reply via email to