Hello there!

I read about Kafka Streams recently, pretty interesting the way it solves
the stream processing problem in a more cleaner way with less overheads and
complexities.

I work as a Software Engineer in a startup, and we are in the design stage
for building a stream processing pipeline (if you will) for the millions of
events we get every day. We use Kafka (cluster) as the log aggregation
layer already in production a 5-6 months back and very happy about it.

I went through a few confluent blogs (by Jay, Neha) as to how KStreams
solve for sort of a state-ful event processing, and maybe I missed the
whole point in this regard, I have some doubts.

We have use-cases like the following:

There is an event E1, which is sort-of the base event after which we have a
lot of sub- events E2,E3..En enriching E1 with lot of extra properties
(with considerable delay, say 30-40 mins).

Eg. 1: An order event has come in where the user has ordered an item on our
website (This is the base event). After say 30-40 minutes, we get events
like packaging_time, shipping_time, delivered_time or cancelled_time etc
related to that order (These are the sub-events).

So before we get the whole event to a warehouse, we need to collect all
these (ordered, packaged, shipped, cancelled/delivered), and whenever I get
a cancelled or delivered event for an order, I know that completes the
lifecycle for that order, and can put it in the warehouse.

Eg. 2: User login events - If we are to capture events like User-Logged-In,
User-Logged-Out, I need it to be in the warehouse as a single row. Eg. row
would have these columns *user_id, login_time, logout_time*. So as and when
I receive a logout event (and if I have login event stored in some store),
there would be a trigger which combines both, and send it across to the
warehouse.

All these involve storing the state of the events and act as-and-when
another event (that completes lifecycle) occurs, after which you have a
trigger for further steps (warehouse or anything else).

Does KStream help me do this? If not, how should I go about solving this
problem?

Also, I wanted some advice as to whether it is a standard practice to
aggregate like this and *then* store to warehouse, or should I append each
event into the warehouse and do sort-of an ELT on that using the warehouse?
(Run a query to re-structure the data in the database and store it off as a
separate table)

Eagerly waiting for your reply,
Arvind

Reply via email to