Help with streaming pipeline design

Zdenko Hrcek Thu, 08 Nov 2018 23:15:01 -0800

Greetings,

I'm trying to do following:


I'm streaming messages from PubSub with encoded json which contain DB
fields like "primary key" as well as "operation type", which can have
values: insert, update, delete to reflect db operations.

Goal is to produce one file per day (with 30 minute updates) which will
contain final state of rows/messages.
30 minute file update/flush should reflect date state at that time (taking
into considerations operations).
Lets say in first 30 minutes there is insert operation for some row which
will be in first flush of output file, but in next 30 minutes comes a
message with delete operation then in second file flush it shouldn't be in
output file etc.
For the next day, new file should be created.

I'm just wondering if something like this can be done in Beam (Java).

I had in mind doing fixed 24h window (which I guess is not best practice)
which would trigger after 30 minutes and having group by key operation (by
primary key) and then handle operations, but it looks like triggering can't
be done that way, or maybe my whole approach/understanding is wrong...

Anyway I would be grateful for any advice how this can be done
(conceptually) or if there is some code sample of similar pipeline.

Best regards,

Zdenko
-- 
_______________________
http://www.the-swamp.info

Help with streaming pipeline design

Reply via email to