Hi!  Interesting problem to solve ahead :)
I need to implement a streaming sessionization algorithm (split stream of
events into groups of correlated events). It's pretty non-standard as we
DON'T have a key like user id which separates the stream into substreams
which we just need to chunk based on time.
Instead and simplifying a lot, our events bear tuples, that I compare to
graph edges, e.g.:
event 1: A -> B
event 2: B -> C
event 3: D -> E
event 4: D -> F
event 5: G -> F
I need to group them into subgroups reachable by following these edges from
some specific nodes. E.g. here:
{ A->B, B->C}
{ D->E, D->F}
{ G->F }
(note: I need to group the events, which are represented by edges here, not
the nodes).
As far as I understand, to solve this problem I need to leverage feedback
loops/iterations feature in Flink (Generally I believe I need to apply a
Bulk Synchronous Processing approach).

Does anyone have seen this kind of sessionization implemented in the wild?
Would you suggest implementing such an algorithm using *stateful functions*?
(AFAIK, they use feedback loops underneath). Can you suggest how would
these be used here?
I know there are some problems with checkpointing when using iterations,
does it mean the implementation may experience data loss on stops?

Side comment: I'm not sure which graph algorithm derivative needs to be
applied here, but the candidate is transitive closure.

Thanks for joining the discussion!
Krzysztof

Reply via email to