Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/22482
@arunmahadevan
We may want to be aware is that the requirement is pretty different from
other streaming frameworks like Flink, which normally set a long period of
checkpoint interval and do a full snapshot (though it supports incremental
checkpoint, which deals with how it minimizes amount of storing data).
Here in Spark, we are expecting smaller batch interval, and Spark deals
with the requirement as storing "delta" of state change. The behavior brings
concern about the strategy of how we store and how we remove the state.
Let's say we have 3 rows in group in batch result and there're also 3 rows
in same group in state, and we want to replace state with new batch result. For
full snapshot removing 3 rows first and putting 3 rows may not matter much, but
with delta approach, we should compare them side-by-side and bring less changes
on state.
The difference is not trivial one for session window, because arbitrary
changes are required: for example, two different sessions in state can be
merged later when late events come in, then we should have to overwrite one and
remove others. Some new sessions can be created as well as existing session,
and we want to overwrite session if the new output session is originated from
old state, and append session if not. For other window, it is just a "put"
because there's no group and we are just safe to put (overwrite if any, and
without evict there's no need to remove). The different requirements between
time window and session window are not easy to combine into one.
That's what I realized the difficulty of state part for session window, and
that's why I feel I need to make change on streaming part. For batch part
current patch is doing OK.
Btw, we can assume `AssignWindows` as `TimeWindowing` and
`SessionWindowing` as we are logically assign rows to individual window. So
unless we would like to support custom window like dynamic gap session window,
I think we can address it later whenever needed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]