Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/22482 @arunmahadevan We may want to be aware is that the requirement is pretty different from other streaming frameworks like Flink, which normally set a long period of checkpoint interval and do a full snapshot (though it supports incremental checkpoint, which deals with how it minimizes amount of storing data). Here in Spark, we are expecting smaller batch interval, and Spark deals with the requirement as storing "delta" of state change. The behavior brings concern about the strategy of how we store and how we remove the state. Let's say we have 3 rows in group in batch result and there're also 3 rows in same group in state, and we want to replace state with new batch result. For full snapshot removing 3 rows first and putting 3 rows may not matter much, but with delta approach, we should compare them side-by-side and bring less changes on state. The difference is not trivial one for session window, because arbitrary changes are required: for example, two different sessions in state can be merged later when late events come in, then we should have to overwrite one and remove others. Some new sessions can be created as well as existing session, and we want to overwrite session if the new output session is originated from old state, and append session if not. For other window, it is just a "put" because there's no group and we are just safe to put (overwrite if any, and without evict there's no need to remove). The different requirements between time window and session window are not easy to combine into one. That's what I realized the difficulty of state part for session window, and that's why I feel I need to make change on streaming part. For batch part current patch is doing OK. Btw, we can assume `AssignWindows` as `TimeWindowing` and `SessionWindowing` as we are logically assign rows to individual window. So unless we would like to support custom window like dynamic gap session window, I think we can address it later whenever needed.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org