Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/22482
  
    @arunmahadevan 
    We may want to be aware is that the requirement is pretty different from 
other streaming frameworks like Flink, which normally set a long period of 
checkpoint interval and do a full snapshot (though it supports incremental 
checkpoint, which deals with how it minimizes amount of storing data). 
    
    Here in Spark, we are expecting smaller batch interval, and Spark deals 
with the requirement as storing "delta"  of state change. The behavior brings 
concern about the strategy of how we store and how we remove the state. 
    
    Let's say we have 3 rows in group in batch result and there're also 3 rows 
in same group in state, and we want to replace state with new batch result. For 
full snapshot removing 3 rows first and putting 3 rows may not matter much, but 
with delta approach, we should compare them side-by-side and bring less changes 
on state.
    
    The difference is not trivial one for session window, because arbitrary 
changes are required: for example, two different sessions in state can be 
merged later when late events come in, then we should have to overwrite one and 
remove others. Some new sessions can be created as well as existing session, 
and we want to overwrite session if the new output session is originated from 
old state, and append session if not. For other window, it is just a "put" 
because there's no group and we are just safe to put (overwrite if any, and 
without evict there's no need to remove). The different requirements between 
time window and session window are not easy to combine into one.
    
    That's what I realized the difficulty of state part for session window, and 
that's why I feel I need to make change on streaming part. For batch part 
current patch is doing OK.
    
    Btw, we can assume `AssignWindows` as `TimeWindowing` and 
`SessionWindowing` as we are logically assign rows to individual window. So 
unless we would like to support custom window like dynamic gap session window, 
I think we can address it later whenever needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to