Jungtaek Lim created SPARK-28190:
------------------------------------

             Summary: Data Source - State
                 Key: SPARK-28190
                 URL: https://issues.apache.org/jira/browse/SPARK-28190
             Project: Spark
          Issue Type: Umbrella
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


"State" is becoming one of most important data on most of streaming frameworks, 
which makes us getting continuous result of the query. In other words, query 
could be no longer valid once state is corrupted or lost.

Ideally we could run the query from the first of data to construct a brand-new 
state for current query, but in reality it may not be possible for many 
reasons, like input data source having retention, lots of resource waste to 
rerun from start, etc.

 

There're other cases which end users want to deal with state, like creating 
initial state from existing data via batch query (given batch query could be 
far more efficient and faster).

I'd like to propose a new data source which handles "state" in batch query, 
enabling read and write on state.

Allowing state read brings couple of benefits:
 * You can analyze the state from "outside" of your streaming query
 * It could be useful when there's something which can be derived from existing 
state of existing query - note that state is not designed to be shared among 
multiple queries

Allowing state (re)write brings couple of major benefits:
 * State can be repartitioned physically
 * Schema in state can be changed, which means you don't need to run the query 
from the start when the query should be changed
 * You can remove state rows if you want, like reducing size, removing corrupt, 
etc.
 * You can bootstrap state in your new query with existing data efficiently, 
don't need to run streaming query from the start point

Btw, basically I'm planning to contribute my own works 
([https://github.com/HeartSaVioR/spark-state-tools]), so for many of sub-issues 
it would require not-too-much amount of efforts to submit patches. I'll try to 
apply new DSv2, so it could be a major effort while preparing to donate code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to