[jira] [Commented] (SPARK-28190) Data Source - State

Jose Torres (Jira) Tue, 20 Aug 2019 12:19:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911678#comment-16911678
 ]


Jose Torres commented on SPARK-28190:
-------------------------------------

Yeah, I think an SPIP is needed here. It sounds like we're planning to support 
state read and write as external interfaces, so we need a broad consensus on 
what those interfaces should be and how they'll constrain future evolvability.

> Data Source - State
> -------------------
>
>                 Key: SPARK-28190
>                 URL: https://issues.apache.org/jira/browse/SPARK-28190
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> "State" is becoming one of most important data on most of streaming 
> frameworks, which makes us getting continuous result of the query. In other 
> words, query could be no longer valid once state is corrupted or lost.
> Ideally we could run the query from the first of data to construct a 
> brand-new state for current query, but in reality it may not be possible for 
> many reasons, like input data source having retention, lots of resource waste 
> to rerun from start, etc.
>  
> There're other cases which end users want to deal with state, like creating 
> initial state from existing data via batch query (given batch query could be 
> far more efficient and faster).
> I'd like to propose a new data source which handles "state" in batch query, 
> enabling read and write on state.
> Allowing state read brings couple of benefits:
>  * You can analyze the state from "outside" of your streaming query
>  * It could be useful when there's something which can be derived from 
> existing state of existing query - note that state is not designed to be 
> shared among multiple queries
> Allowing state (re)write brings couple of major benefits:
>  * State can be repartitioned physically
>  * Schema in state can be changed, which means you don't need to run the 
> query from the start when the query should be changed
>  * You can remove state rows if you want, like reducing size, removing 
> corrupt, etc.
>  * You can bootstrap state in your new query with existing data efficiently, 
> don't need to run streaming query from the start point
> Btw, basically I'm planning to contribute my own works 
> ([https://github.com/HeartSaVioR/spark-state-tools]), so for many of 
> sub-issues it would require not-too-much amount of efforts to submit patches. 
> I'll try to apply new DSv2, so it could be a major effort while preparing to 
> donate code.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28190) Data Source - State

Reply via email to