[ https://issues.apache.org/jira/browse/SPARK-28190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911678#comment-16911678 ]
Jose Torres commented on SPARK-28190: ------------------------------------- Yeah, I think an SPIP is needed here. It sounds like we're planning to support state read and write as external interfaces, so we need a broad consensus on what those interfaces should be and how they'll constrain future evolvability. > Data Source - State > ------------------- > > Key: SPARK-28190 > URL: https://issues.apache.org/jira/browse/SPARK-28190 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming > Affects Versions: 3.0.0 > Reporter: Jungtaek Lim > Priority: Major > > "State" is becoming one of most important data on most of streaming > frameworks, which makes us getting continuous result of the query. In other > words, query could be no longer valid once state is corrupted or lost. > Ideally we could run the query from the first of data to construct a > brand-new state for current query, but in reality it may not be possible for > many reasons, like input data source having retention, lots of resource waste > to rerun from start, etc. > > There're other cases which end users want to deal with state, like creating > initial state from existing data via batch query (given batch query could be > far more efficient and faster). > I'd like to propose a new data source which handles "state" in batch query, > enabling read and write on state. > Allowing state read brings couple of benefits: > * You can analyze the state from "outside" of your streaming query > * It could be useful when there's something which can be derived from > existing state of existing query - note that state is not designed to be > shared among multiple queries > Allowing state (re)write brings couple of major benefits: > * State can be repartitioned physically > * Schema in state can be changed, which means you don't need to run the > query from the start when the query should be changed > * You can remove state rows if you want, like reducing size, removing > corrupt, etc. > * You can bootstrap state in your new query with existing data efficiently, > don't need to run streaming query from the start point > Btw, basically I'm planning to contribute my own works > ([https://github.com/HeartSaVioR/spark-state-tools]), so for many of > sub-issues it would require not-too-much amount of efforts to submit patches. > I'll try to apply new DSv2, so it could be a major effort while preparing to > donate code. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org