[jira] [Updated] (SPARK-27237) Introduce State schema validation among query restart

Jungtaek Lim (JIRA) Thu, 21 Mar 2019 14:43:39 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-27237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jungtaek Lim updated SPARK-27237:
---------------------------------
    Description: 
Even though Spark structured streaming guide page clearly documents that "Any 
change in number or type of grouping keys or aggregates is not allowed.", Spark 
doesn't do anything when end users try to do it, which would end up with 
indeterministic outputs or unexpected exceptions.

Even worse, if the query doesn't crash by chance it could write the new messed 
values to state which completely breaks state unless end users roll back to 
specific batch via manually editing checkpoint.

The restriction is clear, the number of columns, and data type for each must 
not be modified among query runs. We can store schema of state along with 
state, and verify whether the (maybe) new schema is compatible if state schema 
is modified. With this validation we can prevent query runs and shows 
indeterministic behavior when schema is incompatible, as well as we can give 
more informative error messages to end users.

  was:
Even though Spark structured streaming guide page clearly documents that "Any 
change in number or type of grouping keys or aggregates is not allowed.", Spark 
doesn't do anything when end users try to do it, which would end up with 
indeterministic outputs or unexpected exceptions.

Even worse, if the query doesn't crash by chance it could write the new messed 
values to state which completely breaks state unless end users roll back to 
specific batch via manually editing checkpoint.

The restriction is clear, the number of columns, and data type for each must 
not be modified among query runs. We can store schema of state along with 
state, and verify whether the (maybe) new schema is compatible if state schema 
is modified.


> Introduce State schema validation among query restart
> -----------------------------------------------------
>
>                 Key: SPARK-27237
>                 URL: https://issues.apache.org/jira/browse/SPARK-27237
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> Even though Spark structured streaming guide page clearly documents that "Any 
> change in number or type of grouping keys or aggregates is not allowed.", 
> Spark doesn't do anything when end users try to do it, which would end up 
> with indeterministic outputs or unexpected exceptions.
> Even worse, if the query doesn't crash by chance it could write the new 
> messed values to state which completely breaks state unless end users roll 
> back to specific batch via manually editing checkpoint.
> The restriction is clear, the number of columns, and data type for each must 
> not be modified among query runs. We can store schema of state along with 
> state, and verify whether the (maybe) new schema is compatible if state 
> schema is modified. With this validation we can prevent query runs and shows 
> indeterministic behavior when schema is incompatible, as well as we can give 
> more informative error messages to end users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27237) Introduce State schema validation among query restart

Reply via email to