Jungtaek Lim created SPARK-57003:
------------------------------------

             Summary: Enforce streaming stateful operator output and 
state-schema nullability
                 Key: SPARK-57003
                 URL: https://issues.apache.org/jira/browse/SPARK-57003
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 4.3.0
            Reporter: Jungtaek Lim


This has been a long standing issue of streaming engine vs Query Optimizer.

By the nature of streaming query, the query is meant to be long-running, in 
many cases spans to multiple Spark versions. Also, the logical plan is not 
always the same across batches (e.g. there are multiple stream sources and one 
of the source does not have a new data at batch N). This puts the streaming 
query to be affected by analyzer and optimizer.

The state schema of stateful operator is mostly determined by the input schema 
of the stateful operator, and nullability isn't an exception. If the input 
schema has a nullable column, state schema would have a nullable column. Vice 
versa with non-nullable column.

For Query Optimizer, one of the optimizations is to flip the nullability, say, 
nullable to non-nullable if appropriate. (This can be done directly or 
indirectly - one of indirect example is determining the nullability of the 
column when union is eliminated with one side) If this optimization can happen 
conditionally, the nullability of the column can change over time, which breaks 
stateful operator.

The holistic fix of this is to loose the nullability of state schema to be 
always nullable, and propagate this to the output schema of the stateful 
operator, which should be also propagated to the downstream operators.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to