[ https://issues.apache.org/jira/browse/SPARK-39650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561276#comment-17561276 ]
Apache Spark commented on SPARK-39650: -------------------------------------- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37041 > Streaming Deduplication should not check the schema of "value" > -------------------------------------------------------------- > > Key: SPARK-39650 > URL: https://issues.apache.org/jira/browse/SPARK-39650 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 3.3.0, 3.4.0 > Reporter: Jungtaek Lim > Priority: Major > > When we use dropDuplicate() in the streaming query, specifying the columns > explicitly would perform deduplication against the columns rather than all > columns. > For the structure of state in streaming deduplication, we construct the key > from "specified" columns and value as empty row (since it's not used at all). > That said, once the query specifies the columns in dropDuplicate(), all other > columns should not affect the operation. > Unfortunately, even we use the empty row as value of the state store, we > register the "all columns" as the schema for the value on state store, which > leads incorrect behavior from checking schema for state store. (This is > figured out as a long-standing issue, it's from the initial implementation of > StreamingDeduplicateExec.) > Specifically, columns for DataFrame which is applied to streaming deduplicate > should be same across the lifetime of the query, whereas the only specified > columns should be same actually. > It would be ideal to change the value schema to be empty, but the change > itself may not be sufficient since schema file has been already written for > older streaming queries. We may need to allow state schema compatibility > checker to ignore value schema if required (either config or parameter of > method if feasible). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org