[ 
https://issues.apache.org/jira/browse/SPARK-47776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47776.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 45951
[https://github.com/apache/spark/pull/45951]

> State store operation cannot work properly with binary inequality collation
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-47776
>                 URL: https://issues.apache.org/jira/browse/SPARK-47776
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 4.0.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> Arguably this is a correctness issue, though we haven't released collation 
> feature yet.
> collation introduces the concept of binary (in)equality, which means in some 
> collation we no longer be able to just compare the binary format of two 
> UnsafeRows to determine equality.
> For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive 
> collation.
> State store is basically key-value storage, and the most provider 
> implementations rely on the fact that all the columns in the key schema 
> support binary equality. We need to disallow using binary inequality column 
> in the key schema, before we could support this in majority of state store 
> providers (or high-level of state store.)
> Why this is correctness issue? For example, streaming aggregation will 
> produce an output of aggregation which does not care about the semantic 
> equality.
> e.g. df.groupBy(strCol).count() 
> Although strCol is case insensitive, 'a' and 'A' won't be counted together in 
> streaming aggregation, while they should be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to