[ https://issues.apache.org/jira/browse/SPARK-47776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim resolved SPARK-47776. ---------------------------------- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45951 [https://github.com/apache/spark/pull/45951] > State store operation cannot work properly with binary inequality collation > --------------------------------------------------------------------------- > > Key: SPARK-47776 > URL: https://issues.apache.org/jira/browse/SPARK-47776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 4.0.0 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > > Arguably this is a correctness issue, though we haven't released collation > feature yet. > collation introduces the concept of binary (in)equality, which means in some > collation we no longer be able to just compare the binary format of two > UnsafeRows to determine equality. > For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive > collation. > State store is basically key-value storage, and the most provider > implementations rely on the fact that all the columns in the key schema > support binary equality. We need to disallow using binary inequality column > in the key schema, before we could support this in majority of state store > providers (or high-level of state store.) > Why this is correctness issue? For example, streaming aggregation will > produce an output of aggregation which does not care about the semantic > equality. > e.g. df.groupBy(strCol).count() > Although strCol is case insensitive, 'a' and 'A' won't be counted together in > streaming aggregation, while they should be. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org