Kent Yao created SPARK-57029:
--------------------------------

             Summary: Add byte-level golden file for ICU sort keys to detect 
collation regressions on ICU upgrades
                 Key: SPARK-57029
                 URL: https://issues.apache.org/jira/browse/SPARK-57029
             Project: Spark
          Issue Type: Improvement
          Components: SQL, Tests
    Affects Versions: 5.0.0
            Reporter: Kent Yao
            Assignee: Kent Yao


h3. Motivation

ICU upgrade PRs (SPARK-50189 76.1 / SPARK-52038 77.1 / SPARK-54447 78.1
/ SPARK-55308 78.2 / SPARK-56397 78.3) currently only touch the
dependency file and {{ICUCollationsMapSuite}}. Sort-key byte changes
between ICU versions go undetected without a review-trigger snapshot.

Spark relies on ICU4J's {{Collator.getCollationKey(...).toByteArray()}}
to compute binary-comparable sort keys for ICU collations. These byte
sequences are silently versioned by ICU — an ICU library upgrade can
change the bytes for the same locale + input without any test catching it.

h3. Proposal

Add a small golden file that snapshots the byte-level ICU sort keys for
a representative matrix of (locale x case/accent sensitivity x Unicode
input). A new test suite reads the markdown file and asserts byte-equality
against ICU's runtime output. When ICU is upgraded, regenerating the
golden file will make the diff loudly visible in code review.

h3. Scope

Single PR. Skeleton (P1a) wires the suite + disclaimer-only golden file.
Follow-up commits in the same PR fill in: (P1b) the actual cell matrix,
(P1c) the regenerator, (P1d) CI hook, (P1e) migration-guide note.

h3. Non-goals

* Not a stability contract — the golden file *surfaces* divergence;
  reviewers still decide whether to accept the new bytes.
* No new SQLConf or user-visible runtime behavior.
* No change to CollationKey semantics or ICU version.

h3. Verification

* New test suite {{o.a.s.sql.ICUCollationSortKeyGoldenSuite}} under
  {{sql/core/src/test/scala/}}.
* GREEN locally (44s sbt warm, 27ms test).
* Tests-only change; zero production code touched.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to