mihailom-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1579005770


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -117,76 +119,422 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * collation id (32-bit integer) layout:
+     * bit 31:    0 = predefined collation, 1 = user-defined collation
+     * bit 30:    0 = utf8-binary, 1 = ICU
+     * bit 29:    0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+     * bit 28:    0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+     * bit 27-26: zeroes, reserved for punctuation sensitivity
+     * bit 25-24: zeroes, reserved for first letter preference
+     * bit 23-22: 00 = unspecified, 01 = to-lower, 10 = to-upper
+     * bit 21-20: zeroes, reserved for space trimming
+     * bit 19-18: zeroes, reserved for version
+     * bit 17-16: zeroes
+     * bit 15-0:  zeroes for utf8-binary / locale id for ICU
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      protected enum CaseSensitivity {
+        CS, CI
+      }
+
+      protected enum AccentSensitivity {
+        AS, AI
+      }
+
+      protected enum CaseConversion {
+        UNSPECIFIED, LCASE, UCASE
+      }
+
+      protected static final int implementationProviderOffset = 30;
+      protected static final int implementationProviderLen = 1;
+      protected static final int caseSensitivityOffset = 29;
+      protected static final int caseSensitivityLen = 1;
+      protected static final int accentSensitivityOffset = 28;
+      protected static final int accentSensitivityLen = 1;
+      protected static final int caseConversionOffset = 22;
+      protected static final int caseConversionLen = 2;
+      protected static final int localeOffset = 0;
+      protected static final int localeLen = 16;

Review Comment:
   Why don't we set these set some of these values to hexadecimal (0x0..). I 
would expect to see some bit masking if you have bit representation of 
collationId. What do others thing? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to