Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

via GitHub Thu, 09 May 2024 00:32:10 -0700


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1595078750



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -117,76 +119,445 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * collation id (32-bit integer) layout:
+     * bit 31:    0 = predefined collation, 1 = user-defined collation
+     * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)
+     * bit 28:    0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+     * bit 27:    0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+     * bit 26-25: zeroes, reserved for punctuation sensitivity
+     * bit 24-23: zeroes, reserved for first letter preference
+     * bit 22-21: 00 = unspecified, 01 = to-lower, 10 = to-upper
+     * bit 20-19: zeroes, reserved for space trimming
+     * bit 18-17: zeroes, reserved for version
+     * bit 16-12: zeroes
+     * bit 11-0:  zeroes for utf8-binary / locale id for ICU
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU, INDETERMINATE
+      }
+
+      protected enum CaseSensitivity {
+        CS, CI
+      }
+
+      protected enum AccentSensitivity {
+        AS, AI
+      }
+
+      protected enum CaseConversion {
+        UNSPECIFIED, LCASE, UCASE
+      }
+
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b11;
+      protected static final int CASE_SENSITIVITY_OFFSET = 28;
+      protected static final int CASE_SENSITIVITY_MASK = 0b1;
+      protected static final int ACCENT_SENSITIVITY_OFFSET = 27;
+      protected static final int ACCENT_SENSITIVITY_MASK = 0b1;
+      protected static final int CASE_CONVERSION_OFFSET = 21;
+      protected static final int CASE_CONVERSION_MASK = 0b11;
+      protected static final int LOCALE_OFFSET = 0;
+      protected static final int LOCALE_MASK = 0x0FFF;
+
+      protected static final int INDETERMINATE_COLLATION_ID =
+        ImplementationProvider.INDETERMINATE.ordinal() << 
IMPLEMENTATION_PROVIDER_OFFSET;
+
+      protected final CaseSensitivity caseSensitivity;
+      protected final AccentSensitivity accentSensitivity;
+      protected final CaseConversion caseConversion;
+      protected final String locale;
+      protected final int collationId;
+
+      protected CollationSpec(
+          String locale,
+          CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity,
+          CaseConversion caseConversion) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        this.caseConversion = caseConversion;
+        this.collationId = getCollationId();
+      }
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      public static Collation fetchCollation(int collationId) throws 
SparkException {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          int implementationProviderOrdinal =
+            (collationId >> IMPLEMENTATION_PROVIDER_OFFSET) & 
IMPLEMENTATION_PROVIDER_MASK;
+          if (implementationProviderOrdinal >= 
ImplementationProvider.values().length) {
+            throw SparkException.internalError("Invalid collation 
implementation provider");
+          } else {
+            ImplementationProvider implementationProvider = 
ImplementationProvider.values()[
+              implementationProviderOrdinal];
+            if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+              spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+            } else if (implementationProvider == ImplementationProvider.ICU) {
+              spec = CollationSpecICU.fromCollationId(collationId);
+            } else {
+              throw SparkException.internalError("Cannot instantiate 
indeterminate collation");
+            }
+            Collation collation = spec.buildCollation();
+            collationMap.put(collationId, collation);

Review Comment:
   For now (looking at changes of this PR) it's not that big deal since we're 
supporting ~200 locales x 2 case sensitivity x 2 accent sensitivity x 3 case 
conversions. So if user wants to use all available collations (highly unlikely, 
user will probably use several different collations on average) it will be < 
2000 entries in the map.
   
   The idea with map (cache) approach was that we don't want to hardcode all 
the collations into the table which was the previous approach. This will make a 
considerable difference when we add support for punctuation sensitivity, first 
letter preference and space trimming; this will grow number of available 
collations to ~40k. I think it would make sense to worry about caching 
strategies after these additions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

Reply via email to