Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

via GitHub Mon, 08 Apr 2024 01:26:12 -0700


dbatomic commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555419431



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -176,15 +176,31 @@ public Collation(
    */
 
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetUTF8String,
+      final UTF8String patternUTF8String,
       final int collationId) {
-    String pattern = right.toString();
-    CharacterIterator target = new StringCharacterIterator(left.toString());
+
+    if (collationId == UTF8_BINARY_COLLATION_ID) {
+      return getStringSearch(targetUTF8String, patternUTF8String);
+    } else if (collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+      return getStringSearch(targetUTF8String.toLowerCase(), 
patternUTF8String.toLowerCase());
+    }
+
+    String pattern = patternUTF8String.toString();

Review Comment:
   General principle is that we should minimize heap allocations in string 
compare path.
   In this snipped we are doing:
   1) 2 allocations for UTF8_BINARY (should be 0)
   2) For UTF8_BINARY_LCASE it get's kind of crazy. If this string has 
characters in non-ascii range we first push it to String, then back to 
UTF8String, then back to String :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

Reply via email to