Re: [PR] [SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations [spark]

via GitHub Sun, 07 Jul 2024 20:11:24 -0700


uros-db commented on code in PR #46762:
URL: https://github.com/apache/spark/pull/46762#discussion_r1667891877



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -657,57 +659,64 @@ public static Map<String, String> 
getCollationAwareDict(UTF8String string,
   public static UTF8String lowercaseTrim(
       final UTF8String srcString,
       final UTF8String trimString) {
+    return lowercaseTrimRight(lowercaseTrimLeft(srcString, trimString), 
trimString);
+  }
+
+  public static UTF8String trim(
+      final UTF8String srcString,
+      final UTF8String trimString,
+      final int collationId) {
+    return trimRight(trimLeft(srcString, trimString, collationId), trimString, 
collationId);
+  }
+
+  public static UTF8String lowercaseTrimLeft(
+      final UTF8String srcString,
+      final UTF8String trimString) {
     // Matching UTF8String behavior for null `trimString`.
     if (trimString == null) {
       return null;
     }
 
-    UTF8String leftTrimmed = lowercaseTrimLeft(srcString, trimString);
-    return lowercaseTrimRight(leftTrimmed, trimString);
+    HashSet<Integer> trimChars = new HashSet<>();
+    Iterator<Integer> trimIter = trimString.codePointIterator();
+    while (trimIter.hasNext()) 
trimChars.add(UCharacter.toLowerCase(trimIter.next()));
+
+    int searchIndex = 0;
+    Iterator<Integer> srcIter = srcString.codePointIterator();
+    while (srcIter.hasNext()) {
+      if (!trimChars.contains(UCharacter.toLowerCase(srcIter.next()))) break;
+      ++searchIndex;
+    }
+
+    return srcString.substring(searchIndex, srcString.numChars());
   }
 
-  public static UTF8String lowercaseTrimLeft(
+  public static UTF8String trimLeft(
       final UTF8String srcString,
-      final UTF8String trimString) {
+      final UTF8String trimString,
+      final int collationId) {
     // Matching UTF8String behavior for null `trimString`.
     if (trimString == null) {
       return null;
     }
 
-    // The searching byte position in the srcString.
-    int searchIdx = 0;
-    // The byte position of a first non-matching character in the srcString.
-    int trimByteIdx = 0;
-    // Number of bytes in srcString.
-    int numBytes = srcString.numBytes();
-    // Convert trimString to lowercase, so it can be searched properly.
-    UTF8String lowercaseTrimString = trimString.toLowerCase();
-
-    while (searchIdx < numBytes) {
-      UTF8String searchChar = srcString.copyUTF8String(
-        searchIdx,
-        searchIdx + 
UTF8String.numBytesForFirstByte(srcString.getByte(searchIdx)) - 1);
-      int searchCharBytes = searchChar.numBytes();
-
-      // Try to find the matching for the searchChar in the trimString.
-      if (lowercaseTrimString.find(searchChar.toLowerCase(), 0) >= 0) {
-        trimByteIdx += searchCharBytes;
-        searchIdx += searchCharBytes;
-      } else {
-        // No matching, exit the search.
-        break;
-      }
+    // Create a set of collation keys for all characters of the trim string, 
for fast lookup.
+    String trim = trimString.toString();
+    HashSet<String> trimChars = new HashSet<>();
+    for (int i = 0; i < trim.length(); i++) {
+      
trimChars.add(CollationFactory.getCollationKey(String.valueOf(trim.charAt(i)), 
collationId));
     }
 
-    if (searchIdx == 0) {
-      // Nothing trimmed - return original string (not converted to lowercase).
-      return srcString;
-    }
-    if (trimByteIdx >= numBytes) {
-      // Everything trimmed.
-      return UTF8String.EMPTY_UTF8;
+    // Iterate over srcString from the left and find the first character that 
is not in trimChars.
+    String input = srcString.toString();
+    int i = 0;
+    while (i < input.length()) {
+      String key = 
CollationFactory.getCollationKey(String.valueOf(input.charAt(i)), collationId);
+      if (!trimChars.contains(key)) break;
+      ++i;

Review Comment:
   using `StringSearch` for this now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations [spark]

Reply via email to