miland-db commented on code in PR #45791: URL: https://github.com/apache/spark/pull/45791#discussion_r1555445740
########## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ########## @@ -176,15 +176,31 @@ public Collation( */ public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { - String pattern = right.toString(); - CharacterIterator target = new StringCharacterIterator(left.toString()); + + if (collationId == UTF8_BINARY_COLLATION_ID) { + return getStringSearch(targetUTF8String, patternUTF8String); + } else if (collationId == UTF8_BINARY_LCASE_COLLATION_ID) { + return getStringSearch(targetUTF8String.toLowerCase(), patternUTF8String.toLowerCase()); + } + + String pattern = patternUTF8String.toString(); Review Comment: 1. _2 allocations for UTF8_BINARY (should be 0)_ - this should be never be called for UTF8_BINARY 2. For this one, I understand the consequences, but it's very similar to what we had to do in the `UTF8String` to be able to successfully work with UTF8_BINARY_LCASE collation. This makes the code a lot cleaner than it was before when we had separate methods with _almost identical_ code for UTF8_BINARY_LCASE and other collations. If this is not good/performant enough, we should think of a some other way to solve it because more and more PRs are coming with this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org