Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/12646#discussion_r117907961 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -510,6 +510,69 @@ public UTF8String trim() { } } + /** + * Removes the given trim string from both ends of a string + * @param trimString the trim character string + */ + public UTF8String trim(UTF8String trimString) { + // This method searches for each character in the source string, removes the character if it is found + // in the trim string, stops at the first not found. It starts from left end, then right end. + // It returns a new string in which both ends trim characters have been removed. + int s = 0; // the searching byte position of the input string + int i = 0; // the first beginning byte position of a non-matching character + int e = 0; // the last byte position + int numChars = 0; // number of characters from the input string + int[] stringCharLen = new int[numBytes]; // array of character length for the input string + int[] stringCharPos = new int[numBytes]; // array of the first byte position for each character in the input string + int searchCharBytes; + + while (s < this.numBytes) { + UTF8String searchChar = copyUTF8String(s, s + numBytesForFirstByte(this.getByte(s)) - 1); + searchCharBytes = searchChar.numBytes; + // try to find the matching for the searchChar in the trimString set + if (trimString.find(searchChar, 0) >= 0) { + i += searchCharBytes; + } else { + // no matching, exit the search + break; + } + s += searchCharBytes; + } + + if (i >= this.numBytes) { + // empty string + return UTF8String.EMPTY_UTF8; + } else { + //build the position and length array + s = 0; + while (s < numBytes) { + stringCharPos[numChars] = s; + stringCharLen[numChars]= numBytesForFirstByte(getByte(s)); --- End diff -- > I was thinking that these two arrays are only used by trimRight, in the case trimLeft trim all the source string, then we don't need to do the trimRight, so it will save some performance. Yeah I agree with you. I just think `numBytesForFirstByte` is called twice for beginning matched chars. But it seems easier to extract methods based on current implementation. Let's keep `stringCharPos` and `stringCharLen` only in "trimRight" part.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org