[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

wzhfy Mon, 22 May 2017 23:22:57 -0700

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12646#discussion_r117907961
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -510,6 +510,69 @@ public UTF8String trim() {
         }
       }
     
    +  /**
    +   * Removes the given trim string from both ends of a string
    +   * @param trimString the trim character string
    +   */
    +  public UTF8String trim(UTF8String trimString) {
    +    // This method searches for each character in the source string, 
removes the character if it is found
    +    // in the trim string, stops at the first not found. It starts from 
left end, then right end.
    +    // It returns a new string in which both ends trim characters have 
been removed.
    +    int s = 0; // the searching byte position of the input string
    +    int i = 0; // the first beginning byte position of a non-matching 
character
    +    int e = 0; // the last byte position
    +    int numChars = 0; // number of characters from the input string
    +    int[] stringCharLen = new int[numBytes]; // array of character length 
for the input string
    +    int[] stringCharPos = new int[numBytes]; // array of the first byte 
position for each character in the input string
    +    int searchCharBytes;
    +
    +    while (s < this.numBytes) {
    +      UTF8String searchChar = copyUTF8String(s, s + 
numBytesForFirstByte(this.getByte(s)) - 1);
    +      searchCharBytes = searchChar.numBytes;
    +      // try to find the matching for the searchChar in the trimString set
    +      if (trimString.find(searchChar, 0) >= 0) {
    +        i += searchCharBytes;
    +      } else {
    +        // no matching, exit the search
    +        break;
    +      }
    +      s += searchCharBytes;
    +    }
    +
    +    if (i >= this.numBytes) {
    +      // empty string
    +      return UTF8String.EMPTY_UTF8;
    +    } else {
    +      //build the position and length array
    +      s = 0;
    +      while (s < numBytes) {
    +        stringCharPos[numChars] = s;
    +        stringCharLen[numChars]= numBytesForFirstByte(getByte(s));
    --- End diff --
    
    > I was thinking that these two arrays are only used by trimRight, in the 
case trimLeft trim all the source string, then we don't need to do the 
trimRight, so it will save some performance.
    
    Yeah I agree with you. I just think `numBytesForFirstByte` is called twice 
for beginning matched chars. But it seems easier to extract methods based on 
current implementation. Let's keep `stringCharPos` and `stringCharLen` only in 
"trimRight" part.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

Reply via email to