[ 
https://issues.apache.org/jira/browse/DRILL-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992350#comment-15992350
 ] 

ASF GitHub Bot commented on DRILL-5450:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r114249116
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java
 ---
    @@ -144,41 +144,28 @@ public static int varTypesToInt(final int start, 
final int end, DrillBuf buffer)
         return result;
       }
     
    -  // Assumes Alpha as [A-Za-z0-9]
    -  // white space is treated as everything else.
    +  /**
    +   * Capitalizes first letter in each word.
    +   * Any symbol except digits and letters is considered as word delimiter.
    +   *
    +   * @param start start position in input buffer
    +   * @param end end position in input buffer
    +   * @param inBuf buffer with input characters
    +   * @param outBuf buffer with output characters
    +   */
       public static void initCap(int start, int end, DrillBuf inBuf, DrillBuf 
outBuf) {
    -    boolean capNext = true;
    +    boolean capitalizeNext = true;
         int out = 0;
         for (int id = start; id < end; id++, out++) {
    -      byte currentByte = inBuf.getByte(id);
    -
    -      // 'A - Z' : 0x41 - 0x5A
    -      // 'a - z' : 0x61 - 0x7A
    -      // '0-9' : 0x30 - 0x39
    -      if (capNext) { // curCh is whitespace or first character of word.
    -        if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
    -          capNext = false;
    -        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          capNext = false;
    -        } else if (currentByte >= 0x61 && currentByte <= 0x7A) { // a-z
    -          capNext = false;
    -          currentByte -= 0x20; // Uppercase this character
    -        }
    -        // else {} whitespace
    -      } else { // Inside of a word or white space after end of word.
    -        if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
    -          // noop
    -        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          currentByte -= 0x20; // Lowercase this character
    -        } else if (currentByte >= 0x61 && currentByte <= 0x7A) { // a-z
    -          // noop
    -        } else { // whitespace
    -          capNext = true;
    -        }
    +      int currentByte = inBuf.getByte(id);
    --- End diff --
    
    This code works only for ASCII, but not for UTF-8. UTF-8 is a multi-byte 
code that requires special encoding/decoding to convert to Unicode characters. 
Without that encoding, this method won't work for Cyrillic, Greek or any other 
character set with upper/lower distinctions.
    
    Since this method never worked, it is probably OK to make it a bit less 
broken than before: at least now it works for ASCII. Please add unit tests 
below, then file a JIRA, for the fact that this function does not work with 
UTF-8 despite the fact that Drill claims it supports UTF-8.


> Fix initcap function to convert upper case characters correctly
> ---------------------------------------------------------------
>
>                 Key: DRILL-5450
>                 URL: https://issues.apache.org/jira/browse/DRILL-5450
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.10.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>
> Initcap function converts incorrectly subsequent upper case characters after 
> first character.
> {noformat}
> 0: jdbc:drill:zk=local> select initcap('aaa') from (values(1));
> +---------+
> | EXPR$0  |
> +---------+
> | Aaa     |
> +---------+
> 1 row selected (0.275 seconds)
> 0: jdbc:drill:zk=local> select initcap('AAA') from (values(1));
> +---------+
> | EXPR$0  |
> +---------+
> | A!!     |
> +---------+
> 1 row selected (0.27 seconds)
> 0: jdbc:drill:zk=local> select initcap('aAa') from (values(1));
> +---------+
> | EXPR$0  |
> +---------+
> | A!a     |
> +---------+
> 1 row selected (0.229 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to