Quanlong Huang created IMPALA-12718:
---------------------------------------

             Summary: trim() functions are lack of utf-8 support
                 Key: IMPALA-12718
                 URL: https://issues.apache.org/jira/browse/IMPALA-12718
             Project: IMPALA
          Issue Type: Bug
            Reporter: Quanlong Huang


The following string functions are lack of UTF-8 support:
{noformat}
BTRIM(STRING a, STRING chars_to_trim)
LTRIM(STRING a, STRING chars_to_trim)
RTRIM(STRING a , STRING chars_to_trim)
{noformat}
Here is an issue reported by our user:
{noformat}
[localhost:21050] default> select rtrim('价格,', ',');
+-----------------------+
| rtrim('价格,', ',') |
+-----------------------+
| 价�                   |
+-----------------------+{noformat}
The result is the same if setting utf8_mode=true. Note that the comma used in 
the above strings is Chinese punctuation mark ',' , not English(ASCII) mark ','.

The cause is that the Chinese character ',' is used as a char set. The utf8 
encoding of these characters:
 * '价': 0xe4 0xbb 0xb7
 * '格': 0xe6 0xa0 0xbc
 * ',': 0xef 0xbc 0x8c

Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which also 
appears in the bytes of ','. So it's removed as well. The result is a string of 
'价' and the first two bytes of '格'. The last character becomes a malformed 
unicode so it's replaced with '�'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to