Quanlong Huang created IMPALA-12718: ---------------------------------------
Summary: trim() functions are lack of utf-8 support Key: IMPALA-12718 URL: https://issues.apache.org/jira/browse/IMPALA-12718 Project: IMPALA Issue Type: Bug Reporter: Quanlong Huang The following string functions are lack of UTF-8 support: {noformat} BTRIM(STRING a, STRING chars_to_trim) LTRIM(STRING a, STRING chars_to_trim) RTRIM(STRING a , STRING chars_to_trim) {noformat} Here is an issue reported by our user: {noformat} [localhost:21050] default> select rtrim('价格,', ','); +-----------------------+ | rtrim('价格,', ',') | +-----------------------+ | 价� | +-----------------------+{noformat} The result is the same if setting utf8_mode=true. Note that the comma used in the above strings is Chinese punctuation mark ',' , not English(ASCII) mark ','. The cause is that the Chinese character ',' is used as a char set. The utf8 encoding of these characters: * '价': 0xe4 0xbb 0xb7 * '格': 0xe6 0xa0 0xbc * ',': 0xef 0xbc 0x8c Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which also appears in the bytes of ','. So it's removed as well. The result is a string of '价' and the first two bytes of '格'. The last character becomes a malformed unicode so it's replaced with '�'. -- This message was sent by Atlassian Jira (v8.20.10#820010)