Quanlong Huang created IMPALA-10319: ---------------------------------------
Summary: Support arbitrary encodings on text files Key: IMPALA-10319 URL: https://issues.apache.org/jira/browse/IMPALA-10319 Project: IMPALA Issue Type: New Feature Reporter: Quanlong Huang Attachments: gbk_names.txt ORC/Parquet/Avro files store strings in UTF-8 encoded bytes. However, text files can be in arbitrary encodings. Hive support specifying arbitrary encoding on text tables in the "serialization.encoding" table property (HIVE-7142). Impala is currently not aware of this table property and treate all strings as byte arrays. It's good to support at least reading from these text files. *Example* Create a text table in Hive using GBK encoding and load a GBK encoded text file into it: {code:sql} hive> create table gbk_names (name string) stored as textfile tblproperties("serialization.encoding"="GBK"); hive> load data local inpath '/home/quanlong/workspace/Impala/gbk_names.txt' into table gbk_names; hive> select * from gbk_names; +-----------------+ | gbk_names.name | +-----------------+ | 张三 | | 李四 | | 王五 | +-----------------+ {code} Impala read strings as byte arrays so can't decode them correctly: {code:sql} impala-shell> invalidate metadata gbk_names; impala-shell> select * from gbk_names; +------+ | name | +------+ | ���� | | ���� | | ���� | +------+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org