[jira] [Created] (IMPALA-10319) Support arbitrary encodings on text files

Quanlong Huang (Jira) Wed, 11 Nov 2020 18:32:09 -0800

Quanlong Huang created IMPALA-10319:
---------------------------------------


             Summary: Support arbitrary encodings on text files
                 Key: IMPALA-10319
                 URL: https://issues.apache.org/jira/browse/IMPALA-10319
             Project: IMPALA
          Issue Type: New Feature
            Reporter: Quanlong Huang
         Attachments: gbk_names.txt

ORC/Parquet/Avro files store strings in UTF-8 encoded bytes. However, text 
files can be in arbitrary encodings. Hive support specifying arbitrary encoding 
on text tables in the "serialization.encoding" table property (HIVE-7142). 
Impala is currently not aware of this table property and treate all strings as 
byte arrays. It's good to support at least reading from these text files.

*Example*

Create a text table in Hive using GBK encoding and load a GBK encoded text file 
into it: 
{code:sql}
hive> create table gbk_names (name string) stored as textfile 
tblproperties("serialization.encoding"="GBK");
hive> load data local inpath '/home/quanlong/workspace/Impala/gbk_names.txt' 
into table gbk_names;
hive> select * from gbk_names;
+-----------------+
| gbk_names.name  |
+-----------------+
| 张三              |
| 李四              |
| 王五              |
+-----------------+
{code}
Impala read strings as byte arrays so can't decode them correctly:
{code:sql}
impala-shell> invalidate metadata gbk_names;
impala-shell> select * from gbk_names;
+------+
| name |
+------+
| ���� |
| ���� |
| ���� |
+------+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10319) Support arbitrary encodings on text files

Reply via email to