[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829614#comment-17829614
 ] 

Csaba Ringhofer commented on IMPALA-12927:
------------------------------------------

[~Eyizoha]  About AuxColumnType: fyi is there is an ongoing refactor to remove 
that class and make it easier to decided whether a column is STRING or BINARY: 
[https://gerrit.cloudera.org/#/c/21157/]

About encoding of BINARY columns: I looked at the Hive code, but it doesn't 
match with the encoding I see in the files.

[https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135]

Current Apache Hive seems to default to using base64 encoding, while it can be 
altered with tbl property "json.binary.format". In the JSON tables in Impala's 
dataload the files are certainly not base64 encoded and "json.binary.format" is 
also not set, so it doesn't seem to work like the current Hive codebase. It is 
possible that this is related to differences between Apache Impala's Hive 
dependency and current Apache Hive.

Currently Impala base64 decodes the BINARY columns:

{code}

Hive:

create table tjsonbinary (string s, binary b) stored as JSONFILE;

insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary)));

Impala:

select * from tjsonbinary;

+------+------+
| s    | b    |
+------+------+
| abcd | abcd |
+------+------+

{code}

What do you think about disabling BINARY column reading in JSON until Hive 
compatibility is clarified? My concern is that besides error messages and 
nulled values this may actually lead to correctness issues as many strings are 
both valid utf8 strings and base64 strings, so Impala may return unintended 
results.

> Support reading BINARY columns in JSON tables
> ---------------------------------------------
>
>                 Key: IMPALA-12927
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12927
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Zihao Ye
>            Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> +----+--------------+------------+
> | id | string_col   | binary_col |
> +----+--------------+------------+
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> +----+--------------+------------+
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '你好hello'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '��'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '�D3"'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
>  hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u0000�\u0000�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u0000"}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to