[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967036#comment-15967036 ]
Wes McKinney edited comment on ARROW-785 at 4/13/17 2:52 AM: ------------------------------------------------------------- I tried to reproduce this issue with Impala on Arrow / Parquet master branches. I put the file in a temporary directory then ran {code} CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60` LIKE PARQUET '/tmp/test-parquet-binary/0.parq' STORED AS PARQUET LOCATION '/tmp/test-parquet-binary' {code} The resulting table, with schema inferred from the Parquet file, is: {code} describe __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60` Out[30]: [('year', 'bigint', 'Inferred from Parquet file.'), ('word', 'string', 'Inferred from Parquet file.')] {code} and {code} year word 0 2017 Word 1 1 2018 Word 2 {code} string in Impala is a plain BYTE_ARRAY aka Binary. The Arrow table was {code} pyarrow.Table YEAR: int64 WORD: binary {code} However, parquet-tool cat from parquet-mr 1.9.0 gives: {code} $ java -jar target/parquet-tools-1.9.0.jar cat test.parq YEAR = 2017 WORD = V29yZCAx YEAR = 2018 WORD = V29yZCAy {code} This suggests there's something wrong with the file metadata is Impala is able to read the file OK. I'm looking more closely into it was (Author: wesmckinn): I tried to reproduce this issue with Impala on Arrow / Parquet master branches. I put the file in a temporary directory then ran {code} CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60` LIKE PARQUET '/tmp/test-parquet-binary/0.parq' STORED AS PARQUET LOCATION '/tmp/test-parquet-binary' {code} The resulting table, with schema inferred from the Parquet file, is: {code} describe __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60` Out[30]: [('year', 'bigint', 'Inferred from Parquet file.'), ('word', 'string', 'Inferred from Parquet file.')] {code} string in Impala is a plain BYTE_ARRAY aka Binary. The Arrow table was {code} pyarrow.Table YEAR: int64 WORD: binary {code} However, parquet-tool cat from parquet-mr 1.9.0 gives: {code} $ java -jar target/parquet-tools-1.9.0.jar cat test.parq YEAR = 2017 WORD = V29yZCAx YEAR = 2018 WORD = V29yZCAy {code} This suggests there's something wrong with the file metadata is Impala is able to read the file OK. I'm looking more closely into it > possible issue on writing parquet via pyarrow, subsequently read in Hive > ------------------------------------------------------------------------ > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug > Reporter: Jeff Reback > Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)