[jira] [Comment Edited] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2019-11-18 Thread albertoramon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976721#comment-16976721
 ] 

albertoramon edited comment on ARROW-785 at 11/18/19 5:32 PM:
--

I saw this:(SparkSQL 2.4.4 PyArrow 0.15)

The problem is Create table with INT columns (BIGINT works properly)

SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in 
create table

 

In my case: these Parquet Files are from SSB benchmark
{code:java}
SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY)
FROM SSB.LINEORDER;
Returns: 2 20 2000
{code}
 

 

In my Column_Types I Had,: (thus I need review my Python Code :)):
{code:java}
'lo_custkey':'int64',
 'lo_partkey':'int64',
 'lo_suppkey':'int64',{code}
 

 

 

 


was (Author: albertoramon):
I saw this:(SparkSQL 2.4.4 PyArrow 0.15)

The problem is Create table with INT columns (BIGINT works properly)

SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in 
create table

 

In my case: these Parquet Files are from SSB benchmark
{code:java}
SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY)
FROM SSB.LINEORDER;
Returns: 2 20 2000
{code}
 

 

In my Column_Types I Had, thus I need review my Python Code :) :
{code:java}
'lo_custkey':'int64',
 'lo_partkey':'int64',
 'lo_suppkey':'int64',{code}
 

 

 

 

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jeff Reback
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.5.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967036#comment-15967036
 ] 

Wes McKinney edited comment on ARROW-785 at 4/13/17 2:52 AM:
-

I tried to reproduce this issue with Impala on Arrow / Parquet master branches. 
 I put the file in a temporary directory then ran

{code}
CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60`
LIKE PARQUET '/tmp/test-parquet-binary/0.parq'
STORED AS PARQUET
LOCATION '/tmp/test-parquet-binary'
{code}

The resulting table, with schema inferred from the Parquet file, is:

{code}
describe __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60`
Out[30]:
[('year', 'bigint', 'Inferred from Parquet file.'),
 ('word', 'string', 'Inferred from Parquet file.')]
{code}

and

{code}
yearword
0   2017Word 1
1   2018Word 2
{code}

string in Impala is a plain BYTE_ARRAY aka Binary. The Arrow table was

{code}
pyarrow.Table
YEAR: int64
WORD: binary
{code}

However, parquet-tool cat from parquet-mr 1.9.0 gives:

{code}
$ java -jar target/parquet-tools-1.9.0.jar cat test.parq 
YEAR = 2017
WORD = V29yZCAx

YEAR = 2018
WORD = V29yZCAy
{code}

This suggests there's something wrong with the file metadata is Impala is able 
to read the file OK. I'm looking more closely into it


was (Author: wesmckinn):
I tried to reproduce this issue with Impala on Arrow / Parquet master branches. 
 I put the file in a temporary directory then ran

{code}
CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60`
LIKE PARQUET '/tmp/test-parquet-binary/0.parq'
STORED AS PARQUET
LOCATION '/tmp/test-parquet-binary'
{code}

The resulting table, with schema inferred from the Parquet file, is:

{code}
describe __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60`
Out[30]:
[('year', 'bigint', 'Inferred from Parquet file.'),
 ('word', 'string', 'Inferred from Parquet file.')]
{code}

string in Impala is a plain BYTE_ARRAY aka Binary. The Arrow table was

{code}
pyarrow.Table
YEAR: int64
WORD: binary
{code}

However, parquet-tool cat from parquet-mr 1.9.0 gives:

{code}
$ java -jar target/parquet-tools-1.9.0.jar cat test.parq 
YEAR = 2017
WORD = V29yZCAx

YEAR = 2018
WORD = V29yZCAy
{code}

This suggests there's something wrong with the file metadata is Impala is able 
to read the file OK. I'm looking more closely into it

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)