[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2019-11-18 Thread albertoramon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976721#comment-16976721
 ] 

albertoramon commented on ARROW-785:


I saw this:(SparkSQL 2.4.4 PyArrow 0.15)

The problem is Create table with INT columns (BIGINT works properly)

SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in 
create table

 

In my case: these Parquet Files are from SSB benchmark
{code:java}
SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY)
FROM SSB.LINEORDER;
Returns: 2 20 2000
{code}
 

 

In my Column_Types I Had, thus I need review my Python Code :) :
{code:java}
'lo_custkey':'int64',
 'lo_partkey':'int64',
 'lo_suppkey':'int64',{code}
 

 

 

 

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jeff Reback
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.5.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-24 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981803#comment-15981803
 ] 

Wes McKinney commented on ARROW-785:


I can't repro this, took off 0.3 fix version until we get a reliable 
reproduction

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967700#comment-15967700
 ] 

Phillip Cloud commented on ARROW-785:
-

I'm not sure if there's still a possible issue here. When using drill, if I 
cast the {{WORD}} column to {{varchar}} then the data look fine. When left as 
{{binary}} the values are unintelligible:

{code}
0: jdbc:drill:zk=local> select `YEAR`, cast(`WORD` as varchar) as `WORD` from 
dfs.`/home/phillip/code/cpp/arrow/python/arrow_parquet.parquet`;
+---+-+
| YEAR  |  WORD   |
+---+-+
| 2017  | Word 1  |
| 2018  | Word 2  |
+---+-+
{code}

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967675#comment-15967675
 ] 

Phillip Cloud commented on ARROW-785:
-

I'm able to run this using {{beeline}} declaring the {{word}} column as either 
{{binary}} or {{string}} type in Hive:

{code}
ubuntu@impala:~$ beeline --silent=true --showHeader=false -u 
jdbc:hive2://localhost:1/default -n ubuntu   
0: jdbc:hive2://localhost:1/default> create external table t (year bigint, 
word string) stored as parquet location '/user/hive/warehouse/arrow';
0: jdbc:hive2://localhost:1/default> select * from t;
+-+-+--+
| 2017| Word 1  |
| 2018| Word 2  |
+-+-+--+
0: jdbc:hive2://localhost:1/default> create external table t2 (year bigint, 
word binary) stored as parquet location '/user/hive/warehouse/arrow';
0: jdbc:hive2://localhost:1/default> select * from t2;
+--+--+--+
| 2017 | Word 1   |
| 2018 | Word 2   |
+--+--+--+
{code}

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967059#comment-15967059
 ] 

Wes McKinney commented on ARROW-785:


If I convert the strings to UTF8, then the problem goes away:

{code}
df['WORD'] = df['WORD'].str.decode('utf8')
{code}

then in parquet-mr and Spark

{code}
java -jar target/parquet-tools-1.9.0.jar test2.parq 
YEAR = 2017
WORD = Word 1

YEAR = 2018
WORD = Word 2
{code}

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967053#comment-15967053
 ] 

Wes McKinney commented on ARROW-785:


Making this not a 0.3 blocker, [~cpcloud] if you have any ideas how to fix 
this, let me know

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967051#comment-15967051
 ] 

Wes McKinney commented on ARROW-785:


Spoke too soon:

{code}
$ java -jar target/parquet-tools-1.9.0.jar dump ../../arrow/python/test.parq 
row group 0 

YEAR:  INT64 SNAPPY DO:4 FPO:36 SZ:84/80/0.95 VC:2 ENC:RLE,PLAIN_DICTI [more]...
WORD:  BINARY SNAPPY DO:148 FPO:184 SZ:84/80/0.95 VC:2 ENC:RLE,PLAIN_D [more]...

YEAR TV=2 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY

page 0:  DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[ 
[more]... VC:2

WORD TV=2 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY

page 0:  DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[ 
[more]... VC:2

INT64 YEAR 

*** row group 1 of 1, values 1 to 2 *** 
value 1: R:0 D:1 V:2017
value 2: R:0 D:1 V:2018

BINARY WORD 

*** row group 1 of 1, values 1 to 2 *** 
value 1: R:0 D:1 V:Word 1
value 2: R:0 D:1 V:Word 2
{code}

In Spark 2.2.x I have:

{code}
sqlContext.read.parquet('/home/wesm/code/arrow/python/test.parq').toPandas()

YEARWORD
0   2017[87, 111, 114, 100, 32, 49]
1   2018[87, 111, 114, 100, 32, 50]
{code}

There's some Spark setting to treat binary as strings. If you look up the ASCII 
codes for the integers in the Spark output, that looks right. I'm not sure what 
incantation Hive needs to work properly though

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967036#comment-15967036
 ] 

Wes McKinney commented on ARROW-785:


I tried to reproduce this issue with Impala on Arrow / Parquet master branches. 
 I put the file in a temporary directory then ran

{code}
CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60`
LIKE PARQUET '/tmp/test-parquet-binary/0.parq'
STORED AS PARQUET
LOCATION '/tmp/test-parquet-binary'
{code}

The resulting table, with schema inferred from the Parquet file, is:

{code}
describe __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60`
Out[30]:
[('year', 'bigint', 'Inferred from Parquet file.'),
 ('word', 'string', 'Inferred from Parquet file.')]
{code}

string in Impala is a plain BYTE_ARRAY aka Binary. The Arrow table was

{code}
pyarrow.Table
YEAR: int64
WORD: binary
{code}

However, parquet-tool cat from parquet-mr 1.9.0 gives:

{code}
$ java -jar target/parquet-tools-1.9.0.jar cat test.parq 
YEAR = 2017
WORD = V29yZCAx

YEAR = 2018
WORD = V29yZCAy
{code}

This suggests there's something wrong with the file metadata is Impala is able 
to read the file OK. I'm looking more closely into it

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-07 Thread Ashima Sood (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961226#comment-15961226
 ] 

Ashima Sood commented on ARROW-785:
---

I see that after conversion of pandas dataframe to table, the dataTypes are 
changed like below:
DataFrame datatype:
WORD : object

Table datatype:
WORD: binary

Is there a way to convert to string instead?

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-07 Thread Ashima Sood (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961132#comment-15961132
 ] 

Ashima Sood commented on ARROW-785:
---

I'm using Zeppelin to create a table on top of the parquet file using below 
command:
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS schema_abc.parquet_table_name(
  YEAR INT
, WORD STRING
)
STORED AS PARQUET
LOCATION 's3://bucket_name/folder/parquet_files/'

***Please note: parquet_files folder has the testFile.parquet file in it.

Describing the table:

%spark.sql
describe table schema_abc.parquet_table_name

Gives:
col_name   data_type  comment
YEARint null
WORDstring  null


but when I run a select query to read the table.. It gives me below error:

Describing the table:

%spark.sql
select * from schema_abc.parquet_table_name

java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Since I was getting below error I wanted to see if parquet file really is 
showing the data. hence used Apache Drill to view the data which outputs like 
below:


user@server:parth_to/parquet-drill/apache-drill-1.10.0$ bin/drill-embedded
Apr 07, 2017 1:04:51 PM org.glassfish.jersey.server.ApplicationHandler 
initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.10.0
"drill baby drill"
0: jdbc:drill:zk=local> select * from 
dfs.`/path_to/parquet-drill/apache-drill-1.10.0/sample-data/testFile.parquet`;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
+---+--+
| YEAR  | WORD |
+---+--+
| 2017  | null |
| 2018  | [B@5bd466f2  |
+---+--+
2 rows selected (1.433 seconds)


Input Txt file:
YEAR|WORD
2017|
2018|Word 2

(i've put null to test if that works)


> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960984#comment-15960984
 ] 

Wes McKinney commented on ARROW-785:


I will take a look at the file in Impala (or Hive if I can figure out how to do 
that) to see if I can repro

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)