[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976721#comment-16976721 ] albertoramon commented on ARROW-785: I saw this:(SparkSQL 2.4.4 PyArrow 0.15) The problem is Create table with INT columns (BIGINT works properly) SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in create table In my case: these Parquet Files are from SSB benchmark {code:java} SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY) FROM SSB.LINEORDER; Returns: 2 20 2000 {code} In my Column_Types I Had, thus I need review my Python Code :) : {code:java} 'lo_custkey':'int64', 'lo_partkey':'int64', 'lo_suppkey':'int64',{code} > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Jeff Reback >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.5.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981803#comment-15981803 ] Wes McKinney commented on ARROW-785: I can't repro this, took off 0.3 fix version until we get a reliable reproduction > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967700#comment-15967700 ] Phillip Cloud commented on ARROW-785: - I'm not sure if there's still a possible issue here. When using drill, if I cast the {{WORD}} column to {{varchar}} then the data look fine. When left as {{binary}} the values are unintelligible: {code} 0: jdbc:drill:zk=local> select `YEAR`, cast(`WORD` as varchar) as `WORD` from dfs.`/home/phillip/code/cpp/arrow/python/arrow_parquet.parquet`; +---+-+ | YEAR | WORD | +---+-+ | 2017 | Word 1 | | 2018 | Word 2 | +---+-+ {code} > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967675#comment-15967675 ] Phillip Cloud commented on ARROW-785: - I'm able to run this using {{beeline}} declaring the {{word}} column as either {{binary}} or {{string}} type in Hive: {code} ubuntu@impala:~$ beeline --silent=true --showHeader=false -u jdbc:hive2://localhost:1/default -n ubuntu 0: jdbc:hive2://localhost:1/default> create external table t (year bigint, word string) stored as parquet location '/user/hive/warehouse/arrow'; 0: jdbc:hive2://localhost:1/default> select * from t; +-+-+--+ | 2017| Word 1 | | 2018| Word 2 | +-+-+--+ 0: jdbc:hive2://localhost:1/default> create external table t2 (year bigint, word binary) stored as parquet location '/user/hive/warehouse/arrow'; 0: jdbc:hive2://localhost:1/default> select * from t2; +--+--+--+ | 2017 | Word 1 | | 2018 | Word 2 | +--+--+--+ {code} > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967059#comment-15967059 ] Wes McKinney commented on ARROW-785: If I convert the strings to UTF8, then the problem goes away: {code} df['WORD'] = df['WORD'].str.decode('utf8') {code} then in parquet-mr and Spark {code} java -jar target/parquet-tools-1.9.0.jar test2.parq YEAR = 2017 WORD = Word 1 YEAR = 2018 WORD = Word 2 {code} > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967053#comment-15967053 ] Wes McKinney commented on ARROW-785: Making this not a 0.3 blocker, [~cpcloud] if you have any ideas how to fix this, let me know > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967051#comment-15967051 ] Wes McKinney commented on ARROW-785: Spoke too soon: {code} $ java -jar target/parquet-tools-1.9.0.jar dump ../../arrow/python/test.parq row group 0 YEAR: INT64 SNAPPY DO:4 FPO:36 SZ:84/80/0.95 VC:2 ENC:RLE,PLAIN_DICTI [more]... WORD: BINARY SNAPPY DO:148 FPO:184 SZ:84/80/0.95 VC:2 ENC:RLE,PLAIN_D [more]... YEAR TV=2 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[ [more]... VC:2 WORD TV=2 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[ [more]... VC:2 INT64 YEAR *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:2017 value 2: R:0 D:1 V:2018 BINARY WORD *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:Word 1 value 2: R:0 D:1 V:Word 2 {code} In Spark 2.2.x I have: {code} sqlContext.read.parquet('/home/wesm/code/arrow/python/test.parq').toPandas() YEARWORD 0 2017[87, 111, 114, 100, 32, 49] 1 2018[87, 111, 114, 100, 32, 50] {code} There's some Spark setting to treat binary as strings. If you look up the ASCII codes for the integers in the Spark output, that looks right. I'm not sure what incantation Hive needs to work properly though > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967036#comment-15967036 ] Wes McKinney commented on ARROW-785: I tried to reproduce this issue with Impala on Arrow / Parquet master branches. I put the file in a temporary directory then ran {code} CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60` LIKE PARQUET '/tmp/test-parquet-binary/0.parq' STORED AS PARQUET LOCATION '/tmp/test-parquet-binary' {code} The resulting table, with schema inferred from the Parquet file, is: {code} describe __ibis_tmp.`__ibis_tmp_57ccb655a5b1425fbc99ea30054c6c60` Out[30]: [('year', 'bigint', 'Inferred from Parquet file.'), ('word', 'string', 'Inferred from Parquet file.')] {code} string in Impala is a plain BYTE_ARRAY aka Binary. The Arrow table was {code} pyarrow.Table YEAR: int64 WORD: binary {code} However, parquet-tool cat from parquet-mr 1.9.0 gives: {code} $ java -jar target/parquet-tools-1.9.0.jar cat test.parq YEAR = 2017 WORD = V29yZCAx YEAR = 2018 WORD = V29yZCAy {code} This suggests there's something wrong with the file metadata is Impala is able to read the file OK. I'm looking more closely into it > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961226#comment-15961226 ] Ashima Sood commented on ARROW-785: --- I see that after conversion of pandas dataframe to table, the dataTypes are changed like below: DataFrame datatype: WORD : object Table datatype: WORD: binary Is there a way to convert to string instead? > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961132#comment-15961132 ] Ashima Sood commented on ARROW-785: --- I'm using Zeppelin to create a table on top of the parquet file using below command: %sql CREATE EXTERNAL TABLE IF NOT EXISTS schema_abc.parquet_table_name( YEAR INT , WORD STRING ) STORED AS PARQUET LOCATION 's3://bucket_name/folder/parquet_files/' ***Please note: parquet_files folder has the testFile.parquet file in it. Describing the table: %spark.sql describe table schema_abc.parquet_table_name Gives: col_name data_type comment YEARint null WORDstring null but when I run a select query to read the table.. It gives me below error: Describing the table: %spark.sql select * from schema_abc.parquet_table_name java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Since I was getting below error I wanted to see if parquet file really is showing the data. hence used Apache Drill to view the data which outputs like below: user@server:parth_to/parquet-drill/apache-drill-1.10.0$ bin/drill-embedded Apr 07, 2017 1:04:51 PM org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.10.0 "drill baby drill" 0: jdbc:drill:zk=local> select * from dfs.`/path_to/parquet-drill/apache-drill-1.10.0/sample-data/testFile.parquet`; SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. +---+--+ | YEAR | WORD | +---+--+ | 2017 | null | | 2018 | [B@5bd466f2 | +---+--+ 2 rows selected (1.433 seconds) Input Txt file: YEAR|WORD 2017| 2018|Word 2 (i've put null to test if that works) > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive
[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960984#comment-15960984 ] Wes McKinney commented on ARROW-785: I will take a look at the file in Impala (or Hive if I can figure out how to do that) to see if I can repro > possible issue on writing parquet via pyarrow, subsequently read in Hive > > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jeff Reback >Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)