[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965273#comment-15965273 ] Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:24 AM: --- Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/6617. cc [~lian cheng]] was (Author: hyukjin.kwon): Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/6617. cc [~liancheng] > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965273#comment-15965273 ] Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:23 AM: --- Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/6617. cc [~liancheng] was (Author: hyukjin.kwon): Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/8566. cc [~liancheng] > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965265#comment-15965265 ] Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:21 AM: --- Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is resolvable maybe? was (Author: hyukjin.kwon): Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is resolvable maybe? Up to my knowledge, this option means to follow Parquet's specification rather than the current way used by Spark. So, if other implementation follows Parquet's specification, I guess this is the correct option for compatibility. > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org