Mostafa Mokhtar created SPARK-20297: ---------------------------------------
Summary: Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala Key: SPARK-20297 URL: https://issues.apache.org/jira/browse/SPARK-20297 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Mostafa Mokhtar Priority: Critical While trying to load some data using Spark 2.1 I realized that decimal(12,2) columns stored in Parquet written by Spark are not readable by Hive or Impala. Repro {code} CREATE TABLE customer_acctbal( c_acctbal decimal(12,2)) STORED AS Parquet; insert into customer_acctbal values (7539.95); {code} Error from Hive {code} Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet Time taken: 0.122 seconds {code} Error from Impala {code} File 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' has an incompatible Parquet schema for column 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: DECIMAL(12,2), Parquet schema: optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) {code} Table info {code} hive> describe formatted customer_acctbal; OK # col_name data_type comment c_acctbal decimal(12,2) # Detailed Table Information Database: tpch_nested_3000_parquet Owner: mmokhtar CreateTime: Mon Apr 10 17:47:24 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 0 rawDataSize 0 totalSize 120 transient_lastDdlTime 1491871644 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.032 seconds, Fetched: 31 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org