[jira] [Created] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

Rahul Aggarwal (JIRA) Thu, 01 Jan 2015 04:11:30 -0800

Rahul Aggarwal created SPARK-5049:
-------------------------------------

             Summary: ParquetTableScan always prepends the values of partition 
columns in output rows irrespective of the order of the partition columns in 
the original SELECT query
                 Key: SPARK-5049
                 URL: https://issues.apache.org/jira/browse/SPARK-5049
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.2.0, 1.1.0
            Reporter: Rahul Aggarwal



This happens when ParquetTableScan is being used by turning on 
spark.sql.hive.convertMetastoreParquet

For example:

spark-sql> set spark.sql.hive.convertMetastoreParquet=true;

spark-sql> create table table1(a int , b int) partitioned by (p1 string, p2 
int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
'parquet.hive.DeprecatedParquetOutputFormat';

spark-sql> insert into table table1 partition(p1='January',p2=1) select key, 10 
 from src;    

spark-sql> select a, b, p1, p2 from table1 limit 10;

January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10
January 1       484     10

The correct output should be 

484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1
484     10      January 1


This also leads to schema mismatch if the query is run using HiveContext and 
the result is a SchemaRDD.
For example :

scala> import org.apache.spark.sql.hive._
scala> val hc = new HiveContext(sc)
scala> hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
scala> val res = hc.sql("select a, b, p1, p2 from table1 limit 10")
scala> res.collect
res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
[January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
[January,1,409,10], [January,1,255,10], [January,1,278,10], [January,1,98,10], 
[January,1,484,10])

scala> res.schema
res5: org.apache.spark.sql.StructType = 
StructType(ArrayBuffer(StructField(a,IntegerType,true), 
StructField(b,IntegerType,true), StructField(p1,StringType,true), 
StructField(p2,IntegerType,true)))






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

Reply via email to