Yin Huai created SPARK-10428:
--------------------------------

             Summary: Struct fields read from parquet are mis-aligned
                 Key: SPARK-10428
                 URL: https://issues.apache.org/jira/browse/SPARK-10428
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.0
            Reporter: Yin Huai
            Priority: Critical


{code}
val df1 = sqlContext
        .range(1)
        .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
        .coalesce(1)

val df2 = sqlContext
  .range(1, 2)
  .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS 
s")
  .coalesce(1)

df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
{code}

{code}
sqlContext.read.option("mergeSchema", 
"true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", 
"s.d", “p").show

+---+---+----+----+---+
|  a|  b|   c|   d|  p|
+---+---+----+----+---+
|  0|  3|null|null|  1|
|  1|  2|   3|   4|  2|
+---+---+----+----+---+
{code}

Looks like the problem is at 
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204,
 we do padding when global schema has more struct fields than local parquet 
file's schema. However, when we read field from parquet, we still use parquet's 
local schema and then we put the value of {{d}} to the wrong slot.

I tried master. Looks like this issue is resolved by 
https://github.com/apache/spark/pull/8509. We need to decide if we want to back 
port that to branch 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to