Yin Huai created SPARK-10428:
--------------------------------
Summary: Struct fields read from parquet are mis-aligned
Key: SPARK-10428
URL: https://issues.apache.org/jira/browse/SPARK-10428
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical
{code}
val df1 = sqlContext
.range(1)
.selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
.coalesce(1)
val df2 = sqlContext
.range(1, 2)
.selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS
s")
.coalesce(1)
df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
{code}
{code}
sqlContext.read.option("mergeSchema",
"true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c",
"s.d", “p").show
+---+---+----+----+---+
| a| b| c| d| p|
+---+---+----+----+---+
| 0| 3|null|null| 1|
| 1| 2| 3| 4| 2|
+---+---+----+----+---+
{code}
Looks like the problem is at
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204,
we do padding when global schema has more struct fields than local parquet
file's schema. However, when we read field from parquet, we still use parquet's
local schema and then we put the value of {{d}} to the wrong slot.
I tried master. Looks like this issue is resolved by
https://github.com/apache/spark/pull/8509. We need to decide if we want to back
port that to branch 1.5.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]