Ziya Mukhtarov created SPARK-53535:
--------------------------------------
Summary: Missing columns inside a struct in Parquet files are not
handled correctly
Key: SPARK-53535
URL: https://issues.apache.org/jira/browse/SPARK-53535
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.1
Reporter: Ziya Mukhtarov
The mechanism that is used to fill missing columns with NULLs does not work
correctly when confronted with a missing column inside of a STRUCT.
h3. Repro
{*}Step 1{*}: We’re going to consider three different schemas:
# One with STRUCT<INT a>, which we’re going to write to disk.
# One with STRUCT<INT b>, which we’ll use to demonstrate the missing columns
being handled incorrectly.
# One with STRUCT<INT a, INT b>, which we’ll use to demonstrate the missing
columns being handled well.
{code:java}
df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
df_ab = sql('SELECT 2 as id, named_struct("a", 2, "b", 3) AS s')
{code}
{*}Step 2{*}: We write the sample data to disk.
{code:java}
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)
{code}
{*}Step 3{*}: We read this data with the three different schemas explained
above.
{code:java}
# This is the schema that matches the data. We read correct data.
spark.read.format("parquet").schema(df_a.schema).load(path).show()
+---+---+
| id| s|
+---+---+
| 1|{1}| # <- GOOD
+---+---+
# This is the schema that has struct s, but doesn't have columns
# inside of s in common with what was written to disk.
spark.read.format("parquet").schema(df_b.schema).load(path).show()
+---+----+
| id| s|
+---+----+
| 1|NULL| # <- WRONG! Should be {NULL} instead.
+---+----+
# This is the shcema that has more columns in struct s than what
# was written to disk.
spark.read.format("parquet").schema(df_ab.schema).load(path).show()
+---+---------+
| id| s|
+---+---------+
| 1|{1, NULL}| # <- GOOD
+---+---------+
{code}
{*}Step 4{*}: To demonstrate that this is not a display glitch, but genuinely
leads to incorrect query results, we can show that we evaluate a function
differently depending on the number of columns insidea struct in the read
schema.
{code:java}
from pyspark.sql.functions import col
spark.read.format("parquet").schema(df_a.schema).load(path).withColumn("isnull",
col("s").isNull()).show()
+---+---+------+
| id| s|isnull|
+---+---+------+
| 1|{1}| false| <- GOOD
+---+---+------+
spark.read.format("parquet").schema(df_b.schema).load(path).withColumn("isnull",
col("s").isNull()).show()
+---+----+------+
| id| s|isnull|
+---+----+------+
| 1|NULL| true| <- WRONG!!! Should be false.
+---+----+------+
spark.read.format("parquet").schema(df_ab.schema).load(path).withColumn("isnull",
col("s").isNull()).show()
+---+---------+------+
| id| s|isnull|
+---+---------+------+
| 1|{1, NULL}| false| <- GOOD
+---+---------+------+
{code}
h3. Solution
This is happening when all the fields of the struct we are trying to read is
missing in a Parquet file, and we assume null structs because we are not
reading any field with definition & repetition levels. We should look at the
file schema to see if the struct has another non-requested field, whose
definition levels can be used for struct's nullability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]