Ziya Mukhtarov created SPARK-53535:
--------------------------------------

             Summary: Missing columns inside a struct in Parquet files are not 
handled correctly
                 Key: SPARK-53535
                 URL: https://issues.apache.org/jira/browse/SPARK-53535
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.1
            Reporter: Ziya Mukhtarov


The mechanism that is used to fill missing columns with NULLs does not work 
correctly when confronted with a missing column inside of a STRUCT.

h3. Repro

{*}Step 1{*}: We’re going to consider three different schemas:
 # One with STRUCT<INT a>, which we’re going to write to disk.
 # One with STRUCT<INT b>, which we’ll use to demonstrate the missing columns 
being handled incorrectly.
 # One with STRUCT<INT a, INT b>, which we’ll use to demonstrate the missing 
columns being handled well.

{code:java}
df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
df_ab = sql('SELECT 2 as id, named_struct("a", 2, "b", 3) AS s')
{code}
{*}Step 2{*}: We write the sample data to disk.
{code:java}
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)
{code}
{*}Step 3{*}: We read this data with the three different schemas explained 
above.
{code:java}
# This is the schema that matches the data. We read correct data.
spark.read.format("parquet").schema(df_a.schema).load(path).show()
+---+---+
| id|  s|
+---+---+
|  1|{1}| # <- GOOD
+---+---+

# This is the schema that has struct s, but doesn't have columns
# inside of s in common with what was written to disk.
spark.read.format("parquet").schema(df_b.schema).load(path).show()
+---+----+
| id|   s|
+---+----+
|  1|NULL| # <- WRONG! Should be {NULL} instead.
+---+----+

# This is the shcema that has more columns in struct s than what
# was written to disk.
spark.read.format("parquet").schema(df_ab.schema).load(path).show()
+---+---------+
| id|        s|
+---+---------+
|  1|{1, NULL}| # <- GOOD
+---+---------+
{code}
{*}Step 4{*}: To demonstrate that this is not a display glitch, but genuinely 
leads to incorrect query results, we can show that we evaluate a function 
differently depending on the number of columns insidea struct in the read 
schema.
{code:java}
from pyspark.sql.functions import col

spark.read.format("parquet").schema(df_a.schema).load(path).withColumn("isnull",
 col("s").isNull()).show()
+---+---+------+
| id|  s|isnull|
+---+---+------+
|  1|{1}| false| <- GOOD
+---+---+------+

spark.read.format("parquet").schema(df_b.schema).load(path).withColumn("isnull",
 col("s").isNull()).show()
+---+----+------+
| id|   s|isnull|
+---+----+------+
|  1|NULL|  true| <- WRONG!!! Should be false. 
+---+----+------+

spark.read.format("parquet").schema(df_ab.schema).load(path).withColumn("isnull",
 col("s").isNull()).show()
+---+---------+------+
| id|        s|isnull|
+---+---------+------+
|  1|{1, NULL}| false| <- GOOD
+---+---------+------+
{code}
h3. Solution

This is happening when all the fields of the struct we are trying to read is 
missing in a Parquet file, and we assume null structs because we are not 
reading any field with definition & repetition levels. We should look at the 
file schema to see if the struct has another non-requested field, whose 
definition levels can be used for struct's nullability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to