[ 
https://issues.apache.org/jira/browse/SPARK-53535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019133#comment-18019133
 ] 

Ziya Mukhtarov commented on SPARK-53535:
----------------------------------------

I'm already working on fixing this issue.

> Missing columns inside a struct in Parquet files are not handled correctly
> --------------------------------------------------------------------------
>
>                 Key: SPARK-53535
>                 URL: https://issues.apache.org/jira/browse/SPARK-53535
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.1
>            Reporter: Ziya Mukhtarov
>            Priority: Major
>              Labels: parquet, parquetReader
>
> The mechanism that is used to fill missing columns with NULLs does not work 
> correctly when confronted with a missing column inside of a STRUCT.
> h3. Repro
> {*}Step 1{*}: We’re going to consider three different schemas:
>  # One with STRUCT<INT a>, which we’re going to write to disk.
>  # One with STRUCT<INT b>, which we’ll use to demonstrate the missing columns 
> being handled incorrectly.
>  # One with STRUCT<INT a, INT b>, which we’ll use to demonstrate the missing 
> columns being handled well.
> {code:java}
> df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
> df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
> df_ab = sql('SELECT 2 as id, named_struct("a", 2, "b", 3) AS s')
> {code}
> {*}Step 2{*}: We write the sample data to disk.
> {code:java}
> path = "/tmp/missing_col_test"
> df_a.write.format("parquet").save(path)
> {code}
> {*}Step 3{*}: We read this data with the three different schemas explained 
> above.
> {code:java}
> # This is the schema that matches the data. We read correct data.
> spark.read.format("parquet").schema(df_a.schema).load(path).show()
> +---+---+
> | id|  s|
> +---+---+
> |  1|{1}| # <- GOOD
> +---+---+
> # This is the schema that has struct s, but doesn't have columns
> # inside of s in common with what was written to disk.
> spark.read.format("parquet").schema(df_b.schema).load(path).show()
> +---+----+
> | id|   s|
> +---+----+
> |  1|NULL| # <- WRONG! Should be {NULL} instead.
> +---+----+
> # This is the shcema that has more columns in struct s than what
> # was written to disk.
> spark.read.format("parquet").schema(df_ab.schema).load(path).show()
> +---+---------+
> | id|        s|
> +---+---------+
> |  1|{1, NULL}| # <- GOOD
> +---+---------+
> {code}
> {*}Step 4{*}: To demonstrate that this is not a display glitch, but genuinely 
> leads to incorrect query results, we can show that we evaluate a function 
> differently depending on the number of columns insidea struct in the read 
> schema.
> {code:java}
> from pyspark.sql.functions import col
> spark.read.format("parquet").schema(df_a.schema).load(path).withColumn("isnull",
>  col("s").isNull()).show()
> +---+---+------+
> | id|  s|isnull|
> +---+---+------+
> |  1|{1}| false| <- GOOD
> +---+---+------+
> spark.read.format("parquet").schema(df_b.schema).load(path).withColumn("isnull",
>  col("s").isNull()).show()
> +---+----+------+
> | id|   s|isnull|
> +---+----+------+
> |  1|NULL|  true| <- WRONG!!! Should be false. 
> +---+----+------+
> spark.read.format("parquet").schema(df_ab.schema).load(path).withColumn("isnull",
>  col("s").isNull()).show()
> +---+---------+------+
> | id|        s|isnull|
> +---+---------+------+
> |  1|{1, NULL}| false| <- GOOD
> +---+---------+------+
> {code}
> h3. Solution
> This is happening when all the fields of the struct we are trying to read is 
> missing in a Parquet file, and we assume null structs because we are not 
> reading any field with definition & repetition levels. We should look at the 
> file schema to see if the struct has another non-requested field, whose 
> definition levels can be used for struct's nullability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to