[
https://issues.apache.org/jira/browse/SPARK-53535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019133#comment-18019133
]
Ziya Mukhtarov commented on SPARK-53535:
----------------------------------------
I'm already working on fixing this issue.
> Missing columns inside a struct in Parquet files are not handled correctly
> --------------------------------------------------------------------------
>
> Key: SPARK-53535
> URL: https://issues.apache.org/jira/browse/SPARK-53535
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.1
> Reporter: Ziya Mukhtarov
> Priority: Major
> Labels: parquet, parquetReader
>
> The mechanism that is used to fill missing columns with NULLs does not work
> correctly when confronted with a missing column inside of a STRUCT.
> h3. Repro
> {*}Step 1{*}: We’re going to consider three different schemas:
> # One with STRUCT<INT a>, which we’re going to write to disk.
> # One with STRUCT<INT b>, which we’ll use to demonstrate the missing columns
> being handled incorrectly.
> # One with STRUCT<INT a, INT b>, which we’ll use to demonstrate the missing
> columns being handled well.
> {code:java}
> df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
> df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
> df_ab = sql('SELECT 2 as id, named_struct("a", 2, "b", 3) AS s')
> {code}
> {*}Step 2{*}: We write the sample data to disk.
> {code:java}
> path = "/tmp/missing_col_test"
> df_a.write.format("parquet").save(path)
> {code}
> {*}Step 3{*}: We read this data with the three different schemas explained
> above.
> {code:java}
> # This is the schema that matches the data. We read correct data.
> spark.read.format("parquet").schema(df_a.schema).load(path).show()
> +---+---+
> | id| s|
> +---+---+
> | 1|{1}| # <- GOOD
> +---+---+
> # This is the schema that has struct s, but doesn't have columns
> # inside of s in common with what was written to disk.
> spark.read.format("parquet").schema(df_b.schema).load(path).show()
> +---+----+
> | id| s|
> +---+----+
> | 1|NULL| # <- WRONG! Should be {NULL} instead.
> +---+----+
> # This is the shcema that has more columns in struct s than what
> # was written to disk.
> spark.read.format("parquet").schema(df_ab.schema).load(path).show()
> +---+---------+
> | id| s|
> +---+---------+
> | 1|{1, NULL}| # <- GOOD
> +---+---------+
> {code}
> {*}Step 4{*}: To demonstrate that this is not a display glitch, but genuinely
> leads to incorrect query results, we can show that we evaluate a function
> differently depending on the number of columns insidea struct in the read
> schema.
> {code:java}
> from pyspark.sql.functions import col
> spark.read.format("parquet").schema(df_a.schema).load(path).withColumn("isnull",
> col("s").isNull()).show()
> +---+---+------+
> | id| s|isnull|
> +---+---+------+
> | 1|{1}| false| <- GOOD
> +---+---+------+
> spark.read.format("parquet").schema(df_b.schema).load(path).withColumn("isnull",
> col("s").isNull()).show()
> +---+----+------+
> | id| s|isnull|
> +---+----+------+
> | 1|NULL| true| <- WRONG!!! Should be false.
> +---+----+------+
> spark.read.format("parquet").schema(df_ab.schema).load(path).withColumn("isnull",
> col("s").isNull()).show()
> +---+---------+------+
> | id| s|isnull|
> +---+---------+------+
> | 1|{1, NULL}| false| <- GOOD
> +---+---------+------+
> {code}
> h3. Solution
> This is happening when all the fields of the struct we are trying to read is
> missing in a Parquet file, and we assume null structs because we are not
> reading any field with definition & repetition levels. We should look at the
> file schema to see if the struct has another non-requested field, whose
> definition levels can be used for struct's nullability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]