senthh opened a new pull request, #12190:
URL: https://github.com/apache/gluten/pull/12190
What changes are proposed in this pull request?
Fixes #11914.
Spark 4.1 (SPARK-53535) introduced the
spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing config to change
how a struct is handled when all of its requested fields are missing from the
Parquet file. When this config is false (the new default), Spark returns a
non-null struct with null inner fields ({NULL}) instead of a null struct.
Gluten's Velox native Parquet scan has not adapted to this behavior and returns
an incorrect null struct, causing GlutenParquetIOSuite to fail on Spark 4.1.
This PR makes the Velox backend fall back to the vanilla Spark Parquet
reader for the affected cases so results match Spark, rather than producing
incorrect results via the native scan. Native support for this behavior can be
added later as a follow-up.
Specifically:
In VeloxBackendSettings.validateFileFormat (Parquet path), add a guard
shouldFallbackBySpark41ParquetStructBehavior that rejects the native scan
(triggering fallback) when all of the following hold:
running on Spark 4.1+ (SparkVersionUtil.gteSpark41),
the read schema contains a struct type (recursively, including structs
nested inside array/map), and
spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing is false.
Re-enable the two previously excluded Spark 4.1 tests in gluten-ut/spark41
VeloxTestSettings (SPARK-53535 and vectorized reader: missing all struct
fields), which now pass via fallback.
Files changed:
backends-velox/.../velox/VeloxBackend.scala — fallback guard.
backends-velox/.../execution/FallbackSuite.scala — new fallback test.
gluten-ut/spark41/.../velox/VeloxTestSettings.scala — remove the two
SPARK-53535-related excludes.
How was this patch tested?
Added a unit test in FallbackSuite ("fallback Spark 4.1 parquet missing all
struct fields compatibility") that writes a Parquet file containing struct
field a, reads it back with a schema requesting only the missing field b under
returnNullStructIfAllFieldsMissing=false, and asserts the plan contains no
GlutenPlan (i.e. fully fell back to Spark). The test is cancel-ed on Spark <
4.1.
Re-enabled and verified the previously excluded GlutenParquetIOSuite tests
on Spark 4.1: SPARK-53535 and vectorized reader: missing all struct fields.
Manually verified on spark-shell (Spark 4.1.2 + Velox bundle) that reading
structs where requested fields are absent now returns a non-null struct with
null inner fields ({NULL}, s.isNull = false), matching vanilla Spark, with the
scan correctly falling back.
Was this patch authored or co-authored using generative AI tooling?
No
A couple of notes before you submit:
I attributed AI assistance per the ASF guidance since this fix was developed
with Cursor. If you wrote it without AI tooling, change that line to No. Adjust
the tool/version string to match what you actually used.
The guard is intentionally broad (any struct-containing schema, not strictly
"all requested fields missing"), so it may over-fall-back. If reviewers push
back, the description already frames native support as a follow-up — but you
may want to mention that trade-off explicitly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]