senthh opened a new pull request, #12190:
URL: https://github.com/apache/gluten/pull/12190

   What changes are proposed in this pull request?
   Fixes #11914.
   
   Spark 4.1 (SPARK-53535) introduced the 
spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing config to change 
how a struct is handled when all of its requested fields are missing from the 
Parquet file. When this config is false (the new default), Spark returns a 
non-null struct with null inner fields ({NULL}) instead of a null struct. 
Gluten's Velox native Parquet scan has not adapted to this behavior and returns 
an incorrect null struct, causing GlutenParquetIOSuite to fail on Spark 4.1.
   
   This PR makes the Velox backend fall back to the vanilla Spark Parquet 
reader for the affected cases so results match Spark, rather than producing 
incorrect results via the native scan. Native support for this behavior can be 
added later as a follow-up.
   
   Specifically:
   
   In VeloxBackendSettings.validateFileFormat (Parquet path), add a guard 
shouldFallbackBySpark41ParquetStructBehavior that rejects the native scan 
(triggering fallback) when all of the following hold:
   running on Spark 4.1+ (SparkVersionUtil.gteSpark41),
   the read schema contains a struct type (recursively, including structs 
nested inside array/map), and
   spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing is false.
   Re-enable the two previously excluded Spark 4.1 tests in gluten-ut/spark41 
VeloxTestSettings (SPARK-53535 and vectorized reader: missing all struct 
fields), which now pass via fallback.
   Files changed:
   
   backends-velox/.../velox/VeloxBackend.scala — fallback guard.
   backends-velox/.../execution/FallbackSuite.scala — new fallback test.
   gluten-ut/spark41/.../velox/VeloxTestSettings.scala — remove the two 
SPARK-53535-related excludes.
   How was this patch tested?
   Added a unit test in FallbackSuite ("fallback Spark 4.1 parquet missing all 
struct fields compatibility") that writes a Parquet file containing struct 
field a, reads it back with a schema requesting only the missing field b under 
returnNullStructIfAllFieldsMissing=false, and asserts the plan contains no 
GlutenPlan (i.e. fully fell back to Spark). The test is cancel-ed on Spark < 
4.1.
   Re-enabled and verified the previously excluded GlutenParquetIOSuite tests 
on Spark 4.1: SPARK-53535 and vectorized reader: missing all struct fields.
   Manually verified on spark-shell (Spark 4.1.2 + Velox bundle) that reading 
structs where requested fields are absent now returns a non-null struct with 
null inner fields ({NULL}, s.isNull = false), matching vanilla Spark, with the 
scan correctly falling back.
   Was this patch authored or co-authored using generative AI tooling?
   No
   
   A couple of notes before you submit:
   
   I attributed AI assistance per the ASF guidance since this fix was developed 
with Cursor. If you wrote it without AI tooling, change that line to No. Adjust 
the tool/version string to match what you actually used.
   The guard is intentionally broad (any struct-containing schema, not strictly 
"all requested fields missing"), so it may over-fall-back. If reviewers push 
back, the description already frames native support as a follow-up — but you 
may want to mention that trade-off explicitly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to