mzhang opened a new pull request, #55962:
URL: https://github.com/apache/spark/pull/55962

   ### What changes were proposed in this pull request?
   
   Follow-up to SPARK-56844, which allowed `ArrayType` / `MapType` / 
`StructType`
   in `FileSourceMetadataAttribute` and added the matching branches to
   `ColumnVectorUtils.populate` for the columnar metadata path.
   
   That covered file scans returning `ColumnarBatch`. For scans that produce
   row-form output (text, JSON, CSV, or any reader with `Batched=false`), the
   metadata row is filled via
   `FileFormat.updateMetadataInternalRow` ->
   `FileFormat.getFileConstantMetadataColumnValue` ->
   `Literal(extractor.apply(file))`.
   
   `Literal.apply(Any)` dispatches on the value class and has no case for
   `ArrayData`, `MapData`, or `InternalRow`, so a complex constant metadata
   column trips `UNSUPPORTED_FEATURE.LITERAL_TYPE` before the row is even
   populated. Separately, `SchemaPruning.sortLeftFieldsByRight` recurses
   through the metadata schema and prunes nested struct fields inside an
   array/map/struct subfield. That is correct for data files (the reader
   projects the requested columns) but wrong for constant metadata, where
   each subfield's value is produced whole by a single extractor; pruning
   shaves catalyst row positions out from under the extractor.
   
   This PR:
   - Threads the column's `DataType` through
     `FileFormat.getFileConstantMetadataColumnValue` and
     `updateMetadataInternalRow`. When provided, the value goes through
     `Literal.create(value, dataType)` which accepts catalyst-form values
     directly. The parameter is optional and existing call sites that pass
     primitives keep working unchanged.
   - Teaches `SchemaPruning.sortLeftFieldsByRight` to preserve subfield
     data types when recursing inside a `FileSourceMetadataAttribute`. The
     metadata attribute's top struct can still have unused sibling
     sub-attributes pruned (each is a separate extractor), but anything
     below that level is preserved verbatim. Non-metadata data file
     pruning behavior is unchanged.
   
   ### Why are the changes needed?
   
   Without this, a file format that registers a constant metadata column
   with a complex type (e.g. `array<struct<...>>`) can be read columnar
   but fails at runtime on the row path, and even on the columnar path
   the schema-pruning rewriter can shift element struct ordinals.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. No current OSS code path exposes a complex constant metadata column.
   
   ### How was this patch tested?
   
   New `SchemaPruningSuite` case covers the metadata-attribute preservation
   rule. Existing `SchemaPruningSuite` and `FileMetadataStructSuite` tests
   verify the non-metadata and sibling-pruning behavior is unchanged.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude (Anthropic)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to