malinjawi opened a new pull request, #49886:
URL: https://github.com/apache/arrow/pull/49886

   ### Rationale for this change
   
   This is follow-up work to GH-33985 / PR #34834 now that Substrait can 
represent unresolved / partially bound expressions (see 
substrait-io/substrait#515).
   
   Arrow can currently deserialize bound Substrait `ExtendedExpression` 
messages, but it cannot yet consume unresolved expressions that contain:
   - `Expression.NamedExpression`
   - `Type.Unknown`
   - unresolved function signatures such as `add:unknown_unknown`
   
   To support front-end filter / projection workflows, Arrow should be able to 
deserialize these messages using a supplied Arrow schema, bind unresolved names 
and types against that schema, and then return normal Arrow compute expressions.
   
   This PR depends on the Substrait protocol change in 
https://github.com/substrait-io/substrait/pull/1063, so it should remain draft 
until Arrow can pin to a Substrait release that includes those protocol changes.
   
   ### What changes are included in this PR?
   
   This PR adds schema-aware deserialization for unresolved Substrait 
expressions.
   
   On the C++ side:
   - add a `DeserializeExpressions(buf, input_schema, ...)` overload
   - bind `Expression.NamedExpression` to Arrow `FieldRef`
   - treat `Type.Unknown` as a bind-time placeholder instead of an executable 
Arrow type
   - validate supplied schema names against unresolved 
`ExtendedExpression.base_schema`
   - allow unresolved function ids under `extension:io.substrait:unknown` to 
resolve through Arrow's existing function registry
   
   On the Python side:
   - add optional `schema=` support to:
     - `pyarrow.substrait.deserialize_expressions(...)`
     - `pyarrow.substrait.BoundExpressions.from_substrait(...)`
     - `pyarrow.compute.Expression.from_substrait(...)`
   - make `SubstraitSchema.to_pysubstrait()` work with either `substrait.proto` 
or generated protobuf module layouts
   
   Testing added:
   - unresolved projection binding with a supplied schema
   - unresolved filter binding with a supplied schema
   - failure when no schema is supplied
   - failure when the supplied schema does not match the unresolved 
`base_schema`
   - combined unresolved filter + projection scanner flow
   
   ### Are these changes tested?
   
   Yes.
   
   Validated locally with:
   - targeted C++ Substrait serde coverage
   - targeted Python Substrait tests
   - end-to-end `pyarrow.dataset` flows using unresolved projection and filter 
expressions
   - negative cases for missing schema and schema mismatch
   
   The local end-to-end validation was run against an Arrow build using a 
Substrait archive containing the protocol changes from 
https://github.com/substrait-io/substrait/pull/1063.
   
   ### Are there any user-facing changes?
   
   Yes.
   
   This PR adds additive API surface for schema-aware deserialization of 
unresolved Substrait expressions:
   
   - C++:
     - `DeserializeExpressions(const Buffer&, const Schema&, ...)`
   - Python:
     - `pyarrow.substrait.deserialize_expressions(..., schema=...)`
     - `pyarrow.substrait.BoundExpressions.from_substrait(..., schema=...)`
     - `pyarrow.compute.Expression.from_substrait(..., schema=...)`
   
   These changes are intended for unresolved / partially bound Substrait 
expression workflows and do not change the existing bound-expression API 
behavior.
   
   ### Additional context
   
   Portions of this change were developed with AI assistance and then manually 
reviewed, built, debugged, and validated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to