malinjawi opened a new pull request, #49886: URL: https://github.com/apache/arrow/pull/49886
### Rationale for this change This is follow-up work to GH-33985 / PR #34834 now that Substrait can represent unresolved / partially bound expressions (see substrait-io/substrait#515). Arrow can currently deserialize bound Substrait `ExtendedExpression` messages, but it cannot yet consume unresolved expressions that contain: - `Expression.NamedExpression` - `Type.Unknown` - unresolved function signatures such as `add:unknown_unknown` To support front-end filter / projection workflows, Arrow should be able to deserialize these messages using a supplied Arrow schema, bind unresolved names and types against that schema, and then return normal Arrow compute expressions. This PR depends on the Substrait protocol change in https://github.com/substrait-io/substrait/pull/1063, so it should remain draft until Arrow can pin to a Substrait release that includes those protocol changes. ### What changes are included in this PR? This PR adds schema-aware deserialization for unresolved Substrait expressions. On the C++ side: - add a `DeserializeExpressions(buf, input_schema, ...)` overload - bind `Expression.NamedExpression` to Arrow `FieldRef` - treat `Type.Unknown` as a bind-time placeholder instead of an executable Arrow type - validate supplied schema names against unresolved `ExtendedExpression.base_schema` - allow unresolved function ids under `extension:io.substrait:unknown` to resolve through Arrow's existing function registry On the Python side: - add optional `schema=` support to: - `pyarrow.substrait.deserialize_expressions(...)` - `pyarrow.substrait.BoundExpressions.from_substrait(...)` - `pyarrow.compute.Expression.from_substrait(...)` - make `SubstraitSchema.to_pysubstrait()` work with either `substrait.proto` or generated protobuf module layouts Testing added: - unresolved projection binding with a supplied schema - unresolved filter binding with a supplied schema - failure when no schema is supplied - failure when the supplied schema does not match the unresolved `base_schema` - combined unresolved filter + projection scanner flow ### Are these changes tested? Yes. Validated locally with: - targeted C++ Substrait serde coverage - targeted Python Substrait tests - end-to-end `pyarrow.dataset` flows using unresolved projection and filter expressions - negative cases for missing schema and schema mismatch The local end-to-end validation was run against an Arrow build using a Substrait archive containing the protocol changes from https://github.com/substrait-io/substrait/pull/1063. ### Are there any user-facing changes? Yes. This PR adds additive API surface for schema-aware deserialization of unresolved Substrait expressions: - C++: - `DeserializeExpressions(const Buffer&, const Schema&, ...)` - Python: - `pyarrow.substrait.deserialize_expressions(..., schema=...)` - `pyarrow.substrait.BoundExpressions.from_substrait(..., schema=...)` - `pyarrow.compute.Expression.from_substrait(..., schema=...)` These changes are intended for unresolved / partially bound Substrait expression workflows and do not change the existing bound-expression API behavior. ### Additional context Portions of this change were developed with AI assistance and then manually reviewed, built, debugged, and validated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
