kosiew opened a new pull request, #19179:
URL: https://github.com/apache/datafusion/pull/19179
**PR Title**
Add Substrait roundtrip support for `RecursiveQuery` and recursive CTE scans
---
## Which issue does this PR close?
* Closes #16274.
## Rationale for this change
Substrait roundtrip mode currently fails for plans that include
`RecursiveQuery`, resulting in `not_impl_err!("Unsupported plan type:
RecursiveQuery")` during SQL logic tests. This prevents recursive CTE queries
from being roundtripped through Substrait, causing multiple `cte.slt` cases to
fail and reducing confidence in Substrait interoperability.
This PR adds end-to-end support for serializing and deserializing
`RecursiveQuery` logical plans and their associated recursive work-table scans.
This unblocks the failing SQL logic tests and improves parity between the
DataFusion logical plan representation and its Substrait encoding.
## What changes are included in this PR?
* **New `logical_plan::recursive` helper module**
* Introduces `RECURSIVE_QUERY_TYPE_URL` and `RECURSIVE_SCAN_TYPE_URL` to
tag recursive structures in Substrait extensions.
* Defines small prost messages for:
* `RecursiveQueryDetail { name, is_distinct }` used to carry
`RecursiveQuery` metadata.
* `RecursiveScanDetail { name }` used to identify recursive work-table
scans.
* Provides helpers to encode/decode these messages:
* `encode_recursive_query_detail` / `decode_recursive_query_detail`.
* `encode_recursive_scan_detail` / `decode_recursive_scan_detail`.
* Adds validation so that empty names and malformed payloads are reported
as Substrait errors.
* **Producer-side support for `RecursiveQuery`**
* Adds `logical_plan/producer/rel/recursive_query_rel.rs` implementing:
* `from_recursive_query`, which serializes `LogicalPlan::RecursiveQuery`
into `ExtensionMultiRel` with two inputs:
* `inputs[0]`: `static_term`.
* `inputs[1]`: `recursive_term`.
* Encodes `{ name, is_distinct }` into the `detail` field using
`RECURSIVE_QUERY_TYPE_URL`.
* Wires this into the generic Substrait producer:
* Extends `SubstraitProducer` with `handle_recursive_query`.
* Updates `to_substrait_rel` to delegate `LogicalPlan::RecursiveQuery`
to `handle_recursive_query` instead of returning `not_impl_err!`.
* **Consumer-side support for `RecursiveQuery`**
* Adds `logical_plan/consumer/rel/recursive_query_rel.rs` implementing:
* `from_recursive_query_rel`, which reconstructs
`LogicalPlan::RecursiveQuery` from `ExtensionMultiRel`.
* Validates that:
* The extension has exactly two inputs.
* The `detail` field is present and decodes successfully.
* Rebuilds `RecursiveQuery { name, is_distinct, static_term,
recursive_term }`.
* Integrates with `DefaultSubstraitConsumer`:
* Detects `RECURSIVE_QUERY_TYPE_URL` in `ExtensionMultiRel` and routes
to `from_recursive_query_rel`.
* Falls back to the existing extension handling for other type URLs.
* **Support for recursive CTE work-table scans**
* **Producer (TableScan → ReadRel)**
* Detects `CteWorkTable`-backed scans via `DefaultTableSource`.
* Adds a helper `recursive_scan_name(&TableScan) -> Option<String>` that:
* Downcasts to `DefaultTableSource` and then to `CteWorkTable`.
* Confirms that the scan’s table name matches the work-table name.
* When a `CteWorkTable` is detected, sets `ReadRel.advanced_extension`
with:
* `type_url = RECURSIVE_SCAN_TYPE_URL`.
* `value = encode_recursive_scan_detail(name)`.
* **Consumer (ReadRel → LogicalPlan::TableScan)**
* Parses `ReadRel.advanced_extension` and, when `type_url ==
RECURSIVE_SCAN_TYPE_URL`, decodes the recursive scan detail.
* Verifies the recursive scan name matches the `TableReference` table
name and returns a Substrait error if they differ.
* Uses a `CteWorkTable` wrapped in a `TableSource` instead of resolving
a regular catalog table for recursive scans.
* Falls back to the existing `resolve_table_ref` logic for non-recursive
scans.
* **Refactoring in `ReadRel` consumer**
* Replaces the earlier `read_with_schema` helper with a more flexible
`read_with_source` helper that accepts a `TableSource` (either a catalog table
or a `CteWorkTable`).
* Ensures that schema compatibility checks are still performed after
building the scan plan.
* **Tests**
* Adds unit tests for producer-side recursive scan encoding:
* `from_table_scan_sets_advanced_extension_for_cte_work_table` ensures
that:
* A `TableScan` over a `CteWorkTable` uses `RECURSIVE_SCAN_TYPE_URL`
in `advanced_extension`.
* The encoded payload decodes back to the correct table name.
* Adds roundtrip and error-coverage tests for recursive queries and scan
details in `roundtrip_logical_plan.rs`:
* `roundtrip_recursive_query` verifies that a `RecursiveQuery`
roundtrips through Substrait and back, preserving the name and `is_distinct`
flag.
* `serialize_recursive_query_with_empty_name_errors` checks that
encoding fails with a clear error when the `RecursiveQuery` name is empty.
* `decode_recursive_query_detail_malformed_bytes_errors` ensures
malformed bytes produce a descriptive Substrait error.
* `decode_recursive_scan_detail_malformed_bytes_errors` similarly
validates error handling for malformed recursive scan detail.
* `roundtrip_recursive_query_distinct` confirms that `is_distinct =
true` is preserved across the roundtrip.
* `roundtrip_recursive_query_preserves_child_plans` does a structural
sanity check that the main characteristics of the child plans (projections,
filters, table scans) survive the roundtrip.
* `roundtrip_recursive_query_with_work_table_scan_executes` runs a real
recursive CTE (balances example) through Substrait roundtrip and executes it,
asserting non-empty results.
## Are these changes tested?
### Before
```
> cargo test --test sqllogictests -- --substrait-round-trip cte.slt:175
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.65s
Running bin/sqllogictests.rs
(target/debug/deps/sqllogictests-79fc77e66b850bad)
Completed 1 test files in 0 seconds
External error: 1 errors in file ...
1. query failed: DataFusion error: This feature is not implemented:
Unsupported plan type: RecursiveQuery....
```
### After
```
❯ cargo test --test sqllogictests -- --substrait-round-trip cte.slt:175
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.78s
Running bin/sqllogictests.rs
(target/debug/deps/sqllogictests-79fc77e66b850bad)
Completed 1 test files in 0 seconds
```
Yes.
* New and updated tests have been added to cover:
* `RecursiveQuery` Substrait serialization and deserialization.
* Encoding/decoding of `RecursiveQueryDetail` and `RecursiveScanDetail`,
including malformed-byte and empty-name error paths.
* Correct tagging of `CteWorkTable` scans via `advanced_extension`.
* End-to-end execution of a recursive CTE query after Substrait roundtrip.
* Existing Substrait roundtrip tests continue to run and provide regression
coverage for non-recursive plans.
## Are there any user-facing changes?
* **Behavioral improvement (no breaking API changes):**
* Substrait roundtrip mode now supports logical plans that contain
`RecursiveQuery` and recursive CTE work-table scans. Previously, such plans
failed with `not_impl_err!("Unsupported plan type: RecursiveQuery")`.
* Users who rely on Substrait serialization/deserialization for queries
with recursive CTEs should now see these queries roundtrip and execute
successfully.
* No public Rust API signatures are changed, and no configuration flags or
SQL syntax are modified.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]