kosiew opened a new pull request, #19179:
URL: https://github.com/apache/datafusion/pull/19179

   **PR Title**
   
   Add Substrait roundtrip support for `RecursiveQuery` and recursive CTE scans
   
   ---
   
   ## Which issue does this PR close?
   
   * Closes #16274.
   
   ## Rationale for this change
   
   Substrait roundtrip mode currently fails for plans that include 
`RecursiveQuery`, resulting in `not_impl_err!("Unsupported plan type: 
RecursiveQuery")` during SQL logic tests. This prevents recursive CTE queries 
from being roundtripped through Substrait, causing multiple `cte.slt` cases to 
fail and reducing confidence in Substrait interoperability.
   
   This PR adds end-to-end support for serializing and deserializing 
`RecursiveQuery` logical plans and their associated recursive work-table scans. 
This unblocks the failing SQL logic tests and improves parity between the 
DataFusion logical plan representation and its Substrait encoding.
   
   ## What changes are included in this PR?
   
   * **New `logical_plan::recursive` helper module**
   
     * Introduces `RECURSIVE_QUERY_TYPE_URL` and `RECURSIVE_SCAN_TYPE_URL` to 
tag recursive structures in Substrait extensions.
     * Defines small prost messages for:
   
       * `RecursiveQueryDetail { name, is_distinct }` used to carry 
`RecursiveQuery` metadata.
       * `RecursiveScanDetail { name }` used to identify recursive work-table 
scans.
     * Provides helpers to encode/decode these messages:
   
       * `encode_recursive_query_detail` / `decode_recursive_query_detail`.
       * `encode_recursive_scan_detail` / `decode_recursive_scan_detail`.
     * Adds validation so that empty names and malformed payloads are reported 
as Substrait errors.
   
   * **Producer-side support for `RecursiveQuery`**
   
     * Adds `logical_plan/producer/rel/recursive_query_rel.rs` implementing:
   
       * `from_recursive_query`, which serializes `LogicalPlan::RecursiveQuery` 
into `ExtensionMultiRel` with two inputs:
   
         * `inputs[0]`: `static_term`.
         * `inputs[1]`: `recursive_term`.
       * Encodes `{ name, is_distinct }` into the `detail` field using 
`RECURSIVE_QUERY_TYPE_URL`.
     * Wires this into the generic Substrait producer:
   
       * Extends `SubstraitProducer` with `handle_recursive_query`.
       * Updates `to_substrait_rel` to delegate `LogicalPlan::RecursiveQuery` 
to `handle_recursive_query` instead of returning `not_impl_err!`.
   
   * **Consumer-side support for `RecursiveQuery`**
   
     * Adds `logical_plan/consumer/rel/recursive_query_rel.rs` implementing:
   
       * `from_recursive_query_rel`, which reconstructs 
`LogicalPlan::RecursiveQuery` from `ExtensionMultiRel`.
       * Validates that:
   
         * The extension has exactly two inputs.
         * The `detail` field is present and decodes successfully.
       * Rebuilds `RecursiveQuery { name, is_distinct, static_term, 
recursive_term }`.
     * Integrates with `DefaultSubstraitConsumer`:
   
       * Detects `RECURSIVE_QUERY_TYPE_URL` in `ExtensionMultiRel` and routes 
to `from_recursive_query_rel`.
       * Falls back to the existing extension handling for other type URLs.
   
   * **Support for recursive CTE work-table scans**
   
     * **Producer (TableScan → ReadRel)**
   
       * Detects `CteWorkTable`-backed scans via `DefaultTableSource`.
       * Adds a helper `recursive_scan_name(&TableScan) -> Option<String>` that:
   
         * Downcasts to `DefaultTableSource` and then to `CteWorkTable`.
         * Confirms that the scan’s table name matches the work-table name.
       * When a `CteWorkTable` is detected, sets `ReadRel.advanced_extension` 
with:
   
         * `type_url = RECURSIVE_SCAN_TYPE_URL`.
         * `value = encode_recursive_scan_detail(name)`.
     * **Consumer (ReadRel → LogicalPlan::TableScan)**
   
       * Parses `ReadRel.advanced_extension` and, when `type_url == 
RECURSIVE_SCAN_TYPE_URL`, decodes the recursive scan detail.
       * Verifies the recursive scan name matches the `TableReference` table 
name and returns a Substrait error if they differ.
       * Uses a `CteWorkTable` wrapped in a `TableSource` instead of resolving 
a regular catalog table for recursive scans.
       * Falls back to the existing `resolve_table_ref` logic for non-recursive 
scans.
   
   * **Refactoring in `ReadRel` consumer**
   
     * Replaces the earlier `read_with_schema` helper with a more flexible 
`read_with_source` helper that accepts a `TableSource` (either a catalog table 
or a `CteWorkTable`).
     * Ensures that schema compatibility checks are still performed after 
building the scan plan.
   
   * **Tests**
   
     * Adds unit tests for producer-side recursive scan encoding:
   
       * `from_table_scan_sets_advanced_extension_for_cte_work_table` ensures 
that:
   
         * A `TableScan` over a `CteWorkTable` uses `RECURSIVE_SCAN_TYPE_URL` 
in `advanced_extension`.
         * The encoded payload decodes back to the correct table name.
     * Adds roundtrip and error-coverage tests for recursive queries and scan 
details in `roundtrip_logical_plan.rs`:
   
       * `roundtrip_recursive_query` verifies that a `RecursiveQuery` 
roundtrips through Substrait and back, preserving the name and `is_distinct` 
flag.
       * `serialize_recursive_query_with_empty_name_errors` checks that 
encoding fails with a clear error when the `RecursiveQuery` name is empty.
       * `decode_recursive_query_detail_malformed_bytes_errors` ensures 
malformed bytes produce a descriptive Substrait error.
       * `decode_recursive_scan_detail_malformed_bytes_errors` similarly 
validates error handling for malformed recursive scan detail.
       * `roundtrip_recursive_query_distinct` confirms that `is_distinct = 
true` is preserved across the roundtrip.
       * `roundtrip_recursive_query_preserves_child_plans` does a structural 
sanity check that the main characteristics of the child plans (projections, 
filters, table scans) survive the roundtrip.
       * `roundtrip_recursive_query_with_work_table_scan_executes` runs a real 
recursive CTE (balances example) through Substrait roundtrip and executes it, 
asserting non-empty results.
   
   ## Are these changes tested?
   
   ### Before
   ```
   > cargo test --test sqllogictests -- --substrait-round-trip cte.slt:175
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.65s
        Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-79fc77e66b850bad)
   Completed 1 test files in 0 seconds                                          
External error: 1 errors in file ...
   
   1. query failed: DataFusion error: This feature is not implemented: 
Unsupported plan type: RecursiveQuery....
   ```
   
   ### After
   ```
   ❯ cargo test --test sqllogictests -- --substrait-round-trip cte.slt:175
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.78s
        Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-79fc77e66b850bad)
   Completed 1 test files in 0 seconds
   ```
   
   
   
   Yes.
   
   * New and updated tests have been added to cover:
   
     * `RecursiveQuery` Substrait serialization and deserialization.
     * Encoding/decoding of `RecursiveQueryDetail` and `RecursiveScanDetail`, 
including malformed-byte and empty-name error paths.
     * Correct tagging of `CteWorkTable` scans via `advanced_extension`.
     * End-to-end execution of a recursive CTE query after Substrait roundtrip.
   * Existing Substrait roundtrip tests continue to run and provide regression 
coverage for non-recursive plans.
   
   ## Are there any user-facing changes?
   
   * **Behavioral improvement (no breaking API changes):**
   
     * Substrait roundtrip mode now supports logical plans that contain 
`RecursiveQuery` and recursive CTE work-table scans. Previously, such plans 
failed with `not_impl_err!("Unsupported plan type: RecursiveQuery")`.
     * Users who rely on Substrait serialization/deserialization for queries 
with recursive CTEs should now see these queries roundtrip and execute 
successfully.
   * No public Rust API signatures are changed, and no configuration flags or 
SQL syntax are modified.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to