bubulalabu opened a new pull request, #18224:
URL: https://github.com/apache/datafusion/pull/18224

   # LATERAL Table Function Support
   
   ## Which issue does this PR close?
   
   Closes #18121.
   
   ## Rationale for this change
   
   DataFusion's `TableFunction` API currently cannot access data from columns 
in outer queries, preventing LATERAL joins with table functions. Users 
attempting queries like:
   
   ```sql
   SELECT t1.id, t2.x, t2.y
   FROM my_table AS t1,
        LATERAL my_transform(t1.a, t1.b, t1.c) AS t2(x, y)
   ```
   
   encounter errors: `This feature is not implemented: Physical plan does not 
support logical expression OuterReferenceColumn`.
   
   ## What changes are included in this PR?
   
   This PR attempts to add LATERAL table function support, enabling table 
functions to reference columns from outer queries. I'm sure there are 
improvements that can be made; feedback is very welcome
   
   ### Core Components
   
   1. **Logical Plan Node** (`LateralTableFunction`)
      - Represents table function calls with outer column references
      - Stores both combined output schema and function-only schema for clarity
   
   2. **Physical Execution Plan** (`LateralTableFunctionExec`)
      - Evaluates table function once per input row
      - Handles outer column references by extracting values from each row
      - Combines input columns with function output
      - Implements streaming execution via `RecordBatchStream`
   
   3. **SQL Planning** (`relation/join.rs`)
      - Detects LATERAL table function calls during join planning
      - Performs schema inference using placeholder values when outer 
references exist
      - Creates `LateralTableFunction` nodes instead of regular scans
   
   4. **Schema Inference**
      - When table function arguments contain outer references, replaces them 
with placeholder values to infer output schema
      - Uses incrementing values (1, 2, 3...) to ensure valid ranges for 
functions like `generate_series(start, end)`
      - Stores inferred schema in logical plan for use during physical planning
   
   ### Example Usage
   
   ```sql
   -- Generate series based on table values
   SELECT t.id, s.value
   FROM my_table t,
        LATERAL generate_series(t.start_val, t.end_val) s;
   
   -- Transform row data through UDTF
   SELECT t.id, result.x, result.y
   FROM data t,
        LATERAL my_transform(t.a, t.b, t.c) AS result(x, y);
   ```
   
   ## Implementation Approach
   
   I'd appreciate feedback on these design choices - there may be better 
approaches I haven't considered:
   
   1. **Schema Inference via Placeholders**
   
      The implementation replaces outer references with placeholder literal 
values (1, 2, 3...) to infer the table function's output schema. This works 
because table functions need concrete values to execute, but outer references 
aren't resolved until runtime.
   
      **Known limitation**: This approach has an inherent weakness - schema 
inference can fail if the placeholder values are semantically invalid for the 
function. For example, `generate_series(start, end)` will error during planning 
if the placeholder for `start` is greater than `end`, even though the actual 
runtime values might be valid. For the sake of getting it to work I used 
incrementing placeholder values, but that's by no means a robust solution.
   
      I considered the TableFunction provide explicit schema declarations. If 
there's a better approach for schema inference, I'd love to hear suggestions!
   
   2. **Row-by-Row Sequential Execution**
   
      The current implementation executes the table function once per input 
row, sequentially (not parallelized or batched). I chose this conservative 
approach because we don't have metadata about whether table functions have side 
effects, are thread-safe, maintain internal state, or can be safely executed in 
parallel.
   
      **Trade-off**: This has performance implications. Batched or parallel 
execution would likely be faster, but would require additional API changes to 
let table functions declare their safety characteristics.
   
   3 **Dual Schema Storage**
   
      The logical plan stores both the combined schema (input + function 
output) and the function-only schema. This is slightly redundant (we could 
derive one from the other), but I found it made the code clearer and easier to 
understand. 
   
   ### Error Handling
   
   Updated error message for non-LATERAL table functions with column references:
   ```
   Before: "Table functions with outer references are not yet supported. This 
requires LATERAL table function support."
   After:  "Table function arguments cannot reference columns without LATERAL 
keyword. Use: FROM other_table, LATERAL table_func(other_table.column)"
   ```
   
   ## Are these changes tested?
   
   Yes, I added a dedicated sql logic test file.
   
   ## Are there any user-facing changes?
   
   ### New Functionality
   
   Users can now use LATERAL with table functions to reference outer columns:
   
   ```sql
   -- This now works! (previously failed)
   SELECT t.id, s.value
   FROM my_table t,
        LATERAL generate_series(t.min, t.max) s;
   ```
   
   ### API Changes
   
   1 **`ScalarValue::try_new_placeholder()`** (new)
      - Public utility method for creating placeholder values
      - Useful for schema inference scenarios
   
   2 **No Breaking Changes**
      - All existing APIs remain unchanged
      - Backward compatible with existing table functions
      - Existing SQL queries continue to work
   
   ### Performance Characteristics
   
   - LATERAL table functions execute once per input row (not batched)
   
   ## Additional Context
   
   ### Future Enhancements
   
   Potential follow-up work (out of scope for this PR):
   
   1. **Batch Execution**: Execute table function on multiple rows at once 
(requires UDTF API changes)
   2. **Parallel Execution**: Partition input and execute table functions in 
parallel
   3. **Substrait Support**: Add protobuf schema and serialization
   
   ---
   
   **Disclosure**: This PR was developed with the assistance of an LLM and has 
been thoroughly reviewed and tested by me. All design decisions, code 
implementation, and testing were validated through manual review and execution.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to