bubulalabu opened a new pull request, #18224:
URL: https://github.com/apache/datafusion/pull/18224
# LATERAL Table Function Support
## Which issue does this PR close?
Closes #18121.
## Rationale for this change
DataFusion's `TableFunction` API currently cannot access data from columns
in outer queries, preventing LATERAL joins with table functions. Users
attempting queries like:
```sql
SELECT t1.id, t2.x, t2.y
FROM my_table AS t1,
LATERAL my_transform(t1.a, t1.b, t1.c) AS t2(x, y)
```
encounter errors: `This feature is not implemented: Physical plan does not
support logical expression OuterReferenceColumn`.
## What changes are included in this PR?
This PR attempts to add LATERAL table function support, enabling table
functions to reference columns from outer queries. I'm sure there are
improvements that can be made; feedback is very welcome
### Core Components
1. **Logical Plan Node** (`LateralTableFunction`)
- Represents table function calls with outer column references
- Stores both combined output schema and function-only schema for clarity
2. **Physical Execution Plan** (`LateralTableFunctionExec`)
- Evaluates table function once per input row
- Handles outer column references by extracting values from each row
- Combines input columns with function output
- Implements streaming execution via `RecordBatchStream`
3. **SQL Planning** (`relation/join.rs`)
- Detects LATERAL table function calls during join planning
- Performs schema inference using placeholder values when outer
references exist
- Creates `LateralTableFunction` nodes instead of regular scans
4. **Schema Inference**
- When table function arguments contain outer references, replaces them
with placeholder values to infer output schema
- Uses incrementing values (1, 2, 3...) to ensure valid ranges for
functions like `generate_series(start, end)`
- Stores inferred schema in logical plan for use during physical planning
### Example Usage
```sql
-- Generate series based on table values
SELECT t.id, s.value
FROM my_table t,
LATERAL generate_series(t.start_val, t.end_val) s;
-- Transform row data through UDTF
SELECT t.id, result.x, result.y
FROM data t,
LATERAL my_transform(t.a, t.b, t.c) AS result(x, y);
```
## Implementation Approach
I'd appreciate feedback on these design choices - there may be better
approaches I haven't considered:
1. **Schema Inference via Placeholders**
The implementation replaces outer references with placeholder literal
values (1, 2, 3...) to infer the table function's output schema. This works
because table functions need concrete values to execute, but outer references
aren't resolved until runtime.
**Known limitation**: This approach has an inherent weakness - schema
inference can fail if the placeholder values are semantically invalid for the
function. For example, `generate_series(start, end)` will error during planning
if the placeholder for `start` is greater than `end`, even though the actual
runtime values might be valid. For the sake of getting it to work I used
incrementing placeholder values, but that's by no means a robust solution.
I considered the TableFunction provide explicit schema declarations. If
there's a better approach for schema inference, I'd love to hear suggestions!
2. **Row-by-Row Sequential Execution**
The current implementation executes the table function once per input
row, sequentially (not parallelized or batched). I chose this conservative
approach because we don't have metadata about whether table functions have side
effects, are thread-safe, maintain internal state, or can be safely executed in
parallel.
**Trade-off**: This has performance implications. Batched or parallel
execution would likely be faster, but would require additional API changes to
let table functions declare their safety characteristics.
3 **Dual Schema Storage**
The logical plan stores both the combined schema (input + function
output) and the function-only schema. This is slightly redundant (we could
derive one from the other), but I found it made the code clearer and easier to
understand.
### Error Handling
Updated error message for non-LATERAL table functions with column references:
```
Before: "Table functions with outer references are not yet supported. This
requires LATERAL table function support."
After: "Table function arguments cannot reference columns without LATERAL
keyword. Use: FROM other_table, LATERAL table_func(other_table.column)"
```
## Are these changes tested?
Yes, I added a dedicated sql logic test file.
## Are there any user-facing changes?
### New Functionality
Users can now use LATERAL with table functions to reference outer columns:
```sql
-- This now works! (previously failed)
SELECT t.id, s.value
FROM my_table t,
LATERAL generate_series(t.min, t.max) s;
```
### API Changes
1 **`ScalarValue::try_new_placeholder()`** (new)
- Public utility method for creating placeholder values
- Useful for schema inference scenarios
2 **No Breaking Changes**
- All existing APIs remain unchanged
- Backward compatible with existing table functions
- Existing SQL queries continue to work
### Performance Characteristics
- LATERAL table functions execute once per input row (not batched)
## Additional Context
### Future Enhancements
Potential follow-up work (out of scope for this PR):
1. **Batch Execution**: Execute table function on multiple rows at once
(requires UDTF API changes)
2. **Parallel Execution**: Partition input and execute table functions in
parallel
3. **Substrait Support**: Add protobuf schema and serialization
---
**Disclosure**: This PR was developed with the assistance of an LLM and has
been thoroughly reviewed and tested by me. All design decisions, code
implementation, and testing were validated through manual review and execution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]