andygrove opened a new issue, #44:
URL: https://github.com/apache/datafusion-java/issues/44
### Is your feature request related to a problem or challenge?
Joins are the largest missing piece of the DataFrame API, and they
force a design decision that the smaller methods have so far avoided:
how do we represent DataFusion `Expr` values from Java?
DataFusion's join surface:
- `join(right, JoinType, left_cols, right_cols, filter: Option<Expr>)`
— equi-join on named columns, optional residual predicate.
- `join_on(right, JoinType, on: impl IntoIterator<Item = Expr>)` —
arbitrary join predicates as a list of expressions.
`join` (the column-name form) is buildable today via the SQL-string
pattern (`parse_sql_expr` for the optional filter). `join_on` is not —
it fundamentally takes `Expr` values.
### Describe the solution you'd like
Split into two phases.
**Phase 1 — `join` (column-name form).** Ship this first, no `Expr`
model required:
- `DataFrame.join(DataFrame right, JoinType type, String[] leftCols,
String[] rightCols)`
- Overload accepting a `String filter` parsed via `parse_sql_expr` for
the residual predicate.
- `JoinType` Java enum mirroring DataFusion's (`INNER`, `LEFT`,
`RIGHT`, `FULL`, `LEFT_SEMI`, `RIGHT_SEMI`, `LEFT_ANTI`,
`RIGHT_ANTI`, `LEFT_MARK`).
**Phase 2 — `joinOn` (Expr form).** Requires picking one of:
1. **SQL strings.** `df.joinOn(right, INNER, "l.a = r.b", "l.c > r.d")`.
Each predicate parsed via `parse_sql_expr` after both sides are
stitched into a synthetic schema. Cheapest; no Java-side model.
2. **Typed `Expr` builder.** Java class hierarchy mirroring DataFusion's
`Expr` (Column, BinaryExpr, Literal, ScalarUDF…). Discoverable but a
large, ongoing maintenance surface that has to track DataFusion's
`Expr` enum.
3. **Defer.** Keep only `join` (Phase 1) until a concrete user needs
non-equi joins.
Recommendation: do (1) for symmetry with `filter(String)` and the
proposed `withColumn(String)` / `sort(String)`. (2) is its own
multi-PR effort that the project may or may not want to commit to.
### Describe alternatives you've considered
SQL joins via `ctx.sql(...)`. Works for everything, but requires
naming both sides — losing the DataFrame composition story.
### Additional context
Filing this as a **discussion** issue, not an implementation issue.
Phase 1 (column-name `join`) can be implemented immediately once the
`JoinType` enum and lifecycle (does the right-hand DataFrame get
consumed?) are settled. The Phase 2 decision is the real ask here.
Related: the right-hand `DataFrame` consume-vs-clone question is
identical to the set-operations issue and should be answered the same
way in both.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]