[I] design: DataFrame joins (join, joinOn) and the Java Expr question [datafusion-java]

via GitHub Wed, 13 May 2026 15:09:06 -0700


andygrove opened a new issue, #44:
URL: https://github.com/apache/datafusion-java/issues/44


   ### Is your feature request related to a problem or challenge?
   
   Joins are the largest missing piece of the DataFrame API, and they
   force a design decision that the smaller methods have so far avoided:
   how do we represent DataFusion `Expr` values from Java?
   
   DataFusion's join surface:
   
   - `join(right, JoinType, left_cols, right_cols, filter: Option<Expr>)`
     — equi-join on named columns, optional residual predicate.
   - `join_on(right, JoinType, on: impl IntoIterator<Item = Expr>)` —
     arbitrary join predicates as a list of expressions.
   
   `join` (the column-name form) is buildable today via the SQL-string
   pattern (`parse_sql_expr` for the optional filter). `join_on` is not —
   it fundamentally takes `Expr` values.
   
   ### Describe the solution you'd like
   
   Split into two phases.
   
   **Phase 1 — `join` (column-name form).** Ship this first, no `Expr`
   model required:
   
   - `DataFrame.join(DataFrame right, JoinType type, String[] leftCols, 
String[] rightCols)`
   - Overload accepting a `String filter` parsed via `parse_sql_expr` for
     the residual predicate.
   - `JoinType` Java enum mirroring DataFusion's (`INNER`, `LEFT`,
     `RIGHT`, `FULL`, `LEFT_SEMI`, `RIGHT_SEMI`, `LEFT_ANTI`,
     `RIGHT_ANTI`, `LEFT_MARK`).
   
   **Phase 2 — `joinOn` (Expr form).** Requires picking one of:
   
   1. **SQL strings.** `df.joinOn(right, INNER, "l.a = r.b", "l.c > r.d")`.
      Each predicate parsed via `parse_sql_expr` after both sides are
      stitched into a synthetic schema. Cheapest; no Java-side model.
   2. **Typed `Expr` builder.** Java class hierarchy mirroring DataFusion's
      `Expr` (Column, BinaryExpr, Literal, ScalarUDF…). Discoverable but a
      large, ongoing maintenance surface that has to track DataFusion's
      `Expr` enum.
   3. **Defer.** Keep only `join` (Phase 1) until a concrete user needs
      non-equi joins.
   
   Recommendation: do (1) for symmetry with `filter(String)` and the
   proposed `withColumn(String)` / `sort(String)`. (2) is its own
   multi-PR effort that the project may or may not want to commit to.
   
   ### Describe alternatives you've considered
   
   SQL joins via `ctx.sql(...)`. Works for everything, but requires
   naming both sides — losing the DataFrame composition story.
   
   ### Additional context
   
   Filing this as a **discussion** issue, not an implementation issue.
   Phase 1 (column-name `join`) can be implemented immediately once the
   `JoinType` enum and lifecycle (does the right-hand DataFrame get
   consumed?) are settled. The Phase 2 decision is the real ask here.
   
   Related: the right-hand `DataFrame` consume-vs-clone question is
   identical to the set-operations issue and should be answered the same
   way in both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] design: DataFrame joins (join, joinOn) and the Java Expr question [datafusion-java]

Reply via email to