andygrove opened a new pull request, #19:
URL: https://github.com/apache/datafusion-java/pull/19
## Summary
Adds four transformation/action methods to
`org.apache.datafusion.DataFrame`, expanding the Java surface beyond `sql` →
`collect`:
```java
DataFrame select(String... columnNames);
DataFrame filter(String sqlPredicate);
long count();
void show();
void show(int limit);
```
Callers can now chain small queries in Java without round-tripping through
SQL strings for every step:
```java
try (SessionContext ctx = new SessionContext()) {
ctx.registerParquet("lineitem", "tpch-data/sf1/lineitem.parquet");
try (DataFrame df = ctx.sql("SELECT * FROM lineitem")) {
long n = df.filter("l_orderkey < 100").count();
df.select("l_orderkey", "l_quantity").show(20);
}
}
```
## Design notes
- **Non-destructive on the Rust side.** DataFusion's
`DataFrame::select_columns`/`filter`/`count`/`show` all take `self` by value,
so each new JNI fn clones the borrowed `DataFrame` (cheap: `Arc<SessionState>`
+ `LogicalPlan`) and operates on the clone. The caller's original Java
`DataFrame` stays usable for further operations. This is intentionally
different from `collect()`, which still consumes its receiver because it ships
the actual execution stream out.
- **`filter` parsing.** Uses `DataFrame::parse_sql_expr` so the predicate is
parsed against the DataFrame's own schema.
- **`show()` output.** Goes to native stdout via DataFusion's printer (same
behavior as the Rust / Python APIs). This collides with Surefire's forked-JVM
IPC stream and produces a `Corrupted channel` warning during tests — non-fatal,
BUILD SUCCESS — but worth addressing in a follow-up (Surefire forkNode
extension or a `formatString()` companion method).
- **Sync API.** All four methods are synchronous to match the existing
`sql`/`collect` shape. A future async refactor would touch all of `DataFrame`
and `SessionContext` together.
## Out of scope
`sort`, `join`, `aggregate`, `limit`, `distinct`, `withColumn`, typed
`Expr`/`Column` API, async/`CompletableFuture` overloads. All can land in
follow-ups using the same pattern.
## Testing
12 new tests in `DataFrameTransformationsTest` cover:
- Each new method on small inline `VALUES` tables.
- Non-destructive semantics — original `DataFrame` still usable after
`select`/`filter`/`count`/`show`.
- Chained operations: `filter().select().count()`.
- `IllegalStateException` after `close()` and after `collect()`.
- `RuntimeException` on invalid column / malformed predicate.
- TPC-H lineitem smoke: `filter(...).count()` matches `SELECT COUNT(*) ...
WHERE ...` (guarded by `Assumptions.assumeTrue` on file presence — skipped when
TPC-H data is absent, as it is on CI).
`make test` runs both the existing 13 tests and the 12 new ones — all 25
pass. `cargo clippy --all-targets -- -D warnings` is clean. `./mvnw
spotless:check` and `./mvnw apache-rat:check` are clean.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]