andygrove opened a new pull request, #19:
URL: https://github.com/apache/datafusion-java/pull/19

   ## Summary
   
   Adds four transformation/action methods to 
`org.apache.datafusion.DataFrame`, expanding the Java surface beyond `sql` → 
`collect`:
   
   ```java
   DataFrame select(String... columnNames);
   DataFrame filter(String sqlPredicate);
   long      count();
   void      show();
   void      show(int limit);
   ```
   
   Callers can now chain small queries in Java without round-tripping through 
SQL strings for every step:
   
   ```java
   try (SessionContext ctx = new SessionContext()) {
     ctx.registerParquet("lineitem", "tpch-data/sf1/lineitem.parquet");
     try (DataFrame df = ctx.sql("SELECT * FROM lineitem")) {
       long n = df.filter("l_orderkey < 100").count();
       df.select("l_orderkey", "l_quantity").show(20);
     }
   }
   ```
   
   ## Design notes
   
   - **Non-destructive on the Rust side.** DataFusion's 
`DataFrame::select_columns`/`filter`/`count`/`show` all take `self` by value, 
so each new JNI fn clones the borrowed `DataFrame` (cheap: `Arc<SessionState>` 
+ `LogicalPlan`) and operates on the clone. The caller's original Java 
`DataFrame` stays usable for further operations. This is intentionally 
different from `collect()`, which still consumes its receiver because it ships 
the actual execution stream out.
   - **`filter` parsing.** Uses `DataFrame::parse_sql_expr` so the predicate is 
parsed against the DataFrame's own schema.
   - **`show()` output.** Goes to native stdout via DataFusion's printer (same 
behavior as the Rust / Python APIs). This collides with Surefire's forked-JVM 
IPC stream and produces a `Corrupted channel` warning during tests — non-fatal, 
BUILD SUCCESS — but worth addressing in a follow-up (Surefire forkNode 
extension or a `formatString()` companion method).
   - **Sync API.** All four methods are synchronous to match the existing 
`sql`/`collect` shape. A future async refactor would touch all of `DataFrame` 
and `SessionContext` together.
   
   ## Out of scope
   
   `sort`, `join`, `aggregate`, `limit`, `distinct`, `withColumn`, typed 
`Expr`/`Column` API, async/`CompletableFuture` overloads. All can land in 
follow-ups using the same pattern.
   
   ## Testing
   
   12 new tests in `DataFrameTransformationsTest` cover:
   
   - Each new method on small inline `VALUES` tables.
   - Non-destructive semantics — original `DataFrame` still usable after 
`select`/`filter`/`count`/`show`.
   - Chained operations: `filter().select().count()`.
   - `IllegalStateException` after `close()` and after `collect()`.
   - `RuntimeException` on invalid column / malformed predicate.
   - TPC-H lineitem smoke: `filter(...).count()` matches `SELECT COUNT(*) ... 
WHERE ...` (guarded by `Assumptions.assumeTrue` on file presence — skipped when 
TPC-H data is absent, as it is on CI).
   
   `make test` runs both the existing 13 tests and the 12 new ones — all 25 
pass. `cargo clippy --all-targets -- -D warnings` is clean. `./mvnw 
spotless:check` and `./mvnw apache-rat:check` are clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to