timsaucer opened a new issue, #112: URL: https://github.com/apache/datafusion-java/issues/112
**Is your feature request related to a problem or challenge?** Spark users want to read data from a DataFusion `TableProvider` as a native Spark `DataSourceV2`. Today there is no first-class path; options are either a bespoke per-operation JNI surface (more native surface to maintain) or copying data out of process. **Describe the solution you'd like** A Spark `DataSourceV2` connector that places the native boundary at a **standard ADBC driver**. Spark talks to the upstream arrow-adbc Java driver manager (`adbc-core` + `adbc-driver-jni`), which loads a native DataFusion ADBC cdylib and returns arrow-java `ArrowReader`s consumed zero-copy as `ArrowColumnVector`s on the cluster-provided Arrow. This reuses the upstream ADBC bindings rather than reproducing them. Scope: - `adbc-datafusion` format registered as a `DataSourceV2`; schema probed on the driver. - Projection / filter / limit pushdown via Substrait, with a SQL fallback. - Multi-partition reads (`executePartitioned` / `readPartition`) and a `target_partitions` option. - Per-executor connection pool to amortize driver/database setup across task slots. - An example DataFusion ADBC driver cdylib plus end-to-end (PySpark) coverage. **Describe alternatives you've considered** A plain-C scan ABI + hand-written JNI shim (discussed on #103 / #104). The ADBC approach reuses standard, separately-reviewed bindings and a stable driver contract instead. **Additional context** Implemented in #111. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
