GitHub user spencerwilson closed a discussion: How DataFusion could support 
other compute engines (libcudf, velox)

đź‘‹ I'm wondering if anyone has contemplated how compute functions other than 
[arrow::compute](https://docs.rs/arrow/49.0.0/arrow/compute/index.html) could 
be used with DataFusion? I've not studied their APIs in depth yet, but two 
libraries that maybe can be considered compute kernels are:
- [libcudf](https://docs.rapids.ai/api/libcudf/stable/index.html), a CUDA 
library to compute analytics on Nvidia GPUs
- [Velox](https://velox-lib.io/), a "database acceleration library which 
provides reusable, extensible, and high-performance data processing components"

I can think of two broad approaches to using these libraries from DataFusion.

### Granular: New set of ExecutionPlans for each engine

Create a 
[PhysicalPlanner](https://docs.rs/datafusion/latest/datafusion/physical_planner/trait.PhysicalPlanner.html)
 impl that creates a tree of ExecutionPlans that use these other engines. For 
example, there'd be 3 different `ExecutionPlan` impls of the "hash-join" 
operation:
- 
[HashJoinExec](https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html),
 which uses the functions in 
[arrow::compute](https://docs.rs/arrow/49.0.0/arrow/compute/index.html)
- another that uses some functions in libcudf
- another that uses the corresponding operator impl in Velox

And if one uses "the libcudf PhysicalPlanner" (for example), you'll get a tree 
of the libcudf-based ExecutionPlans.

### Coarse: Create a single-node `ExecutionPlan`

In this approach, have a single impl of `ExecutionPlan` called `VeloxExec` or 
`LibCudfExec` that completely encapsulates the other query engine.

---

If there are obvious reasons why this makes no sense, would love to hear those, 
too! I haven't studied the libcudf and velox APIs in detail so I could believe 
that maybe their APIs just aren't conducive to integration with DataFusion.

The primary potential value here is flexibility, especially if the system has 
resources that could compute more efficiently than the CPU. For example, if 
these are to be believed—
- 
https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter
- 
https://voltrondata.com/resources/gpus-analytics-experiment-with-tuning-chunking-compression-decompression

—then there may be workloads for which a GPU is a great choice for computation. 
And it would be keeping with DataFusion's value of modularity.

GitHub link: https://github.com/apache/datafusion/discussions/8498

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to