[GitHub] [arrow-datafusion] alamb opened a new issue, #2703: [EPIC] Full JIT support for `DataFusion`

GitBox Mon, 06 Jun 2022 05:38:22 -0700


alamb opened a new issue, #2703:
URL: https://github.com/apache/arrow-datafusion/issues/2703


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   TLDR: The benefit of this are:
   1. Speed up fundamentally row oriented operations like hash table lookup or 
comparisons (e.g. 
[#2427](https://github.com/apache/arrow-datafusion/issues/2427))
   2. (secondarily) Avoid unnecessary intermediate allocation that we can speed 
up complex / nested expressions 
   
   DataFusion, like many Arrow systems, is a classic "vectorized computation 
engine" which works quite well for many common operations. The following paper, 
gives a good treatment on the various tradeoffs between vectorized and JIT's 
compilation of query plans: 
https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de
   
   As mentioned in the paper, there are some fundamentally "row oriented" 
operations in a database that are not typically amenable to vectorization. The 
"classics" are: Hash table updates in Joins and Hash Aggregates, as well as 
comparing tuples in sort.
   
   In addition, the datafusion `PhysicalExpr` and `arrow-rs` library currently 
evaluate expressions by "materializing intermediate results" -- for example `(a 
+ b) + c` results in first evaluating `(a+b)` to a temporary location and then 
adding `c` to form the final result. Note however, there is a tradeoff here 
between the speed gained using the LLVM optimized vectorized kernels in 
`arrow-rs` and cranelift generated JIT expressions where JIT may not actually 
be faster. 
   
   Another example can be found in [these 
slides](https://github.com/apache/arrow-datafusion/files/8843957/Expression_evaluation.pdf)
 from [this 
presentation](https://docs.google.com/presentation/d/1owNlmpNpC2-eBd-jEYRCt0L8_sW4PWYx70gqnFC0twM/edit#slide=id.gba4e8a221c_0_125)
   
   @yjshen added initial support for JIT'ing in 
https://github.com/apache/arrow-datafusion/pull/1849 and it currently lives in 
https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. 
   
   This ticket aims to be a central location for tracking the status of JIT 
compiling expressions for anyone who wants to contribute to this effort
   
   
   **Describe the solution you'd like**
   
   We should be able to take in a collection of a `RecordBatch` / named 
`Array`s and compile an expression like `(a + b)/ 2` to a loop that results in 
a new `Array`.
   ```
   fn compile(schema: SchemaRef, expr: Expr) ->  CompiledFunction {
   
   }
   ```
   
   The loop itself also must be included in the to-be compiled expression, to 
remove call overhead and allow for possible use of SIMD instructions, either 
explicitly by instrumenting cranelift enough or through auto-vectorization.
   
   **Describe alternatives you've considered**
   n/a
   
   **Additional context**
   
   **Work Items**
   - [ ] https://github.com/apache/arrow-datafusion/pull/2587
   - [ ] https://github.com/apache/arrow-datafusion/issues/2122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #2703: [EPIC] Full JIT support for `DataFusion`

Reply via email to