[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

GitBox Fri, 22 Jan 2021 05:22:33 -0800


Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765392170



   @nevi-me 
   
   I don't think indeed it is very expensive on large Arrays compared to the 
size / operations on the array, but it turns out to be expensive on very small 
arrays. For this PR I am using `slice` to make the hash aggregate code in 
DataFusion more efficient for small output groups with a small amount of rows 
(only 1 row / `Array.slice(i, 1)`) in extreme cases), in which case the slicing 
function becomes a bottleneck, because of the cloning here + `make_array` 
function and because it will be called many times, for example (I believe) in 
total 20M times for a table of 10M rows (it is one example of the db-benchmark 
benchmark). 
   It still is faster than `taking` for each group individually though as the 
benchmark results show.
   
   I am wondering if instead of trying to make a new array when doing 
`.slice()`, we could create a data-structure for slicing instead that 
implements the Array interface and is supported in kernels, so creating the 
slice would be cheap?
   
   Something like this :
   ```
   struct ArraySlice {
       offset: usize,
       length: usize,
       array: ArrayRef
   }
   ```
    
   
   
   
   
   
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Reply via email to