Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765392170
@nevi-me
I don't think indeed it is very expensive on large Arrays compared to the
size / operations on the array, but it turns out to be expensive on very small
arrays. For this PR I am using `slice` to make the hash aggregate code in
DataFusion more efficient for small output groups with a small amount of rows
(only 1 row / `Array.slice(i, 1)`) in extreme cases), in which case the slicing
function becomes a bottleneck, because of the cloning here + `make_array`
function and because it will be called many times, for example (I believe) in
total 20M times for a table of 10M rows (it is one example of the db-benchmark
benchmark).
It still is faster than `taking` for each group individually though as the
benchmark results show.
I am wondering if instead of trying to make a new array when doing
`.slice()`, we could create a data-structure for slicing instead that
implements the Array interface and is supported in kernels, so creating the
slice would be cheap?
Something like this :
```
struct ArraySlice {
offset: usize,
length: usize,
array: ArrayRef
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]