[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

GitBox Fri, 22 Jan 2021 20:37:18 -0800


jorgecarleitao edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765865323



   Thanks a lot for your points. I am learning a lot! :)
   
   Note that for small arrays, we are basically in the metadata problem on 
which the "payload size" of transmitting 1 element is driven by its metadata, 
not the data itself. This will always be a problem, as the arrow format was 
designed to be performant for large arrays.
   
   For example, all our buffers are shared via an `Arc`. There is a tradeoff 
between this indirection and mem-copying the memory region. The tradeoff works 
in `Arc`'s favor for large memory regions and vice-versa.
   
   With that said, we could consider replacing `Arc<ArrayData>` by `ArrayData` 
on all our arrays, to avoid the extra `Arc`: cloning an `ArrayData` is actually 
cheap. I am not sure if that would work for FFI, but we could certainly try.
   
   Another idea is to use `buffer1: Buffer`, `buffer2: Buffer` instead of 
`buffers: Vec<Buffer>` to avoid the `Vec`. This is possible because arrow 
arrays support at most 2 buffers (3 with the null). For types of a single 
buffer, we are already incurring the cost of the `Vec` and thus adding a 
`Buffer` instead should not be a big issue (memory-wise). The advantage of this 
is that we avoid cloning the `Vec` on every operation as well as the extra 
bound check. The disadvantage is that we have to be more verbose when we want 
to apply an operation to every buffer.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Reply via email to