[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

GitBox Fri, 22 Jan 2021 20:36:12 -0800


jorgecarleitao commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765865323



   Thanks a lot for your points. I am learning a lot! :)
   
   Note that for small arrays, we are basically in the metadata problem on 
which the "payload size" of transmitting 1 element is driven by its metadata, 
not the data itself. This will always be a problem, as the arrow format was 
designed to be performant for large arrays.
   
   For example, all our buffers are shared via an `Arc`. There is a tradeoff 
between this indirection and mem-copying the memory region. The tradeoff works 
in `Arc`'s favor for large memory regions and vice-versa.
   
   With that said, we could consider replacing `Arc<ArrayData>` by `ArrayData` 
on all our arrays, to avoid the extra `Arc`: cloning an `ArrayData` is actually 
cheap. I am not sure if that would work for FFI, but we could certainly try.
   
   Another idea is to use `buffer1: Buffer`, `buffer2: Buffer` instead of 
`buffers: Vec<Buffer>` to avoid the `Vec`. This is possible arrow arrays 
support at most 2 buffers. For types of a single buffer, we are already 
incurring the cost of the `Vec` and thus adding a `Buffer` instead should not 
be a big issue (memory-wise). The advantage of this is that we avoid cloning 
the `Vec` on every operation as well as the extra bound check. The disadvantage 
is that we have to be more verbose when we want to apply an operation to every 
buffer.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Reply via email to