alamb commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3759332347

   > @alamb I'm talking about the shuffling/partitioning operation - Consider 
you have a 10 batches with 50 columns, and you want to split to 3 partitions 
based on a round robin or hash, so each row will be needed to copy to a 
different place.
   
   If I were implementing this operation I would:
   1. Compute the input indices to send to each partition
   2. Call the 
[take](https://docs.rs/arrow/latest/arrow/compute/kernels/take/fn.take_record_batch.html)
 kernel to create the relevant output arrays
   
   This would result in exactly one data copy (to the output array)
   
   > you might say that but encoding to rows is the same thing and even worse 
as you copy twice. 
   
   Indeed, In my mind, any use of a Row Format would result in at least 2 
copies (into the Row Format, and back to the output Arrays)
   
   > you are right but in my tests I saw that it is faster to (1) encode to 
this format (2) partition and (3) decode back rather than to encode it using 
specialized implementation per type
   
   What tests are you referring to? Maybe I would understand better if you 
could describe what your test was doing
   
   
   > @alamb in the case when we already need to convert to rows (in the 
fallback implementation) the ordering doesn't matter and we can use this new 
row format which will make it faster.
   
   Right -- I am saying this proposal would make the fallback faster. I was 
suggesting avoiding the fallback entirely (with specialized implementation)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to