alamb commented on issue #9083: URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3759332347
> @alamb I'm talking about the shuffling/partitioning operation - Consider you have a 10 batches with 50 columns, and you want to split to 3 partitions based on a round robin or hash, so each row will be needed to copy to a different place. If I were implementing this operation I would: 1. Compute the input indices to send to each partition 2. Call the [take](https://docs.rs/arrow/latest/arrow/compute/kernels/take/fn.take_record_batch.html) kernel to create the relevant output arrays This would result in exactly one data copy (to the output array) > you might say that but encoding to rows is the same thing and even worse as you copy twice. Indeed, In my mind, any use of a Row Format would result in at least 2 copies (into the Row Format, and back to the output Arrays) > you are right but in my tests I saw that it is faster to (1) encode to this format (2) partition and (3) decode back rather than to encode it using specialized implementation per type What tests are you referring to? Maybe I would understand better if you could describe what your test was doing > @alamb in the case when we already need to convert to rows (in the fallback implementation) the ordering doesn't matter and we can use this new row format which will make it faster. Right -- I am saying this proposal would make the fallback faster. I was suggesting avoiding the fallback entirely (with specialized implementation) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
