[GitHub] [arrow-datafusion] Dandandan opened a new issue #405: Hash RepartitionExec should buffer/emit batches based on target batch size

GitBox Mon, 24 May 2021 01:55:28 -0700


Dandandan opened a new issue #405:
URL: https://github.com/apache/arrow-datafusion/issues/405



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Currently RepartitionExec based on hashing will split the rows of the batch 
into multiple partitions. This leads to smaller batch sizes (i.e. roughly 20x 
for 20 partitions), which are concatenated later by a `CoalesceBatchesExec`.
   It is probably beneficial to batch a number of rows (based on target batch 
size) and only emit the batches when they exceed the configured target batch 
size. This avoids doing a number of `take` calls on the smaller batches, plus 
avoiding the overhead of having to concatenate the smaller batches later. If 
this is implemented, the optimization rule to introduce CoalesceBatches 
probably should be removed after a  `RepartitionExec`, as there is no benefit / 
use of doing this anymore.
   
   **Describe the solution you'd like**
   Change the logic, do some benchmarking to show the benefit.
   
   **Describe alternatives you've considered**
   n/a
   **Additional context**
   n/a


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan opened a new issue #405: Hash RepartitionExec should buffer/emit batches based on target batch size

Reply via email to