[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256456#comment-17256456
 ] 

Daniël Heres edited comment on ARROW-11030 at 12/30/20, 11:09 AM:
------------------------------------------------------------------

So in summary, what looks like is the problem here: * for the left side of the 
join we .iter() on the left side batches each iteration (a couple of times, in 
the join implementation itself and in MutableDataArray) when we process a 
probe-side batch.
if we halve the batch size, the number of items in the build-side array grows 
by two, but also the number of probe-side batches we process in the join, thus 
the n*n slowdown.
I am not sure what way best to solve it. I think we have to have something for 
the left-side batches so we can reuse a "prepared" structure which can be used 
in O(1) time in build_batch_from_indices.


was (Author: dandandan):
|So in summary, what looks like is the problem here: * for the left side of the 
join we {{.iter()}} on the left side batches each iteration (a couple of times, 
in the join implementation itself and in {{MutableDataArray}}) when we process 
a probe-side batch.
 * if we halve the batch size, the number of items in the build-side array 
grows by two, but also the number of probe-side batches we process in the join, 
thus the {{n*n}} slowdown.


I am not sure what way best to solve it. I think we have to have something for 
the left-side batches so we can reuse a "prepared" structure which can be used 
in O(1) time in build_batch_from_indices.
 |
 
Delete branch
 

> [Rust] [DataFusion] HashJoinExec slow with many batches
> -------------------------------------------------------
>
>                 Key: ARROW-11030
>                 URL: https://issues.apache.org/jira/browse/ARROW-11030
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
>     debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to