[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2021-04-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11094:
---
Fix Version/s: (was: 4.0.0)

> [Rust] [DataFusion] Implement Sort-Merge Join
> -
>
> Key: ARROW-11094
> URL: https://issues.apache.org/jira/browse/ARROW-11094
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> The current hash join works well when one side of the join can be loaded into 
> memory but cannot scale beyond the available RAM.
> The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
> left and right partitions, and write the intermediate results to disk, and 
> then stream both sides of the join by merging these sorted partitions and we 
> do not need to load one side into memory. At most, we need to load all 
> batches from both sides that contain the current join key values.
> In order to reduce memory pressure we will want to limit the concurrency of 
> these sort operations.
> We would still want to default to hash join when we know that the build-side 
> can fit into memory since it is more efficient than using a sort-merge join.
> [https://en.wikipedia.org/wiki/Sort-merge_join]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2020-12-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11094:
---
Description: 
The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions, and write the intermediate results to disk, and then 
stream both sides of the join by merging these sorted partitions and we do not 
need to load one side into memory. At most, we need to load all batches from 
both sides that contain the current join key values.

In order to reduce memory pressure we will want to limit the concurrency of 
these sort operations.

We would still want to default to hash join when we know that the build-side 
can fit into memory since it is more efficient than using a sort-merge join.

[https://en.wikipedia.org/wiki/Sort-merge_join]

  was:
The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions, and write the intermediate results to disk, and then 
stream both sides of the join by merging these sorted partitions and we do not 
need to load one side into memory. At most, we need to load all batches from 
both sides that contain the current join key values.

We would still want to default to hash join when we know that the build-side 
can fit into memory since it is more efficient than using a sort-merge join.

[https://en.wikipedia.org/wiki/Sort-merge_join]


> [Rust] [DataFusion] Implement Sort-Merge Join
> -
>
> Key: ARROW-11094
> URL: https://issues.apache.org/jira/browse/ARROW-11094
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> The current hash join works well when one side of the join can be loaded into 
> memory but cannot scale beyond the available RAM.
> The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
> left and right partitions, and write the intermediate results to disk, and 
> then stream both sides of the join by merging these sorted partitions and we 
> do not need to load one side into memory. At most, we need to load all 
> batches from both sides that contain the current join key values.
> In order to reduce memory pressure we will want to limit the concurrency of 
> these sort operations.
> We would still want to default to hash join when we know that the build-side 
> can fit into memory since it is more efficient than using a sort-merge join.
> [https://en.wikipedia.org/wiki/Sort-merge_join]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2020-12-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11094:
---
Description: 
The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions, and write the intermediate results to disk, and then 
stream both sides of the join by merging these sorted partitions and we do not 
need to load one side into memory. At most, we need to load all batches from 
both sides that contain the current join key values.

We would still want to default to hash join when we know that the build-side 
can fit into memory since it is more efficient than using a sort-merge join.

[https://en.wikipedia.org/wiki/Sort-merge_join]

  was:
The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions, and write the intermediate results to disk, and then 
stream both sides of the join by merging these sorted partitions and we do not 
need to load one side into memory. At most, we need to load all batches from 
both sides that contain the current join key values.

[https://en.wikipedia.org/wiki/Sort-merge_join]


> [Rust] [DataFusion] Implement Sort-Merge Join
> -
>
> Key: ARROW-11094
> URL: https://issues.apache.org/jira/browse/ARROW-11094
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> The current hash join works well when one side of the join can be loaded into 
> memory but cannot scale beyond the available RAM.
> The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
> left and right partitions, and write the intermediate results to disk, and 
> then stream both sides of the join by merging these sorted partitions and we 
> do not need to load one side into memory. At most, we need to load all 
> batches from both sides that contain the current join key values.
> We would still want to default to hash join when we know that the build-side 
> can fit into memory since it is more efficient than using a sort-merge join.
> [https://en.wikipedia.org/wiki/Sort-merge_join]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join

2020-12-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11094:
---
Description: 
The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions, and write the intermediate results to disk, and then 
stream both sides of the join by merging these sorted partitions and we do not 
need to load one side into memory. At most, we need to load all batches from 
both sides that contain the current join key values.

[https://en.wikipedia.org/wiki/Sort-merge_join]

  was:
The current hash join works well when one side of the join can be loaded into 
memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions in parallel and then stream both sides of the join by 
merging these sorted partitions and we do not need to load one side into 
memory. At most, we need to load all batches from both sides that contain the 
current join key values.

https://en.wikipedia.org/wiki/Sort-merge_join


> [Rust] [DataFusion] Implement Sort-Merge Join
> -
>
> Key: ARROW-11094
> URL: https://issues.apache.org/jira/browse/ARROW-11094
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 4.0.0
>
>
> The current hash join works well when one side of the join can be loaded into 
> memory but cannot scale beyond the available RAM.
> The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
> left and right partitions, and write the intermediate results to disk, and 
> then stream both sides of the join by merging these sorted partitions and we 
> do not need to load one side into memory. At most, we need to load all 
> batches from both sides that contain the current join key values.
> [https://en.wikipedia.org/wiki/Sort-merge_join]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)