[ https://issues.apache.org/jira/browse/ARROW-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove updated ARROW-11094: ------------------------------- Fix Version/s: (was: 4.0.0) > [Rust] [DataFusion] Implement Sort-Merge Join > --------------------------------------------- > > Key: ARROW-11094 > URL: https://issues.apache.org/jira/browse/ARROW-11094 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - DataFusion > Reporter: Andy Grove > Priority: Major > > The current hash join works well when one side of the join can be loaded into > memory but cannot scale beyond the available RAM. > The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the > left and right partitions, and write the intermediate results to disk, and > then stream both sides of the join by merging these sorted partitions and we > do not need to load one side into memory. At most, we need to load all > batches from both sides that contain the current join key values. > In order to reduce memory pressure we will want to limit the concurrency of > these sort operations. > We would still want to default to hash join when we know that the build-side > can fit into memory since it is more efficient than using a sort-merge join. > [https://en.wikipedia.org/wiki/Sort-merge_join] -- This message was sent by Atlassian Jira (v8.3.4#803005)