[ https://issues.apache.org/jira/browse/ARROW-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace closed ARROW-14163. ------------------------------- Resolution: Won't Fix It appears this will be addressed by ARROW-16389 > [C++] Naive spillover implementation for join > --------------------------------------------- > > Key: ARROW-14163 > URL: https://issues.apache.org/jira/browse/ARROW-14163 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Assignee: Sasha Krassovsky > Priority: Major > Labels: query-engine > Fix For: 9.0.0 > > > A join is a pipeline breaker. I believe the proposed join operators assume > that the data can fit into memory and queue all incoming batches. For > example, if I understand correctly, > https://github.com/apache/arrow/pull/11150 queues the right side until the > left side had finished. > There are many clever and interesting ways that this can be optimized > (divide & conquer, recursive query, prioritize reading the left side and > pause the right side read). This issue is intentionally not clever or > interesting. > Instead, I think it would be good to take advantage of this opportunity to > start fleshing out our spillover capabilities. A very simplistic > implementation could be a standalone node which has 2 inputs and 2 outputs. > The node queues up all incoming data on the "right" input and lets the "left" > input pass through. Then, when the left input has finished the node will > release the right input. > This node could then implement a basic spillover mechanism (e.g. IPC to disk) > and start to flesh out the abstractions that we will eventually want to > handle different spillover strategies (abort on spill, spill to disk, and > spill to s3 are all I can think of at the moment). -- This message was sent by Atlassian Jira (v8.20.7#820007)