Weston Pace created ARROW-14163:
-----------------------------------

             Summary: [C++] Naive spillover implementation for join
                 Key: ARROW-14163
                 URL: https://issues.apache.org/jira/browse/ARROW-14163
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace
            Assignee: Weston Pace


A join is a pipeline breaker.  I believe the proposed join operators assume 
that the data can fit into memory and queue all incoming batches.  For example, 
if I understand correctly, https://github.com/apache/arrow/pull/11150 queues 
the right side until the left side had finished.

There are many clever and interesting ways that this can be optimized  (divide 
& conquer, recursive query, prioritize reading the left side and pause the 
right side read).  This issue is intentionally not clever or interesting.

Instead, I think it would be good to take advantage of this opportunity to 
start fleshing out our spillover capabilities.  A very simplistic 
implementation could be a standalone node which has 2 inputs and 2 outputs.  
The node queues up all incoming data on the "right" input and lets the "left" 
input pass through.  Then, when the left input has finished the node will 
release the right input.

This node could then implement a basic spillover mechanism (e.g. IPC to disk) 
and start to flesh out the abstractions that we will eventually want to handle 
different spillover strategies  (abort on spill, spill to disk, and spill to s3 
are all I can think of at the moment).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to