[ 
https://issues.apache.org/jira/browse/ARROW-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasha Krassovsky updated ARROW-16522:
-------------------------------------
    Description: 
This is a high level JIRA that will be broken into a number of subtasks. There 
are a number of things that each node has to reinvent (or, more likely, 
copy/paste) and it we probably know enough at this point that we can enhance 
the ExecNode/ExecPlan API a bit to move some of the burden out of the nodes and 
into the plan. At a high level:
 * All nodes that schedule tasks use an async task group so they can make sure 
that all scheduled tasks are finished before the node is marked finish. Task 
scheduling could happen at the plan level and there would be one spot keeping 
track of running tasks.
 * If we have centralized scheduling then that opens up the door for a 
centralized handling of errors and backpressure. A node's default error 
behavior is then "stop scheduling tasks" and a node's default backpressure 
behavior is "stop running tasks".
 * There are no nodes that emit to multiple outputs today. Sending data to 
multiple outputs is not simple. This is a pipeline breaking activity and new 
tasks will need to be scheduled and the data will need to be held onto until 
each downstream task has run, and backpressure may need to be applied if one 
output is slower than the other. Node's should not be given the choice of 
multiple outputs unless they are solving specifically that problem (e.g. 
temporary table).

A small document describing the proposed changes is here: 
https://docs.google.com/document/d/1yFFUXRsN9Un89Xua-mwJnLRM60gjOBgeWdqtxL9tDSI/edit?usp=sharing

  was:
This is a high level JIRA that will be broken into a number of subtasks.  There 
are a number of things that each node has to reinvent (or, more likely, 
copy/paste) and it we probably know enough at this point that we can enhance 
the ExecNode/ExecPlan API a bit to move some of the burden out of the nodes and 
into the plan.  At a high level:

 * All nodes that schedule tasks use an async task group so they can make sure 
that all scheduled tasks are finished before the node is marked finish.  Task 
scheduling could happen at the plan level and there would be one spot keeping 
track of running tasks.
 * If we have centralized scheduling then that opens up the door for a 
centralized handling of errors and backpressure.  A node's default error 
behavior is then "stop scheduling tasks" and a node's default backpressure 
behavior is "stop running tasks".
 * There are no nodes that emit to multiple outputs today.  Sending data to 
multiple outputs is not simple.  This is a pipeline breaking activity and new 
tasks will need to be scheduled and the data will need to be held onto until 
each downstream task has run, and backpressure may need to be applied if one 
output is slower than the other.  Node's should not be given the choice of 
multiple outputs unless they are solving specifically that problem (e.g. 
temporary table).



> [C++] Evolution of exec plan
> ----------------------------
>
>                 Key: ARROW-16522
>                 URL: https://issues.apache.org/jira/browse/ARROW-16522
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Sasha Krassovsky
>            Priority: Major
>
> This is a high level JIRA that will be broken into a number of subtasks. 
> There are a number of things that each node has to reinvent (or, more likely, 
> copy/paste) and it we probably know enough at this point that we can enhance 
> the ExecNode/ExecPlan API a bit to move some of the burden out of the nodes 
> and into the plan. At a high level:
>  * All nodes that schedule tasks use an async task group so they can make 
> sure that all scheduled tasks are finished before the node is marked finish. 
> Task scheduling could happen at the plan level and there would be one spot 
> keeping track of running tasks.
>  * If we have centralized scheduling then that opens up the door for a 
> centralized handling of errors and backpressure. A node's default error 
> behavior is then "stop scheduling tasks" and a node's default backpressure 
> behavior is "stop running tasks".
>  * There are no nodes that emit to multiple outputs today. Sending data to 
> multiple outputs is not simple. This is a pipeline breaking activity and new 
> tasks will need to be scheduled and the data will need to be held onto until 
> each downstream task has run, and backpressure may need to be applied if one 
> output is slower than the other. Node's should not be given the choice of 
> multiple outputs unless they are solving specifically that problem (e.g. 
> temporary table).
> A small document describing the proposed changes is here: 
> https://docs.google.com/document/d/1yFFUXRsN9Un89Xua-mwJnLRM60gjOBgeWdqtxL9tDSI/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to