[ 
https://issues.apache.org/jira/browse/ARROW-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-15079:
--------------------------------
    Description: 
This is a high-level JIRA that will probably consist of a number of subtasks.  
Currently the exec plan supports backpressure.  This helps to limit the amount 
of data read in by a single query.  Spillover can also help reduce the amount 
of memory used by a single query.

However, if multiple queries are run concurrently, then the system will 
eventually run out of memory.  I'd like to propose a simple scheduler to start 
to solve this problem.

 * Initially, the scheduler will be given a configurable max memory target.  We 
will assume the user is responsible for ensuring this memory is not used 
elsewhere on the system.  For example, a user may configure 12GB of RAM for the 
scheduler and the user is responsible for ensuring that 12GB of RAM is not 
consumed by other processes.  In future JIRA issues we can add more 
sophistication such as detecting and adapting to the system memory levels.

 * The scheduler will track memory usage by querying for the RSS usage of the 
process.  An alternative approach would be to use a dedicated memory pool.  
This is more flexible (allows for multiple schedulers / query engines in a 
single process) but wouldn't capture the non-pool allocations (although these 
should be small).

 * The ExecPlan will be modified to submit its tasks to the scheduler instead 
of directly to the executor (the scheduler will submit tasks to the executor).  
Tasks will be prioritized.  If a task is higher priority then lower priority 
tasks will be pushed to disk.  Prioritization will be based initially on a 
tasks position in the ExecPlan.  Tasks closer to the sink will be higher 
priority.

 * The scheduler is separate from spillover.  Although it may interact with 
spillover mechanisms to ask for more spillover as memory pressure increases.

  was:
This is a high-level JIRA that will probably consist of a number of subtasks.  
Currently the exec plan supports backpressure.  This helps to limit the amount 
of data read in by a single query.  Spillover can also help reduce the amount 
of memory used by a single query.

However, if multiple queries are run concurrently, then the system will 
eventually run out of memory.  I'd like to propose a simple scheduler to start 
to solve this problem.

 * Initially, the scheduler will be given a configurable max memory target.  We 
will assume the user is responsible for ensuring this memory is not used 
elsewhere on the system.  For example, a user may configure 12GB of RAM for the 
scheduler and the user is responsible for ensuring that 12GB of RAM is not 
consumed by other processes.

In future JIRA issues we can add more sophistication such as detecting and 
adapting to the system memory levels.

 * The scheduler will track memory usage by querying for the RSS usage of the 
process.  An alternative approach would be to use a dedicated memory pool.  
This is more flexible (allows for multiple schedulers / query engines in a 
single process) but wouldn't capture the non-pool allocations (although these 
should be small).

 * The ExecPlan will be modified to submit its tasks to the scheduler instead 
of directly to the executor (the scheduler will submit tasks to the executor).  
Tasks will be prioritized.  If a task is higher priority then lower priority 
tasks will be pushed to disk.  Prioritization will be based initially on a 
tasks position in the ExecPlan.  Tasks closer to the sink will be higher 
priority.

 * The scheduler is separate from spillover.  Although it may interact with 
spillover mechanisms to ask for more spillover as memory pressure increases.


> [C++] Add scheduler to constrain memory of exec plans
> -----------------------------------------------------
>
>                 Key: ARROW-15079
>                 URL: https://issues.apache.org/jira/browse/ARROW-15079
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> This is a high-level JIRA that will probably consist of a number of subtasks. 
>  Currently the exec plan supports backpressure.  This helps to limit the 
> amount of data read in by a single query.  Spillover can also help reduce the 
> amount of memory used by a single query.
> However, if multiple queries are run concurrently, then the system will 
> eventually run out of memory.  I'd like to propose a simple scheduler to 
> start to solve this problem.
>  * Initially, the scheduler will be given a configurable max memory target.  
> We will assume the user is responsible for ensuring this memory is not used 
> elsewhere on the system.  For example, a user may configure 12GB of RAM for 
> the scheduler and the user is responsible for ensuring that 12GB of RAM is 
> not consumed by other processes.  In future JIRA issues we can add more 
> sophistication such as detecting and adapting to the system memory levels.
>  * The scheduler will track memory usage by querying for the RSS usage of the 
> process.  An alternative approach would be to use a dedicated memory pool.  
> This is more flexible (allows for multiple schedulers / query engines in a 
> single process) but wouldn't capture the non-pool allocations (although these 
> should be small).
>  * The ExecPlan will be modified to submit its tasks to the scheduler instead 
> of directly to the executor (the scheduler will submit tasks to the 
> executor).  Tasks will be prioritized.  If a task is higher priority then 
> lower priority tasks will be pushed to disk.  Prioritization will be based 
> initially on a tasks position in the ExecPlan.  Tasks closer to the sink will 
> be higher priority.
>  * The scheduler is separate from spillover.  Although it may interact with 
> spillover mechanisms to ask for more spillover as memory pressure increases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to