Tao Yang created YARN-11809:
-------------------------------
Summary: Support application backoff mechanism for
CapacityScheduler
Key: YARN-11809
URL: https://issues.apache.org/jira/browse/YARN-11809
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Tao Yang
Assignee: Tao Yang
Currently, when an application repeatedly fails to schedule tasks due to
resource constraints or other issues, it continues to be considered in every
scheduling cycle, potentially causing unnecessary scheduling overhead and
resource contention. This can lead to inefficient resource utilization and
increased scheduling latency. This is especially impactful in global scheduling
where the scheduler needs to consider resources across the entire cluster. The
number of allocated containers per second may drop from 1000+ to 200+, when the
scheduler is overwhelmed with repeated scheduling attempts for applications
that cannot be satisfied.
Thus it's necessary to introduce a new application backoff mechanism in the
Capacity Scheduler to temporarily skip applications that fail to schedule tasks
after a certain number of opportunities, improving the scheduling efficiency.
h2. Solution
Implement an application backoff mechanism that:
# Tracks missed scheduling opportunities for each application
# Temporarily skips applications that exceed a configurable threshold of
missed opportunities
# Automatically resumes scheduling after a configurable backoff period
# Provides configurable parameters at both global and queue levels
h3. Configuration Parameters
h3. Global Configuration
* yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff
mechanism globally (default: false)
* yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration in
milliseconds (default: 3000ms)
* yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of
missed opportunities before backoff (default: 3)
h3. Queue-Specific Configuration
* yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable
backoff mechanism for a specific queue. When enabled, applications in this
queue will be temporarily skipped if they fail to schedule tasks after reaching
the missed opportunities threshold. This setting can be configured
independently for each queue, allowing for fine-grained control over which
queues use the backoff mechanism. If not specified, it inherits the global
setting from yarn.scheduler.capacity.app-backoff.enabled.
* yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff
duration in milliseconds for a specific queue. If not specified, it inherits
the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.
* yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number of
missed opportunities before backoff for a specific queue. If not specified, it
inherits the global setting from
yarn.scheduler.capacity.app-backoff.missed-threshold.
Queue-specific configurations take precedence over global configurations. If a
queue-specific configuration is not set, the queue will inherit the global
configuration values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]