[
https://issues.apache.org/jira/browse/FLINK-28131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhu Zhu closed FLINK-28131.
---
Release Note:
Speculative execution(FLIP-168) is introduced in Flink 1.16 to mitigate batch
job slowness which is caused by problematic nodes. A problematic node may have
hardware problems, accident I/O busy, or high CPU load. These problems may make
the hosted tasks run much slower than tasks on other nodes, and affect the
overall execution time of a batch job.
When speculative execution is enabled, Flink will keep detecting slow tasks.
Once slow tasks are detected, the nodes that the slow tasks locate in will be
identified as problematic nodes and get blocked via the blocklist
mechanism(FLIP-224). The scheduler will create new attempts for the slow tasks
and deploy them to nodes that are not blocked, while the existing attempts will
keep running. The new attempts process the same input data and produce the same
data as the original attempt. Once any attempt finishes first, it will be
admitted as the only finished attempt of the task, and the remaining attempts
of the task will be canceled.
Most existing sources can work with speculative execution(FLIP-245). Only if a
source uses SourceEvent, it must implement
SupportsHandleExecutionAttemptSourceEvent to support speculative execution.
Sinks do not support speculative execution yet so that speculative execution
will not happen on sinks at the moment.
The Web UI & REST API are also improved(FLIP-249) to display multiple
concurrent attempts of tasks and blocked task managers.
Resolution: Done
> FLIP-168: Speculative Execution for Batch Job
> -
>
> Key: FLINK-28131
> URL: https://issues.apache.org/jira/browse/FLINK-28131
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
> Fix For: 1.16.0
>
>
> Speculative executions is helpful to mitigate slow tasks caused by
> problematic nodes. The basic idea is to start mirror tasks on other nodes
> when a slow task is detected. The mirror task processes the same input data
> and produces the same data as the original task.
> More detailed can be found in
> [FLIP-168|[https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job].]
>
> This is the umbrella ticket to track all the changes of this feature.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)