Stephan Ewen created FLINK-1668:
-----------------------------------
Summary: Add a config option to specify delays between restarts
Key: FLINK-1668
URL: https://issues.apache.org/jira/browse/FLINK-1668
Project: Flink
Issue Type: Improvement
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Fix For: 0.9
The system currently introduces a short delay between a failed task execution
and the restarted execution.
The reason is that this delay seemed to help in letting problems surface that
let to the failed task. As an example, if a TaskManager fails, tasks fail due
to data transfer errors. The TaskManager is not immediately recognized as
failed, though (takes a bit until heartbeats time out). Immediately
re-deploying tasks has a very high chance of assigning work to the TaskManager
that is actually not responding, causing the execution retry to fail again. The
delay gives the system time to figure out that the TaskManager was lost and
does not take it into account upon the retry.
Currently, the system uses the heartbeat timeout as the default delay value.
This may make sense as a default value for critical task failures, but is
actually quite high for other types of failures.
In any case, I would like to add an option for users to specify the delay (even
set it to 0, if desired).
The delay is not the best solution, in my opinion, we should eventually move to
something better. Ideas are to put TaskManagers responsible for failed tasks in
a "probationary" mode until they have reported back that everything is good
(still alive, disk space available, etc)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)