[jira] [Updated] (FLINK-10288) Failover Strategy improvement

JIN SUN (JIRA) Wed, 26 Sep 2018 15:37:44 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


JIN SUN updated FLINK-10288:
----------------------------
    Fix Version/s: 1.7.0

> Failover Strategy improvement
> -----------------------------
>
>                 Key: FLINK-10288
>                 URL: https://issues.apache.org/jira/browse/FLINK-10288
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>            Reporter: JIN SUN
>            Assignee: JIN SUN
>            Priority: Major
>             Fix For: 1.7.0
>
>
> Flink pays significant efforts to make Streaming Job fault tolerant. The 
> checkpoint mechanism and exactly once semantics make Flink different than 
> other systems. However, there are still some cases not been handled very 
> well. Those cases can apply to both Streaming and Batch scenarios, and its 
> orthogonal with current fault tolerant mechanism. Here is a summary of those 
> cases:
>  # Some failures are non-recoverable, such as a user error: 
> DividebyZeroException. We shouldn't try to restart the task, as it will never 
> succeed. The DivideByZeroException is just a simple case, those errors 
> sometime are not easy to reproduce or predict, as it might be only triggered 
> by specific input data, we shouldn’t retry for all user code exceptions.
>  # There is no limit for task retry today, unless a SuppressRestartException 
> was encountered, a task will keep on retrying until it succeeds. As mentioned 
> above, we shouldn’t retry for some cases at all, and for the Exceptions we 
> can retry, such as a network exception, should we have a retry limit? We need 
> retry for any transient issue, but we also need to set a limit to avoid 
> infinite retry and resource wasting. For Batch and Streaming workload, we 
> might need different strategies.
>  # There are some exceptions due to hardware issues, such as disk/network 
> malfunction. when a task/TaskManager fail on this, we’d better detect and 
> avoid to schedule to that machine next time.
>  # If a task read from a blocking result partition, when its input is not 
> available, we can ‘revoke’ the produce task, set the task fail and rerun the 
> upstream task to regenerate data.  the revoke can propagate up through the 
> chain. In Spark, revoke is naturally support by lineage.
> To make fault tolerance easier, we need to keep deterministic behavior as 
> much as possible. For user code, it’s not easy to control. However, for 
> system related code, we can fix it. For example, we should at least make sure 
> the different attempt of a same task to have the same inputs (we have a bug 
> in current codebase (DataSourceTask) that cannot guarantee this). Note that 
> this is track by [Flink-10205]
> Details see this proposal:
> [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-10288) Failover Strategy improvement

Reply via email to