[ https://issues.apache.org/jira/browse/FLINK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ryantaocer reassigned FLINK-10288: ---------------------------------- Assignee: ryantaocer (was: JIN SUN) > Failover Strategy improvement > ----------------------------- > > Key: FLINK-10288 > URL: https://issues.apache.org/jira/browse/FLINK-10288 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: JIN SUN > Assignee: ryantaocer > Priority: Major > > Flink pays significant efforts to make Streaming Job fault tolerant. The > checkpoint mechanism and exactly once semantics make Flink different than > other systems. However, there are still some cases not been handled very > well. Those cases can apply to both Streaming and Batch scenarios, and its > orthogonal with current fault tolerant mechanism. Here is a summary of those > cases: > # Some failures are non-recoverable, such as a user error: > DividebyZeroException. We shouldn't try to restart the task, as it will never > succeed. The DivideByZeroException is just a simple case, those errors > sometime are not easy to reproduce or predict, as it might be only triggered > by specific input data, we shouldn’t retry for all user code exceptions. > # There is no limit for task retry today, unless a SuppressRestartException > was encountered, a task will keep on retrying until it succeeds. As mentioned > above, we shouldn’t retry for some cases at all, and for the Exceptions we > can retry, such as a network exception, should we have a retry limit? We need > retry for any transient issue, but we also need to set a limit to avoid > infinite retry and resource wasting. For Batch and Streaming workload, we > might need different strategies. > # There are some exceptions due to hardware issues, such as disk/network > malfunction. when a task/TaskManager fail on this, we’d better detect and > avoid to schedule to that machine next time. > # If a task read from a blocking result partition, when its input is not > available, we can ‘revoke’ the produce task, set the task fail and rerun the > upstream task to regenerate data. the revoke can propagate up through the > chain. In Spark, revoke is naturally support by lineage. > To make fault tolerance easier, we need to keep deterministic behavior as > much as possible. For user code, it’s not easy to control. However, for > system related code, we can fix it. For example, we should at least make sure > the different attempt of a same task to have the same inputs (we have a bug > in current codebase (DataSourceTask) that cannot guarantee this). Note that > this is track by [Flink-10205] > Details see this proposal: > [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)