[ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huang Xingbo updated FLINK-24386:
---------------------------------
    Fix Version/s: 1.17.0
                       (was: 1.16.0)

> JobMaster should guard against exceptions from OperatorCoordinator
> ------------------------------------------------------------------
>
>                 Key: FLINK-24386
>                 URL: https://issues.apache.org/jira/browse/FLINK-24386
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.13.2
>            Reporter: David Morávek
>            Priority: Major
>             Fix For: 1.17.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to