[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2024-07-10 Thread Weijie Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo updated FLINK-24386:
---
Fix Version/s: 2.0.0
   (was: 1.20.0)

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2024-03-11 Thread lincoln lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lincoln lee updated FLINK-24386:

Fix Version/s: (was: 1.19.0)

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2024-03-11 Thread lincoln lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lincoln lee updated FLINK-24386:

Fix Version/s: 1.20.0

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0, 1.20.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2023-10-13 Thread Jing Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Ge updated FLINK-24386:

Fix Version/s: 1.19.0
   (was: 1.18.0)

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2023-03-23 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-24386:
-
Fix Version/s: 1.18.0
   (was: 1.17.0)

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2022-12-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-24386:
---
Labels: pull-request-available  (was: )

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.17.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2022-09-30 Thread Huang Xingbo (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huang Xingbo updated FLINK-24386:
-
Fix Version/s: 1.17.0
   (was: 1.16.0)

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
> Fix For: 1.17.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2022-04-13 Thread Yun Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Gao updated FLINK-24386:

Fix Version/s: 1.16.0

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
> Fix For: 1.15.0, 1.16.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-24386) JobMaster should guard against exceptions from OperatorCoordinator

2021-09-27 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-24386:
--
Fix Version/s: 1.15.0

> JobMaster should guard against exceptions from OperatorCoordinator
> --
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: David Morávek
>Priority: Major
> Fix For: 1.15.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
>  and something goes wrong in there, the _JobManager_ gets stuck. Concretely, 
> I have an _OperatorCoordinator_ that throws an exception in 
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit 
> dangerous to me. If there is some bug in any part of the failover logic, we 
> have no safety net. No "hard crash" and let the process be restarted. We only 
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
>  * I am wondering if the place where that line is logged should actually 
> invoke the fatal error handler. If an exception propagates out of a main 
> thread action, we need to call off all bets and assume things have gotten 
> inconsistent.
>  * At the very least, the failover procedure itself should be guarded. If an 
> error happens while processing the global failover, then we need to treat 
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart, 
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
>  * OperatorCoordinator is part of a public API interface (part of JobGraph).
>  ** Can be provided by implementing CoordinatedOperatorFactory
>  ** This actually gives the issue higher priority than I initially thought.
>  * We should guard against flaws in user code:
>  ** There are two types of interfaces
>  *** (CRITICAL) Public API for JobGraph construction / submission
>  *** Semi-public interfaces such as custom HA Services, this is for power 
> users, so I wouldn't be as concerned there.
>  ** We already do good job guarding against failure on TM side
>  ** Considering the critical parts on JM side, there two places where user 
> can "hook"
>  *** OperatorCoordinator
>  *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
> Hadoop world)
> --- 
> We should audit all the calls to OperatorCoordinator and handle failures 
> accordingly. We want to avoid unnecessary JVM terminations as much as 
> possible (sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)