[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection
[ https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989515#comment-16989515 ] lamber-ken commented on FLINK-15087: (y) > JobManager is forced to shutdown JVM due to temporary loss of zookeeper > connection > -- > > Key: FLINK-15087 > URL: https://issues.apache.org/jira/browse/FLINK-15087 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.2 >Reporter: Abdul Qadeer >Priority: Major > > While testing I found that the loss of connection with zookeeper triggers JVM > shutdown for Job Manager, when started through > "StandaloneSessionClusterEntrypoint". This happens due to a NPE on > "taskManagerHeartbeatManager." > When JobManagerRunner suspends jobMasterService (as Job manager is no longer > leader), "taskManagerHeartbeatManager" is set to null in > "stopHeartbeatServices". > Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method: > {code:java} > @Override > public CompletableFuture disconnectTaskManager(final ResourceID > resourceID, final Exception cause) { >log.debug("Disconnect TaskExecutor {} because: {}", resourceID, > cause.getMessage()); >taskManagerHeartbeatManager.unmonitorTarget(resourceID); >slotPool.releaseTaskManager(resourceID, cause); > {code} > > This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and > forces JVM shutdown. > The stack trace is below: > > {noformat} > {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership > with session id > b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination > of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. > Cannot submit job under the same job id.","message":"Termination of previous > JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job > under the same job > id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException: > Could not properly shut down the > JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not > properly shut down the > JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could > not properly shut down the JobManagerRunner","message":"Could not properly > shut down the > JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure > while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping > RpcEndpoint >
[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection
[ https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989503#comment-16989503 ] Abdul Qadeer commented on FLINK-15087: -- Looks like not that exactly but FLINK-14315 I will close this. > JobManager is forced to shutdown JVM due to temporary loss of zookeeper > connection > -- > > Key: FLINK-15087 > URL: https://issues.apache.org/jira/browse/FLINK-15087 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.2 >Reporter: Abdul Qadeer >Priority: Major > > While testing I found that the loss of connection with zookeeper triggers JVM > shutdown for Job Manager, when started through > "StandaloneSessionClusterEntrypoint". This happens due to a NPE on > "taskManagerHeartbeatManager." > When JobManagerRunner suspends jobMasterService (as Job manager is no longer > leader), "taskManagerHeartbeatManager" is set to null in > "stopHeartbeatServices". > Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method: > {code:java} > @Override > public CompletableFuture disconnectTaskManager(final ResourceID > resourceID, final Exception cause) { >log.debug("Disconnect TaskExecutor {} because: {}", resourceID, > cause.getMessage()); >taskManagerHeartbeatManager.unmonitorTarget(resourceID); >slotPool.releaseTaskManager(resourceID, cause); > {code} > > This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and > forces JVM shutdown. > The stack trace is below: > > {noformat} > {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership > with session id > b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination > of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. > Cannot submit job under the same job id.","message":"Termination of previous > JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job > under the same job > id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException: > Could not properly shut down the > JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not > properly shut down the > JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could > not properly shut down the JobManagerRunner","message":"Could not properly > shut down the > JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure > while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping > RpcEndpoint >
[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection
[ https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989482#comment-16989482 ] lamber-ken commented on FLINK-15087: hi, this may be is a duplicate of FLINK-10052. > JobManager is forced to shutdown JVM due to temporary loss of zookeeper > connection > -- > > Key: FLINK-15087 > URL: https://issues.apache.org/jira/browse/FLINK-15087 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.2 >Reporter: Abdul Qadeer >Priority: Major > > While testing I found that the loss of connection with zookeeper triggers JVM > shutdown for Job Manager, when started through > "StandaloneSessionClusterEntrypoint". This happens due to a NPE on > "taskManagerHeartbeatManager." > When JobManagerRunner suspends jobMasterService (as Job manager is no longer > leader), "taskManagerHeartbeatManager" is set to null in > "stopHeartbeatServices". > Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method: > {code:java} > @Override > public CompletableFuture disconnectTaskManager(final ResourceID > resourceID, final Exception cause) { >log.debug("Disconnect TaskExecutor {} because: {}", resourceID, > cause.getMessage()); >taskManagerHeartbeatManager.unmonitorTarget(resourceID); >slotPool.releaseTaskManager(resourceID, cause); > {code} > > This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and > forces JVM shutdown. > The stack trace is below: > > {noformat} > {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership > with session id > b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination > of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. > Cannot submit job under the same job id.","message":"Termination of previous > JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job > under the same job > id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException: > Could not properly shut down the > JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not > properly shut down the > JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could > not properly shut down the JobManagerRunner","message":"Could not properly > shut down the > JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure > while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping > RpcEndpoint >
[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection
[ https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989350#comment-16989350 ] Abdul Qadeer commented on FLINK-15087: -- [~trohrm...@apache.org] I would like to contribute for this. Checking for NPE in [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L428] is a simple fix which I tested works fine. However I would like to know if there is any other way to fix it. > JobManager is forced to shutdown JVM due to temporary loss of zookeeper > connection > -- > > Key: FLINK-15087 > URL: https://issues.apache.org/jira/browse/FLINK-15087 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.2 >Reporter: Abdul Qadeer >Priority: Major > > While testing I found that the loss of connection with zookeeper triggers JVM > shutdown for Job Manager, when started through > "StandaloneSessionClusterEntrypoint". This happens due to a NPE on > "taskManagerHeartbeatManager." > When JobManagerRunner suspends jobMasterService (as Job manager is no longer > leader), taskManagerHeartbeatManager is set to null in > "stopHeartbeatServices". > Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method: > {code:java} > @Override > public CompletableFuture disconnectTaskManager(final ResourceID > resourceID, final Exception cause) { >log.debug("Disconnect TaskExecutor {} because: {}", resourceID, > cause.getMessage()); >taskManagerHeartbeatManager.unmonitorTarget(resourceID); >slotPool.releaseTaskManager(resourceID, cause); > {code} > > This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and > forces JVM shutdown. > The stack trace is below: > > {noformat} > {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed > to take leadership with session id > b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership > with session id > b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination > of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. > Cannot submit job under the same job id.","message":"Termination of previous > JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job > under the same job > id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException: > Could not properly shut down the > JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not > properly shut down the > JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could > not properly shut down the JobManagerRunner","message":"Could not properly > shut down the > JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure > while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping > RpcEndpoint >