[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection

2019-12-06 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989515#comment-16989515
 ] 

lamber-ken commented on FLINK-15087:


(y)

> JobManager is forced to shutdown JVM due to temporary loss of zookeeper 
> connection
> --
>
> Key: FLINK-15087
> URL: https://issues.apache.org/jira/browse/FLINK-15087
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Abdul Qadeer
>Priority: Major
>
> While testing I found that the loss of connection with zookeeper triggers JVM 
> shutdown for Job Manager, when started through 
> "StandaloneSessionClusterEntrypoint". This happens due to a NPE on 
> "taskManagerHeartbeatManager."
> When JobManagerRunner suspends jobMasterService (as Job manager is no longer 
> leader), "taskManagerHeartbeatManager" is set to null in 
> "stopHeartbeatServices".
> Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method:
> {code:java}
> @Override
> public CompletableFuture disconnectTaskManager(final ResourceID 
> resourceID, final Exception cause) {
>log.debug("Disconnect TaskExecutor {} because: {}", resourceID, 
> cause.getMessage());
>taskManagerHeartbeatManager.unmonitorTarget(resourceID);
>slotPool.releaseTaskManager(resourceID, cause);
> {code}
>  
> This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and 
> forces JVM shutdown.
> The stack trace is below:
>  
> {noformat}
> {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership 
> with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination
>  of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. 
> Cannot submit job under the same job id.","message":"Termination of previous 
> JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job 
> under the same job 
> id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException:
>  Could not properly shut down the 
> JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not 
> properly shut down the 
> JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could
>  not properly shut down the JobManagerRunner","message":"Could not properly 
> shut down the 
> JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure
>  while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping 
> RpcEndpoint 
> 

[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection

2019-12-06 Thread Abdul Qadeer (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989503#comment-16989503
 ] 

Abdul Qadeer commented on FLINK-15087:
--

Looks like not that exactly but FLINK-14315

I will close this.

> JobManager is forced to shutdown JVM due to temporary loss of zookeeper 
> connection
> --
>
> Key: FLINK-15087
> URL: https://issues.apache.org/jira/browse/FLINK-15087
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Abdul Qadeer
>Priority: Major
>
> While testing I found that the loss of connection with zookeeper triggers JVM 
> shutdown for Job Manager, when started through 
> "StandaloneSessionClusterEntrypoint". This happens due to a NPE on 
> "taskManagerHeartbeatManager."
> When JobManagerRunner suspends jobMasterService (as Job manager is no longer 
> leader), "taskManagerHeartbeatManager" is set to null in 
> "stopHeartbeatServices".
> Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method:
> {code:java}
> @Override
> public CompletableFuture disconnectTaskManager(final ResourceID 
> resourceID, final Exception cause) {
>log.debug("Disconnect TaskExecutor {} because: {}", resourceID, 
> cause.getMessage());
>taskManagerHeartbeatManager.unmonitorTarget(resourceID);
>slotPool.releaseTaskManager(resourceID, cause);
> {code}
>  
> This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and 
> forces JVM shutdown.
> The stack trace is below:
>  
> {noformat}
> {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership 
> with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination
>  of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. 
> Cannot submit job under the same job id.","message":"Termination of previous 
> JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job 
> under the same job 
> id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException:
>  Could not properly shut down the 
> JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not 
> properly shut down the 
> JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could
>  not properly shut down the JobManagerRunner","message":"Could not properly 
> shut down the 
> JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure
>  while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping 
> RpcEndpoint 
> 

[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection

2019-12-05 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989482#comment-16989482
 ] 

lamber-ken commented on FLINK-15087:


hi, this may be is a duplicate of FLINK-10052.

> JobManager is forced to shutdown JVM due to temporary loss of zookeeper 
> connection
> --
>
> Key: FLINK-15087
> URL: https://issues.apache.org/jira/browse/FLINK-15087
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Abdul Qadeer
>Priority: Major
>
> While testing I found that the loss of connection with zookeeper triggers JVM 
> shutdown for Job Manager, when started through 
> "StandaloneSessionClusterEntrypoint". This happens due to a NPE on 
> "taskManagerHeartbeatManager."
> When JobManagerRunner suspends jobMasterService (as Job manager is no longer 
> leader), "taskManagerHeartbeatManager" is set to null in 
> "stopHeartbeatServices".
> Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method:
> {code:java}
> @Override
> public CompletableFuture disconnectTaskManager(final ResourceID 
> resourceID, final Exception cause) {
>log.debug("Disconnect TaskExecutor {} because: {}", resourceID, 
> cause.getMessage());
>taskManagerHeartbeatManager.unmonitorTarget(resourceID);
>slotPool.releaseTaskManager(resourceID, cause);
> {code}
>  
> This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and 
> forces JVM shutdown.
> The stack trace is below:
>  
> {noformat}
> {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership 
> with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination
>  of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. 
> Cannot submit job under the same job id.","message":"Termination of previous 
> JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job 
> under the same job 
> id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException:
>  Could not properly shut down the 
> JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not 
> properly shut down the 
> JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could
>  not properly shut down the JobManagerRunner","message":"Could not properly 
> shut down the 
> JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure
>  while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping 
> RpcEndpoint 
> 

[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection

2019-12-05 Thread Abdul Qadeer (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989350#comment-16989350
 ] 

Abdul Qadeer commented on FLINK-15087:
--

[~trohrm...@apache.org] I would like to contribute for this.

Checking for NPE in 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L428]
 is a simple fix which I tested works fine. However I would like to know if 
there is any other way to fix it.

> JobManager is forced to shutdown JVM due to temporary loss of zookeeper 
> connection
> --
>
> Key: FLINK-15087
> URL: https://issues.apache.org/jira/browse/FLINK-15087
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Abdul Qadeer
>Priority: Major
>
> While testing I found that the loss of connection with zookeeper triggers JVM 
> shutdown for Job Manager, when started through 
> "StandaloneSessionClusterEntrypoint". This happens due to a NPE on 
> "taskManagerHeartbeatManager."
> When JobManagerRunner suspends jobMasterService (as Job manager is no longer 
> leader), taskManagerHeartbeatManager is set to null in 
> "stopHeartbeatServices".
> Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method:
> {code:java}
> @Override
> public CompletableFuture disconnectTaskManager(final ResourceID 
> resourceID, final Exception cause) {
>log.debug("Disconnect TaskExecutor {} because: {}", resourceID, 
> cause.getMessage());
>taskManagerHeartbeatManager.unmonitorTarget(resourceID);
>slotPool.releaseTaskManager(resourceID, cause);
> {code}
>  
> This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and 
> forces JVM shutdown.
> The stack trace is below:
>  
> {noformat}
> {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership 
> with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination
>  of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. 
> Cannot submit job under the same job id.","message":"Termination of previous 
> JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job 
> under the same job 
> id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException:
>  Could not properly shut down the 
> JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not 
> properly shut down the 
> JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could
>  not properly shut down the JobManagerRunner","message":"Could not properly 
> shut down the 
> JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure
>  while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping 
> RpcEndpoint 
>