[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts
[ https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315545#comment-14315545 ] Kashish Jain commented on SPARK-5613: - Thanks Patrick and Andrew YarnClientSchedulerBackend fails to get application report when yarn restarts - Key: SPARK-5613 URL: https://issues.apache.org/jira/browse/SPARK-5613 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kashish Jain Assignee: Kashish Jain Priority: Minor Fix For: 1.3.0, 1.2.2 Original Estimate: 24h Remaining Estimate: 24h Steps to Reproduce 1) Run any spark job 2) Stop yarn while the spark job is running (an application id has been generated by now) 3) Restart yarn now 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException exception. This leads to termination of thread. Here is the StackTrace 15/02/05 05:22:37 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:38 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:39 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 5/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Exception in thread Yarn application state monitor org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1423113179043_0003' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120) Caused by:
[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts
[ https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314922#comment-14314922 ] Andrew Or commented on SPARK-5613: -- Thanks Patrick. I just verified that it was merged into all of 1.2, 1.3 and Master. Closing this again. YarnClientSchedulerBackend fails to get application report when yarn restarts - Key: SPARK-5613 URL: https://issues.apache.org/jira/browse/SPARK-5613 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kashish Jain Assignee: Kashish Jain Priority: Minor Fix For: 1.3.0, 1.2.2 Original Estimate: 24h Remaining Estimate: 24h Steps to Reproduce 1) Run any spark job 2) Stop yarn while the spark job is running (an application id has been generated by now) 3) Restart yarn now 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException exception. This leads to termination of thread. Here is the StackTrace 15/02/05 05:22:37 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:38 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:39 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 5/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Exception in thread Yarn application state monitor org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1423113179043_0003' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120) Caused by:
[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts
[ https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314821#comment-14314821 ] Patrick Wendell commented on SPARK-5613: I have cherry picked it into the 1.3 branch. YarnClientSchedulerBackend fails to get application report when yarn restarts - Key: SPARK-5613 URL: https://issues.apache.org/jira/browse/SPARK-5613 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kashish Jain Assignee: Kashish Jain Priority: Minor Fix For: 1.3.0, 1.2.2 Original Estimate: 24h Remaining Estimate: 24h Steps to Reproduce 1) Run any spark job 2) Stop yarn while the spark job is running (an application id has been generated by now) 3) Restart yarn now 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException exception. This leads to termination of thread. Here is the StackTrace 15/02/05 05:22:37 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:38 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:39 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 5/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Exception in thread Yarn application state monitor org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1423113179043_0003' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120) Caused by:
[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts
[ https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306903#comment-14306903 ] Apache Spark commented on SPARK-5613: - User 'kasjain' has created a pull request for this issue: https://github.com/apache/spark/pull/4392 YarnClientSchedulerBackend fails to get application report when yarn restarts - Key: SPARK-5613 URL: https://issues.apache.org/jira/browse/SPARK-5613 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Kashish Jain Priority: Minor Fix For: 1.2.0, 1.2.1 Original Estimate: 24h Remaining Estimate: 24h Steps to Reproduce 1) Run any spark job 2) Stop yarn while the spark job is running (an application id has been generated by now) 3) Restart yarn now 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException exception. This leads to termination of thread. Here is the StackTrace 15/02/05 05:22:37 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:38 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:39 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 5/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Exception in thread Yarn application state monitor org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1423113179043_0003' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120) Caused by: