[ https://issues.apache.org/jira/browse/SLIDER-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396727#comment-15396727 ]
Manoj Samel commented on SLIDER-1158: ------------------------------------- Hi Jon, 1. Just to clarify, our hadoop version is 2.6. 2. I have attached the slider AM log as SUCCESS_slider.log and RM log from successful switch. The successful switch happens when a. RM2 was active b. Slider app created c. RM2 to restarted so RM1 becomes active. The RM1 log is attached as SUCCESS_rm1.log.gz 3. The only difference I noticed in RM logs from failed and success case is that * in failed case, the newly active RM (RM2) log shows trying to recover container 1 i.e. the slider AM container (in addition to container 2 and 3) . See rm2.log for this entry {code} 2016-07-26 01:04:30,278 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_1469494822187_0002_000001 is recovering container container_e50_1469494822187_0002_01_000001 {code} * In success case, the RM just recovers container 2 and 3. There is no entry about trying to recover container 1 in SUCCESS_rm1.log.gz {code} 2016-07-27 21:20:57,319 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_1469495064609_0932_000001 is recovering container container_e51_1469495064609_0932_01_000002 {code} Its not clear why in failed case the RM is trying to recover the slider AM container, but could this lead to the issue ? The success case slider AM log seems to show normal behavior ... after failing over to rm1, it succeeds in authenticating itself and proceeds with recovery. RM1 log for successful slider AM auth {code} 2016-07-27 21:20:58,013 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1469495064609_0932_000001 (auth:SIMPLE) 2016-07-27 21:20:58,014 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1469495064609_0932_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB 2016-07-27 21:20:58,016 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM is not registered for known application attempt: appattempt_1469495064609_0932_000001 or RM had restarted after AM registered . AM should re-register. ... {code} slider AM log entry for successful switch and subsequent auth and getting NM tokens for its containers {code} 2016-07-27 21:20:58,010 [AMRM Heartbeater thread] INFO client.ConfiguredRMFailoverProxyProvider - Failing over to rm1 2016-07-27 21:20:58,018 [AMRM Heartbeater thread] WARN impl.AMRMClientImpl - ApplicationMaster is out of sync with ResourceManager, hence resyncing. 2016-07-27 21:20:58,020 [AMRM Heartbeater thread] INFO impl.AMRMClientImpl - Replacing token for : node1079:45454 2016-07-27 21:20:58,020 [AMRM Heartbeater thread] INFO impl.AMRMClientImpl - Replacing token for : node1076:45454 2016-07-27 21:20:58,026 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - onNodesUpdated(4) ... {code} > Slider AM hits error org.apache.hadoop.security.AccessControlException: > Client cannot authenticate via:[TOKEN] when RM failover happens in RM HA > ------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SLIDER-1158 > URL: https://issues.apache.org/jira/browse/SLIDER-1158 > Project: Slider > Issue Type: Bug > Components: appmaster > Affects Versions: Slider 0.80 > Environment: centos, Hadoop 2.6 (CDH 5.5.1), Kerberos > Reporter: Manoj Samel > Attachments: README_INFO_ANALYSIS, SUCCESS_rm1.log.gz, > SUCCESS_slider.log, rm2.log, slider.log > > > In certain cases, when a RM fails over from RM1 to RM2, the Slider AM starts > getting error "Client cannot authenticate via:[TOKEN]" when trying to talk to > RM2. Then slider AM goes from (initial) RUNNING State to ACCEPTED state. > Attached are RM log, slider AM log and detailed info and analysis in file > README_INFO_ANALYSIS. Please make sure to read README_INFO_ANALYSIS for > further information on the issue and log info > {CODE} > 2016-07-26 01:04:28,337 [AMRM Heartbeater thread] INFO > client.ConfiguredRMFailoverProxyProvider - Failing over to rm2 > 2016-07-26 01:04:28,339 [AMRM Heartbeater thread] WARN > security.UserGroupInformation - PriviledgedActionException as:abc@xyz > (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: > Client cannot authenticate via:[TOKEN] > 2016-07-26 01:04:28,340 [AMRM Heartbeater thread] WARN ipc.Client - > Exception encountered while connecting to the server : > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN] > 2016-07-26 01:04:28,341 [AMRM Heartbeater thread] WARN > security.UserGroupInformation - PriviledgedActionException as:abc@xyz > (auth:KERBEROS) cause:java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN] > 2016-07-26 01:04:28,341 [AMRM Heartbeater thread] INFO > retry.RetryInvocationHandler - Exception while invoking allocate of class > ApplicationMasterProtocolPBClientImpl over rm2 after 3 fail over attempts. > Trying to fail over immediately. > java.io.IOException: Failed on local exception: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN]; Host Details : local host is: "foo/xx.xx.xx.xx"; destination > host is: "bar":23130; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1476) > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy23.allocate(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy24.allocate(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224) > Caused by: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525) > at org.apache.hadoop.ipc.Client.call(Client.java:1442) > ... 12 more > Caused by: org.apache.hadoop.security.AccessControlException: Client cannot > authenticate via:[TOKEN] > at > org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555) > at > org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720) > ... 15 more > {CODE} -- This message was sent by Atlassian JIRA (v6.3.4#6332)