[ 
https://issues.apache.org/jira/browse/SLIDER-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396727#comment-15396727
 ] 

Manoj Samel commented on SLIDER-1158:
-------------------------------------

Hi Jon,

1. Just to clarify, our hadoop version is 2.6.
2. I have attached the slider AM log as SUCCESS_slider.log and RM log from 
successful switch. The successful switch happens when a. RM2 was active b. 
Slider app created c. RM2 to restarted so RM1 becomes active. The RM1 log is 
attached as SUCCESS_rm1.log.gz
3. The only difference I noticed in RM logs from failed and success case is that
* in failed case, the newly active RM (RM2) log shows trying to recover 
container 1 i.e. the slider AM container (in addition to container 2 and 3) . 
See rm2.log for this entry
{code}
2016-07-26 01:04:30,278 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 SchedulerAttempt appattempt_1469494822187_0002_000001 is recovering container 
container_e50_1469494822187_0002_01_000001
{code}
* In success case, the RM just recovers container 2 and 3. There is no entry 
about trying to recover container 1 in SUCCESS_rm1.log.gz
{code}
2016-07-27 21:20:57,319 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 SchedulerAttempt appattempt_1469495064609_0932_000001 is recovering container 
container_e51_1469495064609_0932_01_000002
{code}

Its not clear why in failed case the RM is trying to recover the slider AM 
container, but could this lead to the issue ?

The success case slider AM log seems to show normal behavior ... after failing 
over to rm1, it succeeds in authenticating itself and proceeds with recovery.

RM1 log for successful slider AM auth
{code}
2016-07-27 21:20:58,013 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_1469495064609_0932_000001 (auth:SIMPLE)
2016-07-27 21:20:58,014 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for appattempt_1469495064609_0932_000001 (auth:TOKEN) 
for protocol=interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB
2016-07-27 21:20:58,016 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM is 
not registered for known application attempt: 
appattempt_1469495064609_0932_000001 or RM had restarted after AM registered . 
AM should re-register.
...
{code}

slider AM log entry for successful switch and subsequent auth and getting NM 
tokens for its containers
{code}
2016-07-27 21:20:58,010 [AMRM Heartbeater thread] INFO  
client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
2016-07-27 21:20:58,018 [AMRM Heartbeater thread] WARN  impl.AMRMClientImpl - 
ApplicationMaster is out of sync with ResourceManager, hence resyncing.
2016-07-27 21:20:58,020 [AMRM Heartbeater thread] INFO  impl.AMRMClientImpl - 
Replacing token for : node1079:45454
2016-07-27 21:20:58,020 [AMRM Heartbeater thread] INFO  impl.AMRMClientImpl - 
Replacing token for : node1076:45454
2016-07-27 21:20:58,026 [AMRM Callback Handler Thread] INFO  
appmaster.SliderAppMaster - onNodesUpdated(4)
...
{code}

> Slider AM hits error  org.apache.hadoop.security.AccessControlException: 
> Client cannot authenticate via:[TOKEN] when RM failover happens in RM HA
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-1158
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1158
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster
>    Affects Versions: Slider 0.80
>         Environment: centos, Hadoop 2.6 (CDH 5.5.1), Kerberos
>            Reporter: Manoj Samel
>         Attachments: README_INFO_ANALYSIS, SUCCESS_rm1.log.gz, 
> SUCCESS_slider.log, rm2.log, slider.log
>
>
> In certain cases, when a RM fails over from RM1 to RM2, the Slider AM starts 
> getting error "Client cannot authenticate via:[TOKEN]" when trying to talk to 
> RM2. Then slider AM goes from (initial) RUNNING State to ACCEPTED state.
> Attached are RM log, slider AM log and detailed info and analysis in file 
> README_INFO_ANALYSIS. Please make sure to read  README_INFO_ANALYSIS for 
> further information on the issue and log info
> {CODE}
> 2016-07-26 01:04:28,337 [AMRM Heartbeater thread] INFO  
> client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
> 2016-07-26 01:04:28,339 [AMRM Heartbeater thread] WARN  
> security.UserGroupInformation - PriviledgedActionException as:abc@xyz 
> (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: 
> Client cannot authenticate via:[TOKEN]
> 2016-07-26 01:04:28,340 [AMRM Heartbeater thread] WARN  ipc.Client - 
> Exception encountered while connecting to the server : 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN]
> 2016-07-26 01:04:28,341 [AMRM Heartbeater thread] WARN  
> security.UserGroupInformation - PriviledgedActionException as:abc@xyz 
> (auth:KERBEROS) cause:java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN]
> 2016-07-26 01:04:28,341 [AMRM Heartbeater thread] INFO  
> retry.RetryInvocationHandler - Exception while invoking allocate of class 
> ApplicationMasterProtocolPBClientImpl over rm2 after 3 fail over attempts. 
> Trying to fail over immediately.
> java.io.IOException: Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN]; Host Details : local host is: "foo/xx.xx.xx.xx"; destination 
> host is: "bar":23130;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>         at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
>         at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN]
>         at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>         at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>         ... 12 more
> Caused by: org.apache.hadoop.security.AccessControlException: Client cannot 
> authenticate via:[TOKEN]
>         at 
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172)
>         at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
>         ... 15 more
> {CODE}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to