Hello,

I am trying to understand why my long-running mapreduce jobs stop after 24 
hours (approx) on a secure cluster.

This is on Cloudera CDH 4.3.0, hence hadoop 2.0.0, using mrv1 (not yarn), 
authentication specified as "kerberos". Trying with a short-lived Kerberos 
ticket (1h) I see that it gets renewed regularly. Still, the job crashes after 
24 hours because the delegation token expires.

On a test cluster with increased logging and shortened 
dfs.namenode.delegation.token.renew-interval (for quicker debugging) I see that 
an immediate renew of the delegation token fails, and then after the expiry 
period the Namenode log starts getting clobbered.

Detail:


2013-12-02 15:57:08,461 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for tobbi...@cern.ch (auth:TOKEN)
2013-12-02 15:57:08,462 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for tobbi...@cern.ch (auth:TOKEN) for 
protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2013-12-02 15:57:08,500 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for mapred/xxx.cern...@cern.ch (auth:SIMPLE)
2013-12-02 15:57:08,540 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for mapred/xxx.cern...@cern.ch (auth:KERBEROS) for 
protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2013-12-02 15:57:08,541 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Token renewal requested for identifier: HDFS_DELEGATION_TOKEN token 12 for 
tobbicke

2013-12-02 15:57:08,541 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:mapred/xxx.cern...@cern.ch (auth:KERBEROS) 
cause:org.apache.hadoop.security.AccessControlException: Client mapred tries to 
renew a token with renewer specified as nobody
2013-12-02 15:57:08,541 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 
on 9000, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.renewDelegationToken from 
188.184.xxx.xxx:42031: error: 
org.apache.hadoop.security.AccessControlException: Client mapred tries to renew 
a token with renewer specified as nobody
org.apache.hadoop.security.AccessControlException: Client mapred tries to renew 
a token with renewer specified as nobody
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5319)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:377)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:814)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:45024)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1701)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1697)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1695)



Is this as unhealthy as it looks? If the first (immediate) renewal fails I 
assume others will share the same fate. Would that explain the 24-hour-lifetime 
on the "real" cluster and what could be the reason? How does "nobody" come into 
the game here?

In any case, linked to this or not, after 
dfs.namenode.delegation.token.renew-interval ms the following is logged a 
zillion times:

2013-12-02 16:58:09,718 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
failed for 188.184.xxx.xxx:44979:null (DIGEST-MD5: IO error acquiring password)
2013-12-02 16:58:09,719 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 9000: readAndProcess threw exception javax.security.sasl.SaslException: 
DIGEST-MD5: IO error acquiring password [Caused by 
org.apache.hadoop.security.token.SecretManager$InvalidToken: token 
(HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired] from client 
188.184.xxx.xxx. Count of bytes read: 0
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token 
(HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired]
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:577)
        at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:226)
        at 
org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1210)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1405)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:719)
        at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:518)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:493)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: token 
(HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.retrievePassword(AbstractDelegationTokenSecretManager.java:227)
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.retrievePassword(AbstractDelegationTokenSecretManager.java:46)
        at 
org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:194)
        at 
org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:220)
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:568)
        ... 6 more


Any ideas?

Rainer

Reply via email to