Hello, I am trying to understand why my long-running mapreduce jobs stop after 24 hours (approx) on a secure cluster.
This is on Cloudera CDH 4.3.0, hence hadoop 2.0.0, using mrv1 (not yarn), authentication specified as "kerberos". Trying with a short-lived Kerberos ticket (1h) I see that it gets renewed regularly. Still, the job crashes after 24 hours because the delegation token expires. On a test cluster with increased logging and shortened dfs.namenode.delegation.token.renew-interval (for quicker debugging) I see that an immediate renew of the delegation token fails, and then after the expiry period the Namenode log starts getting clobbered. Detail: 2013-12-02 15:57:08,461 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for tobbi...@cern.ch (auth:TOKEN) 2013-12-02 15:57:08,462 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for tobbi...@cern.ch (auth:TOKEN) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2013-12-02 15:57:08,500 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for mapred/xxx.cern...@cern.ch (auth:SIMPLE) 2013-12-02 15:57:08,540 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for mapred/xxx.cern...@cern.ch (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2013-12-02 15:57:08,541 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Token renewal requested for identifier: HDFS_DELEGATION_TOKEN token 12 for tobbicke 2013-12-02 15:57:08,541 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred/xxx.cern...@cern.ch (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as nobody 2013-12-02 15:57:08,541 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewDelegationToken from 188.184.xxx.xxx:42031: error: org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as nobody org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5319) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:377) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:814) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:45024) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1701) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1697) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1695) Is this as unhealthy as it looks? If the first (immediate) renewal fails I assume others will share the same fate. Would that explain the 24-hour-lifetime on the "real" cluster and what could be the reason? How does "nobody" come into the game here? In any case, linked to this or not, after dfs.namenode.delegation.token.renew-interval ms the following is logged a zillion times: 2013-12-02 16:58:09,718 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 188.184.xxx.xxx:44979:null (DIGEST-MD5: IO error acquiring password) 2013-12-02 16:58:09,719 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9000: readAndProcess threw exception javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token (HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired] from client 188.184.xxx.xxx. Count of bytes read: 0 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token (HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:577) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:226) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1210) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1405) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:719) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:518) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:493) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: token (HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.retrievePassword(AbstractDelegationTokenSecretManager.java:227) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.retrievePassword(AbstractDelegationTokenSecretManager.java:46) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:194) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:220) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:568) ... 6 more Any ideas? Rainer