After a bit more investigation, I found that it could be related to impersonation on kerberized cluster.
Our job is started with the following command. /usr/lib/spark/bin/spark-submit --master yarn-client --principal [principle] --keytab [keytab] --proxy-user [proxied_user] ... In application master's log, At start up, 2015-11-03 16:03:41,602 INFO [main] yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Scheduling login from keytab in 64789744 millis. Later on, when the delegation token renewer thread kicks in, it tries to re-login with the specified principle with new credentials and tries to write the new credentials into the over to the directory where the current user's credentials are stored. However, with impersonation, because the current user is a different user from the principle user, it fails with permission error. 2015-11-04 10:03:31,366 INFO [Delegation Token Refresh Thread-0] yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Attempting to login to KDC using principal: principal/host@domain 2015-11-04 10:03:31,665 INFO [Delegation Token Refresh Thread-0] yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Successfully logged into KDC. 2015-11-04 10:03:31,702 INFO [Delegation Token Refresh Thread-0] yarn.YarnSparkHadoopUtil (Logging.scala:logInfo(59)) - getting token for namenode: hdfs://hadoop_abc/user/proxied_user/.sparkStaging/application_1443481003186_00000 2015-11-04 10:03:31,904 INFO [Delegation Token Refresh Thread-0] hdfs.DFSClient (DFSClient.java:getDelegationToken(1025)) - Created HDFS_DELEGATION_TOKEN token 389283 for principal on ha-hdfs:hadoop_abc 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! 2015-11-04 10:03:31,944 WARN [Delegation Token Refresh Thread-0] security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - PriviledgedActionException as:proxy-user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 2015-11-04 10:03:31,945 WARN [Delegation Token Refresh Thread-0] ipc.Client (Client.java:run(675)) - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 2015-11-04 10:03:31,945 WARN [Delegation Token Refresh Thread-0] security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - PriviledgedActionException as:proxy-user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 2015-11-04 10:03:31,963 WARN [Delegation Token Refresh Thread-0] yarn.YarnSparkHadoopUtil (Logging.scala:logWarning(92)) - Error while attempting to list files from application staging dir org.apache.hadoop.security.AccessControlException: Permission denied: user=principal, access=READ_EXECUTE, inode="/user/proxy-user/.sparkStaging/application_1443481003186_00000":proxy-user:proxy-user:drwx------ Can someone confirm my understanding is right? The class relevant is below, https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala Chen On Tue, Nov 3, 2015 at 11:57 AM, Chen Song <chen.song...@gmail.com> wrote: > We saw the following error happening in Spark Streaming job. Our job is > running on YARN with kerberos enabled. > > First, warnings below were printed out, I only pasted a few but the > following was repeated hundred/thousand of times. > > 15/11/03 14:43:07 WARN UserGroupInformation: PriviledgedActionException > as:[kerberos principle] (auth:KERBEROS) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > 15/11/03 14:43:07 WARN Client: Exception encountered while connecting to > the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > 15/11/03 14:43:07 WARN UserGroupInformation: PriviledgedActionException > as:[kerberos principle] (auth:KERBEROS) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > 15/11/03 14:43:07 WARN UserGroupInformation: PriviledgedActionException > as:[kerberos principle] (auth:KERBEROS) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > 15/11/03 14:43:07 WARN Client: Exception encountered while connecting to > the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > > > It seems to have something to do with renewal of token and it tried to > connect a standby namenode. > > Then the following error was thrown out. > > 15/11/03 14:43:20 ERROR Utils: Uncaught exception in thread Delegation > Token Refresh Thread-0 > java.lang.StackOverflowError > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:360) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:411) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater.updateCredentialsIfRequired(ExecutorDelegationTokenUpdater.scala:89) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1.run(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater.updateCredentialsIfRequired(ExecutorDelegationTokenUpdater.scala:79) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1.run(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater.updateCredentialsIfRequired(ExecutorDelegationTokenUpdater.scala:79) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49) > at > org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > > > Again, the above stacktrace was repeated hundreds/throusands of times. > That explains why a stackoverflow exception was produced. > > My question is: > > * If the HDFS active name node failed over during the job, the next time > token renewal is needed, the client would always need to connect with the > same namenode when the token was created. Is that true and expected? If so, > how to handle failover of namenodes for a streaming job in Spark. > > Thanks for your feedback in advance. > > -- > Chen Song > > -- Chen Song