After a bit more investigation, I found that it could be related to
impersonation on kerberized cluster.

Our job is started with the following command.

/usr/lib/spark/bin/spark-submit --master yarn-client --principal
[principle] --keytab [keytab] --proxy-user [proxied_user] ...


In application master's log,

At start up,

2015-11-03 16:03:41,602 INFO  [main] yarn.AMDelegationTokenRenewer
(Logging.scala:logInfo(59)) - Scheduling login from keytab in 64789744
millis.

Later on, when the delegation token renewer thread kicks in, it tries to
re-login with the specified principle with new credentials and tries to
write the new credentials into the over to the directory where the current
user's credentials are stored. However, with impersonation, because the
current user is a different user from the principle user, it fails with
permission error.

2015-11-04 10:03:31,366 INFO  [Delegation Token Refresh Thread-0]
yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Attempting
to login to KDC using principal: principal/host@domain
2015-11-04 10:03:31,665 INFO  [Delegation Token Refresh Thread-0]
yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) -
Successfully logged into KDC.
2015-11-04 10:03:31,702 INFO  [Delegation Token Refresh Thread-0]
yarn.YarnSparkHadoopUtil (Logging.scala:logInfo(59)) - getting token
for namenode: 
hdfs://hadoop_abc/user/proxied_user/.sparkStaging/application_1443481003186_00000
2015-11-04 10:03:31,904 INFO  [Delegation Token Refresh Thread-0]
hdfs.DFSClient (DFSClient.java:getDelegationToken(1025)) - Created
HDFS_DELEGATION_TOKEN token 389283 for principal on ha-hdfs:hadoop_abc
2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0]
hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87))
- Could not find uri with key [dfs.encryption.key.provider.uri] to
create a keyProvider !!
2015-11-04 10:03:31,944 WARN  [Delegation Token Refresh Thread-0]
security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) -
PriviledgedActionException as:proxy-user (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby
2015-11-04 10:03:31,945 WARN  [Delegation Token Refresh Thread-0]
ipc.Client (Client.java:run(675)) - Exception encountered while
connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby
2015-11-04 10:03:31,945 WARN  [Delegation Token Refresh Thread-0]
security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) -
PriviledgedActionException as:proxy-user (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby
2015-11-04 10:03:31,963 WARN  [Delegation Token Refresh Thread-0]
yarn.YarnSparkHadoopUtil (Logging.scala:logWarning(92)) - Error while
attempting to list files from application staging dir
org.apache.hadoop.security.AccessControlException: Permission denied:
user=principal, access=READ_EXECUTE,
inode="/user/proxy-user/.sparkStaging/application_1443481003186_00000":proxy-user:proxy-user:drwx------


Can someone confirm my understanding is right? The class relevant is below,
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala

Chen

On Tue, Nov 3, 2015 at 11:57 AM, Chen Song <chen.song...@gmail.com> wrote:

> We saw the following error happening in Spark Streaming job. Our job is
> running on YARN with kerberos enabled.
>
> First, warnings below were printed out, I only pasted a few but the
> following was repeated hundred/thousand of times.
>
> 15/11/03 14:43:07 WARN UserGroupInformation: PriviledgedActionException
> as:[kerberos principle] (auth:KERBEROS)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> 15/11/03 14:43:07 WARN Client: Exception encountered while connecting to
> the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> 15/11/03 14:43:07 WARN UserGroupInformation: PriviledgedActionException
> as:[kerberos principle] (auth:KERBEROS)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> 15/11/03 14:43:07 WARN UserGroupInformation: PriviledgedActionException
> as:[kerberos principle] (auth:KERBEROS)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> 15/11/03 14:43:07 WARN Client: Exception encountered while connecting to
> the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
>
>
> It seems to have something to do with renewal of token and it tried to
> connect a standby namenode.
>
> Then the following error was thrown out.
>
> 15/11/03 14:43:20 ERROR Utils: Uncaught exception in thread Delegation
> Token Refresh Thread-0
> java.lang.StackOverflowError
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater.updateCredentialsIfRequired(ExecutorDelegationTokenUpdater.scala:89)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1.run(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater.updateCredentialsIfRequired(ExecutorDelegationTokenUpdater.scala:79)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1.run(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater.updateCredentialsIfRequired(ExecutorDelegationTokenUpdater.scala:79)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49)
> at
> org.apache.spark.deploy.yarn.ExecutorDelegationTokenUpdater$$anon$1$$anonfun$run$1.apply(ExecutorDelegationTokenUpdater.scala:49)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
> Again, the above stacktrace was repeated hundreds/throusands of times.
> That explains why a stackoverflow exception was produced.
>
> My question is:
>
> * If the HDFS active name node failed over during the job, the next time
> token renewal is needed, the client would always need to connect with the
> same namenode when the token was created. Is that true and expected? If so,
> how to handle failover of namenodes for a streaming job in Spark.
>
> Thanks for your feedback in advance.
>
> --
> Chen Song
>
>


-- 
Chen Song

Reply via email to