[ https://issues.apache.org/jira/browse/SPARK-31347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077533#comment-17077533 ]
Babble Shack commented on SPARK-31347: -------------------------------------- I was able to resolve this by adding ha configs used in yarn for addressing subclusters to the spark-defaults.conf file e.g. {code:java} ##YARN Configs spark.hadoop.yarn.resourcemanager.address yarn-master-0.yarn-service.yarn-subcluster-a:8050 spark.hadoop.yarn.resourcemanager.scheduler.address 0.0.0.0:8049 spark.hadoop.yarn.federation.enabled true #spark.hadoop.yarn.resourcemanager.ha.enabled false spark.hadoop.yarn.resourcemanager.ha.rm-ids yarn-subcluster-a,yarn-subcluster-b,yarn-subcluster-c spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-a yarn-master-0.yarn-service.yarn-subcluster-a spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-a yarn-master-0.yarn-service.yarn-subcluster-a:8088 spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-a yarn-master-0.yarn-service.yarn-subcluster-a:8032 spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-a yarn-master-0.yarn-service.yarn-subcluster-a:8030 spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-b yarn-master-0.yarn-service.yarn-subcluster-b spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-b yarn-master-0.yarn-service.yarn-subcluster-b:8088 spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-b yarn-master-0.yarn-service.yarn-subcluster-b:8032 spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-b yarn-master-0.yarn-service.yarn-subcluster-b:8030 spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-c yarn-master-0.yarn-service.yarn-subcluster-c spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-c yarn-master-0.yarn-service.yarn-subcluster-c:8088 spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-c yarn-master-0.yarn-service.yarn-subcluster-c:8032 spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-c yarn-master-0.yarn-service.yarn-subcluster-c:8030{code} Perhaps these should be read from yarn-site.xml located at $HADOOP_CONF_DIR, when either `yarn.resourcemanager.ha.enabled` or `spark.hadoop.yarn.federation.enabled` is set to true. This could happen in the YarnClusterSuite.scala class. > Unable to run Spark Job on Federated Yarn Cluster, AMRMToken invalid > -------------------------------------------------------------------- > > Key: SPARK-31347 > URL: https://issues.apache.org/jira/browse/SPARK-31347 > Project: Spark > Issue Type: Bug > Components: Deploy, YARN > Affects Versions: 3.0.0 > Reporter: Babble Shack > Priority: Major > Attachments: mapred.log, mapred.out, router-yarn-site.xml, > spark.debug.log, spark.log, spark.out > > > Running Spark on Yarn 3.2.1 in federated cluster > ApplicationMaster fails to register with resourcemanager, and throws a > InvalidToken exception. > {code:java} > root@yarn-master-0:/hadoop/spark# HADOOP_CONF_DIR=/hadoop/federation/router > ./bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > > > > > --master yarn \ > > > --deploy-mode cluster \ > --driver-memory 4g \ > > > > > --executor-memory 2g \ > > > --executor-cores 1 \ > > > --queue default \ > > > > > examples/jars/spark-examples*.jar 10 > > > > > > > > 2020-04-04 16:44:18,144 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2020-04-04 16:44:18,345 INFO client.AHSProxy: Connecting to Application > History server at /0.0.0.0:10200 > > > > 2020-04-04 16:44:18,402 INFO yarn.Client: Requesting a new application from > cluster with 10 NodeManagers > 2020-04-04 16:44:18,753 INFO conf.Configuration: resource-types.xml not found > > > 2020-04-04 16:44:18,754 INFO resource.ResourceUtils: Unable to find > 'resource-types.xml'. > > 2020-04-04 16:44:18,766 INFO yarn.Client: Verifying our application has not > requested more than the maximum memory capability of the cluster (7168 MB per > container) > 2020-04-04 16:44:18,767 INFO yarn.Client: Will allocate AM container, with > 4505 MB memory including 409 MB overhead > > 2020-04-04 16:44:18,767 INFO yarn.Client: Setting up container launch context > for our AM > > 2020-04-04 16:44:18,768 INFO yarn.Client: Setting up the launch environment > for our AM container > > > > 2020-04-04 16:44:18,776 INFO yarn.Client: Preparing resources for our AM > container > > 2020-04-04 16:44:18,805 WARN yarn.Client: Neither spark.yarn.jars nor > spark.yarn.archive is set, falling back to uploading libraries under > SPARK_HOME. > > > 2020-04-04 16:44:19,890 INFO yarn.Client: Uploading resource > file:/tmp/spark-cfcf1976-612e-4b64-8bf3-5b0c8f1dc6ec/__spark_libs__5444968329971306297.zip > -> > hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/__spark_libs__5444968329971306297.zip > 2020-04-04 16:44:22,689 INFO yarn.Client: Uploading resource > file:/hadoop/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar -> > hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/spark-examples_2.12-3.0.0-preview2.jar > 2020-04-04 16:44:22,832 INFO yarn.Client: Uploading resource > file:/tmp/spark-cfcf1976-612e-4b64-8bf3-5b0c8f1dc6ec/__spark_conf__2558260056925734476.zip > -> > hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/__spark_conf__.zip > 2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing view acls to: > root > > 2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing modify acls to: > root > > > > 2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing view acls groups > to: > > 2020-04-04 16:44:22,887 INFO spark.SecurityManager: Changing modify acls > groups to: > > 2020-04-04 16:44:22,887 INFO spark.SecurityManager: SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(root); groups with view permissions: Set(); users with modify > permissions: Set(root); groups with modify permissions: Set() > 2020-04-04 16:44:22,927 INFO yarn.Client: Submitting application > application_1586018216728_0005 to ResourceManager > > 2020-04-04 16:44:22,963 INFO impl.YarnClientImpl: Submitted application > application_1586018216728_0005 > > 2020-04-04 16:44:23,967 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > > > > 2020-04-04 16:44:23,969 INFO yarn.Client: > > > > > client token: N/A > > > > > diagnostics: AM container is launched, waiting for AM container to > Register with RM > > ApplicationMaster host: N/A > > > ApplicationMaster RPC port: -1 > queue: default > > > start time: 1586018662937 > > > final status: UNDEFINED > > > tracking URL: > http://yarn-master-0.yarn-service.yarn-subcluster-a.svc.cluster.local:8088/proxy/application_1586018216728_0005/ > > > > user: root > > > 2020-04-04 16:44:24,972 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > > > > 2020-04-04 16:44:25,974 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > > > > 2020-04-04 16:44:26,977 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > 2020-04-04 16:44:27,980 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > 2020-04-04 16:44:28,983 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > > > > 2020-04-04 16:44:29,985 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > > > > 2020-04-04 16:44:30,988 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > 2020-04-04 16:44:31,991 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > 2020-04-04 16:44:32,994 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: ACCEPTED) > > > > 2020-04-04 16:44:33,996 INFO yarn.Client: Application report for > application_1586018216728_0005 (state: FAILED) > 2020-04-04 16:44:33,997 INFO yarn.Client: > > > client token: N/A > > > diagnostics: Application application_1586018216728_0005 failed 2 > times due to AM Container for appattempt_1586018216728_0005_000002 exited > with exitCode: 13 > Failing this attempt.Diagnostics: [2020-04-04 16:44:33.276]Exception from > container-launch. > Container id: container_e27933_1586018216728_0005_02_000001 > > > Exit code: 13 > > [2020-04-04 16:44:33.297]Container exited with a non-zero exit code 13. Error > file: prelaunch.err. > Last 4096 bytes of prelaunch.err : > Last 4096 bytes of stderr : > ect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy16.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:246) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:233) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:213) > at org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:71) > at > org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:426) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:504) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:262) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:875) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:874) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:874) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1586018216728_0005_000002 > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) > at org.apache.hadoop.ipc.Client.call(Client.java:1457) > at org.apache.hadoop.ipc.Client.call(Client.java:1367) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy15.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:107) > ... 24 more > ) > 2020-04-04 16:44:32,555 INFO yarn.ApplicationMaster: Deleting staging > directory > hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005 > 2020-04-04 16:44:32,926 INFO storage.DiskBlockManager: Shutdown hook called > 2020-04-04 16:44:32,930 INFO util.ShutdownHookManager: Shutdown hook called > 2020-04-04 16:44:32,930 INFO util.ShutdownHookManager: Deleting directory > /opt/hadoop/hadooptmpdata/nm-local-dir/usercache/root/appcache/application_1586018216728_0005/spark-5d3f083f-eb43-49e9-a779-2354e07e9bd7/userFiles-1721c4df-1674-4695-b3aa-02e8c72908c0 > 2020-04-04 16:44:32,932 INFO util.ShutdownHookManager: Deleting directory > /opt/hadoop/hadooptmpdata/nm-local-dir/usercache/root/appcache/application_1586018216728_0005/spark-5d3f083f-eb43-49e9-a779-2354e07e9bd7 > > > {code} > > Submitting this here and not in Yarn Jira because Hadoop Mapred Jobs run > normally in the same cluster. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org