Yes, I’m pretty sure my YARN and HDFS HA configuration is correct. I can use the UIs and HDFS command line tools with HA support as expected (failing over namenodes and resourcemanagers, etc) so I believe this to be a Spark issue.
Like I mentioned earlier, if i manipulate the “yarn.resourcemanager.address” to reflect the active resource manager, things work as expected, although that would not be an HA setup… An “unaccepted” reply to this thread from Dean Chen suggested to build Spark with a newer version of Hadoop (2.4.1) and this has worked to some extent. I’m now able to submit jobs (omitting an explicit “yarn.resourcemanager.address” property) and the ConfiguredRMFailoverProxyProvider seems to submit this to the arbitrary, active resource manager. Thanks Dean! However, now the Spark jobs running in the ApplicationMaster on a given node fails to find the active resourcemanager. Below is a log excerpt from one of the assigned nodes. As all the jobs fail, eventually YARN will move this to execute on the node that co-locates the active resourcemanager and a nodemanager, where the job will proceed a bit further. Then, the Spark job itself will fail attempting to access HDFS files via the virtualized HA HDFS URI. I’m running Apache Spark 1.0.2 built against Hadoop 2.4.1. Is it verified that Spark is ready for HA YARN/HDFS? =================================================== 14/08/20 11:34:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/20 11:34:24 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1408548063882_0002_000001 14/08/20 11:34:24 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8030 14/08/20 11:34:24 INFO SecurityManager: Changing view acls to: hadoop 14/08/20 11:34:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop) 14/08/20 11:34:24 INFO ApplicationMaster: Starting the user JAR in a separate Thread 14/08/20 11:34:24 INFO ApplicationMaster: Waiting for Spark context initialization 14/08/20 11:34:24 INFO ApplicationMaster: Waiting for Spark context initialization ... 0 14/08/20 11:34:24 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). 14/08/20 11:34:25 INFO Slf4jLogger: Slf4jLogger started 14/08/20 11:34:25 INFO Remoting: Starting remoting 14/08/20 11:34:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@ip-10-0-5-106.us-west-2.compute.internal:41419] 14/08/20 11:34:25 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@ip-10-0-5-106.us-west-2.compute.internal:41419] 14/08/20 11:34:27 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:28 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:29 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:30 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:31 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:32 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:33 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 14/08/20 11:34:34 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) On Aug 19, 2014, at 5:34 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hi Matt, > > I checked in the YARN code and I don't see any references to > yarn.resourcemanager.address. Have you made sure that your YARN client > configuration on the node you're launching from contains the right configs? > > -Sandy > > > On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell <matt.narr...@gmail.com> wrote: > Hello, > > I have an HA enabled YARN cluster with two resource mangers. When submitting > jobs via “spark-submit —master yarn-cluster”. It appears that the driver is > looking explicitly for the "yarn.resourcemanager.address” property rather > than round robin-ing through the resource managers via the > “yarn.client.failover-proxy-provider” property set to > “org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider” > > If I explicitly set the “yarn.resourcemanager.address” to the active resource > manager, jobs will submit fine. > > Is there a manner to set “spark-submit —master yarn-cluster” to respect the > failover proxy? > > Thanks in advance, > Matt > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >