Yes, I’m pretty sure my YARN and HDFS HA configuration is correct.  I can use 
the UIs and HDFS command line tools with HA support as expected (failing over 
namenodes and resourcemanagers, etc) so I believe this to be a Spark issue.

Like I mentioned earlier, if i manipulate the “yarn.resourcemanager.address” to 
reflect the active resource manager, things work as expected, although that 
would not be an HA setup…

An “unaccepted” reply to this thread from Dean Chen suggested to build Spark 
with a newer version of Hadoop (2.4.1) and this has worked to some extent.  I’m 
now able to submit jobs (omitting an explicit “yarn.resourcemanager.address” 
property) and the ConfiguredRMFailoverProxyProvider seems to submit this to the 
arbitrary, active resource manager.  Thanks Dean!

However, now the Spark jobs running in the ApplicationMaster on a given node 
fails to find the active resourcemanager.  Below is a log excerpt from one of 
the assigned nodes.  As all the jobs fail, eventually YARN will move this to 
execute on the node that co-locates the active resourcemanager and a 
nodemanager, where the job will proceed a bit further.  Then, the Spark job 
itself will fail attempting to access HDFS files via the virtualized HA HDFS 
URI.

I’m running Apache Spark 1.0.2 built against Hadoop 2.4.1.  Is it verified that 
Spark is ready for HA YARN/HDFS?

===================================================
14/08/20 11:34:23 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/08/20 11:34:24 INFO ApplicationMaster: ApplicationAttemptId: 
appattempt_1408548063882_0002_000001
14/08/20 11:34:24 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8030
14/08/20 11:34:24 INFO SecurityManager: Changing view acls to: hadoop
14/08/20 11:34:24 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop)
14/08/20 11:34:24 INFO ApplicationMaster: Starting the user JAR in a separate 
Thread
14/08/20 11:34:24 INFO ApplicationMaster: Waiting for Spark context 
initialization
14/08/20 11:34:24 INFO ApplicationMaster: Waiting for Spark context 
initialization ... 0
14/08/20 11:34:24 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
14/08/20 11:34:25 INFO Slf4jLogger: Slf4jLogger started
14/08/20 11:34:25 INFO Remoting: Starting remoting
14/08/20 11:34:25 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@ip-10-0-5-106.us-west-2.compute.internal:41419]
14/08/20 11:34:25 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@ip-10-0-5-106.us-west-2.compute.internal:41419]
14/08/20 11:34:27 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:28 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:29 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:30 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:31 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:32 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:33 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/08/20 11:34:34 INFO Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)


On Aug 19, 2014, at 5:34 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hi Matt,
> 
> I checked in the YARN code and I don't see any references to 
> yarn.resourcemanager.address.  Have you made sure that your YARN client 
> configuration on the node you're launching from contains the right configs?
> 
> -Sandy  
> 
> 
> On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell <matt.narr...@gmail.com> wrote:
> Hello,
> 
> I have an HA enabled YARN cluster with two resource mangers.  When submitting 
> jobs via “spark-submit —master yarn-cluster”.  It appears that the driver is 
> looking explicitly for the "yarn.resourcemanager.address” property rather 
> than round robin-ing through the resource managers via the 
> “yarn.client.failover-proxy-provider” property set to 
> “org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider”
> 
> If I explicitly set the “yarn.resourcemanager.address” to the active resource 
> manager, jobs will submit fine.
> 
> Is there a manner to set “spark-submit —master yarn-cluster” to respect the 
> failover proxy?
> 
> Thanks in advance,
> Matt
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 

Reply via email to