Shay Rojansky created SPARK-2971:
------------------------------------

             Summary: Orphaned YARN ApplicationMaster lingers forever
                 Key: SPARK-2971
                 URL: https://issues.apache.org/jira/browse/SPARK-2971
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.0.2
         Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise
            Reporter: Shay Rojansky


We have cases where if CTRL-C is hit during a Spark job startup, a YARN 
ApplicationMaster is created but cannot connect to the driver (presumably 
because the driver has terminated). Once an AM enters this state it never exits 
it, and has to be manually killed in YARN.

Here's an excerpt from the AM logs:

{noformat}
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji
14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(roji)
14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started
14/08/11 16:29:40 INFO Remoting: Starting remoting
14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at 
master.grid.eaglerd.local/192.168.41.100:8030
14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: 
appattempt_1407759736957_0014_000001
14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster
14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be 
reachable.
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to