Shay Rojansky created SPARK-2971: ------------------------------------ Summary: Orphaned YARN ApplicationMaster lingers forever Key: SPARK-2971 URL: https://issues.apache.org/jira/browse/SPARK-2971 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise Reporter: Shay Rojansky
We have cases where if CTRL-C is hit during a Spark job startup, a YARN ApplicationMaster is created but cannot connect to the driver (presumably because the driver has terminated). Once an AM enters this state it never exits it, and has to be manually killed in YARN. Here's an excerpt from the AM logs: {noformat} SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roji) 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started 14/08/11 16:29:40 INFO Remoting: Starting remoting 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at master.grid.eaglerd.local/192.168.41.100:8030 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1407759736957_0014_000001 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be reachable. 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org