[ https://issues.apache.org/jira/browse/SPARK-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694608#comment-14694608 ]
Jeff Zhang commented on SPARK-2971: ----------------------------------- Looks like it has been resolved. {code} override def onDisconnected(remoteAddress: RpcAddress): Unit = { logInfo(s"Driver terminated or disconnected! Shutting down. $remoteAddress") // In cluster mode, do not rely on the disassociated event to exit // This avoids potentially reporting incorrect exit codes if the driver fails if (!isClusterMode) { finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) } } {code} > Orphaned YARN ApplicationMaster lingers forever > ----------------------------------------------- > > Key: SPARK-2971 > URL: https://issues.apache.org/jira/browse/SPARK-2971 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.0.2 > Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise > Reporter: Shay Rojansky > > We have cases where if CTRL-C is hit during a Spark job startup, a YARN > ApplicationMaster is created but cannot connect to the driver (presumably > because the driver has terminated). Once an AM enters this state it never > exits it, and has to be manually killed in YARN. > Here's an excerpt from the AM logs: > {noformat} > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji > 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(roji) > 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started > 14/08/11 16:29:40 INFO Remoting: Starting remoting > 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] > 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] > 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at > master.grid.eaglerd.local/192.168.41.100:8030 > 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: > appattempt_1407759736957_0014_000001 > 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster > 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be > reachable. > 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at > master.grid.eaglerd.local:44911, retrying ... > 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at > master.grid.eaglerd.local:44911, retrying ... > 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at > master.grid.eaglerd.local:44911, retrying ... > 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at > master.grid.eaglerd.local:44911, retrying ... > 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at > master.grid.eaglerd.local:44911, retrying ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org