[ https://issues.apache.org/jira/browse/SPARK-25563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633575#comment-16633575 ]
Hyukjin Kwon commented on SPARK-25563: -------------------------------------- Please avoid to set the target version which is usually reserved for committers. > Spark application hangs If container allocate on lost Nodemanager > ----------------------------------------------------------------- > > Key: SPARK-25563 > URL: https://issues.apache.org/jira/browse/SPARK-25563 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.1 > Reporter: devinduan > Priority: Minor > > I met a issue that if I start a spark application use yarn client mode, > application sometimes hang. > I check the application logs, container allocate on a lost NodeManager, > but AM don't retry to start another executor. > My spark version is 2.3.1 > Here is my ApplicationMaster log. > > 2018-09-26 05:21:15 INFO YarnRMClient:54 - Registering the ApplicationMaster > 2018-09-26 05:21:15 INFO ConfiguredRMFailoverProxyProvider:100 - Failing over > to rm2 > 2018-09-26 05:21:15 WARN Utils:66 - spark.executor.instances less than > spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please > update your configs. > 2018-09-26 05:21:15 INFO Utils:54 - Using initial executors = 1, max of > spark.dynamicAllocation.initialExecutors, > spark.dynamicAllocation.minExecutors and spark.executor.instances > 2018-09-26 05:21:15 INFO YarnAllocator:54 - Will request 1 executor > container(s), each with 24 core(s) and 20275 MB memory (including 1843 MB of > overhead) > 2018-09-26 05:21:15 INFO YarnAllocator:54 - Submitted 1 unlocalized container > requests. > 2018-09-26 05:21:15 INFO ApplicationMaster:54 - Started progress reporter > thread with (heartbeat : 3000, initial allocation : 200) intervals > 2018-09-26 05:21:27 WARN YarnAllocator:66 - Cannot find executorId for > container: container_1532951609168_4721728_01_000002 > 2018-09-26 05:21:27 INFO YarnAllocator:54 - Completed container > container_1532951609168_4721728_01_000002 (state: COMPLETE, exit status: -100) > 2018-09-26 05:21:27 WARN YarnAllocator:66 - Container marked as failed: > container_1532951609168_4721728_01_000002. Exit status: -100. Diagnostics: > Container released on a *lost* node -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org