[ https://issues.apache.org/jira/browse/TWILL-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15437601#comment-15437601 ]
ASF GitHub Bot commented on TWILL-190: -------------------------------------- Github user chtyim commented on a diff in the pull request: https://github.com/apache/twill/pull/4#discussion_r76318343 --- Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/ApplicationMasterService.java --- @@ -268,20 +270,33 @@ public void acquired(List<? extends ProcessLauncher<YarnContainerInfo>> launcher @Override public void completed(List<YarnContainerStatus> completed) { for (YarnContainerStatus status : completed) { + handleCompleted(completed); ids.remove(status.getContainerId()); } } }; - runningContainers.stopAll(); - - // Poll for 5 seconds to wait for containers to stop. - int count = 0; - while (!ids.isEmpty() && count++ < 5) { - amClient.allocate(0.0f, handler); - TimeUnit.SECONDS.sleep(1); - } + // Handle heartbeats during shutdown because runningContainers.stopAll() waits until + // handleCompleted() is called for every stopped runnable + ExecutorService stopPoller = Executors.newSingleThreadExecutor(Threads.createDaemonThreadFactory("stopPoller")); + stopPoller.execute(new Runnable() { + @Override + public void run() { + while (!ids.isEmpty()) { + try { + amClient.allocate(0.0f, handler); + TimeUnit.SECONDS.sleep(1); --- End diff -- Should check if `ids` is already emptied before sleeping, since the call the `allocate` may already have the ids emptied by the handler and we don't have the sleep for an extra second for that. > Restart of a TwillRunnable does not wait for the runnable to stop > ----------------------------------------------------------------- > > Key: TWILL-190 > URL: https://issues.apache.org/jira/browse/TWILL-190 > Project: Apache Twill > Issue Type: Bug > Components: core, yarn > Affects Versions: 0.6.0-incubating, 0.7.0-incubating > Reporter: Poorna Chandra > Assignee: Poorna Chandra > Fix For: 0.8.0 > > > Today when a TwillRunnable is restarted, the call sends a stop message to the > TwillRunnable, and then starts new TwillRunnable without waiting for the > stopping runnable to finish stopping. > This can leave a non-responding TwillRunnable container running, and can lead > to issues like two TwillRunnables with same instance id running at the same > time. > We should kill the containers that don't respond to stop message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)