[ https://issues.apache.org/jira/browse/MYRIAD-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987572#comment-14987572 ]
Santosh Marella commented on MYRIAD-153: ---------------------------------------- [~sdaingade], [~sarjeet] and I investigated this problem last week. We've identified the following: a. A M/R app master typically requests for slightly more containers than are required. I guess the reason behind the design is to keep some containers ready to go in case a few tasks fail. b. When there are no task failures, these containers are never used. c. When the job finishes, these containers are released by the app master. d. With FGS, the myriad executor aux service doesn't seem to get a stopContainer callback, since these containers are never physically launched. For e.g. for container container_1442507909665_0003_01_000043, there was no m/r task that was really launched (perhaps speculative execution). This container's placeholder task never finished. smarella:~/scratch/bug20530$ grep container_1442507909665_0003_01_000043 testrm.646ddf2c-5d5a-11e5-9651-0cc47a587d16.stderr | grep Transitioned 15/09/17 10:26:26 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000043 Container Transitioned from NEW to RESERVED 15/09/17 10:26:31 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000043 Container Transitioned from NEW to ALLOCATED 15/09/17 10:26:32 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000043 Container Transitioned from ALLOCATED to ACQUIRED 15/09/17 10:26:33 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000043 Container Transitioned from ACQUIRED to RELEASED For e.g. for container container_1442507909665_0003_01_000043, there was m/r task that launched. This container's placeholder task finished correctly. smarella:~/scratch/bug20530$ grep container_1442507909665_0003_01_000042 testrm.646ddf2c-5d5a-11e5-9651-0cc47a587d16.stderr | grep Transitioned 15/09/17 10:26:21 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000042 Container Transitioned from NEW to RESERVED 15/09/17 10:26:25 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000042 Container Transitioned from NEW to ALLOCATED 15/09/17 10:26:25 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000042 Container Transitioned from ALLOCATED to ACQUIRED 15/09/17 10:26:26 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000042 Container Transitioned from ACQUIRED to RUNNING 15/09/17 10:26:41 INFO rmcontainer.RMContainerImpl: container_1442507909665_0003_01_000042 Container Transitioned from RUNNING to COMPLETED [~sdaingade]proposed a solution for this: The executor aux service should send COMPLETE status for all the remaining placeholder tasks after an application completes. There is a callback in the aux services interface that's called once an application finishes. I'll implement the fix and submit a PR. > Placeholder tasks yarn_container_* is not cleaned after yarn job is complete. > ----------------------------------------------------------------------------- > > Key: MYRIAD-153 > URL: https://issues.apache.org/jira/browse/MYRIAD-153 > Project: Myriad > Issue Type: Bug > Reporter: Sarjeet Singh > Attachments: Mesos_UI_screeshot_placeholder_tasks_running.png > > > Observed the placeholder tasks for containers launched on FGS are still in > RUNNING state on mesos. These container tasks are not cleaned up properly > after job is finished completely. > see screenshot attached for mesos UI with placeholder tasks still running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)