[ https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320409#comment-16320409 ]
Andrei Budnik edited comment on MESOS-8391 at 1/10/18 6:47 PM: --------------------------------------------------------------- https://reviews.apache.org/r/65071/ https://reviews.apache.org/r/65077/ was (Author: abudnik): https://reviews.apache.org/r/65071/ > Mesos agent doesn't notice that a pod task exits or crashes after the agent > restart > ----------------------------------------------------------------------------------- > > Key: MESOS-8391 > URL: https://issues.apache.org/jira/browse/MESOS-8391 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, executor > Affects Versions: 1.5.0 > Reporter: Ivan Chernetsky > Assignee: Andrei Budnik > Priority: Blocker > Attachments: testing-log-2.tar.gz > > > h4. (1) Agent doesn't detect that a pod task exits/crashes > # Create a Marathon pod with two containers which just do {{sleep 10000}}. > # Restart the Mesos agent on the node the pod got launched. > # Kill one of the pod tasks > *Expected result*: The Mesos agent detects that one of the tasks got killed, > and forwards {{TASK_FAILED}} status to Marathon. > *Actual result*: The Mesos agent does nothing, and the Mesos master thinks > that both tasks are running just fine. Marathon doesn't take any action > because it doesn't receive any update from Mesos. > h4. (2) After the agent restart, it detects that the task crashed, forwards > the correct status update, but the other task stays in {{TASK_KILLING}} state > forever > # Perform steps in (1). > # Restart the Mesos agent > *Expected result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, and kills the other task too. > *Actual result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, but the other task stays in > `TASK_KILLING` state forever. > Please note, that after another agent restart, the other tasks gets finally > killed and the correct status updates get propagated all the way to Marathon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)