[jira] [Commented] (MESOS-6322) Agent fails to kill empty parent container

Anand Mazumdar (JIRA) Thu, 06 Oct 2016 16:49:33 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553587#comment-15553587
 ]


Anand Mazumdar commented on MESOS-6322:
---------------------------------------

hmm, looks like we need similar logic that we had introduced for MESOS-5380 to 
guard against these cases. A bit surprised that we did not add the logic to the 
{{subscribe}} handler on the agent for HTTP based executors but only added it 
for driver based executors (https://reviews.apache.org/r/47381).

> Agent fails to kill empty parent container
> ------------------------------------------
>
>                 Key: MESOS-6322
>                 URL: https://issues.apache.org/jira/browse/MESOS-6322
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Greg Mann
>            Assignee: Anand Mazumdar
>            Priority: Blocker
>              Labels: mesosphere
>
> I launched a pod using Marathon, which led to the launching of a task group 
> on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly 
> re-launching multiple instances of the task group. After this went on for a 
> few minutes, I told Marathon to scale the app to 0 instances, which sends 
> {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent 
> reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but 
> hitting the {{/containers}} endpoint on the agent reveals that the executor 
> container for this task group is still running.
> Here is the task group launching on the agent:
> {code}
> slave.cpp:1696] Launching task group containing tasks [ 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
> {code}
> and here is the executor container starting:
> {code}
> mesos-agent[2994]: I1006 20:23:27.407563  3094 containerizer.cpp:965] 
> Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 
> 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
> 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
> {code}
> and here is the output showing the {{TASK_KILLED}} updates for one task group:
> {code}
> mesos-agent[2994]: I1006 20:23:28.728224  3097 slave.cpp:2283] Asked to kill 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000
> mesos-agent[2994]: W1006 20:23:28.728304  3097 slave.cpp:2364] Transitioning 
> the state of task 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 to TASK_KILLED because 
> the executor is not registered
> mesos-agent[2994]: I1006 20:23:28.728659  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
> mesos-agent[2994]: I1006 20:23:28.728817  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
> mesos-agent[2994]: I1006 20:23:28.728912  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0
> {code}
> however, if we grep the log for the executor's ID, the last line mentioning 
> it is:
> {code}
> slave.cpp:3080] Creating a marker file for HTTP based executor 
> 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
> 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 (via HTTP) at path 
> '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker'
> {code}
> so it seems the executor never exited. If we hit the agent's {{/containers}} 
> endpoint, we get output which includes this executor container:
> {code}
> {
>     "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32",
>     "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601",
>     "executor_name": "",
>     "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000",
>     "source": "",
>     "statistics": {
>       "cpus_limit": 0.1,
>       "cpus_nr_periods": 17,
>       "cpus_nr_throttled": 11,
>       "cpus_system_time_secs": 0.02,
>       "cpus_throttled_time_secs": 0.784142448,
>       "cpus_user_time_secs": 0.09,
>       "disk_limit_bytes": 10485760,
>       "disk_used_bytes": 20480,
>       "mem_anon_bytes": 11337728,
>       "mem_cache_bytes": 0,
>       "mem_critical_pressure_counter": 0,
>       "mem_file_bytes": 0,
>       "mem_limit_bytes": 33554432,
>       "mem_low_pressure_counter": 0,
>       "mem_mapped_file_bytes": 0,
>       "mem_medium_pressure_counter": 0,
>       "mem_rss_bytes": 11337728,
>       "mem_swap_bytes": 0,
>       "mem_total_bytes": 12013568,
>       "mem_unevictable_bytes": 0,
>       "timestamp": 1475792290.12373
>     },
>     "status": {
>       "executor_pid": 9068,
>       "network_infos": [
>         {
>           "ip_addresses": [
>             {
>               "ip_address": "9.0.1.34",
>               "protocol": "IPv4"
>             }
>           ],
>           "labels": {},
>           "name": "dcos",
>           "port_mappings": [
>             {
>               "container_port": 8080,
>               "host_port": 24758,
>               "protocol": "tcp"
>             },
>             {
>               "container_port": 8081,
>               "host_port": 24759,
>               "protocol": "tcp"
>             }
>           ]
>         }
>       ]
>     }
>   },
> {code}
> and looking through the output of {{ps}} on the agent, indeed we can locate 
> the executor process:
> {code}
> $ ps aux | grep 9068
> root      9068  0.0  0.1  96076 25380 ?        Ss   20:23   0:00 
> mesos-containerizer launch 
> --command={"arguments":["mesos-default-executor"],"shell":false,"user":"root","value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-default-executor"}
>  --help=false --pipe_read=49 --pipe_write=50 
> --pre_exec_commands=[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"},{"shell":true,"value":"ifconfig
>  lo up"}] 
> --runtime_directory=/var/run/mesos/containers/bf38ff09-3da1-487a-8926-1f4cc88bce32
>  --unshare_namespace_mnt=false --user=root 
> --working_directory=/var/lib/mesos/slave/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32
> core     20406  0.0  0.0   6764  1020 pts/1    S+   23:05   0:00 grep 
> --colour=auto 9068
> $ sudo cat /proc/9068/task/9068/children
> 9330
> $ ps aux | grep 9330
> root      9330  0.0  0.2 498040 32944 ?        Sl   20:23   0:00 
> mesos-default-executor
> root     19330  0.0  0.0      0     0 ?        S    22:49   0:00 [kworker/0:2]
> core     20573  0.0  0.0   6764   888 pts/1    S+   23:07   0:00 grep 
> --colour=auto 9330
> {code}
> Looking at the executor's logs, we find stdout is:
> {code}
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"}'
> Executing pre-exec command '{"shell":true,"value":"ifconfig lo up"}'
> {code}
> and stderr is:
> {code}
> I1006 20:23:28.817401  9330 executor.cpp:189] Version: 1.1.0
> I1006 20:23:28.906349  9352 default_executor.cpp:123] Received SUBSCRIBED 
> event
> I1006 20:23:28.908797  9352 default_executor.cpp:127] Subscribed executor on 
> 10.0.0.133
> {code}
> With this short executor log, it seems possible that the agent received the 
> {{TASK_KILLED}} before any tasks were sent to the executor, and the agent 
> removed those tasks from its data structures without terminating the parent 
> container. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6322) Agent fails to kill empty parent container

Reply via email to