[ https://issues.apache.org/jira/browse/MESOS-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553587#comment-15553587 ]
Anand Mazumdar commented on MESOS-6322: --------------------------------------- hmm, looks like we need similar logic that we had introduced for MESOS-5380 to guard against these cases. A bit surprised that we did not add the logic to the {{subscribe}} handler on the agent for HTTP based executors but only added it for driver based executors (https://reviews.apache.org/r/47381). > Agent fails to kill empty parent container > ------------------------------------------ > > Key: MESOS-6322 > URL: https://issues.apache.org/jira/browse/MESOS-6322 > Project: Mesos > Issue Type: Bug > Reporter: Greg Mann > Assignee: Anand Mazumdar > Priority: Blocker > Labels: mesosphere > > I launched a pod using Marathon, which led to the launching of a task group > on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly > re-launching multiple instances of the task group. After this went on for a > few minutes, I told Marathon to scale the app to 0 instances, which sends > {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent > reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but > hitting the {{/containers}} endpoint on the agent reveals that the executor > container for this task group is still running. > Here is the task group launching on the agent: > {code} > slave.cpp:1696] Launching task group containing tasks [ > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 > {code} > and here is the executor container starting: > {code} > mesos-agent[2994]: I1006 20:23:27.407563 3094 containerizer.cpp:965] > Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor > 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework > 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 > {code} > and here is the output showing the {{TASK_KILLED}} updates for one task group: > {code} > mesos-agent[2994]: I1006 20:23:28.728224 3097 slave.cpp:2283] Asked to kill > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 > mesos-agent[2994]: W1006 20:23:28.728304 3097 slave.cpp:2364] Transitioning > the state of task > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 to TASK_KILLED because > the executor is not registered > mesos-agent[2994]: I1006 20:23:28.728659 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0 > mesos-agent[2994]: I1006 20:23:28.728817 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0 > mesos-agent[2994]: I1006 20:23:28.728912 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 from @0.0.0.0:0 > {code} > however, if we grep the log for the executor's ID, the last line mentioning > it is: > {code} > slave.cpp:3080] Creating a marker file for HTTP based executor > 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework > 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000 (via HTTP) at path > '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker' > {code} > so it seems the executor never exited. If we hit the agent's {{/containers}} > endpoint, we get output which includes this executor container: > {code} > { > "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32", > "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601", > "executor_name": "", > "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000", > "source": "", > "statistics": { > "cpus_limit": 0.1, > "cpus_nr_periods": 17, > "cpus_nr_throttled": 11, > "cpus_system_time_secs": 0.02, > "cpus_throttled_time_secs": 0.784142448, > "cpus_user_time_secs": 0.09, > "disk_limit_bytes": 10485760, > "disk_used_bytes": 20480, > "mem_anon_bytes": 11337728, > "mem_cache_bytes": 0, > "mem_critical_pressure_counter": 0, > "mem_file_bytes": 0, > "mem_limit_bytes": 33554432, > "mem_low_pressure_counter": 0, > "mem_mapped_file_bytes": 0, > "mem_medium_pressure_counter": 0, > "mem_rss_bytes": 11337728, > "mem_swap_bytes": 0, > "mem_total_bytes": 12013568, > "mem_unevictable_bytes": 0, > "timestamp": 1475792290.12373 > }, > "status": { > "executor_pid": 9068, > "network_infos": [ > { > "ip_addresses": [ > { > "ip_address": "9.0.1.34", > "protocol": "IPv4" > } > ], > "labels": {}, > "name": "dcos", > "port_mappings": [ > { > "container_port": 8080, > "host_port": 24758, > "protocol": "tcp" > }, > { > "container_port": 8081, > "host_port": 24759, > "protocol": "tcp" > } > ] > } > ] > } > }, > {code} > and looking through the output of {{ps}} on the agent, indeed we can locate > the executor process: > {code} > $ ps aux | grep 9068 > root 9068 0.0 0.1 96076 25380 ? Ss 20:23 0:00 > mesos-containerizer launch > --command={"arguments":["mesos-default-executor"],"shell":false,"user":"root","value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-default-executor"} > --help=false --pipe_read=49 --pipe_write=50 > --pre_exec_commands=[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"},{"shell":true,"value":"ifconfig > lo up"}] > --runtime_directory=/var/run/mesos/containers/bf38ff09-3da1-487a-8926-1f4cc88bce32 > --unshare_namespace_mnt=false --user=root > --working_directory=/var/lib/mesos/slave/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-0000/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32 > core 20406 0.0 0.0 6764 1020 pts/1 S+ 23:05 0:00 grep > --colour=auto 9068 > $ sudo cat /proc/9068/task/9068/children > 9330 > $ ps aux | grep 9330 > root 9330 0.0 0.2 498040 32944 ? Sl 20:23 0:00 > mesos-default-executor > root 19330 0.0 0.0 0 0 ? S 22:49 0:00 [kworker/0:2] > core 20573 0.0 0.0 6764 888 pts/1 S+ 23:07 0:00 grep > --colour=auto 9330 > {code} > Looking at the executor's logs, we find stdout is: > {code} > Executing pre-exec command > '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/opt\/mesosphere\/packages\/mesos--3f320308144e6b5acb70fc89bc7ea13c70302de0\/libexec\/mesos\/mesos-containerizer"}' > Executing pre-exec command '{"shell":true,"value":"ifconfig lo up"}' > {code} > and stderr is: > {code} > I1006 20:23:28.817401 9330 executor.cpp:189] Version: 1.1.0 > I1006 20:23:28.906349 9352 default_executor.cpp:123] Received SUBSCRIBED > event > I1006 20:23:28.908797 9352 default_executor.cpp:127] Subscribed executor on > 10.0.0.133 > {code} > With this short executor log, it seems possible that the agent received the > {{TASK_KILLED}} before any tasks were sent to the executor, and the agent > removed those tasks from its data structures without terminating the parent > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)