[jira] [Comment Edited] (MESOS-5380) Killing a queued task can cause the corresponding command executor to never terminate.

2016-05-25 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299539#comment-15299539
 ] 

Vinod Kone edited comment on MESOS-5380 at 5/25/16 6:30 AM:


Backported the fixes to 0.28.x.

commit b52a41df090fdf15c65e01805adbb6795bc34e78
Author: Vinod Kone 
Date:   Sun May 15 12:31:31 2016 -0700

Fixed agent to properly handle killTask during agent restart.

***Modified for 0.28.2***

If the agent restarts after handling killTask but before sending
shutdown message to the executor, we ensure the executor terminates.

Review: https://reviews.apache.org/r/47402

commit 8f73932f851096a2d1fdcd72b239be1e2e53cc58
Author: Vinod Kone 
Date:   Fri May 13 16:06:49 2016 -0700

Fixed agent to properly handle killTask of unregistered executor.

***Modified for 0.28.2***

The agent now shuts down the executor during registration if it does not
have any queued tasks (e.g., framework sent a killTask before
registration).

Note that if the executor doesn't register at all, it will be cleaned up
anyway after the registration timeout value.

Also, note that this doesn't handle the case where the agent restarts
after processing the killTask() but before cleaning up the executor.

Review: https://reviews.apache.org/r/47381



was (Author: vinodkone):
Backported the fix to 0.28.x.

> Killing a queued task can cause the corresponding command executor to never 
> terminate.
> --
>
> Key: MESOS-5380
> URL: https://issues.apache.org/jira/browse/MESOS-5380
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0, 0.28.1
>Reporter: Jie Yu
>Assignee: Vinod Kone
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 0.29.0, 0.28.2
>
>
> We observed this in our testing environment. Sequence of events:
> 1) A command task is queued since the executor has not registered yet.
> 2) The framework issues a killTask.
> 3) Since executor is in REGISTERING state, agent calls 
> `statusUpdate(TASK_KILLED, UPID())`
> 4) `statusUpdate` now will call `containerizer->status()` before calling 
> `executor->terminateTask(status.task_id(), status);` which will remove the 
> queued task. (Introduced in this patch: https://reviews.apache.org/r/43258).
> 5) Since the above is async, it's possible that the task is still in queued 
> task when we trying to see if we need to kill unregistered executor in 
> `killTask`:
> {code}
>   // TODO(jieyu): Here, we kill the executor if it no longer has
>   // any task to run and has not yet registered. This is a
>   // workaround for those single task executors that do not have a
>   // proper self terminating logic when they haven't received the
>   // task within a timeout.
>   if (executor->queuedTasks.empty()) {
> CHECK(executor->launchedTasks.empty())
> << " Unregistered executor '" << executor->id
> << "' has launched tasks";
> LOG(WARNING) << "Killing the unregistered executor " << *executor
>  << " because it has no tasks";
> executor->state = Executor::TERMINATING;
> containerizer->destroy(executor->containerId);
>   }
> {code}
> 6) Consequently, the executor will never be terminated by Mesos.
> Attaching the relevant agent log:
> {noformat}
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.640527  1342 slave.cpp:1361] Got assigned task 
> mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c-
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.641034  1342 slave.cpp:1480] Launching task 
> mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c-
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.641440  1342 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a'
>  to user 'root'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.644664  1342 slave.cpp:5389] Launching executor 
> mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> 

[jira] [Comment Edited] (MESOS-5380) Killing a queued task can cause the corresponding command executor to never terminate.

2016-05-15 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283944#comment-15283944
 ] 

Vinod Kone edited comment on MESOS-5380 at 5/15/16 8:08 PM:


Phase 2: https://reviews.apache.org/r/47402/


was (Author: vinodkone):
https://reviews.apache.org/r/47402/

> Killing a queued task can cause the corresponding command executor to never 
> terminate.
> --
>
> Key: MESOS-5380
> URL: https://issues.apache.org/jira/browse/MESOS-5380
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0, 0.28.1
>Reporter: Jie Yu
>Assignee: Vinod Kone
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 0.29.0, 0.28.2
>
>
> We observed this in our testing environment. Sequence of events:
> 1) A command task is queued since the executor has not registered yet.
> 2) The framework issues a killTask.
> 3) Since executor is in REGISTERING state, agent calls 
> `statusUpdate(TASK_KILLED, UPID())`
> 4) `statusUpdate` now will call `containerizer->status()` before calling 
> `executor->terminateTask(status.task_id(), status);` which will remove the 
> queued task. (Introduced in this patch: https://reviews.apache.org/r/43258).
> 5) Since the above is async, it's possible that the task is still in queued 
> task when we trying to see if we need to kill unregistered executor in 
> `killTask`:
> {code}
>   // TODO(jieyu): Here, we kill the executor if it no longer has
>   // any task to run and has not yet registered. This is a
>   // workaround for those single task executors that do not have a
>   // proper self terminating logic when they haven't received the
>   // task within a timeout.
>   if (executor->queuedTasks.empty()) {
> CHECK(executor->launchedTasks.empty())
> << " Unregistered executor '" << executor->id
> << "' has launched tasks";
> LOG(WARNING) << "Killing the unregistered executor " << *executor
>  << " because it has no tasks";
> executor->state = Executor::TERMINATING;
> containerizer->destroy(executor->containerId);
>   }
> {code}
> 6) Consequently, the executor will never be terminated by Mesos.
> Attaching the relevant agent log:
> {noformat}
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.640527  1342 slave.cpp:1361] Got assigned task 
> mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c-
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.641034  1342 slave.cpp:1480] Launching task 
> mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c-
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.641440  1342 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a'
>  to user 'root'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.644664  1342 slave.cpp:5389] Launching executor 
> mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.645195  1342 slave.cpp:1698] Queuing task 
> 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' for executor 
> 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework 
> a3ad8418-cb77-4705-b353-4b514ceca52c-
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.645491  1338 containerizer.cpp:671] Starting container 
> '24762d43-2134-475e-b724-caa72110497a' for executor 
> 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework 
> 'a3ad8418-cb77-4705-b353-4b514ceca52c-'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.647897  1345 cpushare.cpp:389] Updated 'cpu.shares' to 1126 
> (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: 
> I0513 15:36:13.648619  1345 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 110ms