[
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922871#comment-16922871
]
Meng Zhu commented on MESOS-9750:
---------------------------------
Note, while this ticket makes the completed task with the nonterminal status
list in the right place (i.e. completed tasks). However, it would result in a
weird behavior where a completed task would have a nonterminal status e.g.
TASK_RUNNING.
> Agent V1 GET_STATE response may report a complete executor's tasks as
> non-terminal after a graceful agent shutdown
> ------------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
> Issue Type: Bug
> Components: agent, executor
> Affects Versions: 1.6.0, 1.7.0, 1.8.0
> Reporter: Joseph Wu
> Assignee: Joseph Wu
> Priority: Major
> Labels: foundations
> Fix For: 1.7.3, 1.8.1, 1.9.0
>
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent.
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to
> status updates).
> When the agent starts back up, the completed executor will be recovered and
> shows up correctly as a completed executor in {{/state}}. However, if you
> fetch the V1 {{GET_STATE}} result, there will be an entry in
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
> launched_tasks {
> name: "test-task"
> task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
> }
> executor_id {
> value: "default"
> }
> agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
> task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> state: TASK_RUNNING
> agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> timestamp: 1556674758.2175469
> executor_id {
> value: "default"
> }
> source: SOURCE_EXECUTOR
> uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
> container_status { ... }
> }
> }
> }
> get_executors {
> completed_executors {
> executor_info {
> executor_id {
> value: "default"
> }
> command {
> value: ""
> }
> framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
> }
> }
> }
> }
> get_frameworks {
> completed_frameworks {
> framework_info {
> user: "user"
> name: "default"
> id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
> }
> checkpoint: true
> hostname: "localhost"
> principal: "test-principal"
> capabilities {
> type: MULTI_ROLE
> }
> capabilities {
> type: RESERVATION_REFINEMENT
> }
> roles: "*"
> }
> }
> }
> {code}
> This happens because we combine executors and completed executors when
> constructing the response. The terminal task(s) with non-terminal updates
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756
--
This message was sent by Atlassian Jira
(v8.3.2#803003)