[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

Joseph Wu (JIRA) Mon, 08 Apr 2019 09:12:26 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812549#comment-16812549
 ]


Joseph Wu commented on MESOS-6285:
----------------------------------

MESOS-7947 is only a partial solution.  That ticket added completed task 
metadata directories to the agent's existing GC mechanism.  This means it is 
still possible to hit an OOM during recovery if:
1) We launch lots of tasks very quickly.  The GC settings won't clean up quick 
bursts of tasks until days or weeks later.
2) Or, we launch many tasks, with low disk utilization.  Since disk is usually 
much larger than memory, it is possible to have too much metadata to fit into 
memory, while not consuming that much space on disk.  Again, GC won't kick in 
for days/weeks.

> Agents may OOM during recovery if there are too many tasks or executors
> -----------------------------------------------------------------------
>
>                 Key: MESOS-6285
>                 URL: https://issues.apache.org/jira/browse/MESOS-6285
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Joseph Wu
>            Priority: Critical
>              Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>    // Helper to launch a task using an offer.
>    void launch(const Offer& offer)
>    {
> -    int taskId = tasksLaunched++;
> -    ++metrics.tasks_launched;
> -
> -    TaskInfo task;
> -    task.set_name("Task " + stringify(taskId));
> -    task.mutable_task_id()->set_value(stringify(taskId));
> -    task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -    task.mutable_resources()->CopyFrom(taskResources);
> -    task.mutable_executor()->CopyFrom(executor);
> -
>      Call call;
>      call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>      Offer::Operation* operation = accept->add_operations();
>      operation->set_type(Offer::Operation::LAUNCH);
>  
> -    operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +    // Launch as many tasks as possible in the given offer.
> +    Resources remaining = Resources(offer.resources()).flatten();
> +    while (remaining.contains(taskResources)) {
> +      int taskId = tasksLaunched++;
> +      ++metrics.tasks_launched;
> +
> +      TaskInfo task;
> +      task.set_name("Task " + stringify(taskId));
> +      task.mutable_task_id()->set_value(stringify(taskId));
> +      task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +      task.mutable_resources()->CopyFrom(taskResources);
> +      task.mutable_executor()->CopyFrom(executor);
> +
> +      operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +      remaining -= taskResources;
> +    }
>  
>      mesos->send(call);
>    }
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

Reply via email to