[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458823#comment-16458823 ]
Andrei Budnik commented on MESOS-6285: -------------------------------------- IntroducingĀ a limit for the number of stored tasks per executor and/or framework inĀ the garbage collector can solve the issue. > Agents may OOM during recovery if there are too many tasks or executors > ----------------------------------------------------------------------- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.0.1 > Reporter: Joseph Wu > Priority: Major > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: > // Helper to launch a task using an offer. > void launch(const Offer& offer) > { > - int taskId = tasksLaunched++; > - ++metrics.tasks_launched; > - > - TaskInfo task; > - task.set_name("Task " + stringify(taskId)); > - task.mutable_task_id()->set_value(stringify(taskId)); > - task.mutable_agent_id()->MergeFrom(offer.agent_id()); > - task.mutable_resources()->CopyFrom(taskResources); > - task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > - operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + // Launch as many tasks as possible in the given offer. > + Resources remaining = Resources(offer.resources()).flatten(); > + while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > + } > > mesos->send(call); > } > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)