Re: Question on slave recovery

Benjamin Mahler Wed, 06 Apr 2016 18:29:21 -0700

To clarify, mesos does not initiate the spawning of tasks, that is the
responsibility of a framework's scheduler.
Also, FYI there is a 0.28.1 being released right now, although I'm not sure
what you're seeing is related to any of the bugs that were fixed.


In order to help you, I need to know *precisely* what you're doing and what
you're observing. Is the following correct?

-Running a 0.28.0 slave with some running tasks.
-Stop the slave.
-Remove the --work_dir contents.
-Start the slave.
-Seeing tasks still running? How? Does the agent show these? Or were they
leaked and only are revealed with "ps"?

Please also provide the slave flags and the complete slave logs around this
event.

On Fri, Apr 1, 2016 at 3:12 AM, Pradeep Chhetri <pradeep.chhetr...@gmail.com
> wrote:

> Hello Benjamin,
>
> Thank you for the reply. I too think that changing the default working
> directory to something else maybe /var/lib/mesos would be better. In most
> of the linux distributions, generally /tmp is mounted as tmpfs which is a
> volatile filesystem.
>
> Sorry that i didn't frame my original question properly.
>
> Some months back, when I was running older mesos version probably 0.23.0
> and whenever i used to add/edit any attribute, i had to cleanup the working
> directory. I didn't enable the cgroup isolation. Sorry I don't have logs of
> that time.
>
> Currently, I am running 0.28.0 and when i did the same activity, I can see
> the older tasks still running and mesos spawns fresh new set of tasks. So
> there are two copies of tasks running now. I haven't enabled the cgroup
> isolation. Is there any way i can do the recovery of the older tasks by
> modifying some defaults.
>
> By looking at slave configuration options, I can see that there are two
> options --strict and --recover but their defaults looks good.
>
> On Fri, Apr 1, 2016 at 2:40 AM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
>> I'd recommend not using /tmp to store the meta-information because if
>> there is a tmpwatch it will remove things that we need for agent recovery.
>> We probably should change the default --work_dir, or require that the user
>> specify one.
>>
>> It's expected that wiping the work directory will cause the newly started
>> agent to destroy any orphaned tasks, if cgroup isolation is enabled. Are
>> you using cgroup isolation? Can you include logs?
>>
>> On Fri, Mar 25, 2016 at 6:17 AM, Pradeep Chhetri <
>> pradeep.chhetr...@gmail.com> wrote:
>>
>>>
>>> Hello,
>>>
>>> I remember when i was running some older mesos version (maybe 0.23.0),
>>> whenever slave restart used to fail either due to adding some new attribute
>>> or announcing different resource than default, I used to cleanup the
>>> /tmp/mesos (mesos working dir) & this used to bring down the existing
>>> executors/tasks.
>>>
>>> Yesterday, I noticed that even after cleaning up /tmp/mesos and starting
>>> slaves (registered with different slave id) didn't bring down the existing
>>> executor/tasks. I am running 0.28.0.
>>>
>>> I would like to know what has improved in slave recovery process because
>>> i was assuming that i deleted all the information related to checkpointing
>>> by cleaning up /tmp/mesos.
>>>
>>> --
>>> Regards,
>>> Pradeep Chhetri
>>>
>>
>>
>
>
> --
> Regards,
> Pradeep Chhetri
>

Re: Question on slave recovery

Reply via email to