Re: 1.4.1 release

2017-11-02 Thread Qian Zhang
We want to backport https://reviews.apache.org/r/62518/ to 1.2.x, 1.3.x and
1.4.x, James will work on it.


Regards,
Qian Zhang

On Fri, Nov 3, 2017 at 12:11 AM, Kapil Arya  wrote:

> Please reply to this email if you have pending patches to be backported to
> 1.4.x as we are aiming to cut a release candidate for 1.4.1 early next week.
>
> Thanks,
> Anand and Kapil
>


Re: 1.3.2 Release

2017-11-02 Thread Benjamin Mahler
Great!

I cherry picked Gaston's fix for https://issues.apache.org/
jira/browse/MESOS-8135.

On Wed, Nov 1, 2017 at 6:57 PM, Michael Park  wrote:

> Please reply to this email if you have pending patches to be backported to
> 1.3.x, I'm aiming to cut a 1.3.2 on Friday.
>
> Thanks,
>
> MPark
>


Re: orphan executor

2017-11-02 Thread Benjamin Mahler
I filed one: https://issues.apache.org/jira/browse/MESOS-8167

It's a pretty significant effort, and hasn't been requested a lot, so it's
unlikely to be worked on for some time.

On Tue, Oct 31, 2017 at 8:18 PM, Mohit Jaggi  wrote:

> :-)
> Is there a Jira ticket to track this? Any idea when this will be worked on?
>
> On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler 
> wrote:
>
>> The question was posed merely to point out that there is no notion of the
>> executor "running away" currently, due to the answer I provided: there
>> isn't a complete lifecycle API for the executor. (This includes
>> healthiness, state updates, reconciliation, ability for scheduler to shut
>> it down, etc).
>>
>> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi 
>> wrote:
>>
>>> Good question.
>>> - I don't know what the interaction between mesos agent and executor is.
>>> Is there a health check?
>>> - There is a reconciliation between Mesos and Frameworks: will Mesos
>>> include the "orphan" executor in the list there, so framework can find
>>> runaways and kill them(using Mesos provided API)?
>>>
>>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler 
>>> wrote:
>>>
 What defines a runaway executor?

 Mesos does not know that this particular executor should self-terminate
 within some reasonable time after its task terminates. In this case the
 framework (Aurora) knows this expected behavior of Thermos and can clean up
 ones that get stuck after the task terminates. However, we currently don't
 provide a great executor lifecycle API to enable schedulers to do this
 (it's long overdue).

 On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi 
 wrote:

> I was asking if this can happen automatically.
>
> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler 
> wrote:
>
>> You can kill it manually by SIGKILLing the executor process.
>> Using the agent API, you can launch a nested container session and
>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing 
>> into
>> the container?
>>
>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi 
>> wrote:
>>
>>> Yes. There is a fix available now in Aurora/Thermos to try and exit
>>> in such scenarios. But I am curious to know if Mesos agent has the
>>> functionality to reap runaway executors.
>>>
>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <
>>> bmah...@apache.org> wrote:
>>>
 Is my understanding correct that the Thermos transitions the task
 to TASK_FAILED, but Thermos gets stuck and can't terminate itself? The
 typical workflow for thermos, as a 1:1 task:executor approach, is that 
 the
 executor terminates itself after the task is terminal.

 The full logs of the agent during this window would help, it looks
 like an agent termination is involved here as well?

 On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi 
 wrote:

> Here are some relevant logs. Aurora scheduler logs shows the task
> going from:
> INIT
> ->PENDING
> ->ASSIGNED
> ->STARTING
> ->RUNNING for a long time
> ->FAILED due to health check error, OSError: Resource temporarily
> unavailable (I think this is referring to running out of PID space, 
> see
> thermos logs below)
>
>
> --- mesos agent ---
>
> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into 
> the sandbox directory
> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI 
> '/usr/bin/X'
> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource 
> '/usr/bin/x' to 
> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' 
> to 
> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
> WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
> twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
> Writing log files to disk in /mnt/mesos/sandbox
> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
> I1005 22:58:15.68086714 exec.cpp:237] Executor registered on 
> agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
> Writing log files to disk in /mnt/mesos/sandbox
> I1006 01:13:52.95055239 exec.cpp:487] Agent exited, but framework 
> has

Re: clearing the executor authentication token from the task environment

2017-11-02 Thread James Peach

> On Nov 1, 2017, at 2:28 PM, James Peach  wrote:
> 
> Hi all,
> 
> In https://issues.apache.org/jira/browse/MESOS-8140, I'm proposing that we 
> clear the MESOS_EXECUTOR_AUTHENTICATION_TOKEN environment variable 
> immediately after consuming it in the built-in executors. This protects it 
> from observation by other tasks in the same PID namespace, however I wanted 
> to verify that no-one currently has a use case that depends on this. 
> Currently, the token is inherited to the environment of tasks running under 
> the command executor (i.e. not to task group tasks).
> 
> Eventually we would add a formal API for tasks to access the executor token 
> in MESOS-8018.

Ok, we will be landing this change for Mesos 1.5

thanks,
James

1.4.1 release

2017-11-02 Thread Kapil Arya
Please reply to this email if you have pending patches to be backported to
1.4.x as we are aiming to cut a release candidate for 1.4.1 early next week.

Thanks,
Anand and Kapil