[jira] [Commented] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2014-10-20 Thread Isabel Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177978#comment-14177978
 ] 

Isabel Jimenez commented on MESOS-1570:
---

I found that in fact this error does not occur only when building Mesos in a 
Docker container, the main problem is building Mesos as root. The test 
'OsSetnsTest' is skipped when you are not root so I suppose that's why the 
error wasn't detected before. If root on a machine with a recent Kernel version 
(>= 3.8 to have user namespace support), the error is the same.
In my opinion, here are the solutions:
- When building inside Docker, do the same as on the host: build as non root
- Change the test to not test user namespace since it is different than the 
others (similar to the pid namespace test)

For now Mesos does not use the user namespace but tests should be updated when 
that happen to be the case.

> Make check Error when Building Mesos in a Docker container 
> ---
>
> Key: MESOS-1570
> URL: https://issues.apache.org/jira/browse/MESOS-1570
> Project: Mesos
>  Issue Type: Bug
>Reporter: Isabel Jimenez
>Assignee: Isabel Jimenez
>Priority: Minor
>  Labels: Docker
>
> When building Mesos inside a Docker container, it's for the moment impossible 
> to run tests even when you run Docker in --privileged mode. There is a test 
> in stout that sets all the namespaces and libcontainer does not support 
> setting 'user' namespace (more information 
> [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
>  This is the error:
> {code:title=Make check failed test|borderStyle=solid}
> [--] 1 test from OsSetnsTest
> [ RUN  ] OsSetnsTest.setns
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
> Failure
> os::setns(::getpid(), ns): Invalid argument
> [  FAILED  ] OsSetnsTest.setns (7 ms)
> [--] 1 test from OsSetnsTest (7 ms total)
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] OsSetnsTest.setns
>  1 FAILED TEST
> {code}
> This can be disable as Mesos does not need to set 'user' namespace. I don't 
> know if Docker will support setting user namespace one day since it's a new 
> kernel feature, what could be the best approach to this issue? (disabling set 
> for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1951) Add --isolation flag to mesos-tests

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-1951:
--
Shepherd: Niklas Quarfot Nielsen

> Add --isolation flag to mesos-tests
> ---
>
> Key: MESOS-1951
> URL: https://issues.apache.org/jira/browse/MESOS-1951
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>
> When hooking up specific modules for tests, we realized that it would be 
> generally useful to be able to set the default flags for masters and slaves 
> in the tests. For example let mesos-tests.sh take --isolation, --drf_sorter, 
> --authentication and so on, and have CreateSlaveFlags/CreateMasterFlags 
> alongside necessary ::create() use those flags so we can run the entire unit 
> test suite exercising different implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1951) Add --isolation flag to mesos-tests

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya reassigned MESOS-1951:
-

Assignee: Kapil Arya

> Add --isolation flag to mesos-tests
> ---
>
> Key: MESOS-1951
> URL: https://issues.apache.org/jira/browse/MESOS-1951
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>
> When hooking up specific modules for tests, we realized that it would be 
> generally useful to be able to set the default flags for masters and slaves 
> in the tests. For example let mesos-tests.sh take --isolation, --drf_sorter, 
> --authentication and so on, and have CreateSlaveFlags/CreateMasterFlags 
> alongside necessary ::create() use those flags so we can run the entire unit 
> test suite exercising different implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1942) Bad test: ModuleTest.ExampleModuleParseStringTest

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen closed MESOS-1942.
-
Resolution: Fixed

commit 0fcfa50187499e8af958bfdb6f83423c6e7bd75a
Author: Kapil Arya 
Date:   Mon Oct 20 18:09:31 2014 -0700

Added setup/teardown for module API tests.

During the one-time setup of the test cases, we do the following:
1. set LD_LIBRARY_PATH to also point to the src/.libs directory.
   The original LD_LIBRARY_PATH is restored at the end of all tests.
2. dlopen() examplemodule library and retrieve the pointer to
   ModuleBase for the test module.  This pointer is later used to
   reset the Mesos and module API versions during per-test teardown.

During the tear-down after each test, we unload the module to allow
later loads to succeed.

Review: https://reviews.apache.org/r/26855

> Bad test: ModuleTest.ExampleModuleParseStringTest
> -
>
> Key: MESOS-1942
> URL: https://issues.apache.org/jira/browse/MESOS-1942
> Project: Mesos
>  Issue Type: Bug
>  Components: modules, test
>Affects Versions: 0.21.0
>Reporter: Ian Downes
>Assignee: Kapil Arya
>
> [ RUN  ] ModuleTest.UnknownModuleInstantiationTest
> Using temporary directory 
> '/tmp/ModuleTest_UnknownModuleInstantiationTest_Cv4jVf'
> tests/module_tests.cpp:230: Failure
> ModuleManager::load(modules): Error verifying module 
> 'org_apache_mesos_TestModule': Module API version mismatch. Mesos has: 1, 
> library requires: ThisIsNotAnAPIVersion!
> tests/module_tests.cpp:234: Failure
> ModuleManager::unload(moduleName): Error unloading module 
> 'org_apache_mesos_TestModule': module not loaded
> [  FAILED  ] ModuleTest.UnknownModuleInstantiationTest (2 ms)
> [ RUN  ] ModuleTest.AuthorInfoTest
> Using temporary directory '/tmp/ModuleTest_AuthorInfoTest_E98lf3'
> [   OK ] ModuleTest.AuthorInfoTest (2 ms)
> [ RUN  ] ModuleTest.UnknownLibraryTest
> Using temporary directory '/tmp/ModuleTest_UnknownLibraryTest_A0nMzQ'
> [   OK ] ModuleTest.UnknownLibraryTest (4 ms)
> [ RUN  ] ModuleTest.ExampleModuleParseStringTest
> Using temporary directory 
> '/tmp/ModuleTest_ExampleModuleParseStringTest_MVESUD'
> tests/module_tests.cpp:75: Failure
> ModuleManager::load(modules): Error verifying module 
> 'org_apache_mesos_TestModule': Module API version mismatch. Mesos has: 1, 
> library requires: ThisIsNotAnAPIVersion!
> tests/module_tests.cpp:78: Failure
> module: Module 'org_apache_mesos_TestModule' unknown
> ABORT: (../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:92): 
> Try::get() but state == ERROR: Module 'org_apache_mesos_TestModule' 
> unknown*** Aborted at 1413500769 (unix time) try "date -d @1413500769" if you 
> are using GNU date ***
> PC: @   0x32844359e9 (unknown)
> *** SIGABRT (@0x3e81634) received by PID 5684 (TID 0x7f69accd9840) from 
> PID 5684; stack trace: ***
> @   0x3284c0ef90 (unknown)
> @   0x32844359e9 (unknown)
> @   0x32844370f8 (unknown)
> @   0x4fda1d _Abort()
> @   0x4fd9a9 _Abort()
> @   0x9bc1a5 Try<>::get()
> @   0x9b036a 
> ModuleTest_ExampleModuleParseStringTest_Test::TestBody()
> @   0xcb0a15 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc996ee testing::Test::Run()
> @   0xc9a494 testing::TestInfo::Run()
> @   0xc9a9d7 testing::TestCase::Run()
> @   0xc9fd56 testing::internal::UnitTestImpl::RunAllTests()
> @   0xcb18c5 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc9fa59 testing::UnitTest::Run()
> @   0x89ccac main
> @   0x3284421b45 (unknown)
> @   0x4a4c4d (unknown)
> make[3]: *** [check-local] Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1952) Abstract network logic into socket class: connect()

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-1952:
-

 Summary: Abstract network logic into socket class: connect()
 Key: MESOS-1952
 URL: https://issues.apache.org/jira/browse/MESOS-1952
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1954) Abstract network logic into socket class: read()/write()

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-1954:
-

 Summary: Abstract network logic into socket class: read()/write()
 Key: MESOS-1954
 URL: https://issues.apache.org/jira/browse/MESOS-1954
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1953) Abstract network logic into socket class: connection events (connected(), closed(), writable(), readable())

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-1953:
-

 Summary: Abstract network logic into socket class: connection 
events (connected(), closed(), writable(), readable())
 Key: MESOS-1953
 URL: https://issues.apache.org/jira/browse/MESOS-1953
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1330) Introduce stream abstraction to libprocess

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177608#comment-14177608
 ] 

Niklas Quarfot Nielsen commented on MESOS-1330:
---

The general idea is to propose a stream api (and revised clock api) instead of 
a full event manager abstraction.
We did a proof of concept event manager api which abstracted both libev and 
libevent implementations in libprocess and noticed that the resulting api 
captured three things:
- Asynchronous clock ticks
- Network/Connection establishment, Async I/O
- File Async I/O
(Marked with red in the figure below).

!http://cl.ly/image/1y3o0G1T3S3p/libprocess.png!

Instead of having a 20+ method abstraction, introducing them as individual 
concepts seemed more elegant and robust.

The networking abstraction fit well in terms of extending the capabilities of 
the already existing Socket abstraction.
Here is a suggestion on how that could look like:

{code}
class Socket {
  // // The connect/send sequence is implemented by
  // Socket s = SocketManager::connect(Node("example.com", 5050))
  // 
  // // The write operation will be enqueued on the connect future and writable.
  // s.write(msg);
  //
  // // Or generalized:
  // s.connected().then([=]{
  //   // s.read()
  //   // s.write()
  //   // ...  
  // });
  Future connected();
  
  Future accepted();
  
  // Is backed by io::poll(), but doesn't rely on being implemented that way.
  // Buffered IO (as with SSL) won't behave as you expect with poll().
  Future readable();
  
  // Same here.
  Future writable();
  
  // Along with persist(), this supports remote 'exited' notifications.
  Future closed();

  // Stream will keep it self alive (by increasing the ref-count).
  Future persist();

  // Reads will automatically hang of the readable() future.
  Future read();
  // ... all the read variants.
  
  Future write(std::string);
  // ... all the write variants.
}
{code}

We wanted to introduce the notion of streams, as networking and file I/O almost 
ended up being copies of one another. Besides the connection life-cycle, we 
should be able to share implementations of those.


> Introduce stream abstraction to libprocess
> --
>
> Key: MESOS-1330
> URL: https://issues.apache.org/jira/browse/MESOS-1330
> Project: Mesos
>  Issue Type: Task
>  Components: general, libprocess
>Reporter: Niklas Quarfot Nielsen
>Assignee: Joris Van Remoortere
>  Labels: libprocess, network
>
> I think it makes sense to think in terms of different low or middle layer 
> transports (which can accommodate channels like SSL). We could capture 
> connection life-cycles and network send/receive primitives in a much explicit 
> manner than currently in libprocess.
> I have a proof of concept transport / connection abstraction ready and which 
> we can use to iterate a design.
> Notably, there are opportunities to change the current SocketManager/Socket 
> abstractions to explicit ConnectionManager/Connection, which allow several 
> and composeable communication layers.
> I am proposing to own this ticket and am looking for a shepherd to 
> (thoroughly) go over design considerations before jumping into an actual 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1905) Enable module metadata to be accessed by the user

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177573#comment-14177573
 ] 

Niklas Quarfot Nielsen commented on MESOS-1905:
---

If no one is working on it (or needs it) - let's move it back to open / 
consider closing it.

> Enable module metadata to be accessed by the user
> -
>
> Key: MESOS-1905
> URL: https://issues.apache.org/jira/browse/MESOS-1905
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Till Toenshoff
>Priority: Minor
>
> h4. Motivation
> I would love to be able to get custom, meta-data information from a module 
> without needing to create a kind instance. 
> h4. Use Case
> When slave authentication is activated on the master, the user has to supply 
> credentials (as our current implementation demands them). Given that 
> alternative authentications will not rely on such credentials, we need a way 
> to make sure that only Authenticator's that need this information will demand 
> them from the user. I would like to prevent instantiating a kind for that 
> module as I will not make further use of that instance at that (early) point 
> - let me call this the "capabilities instance". 
> Options are; (a) delete it right away or (b) hold on to it.
> a: definitely is possible but does not seem elegant to me right now.
> b: holding on onto that instance and reusing it later is not really a good 
> fit as the master will instantiate new Authenticator's per connected slave. 
> So for the first slave, I would have to use that capabilities instance and 
> for all further slave connections I would have to create new Authenticators 
> (possible but ugly as hell).
> So by extending the Module structure specialization with a {{bool 
> needsCredentials()}}, I could solve this rather neatly:
> {noformat}
> template <>
> struct Module : ModuleBase
> {
>   Module(
>   const char* _moduleApiVersion,
>   const char* _mesosVersion,
>   const char* _authorName,
>   const char* _authorEmail,
>   const char* _description,
>   bool (*_compatible)(),
>   bool (*_needsCredentials)(),
>   Authenticator* (*_create)())
> : ModuleBase(
> _moduleApiVersion,
> _mesosVersion,
> "Authenticator",
> _authorName,
> _authorEmail,
> _description,
> _compatible),
>   needsCredentials(_needsCredentials),
>   create(_create)
>   { }
>   bool (*needsCredentials)();
>   Authenticator* (*create)();
> };
> {noformat}
> Within the implementation I would simply use that function just like we are 
> using {{compatible()}} already.
> h4. Status Quo
> ModuleManager does not support returning the {{ModuleBase*}} to the user. The 
> only way for such information to be returned by a module is to instantiate it 
> and ask its implementation for it - that is, the module interface needs to 
> include a method returning such info. 
> h4. Idea
> For being able to get information on a module without an instance kind, a 
> method within the ModuleManager that looks something like this would help:
> {noformat}
> static Try peek(const std::string& moduleName) 
> {noformat}
> h4. Discussion
> Am I possibly attempting something too hacky here - are there better 
> alternatives I missed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1849) Cannot execute container in privileged mode

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1849:

Target Version/s: 0.21.0

> Cannot execute container in privileged mode 
> 
>
> Key: MESOS-1849
> URL: https://issues.apache.org/jira/browse/MESOS-1849
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.1
> Environment: Mesos 0.20.1 Marathon 0.7.1
>Reporter: Adam Spektor
>Assignee: Timothy Chen
>Priority: Blocker
>  Labels: docker
>
> Cannot find a way to run container in privileged mode, it block me to 
> continue with Mesos, Marathon POC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1931) Add support for isolator modules

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1931:
--
Description: Add support for pluggable isolators for the mesos 
containerizer. This allows us to experiment with and wire up specialized 
isolation and monitoring capabilities.

> Add support for isolator modules
> 
>
> Key: MESOS-1931
> URL: https://issues.apache.org/jira/browse/MESOS-1931
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>
> Add support for pluggable isolators for the mesos containerizer. This allows 
> us to experiment with and wire up specialized isolation and monitoring 
> capabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1951) Add --isolation flag to mesos-tests

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-1951:
-

 Summary: Add --isolation flag to mesos-tests
 Key: MESOS-1951
 URL: https://issues.apache.org/jira/browse/MESOS-1951
 Project: Mesos
  Issue Type: Documentation
Reporter: Niklas Quarfot Nielsen


When hooking up specific modules for tests, we realized that it would be 
generally useful to be able to set the default flags for masters and slaves in 
the tests. For example let mesos-tests.sh take --isolation, --drf_sorter, 
--authentication and so on, and have CreateSlaveFlags/CreateMasterFlags 
alongside necessary ::create() use those flags so we can run the entire unit 
test suite exercising different implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2014-10-20 Thread Isabel Jimenez (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Jimenez reassigned MESOS-1570:
-

Assignee: Isabel Jimenez

> Make check Error when Building Mesos in a Docker container 
> ---
>
> Key: MESOS-1570
> URL: https://issues.apache.org/jira/browse/MESOS-1570
> Project: Mesos
>  Issue Type: Bug
>Reporter: Isabel Jimenez
>Assignee: Isabel Jimenez
>Priority: Minor
>  Labels: Docker
>
> When building Mesos inside a Docker container, it's for the moment impossible 
> to run tests even when you run Docker in --privileged mode. There is a test 
> in stout that sets all the namespaces and libcontainer does not support 
> setting 'user' namespace (more information 
> [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
>  This is the error:
> {code:title=Make check failed test|borderStyle=solid}
> [--] 1 test from OsSetnsTest
> [ RUN  ] OsSetnsTest.setns
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
> Failure
> os::setns(::getpid(), ns): Invalid argument
> [  FAILED  ] OsSetnsTest.setns (7 ms)
> [--] 1 test from OsSetnsTest (7 ms total)
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] OsSetnsTest.setns
>  1 FAILED TEST
> {code}
> This can be disable as Mesos does not need to set 'user' namespace. I don't 
> know if Docker will support setting user namespace one day since it's a new 
> kernel feature, what could be the best approach to this issue? (disabling set 
> for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2014-10-20 Thread Isabel Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177523#comment-14177523
 ] 

Isabel Jimenez commented on MESOS-1570:
---

[~tnachen]  Sure! I'm assigning myself.

> Make check Error when Building Mesos in a Docker container 
> ---
>
> Key: MESOS-1570
> URL: https://issues.apache.org/jira/browse/MESOS-1570
> Project: Mesos
>  Issue Type: Bug
>Reporter: Isabel Jimenez
>Priority: Minor
>  Labels: Docker
>
> When building Mesos inside a Docker container, it's for the moment impossible 
> to run tests even when you run Docker in --privileged mode. There is a test 
> in stout that sets all the namespaces and libcontainer does not support 
> setting 'user' namespace (more information 
> [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
>  This is the error:
> {code:title=Make check failed test|borderStyle=solid}
> [--] 1 test from OsSetnsTest
> [ RUN  ] OsSetnsTest.setns
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
> Failure
> os::setns(::getpid(), ns): Invalid argument
> [  FAILED  ] OsSetnsTest.setns (7 ms)
> [--] 1 test from OsSetnsTest (7 ms total)
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] OsSetnsTest.setns
>  1 FAILED TEST
> {code}
> This can be disable as Mesos does not need to set 'user' namespace. I don't 
> know if Docker will support setting user namespace one day since it's a new 
> kernel feature, what could be the best approach to this issue? (disabling set 
> for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1950) Add module writers guide

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-1950:
-

 Summary: Add module writers guide
 Key: MESOS-1950
 URL: https://issues.apache.org/jira/browse/MESOS-1950
 Project: Mesos
  Issue Type: Documentation
  Components: modules
Reporter: Niklas Quarfot Nielsen
Priority: Critical


Similar to Apache Webserver's "Developing Modules" guide 
(http://httpd.apache.org/docs/2.4/developer/modguide.html), we should write up 
a comprehensive guide to writing robust modules.

I started a draft here: 
https://cwiki.apache.org/confluence/display/MESOS/Mesos+Modules+Developer+Guide

It should be completed and/or copied (or moved) to docs/modules.md. There may 
be usefulness for both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MESOS-1836) Denote module API as experimental

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen reopened MESOS-1836:
---

> Denote module API as experimental
> -
>
> Key: MESOS-1836
> URL: https://issues.apache.org/jira/browse/MESOS-1836
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Priority: Minor
>
> There should be some notion to inform the module writer of the fact that the 
> module API could change.  It's not limited to the macros/functions exposed 
> via module.hpp, but can also include the module base classes such as 
> Isolator, Allocator, and Authenticator.  For example, the foo and bar methods 
> in TestModule should be tagged "experimental" to show that it can change at a 
> later point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1836) Denote module API as experimental

2014-10-20 Thread John (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John resolved MESOS-1836.
-
Resolution: Invalid

The API has evolved and no longer requires a prototype or preliminary status.

> Denote module API as experimental
> -
>
> Key: MESOS-1836
> URL: https://issues.apache.org/jira/browse/MESOS-1836
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Priority: Minor
>
> There should be some notion to inform the module writer of the fact that the 
> module API could change.  It's not limited to the macros/functions exposed 
> via module.hpp, but can also include the module base classes such as 
> Isolator, Allocator, and Authenticator.  For example, the foo and bar methods 
> in TestModule should be tagged "experimental" to show that it can change at a 
> later point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1836) Denote module API as experimental

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177472#comment-14177472
 ] 

Niklas Quarfot Nielsen commented on MESOS-1836:
---

The module system has versioning (which we also need to keep track of and 
advise an upgrading guide similar to how we do between releases). The first 
couple of versions will be experimental, but think this is captured well enough 
if we keep a module version table, for example:

||Module API Version||Changes||Notes||
|1|Initial implementation of Module sub system|**NOTE:** Experimental feature |
|...|...|...|

> Denote module API as experimental
> -
>
> Key: MESOS-1836
> URL: https://issues.apache.org/jira/browse/MESOS-1836
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Priority: Minor
>
> There should be some notion to inform the module writer of the fact that the 
> module API could change.  It's not limited to the macros/functions exposed 
> via module.hpp, but can also include the module base classes such as 
> Isolator, Allocator, and Authenticator.  For example, the foo and bar methods 
> in TestModule should be tagged "experimental" to show that it can change at a 
> later point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1836) Denote module API as experimental

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177472#comment-14177472
 ] 

Niklas Quarfot Nielsen edited comment on MESOS-1836 at 10/20/14 9:07 PM:
-

The module system has versioning (which we also need to keep track of and 
advise an upgrading guide similar to how we do between releases). The first 
couple of versions will be experimental, but think this is captured well enough 
if we keep a module version table, for example:

||Module API Version||Changes||Notes||
|1|Initial implementation of Module sub system|**NOTE:** Experimental feature |
|...|...|...|

Thoughts?


was (Author: nnielsen):
The module system has versioning (which we also need to keep track of and 
advise an upgrading guide similar to how we do between releases). The first 
couple of versions will be experimental, but think this is captured well enough 
if we keep a module version table, for example:

||Module API Version||Changes||Notes||
|1|Initial implementation of Module sub system|**NOTE:** Experimental feature |
|...|...|...|

> Denote module API as experimental
> -
>
> Key: MESOS-1836
> URL: https://issues.apache.org/jira/browse/MESOS-1836
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Priority: Minor
>
> There should be some notion to inform the module writer of the fact that the 
> module API could change.  It's not limited to the macros/functions exposed 
> via module.hpp, but can also include the module base classes such as 
> Isolator, Allocator, and Authenticator.  For example, the foo and bar methods 
> in TestModule should be tagged "experimental" to show that it can change at a 
> later point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1942) Bad test: ModuleTest.ExampleModuleParseStringTest

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1942:
--
Assignee: Kapil Arya  (was: Niklas Quarfot Nielsen)

> Bad test: ModuleTest.ExampleModuleParseStringTest
> -
>
> Key: MESOS-1942
> URL: https://issues.apache.org/jira/browse/MESOS-1942
> Project: Mesos
>  Issue Type: Bug
>  Components: modules, test
>Affects Versions: 0.21.0
>Reporter: Ian Downes
>Assignee: Kapil Arya
>
> [ RUN  ] ModuleTest.UnknownModuleInstantiationTest
> Using temporary directory 
> '/tmp/ModuleTest_UnknownModuleInstantiationTest_Cv4jVf'
> tests/module_tests.cpp:230: Failure
> ModuleManager::load(modules): Error verifying module 
> 'org_apache_mesos_TestModule': Module API version mismatch. Mesos has: 1, 
> library requires: ThisIsNotAnAPIVersion!
> tests/module_tests.cpp:234: Failure
> ModuleManager::unload(moduleName): Error unloading module 
> 'org_apache_mesos_TestModule': module not loaded
> [  FAILED  ] ModuleTest.UnknownModuleInstantiationTest (2 ms)
> [ RUN  ] ModuleTest.AuthorInfoTest
> Using temporary directory '/tmp/ModuleTest_AuthorInfoTest_E98lf3'
> [   OK ] ModuleTest.AuthorInfoTest (2 ms)
> [ RUN  ] ModuleTest.UnknownLibraryTest
> Using temporary directory '/tmp/ModuleTest_UnknownLibraryTest_A0nMzQ'
> [   OK ] ModuleTest.UnknownLibraryTest (4 ms)
> [ RUN  ] ModuleTest.ExampleModuleParseStringTest
> Using temporary directory 
> '/tmp/ModuleTest_ExampleModuleParseStringTest_MVESUD'
> tests/module_tests.cpp:75: Failure
> ModuleManager::load(modules): Error verifying module 
> 'org_apache_mesos_TestModule': Module API version mismatch. Mesos has: 1, 
> library requires: ThisIsNotAnAPIVersion!
> tests/module_tests.cpp:78: Failure
> module: Module 'org_apache_mesos_TestModule' unknown
> ABORT: (../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:92): 
> Try::get() but state == ERROR: Module 'org_apache_mesos_TestModule' 
> unknown*** Aborted at 1413500769 (unix time) try "date -d @1413500769" if you 
> are using GNU date ***
> PC: @   0x32844359e9 (unknown)
> *** SIGABRT (@0x3e81634) received by PID 5684 (TID 0x7f69accd9840) from 
> PID 5684; stack trace: ***
> @   0x3284c0ef90 (unknown)
> @   0x32844359e9 (unknown)
> @   0x32844370f8 (unknown)
> @   0x4fda1d _Abort()
> @   0x4fd9a9 _Abort()
> @   0x9bc1a5 Try<>::get()
> @   0x9b036a 
> ModuleTest_ExampleModuleParseStringTest_Test::TestBody()
> @   0xcb0a15 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc996ee testing::Test::Run()
> @   0xc9a494 testing::TestInfo::Run()
> @   0xc9a9d7 testing::TestCase::Run()
> @   0xc9fd56 testing::internal::UnitTestImpl::RunAllTests()
> @   0xcb18c5 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc9fa59 testing::UnitTest::Run()
> @   0x89ccac main
> @   0x3284421b45 (unknown)
> @   0x4a4c4d (unknown)
> make[3]: *** [check-local] Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1942) Bad test: ModuleTest.ExampleModuleParseStringTest

2014-10-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1942:
--
Shepherd: Niklas Quarfot Nielsen

> Bad test: ModuleTest.ExampleModuleParseStringTest
> -
>
> Key: MESOS-1942
> URL: https://issues.apache.org/jira/browse/MESOS-1942
> Project: Mesos
>  Issue Type: Bug
>  Components: modules, test
>Affects Versions: 0.21.0
>Reporter: Ian Downes
>Assignee: Kapil Arya
>
> [ RUN  ] ModuleTest.UnknownModuleInstantiationTest
> Using temporary directory 
> '/tmp/ModuleTest_UnknownModuleInstantiationTest_Cv4jVf'
> tests/module_tests.cpp:230: Failure
> ModuleManager::load(modules): Error verifying module 
> 'org_apache_mesos_TestModule': Module API version mismatch. Mesos has: 1, 
> library requires: ThisIsNotAnAPIVersion!
> tests/module_tests.cpp:234: Failure
> ModuleManager::unload(moduleName): Error unloading module 
> 'org_apache_mesos_TestModule': module not loaded
> [  FAILED  ] ModuleTest.UnknownModuleInstantiationTest (2 ms)
> [ RUN  ] ModuleTest.AuthorInfoTest
> Using temporary directory '/tmp/ModuleTest_AuthorInfoTest_E98lf3'
> [   OK ] ModuleTest.AuthorInfoTest (2 ms)
> [ RUN  ] ModuleTest.UnknownLibraryTest
> Using temporary directory '/tmp/ModuleTest_UnknownLibraryTest_A0nMzQ'
> [   OK ] ModuleTest.UnknownLibraryTest (4 ms)
> [ RUN  ] ModuleTest.ExampleModuleParseStringTest
> Using temporary directory 
> '/tmp/ModuleTest_ExampleModuleParseStringTest_MVESUD'
> tests/module_tests.cpp:75: Failure
> ModuleManager::load(modules): Error verifying module 
> 'org_apache_mesos_TestModule': Module API version mismatch. Mesos has: 1, 
> library requires: ThisIsNotAnAPIVersion!
> tests/module_tests.cpp:78: Failure
> module: Module 'org_apache_mesos_TestModule' unknown
> ABORT: (../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:92): 
> Try::get() but state == ERROR: Module 'org_apache_mesos_TestModule' 
> unknown*** Aborted at 1413500769 (unix time) try "date -d @1413500769" if you 
> are using GNU date ***
> PC: @   0x32844359e9 (unknown)
> *** SIGABRT (@0x3e81634) received by PID 5684 (TID 0x7f69accd9840) from 
> PID 5684; stack trace: ***
> @   0x3284c0ef90 (unknown)
> @   0x32844359e9 (unknown)
> @   0x32844370f8 (unknown)
> @   0x4fda1d _Abort()
> @   0x4fd9a9 _Abort()
> @   0x9bc1a5 Try<>::get()
> @   0x9b036a 
> ModuleTest_ExampleModuleParseStringTest_Test::TestBody()
> @   0xcb0a15 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc996ee testing::Test::Run()
> @   0xc9a494 testing::TestInfo::Run()
> @   0xc9a9d7 testing::TestCase::Run()
> @   0xc9fd56 testing::internal::UnitTestImpl::RunAllTests()
> @   0xcb18c5 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc9fa59 testing::UnitTest::Run()
> @   0x89ccac main
> @   0x3284421b45 (unknown)
> @   0x4a4c4d (unknown)
> make[3]: *** [check-local] Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2014-10-20 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177452#comment-14177452
 ] 

Benjamin Mahler commented on MESOS-1949:


In general we would like the TASK_FAILED or TASK_LOST updates to contain a 
meaningful {{message}} (inside {{TaskStatus}}). For example, if the TASK_LOST 
is generated because a slave was removed, the {{message}} will indicate this).

Are you finding that there are cases where {{message}} is being set poorly? If 
so, examples would be a great way to improve things.

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> ---
>
> Key: MESOS-1949
> URL: https://issues.apache.org/jira/browse/MESOS-1949
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2014-10-20 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-1949:


 Summary: All log messages from master, slave, executor, etc. 
should be collected on a per-task basis
 Key: MESOS-1949
 URL: https://issues.apache.org/jira/browse/MESOS-1949
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Affects Versions: 0.20.1
Reporter: Steven Schlansker


Currently through a task's lifecycle, various debugging information is created 
at different layers of the Mesos ecosystem.  The framework will log task 
information, the master deals with resource allocation, the slave actually 
allocates those resources, and the executor does the work of launching the task.

If anything through that pipeline fails, the end user is left with little but a 
"TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful information 
(for example a "Docker pull failed because repository didn't exist") is hidden 
in one of four or five different places, potentially spread across as many 
different machines.  This leads to unpleasant and repetitive searching through 
logs looking for a clue to what went wrong.

Collating logs on a per-task basis would give the end user a much friendlier 
way of figuring out exactly where in this process something went wrong, and 
likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1943) Add event queue size metrics to scheduler driver

2014-10-20 Thread Dominic Hamon (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177323#comment-14177323
 ] 

Dominic Hamon commented on MESOS-1943:
--

https://reviews.apache.org/r/26951/

> Add event queue size metrics to scheduler driver
> 
>
> Key: MESOS-1943
> URL: https://issues.apache.org/jira/browse/MESOS-1943
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>
> In the master process, we expose metrics for event queue sizes for various 
> event types. We should do the same for the scheduler driver process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1895) Enable cgroups isolation by default

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon closed MESOS-1895.

Resolution: Won't Fix

As per comments, we won't make this change in favour of documentation.

> Enable cgroups isolation by default
> ---
>
> Key: MESOS-1895
> URL: https://issues.apache.org/jira/browse/MESOS-1895
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.20.1
> Environment: Linux!
>Reporter: Sunil Shah
>
> cgroups isolation is not enabled by default on mesos-slave. For people 
> deploying Mesos in a production environment, it makes sense that this would 
> default - given the assumption that Mesos uses cgroups to isolate running 
> tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1903) Add backoff to framework re-registration retries

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1903:
-
Story Points: 3

> Add backoff to framework re-registration retries
> 
>
> Key: MESOS-1903
> URL: https://issues.apache.org/jira/browse/MESOS-1903
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Assignee: Vinod Kone
>
> To avoid so many duplicate framework re-registration attempts (and thus offer 
> rescinds) we should add backoff to re-registration retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1943) Add event queue size metrics to scheduler driver

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1943:
-
Story Points: 2

> Add event queue size metrics to scheduler driver
> 
>
> Key: MESOS-1943
> URL: https://issues.apache.org/jira/browse/MESOS-1943
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>
> In the master process, we expose metrics for event queue sizes for various 
> event types. We should do the same for the scheduler driver process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1941) Make executor's user owner of executor's cgroup directory

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1941:


Assignee: Ian Downes

> Make executor's user owner of executor's cgroup directory
> -
>
> Key: MESOS-1941
> URL: https://issues.apache.org/jira/browse/MESOS-1941
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation, slave
>Reporter: Mohit Soni
>Assignee: Ian Downes
>Priority: Minor
>
> Currently, when cgroups are enabled, and executor is spawned, it's mounted 
> under, for ex: /sys/fs/cgroup/cpu/mesos/. This directory in current 
> implementation is only writable by root user. This prevents process launched 
> by executor to mount its child processes under this cgroup, because the 
> cgroup directory is only writable by root.
> To enable a executor spawned process to mount it's child processes under it's 
> cgroup directory, the cgroup directory should be made writable by the user 
> which spawns the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1941) Make executor's user owner of executor's cgroup directory

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1941:
-
Story Points: 3

> Make executor's user owner of executor's cgroup directory
> -
>
> Key: MESOS-1941
> URL: https://issues.apache.org/jira/browse/MESOS-1941
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation, slave
>Reporter: Mohit Soni
>Priority: Minor
>
> Currently, when cgroups are enabled, and executor is spawned, it's mounted 
> under, for ex: /sys/fs/cgroup/cpu/mesos/. This directory in current 
> implementation is only writable by root user. This prevents process launched 
> by executor to mount its child processes under this cgroup, because the 
> cgroup directory is only writable by root.
> To enable a executor spawned process to mount it's child processes under it's 
> cgroup directory, the cgroup directory should be made writable by the user 
> which spawns the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1943) Add event queue size metrics to scheduler driver

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1943:


Assignee: Dominic Hamon

> Add event queue size metrics to scheduler driver
> 
>
> Key: MESOS-1943
> URL: https://issues.apache.org/jira/browse/MESOS-1943
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>
> In the master process, we expose metrics for event queue sizes for various 
> event types. We should do the same for the scheduler driver process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1875) os::killtree() incorrectly returns early if pid has terminated

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1875:
-
Story Points: 2

> os::killtree() incorrectly returns early if pid has terminated
> --
>
> Key: MESOS-1875
> URL: https://issues.apache.org/jira/browse/MESOS-1875
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.19.0, 0.20.0, 0.19.1, 0.20.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> If groups == true and/or sessions == true then os::killtree() should continue 
> to signal all processes in the process group and/or session, even if the 
> leading pid has terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1689) Race with kernel to kill process / destroy cgroup after OOM

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1689:


Assignee: Ian Downes

> Race with kernel to kill process / destroy cgroup after OOM
> ---
>
> Key: MESOS-1689
> URL: https://issues.apache.org/jira/browse/MESOS-1689
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> The recently refactored cgroup::destroy code can fail to freeze a freezer 
> cgroup under a particular ordering of events, as detailed below. The 
> LinuxLauncher will fail to destroy the cgroup and other isolators will not be 
> able to destroy their cgroups.
> This failure will be logged but otherwise ignored by a running slave. If the 
> slave is subsequently restarted it will block on the cgroup::destroy during 
> launcher recovery, timing out after 60 seconds and causing recovery to fail, 
> which will then cause the slave to terminate. If the slave is monitored and 
> automatically restarted it will repeatedly flap.
> The problem appears as a container freezer cgroup that will not transition to 
> FROZEN and remains in FREEZING. This is because one or more processes cannot 
> be frozen.
> {noformat}
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state
>  FREEZING
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs
>   | xargs ps -L
>   PID   LWP TTY  STAT   TIME COMMAND
> 29369 29369 ?Dsl0:02 python2.6 ./thermos_executor
> 29369 29482 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29483 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29484 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29485 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29486 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29487 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29488 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29489 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29490 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29491 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29492 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29493 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29494 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29495 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29496 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29497 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29498 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29499 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29500 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29582 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29583 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29584 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29585 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29604 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29605 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29606 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29607 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29608 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29610 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29612 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29613 ?Dsl0:00 python2.6 ./thermos_executor
> 29526 29526 ?D  0:02 python2.6 
> /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29578 29578 ?Ds29:49 python2.6 
> /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29603 29603 ?R254:08 python2.6 /usr/local/bin/package_cache 
> 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r
> {noformat}
> Debugging with [~jieyu] indicates the following sequence of events:
> 1. Cgroup reaches memory limit
> 2. Kernel notifies Mesos that an OOM condition has occured.
> 3. Mesos initiates freezing the cgroup.
> 4. There is now a race in the kernel between freezing processes and deciding 
> which process to kill. The kernel does check a process is suitable for 
> killing and will thaw the process if necessary but there is a window between 
> this check and actually signaling the process. It can occur that the selected 
> process is frozen before the signal is delivered and thus the process does 
> not die, and therefore does not release its memory.
> {noformat}
> [idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
> Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of 
> memo

[jira] [Updated] (MESOS-1943) Add event queue size metrics to scheduler driver

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1943:
-
Sprint: Twitter Q4 Sprint 2

> Add event queue size metrics to scheduler driver
> 
>
> Key: MESOS-1943
> URL: https://issues.apache.org/jira/browse/MESOS-1943
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Priority: Minor
>
> In the master process, we expose metrics for event queue sizes for various 
> event types. We should do the same for the scheduler driver process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1941) Make executor's user owner of executor's cgroup directory

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1941:
-
Sprint: Twitter Q4 Sprint 2

> Make executor's user owner of executor's cgroup directory
> -
>
> Key: MESOS-1941
> URL: https://issues.apache.org/jira/browse/MESOS-1941
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation, slave
>Reporter: Mohit Soni
>Priority: Minor
>
> Currently, when cgroups are enabled, and executor is spawned, it's mounted 
> under, for ex: /sys/fs/cgroup/cpu/mesos/. This directory in current 
> implementation is only writable by root user. This prevents process launched 
> by executor to mount its child processes under this cgroup, because the 
> cgroup directory is only writable by root.
> To enable a executor spawned process to mount it's child processes under it's 
> cgroup directory, the cgroup directory should be made writable by the user 
> which spawns the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1930) Expose TASK_KILLED reason.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1930:
-
Sprint: Twitter Q4 Sprint 2

> Expose TASK_KILLED reason.
> --
>
> Key: MESOS-1930
> URL: https://issues.apache.org/jira/browse/MESOS-1930
> Project: Mesos
>  Issue Type: Story
>Reporter: Alexander Rukletsov
>Assignee: Dominic Hamon
>Priority: Minor
>
> A task process may be killed by a SIGTERM or SIGKILL. The only possibility to 
> check how the task process has exited is to examine the message: 
> {{status.message().find("Terminated")}}. However, a task may not run in its 
> own process, hence the executor may not be able to provide an exit status. 
> What we actually want is an artificial task exit status that is rendered by 
> the executor.
> This may be resolved by adding second tier states or state explanations. Here 
> is a link to a discussion: https://reviews.apache.org/r/26382/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-910) Add SSL support to Mesos

2014-10-20 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-910:
--
Epic Name: SSL  (was: Add SSL support to Mesos)

> Add SSL support to Mesos
> 
>
> Key: MESOS-910
> URL: https://issues.apache.org/jira/browse/MESOS-910
> Project: Mesos
>  Issue Type: Epic
>  Components: general, libprocess
>Reporter: Adam B
>  Labels: encryption, security
>
> Currently all the messages that flow through the Mesos cluster are 
> unencrypted making it possible for intruders to intercept and potentially 
> control your task. We plan to add encryption support by adding SSL/TLS 
> support to libprocess, the low-level communication library that Mesos uses 
> for all network communication between Mesos components.
> As a first step, we should replace the hand-coded http code in libprocess 
> with a standard library, ensuring that any mesos custom code like routing 
> remains. Then, transition to https should be easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1903) Add backoff to framework re-registration retries

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1903:


Assignee: Vinod Kone

> Add backoff to framework re-registration retries
> 
>
> Key: MESOS-1903
> URL: https://issues.apache.org/jira/browse/MESOS-1903
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Assignee: Vinod Kone
>
> To avoid so many duplicate framework re-registration attempts (and thus offer 
> rescinds) we should add backoff to re-registration retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1903) Add backoff to framework re-registration retries

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1903:
-
Sprint: Twitter Q4 Sprint 2

> Add backoff to framework re-registration retries
> 
>
> Key: MESOS-1903
> URL: https://issues.apache.org/jira/browse/MESOS-1903
> Project: Mesos
>  Issue Type: Task
>Reporter: Dominic Hamon
>Assignee: Vinod Kone
>
> To avoid so many duplicate framework re-registration attempts (and thus offer 
> rescinds) we should add backoff to re-registration retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1689) Race with kernel to kill process / destroy cgroup after OOM

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1689:
-
Sprint: Twitter Q4 Sprint 2

> Race with kernel to kill process / destroy cgroup after OOM
> ---
>
> Key: MESOS-1689
> URL: https://issues.apache.org/jira/browse/MESOS-1689
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>
> The recently refactored cgroup::destroy code can fail to freeze a freezer 
> cgroup under a particular ordering of events, as detailed below. The 
> LinuxLauncher will fail to destroy the cgroup and other isolators will not be 
> able to destroy their cgroups.
> This failure will be logged but otherwise ignored by a running slave. If the 
> slave is subsequently restarted it will block on the cgroup::destroy during 
> launcher recovery, timing out after 60 seconds and causing recovery to fail, 
> which will then cause the slave to terminate. If the slave is monitored and 
> automatically restarted it will repeatedly flap.
> The problem appears as a container freezer cgroup that will not transition to 
> FROZEN and remains in FREEZING. This is because one or more processes cannot 
> be frozen.
> {noformat}
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state
>  FREEZING
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs
>   | xargs ps -L
>   PID   LWP TTY  STAT   TIME COMMAND
> 29369 29369 ?Dsl0:02 python2.6 ./thermos_executor
> 29369 29482 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29483 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29484 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29485 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29486 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29487 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29488 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29489 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29490 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29491 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29492 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29493 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29494 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29495 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29496 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29497 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29498 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29499 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29500 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29582 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29583 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29584 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29585 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29604 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29605 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29606 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29607 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29608 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29610 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29612 ?Dsl0:00 python2.6 ./thermos_executor
> 29369 29613 ?Dsl0:00 python2.6 ./thermos_executor
> 29526 29526 ?D  0:02 python2.6 
> /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29578 29578 ?Ds29:49 python2.6 
> /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29603 29603 ?R254:08 python2.6 /usr/local/bin/package_cache 
> 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r
> {noformat}
> Debugging with [~jieyu] indicates the following sequence of events:
> 1. Cgroup reaches memory limit
> 2. Kernel notifies Mesos that an OOM condition has occured.
> 3. Mesos initiates freezing the cgroup.
> 4. There is now a race in the kernel between freezing processes and deciding 
> which process to kill. The kernel does check a process is suitable for 
> killing and will thaw the process if necessary but there is a window between 
> this check and actually signaling the process. It can occur that the selected 
> process is frozen before the signal is delivered and thus the process does 
> not die, and therefore does not release its memory.
> {noformat}
> [idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
> Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of 
> memory: Kill process 29578 (python2.6)

[jira] [Updated] (MESOS-1807) Disallow executors with cpu only or memory only resources

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1807:
-
Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> Disallow executors with cpu only or memory only resources
> -
>
> Key: MESOS-1807
> URL: https://issues.apache.org/jira/browse/MESOS-1807
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>  Labels: newbie
>
> Currently master allows executors to be launched with either only cpus or 
> only memory but we shouldn't allow that.
> This is because executor is an actual unix process that is launched by the 
> slave. If an executor doesn't specify cpus, what should do the cpu limits be 
> for that executor when there are no tasks running on it? If no cpu limits are 
> set then it might starve other executors/tasks on the slave violating 
> isolation guarantees. Same goes with memory. Moreover, the current 
> containerizer/isolator code will throw failures when using such an executor, 
> e.g., when the last task on the executor finishes and Containerizer::update() 
> is called with 0 cpus or 0 mem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1586) Isolate system directories, e.g., per-container /tmp

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1586:
-
Sprint: Q3 Sprint 1, Q3 Sprint 2, Q3 Sprint 3, Q3 Sprint 4, Mesos Q3 Sprint 
5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Q3 Sprint 
1, Q3 Sprint 2, Q3 Sprint 3, Q3 Sprint 4, Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, 
Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1)

> Isolate system directories, e.g., per-container /tmp
> 
>
> Key: MESOS-1586
> URL: https://issues.apache.org/jira/browse/MESOS-1586
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> Ideally, tasks should not write outside their sandbox (executor work 
> directory) but pragmatically they may need to write to /tmp, /var/tmp, or 
> some other directory.
> 1) We should include any such files in disk usage and quota.
> 2) We should make these "shared" directories private, i.e., each container 
> has their own.
> 3) We should make the lifetime of any such files the same as the executor 
> work directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1830) Expose master stats differentiating between master-generated and slave-generated LOST tasks

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1830:
-
Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> Expose master stats differentiating between master-generated and 
> slave-generated LOST tasks
> ---
>
> Key: MESOS-1830
> URL: https://issues.apache.org/jira/browse/MESOS-1830
> Project: Mesos
>  Issue Type: Story
>  Components: master
>Reporter: Bill Farner
>Assignee: Dominic Hamon
>Priority: Minor
>
> The master exports a monotonically-increasing counter of tasks transitioned 
> to TASK_LOST.  This loses fidelity of the source of the lost task.  A first 
> step in exposing the source of lost tasks might be to just differentiate 
> between TASK_LOST transitions initiated by the master vs the slave (and maybe 
> bad input from the scheduler).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1799) Reconciliation can send out-of-order updates.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1799:
-
Sprint: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: 
Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1)

> Reconciliation can send out-of-order updates.
> -
>
> Key: MESOS-1799
> URL: https://issues.apache.org/jira/browse/MESOS-1799
> Project: Mesos
>  Issue Type: Bug
>  Components: master, slave
>Reporter: Benjamin Mahler
>Assignee: Vinod Kone
>
> When a slave re-registers with the master, it currently sends the latest task 
> state for all tasks that are not both terminal and acknowledged.
> However, reconciliation assumes that we always have the latest unacknowledged 
> state of the task represented in the master.
> As a result, out-of-order updates are possible, e.g.
> (1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
> [TASK_RUNNING, TASK_FINISHED].
> (2) Master fails over.
> (3) New master re-registers the slave with T in TASK_FINISHED.
> (4) Reconciliation request arrives, master sends TASK_FINISHED.
> (5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.
> I think the fix here is to preserve the task state invariants in the master, 
> namely, that the master has the latest unacknowledged state of the task. This 
> means when the slave re-registers, it should instead send the latest 
> acknowledged state of each task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1751) Request for "stats.json" cannot be fulfilled after stopping the framework

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1751:
-
Sprint: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: 
Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1)

> Request for "stats.json" cannot be fulfilled after stopping the framework 
> --
>
> Key: MESOS-1751
> URL: https://issues.apache.org/jira/browse/MESOS-1751
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
> Environment: Test case launched on Mac OS X Mavericks.
>Reporter: Alexander Rukletsov
>Assignee: Dominic Hamon
>Priority: Minor
>
> Request for "stats.json" to master from a test case doesn't work after 
> calling frameworks' {{driver.stop()}}. However, it works for "state.json". I 
> think the problem is related to {{stats()}} continuation {{_stats()}}. The 
> following test illustrates the issue:
> {code:title=TestCase.cpp|borderStyle=solid}
> TEST_F(MasterTest, RequestAfterDriverStop)
> {
>   Try > master = StartMaster();
>   ASSERT_SOME(master);
>   Try > slave = StartSlave();
>   ASSERT_SOME(slave);
>   MockScheduler sched;
>   MesosSchedulerDriver driver(
>   &sched, DEFAULT_FRAMEWORK_INFO, master.get(), DEFAULT_CREDENTIAL);
>   driver.start();
>   
>   Future response_before =
>   process::http::get(master.get(), "stats.json");
>   AWAIT_READY(response_before);
>   driver.stop();
>   Future response_after =
>   process::http::get(master.get(), "stats.json");
>   AWAIT_READY(response_after);
>   driver.join();
>   Shutdown();  // Must shutdown before 'containerizer' gets deallocated.
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-681) Document the reconciliation API.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-681:

Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> Document the reconciliation API.
> 
>
> Key: MESOS-681
> URL: https://issues.apache.org/jira/browse/MESOS-681
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Attachments: 0.19.0.key, 0.19.0.pdf
>
>
> Now that we have a reconciliation mechanism, we should document why it exists 
> and how to use it going forward.
> As we add the lower level API, reconciliation may be done slightly 
> differently. Having documentation that reflects the changes would be great.
> It might also be helpful to upload my slides from the May 19th meetup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1817:
-
Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> Completed tasks remains in TASK_RUNNING when framework is disconnected
> --
>
> Key: MESOS-1817
> URL: https://issues.apache.org/jira/browse/MESOS-1817
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Assignee: Vinod Kone
>
> We have run into a problem that cause tasks which completes, when a framework 
> is disconnected and has a fail-over time, to remain in a running state even 
> though the tasks actually finishes. This hogs the cluster and gives users a 
> inconsistent view of the cluster state. Going to the slave, the task is 
> finished. Going to the master, the task is still in a non-terminal state. 
> When the scheduler reattaches or the failover timeout expires, the tasks 
> finishes correctly. The current workflow of this scheduler has a long 
> fail-over timeout, but may on the other hand never reattach.
> Here is a test framework we have been able to reproduce the issue with: 
> https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the 
> framework instance, the master reports the tasks as running even after 
> several minutes: 
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> When clicking on one of the slaves where, for example, task 49 runs; the 
> slave knows that it completed: 
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> Here is the log of a mesos-local instance where I reproduced it: 
> https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are 
> stuck in running state).
> There is a lot of output, so here is a filtered log for task 10: 
> https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> The problem turn out to be an issue with the ack-cycle of status updates:
> If the framework disconnects (with a failover timeout set), the status update 
> manage on the slaves will keep trying to send the front of status update 
> stream to the master (which in turn forwards it to the framework). If the 
> first status update after the disconnect is terminal, things work out fine; 
> the master pick the terminal state up, removes the task and release the 
> resources.
> If, on the other hand, one non-terminal status is in the stream. The master 
> will never know that the task finished (or failed) before the framework 
> reconnects.
> During a discussion on the dev mailing list 
> (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
>  we enumerated a couple of options to solve this problem.
> First off, having two ack-cycles: one between masters and slaves and one 
> between masters and frameworks, would be ideal. We would be able to replay 
> the statuses in order while keeping the master state current. However, this 
> requires us to persist the master state in a replicated storage.
> As a first pass, we can make sure that the tasks caught in a running state 
> doesn't hog the cluster when completed and the framework being disconnected.
> Here is a proof-of-concept to work out of: 
> https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
> A new (optional) field have been added to the internal status update message:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
> Which makes it possible for the status update manager to set the field, if 
> the latest status was terminal: 
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
> I added a test which should high-light the issue as well:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
> I would love some input on the approach before moving on.
> There are rough edges in the PoC which (of course) should be addressed before 
> bringing it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1765) Use PID namespace to avoid freezing cgroup

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1765:
-
Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Twitter 
Q4 Sprint 2  (was: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, 
Mesosphere Q4 Sprint 1)

> Use PID namespace to avoid freezing cgroup
> --
>
> Key: MESOS-1765
> URL: https://issues.apache.org/jira/browse/MESOS-1765
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Cong Wang
>Assignee: Ian Downes
>
> There is some known kernel issue when we freeze the whole cgroup upon OOM. 
> Mesos probably can just use PID namespace so that we will only need to kill 
> the "init" of the pid namespace, instead of freezing all the processes and 
> killing them one by one. But I am not quite sure if this would break the 
> existing code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1875) os::killtree() incorrectly returns early if pid has terminated

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1875:
-
Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> os::killtree() incorrectly returns early if pid has terminated
> --
>
> Key: MESOS-1875
> URL: https://issues.apache.org/jira/browse/MESOS-1875
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.19.0, 0.20.0, 0.19.1, 0.20.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> If groups == true and/or sessions == true then os::killtree() should continue 
> to signal all processes in the process group and/or session, even if the 
> leading pid has terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1853) Remove /proc and /sys remounts from port_mapping isolator

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1853:
-
Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> Remove /proc and /sys remounts from port_mapping isolator
> -
>
> Key: MESOS-1853
> URL: https://issues.apache.org/jira/browse/MESOS-1853
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.20.0, 0.20.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> /proc/net reflects a new network namespace regardless and remount doesn't 
> actually do what we expected anyway, i.e., it's not sufficient for a new pid 
> namespace and a new mount is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1807) Disallow executors with cpu only or memory only resources

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1807:
-
Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> Disallow executors with cpu only or memory only resources
> -
>
> Key: MESOS-1807
> URL: https://issues.apache.org/jira/browse/MESOS-1807
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>  Labels: newbie
>
> Currently master allows executors to be launched with either only cpus or 
> only memory but we shouldn't allow that.
> This is because executor is an actual unix process that is launched by the 
> slave. If an executor doesn't specify cpus, what should do the cpu limits be 
> for that executor when there are no tasks running on it? If no cpu limits are 
> set then it might starve other executors/tasks on the slave violating 
> isolation guarantees. Same goes with memory. Moreover, the current 
> containerizer/isolator code will throw failures when using such an executor, 
> e.g., when the last task on the executor finishes and Containerizer::update() 
> is called with 0 cpus or 0 mem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1456) Metric lifetime should be tied to process runstate, not lifetime.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1456:
-
Sprint: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: 
Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1)

> Metric lifetime should be tied to process runstate, not lifetime.
> -
>
> Key: MESOS-1456
> URL: https://issues.apache.org/jira/browse/MESOS-1456
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Affects Versions: 0.19.0
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>
> The usual pattern for termination of processes is {{terminate(..); wait(..); 
> delete ..;}} but the {{SchedulerProcess}} is terminated and then deleted some 
> time later.
> If the metrics endpoint is accessed within that period, it never returns as 
> it tries to access a {{Gauge}} that has a reference to a valid PID that is 
> not getting any timeslices (the {{SchedulerProcess}}). A one-off fix can be 
> made to the {{SchedulerProcess}} to move the metrics add/remove calls to 
> {{initialize}} and {{finalize}}, but this should be the general pattern for 
> every process with metrics. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1739) Allow slave reconfiguration on restart

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1739:
-
Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Twitter 
Q4 Sprint 2  (was: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, 
Mesosphere Q4 Sprint 1)

> Allow slave reconfiguration on restart
> --
>
> Key: MESOS-1739
> URL: https://issues.apache.org/jira/browse/MESOS-1739
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Patrick Reilly
>Assignee: Cody Maloney
>
> Make it so that either via a slave restart or a out of process "reconfigure" 
> ping, the attributes and resources of a slave can be updated to be a superset 
> of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1902) Support persistent disk resource.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1902:
-
Sprint: Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: Twitter Q4 Sprint 
1, Mesosphere Q4 Sprint 1)

> Support persistent disk resource.
> -
>
> Key: MESOS-1902
> URL: https://issues.apache.org/jira/browse/MESOS-1902
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> Mesos needs to provide a way to allow tasks to write persistent data which 
> won’t be garbage collected. For example, a task can write its persistent data 
> to some predefined directory. When this task finishes, the framework can 
> launch a new task which is able to access the persistent data written by the 
> previous task which Mesos would have usually garbage-collected.
> One way to achieve that is to provide a new type of disk resources which are 
> persistent. We call it persistent disk resource. When a framework launches a 
> task using persistent disk resources, the data the task writes will be 
> persisted. When the framework launches a new task using the same persistent 
> disk resource (after the previous task finishes), the new task will be able 
> to access the data written by the previous task.
> The persistent disk resource should be able to survive slave reboot or slave 
> info/id change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1347) GarbageCollectorIntegrationTest.DiskUsage is flaky.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1347:
-
Sprint: Q2'14 Sprint 2, Twitter Q4 Sprint 1, Twitter Q4 Sprint 2  (was: 
Q2'14 Sprint 2, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1)

> GarbageCollectorIntegrationTest.DiskUsage is flaky.
> ---
>
> Key: MESOS-1347
> URL: https://issues.apache.org/jira/browse/MESOS-1347
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>Assignee: Dominic Hamon
> Fix For: 0.19.0
>
>
> From Jenkins:
> https://builds.apache.org/job/Mesos-Ubuntu-distcheck/79/consoleFull
> {noformat}
> [ RUN  ] GarbageCollectorIntegrationTest.DiskUsage
> Using temporary directory 
> '/tmp/GarbageCollectorIntegrationTest_DiskUsage_pU3Ym7'
> I0507 03:27:38.775058  5758 leveldb.cpp:174] Opened db in 44.343989ms
> I0507 03:27:38.787498  5758 leveldb.cpp:181] Compacted db in 12.411065ms
> I0507 03:27:38.787533  5758 leveldb.cpp:196] Created db iterator in 4008ns
> I0507 03:27:38.787545  5758 leveldb.cpp:202] Seeked to beginning of db in 
> 598ns
> I0507 03:27:38.787552  5758 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 173ns
> I0507 03:27:38.787564  5758 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0507 03:27:38.787858  5777 recover.cpp:425] Starting replica recovery
> I0507 03:27:38.788352  5793 master.cpp:267] Master 
> 20140507-032738-453759884-58462-5758 (hemera.apache.org) started on 
> 140.211.11.27:58462
> I0507 03:27:38.788377  5793 master.cpp:304] Master only allowing 
> authenticated frameworks to register
> I0507 03:27:38.788383  5793 master.cpp:309] Master only allowing 
> authenticated slaves to register
> I0507 03:27:38.788389  5793 credentials.hpp:35] Loading credentials for 
> authentication
> I0507 03:27:38.789064  5779 recover.cpp:451] Replica is in EMPTY status
> W0507 03:27:38.789115  5793 credentials.hpp:48] Failed to stat credentials 
> file 
> 'file:///tmp/GarbageCollectorIntegrationTest_DiskUsage_pU3Ym7/credentials': 
> No such file or directory
> I0507 03:27:38.789489  5779 master.cpp:104] No whitelist given. Advertising 
> offers for all slaves
> I0507 03:27:38.789531  5778 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@140.211.11.27:58462
> I0507 03:27:38.791007  5788 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0507 03:27:38.791177  5780 master.cpp:921] The newly elected leader is 
> master@140.211.11.27:58462 with id 20140507-032738-453759884-58462-5758
> I0507 03:27:38.791198  5780 master.cpp:931] Elected as the leading master!
> I0507 03:27:38.791205  5780 master.cpp:752] Recovering from registrar
> I0507 03:27:38.791251  5796 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0507 03:27:38.791323  5797 registrar.cpp:313] Recovering registrar
> I0507 03:27:38.792137  5795 recover.cpp:542] Updating replica status to 
> STARTING
> I0507 03:27:38.807531  5781 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 15.124092ms
> I0507 03:27:38.807559  5781 replica.cpp:320] Persisted replica status to 
> STARTING
> I0507 03:27:38.807621  5781 recover.cpp:451] Replica is in STARTING status
> I0507 03:27:38.809319  5799 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0507 03:27:38.809983  5795 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0507 03:27:38.811204  5778 recover.cpp:542] Updating replica status to VOTING
> I0507 03:27:38.827595  5795 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 16.011355ms
> I0507 03:27:38.827627  5795 replica.cpp:320] Persisted replica status to 
> VOTING
> I0507 03:27:38.827683  5795 recover.cpp:556] Successfully joined the Paxos 
> group
> I0507 03:27:38.827775  5795 recover.cpp:440] Recover process terminated
> I0507 03:27:38.828966  5780 log.cpp:656] Attempting to start the writer
> I0507 03:27:38.831114  5782 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0507 03:27:38.847708  5782 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 16.573137ms
> I0507 03:27:38.847739  5782 replica.cpp:342] Persisted promised to 1
> I0507 03:27:38.848141  5797 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0507 03:27:38.849684  5790 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0507 03:27:38.863777  5790 leveldb.cpp:341] Persisting action (8 bytes) to 
> leveldb took 14.076775ms
> I0507 03:27:38.863801  5790 replica.cpp:676] Persisted action at 0
> I0507 03:27:38.864915  5798 rep

[jira] [Updated] (MESOS-1799) Reconciliation can send out-of-order updates.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1799:
-
Sprint: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  
(was: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1)

> Reconciliation can send out-of-order updates.
> -
>
> Key: MESOS-1799
> URL: https://issues.apache.org/jira/browse/MESOS-1799
> Project: Mesos
>  Issue Type: Bug
>  Components: master, slave
>Reporter: Benjamin Mahler
>Assignee: Vinod Kone
>
> When a slave re-registers with the master, it currently sends the latest task 
> state for all tasks that are not both terminal and acknowledged.
> However, reconciliation assumes that we always have the latest unacknowledged 
> state of the task represented in the master.
> As a result, out-of-order updates are possible, e.g.
> (1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
> [TASK_RUNNING, TASK_FINISHED].
> (2) Master fails over.
> (3) New master re-registers the slave with T in TASK_FINISHED.
> (4) Reconciliation request arrives, master sends TASK_FINISHED.
> (5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.
> I think the fix here is to preserve the task state invariants in the master, 
> namely, that the master has the latest unacknowledged state of the task. This 
> means when the slave re-registers, it should instead send the latest 
> acknowledged state of each task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1347) GarbageCollectorIntegrationTest.DiskUsage is flaky.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1347:
-
Sprint: Q2'14 Sprint 2, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: 
Q2'14 Sprint 2, Twitter Q4 Sprint 1)

> GarbageCollectorIntegrationTest.DiskUsage is flaky.
> ---
>
> Key: MESOS-1347
> URL: https://issues.apache.org/jira/browse/MESOS-1347
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>Assignee: Dominic Hamon
> Fix For: 0.19.0
>
>
> From Jenkins:
> https://builds.apache.org/job/Mesos-Ubuntu-distcheck/79/consoleFull
> {noformat}
> [ RUN  ] GarbageCollectorIntegrationTest.DiskUsage
> Using temporary directory 
> '/tmp/GarbageCollectorIntegrationTest_DiskUsage_pU3Ym7'
> I0507 03:27:38.775058  5758 leveldb.cpp:174] Opened db in 44.343989ms
> I0507 03:27:38.787498  5758 leveldb.cpp:181] Compacted db in 12.411065ms
> I0507 03:27:38.787533  5758 leveldb.cpp:196] Created db iterator in 4008ns
> I0507 03:27:38.787545  5758 leveldb.cpp:202] Seeked to beginning of db in 
> 598ns
> I0507 03:27:38.787552  5758 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 173ns
> I0507 03:27:38.787564  5758 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0507 03:27:38.787858  5777 recover.cpp:425] Starting replica recovery
> I0507 03:27:38.788352  5793 master.cpp:267] Master 
> 20140507-032738-453759884-58462-5758 (hemera.apache.org) started on 
> 140.211.11.27:58462
> I0507 03:27:38.788377  5793 master.cpp:304] Master only allowing 
> authenticated frameworks to register
> I0507 03:27:38.788383  5793 master.cpp:309] Master only allowing 
> authenticated slaves to register
> I0507 03:27:38.788389  5793 credentials.hpp:35] Loading credentials for 
> authentication
> I0507 03:27:38.789064  5779 recover.cpp:451] Replica is in EMPTY status
> W0507 03:27:38.789115  5793 credentials.hpp:48] Failed to stat credentials 
> file 
> 'file:///tmp/GarbageCollectorIntegrationTest_DiskUsage_pU3Ym7/credentials': 
> No such file or directory
> I0507 03:27:38.789489  5779 master.cpp:104] No whitelist given. Advertising 
> offers for all slaves
> I0507 03:27:38.789531  5778 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@140.211.11.27:58462
> I0507 03:27:38.791007  5788 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0507 03:27:38.791177  5780 master.cpp:921] The newly elected leader is 
> master@140.211.11.27:58462 with id 20140507-032738-453759884-58462-5758
> I0507 03:27:38.791198  5780 master.cpp:931] Elected as the leading master!
> I0507 03:27:38.791205  5780 master.cpp:752] Recovering from registrar
> I0507 03:27:38.791251  5796 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0507 03:27:38.791323  5797 registrar.cpp:313] Recovering registrar
> I0507 03:27:38.792137  5795 recover.cpp:542] Updating replica status to 
> STARTING
> I0507 03:27:38.807531  5781 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 15.124092ms
> I0507 03:27:38.807559  5781 replica.cpp:320] Persisted replica status to 
> STARTING
> I0507 03:27:38.807621  5781 recover.cpp:451] Replica is in STARTING status
> I0507 03:27:38.809319  5799 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0507 03:27:38.809983  5795 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0507 03:27:38.811204  5778 recover.cpp:542] Updating replica status to VOTING
> I0507 03:27:38.827595  5795 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 16.011355ms
> I0507 03:27:38.827627  5795 replica.cpp:320] Persisted replica status to 
> VOTING
> I0507 03:27:38.827683  5795 recover.cpp:556] Successfully joined the Paxos 
> group
> I0507 03:27:38.827775  5795 recover.cpp:440] Recover process terminated
> I0507 03:27:38.828966  5780 log.cpp:656] Attempting to start the writer
> I0507 03:27:38.831114  5782 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0507 03:27:38.847708  5782 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 16.573137ms
> I0507 03:27:38.847739  5782 replica.cpp:342] Persisted promised to 1
> I0507 03:27:38.848141  5797 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0507 03:27:38.849684  5790 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0507 03:27:38.863777  5790 leveldb.cpp:341] Persisting action (8 bytes) to 
> leveldb took 14.076775ms
> I0507 03:27:38.863801  5790 replica.cpp:676] Persisted action at 0
> I0507 03:27:38.864915  5798 replica.cpp:508] Replica

[jira] [Updated] (MESOS-681) Document the reconciliation API.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-681:

Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> Document the reconciliation API.
> 
>
> Key: MESOS-681
> URL: https://issues.apache.org/jira/browse/MESOS-681
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Attachments: 0.19.0.key, 0.19.0.pdf
>
>
> Now that we have a reconciliation mechanism, we should document why it exists 
> and how to use it going forward.
> As we add the lower level API, reconciliation may be done slightly 
> differently. Having documentation that reflects the changes would be great.
> It might also be helpful to upload my slides from the May 19th meetup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1739) Allow slave reconfiguration on restart

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1739:
-
Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, 
Mesosphere Q4 Sprint 1  (was: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 
Sprint 1)

> Allow slave reconfiguration on restart
> --
>
> Key: MESOS-1739
> URL: https://issues.apache.org/jira/browse/MESOS-1739
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Patrick Reilly
>Assignee: Cody Maloney
>
> Make it so that either via a slave restart or a out of process "reconfigure" 
> ping, the attributes and resources of a slave can be updated to be a superset 
> of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1830) Expose master stats differentiating between master-generated and slave-generated LOST tasks

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1830:
-
Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> Expose master stats differentiating between master-generated and 
> slave-generated LOST tasks
> ---
>
> Key: MESOS-1830
> URL: https://issues.apache.org/jira/browse/MESOS-1830
> Project: Mesos
>  Issue Type: Story
>  Components: master
>Reporter: Bill Farner
>Assignee: Dominic Hamon
>Priority: Minor
>
> The master exports a monotonically-increasing counter of tasks transitioned 
> to TASK_LOST.  This loses fidelity of the source of the lost task.  A first 
> step in exposing the source of lost tasks might be to just differentiate 
> between TASK_LOST transitions initiated by the master vs the slave (and maybe 
> bad input from the scheduler).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1902) Support persistent disk resource.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1902:
-
Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> Support persistent disk resource.
> -
>
> Key: MESOS-1902
> URL: https://issues.apache.org/jira/browse/MESOS-1902
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> Mesos needs to provide a way to allow tasks to write persistent data which 
> won’t be garbage collected. For example, a task can write its persistent data 
> to some predefined directory. When this task finishes, the framework can 
> launch a new task which is able to access the persistent data written by the 
> previous task which Mesos would have usually garbage-collected.
> One way to achieve that is to provide a new type of disk resources which are 
> persistent. We call it persistent disk resource. When a framework launches a 
> task using persistent disk resources, the data the task writes will be 
> persisted. When the framework launches a new task using the same persistent 
> disk resource (after the previous task finishes), the new task will be able 
> to access the data written by the previous task.
> The persistent disk resource should be able to survive slave reboot or slave 
> info/id change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1456) Metric lifetime should be tied to process runstate, not lifetime.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1456:
-
Sprint: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  
(was: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1)

> Metric lifetime should be tied to process runstate, not lifetime.
> -
>
> Key: MESOS-1456
> URL: https://issues.apache.org/jira/browse/MESOS-1456
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Affects Versions: 0.19.0
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>
> The usual pattern for termination of processes is {{terminate(..); wait(..); 
> delete ..;}} but the {{SchedulerProcess}} is terminated and then deleted some 
> time later.
> If the metrics endpoint is accessed within that period, it never returns as 
> it tries to access a {{Gauge}} that has a reference to a valid PID that is 
> not getting any timeslices (the {{SchedulerProcess}}). A one-off fix can be 
> made to the {{SchedulerProcess}} to move the metrics add/remove calls to 
> {{initialize}} and {{finalize}}, but this should be the general pattern for 
> every process with metrics. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1765) Use PID namespace to avoid freezing cgroup

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1765:
-
Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, 
Mesosphere Q4 Sprint 1  (was: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 
Sprint 1)

> Use PID namespace to avoid freezing cgroup
> --
>
> Key: MESOS-1765
> URL: https://issues.apache.org/jira/browse/MESOS-1765
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Cong Wang
>Assignee: Ian Downes
>
> There is some known kernel issue when we freeze the whole cgroup upon OOM. 
> Mesos probably can just use PID namespace so that we will only need to kill 
> the "init" of the pid namespace, instead of freezing all the processes and 
> killing them one by one. But I am not quite sure if this would break the 
> existing code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1751) Request for "stats.json" cannot be fulfilled after stopping the framework

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1751:
-
Sprint: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  
(was: Mesos Q3 Sprint 6, Twitter Q4 Sprint 1)

> Request for "stats.json" cannot be fulfilled after stopping the framework 
> --
>
> Key: MESOS-1751
> URL: https://issues.apache.org/jira/browse/MESOS-1751
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
> Environment: Test case launched on Mac OS X Mavericks.
>Reporter: Alexander Rukletsov
>Assignee: Dominic Hamon
>Priority: Minor
>
> Request for "stats.json" to master from a test case doesn't work after 
> calling frameworks' {{driver.stop()}}. However, it works for "state.json". I 
> think the problem is related to {{stats()}} continuation {{_stats()}}. The 
> following test illustrates the issue:
> {code:title=TestCase.cpp|borderStyle=solid}
> TEST_F(MasterTest, RequestAfterDriverStop)
> {
>   Try > master = StartMaster();
>   ASSERT_SOME(master);
>   Try > slave = StartSlave();
>   ASSERT_SOME(slave);
>   MockScheduler sched;
>   MesosSchedulerDriver driver(
>   &sched, DEFAULT_FRAMEWORK_INFO, master.get(), DEFAULT_CREDENTIAL);
>   driver.start();
>   
>   Future response_before =
>   process::http::get(master.get(), "stats.json");
>   AWAIT_READY(response_before);
>   driver.stop();
>   Future response_after =
>   process::http::get(master.get(), "stats.json");
>   AWAIT_READY(response_after);
>   driver.join();
>   Shutdown();  // Must shutdown before 'containerizer' gets deallocated.
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1817:
-
Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> Completed tasks remains in TASK_RUNNING when framework is disconnected
> --
>
> Key: MESOS-1817
> URL: https://issues.apache.org/jira/browse/MESOS-1817
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Assignee: Vinod Kone
>
> We have run into a problem that cause tasks which completes, when a framework 
> is disconnected and has a fail-over time, to remain in a running state even 
> though the tasks actually finishes. This hogs the cluster and gives users a 
> inconsistent view of the cluster state. Going to the slave, the task is 
> finished. Going to the master, the task is still in a non-terminal state. 
> When the scheduler reattaches or the failover timeout expires, the tasks 
> finishes correctly. The current workflow of this scheduler has a long 
> fail-over timeout, but may on the other hand never reattach.
> Here is a test framework we have been able to reproduce the issue with: 
> https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the 
> framework instance, the master reports the tasks as running even after 
> several minutes: 
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> When clicking on one of the slaves where, for example, task 49 runs; the 
> slave knows that it completed: 
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> Here is the log of a mesos-local instance where I reproduced it: 
> https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are 
> stuck in running state).
> There is a lot of output, so here is a filtered log for task 10: 
> https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> The problem turn out to be an issue with the ack-cycle of status updates:
> If the framework disconnects (with a failover timeout set), the status update 
> manage on the slaves will keep trying to send the front of status update 
> stream to the master (which in turn forwards it to the framework). If the 
> first status update after the disconnect is terminal, things work out fine; 
> the master pick the terminal state up, removes the task and release the 
> resources.
> If, on the other hand, one non-terminal status is in the stream. The master 
> will never know that the task finished (or failed) before the framework 
> reconnects.
> During a discussion on the dev mailing list 
> (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
>  we enumerated a couple of options to solve this problem.
> First off, having two ack-cycles: one between masters and slaves and one 
> between masters and frameworks, would be ideal. We would be able to replay 
> the statuses in order while keeping the master state current. However, this 
> requires us to persist the master state in a replicated storage.
> As a first pass, we can make sure that the tasks caught in a running state 
> doesn't hog the cluster when completed and the framework being disconnected.
> Here is a proof-of-concept to work out of: 
> https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
> A new (optional) field have been added to the internal status update message:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
> Which makes it possible for the status update manager to set the field, if 
> the latest status was terminal: 
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
> I added a test which should high-light the issue as well:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
> I would love some input on the approach before moving on.
> There are rough edges in the PoC which (of course) should be addressed before 
> bringing it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1875) os::killtree() incorrectly returns early if pid has terminated

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1875:
-
Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> os::killtree() incorrectly returns early if pid has terminated
> --
>
> Key: MESOS-1875
> URL: https://issues.apache.org/jira/browse/MESOS-1875
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.19.0, 0.20.0, 0.19.1, 0.20.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> If groups == true and/or sessions == true then os::killtree() should continue 
> to signal all processes in the process group and/or session, even if the 
> leading pid has terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1586) Isolate system directories, e.g., per-container /tmp

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1586:
-
Sprint: Q3 Sprint 1, Q3 Sprint 2, Q3 Sprint 3, Q3 Sprint 4, Mesos Q3 Sprint 
5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Q3 
Sprint 1, Q3 Sprint 2, Q3 Sprint 3, Q3 Sprint 4, Mesos Q3 Sprint 5, Mesos Q3 
Sprint 6, Twitter Q4 Sprint 1)

> Isolate system directories, e.g., per-container /tmp
> 
>
> Key: MESOS-1586
> URL: https://issues.apache.org/jira/browse/MESOS-1586
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> Ideally, tasks should not write outside their sandbox (executor work 
> directory) but pragmatically they may need to write to /tmp, /var/tmp, or 
> some other directory.
> 1) We should include any such files in disk usage and quota.
> 2) We should make these "shared" directories private, i.e., each container 
> has their own.
> 3) We should make the lifetime of any such files the same as the executor 
> work directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1853) Remove /proc and /sys remounts from port_mapping isolator

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1853:
-
Sprint: Twitter Q4 Sprint 1, Mesosphere Q4 Sprint 1  (was: Twitter Q4 
Sprint 1)

> Remove /proc and /sys remounts from port_mapping isolator
> -
>
> Key: MESOS-1853
> URL: https://issues.apache.org/jira/browse/MESOS-1853
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.20.0, 0.20.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>
> /proc/net reflects a new network namespace regardless and remount doesn't 
> actually do what we expected anyway, i.e., it's not sufficient for a new pid 
> namespace and a new mount is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1143) Add a TASK_ERROR task status.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1143:


Assignee: Dominic Hamon

> Add a TASK_ERROR task status.
> -
>
> Key: MESOS-1143
> URL: https://issues.apache.org/jira/browse/MESOS-1143
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>
> During task validation we drop tasks that have errors and send TASK_LOST 
> status updates. In most circumstances a framework will want to relaunch a 
> task that has gone lost, and in the event the task is actually malformed 
> (thus invalid) this will result in an infinite loop of sending a task and 
> having it go lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-343) Expose TASK_FAILED reason to Frameworks.

2014-10-20 Thread Dominic Hamon (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177141#comment-14177141
 ] 

Dominic Hamon commented on MESOS-343:
-

https://reviews.apache.org/r/26382/

> Expose TASK_FAILED reason to Frameworks.
> 
>
> Key: MESOS-343
> URL: https://issues.apache.org/jira/browse/MESOS-343
> Project: Mesos
>  Issue Type: Story
>Reporter: Benjamin Mahler
>Assignee: Dominic Hamon
>Priority: Minor
>
> We now have a message string inside TaskStatus that provides human readable 
> information about TASK_FAILED.
> It would be good to add some structure to the failure reasons, for framework 
> schedulers to act on programmatically.
> E.g.
> enum TaskFailure {
>   EXECUTOR_OOM;
>   EXECUTOR_OUT_OF_DISK;
>   EXECUTOR_TERMINATED;
>   SLAVE_LOST;
>   etc..
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1143) Add a TASK_ERROR task status.

2014-10-20 Thread Dominic Hamon (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177142#comment-14177142
 ] 

Dominic Hamon commented on MESOS-1143:
--

https://reviews.apache.org/r/26382/

> Add a TASK_ERROR task status.
> -
>
> Key: MESOS-1143
> URL: https://issues.apache.org/jira/browse/MESOS-1143
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>
> During task validation we drop tasks that have errors and send TASK_LOST 
> status updates. In most circumstances a framework will want to relaunch a 
> task that has gone lost, and in the event the task is actually malformed 
> (thus invalid) this will result in an infinite loop of sending a task and 
> having it go lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1930) Expose TASK_KILLED reason.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1930:


Assignee: Dominic Hamon

> Expose TASK_KILLED reason.
> --
>
> Key: MESOS-1930
> URL: https://issues.apache.org/jira/browse/MESOS-1930
> Project: Mesos
>  Issue Type: Story
>Reporter: Alexander Rukletsov
>Assignee: Dominic Hamon
>Priority: Minor
>
> A task process may be killed by a SIGTERM or SIGKILL. The only possibility to 
> check how the task process has exited is to examine the message: 
> {{status.message().find("Terminated")}}. However, a task may not run in its 
> own process, hence the executor may not be able to provide an exit status. 
> What we actually want is an artificial task exit status that is rendered by 
> the executor.
> This may be resolved by adding second tier states or state explanations. Here 
> is a link to a discussion: https://reviews.apache.org/r/26382/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1930) Expose TASK_KILLED reason.

2014-10-20 Thread Dominic Hamon (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177143#comment-14177143
 ] 

Dominic Hamon commented on MESOS-1930:
--

https://reviews.apache.org/r/26382/

> Expose TASK_KILLED reason.
> --
>
> Key: MESOS-1930
> URL: https://issues.apache.org/jira/browse/MESOS-1930
> Project: Mesos
>  Issue Type: Story
>Reporter: Alexander Rukletsov
>Assignee: Dominic Hamon
>Priority: Minor
>
> A task process may be killed by a SIGTERM or SIGKILL. The only possibility to 
> check how the task process has exited is to examine the message: 
> {{status.message().find("Terminated")}}. However, a task may not run in its 
> own process, hence the executor may not be able to provide an exit status. 
> What we actually want is an artificial task exit status that is rendered by 
> the executor.
> This may be resolved by adding second tier states or state explanations. Here 
> is a link to a discussion: https://reviews.apache.org/r/26382/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-343) Expose TASK_FAILED reason to Frameworks.

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-343:
---

Assignee: Dominic Hamon

> Expose TASK_FAILED reason to Frameworks.
> 
>
> Key: MESOS-343
> URL: https://issues.apache.org/jira/browse/MESOS-343
> Project: Mesos
>  Issue Type: Story
>Reporter: Benjamin Mahler
>Assignee: Dominic Hamon
>Priority: Minor
>
> We now have a message string inside TaskStatus that provides human readable 
> information about TASK_FAILED.
> It would be good to add some structure to the failure reasons, for framework 
> schedulers to act on programmatically.
> E.g.
> enum TaskFailure {
>   EXECUTOR_OOM;
>   EXECUTOR_OUT_OF_DISK;
>   EXECUTOR_TERMINATED;
>   SLAVE_LOST;
>   etc..
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1562) Mesos MasterInfo can not deal with IPv6

2014-10-20 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon closed MESOS-1562.

Resolution: Duplicate

> Mesos MasterInfo can not deal with IPv6
> ---
>
> Key: MESOS-1562
> URL: https://issues.apache.org/jira/browse/MESOS-1562
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.19.0
>Reporter: Henning Schmiedehausen
>
> The mesos.proto contains 
> message MasterInfo {
>   required string id = 1;
>   required uint32 ip = 2;
> ...
> the uint32 can not hold an IPv6 address (which has 128 bits). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1937) Create a document explaining the --modules flag

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-1937:
--
Shepherd: Niklas Quarfot Nielsen

> Create a document explaining the --modules flag
> ---
>
> Key: MESOS-1937
> URL: https://issues.apache.org/jira/browse/MESOS-1937
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>Priority: Blocker
>
> As the protobuf/Json for --modules is evolving, it is harder to explain 
> everything in the command-line help.  We should create a man page sort of 
> document that explain all the intricacies of the --modules flag and refer to 
> the document in the command-line help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1859) src/examples/docker_no_executor_framework.cpp uses wrong ContainerInfo

2014-10-20 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177077#comment-14177077
 ] 

Timothy Chen commented on MESOS-1859:
-

Hi, thanks for reporting this! you're correct the image is wrong, would you 
like to submit a patch to fix this?

> src/examples/docker_no_executor_framework.cpp uses wrong ContainerInfo
> --
>
> Key: MESOS-1859
> URL: https://issues.apache.org/jira/browse/MESOS-1859
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Matzen
>Priority: Minor
>
> src/examples/docker_no_executor_framework.cpp sets up the docker image using:
> CommandInfo::ContainerInfo* container =
>   task.mutable_command()->mutable_container();
> container->set_image("docker:///busybox");
> As far as I can tell, the slave expects it to be configured as follows:
>   ContainerInfo* container = task.mutable_container();
>   container->set_type(ContainerInfo::DOCKER);
>   container->mutable_docker()->set_image("busybox");
> Did I understand correctly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1937) Create a document explaining the --modules flag

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya reassigned MESOS-1937:
-

Assignee: Kapil Arya

> Create a document explaining the --modules flag
> ---
>
> Key: MESOS-1937
> URL: https://issues.apache.org/jira/browse/MESOS-1937
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>Priority: Blocker
>
> As the protobuf/Json for --modules is evolving, it is harder to explain 
> everything in the command-line help.  We should create a man page sort of 
> document that explain all the intricacies of the --modules flag and refer to 
> the document in the command-line help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1948) Docker tests are flaky

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1948:

Labels: docker  (was: )

> Docker tests are flaky
> --
>
> Key: MESOS-1948
> URL: https://issues.apache.org/jira/browse/MESOS-1948
> Project: Mesos
>  Issue Type: Bug
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>  Labels: docker
>
> The docker unit tests may fail occasionally because of docker issues and some 
> testing orders.
> More details can be found in the reviewboard



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1948) Docker tests are flaky

2014-10-20 Thread Timothy Chen (JIRA)
Timothy Chen created MESOS-1948:
---

 Summary: Docker tests are flaky
 Key: MESOS-1948
 URL: https://issues.apache.org/jira/browse/MESOS-1948
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen


The docker unit tests may fail occasionally because of docker issues and some 
testing orders.
More details can be found in the reviewboard



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1931) Add support for isolator modules

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-1931:
--
Sprint: Mesosphere Q4 Sprint 1

> Add support for isolator modules
> 
>
> Key: MESOS-1931
> URL: https://issues.apache.org/jira/browse/MESOS-1931
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1248) Use JSON instead of our own format for passing URI information to mesos-fetcher

2014-10-20 Thread Bernd Mathiske (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-1248:
--
Sprint: Mesosphere Q4 Sprint 1

> Use JSON instead of our own format for passing URI information to 
> mesos-fetcher
> ---
>
> Key: MESOS-1248
> URL: https://issues.apache.org/jira/browse/MESOS-1248
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Benjamin Hindman
>Assignee: Bernd Mathiske
>  Labels: newbie
>
> We should just send JSON in the environment variable rather than our own 
> format for MESOS_EXECUTOR_URIS. To simplify we might as well send the entire 
> CommandInfo rather than pulling out the URIs. This would boil down to just 
> the following in the containerizer:
> environment["MESOS_COMMAND_INFO"] = stringify(JSON::Protobuf(commandInfo));
> And something along the lines of the following in the fetcher:
> Try parse = 
> JSON::parse(os::getenv("MESOS_COMMAND_INFO"));
> if (parse.isError()) {
>   ...
> }
> Try commandInfo = protobuf::parse(parse.get());
> if (commandInfo.isError()) {
>   ...
> }
> foreach (const CommandInfo::URI& uri, commandInfo.get().uris()) {
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1931) Add support for isolator modules

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-1931:
--
Shepherd: Niklas Quarfot Nielsen

> Add support for isolator modules
> 
>
> Key: MESOS-1931
> URL: https://issues.apache.org/jira/browse/MESOS-1931
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1931) Add support for isolator modules

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya reassigned MESOS-1931:
-

Assignee: Kapil Arya

> Add support for isolator modules
> 
>
> Key: MESOS-1931
> URL: https://issues.apache.org/jira/browse/MESOS-1931
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1925) Docker kill does not allow containers to exit gracefully

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen reassigned MESOS-1925:
---

Assignee: Timothy Chen

> Docker kill does not allow containers to exit gracefully
> 
>
> Key: MESOS-1925
> URL: https://issues.apache.org/jira/browse/MESOS-1925
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 0.20.1
>Reporter: Ryan Thomas
>Assignee: Timothy Chen
>
> The docker implementation uses the docker kill command, this immediately 
> terminated the container, not allowing it to exit gracefully.
> We should be using the docker stop command that will send a kill after a 
> predetermined amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1248) Use JSON instead of our own format for passing URI information to mesos-fetcher

2014-10-20 Thread Bernd Mathiske (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-1248:
--
Sprint:   (was: Mesosphere Q4 Sprint 1)

> Use JSON instead of our own format for passing URI information to 
> mesos-fetcher
> ---
>
> Key: MESOS-1248
> URL: https://issues.apache.org/jira/browse/MESOS-1248
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Benjamin Hindman
>Assignee: Bernd Mathiske
>  Labels: newbie
>
> We should just send JSON in the environment variable rather than our own 
> format for MESOS_EXECUTOR_URIS. To simplify we might as well send the entire 
> CommandInfo rather than pulling out the URIs. This would boil down to just 
> the following in the containerizer:
> environment["MESOS_COMMAND_INFO"] = stringify(JSON::Protobuf(commandInfo));
> And something along the lines of the following in the fetcher:
> Try parse = 
> JSON::parse(os::getenv("MESOS_COMMAND_INFO"));
> if (parse.isError()) {
>   ...
> }
> Try commandInfo = protobuf::parse(parse.get());
> if (commandInfo.isError()) {
>   ...
> }
> foreach (const CommandInfo::URI& uri, commandInfo.get().uris()) {
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1824) when "docker ps -a" returns 400+ lines enabling docker containerizer results in all executors dying

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1824:

Target Version/s: 0.21.0

> when "docker ps -a" returns 400+ lines enabling docker containerizer results 
> in all executors dying
> ---
>
> Key: MESOS-1824
> URL: https://issues.apache.org/jira/browse/MESOS-1824
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>
> To reproduce:
> # run this one-liner on your slave to create 400 exited docker containers:
> {noformat}
> for i in `seq 1 400`; do docker run busybox:latest echo "hello" ; done;
> {noformat}
> # Start mesos-slave with only mesos containerizer enabled
> # Launch tasks that use an executor (which uses libmesos)
> # Restart mesos-slave process with --containerizer=docker,mesos
> # See mesos-slave fork "docker ps -a" and never return
> # Note that this mesos-slave never reregisters with master
> # Wait at least 10 minutes and see executors commit suicide, which kills all 
> of the tasks on your system.  From executor log:
> {noformat}
> I0919 21:24:14.018127 21778 exec.cpp:379] Executor asked to shutdown
> I0919 21:24:14.018812 21771 exec.cpp:78] Scheduling shutdown of the executor
> I0919 21:24:14.020514 21778 exec.cpp:394] Executor::shutdown took 1.866382ms
> I0919 21:24:16.000500 21771 exec.cpp:525] Executor sending status update 
> TASK_KILLED (UUID: bfd3969c-ad0a-455a-93fe-06c37bdee513) for task 
> 1411160025479-another-task-0-b5e24381-3353-43d4-9587-ffef9ccf2f38 of 
> framework 20140814-221057-1208029356-5050-10525-
> I0919 21:24:16.030253 21772 exec.cpp:332] Ignoring status update 
> acknowledgement bfd3969c-ad0a-455a-93fe-06c37bdee513 for task 
> 1411160025479-another-task-0-b5e24381-3353-43d4-9587-ffef9ccf2f38 of 
> framework 20140814-221057-1208029356-5050-10525- because the driver is 
> aborted!
> I0919 21:24:19.021966 21778 exec.cpp:86] Committing suicide by killing the 
> process group
> {noformat}
> # mesos-slave fails to tell the master about tasking be killed with this 
> message in the log:
> {noformat}
> W0918 01:02:57.252231 11725 status_update_manager.cpp:381] Not
> forwarding status update TASK_KILLED (UUID:
> 6fbacbcf-ad0f-4e89-89ee-e9f88a618573) for task
> 1410298578043-some-task-30-29279377-fdf2-4bb7-b862-852adddea09c
> of framework 20140522-213145-1749004561-5050-29512- because no
> master is elected yet
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1833) Running docker container with colon in executor id generates error

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1833:

  Sprint: Mesosphere Q4 Sprint 1
Shepherd: Benjamin Hindman  (was: Elizabeth Lingg)

> Running docker container with colon in executor id generates error
> --
>
> Key: MESOS-1833
> URL: https://issues.apache.org/jira/browse/MESOS-1833
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.20.0
> Environment: ubuntu (mesosphere vangrant vm)
>Reporter: Elizabeth Lingg
>Assignee: Timothy Chen
>  Labels: docker
>
> I created and launched a container successfully in chronos, but when mesos 
> ran the docker container, docker did not accept the volumes setting due to 
> the colon in the executor id (-v option). Here is the executor id, which is 
> valid: ct:141167016:0:lldocker.  
> In mesos, there will be a fix to avoid using the host directory by use of a 
> simlink and mapping.  However, ideally docker will fix this issue.  They 
> should accept executor ids with colons as the format is valid.
> Here is the error log:
> Error: One iContainer '8fdb0cd7-86f8-4bc9-bd1b-d36f86663bb3' for executor 
> 'ct:141167016:0:lldocker' of framework 
> '20140925-174859-16842879-5050-1573-' failed to start: Failed to 'docker 
> run -d -c 512 -m 536870912 -e mesos_task_id=ct:141167016:0:lldocker -e 
> CHRONOS_JOB_OWNER= -e MESOS_SANDBOX=/mnt/mesos/sandbox -v 
> /tmp/mesos/slaves/20140925-181954-16842879-5050-1560-0/frameworks/20140925-174859-16842879-5050-1573-/executors/ct:141167016:0:lldocker/runs/8fdb0cd7-86f8-4bc9-bd1b-d36f86663bb3:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-8fdb0cd7-86f8-4bc9-bd1b-d36f86663bb3 libmesos/ubuntu -c while sleep 10; 
> do date =u %T; done': exit status = exited with status 2 stderr = invalid 
> value 
> "/tmp/mesos/slaves/20140925-181954-16842879-5050-1560-0/frameworks/20140925-174859-16842879-5050-1573-/executors/ct:141167016:0:lldocker/runs/8fdb0cd7-86f8-4bc9-bd1b-d36f86663bb3:/mnt/mesos/sandbox"
>  for flag -v: bad format for volumes: 
> /tmp/mesos/slaves/20140925-181954-16842879-5050-1560-0/frameworks/20140925-174859-16842879-5050-1573-/executors/ct:141167016:0:lldocker/runs/8fdb0cd7-86f8-4bc9-bd1b-d36f86663bb3:/mnt/mesos/sandbox



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1925) Docker kill does not allow containers to exit gracefully

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1925:

Sprint: Mesosphere Q4 Sprint 1

> Docker kill does not allow containers to exit gracefully
> 
>
> Key: MESOS-1925
> URL: https://issues.apache.org/jira/browse/MESOS-1925
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 0.20.1
>Reporter: Ryan Thomas
>Assignee: Timothy Chen
>
> The docker implementation uses the docker kill command, this immediately 
> terminated the container, not allowing it to exit gracefully.
> We should be using the docker stop command that will send a kill after a 
> predetermined amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1248) Use JSON instead of our own format for passing URI information to mesos-fetcher

2014-10-20 Thread Bernd Mathiske (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-1248:
--
  Sprint: Mesosphere Q4 Sprint 1
Shepherd: Benjamin Hindman
Assignee: Bernd Mathiske  (was: Benjamin Hindman)

> Use JSON instead of our own format for passing URI information to 
> mesos-fetcher
> ---
>
> Key: MESOS-1248
> URL: https://issues.apache.org/jira/browse/MESOS-1248
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Benjamin Hindman
>Assignee: Bernd Mathiske
>  Labels: newbie
>
> We should just send JSON in the environment variable rather than our own 
> format for MESOS_EXECUTOR_URIS. To simplify we might as well send the entire 
> CommandInfo rather than pulling out the URIs. This would boil down to just 
> the following in the containerizer:
> environment["MESOS_COMMAND_INFO"] = stringify(JSON::Protobuf(commandInfo));
> And something along the lines of the following in the fetcher:
> Try parse = 
> JSON::parse(os::getenv("MESOS_COMMAND_INFO"));
> if (parse.isError()) {
>   ...
> }
> Try commandInfo = protobuf::parse(parse.get());
> if (commandInfo.isError()) {
>   ...
> }
> foreach (const CommandInfo::URI& uri, commandInfo.get().uris()) {
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1570:

Labels: Docker  (was: )

> Make check Error when Building Mesos in a Docker container 
> ---
>
> Key: MESOS-1570
> URL: https://issues.apache.org/jira/browse/MESOS-1570
> Project: Mesos
>  Issue Type: Bug
>Reporter: Isabel Jimenez
>Priority: Minor
>  Labels: Docker
>
> When building Mesos inside a Docker container, it's for the moment impossible 
> to run tests even when you run Docker in --privileged mode. There is a test 
> in stout that sets all the namespaces and libcontainer does not support 
> setting 'user' namespace (more information 
> [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
>  This is the error:
> {code:title=Make check failed test|borderStyle=solid}
> [--] 1 test from OsSetnsTest
> [ RUN  ] OsSetnsTest.setns
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
> Failure
> os::setns(::getpid(), ns): Invalid argument
> [  FAILED  ] OsSetnsTest.setns (7 ms)
> [--] 1 test from OsSetnsTest (7 ms total)
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] OsSetnsTest.setns
>  1 FAILED TEST
> {code}
> This can be disable as Mesos does not need to set 'user' namespace. I don't 
> know if Docker will support setting user namespace one day since it's a new 
> kernel feature, what could be the best approach to this issue? (disabling set 
> for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1824) when "docker ps -a" returns 400+ lines enabling docker containerizer results in all executors dying

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen resolved MESOS-1824.
-
Resolution: Fixed

> when "docker ps -a" returns 400+ lines enabling docker containerizer results 
> in all executors dying
> ---
>
> Key: MESOS-1824
> URL: https://issues.apache.org/jira/browse/MESOS-1824
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>
> To reproduce:
> # run this one-liner on your slave to create 400 exited docker containers:
> {noformat}
> for i in `seq 1 400`; do docker run busybox:latest echo "hello" ; done;
> {noformat}
> # Start mesos-slave with only mesos containerizer enabled
> # Launch tasks that use an executor (which uses libmesos)
> # Restart mesos-slave process with --containerizer=docker,mesos
> # See mesos-slave fork "docker ps -a" and never return
> # Note that this mesos-slave never reregisters with master
> # Wait at least 10 minutes and see executors commit suicide, which kills all 
> of the tasks on your system.  From executor log:
> {noformat}
> I0919 21:24:14.018127 21778 exec.cpp:379] Executor asked to shutdown
> I0919 21:24:14.018812 21771 exec.cpp:78] Scheduling shutdown of the executor
> I0919 21:24:14.020514 21778 exec.cpp:394] Executor::shutdown took 1.866382ms
> I0919 21:24:16.000500 21771 exec.cpp:525] Executor sending status update 
> TASK_KILLED (UUID: bfd3969c-ad0a-455a-93fe-06c37bdee513) for task 
> 1411160025479-another-task-0-b5e24381-3353-43d4-9587-ffef9ccf2f38 of 
> framework 20140814-221057-1208029356-5050-10525-
> I0919 21:24:16.030253 21772 exec.cpp:332] Ignoring status update 
> acknowledgement bfd3969c-ad0a-455a-93fe-06c37bdee513 for task 
> 1411160025479-another-task-0-b5e24381-3353-43d4-9587-ffef9ccf2f38 of 
> framework 20140814-221057-1208029356-5050-10525- because the driver is 
> aborted!
> I0919 21:24:19.021966 21778 exec.cpp:86] Committing suicide by killing the 
> process group
> {noformat}
> # mesos-slave fails to tell the master about tasking be killed with this 
> message in the log:
> {noformat}
> W0918 01:02:57.252231 11725 status_update_manager.cpp:381] Not
> forwarding status update TASK_KILLED (UUID:
> 6fbacbcf-ad0f-4e89-89ee-e9f88a618573) for task
> 1410298578043-some-task-30-29279377-fdf2-4bb7-b862-852adddea09c
> of framework 20140522-213145-1749004561-5050-29512- because no
> master is elected yet
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1816) lxc execution driver support for docker containerizer

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1816:

Sprint: Mesosphere Q4 Sprint 1

> lxc execution driver support for docker containerizer
> -
>
> Key: MESOS-1816
> URL: https://issues.apache.org/jira/browse/MESOS-1816
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 0.20.1
>Reporter: Eugen Feller
>Assignee: Timothy Chen
>  Labels: docker
> Attachments: docker_patch.cpp, test_framework_patch.cpp
>
>
> Hi all,
> One way to get networking up and running in Docker is to use the bridge mode. 
> The bridge mode results in Docker automatically assigning IPs to the 
> containers from the IP range specified on the docker0 bridge.
> In our setup we need to manage IPs using our own DHCP server. Unfortunately 
> this is not supported by Docker's libcontainer execution driver. Instead, the 
> lxc execution driver 
> (http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
>  can be used. In order to use the lxc execution driver, Docker daemon needs 
> to be started with the "-e lxc" flag. Once started, Docker own networking can 
> be disabled and lxc options can be passed to the docker run command. For 
> example:
> $ docker run -n=false --lxc-conf="lxc.network.type = veth" 
> --lxc-conf="lxc.network.link = br0" --lxc-conf="lxc.network.name = eth0" 
> -lxc-conf="lxc.network.flags = up" ...
> This will force Docker to use my own bridge br0. Moreover, IP can be assigned 
> to the eth0 interface by executing the "dhclient eth0" command inside the 
> started container.
> In the previous integration of Docker in Mesos (using Deimos), I have passed 
> the aforementioned options using the "options" flag in Marathon. However, 
> with the new changes this is no longer possible. It would be great to support 
> the lxc execution driver in the current Docker integration.
> Thanks.
> Best regards,
> Eugen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2014-10-20 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177074#comment-14177074
 ] 

Timothy Chen commented on MESOS-1570:
-

[~ijimenez] you want to fix this?

> Make check Error when Building Mesos in a Docker container 
> ---
>
> Key: MESOS-1570
> URL: https://issues.apache.org/jira/browse/MESOS-1570
> Project: Mesos
>  Issue Type: Bug
>Reporter: Isabel Jimenez
>Priority: Minor
>  Labels: Docker
>
> When building Mesos inside a Docker container, it's for the moment impossible 
> to run tests even when you run Docker in --privileged mode. There is a test 
> in stout that sets all the namespaces and libcontainer does not support 
> setting 'user' namespace (more information 
> [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
>  This is the error:
> {code:title=Make check failed test|borderStyle=solid}
> [--] 1 test from OsSetnsTest
> [ RUN  ] OsSetnsTest.setns
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
> Failure
> os::setns(::getpid(), ns): Invalid argument
> [  FAILED  ] OsSetnsTest.setns (7 ms)
> [--] 1 test from OsSetnsTest (7 ms total)
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] OsSetnsTest.setns
>  1 FAILED TEST
> {code}
> This can be disable as Mesos does not need to set 'user' namespace. I don't 
> know if Docker will support setting user namespace one day since it's a new 
> kernel feature, what could be the best approach to this issue? (disabling set 
> for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1851) Cannot provide -hostname parameter to docker container

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1851:

Sprint: Mesosphere Q4 Sprint 1

> Cannot provide -hostname parameter to docker container
> --
>
> Key: MESOS-1851
> URL: https://issues.apache.org/jira/browse/MESOS-1851
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.20.1
>Reporter: Adam Spektor
>Assignee: Timothy Chen
>
> When using bridged networking it appears folks want to be able to resolve 
> using hostname.  Currently still in flight upstream as well: 
> https://github.com/docker/docker/issues/7851



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen closed MESOS-1809.
---

> Modify docker pull to use docker inspect after a successful pull
> 
>
> Key: MESOS-1809
> URL: https://issues.apache.org/jira/browse/MESOS-1809
> Project: Mesos
>  Issue Type: Bug
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 0.20.1
>
>
> Currently in docker pull we read the stdout of pull to construct the docker 
> image object, however it contains extra output from stdout.
> We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1945) SlaveTest.KillTaskBetweenRunTaskParts is flaky

2014-10-20 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1945:
---
Shepherd: Vinod Kone

> SlaveTest.KillTaskBetweenRunTaskParts is flaky
> --
>
> Key: MESOS-1945
> URL: https://issues.apache.org/jira/browse/MESOS-1945
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Bernd Mathiske
>
> Observed this on the internal CI.
> [~bernd-mesos] Can you take a look?
> {code}
> [ RUN  ] SlaveTest.KillTaskBetweenRunTaskParts
> Using temporary directory '/tmp/SlaveTest_KillTaskBetweenRunTaskParts_RmlPwG'
> I1017 13:42:13.066948 13328 leveldb.cpp:176] Opened db in 102.342262ms
> I1017 13:42:13.096580 13328 leveldb.cpp:183] Compacted db in 29.603997ms
> I1017 13:42:13.096628 13328 leveldb.cpp:198] Created db iterator in 10276ns
> I1017 13:42:13.096638 13328 leveldb.cpp:204] Seeked to beginning of db in 
> 4732ns
> I1017 13:42:13.096644 13328 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 3353ns
> I1017 13:42:13.096659 13328 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1017 13:42:13.096951 13349 recover.cpp:437] Starting replica recovery
> I1017 13:42:13.097007 13349 recover.cpp:463] Replica is in EMPTY status
> I1017 13:42:13.097256 13349 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1017 13:42:13.097306 13349 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I1017 13:42:13.097378 13349 recover.cpp:554] Updating replica status to 
> STARTING
> I1017 13:42:13.101631 13345 master.cpp:312] Master 
> 20141017-134213-16842879-48221-13328 (trusty) started on 127.0.1.1:48221
> I1017 13:42:13.102226 13345 master.cpp:358] Master only allowing 
> authenticated frameworks to register
> I1017 13:42:13.102473 13345 master.cpp:363] Master only allowing 
> authenticated slaves to register
> I1017 13:42:13.102738 13345 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_KillTaskBetweenRunTaskParts_RmlPwG/credentials'
> I1017 13:42:13.103142 13345 master.cpp:392] Authorization enabled
> I1017 13:42:13.103667 13346 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:48221
> I1017 13:42:13.103966 13342 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I1017 13:42:13.104833 13345 master.cpp:1242] The newly elected leader is 
> master@127.0.1.1:48221 with id 20141017-134213-16842879-48221-13328
> I1017 13:42:13.105020 13345 master.cpp:1255] Elected as the leading master!
> I1017 13:42:13.105200 13345 master.cpp:1073] Recovering from registrar
> I1017 13:42:13.105465 13347 registrar.cpp:313] Recovering registrar
> I1017 13:42:13.112493 13349 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 15.090983ms
> I1017 13:42:13.112735 13349 replica.cpp:320] Persisted replica status to 
> STARTING
> I1017 13:42:13.113172 13349 recover.cpp:463] Replica is in STARTING status
> I1017 13:42:13.113713 13349 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I1017 13:42:13.113998 13349 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I1017 13:42:13.114323 13349 recover.cpp:554] Updating replica status to VOTING
> I1017 13:42:13.131239 13349 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.594369ms
> I1017 13:42:13.131573 13349 replica.cpp:320] Persisted replica status to 
> VOTING
> I1017 13:42:13.131916 13344 recover.cpp:568] Successfully joined the Paxos 
> group
> I1017 13:42:13.132225 13342 recover.cpp:452] Recover process terminated
> I1017 13:42:13.132542 13343 log.cpp:656] Attempting to start the writer
> I1017 13:42:13.134614 13343 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I1017 13:42:13.155139 13343 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 20.162122ms
> I1017 13:42:13.155519 13343 replica.cpp:342] Persisted promised to 1
> I1017 13:42:13.155941 13343 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I1017 13:42:13.156524 13343 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I1017 13:42:13.170680 13343 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 13.967251ms
> I1017 13:42:13.171041 13343 replica.cpp:676] Persisted action at 0
> I1017 13:42:13.171551 13343 replica.cpp:508] Replica received write request 
> for position 0
> I1017 13:42:13.171787 13343 leveldb.cpp:438] Reading position from leveldb 
> took 30854ns
> I1017 13:42:13.182826 13343 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb 

[jira] [Updated] (MESOS-1915) Docker containers that fail to launch are not killed

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-1915:

Sprint: Mesosphere Q4 Sprint 1

> Docker containers that fail to launch are not killed
> 
>
> Key: MESOS-1915
> URL: https://issues.apache.org/jira/browse/MESOS-1915
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.20.1
> Environment: Mesos 0.20.1 using the docker executor with a private 
> docker repository. Images often take up to 5 minutes to launch.
> /etc/mesos-slave/executor_registration_timeout is set to '10mins'
>Reporter: Daniel Hall
>Assignee: Timothy Chen
>
> When we launch docker containers on our Mesos cluster using marathon we have 
> noticed that we end up with several docker containers running, with only one 
> of them actually being tracked my Mesos. When inspected the containers both 
> have the same start time.
> This seems to be because Mesos gives up on trying to start the container 
> after 1min, but fails to clean up the docker container because it is is not 
> yet running. Eventually the container starts alongside all the other attempts 
> mesos has made and we end up with several containers running with only one 
> being tracked by Mesos.
> I've pasted some logs from the slave below filter for that particular task, 
> but it is pretty easy to replicate in our environment so I'm happy to provide 
> further logs, details and analysis as required. This is becoming a bit 
> problem for us so we are happy to help as much as possible.
> {noformat}
> Oct 13 04:47:42 mesosslave-1 mesos-slave[16647]: I1013 04:47:42.776945 16661 
> docker.cpp:743] Starting container 'dd113461-4d18-4170-8e3f-9527e6d7f598' for 
> task 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' (and executor 
> 'docker-test.11588a48-5294-11e4-adea-42010af0f51e') of framework 
> '20140918-022627-519434250-5050-6171-'
> Oct 13 04:48:42 mesosslave-1 mesos-slave[16647]: E1013 04:48:42.819563 16664 
> slave.cpp:2205] Failed to update resources for container 
> dd113461-4d18-4170-8e3f-9527e6d7f598 of executor 
> docker-test.11588a48-5294-11e4-adea-42010af0f51e running task 
> docker-test.11588a48-5294-11e4-adea-42010af0f51e on status update for 
> terminal task, destroying container: No container found
> Oct 13 04:49:29 mesosslave-1 mesos-slave[16647]: I1013 04:49:29.916460 16665 
> slave.cpp:2538] Monitoring executor 
> 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' of framework 
> '20140918-022627-519434250-5050-6171-' in container 
> 'dd113461-4d18-4170-8e3f-9527e6d7f598'
> Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.103175 16663 
> docker.cpp:1286] Updated 'cpu.shares' to 102 at 
> /cgroup/cpu/docker/6a581f5c2174dc76bcfb2e5b89fd9a4310732c384d93901a8b37da8aeb700468
>  for container dd113461-4d18-4170-8e3f-9527e6d7f598
> Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.105036 16663 
> docker.cpp:1321] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> dd113461-4d18-4170-8e3f-9527e6d7f598
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1945) SlaveTest.KillTaskBetweenRunTaskParts is flaky

2014-10-20 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1945:
---
Sprint: Mesosphere Q4 Sprint 1

> SlaveTest.KillTaskBetweenRunTaskParts is flaky
> --
>
> Key: MESOS-1945
> URL: https://issues.apache.org/jira/browse/MESOS-1945
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Bernd Mathiske
>
> Observed this on the internal CI.
> [~bernd-mesos] Can you take a look?
> {code}
> [ RUN  ] SlaveTest.KillTaskBetweenRunTaskParts
> Using temporary directory '/tmp/SlaveTest_KillTaskBetweenRunTaskParts_RmlPwG'
> I1017 13:42:13.066948 13328 leveldb.cpp:176] Opened db in 102.342262ms
> I1017 13:42:13.096580 13328 leveldb.cpp:183] Compacted db in 29.603997ms
> I1017 13:42:13.096628 13328 leveldb.cpp:198] Created db iterator in 10276ns
> I1017 13:42:13.096638 13328 leveldb.cpp:204] Seeked to beginning of db in 
> 4732ns
> I1017 13:42:13.096644 13328 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 3353ns
> I1017 13:42:13.096659 13328 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1017 13:42:13.096951 13349 recover.cpp:437] Starting replica recovery
> I1017 13:42:13.097007 13349 recover.cpp:463] Replica is in EMPTY status
> I1017 13:42:13.097256 13349 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I1017 13:42:13.097306 13349 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I1017 13:42:13.097378 13349 recover.cpp:554] Updating replica status to 
> STARTING
> I1017 13:42:13.101631 13345 master.cpp:312] Master 
> 20141017-134213-16842879-48221-13328 (trusty) started on 127.0.1.1:48221
> I1017 13:42:13.102226 13345 master.cpp:358] Master only allowing 
> authenticated frameworks to register
> I1017 13:42:13.102473 13345 master.cpp:363] Master only allowing 
> authenticated slaves to register
> I1017 13:42:13.102738 13345 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_KillTaskBetweenRunTaskParts_RmlPwG/credentials'
> I1017 13:42:13.103142 13345 master.cpp:392] Authorization enabled
> I1017 13:42:13.103667 13346 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:48221
> I1017 13:42:13.103966 13342 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I1017 13:42:13.104833 13345 master.cpp:1242] The newly elected leader is 
> master@127.0.1.1:48221 with id 20141017-134213-16842879-48221-13328
> I1017 13:42:13.105020 13345 master.cpp:1255] Elected as the leading master!
> I1017 13:42:13.105200 13345 master.cpp:1073] Recovering from registrar
> I1017 13:42:13.105465 13347 registrar.cpp:313] Recovering registrar
> I1017 13:42:13.112493 13349 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 15.090983ms
> I1017 13:42:13.112735 13349 replica.cpp:320] Persisted replica status to 
> STARTING
> I1017 13:42:13.113172 13349 recover.cpp:463] Replica is in STARTING status
> I1017 13:42:13.113713 13349 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I1017 13:42:13.113998 13349 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I1017 13:42:13.114323 13349 recover.cpp:554] Updating replica status to VOTING
> I1017 13:42:13.131239 13349 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.594369ms
> I1017 13:42:13.131573 13349 replica.cpp:320] Persisted replica status to 
> VOTING
> I1017 13:42:13.131916 13344 recover.cpp:568] Successfully joined the Paxos 
> group
> I1017 13:42:13.132225 13342 recover.cpp:452] Recover process terminated
> I1017 13:42:13.132542 13343 log.cpp:656] Attempting to start the writer
> I1017 13:42:13.134614 13343 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I1017 13:42:13.155139 13343 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 20.162122ms
> I1017 13:42:13.155519 13343 replica.cpp:342] Persisted promised to 1
> I1017 13:42:13.155941 13343 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I1017 13:42:13.156524 13343 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I1017 13:42:13.170680 13343 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 13.967251ms
> I1017 13:42:13.171041 13343 replica.cpp:676] Persisted action at 0
> I1017 13:42:13.171551 13343 replica.cpp:508] Replica received write request 
> for position 0
> I1017 13:42:13.171787 13343 leveldb.cpp:438] Reading position from leveldb 
> took 30854ns
> I1017 13:42:13.182826 13343 leveldb.cpp:343] Persisting action (14 bytes) to 

[jira] [Closed] (MESOS-1652) Stream Docker logs into sandbox logs

2014-10-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen closed MESOS-1652.
---

> Stream Docker logs into sandbox logs
> 
>
> Key: MESOS-1652
> URL: https://issues.apache.org/jira/browse/MESOS-1652
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 0.20.0
>
>
> We should write logs either during or after the task launched the logs from 
> the docker container into the sandbox so it can be viewed without actually 
> going to the host and calling "docker logs".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1864) Add test integration for module developers

2014-10-20 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-1864:
--
Sprint: Mesosphere Q4 Sprint 1

> Add test integration for module developers
> --
>
> Key: MESOS-1864
> URL: https://issues.apache.org/jira/browse/MESOS-1864
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>
> To help module developers write and test mesos-modules, we should wire up 
> integration suites that let the usual unit tests be run with custom built 
> modules.
> A couple of examples could be: 
> $ ./bin/mesos-tests.sh --modules="path:name"
> or a dedicated test scripts:
> $ ./bin/modules-test.sh  --modules="path:name" --isolation="name" 
> --authentication="name"
> We should also think about how to encourage internal module testing (tests 
> that will be specific for that particular module)
> In the case of ./bin/modules-test.sh - we could run 1) our own (general) 
> tests and 2) tests provided by the module it self (maybe as an extra field in 
> the module registration struct)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >