[jira] [Updated] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2228:
---
Sprint: Twitter Mesos Q1 Sprint 1

> SlaveTest.MesosExecutorGracefulShutdown is flaky
> 
>
> Key: MESOS-2228
> URL: https://issues.apache.org/jira/browse/MESOS-2228
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>  Labels: twitter
>
> Observed this on internal CI
> {noformat}
> [ RUN  ] SlaveTest.MesosExecutorGracefulShutdown
> Using temporary directory 
> '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ'
> I0124 08:14:04.399211  7926 leveldb.cpp:176] Opened db in 27.364056ms
> I0124 08:14:04.402632  7926 leveldb.cpp:183] Compacted db in 3.357646ms
> I0124 08:14:04.402691  7926 leveldb.cpp:198] Created db iterator in 23822ns
> I0124 08:14:04.402708  7926 leveldb.cpp:204] Seeked to beginning of db in 
> 1913ns
> I0124 08:14:04.402716  7926 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 458ns
> I0124 08:14:04.402767  7926 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0124 08:14:04.403728  7951 recover.cpp:449] Starting replica recovery
> I0124 08:14:04.404011  7951 recover.cpp:475] Replica is in EMPTY status
> I0124 08:14:04.407765  7950 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0124 08:14:04.408710  7951 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0124 08:14:04.419666  7951 recover.cpp:566] Updating replica status to 
> STARTING
> I0124 08:14:04.429719  7953 master.cpp:262] Master 
> 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787
> I0124 08:14:04.429790  7953 master.cpp:308] Master only allowing 
> authenticated frameworks to register
> I0124 08:14:04.429802  7953 master.cpp:313] Master only allowing 
> authenticated slaves to register
> I0124 08:14:04.429826  7953 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials'
> I0124 08:14:04.430277  7953 master.cpp:357] Authorization enabled
> I0124 08:14:04.432682  7953 master.cpp:1219] The newly elected leader is 
> master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926
> I0124 08:14:04.432816  7953 master.cpp:1232] Elected as the leading master!
> I0124 08:14:04.432894  7953 master.cpp:1050] Recovering from registrar
> I0124 08:14:04.433212  7950 registrar.cpp:313] Recovering registrar
> I0124 08:14:04.434226  7951 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.323302ms
> I0124 08:14:04.434270  7951 replica.cpp:323] Persisted replica status to 
> STARTING
> I0124 08:14:04.434489  7951 recover.cpp:475] Replica is in STARTING status
> I0124 08:14:04.436164  7951 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0124 08:14:04.439368  7947 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0124 08:14:04.440626  7947 recover.cpp:566] Updating replica status to VOTING
> I0124 08:14:04.443667  7947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.698664ms
> I0124 08:14:04.443759  7947 replica.cpp:323] Persisted replica status to 
> VOTING
> I0124 08:14:04.443925  7947 recover.cpp:580] Successfully joined the Paxos 
> group
> I0124 08:14:04.444160  7947 recover.cpp:464] Recover process terminated
> I0124 08:14:04.444543  7949 log.cpp:660] Attempting to start the writer
> I0124 08:14:04.446331  7949 replica.cpp:477] Replica received implicit 
> promise request with proposal 1
> I0124 08:14:04.449329  7949 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.690453ms
> I0124 08:14:04.449388  7949 replica.cpp:345] Persisted promised to 1
> I0124 08:14:04.450637  7947 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0124 08:14:04.452271  7949 replica.cpp:378] Replica received explicit 
> promise request for position 0 with proposal 2
> I0124 08:14:04.455124  7949 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 2.593522ms
> I0124 08:14:04.455157  7949 replica.cpp:679] Persisted action at 0
> I0124 08:14:04.456594  7951 replica.cpp:511] Replica received write request 
> for position 0
> I0124 08:14:04.456657  7951 leveldb.cpp:438] Reading position from leveldb 
> took 30358ns
> I0124 08:14:04.464860  7951 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 8.164646ms
> I0124 08:14:04.464903  7951 replica.cpp:679] Persisted action at 0
> I0124 08:14:04.465947  7949 replica.cpp:658] Replica received learned notice 
> for position 0
> I0124 08:14:04.471567  7949 leveldb.cpp:343] Persisting action (16 bytes)

[jira] [Assigned] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-2228:
--

Assignee: Benjamin Mahler  (was: Alexander Rukletsov)

{quote}
(or is not being reaped)
{quote}

>From the output, we're not seeing 'Terminated' in the output, which means that 
>it's the SIGKILL reaching the pid, not the SIGTERM, no? Because of this, it 
>doesn't seem like it's a reaping issue, anything I'm missing?

{quote}
>From the logs it looks like a simple sleep task doesn't terminate
{quote}

Looks like this to me as well, these are VMs and we sometimes see strange 
blocking behavior. I've bumped the timeout for now and included a nicer error 
message. Please take a look:

https://reviews.apache.org/r/30402/

> SlaveTest.MesosExecutorGracefulShutdown is flaky
> 
>
> Key: MESOS-2228
> URL: https://issues.apache.org/jira/browse/MESOS-2228
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>  Labels: twitter
>
> Observed this on internal CI
> {noformat}
> [ RUN  ] SlaveTest.MesosExecutorGracefulShutdown
> Using temporary directory 
> '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ'
> I0124 08:14:04.399211  7926 leveldb.cpp:176] Opened db in 27.364056ms
> I0124 08:14:04.402632  7926 leveldb.cpp:183] Compacted db in 3.357646ms
> I0124 08:14:04.402691  7926 leveldb.cpp:198] Created db iterator in 23822ns
> I0124 08:14:04.402708  7926 leveldb.cpp:204] Seeked to beginning of db in 
> 1913ns
> I0124 08:14:04.402716  7926 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 458ns
> I0124 08:14:04.402767  7926 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0124 08:14:04.403728  7951 recover.cpp:449] Starting replica recovery
> I0124 08:14:04.404011  7951 recover.cpp:475] Replica is in EMPTY status
> I0124 08:14:04.407765  7950 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0124 08:14:04.408710  7951 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0124 08:14:04.419666  7951 recover.cpp:566] Updating replica status to 
> STARTING
> I0124 08:14:04.429719  7953 master.cpp:262] Master 
> 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787
> I0124 08:14:04.429790  7953 master.cpp:308] Master only allowing 
> authenticated frameworks to register
> I0124 08:14:04.429802  7953 master.cpp:313] Master only allowing 
> authenticated slaves to register
> I0124 08:14:04.429826  7953 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials'
> I0124 08:14:04.430277  7953 master.cpp:357] Authorization enabled
> I0124 08:14:04.432682  7953 master.cpp:1219] The newly elected leader is 
> master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926
> I0124 08:14:04.432816  7953 master.cpp:1232] Elected as the leading master!
> I0124 08:14:04.432894  7953 master.cpp:1050] Recovering from registrar
> I0124 08:14:04.433212  7950 registrar.cpp:313] Recovering registrar
> I0124 08:14:04.434226  7951 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.323302ms
> I0124 08:14:04.434270  7951 replica.cpp:323] Persisted replica status to 
> STARTING
> I0124 08:14:04.434489  7951 recover.cpp:475] Replica is in STARTING status
> I0124 08:14:04.436164  7951 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0124 08:14:04.439368  7947 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0124 08:14:04.440626  7947 recover.cpp:566] Updating replica status to VOTING
> I0124 08:14:04.443667  7947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.698664ms
> I0124 08:14:04.443759  7947 replica.cpp:323] Persisted replica status to 
> VOTING
> I0124 08:14:04.443925  7947 recover.cpp:580] Successfully joined the Paxos 
> group
> I0124 08:14:04.444160  7947 recover.cpp:464] Recover process terminated
> I0124 08:14:04.444543  7949 log.cpp:660] Attempting to start the writer
> I0124 08:14:04.446331  7949 replica.cpp:477] Replica received implicit 
> promise request with proposal 1
> I0124 08:14:04.449329  7949 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.690453ms
> I0124 08:14:04.449388  7949 replica.cpp:345] Persisted promised to 1
> I0124 08:14:04.450637  7947 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0124 08:14:04.452271  7949 replica.cpp:378] Replica received explicit 
> promise request for position 0 with proposal 2
> I0124 08:14:04.455124  7949 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 2.593522ms
> I0124 

[jira] [Updated] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2228:
---
Labels: twitter  (was: )

> SlaveTest.MesosExecutorGracefulShutdown is flaky
> 
>
> Key: MESOS-2228
> URL: https://issues.apache.org/jira/browse/MESOS-2228
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>  Labels: twitter
>
> Observed this on internal CI
> {noformat}
> [ RUN  ] SlaveTest.MesosExecutorGracefulShutdown
> Using temporary directory 
> '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ'
> I0124 08:14:04.399211  7926 leveldb.cpp:176] Opened db in 27.364056ms
> I0124 08:14:04.402632  7926 leveldb.cpp:183] Compacted db in 3.357646ms
> I0124 08:14:04.402691  7926 leveldb.cpp:198] Created db iterator in 23822ns
> I0124 08:14:04.402708  7926 leveldb.cpp:204] Seeked to beginning of db in 
> 1913ns
> I0124 08:14:04.402716  7926 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 458ns
> I0124 08:14:04.402767  7926 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0124 08:14:04.403728  7951 recover.cpp:449] Starting replica recovery
> I0124 08:14:04.404011  7951 recover.cpp:475] Replica is in EMPTY status
> I0124 08:14:04.407765  7950 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0124 08:14:04.408710  7951 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0124 08:14:04.419666  7951 recover.cpp:566] Updating replica status to 
> STARTING
> I0124 08:14:04.429719  7953 master.cpp:262] Master 
> 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787
> I0124 08:14:04.429790  7953 master.cpp:308] Master only allowing 
> authenticated frameworks to register
> I0124 08:14:04.429802  7953 master.cpp:313] Master only allowing 
> authenticated slaves to register
> I0124 08:14:04.429826  7953 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials'
> I0124 08:14:04.430277  7953 master.cpp:357] Authorization enabled
> I0124 08:14:04.432682  7953 master.cpp:1219] The newly elected leader is 
> master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926
> I0124 08:14:04.432816  7953 master.cpp:1232] Elected as the leading master!
> I0124 08:14:04.432894  7953 master.cpp:1050] Recovering from registrar
> I0124 08:14:04.433212  7950 registrar.cpp:313] Recovering registrar
> I0124 08:14:04.434226  7951 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.323302ms
> I0124 08:14:04.434270  7951 replica.cpp:323] Persisted replica status to 
> STARTING
> I0124 08:14:04.434489  7951 recover.cpp:475] Replica is in STARTING status
> I0124 08:14:04.436164  7951 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0124 08:14:04.439368  7947 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0124 08:14:04.440626  7947 recover.cpp:566] Updating replica status to VOTING
> I0124 08:14:04.443667  7947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.698664ms
> I0124 08:14:04.443759  7947 replica.cpp:323] Persisted replica status to 
> VOTING
> I0124 08:14:04.443925  7947 recover.cpp:580] Successfully joined the Paxos 
> group
> I0124 08:14:04.444160  7947 recover.cpp:464] Recover process terminated
> I0124 08:14:04.444543  7949 log.cpp:660] Attempting to start the writer
> I0124 08:14:04.446331  7949 replica.cpp:477] Replica received implicit 
> promise request with proposal 1
> I0124 08:14:04.449329  7949 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.690453ms
> I0124 08:14:04.449388  7949 replica.cpp:345] Persisted promised to 1
> I0124 08:14:04.450637  7947 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0124 08:14:04.452271  7949 replica.cpp:378] Replica received explicit 
> promise request for position 0 with proposal 2
> I0124 08:14:04.455124  7949 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 2.593522ms
> I0124 08:14:04.455157  7949 replica.cpp:679] Persisted action at 0
> I0124 08:14:04.456594  7951 replica.cpp:511] Replica received write request 
> for position 0
> I0124 08:14:04.456657  7951 leveldb.cpp:438] Reading position from leveldb 
> took 30358ns
> I0124 08:14:04.464860  7951 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 8.164646ms
> I0124 08:14:04.464903  7951 replica.cpp:679] Persisted action at 0
> I0124 08:14:04.465947  7949 replica.cpp:658] Replica received learned notice 
> for position 0
> I0124 08:14:04.471567  7949 leveldb.cpp:343] Persisting action (16 bytes) to 
> le

[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper

2015-01-28 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296345#comment-14296345
 ] 

Cody Maloney commented on MESOS-1806:
-

https://reviews.apache.org/r/30194/
https://reviews.apache.org/r/30195/
https://reviews.apache.org/r/30393/
https://reviews.apache.org/r/30394/
https://reviews.apache.org/r/30395/
https://reviews.apache.org/r/30396/
https://reviews.apache.org/r/30397/
https://reviews.apache.org/r/30398/


> Substituting etcd or ReplicatedLog for Zookeeper
> 
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Task
>Reporter: Ed Ropple
>Assignee: Cody Maloney
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container

2015-01-28 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296300#comment-14296300
 ] 

Timothy Chen commented on MESOS-2183:
-

So I'm planning to leverage the --pid=host flag in docker 1.5, which won't 
clone a new pid namespace. With this you won't see the problems you are seeing.

What I described in my doc is to handle recovery,


> docker containerizer doesn't work when mesos-slave is running in a container
> 
>
> Key: MESOS-2183
> URL: https://issues.apache.org/jira/browse/MESOS-2183
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>
> I've started running the mesos-slave process itself inside a docker 
> container.  I bind mount in the dockerd socket, so there is only one docker 
> daemon running on the system.
> The mesos-slave process uses "docker run" to start an executor in another, 
> sibling, container.  It asks "docker inspect" what the pid of the executor 
> running in the container is.  Since the mesos-slave process is in its own pid 
> namespace, it cannot see the pid for the executor in /proc.  Therefore, it 
> thinks the executor died and it does a docker kill.
> It looks like the executor pid is also used to determine what port the 
> executor is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1825) Support the webui over HTTPS.

2015-01-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296276#comment-14296276
 ] 

ASF GitHub Bot commented on MESOS-1825:
---

Github user bmahler commented on the pull request:

https://github.com/apache/mesos/pull/34#issuecomment-71957729
  
Thanks Arnaud! Nice, there will be built-in HTTPS support in Mesos at some 
point, you may want to chime in here:

https://issues.apache.org/jira/browse/MESOS-1825


> Support the webui over HTTPS.
> -
>
> Key: MESOS-1825
> URL: https://issues.apache.org/jira/browse/MESOS-1825
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: Kien Pham
>Priority: Minor
>  Labels: newbie
>
> Right now at Mesos UI, link are hardcoded to http:// . It should not be 
> hardcoded so that it can support https link.
> Ex:
> https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L17



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1825) Support the webui over HTTPS.

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1825:
---
Summary: Support the webui over HTTPS.  (was: support https link)

> Support the webui over HTTPS.
> -
>
> Key: MESOS-1825
> URL: https://issues.apache.org/jira/browse/MESOS-1825
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: Kien Pham
>Priority: Minor
>  Labels: newbie
>
> Right now at Mesos UI, link are hardcoded to http:// . It should not be 
> hardcoded so that it can support https link.
> Ex:
> https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L17



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-01-28 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296128#comment-14296128
 ] 

Cody Maloney commented on MESOS-2144:
-

Based on the addresses being at the low end of the address range I'm guessing 
it is happening while running __cxa_exit (global static destruction), or some 
other system cleanup symbol and this is during glibc doing something on mesos' 
behalf. Likely whatever that library is doesn't have symbols / is stripped if 
it is coming from the Linux distribution.

Side note: 
Backtraces from our code don't use the debugging info. But yea, definitely 
looks like debugging is enabled. And functions shouldn't be optimized, binary 
isn't stripped of symbols, so stack traces should have all the function symbols.

> Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
> ---
>
> Key: MESOS-2144
> URL: https://issues.apache.org/jira/browse/MESOS-2144
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Cody Maloney
>Priority: Minor
>  Labels: flaky
>
> Occured on review bot review of: 
> https://reviews.apache.org/r/28262/#review62333
> The review doesn't touch code related to the test (And doesn't break 
> libprocess in general)
> [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
> ../../src/tests/script.cpp:83: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
> The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container

2015-01-28 Thread Jay Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296106#comment-14296106
 ] 

Jay Buffington edited comment on MESOS-2183 at 1/29/15 12:10 AM:
-

Hey [~tnachen], I read your doc at 
https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit#
 and it's not clear you address the issue I encountered.  In my mesos-slave 
running in coreos I have it:

* running inside a pid namespace
* using the mounted /var/run/docker.sock to start a sibling container
* running docker inspect to get the pid it just launched
* it sees that the pid docker inspect reports 
* it tries to determine the libprocess port based on that pid
* it doesn't see that pid since the pid docker inspect returns is only 
visible in the root namespace
* it does docker stop/kill because it incorrectly thinks the executor 
failed to start since it couldn't see the pid

I don't understand how your patch addresses that issue.  Can you give me a 
summary of how it fixes this problem I've described?


was (Author: jaybuff):
Hey [~tnachen], I read your doc at 
https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit#
 and it's not clear you address the issue I encountered.  In my mesos-slave 
running in coreos I have it:

* running inside a pid namespace
* useing the mounted /var/run/docker.sock to start a sibling container
* running docker inspect to get the pid it just launched
* it sees that the pid docker inspect reports 
* it tries to determine the libprocess port based on that pid
* it does see that pid since the pid docker inspect returns is only visible 
in the root namespace
* it does docker stop/kill because it incorrectly thinks the executor 
failed to start since it couldn't see the pid

I don't understand how your patch addresses that issue.  Can you give me a 
summary of how it fixes this problem I've described?

> docker containerizer doesn't work when mesos-slave is running in a container
> 
>
> Key: MESOS-2183
> URL: https://issues.apache.org/jira/browse/MESOS-2183
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>
> I've started running the mesos-slave process itself inside a docker 
> container.  I bind mount in the dockerd socket, so there is only one docker 
> daemon running on the system.
> The mesos-slave process uses "docker run" to start an executor in another, 
> sibling, container.  It asks "docker inspect" what the pid of the executor 
> running in the container is.  Since the mesos-slave process is in its own pid 
> namespace, it cannot see the pid for the executor in /proc.  Therefore, it 
> thinks the executor died and it does a docker kill.
> It looks like the executor pid is also used to determine what port the 
> executor is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container

2015-01-28 Thread Jay Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296106#comment-14296106
 ] 

Jay Buffington commented on MESOS-2183:
---

Hey [~tnachen], I read your doc at 
https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit#
 and it's not clear you address the issue I encountered.  In my mesos-slave 
running in coreos I have it:

* running inside a pid namespace
* useing the mounted /var/run/docker.sock to start a sibling container
* running docker inspect to get the pid it just launched
* it sees that the pid docker inspect reports 
* it tries to determine the libprocess port based on that pid
* it does see that pid since the pid docker inspect returns is only visible 
in the root namespace
* it does docker stop/kill because it incorrectly thinks the executor 
failed to start since it couldn't see the pid

I don't understand how your patch addresses that issue.  Can you give me a 
summary of how it fixes this problem I've described?

> docker containerizer doesn't work when mesos-slave is running in a container
> 
>
> Key: MESOS-2183
> URL: https://issues.apache.org/jira/browse/MESOS-2183
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>
> I've started running the mesos-slave process itself inside a docker 
> container.  I bind mount in the dockerd socket, so there is only one docker 
> daemon running on the system.
> The mesos-slave process uses "docker run" to start an executor in another, 
> sibling, container.  It asks "docker inspect" what the pid of the executor 
> running in the container is.  Since the mesos-slave process is in its own pid 
> namespace, it cannot see the pid for the executor in /proc.  Therefore, it 
> thinks the executor died and it does a docker kill.
> It looks like the executor pid is also used to determine what port the 
> executor is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2232) Suppress MockAllocator::transformAllocation() warnings.

2015-01-28 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296089#comment-14296089
 ] 

Benjamin Mahler commented on MESOS-2232:


First two are committed:
{noformat}
commit ccd697df0b7e05b07dee75d53e0ff55d6884ba2f
Author: Benjamin Mahler 
Date:   Fri Jan 16 12:13:01 2015 -0800

Renamed MockAllocatorProcess to TestAllocatorProcess.

Review: https://reviews.apache.org/r/29989
{noformat}
{noformat}
commit b7bb6696b5a78dbc896b4756b7d4123e86c01635
Author: Benjamin Mahler 
Date:   Fri Jan 16 14:10:05 2015 -0800

Updated TestAllocatorProcess to avoid the test warnings.

Review: https://reviews.apache.org/r/29990
{noformat}

> Suppress MockAllocator::transformAllocation() warnings.
> ---
>
> Key: MESOS-2232
> URL: https://issues.apache.org/jira/browse/MESOS-2232
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Minor
>
> After transforming allocated resources feature was added to allocator, a 
> number of warnings are popping out for allocator tests. Commits leading to 
> this behaviour:
> {{dacc88292cc13d4b08fe8cda4df71110a96cb12a}}
> {{5a02d5bdc75d3b1149dcda519016374be06ec6bd}}
> corresponding reviews:
> https://reviews.apache.org/r/29083
> https://reviews.apache.org/r/29084
> Here is an example:
> {code}
> [ RUN ] MasterAllocatorTest/0.FrameworkReregistersFirst GMOCK WARNING: 
> Uninteresting mock function call - taking default action specified at: 
> ../../../src/tests/mesos.hpp:719: Function call: 
> transformAllocation(@0x7fd3bb5274d8 
> 20150115-185632-1677764800-59671-44186-, @0x7fd3bb5274f8 
> 20150115-185632-1677764800-59671-44186-S0, @0x1119140e0 16-byte object  52-BB D3-7F 00-00 C0-5F 52-BB D3-7F 00-00>) Stack trace: [ OK ] 
> MasterAllocatorTest/0.FrameworkReregistersFirst (204 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2298) Provide master detection library/libraries for pure schedulers

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2298:
-

 Summary: Provide master detection library/libraries for pure 
schedulers
 Key: MESOS-2298
 URL: https://issues.apache.org/jira/browse/MESOS-2298
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


When schedulers start interacting with Mesos master via HTTP endpoints, they 
need a way to detect masters. Ideally, Mesos provides master detection 
library/libraries in supported languages (java and python to start with) to 
make this easy for frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2297) Add authentication support for HTTP API

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2297:
-

 Summary: Add authentication support for HTTP API
 Key: MESOS-2297
 URL: https://issues.apache.org/jira/browse/MESOS-2297
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


To start with, we will only support basic http auth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2296) Implement the Events endpoint on slave

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2296:
-

 Summary: Implement the Events endpoint on slave
 Key: MESOS-2296
 URL: https://issues.apache.org/jira/browse/MESOS-2296
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2295) Implement the Call endpoint on Slave

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2295:
-

 Summary: Implement the Call endpoint on Slave
 Key: MESOS-2295
 URL: https://issues.apache.org/jira/browse/MESOS-2295
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2294) Implement the Events endpoint on master

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2294:
-

 Summary: Implement the Events endpoint on master
 Key: MESOS-2294
 URL: https://issues.apache.org/jira/browse/MESOS-2294
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2293) Implement the Call endpoint on master

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2293:
-

 Summary: Implement the Call endpoint on master
 Key: MESOS-2293
 URL: https://issues.apache.org/jira/browse/MESOS-2293
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2292) Implement Call/Event protobufs for Executor

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2292:
-

 Summary: Implement Call/Event protobufs for Executor
 Key: MESOS-2292
 URL: https://issues.apache.org/jira/browse/MESOS-2292
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2291) Move executor driver validations to slave

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2291:
-

 Summary: Move executor driver validations to slave
 Key: MESOS-2291
 URL: https://issues.apache.org/jira/browse/MESOS-2291
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


With HTTP API, the executor driver will no longer exist and hence all the 
validations should move to the slave. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2290) Move all scheduler driver validations to master

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2290:
-

 Summary: Move all scheduler driver validations to master
 Key: MESOS-2290
 URL: https://issues.apache.org/jira/browse/MESOS-2290
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


With HTTP API, the scheduler driver will no longer exist and hence all the 
validations should move to the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2289) Design doc for the HTTP API

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2289:
-

 Summary: Design doc for the HTTP API
 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2288) HTTP API for interacting with Mesos

2015-01-28 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-2288:
--
Epic Name: HTTP API  (was: http api)

> HTTP API for interacting with Mesos
> ---
>
> Key: MESOS-2288
> URL: https://issues.apache.org/jira/browse/MESOS-2288
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>
> Currently Mesos frameworks (schedulers and executors) interact with Mesos 
> (masters and slaves) via drivers provided by Mesos. While the driver helped 
> in providing some common functionality for all frameworks (master detection, 
> authentication, validation etc), it has several drawbacks.
> --> Frameworks need to depend on a native library which makes their 
> build/deploy process cumbersome.
> --> Pure language frameworks cannot use off the shelf libraries to interact 
> with the undocumented API used by the driver.
> --> Makes it hard for developers to implement new APIs (lot of boiler plate 
> code to write).
> This proposal is for Mesos to provide a well documented public HTTP API that 
> frameworks (and maybe operators) can use to interact with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1127) Expose lower-level scheduler/executor API

2015-01-28 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1127:
--
 Epic Name:   (was: HTTP API)
Issue Type: Task  (was: Epic)

> Expose lower-level scheduler/executor API
> -
>
> Key: MESOS-1127
> URL: https://issues.apache.org/jira/browse/MESOS-1127
> Project: Mesos
>  Issue Type: Task
>  Components: framework
>Reporter: Benjamin Hindman
>Assignee: Benjamin Hindman
>  Labels: twitter
>
> The default scheduler/executor interface and implementation in Mesos have a 
> few drawbacks:
> (1) The interface is fairly high-level which makes it hard to do certain 
> things, for example, handle events (callbacks) in batch. This can have a big 
> impact on the performance of schedulers (for example, writing task updates 
> that need to be persisted).
> (2) The implementation requires writing a lot of boilerplate JNI and native 
> Python wrappers when adding additional API components.
> The plan is to provide a lower-level API that can easily be used to implement 
> the higher-level API that is currently provided. This will also open the door 
> to more easily building native-language Mesos libraries (i.e., not needing 
> the C++ shim layer) and building new higher-level abstractions on top of the 
> lower-level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2288) HTTP API for interacting with Mesos

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2288:
-

 Summary: HTTP API for interacting with Mesos
 Key: MESOS-2288
 URL: https://issues.apache.org/jira/browse/MESOS-2288
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone


Currently Mesos frameworks (schedulers and executors) interact with Mesos 
(masters and slaves) via drivers provided by Mesos. While the driver helped in 
providing some common functionality for all frameworks (master detection, 
authentication, validation etc), it has several drawbacks.

--> Frameworks need to depend on a native library which makes their 
build/deploy process cumbersome.

--> Pure language frameworks cannot use off the shelf libraries to interact 
with the undocumented API used by the driver.

--> Makes it hard for developers to implement new APIs (lot of boiler plate 
code to write).

This proposal is for Mesos to provide a well documented public HTTP API that 
frameworks (and maybe operators) can use to interact with Mesos.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295996#comment-14295996
 ] 

Ian Downes commented on MESOS-2162:
---

I'll be working on this too, development and/or shepherding. 

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295983#comment-14295983
 ] 

Steven Schlansker commented on MESOS-2162:
--

I would love to help out in any way I can, but I am not much of a C++ guy. But 
at the very least I would happily test it, or if you have other suggestions for 
how I can help...

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295981#comment-14295981
 ] 

Steven Schlansker commented on MESOS-2162:
--

I would love to help out in any way I can, but I am not much of a C++ guy.  But 
at the very least I would happily test it, or if you have other suggestions for 
how I can help...

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2162:
-
Comment: was deleted

(was: I would love to help out in any way I can, but I am not much of a C++ 
guy.  But at the very least I would happily test it, or if you have other 
suggestions for how I can help...)

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295978#comment-14295978
 ] 

Timothy Chen commented on MESOS-2162:
-

Hi Steven, that's what I think too.

It's my plan to work on this but this quarter I won't have much time to do so.

Are you interested in this? We could work together.

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

2015-01-28 Thread Steve Niemitz (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Niemitz updated MESOS-2215:
-
Description: 
Once the slave restarts and recovers the task, I see this error in the log for 
all tasks that were recovered every second or so.  Note, these were NOT docker 
tasks:

W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  
container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
 of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with 
status 1 stderr = Error: No such image or container: 
mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
However the tasks themselves are still healthy and running.

The slave was launched with --containerizers=mesos,docker

-
More info: it looks like the docker containerizer is a little too ambitious 
about recovering containers, again this was not a docker task:
I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
'7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
 of framework 20150109-161713-715350282-5050-290797-

Looking into the source, it looks like the problem is that the 
ComposingContainerize runs recover in parallel, but neither the docker 
containerizer nor mesos containerizer check if they should recover the task or 
not (because they were the ones that launched it).  Perhaps this needs to be 
written into the checkpoint somewhere?

  was:
Once the slave restarts and recovers the task, I see this error in the log for 
all tasks that were recovered every second or so.  Note, these were NOT docker 
tasks:

W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  
container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
 of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with 
status 1 stderr = Error: No such image or container: 
mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
However the tasks themselves are still healthy and running.

The slave was launched with --containerizers=mesos,docker

-
More info: it looks like the docker containerizer is a little too ambitious 
about recovering containers, again this was not a docker task:
I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
'7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
 of framework 20150109-161713-715350282-5050-290797-

Looking into the source, it looks like the problem is that the 
ComposingContainerize runs recover in parallel, but neither the docker 
containerizer not mesos containerizer check if they should recover the task or 
not (because they were the ones that launched it).  Perhaps this needs to be 
written into the checkpoint somewhere?


> The Docker containerizer attempts to recover any task when checkpointing is 
> enabled, not just docker tasks.
> ---
>
> Key: MESOS-2215
> URL: https://issues.apache.org/jira/browse/MESOS-2215
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.0
>Reporter: Steve Niemitz
>Assignee: Timothy Chen
>
> Once the slave restarts and recovers the task, I see this error in the log 
> for all tasks that were recovered every second or so.  Note, these were NOT 
> docker tasks:
> W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
> for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
> thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
>  of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
> inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
> with status 1 stderr = Error: No such image or container: 
> mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
> However the tasks themselves are still healthy and running.
> The slave was launched with --containerizers=mesos,docker
> -
> More info: it looks like the docker containerizer is a little too ambitious 
> about recovering containers, again this was not a docker task:
> I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
> '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
> 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
>  of framework 20150109-161713-715350282-5050-290797-
> Looking into the source

[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295944#comment-14295944
 ] 

Steven Schlansker commented on MESOS-2162:
--

This library may be a good starting point: https://github.com/cdaylward/libappc/

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295939#comment-14295939
 ] 

Steven Schlansker commented on MESOS-2162:
--

Any possibility of getting this scheduled for an upcoming release?

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-354) oversubscribe resources

2015-01-28 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295839#comment-14295839
 ] 

Niklas Quarfot Nielsen commented on MESOS-354:
--

Oversubscription means many things and can be considered as a subset of the 
currently ongoing effort in optimistic offers:
Where optimistic offers lets the allocator to offer resources:

 - In multiple frameworks to increase 'parallelism' (opposed to the 
conservative/pessimistic scheme) and **increase task throughput**.
 - Preemptable resources from unallocated but reserved resources, to **limit 
reservation slack** (difference between reserverd and allocated resources).

A third (and equally important) case, which expands these scenarios is 
oversubscription of _allocated_ resources which limits the **usage slack** 
(difference between allocated and used resources).
There has been a lot of recent research which shows the ability to reduce usage 
slack with 60% while maintaining the Service Level Objective (SLO) of latency 
critical workloads(1). However, this kind of oversubscription needs policies 
and fine-tuning to make sure that best-effort tasks doesn't interfere with 
latency critical ones. Therefore, we'd like to start a discussion on how such a 
system would look in Mesos. I will create a JIRA ticket (linking to this one) 
to start the conversation.

(1) 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43017.pdf

> oversubscribe resources
> ---
>
> Key: MESOS-354
> URL: https://issues.apache.org/jira/browse/MESOS-354
> Project: Mesos
>  Issue Type: Story
>  Components: isolation, master, slave
>Reporter: brian wickman
>Priority: Minor
> Attachments: mesos_virtual_offers.pdf
>
>
> This proposal is predicated upon offer revocation.
> The idea would be to add a new "revoked" status either by (1) piggybacking 
> off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a 
> new status update TASK_REVOKED.
> In order to augment an offer with metadata about revocability, there are 
> options:
>   1) Add a revocable boolean to the Offer and
> a) offer only one type of Offer per slave at a particular time
> b) offer both revocable and non-revocable resources at the same time but 
> require frameworks to understand that Offers can contain overlapping resources
>   2) Add a revocable_resources field on the Offer which is a superset of the 
> regular resources field.  By consuming > resources <= revocable_resources in 
> a launchTask, the Task becomes a revocable task.  If launching a task with < 
> resources, the Task is non-revocable.
> The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) 
> and non-revocable tasks are online higher-SLA tasks (e.g. services.)
> Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk.  
> One of these resources is a rate (4 cpu seconds per second) and two of them 
> are fixed values (8GB and 20GB respectively, though disk resources can be 
> further broken down into spindles - fixed - and iops - a rate.)  In practice, 
> these are the maximum resources in the respective dimensions that this task 
> will use.  In reality, we provision tasks at some factor below peak, and only 
> hit peak resource consumption in rare circumstances or perhaps at a diurnal 
> peak.  
> In the meantime, we stand to gain from offering the some constant factor of 
> the difference between (reserved - actual) of non-revocable tasks as 
> revocable resources, depending upon our tolerance for revocable task churn.  
> The main challenge is coming up with an accurate short / medium / long-term 
> prediction of resource consumption based upon current behavior.
> In many cases it would be OK to be sloppy:
>   * CPU / iops / network IO are rates (compressible) and can often be OK 
> below guarantees for brief periods of time while task revocation takes place
>   * Memory slack can be provided by enabling swap and dynamically setting 
> swap paging boundaries.  Should swap ever be activated, that would be a 
> signal to revoke.
> The master / allocator would piggyback on the slave heartbeat mechanism to 
> learn of the amount of revocable resources available at any point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2287) Document undocumented tests

2015-01-28 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-2287:
-

 Summary: Document undocumented tests
 Key: MESOS-2287
 URL: https://issues.apache.org/jira/browse/MESOS-2287
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen
Priority: Trivial


We have a inconsistency in the way we document tests. It has become a rule of 
thumb to include a small blob about the test. For example:

{code}
// This tests the 'active' field in slave entries from state.json. We
// first verify an active slave, deactivate it and verify that the
// 'active' field is false.
TEST_F(MasterTest, SlaveActiveEndpoint)
{
  // Start a master.
  Try> master = StartMaster();
  ASSERT_SOME(master);
  ...
{code}

However, we still have many tests that haven't been documented. For example: 

{code}
}


TEST_F(MasterTest, MetricsInStatsEndpoint)
{
  Try > master = StartMaster();
  ASSERT_SOME(master);

  Future response =
process::http::get(master.get(), "stats.json");
  ...
{code}

It would be great to do a scan and make sure all the tests are documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2286) Simplify the allocator architecture

2015-01-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-2286:
---
Component/s: allocation
Description: Allocator refactor 
[https://issues.apache.org/jira/browse/MESOS-2213] will distinguish between 
general allocators and Process-based ones. This introduces a chain of 
inheritance with a single real allocator at the bottom. Consider simplifying 
this architecture without impacting adding new allocators.
   Priority: Minor  (was: Major)

> Simplify the allocator architecture
> ---
>
> Key: MESOS-2286
> URL: https://issues.apache.org/jira/browse/MESOS-2286
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Alexander Rukletsov
>Priority: Minor
>
> Allocator refactor [https://issues.apache.org/jira/browse/MESOS-2213] will 
> distinguish between general allocators and Process-based ones. This 
> introduces a chain of inheritance with a single real allocator at the bottom. 
> Consider simplifying this architecture without impacting adding new 
> allocators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2286) Simplify the allocator architecture

2015-01-28 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-2286:
--

 Summary: Simplify the allocator architecture
 Key: MESOS-2286
 URL: https://issues.apache.org/jira/browse/MESOS-2286
 Project: Mesos
  Issue Type: Improvement
Reporter: Alexander Rukletsov






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2285) Eliminate dependency on master::Flags in Allocator

2015-01-28 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-2285:
--

 Summary: Eliminate dependency on master::Flags in Allocator
 Key: MESOS-2285
 URL: https://issues.apache.org/jira/browse/MESOS-2285
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Alexander Rukletsov
Priority: Minor


{{Allocator}} extracts parameters from {{master::Flags}} during initialization. 
Currently, only {{allocation_interval}} key from {{master::Flags}} is used. It 
makes sense to introduce a separate structure {{allocator::Options}} with 
values relevant for allocation and eliminate dependency on {{master::Flags}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov resolved MESOS-2284.

Resolution: Not a Problem

> Slave cannot be registered while masters keep switching to another one.
> ---
>
> Key: MESOS-2284
> URL: https://issues.apache.org/jira/browse/MESOS-2284
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.20.1
> Environment: Ubuntu14.04
>Reporter: Hou Xiaokun
>Priority: Blocker
> Fix For: 0.21.0
>
>
> I followed the instruction in page 
> http://mesosphere.com/docs/getting-started/datacenter/install/.
> Setup two masters and one slave. And quorum value is "2". Configured ip 
> addresses in hostname files separately.
> Here is the log from slave node,
> I0127 22:37:26.762953  1966 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:37:26.762985  1966 slave.cpp:638] Detecting new master
> I0127 22:37:26.763022  1966 status_update_manager.cpp:171] Pausing sending 
> status updates
> I0127 22:38:06.683840  1962 slave.cpp:3321] Current usage 16.98%. Max allowed 
> age: 5.111732713224155days
> I0127 22:38:26.986556  1966 slave.cpp:2623] master@10.27.17.135:5050 exited
> W0127 22:38:26.986675  1966 slave.cpp:2626] Master disconnected! Waiting for 
> a new master to be elected
> I0127 22:38:34.909605  1963 detector.cpp:138] Detected a new leader: 
> (id='2028')
> I0127 22:38:34.909811  1963 group.cpp:659] Trying to get 
> '/mesos/info_002028' in ZooKeeper
> I0127 22:38:34.910909  1963 detector.cpp:433] A new leading master 
> (UPID=master@10.27.16.214:5050) is detected
> I0127 22:38:34.910989  1963 slave.cpp:602] New master detected at 
> master@10.27.16.214:5050
> I0127 22:38:34.93  1963 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:38:34.911144  1963 slave.cpp:638] Detecting new master
> I0127 22:38:34.911183  1963 status_update_manager.cpp:171] Pausing sending 
> status updates
> I0127 22:39:06.684526  1964 slave.cpp:3321] Current usage 16.98%. Max allowed 
> age: 5.111731773610567days
> I0127 22:39:35.231653  1963 slave.cpp:2623] master@10.27.16.214:5050 exited
> W0127 22:39:35.231869  1963 slave.cpp:2626] Master disconnected! Waiting for 
> a new master to be elected
> I0127 22:39:42.761540  1964 detector.cpp:138] Detected a new leader: 
> (id='2029')
> I0127 22:39:42.761732  1964 group.cpp:659] Trying to get 
> '/mesos/info_002029' in ZooKeeper
> I0127 22:39:42.762914  1964 detector.cpp:433] A new leading master 
> (UPID=master@10.27.17.135:5050) is detected
> I0127 22:39:42.762984  1964 slave.cpp:602] New master detected at 
> master@10.27.17.135:5050
> I0127 22:39:42.763089  1964 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:39:42.763118  1964 slave.cpp:638] Detecting new master
> I0127 22:39:42.763155  1964 status_update_manager.cpp:171] Pausing sending 
> status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reopened MESOS-2284:


> Slave cannot be registered while masters keep switching to another one.
> ---
>
> Key: MESOS-2284
> URL: https://issues.apache.org/jira/browse/MESOS-2284
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.20.1
> Environment: Ubuntu14.04
>Reporter: Hou Xiaokun
>Priority: Blocker
> Fix For: 0.21.0
>
>
> I followed the instruction in page 
> http://mesosphere.com/docs/getting-started/datacenter/install/.
> Setup two masters and one slave. And quorum value is "2". Configured ip 
> addresses in hostname files separately.
> Here is the log from slave node,
> I0127 22:37:26.762953  1966 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:37:26.762985  1966 slave.cpp:638] Detecting new master
> I0127 22:37:26.763022  1966 status_update_manager.cpp:171] Pausing sending 
> status updates
> I0127 22:38:06.683840  1962 slave.cpp:3321] Current usage 16.98%. Max allowed 
> age: 5.111732713224155days
> I0127 22:38:26.986556  1966 slave.cpp:2623] master@10.27.17.135:5050 exited
> W0127 22:38:26.986675  1966 slave.cpp:2626] Master disconnected! Waiting for 
> a new master to be elected
> I0127 22:38:34.909605  1963 detector.cpp:138] Detected a new leader: 
> (id='2028')
> I0127 22:38:34.909811  1963 group.cpp:659] Trying to get 
> '/mesos/info_002028' in ZooKeeper
> I0127 22:38:34.910909  1963 detector.cpp:433] A new leading master 
> (UPID=master@10.27.16.214:5050) is detected
> I0127 22:38:34.910989  1963 slave.cpp:602] New master detected at 
> master@10.27.16.214:5050
> I0127 22:38:34.93  1963 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:38:34.911144  1963 slave.cpp:638] Detecting new master
> I0127 22:38:34.911183  1963 status_update_manager.cpp:171] Pausing sending 
> status updates
> I0127 22:39:06.684526  1964 slave.cpp:3321] Current usage 16.98%. Max allowed 
> age: 5.111731773610567days
> I0127 22:39:35.231653  1963 slave.cpp:2623] master@10.27.16.214:5050 exited
> W0127 22:39:35.231869  1963 slave.cpp:2626] Master disconnected! Waiting for 
> a new master to be elected
> I0127 22:39:42.761540  1964 detector.cpp:138] Detected a new leader: 
> (id='2029')
> I0127 22:39:42.761732  1964 group.cpp:659] Trying to get 
> '/mesos/info_002029' in ZooKeeper
> I0127 22:39:42.762914  1964 detector.cpp:433] A new leading master 
> (UPID=master@10.27.17.135:5050) is detected
> I0127 22:39:42.762984  1964 slave.cpp:602] New master detected at 
> master@10.27.17.135:5050
> I0127 22:39:42.763089  1964 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:39:42.763118  1964 slave.cpp:638] Detecting new master
> I0127 22:39:42.763155  1964 status_update_manager.cpp:171] Pausing sending 
> status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

2015-01-28 Thread Dr. Stefan Schimanski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294907#comment-14294907
 ] 

Dr. Stefan Schimanski commented on MESOS-2276:
--

I have changed the topic of this issue. As the original issue is resolved, it 
is left that mesos-slave should behave much more forgiving in the situation of 
many stopped containers. Moreover, a proper error message would help to 
identify the problem.

> Mesos-slave refuses to startup with many stopped docker containers
> --
>
> Key: MESOS-2276
> URL: https://issues.apache.org/jira/browse/MESOS-2276
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Affects Versions: 0.21.0, 0.21.1
> Environment: Ubuntu 14.04LTS, Mesosphere packages
>Reporter: Dr. Stefan Schimanski
>
> The mesos-slave is launched as
> # /usr/local/sbin/mesos-slave 
> --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 
> --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint 
> --containerizers=docker --executor_registration_timeout=5mins 
> --logging_level=INFO
> giving this output:
> I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
> I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
> I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
> I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
> I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: 
> ab8fa655d34e8e15a4290422df38a18db1c09b5b
> I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client 
> environment:host.name=srv002
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.13.0-44-generic
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client 
> environment:user.name=root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/root
> 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 
> sessionTimeout=1 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd= 
> context=0x7fceec0009e0 flags=0
> I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
> I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; 
> mem(*):6960; disk(*):246731; ports(*):[31000-32000]
> I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
> I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
> 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: 
> initiated connection to server [10.0.0.1:2181]
> I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status 
> update manager
> I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
> 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: 
> session establishment complete on server [10.0.0.1:2181], 
> sessionId=0x14b2adf7a560106, negotiated timeout=1
> I0127 19:26:32.823292 19885 group.cpp:313] Group process 
> (group(1)@10.0.0.2:5051) connected to ZooKeeper
> I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in 
> ZooKeeper
> I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: 
> (id='143')
> I0127 19:26:32.830559 19882 group.cpp:659] Trying to get 
> '/mesos/info_000143' in ZooKeeper
> I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master 
> (UPID=master@10.0.0.1:5050) is detected
> Failed to perform recovery: Collect failed: Failed to create pipe: Too many 
> open files
> To remedy this do as follows:
> Step 1: rm -f /tmp/mesos/meta/slaves/latest
> This ensures slave doesn't recover old live executors.
> Step 2: Restart the slave.
> At /tmp/mesos/meta/slaves/latest there is nothing.
> The slave was part of a 3 node cluster before.
> When started as an upstart service, the process is relaunched all the time 
> and a large number of defunct processes 

[jira] [Updated] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

2015-01-28 Thread Dr. Stefan Schimanski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dr. Stefan Schimanski updated MESOS-2276:
-
Summary: Mesos-slave refuses to startup with many stopped docker containers 
 (was: Mesos-slave with containerizer Docker doesn't startup anymore)

> Mesos-slave refuses to startup with many stopped docker containers
> --
>
> Key: MESOS-2276
> URL: https://issues.apache.org/jira/browse/MESOS-2276
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Affects Versions: 0.21.0, 0.21.1
> Environment: Ubuntu 14.04LTS, Mesosphere packages
>Reporter: Dr. Stefan Schimanski
>
> The mesos-slave is launched as
> # /usr/local/sbin/mesos-slave 
> --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 
> --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint 
> --containerizers=docker --executor_registration_timeout=5mins 
> --logging_level=INFO
> giving this output:
> I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
> I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
> I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
> I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
> I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: 
> ab8fa655d34e8e15a4290422df38a18db1c09b5b
> I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client 
> environment:host.name=srv002
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.13.0-44-generic
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client 
> environment:user.name=root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/root
> 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 
> sessionTimeout=1 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd= 
> context=0x7fceec0009e0 flags=0
> I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
> I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; 
> mem(*):6960; disk(*):246731; ports(*):[31000-32000]
> I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
> I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
> 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: 
> initiated connection to server [10.0.0.1:2181]
> I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status 
> update manager
> I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
> 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: 
> session establishment complete on server [10.0.0.1:2181], 
> sessionId=0x14b2adf7a560106, negotiated timeout=1
> I0127 19:26:32.823292 19885 group.cpp:313] Group process 
> (group(1)@10.0.0.2:5051) connected to ZooKeeper
> I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in 
> ZooKeeper
> I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: 
> (id='143')
> I0127 19:26:32.830559 19882 group.cpp:659] Trying to get 
> '/mesos/info_000143' in ZooKeeper
> I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master 
> (UPID=master@10.0.0.1:5050) is detected
> Failed to perform recovery: Collect failed: Failed to create pipe: Too many 
> open files
> To remedy this do as follows:
> Step 1: rm -f /tmp/mesos/meta/slaves/latest
> This ensures slave doesn't recover old live executors.
> Step 2: Restart the slave.
> At /tmp/mesos/meta/slaves/latest there is nothing.
> The slave was part of a 3 node cluster before.
> When started as an upstart service, the process is relaunched all the time 
> and a large number of defunct processes appear, like these ones:
> root 30321  0.0  0.0  13000   440 ?S19:28   0:00 iptables 
> --wait -L -n
> root 30322  0.0  0.0      396 ?   

[jira] [Resolved] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Hou Xiaokun (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hou Xiaokun resolved MESOS-2284.

   Resolution: Fixed
Fix Version/s: 0.21.0

hi, I changed the quorum to 1. Slave can be displayed now!

Thanks!

> Slave cannot be registered while masters keep switching to another one.
> ---
>
> Key: MESOS-2284
> URL: https://issues.apache.org/jira/browse/MESOS-2284
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.20.1
> Environment: Ubuntu14.04
>Reporter: Hou Xiaokun
>Priority: Blocker
> Fix For: 0.21.0
>
>
> I followed the instruction in page 
> http://mesosphere.com/docs/getting-started/datacenter/install/.
> Setup two masters and one slave. And quorum value is "2". Configured ip 
> addresses in hostname files separately.
> Here is the log from slave node,
> I0127 22:37:26.762953  1966 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:37:26.762985  1966 slave.cpp:638] Detecting new master
> I0127 22:37:26.763022  1966 status_update_manager.cpp:171] Pausing sending 
> status updates
> I0127 22:38:06.683840  1962 slave.cpp:3321] Current usage 16.98%. Max allowed 
> age: 5.111732713224155days
> I0127 22:38:26.986556  1966 slave.cpp:2623] master@10.27.17.135:5050 exited
> W0127 22:38:26.986675  1966 slave.cpp:2626] Master disconnected! Waiting for 
> a new master to be elected
> I0127 22:38:34.909605  1963 detector.cpp:138] Detected a new leader: 
> (id='2028')
> I0127 22:38:34.909811  1963 group.cpp:659] Trying to get 
> '/mesos/info_002028' in ZooKeeper
> I0127 22:38:34.910909  1963 detector.cpp:433] A new leading master 
> (UPID=master@10.27.16.214:5050) is detected
> I0127 22:38:34.910989  1963 slave.cpp:602] New master detected at 
> master@10.27.16.214:5050
> I0127 22:38:34.93  1963 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:38:34.911144  1963 slave.cpp:638] Detecting new master
> I0127 22:38:34.911183  1963 status_update_manager.cpp:171] Pausing sending 
> status updates
> I0127 22:39:06.684526  1964 slave.cpp:3321] Current usage 16.98%. Max allowed 
> age: 5.111731773610567days
> I0127 22:39:35.231653  1963 slave.cpp:2623] master@10.27.16.214:5050 exited
> W0127 22:39:35.231869  1963 slave.cpp:2626] Master disconnected! Waiting for 
> a new master to be elected
> I0127 22:39:42.761540  1964 detector.cpp:138] Detected a new leader: 
> (id='2029')
> I0127 22:39:42.761732  1964 group.cpp:659] Trying to get 
> '/mesos/info_002029' in ZooKeeper
> I0127 22:39:42.762914  1964 detector.cpp:433] A new leading master 
> (UPID=master@10.27.17.135:5050) is detected
> I0127 22:39:42.762984  1964 slave.cpp:602] New master detected at 
> master@10.27.17.135:5050
> I0127 22:39:42.763089  1964 slave.cpp:627] No credentials provided. 
> Attempting to register without authentication
> I0127 22:39:42.763118  1964 slave.cpp:638] Detecting new master
> I0127 22:39:42.763155  1964 status_update_manager.cpp:171] Pausing sending 
> status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)