[jira] [Updated] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky
[ https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2228: --- Sprint: Twitter Mesos Q1 Sprint 1 > SlaveTest.MesosExecutorGracefulShutdown is flaky > > > Key: MESOS-2228 > URL: https://issues.apache.org/jira/browse/MESOS-2228 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.22.0 >Reporter: Vinod Kone >Assignee: Benjamin Mahler > Labels: twitter > > Observed this on internal CI > {noformat} > [ RUN ] SlaveTest.MesosExecutorGracefulShutdown > Using temporary directory > '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ' > I0124 08:14:04.399211 7926 leveldb.cpp:176] Opened db in 27.364056ms > I0124 08:14:04.402632 7926 leveldb.cpp:183] Compacted db in 3.357646ms > I0124 08:14:04.402691 7926 leveldb.cpp:198] Created db iterator in 23822ns > I0124 08:14:04.402708 7926 leveldb.cpp:204] Seeked to beginning of db in > 1913ns > I0124 08:14:04.402716 7926 leveldb.cpp:273] Iterated through 0 keys in the > db in 458ns > I0124 08:14:04.402767 7926 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0124 08:14:04.403728 7951 recover.cpp:449] Starting replica recovery > I0124 08:14:04.404011 7951 recover.cpp:475] Replica is in EMPTY status > I0124 08:14:04.407765 7950 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I0124 08:14:04.408710 7951 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I0124 08:14:04.419666 7951 recover.cpp:566] Updating replica status to > STARTING > I0124 08:14:04.429719 7953 master.cpp:262] Master > 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787 > I0124 08:14:04.429790 7953 master.cpp:308] Master only allowing > authenticated frameworks to register > I0124 08:14:04.429802 7953 master.cpp:313] Master only allowing > authenticated slaves to register > I0124 08:14:04.429826 7953 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials' > I0124 08:14:04.430277 7953 master.cpp:357] Authorization enabled > I0124 08:14:04.432682 7953 master.cpp:1219] The newly elected leader is > master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926 > I0124 08:14:04.432816 7953 master.cpp:1232] Elected as the leading master! > I0124 08:14:04.432894 7953 master.cpp:1050] Recovering from registrar > I0124 08:14:04.433212 7950 registrar.cpp:313] Recovering registrar > I0124 08:14:04.434226 7951 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 14.323302ms > I0124 08:14:04.434270 7951 replica.cpp:323] Persisted replica status to > STARTING > I0124 08:14:04.434489 7951 recover.cpp:475] Replica is in STARTING status > I0124 08:14:04.436164 7951 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I0124 08:14:04.439368 7947 recover.cpp:195] Received a recover response from > a replica in STARTING status > I0124 08:14:04.440626 7947 recover.cpp:566] Updating replica status to VOTING > I0124 08:14:04.443667 7947 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.698664ms > I0124 08:14:04.443759 7947 replica.cpp:323] Persisted replica status to > VOTING > I0124 08:14:04.443925 7947 recover.cpp:580] Successfully joined the Paxos > group > I0124 08:14:04.444160 7947 recover.cpp:464] Recover process terminated > I0124 08:14:04.444543 7949 log.cpp:660] Attempting to start the writer > I0124 08:14:04.446331 7949 replica.cpp:477] Replica received implicit > promise request with proposal 1 > I0124 08:14:04.449329 7949 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.690453ms > I0124 08:14:04.449388 7949 replica.cpp:345] Persisted promised to 1 > I0124 08:14:04.450637 7947 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0124 08:14:04.452271 7949 replica.cpp:378] Replica received explicit > promise request for position 0 with proposal 2 > I0124 08:14:04.455124 7949 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 2.593522ms > I0124 08:14:04.455157 7949 replica.cpp:679] Persisted action at 0 > I0124 08:14:04.456594 7951 replica.cpp:511] Replica received write request > for position 0 > I0124 08:14:04.456657 7951 leveldb.cpp:438] Reading position from leveldb > took 30358ns > I0124 08:14:04.464860 7951 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb took 8.164646ms > I0124 08:14:04.464903 7951 replica.cpp:679] Persisted action at 0 > I0124 08:14:04.465947 7949 replica.cpp:658] Replica received learned notice > for position 0 > I0124 08:14:04.471567 7949 leveldb.cpp:343] Persisting action (16 bytes)
[jira] [Assigned] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky
[ https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-2228: -- Assignee: Benjamin Mahler (was: Alexander Rukletsov) {quote} (or is not being reaped) {quote} >From the output, we're not seeing 'Terminated' in the output, which means that >it's the SIGKILL reaching the pid, not the SIGTERM, no? Because of this, it >doesn't seem like it's a reaping issue, anything I'm missing? {quote} >From the logs it looks like a simple sleep task doesn't terminate {quote} Looks like this to me as well, these are VMs and we sometimes see strange blocking behavior. I've bumped the timeout for now and included a nicer error message. Please take a look: https://reviews.apache.org/r/30402/ > SlaveTest.MesosExecutorGracefulShutdown is flaky > > > Key: MESOS-2228 > URL: https://issues.apache.org/jira/browse/MESOS-2228 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.22.0 >Reporter: Vinod Kone >Assignee: Benjamin Mahler > Labels: twitter > > Observed this on internal CI > {noformat} > [ RUN ] SlaveTest.MesosExecutorGracefulShutdown > Using temporary directory > '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ' > I0124 08:14:04.399211 7926 leveldb.cpp:176] Opened db in 27.364056ms > I0124 08:14:04.402632 7926 leveldb.cpp:183] Compacted db in 3.357646ms > I0124 08:14:04.402691 7926 leveldb.cpp:198] Created db iterator in 23822ns > I0124 08:14:04.402708 7926 leveldb.cpp:204] Seeked to beginning of db in > 1913ns > I0124 08:14:04.402716 7926 leveldb.cpp:273] Iterated through 0 keys in the > db in 458ns > I0124 08:14:04.402767 7926 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0124 08:14:04.403728 7951 recover.cpp:449] Starting replica recovery > I0124 08:14:04.404011 7951 recover.cpp:475] Replica is in EMPTY status > I0124 08:14:04.407765 7950 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I0124 08:14:04.408710 7951 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I0124 08:14:04.419666 7951 recover.cpp:566] Updating replica status to > STARTING > I0124 08:14:04.429719 7953 master.cpp:262] Master > 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787 > I0124 08:14:04.429790 7953 master.cpp:308] Master only allowing > authenticated frameworks to register > I0124 08:14:04.429802 7953 master.cpp:313] Master only allowing > authenticated slaves to register > I0124 08:14:04.429826 7953 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials' > I0124 08:14:04.430277 7953 master.cpp:357] Authorization enabled > I0124 08:14:04.432682 7953 master.cpp:1219] The newly elected leader is > master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926 > I0124 08:14:04.432816 7953 master.cpp:1232] Elected as the leading master! > I0124 08:14:04.432894 7953 master.cpp:1050] Recovering from registrar > I0124 08:14:04.433212 7950 registrar.cpp:313] Recovering registrar > I0124 08:14:04.434226 7951 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 14.323302ms > I0124 08:14:04.434270 7951 replica.cpp:323] Persisted replica status to > STARTING > I0124 08:14:04.434489 7951 recover.cpp:475] Replica is in STARTING status > I0124 08:14:04.436164 7951 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I0124 08:14:04.439368 7947 recover.cpp:195] Received a recover response from > a replica in STARTING status > I0124 08:14:04.440626 7947 recover.cpp:566] Updating replica status to VOTING > I0124 08:14:04.443667 7947 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.698664ms > I0124 08:14:04.443759 7947 replica.cpp:323] Persisted replica status to > VOTING > I0124 08:14:04.443925 7947 recover.cpp:580] Successfully joined the Paxos > group > I0124 08:14:04.444160 7947 recover.cpp:464] Recover process terminated > I0124 08:14:04.444543 7949 log.cpp:660] Attempting to start the writer > I0124 08:14:04.446331 7949 replica.cpp:477] Replica received implicit > promise request with proposal 1 > I0124 08:14:04.449329 7949 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.690453ms > I0124 08:14:04.449388 7949 replica.cpp:345] Persisted promised to 1 > I0124 08:14:04.450637 7947 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0124 08:14:04.452271 7949 replica.cpp:378] Replica received explicit > promise request for position 0 with proposal 2 > I0124 08:14:04.455124 7949 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 2.593522ms > I0124
[jira] [Updated] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky
[ https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2228: --- Labels: twitter (was: ) > SlaveTest.MesosExecutorGracefulShutdown is flaky > > > Key: MESOS-2228 > URL: https://issues.apache.org/jira/browse/MESOS-2228 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.22.0 >Reporter: Vinod Kone >Assignee: Benjamin Mahler > Labels: twitter > > Observed this on internal CI > {noformat} > [ RUN ] SlaveTest.MesosExecutorGracefulShutdown > Using temporary directory > '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ' > I0124 08:14:04.399211 7926 leveldb.cpp:176] Opened db in 27.364056ms > I0124 08:14:04.402632 7926 leveldb.cpp:183] Compacted db in 3.357646ms > I0124 08:14:04.402691 7926 leveldb.cpp:198] Created db iterator in 23822ns > I0124 08:14:04.402708 7926 leveldb.cpp:204] Seeked to beginning of db in > 1913ns > I0124 08:14:04.402716 7926 leveldb.cpp:273] Iterated through 0 keys in the > db in 458ns > I0124 08:14:04.402767 7926 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0124 08:14:04.403728 7951 recover.cpp:449] Starting replica recovery > I0124 08:14:04.404011 7951 recover.cpp:475] Replica is in EMPTY status > I0124 08:14:04.407765 7950 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I0124 08:14:04.408710 7951 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I0124 08:14:04.419666 7951 recover.cpp:566] Updating replica status to > STARTING > I0124 08:14:04.429719 7953 master.cpp:262] Master > 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787 > I0124 08:14:04.429790 7953 master.cpp:308] Master only allowing > authenticated frameworks to register > I0124 08:14:04.429802 7953 master.cpp:313] Master only allowing > authenticated slaves to register > I0124 08:14:04.429826 7953 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials' > I0124 08:14:04.430277 7953 master.cpp:357] Authorization enabled > I0124 08:14:04.432682 7953 master.cpp:1219] The newly elected leader is > master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926 > I0124 08:14:04.432816 7953 master.cpp:1232] Elected as the leading master! > I0124 08:14:04.432894 7953 master.cpp:1050] Recovering from registrar > I0124 08:14:04.433212 7950 registrar.cpp:313] Recovering registrar > I0124 08:14:04.434226 7951 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 14.323302ms > I0124 08:14:04.434270 7951 replica.cpp:323] Persisted replica status to > STARTING > I0124 08:14:04.434489 7951 recover.cpp:475] Replica is in STARTING status > I0124 08:14:04.436164 7951 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I0124 08:14:04.439368 7947 recover.cpp:195] Received a recover response from > a replica in STARTING status > I0124 08:14:04.440626 7947 recover.cpp:566] Updating replica status to VOTING > I0124 08:14:04.443667 7947 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.698664ms > I0124 08:14:04.443759 7947 replica.cpp:323] Persisted replica status to > VOTING > I0124 08:14:04.443925 7947 recover.cpp:580] Successfully joined the Paxos > group > I0124 08:14:04.444160 7947 recover.cpp:464] Recover process terminated > I0124 08:14:04.444543 7949 log.cpp:660] Attempting to start the writer > I0124 08:14:04.446331 7949 replica.cpp:477] Replica received implicit > promise request with proposal 1 > I0124 08:14:04.449329 7949 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.690453ms > I0124 08:14:04.449388 7949 replica.cpp:345] Persisted promised to 1 > I0124 08:14:04.450637 7947 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0124 08:14:04.452271 7949 replica.cpp:378] Replica received explicit > promise request for position 0 with proposal 2 > I0124 08:14:04.455124 7949 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 2.593522ms > I0124 08:14:04.455157 7949 replica.cpp:679] Persisted action at 0 > I0124 08:14:04.456594 7951 replica.cpp:511] Replica received write request > for position 0 > I0124 08:14:04.456657 7951 leveldb.cpp:438] Reading position from leveldb > took 30358ns > I0124 08:14:04.464860 7951 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb took 8.164646ms > I0124 08:14:04.464903 7951 replica.cpp:679] Persisted action at 0 > I0124 08:14:04.465947 7949 replica.cpp:658] Replica received learned notice > for position 0 > I0124 08:14:04.471567 7949 leveldb.cpp:343] Persisting action (16 bytes) to > le
[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296345#comment-14296345 ] Cody Maloney commented on MESOS-1806: - https://reviews.apache.org/r/30194/ https://reviews.apache.org/r/30195/ https://reviews.apache.org/r/30393/ https://reviews.apache.org/r/30394/ https://reviews.apache.org/r/30395/ https://reviews.apache.org/r/30396/ https://reviews.apache.org/r/30397/ https://reviews.apache.org/r/30398/ > Substituting etcd or ReplicatedLog for Zookeeper > > > Key: MESOS-1806 > URL: https://issues.apache.org/jira/browse/MESOS-1806 > Project: Mesos > Issue Type: Task >Reporter: Ed Ropple >Assignee: Cody Maloney >Priority: Minor > >eropple: Could you also file a new JIRA for Mesos to drop ZK > in favor of etcd or ReplicatedLog? Would love to get some momentum going on > that one. > -- > Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container
[ https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296300#comment-14296300 ] Timothy Chen commented on MESOS-2183: - So I'm planning to leverage the --pid=host flag in docker 1.5, which won't clone a new pid namespace. With this you won't see the problems you are seeing. What I described in my doc is to handle recovery, > docker containerizer doesn't work when mesos-slave is running in a container > > > Key: MESOS-2183 > URL: https://issues.apache.org/jira/browse/MESOS-2183 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Reporter: Jay Buffington >Assignee: Timothy Chen > > I've started running the mesos-slave process itself inside a docker > container. I bind mount in the dockerd socket, so there is only one docker > daemon running on the system. > The mesos-slave process uses "docker run" to start an executor in another, > sibling, container. It asks "docker inspect" what the pid of the executor > running in the container is. Since the mesos-slave process is in its own pid > namespace, it cannot see the pid for the executor in /proc. Therefore, it > thinks the executor died and it does a docker kill. > It looks like the executor pid is also used to determine what port the > executor is listening on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1825) Support the webui over HTTPS.
[ https://issues.apache.org/jira/browse/MESOS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296276#comment-14296276 ] ASF GitHub Bot commented on MESOS-1825: --- Github user bmahler commented on the pull request: https://github.com/apache/mesos/pull/34#issuecomment-71957729 Thanks Arnaud! Nice, there will be built-in HTTPS support in Mesos at some point, you may want to chime in here: https://issues.apache.org/jira/browse/MESOS-1825 > Support the webui over HTTPS. > - > > Key: MESOS-1825 > URL: https://issues.apache.org/jira/browse/MESOS-1825 > Project: Mesos > Issue Type: Bug > Components: webui >Reporter: Kien Pham >Priority: Minor > Labels: newbie > > Right now at Mesos UI, link are hardcoded to http:// . It should not be > hardcoded so that it can support https link. > Ex: > https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L17 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1825) Support the webui over HTTPS.
[ https://issues.apache.org/jira/browse/MESOS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1825: --- Summary: Support the webui over HTTPS. (was: support https link) > Support the webui over HTTPS. > - > > Key: MESOS-1825 > URL: https://issues.apache.org/jira/browse/MESOS-1825 > Project: Mesos > Issue Type: Bug > Components: webui >Reporter: Kien Pham >Priority: Minor > Labels: newbie > > Right now at Mesos UI, link are hardcoded to http:// . It should not be > hardcoded so that it can support https link. > Ex: > https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L17 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296128#comment-14296128 ] Cody Maloney commented on MESOS-2144: - Based on the addresses being at the low end of the address range I'm guessing it is happening while running __cxa_exit (global static destruction), or some other system cleanup symbol and this is during glibc doing something on mesos' behalf. Likely whatever that library is doesn't have symbols / is stripped if it is coming from the Linux distribution. Side note: Backtraces from our code don't use the debugging info. But yea, definitely looks like debugging is enabled. And functions shouldn't be optimized, binary isn't stripped of symbols, so stack traces should have all the function symbols. > Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread > --- > > Key: MESOS-2144 > URL: https://issues.apache.org/jira/browse/MESOS-2144 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.21.0 >Reporter: Cody Maloney >Priority: Minor > Labels: flaky > > Occured on review bot review of: > https://reviews.apache.org/r/28262/#review62333 > The review doesn't touch code related to the test (And doesn't break > libprocess in general) > [ RUN ] ExamplesTest.LowLevelSchedulerPthread > ../../src/tests/script.cpp:83: Failure > Failed > low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault > [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) > The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container
[ https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296106#comment-14296106 ] Jay Buffington edited comment on MESOS-2183 at 1/29/15 12:10 AM: - Hey [~tnachen], I read your doc at https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit# and it's not clear you address the issue I encountered. In my mesos-slave running in coreos I have it: * running inside a pid namespace * using the mounted /var/run/docker.sock to start a sibling container * running docker inspect to get the pid it just launched * it sees that the pid docker inspect reports * it tries to determine the libprocess port based on that pid * it doesn't see that pid since the pid docker inspect returns is only visible in the root namespace * it does docker stop/kill because it incorrectly thinks the executor failed to start since it couldn't see the pid I don't understand how your patch addresses that issue. Can you give me a summary of how it fixes this problem I've described? was (Author: jaybuff): Hey [~tnachen], I read your doc at https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit# and it's not clear you address the issue I encountered. In my mesos-slave running in coreos I have it: * running inside a pid namespace * useing the mounted /var/run/docker.sock to start a sibling container * running docker inspect to get the pid it just launched * it sees that the pid docker inspect reports * it tries to determine the libprocess port based on that pid * it does see that pid since the pid docker inspect returns is only visible in the root namespace * it does docker stop/kill because it incorrectly thinks the executor failed to start since it couldn't see the pid I don't understand how your patch addresses that issue. Can you give me a summary of how it fixes this problem I've described? > docker containerizer doesn't work when mesos-slave is running in a container > > > Key: MESOS-2183 > URL: https://issues.apache.org/jira/browse/MESOS-2183 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Reporter: Jay Buffington >Assignee: Timothy Chen > > I've started running the mesos-slave process itself inside a docker > container. I bind mount in the dockerd socket, so there is only one docker > daemon running on the system. > The mesos-slave process uses "docker run" to start an executor in another, > sibling, container. It asks "docker inspect" what the pid of the executor > running in the container is. Since the mesos-slave process is in its own pid > namespace, it cannot see the pid for the executor in /proc. Therefore, it > thinks the executor died and it does a docker kill. > It looks like the executor pid is also used to determine what port the > executor is listening on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container
[ https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296106#comment-14296106 ] Jay Buffington commented on MESOS-2183: --- Hey [~tnachen], I read your doc at https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit# and it's not clear you address the issue I encountered. In my mesos-slave running in coreos I have it: * running inside a pid namespace * useing the mounted /var/run/docker.sock to start a sibling container * running docker inspect to get the pid it just launched * it sees that the pid docker inspect reports * it tries to determine the libprocess port based on that pid * it does see that pid since the pid docker inspect returns is only visible in the root namespace * it does docker stop/kill because it incorrectly thinks the executor failed to start since it couldn't see the pid I don't understand how your patch addresses that issue. Can you give me a summary of how it fixes this problem I've described? > docker containerizer doesn't work when mesos-slave is running in a container > > > Key: MESOS-2183 > URL: https://issues.apache.org/jira/browse/MESOS-2183 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Reporter: Jay Buffington >Assignee: Timothy Chen > > I've started running the mesos-slave process itself inside a docker > container. I bind mount in the dockerd socket, so there is only one docker > daemon running on the system. > The mesos-slave process uses "docker run" to start an executor in another, > sibling, container. It asks "docker inspect" what the pid of the executor > running in the container is. Since the mesos-slave process is in its own pid > namespace, it cannot see the pid for the executor in /proc. Therefore, it > thinks the executor died and it does a docker kill. > It looks like the executor pid is also used to determine what port the > executor is listening on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2232) Suppress MockAllocator::transformAllocation() warnings.
[ https://issues.apache.org/jira/browse/MESOS-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296089#comment-14296089 ] Benjamin Mahler commented on MESOS-2232: First two are committed: {noformat} commit ccd697df0b7e05b07dee75d53e0ff55d6884ba2f Author: Benjamin Mahler Date: Fri Jan 16 12:13:01 2015 -0800 Renamed MockAllocatorProcess to TestAllocatorProcess. Review: https://reviews.apache.org/r/29989 {noformat} {noformat} commit b7bb6696b5a78dbc896b4756b7d4123e86c01635 Author: Benjamin Mahler Date: Fri Jan 16 14:10:05 2015 -0800 Updated TestAllocatorProcess to avoid the test warnings. Review: https://reviews.apache.org/r/29990 {noformat} > Suppress MockAllocator::transformAllocation() warnings. > --- > > Key: MESOS-2232 > URL: https://issues.apache.org/jira/browse/MESOS-2232 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Minor > > After transforming allocated resources feature was added to allocator, a > number of warnings are popping out for allocator tests. Commits leading to > this behaviour: > {{dacc88292cc13d4b08fe8cda4df71110a96cb12a}} > {{5a02d5bdc75d3b1149dcda519016374be06ec6bd}} > corresponding reviews: > https://reviews.apache.org/r/29083 > https://reviews.apache.org/r/29084 > Here is an example: > {code} > [ RUN ] MasterAllocatorTest/0.FrameworkReregistersFirst GMOCK WARNING: > Uninteresting mock function call - taking default action specified at: > ../../../src/tests/mesos.hpp:719: Function call: > transformAllocation(@0x7fd3bb5274d8 > 20150115-185632-1677764800-59671-44186-, @0x7fd3bb5274f8 > 20150115-185632-1677764800-59671-44186-S0, @0x1119140e0 16-byte object 52-BB D3-7F 00-00 C0-5F 52-BB D3-7F 00-00>) Stack trace: [ OK ] > MasterAllocatorTest/0.FrameworkReregistersFirst (204 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2298) Provide master detection library/libraries for pure schedulers
Vinod Kone created MESOS-2298: - Summary: Provide master detection library/libraries for pure schedulers Key: MESOS-2298 URL: https://issues.apache.org/jira/browse/MESOS-2298 Project: Mesos Issue Type: Task Reporter: Vinod Kone When schedulers start interacting with Mesos master via HTTP endpoints, they need a way to detect masters. Ideally, Mesos provides master detection library/libraries in supported languages (java and python to start with) to make this easy for frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2297) Add authentication support for HTTP API
Vinod Kone created MESOS-2297: - Summary: Add authentication support for HTTP API Key: MESOS-2297 URL: https://issues.apache.org/jira/browse/MESOS-2297 Project: Mesos Issue Type: Task Reporter: Vinod Kone To start with, we will only support basic http auth. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2296) Implement the Events endpoint on slave
Vinod Kone created MESOS-2296: - Summary: Implement the Events endpoint on slave Key: MESOS-2296 URL: https://issues.apache.org/jira/browse/MESOS-2296 Project: Mesos Issue Type: Task Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2295) Implement the Call endpoint on Slave
Vinod Kone created MESOS-2295: - Summary: Implement the Call endpoint on Slave Key: MESOS-2295 URL: https://issues.apache.org/jira/browse/MESOS-2295 Project: Mesos Issue Type: Task Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2294) Implement the Events endpoint on master
Vinod Kone created MESOS-2294: - Summary: Implement the Events endpoint on master Key: MESOS-2294 URL: https://issues.apache.org/jira/browse/MESOS-2294 Project: Mesos Issue Type: Task Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2293) Implement the Call endpoint on master
Vinod Kone created MESOS-2293: - Summary: Implement the Call endpoint on master Key: MESOS-2293 URL: https://issues.apache.org/jira/browse/MESOS-2293 Project: Mesos Issue Type: Task Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2292) Implement Call/Event protobufs for Executor
Vinod Kone created MESOS-2292: - Summary: Implement Call/Event protobufs for Executor Key: MESOS-2292 URL: https://issues.apache.org/jira/browse/MESOS-2292 Project: Mesos Issue Type: Task Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2291) Move executor driver validations to slave
Vinod Kone created MESOS-2291: - Summary: Move executor driver validations to slave Key: MESOS-2291 URL: https://issues.apache.org/jira/browse/MESOS-2291 Project: Mesos Issue Type: Task Reporter: Vinod Kone With HTTP API, the executor driver will no longer exist and hence all the validations should move to the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2290) Move all scheduler driver validations to master
Vinod Kone created MESOS-2290: - Summary: Move all scheduler driver validations to master Key: MESOS-2290 URL: https://issues.apache.org/jira/browse/MESOS-2290 Project: Mesos Issue Type: Task Reporter: Vinod Kone With HTTP API, the scheduler driver will no longer exist and hence all the validations should move to the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2289) Design doc for the HTTP API
Vinod Kone created MESOS-2289: - Summary: Design doc for the HTTP API Key: MESOS-2289 URL: https://issues.apache.org/jira/browse/MESOS-2289 Project: Mesos Issue Type: Task Reporter: Vinod Kone This tracks the design of the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2288) HTTP API for interacting with Mesos
[ https://issues.apache.org/jira/browse/MESOS-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2288: -- Epic Name: HTTP API (was: http api) > HTTP API for interacting with Mesos > --- > > Key: MESOS-2288 > URL: https://issues.apache.org/jira/browse/MESOS-2288 > Project: Mesos > Issue Type: Epic >Reporter: Vinod Kone > > Currently Mesos frameworks (schedulers and executors) interact with Mesos > (masters and slaves) via drivers provided by Mesos. While the driver helped > in providing some common functionality for all frameworks (master detection, > authentication, validation etc), it has several drawbacks. > --> Frameworks need to depend on a native library which makes their > build/deploy process cumbersome. > --> Pure language frameworks cannot use off the shelf libraries to interact > with the undocumented API used by the driver. > --> Makes it hard for developers to implement new APIs (lot of boiler plate > code to write). > This proposal is for Mesos to provide a well documented public HTTP API that > frameworks (and maybe operators) can use to interact with Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1127) Expose lower-level scheduler/executor API
[ https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1127: -- Epic Name: (was: HTTP API) Issue Type: Task (was: Epic) > Expose lower-level scheduler/executor API > - > > Key: MESOS-1127 > URL: https://issues.apache.org/jira/browse/MESOS-1127 > Project: Mesos > Issue Type: Task > Components: framework >Reporter: Benjamin Hindman >Assignee: Benjamin Hindman > Labels: twitter > > The default scheduler/executor interface and implementation in Mesos have a > few drawbacks: > (1) The interface is fairly high-level which makes it hard to do certain > things, for example, handle events (callbacks) in batch. This can have a big > impact on the performance of schedulers (for example, writing task updates > that need to be persisted). > (2) The implementation requires writing a lot of boilerplate JNI and native > Python wrappers when adding additional API components. > The plan is to provide a lower-level API that can easily be used to implement > the higher-level API that is currently provided. This will also open the door > to more easily building native-language Mesos libraries (i.e., not needing > the C++ shim layer) and building new higher-level abstractions on top of the > lower-level API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2288) HTTP API for interacting with Mesos
Vinod Kone created MESOS-2288: - Summary: HTTP API for interacting with Mesos Key: MESOS-2288 URL: https://issues.apache.org/jira/browse/MESOS-2288 Project: Mesos Issue Type: Epic Reporter: Vinod Kone Currently Mesos frameworks (schedulers and executors) interact with Mesos (masters and slaves) via drivers provided by Mesos. While the driver helped in providing some common functionality for all frameworks (master detection, authentication, validation etc), it has several drawbacks. --> Frameworks need to depend on a native library which makes their build/deploy process cumbersome. --> Pure language frameworks cannot use off the shelf libraries to interact with the undocumented API used by the driver. --> Makes it hard for developers to implement new APIs (lot of boiler plate code to write). This proposal is for Mesos to provide a well documented public HTTP API that frameworks (and maybe operators) can use to interact with Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295996#comment-14295996 ] Ian Downes commented on MESOS-2162: --- I'll be working on this too, development and/or shepherding. > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295983#comment-14295983 ] Steven Schlansker commented on MESOS-2162: -- I would love to help out in any way I can, but I am not much of a C++ guy. But at the very least I would happily test it, or if you have other suggestions for how I can help... > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295981#comment-14295981 ] Steven Schlansker commented on MESOS-2162: -- I would love to help out in any way I can, but I am not much of a C++ guy. But at the very least I would happily test it, or if you have other suggestions for how I can help... > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2162: - Comment: was deleted (was: I would love to help out in any way I can, but I am not much of a C++ guy. But at the very least I would happily test it, or if you have other suggestions for how I can help...) > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295978#comment-14295978 ] Timothy Chen commented on MESOS-2162: - Hi Steven, that's what I think too. It's my plan to work on this but this quarter I won't have much time to do so. Are you interested in this? We could work together. > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.
[ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Niemitz updated MESOS-2215: - Description: Once the slave restarts and recovers the task, I see this error in the log for all tasks that were recovered every second or so. Note, these were NOT docker tasks: W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 However the tasks themselves are still healthy and running. The slave was launched with --containerizers=mesos,docker - More info: it looks like the docker containerizer is a little too ambitious about recovering containers, again this was not a docker task: I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' of framework 20150109-161713-715350282-5050-290797- Looking into the source, it looks like the problem is that the ComposingContainerize runs recover in parallel, but neither the docker containerizer nor mesos containerizer check if they should recover the task or not (because they were the ones that launched it). Perhaps this needs to be written into the checkpoint somewhere? was: Once the slave restarts and recovers the task, I see this error in the log for all tasks that were recovered every second or so. Note, these were NOT docker tasks: W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 However the tasks themselves are still healthy and running. The slave was launched with --containerizers=mesos,docker - More info: it looks like the docker containerizer is a little too ambitious about recovering containers, again this was not a docker task: I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' of framework 20150109-161713-715350282-5050-290797- Looking into the source, it looks like the problem is that the ComposingContainerize runs recover in parallel, but neither the docker containerizer not mesos containerizer check if they should recover the task or not (because they were the ones that launched it). Perhaps this needs to be written into the checkpoint somewhere? > The Docker containerizer attempts to recover any task when checkpointing is > enabled, not just docker tasks. > --- > > Key: MESOS-2215 > URL: https://issues.apache.org/jira/browse/MESOS-2215 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.21.0 >Reporter: Steve Niemitz >Assignee: Timothy Chen > > Once the slave restarts and recovers the task, I see this error in the log > for all tasks that were recovered every second or so. Note, these were NOT > docker tasks: > W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage > for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor > thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd > of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker > inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited > with status 1 stderr = Error: No such image or container: > mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 > However the tasks themselves are still healthy and running. > The slave was launched with --containerizers=mesos,docker > - > More info: it looks like the docker containerizer is a little too ambitious > about recovering containers, again this was not a docker task: > I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container > '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor > 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' > of framework 20150109-161713-715350282-5050-290797- > Looking into the source
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295944#comment-14295944 ] Steven Schlansker commented on MESOS-2162: -- This library may be a good starting point: https://github.com/cdaylward/libappc/ > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295939#comment-14295939 ] Steven Schlansker commented on MESOS-2162: -- Any possibility of getting this scheduled for an upcoming release? > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-354) oversubscribe resources
[ https://issues.apache.org/jira/browse/MESOS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295839#comment-14295839 ] Niklas Quarfot Nielsen commented on MESOS-354: -- Oversubscription means many things and can be considered as a subset of the currently ongoing effort in optimistic offers: Where optimistic offers lets the allocator to offer resources: - In multiple frameworks to increase 'parallelism' (opposed to the conservative/pessimistic scheme) and **increase task throughput**. - Preemptable resources from unallocated but reserved resources, to **limit reservation slack** (difference between reserverd and allocated resources). A third (and equally important) case, which expands these scenarios is oversubscription of _allocated_ resources which limits the **usage slack** (difference between allocated and used resources). There has been a lot of recent research which shows the ability to reduce usage slack with 60% while maintaining the Service Level Objective (SLO) of latency critical workloads(1). However, this kind of oversubscription needs policies and fine-tuning to make sure that best-effort tasks doesn't interfere with latency critical ones. Therefore, we'd like to start a discussion on how such a system would look in Mesos. I will create a JIRA ticket (linking to this one) to start the conversation. (1) http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43017.pdf > oversubscribe resources > --- > > Key: MESOS-354 > URL: https://issues.apache.org/jira/browse/MESOS-354 > Project: Mesos > Issue Type: Story > Components: isolation, master, slave >Reporter: brian wickman >Priority: Minor > Attachments: mesos_virtual_offers.pdf > > > This proposal is predicated upon offer revocation. > The idea would be to add a new "revoked" status either by (1) piggybacking > off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a > new status update TASK_REVOKED. > In order to augment an offer with metadata about revocability, there are > options: > 1) Add a revocable boolean to the Offer and > a) offer only one type of Offer per slave at a particular time > b) offer both revocable and non-revocable resources at the same time but > require frameworks to understand that Offers can contain overlapping resources > 2) Add a revocable_resources field on the Offer which is a superset of the > regular resources field. By consuming > resources <= revocable_resources in > a launchTask, the Task becomes a revocable task. If launching a task with < > resources, the Task is non-revocable. > The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) > and non-revocable tasks are online higher-SLA tasks (e.g. services.) > Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk. > One of these resources is a rate (4 cpu seconds per second) and two of them > are fixed values (8GB and 20GB respectively, though disk resources can be > further broken down into spindles - fixed - and iops - a rate.) In practice, > these are the maximum resources in the respective dimensions that this task > will use. In reality, we provision tasks at some factor below peak, and only > hit peak resource consumption in rare circumstances or perhaps at a diurnal > peak. > In the meantime, we stand to gain from offering the some constant factor of > the difference between (reserved - actual) of non-revocable tasks as > revocable resources, depending upon our tolerance for revocable task churn. > The main challenge is coming up with an accurate short / medium / long-term > prediction of resource consumption based upon current behavior. > In many cases it would be OK to be sloppy: > * CPU / iops / network IO are rates (compressible) and can often be OK > below guarantees for brief periods of time while task revocation takes place > * Memory slack can be provided by enabling swap and dynamically setting > swap paging boundaries. Should swap ever be activated, that would be a > signal to revoke. > The master / allocator would piggyback on the slave heartbeat mechanism to > learn of the amount of revocable resources available at any point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2287) Document undocumented tests
Niklas Quarfot Nielsen created MESOS-2287: - Summary: Document undocumented tests Key: MESOS-2287 URL: https://issues.apache.org/jira/browse/MESOS-2287 Project: Mesos Issue Type: Improvement Reporter: Niklas Quarfot Nielsen Priority: Trivial We have a inconsistency in the way we document tests. It has become a rule of thumb to include a small blob about the test. For example: {code} // This tests the 'active' field in slave entries from state.json. We // first verify an active slave, deactivate it and verify that the // 'active' field is false. TEST_F(MasterTest, SlaveActiveEndpoint) { // Start a master. Try> master = StartMaster(); ASSERT_SOME(master); ... {code} However, we still have many tests that haven't been documented. For example: {code} } TEST_F(MasterTest, MetricsInStatsEndpoint) { Try > master = StartMaster(); ASSERT_SOME(master); Future response = process::http::get(master.get(), "stats.json"); ... {code} It would be great to do a scan and make sure all the tests are documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2286) Simplify the allocator architecture
[ https://issues.apache.org/jira/browse/MESOS-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-2286: --- Component/s: allocation Description: Allocator refactor [https://issues.apache.org/jira/browse/MESOS-2213] will distinguish between general allocators and Process-based ones. This introduces a chain of inheritance with a single real allocator at the bottom. Consider simplifying this architecture without impacting adding new allocators. Priority: Minor (was: Major) > Simplify the allocator architecture > --- > > Key: MESOS-2286 > URL: https://issues.apache.org/jira/browse/MESOS-2286 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Alexander Rukletsov >Priority: Minor > > Allocator refactor [https://issues.apache.org/jira/browse/MESOS-2213] will > distinguish between general allocators and Process-based ones. This > introduces a chain of inheritance with a single real allocator at the bottom. > Consider simplifying this architecture without impacting adding new > allocators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2286) Simplify the allocator architecture
Alexander Rukletsov created MESOS-2286: -- Summary: Simplify the allocator architecture Key: MESOS-2286 URL: https://issues.apache.org/jira/browse/MESOS-2286 Project: Mesos Issue Type: Improvement Reporter: Alexander Rukletsov -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2285) Eliminate dependency on master::Flags in Allocator
Alexander Rukletsov created MESOS-2285: -- Summary: Eliminate dependency on master::Flags in Allocator Key: MESOS-2285 URL: https://issues.apache.org/jira/browse/MESOS-2285 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Alexander Rukletsov Priority: Minor {{Allocator}} extracts parameters from {{master::Flags}} during initialization. Currently, only {{allocation_interval}} key from {{master::Flags}} is used. It makes sense to introduce a separate structure {{allocator::Options}} with values relevant for allocation and eliminate dependency on {{master::Flags}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.
[ https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov resolved MESOS-2284. Resolution: Not a Problem > Slave cannot be registered while masters keep switching to another one. > --- > > Key: MESOS-2284 > URL: https://issues.apache.org/jira/browse/MESOS-2284 > Project: Mesos > Issue Type: Bug > Components: documentation >Affects Versions: 0.20.1 > Environment: Ubuntu14.04 >Reporter: Hou Xiaokun >Priority: Blocker > Fix For: 0.21.0 > > > I followed the instruction in page > http://mesosphere.com/docs/getting-started/datacenter/install/. > Setup two masters and one slave. And quorum value is "2". Configured ip > addresses in hostname files separately. > Here is the log from slave node, > I0127 22:37:26.762953 1966 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:37:26.762985 1966 slave.cpp:638] Detecting new master > I0127 22:37:26.763022 1966 status_update_manager.cpp:171] Pausing sending > status updates > I0127 22:38:06.683840 1962 slave.cpp:3321] Current usage 16.98%. Max allowed > age: 5.111732713224155days > I0127 22:38:26.986556 1966 slave.cpp:2623] master@10.27.17.135:5050 exited > W0127 22:38:26.986675 1966 slave.cpp:2626] Master disconnected! Waiting for > a new master to be elected > I0127 22:38:34.909605 1963 detector.cpp:138] Detected a new leader: > (id='2028') > I0127 22:38:34.909811 1963 group.cpp:659] Trying to get > '/mesos/info_002028' in ZooKeeper > I0127 22:38:34.910909 1963 detector.cpp:433] A new leading master > (UPID=master@10.27.16.214:5050) is detected > I0127 22:38:34.910989 1963 slave.cpp:602] New master detected at > master@10.27.16.214:5050 > I0127 22:38:34.93 1963 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:38:34.911144 1963 slave.cpp:638] Detecting new master > I0127 22:38:34.911183 1963 status_update_manager.cpp:171] Pausing sending > status updates > I0127 22:39:06.684526 1964 slave.cpp:3321] Current usage 16.98%. Max allowed > age: 5.111731773610567days > I0127 22:39:35.231653 1963 slave.cpp:2623] master@10.27.16.214:5050 exited > W0127 22:39:35.231869 1963 slave.cpp:2626] Master disconnected! Waiting for > a new master to be elected > I0127 22:39:42.761540 1964 detector.cpp:138] Detected a new leader: > (id='2029') > I0127 22:39:42.761732 1964 group.cpp:659] Trying to get > '/mesos/info_002029' in ZooKeeper > I0127 22:39:42.762914 1964 detector.cpp:433] A new leading master > (UPID=master@10.27.17.135:5050) is detected > I0127 22:39:42.762984 1964 slave.cpp:602] New master detected at > master@10.27.17.135:5050 > I0127 22:39:42.763089 1964 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:39:42.763118 1964 slave.cpp:638] Detecting new master > I0127 22:39:42.763155 1964 status_update_manager.cpp:171] Pausing sending > status updates -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.
[ https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reopened MESOS-2284: > Slave cannot be registered while masters keep switching to another one. > --- > > Key: MESOS-2284 > URL: https://issues.apache.org/jira/browse/MESOS-2284 > Project: Mesos > Issue Type: Bug > Components: documentation >Affects Versions: 0.20.1 > Environment: Ubuntu14.04 >Reporter: Hou Xiaokun >Priority: Blocker > Fix For: 0.21.0 > > > I followed the instruction in page > http://mesosphere.com/docs/getting-started/datacenter/install/. > Setup two masters and one slave. And quorum value is "2". Configured ip > addresses in hostname files separately. > Here is the log from slave node, > I0127 22:37:26.762953 1966 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:37:26.762985 1966 slave.cpp:638] Detecting new master > I0127 22:37:26.763022 1966 status_update_manager.cpp:171] Pausing sending > status updates > I0127 22:38:06.683840 1962 slave.cpp:3321] Current usage 16.98%. Max allowed > age: 5.111732713224155days > I0127 22:38:26.986556 1966 slave.cpp:2623] master@10.27.17.135:5050 exited > W0127 22:38:26.986675 1966 slave.cpp:2626] Master disconnected! Waiting for > a new master to be elected > I0127 22:38:34.909605 1963 detector.cpp:138] Detected a new leader: > (id='2028') > I0127 22:38:34.909811 1963 group.cpp:659] Trying to get > '/mesos/info_002028' in ZooKeeper > I0127 22:38:34.910909 1963 detector.cpp:433] A new leading master > (UPID=master@10.27.16.214:5050) is detected > I0127 22:38:34.910989 1963 slave.cpp:602] New master detected at > master@10.27.16.214:5050 > I0127 22:38:34.93 1963 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:38:34.911144 1963 slave.cpp:638] Detecting new master > I0127 22:38:34.911183 1963 status_update_manager.cpp:171] Pausing sending > status updates > I0127 22:39:06.684526 1964 slave.cpp:3321] Current usage 16.98%. Max allowed > age: 5.111731773610567days > I0127 22:39:35.231653 1963 slave.cpp:2623] master@10.27.16.214:5050 exited > W0127 22:39:35.231869 1963 slave.cpp:2626] Master disconnected! Waiting for > a new master to be elected > I0127 22:39:42.761540 1964 detector.cpp:138] Detected a new leader: > (id='2029') > I0127 22:39:42.761732 1964 group.cpp:659] Trying to get > '/mesos/info_002029' in ZooKeeper > I0127 22:39:42.762914 1964 detector.cpp:433] A new leading master > (UPID=master@10.27.17.135:5050) is detected > I0127 22:39:42.762984 1964 slave.cpp:602] New master detected at > master@10.27.17.135:5050 > I0127 22:39:42.763089 1964 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:39:42.763118 1964 slave.cpp:638] Detecting new master > I0127 22:39:42.763155 1964 status_update_manager.cpp:171] Pausing sending > status updates -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers
[ https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294907#comment-14294907 ] Dr. Stefan Schimanski commented on MESOS-2276: -- I have changed the topic of this issue. As the original issue is resolved, it is left that mesos-slave should behave much more forgiving in the situation of many stopped containers. Moreover, a proper error message would help to identify the problem. > Mesos-slave refuses to startup with many stopped docker containers > -- > > Key: MESOS-2276 > URL: https://issues.apache.org/jira/browse/MESOS-2276 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Affects Versions: 0.21.0, 0.21.1 > Environment: Ubuntu 14.04LTS, Mesosphere packages >Reporter: Dr. Stefan Schimanski > > The mesos-slave is launched as > # /usr/local/sbin/mesos-slave > --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 > --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint > --containerizers=docker --executor_registration_timeout=5mins > --logging_level=INFO > giving this output: > I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started! > I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root > I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0 > I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0 > I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: > ab8fa655d34e8e15a4290422df38a18db1c09b5b > I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave > 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client > environment:host.name=srv002 > 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client > environment:os.name=Linux > 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client > environment:os.arch=3.13.0-44-generic > 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client > environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 > 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client > environment:user.name=root > 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client > environment:user.home=/root > 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client > environment:user.dir=/root > 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: > Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 > sessionTimeout=1 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd= > context=0x7fceec0009e0 flags=0 > I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051 > I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; > mem(*):6960; disk(*):246731; ports(*):[31000-32000] > I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002 > I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true > 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: > initiated connection to server [10.0.0.1:2181] > I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from > '/tmp/mesos/meta' > I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status > update manager > I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers > 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: > session establishment complete on server [10.0.0.1:2181], > sessionId=0x14b2adf7a560106, negotiated timeout=1 > I0127 19:26:32.823292 19885 group.cpp:313] Group process > (group(1)@10.0.0.2:5051) connected to ZooKeeper > I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in > ZooKeeper > I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: > (id='143') > I0127 19:26:32.830559 19882 group.cpp:659] Trying to get > '/mesos/info_000143' in ZooKeeper > I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master > (UPID=master@10.0.0.1:5050) is detected > Failed to perform recovery: Collect failed: Failed to create pipe: Too many > open files > To remedy this do as follows: > Step 1: rm -f /tmp/mesos/meta/slaves/latest > This ensures slave doesn't recover old live executors. > Step 2: Restart the slave. > At /tmp/mesos/meta/slaves/latest there is nothing. > The slave was part of a 3 node cluster before. > When started as an upstart service, the process is relaunched all the time > and a large number of defunct processes
[jira] [Updated] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers
[ https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dr. Stefan Schimanski updated MESOS-2276: - Summary: Mesos-slave refuses to startup with many stopped docker containers (was: Mesos-slave with containerizer Docker doesn't startup anymore) > Mesos-slave refuses to startup with many stopped docker containers > -- > > Key: MESOS-2276 > URL: https://issues.apache.org/jira/browse/MESOS-2276 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Affects Versions: 0.21.0, 0.21.1 > Environment: Ubuntu 14.04LTS, Mesosphere packages >Reporter: Dr. Stefan Schimanski > > The mesos-slave is launched as > # /usr/local/sbin/mesos-slave > --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 > --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint > --containerizers=docker --executor_registration_timeout=5mins > --logging_level=INFO > giving this output: > I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started! > I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root > I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0 > I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0 > I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: > ab8fa655d34e8e15a4290422df38a18db1c09b5b > I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave > 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client > environment:host.name=srv002 > 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client > environment:os.name=Linux > 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client > environment:os.arch=3.13.0-44-generic > 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client > environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 > 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client > environment:user.name=root > 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client > environment:user.home=/root > 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client > environment:user.dir=/root > 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: > Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 > sessionTimeout=1 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd= > context=0x7fceec0009e0 flags=0 > I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051 > I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; > mem(*):6960; disk(*):246731; ports(*):[31000-32000] > I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002 > I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true > 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: > initiated connection to server [10.0.0.1:2181] > I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from > '/tmp/mesos/meta' > I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status > update manager > I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers > 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: > session establishment complete on server [10.0.0.1:2181], > sessionId=0x14b2adf7a560106, negotiated timeout=1 > I0127 19:26:32.823292 19885 group.cpp:313] Group process > (group(1)@10.0.0.2:5051) connected to ZooKeeper > I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in > ZooKeeper > I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: > (id='143') > I0127 19:26:32.830559 19882 group.cpp:659] Trying to get > '/mesos/info_000143' in ZooKeeper > I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master > (UPID=master@10.0.0.1:5050) is detected > Failed to perform recovery: Collect failed: Failed to create pipe: Too many > open files > To remedy this do as follows: > Step 1: rm -f /tmp/mesos/meta/slaves/latest > This ensures slave doesn't recover old live executors. > Step 2: Restart the slave. > At /tmp/mesos/meta/slaves/latest there is nothing. > The slave was part of a 3 node cluster before. > When started as an upstart service, the process is relaunched all the time > and a large number of defunct processes appear, like these ones: > root 30321 0.0 0.0 13000 440 ?S19:28 0:00 iptables > --wait -L -n > root 30322 0.0 0.0 396 ?
[jira] [Resolved] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.
[ https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hou Xiaokun resolved MESOS-2284. Resolution: Fixed Fix Version/s: 0.21.0 hi, I changed the quorum to 1. Slave can be displayed now! Thanks! > Slave cannot be registered while masters keep switching to another one. > --- > > Key: MESOS-2284 > URL: https://issues.apache.org/jira/browse/MESOS-2284 > Project: Mesos > Issue Type: Bug > Components: documentation >Affects Versions: 0.20.1 > Environment: Ubuntu14.04 >Reporter: Hou Xiaokun >Priority: Blocker > Fix For: 0.21.0 > > > I followed the instruction in page > http://mesosphere.com/docs/getting-started/datacenter/install/. > Setup two masters and one slave. And quorum value is "2". Configured ip > addresses in hostname files separately. > Here is the log from slave node, > I0127 22:37:26.762953 1966 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:37:26.762985 1966 slave.cpp:638] Detecting new master > I0127 22:37:26.763022 1966 status_update_manager.cpp:171] Pausing sending > status updates > I0127 22:38:06.683840 1962 slave.cpp:3321] Current usage 16.98%. Max allowed > age: 5.111732713224155days > I0127 22:38:26.986556 1966 slave.cpp:2623] master@10.27.17.135:5050 exited > W0127 22:38:26.986675 1966 slave.cpp:2626] Master disconnected! Waiting for > a new master to be elected > I0127 22:38:34.909605 1963 detector.cpp:138] Detected a new leader: > (id='2028') > I0127 22:38:34.909811 1963 group.cpp:659] Trying to get > '/mesos/info_002028' in ZooKeeper > I0127 22:38:34.910909 1963 detector.cpp:433] A new leading master > (UPID=master@10.27.16.214:5050) is detected > I0127 22:38:34.910989 1963 slave.cpp:602] New master detected at > master@10.27.16.214:5050 > I0127 22:38:34.93 1963 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:38:34.911144 1963 slave.cpp:638] Detecting new master > I0127 22:38:34.911183 1963 status_update_manager.cpp:171] Pausing sending > status updates > I0127 22:39:06.684526 1964 slave.cpp:3321] Current usage 16.98%. Max allowed > age: 5.111731773610567days > I0127 22:39:35.231653 1963 slave.cpp:2623] master@10.27.16.214:5050 exited > W0127 22:39:35.231869 1963 slave.cpp:2626] Master disconnected! Waiting for > a new master to be elected > I0127 22:39:42.761540 1964 detector.cpp:138] Detected a new leader: > (id='2029') > I0127 22:39:42.761732 1964 group.cpp:659] Trying to get > '/mesos/info_002029' in ZooKeeper > I0127 22:39:42.762914 1964 detector.cpp:433] A new leading master > (UPID=master@10.27.17.135:5050) is detected > I0127 22:39:42.762984 1964 slave.cpp:602] New master detected at > master@10.27.17.135:5050 > I0127 22:39:42.763089 1964 slave.cpp:627] No credentials provided. > Attempting to register without authentication > I0127 22:39:42.763118 1964 slave.cpp:638] Detecting new master > I0127 22:39:42.763155 1964 status_update_manager.cpp:171] Pausing sending > status updates -- This message was sent by Atlassian JIRA (v6.3.4#6332)