[jira] [Commented] (MESOS-1582) Improve build time.
[ https://issues.apache.org/jira/browse/MESOS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060776#comment-14060776 ] Benjamin Hindman commented on MESOS-1582: - Some initial suggestions that we can turn into their own JIRA issues (some of which already exist, so we can just include them within this epic): * An include what you use campaign that strips headers that are not used, possibly even using the include-what-you-use clang tool. * Move implementations from .hpp to .cpp. * Separate large .cpp as necessary (increase parallelism). * Introduce more forward declarations in place of headers. * Document use and speedup of ccache. Improve build time. --- Key: MESOS-1582 URL: https://issues.apache.org/jira/browse/MESOS-1582 Project: Mesos Issue Type: Epic Components: build Reporter: Benjamin Hindman The build takes a ridiculously long time unless you have a large, parallel machine. This is a combination of many factors, all of which we'd like to discuss and track here. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1582) Improve build time.
[ https://issues.apache.org/jira/browse/MESOS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060819#comment-14060819 ] Timothy St. Clair commented on MESOS-1582: -- IMHO subsuming stout into libprocess and converting .hpp - .cpp would go a long way. Right now stout doesn't really live alone, but requires dependencies that exist under libprocess/3rdparty. Improve build time. --- Key: MESOS-1582 URL: https://issues.apache.org/jira/browse/MESOS-1582 Project: Mesos Issue Type: Epic Components: build Reporter: Benjamin Hindman The build takes a ridiculously long time unless you have a large, parallel machine. This is a combination of many factors, all of which we'd like to discuss and track here. I'd also love to actually track build times so we can get an appreciation of the improvements. Please leave a comment below with your build times! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1170) Update system check (glog)
[ https://issues.apache.org/jira/browse/MESOS-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060840#comment-14060840 ] Timothy St. Clair commented on MESOS-1170: -- https://reviews.apache.org/r/23453/ Update system check (glog) -- Key: MESOS-1170 URL: https://issues.apache.org/jira/browse/MESOS-1170 Project: Mesos Issue Type: Bug Components: build Affects Versions: 0.19.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair Clean up glog detection to follow https://issues.apache.org/jira/browse/MESOS-1071 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1584) allow slave to proxy scheduler registration in service to master detection
[ https://issues.apache.org/jira/browse/MESOS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060915#comment-14060915 ] brian wickman commented on MESOS-1584: -- I don't think this is quite what I had in mind. It's my understanding that mesos-resolve does a single resolve and then exits. Compare to the mesos-slave which does realtime master detection including failover detection, in addition to proxying framework messages on behalf of executors to whomever is leading at the time. It seems like we should have the analogue for schedulers as well. allow slave to proxy scheduler registration in service to master detection -- Key: MESOS-1584 URL: https://issues.apache.org/jira/browse/MESOS-1584 Project: Mesos Issue Type: Improvement Components: framework Reporter: brian wickman This is just an idea -- right now each pure language binding will need to implement a master detector (which in most cases means your language will need zookeeper bindings.) Instead of each binding needing to implement a master detector, perhaps the slave should support a 'proxy' mode whereby it acts as master detector on behalf of a dumber pure language binding. This could also help facilitate the launching of replica schedulers. Alternately this could be a separate binary but could just as easily live in the slave process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1589) Support file system isolation and monitoring
Dominic Hamon created MESOS-1589: Summary: Support file system isolation and monitoring Key: MESOS-1589 URL: https://issues.apache.org/jira/browse/MESOS-1589 Project: Mesos Issue Type: Epic Reporter: Dominic Hamon Assignee: Ian Downes -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1587) Report disk usage from MesosContainerizer
[ https://issues.apache.org/jira/browse/MESOS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes reassigned MESOS-1587: - Assignee: Ian Downes Report disk usage from MesosContainerizer - Key: MESOS-1587 URL: https://issues.apache.org/jira/browse/MESOS-1587 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Ian Downes We should report disk usage for the executor work directory from MesosContainerizer and include in the ResourceStatistics protobuf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1027) IPv6 support
[ https://issues.apache.org/jira/browse/MESOS-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1027: - Issue Type: Epic (was: Improvement) IPv6 support Key: MESOS-1027 URL: https://issues.apache.org/jira/browse/MESOS-1027 Project: Mesos Issue Type: Epic Components: framework, libprocess, master, slave Reporter: Dominic Hamon Fix For: 1.0.0 From the CLI down through the various layers of tech we should support IPv6. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1584) allow slave to proxy scheduler registration in service to master detection
[ https://issues.apache.org/jira/browse/MESOS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060979#comment-14060979 ] Benjamin Hindman commented on MESOS-1584: - Why provide the proxy in the mesos-slave instead of the mesos-master? Are you suggesting that the scheduler will be running on a slave and therefore it has easy access to the slave? What about providing a service like this on the master, and then giving each scheduler the list of all the masters instead of the list of all the ZooKeeper hosts? allow slave to proxy scheduler registration in service to master detection -- Key: MESOS-1584 URL: https://issues.apache.org/jira/browse/MESOS-1584 Project: Mesos Issue Type: Improvement Components: framework Reporter: brian wickman This is just an idea -- right now each pure language binding will need to implement a master detector (which in most cases means your language will need zookeeper bindings.) Instead of each binding needing to implement a master detector, perhaps the slave should support a 'proxy' mode whereby it acts as master detector on behalf of a dumber pure language binding. This could also help facilitate the launching of replica schedulers. Alternately this could be a separate binary but could just as easily live in the slave process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1001) registrar doesn't build on Linux/Clang
[ https://issues.apache.org/jira/browse/MESOS-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon resolved MESOS-1001. -- Resolution: Fixed Fix Version/s: 0.20.0 registrar doesn't build on Linux/Clang -- Key: MESOS-1001 URL: https://issues.apache.org/jira/browse/MESOS-1001 Project: Mesos Issue Type: Bug Components: build Affects Versions: 0.18.0 Environment: Ubuntu 13.10 clang Reporter: Vinod Kone Fix For: 0.20.0 libtool: compile: clang++ -DPACKAGE_NAME=\mesos\ -DPACKAGE_TARNAME=\mesos\ -DPACKAGE_VERSION=\0.18.0\ -DPACKAGE_STRING=\mesos 0.18.0\ -DPACKAGE_BUGREPORT=\\ -DPACKAGE_URL=\\ -DPACKAGE=\mesos\ -DVERSION=\0.18.0\ -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\.libs/\ -DHAVE_PTHREAD=1 -DMESOS_HAS_JAVA=1 -DHAVE_PYTHON=\2.7\ -DMESOS_HAS_PYTHON=1 -DHAVE_LIBZ=1 -DHAVE_LIBCURL=1 -DHAVE_LIBSASL2=1 -I. -Wall -Werror -DLIBDIR=\/usr/local/lib\ -DPKGLIBEXECDIR=\/usr/local/libexec/mesos\ -DPKGDATADIR=\/usr/local/share/mesos\ -I../include -I../3rdparty/libprocess/include -I../3rdparty/libprocess/3rdparty/stout/include -I../include -I../3rdparty/libprocess/3rdparty/boost-1.53.0 -I../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src -I../3rdparty/libprocess/3rdparty/glog-0.3.3/src -I../3rdparty/zookeeper-3.4.5/src/c/include -I../3rdparty/zookeeper-3.4.5/src/c/generated -pthread -DGTEST_USE_OWN_TR1_TUPLE=1 -g -g2 -O2 -std=c++11 -MT master/libmesos_no_3rdparty_la-registrar.lo -MD -MP -MF master/.deps/libmesos_no_3rdparty_la-registrar.Tpo -c master/registrar.cpp -fPIC -DPIC -o master/.libs/libmesos_no_3rdparty_la-registrar.o In file included from master/registrar.cpp:34: In file included from ./master/registrar.hpp:26: ./state/protobuf.hpp:124:10: error: calling a private constructor of class 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves' return VariableT(variable, t.get()); ^ ./state/protobuf.hpp:111:41: note: in instantiation of function template specialization 'mesos::internal::state::protobuf::State::_fetchmesos::internal::registry::Slaves' requested here .then(lambda::bind(State::template _fetchT, lambda::_1)); ^ master/registrar.cpp:191:12: note: in instantiation of function template specialization 'mesos::internal::state::protobuf::State::fetchmesos::internal::registry::Slaves' requested here state-fetchregistry::Slaves(slaves) ^ ./state/protobuf.hpp:62:3: note: declared private here Variable(const state::Variable _variable, const T _t) ^ ./state/protobuf.hpp:132:57: error: 't' is a private member of 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves' Trystd::string value = messages::serialize(variable.t); ^ master/registrar.cpp:333:14: note: in instantiation of function template specialization 'mesos::internal::state::protobuf::State::storemesos::internal::registry::Slaves' requested here state-store(variable).then(defer(self(), Self::_update, lambda::_1)); ^ ./state/protobuf.hpp:67:5: note: declared private here T t; ^ ./state/protobuf.hpp:138:39: error: 'variable' is a private member of 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves' return state::State::store(variable.variable.mutate(value.get())) ^ ./state/protobuf.hpp:66:19: note: declared private here state::Variable variable; // Not const to keep Variable assignable. ^ ./state/protobuf.hpp:139:61: error: 't' is a private member of 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves' .then(lambda::bind(State::template _storeT, variable.t, lambda::_1)); ^ ./state/protobuf.hpp:67:5: note: declared private here T t; ^ ./state/protobuf.hpp:149:17: error: calling a private constructor of class 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves' return Some(VariableT(variable.get(), t)); ^ ./state/protobuf.hpp:139:41: note: in instantiation of function template specialization 'mesos::internal::state::protobuf::State::_storemesos::internal::registry::Slaves' requested here .then(lambda::bind(State::template _storeT, variable.t, lambda::_1)); ^ master/registrar.cpp:333:14: note: in instantiation of function template specialization
[jira] [Updated] (MESOS-1027) IPv6 support
[ https://issues.apache.org/jira/browse/MESOS-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1027: - Epic Name: IPv6 Support IPv6 support Key: MESOS-1027 URL: https://issues.apache.org/jira/browse/MESOS-1027 Project: Mesos Issue Type: Epic Components: framework, libprocess, master, slave Reporter: Dominic Hamon Fix For: 1.0.0 From the CLI down through the various layers of tech we should support IPv6. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1580) Accept --isolation=external through a deprecation cycle.
[ https://issues.apache.org/jira/browse/MESOS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Hindman updated MESOS-1580: Description: The feature branch at github.com/mesos/mesos/tree/docker removes the --isolation=external option and forces people instead to do --containerizers=external to get the same thing. We should actually put --isolation=external through a deprecation cycle. Accept --isolation=external through a deprecation cycle. Key: MESOS-1580 URL: https://issues.apache.org/jira/browse/MESOS-1580 Project: Mesos Issue Type: Technical task Components: containerization, slave Reporter: Benjamin Hindman The feature branch at github.com/mesos/mesos/tree/docker removes the --isolation=external option and forces people instead to do --containerizers=external to get the same thing. We should actually put --isolation=external through a deprecation cycle. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1580) Accept --isolation=external through a deprecation cycle.
[ https://issues.apache.org/jira/browse/MESOS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060996#comment-14060996 ] Benjamin Hindman commented on MESOS-1580: - I added a 'Description' [~tstclair]. ;-) Let me know if there are other details that would help. Accept --isolation=external through a deprecation cycle. Key: MESOS-1580 URL: https://issues.apache.org/jira/browse/MESOS-1580 Project: Mesos Issue Type: Technical task Components: containerization, slave Reporter: Benjamin Hindman The feature branch at github.com/mesos/mesos/tree/docker removes the --isolation=external option and forces people instead to do --containerizers=external to get the same thing. We should actually put --isolation=external through a deprecation cycle. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1529) Handle a network partition between Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert reassigned MESOS-1529: Assignee: Benjamin Mahler Handle a network partition between Master and Slave --- Key: MESOS-1529 URL: https://issues.apache.org/jira/browse/MESOS-1529 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon Assignee: Benjamin Mahler If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run. (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed). There are at least two possible approaches to solving this issue: 1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur. 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1525) Don't require slave id for reconciliation requests.
[ https://issues.apache.org/jira/browse/MESOS-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1525: - Sprint: Q3 Sprint 1 Don't require slave id for reconciliation requests. --- Key: MESOS-1525 URL: https://issues.apache.org/jira/browse/MESOS-1525 Project: Mesos Issue Type: Improvement Affects Versions: 0.19.0 Reporter: Benjamin Mahler Reconciliation requests currently specify a list of TaskStatuses. SlaveID is optional inside TaskStatus but reconciliation requests are dropped when the SlaveID is not specified. We can answer reconciliation requests for a task so long as there are no transient slaves, this is what we should do when the slave id is not specified. There's an open question around whether we want the Reconcile Event to specify TaskID/SlaveID instead of TaskStatus, but I'll save that for later. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1119) Allocator should make an allocation decision per slave instead of per framework/role.
[ https://issues.apache.org/jira/browse/MESOS-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1119: - Sprint: Q3 Sprint 1 Allocator should make an allocation decision per slave instead of per framework/role. - Key: MESOS-1119 URL: https://issues.apache.org/jira/browse/MESOS-1119 Project: Mesos Issue Type: Bug Components: allocation Reporter: Vinod Kone Assignee: Vinod Kone Currently the Allocator::allocate() code loops through roles and frameworks (based on DRF sort) and allocates *all* slaves resources to the first framework. This logic should be a bit inversed. Instead, the slave should go through each slave, allocate it a role/framework and update the DRF shares. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1579) Add Sailthru to the Powered By Mesos page
[ https://issues.apache.org/jira/browse/MESOS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert reassigned MESOS-1579: Assignee: Dave Lester Add Sailthru to the Powered By Mesos page - Key: MESOS-1579 URL: https://issues.apache.org/jira/browse/MESOS-1579 Project: Mesos Issue Type: Wish Components: documentation Reporter: Alex Gaudio Assignee: Dave Lester Priority: Trivial Original Estimate: 0h Remaining Estimate: 0h Hello! We recently started using Mesos at Sailthru and love it! We'd love to add our organization to the Powered By Mesos page, and I created a GitHub PR to that effect. We'd love if you could merge it :) https://github.com/apache/mesos/pull/21 Alex (+ Sailthru's Data Science team) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1196) create annotated tag for v0.19.0
[ https://issues.apache.org/jira/browse/MESOS-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061005#comment-14061005 ] Dominic Hamon commented on MESOS-1196: -- It's not clear to me what you're looking for here. Does this mean we'd have two tags for a release - one at the start and one at the end? create annotated tag for v0.19.0 Key: MESOS-1196 URL: https://issues.apache.org/jira/browse/MESOS-1196 Project: Mesos Issue Type: Task Components: release Reporter: Bhuvan Arumugam To facilitate setting up CI for mesos repository, we should create annotated tag at the beginning of each release. This is follow up to http://www.mail-archive.com/dev@mesos.apache.org/msg10915.html Can you, a) create one based on this hash 99985d27857fb5a10b26ded8da1a36100780d18b, wherein master was pointed to 0.19.0 release? b) document the step to create annotated tag at beginning of every release c) document the step to create lightweight tag for every RC release -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1567) Add logging of the user uid when receiving SIGTERM.
[ https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1567: - Sprint: Q3 Sprint 1 Add logging of the user uid when receiving SIGTERM. --- Key: MESOS-1567 URL: https://issues.apache.org/jira/browse/MESOS-1567 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Benjamin Mahler Assignee: Alexandra Sava We currently do not log the user pid when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1567) Add logging of the user pid when receiving SIGTERM.
[ https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert reassigned MESOS-1567: Assignee: Alexandra Sava Add logging of the user pid when receiving SIGTERM. --- Key: MESOS-1567 URL: https://issues.apache.org/jira/browse/MESOS-1567 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Benjamin Mahler Assignee: Alexandra Sava We currently do not log the user pid when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1576) Add Go bindings to Mesos.
[ https://issues.apache.org/jira/browse/MESOS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061009#comment-14061009 ] Niklas Quarfot Nielsen commented on MESOS-1576: --- Sure - https://github.com/mesos/mesos-go Add Go bindings to Mesos. - Key: MESOS-1576 URL: https://issues.apache.org/jira/browse/MESOS-1576 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.19.0 Reporter: Timothy St. Clair From [~benjaminhindman]: I know that Niklas has some go bindings (backed by libmesos) here and Vladimir Vivien has some _native_ go bindings (no need for libmesos) here that could be used to help accomplish this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool
[ https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1316: - Sprint: Q3 Sprint 1 Implement decent unit test coverage for the mesos-fetcher tool -- Key: MESOS-1316 URL: https://issues.apache.org/jira/browse/MESOS-1316 Project: Mesos Issue Type: Improvement Reporter: Tom Arnfeld Assignee: Tom Arnfeld There are current no tests that cover the {{mesos-fetcher}} tool itself, and hence bugs like MESOS-1313 have accidentally slipped though. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool
[ https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert reassigned MESOS-1316: Assignee: Benjamin Hindman (was: Tom Arnfeld) Ben has offered to reinstate the tests. Implement decent unit test coverage for the mesos-fetcher tool -- Key: MESOS-1316 URL: https://issues.apache.org/jira/browse/MESOS-1316 Project: Mesos Issue Type: Improvement Reporter: Tom Arnfeld Assignee: Benjamin Hindman There are current no tests that cover the {{mesos-fetcher}} tool itself, and hence bugs like MESOS-1313 have accidentally slipped though. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-752) SlaveRecoveryTest/0.ReconcileTasksMissingFromSlave test is flaky
[ https://issues.apache.org/jira/browse/MESOS-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-752: Sprint: Q3 Sprint 1 SlaveRecoveryTest/0.ReconcileTasksMissingFromSlave test is flaky Key: MESOS-752 URL: https://issues.apache.org/jira/browse/MESOS-752 Project: Mesos Issue Type: Bug Components: test Environment: centos6 Reporter: Vinod Kone Assignee: Vinod Kone [ RUN ] SlaveRecoveryTest/0.ReconcileTasksMissingFromSlave Checkpointing executor's forked pid 32281 to '/tmp/SlaveRecoveryTest_0_ReconcileTasksMissingFromSlave_NT1btb/meta/slaves/201310151913-16777343-35153-31491-0/frameworks/201310151913-16777343-35153-31491-/executors/0514b52f-3c17-4ee5-ba16-635198701ca2/runs/97c9e2cc-ceea-40a8-a915-aed5fed1dcb3/pids/forked.pid' Fetching resources into '/tmp/SlaveRecoveryTest_0_ReconcileTasksMissingFromSlave_NT1btb/slaves/201310151913-16777343-35153-31491-0/frameworks/201310151913-16777343-35153-31491-/executors/0514b52f-3c17-4ee5-ba16-635198701ca2/runs/97c9e2cc-ceea-40a8-a915-aed5fed1dcb3' Registered executor on localhost.localdomain Starting task 0514b52f-3c17-4ee5-ba16-635198701ca2 Forked command at 32317 sh -c 'sleep 10' tests/slave_recovery_tests.cpp:1927: Failure Mock function called more times than expected - returning directly. Function call: statusUpdate(0x7fffae636eb0, @0x7f1590027a00 64-byte object F0-2F D0-A1 15-7F 00-00 00-00 00-00 00-00 00-00 40-E9 01-90 15-7F 00-00 20-6B 03-90 15-7F 00-00 48-91 C3-00 00-00 00-00 B0-3B 01-90 15-7F 00-00 05-00 00-00 00-00 00-00 17-00 00-00 00-00 00-00) Expected: to be called once Actual: called twice - over-saturated and active Command exited with status 0 (pid: 32317) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-976) SlaveRecoveryTest/1.SchedulerFailover is flaky
[ https://issues.apache.org/jira/browse/MESOS-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-976: Sprint: Q3 Sprint 1 SlaveRecoveryTest/1.SchedulerFailover is flaky -- Key: MESOS-976 URL: https://issues.apache.org/jira/browse/MESOS-976 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.18.0 Reporter: Vinod Kone Assignee: Ian Downes [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from SlaveRecoveryTest/1, where TypeParam = mesos::internal::slave::CgroupsIsolator [ RUN ] SlaveRecoveryTest/1.SchedulerFailover I0206 20:18:31.525116 56447 master.cpp:239] Master ID: 2014-02-06-20:18:31-1740121354-55566-56447 Hostname: smfd-bkq-03-sr4.devel.twitter.com I0206 20:18:31.525295 56481 master.cpp:321] Master started on 10.37.184.103:55566 I0206 20:18:31.525315 56481 master.cpp:324] Master only allowing authenticated frameworks to register! I0206 20:18:31.527093 56481 master.cpp:756] The newly elected leader is master@10.37.184.103:55566 I0206 20:18:31.527122 56481 master.cpp:764] Elected as the leading master! I0206 20:18:31.530642 56473 slave.cpp:112] Slave started on 9)@10.37.184.103:55566 I0206 20:18:31.530802 56473 slave.cpp:212] Slave resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] I0206 20:18:31.531203 56473 slave.cpp:240] Slave hostname: smfd-bkq-03-sr4.devel.twitter.com I0206 20:18:31.531221 56473 slave.cpp:241] Slave checkpoint: true I0206 20:18:31.531991 56482 cgroups_isolator.cpp:225] Using /tmp/mesos_test_cgroup as cgroups hierarchy root I0206 20:18:31.532470 56478 state.cpp:33] Recovering state from '/tmp/SlaveRecoveryTest_1_SchedulerFailover_7dC2N1/meta' I0206 20:18:31.532698 56469 status_update_manager.cpp:188] Recovering status update manager I0206 20:18:31.533962 56472 sched.cpp:265] Authenticating with master master@10.37.184.103:55566 I0206 20:18:31.534102 56472 sched.cpp:234] Detecting new master I0206 20:18:31.534124 56484 authenticatee.hpp:124] Creating new client SASL connection I0206 20:18:31.534299 56473 master.cpp:2317] Authenticating framework at scheduler(9)@10.37.184.103:55566 I0206 20:18:31.534459 56461 authenticator.hpp:140] Creating new server SASL connection I0206 20:18:31.534572 56466 authenticatee.hpp:212] Received SASL authentication mechanisms: CRAM-MD5 I0206 20:18:31.534595 56466 authenticatee.hpp:238] Attempting to authenticate with mechanism 'CRAM-MD5' I0206 20:18:31.534667 56474 authenticator.hpp:243] Received SASL authentication start I0206 20:18:31.534732 56474 authenticator.hpp:325] Authentication requires more steps I0206 20:18:31.534814 56468 authenticatee.hpp:258] Received SASL authentication step I0206 20:18:31.534946 56466 authenticator.hpp:271] Received SASL authentication step I0206 20:18:31.535007 56466 authenticator.hpp:317] Authentication success I0206 20:18:31.535084 56471 authenticatee.hpp:298] Authentication success I0206 20:18:31.535107 56461 master.cpp:2357] Successfully authenticated framework at scheduler(9)@10.37.184.103:55566 I0206 20:18:31.535392 56476 sched.cpp:339] Successfully authenticated with master master@10.37.184.103:55566 I0206 20:18:31.535512 56465 master.cpp:812] Received registration request from scheduler(9)@10.37.184.103:55566 I0206 20:18:31.535570 56465 master.cpp:830] Registering framework 2014-02-06-20:18:31-1740121354-55566-56447- at scheduler(9)@10.37.184.103:55566 I0206 20:18:31.535856 56465 hierarchical_allocator_process.hpp:332] Added framework 2014-02-06-20:18:31-1740121354-55566-56447- I0206 20:18:31.537802 56482 cgroups_isolator.cpp:840] Recovering isolator I0206 20:18:31.538462 56472 slave.cpp:2760] Finished recovery I0206 20:18:31.538910 56472 slave.cpp:508] New master detected at master@10.37.184.103:55566 I0206 20:18:31.539036 56478 status_update_manager.cpp:162] New master detected at master@10.37.184.103:55566 I0206 20:18:31.539223 56464 master.cpp:1834] Attempting to register slave on smfd-bkq-03-sr4.devel.twitter.com at slave(9)@10.37.184.103:55566 I0206 20:18:31.539271 56472 slave.cpp:533] Detecting new master I0206 20:18:31.539330 56464 master.cpp:2804] Adding slave 2014-02-06-20:18:31-1740121354-55566-56447-0 at smfd-bkq-03-sr4.devel.twitter.com with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] I0206 20:18:31.539454 56472 slave.cpp:551] Registered with master master@10.37.184.103:55566; given slave ID 2014-02-06-20:18:31-1740121354-55566-56447-0 I0206 20:18:31.539620 56472 slave.cpp:564] Checkpointing SlaveInfo to '/tmp/SlaveRecoveryTest_1_SchedulerFailover_7dC2N1/meta/slaves/2014-02-06-20:18:31-1740121354-55566-56447-0/slave.info'
[jira] [Updated] (MESOS-1527) Choose containerizer at runtime
[ https://issues.apache.org/jira/browse/MESOS-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1527: - Sprint: Q3 Sprint 1 Choose containerizer at runtime --- Key: MESOS-1527 URL: https://issues.apache.org/jira/browse/MESOS-1527 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Jay Buffington Currently you have to choose the containerizer at mesos-slave start time via the --isolation option. I'd like to be able to specify the containerizer in the request to launch the job. This could be specified by a new Provider field in the ContainerInfo proto buf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1580) Accept --isolation=external through a deprecation cycle.
[ https://issues.apache.org/jira/browse/MESOS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1580: - Issue Type: Task (was: Technical task) Parent: (was: MESOS-1527) Accept --isolation=external through a deprecation cycle. Key: MESOS-1580 URL: https://issues.apache.org/jira/browse/MESOS-1580 Project: Mesos Issue Type: Task Components: containerization, slave Reporter: Benjamin Hindman The feature branch at github.com/mesos/mesos/tree/docker removes the --isolation=external option and forces people instead to do --containerizers=external to get the same thing. We should actually put --isolation=external through a deprecation cycle. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1219) Master should generate new id for frameworks that reconnect after failover timeout
[ https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1219: -- Sprint: Q3 Sprint 1 Assignee: Vinod Kone Master should generate new id for frameworks that reconnect after failover timeout -- Key: MESOS-1219 URL: https://issues.apache.org/jira/browse/MESOS-1219 Project: Mesos Issue Type: Bug Components: master, webui Reporter: Robert Lacroix Assignee: Vinod Kone When a scheduler reconnects after the failover timeout has exceeded, the framework id is usually reused because the scheduler doesn't know that the timeout exceeded and it is actually handled as a new framework. The /framework/:framework_id route of the Web UI doesn't handle those cases very well because its key is reused. It only shows the terminated one. Would it make sense to ignore the provided framework id when a scheduler reconnects to a terminated framework and generate a new id to make sure it's unique? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1566) Support private docker registry.
[ https://issues.apache.org/jira/browse/MESOS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061064#comment-14061064 ] Timothy St. Clair commented on MESOS-1566: -- We should really link all the Docker JIRA's together. Support private docker registry. Key: MESOS-1566 URL: https://issues.apache.org/jira/browse/MESOS-1566 Project: Mesos Issue Type: Task Reporter: Timothy Chen Need to support Docker launching images hosted in private registry service, which requires docker login. Can consider utilizing .dockercfg file for providing credentials. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1588) Enforce disk quota in MesosContainerizer
[ https://issues.apache.org/jira/browse/MESOS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061067#comment-14061067 ] Ian Downes commented on MESOS-1588: --- Can you please elaborate on what you mean by machine policy? Disk is a first-class resource that should be enforced, both to protect the host and to protect other tasks running on the host, i.e., a task should *not* be able to spew out logs and affect others; if a tasks requests XX GB, that's all it should get. Enforcement could be ENOSPC if a separate filesystem could be used, but this solution is not always available. Many applications also don't handle ENOSPC well and it's generally safer to just terminate the container. I'm proposing a cycle with a release with enforcement defaulting to false, keeping existing behavior. A subsequent release would default to true. Enforce disk quota in MesosContainerizer Key: MESOS-1588 URL: https://issues.apache.org/jira/browse/MESOS-1588 Project: Mesos Issue Type: Improvement Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Ian Downes Once we have disk usage we should enforce this. Containers that exceed their quota should be terminated, i.e., the filesystem isolator should set a Limitation so the MesosContainerizer kills the container. Disk quota enforcement should be optional to permit a transition period where disk usage is monitored before enabling enforcement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1588) Enforce disk quota in MesosContainerizer
[ https://issues.apache.org/jira/browse/MESOS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061130#comment-14061130 ] Timothy St. Clair commented on MESOS-1588: -- Machine policy means defining a set of rules for machine(s). This idea doesn't exist as a formalism inside of Mesos, but it certainly does in other grid systems. In other grid systems you can define a KILL policy expressions, such that there is no harm, no foul to a point. For example, if a machine has a 1TB drive, and a task goes over by 1GB should that task get booted? Strict enforcement says yes, but presumes that the users accurately outline how much disk their task will consume, which I assert is a really bad idea. This problem was the root reason why we designed hunting policies in condor that used job history. We allowed users to go over to a point defined by the policy expression, and updated the jobAd to more accurately reflect how much resource was being used, so subsequent jobs would land appropriately. IMHO strict enforcement should be an *optional* parameter only. Enforce disk quota in MesosContainerizer Key: MESOS-1588 URL: https://issues.apache.org/jira/browse/MESOS-1588 Project: Mesos Issue Type: Improvement Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Ian Downes Once we have disk usage we should enforce this. Containers that exceed their quota should be terminated, i.e., the filesystem isolator should set a Limitation so the MesosContainerizer kills the container. Disk quota enforcement should be optional to permit a transition period where disk usage is monitored before enabling enforcement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-987) Wire up a code coverage tool
[ https://issues.apache.org/jira/browse/MESOS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-987: Component/s: technical debt Wire up a code coverage tool Key: MESOS-987 URL: https://issues.apache.org/jira/browse/MESOS-987 Project: Mesos Issue Type: Improvement Components: technical debt Reporter: Vinod Kone Assignee: Dominic Hamon Some options are gcov (works only with gcc afaict) and optionally lcov. It would be nice to hook this up with Jenkins too. http://meekrosoft.wordpress.com/2010/06/02/continuous-code-coverage-with-gcc-googletest-and-hudson/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool
[ https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1316: - Component/s: test technical debt Implement decent unit test coverage for the mesos-fetcher tool -- Key: MESOS-1316 URL: https://issues.apache.org/jira/browse/MESOS-1316 Project: Mesos Issue Type: Improvement Components: technical debt, test Reporter: Tom Arnfeld Assignee: Benjamin Hindman There are current no tests that cover the {{mesos-fetcher}} tool itself, and hence bugs like MESOS-1313 have accidentally slipped though. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1583) Clang tool build improvement include what you use
[ https://issues.apache.org/jira/browse/MESOS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1583: - Component/s: technical debt Clang tool build improvement include what you use --- Key: MESOS-1583 URL: https://issues.apache.org/jira/browse/MESOS-1583 Project: Mesos Issue Type: Improvement Components: technical debt Reporter: Isabel Jimenez Assignee: Isabel Jimenez -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool
[ https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061176#comment-14061176 ] Tom Arnfeld commented on MESOS-1316: Awesome, Thanks Ben! Apologies that I've not had the time to look at this after I said I would. Implement decent unit test coverage for the mesos-fetcher tool -- Key: MESOS-1316 URL: https://issues.apache.org/jira/browse/MESOS-1316 Project: Mesos Issue Type: Improvement Components: technical debt, test Reporter: Tom Arnfeld Assignee: Benjamin Hindman There are current no tests that cover the {{mesos-fetcher}} tool itself, and hence bugs like MESOS-1313 have accidentally slipped though. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061219#comment-14061219 ] Jie Yu commented on MESOS-1574: --- The system admin could also set ip_local_port_range to prevent a rogue process from binding to a mesos reserved port: echo xxx /proc/sys/net/ipv4/ip_local_port_range what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1574: --- Component/s: isolation what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1538) A container destruction in the middle of a launch leads to CHECK failure.
[ https://issues.apache.org/jira/browse/MESOS-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1538: --- Summary: A container destruction in the middle of a launch leads to CHECK failure. (was: A container destruction in the middle of a launch leads to CHECK failure) A container destruction in the middle of a launch leads to CHECK failure. - Key: MESOS-1538 URL: https://issues.apache.org/jira/browse/MESOS-1538 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Ian Downes Fix For: 0.19.1 There is a race between the destroy() and exec() in the containerizer process, when the destroy is called in the middle of the launch. In particular if the destroy is completed and the container removed from 'promises' map before 'exec()' was called, CHECK failure happens. The fix is to return a Failure instead of doing a CHECK in 'exec()'. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061243#comment-14061243 ] Ian Downes edited comment on MESOS-1574 at 7/14/14 9:05 PM: ip_local_port_range sets the range for local ports when opening outgoing connections; it does not restrict processes from binding to ports inside that range. [~jaybuff] are you using a cgroups isolator? If so, you can check if the process' cgroup is managed by mesos, implying it's a descendent of a terminated mesos-slave: {noformat} $ cat /proc/$pid/cgroup 4:memory:/sys/fs/cgroup/memory/mesos/XXX 3:freezer:/sys/fs/cgroup/freezer/mesos/XXX 2:cpuacct:/sys/fs/cgroup/cpuacct/mesos/XXX 1:cpu:/sys/fs/cgroup/cpu/mesos/XXX {noformat} was (Author: idownes): ip_local_port_range sets the range for local ports when opening outgoing connections; it does not restrict processes from binding to ports inside that range. [~jaybuff] are you using a cgroups isolator? If so, you can check if the process' cgroup is managed by mesos, implying it's a descendent of a terminated mesos-slave: $ cat /proc/$pid/cgroup 4:memory:/sys/fs/cgroup/memory/mesos/XXX 3:freezer:/sys/fs/cgroup/freezer/mesos/XXX 2:cpuacct:/sys/fs/cgroup/cpuacct/mesos/XXX 1:cpu:/sys/fs/cgroup/cpu/mesos/XXX what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1567) Add logging of the user uid when receiving SIGTERM.
[ https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1567: --- Description: We currently do not log the user id when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction. (was: We currently do not log the user pid when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction.) Add logging of the user uid when receiving SIGTERM. --- Key: MESOS-1567 URL: https://issues.apache.org/jira/browse/MESOS-1567 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Benjamin Mahler Assignee: Alexandra Sava We currently do not log the user id when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout
[ https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1219: -- Summary: Master should disallow frameworks that reconnect after failover timeout (was: Master should generate new id for frameworks that reconnect after failover timeout) Master should disallow frameworks that reconnect after failover timeout --- Key: MESOS-1219 URL: https://issues.apache.org/jira/browse/MESOS-1219 Project: Mesos Issue Type: Bug Components: master, webui Reporter: Robert Lacroix Assignee: Vinod Kone When a scheduler reconnects after the failover timeout has exceeded, the framework id is usually reused because the scheduler doesn't know that the timeout exceeded and it is actually handled as a new framework. The /framework/:framework_id route of the Web UI doesn't handle those cases very well because its key is reused. It only shows the terminated one. Would it make sense to ignore the provided framework id when a scheduler reconnects to a terminated framework and generate a new id to make sure it's unique? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1576) Add Go bindings to Mesos.
[ https://issues.apache.org/jira/browse/MESOS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061409#comment-14061409 ] Dominic Hamon commented on MESOS-1576: -- Regarding ZK: https://godoc.org/github.com/samuel/go-zookeeper/ or https://godoc.org/launchpad.net/gozk/zookeeper might be worth a look. Disclaimer: I haven't look at them in any depth. Add Go bindings to Mesos. - Key: MESOS-1576 URL: https://issues.apache.org/jira/browse/MESOS-1576 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.19.0 Reporter: Timothy St. Clair From [~benjaminhindman]: I know that Niklas has some go bindings (backed by libmesos) here and Vladimir Vivien has some _native_ go bindings (no need for libmesos) here that could be used to help accomplish this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1566) Support private docker registry.
[ https://issues.apache.org/jira/browse/MESOS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061462#comment-14061462 ] Timothy Chen commented on MESOS-1566: - [~hufman] yes it works if you don't need authentication, but in case you do we need to allow user authentication configurations which is what this ticket is about. Overall I didn't put that in as I want to make sure we address any other needs required for private registry. Support private docker registry. Key: MESOS-1566 URL: https://issues.apache.org/jira/browse/MESOS-1566 Project: Mesos Issue Type: Task Reporter: Timothy Chen Need to support Docker launching images hosted in private registry service, which requires docker login. Can consider utilizing .dockercfg file for providing credentials. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1593) Add DockerInfo Configuration
Timothy Chen created MESOS-1593: --- Summary: Add DockerInfo Configuration Key: MESOS-1593 URL: https://issues.apache.org/jira/browse/MESOS-1593 Project: Mesos Issue Type: Task Reporter: Timothy Chen We want to add a new proto message to encapsulate all Docker related configurations into DockerInfo. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061466#comment-14061466 ] Jay Buffington commented on MESOS-1574: --- [~idownes] Good idea! Unfortunately, we turned on cgroups isolation two weeks ago, but this process was started a month ago :( what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1594) SlaveRecoveryTest/0.ReconcileKillTask is flaky
Vinod Kone created MESOS-1594: - Summary: SlaveRecoveryTest/0.ReconcileKillTask is flaky Key: MESOS-1594 URL: https://issues.apache.org/jira/browse/MESOS-1594 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.20.0 Environment: Ubuntu 12.10 with GCC Reporter: Vinod Kone Observed this on Jenkins. {code} [ RUN ] SlaveRecoveryTest/0.ReconcileKillTask Using temporary directory '/tmp/SlaveRecoveryTest_0_ReconcileKillTask_3zJ6DG' I0714 15:08:43.915114 27216 leveldb.cpp:176] Opened db in 474.695188ms I0714 15:08:43.933645 27216 leveldb.cpp:183] Compacted db in 18.068942ms I0714 15:08:43.934129 27216 leveldb.cpp:198] Created db iterator in 7860ns I0714 15:08:43.934439 27216 leveldb.cpp:204] Seeked to beginning of db in 2560ns I0714 15:08:43.934779 27216 leveldb.cpp:273] Iterated through 0 keys in the db in 1400ns I0714 15:08:43.935098 27216 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0714 15:08:43.936027 27238 recover.cpp:425] Starting replica recovery I0714 15:08:43.936225 27238 recover.cpp:451] Replica is in EMPTY status I0714 15:08:43.936867 27238 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0714 15:08:43.937049 27238 recover.cpp:188] Received a recover response from a replica in EMPTY status I0714 15:08:43.937232 27238 recover.cpp:542] Updating replica status to STARTING I0714 15:08:43.945600 27235 master.cpp:288] Master 20140714-150843-16842879-55850-27216 (quantal) started on 127.0.1.1:55850 I0714 15:08:43.945643 27235 master.cpp:325] Master only allowing authenticated frameworks to register I0714 15:08:43.945651 27235 master.cpp:330] Master only allowing authenticated slaves to register I0714 15:08:43.945658 27235 credentials.hpp:36] Loading credentials for authentication from '/tmp/SlaveRecoveryTest_0_ReconcileKillTask_3zJ6DG/credentials' I0714 15:08:43.945808 27235 master.cpp:359] Authorization enabled I0714 15:08:43.946369 27235 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:55850 I0714 15:08:43.946419 27235 master.cpp:122] No whitelist given. Advertising offers for all slaves I0714 15:08:43.946614 27235 master.cpp:1128] The newly elected leader is master@127.0.1.1:55850 with id 20140714-150843-16842879-55850-27216 I0714 15:08:43.946630 27235 master.cpp:1141] Elected as the leading master! I0714 15:08:43.946637 27235 master.cpp:959] Recovering from registrar I0714 15:08:43.946707 27235 registrar.cpp:313] Recovering registrar I0714 15:08:43.957895 27238 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 20.529301ms I0714 15:08:43.957978 27238 replica.cpp:320] Persisted replica status to STARTING I0714 15:08:43.958142 27238 recover.cpp:451] Replica is in STARTING status I0714 15:08:43.958664 27238 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0714 15:08:43.958762 27238 recover.cpp:188] Received a recover response from a replica in STARTING status I0714 15:08:43.958945 27238 recover.cpp:542] Updating replica status to VOTING I0714 15:08:43.975685 27238 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 16.646136ms I0714 15:08:43.976367 27238 replica.cpp:320] Persisted replica status to VOTING I0714 15:08:43.976824 27241 recover.cpp:556] Successfully joined the Paxos group I0714 15:08:43.977072 27242 recover.cpp:440] Recover process terminated I0714 15:08:43.980590 27236 log.cpp:656] Attempting to start the writer I0714 15:08:43.981385 27236 replica.cpp:474] Replica received implicit promise request with proposal 1 I0714 15:08:43.999141 27236 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 17.705787ms I0714 15:08:43.999222 27236 replica.cpp:342] Persisted promised to 1 I0714 15:08:44.004451 27240 coordinator.cpp:230] Coordinator attemping to fill missing position I0714 15:08:44.004914 27240 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0714 15:08:44.021456 27240 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 16.499775ms I0714 15:08:44.021533 27240 replica.cpp:676] Persisted action at 0 I0714 15:08:44.022006 27240 replica.cpp:508] Replica received write request for position 0 I0714 15:08:44.022043 27240 leveldb.cpp:438] Reading position from leveldb took 21376ns I0714 15:08:44.035969 27240 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 13.885907ms I0714 15:08:44.036365 27240 replica.cpp:676] Persisted action at 0 I0714 15:08:44.040156 27238 replica.cpp:655] Replica received learned notice for position 0 I0714 15:08:44.058082 27238 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 17.860707ms I0714 15:08:44.058161 27238 replica.cpp:676] Persisted action at 0 I0714 15:08:44.058176 27238 replica.cpp:661] Replica
[jira] [Created] (MESOS-1595) Provide a way to install libprocess
Vinod Kone created MESOS-1595: - Summary: Provide a way to install libprocess Key: MESOS-1595 URL: https://issues.apache.org/jira/browse/MESOS-1595 Project: Mesos Issue Type: Story Reporter: Vinod Kone Assignee: Vinod Kone For C++ framework developers that want to use libprocess in their code base, it would be great if Mesos provides a way to easily get access to the headers. A first step in that direction would be to provide a install target in the libprocess Makefile for the same. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1596) Improve allocation of resources
Dominic Hamon created MESOS-1596: Summary: Improve allocation of resources Key: MESOS-1596 URL: https://issues.apache.org/jira/browse/MESOS-1596 Project: Mesos Issue Type: Epic Reporter: Dominic Hamon Assignee: Vinod Kone -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1597) Add tc police action to routing library
Dominic Hamon created MESOS-1597: Summary: Add tc police action to routing library Key: MESOS-1597 URL: https://issues.apache.org/jira/browse/MESOS-1597 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1598) Add advanced shaping controls to routing library
Dominic Hamon created MESOS-1598: Summary: Add advanced shaping controls to routing library Key: MESOS-1598 URL: https://issues.apache.org/jira/browse/MESOS-1598 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu Only necessary if bandwidth cap using tc police action is deemed not effective for network isolation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1599) Slave configuration for network isolation
Dominic Hamon created MESOS-1599: Summary: Slave configuration for network isolation Key: MESOS-1599 URL: https://issues.apache.org/jira/browse/MESOS-1599 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1471) Add documentation for the replicated log.
[ https://issues.apache.org/jira/browse/MESOS-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1471: - Sprint: Q3 Sprint 1 Add documentation for the replicated log. - Key: MESOS-1471 URL: https://issues.apache.org/jira/browse/MESOS-1471 Project: Mesos Issue Type: Documentation Components: documentation, replicated log Reporter: Benjamin Mahler Assignee: Jie Yu The replicated log could benefit from some documentation. In particular, how does it work? What do operators need to know? Possibly there is some overlap with our future maintenance documentation in MESOS-1470. I believe [~jieyu] has some unpublished work that could be leveraged here! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1582) Improve build time.
[ https://issues.apache.org/jira/browse/MESOS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lambert updated MESOS-1582: - Epic Colour: ghx-label-2 (was: Red) Improve build time. --- Key: MESOS-1582 URL: https://issues.apache.org/jira/browse/MESOS-1582 Project: Mesos Issue Type: Epic Components: build Reporter: Benjamin Hindman The build takes a ridiculously long time unless you have a large, parallel machine. This is a combination of many factors, all of which we'd like to discuss and track here. I'd also love to actually track build times so we can get an appreciation of the improvements. Please leave a comment below with your build times! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1583) Clang tool build improvement include what you use
[ https://issues.apache.org/jira/browse/MESOS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061634#comment-14061634 ] Dominic Hamon commented on MESOS-1583: -- Please test this on a small selection of files and share the resulting patch. There has been some controversy in other projects that have used it regarding how aggressive it can be. Clang tool build improvement include what you use --- Key: MESOS-1583 URL: https://issues.apache.org/jira/browse/MESOS-1583 Project: Mesos Issue Type: Improvement Components: technical debt Reporter: Isabel Jimenez Assignee: Isabel Jimenez -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1596) Various allocator improvements for multi-framework support
[ https://issues.apache.org/jira/browse/MESOS-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1596: - Summary: Various allocator improvements for multi-framework support (was: Improve allocation of resources) Various allocator improvements for multi-framework support -- Key: MESOS-1596 URL: https://issues.apache.org/jira/browse/MESOS-1596 Project: Mesos Issue Type: Epic Reporter: Dominic Hamon Assignee: Vinod Kone -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1597) Add tc police action to routing library
[ https://issues.apache.org/jira/browse/MESOS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1597: - Description: [Policing filters|http://www.lartc.org/lartc.html#LARTC.ADV-FILTER.POLICING] are a simple way to add bandwidth limiting to a connection. Adding this action to the routing library will allow us to start isolating network bandwidth per container. Add tc police action to routing library --- Key: MESOS-1597 URL: https://issues.apache.org/jira/browse/MESOS-1597 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu [Policing filters|http://www.lartc.org/lartc.html#LARTC.ADV-FILTER.POLICING] are a simple way to add bandwidth limiting to a connection. Adding this action to the routing library will allow us to start isolating network bandwidth per container. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1599) Slave configuration for network isolation
[ https://issues.apache.org/jira/browse/MESOS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1599: - Description: Once the policing or shaping controls are available in the routing library, configuration options are required on the slave to enable them. (was: Once the ) Slave configuration for network isolation - Key: MESOS-1599 URL: https://issues.apache.org/jira/browse/MESOS-1599 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu Once the policing or shaping controls are available in the routing library, configuration options are required on the slave to enable them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1598) Add advanced shaping controls to routing library
[ https://issues.apache.org/jira/browse/MESOS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1598: - Description: Only necessary if bandwidth cap using tc police action (MESOS-1597) is deemed not effective for network isolation. If this is necessary, we can use more complex shaping controls than just flat bandwidth caps to manage bandwidth isolation between containers. was:Only necessary if bandwidth cap using tc police action is deemed not effective for network isolation. Add advanced shaping controls to routing library Key: MESOS-1598 URL: https://issues.apache.org/jira/browse/MESOS-1598 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu Only necessary if bandwidth cap using tc police action (MESOS-1597) is deemed not effective for network isolation. If this is necessary, we can use more complex shaping controls than just flat bandwidth caps to manage bandwidth isolation between containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1599) Slave configuration for network isolation
[ https://issues.apache.org/jira/browse/MESOS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1599: - Description: Once the Slave configuration for network isolation - Key: MESOS-1599 URL: https://issues.apache.org/jira/browse/MESOS-1599 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Jie Yu Once the -- This message was sent by Atlassian JIRA (v6.2#6252)