date:20140714


[ 
https://issues.apache.org/jira/browse/MESOS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060776#comment-14060776
 ] 

Benjamin Hindman commented on MESOS-1582:
-

Some initial suggestions that we can turn into their own JIRA issues (some of 
which already exist, so we can just include them within this epic):

* An include what you use campaign that strips headers that are not used, 
possibly even using the include-what-you-use clang tool.
* Move implementations from .hpp to .cpp.
* Separate large .cpp as necessary (increase parallelism).
* Introduce more forward declarations in place of headers.
* Document use and speedup of ccache.

 Improve build time.
 ---

 Key: MESOS-1582
 URL: https://issues.apache.org/jira/browse/MESOS-1582
 Project: Mesos
  Issue Type: Epic
  Components: build
Reporter: Benjamin Hindman

 The build takes a ridiculously long time unless you have a large, parallel 
 machine. This is a combination of many factors, all of which we'd like to 
 discuss and track here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1582) Improve build time.


[ 
https://issues.apache.org/jira/browse/MESOS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060819#comment-14060819
 ] 

Timothy St. Clair commented on MESOS-1582:
--

IMHO subsuming stout into libprocess and converting .hpp - .cpp would go a 
long way.
Right now stout doesn't really live alone, but requires dependencies that exist 
under libprocess/3rdparty.

 Improve build time.
 ---

 Key: MESOS-1582
 URL: https://issues.apache.org/jira/browse/MESOS-1582
 Project: Mesos
  Issue Type: Epic
  Components: build
Reporter: Benjamin Hindman

 The build takes a ridiculously long time unless you have a large, parallel 
 machine. This is a combination of many factors, all of which we'd like to 
 discuss and track here.
 I'd also love to actually track build times so we can get an appreciation of 
 the improvements. Please leave a comment below with your build times!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1170) Update system check (glog)


[ 
https://issues.apache.org/jira/browse/MESOS-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060840#comment-14060840
 ] 

Timothy St. Clair commented on MESOS-1170:
--

https://reviews.apache.org/r/23453/

 Update system check (glog)
 --

 Key: MESOS-1170
 URL: https://issues.apache.org/jira/browse/MESOS-1170
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Timothy St. Clair

 Clean up glog detection to follow 
 https://issues.apache.org/jira/browse/MESOS-1071



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1584) allow slave to proxy scheduler registration in service to master detection

2014-07-14 Thread brian wickman (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060915#comment-14060915
 ] 

brian wickman commented on MESOS-1584:
--

I don't think this is quite what I had in mind.  It's my understanding that 
mesos-resolve does a single resolve and then exits.

Compare to the mesos-slave which does realtime master detection including 
failover detection, in addition to proxying framework messages on behalf of 
executors to whomever is leading at the time.  It seems like we should have the 
analogue for schedulers as well.

 allow slave to proxy scheduler registration in service to master detection
 --

 Key: MESOS-1584
 URL: https://issues.apache.org/jira/browse/MESOS-1584
 Project: Mesos
  Issue Type: Improvement
  Components: framework
Reporter: brian wickman

 This is just an idea -- right now each pure language binding will need to 
 implement a master detector (which in most cases means your language will 
 need zookeeper bindings.)
 Instead of each binding needing to implement a master detector, perhaps the 
 slave should support a 'proxy' mode whereby it acts as master detector on 
 behalf of a dumber pure language binding.  This could also help facilitate 
 the launching of replica schedulers.  Alternately this could be a separate 
 binary but could just as easily live in the slave process.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1589) Support file system isolation and monitoring

Dominic Hamon created MESOS-1589:


 Summary: Support file system isolation and monitoring
 Key: MESOS-1589
 URL: https://issues.apache.org/jira/browse/MESOS-1589
 Project: Mesos
  Issue Type: Epic
Reporter: Dominic Hamon
Assignee: Ian Downes






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MESOS-1587) Report disk usage from MesosContainerizer

2014-07-14 Thread Ian Downes (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes reassigned MESOS-1587:
-

Assignee: Ian Downes

 Report disk usage from MesosContainerizer
 -

 Key: MESOS-1587
 URL: https://issues.apache.org/jira/browse/MESOS-1587
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Ian Downes

 We should report disk usage for the executor work directory from 
 MesosContainerizer and include in the ResourceStatistics protobuf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1027) IPv6 support


 [ 
https://issues.apache.org/jira/browse/MESOS-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1027:
-

Issue Type: Epic  (was: Improvement)

 IPv6 support
 

 Key: MESOS-1027
 URL: https://issues.apache.org/jira/browse/MESOS-1027
 Project: Mesos
  Issue Type: Epic
  Components: framework, libprocess, master, slave
Reporter: Dominic Hamon
 Fix For: 1.0.0


 From the CLI down through the various layers of tech we should support IPv6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1584) allow slave to proxy scheduler registration in service to master detection


[ 
https://issues.apache.org/jira/browse/MESOS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060979#comment-14060979
 ] 

Benjamin Hindman commented on MESOS-1584:
-

Why provide the proxy in the mesos-slave instead of the mesos-master? Are you 
suggesting that the scheduler will be running on a slave and therefore it has 
easy access to the slave? What about providing a service like this on the 
master, and then giving each scheduler the list of all the masters instead of 
the list of all the ZooKeeper hosts?

 allow slave to proxy scheduler registration in service to master detection
 --

 Key: MESOS-1584
 URL: https://issues.apache.org/jira/browse/MESOS-1584
 Project: Mesos
  Issue Type: Improvement
  Components: framework
Reporter: brian wickman

 This is just an idea -- right now each pure language binding will need to 
 implement a master detector (which in most cases means your language will 
 need zookeeper bindings.)
 Instead of each binding needing to implement a master detector, perhaps the 
 slave should support a 'proxy' mode whereby it acts as master detector on 
 behalf of a dumber pure language binding.  This could also help facilitate 
 the launching of replica schedulers.  Alternately this could be a separate 
 binary but could just as easily live in the slave process.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MESOS-1001) registrar doesn't build on Linux/Clang


 [ 
https://issues.apache.org/jira/browse/MESOS-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-1001.
--

   Resolution: Fixed
Fix Version/s: 0.20.0

 registrar doesn't build on Linux/Clang
 --

 Key: MESOS-1001
 URL: https://issues.apache.org/jira/browse/MESOS-1001
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 0.18.0
 Environment: Ubuntu 13.10 clang
Reporter: Vinod Kone
 Fix For: 0.20.0


 libtool: compile:  clang++ -DPACKAGE_NAME=\mesos\ 
 -DPACKAGE_TARNAME=\mesos\ -DPACKAGE_VERSION=\0.18.0\ 
 -DPACKAGE_STRING=\mesos 0.18.0\ -DPACKAGE_BUGREPORT=\\ 
 -DPACKAGE_URL=\\ -DPACKAGE=\mesos\ -DVERSION=\0.18.0\ -DSTDC_HEADERS=1 
 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 
 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 
 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\.libs/\ -DHAVE_PTHREAD=1 
 -DMESOS_HAS_JAVA=1 -DHAVE_PYTHON=\2.7\ -DMESOS_HAS_PYTHON=1 -DHAVE_LIBZ=1 
 -DHAVE_LIBCURL=1 -DHAVE_LIBSASL2=1 -I. -Wall -Werror 
 -DLIBDIR=\/usr/local/lib\ -DPKGLIBEXECDIR=\/usr/local/libexec/mesos\ 
 -DPKGDATADIR=\/usr/local/share/mesos\ -I../include 
 -I../3rdparty/libprocess/include 
 -I../3rdparty/libprocess/3rdparty/stout/include -I../include 
 -I../3rdparty/libprocess/3rdparty/boost-1.53.0 
 -I../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src 
 -I../3rdparty/libprocess/3rdparty/glog-0.3.3/src 
 -I../3rdparty/zookeeper-3.4.5/src/c/include 
 -I../3rdparty/zookeeper-3.4.5/src/c/generated -pthread 
 -DGTEST_USE_OWN_TR1_TUPLE=1 -g -g2 -O2 -std=c++11 -MT 
 master/libmesos_no_3rdparty_la-registrar.lo -MD -MP -MF 
 master/.deps/libmesos_no_3rdparty_la-registrar.Tpo -c master/registrar.cpp  
 -fPIC -DPIC -o master/.libs/libmesos_no_3rdparty_la-registrar.o
 In file included from master/registrar.cpp:34:
 In file included from ./master/registrar.hpp:26:
 ./state/protobuf.hpp:124:10: error: calling a private constructor of class 
 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves'
   return VariableT(variable, t.get());
  ^
 ./state/protobuf.hpp:111:41: note: in instantiation of function template 
 specialization 
 'mesos::internal::state::protobuf::State::_fetchmesos::internal::registry::Slaves'
  requested here
 .then(lambda::bind(State::template _fetchT, lambda::_1));
 ^
 master/registrar.cpp:191:12: note: in instantiation of function template 
 specialization 
 'mesos::internal::state::protobuf::State::fetchmesos::internal::registry::Slaves'
  requested here
 state-fetchregistry::Slaves(slaves)
^
 ./state/protobuf.hpp:62:3: note: declared private here
   Variable(const state::Variable _variable, const T _t)
   ^
 ./state/protobuf.hpp:132:57: error: 't' is a private member of 
 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves'
   Trystd::string value = messages::serialize(variable.t);
 ^
 master/registrar.cpp:333:14: note: in instantiation of function template 
 specialization 
 'mesos::internal::state::protobuf::State::storemesos::internal::registry::Slaves'
  requested here
   state-store(variable).then(defer(self(), Self::_update, lambda::_1));
  ^
 ./state/protobuf.hpp:67:5: note: declared private here
   T t;
 ^
 ./state/protobuf.hpp:138:39: error: 'variable' is a private member of 
 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves'
   return state::State::store(variable.variable.mutate(value.get()))
   ^
 ./state/protobuf.hpp:66:19: note: declared private here
   state::Variable variable; // Not const to keep Variable assignable.
   ^
 ./state/protobuf.hpp:139:61: error: 't' is a private member of 
 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves'
 .then(lambda::bind(State::template _storeT, variable.t, lambda::_1));
 ^
 ./state/protobuf.hpp:67:5: note: declared private here
   T t;
 ^
 ./state/protobuf.hpp:149:17: error: calling a private constructor of class 
 'mesos::internal::state::protobuf::Variablemesos::internal::registry::Slaves'
 return Some(VariableT(variable.get(), t));
 ^
 ./state/protobuf.hpp:139:41: note: in instantiation of function template 
 specialization 
 'mesos::internal::state::protobuf::State::_storemesos::internal::registry::Slaves'
  requested here
 .then(lambda::bind(State::template _storeT, variable.t, lambda::_1));
 ^
 master/registrar.cpp:333:14: note: in instantiation of function template 
 specialization

[jira] [Updated] (MESOS-1027) IPv6 support


 [ 
https://issues.apache.org/jira/browse/MESOS-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1027:
-

Epic Name: IPv6 Support

 IPv6 support
 

 Key: MESOS-1027
 URL: https://issues.apache.org/jira/browse/MESOS-1027
 Project: Mesos
  Issue Type: Epic
  Components: framework, libprocess, master, slave
Reporter: Dominic Hamon
 Fix For: 1.0.0


 From the CLI down through the various layers of tech we should support IPv6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1580) Accept --isolation=external through a deprecation cycle.


 [ 
https://issues.apache.org/jira/browse/MESOS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Hindman updated MESOS-1580:


Description: The feature branch at github.com/mesos/mesos/tree/docker 
removes the --isolation=external option and forces people instead to do 
--containerizers=external to get the same thing. We should actually put 
--isolation=external through a deprecation cycle.

 Accept --isolation=external through a deprecation cycle.
 

 Key: MESOS-1580
 URL: https://issues.apache.org/jira/browse/MESOS-1580
 Project: Mesos
  Issue Type: Technical task
  Components: containerization, slave
Reporter: Benjamin Hindman

 The feature branch at github.com/mesos/mesos/tree/docker removes the 
 --isolation=external option and forces people instead to do 
 --containerizers=external to get the same thing. We should actually put 
 --isolation=external through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1580) Accept --isolation=external through a deprecation cycle.


[ 
https://issues.apache.org/jira/browse/MESOS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060996#comment-14060996
 ] 

Benjamin Hindman commented on MESOS-1580:
-

I added a 'Description' [~tstclair]. ;-) Let me know if there are other details 
that would help.

 Accept --isolation=external through a deprecation cycle.
 

 Key: MESOS-1580
 URL: https://issues.apache.org/jira/browse/MESOS-1580
 Project: Mesos
  Issue Type: Technical task
  Components: containerization, slave
Reporter: Benjamin Hindman

 The feature branch at github.com/mesos/mesos/tree/docker removes the 
 --isolation=external option and forces people instead to do 
 --containerizers=external to get the same thing. We should actually put 
 --isolation=external through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MESOS-1529) Handle a network partition between Master and Slave


 [ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert reassigned MESOS-1529:


Assignee: Benjamin Mahler

 Handle a network partition between Master and Slave
 ---

 Key: MESOS-1529
 URL: https://issues.apache.org/jira/browse/MESOS-1529
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon
Assignee: Benjamin Mahler

 If a network partition occurs between a Master and Slave, the Master will 
 remove the Slave (as it fails health check) and mark the tasks being run 
 there as LOST. However, the Slave is not aware that it has been removed so 
 the tasks will continue to run.
 (To clarify a little bit: neither the master nor the slave receives 'exited' 
 event, indicating that the connection between the master and slave is not 
 closed).
 There are at least two possible approaches to solving this issue:
 1. Introduce a health check from Slave to Master so they have a consistent 
 view of a network partition. We may still see this issue should a one-way 
 connection error occur.
 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
 Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
 as potentially lost (zombie state) but maybe the Scheduler can make a more 
 intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1525) Don't require slave id for reconciliation requests.


 [ 
https://issues.apache.org/jira/browse/MESOS-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1525:
-

Sprint: Q3 Sprint 1

 Don't require slave id for reconciliation requests.
 ---

 Key: MESOS-1525
 URL: https://issues.apache.org/jira/browse/MESOS-1525
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Benjamin Mahler

 Reconciliation requests currently specify a list of TaskStatuses. SlaveID is 
 optional inside TaskStatus but reconciliation requests are dropped when the 
 SlaveID is not specified.
 We can answer reconciliation requests for a task so long as there are no 
 transient slaves, this is what we should do when the slave id is not 
 specified.
 There's an open question around whether we want the Reconcile Event to 
 specify TaskID/SlaveID instead of TaskStatus, but I'll save that for later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1119) Allocator should make an allocation decision per slave instead of per framework/role.


 [ 
https://issues.apache.org/jira/browse/MESOS-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1119:
-

Sprint: Q3 Sprint 1

 Allocator should make an allocation decision per slave instead of per 
 framework/role.
 -

 Key: MESOS-1119
 URL: https://issues.apache.org/jira/browse/MESOS-1119
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Vinod Kone
Assignee: Vinod Kone

 Currently the Allocator::allocate() code loops through roles and frameworks 
 (based on DRF sort) and allocates *all* slaves resources to the first 
 framework.
 This logic should be a bit inversed. Instead, the slave should go through 
 each slave, allocate it a role/framework and update the DRF shares.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MESOS-1579) Add Sailthru to the Powered By Mesos page


 [ 
https://issues.apache.org/jira/browse/MESOS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert reassigned MESOS-1579:


Assignee: Dave Lester

 Add Sailthru to the Powered By Mesos page
 -

 Key: MESOS-1579
 URL: https://issues.apache.org/jira/browse/MESOS-1579
 Project: Mesos
  Issue Type: Wish
  Components: documentation
Reporter: Alex Gaudio
Assignee: Dave Lester
Priority: Trivial
   Original Estimate: 0h
  Remaining Estimate: 0h

 Hello!
 We recently started using Mesos at Sailthru and love it!  We'd love to add 
 our organization to the Powered By Mesos page, and I created a GitHub PR to 
 that effect.  We'd love if you could merge it :)
 https://github.com/apache/mesos/pull/21
 Alex (+ Sailthru's Data Science team)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1196) create annotated tag for v0.19.0


[ 
https://issues.apache.org/jira/browse/MESOS-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061005#comment-14061005
 ] 

Dominic Hamon commented on MESOS-1196:
--

It's not clear to me what you're looking for here. Does this mean we'd have two 
tags for a release - one at the start and one at the end?

 create annotated tag for v0.19.0
 

 Key: MESOS-1196
 URL: https://issues.apache.org/jira/browse/MESOS-1196
 Project: Mesos
  Issue Type: Task
  Components: release
Reporter: Bhuvan Arumugam

 To facilitate setting up CI for mesos repository, we should create annotated 
 tag at the beginning of each release.
 This is follow up to 
 http://www.mail-archive.com/dev@mesos.apache.org/msg10915.html
 Can you,
 a) create one based on this hash 99985d27857fb5a10b26ded8da1a36100780d18b, 
 wherein master was pointed to 0.19.0 release?
 b) document the step to create annotated tag at beginning of every release
 c) document the step to create lightweight tag for every RC release



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1567) Add logging of the user uid when receiving SIGTERM.


 [ 
https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1567:
-

Sprint: Q3 Sprint 1

 Add logging of the user uid when receiving SIGTERM.
 ---

 Key: MESOS-1567
 URL: https://issues.apache.org/jira/browse/MESOS-1567
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Alexandra Sava

 We currently do not log the user pid when receiving a SIGTERM, this makes 
 debugging a bit difficult. It's easy to get this information through 
 sigaction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MESOS-1567) Add logging of the user pid when receiving SIGTERM.


 [ 
https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert reassigned MESOS-1567:


Assignee: Alexandra Sava

 Add logging of the user pid when receiving SIGTERM.
 ---

 Key: MESOS-1567
 URL: https://issues.apache.org/jira/browse/MESOS-1567
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Alexandra Sava

 We currently do not log the user pid when receiving a SIGTERM, this makes 
 debugging a bit difficult. It's easy to get this information through 
 sigaction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1576) Add Go bindings to Mesos.

2014-07-14 Thread Niklas Quarfot Nielsen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061009#comment-14061009
 ] 

Niklas Quarfot Nielsen commented on MESOS-1576:
---

Sure - https://github.com/mesos/mesos-go

 Add Go bindings to Mesos.
 -

 Key: MESOS-1576
 URL: https://issues.apache.org/jira/browse/MESOS-1576
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.19.0
Reporter: Timothy St. Clair

 From [~benjaminhindman]: 
 I know that Niklas has some go bindings (backed by libmesos) here and 
 Vladimir Vivien has some _native_ go bindings (no need for libmesos) here 
 that could be used to help accomplish this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool


 [ 
https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1316:
-

Sprint: Q3 Sprint 1

 Implement decent unit test coverage for the mesos-fetcher tool
 --

 Key: MESOS-1316
 URL: https://issues.apache.org/jira/browse/MESOS-1316
 Project: Mesos
  Issue Type: Improvement
Reporter: Tom Arnfeld
Assignee: Tom Arnfeld

 There are current no tests that cover the {{mesos-fetcher}} tool itself, and 
 hence bugs like MESOS-1313 have accidentally slipped though.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool


 [ 
https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert reassigned MESOS-1316:


Assignee: Benjamin Hindman  (was: Tom Arnfeld)

Ben has offered to reinstate the tests.

 Implement decent unit test coverage for the mesos-fetcher tool
 --

 Key: MESOS-1316
 URL: https://issues.apache.org/jira/browse/MESOS-1316
 Project: Mesos
  Issue Type: Improvement
Reporter: Tom Arnfeld
Assignee: Benjamin Hindman

 There are current no tests that cover the {{mesos-fetcher}} tool itself, and 
 hence bugs like MESOS-1313 have accidentally slipped though.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-752) SlaveRecoveryTest/0.ReconcileTasksMissingFromSlave test is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-752:


Sprint: Q3 Sprint 1

 SlaveRecoveryTest/0.ReconcileTasksMissingFromSlave test is flaky
 

 Key: MESOS-752
 URL: https://issues.apache.org/jira/browse/MESOS-752
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: centos6
Reporter: Vinod Kone
Assignee: Vinod Kone

 [ RUN  ] SlaveRecoveryTest/0.ReconcileTasksMissingFromSlave
 Checkpointing executor's forked pid 32281 to 
 '/tmp/SlaveRecoveryTest_0_ReconcileTasksMissingFromSlave_NT1btb/meta/slaves/201310151913-16777343-35153-31491-0/frameworks/201310151913-16777343-35153-31491-/executors/0514b52f-3c17-4ee5-ba16-635198701ca2/runs/97c9e2cc-ceea-40a8-a915-aed5fed1dcb3/pids/forked.pid'
 Fetching resources into 
 '/tmp/SlaveRecoveryTest_0_ReconcileTasksMissingFromSlave_NT1btb/slaves/201310151913-16777343-35153-31491-0/frameworks/201310151913-16777343-35153-31491-/executors/0514b52f-3c17-4ee5-ba16-635198701ca2/runs/97c9e2cc-ceea-40a8-a915-aed5fed1dcb3'
 Registered executor on localhost.localdomain
 Starting task 0514b52f-3c17-4ee5-ba16-635198701ca2
 Forked command at 32317
 sh -c 'sleep 10'
 tests/slave_recovery_tests.cpp:1927: Failure
 Mock function called more times than expected - returning directly.
 Function call: statusUpdate(0x7fffae636eb0, @0x7f1590027a00 64-byte 
 object F0-2F D0-A1 15-7F 00-00 00-00 00-00 00-00 00-00 40-E9 01-90 15-7F 
 00-00 20-6B 03-90 15-7F 00-00 48-91 C3-00 00-00 00-00 B0-3B 01-90 15-7F 00-00 
 05-00 00-00 00-00 00-00 17-00 00-00 00-00 00-00)
  Expected: to be called once
Actual: called twice - over-saturated and active
 Command exited with status 0 (pid: 32317)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-976) SlaveRecoveryTest/1.SchedulerFailover is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-976:


Sprint: Q3 Sprint 1

 SlaveRecoveryTest/1.SchedulerFailover is flaky
 --

 Key: MESOS-976
 URL: https://issues.apache.org/jira/browse/MESOS-976
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.18.0
Reporter: Vinod Kone
Assignee: Ian Downes

 [==] Running 1 test from 1 test case.
 [--] Global test environment set-up.
 [--] 1 test from SlaveRecoveryTest/1, where TypeParam = 
 mesos::internal::slave::CgroupsIsolator
 [ RUN  ] SlaveRecoveryTest/1.SchedulerFailover
 I0206 20:18:31.525116 56447 master.cpp:239] Master ID: 
 2014-02-06-20:18:31-1740121354-55566-56447 Hostname: 
 smfd-bkq-03-sr4.devel.twitter.com
 I0206 20:18:31.525295 56481 master.cpp:321] Master started on 
 10.37.184.103:55566
 I0206 20:18:31.525315 56481 master.cpp:324] Master only allowing 
 authenticated frameworks to register!
 I0206 20:18:31.527093 56481 master.cpp:756] The newly elected leader is 
 master@10.37.184.103:55566
 I0206 20:18:31.527122 56481 master.cpp:764] Elected as the leading master!
 I0206 20:18:31.530642 56473 slave.cpp:112] Slave started on 
 9)@10.37.184.103:55566
 I0206 20:18:31.530802 56473 slave.cpp:212] Slave resources: cpus(*):2; 
 mem(*):1024; disk(*):1024; ports(*):[31000-32000]
 I0206 20:18:31.531203 56473 slave.cpp:240] Slave hostname: 
 smfd-bkq-03-sr4.devel.twitter.com
 I0206 20:18:31.531221 56473 slave.cpp:241] Slave checkpoint: true
 I0206 20:18:31.531991 56482 cgroups_isolator.cpp:225] Using 
 /tmp/mesos_test_cgroup as cgroups hierarchy root
 I0206 20:18:31.532470 56478 state.cpp:33] Recovering state from 
 '/tmp/SlaveRecoveryTest_1_SchedulerFailover_7dC2N1/meta'
 I0206 20:18:31.532698 56469 status_update_manager.cpp:188] Recovering status 
 update manager
 I0206 20:18:31.533962 56472 sched.cpp:265] Authenticating with master 
 master@10.37.184.103:55566
 I0206 20:18:31.534102 56472 sched.cpp:234] Detecting new master
 I0206 20:18:31.534124 56484 authenticatee.hpp:124] Creating new client SASL 
 connection
 I0206 20:18:31.534299 56473 master.cpp:2317] Authenticating framework at 
 scheduler(9)@10.37.184.103:55566
 I0206 20:18:31.534459 56461 authenticator.hpp:140] Creating new server SASL 
 connection
 I0206 20:18:31.534572 56466 authenticatee.hpp:212] Received SASL 
 authentication mechanisms: CRAM-MD5
 I0206 20:18:31.534595 56466 authenticatee.hpp:238] Attempting to authenticate 
 with mechanism 'CRAM-MD5'
 I0206 20:18:31.534667 56474 authenticator.hpp:243] Received SASL 
 authentication start
 I0206 20:18:31.534732 56474 authenticator.hpp:325] Authentication requires 
 more steps
 I0206 20:18:31.534814 56468 authenticatee.hpp:258] Received SASL 
 authentication step
 I0206 20:18:31.534946 56466 authenticator.hpp:271] Received SASL 
 authentication step
 I0206 20:18:31.535007 56466 authenticator.hpp:317] Authentication success
 I0206 20:18:31.535084 56471 authenticatee.hpp:298] Authentication success
 I0206 20:18:31.535107 56461 master.cpp:2357] Successfully authenticated 
 framework at scheduler(9)@10.37.184.103:55566
 I0206 20:18:31.535392 56476 sched.cpp:339] Successfully authenticated with 
 master master@10.37.184.103:55566
 I0206 20:18:31.535512 56465 master.cpp:812] Received registration request 
 from scheduler(9)@10.37.184.103:55566
 I0206 20:18:31.535570 56465 master.cpp:830] Registering framework 
 2014-02-06-20:18:31-1740121354-55566-56447- at 
 scheduler(9)@10.37.184.103:55566
 I0206 20:18:31.535856 56465 hierarchical_allocator_process.hpp:332] Added 
 framework 2014-02-06-20:18:31-1740121354-55566-56447-
 I0206 20:18:31.537802 56482 cgroups_isolator.cpp:840] Recovering isolator
 I0206 20:18:31.538462 56472 slave.cpp:2760] Finished recovery
 I0206 20:18:31.538910 56472 slave.cpp:508] New master detected at 
 master@10.37.184.103:55566
 I0206 20:18:31.539036 56478 status_update_manager.cpp:162] New master 
 detected at master@10.37.184.103:55566
 I0206 20:18:31.539223 56464 master.cpp:1834] Attempting to register slave on 
 smfd-bkq-03-sr4.devel.twitter.com at slave(9)@10.37.184.103:55566
 I0206 20:18:31.539271 56472 slave.cpp:533] Detecting new master
 I0206 20:18:31.539330 56464 master.cpp:2804] Adding slave 
 2014-02-06-20:18:31-1740121354-55566-56447-0 at 
 smfd-bkq-03-sr4.devel.twitter.com with cpus(*):2; mem(*):1024; disk(*):1024; 
 ports(*):[31000-32000]
 I0206 20:18:31.539454 56472 slave.cpp:551] Registered with master 
 master@10.37.184.103:55566; given slave ID 
 2014-02-06-20:18:31-1740121354-55566-56447-0
 I0206 20:18:31.539620 56472 slave.cpp:564] Checkpointing SlaveInfo to 
 '/tmp/SlaveRecoveryTest_1_SchedulerFailover_7dC2N1/meta/slaves/2014-02-06-20:18:31-1740121354-55566-56447-0/slave.info'

[jira] [Updated] (MESOS-1527) Choose containerizer at runtime


 [ 
https://issues.apache.org/jira/browse/MESOS-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1527:
-

Sprint: Q3 Sprint 1

 Choose containerizer at runtime
 ---

 Key: MESOS-1527
 URL: https://issues.apache.org/jira/browse/MESOS-1527
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Jay Buffington

 Currently you have to choose the containerizer at mesos-slave start time via 
 the --isolation option.  I'd like to be able to specify the containerizer in 
 the request to launch the job. This could be specified by a new Provider 
 field in the ContainerInfo proto buf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1580) Accept --isolation=external through a deprecation cycle.


 [ 
https://issues.apache.org/jira/browse/MESOS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1580:
-

Issue Type: Task  (was: Technical task)
Parent: (was: MESOS-1527)

 Accept --isolation=external through a deprecation cycle.
 

 Key: MESOS-1580
 URL: https://issues.apache.org/jira/browse/MESOS-1580
 Project: Mesos
  Issue Type: Task
  Components: containerization, slave
Reporter: Benjamin Hindman

 The feature branch at github.com/mesos/mesos/tree/docker removes the 
 --isolation=external option and forces people instead to do 
 --containerizers=external to get the same thing. We should actually put 
 --isolation=external through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1219) Master should generate new id for frameworks that reconnect after failover timeout


 [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1219:
--

  Sprint: Q3 Sprint 1
Assignee: Vinod Kone

 Master should generate new id for frameworks that reconnect after failover 
 timeout
 --

 Key: MESOS-1219
 URL: https://issues.apache.org/jira/browse/MESOS-1219
 Project: Mesos
  Issue Type: Bug
  Components: master, webui
Reporter: Robert Lacroix
Assignee: Vinod Kone

 When a scheduler reconnects after the failover timeout has exceeded, the 
 framework id is usually reused because the scheduler doesn't know that the 
 timeout exceeded and it is actually handled as a new framework.
 The /framework/:framework_id route of the Web UI doesn't handle those cases 
 very well because its key is reused. It only shows the terminated one.
 Would it make sense to ignore the provided framework id when a scheduler 
 reconnects to a terminated framework and generate a new id to make sure it's 
 unique?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1566) Support private docker registry.


[ 
https://issues.apache.org/jira/browse/MESOS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061064#comment-14061064
 ] 

Timothy St. Clair commented on MESOS-1566:
--

We should really link all the Docker JIRA's together. 

 Support private docker registry.
 

 Key: MESOS-1566
 URL: https://issues.apache.org/jira/browse/MESOS-1566
 Project: Mesos
  Issue Type: Task
Reporter: Timothy Chen

 Need to support Docker launching images hosted in private registry service, 
 which requires docker login.
 Can consider utilizing .dockercfg file for providing credentials.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1588) Enforce disk quota in MesosContainerizer

2014-07-14 Thread Ian Downes (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061067#comment-14061067
]

Ian Downes commented on MESOS-1588:
---

Can you please elaborate on what you mean by machine policy?

Disk is a first-class resource that should be enforced, both to protect the
host and to protect other tasks running on the host, i.e., a task should *not*
be able to spew out logs and affect others; if a tasks requests XX GB, that's
all it should get.

Enforcement could be ENOSPC if a separate filesystem could be used, but this
solution is not always available. Many applications also don't handle ENOSPC
well and it's generally safer to just terminate the container.

I'm proposing a cycle with a release with enforcement defaulting to false,
keeping existing behavior. A subsequent release would default to true.

Enforce disk quota in MesosContainerizer

Key: MESOS-1588
URL: https://issues.apache.org/jira/browse/MESOS-1588
Project: Mesos
Issue Type: Improvement
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Ian Downes

Once we have disk usage we should enforce this. Containers that exceed their
quota should be terminated, i.e., the filesystem isolator should set a
Limitation so the MesosContainerizer kills the container.
Disk quota enforcement should be optional to permit a transition period where
disk usage is monitored before enabling enforcement.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1588) Enforce disk quota in MesosContainerizer

[
https://issues.apache.org/jira/browse/MESOS-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061130#comment-14061130
]

Timothy St. Clair commented on MESOS-1588:
--

Machine policy means defining a set of rules for machine(s). This idea doesn't
exist as a formalism inside of Mesos, but it certainly does in other grid
systems. In other grid systems you can define a KILL policy expressions, such
that there is no harm, no foul to a point.

For example, if a machine has a 1TB drive, and a task goes over by 1GB should
that task get booted? Strict enforcement says yes, but presumes that the users
accurately outline how much disk their task will consume, which I assert is a
really bad idea. This problem was the root reason why we designed hunting
policies in condor that used job history. We allowed users to go over to a
point defined by the policy expression, and updated the jobAd to more
accurately reflect how much resource was being used, so subsequent jobs would
land appropriately.

IMHO strict enforcement should be an *optional* parameter only.

Enforce disk quota in MesosContainerizer

Key: MESOS-1588
URL: https://issues.apache.org/jira/browse/MESOS-1588
Project: Mesos
Issue Type: Improvement
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Ian Downes

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-987) Wire up a code coverage tool


 [ 
https://issues.apache.org/jira/browse/MESOS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-987:


Component/s: technical debt

 Wire up a code coverage tool
 

 Key: MESOS-987
 URL: https://issues.apache.org/jira/browse/MESOS-987
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt
Reporter: Vinod Kone
Assignee: Dominic Hamon

 Some options are gcov (works only with gcc afaict) and optionally lcov.
 It would be nice to hook this up with Jenkins too.
 http://meekrosoft.wordpress.com/2010/06/02/continuous-code-coverage-with-gcc-googletest-and-hudson/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool


 [ 
https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1316:
-

Component/s: test
 technical debt

 Implement decent unit test coverage for the mesos-fetcher tool
 --

 Key: MESOS-1316
 URL: https://issues.apache.org/jira/browse/MESOS-1316
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt, test
Reporter: Tom Arnfeld
Assignee: Benjamin Hindman

 There are current no tests that cover the {{mesos-fetcher}} tool itself, and 
 hence bugs like MESOS-1313 have accidentally slipped though.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1583) Clang tool build improvement include what you use


 [ 
https://issues.apache.org/jira/browse/MESOS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1583:
-

Component/s: technical debt

 Clang tool build improvement include what you use
 ---

 Key: MESOS-1583
 URL: https://issues.apache.org/jira/browse/MESOS-1583
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1316) Implement decent unit test coverage for the mesos-fetcher tool

2014-07-14 Thread Tom Arnfeld (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061176#comment-14061176
 ] 

Tom Arnfeld commented on MESOS-1316:


Awesome, Thanks Ben! Apologies that I've not had the time to look at this after 
I said I would.

 Implement decent unit test coverage for the mesos-fetcher tool
 --

 Key: MESOS-1316
 URL: https://issues.apache.org/jira/browse/MESOS-1316
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt, test
Reporter: Tom Arnfeld
Assignee: Benjamin Hindman

 There are current no tests that cover the {{mesos-fetcher}} tool itself, and 
 hence bugs like MESOS-1313 have accidentally slipped though.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-07-14 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061219#comment-14061219
 ] 

Jie Yu commented on MESOS-1574:
---

The system admin could also set ip_local_port_range to prevent a rogue process 
from binding to a mesos reserved port:

echo xxx  /proc/sys/net/ipv4/ip_local_port_range

 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-07-14 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1574:
---

Component/s: isolation

 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1538) A container destruction in the middle of a launch leads to CHECK failure.

2014-07-14 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1538:
---

Summary: A container destruction in the middle of a launch leads to CHECK 
failure.  (was: A container destruction in the middle of a launch leads to 
CHECK failure)

 A container destruction in the middle of a launch leads to CHECK failure.
 -

 Key: MESOS-1538
 URL: https://issues.apache.org/jira/browse/MESOS-1538
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Ian Downes
 Fix For: 0.19.1


 There is a race between the destroy() and exec() in the containerizer 
 process, when the destroy is called in the middle of the launch.
 In particular if the destroy is completed and the container removed from 
 'promises' map before 'exec()' was called, CHECK failure happens.
 The fix is to return a Failure instead of doing a CHECK in 'exec()'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-07-14 Thread Ian Downes (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061243#comment-14061243
 ] 

Ian Downes edited comment on MESOS-1574 at 7/14/14 9:05 PM:


ip_local_port_range sets the range for local ports when opening outgoing 
connections; it does not restrict processes from binding to ports inside that 
range.

[~jaybuff] are you using a cgroups isolator? If so, you can check if the 
process' cgroup is managed by mesos, implying it's a descendent of a terminated 
mesos-slave:
{noformat}
$ cat /proc/$pid/cgroup
4:memory:/sys/fs/cgroup/memory/mesos/XXX
3:freezer:/sys/fs/cgroup/freezer/mesos/XXX
2:cpuacct:/sys/fs/cgroup/cpuacct/mesos/XXX
1:cpu:/sys/fs/cgroup/cpu/mesos/XXX
{noformat}


was (Author: idownes):
ip_local_port_range sets the range for local ports when opening outgoing 
connections; it does not restrict processes from binding to ports inside that 
range.

[~jaybuff] are you using a cgroups isolator? If so, you can check if the 
process' cgroup is managed by mesos, implying it's a descendent of a terminated 
mesos-slave:
$ cat /proc/$pid/cgroup
4:memory:/sys/fs/cgroup/memory/mesos/XXX
3:freezer:/sys/fs/cgroup/freezer/mesos/XXX
2:cpuacct:/sys/fs/cgroup/cpuacct/mesos/XXX
1:cpu:/sys/fs/cgroup/cpu/mesos/XXX


 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1567) Add logging of the user uid when receiving SIGTERM.

2014-07-14 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1567:
---

Description: We currently do not log the user id when receiving a SIGTERM, 
this makes debugging a bit difficult. It's easy to get this information through 
sigaction.  (was: We currently do not log the user pid when receiving a 
SIGTERM, this makes debugging a bit difficult. It's easy to get this 
information through sigaction.)

 Add logging of the user uid when receiving SIGTERM.
 ---

 Key: MESOS-1567
 URL: https://issues.apache.org/jira/browse/MESOS-1567
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Alexandra Sava

 We currently do not log the user id when receiving a SIGTERM, this makes 
 debugging a bit difficult. It's easy to get this information through 
 sigaction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout


 [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1219:
--

Summary: Master should disallow frameworks that reconnect after failover 
timeout  (was: Master should generate new id for frameworks that reconnect 
after failover timeout)

 Master should disallow frameworks that reconnect after failover timeout
 ---

 Key: MESOS-1219
 URL: https://issues.apache.org/jira/browse/MESOS-1219
 Project: Mesos
  Issue Type: Bug
  Components: master, webui
Reporter: Robert Lacroix
Assignee: Vinod Kone

 When a scheduler reconnects after the failover timeout has exceeded, the 
 framework id is usually reused because the scheduler doesn't know that the 
 timeout exceeded and it is actually handled as a new framework.
 The /framework/:framework_id route of the Web UI doesn't handle those cases 
 very well because its key is reused. It only shows the terminated one.
 Would it make sense to ignore the provided framework id when a scheduler 
 reconnects to a terminated framework and generate a new id to make sure it's 
 unique?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1576) Add Go bindings to Mesos.


[ 
https://issues.apache.org/jira/browse/MESOS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061409#comment-14061409
 ] 

Dominic Hamon commented on MESOS-1576:
--

Regarding ZK: https://godoc.org/github.com/samuel/go-zookeeper/ or 
https://godoc.org/launchpad.net/gozk/zookeeper might be worth a look. 
Disclaimer: I haven't look at them in any depth.

 Add Go bindings to Mesos.
 -

 Key: MESOS-1576
 URL: https://issues.apache.org/jira/browse/MESOS-1576
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.19.0
Reporter: Timothy St. Clair

 From [~benjaminhindman]: 
 I know that Niklas has some go bindings (backed by libmesos) here and 
 Vladimir Vivien has some _native_ go bindings (no need for libmesos) here 
 that could be used to help accomplish this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1566) Support private docker registry.

2014-07-14 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061462#comment-14061462
 ] 

Timothy Chen commented on MESOS-1566:
-

[~hufman] yes it works if you don't need authentication, but in case you do we 
need to allow user authentication configurations which is what this ticket is 
about. Overall I didn't put that in as I want to make sure we address any other 
needs required for private registry.

 Support private docker registry.
 

 Key: MESOS-1566
 URL: https://issues.apache.org/jira/browse/MESOS-1566
 Project: Mesos
  Issue Type: Task
Reporter: Timothy Chen

 Need to support Docker launching images hosted in private registry service, 
 which requires docker login.
 Can consider utilizing .dockercfg file for providing credentials.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1593) Add DockerInfo Configuration

2014-07-14 Thread Timothy Chen (JIRA)

Timothy Chen created MESOS-1593:
---

 Summary: Add DockerInfo Configuration
 Key: MESOS-1593
 URL: https://issues.apache.org/jira/browse/MESOS-1593
 Project: Mesos
  Issue Type: Task
Reporter: Timothy Chen


We want to add a new proto message to encapsulate all Docker related 
configurations into DockerInfo.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-07-14 Thread Jay Buffington (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061466#comment-14061466
 ] 

Jay Buffington commented on MESOS-1574:
---

[~idownes] Good idea!  Unfortunately, we turned on cgroups isolation two weeks 
ago, but this process was started a month ago :(

 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1594) SlaveRecoveryTest/0.ReconcileKillTask is flaky

Vinod Kone created MESOS-1594:
-

 Summary: SlaveRecoveryTest/0.ReconcileKillTask is flaky
 Key: MESOS-1594
 URL: https://issues.apache.org/jira/browse/MESOS-1594
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.20.0
 Environment: Ubuntu 12.10 with GCC
Reporter: Vinod Kone


Observed this on Jenkins.

{code}
[ RUN  ] SlaveRecoveryTest/0.ReconcileKillTask
Using temporary directory '/tmp/SlaveRecoveryTest_0_ReconcileKillTask_3zJ6DG'
I0714 15:08:43.915114 27216 leveldb.cpp:176] Opened db in 474.695188ms
I0714 15:08:43.933645 27216 leveldb.cpp:183] Compacted db in 18.068942ms
I0714 15:08:43.934129 27216 leveldb.cpp:198] Created db iterator in 7860ns
I0714 15:08:43.934439 27216 leveldb.cpp:204] Seeked to beginning of db in 2560ns
I0714 15:08:43.934779 27216 leveldb.cpp:273] Iterated through 0 keys in the db 
in 1400ns
I0714 15:08:43.935098 27216 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0714 15:08:43.936027 27238 recover.cpp:425] Starting replica recovery
I0714 15:08:43.936225 27238 recover.cpp:451] Replica is in EMPTY status
I0714 15:08:43.936867 27238 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0714 15:08:43.937049 27238 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0714 15:08:43.937232 27238 recover.cpp:542] Updating replica status to STARTING
I0714 15:08:43.945600 27235 master.cpp:288] Master 
20140714-150843-16842879-55850-27216 (quantal) started on 127.0.1.1:55850
I0714 15:08:43.945643 27235 master.cpp:325] Master only allowing authenticated 
frameworks to register
I0714 15:08:43.945651 27235 master.cpp:330] Master only allowing authenticated 
slaves to register
I0714 15:08:43.945658 27235 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/SlaveRecoveryTest_0_ReconcileKillTask_3zJ6DG/credentials'
I0714 15:08:43.945808 27235 master.cpp:359] Authorization enabled
I0714 15:08:43.946369 27235 hierarchical_allocator_process.hpp:301] 
Initializing hierarchical allocator process with master : master@127.0.1.1:55850
I0714 15:08:43.946419 27235 master.cpp:122] No whitelist given. Advertising 
offers for all slaves
I0714 15:08:43.946614 27235 master.cpp:1128] The newly elected leader is 
master@127.0.1.1:55850 with id 20140714-150843-16842879-55850-27216
I0714 15:08:43.946630 27235 master.cpp:1141] Elected as the leading master!
I0714 15:08:43.946637 27235 master.cpp:959] Recovering from registrar
I0714 15:08:43.946707 27235 registrar.cpp:313] Recovering registrar
I0714 15:08:43.957895 27238 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 20.529301ms
I0714 15:08:43.957978 27238 replica.cpp:320] Persisted replica status to 
STARTING
I0714 15:08:43.958142 27238 recover.cpp:451] Replica is in STARTING status
I0714 15:08:43.958664 27238 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0714 15:08:43.958762 27238 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0714 15:08:43.958945 27238 recover.cpp:542] Updating replica status to VOTING
I0714 15:08:43.975685 27238 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 16.646136ms
I0714 15:08:43.976367 27238 replica.cpp:320] Persisted replica status to VOTING
I0714 15:08:43.976824 27241 recover.cpp:556] Successfully joined the Paxos group
I0714 15:08:43.977072 27242 recover.cpp:440] Recover process terminated
I0714 15:08:43.980590 27236 log.cpp:656] Attempting to start the writer
I0714 15:08:43.981385 27236 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0714 15:08:43.999141 27236 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 17.705787ms
I0714 15:08:43.999222 27236 replica.cpp:342] Persisted promised to 1
I0714 15:08:44.004451 27240 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0714 15:08:44.004914 27240 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0714 15:08:44.021456 27240 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 16.499775ms
I0714 15:08:44.021533 27240 replica.cpp:676] Persisted action at 0
I0714 15:08:44.022006 27240 replica.cpp:508] Replica received write request for 
position 0
I0714 15:08:44.022043 27240 leveldb.cpp:438] Reading position from leveldb took 
21376ns
I0714 15:08:44.035969 27240 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 13.885907ms
I0714 15:08:44.036365 27240 replica.cpp:676] Persisted action at 0
I0714 15:08:44.040156 27238 replica.cpp:655] Replica received learned notice 
for position 0
I0714 15:08:44.058082 27238 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 17.860707ms
I0714 15:08:44.058161 27238 replica.cpp:676] Persisted action at 0
I0714 15:08:44.058176 27238 replica.cpp:661] Replica

[jira] [Created] (MESOS-1595) Provide a way to install libprocess

Vinod Kone created MESOS-1595:
-

 Summary: Provide a way to install libprocess
 Key: MESOS-1595
 URL: https://issues.apache.org/jira/browse/MESOS-1595
 Project: Mesos
  Issue Type: Story
Reporter: Vinod Kone
Assignee: Vinod Kone


For C++ framework developers that want to use libprocess in their code base, it 
would be great if Mesos provides a way to easily get access to the headers. A 
first step in that direction would be to provide a install target in the 
libprocess Makefile for the same.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1596) Improve allocation of resources

Dominic Hamon created MESOS-1596:


 Summary: Improve allocation of resources
 Key: MESOS-1596
 URL: https://issues.apache.org/jira/browse/MESOS-1596
 Project: Mesos
  Issue Type: Epic
Reporter: Dominic Hamon
Assignee: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1597) Add tc police action to routing library

Dominic Hamon created MESOS-1597:


 Summary: Add tc police action to routing library
 Key: MESOS-1597
 URL: https://issues.apache.org/jira/browse/MESOS-1597
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1598) Add advanced shaping controls to routing library

Dominic Hamon created MESOS-1598:


 Summary: Add advanced shaping controls to routing library
 Key: MESOS-1598
 URL: https://issues.apache.org/jira/browse/MESOS-1598
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Jie Yu


Only necessary if bandwidth cap using tc police action is deemed not effective 
for network isolation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MESOS-1599) Slave configuration for network isolation

Dominic Hamon created MESOS-1599:


 Summary: Slave configuration for network isolation
 Key: MESOS-1599
 URL: https://issues.apache.org/jira/browse/MESOS-1599
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1471) Add documentation for the replicated log.


 [ 
https://issues.apache.org/jira/browse/MESOS-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1471:
-

Sprint: Q3 Sprint 1

 Add documentation for the replicated log.
 -

 Key: MESOS-1471
 URL: https://issues.apache.org/jira/browse/MESOS-1471
 Project: Mesos
  Issue Type: Documentation
  Components: documentation, replicated log
Reporter: Benjamin Mahler
Assignee: Jie Yu

 The replicated log could benefit from some documentation. In particular, how 
 does it work? What do operators need to know? Possibly there is some overlap 
 with our future maintenance documentation in MESOS-1470.
 I believe [~jieyu] has some unpublished work that could be leveraged here!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1582) Improve build time.


 [ 
https://issues.apache.org/jira/browse/MESOS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated MESOS-1582:
-

Epic Colour: ghx-label-2  (was: Red)

 Improve build time.
 ---

 Key: MESOS-1582
 URL: https://issues.apache.org/jira/browse/MESOS-1582
 Project: Mesos
  Issue Type: Epic
  Components: build
Reporter: Benjamin Hindman

 The build takes a ridiculously long time unless you have a large, parallel 
 machine. This is a combination of many factors, all of which we'd like to 
 discuss and track here.
 I'd also love to actually track build times so we can get an appreciation of 
 the improvements. Please leave a comment below with your build times!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1583) Clang tool build improvement include what you use


[ 
https://issues.apache.org/jira/browse/MESOS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061634#comment-14061634
 ] 

Dominic Hamon commented on MESOS-1583:
--

Please test this on a small selection of files and share the resulting patch. 
There has been some controversy in other projects that have used it regarding 
how aggressive it can be.

 Clang tool build improvement include what you use
 ---

 Key: MESOS-1583
 URL: https://issues.apache.org/jira/browse/MESOS-1583
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1596) Various allocator improvements for multi-framework support


 [ 
https://issues.apache.org/jira/browse/MESOS-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1596:
-

Summary: Various allocator improvements for multi-framework support  (was: 
Improve allocation of resources)

 Various allocator improvements for multi-framework support
 --

 Key: MESOS-1596
 URL: https://issues.apache.org/jira/browse/MESOS-1596
 Project: Mesos
  Issue Type: Epic
Reporter: Dominic Hamon
Assignee: Vinod Kone





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1597) Add tc police action to routing library


 [ 
https://issues.apache.org/jira/browse/MESOS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1597:
-

Description: [Policing 
filters|http://www.lartc.org/lartc.html#LARTC.ADV-FILTER.POLICING] are a simple 
way to add bandwidth limiting to a connection. Adding this action to the 
routing library will allow us to start isolating network bandwidth per 
container.

 Add tc police action to routing library
 ---

 Key: MESOS-1597
 URL: https://issues.apache.org/jira/browse/MESOS-1597
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Jie Yu

 [Policing filters|http://www.lartc.org/lartc.html#LARTC.ADV-FILTER.POLICING] 
 are a simple way to add bandwidth limiting to a connection. Adding this 
 action to the routing library will allow us to start isolating network 
 bandwidth per container.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1599) Slave configuration for network isolation


 [ 
https://issues.apache.org/jira/browse/MESOS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1599:
-

Description: Once the policing or shaping controls are available in the 
routing library, configuration options are required on the slave to enable 
them.  (was: Once the )

 Slave configuration for network isolation
 -

 Key: MESOS-1599
 URL: https://issues.apache.org/jira/browse/MESOS-1599
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Jie Yu

 Once the policing or shaping controls are available in the routing library, 
 configuration options are required on the slave to enable them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1598) Add advanced shaping controls to routing library


 [ 
https://issues.apache.org/jira/browse/MESOS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1598:
-

Description: 
Only necessary if bandwidth cap using tc police action (MESOS-1597) is deemed 
not effective for network isolation.

If this is necessary, we can use more complex shaping controls than just flat 
bandwidth caps to manage bandwidth isolation between containers.

  was:Only necessary if bandwidth cap using tc police action is deemed not 
effective for network isolation.


 Add advanced shaping controls to routing library
 

 Key: MESOS-1598
 URL: https://issues.apache.org/jira/browse/MESOS-1598
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Jie Yu

 Only necessary if bandwidth cap using tc police action (MESOS-1597) is deemed 
 not effective for network isolation.
 If this is necessary, we can use more complex shaping controls than just flat 
 bandwidth caps to manage bandwidth isolation between containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MESOS-1599) Slave configuration for network isolation