Re: [VOTE] Release Apache Mesos 1.1.0 (rc1)

2016-10-24 Thread David Robinson
 as private bridge networks, and
> the
> services running in the container needs to be exposed outside these
> isolated networks.
>
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.1.0-rc1
> 
> 
>
> The candidate for Mesos 1.1.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/mesos-1.1.0.tar.gz
>
> The tag to be voted on is 1.1.0-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.0-rc1
>
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/
> mesos-1.1.0.tar.gz.md5
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/
> mesos-1.1.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1158
>
> Please vote on releasing this package as Apache Mesos 1.1.0!
>
> The vote is open until Fri Oct 21 21:57:02 CEST 2016 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.1.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Alex & Till
>
>


-- 
David Robinson
SRE - Mesos
@daverobinson


Re: Review Request 30514: Rate limited the removal of slaves failing health checks.

2015-02-03 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30514/#review70768
---



src/master/flags.hpp
https://reviews.apache.org/r/30514/#comment116199

Should this be slaves pre minute? I imagine most clusters in the wild would 
be relatively small, and the smaller the cluster the slower you'd want removals?



src/master/master.cpp
https://reviews.apache.org/r/30514/#comment116173

s/is/was/ ?


- David Robinson


On Feb. 3, 2015, 2:39 a.m., Vinod Kone wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30514/
 ---
 
 (Updated Feb. 3, 2015, 2:39 a.m.)
 
 
 Review request for mesos, Ben Mahler, David Robinson, and Jie Yu.
 
 
 Bugs: MESOS-1148
 https://issues.apache.org/jira/browse/MESOS-1148
 
 
 Repository: mesos
 
 
 Description
 ---
 
 The algorithm is simple. All the slave observers share a rate limiter whose 
 rate is configured via command line. When a slave times out on health check, 
 a permit is acquired to shutdown the slave *but* the pings are continued to 
 be sent. If a pong arrives before the permit is satisifed, the shutdown is 
 cancelled.
 
 
 Diffs
 -
 
   src/master/flags.hpp 6c18a1af625311ef149b5877b08f63c2b12c040d 
   src/master/master.hpp 337e00aa46ea127f3667e3383d631c3fb8e22f30 
   src/master/master.cpp 10056861b95ed9453c971787982db7d09f09f323 
   src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 
   src/tests/slave_tests.cpp e7e2af63da785644f3f7e6e23607c02be962a2c6 
 
 Diff: https://reviews.apache.org/r/30514/diff/
 
 
 Testing
 ---
 
 make check
 
 Ran the new tests 100 times
 
 
 Thanks,
 
 Vinod Kone
 




Review Request 30328: Added external_log_file option.

2015-01-27 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30328/
---

Review request for mesos and Ben Mahler.


Bugs: MESOS-2193
https://issues.apache.org/jira/browse/MESOS-2193


Repository: mesos-git


Description
---

Added external_log_file option.

This is a continuation of the review started here: 
https://reviews.apache.org/r/29071


Diffs
-

  src/logging/flags.hpp 11efb84cc2c509f852f8ba20f16a366b4cb5810f 
  src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 
  src/master/master.cpp ab6d1d17367f199191b7c77bccec73ec3b112d4f 
  src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 
  src/slave/slave.cpp fca83b3977b95ddda30f9830da10e124b5c605e6 
  src/webui/master/static/js/controllers.js 
41a70a80442501a2bf7b217939dbe504662941d2 

Diff: https://reviews.apache.org/r/30328/diff/


Testing
---

Ran locally.


Thanks,

David Robinson



Review Request 30347: Added additional details to developers guide.

2015-01-27 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30347/
---

Review request for mesos, Ben Mahler and Dominic Hamon.


Bugs: MESOS-2282
https://issues.apache.org/jira/browse/MESOS-2282


Repository: mesos-git


Description
---

Added additional details to developers guide.

These two items tripped me up when trying to submit a patch.
This also fixes whitespace inconsistencies (mixed tabs and spaces).


Diffs
-

  docs/mesos-developers-guide.md 036a6fd336c1173be73393e5ee62dba208378518 

Diff: https://reviews.apache.org/r/30347/diff/


Testing
---


Thanks,

David Robinson



Re: Review Request 29071: added webui_log option

2015-01-23 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29071/#review69433
---


Ping?

- David Robinson


On Dec. 16, 2014, 10:37 p.m., David Robinson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29071/
 ---
 
 (Updated Dec. 16, 2014, 10:37 p.m.)
 
 
 Review request for mesos.
 
 
 Bugs: MESOS-2193
 https://issues.apache.org/jira/browse/MESOS-2193
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 added webui_log option
 
 
 Diffs
 -
 
   src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 
   src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 
   src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e 
   src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef 
   src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 
   src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b 
   src/webui/master/static/js/controllers.js 
 41a70a80442501a2bf7b217939dbe504662941d2 
 
 Diff: https://reviews.apache.org/r/29071/diff/
 
 
 Testing
 ---
 
 Ran locally.
 
 
 Thanks,
 
 David Robinson
 




Re: Review Request 29071: added webui_log option

2014-12-16 Thread David Robinson


 On Dec. 16, 2014, 12:58 a.m., Dominic Hamon wrote:
  src/slave/slave.cpp, line 459
  https://reviews.apache.org/r/29071/diff/1/?file=792487#file792487line459
 
  might want to check that the file exists, as per the log_dir stanza 
  above.

The stanza above just returns a filename (created from log_dir, log level etc), 
it doesn't check that the file exists. FilesProcess::attach is where file 
existance check occurs:


FutureNothing FilesProcess::attach(const string path, const string name)
{
  Resultstring result = os::realpath(path);

  if (!result.isSome()) {
return Failure(
Failed to get realpath of ' + path + ':  +
(result.isError()
 ? result.error()
 : No such file or directory));
  }

  // Make sure we have permissions to read the file/dir.
  ...


 On Dec. 16, 2014, 12:58 a.m., Dominic Hamon wrote:
  src/slave/slave.cpp, line 446
  https://reviews.apache.org/r/29071/diff/1/?file=792487#file792487line446
 
  so webui_log overrides log_dir? should check if both are set and error.

Done.


 On Dec. 16, 2014, 12:58 a.m., Dominic Hamon wrote:
  src/webui/master/static/js/controllers.js, line 405
  https://reviews.apache.org/r/29071/diff/1/?file=792488#file792488line405
 
  or 'webui_log'

Done.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29071/#review65162
---


On Dec. 16, 2014, 12:51 a.m., David Robinson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29071/
 ---
 
 (Updated Dec. 16, 2014, 12:51 a.m.)
 
 
 Review request for mesos.
 
 
 Bugs: MESOS-2193
 https://issues.apache.org/jira/browse/MESOS-2193
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 added webui_log option
 
 
 Diffs
 -
 
   src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 
   src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 
   src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e 
   src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef 
   src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 
   src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b 
   src/webui/master/static/js/controllers.js 
 41a70a80442501a2bf7b217939dbe504662941d2 
 
 Diff: https://reviews.apache.org/r/29071/diff/
 
 
 Testing
 ---
 
 Ran locally.
 
 
 Thanks,
 
 David Robinson
 




Re: Review Request 29071: added webui_log option

2014-12-16 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29071/
---

(Updated Dec. 16, 2014, 10:37 p.m.)


Review request for mesos.


Changes
---

Dominic's feedback.


Bugs: MESOS-2193
https://issues.apache.org/jira/browse/MESOS-2193


Repository: mesos-git


Description
---

added webui_log option


Diffs (updated)
-

  src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 
  src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 
  src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e 
  src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef 
  src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 
  src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b 
  src/webui/master/static/js/controllers.js 
41a70a80442501a2bf7b217939dbe504662941d2 

Diff: https://reviews.apache.org/r/29071/diff/


Testing
---

Ran locally.


Thanks,

David Robinson



Re: Review Request 29071: added webui_log option

2014-12-16 Thread David Robinson


 On Dec. 16, 2014, 1:44 a.m., Mesos ReviewBot wrote:
  Bad patch!
  
  Reviews applied: [29071]
  
  Failed command: ./support/apply-review.sh -n -r 29071
  
  Error:
   2014-12-16 01:44:48 URL:https://reviews.apache.org/r/29071/diff/raw/ 
  [5226/5226] - 29071.patch [1]
  error: master/flags.hpp: does not exist in index
  error: master/http.cpp: does not exist in index
  error: master/master.cpp: does not exist in index
  error: slave/flags.hpp: does not exist in index
  error: slave/http.cpp: does not exist in index
  error: slave/slave.cpp: does not exist in index
  error: webui/master/static/js/controllers.js: does not exist in index
  Failed to apply patch

Guess you don't like 'noprefix = true'?!


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29071/#review65165
---


On Dec. 16, 2014, 10:37 p.m., David Robinson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29071/
 ---
 
 (Updated Dec. 16, 2014, 10:37 p.m.)
 
 
 Review request for mesos.
 
 
 Bugs: MESOS-2193
 https://issues.apache.org/jira/browse/MESOS-2193
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 added webui_log option
 
 
 Diffs
 -
 
   src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 
   src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 
   src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e 
   src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef 
   src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 
   src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b 
   src/webui/master/static/js/controllers.js 
 41a70a80442501a2bf7b217939dbe504662941d2 
 
 Diff: https://reviews.apache.org/r/29071/diff/
 
 
 Testing
 ---
 
 Ran locally.
 
 
 Thanks,
 
 David Robinson
 




Review Request 29071: added webui_log option

2014-12-15 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29071/
---

Review request for mesos.


Bugs: MESOS-2193
https://issues.apache.org/jira/browse/MESOS-2193


Repository: mesos-git


Description
---

added webui_log option


Diffs
-

  src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 
  src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 
  src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e 
  src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef 
  src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 
  src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b 
  src/webui/master/static/js/controllers.js 
41a70a80442501a2bf7b217939dbe504662941d2 

Diff: https://reviews.apache.org/r/29071/diff/


Testing
---

Ran locally.


Thanks,

David Robinson



Re: Review Request 24902: Fixed the build error in routing tests.

2014-08-22 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24902/#review51305
---



src/tests/routing_tests.cpp
https://reviews.apache.org/r/24902/#comment89457

Why is '#if LINUX_VERSION_CODE = KERNEL_VERSION(2, 6, 24)' being removed?


- David Robinson


On Aug. 20, 2014, 6:21 p.m., Jie Yu wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/24902/
 ---
 
 (Updated Aug. 20, 2014, 6:21 p.m.)
 
 
 Review request for mesos, Ian Downes and Vinod Kone.
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 Realized that checking linux version is not very useful. We may have a new 
 kernel header but an old glibc in some cases.
 
 
 Diffs
 -
 
   src/tests/routing_tests.cpp 020676cac092aae63fcb45f37b206323db100f95 
 
 Diff: https://reviews.apache.org/r/24902/diff/
 
 
 Testing
 ---
 
 sudo make check
 
 
 Thanks,
 
 Jie Yu
 




[jira] [Commented] (MESOS-890) Figure out a way to migrate a live Mesos cluster to a different ZooKeeper cluster

2014-02-26 Thread David Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913694#comment-13913694
 ] 

David Robinson commented on MESOS-890:
--

I think option #1 is the better approach. Writing a heap of code to correctly 
handle option #2 seems like a lot of effort for something which should very 
rarely happen. IIRC the 75 second constraint can be adjusted in Aurora too. 
[~rgs], how difficult is #1 from the ZK perspective? I'm happy to test this in 
a dev/test cluster.

 Figure out a way to migrate a live Mesos cluster to a different ZooKeeper 
 cluster
 -

 Key: MESOS-890
 URL: https://issues.apache.org/jira/browse/MESOS-890
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Raul Gutierrez Segales
Assignee: Raul Gutierrez Segales

 I've been chatting with [~vinodkone] about approaching a live ZK cluster 
 migration. Here are the options we came up with.
 For the descriptions we treat `zk1` as the current working cluster, `obs` as 
 a  bunch of ZooKeeper Observers [1] and `zk2` as the new cluster to which we 
 need to migrate. 
 Approach #1: Using Observers
 With this option we need to:
 * add obs to zk1
 * restart slaves to have them use obs to find their master
 * restart the framework having it use obs to find the mesos master
 * restart the mesos masters having them use obs to perform their election
 * we then stop all ZK obs and remove their data (since they will need to sync 
 up with an entirely new cluster, we need to lose the old data)
 * we restart ZK obs having them be part of zk2
 * at this point the slaves, the framework and the masters can reach the ZK 
 obs again and an election happens
 * optionally you can restart slaves, the framework and masters again using 
 zk2 instead of the ZK obs if you wanted to decommission them. 
 This assumes that we can do the last three steps in  75 secs (75 secs being 
 the slave health check timeout). This is a reasonable assumption if the data 
 size in zk2 is small enough to ensure that the ZK obs can sync up quickly 
 with zk2. If zk2 is a new cluster with no data then this should be very fast.
 The good things of this approach are:
 * no mesos code change
 * it is very easy to rollback half way through, if need be
 The hard issues are:
 * Manipulating the ZK obs (i.e.: stopping, removing the data from zk1 and 
 starting again) needs to be done with care. Messing up configs or not 
 removing the data from zk1 on any of the ZK obs will cause problems
 * we need to restart all slaves to have them use the ZK obs instead of 
 connecting to zk1 directly. But with slave recovery this isn't an issue, just 
 an extra step.
 * same thing for the framework and the masters
 Approach #2: Dual publishing from mesos masters
 With this option we would augment the election handling code in mesos masters 
 to have it deal with the notion of a primary and secondary ZK clusters. 
 Master registration and election would then work as follows:
 * create an ephemeral|sequential znode in zk1 (i.e.:  
 /path/to/znode/mesos_23)
 * create an ephemeral, but not sequential, znode in zk2 with the exact same 
 path as what was created in zk1 (i.e.: /path/to/znode/mesos_23)
 * make sure both sessions, in zk1 and zk2, are always in the same state 
 (i.e.: if one expires, the other one should be closed, etc.)
 For now, lets omit a few implementation details which might need extra care 
 and assume we can make this work consistently in such a way that zk2 reflects 
 accurately elections that happen in zk1. This means that regardless of being 
 connected to zk1 or zk2, you always get the same master. Once we have this 
 the migration steps would be:
 * restart slaves to have them use zk2 where masters can be found by virtue of 
 what we implemented above
 * restart the framework so that it finds the mesos master in zk2
 * stop all mesos masters (they all need to be stopped before moving to the 
 next step)
 * start all mesos masters using zk2 as its primary and only cluster
 Again, this assumes we can do the last two steps in  75 secs (or if we 
 needed to, we could bump the slave health check timeout). Which, again, 
 sounds achievable given that masters have no state and their start-up time is 
 very short.
 The good things of this approach are:
 - no tinkering with extra ZK servers nor with ZK configs 
 The hard issues are:
 - extra code needs to be added to the election handling bits of mesos master 
 to address a very rare, but probable, use-case of cluster migration. It might 
 take a bit of time to get that code right. 
 - it's easier to end up with a bad state if any of the mesos masters ends up 
 with a bad config or is restarted earlier and ends up publishing differently 
 than

[jira] [Created] (MESOS-1028) expose internal metrics

2014-02-21 Thread David Robinson (JIRA)
David Robinson created MESOS-1028:
-

 Summary: expose internal metrics
 Key: MESOS-1028
 URL: https://issues.apache.org/jira/browse/MESOS-1028
 Project: Mesos
  Issue Type: Improvement
  Components: general
Reporter: David Robinson


Mesos should export statistics that provide visibility into its internals. This 
would allow users to detect numerous problem without resorting to trolling log 
files.

E.g. export counters of (some of these already exist, most don't):
cgroup create
cgroup destroy
cgroup destroy attempts
resource offers made
resource offers accepted
tasks launched
tasks destroyed
tasks lost
writes to replicated log
queue length

export 50th, 90th, 95th, 99th percentile of time taken to:
start mesos (reach a certain state)
move tasks between two given states (starting - started)
create a cgroup
destroy a cgroup
send a message from slave to master
start a task
stop a task
register in zookeeper
write to the replicated log

Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See 
[metrics|http://metrics.codahale.com/getting-started/] for an example (albeit 
Java) library (or [medida|http://dln.github.io/medida/] for an unmaintained(?) 
c++ port)

We've previously seen problems where tasks were stuck in cgroup destroy with 
30,000 attempts. Exposing metrics would allow us to easily detect problems 
like this.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MESOS-1028) expose internal metrics

2014-02-21 Thread David Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908729#comment-13908729
 ] 

David Robinson commented on MESOS-1028:
---

sgtm

 expose internal metrics
 ---

 Key: MESOS-1028
 URL: https://issues.apache.org/jira/browse/MESOS-1028
 Project: Mesos
  Issue Type: Improvement
  Components: general
Reporter: David Robinson

 Mesos should export statistics that provide visibility into its internals. 
 This would allow users to detect numerous problem without resorting to 
 trolling log files.
 E.g. export counters of (some of these already exist, most don't):
 cgroup create
 cgroup destroy
 cgroup destroy attempts
 resource offers made
 resource offers accepted
 tasks launched
 tasks destroyed
 tasks lost
 writes to replicated log
 queue length
 export 50th, 90th, 95th, 99th percentile of time taken to:
 start mesos (reach a certain state)
 move tasks between two given states (starting - started)
 create a cgroup
 destroy a cgroup
 send a message from slave to master
 start a task
 stop a task
 register in zookeeper
 write to the replicated log
 Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See 
 [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit 
 Java) library (or [medida|http://dln.github.io/medida/] for an 
 unmaintained(?) c++ port)
 We've previously seen problems where tasks were stuck in cgroup destroy with 
 30,000 attempts. Exposing metrics would allow us to easily detect problems 
 like this.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Review Request 17442: Added 'active_tasks' stat to master stats endpoint.

2014-01-27 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17442/#review32962
---

Ship it!



src/master/http.cpp
https://reviews.apache.org/r/17442/#comment62039

s/launched_tasks/launched_tasks_gauge/ ? (or active_tasks_gauge, please 
yourself)

All the other *_tasks are counters, which, but calling this something_tasks 
also implies it is. This isn't a counter so it's better to explicitly call it 
something else, such as something_tasks_gauge.


- David Robinson


On Jan. 28, 2014, 2:59 a.m., Vinod Kone wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/17442/
 ---
 
 (Updated Jan. 28, 2014, 2:59 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, David Robinson, and 
 Niklas Nielsen.
 
 
 Bugs: MESOS-772
 https://issues.apache.org/jira/browse/MESOS-772
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 See summary.
 
 I opted for active tasks instead of running tasks because I didn't want 
 the stats endpoint to loop through all tasks to figure out if a task is in 
 RUNNING state. I think active is useful for most debugging purposes.
 
 
 Diffs
 -
 
   src/master/http.cpp 546e91dbb9c8ee1014bb4f0b3be2714ad6a2d520 
 
 Diff: https://reviews.apache.org/r/17442/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Vinod Kone
 




Re: Review Request 17442: Added 'active_tasks' stat to master stats endpoint.

2014-01-27 Thread David Robinson


 On Jan. 28, 2014, 4:55 a.m., Ben Mahler wrote:
  I hope the past vs present tense will be enough to not make the monotonic 
  vs instantaneous stats confusing for those consuming this data.

I'd find it confusing. active_tasks_gauge is preferrable.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17442/#review32952
---


On Jan. 28, 2014, 2:59 a.m., Vinod Kone wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/17442/
 ---
 
 (Updated Jan. 28, 2014, 2:59 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, David Robinson, and 
 Niklas Nielsen.
 
 
 Bugs: MESOS-772
 https://issues.apache.org/jira/browse/MESOS-772
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 See summary.
 
 I opted for active tasks instead of running tasks because I didn't want 
 the stats endpoint to loop through all tasks to figure out if a task is in 
 RUNNING state. I think active is useful for most debugging purposes.
 
 
 Diffs
 -
 
   src/master/http.cpp 546e91dbb9c8ee1014bb4f0b3be2714ad6a2d520 
 
 Diff: https://reviews.apache.org/r/17442/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Vinod Kone
 




Re: Review Request 17443: Added queued and launched tasks to slave stats.

2014-01-27 Thread David Robinson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17443/#review32964
---

Ship it!



src/slave/http.cpp
https://reviews.apache.org/r/17443/#comment62041

Same comment as RB 17442. queued_tasks_gauge and launced_tasks_gauge (or 
active_tasks_gauge, same as whatever gets using in the master).


- David Robinson


On Jan. 28, 2014, 3:02 a.m., Vinod Kone wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/17443/
 ---
 
 (Updated Jan. 28, 2014, 3:02 a.m.)
 
 
 Review request for mesos, Adam B, Benjamin Hindman, Ben Mahler, David 
 Robinson, and Niklas Nielsen.
 
 
 Bugs: MESOS-772
 https://issues.apache.org/jira/browse/MESOS-772
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 See summary.
 
 
 Diffs
 -
 
   src/slave/http.cpp c8357e214d2adf2cd712072f58d07b07badb79dc 
 
 Diff: https://reviews.apache.org/r/17443/diff/
 
 
 Testing
 ---
 
 make
 
 
 Thanks,
 
 Vinod Kone
 




[jira] [Commented] (MESOS-824) export running config via http+json

2013-11-20 Thread David Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828067#comment-13828067
 ] 

David Robinson commented on MESOS-824:
--

[~nnielsen], exposing all the flags would be useful.

[~vinodkone], that's great, we'll be able to use this info to make our deploys 
more reliable.

Having both the flags and Frameworkinfo exposed helps.

 export running config via http+json 
 

 Key: MESOS-824
 URL: https://issues.apache.org/jira/browse/MESOS-824
 Project: Mesos
  Issue Type: Improvement
Reporter: David Robinson
Priority: Minor

 Currently there's no way of knowing whether a slave is actually checkpointing 
 (except for grepping through logs, which isn't ideal). The --checkpoint flag 
 on the command line can't be used to detect this since checkpointing could be 
 enabled on the slave but not in the framework. Because of this we cannot 
 detect whether slave recovery is actually enabled and therefore can't tell 
 whether it's safe to restart a slave.
 Please export the running config, preferably via a json endpoint.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MESOS-824) export running config via http+json

2013-11-19 Thread David Robinson (JIRA)
David Robinson created MESOS-824:


 Summary: export running config via http+json 
 Key: MESOS-824
 URL: https://issues.apache.org/jira/browse/MESOS-824
 Project: Mesos
  Issue Type: Improvement
Reporter: David Robinson
Priority: Minor


Currently there's no way of knowing whether a slave is actually checkpointing 
(except for grepping through logs, which isn't ideal). The --checkpoint flag on 
the command line can't be used to detect this since checkpointing could be 
enabled on the slave but not in the framework. Because of this we cannot detect 
whether slave recovery is actually enabled and therefore can't tell whether 
it's safe to restart a slave.

Please export the running config, preferably via a json endpoint.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MESOS-780) Adding support for 3rd party performance and health monitoring.

2013-11-05 Thread David Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814307#comment-13814307
 ] 

David Robinson commented on MESOS-780:
--

You can already solve the push problem quite easily, with no changes to 
Mesos. eg:
https://collectd.org/wiki/index.php/Plugin:cURL-JSON
https://collectd.org/wiki/index.php/Plugin:Write_Graphite

 Adding support for 3rd party performance and health monitoring.
 ---

 Key: MESOS-780
 URL: https://issues.apache.org/jira/browse/MESOS-780
 Project: Mesos
  Issue Type: Improvement
  Components: framework
Reporter: Bernardo Gomez Palacio

 User Story:
 As a SysAdmin I should be able to monitor Mesos (Masters and Slaves) with
 3rd party tools such as:
 * [Ganglia|http://ganglia.sourceforge.net/]
 * [Graphite|http://graphite.wikidot.com/]
 * [Nagios|http://www.nagios.org/]
 * [Zabbix|http://www.zabbix.com/]



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MESOS-772) expose count of running tasks

2013-10-24 Thread David Robinson (JIRA)
David Robinson created MESOS-772:


 Summary: expose count of running tasks
 Key: MESOS-772
 URL: https://issues.apache.org/jira/browse/MESOS-772
 Project: Mesos
  Issue Type: Improvement
Reporter: David Robinson
Priority: Minor


The stats endpoint doesn't show the current number of running tasks:

$ curl -s http://localhost:5051/slave\(1\)/stats.json | python2.7 -m json.tool
{
failed_tasks: 0,
finished_tasks: 0,
invalid_status_updates: 0,
killed_tasks: 0,
lost_tasks: 0,
recovery_errors: 0,
registered: 1,
staged_tasks: 2,
started_tasks: 0,
total_frameworks: 1,
uptime: 1168.518182912,
valid_status_updates: 0
}

Can this be added please?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MESOS-712) invalid zhandle state

2013-09-30 Thread David Robinson (JIRA)
David Robinson created MESOS-712:


 Summary: invalid zhandle state
 Key: MESOS-712
 URL: https://issues.apache.org/jira/browse/MESOS-712
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: David Robinson


{noformat:title=log snippet}
2013-09-29 08:58:30,445:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: 
Exceeded deadline by 16533ms
2013-09-29 
08:58:30,445:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1528: 
Socket [192.168.0.1:2181] zk retcode=-7, errno=110(Connection timed out): 
connection timed out (exceeded timeout by 13199ms)
I0929 08:58:17.544836 45283 cgroups.cpp:1193] Trying to freeze cgroup 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
2013-09-29 08:58:30,474:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1141: 
Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE
2013-09-29 08:58:30,475:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: 
Exceeded deadline by 16564ms
2013-09-29 
08:58:30,475:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling 
a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I0929 08:58:30.445508 45282 detector.cpp:251] Trying to create path 
'/home/mesos/prod/master' in ZooKeeper
2013-09-29 08:58:30,483:45279(0x7f9024e3f940):ZOO_INFO@check_events@1585: 
initiated connection to server [192.168.0.2:2181]
2013-09-29 08:58:30,488:45279(0x7f9031267940):ZOO_DEBUG@zoo_awexists@2587: 
Sending request xid=0x5244d598 for path [/home/mesos/prod/master] to 
192.168.0.2:2181
2013-09-29 
08:58:30,488:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1621: 
Socket [192.168.0.2:2181] zk retcode=-112, errno=116(Stale NFS file handle): 
sessionId=0x340523200364932 has expired.
2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1138: 
Calling a watcher for a ZOO_SESSION_EVENT and the 
state=ZOO_EXPIRED_SESSION_STATE
2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@do_io@317: IO thread 
terminated
2013-09-29 
08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling 
a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
2013-09-29 
08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1784: Calling 
COMPLETION_STAT for xid=0x5244d598 rc=-112
I0929 08:58:30.475751 45283 cgroups.cpp:1232] Successfully froze cgroup 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
 after 1 attempts
F0929 08:58:30.492090 45282 detector.cpp:266] Failed to create 
'/home/mesos/prod/master' in ZooKeeper: invalid zhandle state
*** Check failure stack trace: ***
I0929 08:58:30.492761 45292 cgroups.cpp:1208] Trying to thaw cgroup 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
I0929 08:58:31.144810 45291 cgroups_isolator.cpp:937] Executor 
thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of 
framework 201205082337-03- terminated with status 9
I0929 08:58:32.791193 45292 cgroups.cpp:1318] Successfully thawed 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
I0929 08:58:33.675348 45298 cgroups_isolator.cpp:1275] Successfully destroyed 
cgroup 
mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
I0929 08:58:33.676269 45300 slave.cpp:2158] Executor 
'thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f' of 
framework 201205082337-03- has terminated with signal Killed
I0929 08:58:33.678154 45300 slave.cpp:1778] Handling status update TASK_FAILED 
(UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 
1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 
201205082337-03- from @0.0.0.0:0
I0929 08:58:33.679175 45288 cgroups_isolator.cpp:700] Asked to update resources 
for an unknown/killed executor
I0929 08:58:33.679201 45300 status_update_manager.cpp:300] Received status 
update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 
1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 
201205082337-03- 
I0929 08:58:33.680452 45300 status_update_manager.hpp:337] Checkpointing UPDATE 
for status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for 
task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of 
framework 201205082337-03- 
@ 0x7f9035fb562d

[jira] [Commented] (MESOS-695) Introduce automated self-healing and coordinated repair to Mesos

2013-09-16 Thread David Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768508#comment-13768508
 ] 

David Robinson commented on MESOS-695:
--

TBH I'm not sure this really belongs in Mesos. The questions I'd ask are:

1) How do you define something is amiss.
2) How do you detect if something is amiss.
3) How do you know what the correct action to take is (restart a process vs 
reimage a host)
4) How do you know what number of hosts to repair is too many
5) How do you repair hosts
6) How do you reimage hosts

Twitter have tools for all of these tasks already (1 and 2 are covered by our 
observability team, and 3, 4 and 5 would be covered by an internal tool called 
servermaint).

I suspect that if you try and solve these problems from within Mesos you'll 
reinvent a lot of wheels and alienate a lot of people. Most people using Mesos 
would already have an observability stack (so could answer questions 1 and 2). 
Questions 3, 4 and 5 are business logic, and most people would already have a 
provisioning system (question 6).

What you need to solve the problem can be implemented without any changes to 
Mesos core. Rather than add this to Mesos core you'd be better off building 
something on top. eg, have a separate tool that detects something is amiss 
(an observability stack), and takes corrective action. Essentially what they 
want is something like servermaint.

 Introduce automated self-healing and coordinated repair to Mesos
 

 Key: MESOS-695
 URL: https://issues.apache.org/jira/browse/MESOS-695
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jeff Currier

 One capability that is presently missing within the Mesos framework is the 
 ability for the system to self-heal.  Specifically, the ability for a master 
 to detect something is amiss with a particular host and then to attempt to 
 heal that host through a set of automated corrective actions such as:
 1) restarting process on the suspect node
 2) rebooting the node
 3) reimaging the node
 4) blacklisting node from future scheduled work
 By adding in this capability and informing schedulers of the behavior of the 
 hosts within the system it's believed that we can get Mesos to function in 
 more of a, 'lights out' mode thereby reducing the OpEx costs for running the 
 system today.
 It should be noted that a certain amount of coordination will be required in 
 order to ensure that we don't, 'repair too many nodes at the same time.  
 This logic will need to be centralized and such that there is a central 
 authority who is elected to make these decisions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira