Re: [VOTE] Release Apache Mesos 1.1.0 (rc1)
as private bridge networks, and > the > services running in the container needs to be exposed outside these > isolated networks. > > > The CHANGELOG for the release is available at: > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_ > plain;f=CHANGELOG;hb=1.1.0-rc1 > > > > The candidate for Mesos 1.1.0 release is available at: > https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/mesos-1.1.0.tar.gz > > The tag to be voted on is 1.1.0-rc1: > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.0-rc1 > > The MD5 checksum of the tarball can be found at: > https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/ > mesos-1.1.0.tar.gz.md5 > > The signature of the tarball can be found at: > https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/ > mesos-1.1.0.tar.gz.asc > > The PGP key used to sign the release is here: > https://dist.apache.org/repos/dist/release/mesos/KEYS > > The JAR is up in Maven in a staging repository here: > https://repository.apache.org/content/repositories/orgapachemesos-1158 > > Please vote on releasing this package as Apache Mesos 1.1.0! > > The vote is open until Fri Oct 21 21:57:02 CEST 2016 and passes if a > majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Mesos 1.1.0 > [ ] -1 Do not release this package because ... > > Thanks, > Alex & Till > > -- David Robinson SRE - Mesos @daverobinson
Re: Review Request 30514: Rate limited the removal of slaves failing health checks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30514/#review70768 --- src/master/flags.hpp https://reviews.apache.org/r/30514/#comment116199 Should this be slaves pre minute? I imagine most clusters in the wild would be relatively small, and the smaller the cluster the slower you'd want removals? src/master/master.cpp https://reviews.apache.org/r/30514/#comment116173 s/is/was/ ? - David Robinson On Feb. 3, 2015, 2:39 a.m., Vinod Kone wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30514/ --- (Updated Feb. 3, 2015, 2:39 a.m.) Review request for mesos, Ben Mahler, David Robinson, and Jie Yu. Bugs: MESOS-1148 https://issues.apache.org/jira/browse/MESOS-1148 Repository: mesos Description --- The algorithm is simple. All the slave observers share a rate limiter whose rate is configured via command line. When a slave times out on health check, a permit is acquired to shutdown the slave *but* the pings are continued to be sent. If a pong arrives before the permit is satisifed, the shutdown is cancelled. Diffs - src/master/flags.hpp 6c18a1af625311ef149b5877b08f63c2b12c040d src/master/master.hpp 337e00aa46ea127f3667e3383d631c3fb8e22f30 src/master/master.cpp 10056861b95ed9453c971787982db7d09f09f323 src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 src/tests/slave_tests.cpp e7e2af63da785644f3f7e6e23607c02be962a2c6 Diff: https://reviews.apache.org/r/30514/diff/ Testing --- make check Ran the new tests 100 times Thanks, Vinod Kone
Review Request 30328: Added external_log_file option.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30328/ --- Review request for mesos and Ben Mahler. Bugs: MESOS-2193 https://issues.apache.org/jira/browse/MESOS-2193 Repository: mesos-git Description --- Added external_log_file option. This is a continuation of the review started here: https://reviews.apache.org/r/29071 Diffs - src/logging/flags.hpp 11efb84cc2c509f852f8ba20f16a366b4cb5810f src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 src/master/master.cpp ab6d1d17367f199191b7c77bccec73ec3b112d4f src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 src/slave/slave.cpp fca83b3977b95ddda30f9830da10e124b5c605e6 src/webui/master/static/js/controllers.js 41a70a80442501a2bf7b217939dbe504662941d2 Diff: https://reviews.apache.org/r/30328/diff/ Testing --- Ran locally. Thanks, David Robinson
Review Request 30347: Added additional details to developers guide.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30347/ --- Review request for mesos, Ben Mahler and Dominic Hamon. Bugs: MESOS-2282 https://issues.apache.org/jira/browse/MESOS-2282 Repository: mesos-git Description --- Added additional details to developers guide. These two items tripped me up when trying to submit a patch. This also fixes whitespace inconsistencies (mixed tabs and spaces). Diffs - docs/mesos-developers-guide.md 036a6fd336c1173be73393e5ee62dba208378518 Diff: https://reviews.apache.org/r/30347/diff/ Testing --- Thanks, David Robinson
Re: Review Request 29071: added webui_log option
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/#review69433 --- Ping? - David Robinson On Dec. 16, 2014, 10:37 p.m., David Robinson wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/ --- (Updated Dec. 16, 2014, 10:37 p.m.) Review request for mesos. Bugs: MESOS-2193 https://issues.apache.org/jira/browse/MESOS-2193 Repository: mesos-git Description --- added webui_log option Diffs - src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b src/webui/master/static/js/controllers.js 41a70a80442501a2bf7b217939dbe504662941d2 Diff: https://reviews.apache.org/r/29071/diff/ Testing --- Ran locally. Thanks, David Robinson
Re: Review Request 29071: added webui_log option
On Dec. 16, 2014, 12:58 a.m., Dominic Hamon wrote: src/slave/slave.cpp, line 459 https://reviews.apache.org/r/29071/diff/1/?file=792487#file792487line459 might want to check that the file exists, as per the log_dir stanza above. The stanza above just returns a filename (created from log_dir, log level etc), it doesn't check that the file exists. FilesProcess::attach is where file existance check occurs: FutureNothing FilesProcess::attach(const string path, const string name) { Resultstring result = os::realpath(path); if (!result.isSome()) { return Failure( Failed to get realpath of ' + path + ': + (result.isError() ? result.error() : No such file or directory)); } // Make sure we have permissions to read the file/dir. ... On Dec. 16, 2014, 12:58 a.m., Dominic Hamon wrote: src/slave/slave.cpp, line 446 https://reviews.apache.org/r/29071/diff/1/?file=792487#file792487line446 so webui_log overrides log_dir? should check if both are set and error. Done. On Dec. 16, 2014, 12:58 a.m., Dominic Hamon wrote: src/webui/master/static/js/controllers.js, line 405 https://reviews.apache.org/r/29071/diff/1/?file=792488#file792488line405 or 'webui_log' Done. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/#review65162 --- On Dec. 16, 2014, 12:51 a.m., David Robinson wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/ --- (Updated Dec. 16, 2014, 12:51 a.m.) Review request for mesos. Bugs: MESOS-2193 https://issues.apache.org/jira/browse/MESOS-2193 Repository: mesos-git Description --- added webui_log option Diffs - src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b src/webui/master/static/js/controllers.js 41a70a80442501a2bf7b217939dbe504662941d2 Diff: https://reviews.apache.org/r/29071/diff/ Testing --- Ran locally. Thanks, David Robinson
Re: Review Request 29071: added webui_log option
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/ --- (Updated Dec. 16, 2014, 10:37 p.m.) Review request for mesos. Changes --- Dominic's feedback. Bugs: MESOS-2193 https://issues.apache.org/jira/browse/MESOS-2193 Repository: mesos-git Description --- added webui_log option Diffs (updated) - src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b src/webui/master/static/js/controllers.js 41a70a80442501a2bf7b217939dbe504662941d2 Diff: https://reviews.apache.org/r/29071/diff/ Testing --- Ran locally. Thanks, David Robinson
Re: Review Request 29071: added webui_log option
On Dec. 16, 2014, 1:44 a.m., Mesos ReviewBot wrote: Bad patch! Reviews applied: [29071] Failed command: ./support/apply-review.sh -n -r 29071 Error: 2014-12-16 01:44:48 URL:https://reviews.apache.org/r/29071/diff/raw/ [5226/5226] - 29071.patch [1] error: master/flags.hpp: does not exist in index error: master/http.cpp: does not exist in index error: master/master.cpp: does not exist in index error: slave/flags.hpp: does not exist in index error: slave/http.cpp: does not exist in index error: slave/slave.cpp: does not exist in index error: webui/master/static/js/controllers.js: does not exist in index Failed to apply patch Guess you don't like 'noprefix = true'?! - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/#review65165 --- On Dec. 16, 2014, 10:37 p.m., David Robinson wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/ --- (Updated Dec. 16, 2014, 10:37 p.m.) Review request for mesos. Bugs: MESOS-2193 https://issues.apache.org/jira/browse/MESOS-2193 Repository: mesos-git Description --- added webui_log option Diffs - src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b src/webui/master/static/js/controllers.js 41a70a80442501a2bf7b217939dbe504662941d2 Diff: https://reviews.apache.org/r/29071/diff/ Testing --- Ran locally. Thanks, David Robinson
Review Request 29071: added webui_log option
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29071/ --- Review request for mesos. Bugs: MESOS-2193 https://issues.apache.org/jira/browse/MESOS-2193 Repository: mesos-git Description --- added webui_log option Diffs - src/master/flags.hpp 1cea50c02f3ad7de1e1ae91d65d1accdb9af7b03 src/master/http.cpp 46890bed05d7c4b63e1f7be5bb35217173e0ade8 src/master/master.cpp 0f55a5cc2d6845cbaace718a48f771d80aad0e6e src/slave/flags.hpp 670997dc3a702cd5edf33f2e5824c5e4dfe4ecef src/slave/http.cpp d1cf8a68fab9a2df44f6c753683ad37fd4b1a1f9 src/slave/slave.cpp 50b57819b55bdcdb9f49f20648199badc4d3f37b src/webui/master/static/js/controllers.js 41a70a80442501a2bf7b217939dbe504662941d2 Diff: https://reviews.apache.org/r/29071/diff/ Testing --- Ran locally. Thanks, David Robinson
Re: Review Request 24902: Fixed the build error in routing tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24902/#review51305 --- src/tests/routing_tests.cpp https://reviews.apache.org/r/24902/#comment89457 Why is '#if LINUX_VERSION_CODE = KERNEL_VERSION(2, 6, 24)' being removed? - David Robinson On Aug. 20, 2014, 6:21 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24902/ --- (Updated Aug. 20, 2014, 6:21 p.m.) Review request for mesos, Ian Downes and Vinod Kone. Repository: mesos-git Description --- Realized that checking linux version is not very useful. We may have a new kernel header but an old glibc in some cases. Diffs - src/tests/routing_tests.cpp 020676cac092aae63fcb45f37b206323db100f95 Diff: https://reviews.apache.org/r/24902/diff/ Testing --- sudo make check Thanks, Jie Yu
[jira] [Commented] (MESOS-890) Figure out a way to migrate a live Mesos cluster to a different ZooKeeper cluster
[ https://issues.apache.org/jira/browse/MESOS-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913694#comment-13913694 ] David Robinson commented on MESOS-890: -- I think option #1 is the better approach. Writing a heap of code to correctly handle option #2 seems like a lot of effort for something which should very rarely happen. IIRC the 75 second constraint can be adjusted in Aurora too. [~rgs], how difficult is #1 from the ZK perspective? I'm happy to test this in a dev/test cluster. Figure out a way to migrate a live Mesos cluster to a different ZooKeeper cluster - Key: MESOS-890 URL: https://issues.apache.org/jira/browse/MESOS-890 Project: Mesos Issue Type: Improvement Components: master Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales I've been chatting with [~vinodkone] about approaching a live ZK cluster migration. Here are the options we came up with. For the descriptions we treat `zk1` as the current working cluster, `obs` as a bunch of ZooKeeper Observers [1] and `zk2` as the new cluster to which we need to migrate. Approach #1: Using Observers With this option we need to: * add obs to zk1 * restart slaves to have them use obs to find their master * restart the framework having it use obs to find the mesos master * restart the mesos masters having them use obs to perform their election * we then stop all ZK obs and remove their data (since they will need to sync up with an entirely new cluster, we need to lose the old data) * we restart ZK obs having them be part of zk2 * at this point the slaves, the framework and the masters can reach the ZK obs again and an election happens * optionally you can restart slaves, the framework and masters again using zk2 instead of the ZK obs if you wanted to decommission them. This assumes that we can do the last three steps in 75 secs (75 secs being the slave health check timeout). This is a reasonable assumption if the data size in zk2 is small enough to ensure that the ZK obs can sync up quickly with zk2. If zk2 is a new cluster with no data then this should be very fast. The good things of this approach are: * no mesos code change * it is very easy to rollback half way through, if need be The hard issues are: * Manipulating the ZK obs (i.e.: stopping, removing the data from zk1 and starting again) needs to be done with care. Messing up configs or not removing the data from zk1 on any of the ZK obs will cause problems * we need to restart all slaves to have them use the ZK obs instead of connecting to zk1 directly. But with slave recovery this isn't an issue, just an extra step. * same thing for the framework and the masters Approach #2: Dual publishing from mesos masters With this option we would augment the election handling code in mesos masters to have it deal with the notion of a primary and secondary ZK clusters. Master registration and election would then work as follows: * create an ephemeral|sequential znode in zk1 (i.e.: /path/to/znode/mesos_23) * create an ephemeral, but not sequential, znode in zk2 with the exact same path as what was created in zk1 (i.e.: /path/to/znode/mesos_23) * make sure both sessions, in zk1 and zk2, are always in the same state (i.e.: if one expires, the other one should be closed, etc.) For now, lets omit a few implementation details which might need extra care and assume we can make this work consistently in such a way that zk2 reflects accurately elections that happen in zk1. This means that regardless of being connected to zk1 or zk2, you always get the same master. Once we have this the migration steps would be: * restart slaves to have them use zk2 where masters can be found by virtue of what we implemented above * restart the framework so that it finds the mesos master in zk2 * stop all mesos masters (they all need to be stopped before moving to the next step) * start all mesos masters using zk2 as its primary and only cluster Again, this assumes we can do the last two steps in 75 secs (or if we needed to, we could bump the slave health check timeout). Which, again, sounds achievable given that masters have no state and their start-up time is very short. The good things of this approach are: - no tinkering with extra ZK servers nor with ZK configs The hard issues are: - extra code needs to be added to the election handling bits of mesos master to address a very rare, but probable, use-case of cluster migration. It might take a bit of time to get that code right. - it's easier to end up with a bad state if any of the mesos masters ends up with a bad config or is restarted earlier and ends up publishing differently than
[jira] [Created] (MESOS-1028) expose internal metrics
David Robinson created MESOS-1028: - Summary: expose internal metrics Key: MESOS-1028 URL: https://issues.apache.org/jira/browse/MESOS-1028 Project: Mesos Issue Type: Improvement Components: general Reporter: David Robinson Mesos should export statistics that provide visibility into its internals. This would allow users to detect numerous problem without resorting to trolling log files. E.g. export counters of (some of these already exist, most don't): cgroup create cgroup destroy cgroup destroy attempts resource offers made resource offers accepted tasks launched tasks destroyed tasks lost writes to replicated log queue length export 50th, 90th, 95th, 99th percentile of time taken to: start mesos (reach a certain state) move tasks between two given states (starting - started) create a cgroup destroy a cgroup send a message from slave to master start a task stop a task register in zookeeper write to the replicated log Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit Java) library (or [medida|http://dln.github.io/medida/] for an unmaintained(?) c++ port) We've previously seen problems where tasks were stuck in cgroup destroy with 30,000 attempts. Exposing metrics would allow us to easily detect problems like this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MESOS-1028) expose internal metrics
[ https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908729#comment-13908729 ] David Robinson commented on MESOS-1028: --- sgtm expose internal metrics --- Key: MESOS-1028 URL: https://issues.apache.org/jira/browse/MESOS-1028 Project: Mesos Issue Type: Improvement Components: general Reporter: David Robinson Mesos should export statistics that provide visibility into its internals. This would allow users to detect numerous problem without resorting to trolling log files. E.g. export counters of (some of these already exist, most don't): cgroup create cgroup destroy cgroup destroy attempts resource offers made resource offers accepted tasks launched tasks destroyed tasks lost writes to replicated log queue length export 50th, 90th, 95th, 99th percentile of time taken to: start mesos (reach a certain state) move tasks between two given states (starting - started) create a cgroup destroy a cgroup send a message from slave to master start a task stop a task register in zookeeper write to the replicated log Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit Java) library (or [medida|http://dln.github.io/medida/] for an unmaintained(?) c++ port) We've previously seen problems where tasks were stuck in cgroup destroy with 30,000 attempts. Exposing metrics would allow us to easily detect problems like this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Review Request 17442: Added 'active_tasks' stat to master stats endpoint.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/17442/#review32962 --- Ship it! src/master/http.cpp https://reviews.apache.org/r/17442/#comment62039 s/launched_tasks/launched_tasks_gauge/ ? (or active_tasks_gauge, please yourself) All the other *_tasks are counters, which, but calling this something_tasks also implies it is. This isn't a counter so it's better to explicitly call it something else, such as something_tasks_gauge. - David Robinson On Jan. 28, 2014, 2:59 a.m., Vinod Kone wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/17442/ --- (Updated Jan. 28, 2014, 2:59 a.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, David Robinson, and Niklas Nielsen. Bugs: MESOS-772 https://issues.apache.org/jira/browse/MESOS-772 Repository: mesos-git Description --- See summary. I opted for active tasks instead of running tasks because I didn't want the stats endpoint to loop through all tasks to figure out if a task is in RUNNING state. I think active is useful for most debugging purposes. Diffs - src/master/http.cpp 546e91dbb9c8ee1014bb4f0b3be2714ad6a2d520 Diff: https://reviews.apache.org/r/17442/diff/ Testing --- make check Thanks, Vinod Kone
Re: Review Request 17442: Added 'active_tasks' stat to master stats endpoint.
On Jan. 28, 2014, 4:55 a.m., Ben Mahler wrote: I hope the past vs present tense will be enough to not make the monotonic vs instantaneous stats confusing for those consuming this data. I'd find it confusing. active_tasks_gauge is preferrable. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/17442/#review32952 --- On Jan. 28, 2014, 2:59 a.m., Vinod Kone wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/17442/ --- (Updated Jan. 28, 2014, 2:59 a.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, David Robinson, and Niklas Nielsen. Bugs: MESOS-772 https://issues.apache.org/jira/browse/MESOS-772 Repository: mesos-git Description --- See summary. I opted for active tasks instead of running tasks because I didn't want the stats endpoint to loop through all tasks to figure out if a task is in RUNNING state. I think active is useful for most debugging purposes. Diffs - src/master/http.cpp 546e91dbb9c8ee1014bb4f0b3be2714ad6a2d520 Diff: https://reviews.apache.org/r/17442/diff/ Testing --- make check Thanks, Vinod Kone
Re: Review Request 17443: Added queued and launched tasks to slave stats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/17443/#review32964 --- Ship it! src/slave/http.cpp https://reviews.apache.org/r/17443/#comment62041 Same comment as RB 17442. queued_tasks_gauge and launced_tasks_gauge (or active_tasks_gauge, same as whatever gets using in the master). - David Robinson On Jan. 28, 2014, 3:02 a.m., Vinod Kone wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/17443/ --- (Updated Jan. 28, 2014, 3:02 a.m.) Review request for mesos, Adam B, Benjamin Hindman, Ben Mahler, David Robinson, and Niklas Nielsen. Bugs: MESOS-772 https://issues.apache.org/jira/browse/MESOS-772 Repository: mesos-git Description --- See summary. Diffs - src/slave/http.cpp c8357e214d2adf2cd712072f58d07b07badb79dc Diff: https://reviews.apache.org/r/17443/diff/ Testing --- make Thanks, Vinod Kone
[jira] [Commented] (MESOS-824) export running config via http+json
[ https://issues.apache.org/jira/browse/MESOS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828067#comment-13828067 ] David Robinson commented on MESOS-824: -- [~nnielsen], exposing all the flags would be useful. [~vinodkone], that's great, we'll be able to use this info to make our deploys more reliable. Having both the flags and Frameworkinfo exposed helps. export running config via http+json Key: MESOS-824 URL: https://issues.apache.org/jira/browse/MESOS-824 Project: Mesos Issue Type: Improvement Reporter: David Robinson Priority: Minor Currently there's no way of knowing whether a slave is actually checkpointing (except for grepping through logs, which isn't ideal). The --checkpoint flag on the command line can't be used to detect this since checkpointing could be enabled on the slave but not in the framework. Because of this we cannot detect whether slave recovery is actually enabled and therefore can't tell whether it's safe to restart a slave. Please export the running config, preferably via a json endpoint. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MESOS-824) export running config via http+json
David Robinson created MESOS-824: Summary: export running config via http+json Key: MESOS-824 URL: https://issues.apache.org/jira/browse/MESOS-824 Project: Mesos Issue Type: Improvement Reporter: David Robinson Priority: Minor Currently there's no way of knowing whether a slave is actually checkpointing (except for grepping through logs, which isn't ideal). The --checkpoint flag on the command line can't be used to detect this since checkpointing could be enabled on the slave but not in the framework. Because of this we cannot detect whether slave recovery is actually enabled and therefore can't tell whether it's safe to restart a slave. Please export the running config, preferably via a json endpoint. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MESOS-780) Adding support for 3rd party performance and health monitoring.
[ https://issues.apache.org/jira/browse/MESOS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814307#comment-13814307 ] David Robinson commented on MESOS-780: -- You can already solve the push problem quite easily, with no changes to Mesos. eg: https://collectd.org/wiki/index.php/Plugin:cURL-JSON https://collectd.org/wiki/index.php/Plugin:Write_Graphite Adding support for 3rd party performance and health monitoring. --- Key: MESOS-780 URL: https://issues.apache.org/jira/browse/MESOS-780 Project: Mesos Issue Type: Improvement Components: framework Reporter: Bernardo Gomez Palacio User Story: As a SysAdmin I should be able to monitor Mesos (Masters and Slaves) with 3rd party tools such as: * [Ganglia|http://ganglia.sourceforge.net/] * [Graphite|http://graphite.wikidot.com/] * [Nagios|http://www.nagios.org/] * [Zabbix|http://www.zabbix.com/] -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MESOS-772) expose count of running tasks
David Robinson created MESOS-772: Summary: expose count of running tasks Key: MESOS-772 URL: https://issues.apache.org/jira/browse/MESOS-772 Project: Mesos Issue Type: Improvement Reporter: David Robinson Priority: Minor The stats endpoint doesn't show the current number of running tasks: $ curl -s http://localhost:5051/slave\(1\)/stats.json | python2.7 -m json.tool { failed_tasks: 0, finished_tasks: 0, invalid_status_updates: 0, killed_tasks: 0, lost_tasks: 0, recovery_errors: 0, registered: 1, staged_tasks: 2, started_tasks: 0, total_frameworks: 1, uptime: 1168.518182912, valid_status_updates: 0 } Can this be added please? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MESOS-712) invalid zhandle state
David Robinson created MESOS-712: Summary: invalid zhandle state Key: MESOS-712 URL: https://issues.apache.org/jira/browse/MESOS-712 Project: Mesos Issue Type: Bug Affects Versions: 0.14.0 Reporter: David Robinson {noformat:title=log snippet} 2013-09-29 08:58:30,445:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 16533ms 2013-09-29 08:58:30,445:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.0.1:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 13199ms) I0929 08:58:17.544836 45283 cgroups.cpp:1193] Trying to freeze cgroup /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 2013-09-29 08:58:30,474:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1141: Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE 2013-09-29 08:58:30,475:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 16564ms 2013-09-29 08:58:30,475:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT I0929 08:58:30.445508 45282 detector.cpp:251] Trying to create path '/home/mesos/prod/master' in ZooKeeper 2013-09-29 08:58:30,483:45279(0x7f9024e3f940):ZOO_INFO@check_events@1585: initiated connection to server [192.168.0.2:2181] 2013-09-29 08:58:30,488:45279(0x7f9031267940):ZOO_DEBUG@zoo_awexists@2587: Sending request xid=0x5244d598 for path [/home/mesos/prod/master] to 192.168.0.2:2181 2013-09-29 08:58:30,488:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [192.168.0.2:2181] zk retcode=-112, errno=116(Stale NFS file handle): sessionId=0x340523200364932 has expired. 2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE 2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@do_io@317: IO thread terminated 2013-09-29 08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT 2013-09-29 08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1784: Calling COMPLETION_STAT for xid=0x5244d598 rc=-112 I0929 08:58:30.475751 45283 cgroups.cpp:1232] Successfully froze cgroup /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 after 1 attempts F0929 08:58:30.492090 45282 detector.cpp:266] Failed to create '/home/mesos/prod/master' in ZooKeeper: invalid zhandle state *** Check failure stack trace: *** I0929 08:58:30.492761 45292 cgroups.cpp:1208] Trying to thaw cgroup /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 I0929 08:58:31.144810 45291 cgroups_isolator.cpp:937] Executor thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- terminated with status 9 I0929 08:58:32.791193 45292 cgroups.cpp:1318] Successfully thawed /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 I0929 08:58:33.675348 45298 cgroups_isolator.cpp:1275] Successfully destroyed cgroup mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 I0929 08:58:33.676269 45300 slave.cpp:2158] Executor 'thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f' of framework 201205082337-03- has terminated with signal Killed I0929 08:58:33.678154 45300 slave.cpp:1778] Handling status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- from @0.0.0.0:0 I0929 08:58:33.679175 45288 cgroups_isolator.cpp:700] Asked to update resources for an unknown/killed executor I0929 08:58:33.679201 45300 status_update_manager.cpp:300] Received status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- I0929 08:58:33.680452 45300 status_update_manager.hpp:337] Checkpointing UPDATE for status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- @ 0x7f9035fb562d
[jira] [Commented] (MESOS-695) Introduce automated self-healing and coordinated repair to Mesos
[ https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768508#comment-13768508 ] David Robinson commented on MESOS-695: -- TBH I'm not sure this really belongs in Mesos. The questions I'd ask are: 1) How do you define something is amiss. 2) How do you detect if something is amiss. 3) How do you know what the correct action to take is (restart a process vs reimage a host) 4) How do you know what number of hosts to repair is too many 5) How do you repair hosts 6) How do you reimage hosts Twitter have tools for all of these tasks already (1 and 2 are covered by our observability team, and 3, 4 and 5 would be covered by an internal tool called servermaint). I suspect that if you try and solve these problems from within Mesos you'll reinvent a lot of wheels and alienate a lot of people. Most people using Mesos would already have an observability stack (so could answer questions 1 and 2). Questions 3, 4 and 5 are business logic, and most people would already have a provisioning system (question 6). What you need to solve the problem can be implemented without any changes to Mesos core. Rather than add this to Mesos core you'd be better off building something on top. eg, have a separate tool that detects something is amiss (an observability stack), and takes corrective action. Essentially what they want is something like servermaint. Introduce automated self-healing and coordinated repair to Mesos Key: MESOS-695 URL: https://issues.apache.org/jira/browse/MESOS-695 Project: Mesos Issue Type: Task Components: master Reporter: Jeff Currier One capability that is presently missing within the Mesos framework is the ability for the system to self-heal. Specifically, the ability for a master to detect something is amiss with a particular host and then to attempt to heal that host through a set of automated corrective actions such as: 1) restarting process on the suspect node 2) rebooting the node 3) reimaging the node 4) blacklisting node from future scheduled work By adding in this capability and informing schedulers of the behavior of the hosts within the system it's believed that we can get Mesos to function in more of a, 'lights out' mode thereby reducing the OpEx costs for running the system today. It should be noted that a certain amount of coordination will be required in order to ensure that we don't, 'repair too many nodes at the same time. This logic will need to be centralized and such that there is a central authority who is elected to make these decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira