Can't start master properly (stale state issue?); help!
Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0813 10:19:43.255265 2612 recover.cpp:195] Received a recover response from a replica in STARTING status I0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to VOTING I0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file '/var/log/mesos/mesos-master.INFO' I0813 10:19:43.258760 2609 contender.cpp:131] Joining the ZK group I0813 10:19:43.258862 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.477458ms I0813 10:19:43.258894 2612 replica.cpp:323] Persisted replica status to VOTING I0813 10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxos group I0813 10:19:43.258987 2612 recover.cpp:464] Recover process terminated I0813 10:19:46.590340 2606 group.cpp:313] Group process (group(1)@ 71.100.14.9:5050) connected to ZooKeeper I0813 10:19:46.590373 2606 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0813 10:19:46.590386 2606 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper I0813 10:19:46.591442 2606 network.hpp:424] ZooKeeper group memberships changed I0813 10:19:46.591514 2606 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0813 10:19:46.592146 2606 group.cpp:659] Trying to get '/mesos/log_replicas/01' in ZooKeeper I0813 10:19:46.593128 2608 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050 } I0813 10:19:46.593955 2608 group.cpp:313] Group process (group(2)@ 71.100.14.9:5050) connected to ZooKeeper I0813 10:19:46.593977 2608 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0813 10:19:46.593986 2608 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper I0813 10:19:46.594894 2605 group.cpp:313] Group process (group(3)@
Re: Can't start master properly (stale state issue?); help!
Marco hasodent, This is just a quick note to say thank you for your replies. I will answer you much more fully tomorrow, but for now can only manage a few quick observations questions: 1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 (I'll provide references tomorrow), I early on configured /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out a race condition whereby ZK | mesos-master saw the original unchanged /etc/hosts before I zapped it. 2. What is a znode and how would I drop it? I start the services (zk, master, marathon; all on same host) by SSHing into the host doing service start commands. Again, thanks very much; and more tomorrow. Cordially, Paul On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote: Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0813 10:19:43.255265 2612 recover.cpp:195] Received a recover response from a replica in STARTING status I0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to VOTING I0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file '/var/log/mesos/mesos-master.INFO' I0813 10:19:43.258760 2609 contender.cpp:131] Joining the ZK group I0813 10:19:43.258862 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.477458ms I0813 10:19:43.258894 2612 replica.cpp:323] Persisted replica status to VOTING I0813 10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxos group I0813 10:19:43.258987 2612 recover.cpp:464] Recover process
Re: Custom flags to docker run
Hi Stephen, Via Marathon I am deploying Docker containers across a Mesos cluster. The containers have unique Weave IP@s allowing inter-container communication. All things considered, getting to this point has been relatively straight-forward, and Weave has been one of the IJW components. I'd be curious to learn why you're finding Weave messy. If you'd like to take it out-of-band (as it were), please feel free to e-mail me directly. Cordially, Paul On Wed, Aug 12, 2015 at 3:16 AM, Stephen Knight skni...@pivotal.io wrote: Hi, Is there a way to pass a custom flag to docker run through the Marathon API? I've not seen anything in the documentation, this could just be a basic reading fail on my part. What I want to do is use Calico (or similar) with Docker and provision containers via Marathon. Weave is messy for what I am trying to achieve and the integration isn't going as planned, is there a better option and how can you then integrate it? Does that flexibility exist in the Marathon API? Thx Stephen
Re: Can't start master properly (stale state issue?); help!
All, By way of some background: I'm not running a data center (or centers). Rather, I work on a distributed application whose trajectory is taking it into a realm of many Docker containers distributed across many hosts (mostly virtual hosts at the outset). An environment that supports isolation, multi-tenancy, scalability, and some fault tolerance is desirable for this application. Also, the mere ability to simplify - at least somewhat - the management of multiple hosts is of great importance. So, that's more or less how I got to Mesos and to here... I ended up writing a Java program that configures a collection of host VMs as a Mesos cluster and then, via Marathon, distributes the application containers across the cluster. Configuring building the cluster is largely a lot of SSH work. Doing the same for the application is part Marathon, part Docker remote API. The containers that need to talk to each other via TCP are connected with Weave's (http://weave.works) overlay network. So the main infrastructure consists of Mesos, Docker, and Weave. The whole thing is pretty amazing - for which I take very little credit. Rather, these are some wonderful technologies, and the folks who write support them are very helpful. That said, I sometimes feel like I'm juggling chain saws! *In re* the issues raised on this thread: All Mesos components were installed via the Mesosphere packages. The 4 VMs in the cluster are all running Ubuntu 14.04 LTS. My suspicions about the IP@ 127.0.1.1 were raised a few months ago when, after seeing this IP in a mesos-master log when things weren't working, I discovered these articles: https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4 https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4 *http://frankhinek.com/build-mesos-multi-node-ha-cluster/ http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2* (see note 2) So, to the point raised just now by Klaus (and earlier in the thread), the aforementioned configuration program does change /etc/hosts (and /etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco hasodent, I might have encountered a race condition wherein ZK mesos-master saw the unchanged /etc/hosts before I altered it. I believe that I yesterday fixed that issue. Also, as part of the cluster create step, I get a bit aggressive (perhaps unwisely) with what I believe are some state repositories. Specifically, I rm /var/lib/zookeeper/version-2/* rm -Rf /var/lib/mesos/replicated_log Should I NOT be doing this? I know from experience that zapping the version-2 directory (ZK's data_Dir if IIRC) can solve occasional weirdness. Marco is /var/lib/mesos/replicated_log what you are referring to when you say some issue with the log-replica? Just a day or two ago I first heard the term znode learned a little about zkCli.sh. I will experiment with it more in the coming days. As matters now stand, I have the cluster up and running. But before I again deploy the application, I am trying to put the cluster through its paces by periodically cycling it through the states my program can bring about, e.g., --cluster create (takes a clean VM and configures it to act as one or more Mesos components: ZK, master, slave) --cluster stop(stops the Mesos services on each node) --cluster destroy (configures the VM back to its original clean state) --cluster create --cluster stop --cluster start et cetera. *The only way I got rid of the no leading master issue that started this thread was by wiping out the master VM and starting over with a clean VM. That is, stopping/destroying/creating (even rebooting) the cluster had no effect.* I suspect that, sooner or later, I will again hit this problem (probably sooner!). And I want to understand how best to handle it. Such an occurrence could be pretty awkward at a customer site. Thanks for all your help. Cordially, Paul On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma kl...@cguru.net wrote: I used to meet a similar issue with Zookeeper + Messo; I resolved it by remove 127.0.1.1 from /etc/hosts; here is an example: klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 klaus-OptiPlex-780 *= remove this line, and a new line: mapping IP (e.g. 192.168.1.100) with hostname* ... BTW, please also clear-up the log directory and re-start ZK Mesos. If any more comments, please let me know. Regards, Klaus Ma (马达), PMP® | http://www.cguru.net Call Send SMS Call from mobile Add to Skype You'll need Skype CreditFree via Skype Call Send SMS Call from mobile Add to Skype You'll need Skype CreditFree via Skype -- Date: Thu, 13 Aug 2015 12:20:34 -0700 Subject: Re: Can't start master properly (stale state issue?); help! From: ma...@mesosphere.io To: user@mesos.apache.org On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com
Fate of slave node after timeout
Hi All, IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded without a response from a mesos-slave, the master will remove the slave. In the Mesos UI I can see slave state transition from 1 deactivated to 0. Can that slave never again be added into the cluster, i.e., what happens if it comes up 1 second after exceeding the timeout product? (I'm dusting off some old notes and trying to refresh my memory about problems I haven't seen in quite some time). Thank you. -Paul
Re: manage jobs log files in sandboxes
I've done a little reconnoitering, and the terrain looks to me as follows: 1. Docker maintains container log files at /var/lib/docker/containers//-json.log 2. Mesos maintains container STDOUT files at a slave/framework/application specific location, e.g., /tmp/mesos/slaves/20151102-082316-370041927-5050-32381-S1/frameworks/20151102-082316-370041927-5050-32381-/executors/ecxprimary1.80750071-81a0-11e5-8596-82d195a34239/runs/5c767378-9599-40af-8010-a31f4c55f9dc 3. The latter is mapped to the container's /mnt/mesos/sandbox 4. These two files (-json.log and the STDOUT file) are different, *each* consumes disk space. I think that the answer to (1) is Docker's logrotate. As to (2), I am considering a cron job at host (not container) level that drives truncate cmd (GNU coreutils) to prune these files at a certain size. Obviously requires knowing the fully-qualified path under /tmp/mesos/slaves, but this is readily available via "docker inspect". -Paul On Fri, Nov 6, 2015 at 7:17 AM, Paul Bell <arach...@gmail.com> wrote: > Hi Mauricio, > > YeahI see your point; thank you. > > My approach would be akin to closing the barn door after the horse got > out. Both Mesos & Docker are doing their own writing of STDOUT. Docker's > rotation won't address Mesos's behavior. > > I need to find a solution here. > > -Paul > > > On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia < > mauriciogaravag...@gmail.com> wrote: > >> Hi Paul, >> >> I don't think that's going to help :( >> Even if you configure a different docker log driver, Docker still send >> things to stdout, which is catched by mesos and dumped in the .logs >> directory in the job sandbox. For example, by default docker logs into a >> json file in /var/lib/docker but mesos still writes to the sandbox. >> Hi Mauricio, >> >> I'm grappling with the same issue. >> >> I'm not yet sure if it represents a viable solution, but I plan to look >> at Docker's log rotation facility. It was introduced in Docker 1.8. >> >> If you beat me to it & it looks like a solution, please let us know! >> >> Thanks. >> >> Cordially, >> >> Paul >> >> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia < >> mauriciogaravag...@gmail.com> wrote: >> > >> > Hi guys, >> > >> > How can I manage the stdout/err log files generated by jobs in mesos? >> for long running docker apps launched using marathon the log files can >> deplete the disk of an agent, and using quotas makes the jobs to be killed >> which is also not ideal. I'd like to have a way to rotate them. >> > >> > Is it correct to just go to the mesos agent workdir and go through each >> sandbox stdout/err and rotate them? I know that could break the log UI but >> it doesn't scale very well having logs of several of GB. >> > >> > Thanks! >> > >
Re: manage jobs log files in sandboxes
Hi Mauricio, YeahI see your point; thank you. My approach would be akin to closing the barn door after the horse got out. Both Mesos & Docker are doing their own writing of STDOUT. Docker's rotation won't address Mesos's behavior. I need to find a solution here. -Paul On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia < mauriciogaravag...@gmail.com> wrote: > Hi Paul, > > I don't think that's going to help :( > Even if you configure a different docker log driver, Docker still send > things to stdout, which is catched by mesos and dumped in the .logs > directory in the job sandbox. For example, by default docker logs into a > json file in /var/lib/docker but mesos still writes to the sandbox. > Hi Mauricio, > > I'm grappling with the same issue. > > I'm not yet sure if it represents a viable solution, but I plan to look at > Docker's log rotation facility. It was introduced in Docker 1.8. > > If you beat me to it & it looks like a solution, please let us know! > > Thanks. > > Cordially, > > Paul > > > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia < > mauriciogaravag...@gmail.com> wrote: > > > > Hi guys, > > > > How can I manage the stdout/err log files generated by jobs in mesos? > for long running docker apps launched using marathon the log files can > deplete the disk of an agent, and using quotas makes the jobs to be killed > which is also not ideal. I'd like to have a way to rotate them. > > > > Is it correct to just go to the mesos agent workdir and go through each > sandbox stdout/err and rotate them? I know that could break the log UI but > it doesn't scale very well having logs of several of GB. > > > > Thanks! >
Use docker start rather than docker run?
Hi All, I first posted this to the Marathon list, but someone suggested I try it here. I'm still not sure what component (mesos-master, mesos-slave, marathon) generates the docker run command that launches containers on a slave node. I suppose that it's the framework executor (Marathon) on the slave that actually executes the docker run, but I'm not sure. What I'm really after is whether or not we can cause the use of docker start rather than docker run. At issue here is some persistent data inside /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch my application with a different CTR_ID effectively rendering that data inaccessible. But docker start will restart the container and its old data will still be there. Thanks. -Paul
Re: Use "docker start" rather than "docker run"?
Alex and Marco, Thanks very much for your really helpful explanations. For better or worse, neither cpp nor Python are my things; Java's the go-to language for me. Cordially, Paul On Sat, Aug 29, 2015 at 5:23 AM, Marco Massenzio <ma...@mesosphere.io> wrote: > Hi Paul, > > +1 to what Alex/Tim say. > > Maybe a (simple) example will help: a very basic "framework" I created > recently, does away with the "Executor" and only uses the "Scheduler", > sending a CommandInfo structure to Mesos' Agent node to execute. > > See: > https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L124 > > If Python is more your thing, there are examples in the Mesos repository, > or you can take a look at something I started recently to use the new > (0.24) HTTP API (NOTE - this is still very much still WIP): > https://github.com/massenz/zk-mesos/blob/develop/notebooks/HTTP%20API%20Tests.ipynb > > *Marco Massenzio* > > *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* > > On Fri, Aug 28, 2015 at 8:44 AM, Paul Bell <arach...@gmail.com> wrote: > >> Alex & Tim, >> >> Thank you both; most helpful. >> >> Alex, can you dispel my confusion on this point: I keep reading that a >> "framework" in Mesos (e.g., Marathon) consists of a scheduler and an >> executor. This reference to "executor" made me think that Marathon must >> have *some* kind of presence on the slave node. But the more familiar I >> become with Mesos the less likely this seems to me. So, what does it mean >> to talk about the Marathon framework "executor"? >> >> Tim, I did come up with a simple work-around that involves re-copying the >> needed file into the container each time the application is started. For >> reasons unknown, this file is not kept in a location that would readily >> lend itself to my use of persistent storage (Docker -v). That said, I am >> keenly interested in learning how to write both custom executors & >> schedulers. Any sense for what release of Mesos will see "persistent >> volumes"? >> >> Thanks again, gents. >> >> -Paul >> >> >> >> On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen <t...@mesosphere.io> wrote: >> >>> Hi Paul, >>> >>> We don't [re]start a container since we assume once the task terminated >>> the container is no longer reused. In Mesos to allow tasks to reuse the >>> same executor and handle task logic accordingly people will opt to choose >>> the custom executor route. >>> >>> We're working on a way to keep your sandbox data beyond a container >>> lifecycle, which is called persistent volumes. We haven't integrated that >>> with Docker containerizer yet, so you'll have to wait to use that feature. >>> >>> You could also choose to implement a custom executor for now if you like. >>> >>> Tim >>> >>> On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov <a...@mesosphere.com> >>> wrote: >>> >>>> Paul, >>>> >>>> that component is called DockerContainerizer and it's part of Mesos >>>> Agent (check >>>> "/Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp"). @Tim, >>>> could you answer the "docker start" vs. "docker run" question? >>>> >>>> On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell <arach...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I first posted this to the Marathon list, but someone suggested I try >>>>> it here. >>>>> >>>>> I'm still not sure what component (mesos-master, mesos-slave, >>>>> marathon) generates the "docker run" command that launches containers on a >>>>> slave node. I suppose that it's the framework executor (Marathon) on the >>>>> slave that actually executes the "docker run", but I'm not sure. >>>>> >>>>> What I'm really after is whether or not we can cause the use of >>>>> "docker start" rather than "docker run". >>>>> >>>>> At issue here is some persistent data inside >>>>> /var/lib/docker/aufs/mnt/. "docker run" will by design (re)launch >>>>> my application with a different CTR_ID effectively rendering that data >>>>> inaccessible. But "docker start" will restart the container and its "old" >>>>> data will still be there. >>>>> >>>>> Thanks. >>>>> >>>>> -Paul >>>>> >>>> >>>> >>> >> >
Re: Detecting slave crashes event
Thank you, Benjamin. So, I could periodically request the metrics endpoint, or stream the logs (maybe via mesos.cli; or SSH)? What, roughly, does the "agent removed" message look like in the logs? Are there plans to offer a mechanism for event subscription? Cordially, Paul On Wed, Sep 16, 2015 at 1:30 PM, Benjamin Mahler <benjamin.mah...@gmail.com> wrote: > You can detect when we remove an agent due to health check failures via > the metrics endpoint, but these are counters that are better used for > alerting / dashboards for visibility. If you need to know which agents, you > can also consume the logs as a stop-gap solution, until we offer a > mechanism for subscribing to cluster events. > > On Wed, Sep 16, 2015 at 10:11 AM, Paul Bell <arach...@gmail.com> wrote: > >> Hi All, >> >> I am led to believe that, unlike Marathon, Mesos doesn't (yet?) offer a >> subscribable event bus. >> >> So I am wondering if there's a best practices way of determining if a >> slave node has crashed. By "crashed" I mean something like the power plug >> got yanked, or anything that would cause Mesos to stop talking to the slave >> node. >> >> I suppose such information would be recorded in /var/log/mesos. >> >> Interested to learn how best to detect this. >> >> Thank you. >> >> -Paul >> > >
Re: Changing mesos slave configuration
Hi Pradeep, Perhaps I am speaking to a slightly different point, but when I change /etc/default/mesos-slave to add a new attribute, I have to remove file /tmp/mesos/meta/slaves/latest. IIRC, mesos-slave itself, in failing to start after such a change, tells me to do this: rm -f /tmp/mesos/meta/slaves/latest But I know of no way to make such configuration changes without downtime. And I'd very much like it if Mesos supported such dynamic changes. I suppose this would require that the agent consult its default file on demand, rather than once at start-up. Cordially, Paul On Wed, Sep 23, 2015 at 4:41 AM, Pradeep Chhetri < pradeep.chhetr...@gmail.com> wrote: > Hello all, > > I have often faced this problem that whenever i try to add some > configuration parameter to mesos-slave or change any configuration (eg. add > a new attribute in mesos-slave), the mesos slave doesnt come up on restart. > I have to delete the slave.info file and then restart the slave but it > ends up killing all the docker containers started using mesos. > > I was trying to figure out the best way to make such changes without > making any downtime. > > Thank you. > > -- > Pradeep Chhetri >
Re: Use docker start rather than docker run?
Alex Tim, Thank you both; most helpful. Alex, can you dispel my confusion on this point: I keep reading that a framework in Mesos (e.g., Marathon) consists of a scheduler and an executor. This reference to executor made me think that Marathon must have *some* kind of presence on the slave node. But the more familiar I become with Mesos the less likely this seems to me. So, what does it mean to talk about the Marathon framework executor? Tim, I did come up with a simple work-around that involves re-copying the needed file into the container each time the application is started. For reasons unknown, this file is not kept in a location that would readily lend itself to my use of persistent storage (Docker -v). That said, I am keenly interested in learning how to write both custom executors schedulers. Any sense for what release of Mesos will see persistent volumes? Thanks again, gents. -Paul On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote: Hi Paul, We don't [re]start a container since we assume once the task terminated the container is no longer reused. In Mesos to allow tasks to reuse the same executor and handle task logic accordingly people will opt to choose the custom executor route. We're working on a way to keep your sandbox data beyond a container lifecycle, which is called persistent volumes. We haven't integrated that with Docker containerizer yet, so you'll have to wait to use that feature. You could also choose to implement a custom executor for now if you like. Tim On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com wrote: Paul, that component is called DockerContainerizer and it's part of Mesos Agent (check /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp). @Tim, could you answer the docker start vs. docker run question? On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote: Hi All, I first posted this to the Marathon list, but someone suggested I try it here. I'm still not sure what component (mesos-master, mesos-slave, marathon) generates the docker run command that launches containers on a slave node. I suppose that it's the framework executor (Marathon) on the slave that actually executes the docker run, but I'm not sure. What I'm really after is whether or not we can cause the use of docker start rather than docker run. At issue here is some persistent data inside /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch my application with a different CTR_ID effectively rendering that data inaccessible. But docker start will restart the container and its old data will still be there. Thanks. -Paul
Re: Securing executors
Thanks, Alexander; I will check out the vid. I kind of assumed that this port was used for exactly the purpose you mention. Is TLS a possibility here? -Paul On Tue, Oct 6, 2015 at 8:15 AM, Alexander Rojas <alexan...@mesosphere.io> wrote: > Hi Paul, > > I can refer you to the talk given by Adam Bordelon at MesosCon > https://www.youtube.com/watch?v=G3sn1OLYDOE > > If you want to the short answer, the solution is to put a firewall around > your cluster. > > On a closer look on the port, it is the one used for message passing > between the mesas-docker-executor and other mesos components. > > > On 05 Oct 2015, at 19:04, Paul Bell <arach...@gmail.com> wrote: > > Hi All, > > I am running an nmap port scan on a Mesos agent node and noticed nmap > reporting an open TCP port at 50577. > > Poking around some, I discovered exactly 5 mesos-docker-executor > processes, one for each of my 5 Docker containers, and each with an open > listen port: > > root 14131 3617 0 10:39 ?00:00:17 mesos-docker-executor > --container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95 > --docker=/usr/local/ecxmcc/weaveShim --help=false > --mapped_directory=/mnt/mesos/sandbox > --sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95 > --stop_timeout=15secs > > I suppose that all of this is unsurprising. But I know of at least one big > customer who will without delay run Nmap or Nessus against my clustered > deployment. > > So I am wondering what the best practices approach is to securing these > open ports. > > Thanks for your help. > > -Paul > > > > >
Securing executors
Hi All, I am running an nmap port scan on a Mesos agent node and noticed nmap reporting an open TCP port at 50577. Poking around some, I discovered exactly 5 mesos-docker-executor processes, one for each of my 5 Docker containers, and each with an open listen port: root 14131 3617 0 10:39 ?00:00:17 mesos-docker-executor --container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95 --stop_timeout=15secs I suppose that all of this is unsurprising. But I know of at least one big customer who will without delay run Nmap or Nessus against my clustered deployment. So I am wondering what the best practices approach is to securing these open ports. Thanks for your help. -Paul
Re: Anyone try Weave in Mesos env ?
Hi Weitao, I came up with this architecture as a way of distributing our application across multiple nodes. Pre-Mesos, our application, delivered as a single VMware VM, was not easily scalable. By breaking out the several application components as Docker containers, we are now able (within limits imposed chiefly by the application itself) to distribute & run those containers across the several nodes in the Mesos cluster. Application containers that need to talk to each other are connected via Weave's "overlay" (veth) network. Not surprisingly, this architecture has some of the benefits that you'd expect from Mesos, chief among them being high-availability (more on this below), scalability, and hybrid Cloud deployment. The core unit of deployment is an Ubuntu image (14.04 LTS) that I've configured with the appropriate components: Zookeeper Mesos-master Mesos-slave Marathon Docker Weave SSH (including RSA keys) Our application This images is presently downloaded by a customer as a VMware .ova file. We typically ask the customer to convert the resulting VM to a so-called VMware template from which she can easily deploy multiple VMs as needed. Please note that although we've started with VMware as our virtualization platform, I've successfully run cluster nodes on both EC2 and Azure. I tend to describe the Ubuntu image as "polymorphic", i.e., it can be told to assume one of two roles, either a "master" role or a "slave" role. A master runs ZK, mesos-master, and Marathon. A slave runs mesos-slave, Docker, Weave, and the application. We presently offer 3 canned deployment options: 1. single-host, no HA 2. multi-host, no HA (1 master, 3 slaves) 3. multi-host, HA (3 masters, 3 slaves) The single-host, no HA option exists chiefly to mimic the original pre-Mesos deployment. But it has the added virtue, thanks to Mesos, of allowing us to dynamically "grow" from a single-host to multiple hosts. The multi-host, no HA option is presently geared toward a sharded MongoDB backend where each slave runs a mongod container that is a single partition (shard) of the larger database. This deployment option also lends itself very nicely to adding a new slave node at the cluster level, and a new mongod container at the application level - all without any downtime whatsoever. The multi-host, HA option offers the probably familiar *cluster-level* high availability. I stress "cluster-level" because I think we have to distinguish between HA at that level & HA at the application level. The former is realized by the 3 master hosts, i.e., you can lose a master and new one will self-elect thereby keeping the cluster up & running. But, to my mind, at least, application level HA requires some co-operation on the part of the application itself (e.g., checkpoint/restart). That said, it *is* almost magical to watch Mesos re-launch an application container that has crashed. But whether or not that re-launch results in coherent application behavior is another matter. An important home-grown component here is a Java program that automates these functions: create cluster - configures a host for a given role and starts Mesos services. This is done via SSH start application - distributes application containers across slave hosts. This is done by talking to the Marathon REST API stop application - again, via the Marathon REST API stop cluster - stops Mesos services. Again, via SSH destroy cluster - deconfigures the host (after which it has no defined role); again, SSH As I write, I see Ajay's e-mail arrive about Calico. I am aware of this project and it seems quite solid. But I've never understood the need to "worry about networking containers in multihost setup". Weave runs as a Docker container and It Just Works. I've "woven" together slaves nodes in a cluster that spanned 3 different datacenters, one of them in EC2, without any difficulty. Yes, I do have to assign Weave IP addresses to the several containers, but this is hardly onerous. In fact, I've found it "liberating" to select such addresses from a CIDR/8 address space, assigning them to containers based on the container's purpose (e.g., MongoDB shard containers might live at 10.4.0.X, etc.). Ultimately, this assignment boils down to setting an environment variable that Marathon (or the mesos-slave executor) will use when creating the container via "docker run". There is a whole lot more that I could say about the internals of this architecture. But, if you're still interested, I'll await further questions from you. HTH. Cordially, Paul On Thu, Nov 26, 2015 at 7:16 AM, Paulwrote: > Gladly, Weitao. It'd be my pleasure. > > But give me a few hours to find some free time. > > I am today tasked with cooking a Thanksgiving turkey. > > But I will try to find the time before noon today (I'm on the right coast > in the USA). > > -Paul > > > On Nov 25, 2015, at 11:26 PM, Weitao wrote: > > > > Hi, Paul. Can your share
Re: Anyone try Weave in Mesos env ?
HmmI'm not sure there's really a "fix" for that (BTW: I assume you mean to fix high (or long) latency, i.e., to make it lower, faster). A network link is a network link, right? Like all hardware, it has its own physical characteristics which determine its latency's lower bound, below which it is physically impossible to go. Sounds to me as if you've got the whole Mesos + Docker + Weave thing figured out, at least as far as the basic connectivity and addressing is concerned. So there's not much more that I can tell you in that regard. Are you running Weave 1.2 (or above)? It incorporates their "fast path" technology based on the Linux kernel's Open vSwitch (*vide*: http://blog.weave.works/2015/11/13/weave-docker-networking-performance-fast-data-path/). But, remember, there's still the link in between endpoints. One can optimize the packet handling within an endpoint, but this could boil down to a case of "hurry up and wait". I would urge you to take this question up with the friendly, knowledgeable, and very helpful folks at Weave: https://groups.google.com/a/weave.works/forum/#!forum/weave-users . Cordially, Paul On Wed, Nov 25, 2015 at 9:31 PM, Samwrote: > Paul, > Yup, Weave and Docker. May I know how did you fix low latency issue over > Internet ? By tunnel or ? > > Regards, > Sam > > Sent from my iPhone > > > On Nov 26, 2015, at 10:23 AM, Paul wrote: > > > > Happy Thanksgiving to you, too. > > > > I tend to deploy the several Mesos nodes as VMware VMs. > > > > However, I've also run a cluster with master on ESXi, slaves on ESXi, > slave on bare metal, and an EC2 slave. > > > > But in my case all applications are Docker containers connected via > Weave. > > > > Does your present deployment involve Docker and Weave? > > > > -paul > > > >> On Nov 25, 2015, at 8:55 PM, Sam wrote: > >> > >> Paul, > >> Happy thanksgiving first. We are using Aws, Rackspace as hybrid cloud > env , and we deployed Mesos master in AWS , part of Slaves in AWS , part of > Slaves in Rackspace . I am thinking whether it works ? And since it got > low latency in networking , can we deploy two masters in both AWS and > Rackspace ? And federation ?Appreciated for your reply . > >> > >> Regards , > >> Sam > >> > >> Sent from my iPhone > >> > >>> On Nov 26, 2015, at 9:47 AM, Paul wrote: > >>> > >>> Hi Sam, > >>> > >>> Yeah, I have significant experience in this regard. > >>> > >>> We run a Docker containers spread across several Mesos slave nodes. > The containers are all connected via Weave. It works very well. > >>> > >>> Can you describe what you have in mind? > >>> > >>> Cordially, > >>> > >>> Paul > >>> > On Nov 25, 2015, at 8:03 PM, Sam wrote: > > Guys, > We are trying to use Weave in hybrid cloud Mesos env , anyone got > experience on it ? Appreciated > Regards, > Sam > > Sent from my iPhone >
Re: Help needed (alas, urgently)
In chasing down this problem, I stumbled upon something of moment: the problem does NOT seem to happen with kernel 3.13. Some weeks back, in the hope of getting past another problem wherein the root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04 LTS). The kernel upgrade was done as shown here (there's some extra stuff to get rid of Ubuntu desktop and liberate some disk space): apt-get update apt-get -y remove unbuntu-desktop apt-get -y purge lightdm rm -Rf /var/lib/lightdm-data apt-get -y remove --purge libreoffice-core apt-get -y remove --purge libreoffice-common echo " Installing new kernel" apt-get -y install linux-generic-lts-vivid apt-get -y autoremove linux-image-3.13.0-32-generic apt-get -y autoremove linux-image-3.13.0-71-generic update-grub reboot After the reboot, a "uname -r" shows kernel 3.19.0-42-generic. Under this kernel I can now reliably reproduce the failure to stop a MongoDB container. Specifically, any & all attempts to kill the container, e.g.,via Marathon HTTP Delete (which leads to docker-mesos-executor presenting "docker stop" command) Getting inside the running container shell and issuing "kill" or db.shutDown() causes the mongod container - to show in its log that it's shutting down normally - to enter a 100% CPU loop - to become unkillable (only reboot "fixes" things) Note finally that my conclusion about kernel 3.13 "working" is at present a weak induction. But I do know that when I reverted to that kernel I could, at least once, stop the containers w/o any problems; whereas at 3.19 I can reliably reproduce the problem. I will try to make this induction stronger as the day wears on. Did I do something "wrong" in my kernel upgrade steps? Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the area of task termination & signal handling? Thanks for your help. -Paul On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell <arach...@gmail.com> wrote: > I spoke to soon, I'm afraid. > > Next time I did the stop (with zero timeout), I see the same phenomenon: a > mongo container showing repeated: > > killing docker task > shutting down > > > What else can I try? > > Thank you. > > On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote: > >> Hi Tim, >> >> I set docker_stop_timeout to zero as you asked. I am pleased to report >> (though a bit fearful about being pleased) that this change seems to have >> shut everyone down pretty much instantly. >> >> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause >> the immediate use of "kill -9" as opposed to "kill -2"? >> >> I will keep testing the behavior. >> >> Thank you. >> >> -Paul >> >> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote: >> >>> Hi Tim, >>> >>> Things have gotten slightly odder (if that's possible). When I now start >>> the application 5 or so containers, only one "ecxconfigdb" gets started - >>> and even he took a few tries. That is, I see him failing, moving to >>> deploying, then starting again. But I've no evidence (no STDOUT, and no >>> docker ctr logs) that show why. >>> >>> In any event, ecxconfigdb does start. Happily, when I try to stop the >>> application I am seeing the phenomena I posted before: killing docker task, >>> shutting down repeated many times. The UN-stopped container is now running >>> at 100% CPU. >>> >>> I will try modifying docker_stop_timeout. Back shortly >>> >>> Thanks again. >>> >>> -Paul >>> >>> PS: what do you make of the "broken pipe" error in the docker.log? >>> >>> *from /var/log/upstart/docker.log* >>> >>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json >>> [34mINFO [0m[3054] GET >>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>> [31mERRO [0m[3054] Handler for GET >>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>> returned error: No such image: >>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>> [31mERRO [0m[3054] HTTP Error >>> [31merr [0m=No such image: >>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>> [31mstatusCode [0m=404 >>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >>> [34mINFO [0m[3054] POST >>> /v1.21/containers/create?name=m
Help needed (alas, urgently)
Hi All, It's been quite some time since I've posted here and that's chiefly because up until a day or two ago, things were working really well. I actually may have posted about this some time back. But then the problem seemed more intermittent. In summa, several "docker stops" don't work, i.e., the containers are not stopped. Deployment: one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19 Zookeeper Mesos-master (0.23.0) Mesos-slave (0.23.0) Marathon (0.10.0) Docker 1.9.1 Weave 1.1.0 Our application contains which include MongoDB (4) PostGres ECX (our product) The only thing that's changed at all in the config above is the version of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the problem. My automater program stops the application by sending Marathon an "http delete" for each running up. Every now & then (reliably reproducible today) not all containers get stopped. Most recently, 3 containers failed to stop. Here are the attendant phenomena: Marathon shows the 3 applications in deployment mode (presumably "deployment" in the sense of "stopping") *ps output:* root@71:~# ps -ef | grep docker root 3823 1 0 13:55 ?00:00:02 /usr/bin/docker daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243 root 4967 1 0 13:57 ?00:00:01 /usr/sbin/mesos-slave --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim --docker_stop_timeout=15secs --executor_registration_timeout=5mins --hostname=71.100.202.99 --ip=71.100.202.99 --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443] root 5263 3823 0 13:57 ?00:00:00 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port 6783 root 5271 3823 0 13:57 ?00:00:00 docker-proxy -proto udp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port 6783 root 5279 3823 0 13:57 ?00:00:00 docker-proxy -proto tcp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port 53 root 5287 3823 0 13:57 ?00:00:00 docker-proxy -proto udp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port 53 root 7119 4967 0 14:00 ?00:00:01 mesos-docker-executor --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2 --stop_timeout=15secs root 7378 4967 0 14:00 ?00:00:01 mesos-docker-executor --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89 --stop_timeout=15secs root 7640 4967 0 14:01 ?00:00:01 mesos-docker-executor --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298 --stop_timeout=15secs *root 9696 9695 0 14:06 ?00:00:00 /usr/bin/docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298* *root 9709 9708 0 14:06 ?00:00:00 /usr/bin/docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89* *root 9720 9719 0 14:06 ?00:00:00 /usr/bin/docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2* *docker ps output:* root@71:~# docker ps CONTAINER IDIMAGE COMMAND CREATED STATUS PORTS NAMES 5abafbfe7de2mongo:2.6.8 "/w/w /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 a8449682ca2emongo:2.6.8 "/w/w /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 3b956457374bmongo:2.6.8 "/w/w /entrypoint.sh " 11 minutes ago Up 11
Re: Help needed (alas, urgently)
ctory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b" --stop_timeout="15secs" Registered docker executor on 71.100.202.99 Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285 2016-01-14T20:45:38.613+ [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=ecxconfigdb 2016-01-14T20:45:38.614+ [initandlisten] db version v2.6.8 2016-01-14T20:45:38.614+ [initandlisten] git version: 3abc04d6d4f71de00b57378e3277def8fd7a6700 2016-01-14T20:45:38.614+ [initandlisten] build info: Linux build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 2016-01-14T20:45:38.614+ [initandlisten] allocator: tcmalloc 2016-01-14T20:45:38.614+ [initandlisten] options: { storage: { journal: { enabled: true } } } 2016-01-14T20:45:38.616+ [initandlisten] journal dir=/data/db/journal 2016-01-14T20:45:38.616+ [initandlisten] recover : no journal files present, no recovery needed 2016-01-14T20:45:39.006+ [initandlisten] waiting for connections on port 27017 2016-01-14T20:46:38.975+ [clientcursormon] mem (MB) res:77 virt:12942 2016-01-14T20:46:38.975+ [clientcursormon] mapped (incl journal view):12762 2016-01-14T20:46:38.975+ [clientcursormon] connections:0 Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell <arach...@gmail.com> wrote: > Hey Tim, > > Thank you very much for your reply. > > Yes, I am in the midst of trying to reproduce the problem. If successful > (so to speak), I will do as you ask. > > Cordially, > > Paul > > On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote: > >> Hi Paul, >> >> Looks like we've already issued the docker stop as you seen in the ps >> output, but the containers are still running. Can you look at the Docker >> daemon logs and see what's going on there? >> >> And also can you also try to modify docker_stop_timeout to 0 so that we >> SIGKILL the containers right away, and see if this still happens? >> >> Tim >> >> >> >> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote: >> >>> Hi All, >>> >>> It's been quite some time since I've posted here and that's chiefly >>> because up until a day or two ago, things were working really well. >>> >>> I actually may have posted about this some time back. But then the >>> problem seemed more intermittent. >>> >>> In summa, several "docker stops" don't work, i.e., the containers are >>> not stopped. >>> >>> Deployment: >>> >>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19 >>> Zookeeper >>> Mesos-master (0.23.0) >>> Mesos-slave (0.23.0) >>> Marathon (0.10.0) >>> Docker 1.9.1 >>> Weave 1.1.0 >>> Our application contains which include >>> MongoDB (4) >>> PostGres >>> ECX (our product) >>> >>> The only thing that's changed at all in the config above is the version >>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the >>> problem. >>> >>> >>> My automater program stops the application by sending Marathon an "http >>> delete" for each running up. Every now & then (reliably reproducible today) >>> not all containers get stopped. Most recently, 3 containers failed to stop. >>> >>> Here are the attendant phenomena: >>> >>> Marathon shows the 3 applications in deployment mode (presumably >>> "deployment" in the sense of "stopping") >>> >>> *ps output:* >>> >>> root@71:~# ps -ef | grep docker >>> root 3823 1 0 13:55 ?00:00:02 /usr/bin/docker daemon >>> -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243 >>> root 4967 1 0 13:57 ?00:00:01 /usr/sbin/mesos-slave >>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos >>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim >>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins >>> --hostname=71.100.202.99 --ip=71.100.202.99 >>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443] >>> root 5263 3823 0 13:57 ?00:00:00 docker-proxy -proto tcp >>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.1
Re: Help needed (alas, urgently)
Hey Tim, Thank you very much for your reply. Yes, I am in the midst of trying to reproduce the problem. If successful (so to speak), I will do as you ask. Cordially, Paul On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote: > Hi Paul, > > Looks like we've already issued the docker stop as you seen in the ps > output, but the containers are still running. Can you look at the Docker > daemon logs and see what's going on there? > > And also can you also try to modify docker_stop_timeout to 0 so that we > SIGKILL the containers right away, and see if this still happens? > > Tim > > > > On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote: > >> Hi All, >> >> It's been quite some time since I've posted here and that's chiefly >> because up until a day or two ago, things were working really well. >> >> I actually may have posted about this some time back. But then the >> problem seemed more intermittent. >> >> In summa, several "docker stops" don't work, i.e., the containers are not >> stopped. >> >> Deployment: >> >> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19 >> Zookeeper >> Mesos-master (0.23.0) >> Mesos-slave (0.23.0) >> Marathon (0.10.0) >> Docker 1.9.1 >> Weave 1.1.0 >> Our application contains which include >> MongoDB (4) >> PostGres >> ECX (our product) >> >> The only thing that's changed at all in the config above is the version >> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the >> problem. >> >> >> My automater program stops the application by sending Marathon an "http >> delete" for each running up. Every now & then (reliably reproducible today) >> not all containers get stopped. Most recently, 3 containers failed to stop. >> >> Here are the attendant phenomena: >> >> Marathon shows the 3 applications in deployment mode (presumably >> "deployment" in the sense of "stopping") >> >> *ps output:* >> >> root@71:~# ps -ef | grep docker >> root 3823 1 0 13:55 ?00:00:02 /usr/bin/docker daemon -H >> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243 >> root 4967 1 0 13:57 ?00:00:01 /usr/sbin/mesos-slave >> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos >> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim >> --docker_stop_timeout=15secs --executor_registration_timeout=5mins >> --hostname=71.100.202.99 --ip=71.100.202.99 >> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443] >> root 5263 3823 0 13:57 ?00:00:00 docker-proxy -proto tcp >> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port >> 6783 >> root 5271 3823 0 13:57 ?00:00:00 docker-proxy -proto udp >> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port >> 6783 >> root 5279 3823 0 13:57 ?00:00:00 docker-proxy -proto tcp >> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port >> 53 >> root 5287 3823 0 13:57 ?00:00:00 docker-proxy -proto udp >> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port >> 53 >> root 7119 4967 0 14:00 ?00:00:01 mesos-docker-executor >> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 >> --docker=/usr/local/ecxmcc/weaveShim --help=false >> --mapped_directory=/mnt/mesos/sandbox >> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2 >> --stop_timeout=15secs >> root 7378 4967 0 14:00 ?00:00:01 mesos-docker-executor >> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >> --docker=/usr/local/ecxmcc/weaveShim --help=false >> --mapped_directory=/mnt/mesos/sandbox >> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >> --stop_timeout=15secs >> root 7640 4967 0 14:01 ?00:00:01 mesos-docker-executor >> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 >> --docker=/usr/local/ecxmcc/weaveShim --help=false >> --mapped_directory=/mnt/m
Re: Help needed (alas, urgently)
I spoke to soon, I'm afraid. Next time I did the stop (with zero timeout), I see the same phenomenon: a mongo container showing repeated: killing docker task shutting down What else can I try? Thank you. On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote: > Hi Tim, > > I set docker_stop_timeout to zero as you asked. I am pleased to report > (though a bit fearful about being pleased) that this change seems to have > shut everyone down pretty much instantly. > > Can you explain what's happening, e.g., does docker_stop_timeout=0 cause > the immediate use of "kill -9" as opposed to "kill -2"? > > I will keep testing the behavior. > > Thank you. > > -Paul > > On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote: > >> Hi Tim, >> >> Things have gotten slightly odder (if that's possible). When I now start >> the application 5 or so containers, only one "ecxconfigdb" gets started - >> and even he took a few tries. That is, I see him failing, moving to >> deploying, then starting again. But I've no evidence (no STDOUT, and no >> docker ctr logs) that show why. >> >> In any event, ecxconfigdb does start. Happily, when I try to stop the >> application I am seeing the phenomena I posted before: killing docker task, >> shutting down repeated many times. The UN-stopped container is now running >> at 100% CPU. >> >> I will try modifying docker_stop_timeout. Back shortly >> >> Thanks again. >> >> -Paul >> >> PS: what do you make of the "broken pipe" error in the docker.log? >> >> *from /var/log/upstart/docker.log* >> >> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json >> [34mINFO [0m[3054] GET >> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >> [31mERRO [0m[3054] Handler for GET >> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >> returned error: No such image: >> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >> [31mERRO [0m[3054] HTTP Error [31merr >> [0m=No such image: >> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >> [31mstatusCode [0m=404 >> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >> [34mINFO [0m[3054] POST >> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >> [34mINFO [0m[3054] POST >> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1 >> [34mINFO [0m[3054] POST >> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >> [34mINFO [0m[3054] GET >> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3054] GET >> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >> [34mINFO [0m[3111] GET /v1.21/containers/json >> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json >> [34mINFO [0m[3120] GET >> /v1.21/containers/cf7/logs?stderr=1=1=all >> [34mINFO [0m[3153] GET /containers/json >> [34mINFO [0m[3153] GET >> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3153] GET >> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json >> [34mINFO [0m[3153] GET >> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json >> [34mINFO [0m[3175] GET /containers/json >> [34mINFO [0m[3175] GET >> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >> [34mINFO [0m[3175] GET >> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json >> [34mINFO [0m[3175] GET >> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json >> * [34mI
Recent UI enhancements & Managed Service Providers
Hi All, I am running older versions of Mesos & Marathon (0.23.0 and 0.10.0). Over the course of the last several months I think I've seen several items on this list about UI enhancements. Perhaps they were enhancements to the data consumed by the Mesos & Marathon UIs. I've had very little time to dig deeply into it. So...I am wondering if someone can either point me to any discussions of such enhancements or summarize them here. There is a specific use case behind this request. The Mesos architecture seems to be a real sweet spot for an MSP. But an important MSP requirement is a unified view of their many tenants. So I am really trying to get a sense for how well the recent Mesos/Marathon releases address this requirement. Thank you. -Paul
Re: Recent UI enhancements & Managed Service Providers
Hi Vinod, Thank you for your reply. I'm not sure that I can be more specific. MSPs are interested in a "view by tenant", e.g., "show me all applications that are allotted to Tenant X". I suppose that the standard Mesos UI could, with properly named task IDs and the UI's "Find" filter, accomplish part of "view by tenant". But in order to see the resources consumed by Tenant X's tasks, you have to visit each task individually and look at their "Resources" table (add them all up). It'd be cool if when a filter is in effect, the Resources table was updated to reflect only the resources consumed by the filter-selected tasks. There's also the question of the units/meaning of Resources. Through Marathon I give each of my Dockerized tasks .1 CPU. As I understand it, Docker multiplies this value times 1024 which is Docker's representation of all the cores on a host. So when I do "docker inspect " I will see CpuShares of 102. But in the Mesos UI each of my 6 tasks shows .2 CPUs allocated. I'm simply not sure what this means or how it's arrived at. I suspect that an MSP will ask the same questions. I will think about it some more, but I'd be interested to hear feedback on these few points that I've raised. Thanks again. -Paul On Thu, Feb 25, 2016 at 11:55 AM, Vinod Konewrote: > > > But an important MSP requirement is a unified view of their many > tenants. So I am really trying to get a sense for how well the recent > Mesos/Marathon releases address this requirement. > > Can you be more specific about what you mean by unified view and tenants? > What's lacking currently?
Feature request: move in-flight containers w/o stopping them
Hello All, Has there ever been any consideration of the ability to move in-flight containers from one Mesos host node to another? I see this as analogous to VMware's "vMotion" facility wherein VMs can be moved from one ESXi host to another. I suppose something like this could be useful from a load-balancing perspective. Just curious if it's ever been considered and if so - and rejected - why rejected? Thanks. -Paul
Re: Agent won't start
Greg, thanks again - I am planning on moving my work_dir. Pradeep, thanks again. In a slightly different scenario, namely, service mesos-slave stop edit /etc/default/mesos-slave (add a port resource) service mesos-slave start I noticed that slave did not start and - again - the log shows the same phenomena as in my original post. Per your suggestion, I did a rm -Rf /tmp/mesos and the slave service started correctly. Questions: 1. Did editing /etc/default/mesos-slave cause the failure of the service to start? 2. given that starting/stopping the entire cluster (stopping all services on all nodes) is a standard feature in our product, should I routinely to the above "rm" command when the mesos services are stopped? Thanks for your help. Cordially, Paul On Tue, Mar 29, 2016 at 6:16 PM, Greg Mannwrote: > Check out this link for info on /tmp cleanup in Ubuntu: > http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up > > And check out this link for information on some of the work_dir's contents > on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/ > > The work_dir contains important application state for the Mesos agent, so > it should not be placed in a location that will be automatically > garbage-collected by the OS. The choice of /tmp/mesos as a default location > is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to > change it. Ideally you should be able to leave the work_dir alone and let > the Mesos agent manage it for you. > > In any case, I would recommend that you set the work_dir to something > outside of /tmp; /var/lib/mesos is a commonly-used location. > > Cheers, > Greg >
Backup a Mesos Cluster
Hi All, As we get closer to shipping a Mesos-based version of our product, we've turned our attention to "protecting" (supporting backup & recovery) of not only our application databases, but the cluster as well. I'm not quite sure how to begin thinking about this, but I suppose the usual dimensions of B/R would come into play, e.g., hot/cold, application consistent/crash consistent, etc. Has anyone grappled with this issue and, if so, would you be so kind as to share your experience and solutions? Thank you. -Paul
Re: Backup a Mesos Cluster
Piotr, Thank you for this link. I am looking at it now where I right away notice that Exhibitor is designed to monitor (and backup) Zookeeper (but not anything related to Mesos itself). Don't the Mesos master & agent nodes keep at least some state outside of the ZK znodes, e.g., under the default workdir? Shua, Thank you for this observation. Happily (I think), we do not have a custom framework. Presently, Marathon is the only framework that we use. -Paul On Mon, Apr 11, 2016 at 8:12 AM, Shuai Lin <linshuai2...@gmail.com> wrote: > If your product containers a custom framework, at least you should > implement kind of high availability for your scheduler (like > marathon/chronos does), or let it be launched by marathon so it can be > restarted when it fails. > > On Mon, Apr 11, 2016 at 7:27 PM, Paul Bell <arach...@gmail.com> wrote: > >> Hi All, >> >> As we get closer to shipping a Mesos-based version of our product, we've >> turned our attention to "protecting" (supporting backup & recovery) of not >> only our application databases, but the cluster as well. >> >> I'm not quite sure how to begin thinking about this, but I suppose the >> usual dimensions of B/R would come into play, e.g., hot/cold, application >> consistent/crash consistent, etc. >> >> Has anyone grappled with this issue and, if so, would you be so kind as >> to share your experience and solutions? >> >> Thank you. >> >> -Paul >> >> >
Re: Mesos 0.28 SSL in official packages
FWIW, I quite agree with Zameer's point. That said, I want to make abundantly clear that in my experience the folks at Mesosphere are wonderfully helpful. But what happens if down the road Mesosphere is acquired or there occurs some other event that could represent, if not a conflict of interest, then simply a different strategic direction? My 2 cents. -Paul On Mon, Apr 11, 2016 at 5:19 PM, Zameer Manjiwrote: > I have suggested this before and I will suggest it again here. > > I think the Apache Mesos project should build and distribute packages > instead of relying on the generosity of a commercial vendor. The Apache > Aurora project does this already with good success. As a user of Apache > Mesos I don't care about Mesosphere Inc and I feel uncomfortable that the > project is so dependent on its employees. > > Doing this would allow users to contribute packaging fixes directly to the > project, such as enabling SSL. > > On Mon, Apr 11, 2016 at 3:02 AM, Adam Bordelon wrote: > >> Hi Kamil, >> >> Technically, there are no "official" Apache-built packages for Apache >> Mesos. >> >> At least once company (Mesosphere) chooses to build and distribute >> Mesos packages, but does not currently offer SSL builds. It wouldn't >> be hard to add an SSL build to our regular builds, but it hasn't been >> requested enough to prioritize it. >> >> cc: Joris, Kapil >> >> On Thu, Apr 7, 2016 at 7:42 AM, haosdent wrote: >> > Hi, ssl didn't enable default. You need compile it by following this doc >> > http://mesos.apache.org/documentation/latest/ssl/ >> > >> > On Thu, Apr 7, 2016 at 10:04 PM, Kamil Wokitajtis > > >> > wrote: >> >> >> >> This is my first post, so Hi everyone! >> >> >> >> Is SSL enabled in official packages (CentOS in my case)? >> >> I can see libssl in ldd output, but I cannot see libevent. >> >> I had to compile mesos from sources to run it over ssl. >> >> I would prefer to install it from packages. >> >> >> >> Regards, >> >> Kamil >> > >> > >> > >> > >> > -- >> > Best Regards, >> > Haosdent Huang >> >> -- >> Zameer Manji >> >>
Re: Agent won't start
Hi Greg, Thanks very much for your quick reply. I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not systemd. I will look at the link you provide. Is there any chance that it might apply to non-systemd platforms? Cordially, Paul On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote: > Hi Paul, > Noticing the logging output, "Failed to find resources file > '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may > be related to the location of your agent's work_dir. See this ticket: > https://issues.apache.org/jira/browse/MESOS-4541 > > Some users have reported issues resulting from the systemd-tmpfiles > service garbage collecting files in /tmp, perhaps this is related? What > platform is your agent running on? > > You could try specifying a different agent work directory outside of /tmp/ > via the `--work_dir` command-line flag. > > Cheers, > Greg > > > On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote: > >> Hi, >> >> I am hoping someone can shed some light on this. >> >> An agent node failed to start, that is, when I did "service mesos-slave >> start" the service came up briefly & then stopped. Before stopping it >> produced the log shown below. The last thing it wrote is "Trying to create >> path '/mesos' in Zookeeper". >> >> This mention of the mesos znode prompted me to go for a clean slate by >> removing the mesos znode from Zookeeper. >> >> After doing this, the mesos-slave service started perfectly. >> >> What might be happening here, and also what's the right way to >> trouble-shoot such a problem? Mesos is version 0.23.0. >> >> Thanks for your help. >> >> -Paul >> >> >> Log file created at: 2016/03/29 14:19:39 >> Running on machine: 71.100.202.193 >> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! >> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by >> root >> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >> posix/cpu,posix/mem >> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >> 71.100.202.193:5051 >> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >> --attributes="hostType:shard1" --authenticatee="crammd5" >> --cgroups_cpu_enable_pids_and_tids_count="false" >> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >> --cgroups_limit_swap="false" --cgroups_root="mesos" >> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >> --default_role="*" --disk_watch_interval="1mins" >> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >> --docker_remove_delay="6hrs" >> --docker_sandbox_directory="/mnt/mesos/sandbox" >> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" >> --enforce_container_disk_quota="false" >> --executor_registration_timeout="5mins" >> --executor_shutdown_grace_period="5secs" >> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" >> --hadoop_home="" --help="false" --hostname="71.100.202.193" >> --initialize_driver_logging="true" --ip="71.100.202.193" >> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" >> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" >> --master="zk://71.100.202.191:2181/mesos" >> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" >> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" >> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" >> --registration_backoff_factor="1secs" >> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" >> --strict="true" --switch_user="true" --version="false" >> --work_dir="/tmp/mesos" >> I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; >> mem(*):23089; disk(*):122517; ports(*):[31000-32000] >> I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 >> I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true >> I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from >> '/tmp/mesos/meta' >> I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file >> '/tmp/mesos/meta/resources/resources.info' >> I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ >> 71.100.202.193:5051) connected to ZooKeeper >> I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: >> queue size (joins, cancels, datas) = (0, 0, 0) >> I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' >> in ZooKeeper >> >> >
Agent won't start
Hi, I am hoping someone can shed some light on this. An agent node failed to start, that is, when I did "service mesos-slave start" the service came up briefly & then stopped. Before stopping it produced the log shown below. The last thing it wrote is "Trying to create path '/mesos' in Zookeeper". This mention of the mesos znode prompted me to go for a clean slate by removing the mesos znode from Zookeeper. After doing this, the mesos-slave service started perfectly. What might be happening here, and also what's the right way to trouble-shoot such a problem? Mesos is version 0.23.0. Thanks for your help. -Paul Log file created at: 2016/03/29 14:19:39 Running on machine: 71.100.202.193 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ 71.100.202.193:5051 I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: --attributes="hostType:shard1" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_directory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" --enforce_container_disk_quota="false" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="71.100.202.193" --initialize_driver_logging="true" --ip="71.100.202.193" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://71.100.202.191:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" --strict="true" --switch_user="true" --version="false" --work_dir="/tmp/mesos" I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; mem(*):23089; disk(*):122517; ports(*):[31000-32000] I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from '/tmp/mesos/meta' I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file '/tmp/mesos/meta/resources/resources.info' I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ 71.100.202.193:5051) connected to ZooKeeper I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
Re: Agent won't start
Hi Pradeep, And thank you for your reply! That, too, is very interesting. I think I need to synthesize what you and Greg are telling me and come up with a clean solution. Agent nodes can crash. Moreover, I can stop the mesos-slave service, and start it later with a reboot in between. So I am interested in fully understanding the causal chain here before I try to fix anything. -Paul On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell <arach...@gmail.com> wrote: > Whoa...interessant! > > The node *may* have been rebooted. Uptime says 2 days. I'll need to check > my notes. > > Can you point me to reference re Ubuntu behavior? > > Based on what you've told me so far, it sounds as if the sequence: > > stop service > reboot agent node > start service > > > could lead to trouble - or do I misunderstand? > > > Thank you again for your help. > > -Paul > > On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote: > >> Paul, >> This would be relevant for any system which is automatically deleting >> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to >> be completely nuked at boot time. Was the agent node rebooted prior to this >> problem? >> >> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote: >> >>> Hi Greg, >>> >>> Thanks very much for your quick reply. >>> >>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >>> systemd. I will look at the link you provide. >>> >>> Is there any chance that it might apply to non-systemd platforms? >>> >>> Cordially, >>> >>> Paul >>> >>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote: >>> >>>> Hi Paul, >>>> Noticing the logging output, "Failed to find resources file >>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >>>> may be related to the location of your agent's work_dir. See this ticket: >>>> https://issues.apache.org/jira/browse/MESOS-4541 >>>> >>>> Some users have reported issues resulting from the systemd-tmpfiles >>>> service garbage collecting files in /tmp, perhaps this is related? What >>>> platform is your agent running on? >>>> >>>> You could try specifying a different agent work directory outside of >>>> /tmp/ via the `--work_dir` command-line flag. >>>> >>>> Cheers, >>>> Greg >>>> >>>> >>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am hoping someone can shed some light on this. >>>>> >>>>> An agent node failed to start, that is, when I did "service >>>>> mesos-slave start" the service came up briefly & then stopped. Before >>>>> stopping it produced the log shown below. The last thing it wrote is >>>>> "Trying to create path '/mesos' in Zookeeper". >>>>> >>>>> This mention of the mesos znode prompted me to go for a clean slate by >>>>> removing the mesos znode from Zookeeper. >>>>> >>>>> After doing this, the mesos-slave service started perfectly. >>>>> >>>>> What might be happening here, and also what's the right way to >>>>> trouble-shoot such a problem? Mesos is version 0.23.0. >>>>> >>>>> Thanks for your help. >>>>> >>>>> -Paul >>>>> >>>>> >>>>> Log file created at: 2016/03/29 14:19:39 >>>>> Running on machine: 71.100.202.193 >>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >>>>> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging >>>>> started! >>>>> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 >>>>> by root >>>>> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >>>>> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >>>>> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >>>>> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >>>>> posix/cpu,posix/mem >>>>> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >>>>> I0329 14:19:39.616267 58
Re: Agent won't start
Whoa...interessant! The node *may* have been rebooted. Uptime says 2 days. I'll need to check my notes. Can you point me to reference re Ubuntu behavior? Based on what you've told me so far, it sounds as if the sequence: stop service reboot agent node start service could lead to trouble - or do I misunderstand? Thank you again for your help. -Paul On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote: > Paul, > This would be relevant for any system which is automatically deleting > files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to > be completely nuked at boot time. Was the agent node rebooted prior to this > problem? > > On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote: > >> Hi Greg, >> >> Thanks very much for your quick reply. >> >> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >> systemd. I will look at the link you provide. >> >> Is there any chance that it might apply to non-systemd platforms? >> >> Cordially, >> >> Paul >> >> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote: >> >>> Hi Paul, >>> Noticing the logging output, "Failed to find resources file >>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >>> may be related to the location of your agent's work_dir. See this ticket: >>> https://issues.apache.org/jira/browse/MESOS-4541 >>> >>> Some users have reported issues resulting from the systemd-tmpfiles >>> service garbage collecting files in /tmp, perhaps this is related? What >>> platform is your agent running on? >>> >>> You could try specifying a different agent work directory outside of >>> /tmp/ via the `--work_dir` command-line flag. >>> >>> Cheers, >>> Greg >>> >>> >>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am hoping someone can shed some light on this. >>>> >>>> An agent node failed to start, that is, when I did "service mesos-slave >>>> start" the service came up briefly & then stopped. Before stopping it >>>> produced the log shown below. The last thing it wrote is "Trying to create >>>> path '/mesos' in Zookeeper". >>>> >>>> This mention of the mesos znode prompted me to go for a clean slate by >>>> removing the mesos znode from Zookeeper. >>>> >>>> After doing this, the mesos-slave service started perfectly. >>>> >>>> What might be happening here, and also what's the right way to >>>> trouble-shoot such a problem? Mesos is version 0.23.0. >>>> >>>> Thanks for your help. >>>> >>>> -Paul >>>> >>>> >>>> Log file created at: 2016/03/29 14:19:39 >>>> Running on machine: 71.100.202.193 >>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >>>> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! >>>> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by >>>> root >>>> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >>>> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >>>> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >>>> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >>>> posix/cpu,posix/mem >>>> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >>>> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >>>> 71.100.202.193:5051 >>>> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >>>> --attributes="hostType:shard1" --authenticatee="crammd5" >>>> --cgroups_cpu_enable_pids_and_tids_count="false" >>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >>>> --default_role="*" --disk_watch_interval="1mins" >>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >>>> --docker_remove_delay="6hrs" >>>> --docker_sandbox_dir
Re: Mesos Master and Slave on same server?
Hi June, In addition to doing what Pradeep suggests, I also now & then run a single node "cluster" that houses mesos-master, mesos-slave, and Marathon. Works fine. Cordially, Paul On Wed, Apr 13, 2016 at 12:36 PM, Pradeep Chhetri < pradeep.chhetr...@gmail.com> wrote: > I would suggest you to run mesos-master and zookeeper and marathon on same > set of hosts (maybe call them as coordinator nodes) and use completely > different set of nodes for mesos slaves. This way you can do the > maintenance of such hosts in a very planned fashion. > > On Wed, Apr 13, 2016 at 4:22 PM, Stefano Bianchi> wrote: > >> For sure it is possible. >> Simply Mesos-master will the the resources offered by the machine on >> which is running mesos-slave also, transparently. >> >> 2016-04-13 16:34 GMT+02:00 June Taylor : >> >>> All of our node servers are identical hardware. Is it reasonable for me >>> to install the Mesos-Master and Mesos-Slave on the same physical hardware? >>> >>> Thanks, >>> June Taylor >>> System Administrator, Minnesota Population Center >>> University of Minnesota >>> >> >> > > > -- > Regards, > Pradeep Chhetri >
Consequences of health-check timeouts?
Hi All, I probably have the following account partly wrong, but let me present it just the same and those who know better can correct me as needed. I've an application that runs several MongoDB shards, each a Dockerized container, each on a distinct node (VM); in fact, some of the VMs are on separate ESXi hosts. I've lately seen situations where, because of very slow disks for the database, the following sequence occurs (I think): 1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both) 2. Synchronous flushing of many, many pages occurs, writing to a slow disk 3. (Around this time one might see in /var/log/syslog "task X blocked for more than 120 seconds" for all kinds of tasks, including mesos-master) 4. mesos-slaves get shutdown (this is the part I'm unclear about; but I am quite certain that on 2 nodes the executors and their in-flight MongoDB tasks got zapped because I can see that Marathon restarted them). The consequences of this are a corrupt MongoDB database. In the case at hand, the job had run for over 50 hours, processing close to 120 million files. Steps I've taken so far to remedy include: - tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent, smaller flushes and thus avoid less frequent massive flushes that suspend threads for very long periods. - increase agent ping timeout to 10 minutes (every 30 seconds, 20 times) So the questions are: - Is there some way to be given control (a callback, or an "exit" routine) so that the container about to be nuked can be given a chance to exit gracefully? - Are there other steps I can take to avoid this mildly calamitous occurrence? - (Also, I'd be grateful for more clarity on anything in steps 1-4 above that is a bit hand-wavy!) As always, thanks. -Paul
Re: Consequences of health-check timeouts?
Hi Hasodent, Thanks for your reply. In re executor_shutdown_grace_period: how would this enable the task (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo STDOUT as captured by Mesos shows that it received signal 15 just before it said good-bye). My naive understanding of this grace period is that it simply delays the termination of the executor. The following snippet is rom /var/log/syslog. I believe it shows the stack trace (largely in the kernel) that led to mesos-master being blocked for more than 120 seconds. Please note that immediately above (before) the blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the blocked mesos-master is a blocked java task. I'm not sure what the java task is. This took place on the mesos-master node and none of our applications runs there. It runs master, Marathon, and ZK. Maybe the java task is Marathon or ZK? Thanks again. -Paul May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013 blocked for more than 120 seconds. May 16 20:06:53 71 kernel: [193339.890873] Not tainted 3.13.0-32-generic #57-Ubuntu May 16 20:06:53 71 kernel: [193339.890889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 16 20:06:53 71 kernel: [193339.890912] mesos-masterD 88013fd94440 0 4013 1 0x May 16 20:06:53 71 kernel: [193339.890914] 880137429a28 0002 880135778000 880137429fd8 May 16 20:06:53 71 kernel: [193339.890916] 00014440 00014440 880135778000 88013fd94cd8 May 16 20:06:53 71 kernel: [193339.890918] 88013ffd34b0 0002 81284630 880137429aa0 May 16 20:06:53 71 kernel: [193339.890919] Call Trace: May 16 20:06:53 71 kernel: [193339.890922] [] ? start_this_handle+0x590/0x590 May 16 20:06:53 71 kernel: [193339.890924] [] io_schedule+0x9d/0x140 May 16 20:06:53 71 kernel: [193339.890925] [] sleep_on_shadow_bh+0xe/0x20 May 16 20:06:53 71 kernel: [193339.890927] [] __wait_on_bit+0x62/0x90 May 16 20:06:53 71 kernel: [193339.890929] [] ? start_this_handle+0x590/0x590 May 16 20:06:53 71 kernel: [193339.890930] [] out_of_line_wait_on_bit+0x77/0x90 May 16 20:06:53 71 kernel: [193339.890932] [] ? autoremove_wake_function+0x40/0x40 May 16 20:06:53 71 kernel: [193339.890934] [] ? wake_up_bit+0x25/0x30 May 16 20:06:53 71 kernel: [193339.890936] [] do_get_write_access+0x2ad/0x4f0 May 16 20:06:53 71 kernel: [193339.890938] [] ? __getblk+0x2d/0x2e0 May 16 20:06:53 71 kernel: [193339.890939] [] jbd2_journal_get_write_access+0x27/0x40 May 16 20:06:53 71 kernel: [193339.890942] [] __ext4_journal_get_write_access+0x3b/0x80 May 16 20:06:53 71 kernel: [193339.890946] [] ext4_reserve_inode_write+0x70/0xa0 May 16 20:06:53 71 kernel: [193339.890948] [] ? ext4_dirty_inode+0x40/0x60 May 16 20:06:53 71 kernel: [193339.890949] [] ext4_mark_inode_dirty+0x44/0x1f0 May 16 20:06:53 71 kernel: [193339.890951] [] ext4_dirty_inode+0x40/0x60 May 16 20:06:53 71 kernel: [193339.890953] [] __mark_inode_dirty+0x10a/0x2d0 May 16 20:06:53 71 kernel: [193339.890956] [] update_time+0x81/0xd0 May 16 20:06:53 71 kernel: [193339.890957] [] file_update_time+0x80/0xd0 May 16 20:06:53 71 kernel: [193339.890961] [] __generic_file_aio_write+0x180/0x3d0 May 16 20:06:53 71 kernel: [193339.890963] [] generic_file_aio_write+0x58/0xa0 May 16 20:06:53 71 kernel: [193339.890965] [] ext4_file_write+0x99/0x400 May 16 20:06:53 71 kernel: [193339.890967] [] ? wake_up_state+0x10/0x20 May 16 20:06:53 71 kernel: [193339.890970] [] ? wake_futex+0x66/0x90 May 16 20:06:53 71 kernel: [193339.890972] [] ? futex_wake+0x1b1/0x1d0 May 16 20:06:53 71 kernel: [193339.890974] [] do_sync_write+0x5a/0x90 May 16 20:06:53 71 kernel: [193339.890976] [] vfs_write+0xb4/0x1f0 May 16 20:06:53 71 kernel: [193339.890978] [] SyS_write+0x49/0xa0 May 16 20:06:53 71 kernel: [193339.890980] [] tracesys+0xe1/0xe6 On Wed, May 18, 2016 at 2:33 AM, haosdent <haosd...@gmail.com> wrote: > >Is there some way to be given control (a callback, or an "exit" routine) > so that the container about to be nuked can be given a chance to exit > gracefully? > The default value of executor_shutdown_grace_period is 5 seconds, you > could change it by specify the `--executor_shutdown_grace_period` flag when > launch mesos agent. > > >Are there other steps I can take to avoid this mildly calamitous > occurrence? > >mesos-slaves get shutdown > Do you know where your mesos-master stuck when it happens? Any error log > or related log about this? In addition, is there any log when mesos-slave > shut down? > > On Wed, May 18, 2016 at 6:12 AM, Paul Bell <arach...@gmail.com> wrote: > >> Hi All, >> >> I probably have the following account partly wrong, but let me present it >> just the same and those who know better can co
Status of Mesos-3821
Hi, I think I encountered the problem described by https://issues.apache.org/jira/browse/MESOS-3821 and wanted to ask if this fix is in Mesos 0.28. But perhaps I misunderstand what's being said; so by way of background our case is Mesos on CentOS 7.2. When we try to set --docker_socket to tcp://: the mesos-slave service refuses to start. It seems to require a Unix socket. There is one comment in the ticket that expresses the hope of being able to use URLs of the tcp:// form. Am I misunderstanding this fix and if not, what release of Mesos incorporates it? Thanks for your help. -Paul
Mesos loses track of Docker containers
Hello, One of our customers has twice encountered a problem wherein Mesos & Marathon appear to lose track of the application containers that they started. Platform & version info: Ubuntu 14.04 (running under VMware) Mesos (master & agent): 0.23.0 ZK: 3.4.5--1 Marathon: 0.10.0 The phenomena: When I log into either the Mesos or Marathon UIs I see no evidence of *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps" command shows the containers up & running. I've seen some confusing appearances before, but never this. For example, I've seen what might be described as the *reverse* of the above phenomena. I mean the case where a customer powers cycles the VM. In such a case you typically see in Marathon's UI the (mere) appearance of the containers up & running, but a "docker ps" command shows no containers running. As folks on this list have explained to me, this is the result of "stale state" and after 10 minutes (by default), Mesos figures out that the supposedly active tasks aren't there and restarts them. But that's not the case here. I am hard-pressed to understand what conditions/causes might lead to Mesos & Marathon becoming unaware of containers that they started. I would be very grateful if someone could help me understand what's going on here (so would our customer!). Thanks. -Paul
Re: Mesos loses track of Docker containers
Hi Jeff, Thanks for your reply. Yeahthat thought occurred to me late last night. But customer is sensitive to too much churn, so it wouldn't be my first choice. If I knew with certainty that such a problem existed in the versions they are running AND that more recent versions fixed it, then I'd do my best to compel the upgrade. Docker version is also old, 1.6.2. -Paul On Wed, Aug 10, 2016 at 9:18 AM, Jeff Schroeder <jeffschroe...@computer.org> wrote: > Have you considered upgrading Mesos and Marathon? Those are quite old > versions of both with some fairly glaring problems with the docker > containerizer if memory serves. Also what version of docker? > > > On Wednesday, August 10, 2016, Paul Bell <arach...@gmail.com> wrote: > >> Hello, >> >> One of our customers has twice encountered a problem wherein Mesos & >> Marathon appear to lose track of the application containers that they >> started. >> >> Platform & version info: >> >> Ubuntu 14.04 (running under VMware) >> Mesos (master & agent): 0.23.0 >> ZK: 3.4.5--1 >> Marathon: 0.10.0 >> >> The phenomena: >> >> When I log into either the Mesos or Marathon UIs I see no evidence of >> *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps" >> command shows the containers up & running. >> >> I've seen some confusing appearances before, but never this. For example, >> I've seen what might be described as the *reverse* of the above >> phenomena. I mean the case where a customer powers cycles the VM. In such a >> case you typically see in Marathon's UI the (mere) appearance of the >> containers up & running, but a "docker ps" command shows no containers >> running. As folks on this list have explained to me, this is the result of >> "stale state" and after 10 minutes (by default), Mesos figures out that the >> supposedly active tasks aren't there and restarts them. >> >> But that's not the case here. I am hard-pressed to understand what >> conditions/causes might lead to Mesos & Marathon becoming unaware of >> containers that they started. >> >> I would be very grateful if someone could help me understand what's going >> on here (so would our customer!). >> >> Thanks. >> >> -Paul >> >> >> > > -- > Text by Jeff, typos by iPhone >