from:"Paul Bell"

Can't start master properly (stale state issue?); help!

2015-08-13 Thread Paul Bell

Hi All,

I hope someone can shed some light on this because I'm getting desperate!

I try to start components zk, mesos-master, and marathon in that order.
They are started via a program that SSHs to the sole host and does service
xxx start. Everyone starts happily enough. But the Mesos UI shows me:

*This master is not the leader, redirecting in 0 seconds ... go now*

The pattern seen in all of the mesos-master.INFO logs (one of which shown
below) is that the mesos-master with the correct IP@ starts. But then a
new leader is detected and becomes leading master. This new leader shows
UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050*

I've tried clearing what ZK and mesos-master state I can find, but this
problem will not go away.

Would someone be so kind as to a) explain what is happening here and b)
suggest remedies?

Thanks very much.

-Paul


Log file created at: 2015/08/13 10:19:43
Running on machine: 71.100.14.9
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root
I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
d6309f92a7f9af3ab61a878403e3d9c284ea87e0
I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in 13961ns
I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
677ns
I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in the
db in 243ns
I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
positions 0 - 0 with 1 holes and 0 unlearned
I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
ZooKeeper group
I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
received a broadcasted recover request
I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
from a replica in EMPTY status
*I0813 10:19:43.249503  2542 master.cpp:349] Master
20150813-101943-151938119-5050-2542 (71.100.14.9) started on
71.100.14.9:5050 http://71.100.14.9:5050*
I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
STARTING
I0813 10:19:43.252571  2542 master.cpp:397] Master allowing unauthenticated
frameworks to register
I0813 10:19:43.253159  2542 master.cpp:402] Master allowing unauthenticated
slaves to register
I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 1.816161ms
I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
STARTING
I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
received a broadcasted recover request
I0813 10:19:43.255265  2612 recover.cpp:195] Received a recover response
from a replica in STARTING status
I0813 10:19:43.255343  2612 recover.cpp:566] Updating replica status to
VOTING
I0813 10:19:43.258730  2611 master.cpp:1295] Successfully attached file
'/var/log/mesos/mesos-master.INFO'
I0813 10:19:43.258760  2609 contender.cpp:131] Joining the ZK group
I0813 10:19:43.258862  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 3.477458ms
I0813 10:19:43.258894  2612 replica.cpp:323] Persisted replica status to
VOTING
I0813 10:19:43.258934  2612 recover.cpp:580] Successfully joined the Paxos
group
I0813 10:19:43.258987  2612 recover.cpp:464] Recover process terminated
I0813 10:19:46.590340  2606 group.cpp:313] Group process (group(1)@
71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.590373  2606 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0813 10:19:46.590386  2606 group.cpp:385] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0813 10:19:46.591442  2606 network.hpp:424] ZooKeeper group memberships
changed
I0813 10:19:46.591514  2606 group.cpp:659] Trying to get
'/mesos/log_replicas/00' in ZooKeeper
I0813 10:19:46.592146  2606 group.cpp:659] Trying to get
'/mesos/log_replicas/01' in ZooKeeper
I0813 10:19:46.593128  2608 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@127.0.1.1:5050 }
I0813 10:19:46.593955  2608 group.cpp:313] Group process (group(2)@
71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.593977  2608 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (1, 0, 0)
I0813 10:19:46.593986  2608 group.cpp:385] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0813 10:19:46.594894  2605 group.cpp:313] Group process (group(3)@

Re: Can't start master properly (stale state issue?); help!

2015-08-13 Thread Paul Bell

Marco  hasodent,

This is just a quick note to say thank you for your replies.

I will answer you much more fully tomorrow, but for now can only manage a
few quick observations  questions:

1. Having some months ago encountered a known problem with the IP@
127.0.1.1 (I'll provide references tomorrow), I early on configured
/etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP.
That said, I can't rule out a race condition whereby ZK | mesos-master saw
the original unchanged /etc/hosts before I zapped it.

2. What is a znode and how would I drop it?

I start the services (zk, master, marathon; all on same host) by SSHing
into the host  doing service  start commands.

Again, thanks very much; and more tomorrow.

Cordially,

Paul

On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote:

 Hello, how you start the master? And could you try use netstat -antp|grep
 5050 to find whether there are multi master processes run at a same
 machine or not?

 On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I hope someone can shed some light on this because I'm getting desperate!

 I try to start components zk, mesos-master, and marathon in that order.
 They are started via a program that SSHs to the sole host and does service
 xxx start. Everyone starts happily enough. But the Mesos UI shows me:

 *This master is not the leader, redirecting in 0 seconds ... go now*

 The pattern seen in all of the mesos-master.INFO logs (one of which shown
 below) is that the mesos-master with the correct IP@ starts. But then a
 new leader is detected and becomes leading master. This new leader shows
 UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050*

 I've tried clearing what ZK and mesos-master state I can find, but this
 problem will not go away.

 Would someone be so kind as to a) explain what is happening here and b)
 suggest remedies?

 Thanks very much.

 -Paul


 Log file created at: 2015/08/13 10:19:43
 Running on machine: 71.100.14.9
 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
 I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
 I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by
 root
 I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
 I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
 I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
 d6309f92a7f9af3ab61a878403e3d9c284ea87e0
 I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
 I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
 I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in
 13961ns
 I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
 677ns
 I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in
 the db in 243ns
 I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
 ZooKeeper group
 I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
 I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
 I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
 I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
 received a broadcasted recover request
 I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
 from a replica in EMPTY status
 *I0813 10:19:43.249503  2542 master.cpp:349] Master
 20150813-101943-151938119-5050-2542 (71.100.14.9) started on
 71.100.14.9:5050 http://71.100.14.9:5050*
 I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
 STARTING
 I0813 10:19:43.252571  2542 master.cpp:397] Master allowing
 unauthenticated frameworks to register
 I0813 10:19:43.253159  2542 master.cpp:402] Master allowing
 unauthenticated slaves to register
 I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 1.816161ms
 I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
 STARTING
 I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
 I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
 received a broadcasted recover request
 I0813 10:19:43.255265  2612 recover.cpp:195] Received a recover response
 from a replica in STARTING status
 I0813 10:19:43.255343  2612 recover.cpp:566] Updating replica status to
 VOTING
 I0813 10:19:43.258730  2611 master.cpp:1295] Successfully attached file
 '/var/log/mesos/mesos-master.INFO'
 I0813 10:19:43.258760  2609 contender.cpp:131] Joining the ZK group
 I0813 10:19:43.258862  2612 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 3.477458ms
 I0813 10:19:43.258894  2612 replica.cpp:323] Persisted replica status to
 VOTING
 I0813 10:19:43.258934  2612 recover.cpp:580] Successfully joined the
 Paxos group
 I0813 10:19:43.258987  2612 recover.cpp:464] Recover process

Re: Custom flags to docker run

2015-08-12 Thread Paul Bell

Hi Stephen,

Via Marathon I am deploying Docker containers across a Mesos cluster. The
containers have unique Weave IP@s allowing inter-container communication.
All things considered, getting to this point has been relatively
straight-forward, and Weave has been one of the IJW components.

I'd be curious to learn why you're finding Weave messy.

If you'd like to take it out-of-band (as it were), please feel free to
e-mail me directly.

Cordially,

Paul

On Wed, Aug 12, 2015 at 3:16 AM, Stephen Knight skni...@pivotal.io wrote:

 Hi,

 Is there a way to pass a custom flag to docker run through the Marathon
 API? I've not seen anything in the documentation, this could just be a
 basic reading fail on my part. What I want to do is use Calico (or similar)
 with Docker and provision containers via Marathon.

 Weave is messy for what I am trying to achieve and the integration isn't
 going as planned, is there a better option and how can you then integrate
 it? Does that flexibility exist in the Marathon API?

 Thx
 Stephen

Re: Can't start master properly (stale state issue?); help!

2015-08-14 Thread Paul Bell

All,

By way of some background: I'm not running a data center (or centers).
Rather, I work on a distributed application whose trajectory is taking it
into a realm of many Docker containers distributed across many hosts
(mostly virtual hosts at the outset). An environment that supports
isolation, multi-tenancy, scalability, and some fault tolerance is
desirable for this application. Also, the mere ability to simplify - at
least somewhat - the management of multiple hosts is of great importance.
So, that's more or less how I got to Mesos and to here...

I ended up writing a Java program that configures a collection of host VMs
as a Mesos cluster and then, via Marathon, distributes the application
containers across the cluster. Configuring  building the cluster is
largely a lot of SSH work. Doing the same for the application is part
Marathon, part Docker remote API. The containers that need to talk to each
other via TCP are connected with Weave's (http://weave.works) overlay
network. So the main infrastructure consists of Mesos, Docker, and Weave.
The whole thing is pretty amazing - for which I take very little credit.
Rather, these are some wonderful technologies, and the folks who write 
support them are very helpful. That said, I sometimes feel like I'm
juggling chain saws!

*In re* the issues raised on this thread:

All Mesos components were installed via the Mesosphere packages. The 4 VMs
in the cluster are all running Ubuntu 14.04 LTS.

My suspicions about the IP@ 127.0.1.1 were raised a few months ago when,
after seeing this IP in a mesos-master log when things weren't working, I
discovered these articles:


https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4
https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4

*http://frankhinek.com/build-mesos-multi-node-ha-cluster/
http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2* (see
note 2)


So, to the point raised just now by Klaus (and earlier in the thread), the
aforementioned configuration program does change /etc/hosts (and
/etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco 
hasodent, I might have encountered a race condition wherein ZK 
mesos-master saw the unchanged /etc/hosts before I altered it. I believe
that I yesterday fixed that issue.

Also, as part of the cluster create step, I get a bit aggressive (perhaps
unwisely) with what I believe are some state repositories. Specifically, I

rm /var/lib/zookeeper/version-2/*
rm -Rf /var/lib/mesos/replicated_log

Should I NOT be doing this? I know from experience that zapping the
version-2 directory (ZK's data_Dir if IIRC)  can solve occasional
weirdness. Marco is /var/lib/mesos/replicated_log what you are referring
to when you say some issue with the log-replica?

Just a day or two ago I first heard the term znode  learned a little
about zkCli.sh. I will experiment with it more in the coming days.

As matters now stand, I have the cluster up and running. But before I again
deploy the application, I am trying to put the cluster through its paces by
periodically cycling it through the states my program can bring about,
e.g.,

--cluster create (takes a clean VM and configures it to act as one
or more Mesos components: ZK, master, slave)
--cluster stop(stops the Mesos services on each node)
--cluster destroy   (configures the VM back to its original clean state)
--cluster create
--cluster stop
--cluster start


et cetera.

*The only way I got rid of the no leading master issue that started this
thread was by wiping out the master VM and starting over with a clean VM.
That is, stopping/destroying/creating (even rebooting) the cluster had no
effect.*

I suspect that, sooner or later, I will again hit this problem (probably
sooner!). And I want to understand how best to handle it. Such an
occurrence could be pretty awkward at a customer site.

Thanks for all your help.

Cordially,

Paul


On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma kl...@cguru.net wrote:

 I used to meet a similar issue with Zookeeper + Messo; I resolved it by
 remove 127.0.1.1 from /etc/hosts; here is an example:
 klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts
 127.0.0.1   localhost
 127.0.1.1   klaus-OptiPlex-780   *= remove this line, and a new
 line: mapping IP (e.g. 192.168.1.100) with hostname*
 ...

 BTW, please also clear-up the log directory and re-start ZK  Mesos.

 If any more comments, please let me know.

 Regards,
 
 Klaus Ma (马达), PMP® | http://www.cguru.net

 Call
 Send SMS
 Call from mobile
 Add to Skype
 You'll need Skype CreditFree via Skype
 Call
 Send SMS
 Call from mobile
 Add to Skype
 You'll need Skype CreditFree via Skype
 --
 Date: Thu, 13 Aug 2015 12:20:34 -0700
 Subject: Re: Can't start master properly (stale state issue?); help!
 From: ma...@mesosphere.io
 To: user@mesos.apache.org



 On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com

Fate of slave node after timeout

2015-11-13 Thread Paul Bell

Hi All,

IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded
without a response from a mesos-slave, the master will remove the slave. In
the Mesos UI I can see slave state transition from 1 deactivated to 0.

Can that slave never again be added into the cluster, i.e., what happens if
it comes up 1 second after exceeding the timeout product?

(I'm dusting off some old notes and trying to refresh my memory about
problems I haven't seen in quite some time).

Thank you.

-Paul

Re: manage jobs log files in sandboxes

2015-11-06 Thread Paul Bell

I've done a little reconnoitering, and the terrain looks to me as follows:

   1. Docker maintains container log files at
   /var/lib/docker/containers//-json.log
   2. Mesos maintains container STDOUT files at a
   slave/framework/application specific location, e.g.,

/tmp/mesos/slaves/20151102-082316-370041927-5050-32381-S1/frameworks/20151102-082316-370041927-5050-32381-/executors/ecxprimary1.80750071-81a0-11e5-8596-82d195a34239/runs/5c767378-9599-40af-8010-a31f4c55f9dc
   3. The latter is mapped to the container's /mnt/mesos/sandbox
   4. These two files (-json.log and the STDOUT file) are different, *each*
   consumes disk space.

I think that the answer to (1) is Docker's logrotate.

As to (2), I am considering a cron job at host (not container) level that
drives truncate cmd (GNU coreutils) to prune these files at a certain size.
Obviously requires knowing the fully-qualified path under
/tmp/mesos/slaves, but this is readily available via "docker inspect".

-Paul

On Fri, Nov 6, 2015 at 7:17 AM, Paul Bell <arach...@gmail.com> wrote:

> Hi Mauricio,
>
> YeahI see your point; thank you.
>
> My approach would be akin to closing the barn door after the horse got
> out. Both Mesos & Docker are doing their own writing of STDOUT. Docker's
> rotation won't address Mesos's behavior.
>
> I need to find a solution here.
>
> -Paul
>
>
> On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> I don't think that's going to help :(
>> Even if you configure a different docker log driver, Docker still send
>> things to stdout, which is catched by mesos and dumped in the .logs
>> directory in the job sandbox. For example, by default docker logs into a
>> json file in /var/lib/docker but mesos still writes to the sandbox.
>> Hi Mauricio,
>>
>> I'm grappling with the same issue.
>>
>> I'm not yet sure if it represents a viable solution, but I plan to look
>> at Docker's log rotation facility. It was introduced in Docker 1.8.
>>
>> If you beat me to it & it looks like a solution, please let us know!
>>
>> Thanks.
>>
>> Cordially,
>>
>> Paul
>>
>> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia <
>> mauriciogaravag...@gmail.com> wrote:
>> >
>> > Hi guys,
>> >
>> > How can I manage the stdout/err log files generated by jobs in mesos?
>> for long running docker apps launched using marathon the log files can
>> deplete the disk of an agent, and using quotas makes the jobs to be killed
>> which is also not ideal. I'd like to have a way to rotate them.
>> >
>> > Is it correct to just go to the mesos agent workdir and go through each
>> sandbox stdout/err and rotate them? I know that could break the log UI but
>> it doesn't scale very well having logs of several of GB.
>> >
>> > Thanks!
>>
>
>

Re: manage jobs log files in sandboxes

2015-11-06 Thread Paul Bell

Hi Mauricio,

YeahI see your point; thank you.

My approach would be akin to closing the barn door after the horse got out.
Both Mesos & Docker are doing their own writing of STDOUT. Docker's
rotation won't address Mesos's behavior.

I need to find a solution here.

-Paul


On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi Paul,
>
> I don't think that's going to help :(
> Even if you configure a different docker log driver, Docker still send
> things to stdout, which is catched by mesos and dumped in the .logs
> directory in the job sandbox. For example, by default docker logs into a
> json file in /var/lib/docker but mesos still writes to the sandbox.
> Hi Mauricio,
>
> I'm grappling with the same issue.
>
> I'm not yet sure if it represents a viable solution, but I plan to look at
> Docker's log rotation facility. It was introduced in Docker 1.8.
>
> If you beat me to it & it looks like a solution, please let us know!
>
> Thanks.
>
> Cordially,
>
> Paul
>
> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
> >
> > Hi guys,
> >
> > How can I manage the stdout/err log files generated by jobs in mesos?
> for long running docker apps launched using marathon the log files can
> deplete the disk of an agent, and using quotas makes the jobs to be killed
> which is also not ideal. I'd like to have a way to rotate them.
> >
> > Is it correct to just go to the mesos agent workdir and go through each
> sandbox stdout/err and rotate them? I know that could break the log UI but
> it doesn't scale very well having logs of several of GB.
> >
> > Thanks!
>

Use docker start rather than docker run?

2015-08-28 Thread Paul Bell

Hi All,

I first posted this to the Marathon list, but someone suggested I try it
here.

I'm still not sure what component (mesos-master, mesos-slave, marathon)
generates the docker run command that launches containers on a slave
node. I suppose that it's the framework executor (Marathon) on the slave
that actually executes the docker run, but I'm not sure.

What I'm really after is whether or not we can cause the use of docker
start rather than docker run.

At issue here is some persistent data inside
/var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
my application with a different CTR_ID effectively rendering that data
inaccessible. But docker start will restart the container and its old
data will still be there.

Thanks.

-Paul

Re: Use "docker start" rather than "docker run"?

2015-09-01 Thread Paul Bell

Alex and Marco,

Thanks very much for your really helpful explanations.

For better or worse, neither cpp nor Python are my things; Java's the go-to
language for me.

Cordially,

Paul

On Sat, Aug 29, 2015 at 5:23 AM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> Hi Paul,
>
> +1 to what Alex/Tim say.
>
> Maybe a (simple) example will help: a very basic "framework" I created
> recently, does away with the "Executor" and only uses the "Scheduler",
> sending a CommandInfo structure to Mesos' Agent node to execute.
>
> See:
> https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L124
>
> If Python is more your thing, there are examples in the Mesos repository,
> or you can take a look at something I started recently to use the new
> (0.24) HTTP API (NOTE - this is still very much still WIP):
> https://github.com/massenz/zk-mesos/blob/develop/notebooks/HTTP%20API%20Tests.ipynb
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Fri, Aug 28, 2015 at 8:44 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Alex & Tim,
>>
>> Thank you both; most helpful.
>>
>> Alex, can you dispel my confusion on this point: I keep reading that a
>> "framework" in Mesos (e.g., Marathon) consists of a scheduler and an
>> executor. This reference to "executor" made me think that Marathon must
>> have *some* kind of presence on the slave node. But the more familiar I
>> become with Mesos the less likely this seems to me. So, what does it mean
>> to talk about the Marathon framework "executor"?
>>
>> Tim, I did come up with a simple work-around that involves re-copying the
>> needed file into the container each time the application is started. For
>> reasons unknown, this file is not kept in a location that would readily
>> lend itself to my use of persistent storage (Docker -v). That said, I am
>> keenly interested in learning how to write both custom executors &
>> schedulers. Any sense for what release of Mesos will see "persistent
>> volumes"?
>>
>> Thanks again, gents.
>>
>> -Paul
>>
>>
>>
>> On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> Hi Paul,
>>>
>>> We don't [re]start a container since we assume once the task terminated
>>> the container is no longer reused. In Mesos to allow tasks to reuse the
>>> same executor and handle task logic accordingly people will opt to choose
>>> the custom executor route.
>>>
>>> We're working on a way to keep your sandbox data beyond a container
>>> lifecycle, which is called persistent volumes. We haven't integrated that
>>> with Docker containerizer yet, so you'll have to wait to use that feature.
>>>
>>> You could also choose to implement a custom executor for now if you like.
>>>
>>> Tim
>>>
>>> On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov <a...@mesosphere.com>
>>> wrote:
>>>
>>>> Paul,
>>>>
>>>> that component is called DockerContainerizer and it's part of Mesos
>>>> Agent (check
>>>> "/Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp"). @Tim,
>>>> could you answer the "docker start" vs. "docker run" question?
>>>>
>>>> On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell <arach...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I first posted this to the Marathon list, but someone suggested I try
>>>>> it here.
>>>>>
>>>>> I'm still not sure what component (mesos-master, mesos-slave,
>>>>> marathon) generates the "docker run" command that launches containers on a
>>>>> slave node. I suppose that it's the framework executor (Marathon) on the
>>>>> slave that actually executes the "docker run", but I'm not sure.
>>>>>
>>>>> What I'm really after is whether or not we can cause the use of
>>>>> "docker start" rather than "docker run".
>>>>>
>>>>> At issue here is some persistent data inside
>>>>> /var/lib/docker/aufs/mnt/. "docker run" will by design (re)launch
>>>>> my application with a different CTR_ID effectively rendering that data
>>>>> inaccessible. But "docker start" will restart the container and its "old"
>>>>> data will still be there.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -Paul
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Detecting slave crashes event

2015-09-16 Thread Paul Bell

Thank you, Benjamin.

So, I could periodically request the metrics endpoint, or stream the logs
(maybe via mesos.cli; or SSH)? What, roughly, does the "agent removed"
message look like in the logs?

Are there plans to offer a mechanism for event subscription?

Cordially,

Paul



On Wed, Sep 16, 2015 at 1:30 PM, Benjamin Mahler <benjamin.mah...@gmail.com>
wrote:

> You can detect when we remove an agent due to health check failures via
> the metrics endpoint, but these are counters that are better used for
> alerting / dashboards for visibility. If you need to know which agents, you
> can also consume the logs as a stop-gap solution, until we offer a
> mechanism for subscribing to cluster events.
>
> On Wed, Sep 16, 2015 at 10:11 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am led to believe that, unlike Marathon, Mesos doesn't (yet?) offer a
>> subscribable event bus.
>>
>> So I am wondering if there's a best practices way of determining if a
>> slave node has crashed. By "crashed" I mean something like the power plug
>> got yanked, or anything that would cause Mesos to stop talking to the slave
>> node.
>>
>> I suppose such information would be recorded in /var/log/mesos.
>>
>> Interested to learn how best to detect this.
>>
>> Thank you.
>>
>> -Paul
>>
>
>

Re: Changing mesos slave configuration

2015-09-23 Thread Paul Bell

Hi Pradeep,

Perhaps I am speaking to a slightly different point, but when I change
/etc/default/mesos-slave to add a new attribute, I have to remove file
/tmp/mesos/meta/slaves/latest.

IIRC, mesos-slave itself, in failing to start after such a change, tells me
to do this:

rm -f /tmp/mesos/meta/slaves/latest

But I know of no way to make such configuration changes without downtime.
And I'd very much like it if Mesos supported such dynamic changes. I
suppose this would require that the agent consult its default file on
demand, rather than once at start-up.

Cordially,

Paul

On Wed, Sep 23, 2015 at 4:41 AM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> Hello all,
>
> I have often faced this problem that whenever i try to add some
> configuration parameter to mesos-slave or change any configuration (eg. add
> a new attribute in mesos-slave), the mesos slave doesnt come up on restart.
> I have to delete the slave.info file and then restart the slave but it
> ends up killing all the docker containers started using mesos.
>
> I was trying to figure out the best way to make such changes without
> making any downtime.
>
> Thank you.
>
> --
> Pradeep Chhetri
>

Re: Use docker start rather than docker run?

2015-08-28 Thread Paul Bell

Alex  Tim,

Thank you both; most helpful.

Alex, can you dispel my confusion on this point: I keep reading that a
framework in Mesos (e.g., Marathon) consists of a scheduler and an
executor. This reference to executor made me think that Marathon must
have *some* kind of presence on the slave node. But the more familiar I
become with Mesos the less likely this seems to me. So, what does it mean
to talk about the Marathon framework executor?

Tim, I did come up with a simple work-around that involves re-copying the
needed file into the container each time the application is started. For
reasons unknown, this file is not kept in a location that would readily
lend itself to my use of persistent storage (Docker -v). That said, I am
keenly interested in learning how to write both custom executors 
schedulers. Any sense for what release of Mesos will see persistent
volumes?

Thanks again, gents.

-Paul



On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 We don't [re]start a container since we assume once the task terminated
 the container is no longer reused. In Mesos to allow tasks to reuse the
 same executor and handle task logic accordingly people will opt to choose
 the custom executor route.

 We're working on a way to keep your sandbox data beyond a container
 lifecycle, which is called persistent volumes. We haven't integrated that
 with Docker containerizer yet, so you'll have to wait to use that feature.

 You could also choose to implement a custom executor for now if you like.

 Tim

 On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Paul,

 that component is called DockerContainerizer and it's part of Mesos Agent
 (check /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp).
 @Tim, could you answer the docker start vs. docker run question?

 On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I first posted this to the Marathon list, but someone suggested I try it
 here.

 I'm still not sure what component (mesos-master, mesos-slave, marathon)
 generates the docker run command that launches containers on a slave
 node. I suppose that it's the framework executor (Marathon) on the slave
 that actually executes the docker run, but I'm not sure.

 What I'm really after is whether or not we can cause the use of docker
 start rather than docker run.

 At issue here is some persistent data inside
 /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
 my application with a different CTR_ID effectively rendering that data
 inaccessible. But docker start will restart the container and its old
 data will still be there.

 Thanks.

 -Paul

Re: Securing executors

2015-10-06 Thread Paul Bell

Thanks, Alexander; I will check out the vid.

I kind of assumed that this port was used for exactly the purpose you
mention.

Is TLS a possibility here?

-Paul

On Tue, Oct 6, 2015 at 8:15 AM, Alexander Rojas <alexan...@mesosphere.io>
wrote:

> Hi Paul,
>
> I can refer you to the talk given by Adam Bordelon at MesosCon
> https://www.youtube.com/watch?v=G3sn1OLYDOE
>
> If you want to the short answer, the solution is to put a firewall around
> your cluster.
>
> On a closer look on the port, it is the one used for message passing
> between the mesas-docker-executor and other mesos components.
>
>
> On 05 Oct 2015, at 19:04, Paul Bell <arach...@gmail.com> wrote:
>
> Hi All,
>
> I am running an nmap port scan on a Mesos agent node and noticed nmap
> reporting an open TCP port at 50577.
>
> Poking around some, I discovered exactly 5 mesos-docker-executor
> processes, one for each of my 5 Docker containers, and each with an open
> listen port:
>
> root 14131  3617  0 10:39 ?00:00:17 mesos-docker-executor
> --container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95
> --stop_timeout=15secs
>
> I suppose that all of this is unsurprising. But I know of at least one big
> customer who will without delay run Nmap or Nessus against my clustered
> deployment.
>
> So I am wondering what the best practices approach is to securing these
> open ports.
>
> Thanks for your help.
>
> -Paul
>
>
>
>
>

Securing executors

2015-10-05 Thread Paul Bell

Hi All,

I am running an nmap port scan on a Mesos agent node and noticed nmap
reporting an open TCP port at 50577.

Poking around some, I discovered exactly 5 mesos-docker-executor processes,
one for each of my 5 Docker containers, and each with an open listen port:

root 14131  3617  0 10:39 ?00:00:17 mesos-docker-executor
--container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95
--stop_timeout=15secs

I suppose that all of this is unsurprising. But I know of at least one big
customer who will without delay run Nmap or Nessus against my clustered
deployment.

So I am wondering what the best practices approach is to securing these
open ports.

Thanks for your help.

-Paul

Re: Anyone try Weave in Mesos env ?

2015-11-26 Thread Paul Bell

Hi Weitao,

I came up with this architecture as a way of distributing our application
across multiple nodes. Pre-Mesos, our application, delivered as a single
VMware VM, was not easily scalable. By breaking out the several application
components as Docker containers, we are now able (within limits imposed
chiefly by the application itself) to distribute & run those containers
across the several nodes in the Mesos cluster. Application containers that
need to talk to each other are connected via Weave's "overlay" (veth)
network.

Not surprisingly, this architecture has some of the benefits that you'd
expect from Mesos, chief among them being high-availability (more on this
below), scalability, and hybrid Cloud deployment.

The core unit of deployment is an Ubuntu image (14.04 LTS) that I've
configured with the appropriate components:

Zookeeper
Mesos-master
Mesos-slave
Marathon
Docker
Weave

SSH (including RSA keys)

Our application

This images is presently downloaded by a customer as a VMware .ova file. We
typically ask the customer to convert the resulting VM to a so-called
VMware template from which she can easily deploy multiple VMs as needed.
Please note that although we've started with VMware as our virtualization
platform, I've successfully run cluster nodes on both EC2 and Azure.

I tend to describe the Ubuntu image as "polymorphic", i.e., it can be told
to assume one of two roles, either a "master" role or a "slave" role. A
master runs ZK, mesos-master, and Marathon. A slave runs mesos-slave,
Docker, Weave, and the application.

We presently offer 3 canned deployment options:

   1. single-host, no HA
   2. multi-host, no HA (1 master, 3 slaves)
   3. multi-host, HA (3 masters, 3 slaves)

The single-host, no HA option exists chiefly to mimic the original
pre-Mesos deployment. But it has the added virtue, thanks to Mesos, of
allowing us to dynamically "grow" from a single-host to multiple hosts.

The multi-host, no HA option is presently geared toward a sharded MongoDB
backend where each slave runs a mongod container that is a single partition
(shard) of the larger database. This deployment option also lends itself
very nicely to adding a new slave node at the cluster level, and a new
mongod container at the application level - all without any downtime
whatsoever.

The multi-host, HA option offers the probably familiar *cluster-level* high
availability. I stress "cluster-level" because I think we have to
distinguish between HA at that level & HA at the application level. The
former is realized by the 3 master hosts, i.e., you can lose a master and
new one will self-elect thereby keeping the cluster up & running. But, to
my mind, at least, application level HA requires some co-operation on the
part of the application itself (e.g., checkpoint/restart). That said, it
*is* almost magical to watch Mesos re-launch an application container that
has crashed. But whether or not that re-launch results in coherent
application behavior is another matter.

An important home-grown component here is a Java program that automates
these functions:

create cluster - configures a host for a given role and starts Mesos
services. This is done via SSH
start application - distributes application containers across slave hosts.
This is done by talking to the Marathon REST API
stop application - again, via the Marathon REST API
stop cluster - stops Mesos services. Again, via SSH
destroy cluster - deconfigures the host (after which it has no defined
role); again, SSH

As I write, I see Ajay's e-mail arrive about Calico. I am aware of this
project and it seems quite solid. But I've never understood the need to
"worry about networking containers in multihost setup". Weave runs as a
Docker container and It Just Works. I've "woven" together slaves nodes in a
cluster that spanned 3 different datacenters, one of them in EC2, without
any difficulty. Yes, I do have to assign Weave IP addresses to the several
containers, but this is hardly onerous. In fact, I've found it "liberating"
to select such addresses from a CIDR/8 address space, assigning them to
containers based on the container's purpose (e.g., MongoDB shard containers
might live at 10.4.0.X, etc.). Ultimately, this assignment boils down to
setting an environment variable that Marathon (or the mesos-slave executor)
will use when creating the container via "docker run".

There is a whole lot more that I could say about the internals of this
architecture. But, if you're still interested, I'll await further questions
from you.

HTH.

Cordially,

Paul

On Thu, Nov 26, 2015 at 7:16 AM, Paul  wrote:

> Gladly, Weitao. It'd be my pleasure.
>
> But give me a few hours to find some free time.
>
> I am today tasked with cooking a Thanksgiving turkey.
>
> But I will try to find the time before noon today (I'm on the right coast
> in the USA).
>
> -Paul
>
> > On Nov 25, 2015, at 11:26 PM, Weitao  wrote:
> >
> > Hi, Paul. Can your share

Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul Bell

HmmI'm not sure there's really a "fix" for that (BTW: I assume you mean
to fix high (or long) latency, i.e., to make it lower, faster). A network
link is a network link, right? Like all hardware, it has its own physical
characteristics which determine its latency's lower bound, below which it
is physically impossible to go.

Sounds to me as if you've got the whole Mesos + Docker + Weave thing
figured out, at least as far as the basic connectivity and addressing is
concerned. So there's not much more that I can tell you in that regard.

Are you running Weave 1.2 (or above)? It incorporates their "fast path"
technology based on the Linux kernel's Open vSwitch (*vide*:
http://blog.weave.works/2015/11/13/weave-docker-networking-performance-fast-data-path/).
But, remember, there's still the link in between endpoints. One can
optimize the packet handling within an endpoint, but this could boil down
to a case of "hurry up and wait".

I would urge you to take this question up with the friendly, knowledgeable,
and very helpful folks at Weave:
https://groups.google.com/a/weave.works/forum/#!forum/weave-users .

Cordially,

Paul

On Wed, Nov 25, 2015 at 9:31 PM, Sam  wrote:

> Paul,
> Yup, Weave and Docker.  May I know how did you fix low latency issue over
> Internet ? By tunnel or ?
>
> Regards,
> Sam
>
> Sent from my iPhone
>
> > On Nov 26, 2015, at 10:23 AM, Paul  wrote:
> >
> > Happy Thanksgiving to you, too.
> >
> > I tend to deploy the several Mesos nodes as VMware VMs.
> >
> > However, I've also run a cluster with master on ESXi, slaves on ESXi,
> slave on bare metal, and an EC2 slave.
> >
> > But in my case all applications are Docker containers connected via
> Weave.
> >
> > Does your present deployment involve Docker and Weave?
> >
> > -paul
> >
> >> On Nov 25, 2015, at 8:55 PM, Sam  wrote:
> >>
> >> Paul,
> >> Happy thanksgiving first. We are using Aws, Rackspace as hybrid cloud
> env , and we deployed Mesos master in AWS , part of Slaves in AWS , part of
> Slaves in Rackspace .  I am thinking whether it works ? And since it got
> low latency in networking , can we deploy two masters in both AWS and
> Rackspace ? And federation ?Appreciated for your reply .
> >>
> >> Regards ,
> >> Sam
> >>
> >> Sent from my iPhone
> >>
> >>> On Nov 26, 2015, at 9:47 AM, Paul  wrote:
> >>>
> >>> Hi Sam,
> >>>
> >>> Yeah, I have significant experience in this regard.
> >>>
> >>> We run a Docker containers spread across several Mesos slave nodes.
> The containers are all connected via Weave. It works very well.
> >>>
> >>> Can you describe what you have in mind?
> >>>
> >>> Cordially,
> >>>
> >>> Paul
> >>>
>  On Nov 25, 2015, at 8:03 PM, Sam  wrote:
> 
>  Guys,
>  We are trying to use Weave in hybrid cloud Mesos env , anyone got
> experience on it ? Appreciated
>  Regards,
>  Sam
> 
>  Sent from my iPhone
>

Re: Help needed (alas, urgently)

2016-01-15 Thread Paul Bell

In chasing down this problem, I stumbled upon something of moment: the
problem does NOT seem to happen with kernel 3.13.

Some weeks back, in the hope of getting past another problem wherein the
root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
LTS). The kernel upgrade was done as shown here (there's some extra stuff
to get rid of Ubuntu desktop and liberate some disk space):

  apt-get update
  apt-get -y remove unbuntu-desktop
  apt-get -y purge lightdm
  rm -Rf /var/lib/lightdm-data
  apt-get -y remove --purge libreoffice-core
  apt-get -y remove --purge libreoffice-common

  echo "  Installing new kernel"

  apt-get -y install linux-generic-lts-vivid
  apt-get -y autoremove linux-image-3.13.0-32-generic
  apt-get -y autoremove linux-image-3.13.0-71-generic
  update-grub
  reboot

After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.

Under this kernel I can now reliably reproduce the failure to stop a
MongoDB container. Specifically, any & all attempts to kill the container,
e.g.,via

Marathon HTTP Delete (which leads to docker-mesos-executor presenting
"docker stop" command)
Getting inside the running container shell and issuing "kill" or
db.shutDown()

causes the mongod container

   - to show in its log that it's shutting down normally
   - to enter a 100% CPU loop
   - to become unkillable (only reboot "fixes" things)

Note finally that my conclusion about kernel 3.13 "working" is at present a
weak induction. But I do know that when I reverted to that kernel I could,
at least once, stop the containers w/o any problems; whereas at 3.19 I can
reliably reproduce the problem. I will try to make this induction stronger
as the day wears on.

Did I do something "wrong" in my kernel upgrade steps?

Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
area of task termination & signal handling?

Thanks for your help.

-Paul


On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell <arach...@gmail.com> wrote:

> I spoke to soon, I'm afraid.
>
> Next time I did the stop (with zero timeout), I see the same phenomenon: a
> mongo container showing repeated:
>
> killing docker task
> shutting down
>
>
> What else can I try?
>
> Thank you.
>
> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>> (though a bit fearful about being pleased) that this change seems to have
>> shut everyone down pretty much instantly.
>>
>> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
>> the immediate use of "kill -9" as opposed to "kill -2"?
>>
>> I will keep testing the behavior.
>>
>> Thank you.
>>
>> -Paul
>>
>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote:
>>
>>> Hi Tim,
>>>
>>> Things have gotten slightly odder (if that's possible). When I now start
>>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>>> and even he took a few tries. That is, I see him failing, moving to
>>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>>> docker ctr logs) that show why.
>>>
>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>> application I am seeing the phenomena I posted before: killing docker task,
>>> shutting down repeated many times. The UN-stopped container is now running
>>> at 100% CPU.
>>>
>>> I will try modifying docker_stop_timeout. Back shortly
>>>
>>> Thanks again.
>>>
>>> -Paul
>>>
>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>
>>> *from /var/log/upstart/docker.log*
>>>
>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [31mERRO [0m[3054] Handler for GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> returned error: No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mERRO [0m[3054] HTTP Error
>>> [31merr [0m=No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mstatusCode [0m=404
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/create?name=m

Help needed (alas, urgently)

2016-01-14 Thread Paul Bell

Hi All,

It's been quite some time since I've posted here and that's chiefly because
up until a day or two ago, things were working really well.

I actually may have posted about this some time back. But then the problem
seemed more intermittent.

In summa, several "docker stops" don't work, i.e., the containers are not
stopped.

Deployment:

one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
Zookeeper
Mesos-master (0.23.0)
Mesos-slave (0.23.0)
Marathon (0.10.0)
Docker 1.9.1
Weave 1.1.0
Our application contains which include
MongoDB (4)
PostGres
ECX (our product)

The only thing that's changed at all in the config above is the version of
Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
problem.


My automater program stops the application by sending Marathon an "http
delete" for each running up. Every now & then (reliably reproducible today)
not all containers get stopped. Most recently, 3 containers failed to stop.

Here are the attendant phenomena:

Marathon shows the 3 applications in deployment mode (presumably
"deployment" in the sense of "stopping")

*ps output:*

root@71:~# ps -ef | grep docker
root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
--master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
--containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
--docker_stop_timeout=15secs --executor_registration_timeout=5mins
--hostname=71.100.202.99 --ip=71.100.202.99
--attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
--stop_timeout=15secs
root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--stop_timeout=15secs
root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
--stop_timeout=15secs
*root  9696  9695  0 14:06 ?00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
*root  9709  9708  0 14:06 ?00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
*root  9720  9719  0 14:06 ?00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*

*docker ps output:*

root@71:~# docker ps
CONTAINER IDIMAGE COMMAND
 CREATED STATUS  PORTS
 NAMES
5abafbfe7de2mongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 minutes   27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
a8449682ca2emongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 minutes   27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
3b956457374bmongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell

ctory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
--stop_timeout="15secs"
Registered docker executor on 71.100.202.99
Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285
2016-01-14T20:45:38.613+ [initandlisten] MongoDB starting : pid=1
port=27017 dbpath=/data/db 64-bit host=ecxconfigdb
2016-01-14T20:45:38.614+ [initandlisten] db version v2.6.8
2016-01-14T20:45:38.614+ [initandlisten] git version:
3abc04d6d4f71de00b57378e3277def8fd7a6700
2016-01-14T20:45:38.614+ [initandlisten] build info: Linux
build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC
2014 x86_64 BOOST_LIB_VERSION=1_49
2016-01-14T20:45:38.614+ [initandlisten] allocator: tcmalloc
2016-01-14T20:45:38.614+ [initandlisten] options: { storage: { journal:
{ enabled: true } } }
2016-01-14T20:45:38.616+ [initandlisten] journal dir=/data/db/journal
2016-01-14T20:45:38.616+ [initandlisten] recover : no journal files
present, no recovery needed
2016-01-14T20:45:39.006+ [initandlisten] waiting for connections on
port 27017
2016-01-14T20:46:38.975+ [clientcursormon] mem (MB) res:77 virt:12942
2016-01-14T20:46:38.975+ [clientcursormon]  mapped (incl journal
view):12762
2016-01-14T20:46:38.975+ [clientcursormon]  connections:0
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down

On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell <arach...@gmail.com> wrote:

> Hey Tim,
>
> Thank you very much for your reply.
>
> Yes, I am in the midst of trying to reproduce the problem. If successful
> (so to speak), I will do as you ask.
>
> Cordially,
>
> Paul
>
> On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote:
>
>> Hi Paul,
>>
>> Looks like we've already issued the docker stop as you seen in the ps
>> output, but the containers are still running. Can you look at the Docker
>> daemon logs and see what's going on there?
>>
>> And also can you also try to modify docker_stop_timeout to 0 so that we
>> SIGKILL the containers right away, and see if this still happens?
>>
>> Tim
>>
>>
>>
>> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> It's been quite some time since I've posted here and that's chiefly
>>> because up until a day or two ago, things were working really well.
>>>
>>> I actually may have posted about this some time back. But then the
>>> problem seemed more intermittent.
>>>
>>> In summa, several "docker stops" don't work, i.e., the containers are
>>> not stopped.
>>>
>>> Deployment:
>>>
>>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>>> Zookeeper
>>> Mesos-master (0.23.0)
>>> Mesos-slave (0.23.0)
>>> Marathon (0.10.0)
>>> Docker 1.9.1
>>> Weave 1.1.0
>>> Our application contains which include
>>> MongoDB (4)
>>> PostGres
>>> ECX (our product)
>>>
>>> The only thing that's changed at all in the config above is the version
>>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>>> problem.
>>>
>>>
>>> My automater program stops the application by sending Marathon an "http
>>> delete" for each running up. Every now & then (reliably reproducible today)
>>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>>
>>> Here are the attendant phenomena:
>>>
>>> Marathon shows the 3 applications in deployment mode (presumably
>>> "deployment" in the sense of "stopping")
>>>
>>> *ps output:*
>>>
>>> root@71:~# ps -ef | grep docker
>>> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon
>>> -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>>> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
>>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>>> --hostname=71.100.202.99 --ip=71.100.202.99
>>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>>> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.1

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell

Hey Tim,

Thank you very much for your reply.

Yes, I am in the midst of trying to reproduce the problem. If successful
(so to speak), I will do as you ask.

Cordially,

Paul

On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Paul,
>
> Looks like we've already issued the docker stop as you seen in the ps
> output, but the containers are still running. Can you look at the Docker
> daemon logs and see what's going on there?
>
> And also can you also try to modify docker_stop_timeout to 0 so that we
> SIGKILL the containers right away, and see if this still happens?
>
> Tim
>
>
>
> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> It's been quite some time since I've posted here and that's chiefly
>> because up until a day or two ago, things were working really well.
>>
>> I actually may have posted about this some time back. But then the
>> problem seemed more intermittent.
>>
>> In summa, several "docker stops" don't work, i.e., the containers are not
>> stopped.
>>
>> Deployment:
>>
>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>> Zookeeper
>> Mesos-master (0.23.0)
>> Mesos-slave (0.23.0)
>> Marathon (0.10.0)
>> Docker 1.9.1
>> Weave 1.1.0
>> Our application contains which include
>> MongoDB (4)
>> PostGres
>> ECX (our product)
>>
>> The only thing that's changed at all in the config above is the version
>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>> problem.
>>
>>
>> My automater program stops the application by sending Marathon an "http
>> delete" for each running up. Every now & then (reliably reproducible today)
>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>
>> Here are the attendant phenomena:
>>
>> Marathon shows the 3 applications in deployment mode (presumably
>> "deployment" in the sense of "stopping")
>>
>> *ps output:*
>>
>> root@71:~# ps -ef | grep docker
>> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
>> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>> --hostname=71.100.202.99 --ip=71.100.202.99
>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>> --stop_timeout=15secs
>> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --stop_timeout=15secs
>> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/m

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell

I spoke to soon, I'm afraid.

Next time I did the stop (with zero timeout), I see the same phenomenon: a
mongo container showing repeated:

killing docker task
shutting down


What else can I try?

Thank you.

On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote:

> Hi Tim,
>
> I set docker_stop_timeout to zero as you asked. I am pleased to report
> (though a bit fearful about being pleased) that this change seems to have
> shut everyone down pretty much instantly.
>
> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
> the immediate use of "kill -9" as opposed to "kill -2"?
>
> I will keep testing the behavior.
>
> Thank you.
>
> -Paul
>
> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> Things have gotten slightly odder (if that's possible). When I now start
>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>> and even he took a few tries. That is, I see him failing, moving to
>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>> docker ctr logs) that show why.
>>
>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>> application I am seeing the phenomena I posted before: killing docker task,
>> shutting down repeated many times. The UN-stopped container is now running
>> at 100% CPU.
>>
>> I will try modifying docker_stop_timeout. Back shortly
>>
>> Thanks again.
>>
>> -Paul
>>
>> PS: what do you make of the "broken pipe" error in the docker.log?
>>
>> *from /var/log/upstart/docker.log*
>>
>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>> [34mINFO [0m[3054] GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [31mERRO [0m[3054] Handler for GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> returned error: No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mERRO [0m[3054] HTTP Error [31merr
>> [0m=No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mstatusCode [0m=404
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [34mINFO [0m[3111] GET /v1.21/containers/json
>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>> [34mINFO [0m[3120] GET
>> /v1.21/containers/cf7/logs?stderr=1=1=all
>> [34mINFO [0m[3153] GET /containers/json
>> [34mINFO [0m[3153] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3153] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3153] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> [34mINFO [0m[3175] GET /containers/json
>> [34mINFO [0m[3175] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3175] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3175] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> * [34mI

Recent UI enhancements & Managed Service Providers

2016-02-25 Thread Paul Bell

Hi All,

I am running older versions of Mesos & Marathon (0.23.0 and 0.10.0).

Over the course of the last several months I think I've seen several items
on this list about UI enhancements. Perhaps they were enhancements to the
data consumed by the Mesos & Marathon UIs. I've had very little time to dig
deeply into it.

So...I am wondering if someone can either point me to any discussions of
such enhancements or summarize them here.

There is a specific use case behind this request. The Mesos architecture
seems to be a real sweet spot for an MSP. But an important MSP requirement
is a unified view of their many tenants. So I am really trying to get a
sense for how well the recent Mesos/Marathon releases address this
requirement.

Thank you.

-Paul

Re: Recent UI enhancements & Managed Service Providers

2016-02-25 Thread Paul Bell

Hi Vinod,

Thank you for your reply.

I'm not sure that I can be more specific. MSPs are interested in a "view by
tenant", e.g., "show me all applications that are allotted to Tenant X".  I
suppose that the standard Mesos UI could, with properly named task IDs and
the UI's "Find" filter, accomplish part of "view by tenant". But in order
to see the resources consumed by Tenant X's tasks, you have to visit each
task individually and look at their "Resources" table (add them all up).

It'd be cool if when a filter is in effect, the Resources table was updated
to reflect only the resources consumed by the filter-selected tasks.

There's also the question of the units/meaning of Resources. Through
Marathon I give each of my Dockerized tasks .1 CPU. As I understand it,
Docker multiplies this value times 1024 which is Docker's representation of
all the cores on a host. So when I do "docker inspect " I will see
CpuShares of 102. But in the Mesos UI each of my 6 tasks shows .2 CPUs
allocated. I'm simply not sure what this means or how it's arrived at. I
suspect that an MSP will ask the same questions.

I will think about it some more, but I'd be interested to hear feedback on
these few points that I've raised.

Thanks again.

-Paul

On Thu, Feb 25, 2016 at 11:55 AM, Vinod Kone  wrote:

>
> > But an important MSP requirement is a unified view of their many
> tenants. So I am really trying to get a sense for how well the recent
> Mesos/Marathon releases address this requirement.
>
> Can you be more specific about what you mean by unified view and tenants?
> What's lacking currently?

Feature request: move in-flight containers w/o stopping them

2016-02-18 Thread Paul Bell

Hello All,

Has there ever been any consideration of the ability to move in-flight
containers from one Mesos host node to another?

I see this as analogous to VMware's "vMotion" facility wherein VMs can be
moved from one ESXi host to another.

I suppose something like this could be useful from a load-balancing
perspective.

Just curious if it's ever been considered and if so - and rejected - why
rejected?

Thanks.

-Paul

Re: Agent won't start

2016-03-30 Thread Paul Bell

Greg, thanks again - I am planning on moving my work_dir.

Pradeep, thanks again. In a slightly different scenario, namely,

service mesos-slave stop
edit /etc/default/mesos-slave   (add a port resource)
service mesos-slave start

I noticed that slave did not start and - again - the log shows the same
phenomena as in my original post. Per your suggestion, I did a

rm -Rf /tmp/mesos

and the slave service started correctly.

Questions:

   1. Did editing /etc/default/mesos-slave cause the failure of the service
   to start?
   2. given that starting/stopping the entire cluster (stopping all
   services on all nodes) is a standard feature in our product, should I
   routinely to the above "rm" command when the mesos services are stopped?

Thanks for your help.

Cordially,

Paul

On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann  wrote:

> Check out this link for info on /tmp cleanup in Ubuntu:
> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>
> And check out this link for information on some of the work_dir's contents
> on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/
>
> The work_dir contains important application state for the Mesos agent, so
> it should not be placed in a location that will be automatically
> garbage-collected by the OS. The choice of /tmp/mesos as a default location
> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
> change it. Ideally you should be able to leave the work_dir alone and let
> the Mesos agent manage it for you.
>
> In any case, I would recommend that you set the work_dir to something
> outside of /tmp; /var/lib/mesos is a commonly-used location.
>
> Cheers,
> Greg
>

Backup a Mesos Cluster

2016-04-11 Thread Paul Bell

Hi All,

As we get closer to shipping a Mesos-based version of our product, we've
turned our attention to "protecting" (supporting backup & recovery) of not
only our application databases, but the cluster as well.

I'm not quite sure how to begin thinking about this, but I suppose the
usual dimensions of B/R would come into play, e.g., hot/cold, application
consistent/crash consistent, etc.

Has anyone grappled with this issue and, if so, would you be so kind as to
share your experience and solutions?

Thank you.

-Paul

Re: Backup a Mesos Cluster

2016-04-11 Thread Paul Bell

Piotr,

Thank you for this link. I am looking at it now where I right away notice
that Exhibitor is designed to monitor (and backup) Zookeeper (but not
anything related to Mesos itself). Don't the Mesos master & agent nodes
keep at least some state outside of the ZK znodes, e.g., under the default
workdir?

Shua,

Thank you for this observation. Happily (I think), we do not have a custom
framework. Presently, Marathon is the only framework that we use.

-Paul

On Mon, Apr 11, 2016 at 8:12 AM, Shuai Lin <linshuai2...@gmail.com> wrote:

> If your product containers a custom framework, at least you should
> implement kind of high availability for your scheduler (like
> marathon/chronos does), or let it be launched by marathon so it can be
> restarted when it fails.
>
> On Mon, Apr 11, 2016 at 7:27 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> As we get closer to shipping a Mesos-based version of our product, we've
>> turned our attention to "protecting" (supporting backup & recovery) of not
>> only our application databases, but the cluster as well.
>>
>> I'm not quite sure how to begin thinking about this, but I suppose the
>> usual dimensions of B/R would come into play, e.g., hot/cold, application
>> consistent/crash consistent, etc.
>>
>> Has anyone grappled with this issue and, if so, would you be so kind as
>> to share your experience and solutions?
>>
>> Thank you.
>>
>> -Paul
>>
>>
>

Re: Mesos 0.28 SSL in official packages

2016-04-12 Thread Paul Bell

FWIW, I quite agree with Zameer's point.

That said, I want to make abundantly clear that in my experience the folks
at Mesosphere are wonderfully helpful.

But what happens if down the road Mesosphere is acquired or there occurs
some other event that could represent, if not a conflict of interest, then
simply a different strategic direction?

My 2 cents.

-Paul

On Mon, Apr 11, 2016 at 5:19 PM, Zameer Manji  wrote:

> I have suggested this before and I will suggest it again here.
>
> I think the Apache Mesos project should build and distribute packages
> instead of relying on the generosity of a commercial vendor. The Apache
> Aurora project does this already with good success. As a user of Apache
> Mesos I don't care about Mesosphere Inc and I feel uncomfortable that the
> project is so dependent on its employees.
>
> Doing this would allow users to contribute packaging fixes directly to the
> project, such as enabling SSL.
>
> On Mon, Apr 11, 2016 at 3:02 AM, Adam Bordelon  wrote:
>
>> Hi Kamil,
>>
>> Technically, there are no "official" Apache-built packages for Apache
>> Mesos.
>>
>> At least once company (Mesosphere) chooses to build and distribute
>> Mesos packages, but does not currently offer SSL builds. It wouldn't
>> be hard to add an SSL build to our regular builds, but it hasn't been
>> requested enough to prioritize it.
>>
>> cc: Joris, Kapil
>>
>> On Thu, Apr 7, 2016 at 7:42 AM, haosdent  wrote:
>> > Hi, ssl didn't enable default. You need compile it by following this doc
>> > http://mesos.apache.org/documentation/latest/ssl/
>> >
>> > On Thu, Apr 7, 2016 at 10:04 PM, Kamil Wokitajtis > >
>> > wrote:
>> >>
>> >> This is my first post, so Hi everyone!
>> >>
>> >> Is SSL enabled in official packages (CentOS in my case)?
>> >> I can see libssl in ldd output, but I cannot see libevent.
>> >> I had to compile mesos from sources to run it over ssl.
>> >> I would prefer to install it from packages.
>> >>
>> >> Regards,
>> >> Kamil
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Haosdent Huang
>>
>> --
>> Zameer Manji
>>
>>

Re: Agent won't start

2016-03-29 Thread Paul Bell

Hi Greg,

Thanks very much for your quick reply.

I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
systemd. I will look at the link you provide.

Is there any chance that it might apply to non-systemd platforms?

Cordially,

Paul

On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:

> Hi Paul,
> Noticing the logging output, "Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
> be related to the location of your agent's work_dir. See this ticket:
> https://issues.apache.org/jira/browse/MESOS-4541
>
> Some users have reported issues resulting from the systemd-tmpfiles
> service garbage collecting files in /tmp, perhaps this is related? What
> platform is your agent running on?
>
> You could try specifying a different agent work directory outside of /tmp/
> via the `--work_dir` command-line flag.
>
> Cheers,
> Greg
>
>
> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi,
>>
>> I am hoping someone can shed some light on this.
>>
>> An agent node failed to start, that is, when I did "service mesos-slave
>> start" the service came up briefly & then stopped. Before stopping it
>> produced the log shown below. The last thing it wrote is "Trying to create
>> path '/mesos' in Zookeeper".
>>
>> This mention of the mesos znode prompted me to go for a clean slate by
>> removing the mesos znode from Zookeeper.
>>
>> After doing this, the mesos-slave service started perfectly.
>>
>> What might be happening here, and also what's the right way to
>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> Log file created at: 2016/03/29 14:19:39
>> Running on machine: 71.100.202.193
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>> root
>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>> posix/cpu,posix/mem
>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>> 71.100.202.193:5051
>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>> --attributes="hostType:shard1" --authenticatee="crammd5"
>> --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>> --default_role="*" --disk_watch_interval="1mins"
>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>> --docker_remove_delay="6hrs"
>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="5mins"
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>> --initialize_driver_logging="true" --ip="71.100.202.193"
>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>> --master="zk://71.100.202.191:2181/mesos"
>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> --registration_backoff_factor="1secs"
>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>> --strict="true" --switch_user="true" --version="false"
>> --work_dir="/tmp/mesos"
>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>> '/tmp/mesos/meta'
>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'
>> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
>> 71.100.202.193:5051) connected to ZooKeeper
>> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>>
>>
>

Agent won't start

2016-03-29 Thread Paul Bell

Hi,

I am hoping someone can shed some light on this.

An agent node failed to start, that is, when I did "service mesos-slave
start" the service came up briefly & then stopped. Before stopping it
produced the log shown below. The last thing it wrote is "Trying to create
path '/mesos' in Zookeeper".

This mention of the mesos znode prompted me to go for a clean slate by
removing the mesos znode from Zookeeper.

After doing this, the mesos-slave service started perfectly.

What might be happening here, and also what's the right way to
trouble-shoot such a problem? Mesos is version 0.23.0.

Thanks for your help.

-Paul


Log file created at: 2016/03/29 14:19:39
Running on machine: 71.100.202.193
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
posix/cpu,posix/mem
I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
71.100.202.193:5051
I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
--attributes="hostType:shard1" --authenticatee="crammd5"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="docker,mesos"
--default_role="*" --disk_watch_interval="1mins"
--docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
--docker_remove_delay="6hrs"
--docker_sandbox_directory="/mnt/mesos/sandbox"
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
--enforce_container_disk_quota="false"
--executor_registration_timeout="5mins"
--executor_shutdown_grace_period="5secs"
--fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
--frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --hostname="71.100.202.193"
--initialize_driver_logging="true" --ip="71.100.202.193"
--isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
--log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
--master="zk://71.100.202.191:2181/mesos"
--oversubscribed_resources_interval="15secs" --perf_duration="10secs"
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins"
--registration_backoff_factor="1secs"
--resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
--strict="true" --switch_user="true" --version="false"
--work_dir="/tmp/mesos"
I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
mem(*):23089; disk(*):122517; ports(*):[31000-32000]
I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
'/tmp/mesos/meta'
I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'
I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
71.100.202.193:5051) connected to ZooKeeper
I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper

Re: Agent won't start

2016-03-29 Thread Paul Bell

Hi Pradeep,

And thank you for your reply!

That, too, is very interesting. I think I need to synthesize what you and
Greg are telling me and come up with a clean solution. Agent nodes can
crash. Moreover, I can stop the mesos-slave service, and start it later
with a reboot in between.

So I am interested in fully understanding the causal chain here before I
try to fix anything.

-Paul



On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell <arach...@gmail.com> wrote:

> Whoa...interessant!
>
> The node *may* have been rebooted. Uptime says 2 days. I'll need to check
> my notes.
>
> Can you point me to reference re Ubuntu behavior?
>
> Based on what you've told me so far, it sounds as if the sequence:
>
> stop service
> reboot agent node
> start service
>
>
> could lead to trouble - or do I misunderstand?
>
>
> Thank you again for your help.
>
> -Paul
>
> On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote:
>
>> Paul,
>> This would be relevant for any system which is automatically deleting
>> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
>> be completely nuked at boot time. Was the agent node rebooted prior to this
>> problem?
>>
>> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote:
>>
>>> Hi Greg,
>>>
>>> Thanks very much for your quick reply.
>>>
>>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>>> systemd. I will look at the link you provide.
>>>
>>> Is there any chance that it might apply to non-systemd platforms?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:
>>>
>>>> Hi Paul,
>>>> Noticing the logging output, "Failed to find resources file
>>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>>> may be related to the location of your agent's work_dir. See this ticket:
>>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>>
>>>> Some users have reported issues resulting from the systemd-tmpfiles
>>>> service garbage collecting files in /tmp, perhaps this is related? What
>>>> platform is your agent running on?
>>>>
>>>> You could try specifying a different agent work directory outside of
>>>> /tmp/ via the `--work_dir` command-line flag.
>>>>
>>>> Cheers,
>>>> Greg
>>>>
>>>>
>>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am hoping someone can shed some light on this.
>>>>>
>>>>> An agent node failed to start, that is, when I did "service
>>>>> mesos-slave start" the service came up briefly & then stopped. Before
>>>>> stopping it produced the log shown below. The last thing it wrote is
>>>>> "Trying to create path '/mesos' in Zookeeper".
>>>>>
>>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>>> removing the mesos znode from Zookeeper.
>>>>>
>>>>> After doing this, the mesos-slave service started perfectly.
>>>>>
>>>>> What might be happening here, and also what's the right way to
>>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> -Paul
>>>>>
>>>>>
>>>>> Log file created at: 2016/03/29 14:19:39
>>>>> Running on machine: 71.100.202.193
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging
>>>>> started!
>>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39
>>>>> by root
>>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>>> posix/cpu,posix/mem
>>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>>> I0329 14:19:39.616267  58

Re: Agent won't start

2016-03-29 Thread Paul Bell

Whoa...interessant!

The node *may* have been rebooted. Uptime says 2 days. I'll need to check
my notes.

Can you point me to reference re Ubuntu behavior?

Based on what you've told me so far, it sounds as if the sequence:

stop service
reboot agent node
start service


could lead to trouble - or do I misunderstand?


Thank you again for your help.

-Paul

On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote:

> Paul,
> This would be relevant for any system which is automatically deleting
> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
> be completely nuked at boot time. Was the agent node rebooted prior to this
> problem?
>
> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Greg,
>>
>> Thanks very much for your quick reply.
>>
>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>> systemd. I will look at the link you provide.
>>
>> Is there any chance that it might apply to non-systemd platforms?
>>
>> Cordially,
>>
>> Paul
>>
>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:
>>
>>> Hi Paul,
>>> Noticing the logging output, "Failed to find resources file
>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>> may be related to the location of your agent's work_dir. See this ticket:
>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>
>>> Some users have reported issues resulting from the systemd-tmpfiles
>>> service garbage collecting files in /tmp, perhaps this is related? What
>>> platform is your agent running on?
>>>
>>> You could try specifying a different agent work directory outside of
>>> /tmp/ via the `--work_dir` command-line flag.
>>>
>>> Cheers,
>>> Greg
>>>
>>>
>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am hoping someone can shed some light on this.
>>>>
>>>> An agent node failed to start, that is, when I did "service mesos-slave
>>>> start" the service came up briefly & then stopped. Before stopping it
>>>> produced the log shown below. The last thing it wrote is "Trying to create
>>>> path '/mesos' in Zookeeper".
>>>>
>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>> removing the mesos znode from Zookeeper.
>>>>
>>>> After doing this, the mesos-slave service started perfectly.
>>>>
>>>> What might be happening here, and also what's the right way to
>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>
>>>> Thanks for your help.
>>>>
>>>> -Paul
>>>>
>>>>
>>>> Log file created at: 2016/03/29 14:19:39
>>>> Running on machine: 71.100.202.193
>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>>>> root
>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>> posix/cpu,posix/mem
>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>>> 71.100.202.193:5051
>>>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>>>> --attributes="hostType:shard1" --authenticatee="crammd5"
>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>>>> --default_role="*" --disk_watch_interval="1mins"
>>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>>>> --docker_remove_delay="6hrs"
>>>> --docker_sandbox_dir

Re: Mesos Master and Slave on same server?

2016-04-13 Thread Paul Bell

Hi June,

In addition to doing what Pradeep suggests, I also now & then run a single
node "cluster" that houses mesos-master, mesos-slave, and Marathon.

Works fine.

Cordially,

Paul

On Wed, Apr 13, 2016 at 12:36 PM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> I would suggest you to run mesos-master and zookeeper and marathon on same
> set of hosts (maybe call them as coordinator nodes) and use completely
> different set of nodes for mesos slaves. This way you can do the
> maintenance of such hosts in a very planned fashion.
>
> On Wed, Apr 13, 2016 at 4:22 PM, Stefano Bianchi 
> wrote:
>
>> For sure it is possible.
>> Simply Mesos-master will the the resources offered by the machine on
>> which is running mesos-slave also, transparently.
>>
>> 2016-04-13 16:34 GMT+02:00 June Taylor :
>>
>>> All of our node servers are identical hardware. Is it reasonable for me
>>> to install the Mesos-Master and Mesos-Slave on the same physical hardware?
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>
>>
>
>
> --
> Regards,
> Pradeep Chhetri
>

Consequences of health-check timeouts?

2016-05-17 Thread Paul Bell

Hi All,

I probably have the following account partly wrong, but let me present it
just the same and those who know better can correct me as needed.

I've an application that runs several MongoDB shards, each a Dockerized
container, each on a distinct node (VM); in fact, some of the VMs are on
separate ESXi hosts.

I've lately seen situations where, because of very slow disks for the
database, the following sequence occurs (I think):

   1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
   defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
   2. Synchronous flushing of many, many pages occurs, writing to a slow
   disk
   3. (Around this time one might see in /var/log/syslog "task X blocked
   for more than 120 seconds" for all kinds of tasks, including mesos-master)
   4. mesos-slaves get shutdown (this is the part I'm unclear about; but I
   am quite certain that on 2 nodes the executors and their in-flight MongoDB
   tasks got zapped because I can see that Marathon restarted them).

The consequences of this are a corrupt MongoDB database. In the case at
hand, the job had run for over 50 hours, processing close to 120 million
files.

Steps I've taken so far to remedy include:

   - tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively,
   to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent,
   smaller flushes and thus avoid less frequent massive flushes that suspend
   threads for very long periods.
   - increase agent ping timeout to 10 minutes (every 30 seconds, 20 times)

So the questions are:

   - Is there some way to be given control (a callback, or an "exit"
   routine) so that the container about to be nuked can be given a chance to
   exit gracefully?
   - Are there other steps I can take to avoid this mildly calamitous
   occurrence?
   - (Also, I'd be grateful for more clarity on anything in steps 1-4 above
   that is a bit hand-wavy!)

As always, thanks.

-Paul

Re: Consequences of health-check timeouts?

2016-05-18 Thread Paul Bell

Hi Hasodent,

Thanks for your reply.

In re executor_shutdown_grace_period: how would this enable the task
(MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
STDOUT as captured by Mesos shows that it received signal 15 just before it
said good-bye). My naive understanding of this grace period is that it
simply delays the termination of the executor.

The following snippet is rom /var/log/syslog. I believe it shows the stack
trace (largely in the kernel) that led to mesos-master being blocked for
more than 120 seconds. Please note that immediately above (before) the
blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
blocked mesos-master is a blocked java task. I'm not sure what the java
task is. This took place on the mesos-master node and none of our
applications runs there. It runs master, Marathon, and ZK. Maybe the java
task is Marathon or ZK?

Thanks again.

-Paul

May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
blocked for more than 120 seconds.

May 16 20:06:53 71 kernel: [193339.890873]   Not tainted
3.13.0-32-generic #57-Ubuntu

May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

May 16 20:06:53 71 kernel: [193339.890912] mesos-masterD
88013fd94440 0  4013  1 0x

May 16 20:06:53 71 kernel: [193339.890914]  880137429a28
0002 880135778000 880137429fd8

May 16 20:06:53 71 kernel: [193339.890916]  00014440
00014440 880135778000 88013fd94cd8

May 16 20:06:53 71 kernel: [193339.890918]  88013ffd34b0
0002 81284630 880137429aa0

May 16 20:06:53 71 kernel: [193339.890919] Call Trace:

May 16 20:06:53 71 kernel: [193339.890922]  [] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890924]  []
io_schedule+0x9d/0x140

May 16 20:06:53 71 kernel: [193339.890925]  []
sleep_on_shadow_bh+0xe/0x20

May 16 20:06:53 71 kernel: [193339.890927]  []
__wait_on_bit+0x62/0x90

May 16 20:06:53 71 kernel: [193339.890929]  [] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890930]  []
out_of_line_wait_on_bit+0x77/0x90

May 16 20:06:53 71 kernel: [193339.890932]  [] ?
autoremove_wake_function+0x40/0x40

May 16 20:06:53 71 kernel: [193339.890934]  [] ?
wake_up_bit+0x25/0x30

May 16 20:06:53 71 kernel: [193339.890936]  []
do_get_write_access+0x2ad/0x4f0

May 16 20:06:53 71 kernel: [193339.890938]  [] ?
__getblk+0x2d/0x2e0

May 16 20:06:53 71 kernel: [193339.890939]  []
jbd2_journal_get_write_access+0x27/0x40

May 16 20:06:53 71 kernel: [193339.890942]  []
__ext4_journal_get_write_access+0x3b/0x80

May 16 20:06:53 71 kernel: [193339.890946]  []
ext4_reserve_inode_write+0x70/0xa0

May 16 20:06:53 71 kernel: [193339.890948]  [] ?
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890949]  []
ext4_mark_inode_dirty+0x44/0x1f0

May 16 20:06:53 71 kernel: [193339.890951]  []
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890953]  []
__mark_inode_dirty+0x10a/0x2d0

May 16 20:06:53 71 kernel: [193339.890956]  []
update_time+0x81/0xd0

May 16 20:06:53 71 kernel: [193339.890957]  []
file_update_time+0x80/0xd0

May 16 20:06:53 71 kernel: [193339.890961]  []
__generic_file_aio_write+0x180/0x3d0

May 16 20:06:53 71 kernel: [193339.890963]  []
generic_file_aio_write+0x58/0xa0

May 16 20:06:53 71 kernel: [193339.890965]  []
ext4_file_write+0x99/0x400

May 16 20:06:53 71 kernel: [193339.890967]  [] ?
wake_up_state+0x10/0x20

May 16 20:06:53 71 kernel: [193339.890970]  [] ?
wake_futex+0x66/0x90

May 16 20:06:53 71 kernel: [193339.890972]  [] ?
futex_wake+0x1b1/0x1d0

May 16 20:06:53 71 kernel: [193339.890974]  []
do_sync_write+0x5a/0x90

May 16 20:06:53 71 kernel: [193339.890976]  []
vfs_write+0xb4/0x1f0

May 16 20:06:53 71 kernel: [193339.890978]  []
SyS_write+0x49/0xa0

May 16 20:06:53 71 kernel: [193339.890980]  []
tracesys+0xe1/0xe6



On Wed, May 18, 2016 at 2:33 AM, haosdent <haosd...@gmail.com> wrote:

> >Is there some way to be given control (a callback, or an "exit" routine)
> so that the container about to be nuked can be given a chance to exit
> gracefully?
> The default value of executor_shutdown_grace_period is 5 seconds, you
> could change it by specify the `--executor_shutdown_grace_period` flag when
> launch mesos agent.
>
> >Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> >mesos-slaves get shutdown
> Do you know where your mesos-master stuck when it happens? Any error log
> or related log about this? In addition, is there any log when mesos-slave
> shut down?
>
> On Wed, May 18, 2016 at 6:12 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> I probably have the following account partly wrong, but let me present it
>> just the same and those who know better can co

Status of Mesos-3821

2016-04-19 Thread Paul Bell

Hi,

I think I encountered the problem described by
https://issues.apache.org/jira/browse/MESOS-3821 and wanted to ask if this
fix is in Mesos 0.28.

But perhaps I misunderstand what's being said; so by way of background our
case is Mesos on CentOS 7.2. When we try to set --docker_socket to
tcp://: the mesos-slave service refuses to start. It seems to
require a Unix socket.

There is one comment in the ticket that expresses the hope of being able to
use URLs of the tcp:// form.

Am I misunderstanding this fix and if not, what release of Mesos
incorporates it?

Thanks for your help.

-Paul

Mesos loses track of Docker containers

2016-08-10 Thread Paul Bell

Hello,

One of our customers has twice encountered a problem wherein Mesos &
Marathon appear to lose track of the application containers that they
started.

Platform & version info:

Ubuntu 14.04 (running under VMware)
Mesos (master & agent): 0.23.0
ZK: 3.4.5--1
Marathon: 0.10.0

The phenomena:

When I log into either the Mesos or Marathon UIs I see no evidence of *any*
tasks, active or completed. Yet, in the Linux shell, a "docker ps" command
shows the containers up & running.

I've seen some confusing appearances before, but never this. For example,
I've seen what might be described as the *reverse* of the above phenomena.
I mean the case where a customer powers cycles the VM. In such a case you
typically see in Marathon's UI the (mere) appearance of the containers up &
running, but a "docker ps" command shows no containers running. As folks on
this list have explained to me, this is the result of "stale state" and
after 10 minutes (by default), Mesos figures out that the supposedly active
tasks aren't there and restarts them.

But that's not the case here. I am hard-pressed to understand what
conditions/causes might lead to Mesos & Marathon becoming unaware of
containers that they started.

I would be very grateful if someone could help me understand what's going
on here (so would our customer!).

Thanks.

-Paul

Re: Mesos loses track of Docker containers

2016-08-10 Thread Paul Bell

Hi Jeff,

Thanks for your reply.

Yeahthat thought occurred to me late last night. But customer is
sensitive to too much churn, so it wouldn't be my first choice. If I knew
with certainty that such a problem existed in the versions they are running
AND that more recent versions fixed it, then I'd do my best to compel the
upgrade.

Docker version is also old, 1.6.2.

-Paul

On Wed, Aug 10, 2016 at 9:18 AM, Jeff Schroeder <jeffschroe...@computer.org>
wrote:

> Have you considered upgrading Mesos and Marathon? Those are quite old
> versions of both with some fairly glaring problems with the docker
> containerizer if memory serves. Also what version of docker?
>
>
> On Wednesday, August 10, 2016, Paul Bell <arach...@gmail.com> wrote:
>
>> Hello,
>>
>> One of our customers has twice encountered a problem wherein Mesos &
>> Marathon appear to lose track of the application containers that they
>> started.
>>
>> Platform & version info:
>>
>> Ubuntu 14.04 (running under VMware)
>> Mesos (master & agent): 0.23.0
>> ZK: 3.4.5--1
>> Marathon: 0.10.0
>>
>> The phenomena:
>>
>> When I log into either the Mesos or Marathon UIs I see no evidence of
>> *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps"
>> command shows the containers up & running.
>>
>> I've seen some confusing appearances before, but never this. For example,
>> I've seen what might be described as the *reverse* of the above
>> phenomena. I mean the case where a customer powers cycles the VM. In such a
>> case you typically see in Marathon's UI the (mere) appearance of the
>> containers up & running, but a "docker ps" command shows no containers
>> running. As folks on this list have explained to me, this is the result of
>> "stale state" and after 10 minutes (by default), Mesos figures out that the
>> supposedly active tasks aren't there and restarts them.
>>
>> But that's not the case here. I am hard-pressed to understand what
>> conditions/causes might lead to Mesos & Marathon becoming unaware of
>> containers that they started.
>>
>> I would be very grateful if someone could help me understand what's going
>> on here (so would our customer!).
>>
>> Thanks.
>>
>> -Paul
>>
>>
>>
>
> --
> Text by Jeff, typos by iPhone
>

unsubscribe

2017-01-16 Thread Paul Bell

39 matches

Mail list logo