Re: manage jobs log files in sandboxes

2015-11-05 Thread Paul
Hi Mauricio,

I'm grappling with the same issue.

I'm not yet sure if it represents a viable solution, but I plan to look at 
Docker's log rotation facility. It was introduced in Docker 1.8.

If you beat me to it & it looks like a solution, please let us know!

Thanks.

Cordially,

Paul

> On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia 
>  wrote:
> 
> Hi guys,
> 
> How can I manage the stdout/err log files generated by jobs in mesos? for 
> long running docker apps launched using marathon the log files can deplete 
> the disk of an agent, and using quotas makes the jobs to be killed which is 
> also not ideal. I'd like to have a way to rotate them. 
> 
> Is it correct to just go to the mesos agent workdir and go through each 
> sandbox stdout/err and rotate them? I know that could break the log UI but it 
> doesn't scale very well having logs of several of GB.
> 
> Thanks!


Re: Fate of slave node after timeout

2015-11-13 Thread Paul
Jie,

Thank you.

That's odd behavior, no? That would seem to mean that the slave can never again 
join the cluster, at least not from it's original IP@.

What if the master bounces? Will it then tolerate the slave?

-Paul

On Nov 13, 2015, at 4:46 PM, Jie Yu  wrote:

>> Can that slave never again be added into the cluster, i.e., what happens if 
>> it comes up 1 second after exceeding the timeout product?
> 
> It'll not be added to the cluster. The master will send a Shutdown message to 
> the slave if it comes up after the timeout.
> 
> - Jie 
> 
>> On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell  wrote:
>> Hi All,
>> 
>> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded 
>> without a response from a mesos-slave, the master will remove the slave. In 
>> the Mesos UI I can see slave state transition from 1 deactivated to 0.
>> 
>> Can that slave never again be added into the cluster, i.e., what happens if 
>> it comes up 1 second after exceeding the timeout product?
>> 
>> (I'm dusting off some old notes and trying to refresh my memory about 
>> problems I haven't seen in quite some time).
>> 
>> Thank you.
>> 
>> -Paul
> 


Re: Fate of slave node after timeout

2015-11-13 Thread Paul
Ah, now I get it.

And this comports with the behavior I am observing right now.

Thanks again, Jie.

-Paul

> On Nov 13, 2015, at 5:55 PM, Jie Yu  wrote:
> 
> Paul, the slave will terminate after receiving a Shutdown message. The slave 
> will be restarted (e.g., by monit or systemd) and register with the master as 
> a new slave (a different slaveId).
> 
> - Jie
> 
>> On Fri, Nov 13, 2015 at 2:53 PM, Paul  wrote:
>> Jie,
>> 
>> Thank you.
>> 
>> That's odd behavior, no? That would seem to mean that the slave can never 
>> again join the cluster, at least not from it's original IP@.
>> 
>> What if the master bounces? Will it then tolerate the slave?
>> 
>> -Paul
>> 
>> On Nov 13, 2015, at 4:46 PM, Jie Yu  wrote:
>> 
>>>> Can that slave never again be added into the cluster, i.e., what happens 
>>>> if it comes up 1 second after exceeding the timeout product?
>>> 
>>> It'll not be added to the cluster. The master will send a Shutdown message 
>>> to the slave if it comes up after the timeout.
>>> 
>>> - Jie 
>>> 
>>>> On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell  wrote:
>>>> Hi All,
>>>> 
>>>> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded 
>>>> without a response from a mesos-slave, the master will remove the slave. 
>>>> In the Mesos UI I can see slave state transition from 1 deactivated to 0.
>>>> 
>>>> Can that slave never again be added into the cluster, i.e., what happens 
>>>> if it comes up 1 second after exceeding the timeout product?
>>>> 
>>>> (I'm dusting off some old notes and trying to refresh my memory about 
>>>> problems I haven't seen in quite some time).
>>>> 
>>>> Thank you.
>>>> 
>>>> -Paul
> 


Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul
Hi Sam, 

Yeah, I have significant experience in this regard.

We run a Docker containers spread across several Mesos slave nodes. The 
containers are all connected via Weave. It works very well.

Can you describe what you have in mind?

Cordially,

Paul

> On Nov 25, 2015, at 8:03 PM, Sam  wrote:
> 
> Guys,
> We are trying to use Weave in hybrid cloud Mesos env , anyone got experience 
> on it ? Appreciated 
> Regards,
> Sam
> 
> Sent from my iPhone


Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul
Happy Thanksgiving to you, too.

I tend to deploy the several Mesos nodes as VMware VMs.

However, I've also run a cluster with master on ESXi, slaves on ESXi, slave on 
bare metal, and an EC2 slave.

But in my case all applications are Docker containers connected via Weave.

Does your present deployment involve Docker and Weave? 

-paul

> On Nov 25, 2015, at 8:55 PM, Sam  wrote:
> 
> Paul,
> Happy thanksgiving first. We are using Aws, Rackspace as hybrid cloud env , 
> and we deployed Mesos master in AWS , part of Slaves in AWS , part of Slaves 
> in Rackspace .  I am thinking whether it works ? And since it got low latency 
> in networking , can we deploy two masters in both AWS and Rackspace ? And 
> federation ?Appreciated for your reply .
> 
> Regards ,
> Sam
> 
> Sent from my iPhone
> 
>> On Nov 26, 2015, at 9:47 AM, Paul  wrote:
>> 
>> Hi Sam, 
>> 
>> Yeah, I have significant experience in this regard.
>> 
>> We run a Docker containers spread across several Mesos slave nodes. The 
>> containers are all connected via Weave. It works very well.
>> 
>> Can you describe what you have in mind?
>> 
>> Cordially,
>> 
>> Paul
>> 
>>> On Nov 25, 2015, at 8:03 PM, Sam  wrote:
>>> 
>>> Guys,
>>> We are trying to use Weave in hybrid cloud Mesos env , anyone got 
>>> experience on it ? Appreciated 
>>> Regards,
>>> Sam
>>> 
>>> Sent from my iPhone


Re: Anyone try Weave in Mesos env ?

2015-11-26 Thread Paul
Gladly, Weitao. It'd be my pleasure.

But give me a few hours to find some free time. 

I am today tasked with cooking a Thanksgiving turkey.

But I will try to find the time before noon today (I'm on the right coast in 
the USA).

-Paul

> On Nov 25, 2015, at 11:26 PM, Weitao  wrote:
> 
> Hi, Paul. Can your share the total experience about the arch with us. I am 
> trying to do the similar thing
> 
> 
>> 在 2015年11月26日,09:47,Paul  写道:
>> 
>> experience


Re: [Proposal] Remove the default value for agent work_dir

2016-04-13 Thread Paul
+1

> On Apr 13, 2016, at 11:01 AM, Ken Sipe  wrote:
> 
> +1
>> On Apr 12, 2016, at 5:58 PM, Greg Mann  wrote:
>> 
>> Hey folks!
>> A number of situations have arisen in which the default value of the Mesos 
>> agent `--work_dir` flag (/tmp/mesos) has caused problems on systems in which 
>> the automatic cleanup of '/tmp' deletes agent metadata. To resolve this, we 
>> would like to eliminate the default value of the agent `--work_dir` flag. 
>> You can find the relevant JIRA here.
>> 
>> We considered simply changing the default value to a more appropriate 
>> location, but decided against this because the expected filesystem structure 
>> varies from platform to platform, and because it isn't guaranteed that the 
>> Mesos agent would have access to the default path on a particular platform.
>> 
>> Eliminating the default `--work_dir` value means that the agent would exit 
>> immediately if the flag is not provided, whereas currently it launches 
>> successfully in this case. This will break existing infrastructure which 
>> relies on launching the Mesos agent without specifying the work directory. I 
>> believe this is an acceptable change because '/tmp/mesos' is not a suitable 
>> location for the agent work directory except for short-term local testing, 
>> and any production scenario that is currently using this location should be 
>> altered immediately.
>> 
>> If you have any thoughts/opinions/concerns regarding this change, please let 
>> us know!
>> 
>> Cheers,
>> Greg
> 


Re: What's the official pronounce of mesos?

2016-07-13 Thread Paul
Sadly, I don't understand a whole lot about Mesos, but I did learn Ancient 
Greek in college, taught it for a couple of years, and have even translated 
parts of Homer's Iliad.

 μέσος

The 'e' (epsilon) in 'Mesos' would be pronounced like the 'e' in the English 
word 'pet'. The 'o' (omicron) as in 'hot'.

But, at least to English ears, that pronunciation feels a bit stilted. So I 
think Rodrick's right to sound the 'o' as long, as in 'tone'.

-Paul

> On Jul 13, 2016, at 9:12 PM, Rodrick Brown  wrote:
> 
> Mess-O's 
> 
> Get Outlook for iOS
> 
> 
> 
> 
> On Wed, Jul 13, 2016 at 7:56 PM -0400, "zhiwei"  wrote:
> 
>> Hi,
>> 
>> I saw in some videos, different people pronounce 'mesos' differently.
>> 
>> Can someone add the official pronounce of mesos to wikipedia?
> 
> NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
> use of the addressee only. If you are not an intended recipient of this 
> communication, please delete it immediately and notify the sender by return 
> email. Unauthorized reading, dissemination, distribution or copying of this 
> communication is prohibited. This communication does not constitute an offer 
> to sell or a solicitation of an indication of interest to purchase any loan, 
> security or any other financial product or instrument, nor is it an offer to 
> sell or a solicitation of an indication of interest to purchase any products 
> or services to any persons who are prohibited from receiving such information 
> under applicable law. The contents of this communication may not be accurate 
> or complete and are subject to change without notice. As such, Orchard App, 
> Inc. (including its subsidiaries and affiliates, "Orchard") makes no 
> representation regarding the accuracy or completeness of the information 
> contained herein. The intended recipient is advised to consult its own 
> professional advisors, including those specializing in legal, tax and 
> accounting matters. Orchard does not provide legal, tax or accounting advice.


Re: Mesos loses track of Docker containers

2016-08-14 Thread Paul
Thank you, Sivaram.

That would seem to be 2 "votes" for upgrading.

-Paul


> On Aug 13, 2016, at 11:47 PM, Sivaram Kannan  wrote:
> 
> 
> I don't remember the condition exactly, but I have faced similar issue in my 
> deployments and have been fixed when I moved to 0.26.0. Upgrade the marathon 
> to compatible version as well.
> 
>> On Wed, Aug 10, 2016 at 9:30 AM, Paul Bell  wrote:
>> Hi Jeff,
>> 
>> Thanks for your reply.
>> 
>> Yeahthat thought occurred to me late last night. But customer is 
>> sensitive to too much churn, so it wouldn't be my first choice. If I knew 
>> with certainty that such a problem existed in the versions they are running 
>> AND that more recent versions fixed it, then I'd do my best to compel the 
>> upgrade. 
>> 
>> Docker version is also old, 1.6.2.
>> 
>> -Paul
>> 
>>> On Wed, Aug 10, 2016 at 9:18 AM, Jeff Schroeder 
>>>  wrote:
>>> Have you considered upgrading Mesos and Marathon? Those are quite old 
>>> versions of both with some fairly glaring problems with the docker 
>>> containerizer if memory serves. Also what version of docker?
>>> 
>>> 
>>>> On Wednesday, August 10, 2016, Paul Bell  wrote:
>>>> Hello,
>>>> 
>>>> One of our customers has twice encountered a problem wherein Mesos & 
>>>> Marathon appear to lose track of the application containers that they 
>>>> started. 
>>>> 
>>>> Platform & version info:
>>>> 
>>>> Ubuntu 14.04 (running under VMware)
>>>> Mesos (master & agent): 0.23.0
>>>> ZK: 3.4.5--1
>>>> Marathon: 0.10.0
>>>> 
>>>> The phenomena:
>>>> 
>>>> When I log into either the Mesos or Marathon UIs I see no evidence of 
>>>> *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps" 
>>>> command shows the containers up & running. 
>>>> 
>>>> I've seen some confusing appearances before, but never this. For example, 
>>>> I've seen what might be described as the reverse of the above phenomena. I 
>>>> mean the case where a customer powers cycles the VM. In such a case you 
>>>> typically see in Marathon's UI the (mere) appearance of the containers up 
>>>> & running, but a "docker ps" command shows no containers running. As folks 
>>>> on this list have explained to me, this is the result of "stale state" and 
>>>> after 10 minutes (by default), Mesos figures out that the supposedly 
>>>> active tasks aren't there and restarts them.
>>>> 
>>>> But that's not the case here. I am hard-pressed to understand what 
>>>> conditions/causes might lead to Mesos & Marathon becoming unaware of 
>>>> containers that they started.
>>>> 
>>>> I would be very grateful if someone could help me understand what's going 
>>>> on here (so would our customer!).
>>>> 
>>>> Thanks.
>>>> 
>>>> -Paul
>>> 
>>> 
>>> -- 
>>> Text by Jeff, typos by iPhone
> 
> 
> 
> -- 
> ever tried. ever failed. no matter.
> try again. fail again. fail better.
> -- Samuel Beckett


Re: Changing mesos slave configuration

2015-09-23 Thread Paul Bell
Hi Pradeep,

Perhaps I am speaking to a slightly different point, but when I change
/etc/default/mesos-slave to add a new attribute, I have to remove file
/tmp/mesos/meta/slaves/latest.

IIRC, mesos-slave itself, in failing to start after such a change, tells me
to do this:

rm -f /tmp/mesos/meta/slaves/latest


But I know of no way to make such configuration changes without downtime.
And I'd very much like it if Mesos supported such dynamic changes. I
suppose this would require that the agent consult its default file on
demand, rather than once at start-up.

Cordially,

Paul

On Wed, Sep 23, 2015 at 4:41 AM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> Hello all,
>
> I have often faced this problem that whenever i try to add some
> configuration parameter to mesos-slave or change any configuration (eg. add
> a new attribute in mesos-slave), the mesos slave doesnt come up on restart.
> I have to delete the slave.info file and then restart the slave but it
> ends up killing all the docker containers started using mesos.
>
> I was trying to figure out the best way to make such changes without
> making any downtime.
>
> Thank you.
>
> --
> Pradeep Chhetri
>


Re: Detecting slave crashes event

2015-09-24 Thread Paul Bell
Thank you all for your responses.

I look forward to event subscription. :)

-Paul

On Wed, Sep 23, 2015 at 2:23 PM, Joris Van Remoortere 
wrote:

> There is a plan for event subscription, but it is still in the early
> design phase.
>
> In 0.25 we are adding slave exit hooks: MESOS-3015
>
> This will allow you to generate whatever events you like based on removal
> of a slave. This is your best bet in terms of an immediate solution :-)
> @Kapil and @Niklas have worked on this hook.
>
> On Wed, Sep 23, 2015 at 1:29 PM, Benjamin Mahler <
> benjamin.mah...@gmail.com> wrote:
>
>> I believe some of the contributors from Mesosphere have been thinking
>> about it, but not sure on the plans. I'll let them reply here.
>>
>> On Wed, Sep 16, 2015 at 11:11 AM, Paul Bell  wrote:
>>
>>> Thank you, Benjamin.
>>>
>>> So, I could periodically request the metrics endpoint, or stream the
>>> logs (maybe via mesos.cli; or SSH)? What, roughly, does the "agent removed"
>>> message look like in the logs?
>>>
>>> Are there plans to offer a mechanism for event subscription?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>>
>>>
>>> On Wed, Sep 16, 2015 at 1:30 PM, Benjamin Mahler <
>>> benjamin.mah...@gmail.com> wrote:
>>>
>>>> You can detect when we remove an agent due to health check failures via
>>>> the metrics endpoint, but these are counters that are better used for
>>>> alerting / dashboards for visibility. If you need to know which agents, you
>>>> can also consume the logs as a stop-gap solution, until we offer a
>>>> mechanism for subscribing to cluster events.
>>>>
>>>> On Wed, Sep 16, 2015 at 10:11 AM, Paul Bell  wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am led to believe that, unlike Marathon, Mesos doesn't (yet?) offer
>>>>> a subscribable event bus.
>>>>>
>>>>> So I am wondering if there's a best practices way of determining if a
>>>>> slave node has crashed. By "crashed" I mean something like the power plug
>>>>> got yanked, or anything that would cause Mesos to stop talking to the 
>>>>> slave
>>>>> node.
>>>>>
>>>>> I suppose such information would be recorded in /var/log/mesos.
>>>>>
>>>>> Interested to learn how best to detect this.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> -Paul
>>>>>
>>>>
>>>>
>>>
>>
>


Securing executors

2015-10-05 Thread Paul Bell
Hi All,

I am running an nmap port scan on a Mesos agent node and noticed nmap
reporting an open TCP port at 50577.

Poking around some, I discovered exactly 5 mesos-docker-executor processes,
one for each of my 5 Docker containers, and each with an open listen port:

root 14131  3617  0 10:39 ?00:00:17 mesos-docker-executor
--container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95
--stop_timeout=15secs

I suppose that all of this is unsurprising. But I know of at least one big
customer who will without delay run Nmap or Nessus against my clustered
deployment.

So I am wondering what the best practices approach is to securing these
open ports.

Thanks for your help.

-Paul


Old docker version deployed

2015-10-06 Thread Paul Wolfe
Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.


RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
I do see the stdout in the webgui, which is how I can confirm the old version 
is deployed.

What I need is some information about what version/tag of the image mesos is 
using.

From: haosdent [mailto:haosd...@gmail.com]
Sent: Tuesday, October 06, 2015 11:37 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang



The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.


RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
No different tags.

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

Paul,

Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:
You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:


Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe





The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.


RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
My marathon deploy json:

{
 "type": "DOCKER",
  "volumes": [
{
  "containerPath": "/home/myapp /log",
  "hostPath": "/home",
  "mode": "RW"
}
  ],
  "docker": {
"image": "docker-registry:8080/myapp:86",
"network": "BRIDGE",
"portMappings": [
  {
"containerPort": 80,
"hostPort": 0,
"servicePort": 80,
"protocol": "tcp"
  }
],
"privileged": false,
"parameters": [],
"forcePullImage": false
  }
}


From: Paul Wolfe [mailto:paul.wo...@imc.nl]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org
Subject: RE: Old docker version deployed

No different tags.

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: Re: Old docker version deployed

Paul,

Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:
You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe





The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before

RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
Fair enough, although if that was the case would expect it to fail hard, not 
randomly run an old image.

One thing I did notice was that on the master box, docker images misses the 
version that should have been deployed (ie has version 77 and 79, but no 78)

From: haosdent [mailto:haosd...@gmail.com]
Sent: Tuesday, October 06, 2015 11:52 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

I don't think mesos log "version/tag of the image". When mesos start a docker 
container, always use your image name "docker-registry:8080/myapp:86" as pull 
and run parameters. I think maybe some machines have problems to connect your 
image registry.

On Tue, Oct 6, 2015 at 5:40 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:
My marathon deploy json:

{
 "type": "DOCKER",
  "volumes": [
{
  "containerPath": "/home/myapp /log",
  "hostPath": "/home",
  "mode": "RW"
}
  ],
  "docker": {
"image": "docker-registry:8080/myapp:86",
"network": "BRIDGE",
"portMappings": [
  {
"containerPort": 80,
"hostPort": 0,
"servicePort": 80,
"protocol": "tcp"
  }
],
"privileged": false,
"parameters": [],
"forcePullImage": false
  }
}


From: Paul Wolfe [mailto:paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: RE: Old docker version deployed

No different tags.

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: Re: Old docker version deployed

Paul,

Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:
You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe





The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should re

RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
Turns out it was a “bug” in docker. We found that running by hand the same tag 
(78) would randomly run version 18. Wouldn’t pull, even though the images 
wasn’t in the cache.

Upgrading from docker 1.7.1 to 1.8.2 seems to solve it, dangerous problem 
though…

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:54 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

But if the image version is changed, this would fail. Because the image is 
neither locally, neither the registry is available.

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:51, haosdent wrote:
I don't think mesos log "version/tag of the image". When mesos start a docker 
container, always use your image name "docker-registry:8080/myapp:86" as pull 
and run parameters. I think maybe some machines have problems to connect your 
image registry.

On Tue, Oct 6, 2015 at 5:40 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:


My marathon deploy json:



{

 "type": "DOCKER",

  "volumes": [

{

  "containerPath": "/home/myapp /log",

  "hostPath": "/home",

  "mode": "RW"

}

  ],

  "docker": {

"image": "docker-registry:8080/myapp:86",

"network": "BRIDGE",

"portMappings": [

  {

"containerPort": 80,

"hostPort": 0,

"servicePort": 80,

"protocol": "tcp"

  }

],

"privileged": false,

"parameters": [],

"forcePullImage": false

  }

}





From: Paul Wolfe [mailto:paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: RE: Old docker version deployed



No different tags.



From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: Re: Old docker version deployed



Paul,



Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:

You could see the stdout/stderr of your container from mesos webui.



On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe







The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any res

Re: Securing executors

2015-10-06 Thread Paul Bell
Thanks, Alexander; I will check out the vid.

I kind of assumed that this port was used for exactly the purpose you
mention.

Is TLS a possibility here?

-Paul

On Tue, Oct 6, 2015 at 8:15 AM, Alexander Rojas 
wrote:

> Hi Paul,
>
> I can refer you to the talk given by Adam Bordelon at MesosCon
> https://www.youtube.com/watch?v=G3sn1OLYDOE
>
> If you want to the short answer, the solution is to put a firewall around
> your cluster.
>
> On a closer look on the port, it is the one used for message passing
> between the mesas-docker-executor and other mesos components.
>
>
> On 05 Oct 2015, at 19:04, Paul Bell  wrote:
>
> Hi All,
>
> I am running an nmap port scan on a Mesos agent node and noticed nmap
> reporting an open TCP port at 50577.
>
> Poking around some, I discovered exactly 5 mesos-docker-executor
> processes, one for each of my 5 Docker containers, and each with an open
> listen port:
>
> root 14131  3617  0 10:39 ?00:00:17 mesos-docker-executor
> --container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95
> --stop_timeout=15secs
>
> I suppose that all of this is unsurprising. But I know of at least one big
> customer who will without delay run Nmap or Nessus against my clustered
> deployment.
>
> So I am wondering what the best practices approach is to securing these
> open ports.
>
> Thanks for your help.
>
> -Paul
>
>
>
>
>


Re: manage jobs log files in sandboxes

2015-11-06 Thread Paul Bell
Hi Mauricio,

YeahI see your point; thank you.

My approach would be akin to closing the barn door after the horse got out.
Both Mesos & Docker are doing their own writing of STDOUT. Docker's
rotation won't address Mesos's behavior.

I need to find a solution here.

-Paul


On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi Paul,
>
> I don't think that's going to help :(
> Even if you configure a different docker log driver, Docker still send
> things to stdout, which is catched by mesos and dumped in the .logs
> directory in the job sandbox. For example, by default docker logs into a
> json file in /var/lib/docker but mesos still writes to the sandbox.
> Hi Mauricio,
>
> I'm grappling with the same issue.
>
> I'm not yet sure if it represents a viable solution, but I plan to look at
> Docker's log rotation facility. It was introduced in Docker 1.8.
>
> If you beat me to it & it looks like a solution, please let us know!
>
> Thanks.
>
> Cordially,
>
> Paul
>
> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
> >
> > Hi guys,
> >
> > How can I manage the stdout/err log files generated by jobs in mesos?
> for long running docker apps launched using marathon the log files can
> deplete the disk of an agent, and using quotas makes the jobs to be killed
> which is also not ideal. I'd like to have a way to rotate them.
> >
> > Is it correct to just go to the mesos agent workdir and go through each
> sandbox stdout/err and rotate them? I know that could break the log UI but
> it doesn't scale very well having logs of several of GB.
> >
> > Thanks!
>


Re: manage jobs log files in sandboxes

2015-11-06 Thread Paul Bell
I've done a little reconnoitering, and the terrain looks to me as follows:

   1. Docker maintains container log files at
   /var/lib/docker/containers//-json.log
   2. Mesos maintains container STDOUT files at a
   slave/framework/application specific location, e.g.,
   
/tmp/mesos/slaves/20151102-082316-370041927-5050-32381-S1/frameworks/20151102-082316-370041927-5050-32381-/executors/ecxprimary1.80750071-81a0-11e5-8596-82d195a34239/runs/5c767378-9599-40af-8010-a31f4c55f9dc
   3. The latter is mapped to the container's /mnt/mesos/sandbox
   4. These two files (-json.log and the STDOUT file) are different, *each*
   consumes disk space.

I think that the answer to (1) is Docker's logrotate.

As to (2), I am considering a cron job at host (not container) level that
drives truncate cmd (GNU coreutils) to prune these files at a certain size.
Obviously requires knowing the fully-qualified path under
/tmp/mesos/slaves, but this is readily available via "docker inspect".

-Paul


On Fri, Nov 6, 2015 at 7:17 AM, Paul Bell  wrote:

> Hi Mauricio,
>
> YeahI see your point; thank you.
>
> My approach would be akin to closing the barn door after the horse got
> out. Both Mesos & Docker are doing their own writing of STDOUT. Docker's
> rotation won't address Mesos's behavior.
>
> I need to find a solution here.
>
> -Paul
>
>
> On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> I don't think that's going to help :(
>> Even if you configure a different docker log driver, Docker still send
>> things to stdout, which is catched by mesos and dumped in the .logs
>> directory in the job sandbox. For example, by default docker logs into a
>> json file in /var/lib/docker but mesos still writes to the sandbox.
>> Hi Mauricio,
>>
>> I'm grappling with the same issue.
>>
>> I'm not yet sure if it represents a viable solution, but I plan to look
>> at Docker's log rotation facility. It was introduced in Docker 1.8.
>>
>> If you beat me to it & it looks like a solution, please let us know!
>>
>> Thanks.
>>
>> Cordially,
>>
>> Paul
>>
>> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia <
>> mauriciogaravag...@gmail.com> wrote:
>> >
>> > Hi guys,
>> >
>> > How can I manage the stdout/err log files generated by jobs in mesos?
>> for long running docker apps launched using marathon the log files can
>> deplete the disk of an agent, and using quotas makes the jobs to be killed
>> which is also not ideal. I'd like to have a way to rotate them.
>> >
>> > Is it correct to just go to the mesos agent workdir and go through each
>> sandbox stdout/err and rotate them? I know that could break the log UI but
>> it doesn't scale very well having logs of several of GB.
>> >
>> > Thanks!
>>
>
>


Fate of slave node after timeout

2015-11-13 Thread Paul Bell
Hi All,

IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded
without a response from a mesos-slave, the master will remove the slave. In
the Mesos UI I can see slave state transition from 1 deactivated to 0.

Can that slave never again be added into the cluster, i.e., what happens if
it comes up 1 second after exceeding the timeout product?

(I'm dusting off some old notes and trying to refresh my memory about
problems I haven't seen in quite some time).

Thank you.

-Paul


Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul Bell
HmmI'm not sure there's really a "fix" for that (BTW: I assume you mean
to fix high (or long) latency, i.e., to make it lower, faster). A network
link is a network link, right? Like all hardware, it has its own physical
characteristics which determine its latency's lower bound, below which it
is physically impossible to go.

Sounds to me as if you've got the whole Mesos + Docker + Weave thing
figured out, at least as far as the basic connectivity and addressing is
concerned. So there's not much more that I can tell you in that regard.

Are you running Weave 1.2 (or above)? It incorporates their "fast path"
technology based on the Linux kernel's Open vSwitch (*vide*:
http://blog.weave.works/2015/11/13/weave-docker-networking-performance-fast-data-path/).
But, remember, there's still the link in between endpoints. One can
optimize the packet handling within an endpoint, but this could boil down
to a case of "hurry up and wait".

I would urge you to take this question up with the friendly, knowledgeable,
and very helpful folks at Weave:
https://groups.google.com/a/weave.works/forum/#!forum/weave-users .

Cordially,

Paul

On Wed, Nov 25, 2015 at 9:31 PM, Sam  wrote:

> Paul,
> Yup, Weave and Docker.  May I know how did you fix low latency issue over
> Internet ? By tunnel or ?
>
> Regards,
> Sam
>
> Sent from my iPhone
>
> > On Nov 26, 2015, at 10:23 AM, Paul  wrote:
> >
> > Happy Thanksgiving to you, too.
> >
> > I tend to deploy the several Mesos nodes as VMware VMs.
> >
> > However, I've also run a cluster with master on ESXi, slaves on ESXi,
> slave on bare metal, and an EC2 slave.
> >
> > But in my case all applications are Docker containers connected via
> Weave.
> >
> > Does your present deployment involve Docker and Weave?
> >
> > -paul
> >
> >> On Nov 25, 2015, at 8:55 PM, Sam  wrote:
> >>
> >> Paul,
> >> Happy thanksgiving first. We are using Aws, Rackspace as hybrid cloud
> env , and we deployed Mesos master in AWS , part of Slaves in AWS , part of
> Slaves in Rackspace .  I am thinking whether it works ? And since it got
> low latency in networking , can we deploy two masters in both AWS and
> Rackspace ? And federation ?Appreciated for your reply .
> >>
> >> Regards ,
> >> Sam
> >>
> >> Sent from my iPhone
> >>
> >>> On Nov 26, 2015, at 9:47 AM, Paul  wrote:
> >>>
> >>> Hi Sam,
> >>>
> >>> Yeah, I have significant experience in this regard.
> >>>
> >>> We run a Docker containers spread across several Mesos slave nodes.
> The containers are all connected via Weave. It works very well.
> >>>
> >>> Can you describe what you have in mind?
> >>>
> >>> Cordially,
> >>>
> >>> Paul
> >>>
> >>>> On Nov 25, 2015, at 8:03 PM, Sam  wrote:
> >>>>
> >>>> Guys,
> >>>> We are trying to use Weave in hybrid cloud Mesos env , anyone got
> experience on it ? Appreciated
> >>>> Regards,
> >>>> Sam
> >>>>
> >>>> Sent from my iPhone
>


Re: Anyone try Weave in Mesos env ?

2015-11-26 Thread Paul Bell
Hi Weitao,

I came up with this architecture as a way of distributing our application
across multiple nodes. Pre-Mesos, our application, delivered as a single
VMware VM, was not easily scalable. By breaking out the several application
components as Docker containers, we are now able (within limits imposed
chiefly by the application itself) to distribute & run those containers
across the several nodes in the Mesos cluster. Application containers that
need to talk to each other are connected via Weave's "overlay" (veth)
network.

Not surprisingly, this architecture has some of the benefits that you'd
expect from Mesos, chief among them being high-availability (more on this
below), scalability, and hybrid Cloud deployment.

The core unit of deployment is an Ubuntu image (14.04 LTS) that I've
configured with the appropriate components:

Zookeeper
Mesos-master
Mesos-slave
Marathon
Docker
Weave

SSH (including RSA keys)

Our application


This images is presently downloaded by a customer as a VMware .ova file. We
typically ask the customer to convert the resulting VM to a so-called
VMware template from which she can easily deploy multiple VMs as needed.
Please note that although we've started with VMware as our virtualization
platform, I've successfully run cluster nodes on both EC2 and Azure.

I tend to describe the Ubuntu image as "polymorphic", i.e., it can be told
to assume one of two roles, either a "master" role or a "slave" role. A
master runs ZK, mesos-master, and Marathon. A slave runs mesos-slave,
Docker, Weave, and the application.

We presently offer 3 canned deployment options:

   1. single-host, no HA
   2. multi-host, no HA (1 master, 3 slaves)
   3. multi-host, HA (3 masters, 3 slaves)

The single-host, no HA option exists chiefly to mimic the original
pre-Mesos deployment. But it has the added virtue, thanks to Mesos, of
allowing us to dynamically "grow" from a single-host to multiple hosts.

The multi-host, no HA option is presently geared toward a sharded MongoDB
backend where each slave runs a mongod container that is a single partition
(shard) of the larger database. This deployment option also lends itself
very nicely to adding a new slave node at the cluster level, and a new
mongod container at the application level - all without any downtime
whatsoever.

The multi-host, HA option offers the probably familiar *cluster-level* high
availability. I stress "cluster-level" because I think we have to
distinguish between HA at that level & HA at the application level. The
former is realized by the 3 master hosts, i.e., you can lose a master and
new one will self-elect thereby keeping the cluster up & running. But, to
my mind, at least, application level HA requires some co-operation on the
part of the application itself (e.g., checkpoint/restart). That said, it
*is* almost magical to watch Mesos re-launch an application container that
has crashed. But whether or not that re-launch results in coherent
application behavior is another matter.

An important home-grown component here is a Java program that automates
these functions:

create cluster - configures a host for a given role and starts Mesos
services. This is done via SSH
start application - distributes application containers across slave hosts.
This is done by talking to the Marathon REST API
stop application - again, via the Marathon REST API
stop cluster - stops Mesos services. Again, via SSH
destroy cluster - deconfigures the host (after which it has no defined
role); again, SSH


As I write, I see Ajay's e-mail arrive about Calico. I am aware of this
project and it seems quite solid. But I've never understood the need to
"worry about networking containers in multihost setup". Weave runs as a
Docker container and It Just Works. I've "woven" together slaves nodes in a
cluster that spanned 3 different datacenters, one of them in EC2, without
any difficulty. Yes, I do have to assign Weave IP addresses to the several
containers, but this is hardly onerous. In fact, I've found it "liberating"
to select such addresses from a CIDR/8 address space, assigning them to
containers based on the container's purpose (e.g., MongoDB shard containers
might live at 10.4.0.X, etc.). Ultimately, this assignment boils down to
setting an environment variable that Marathon (or the mesos-slave executor)
will use when creating the container via "docker run".

There is a whole lot more that I could say about the internals of this
architecture. But, if you're still interested, I'll await further questions
from you.

HTH.

Cordially,

Paul


On Thu, Nov 26, 2015 at 7:16 AM, Paul  wrote:

> Gladly, Weitao. It'd be my pleasure.
>
> But give me a few hours to find some free time.
>
> I am today tasked with cooking a Thanksgiving turkey.
>
> But I will try to find the t

Re: Anyone try Weave in Mesos env ?

2015-11-26 Thread Paul Bell
Hi Ajay,

I've no intention of getting into a contest about which product is best for
container networking.

Back in the 1960s there was a "war" between fans of the Beatles & Rolling
Stones: who's music was "better"? Mick Jagger famously then observed "You
can *prefer* them to us, or us to them". Had I started out with Calico, I
might today be its champion - but I didn't. I started with Weave and have
never regretted it.

That said, I'll make these points:

   1. Weave also supports dynamic assignment of IP@s, but I chose not to
   use it.
   2. Our architecture requires no manual intervention. I assign IP@s
   programmatically.

That's my last word on this matter. 😀

Cordially,

Paul


On Thu, Nov 26, 2015 at 9:31 AM, Ajay Bhatnagar 
wrote:

> With Calico you only create the virtual subnets and ip assignments are
> managed dynamically by Calico w/o any manual intervention needed.
>
> Cheers
>
> Ajay
>
>
>
> *From:* Paul Bell [mailto:arach...@gmail.com]
> *Sent:* Thursday, November 26, 2015 9:05 AM
> *To:* user@mesos.apache.org
> *Subject:* Re: Anyone try Weave in Mesos env ?
>
>
>
> Hi Weitao,
>
>
>
> I came up with this architecture as a way of distributing our application
> across multiple nodes. Pre-Mesos, our application, delivered as a single
> VMware VM, was not easily scalable. By breaking out the several application
> components as Docker containers, we are now able (within limits imposed
> chiefly by the application itself) to distribute & run those containers
> across the several nodes in the Mesos cluster. Application containers that
> need to talk to each other are connected via Weave's "overlay" (veth)
> network.
>
>
>
> Not surprisingly, this architecture has some of the benefits that you'd
> expect from Mesos, chief among them being high-availability (more on this
> below), scalability, and hybrid Cloud deployment.
>
>
>
> The core unit of deployment is an Ubuntu image (14.04 LTS) that I've
> configured with the appropriate components:
>
>
>
> Zookeeper
>
> Mesos-master
>
> Mesos-slave
>
> Marathon
>
> Docker
>
> Weave
>
> SSH (including RSA keys)
>
> Our application
>
>
>
> This images is presently downloaded by a customer as a VMware .ova file.
> We typically ask the customer to convert the resulting VM to a so-called
> VMware template from which she can easily deploy multiple VMs as needed.
> Please note that although we've started with VMware as our virtualization
> platform, I've successfully run cluster nodes on both EC2 and Azure.
>
>
>
> I tend to describe the Ubuntu image as "polymorphic", i.e., it can be told
> to assume one of two roles, either a "master" role or a "slave" role. A
> master runs ZK, mesos-master, and Marathon. A slave runs mesos-slave,
> Docker, Weave, and the application.
>
>
>
> We presently offer 3 canned deployment options:
>
>1. single-host, no HA
>2. multi-host, no HA (1 master, 3 slaves)
>3. multi-host, HA (3 masters, 3 slaves)
>
> The single-host, no HA option exists chiefly to mimic the original
> pre-Mesos deployment. But it has the added virtue, thanks to Mesos, of
> allowing us to dynamically "grow" from a single-host to multiple hosts.
>
>
>
> The multi-host, no HA option is presently geared toward a sharded MongoDB
> backend where each slave runs a mongod container that is a single partition
> (shard) of the larger database. This deployment option also lends itself
> very nicely to adding a new slave node at the cluster level, and a new
> mongod container at the application level - all without any downtime
> whatsoever.
>
>
>
> The multi-host, HA option offers the probably familiar *cluster-level*
> high availability. I stress "cluster-level" because I think we have to
> distinguish between HA at that level & HA at the application level. The
> former is realized by the 3 master hosts, i.e., you can lose a master and
> new one will self-elect thereby keeping the cluster up & running. But, to
> my mind, at least, application level HA requires some co-operation on the
> part of the application itself (e.g., checkpoint/restart). That said, it
> *is* almost magical to watch Mesos re-launch an application container
> that has crashed. But whether or not that re-launch results in coherent
> application behavior is another matter.
>
>
>
> An important home-grown component here is a Java program that automates
> these functions:
>
>
>
> create cluster - configures a host for a given role and starts Mesos
> services. This is done via SSH
&

Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
9:01:10.999+ [initandlisten] journal
dir=/data/db/config/journal
2016-01-14T19:01:11.000+ [initandlisten] recover : no journal files
present, no recovery needed
2016-01-14T19:01:11.429+ [initandlisten] warning:
ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: {
opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
"query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural:
-1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
"0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0,
lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
2016-01-14T19:01:11.429+ [initandlisten] waiting for connections on
port 27019
2016-01-14T19:01:17.405+ [initandlisten] connection accepted from
10.2.0.3:51189 #1 (1 connection now open)
2016-01-14T19:01:17.413+ [initandlisten] connection accepted from
10.2.0.3:51190 #2 (2 connections now open)
2016-01-14T19:01:17.413+ [initandlisten] connection accepted from
10.2.0.3:51191 #3 (3 connections now open)
2016-01-14T19:01:17.414+ [conn3] first cluster operation detected,
adding sharding hook to enable versioning and authentication to remote
servers
2016-01-14T19:01:17.414+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.419+ [initandlisten] connection accepted from
10.2.0.3:51193 #4 (4 connections now open)
2016-01-14T19:01:17.420+ [initandlisten] connection accepted from
10.2.0.3:51194 #5 (5 connections now open)
2016-01-14T19:01:17.442+ [conn1] end connection 10.2.0.3:51189 (4
connections now open)
2016-01-14T19:02:11.285+ [clientcursormon] mem (MB) res:59 virt:385
2016-01-14T19:02:11.285+ [clientcursormon]  mapped (incl journal
view):192
2016-01-14T19:02:11.285+ [clientcursormon]  connections:4
2016-01-14T19:03:11.293+ [clientcursormon] mem (MB) res:72 virt:385
2016-01-14T19:03:11.294+ [clientcursormon]  mapped (incl journal
view):192
2016-01-14T19:03:11.294+ [clientcursormon]  connections:4
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task

Most disturbing in all of this is that while I can stop the deployments in
Marathon (which properly end the "docker stop" commands visible in ps
output), I can not bounce docker, not by Upstart, nor via kill command).
Ultimately, I have to reboot the VM.

FWIW, the 3 mongod containers (apparently stuck in their Killing docker
task/shutting down loop) are running at 100%CPU as evinced by both "docker
stats" and "top".

I would truly be grateful for some guidance on this - even a mere
work-around would be appreciated.

Thank you.

-Paul


Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hey Tim,

Thank you very much for your reply.

Yes, I am in the midst of trying to reproduce the problem. If successful
(so to speak), I will do as you ask.

Cordially,

Paul

On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen  wrote:

> Hi Paul,
>
> Looks like we've already issued the docker stop as you seen in the ps
> output, but the containers are still running. Can you look at the Docker
> daemon logs and see what's going on there?
>
> And also can you also try to modify docker_stop_timeout to 0 so that we
> SIGKILL the containers right away, and see if this still happens?
>
> Tim
>
>
>
> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell  wrote:
>
>> Hi All,
>>
>> It's been quite some time since I've posted here and that's chiefly
>> because up until a day or two ago, things were working really well.
>>
>> I actually may have posted about this some time back. But then the
>> problem seemed more intermittent.
>>
>> In summa, several "docker stops" don't work, i.e., the containers are not
>> stopped.
>>
>> Deployment:
>>
>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>> Zookeeper
>> Mesos-master (0.23.0)
>> Mesos-slave (0.23.0)
>> Marathon (0.10.0)
>> Docker 1.9.1
>> Weave 1.1.0
>> Our application contains which include
>> MongoDB (4)
>> PostGres
>> ECX (our product)
>>
>> The only thing that's changed at all in the config above is the version
>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>> problem.
>>
>>
>> My automater program stops the application by sending Marathon an "http
>> delete" for each running up. Every now & then (reliably reproducible today)
>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>
>> Here are the attendant phenomena:
>>
>> Marathon shows the 3 applications in deployment mode (presumably
>> "deployment" in the sense of "stopping")
>>
>> *ps output:*
>>
>> root@71:~# ps -ef | grep docker
>> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
>> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>> --hostname=71.100.202.99 --ip=71.100.202.99
>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>> --stop_timeout=15secs
>> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --stop_timeout=15secs
>> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi Tim,

Things have gotten slightly odder (if that's possible). When I now start
the application 5 or so containers, only one "ecxconfigdb" gets started -
and even he took a few tries. That is, I see him failing, moving to
deploying, then starting again. But I've no evidence (no STDOUT, and no
docker ctr logs) that show why.

In any event, ecxconfigdb does start. Happily, when I try to stop the
application I am seeing the phenomena I posted before: killing docker task,
shutting down repeated many times. The UN-stopped container is now running
at 100% CPU.

I will try modifying docker_stop_timeout. Back shortly

Thanks again.

-Paul

PS: what do you make of the "broken pipe" error in the docker.log?

*from /var/log/upstart/docker.log*

[34mINFO[3054] GET /v1.15/images/mongo:2.6.8/json
INFO[3054] GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
ERRO[3054] Handler for GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
returned error: No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
ERRO[3054] HTTP Error
 err=No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
statusCode=404
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] POST
/v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
INFO[3111] GET /v1.21/containers/json
INFO[3120] GET /v1.21/containers/cf7/json
INFO[3120] GET
/v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all
INFO[3153] GET /containers/json
INFO[3153] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3153] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3153] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
INFO[3175] GET /containers/json
INFO[3175] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3175] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3175] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
*INFO[3175] POST
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
?t=15
*ERRO[3175] attach: stdout: write unix @: broken pipe*
*INFO[3190] Container
cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
exit within 15 seconds of SIGTERM - using the force *
*INFO[3200] Container cf7fc7c48324 failed to exit within 10
seconds of kill - trying direct SIGKILL *

*STDOUT from Mesos:*

*--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
*--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"
--sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
--stop_timeout="15secs"
--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sand

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi Tim,

I set docker_stop_timeout to zero as you asked. I am pleased to report
(though a bit fearful about being pleased) that this change seems to have
shut everyone down pretty much instantly.

Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
the immediate use of "kill -9" as opposed to "kill -2"?

I will keep testing the behavior.

Thank you.

-Paul

On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:

> Hi Tim,
>
> Things have gotten slightly odder (if that's possible). When I now start
> the application 5 or so containers, only one "ecxconfigdb" gets started -
> and even he took a few tries. That is, I see him failing, moving to
> deploying, then starting again. But I've no evidence (no STDOUT, and no
> docker ctr logs) that show why.
>
> In any event, ecxconfigdb does start. Happily, when I try to stop the
> application I am seeing the phenomena I posted before: killing docker task,
> shutting down repeated many times. The UN-stopped container is now running
> at 100% CPU.
>
> I will try modifying docker_stop_timeout. Back shortly
>
> Thanks again.
>
> -Paul
>
> PS: what do you make of the "broken pipe" error in the docker.log?
>
> *from /var/log/upstart/docker.log*
>
> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
> [34mINFO [0m[3054] GET
> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
> [31mERRO [0m[3054] Handler for GET
> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
> returned error: No such image:
> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
> [31mERRO [0m[3054] HTTP Error [31merr
> [0m=No such image:
> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
> [31mstatusCode [0m=404
> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
> [34mINFO [0m[3054] POST
> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
> [34mINFO [0m[3054] POST
> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1
> [34mINFO [0m[3054] POST
> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
> [34mINFO [0m[3054] GET
> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3054] GET
> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
> [34mINFO [0m[3054] GET
> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3054] GET
> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
> [34mINFO [0m[3054] GET
> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3054] GET
> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
> [34mINFO [0m[3111] GET /v1.21/containers/json
> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
> [34mINFO [0m[3120] GET
> /v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all
> [34mINFO [0m[3153] GET /containers/json
> [34mINFO [0m[3153] GET
> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3153] GET
> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
> [34mINFO [0m[3153] GET
> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
> [34mINFO [0m[3175] GET /containers/json
> [34mINFO [0m[3175] GET
> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
> [34mINFO [0m[3175] GET
> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
> [34mINFO [0m[3175] GET
> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
> * [34mINFO [0m[3175] POST
> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
> ?t=15
> * [31mERRO [0m[3175] attach: stdout: write unix @: broken pipe*
> * [34mINFO [0m[3190] Container
> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
> exit within 15 seconds of SIGTERM - using the force *
> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10
> seconds of kill - trying direct SIGKILL *
>
> *STDOUT from Mesos:*
>
> *--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
I spoke to soon, I'm afraid.

Next time I did the stop (with zero timeout), I see the same phenomenon: a
mongo container showing repeated:

killing docker task
shutting down


What else can I try?

Thank you.

On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:

> Hi Tim,
>
> I set docker_stop_timeout to zero as you asked. I am pleased to report
> (though a bit fearful about being pleased) that this change seems to have
> shut everyone down pretty much instantly.
>
> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
> the immediate use of "kill -9" as opposed to "kill -2"?
>
> I will keep testing the behavior.
>
> Thank you.
>
> -Paul
>
> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:
>
>> Hi Tim,
>>
>> Things have gotten slightly odder (if that's possible). When I now start
>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>> and even he took a few tries. That is, I see him failing, moving to
>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>> docker ctr logs) that show why.
>>
>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>> application I am seeing the phenomena I posted before: killing docker task,
>> shutting down repeated many times. The UN-stopped container is now running
>> at 100% CPU.
>>
>> I will try modifying docker_stop_timeout. Back shortly
>>
>> Thanks again.
>>
>> -Paul
>>
>> PS: what do you make of the "broken pipe" error in the docker.log?
>>
>> *from /var/log/upstart/docker.log*
>>
>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>> [34mINFO [0m[3054] GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [31mERRO [0m[3054] Handler for GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> returned error: No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mERRO [0m[3054] HTTP Error [31merr
>> [0m=No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mstatusCode [0m=404
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [34mINFO [0m[3111] GET /v1.21/containers/json
>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>> [34mINFO [0m[3120] GET
>> /v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all
>> [34mINFO [0m[3153] GET /containers/json
>> [34mINFO [0m[3153] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3153] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3153] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> [34mINFO [0m[3175] GET /containers/json
>> [34mINFO [0m[3175] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3175] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3175] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>>

Re: Help needed (alas, urgently)

2016-01-15 Thread Paul Bell
In chasing down this problem, I stumbled upon something of moment: the
problem does NOT seem to happen with kernel 3.13.

Some weeks back, in the hope of getting past another problem wherein the
root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
LTS). The kernel upgrade was done as shown here (there's some extra stuff
to get rid of Ubuntu desktop and liberate some disk space):

  apt-get update
  apt-get -y remove unbuntu-desktop
  apt-get -y purge lightdm
  rm -Rf /var/lib/lightdm-data
  apt-get -y remove --purge libreoffice-core
  apt-get -y remove --purge libreoffice-common

  echo "  Installing new kernel"

  apt-get -y install linux-generic-lts-vivid
  apt-get -y autoremove linux-image-3.13.0-32-generic
  apt-get -y autoremove linux-image-3.13.0-71-generic
  update-grub
  reboot

After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.

Under this kernel I can now reliably reproduce the failure to stop a
MongoDB container. Specifically, any & all attempts to kill the container,
e.g.,via

Marathon HTTP Delete (which leads to docker-mesos-executor presenting
"docker stop" command)
Getting inside the running container shell and issuing "kill" or
db.shutDown()

causes the mongod container

   - to show in its log that it's shutting down normally
   - to enter a 100% CPU loop
   - to become unkillable (only reboot "fixes" things)

Note finally that my conclusion about kernel 3.13 "working" is at present a
weak induction. But I do know that when I reverted to that kernel I could,
at least once, stop the containers w/o any problems; whereas at 3.19 I can
reliably reproduce the problem. I will try to make this induction stronger
as the day wears on.

Did I do something "wrong" in my kernel upgrade steps?

Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
area of task termination & signal handling?

Thanks for your help.

-Paul


On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell  wrote:

> I spoke to soon, I'm afraid.
>
> Next time I did the stop (with zero timeout), I see the same phenomenon: a
> mongo container showing repeated:
>
> killing docker task
> shutting down
>
>
> What else can I try?
>
> Thank you.
>
> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:
>
>> Hi Tim,
>>
>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>> (though a bit fearful about being pleased) that this change seems to have
>> shut everyone down pretty much instantly.
>>
>> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
>> the immediate use of "kill -9" as opposed to "kill -2"?
>>
>> I will keep testing the behavior.
>>
>> Thank you.
>>
>> -Paul
>>
>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:
>>
>>> Hi Tim,
>>>
>>> Things have gotten slightly odder (if that's possible). When I now start
>>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>>> and even he took a few tries. That is, I see him failing, moving to
>>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>>> docker ctr logs) that show why.
>>>
>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>> application I am seeing the phenomena I posted before: killing docker task,
>>> shutting down repeated many times. The UN-stopped container is now running
>>> at 100% CPU.
>>>
>>> I will try modifying docker_stop_timeout. Back shortly
>>>
>>> Thanks again.
>>>
>>> -Paul
>>>
>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>
>>> *from /var/log/upstart/docker.log*
>>>
>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [31mERRO [0m[3054] Handler for GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> returned error: No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mERRO [0m[3054] HTTP Error
>>> [31merr [0m=No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mstatusCode [0m=404
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.

Re: Help needed (alas, urgently)

2016-01-15 Thread Paul Bell
Tim,

I've tracked down the cause of this problem: it's the result of some kind
of incompatibility between kernel 3.19 and "VMware Tools". I know little
more than that.

I installed VMware Tools via *apt-get install open-vm-tools-lts-trusty*.
Everything worked fine on 3.13. But when I upgrade to 3.19, the error
occurs quite reliably. Revert back to 3.13 and the error goes away.

I looked high & low for some statement of kernel requirements for VMware
Tools, but can find none.

Sorry to have wasted your time.

-Paul

On Fri, Jan 15, 2016 at 9:19 AM, Paul Bell  wrote:

> In chasing down this problem, I stumbled upon something of moment: the
> problem does NOT seem to happen with kernel 3.13.
>
> Some weeks back, in the hope of getting past another problem wherein the
> root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
> LTS). The kernel upgrade was done as shown here (there's some extra stuff
> to get rid of Ubuntu desktop and liberate some disk space):
>
>   apt-get update
>   apt-get -y remove unbuntu-desktop
>   apt-get -y purge lightdm
>   rm -Rf /var/lib/lightdm-data
>   apt-get -y remove --purge libreoffice-core
>   apt-get -y remove --purge libreoffice-common
>
>   echo "  Installing new kernel"
>
>   apt-get -y install linux-generic-lts-vivid
>   apt-get -y autoremove linux-image-3.13.0-32-generic
>   apt-get -y autoremove linux-image-3.13.0-71-generic
>   update-grub
>   reboot
>
> After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.
>
> Under this kernel I can now reliably reproduce the failure to stop a
> MongoDB container. Specifically, any & all attempts to kill the container,
> e.g.,via
>
> Marathon HTTP Delete (which leads to docker-mesos-executor presenting
> "docker stop" command)
> Getting inside the running container shell and issuing "kill" or
> db.shutDown()
>
> causes the mongod container
>
>- to show in its log that it's shutting down normally
>- to enter a 100% CPU loop
>- to become unkillable (only reboot "fixes" things)
>
> Note finally that my conclusion about kernel 3.13 "working" is at present
> a weak induction. But I do know that when I reverted to that kernel I
> could, at least once, stop the containers w/o any problems; whereas at 3.19
> I can reliably reproduce the problem. I will try to make this induction
> stronger as the day wears on.
>
> Did I do something "wrong" in my kernel upgrade steps?
>
> Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
> area of task termination & signal handling?
>
> Thanks for your help.
>
> -Paul
>
>
> On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell  wrote:
>
>> I spoke to soon, I'm afraid.
>>
>> Next time I did the stop (with zero timeout), I see the same phenomenon:
>> a mongo container showing repeated:
>>
>> killing docker task
>> shutting down
>>
>>
>> What else can I try?
>>
>> Thank you.
>>
>> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:
>>
>>> Hi Tim,
>>>
>>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>>> (though a bit fearful about being pleased) that this change seems to have
>>> shut everyone down pretty much instantly.
>>>
>>> Can you explain what's happening, e.g., does docker_stop_timeout=0
>>> cause the immediate use of "kill -9" as opposed to "kill -2"?
>>>
>>> I will keep testing the behavior.
>>>
>>> Thank you.
>>>
>>> -Paul
>>>
>>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:
>>>
>>>> Hi Tim,
>>>>
>>>> Things have gotten slightly odder (if that's possible). When I now
>>>> start the application 5 or so containers, only one "ecxconfigdb" gets
>>>> started - and even he took a few tries. That is, I see him failing, moving
>>>> to deploying, then starting again. But I've no evidence (no STDOUT, and no
>>>> docker ctr logs) that show why.
>>>>
>>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>>> application I am seeing the phenomena I posted before: killing docker task,
>>>> shutting down repeated many times. The UN-stopped container is now running
>>>> at 100% CPU.
>>>>
>>>> I will try modifying docker_stop_timeout. Back shortly
>>>>
>>>> Thanks again.
>>>>
>>>> -Paul
>>>>
>>&

Feature request: move in-flight containers w/o stopping them

2016-02-18 Thread Paul Bell
Hello All,

Has there ever been any consideration of the ability to move in-flight
containers from one Mesos host node to another?

I see this as analogous to VMware's "vMotion" facility wherein VMs can be
moved from one ESXi host to another.

I suppose something like this could be useful from a load-balancing
perspective.

Just curious if it's ever been considered and if so - and rejected - why
rejected?

Thanks.

-Paul


Recent UI enhancements & Managed Service Providers

2016-02-25 Thread Paul Bell
Hi All,

I am running older versions of Mesos & Marathon (0.23.0 and 0.10.0).

Over the course of the last several months I think I've seen several items
on this list about UI enhancements. Perhaps they were enhancements to the
data consumed by the Mesos & Marathon UIs. I've had very little time to dig
deeply into it.

So...I am wondering if someone can either point me to any discussions of
such enhancements or summarize them here.

There is a specific use case behind this request. The Mesos architecture
seems to be a real sweet spot for an MSP. But an important MSP requirement
is a unified view of their many tenants. So I am really trying to get a
sense for how well the recent Mesos/Marathon releases address this
requirement.

Thank you.

-Paul


Re: Recent UI enhancements & Managed Service Providers

2016-02-25 Thread Paul Bell
Hi Vinod,

Thank you for your reply.

I'm not sure that I can be more specific. MSPs are interested in a "view by
tenant", e.g., "show me all applications that are allotted to Tenant X".  I
suppose that the standard Mesos UI could, with properly named task IDs and
the UI's "Find" filter, accomplish part of "view by tenant". But in order
to see the resources consumed by Tenant X's tasks, you have to visit each
task individually and look at their "Resources" table (add them all up).

It'd be cool if when a filter is in effect, the Resources table was updated
to reflect only the resources consumed by the filter-selected tasks.

There's also the question of the units/meaning of Resources. Through
Marathon I give each of my Dockerized tasks .1 CPU. As I understand it,
Docker multiplies this value times 1024 which is Docker's representation of
all the cores on a host. So when I do "docker inspect " I will see
CpuShares of 102. But in the Mesos UI each of my 6 tasks shows .2 CPUs
allocated. I'm simply not sure what this means or how it's arrived at. I
suspect that an MSP will ask the same questions.

I will think about it some more, but I'd be interested to hear feedback on
these few points that I've raised.

Thanks again.

-Paul

On Thu, Feb 25, 2016 at 11:55 AM, Vinod Kone  wrote:

>
> > But an important MSP requirement is a unified view of their many
> tenants. So I am really trying to get a sense for how well the recent
> Mesos/Marathon releases address this requirement.
>
> Can you be more specific about what you mean by unified view and tenants?
> What's lacking currently?


Re: Recent UI enhancements & Managed Service Providers

2016-02-26 Thread Paul Bell
Sure thing. I just signed up for an ASF Jira account.

I'm no expert at Jira. Under what Mesos version, etc., would you like me to
create it under?

Also, thanks for the explanation re 0.2. But, again, this is sort of an
abstract number, no?

-Paul

On Fri, Feb 26, 2016 at 4:26 PM, Vinod Kone  wrote:

>
> On Thu, Feb 25, 2016 at 10:31 AM, Paul Bell  wrote:
>
>> I'm not sure that I can be more specific. MSPs are interested in a "view
>> by tenant", e.g., "show me all applications that are allotted to Tenant
>> X".  I suppose that the standard Mesos UI could, with properly named task
>> IDs and the UI's "Find" filter, accomplish part of "view by tenant". But in
>> order to see the resources consumed by Tenant X's tasks, you have to visit
>> each task individually and look at their "Resources" table (add them all
>> up).
>>
>> It'd be cool if when a filter is in effect, the Resources table was
>> updated to reflect only the resources consumed by the filter-selected tasks.
>>
>>
> There has been no work on this (i.e., some way to filter the UI view w.r.t
> a group of tasks), but this sounds like a good use case. Can you file a
> ticket?
>
>
>
>> There's also the question of the units/meaning of Resources. Through
>> Marathon I give each of my Dockerized tasks .1 CPU. As I understand it,
>> Docker multiplies this value times 1024 which is Docker's representation of
>> all the cores on a host. So when I do "docker inspect " I will see
>> CpuShares of 102. But in the Mesos UI each of my 6 tasks shows .2 CPUs
>> allocated. I'm simply not sure what this means or how it's arrived at. I
>> suspect that an MSP will ask the same questions.
>>
>
> You see 0.2 because Mesos adds 0.1 overhead for the default executor that
> runs the docker task.
>
>
>
>


Agent won't start

2016-03-29 Thread Paul Bell
Hi,

I am hoping someone can shed some light on this.

An agent node failed to start, that is, when I did "service mesos-slave
start" the service came up briefly & then stopped. Before stopping it
produced the log shown below. The last thing it wrote is "Trying to create
path '/mesos' in Zookeeper".

This mention of the mesos znode prompted me to go for a clean slate by
removing the mesos znode from Zookeeper.

After doing this, the mesos-slave service started perfectly.

What might be happening here, and also what's the right way to
trouble-shoot such a problem? Mesos is version 0.23.0.

Thanks for your help.

-Paul


Log file created at: 2016/03/29 14:19:39
Running on machine: 71.100.202.193
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
posix/cpu,posix/mem
I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
71.100.202.193:5051
I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
--attributes="hostType:shard1" --authenticatee="crammd5"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="docker,mesos"
--default_role="*" --disk_watch_interval="1mins"
--docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
--docker_remove_delay="6hrs"
--docker_sandbox_directory="/mnt/mesos/sandbox"
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
--enforce_container_disk_quota="false"
--executor_registration_timeout="5mins"
--executor_shutdown_grace_period="5secs"
--fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
--frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --hostname="71.100.202.193"
--initialize_driver_logging="true" --ip="71.100.202.193"
--isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
--log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
--master="zk://71.100.202.191:2181/mesos"
--oversubscribed_resources_interval="15secs" --perf_duration="10secs"
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins"
--registration_backoff_factor="1secs"
--resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
--strict="true" --switch_user="true" --version="false"
--work_dir="/tmp/mesos"
I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
mem(*):23089; disk(*):122517; ports(*):[31000-32000]
I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
'/tmp/mesos/meta'
I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'
I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
71.100.202.193:5051) connected to ZooKeeper
I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper


Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Greg,

Thanks very much for your quick reply.

I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
systemd. I will look at the link you provide.

Is there any chance that it might apply to non-systemd platforms?

Cordially,

Paul

On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:

> Hi Paul,
> Noticing the logging output, "Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
> be related to the location of your agent's work_dir. See this ticket:
> https://issues.apache.org/jira/browse/MESOS-4541
>
> Some users have reported issues resulting from the systemd-tmpfiles
> service garbage collecting files in /tmp, perhaps this is related? What
> platform is your agent running on?
>
> You could try specifying a different agent work directory outside of /tmp/
> via the `--work_dir` command-line flag.
>
> Cheers,
> Greg
>
>
> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>
>> Hi,
>>
>> I am hoping someone can shed some light on this.
>>
>> An agent node failed to start, that is, when I did "service mesos-slave
>> start" the service came up briefly & then stopped. Before stopping it
>> produced the log shown below. The last thing it wrote is "Trying to create
>> path '/mesos' in Zookeeper".
>>
>> This mention of the mesos znode prompted me to go for a clean slate by
>> removing the mesos znode from Zookeeper.
>>
>> After doing this, the mesos-slave service started perfectly.
>>
>> What might be happening here, and also what's the right way to
>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> Log file created at: 2016/03/29 14:19:39
>> Running on machine: 71.100.202.193
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>> root
>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>> posix/cpu,posix/mem
>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>> 71.100.202.193:5051
>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>> --attributes="hostType:shard1" --authenticatee="crammd5"
>> --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>> --default_role="*" --disk_watch_interval="1mins"
>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>> --docker_remove_delay="6hrs"
>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="5mins"
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>> --initialize_driver_logging="true" --ip="71.100.202.193"
>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>> --master="zk://71.100.202.191:2181/mesos"
>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> --registration_backoff_factor="1secs"
>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>> --strict="true" --switch_user="true" --version="false"
>> --work_dir="/tmp/mesos"
>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>> '/tmp/mesos/meta'
>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'
>> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
>> 71.100.202.193:5051) connected to ZooKeeper
>> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>>
>>
>


Re: Agent won't start

2016-03-29 Thread Paul Bell
Whoa...interessant!

The node *may* have been rebooted. Uptime says 2 days. I'll need to check
my notes.

Can you point me to reference re Ubuntu behavior?

Based on what you've told me so far, it sounds as if the sequence:

stop service
reboot agent node
start service


could lead to trouble - or do I misunderstand?


Thank you again for your help.

-Paul

On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann  wrote:

> Paul,
> This would be relevant for any system which is automatically deleting
> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
> be completely nuked at boot time. Was the agent node rebooted prior to this
> problem?
>
> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:
>
>> Hi Greg,
>>
>> Thanks very much for your quick reply.
>>
>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>> systemd. I will look at the link you provide.
>>
>> Is there any chance that it might apply to non-systemd platforms?
>>
>> Cordially,
>>
>> Paul
>>
>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>>
>>> Hi Paul,
>>> Noticing the logging output, "Failed to find resources file
>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>> may be related to the location of your agent's work_dir. See this ticket:
>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>
>>> Some users have reported issues resulting from the systemd-tmpfiles
>>> service garbage collecting files in /tmp, perhaps this is related? What
>>> platform is your agent running on?
>>>
>>> You could try specifying a different agent work directory outside of
>>> /tmp/ via the `--work_dir` command-line flag.
>>>
>>> Cheers,
>>> Greg
>>>
>>>
>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>>>
>>>> Hi,
>>>>
>>>> I am hoping someone can shed some light on this.
>>>>
>>>> An agent node failed to start, that is, when I did "service mesos-slave
>>>> start" the service came up briefly & then stopped. Before stopping it
>>>> produced the log shown below. The last thing it wrote is "Trying to create
>>>> path '/mesos' in Zookeeper".
>>>>
>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>> removing the mesos znode from Zookeeper.
>>>>
>>>> After doing this, the mesos-slave service started perfectly.
>>>>
>>>> What might be happening here, and also what's the right way to
>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>
>>>> Thanks for your help.
>>>>
>>>> -Paul
>>>>
>>>>
>>>> Log file created at: 2016/03/29 14:19:39
>>>> Running on machine: 71.100.202.193
>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>>>> root
>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>> posix/cpu,posix/mem
>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>>> 71.100.202.193:5051
>>>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>>>> --attributes="hostType:shard1" --authenticatee="crammd5"
>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>>>> --default_role="*" --disk_watch_interval="1mins"
>>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>>>> --docker_remove_delay="6hrs"
>>>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>>>> --

Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Pradeep,

And thank you for your reply!

That, too, is very interesting. I think I need to synthesize what you and
Greg are telling me and come up with a clean solution. Agent nodes can
crash. Moreover, I can stop the mesos-slave service, and start it later
with a reboot in between.

So I am interested in fully understanding the causal chain here before I
try to fix anything.

-Paul



On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell  wrote:

> Whoa...interessant!
>
> The node *may* have been rebooted. Uptime says 2 days. I'll need to check
> my notes.
>
> Can you point me to reference re Ubuntu behavior?
>
> Based on what you've told me so far, it sounds as if the sequence:
>
> stop service
> reboot agent node
> start service
>
>
> could lead to trouble - or do I misunderstand?
>
>
> Thank you again for your help.
>
> -Paul
>
> On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann  wrote:
>
>> Paul,
>> This would be relevant for any system which is automatically deleting
>> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
>> be completely nuked at boot time. Was the agent node rebooted prior to this
>> problem?
>>
>> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:
>>
>>> Hi Greg,
>>>
>>> Thanks very much for your quick reply.
>>>
>>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>>> systemd. I will look at the link you provide.
>>>
>>> Is there any chance that it might apply to non-systemd platforms?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>>>
>>>> Hi Paul,
>>>> Noticing the logging output, "Failed to find resources file
>>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>>> may be related to the location of your agent's work_dir. See this ticket:
>>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>>
>>>> Some users have reported issues resulting from the systemd-tmpfiles
>>>> service garbage collecting files in /tmp, perhaps this is related? What
>>>> platform is your agent running on?
>>>>
>>>> You could try specifying a different agent work directory outside of
>>>> /tmp/ via the `--work_dir` command-line flag.
>>>>
>>>> Cheers,
>>>> Greg
>>>>
>>>>
>>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am hoping someone can shed some light on this.
>>>>>
>>>>> An agent node failed to start, that is, when I did "service
>>>>> mesos-slave start" the service came up briefly & then stopped. Before
>>>>> stopping it produced the log shown below. The last thing it wrote is
>>>>> "Trying to create path '/mesos' in Zookeeper".
>>>>>
>>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>>> removing the mesos znode from Zookeeper.
>>>>>
>>>>> After doing this, the mesos-slave service started perfectly.
>>>>>
>>>>> What might be happening here, and also what's the right way to
>>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> -Paul
>>>>>
>>>>>
>>>>> Log file created at: 2016/03/29 14:19:39
>>>>> Running on machine: 71.100.202.193
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging
>>>>> started!
>>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39
>>>>> by root
>>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>>> posix/cpu,posix/mem
>>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>>>> 71.100.202.193:5051
>

Re: Agent won't start

2016-03-30 Thread Paul Bell
Greg, thanks again - I am planning on moving my work_dir.



Pradeep, thanks again. In a slightly different scenario, namely,

service mesos-slave stop
edit /etc/default/mesos-slave   (add a port resource)
service mesos-slave start


I noticed that slave did not start and - again - the log shows the same
phenomena as in my original post. Per your suggestion, I did a

rm -Rf /tmp/mesos

and the slave service started correctly.

Questions:


   1. Did editing /etc/default/mesos-slave cause the failure of the service
   to start?
   2. given that starting/stopping the entire cluster (stopping all
   services on all nodes) is a standard feature in our product, should I
   routinely to the above "rm" command when the mesos services are stopped?


Thanks for your help.

Cordially,

Paul

On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann  wrote:

> Check out this link for info on /tmp cleanup in Ubuntu:
> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>
> And check out this link for information on some of the work_dir's contents
> on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/
>
> The work_dir contains important application state for the Mesos agent, so
> it should not be placed in a location that will be automatically
> garbage-collected by the OS. The choice of /tmp/mesos as a default location
> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
> change it. Ideally you should be able to leave the work_dir alone and let
> the Mesos agent manage it for you.
>
> In any case, I would recommend that you set the work_dir to something
> outside of /tmp; /var/lib/mesos is a commonly-used location.
>
> Cheers,
> Greg
>


Backup a Mesos Cluster

2016-04-11 Thread Paul Bell
Hi All,

As we get closer to shipping a Mesos-based version of our product, we've
turned our attention to "protecting" (supporting backup & recovery) of not
only our application databases, but the cluster as well.

I'm not quite sure how to begin thinking about this, but I suppose the
usual dimensions of B/R would come into play, e.g., hot/cold, application
consistent/crash consistent, etc.

Has anyone grappled with this issue and, if so, would you be so kind as to
share your experience and solutions?

Thank you.

-Paul


Re: Backup a Mesos Cluster

2016-04-11 Thread Paul Bell
Piotr,

Thank you for this link. I am looking at it now where I right away notice
that Exhibitor is designed to monitor (and backup) Zookeeper (but not
anything related to Mesos itself). Don't the Mesos master & agent nodes
keep at least some state outside of the ZK znodes, e.g., under the default
workdir?

Shua,

Thank you for this observation. Happily (I think), we do not have a custom
framework. Presently, Marathon is the only framework that we use.

-Paul

On Mon, Apr 11, 2016 at 8:12 AM, Shuai Lin  wrote:

> If your product containers a custom framework, at least you should
> implement kind of high availability for your scheduler (like
> marathon/chronos does), or let it be launched by marathon so it can be
> restarted when it fails.
>
> On Mon, Apr 11, 2016 at 7:27 PM, Paul Bell  wrote:
>
>> Hi All,
>>
>> As we get closer to shipping a Mesos-based version of our product, we've
>> turned our attention to "protecting" (supporting backup & recovery) of not
>> only our application databases, but the cluster as well.
>>
>> I'm not quite sure how to begin thinking about this, but I suppose the
>> usual dimensions of B/R would come into play, e.g., hot/cold, application
>> consistent/crash consistent, etc.
>>
>> Has anyone grappled with this issue and, if so, would you be so kind as
>> to share your experience and solutions?
>>
>> Thank you.
>>
>> -Paul
>>
>>
>


Re: Mesos 0.28 SSL in official packages

2016-04-12 Thread Paul Bell
FWIW, I quite agree with Zameer's point.

That said, I want to make abundantly clear that in my experience the folks
at Mesosphere are wonderfully helpful.

But what happens if down the road Mesosphere is acquired or there occurs
some other event that could represent, if not a conflict of interest, then
simply a different strategic direction?

My 2 cents.

-Paul

On Mon, Apr 11, 2016 at 5:19 PM, Zameer Manji  wrote:

> I have suggested this before and I will suggest it again here.
>
> I think the Apache Mesos project should build and distribute packages
> instead of relying on the generosity of a commercial vendor. The Apache
> Aurora project does this already with good success. As a user of Apache
> Mesos I don't care about Mesosphere Inc and I feel uncomfortable that the
> project is so dependent on its employees.
>
> Doing this would allow users to contribute packaging fixes directly to the
> project, such as enabling SSL.
>
> On Mon, Apr 11, 2016 at 3:02 AM, Adam Bordelon  wrote:
>
>> Hi Kamil,
>>
>> Technically, there are no "official" Apache-built packages for Apache
>> Mesos.
>>
>> At least once company (Mesosphere) chooses to build and distribute
>> Mesos packages, but does not currently offer SSL builds. It wouldn't
>> be hard to add an SSL build to our regular builds, but it hasn't been
>> requested enough to prioritize it.
>>
>> cc: Joris, Kapil
>>
>> On Thu, Apr 7, 2016 at 7:42 AM, haosdent  wrote:
>> > Hi, ssl didn't enable default. You need compile it by following this doc
>> > http://mesos.apache.org/documentation/latest/ssl/
>> >
>> > On Thu, Apr 7, 2016 at 10:04 PM, Kamil Wokitajtis > >
>> > wrote:
>> >>
>> >> This is my first post, so Hi everyone!
>> >>
>> >> Is SSL enabled in official packages (CentOS in my case)?
>> >> I can see libssl in ldd output, but I cannot see libevent.
>> >> I had to compile mesos from sources to run it over ssl.
>> >> I would prefer to install it from packages.
>> >>
>> >> Regards,
>> >> Kamil
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Haosdent Huang
>>
>> --
>> Zameer Manji
>>
>>


Re: Mesos Master and Slave on same server?

2016-04-13 Thread Paul Bell
Hi June,

In addition to doing what Pradeep suggests, I also now & then run a single
node "cluster" that houses mesos-master, mesos-slave, and Marathon.

Works fine.

Cordially,

Paul

On Wed, Apr 13, 2016 at 12:36 PM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> I would suggest you to run mesos-master and zookeeper and marathon on same
> set of hosts (maybe call them as coordinator nodes) and use completely
> different set of nodes for mesos slaves. This way you can do the
> maintenance of such hosts in a very planned fashion.
>
> On Wed, Apr 13, 2016 at 4:22 PM, Stefano Bianchi 
> wrote:
>
>> For sure it is possible.
>> Simply Mesos-master will the the resources offered by the machine on
>> which is running mesos-slave also, transparently.
>>
>> 2016-04-13 16:34 GMT+02:00 June Taylor :
>>
>>> All of our node servers are identical hardware. Is it reasonable for me
>>> to install the Mesos-Master and Mesos-Slave on the same physical hardware?
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>
>>
>
>
> --
> Regards,
> Pradeep Chhetri
>


Status of Mesos-3821

2016-04-19 Thread Paul Bell
Hi,

I think I encountered the problem described by
https://issues.apache.org/jira/browse/MESOS-3821 and wanted to ask if this
fix is in Mesos 0.28.

But perhaps I misunderstand what's being said; so by way of background our
case is Mesos on CentOS 7.2. When we try to set --docker_socket to
tcp://: the mesos-slave service refuses to start. It seems to
require a Unix socket.

There is one comment in the ticket that expresses the hope of being able to
use URLs of the tcp:// form.

Am I misunderstanding this fix and if not, what release of Mesos
incorporates it?

Thanks for your help.

-Paul


Consequences of health-check timeouts?

2016-05-17 Thread Paul Bell
Hi All,

I probably have the following account partly wrong, but let me present it
just the same and those who know better can correct me as needed.

I've an application that runs several MongoDB shards, each a Dockerized
container, each on a distinct node (VM); in fact, some of the VMs are on
separate ESXi hosts.

I've lately seen situations where, because of very slow disks for the
database, the following sequence occurs (I think):

   1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
   defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
   2. Synchronous flushing of many, many pages occurs, writing to a slow
   disk
   3. (Around this time one might see in /var/log/syslog "task X blocked
   for more than 120 seconds" for all kinds of tasks, including mesos-master)
   4. mesos-slaves get shutdown (this is the part I'm unclear about; but I
   am quite certain that on 2 nodes the executors and their in-flight MongoDB
   tasks got zapped because I can see that Marathon restarted them).

The consequences of this are a corrupt MongoDB database. In the case at
hand, the job had run for over 50 hours, processing close to 120 million
files.

Steps I've taken so far to remedy include:

   - tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively,
   to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent,
   smaller flushes and thus avoid less frequent massive flushes that suspend
   threads for very long periods.
   - increase agent ping timeout to 10 minutes (every 30 seconds, 20 times)

So the questions are:

   - Is there some way to be given control (a callback, or an "exit"
   routine) so that the container about to be nuked can be given a chance to
   exit gracefully?
   - Are there other steps I can take to avoid this mildly calamitous
   occurrence?
   - (Also, I'd be grateful for more clarity on anything in steps 1-4 above
   that is a bit hand-wavy!)

As always, thanks.

-Paul


Re: Consequences of health-check timeouts?

2016-05-18 Thread Paul Bell
Hi Hasodent,

Thanks for your reply.

In re executor_shutdown_grace_period: how would this enable the task
(MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
STDOUT as captured by Mesos shows that it received signal 15 just before it
said good-bye). My naive understanding of this grace period is that it
simply delays the termination of the executor.

The following snippet is rom /var/log/syslog. I believe it shows the stack
trace (largely in the kernel) that led to mesos-master being blocked for
more than 120 seconds. Please note that immediately above (before) the
blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
blocked mesos-master is a blocked java task. I'm not sure what the java
task is. This took place on the mesos-master node and none of our
applications runs there. It runs master, Marathon, and ZK. Maybe the java
task is Marathon or ZK?

Thanks again.

-Paul

May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
blocked for more than 120 seconds.

May 16 20:06:53 71 kernel: [193339.890873]   Not tainted
3.13.0-32-generic #57-Ubuntu

May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

May 16 20:06:53 71 kernel: [193339.890912] mesos-masterD
88013fd94440 0  4013  1 0x

May 16 20:06:53 71 kernel: [193339.890914]  880137429a28
0002 880135778000 880137429fd8

May 16 20:06:53 71 kernel: [193339.890916]  00014440
00014440 880135778000 88013fd94cd8

May 16 20:06:53 71 kernel: [193339.890918]  88013ffd34b0
0002 81284630 880137429aa0

May 16 20:06:53 71 kernel: [193339.890919] Call Trace:

May 16 20:06:53 71 kernel: [193339.890922]  [] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890924]  []
io_schedule+0x9d/0x140

May 16 20:06:53 71 kernel: [193339.890925]  []
sleep_on_shadow_bh+0xe/0x20

May 16 20:06:53 71 kernel: [193339.890927]  []
__wait_on_bit+0x62/0x90

May 16 20:06:53 71 kernel: [193339.890929]  [] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890930]  []
out_of_line_wait_on_bit+0x77/0x90

May 16 20:06:53 71 kernel: [193339.890932]  [] ?
autoremove_wake_function+0x40/0x40

May 16 20:06:53 71 kernel: [193339.890934]  [] ?
wake_up_bit+0x25/0x30

May 16 20:06:53 71 kernel: [193339.890936]  []
do_get_write_access+0x2ad/0x4f0

May 16 20:06:53 71 kernel: [193339.890938]  [] ?
__getblk+0x2d/0x2e0

May 16 20:06:53 71 kernel: [193339.890939]  []
jbd2_journal_get_write_access+0x27/0x40

May 16 20:06:53 71 kernel: [193339.890942]  []
__ext4_journal_get_write_access+0x3b/0x80

May 16 20:06:53 71 kernel: [193339.890946]  []
ext4_reserve_inode_write+0x70/0xa0

May 16 20:06:53 71 kernel: [193339.890948]  [] ?
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890949]  []
ext4_mark_inode_dirty+0x44/0x1f0

May 16 20:06:53 71 kernel: [193339.890951]  []
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890953]  []
__mark_inode_dirty+0x10a/0x2d0

May 16 20:06:53 71 kernel: [193339.890956]  []
update_time+0x81/0xd0

May 16 20:06:53 71 kernel: [193339.890957]  []
file_update_time+0x80/0xd0

May 16 20:06:53 71 kernel: [193339.890961]  []
__generic_file_aio_write+0x180/0x3d0

May 16 20:06:53 71 kernel: [193339.890963]  []
generic_file_aio_write+0x58/0xa0

May 16 20:06:53 71 kernel: [193339.890965]  []
ext4_file_write+0x99/0x400

May 16 20:06:53 71 kernel: [193339.890967]  [] ?
wake_up_state+0x10/0x20

May 16 20:06:53 71 kernel: [193339.890970]  [] ?
wake_futex+0x66/0x90

May 16 20:06:53 71 kernel: [193339.890972]  [] ?
futex_wake+0x1b1/0x1d0

May 16 20:06:53 71 kernel: [193339.890974]  []
do_sync_write+0x5a/0x90

May 16 20:06:53 71 kernel: [193339.890976]  []
vfs_write+0xb4/0x1f0

May 16 20:06:53 71 kernel: [193339.890978]  []
SyS_write+0x49/0xa0

May 16 20:06:53 71 kernel: [193339.890980]  []
tracesys+0xe1/0xe6



On Wed, May 18, 2016 at 2:33 AM, haosdent  wrote:

> >Is there some way to be given control (a callback, or an "exit" routine)
> so that the container about to be nuked can be given a chance to exit
> gracefully?
> The default value of executor_shutdown_grace_period is 5 seconds, you
> could change it by specify the `--executor_shutdown_grace_period` flag when
> launch mesos agent.
>
> >Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> >mesos-slaves get shutdown
> Do you know where your mesos-master stuck when it happens? Any error log
> or related log about this? In addition, is there any log when mesos-slave
> shut down?
>
> On Wed, May 18, 2016 at 6:12 AM, Paul Bell  wrote:
>
>> Hi All,
>>
>> I probably have the following account partly wrong, but let me present it
>> just the same and those who know better can correct me as needed.
>>
>> I've an applic

Mesos loses track of Docker containers

2016-08-10 Thread Paul Bell
Hello,

One of our customers has twice encountered a problem wherein Mesos &
Marathon appear to lose track of the application containers that they
started.

Platform & version info:

Ubuntu 14.04 (running under VMware)
Mesos (master & agent): 0.23.0
ZK: 3.4.5--1
Marathon: 0.10.0

The phenomena:

When I log into either the Mesos or Marathon UIs I see no evidence of *any*
tasks, active or completed. Yet, in the Linux shell, a "docker ps" command
shows the containers up & running.

I've seen some confusing appearances before, but never this. For example,
I've seen what might be described as the *reverse* of the above phenomena.
I mean the case where a customer powers cycles the VM. In such a case you
typically see in Marathon's UI the (mere) appearance of the containers up &
running, but a "docker ps" command shows no containers running. As folks on
this list have explained to me, this is the result of "stale state" and
after 10 minutes (by default), Mesos figures out that the supposedly active
tasks aren't there and restarts them.

But that's not the case here. I am hard-pressed to understand what
conditions/causes might lead to Mesos & Marathon becoming unaware of
containers that they started.

I would be very grateful if someone could help me understand what's going
on here (so would our customer!).

Thanks.

-Paul


Re: Mesos loses track of Docker containers

2016-08-10 Thread Paul Bell
Hi Jeff,

Thanks for your reply.

Yeahthat thought occurred to me late last night. But customer is
sensitive to too much churn, so it wouldn't be my first choice. If I knew
with certainty that such a problem existed in the versions they are running
AND that more recent versions fixed it, then I'd do my best to compel the
upgrade.

Docker version is also old, 1.6.2.

-Paul

On Wed, Aug 10, 2016 at 9:18 AM, Jeff Schroeder 
wrote:

> Have you considered upgrading Mesos and Marathon? Those are quite old
> versions of both with some fairly glaring problems with the docker
> containerizer if memory serves. Also what version of docker?
>
>
> On Wednesday, August 10, 2016, Paul Bell  wrote:
>
>> Hello,
>>
>> One of our customers has twice encountered a problem wherein Mesos &
>> Marathon appear to lose track of the application containers that they
>> started.
>>
>> Platform & version info:
>>
>> Ubuntu 14.04 (running under VMware)
>> Mesos (master & agent): 0.23.0
>> ZK: 3.4.5--1
>> Marathon: 0.10.0
>>
>> The phenomena:
>>
>> When I log into either the Mesos or Marathon UIs I see no evidence of
>> *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps"
>> command shows the containers up & running.
>>
>> I've seen some confusing appearances before, but never this. For example,
>> I've seen what might be described as the *reverse* of the above
>> phenomena. I mean the case where a customer powers cycles the VM. In such a
>> case you typically see in Marathon's UI the (mere) appearance of the
>> containers up & running, but a "docker ps" command shows no containers
>> running. As folks on this list have explained to me, this is the result of
>> "stale state" and after 10 minutes (by default), Mesos figures out that the
>> supposedly active tasks aren't there and restarts them.
>>
>> But that's not the case here. I am hard-pressed to understand what
>> conditions/causes might lead to Mesos & Marathon becoming unaware of
>> containers that they started.
>>
>> I would be very grateful if someone could help me understand what's going
>> on here (so would our customer!).
>>
>> Thanks.
>>
>> -Paul
>>
>>
>>
>
> --
> Text by Jeff, typos by iPhone
>


unsubscribe

2017-01-16 Thread Paul Bell



hadoop task-trackers sticking around

2013-09-26 Thread Paul Mackles
Hi - I am using mesos 0.13 with cdh4.2.0 in pseudo-distributed mode. While
I am able to launch and run hadoop jobs through mesos successfully, I
noticed in the Mesos UI (and through 'ps') that the task-trackers launched
by mesos are sticking around long after my job is complete. Is that
expected behavior? I am thinking the answer is no since they are tying up
resources that could be used by other frameworks. On the other hand, mesos
seems to know enough to reuse them when running subsequent hadoop jobs.
Maybe there are using reservations or something by default?

-- 
Thanks,
Paul


Re: hadoop task-trackers sticking around

2013-09-26 Thread Paul Mackles
I will dig a little further as the behavior is inconsistent. On subsequent
attempts I have seen the task-trackers go away with the job. They always go
away when I shutdown the corresponding job-tracker.

The hadoop code I am using was included in the 0.13 tarball that I
downloaded from here:

http://mirror.nexcess.net/apache/mesos/0.13.0/

I built the jar by running hadoop/TUTORIAL.sh. I wound up integrating with
hadoop manually since the tutorial script didn't work correctly for me. I
mostly followed the instructions here: https://github.com/mesos/hadoop

At one point I tried building it from https://github.com/mesos/hadoop but I
had trouble getting it to build with 0.13.

Should I be working off of a different version?

Thanks,
Paul



On Thu, Sep 26, 2013 at 10:34 PM, Dan Colish wrote:

>
>
>
> On Thu, Sep 26, 2013 at 6:27 PM, Paul Mackles  wrote:
>
>> Hi - I am using mesos 0.13 with cdh4.2.0 in pseudo-distributed mode.
>> While I am able to launch and run hadoop jobs through mesos successfully, I
>> noticed in the Mesos UI (and through 'ps') that the task-trackers launched
>> by mesos are sticking around long after my job is complete. Is that
>> expected behavior? I am thinking the answer is no since they are tying up
>> resources that could be used by other frameworks. On the other hand, mesos
>> seems to know enough to reuse them when running subsequent hadoop jobs.
>> Maybe there are using reservations or something by default?
>>
>>
> Are you using the mesos-hadoop project found here,
> https://github.com/mesos/hadoop? If so, you are correct that idle
> tasktrackers should be torn down at the end. I wonder what the cluster
> state when the JobInProgressListener is called with upon your job's
> completion. Specifically, I would look into tracing this section [1] of code
> * *where the task trackers job queue is checked for emptiness the tracker
> is checked for being active. If the tracker was never activated I think it
> would also be running but not killed.
>
>
> [1]
> https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/MesosScheduler.java#L105
>
>


-- 
Thanks,
Paul


Re: hadoop task-trackers sticking around

2013-09-27 Thread Paul Mackles
I see the following messages in the job-tracker logs which probably explain
why the task-trackers are sticking around:

2013-09-27 03:39:58,808 WARN org.apache.hadoop.mapred.MesosScheduler:
Ignoring TaskTracker: http://vm282.dev.xxx:31001 because it might not have
sent a hearbeat
2013-09-27 03:39:58,808 WARN org.apache.hadoop.mapred.MesosScheduler:
Ignoring TaskTracker: http://vm282.dev.xxx:31000 because it might not have
sent a hearbeat
2013-09-27 03:39:58,809 WARN org.apache.hadoop.mapred.MesosScheduler:
Ignoring TaskTracker: http://vm282.dev.xxx:31001 because it might not have
sent a hearbeat
2013-09-27 03:39:58,809 WARN org.apache.hadoop.mapred.MesosScheduler:
Ignoring TaskTracker: http://vm282.dev.xxx:31000 because it might not have
sent a hearbeat

The source for MesosScheduler.java that is bundled with 0.13 looks quite a
bit different than the version that is currently on git.



On Thu, Sep 26, 2013 at 11:08 PM, Paul Mackles  wrote:

> I will dig a little further as the behavior is inconsistent. On subsequent
> attempts I have seen the task-trackers go away with the job. They always go
> away when I shutdown the corresponding job-tracker.
>
> The hadoop code I am using was included in the 0.13 tarball that I
> downloaded from here:
>
> http://mirror.nexcess.net/apache/mesos/0.13.0/
>
> I built the jar by running hadoop/TUTORIAL.sh. I wound up integrating with
> hadoop manually since the tutorial script didn't work correctly for me. I
> mostly followed the instructions here: https://github.com/mesos/hadoop
>
> At one point I tried building it from https://github.com/mesos/hadoop but
> I had trouble getting it to build with 0.13.
>
> Should I be working off of a different version?
>
> Thanks,
> Paul
>
>
>
> On Thu, Sep 26, 2013 at 10:34 PM, Dan Colish wrote:
>
>>
>>
>>
>> On Thu, Sep 26, 2013 at 6:27 PM, Paul Mackles  wrote:
>>
>>> Hi - I am using mesos 0.13 with cdh4.2.0 in pseudo-distributed mode.
>>> While I am able to launch and run hadoop jobs through mesos successfully, I
>>> noticed in the Mesos UI (and through 'ps') that the task-trackers launched
>>> by mesos are sticking around long after my job is complete. Is that
>>> expected behavior? I am thinking the answer is no since they are tying up
>>> resources that could be used by other frameworks. On the other hand, mesos
>>> seems to know enough to reuse them when running subsequent hadoop jobs.
>>> Maybe there are using reservations or something by default?
>>>
>>>
>> Are you using the mesos-hadoop project found here,
>> https://github.com/mesos/hadoop? If so, you are correct that idle
>> tasktrackers should be torn down at the end. I wonder what the cluster
>> state when the JobInProgressListener is called with upon your job's
>> completion. Specifically, I would look into tracing this section [1] of code
>> * *where the task trackers job queue is checked for emptiness the
>> tracker is checked for being active. If the tracker was never activated I
>> think it would also be running but not killed.
>>
>>
>> [1]
>> https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/MesosScheduler.java#L105
>>
>>
>
>
> --
> Thanks,
> Paul
>



-- 
Thanks,
Paul


mesos/hadoop

2013-09-27 Thread Paul Mackles
Hi - Going by what is currently in the apache git repository, what is the
"recommended" combination of mesos and mesos-hadoop to use?

I originally tried the 0.13.0 tarball which appears to include some version
of mesos-hadoop. While I was able to build it and run some jobs, the task
tracker processes launched by mesos keep on running even after my job
completes. Based on the job-tracker logs, there task-trackers are not
sending the job-tracker any heartbeats so the job-tracker doesn't try to
stop them (even though it reuses them on subsequent jobs).

Based on my interpretation of
https://issues.apache.org/jira/browse/MESOS-618, it sounds like I should be
using the mesos/hadoop project on github. I managed to get that built from
trunk but I am not having nearly as much luck getting mesos built from the
trunk.

So what combination of mesos and mesos-hadoop is working for folks?

-- 
Thanks,
Paul


Re: mesos/hadoop

2013-09-27 Thread Paul Mackles
H Benjamin - Thanks for the quick response. I am keeping notes and will
share with the list after I have had a chance to clean them up.

Will try 0.14.0-rc4.

Thanks again,
Paul


On Fri, Sep 27, 2013 at 7:57 PM, Benjamin Hindman <
benjamin.hind...@gmail.com> wrote:

> Hey Paul,
>
> Feel free to share your build issues, we'd love to help.
>
> In the mean time, I suggest trying the git tag 0.14.0-rc4 with the
> mesos/hadoop project on Github.
>
> Ben.
>
>
>
>
> On Fri, Sep 27, 2013 at 4:55 PM, Paul Mackles  wrote:
>
>> Hi - Going by what is currently in the apache git repository, what is the
>> "recommended" combination of mesos and mesos-hadoop to use?
>>
>> I originally tried the 0.13.0 tarball which appears to include some
>> version of mesos-hadoop. While I was able to build it and run some jobs,
>> the task tracker processes launched by mesos keep on running even after my
>> job completes. Based on the job-tracker logs, there task-trackers are not
>> sending the job-tracker any heartbeats so the job-tracker doesn't try to
>> stop them (even though it reuses them on subsequent jobs).
>>
>> Based on my interpretation of
>> https://issues.apache.org/jira/browse/MESOS-618, it sounds like I should
>> be using the mesos/hadoop project on github. I managed to get that built
>> from trunk but I am not having nearly as much luck getting mesos built from
>> the trunk.
>>
>> So what combination of mesos and mesos-hadoop is working for folks?
>>
>> --
>> Thanks,
>> Paul
>>
>
>


-- 
Thanks,
Paul


Re: hadoop task-trackers sticking around

2013-10-03 Thread Paul Mackles
Hi I went back and started from scratch with mesos-0.14.0-rc4 and
mesos-hadoop  from the trunk of  https://github.com/mesos/hadoop. While the
whole setup was definitely a lot smoother, the tasktrackers are still
sticking arounf. I traced through the code and its definitely due to this
check in JobInProgressListener.jobUpdated() from MesosScheduler.java:

if (mesosTracker.jobs.isEmpty() && mesosTracker.active)

Specifically, the tracker processes never seem to enter the "active" state.
If I remove the check for the active flag, the TaskTrackers shut down as
expected when the job completes.

How does the TaskTracker get activated?

Thanks,
Paul


On Fri, Sep 27, 2013 at 6:58 AM, Paul Mackles  wrote:

> I see the following messages in the job-tracker logs which probably
> explain why the task-trackers are sticking around:
>
> 2013-09-27 03:39:58,808 WARN org.apache.hadoop.mapred.MesosScheduler:
> Ignoring TaskTracker: http://vm282.dev.xxx:31001 because it might not
> have sent a hearbeat
> 2013-09-27 03:39:58,808 WARN org.apache.hadoop.mapred.MesosScheduler:
> Ignoring TaskTracker: http://vm282.dev.xxx:31000 because it might not
> have sent a hearbeat
> 2013-09-27 03:39:58,809 WARN org.apache.hadoop.mapred.MesosScheduler:
> Ignoring TaskTracker: http://vm282.dev.xxx:31001 because it might not
> have sent a hearbeat
> 2013-09-27 03:39:58,809 WARN org.apache.hadoop.mapred.MesosScheduler:
> Ignoring TaskTracker: http://vm282.dev.xxx:31000 because it might not
> have sent a hearbeat
>
> The source for MesosScheduler.java that is bundled with 0.13 looks quite a
> bit different than the version that is currently on git.
>
>
>
> On Thu, Sep 26, 2013 at 11:08 PM, Paul Mackles  wrote:
>
>> I will dig a little further as the behavior is inconsistent. On
>> subsequent attempts I have seen the task-trackers go away with the job.
>> They always go away when I shutdown the corresponding job-tracker.
>>
>> The hadoop code I am using was included in the 0.13 tarball that I
>> downloaded from here:
>>
>> http://mirror.nexcess.net/apache/mesos/0.13.0/
>>
>> I built the jar by running hadoop/TUTORIAL.sh. I wound up integrating
>> with hadoop manually since the tutorial script didn't work correctly for
>> me. I mostly followed the instructions here:
>> https://github.com/mesos/hadoop
>>
>> At one point I tried building it from https://github.com/mesos/hadoop but
>> I had trouble getting it to build with 0.13.
>>
>> Should I be working off of a different version?
>>
>> Thanks,
>> Paul
>>
>>
>>
>> On Thu, Sep 26, 2013 at 10:34 PM, Dan Colish wrote:
>>
>>>
>>>
>>>
>>> On Thu, Sep 26, 2013 at 6:27 PM, Paul Mackles  wrote:
>>>
>>>> Hi - I am using mesos 0.13 with cdh4.2.0 in pseudo-distributed mode.
>>>> While I am able to launch and run hadoop jobs through mesos successfully, I
>>>> noticed in the Mesos UI (and through 'ps') that the task-trackers launched
>>>> by mesos are sticking around long after my job is complete. Is that
>>>> expected behavior? I am thinking the answer is no since they are tying up
>>>> resources that could be used by other frameworks. On the other hand, mesos
>>>> seems to know enough to reuse them when running subsequent hadoop jobs.
>>>> Maybe there are using reservations or something by default?
>>>>
>>>>
>>> Are you using the mesos-hadoop project found here,
>>> https://github.com/mesos/hadoop? If so, you are correct that idle
>>> tasktrackers should be torn down at the end. I wonder what the cluster
>>> state when the JobInProgressListener is called with upon your job's
>>> completion. Specifically, I would look into tracing this section [1] of code
>>> * *where the task trackers job queue is checked for emptiness the
>>> tracker is checked for being active. If the tracker was never activated I
>>> think it would also be running but not killed.
>>>
>>>
>>> [1]
>>> https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/MesosScheduler.java#L105
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Paul
>>
>
>
>
> --
> Thanks,
> Paul
>



-- 
Thanks,
Paul


resource revocation and long-running task

2013-10-10 Thread Paul Mackles
Hi - I was re-reading the mesos technical paper. Particularly sections
3.3.1 and 4.3. I am currently running mesos-0.14.0.rc4 and I was wondering
how much of what is discussed in those sections is actually implemented?
Specifically, I don't see any way to allocate slots for long-running vs.
short-running tasks. I also haven't seen any configuration related to
resource revocation. Am I missing something?

-- 
Thanks,
Paul


process isolation

2013-10-21 Thread Paul Mackles
Hi - I just wanted to confirm my understanding of something... with process
isolation, Mesos will not do anything if a given executor exceeds its
resource allocation. In other words, if I accept a resource with 1GB of
memory and then my executor uses 3GB, Mesos won't detect that the process
exceeded its allocation and kill the process. For that, you need to enable
cgroups at which point allocation limits are enforced by the OS. Did I get
that right?

-- 
Thanks,
Paul


Re: Open call to be incl on Mesos support and services list

2014-07-25 Thread Paul Otto
Hi Dave,

I would be interested in having Otto Ops LLC be added to that list. We have
been building a Mesos + Marathon + Docker infrastructure for Time Warner
Cable, and would be very interested in doing more with the community.

Regards,
Paul Otto

-- 
Paul Otto
Principal DevOps Engineer, Owner
Otto Ops LLC | *OttoOps.com <http://ottoops.com/>*
970.343.4561 office
720.381.2383 cell

On Thu, Jul 24, 2014 at 4:07 PM, Dave Lester  wrote:

> Hi All,
>
> I wanted to revisit a previous thread
> <http://markmail.org/message/o3nlnihmwqtgsm7d> where I suggested that we
> add a section to the Mesos website to list companies that provide Mesos
> services and development. At that time, we heard interest from:
>
> * Grand Logic
> * Mesosphere
> * Big Data Open Source Security LLC
>
> I've created a JIRA ticket (MESOS-1638
> <https://issues.apache.org/jira/browse/MESOS-1638>) to track this; feel
> free to comment there or use this thread for discussion should there be any
> questions or comments.
>
> Best,
> Dave
>



-- 
Paul Otto
Principal DevOps Engineer
Otto Ops LLC | *OttoOps.com <http://OttoOps.com>*
970.343.4561 office
720.381.2383 cell


Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)

2015-03-24 Thread Paul Otto
This is awesome! Thanks for all the hard work you all have put into this! I
am really excited to update to the latest stable version of Apache Mesos!

Regards,
Paul


Paul Otto
Principal DevOps Architect, Co-founder
Otto Ops LLC | *OttoOps.com <http://OttoOps.com>*
970.343.4561 office
720.381.2383 cell

On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen 
wrote:

> Hi all,
>
> The vote for Mesos 0.22.0 (rc4) has passed with the
> following votes.
>
> +1 (Binding)
> --
> Ben Mahler
> Tim St Clair
> Adam Bordelon
> Brenden Matthews
>
> +1 (Non-binding)
> --
> Alex Rukletsov
> Craig W
> Ben Whitehead
> Elizabeth Lingg
> Dario Rexin
> Jeff Schroeder
> Michael Park
> Alexander Rojas
> Andrew Langhorn
>
> There were no 0 or -1 votes.
>
> Please find the release at:
> https://dist.apache.org/repos/dist/release/mesos/0.22.0
>
> It is recommended to use a mirror to download the release:
> http://www.apache.org/dyn/closer.cgi
>
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0
>
> The mesos-0.22.0.jar has been released to:
> https://repository.apache.org
>
> The website (http://mesos.apache.org) will be updated shortly to reflect
> this release.
>
> Thanks,
> Niklas
>


Denver Mesos User Group

2015-03-26 Thread Paul Otto
Hi all,

I am excited to announce that the Denver Mesos User Group has been created!
We will be organizing our first meeting shortly!
http://www.meetup.com/Denver-Mesos-User-Group

Regards,
Paul

Paul Otto
Principal DevOps Architect, Co-founder
Otto Ops LLC | *OttoOps.com <http://OttoOps.com>*
970.343.4561 office
720.381.2383 cell


Re: Denver Mesos User Group

2015-03-26 Thread Paul Otto
Glad to do it! Thanks for getting it added to the Apache page so quickly!

- Paul


Paul Otto
Principal DevOps Architect, Co-founder
Otto Ops LLC | *OttoOps.com <http://OttoOps.com>*
970.343.4561 office
720.381.2383 cell

On Thu, Mar 26, 2015 at 11:10 AM, Dave Lester  wrote:

>  Excellent, thanks for taking the lead here! I've added Denver to our
> list of User Groups -- we're now up to 12 world-wide!
> http://mesos.apache.org/community/user-groups/
>
> Dave
>
> On Thu, Mar 26, 2015, at 06:16 AM, Paul Otto wrote:
>
> Hi all,
>
> I am excited to announce that the Denver Mesos User Group has been
> created! We will be organizing our first meeting shortly!
> http://www.meetup.com/Denver-Mesos-User-Group
>
> Regards,
> Paul
>
> Paul Otto
> Principal DevOps Architect, Co-founder
> Otto Ops LLC | *OttoOps.com <http://OttoOps.com>*
> 970.343.4561 office
> 720.381.2383 cell
>
>
>


Re: New cologne based user group - mesos-user-group-cologne

2015-03-30 Thread Paul Otto
Finally, a reason to travel to Deutschland! ;) Good luck with the new MUG!

Paul
On Mar 29, 2015 1:00 PM, "Marc Zimmermann" 
wrote:

> We’ve started a new users group in Cologne called Mesos-User-Group-Cologne
> - please add us to your list!
>
> http://www.meetup.com/Mesos-User-Group-Cologne/
>
> Thanks, Marc
>
> --
> Dipl.-Inf.
> Marc Zimmermann
>
> 
>  mmbash UG (haftungsbeschränkt)
> Friedenstraße 22, 50676 Köln, Germany
>
> http://mmbash.de
> marc.zimmerm...@mmbash.de
>
> Geschäftsführer: Mike Michel, Marc Zimmermann
> Amtsgericht Köln, HRB 73562
>
>


Re: Writing outside the sandbox

2015-05-10 Thread Paul Brett
Can you check on the NFS server to see if the filesystem has been exported
with the rootsquash option?  That's a commonly used option which converts
root uid on NFS clients to nobody on the server.

-- Paul Brett
On May 10, 2015 5:15 PM, "Adam Bordelon"  wrote:

> Go ahead and run `env` in your script too, and see if there are any
> interesting differences when run via Marathon vs. directly.
> Maybe you're running in a different shell?
>
> On Sun, May 10, 2015 at 2:21 PM, John Omernik  wrote:
>
>> I believe the slave IS running as root. FWIW when I ran the script from
>> above as root, it did work as intended (created the files on the NFS
>> share).
>>
>> On Sun, May 10, 2015 at 9:08 AM, Dick Davies 
>> wrote:
>>
>>> Any idea what user mesos is running as? This could just be a
>>> filesystem permission
>>> thing (ISTR last time I used NFS mounts, they had a 'root squash'
>>> option that prevented
>>> local root from writing to the NFS mount).
>>>
>>> On 9 May 2015 at 22:13, John Omernik  wrote:
>>> > I am not specifying isolators. The Default? :)  Is that a per slave
>>> setting?
>>> >
>>> > On Sat, May 9, 2015 at 3:33 PM, James DeFelice <
>>> james.defel...@gmail.com>
>>> > wrote:
>>> >>
>>> >> What isolators are you using?
>>> >>
>>> >> On Sat, May 9, 2015 at 3:48 PM, John Omernik 
>>> wrote:
>>> >>>
>>> >>> Marco... great idea... thank you.  I just tried it and it worked
>>> when I
>>> >>> had a /mnt/permtesting with the same permissions.  So it appears
>>> something
>>> >>> to do with NFS and Mesos (Remember I tested just NFS that worked
>>> fine, it's
>>> >>> the combination that is causing this).
>>> >>>
>>> >>> On Sat, May 9, 2015 at 1:09 PM, Marco Massenzio >> >
>>> >>> wrote:
>>> >>>>
>>> >>>> Out of my own curiousity (sorry, I have no fresh insights into the
>>> issue
>>> >>>> here) did you try to run the script and write to a non-NFS mounted
>>> >>>> directory? (same ownership/permissions)
>>> >>>>
>>> >>>> This way we could at least find out whether it's something related
>>> to
>>> >>>> NFS, or a more general permission-related issue.
>>> >>>>
>>> >>>> Marco Massenzio
>>> >>>> Distributed Systems Engineer
>>> >>>>
>>> >>>> On Sat, May 9, 2015 at 5:10 AM, John Omernik 
>>> wrote:
>>> >>>>>
>>> >>>>> Here is the testing I am doing. I used a simple script (run.sh)  It
>>> >>>>> writes the user it is running as to stderr (so it's the same log
>>> as the
>>> >>>>> errors from file writing) and then tries to make a directory in
>>> nfs, and
>>> >>>>> then touch a file in nfs.  Note: This script directly run  works
>>> on every
>>> >>>>> node.  You can see the JSON I used in marathon, and in the sandbox
>>> results,
>>> >>>>> you can see the user is indeed darkness and the directory cannot
>>> be created.
>>> >>>>> However when directly run, it the script, with the same user,
>>> creates the
>>> >>>>> directory with no issue.  Now,  I realize this COULD still be a
>>> NFS quirk
>>> >>>>> here, however, this testing points at some restriction in how
>>> marathon kicks
>>> >>>>> off the cmd.   Any thoughts on where to look would be very helpful!
>>> >>>>>
>>> >>>>> John
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Script:
>>> >>>>>
>>> >>>>> #!/bin/bash
>>> >>>>> echo "Writing whoami to stderr for one stop logging" 1>&2
>>> >>>>> whoami 1>&2
>>> >>>>> mkdir /mapr/brewpot/mesos/storm/test/test1
>>> >>>>> touch /mapr/brewpot/mesos/storm/test/test1/testing.go
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Run Via Marathon
>>&

Re: [DISCUSS] Renaming Mesos Slave

2015-06-02 Thread Paul Brett
-1 for the name change.

The master/slave terms in Mesos accurately describe the relationship
between the components using common engineering terms that predate modern
computing.

Human slavery is an abomination, but then so is murder.  Would you have us
eliminate all references to kill in the code?

-- Paul

On Tue, Jun 2, 2015 at 12:53 PM, haosdent  wrote:

> Hi Adam,
>
> 1. Mesos Worker
> 2. Mesos Worker
> 3. No
> 4. Carefully. Should take care the compatible when upgrade.
>
> On Wed, Jun 3, 2015 at 2:50 AM, Dave Lester  wrote:
>
>>  Hi Adam,
>>
>> I've been using Master/Worker in presentations for the past 9 months and
>> it hasn't led to any confusion.
>>
>> 1. Mesos worker
>> 2. Mesos worker
>> 3. No
>> 4. Documentation, then API with a full deprecation cycle
>>
>> Dave
>>
>> On Mon, Jun 1, 2015, at 02:18 PM, Adam Bordelon wrote:
>>
>> There has been much discussion about finding a less offensive name than
>> "Slave", and many of these thoughts have been captured in
>> https://issues.apache.org/jira/browse/MESOS-1478
>>
>> I would like to open up the discussion on this topic for one week, and if
>> we cannot arrive at a lazy consensus, I will draft a proposal from the
>> discussion and call for a VOTE.
>> Here are the questions I would like us to answer:
>> 1. What should we call the "Mesos Slave" node/host/machine?
>> 2. What should we call the "mesos-slave" process (could be the same)?
>> 3. Do we need to rename Mesos Master too?
>>
>> Another topic worth discussing is the deprecation process, but we don't
>> necessarily need to decide on that at the same time as deciding the new
>> name(s).
>> 4. How will we phase in the new name and phase out the old name?
>>
>> Please voice your thoughts and opinions below.
>>
>> Thanks!
>> -Adam-
>>
>> P.S. My personal thoughts:
>> 1. Mesos Worker [Node]
>> 2. Mesos Worker or Agent
>> 3. No
>> 4. Carefully
>>
>>
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
-- Paul Brett


Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

2015-07-07 Thread Paul Brett
t;>>>>>>>  - Revocable Resources
>>>>>>>>>  - SSL encryption
>>>>>>>>>  - Persistent Volumes
>>>>>>>>>  - Dynamic Reservations
>>>>>>>>>
>>>>>>>>> The CHANGELOG for the release is available at:
>>>>>>>>>
>>>>>>>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>> The candidate for Mesos 0.23.0 release is available at:
>>>>>>>>>
>>>>>>>>> https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz
>>>>>>>>>
>>>>>>>>> The tag to be voted on is 0.23.0-rc1:
>>>>>>>>>
>>>>>>>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc1
>>>>>>>>>
>>>>>>>>> The MD5 checksum of the tarball can be found at:
>>>>>>>>>
>>>>>>>>> https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.md5
>>>>>>>>>
>>>>>>>>> The signature of the tarball can be found at:
>>>>>>>>>
>>>>>>>>> https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.asc
>>>>>>>>>
>>>>>>>>> The PGP key used to sign the release is here:
>>>>>>>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>>>>>>>
>>>>>>>>> The JAR is up in Maven in a staging repository here:
>>>>>>>>>
>>>>>>>>> https://repository.apache.org/content/repositories/orgapachemesos-1056
>>>>>>>>>
>>>>>>>>> Please vote on releasing this package as Apache Mesos 0.23.0!
>>>>>>>>>
>>>>>>>>> The vote is open until Fri July 10th, 12:00 PDT 2015 and passes if
>>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Mesos 0.23.0
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>  -Adam-
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Deshi Xiao
>>>>>>> Twitter: xds2000
>>>>>>> E-mail: xiaods(AT)gmail.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>


-- 
-- Paul Brett


Re: Custom flags to docker run

2015-08-12 Thread Paul Bell
Hi Stephen,

Via Marathon I am deploying Docker containers across a Mesos cluster. The
containers have unique Weave IP@s allowing inter-container communication.
All things considered, getting to this point has been relatively
straight-forward, and Weave has been one of the "IJW" components.

I'd be curious to learn why you're finding Weave "messy".

If you'd like to take it out-of-band (as it were), please feel free to
e-mail me directly.

Cordially,

Paul

On Wed, Aug 12, 2015 at 3:16 AM, Stephen Knight  wrote:

> Hi,
>
> Is there a way to pass a custom flag to docker run through the Marathon
> API? I've not seen anything in the documentation, this could just be a
> basic reading fail on my part. What I want to do is use Calico (or similar)
> with Docker and provision containers via Marathon.
>
> Weave is messy for what I am trying to achieve and the integration isn't
> going as planned, is there a better option and how can you then integrate
> it? Does that flexibility exist in the Marathon API?
>
> Thx
> Stephen
>


Can't start master properly (stale state issue?); help!

2015-08-13 Thread Paul Bell
Hi All,

I hope someone can shed some light on this because I'm getting desperate!

I try to start components zk, mesos-master, and marathon in that order.
They are started via a program that SSHs to the sole host and does "service
xxx start". Everyone starts happily enough. But the Mesos UI shows me:

*This master is not the leader, redirecting in 0 seconds ... go now*

The pattern seen in all of the mesos-master.INFO logs (one of which shown
below) is that the mesos-master with the correct IP@ starts. But then a
"new leader" is detected and becomes leading master. This new leader shows
UPID *(UPID=master@127.0.1.1:5050 <http://master@127.0.1.1:5050>*

I've tried clearing what ZK and mesos-master state I can find, but this
problem will not "go away".

Would someone be so kind as to a) explain what is happening here and b)
suggest remedies?

Thanks very much.

-Paul


Log file created at: 2015/08/13 10:19:43
Running on machine: 71.100.14.9
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root
I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
d6309f92a7f9af3ab61a878403e3d9c284ea87e0
I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in 13961ns
I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
677ns
I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in the
db in 243ns
I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
positions 0 -> 0 with 1 holes and 0 unlearned
I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
ZooKeeper group
I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
received a broadcasted recover request
I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
from a replica in EMPTY status
*I0813 10:19:43.249503  2542 master.cpp:349] Master
20150813-101943-151938119-5050-2542 (71.100.14.9) started on
71.100.14.9:5050 <http://71.100.14.9:5050>*
I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
STARTING
I0813 10:19:43.252571  2542 master.cpp:397] Master allowing unauthenticated
frameworks to register
I0813 10:19:43.253159  2542 master.cpp:402] Master allowing unauthenticated
slaves to register
I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 1.816161ms
I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
STARTING
I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
received a broadcasted recover request
I0813 10:19:43.255265  2612 recover.cpp:195] Received a recover response
from a replica in STARTING status
I0813 10:19:43.255343  2612 recover.cpp:566] Updating replica status to
VOTING
I0813 10:19:43.258730  2611 master.cpp:1295] Successfully attached file
'/var/log/mesos/mesos-master.INFO'
I0813 10:19:43.258760  2609 contender.cpp:131] Joining the ZK group
I0813 10:19:43.258862  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 3.477458ms
I0813 10:19:43.258894  2612 replica.cpp:323] Persisted replica status to
VOTING
I0813 10:19:43.258934  2612 recover.cpp:580] Successfully joined the Paxos
group
I0813 10:19:43.258987  2612 recover.cpp:464] Recover process terminated
I0813 10:19:46.590340  2606 group.cpp:313] Group process (group(1)@
71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.590373  2606 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0813 10:19:46.590386  2606 group.cpp:385] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0813 10:19:46.591442  2606 network.hpp:424] ZooKeeper group memberships
changed
I0813 10:19:46.591514  2606 group.cpp:659] Trying to get
'/mesos/log_replicas/00' in ZooKeeper
I0813 10:19:46.592146  2606 group.cpp:659] Trying to get
'/mesos/log_replicas/01' in ZooKeeper
I0813 10:19:46.593128  2608 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@127.0.1.1:5050 }
I0813 10:19:46.593955  2608 group.cpp:313] Group process (group(2)@
71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.593977  2608 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (1, 0, 0)
I0813 10:19:46.593986  2608 group.cpp:385] Trying to create path
'

Re: Can't start master properly (stale state issue?); help!

2015-08-13 Thread Paul Bell
Marco & hasodent,

This is just a quick note to say thank you for your replies.

I will answer you much more fully tomorrow, but for now can only manage a
few quick observations & questions:

1. Having some months ago encountered a known problem with the IP@
127.0.1.1 (I'll provide references tomorrow), I early on configured
/etc/hosts, replacing "myHostName 127.0.1.1" with "myHostName ".
That said, I can't rule out a race condition whereby ZK | mesos-master saw
the original unchanged /etc/hosts before I zapped it.

2. What is a znode and how would I drop it?

I start the services (zk, master, marathon; all on same host) by SSHing
into the host & doing "service  start" commands.

Again, thanks very much; and more tomorrow.

Cordially,

Paul

On Thu, Aug 13, 2015 at 1:08 PM, haosdent  wrote:

> Hello, how you start the master? And could you try use "netstat -antp|grep
> 5050" to find whether there are multi master processes run at a same
> machine or not?
>
> On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell  wrote:
>
>> Hi All,
>>
>> I hope someone can shed some light on this because I'm getting desperate!
>>
>> I try to start components zk, mesos-master, and marathon in that order.
>> They are started via a program that SSHs to the sole host and does "service
>> xxx start". Everyone starts happily enough. But the Mesos UI shows me:
>>
>> *This master is not the leader, redirecting in 0 seconds ... go now*
>>
>> The pattern seen in all of the mesos-master.INFO logs (one of which shown
>> below) is that the mesos-master with the correct IP@ starts. But then a
>> "new leader" is detected and becomes leading master. This new leader shows
>> UPID *(UPID=master@127.0.1.1:5050 <http://master@127.0.1.1:5050>*
>>
>> I've tried clearing what ZK and mesos-master state I can find, but this
>> problem will not "go away".
>>
>> Would someone be so kind as to a) explain what is happening here and b)
>> suggest remedies?
>>
>> Thanks very much.
>>
>> -Paul
>>
>>
>> Log file created at: 2015/08/13 10:19:43
>> Running on machine: 71.100.14.9
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
>> I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by
>> root
>> I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
>> I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
>> I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
>> d6309f92a7f9af3ab61a878403e3d9c284ea87e0
>> I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
>> I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
>> I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in
>> 13961ns
>> I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
>> 677ns
>> I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in
>> the db in 243ns
>> I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
>> positions 0 -> 0 with 1 holes and 0 unlearned
>> I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
>> ZooKeeper group
>> I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
>> I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
>> I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
>> I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
>> received a broadcasted recover request
>> I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
>> from a replica in EMPTY status
>> *I0813 10:19:43.249503  2542 master.cpp:349] Master
>> 20150813-101943-151938119-5050-2542 (71.100.14.9) started on
>> 71.100.14.9:5050 <http://71.100.14.9:5050>*
>> I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
>> STARTING
>> I0813 10:19:43.252571  2542 master.cpp:397] Master allowing
>> unauthenticated frameworks to register
>> I0813 10:19:43.253159  2542 master.cpp:402] Master allowing
>> unauthenticated slaves to register
>> I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8
>> bytes) to leveldb took 1.816161ms
>> I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
>> STARTING
>> I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
>> I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
>> received a broadcasted recover request
>

Re: Can't start master properly (stale state issue?); help!

2015-08-14 Thread Paul Bell
All,

By way of some background: I'm not running a data center (or centers).
Rather, I work on a distributed application whose trajectory is taking it
into a realm of many Docker containers distributed across many hosts
(mostly virtual hosts at the outset). An environment that supports
isolation, multi-tenancy, scalability, and some fault tolerance is
desirable for this application. Also, the mere ability to simplify - at
least somewhat - the management of multiple hosts is of great importance.
So, that's more or less how I got to Mesos and to here...

I ended up writing a Java program that configures a collection of host VMs
as a Mesos cluster and then, via Marathon, distributes the application
containers across the cluster. Configuring & building the cluster is
largely a lot of SSH work. Doing the same for the application is part
Marathon, part Docker remote API. The containers that need to talk to each
other via TCP are connected with Weave's (http://weave.works) overlay
network. So the main infrastructure consists of Mesos, Docker, and Weave.
The whole thing is pretty amazing - for which I take very little credit.
Rather, these are some wonderful technologies, and the folks who write &
support them are very helpful. That said, I sometimes feel like I'm
juggling chain saws!

*In re* the issues raised on this thread:

All Mesos components were installed via the Mesosphere packages. The 4 VMs
in the cluster are all running Ubuntu 14.04 LTS.

My suspicions about the IP@ 127.0.1.1 were raised a few months ago when,
after seeing this IP in a mesos-master log when things "weren't working", I
discovered these articles:


https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4
<https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4>

*http://frankhinek.com/build-mesos-multi-node-ha-cluster/
<http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2>* (see
"note 2")


So, to the point raised just now by Klaus (and earlier in the thread), the
aforementioned configuration program does change /etc/hosts (and
/etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco &
hasodent, I might have encountered a race condition wherein ZK &
mesos-master saw the unchanged /etc/hosts before I altered it. I believe
that I yesterday fixed that issue.

Also, as part of the "cluster create" step, I get a bit aggressive (perhaps
unwisely) with what I believe are some state repositories. Specifically, I

rm /var/lib/zookeeper/version-2/*
rm -Rf /var/lib/mesos/replicated_log

Should I NOT be doing this? I know from experience that zapping the
"version-2" directory (ZK's data_Dir if IIRC)  can solve occasional
weirdness. Marco is "/var/lib/mesos/replicated_log" what you are referring
to when you say some "issue with the log-replica"?

Just a day or two ago I first heard the term "znode" & learned a little
about zkCli.sh. I will experiment with it more in the coming days.

As matters now stand, I have the cluster up and running. But before I again
deploy the application, I am trying to put the cluster through its paces by
periodically cycling it through the states my program can bring about,
e.g.,

--cluster create (takes a clean VM and configures it to act as one
or more Mesos components: ZK, master, slave)
--cluster stop(stops the Mesos services on each node)
--cluster destroy   (configures the VM back to its original clean state)
--cluster create
--cluster stop
--cluster start


et cetera.

*The only way I got rid of the "no leading master" issue that started this
thread was by wiping out the master VM and starting over with a clean VM.
That is, stopping/destroying/creating (even rebooting) the cluster had no
effect.*

I suspect that, sooner or later, I will again hit this problem (probably
sooner!). And I want to understand how best to handle it. Such an
occurrence could be pretty awkward at a customer site.

Thanks for all your help.

Cordially,

Paul


On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma  wrote:

> I used to meet a similar issue with Zookeeper + Messo; I resolved it by
> remove 127.0.1.1 from /etc/hosts; here is an example:
> klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts
> 127.0.0.1   localhost
> 127.0.1.1   klaus-OptiPlex-780   *<<= remove this line, and a new
> line: mapping IP (e.g. 192.168.1.100) with hostname*
> ...
>
> BTW, please also clear-up the log directory and re-start ZK & Mesos.
>
> If any more comments, please let me know.
>
> Regards,
> 
> Klaus Ma (马达), PMP® | http://www.cguru.net
>
> Call
> Send SMS
> Call from mobile
> Add to Skype
> You'll need Skype CreditFree via Skype
> Call
> Send SMS
> Call from mobile
> Add to Skype
> You'll need Skype Cre

Use "docker start" rather than "docker run"?

2015-08-28 Thread Paul Bell
Hi All,

I first posted this to the Marathon list, but someone suggested I try it
here.

I'm still not sure what component (mesos-master, mesos-slave, marathon)
generates the "docker run" command that launches containers on a slave
node. I suppose that it's the framework executor (Marathon) on the slave
that actually executes the "docker run", but I'm not sure.

What I'm really after is whether or not we can cause the use of "docker
start" rather than "docker run".

At issue here is some persistent data inside
/var/lib/docker/aufs/mnt/. "docker run" will by design (re)launch
my application with a different CTR_ID effectively rendering that data
inaccessible. But "docker start" will restart the container and its "old"
data will still be there.

Thanks.

-Paul


Re: Use "docker start" rather than "docker run"?

2015-08-28 Thread Paul Bell
Alex & Tim,

Thank you both; most helpful.

Alex, can you dispel my confusion on this point: I keep reading that a
"framework" in Mesos (e.g., Marathon) consists of a scheduler and an
executor. This reference to "executor" made me think that Marathon must
have *some* kind of presence on the slave node. But the more familiar I
become with Mesos the less likely this seems to me. So, what does it mean
to talk about the Marathon framework "executor"?

Tim, I did come up with a simple work-around that involves re-copying the
needed file into the container each time the application is started. For
reasons unknown, this file is not kept in a location that would readily
lend itself to my use of persistent storage (Docker -v). That said, I am
keenly interested in learning how to write both custom executors &
schedulers. Any sense for what release of Mesos will see "persistent
volumes"?

Thanks again, gents.

-Paul



On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen  wrote:

> Hi Paul,
>
> We don't [re]start a container since we assume once the task terminated
> the container is no longer reused. In Mesos to allow tasks to reuse the
> same executor and handle task logic accordingly people will opt to choose
> the custom executor route.
>
> We're working on a way to keep your sandbox data beyond a container
> lifecycle, which is called persistent volumes. We haven't integrated that
> with Docker containerizer yet, so you'll have to wait to use that feature.
>
> You could also choose to implement a custom executor for now if you like.
>
> Tim
>
> On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov 
> wrote:
>
>> Paul,
>>
>> that component is called DockerContainerizer and it's part of Mesos Agent
>> (check "/Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp").
>> @Tim, could you answer the "docker start" vs. "docker run" question?
>>
>> On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell  wrote:
>>
>>> Hi All,
>>>
>>> I first posted this to the Marathon list, but someone suggested I try it
>>> here.
>>>
>>> I'm still not sure what component (mesos-master, mesos-slave, marathon)
>>> generates the "docker run" command that launches containers on a slave
>>> node. I suppose that it's the framework executor (Marathon) on the slave
>>> that actually executes the "docker run", but I'm not sure.
>>>
>>> What I'm really after is whether or not we can cause the use of "docker
>>> start" rather than "docker run".
>>>
>>> At issue here is some persistent data inside
>>> /var/lib/docker/aufs/mnt/. "docker run" will by design (re)launch
>>> my application with a different CTR_ID effectively rendering that data
>>> inaccessible. But "docker start" will restart the container and its "old"
>>> data will still be there.
>>>
>>> Thanks.
>>>
>>> -Paul
>>>
>>
>>
>


Re: Use "docker start" rather than "docker run"?

2015-09-01 Thread Paul Bell
Alex and Marco,

Thanks very much for your really helpful explanations.

For better or worse, neither cpp nor Python are my things; Java's the go-to
language for me.

Cordially,

Paul

On Sat, Aug 29, 2015 at 5:23 AM, Marco Massenzio 
wrote:

> Hi Paul,
>
> +1 to what Alex/Tim say.
>
> Maybe a (simple) example will help: a very basic "framework" I created
> recently, does away with the "Executor" and only uses the "Scheduler",
> sending a CommandInfo structure to Mesos' Agent node to execute.
>
> See:
> https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L124
>
> If Python is more your thing, there are examples in the Mesos repository,
> or you can take a look at something I started recently to use the new
> (0.24) HTTP API (NOTE - this is still very much still WIP):
> https://github.com/massenz/zk-mesos/blob/develop/notebooks/HTTP%20API%20Tests.ipynb
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Fri, Aug 28, 2015 at 8:44 AM, Paul Bell  wrote:
>
>> Alex & Tim,
>>
>> Thank you both; most helpful.
>>
>> Alex, can you dispel my confusion on this point: I keep reading that a
>> "framework" in Mesos (e.g., Marathon) consists of a scheduler and an
>> executor. This reference to "executor" made me think that Marathon must
>> have *some* kind of presence on the slave node. But the more familiar I
>> become with Mesos the less likely this seems to me. So, what does it mean
>> to talk about the Marathon framework "executor"?
>>
>> Tim, I did come up with a simple work-around that involves re-copying the
>> needed file into the container each time the application is started. For
>> reasons unknown, this file is not kept in a location that would readily
>> lend itself to my use of persistent storage (Docker -v). That said, I am
>> keenly interested in learning how to write both custom executors &
>> schedulers. Any sense for what release of Mesos will see "persistent
>> volumes"?
>>
>> Thanks again, gents.
>>
>> -Paul
>>
>>
>>
>> On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen  wrote:
>>
>>> Hi Paul,
>>>
>>> We don't [re]start a container since we assume once the task terminated
>>> the container is no longer reused. In Mesos to allow tasks to reuse the
>>> same executor and handle task logic accordingly people will opt to choose
>>> the custom executor route.
>>>
>>> We're working on a way to keep your sandbox data beyond a container
>>> lifecycle, which is called persistent volumes. We haven't integrated that
>>> with Docker containerizer yet, so you'll have to wait to use that feature.
>>>
>>> You could also choose to implement a custom executor for now if you like.
>>>
>>> Tim
>>>
>>> On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov 
>>> wrote:
>>>
>>>> Paul,
>>>>
>>>> that component is called DockerContainerizer and it's part of Mesos
>>>> Agent (check
>>>> "/Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp"). @Tim,
>>>> could you answer the "docker start" vs. "docker run" question?
>>>>
>>>> On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell  wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I first posted this to the Marathon list, but someone suggested I try
>>>>> it here.
>>>>>
>>>>> I'm still not sure what component (mesos-master, mesos-slave,
>>>>> marathon) generates the "docker run" command that launches containers on a
>>>>> slave node. I suppose that it's the framework executor (Marathon) on the
>>>>> slave that actually executes the "docker run", but I'm not sure.
>>>>>
>>>>> What I'm really after is whether or not we can cause the use of
>>>>> "docker start" rather than "docker run".
>>>>>
>>>>> At issue here is some persistent data inside
>>>>> /var/lib/docker/aufs/mnt/. "docker run" will by design (re)launch
>>>>> my application with a different CTR_ID effectively rendering that data
>>>>> inaccessible. But "docker start" will restart the container and its "old"
>>>>> data will still be there.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -Paul
>>>>>
>>>>
>>>>
>>>
>>
>


Detecting slave crashes event

2015-09-16 Thread Paul Bell
Hi All,

I am led to believe that, unlike Marathon, Mesos doesn't (yet?) offer a
subscribable event bus.

So I am wondering if there's a best practices way of determining if a slave
node has crashed. By "crashed" I mean something like the power plug got
yanked, or anything that would cause Mesos to stop talking to the slave
node.

I suppose such information would be recorded in /var/log/mesos.

Interested to learn how best to detect this.

Thank you.

-Paul


Re: Detecting slave crashes event

2015-09-16 Thread Paul Bell
Thank you, Benjamin.

So, I could periodically request the metrics endpoint, or stream the logs
(maybe via mesos.cli; or SSH)? What, roughly, does the "agent removed"
message look like in the logs?

Are there plans to offer a mechanism for event subscription?

Cordially,

Paul



On Wed, Sep 16, 2015 at 1:30 PM, Benjamin Mahler 
wrote:

> You can detect when we remove an agent due to health check failures via
> the metrics endpoint, but these are counters that are better used for
> alerting / dashboards for visibility. If you need to know which agents, you
> can also consume the logs as a stop-gap solution, until we offer a
> mechanism for subscribing to cluster events.
>
> On Wed, Sep 16, 2015 at 10:11 AM, Paul Bell  wrote:
>
>> Hi All,
>>
>> I am led to believe that, unlike Marathon, Mesos doesn't (yet?) offer a
>> subscribable event bus.
>>
>> So I am wondering if there's a best practices way of determining if a
>> slave node has crashed. By "crashed" I mean something like the power plug
>> got yanked, or anything that would cause Mesos to stop talking to the slave
>> node.
>>
>> I suppose such information would be recorded in /var/log/mesos.
>>
>> Interested to learn how best to detect this.
>>
>> Thank you.
>>
>> -Paul
>>
>
>


reserving resources for host/mesos

2018-06-12 Thread Paul Mackles
Hi - Basic question that I couldn’t find an answer to in existing docs…
when configuring the available resources on a slave, is it appropriate to
leave some resources over for the mesos-agent itself (and the host OS)? Any
pointers on existing configs folks are using would be appreciated.

-- 
Thanks,
Paul