Mesos agents across a WAN?

2016-03-30 Thread Jeff Schroeder
Given regional bare metal Mesos clusters on multiple continents, are there
any known issues running some of the agents over the WAN? Is anyone else
doing it, or is this a terrible idea that I should tell management no on?

A few specifics:

1. Are there any known limitations or configuration gotchas I might
encounter?
2. Does setting up ZK observers in each non-primary dc and pointing the
agents at them exclusively make sense?
3. Are there plans on a mesos equivalent of something like ubernetes[1], or
would that be up to each framework?
4. Any suggestions on how best to do agent attributes / constraints for
something like this? I was planning on having the config management add a
"data_center" agent attribute to match on.

Thanks!

[1]
https://github.com/kubernetes/kubernetes/blob/8813c955182e3c9daae68a8257365e02cd871c65/release-0.19.0/docs/proposals/federation.md#kubernetes-cluster-federation

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com


Re: Cleaning up failed tasks in the UI

2016-03-30 Thread Adam Bordelon
I suspect that after your maintenance operation, Marathon may have
registered with a new frameworkId and launched is own copies of your tasks
(why you see double). However, the old Marathon frameworkId probably has a
failover_timeout of a week, so it will continue to be considered
"registered", but "disconnected".
Check the /frameworks endpoint to see if Mesos thinks you have two
Marathons registered.
If so, you can use the /teardown endpoint to unregister the old one, which
will cause all of its tasks to be killed.

On Wed, Mar 30, 2016 at 4:56 AM, Alberto del Barrio <
alberto.delbarrio.albe...@gmail.com> wrote:

> Hi haosdent,
>
> thanks for your reply. It is actually very weird, first time I see this
> situation in around one year using mesos.
> I am pasting here the truncate output you asked for. It is showing one of
> the tasks with "Failed" state under "Active tasks":
>
> {
> "executor_id": "",
> "framework_id":
> "c857c625-25dc-4650-89b8-de4b597026ed-",
> "id": "pixie.33f85e8f-f03b-11e5-af6c-fa6389efeef1",
> "labels": [
>..
> ],
> "name": "myTask",
> "resources": {
> "cpus": 4.0,
> "disk": 0,
> "mem": 2560.0,
> "ports": "[31679-31679]"
> },
> "slave_id":
> "c857c625-25dc-4650-89b8-de4b597026ed-S878",
> "state": "TASK_FAILED",
> "statuses": [
> {
> "container_status": {
> "network_infos": [
> {
> "ip_address": "10.XX.XX.XX"
> }
> ]
> },
> "state": "TASK_RUNNING",
> "timestamp": 1458657321.16671
> },
> {
> "container_status": {
> "network_infos": [
> {
> "ip_address": "10.XX.XX.XX"
> }
> ]
> },
> "state": "TASK_FAILED",
> "timestamp": 1459329310.13663
> }
> ]
> },
>
>
> t
>
>
> On 03/30/16 13:30, haosdent wrote:
>
> >"Active tasks" with status "Failed"
> A bit wired here. According to my test, it should exists in "Completed
> Tasks". If possible, could you show you /master/state endpoint result. I
> think the frameworks node in state response would be helpful to analyze the
> problem.
>
> On Wed, Mar 30, 2016 at 6:26 PM, Alberto del Barrio <
> alberto.delbarrio.albe...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> after a maintenance carried on in a mesos cluster (0.25) using marathon
>> (0.10) as a only scheduler , I've finished with the double of tasks for
>> each application. But marathon was recognizing only half of them.
>> For getting rid of this orphaned tasks, I've did a "kill PID" of them, so
>> they free up their resources.
>>
>> My problem now is that these tasks I've killed, are still appearing in
>> the mesos UI under "Active tasks" with status "Failed". This is not
>> affecting my system, but I would like to clean them up.
>> Googling I can't find anything.
>> Can someone point me to a solution for cleaning those tasks?
>>
>> Cheers,
>> Alberto.
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>


Re: Agent won't start

2016-03-30 Thread Pradeep Chhetri
Hello Paul,

Few things to note here:

1. Whenever, you change value of any *resource *or any *attribute*
(Description:
http://mesos.apache.org/documentation/latest/attributes-resources/), you
need to cleanup the work_dir (rm -rf /tmp/mesos) and restart the slave.

2. You muse be already knowing that all mesos tasks/executors started by
mesos-slave keeps running even in case mesos slave process dies. Once, you
cleanup the work_dir, you will no longer be able to recover those
executors/tasks and hence all mesos tasks/executors running on that mesos
slave will get killed. So ideally you *shouldn't* do it routinely. But as
in your case, if it doesn't matter, you can add this work_dir cleanup in
maybe sysinit/systemd/upstart script. (I can't think of the reason why
stopping all services on all mesos nodes is a routine tasks unless your
slaves are very temporary in nature eg. AWS spot instances)

3. If your use case is that you want to change resources dynamically on
each mesos slave, i would suggest you to check mesos dynamic reservation
apis (http://mesos.apache.org/documentation/latest/reservation/)

Hope this answer you questions. Let me know if i can help you more.


On Wed, Mar 30, 2016 at 8:20 PM, Paul Bell  wrote:

> Greg, thanks again - I am planning on moving my work_dir.
>
>
>
> Pradeep, thanks again. In a slightly different scenario, namely,
>
> service mesos-slave stop
> edit /etc/default/mesos-slave   (add a port resource)
> service mesos-slave start
>
>
> I noticed that slave did not start and - again - the log shows the same
> phenomena as in my original post. Per your suggestion, I did a
>
> rm -Rf /tmp/mesos
>
> and the slave service started correctly.
>
> Questions:
>
>
>1. Did editing /etc/default/mesos-slave cause the failure of the
>service to start?
>2. given that starting/stopping the entire cluster (stopping all
>services on all nodes) is a standard feature in our product, should I
>routinely to the above "rm" command when the mesos services are stopped?
>
>
> Thanks for your help.
>
> Cordially,
>
> Paul
>
> On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann  wrote:
>
>> Check out this link for info on /tmp cleanup in Ubuntu:
>> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>>
>> And check out this link for information on some of the work_dir's
>> contents on a Mesos agent:
>> http://mesos.apache.org/documentation/latest/sandbox/
>>
>> The work_dir contains important application state for the Mesos agent, so
>> it should not be placed in a location that will be automatically
>> garbage-collected by the OS. The choice of /tmp/mesos as a default location
>> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
>> change it. Ideally you should be able to leave the work_dir alone and let
>> the Mesos agent manage it for you.
>>
>> In any case, I would recommend that you set the work_dir to something
>> outside of /tmp; /var/lib/mesos is a commonly-used location.
>>
>> Cheers,
>> Greg
>>
>
>


-- 
Regards,
Pradeep Chhetri


Re: Agent won't start

2016-03-30 Thread Paul Bell
Greg, thanks again - I am planning on moving my work_dir.



Pradeep, thanks again. In a slightly different scenario, namely,

service mesos-slave stop
edit /etc/default/mesos-slave   (add a port resource)
service mesos-slave start


I noticed that slave did not start and - again - the log shows the same
phenomena as in my original post. Per your suggestion, I did a

rm -Rf /tmp/mesos

and the slave service started correctly.

Questions:


   1. Did editing /etc/default/mesos-slave cause the failure of the service
   to start?
   2. given that starting/stopping the entire cluster (stopping all
   services on all nodes) is a standard feature in our product, should I
   routinely to the above "rm" command when the mesos services are stopped?


Thanks for your help.

Cordially,

Paul

On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann  wrote:

> Check out this link for info on /tmp cleanup in Ubuntu:
> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>
> And check out this link for information on some of the work_dir's contents
> on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/
>
> The work_dir contains important application state for the Mesos agent, so
> it should not be placed in a location that will be automatically
> garbage-collected by the OS. The choice of /tmp/mesos as a default location
> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
> change it. Ideally you should be able to leave the work_dir alone and let
> the Mesos agent manage it for you.
>
> In any case, I would recommend that you set the work_dir to something
> outside of /tmp; /var/lib/mesos is a commonly-used location.
>
> Cheers,
> Greg
>


Re: Compute event at Twitter HQ - 03/31

2016-03-30 Thread Ian Downes
Hello everyone,

The talks will indeed be recorded and made available on Twitter's
YouTube channel. I will send an update after the event when the videos
have been uploaded.

Ian

On Wed, Mar 16, 2016 at 6:16 PM, Micheal Benedict Arul
 wrote:
> We won't be streaming live but are working to get the talks recorded.
>
> On Wed, Mar 16, 2016 at 9:45 AM, haosdent  wrote:
>
>> Would it have youtube live link?
>>
>> On Thu, Mar 17, 2016 at 12:38 AM, Ian Downes  wrote:
>>
>> > Hello everyone,
>> >
>> > I'd like to call attention to an event the Compute group at Twitter is
>> > holding at the end of the month where there will be a few
>> > Aurora/Mesos-related talks:
>> >
>> > 1. David Robinson, one of our SREs, will talk about how our small team
>> > of SREs manages what is possibly the largest Mesos cluster in
>> > existence.
>> > 2. David McLaughlin, Aurora committer/PMC member, will talk about
>> > Workflows, an internal tool we've built to orchestrate deployments
>> > across Aurora clusters.
>> > 3. David Hagar, Engineering Manager at TellApart, will talk about
>> > running Aurora/Mesos in AWS.
>> >
>> > On top of that there will be lots of other great talks about how we
>> > run the entirety of our compute infrastructure.
>> >
>> > The event is on the evening of March 31st at Twitter HQ in San
>> > Francisco. I hope to see many of you there!
>> >
>> > https://www.eventbrite.com/e/compute-tickets-22811196904
>> >
>> > Thanks,
>> >
>> > Ian
>> >
>> > Note: This is nearly a straight copy of an email that Joshua sent out
>> > to the Aurora mailing lists.
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>


Re: MUG in Raleigh/Durham

2016-03-30 Thread haosdent
Do you mean add your user group to
http://mesos.apache.org/community/user-groups/ ? You could submit a pull
request to
https://github.com/apache/mesos/blob/master/site/source/community/user-groups.html.md
and contact a committer help you submit it.

On Wed, Mar 30, 2016 at 7:31 PM, John Crawford  wrote:

> Hello! I hosted a Meetup last night around Mesos and would like to know
> how we can become an "official" user group? Here is the link to the Meetup
> group I created: http://www.meetup.com/Triangle-Area-Mesos-User-Group/
>
> Thanks!
>
> --
> John Crawford
> Co-founder & CTO
>
> 733 Foster Street  l  Suite 500
> Durham, NC 27701
>
> j...@ndustrial.io 
> mobile l 704.437.9000
>



-- 
Best Regards,
Haosdent Huang


Re: Cleaning up failed tasks in the UI

2016-03-30 Thread Alberto del Barrio

Hi haosdent,

thanks for your reply. It is actually very weird, first time I see this 
situation in around one year using mesos.
I am pasting here the truncate output you asked for. It is showing one 
of the tasks with "Failed" state under "Active tasks":


{
"executor_id": "",
"framework_id": 
"c857c625-25dc-4650-89b8-de4b597026ed-",

"id": "pixie.33f85e8f-f03b-11e5-af6c-fa6389efeef1",
"labels": [
   ..
],
"name": "myTask",
"resources": {
"cpus": 4.0,
"disk": 0,
"mem": 2560.0,
"ports": "[31679-31679]"
},
"slave_id": 
"c857c625-25dc-4650-89b8-de4b597026ed-S878",

"state": "TASK_FAILED",
"statuses": [
{
"container_status": {
"network_infos": [
{
"ip_address": "10.XX.XX.XX"
}
]
},
"state": "TASK_RUNNING",
"timestamp": 1458657321.16671
},
{
"container_status": {
"network_infos": [
{
"ip_address": "10.XX.XX.XX"
}
]
},
"state": "TASK_FAILED",
"timestamp": 1459329310.13663
}
]
},


t

On 03/30/16 13:30, haosdent wrote:

>"Active tasks" with status "Failed"
A bit wired here. According to my test, it should exists in "Completed 
Tasks". If possible, could you show you /master/state endpoint result. 
I think the frameworks node in state response would be helpful to 
analyze the problem.


On Wed, Mar 30, 2016 at 6:26 PM, Alberto del Barrio 
> wrote:


Hi all,

after a maintenance carried on in a mesos cluster (0.25) using
marathon (0.10) as a only scheduler , I've finished with the
double of tasks for each application. But marathon was recognizing
only half of them.
For getting rid of this orphaned tasks, I've did a "kill PID" of
them, so they free up their resources.

My problem now is that these tasks I've killed, are still
appearing in the mesos UI under "Active tasks" with status
"Failed". This is not affecting my system, but I would like to
clean them up.
Googling I can't find anything.
Can someone point me to a solution for cleaning those tasks?

Cheers,
Alberto.




--
Best Regards,
Haosdent Huang




MUG in Raleigh/Durham

2016-03-30 Thread John Crawford
Hello! I hosted a Meetup last night around Mesos and would like to know how
we can become an "official" user group? Here is the link to the Meetup
group I created: http://www.meetup.com/Triangle-Area-Mesos-User-Group/

Thanks!

-- 
John Crawford
Co-founder & CTO

733 Foster Street  l  Suite 500
Durham, NC 27701

j...@ndustrial.io 
mobile l 704.437.9000


Re: Cleaning up failed tasks in the UI

2016-03-30 Thread haosdent
>"Active tasks" with status "Failed"
A bit wired here. According to my test, it should exists in "Completed
Tasks". If possible, could you show you /master/state endpoint result. I
think the frameworks node in state response would be helpful to analyze the
problem.

On Wed, Mar 30, 2016 at 6:26 PM, Alberto del Barrio <
alberto.delbarrio.albe...@gmail.com> wrote:

> Hi all,
>
> after a maintenance carried on in a mesos cluster (0.25) using marathon
> (0.10) as a only scheduler , I've finished with the double of tasks for
> each application. But marathon was recognizing only half of them.
> For getting rid of this orphaned tasks, I've did a "kill PID" of them, so
> they free up their resources.
>
> My problem now is that these tasks I've killed, are still appearing in the
> mesos UI under "Active tasks" with status "Failed". This is not affecting
> my system, but I would like to clean them up.
> Googling I can't find anything.
> Can someone point me to a solution for cleaning those tasks?
>
> Cheers,
> Alberto.
>



-- 
Best Regards,
Haosdent Huang


Cleaning up failed tasks in the UI

2016-03-30 Thread Alberto del Barrio

Hi all,

after a maintenance carried on in a mesos cluster (0.25) using marathon 
(0.10) as a only scheduler , I've finished with the double of tasks for 
each application. But marathon was recognizing only half of them.
For getting rid of this orphaned tasks, I've did a "kill PID" of them, 
so they free up their resources.


My problem now is that these tasks I've killed, are still appearing in 
the mesos UI under "Active tasks" with status "Failed". This is not 
affecting my system, but I would like to clean them up.

Googling I can't find anything.
Can someone point me to a solution for cleaning those tasks?

Cheers,
Alberto.