Mesos agents across a WAN?
Given regional bare metal Mesos clusters on multiple continents, are there any known issues running some of the agents over the WAN? Is anyone else doing it, or is this a terrible idea that I should tell management no on? A few specifics: 1. Are there any known limitations or configuration gotchas I might encounter? 2. Does setting up ZK observers in each non-primary dc and pointing the agents at them exclusively make sense? 3. Are there plans on a mesos equivalent of something like ubernetes[1], or would that be up to each framework? 4. Any suggestions on how best to do agent attributes / constraints for something like this? I was planning on having the config management add a "data_center" agent attribute to match on. Thanks! [1] https://github.com/kubernetes/kubernetes/blob/8813c955182e3c9daae68a8257365e02cd871c65/release-0.19.0/docs/proposals/federation.md#kubernetes-cluster-federation -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com
Re: Cleaning up failed tasks in the UI
I suspect that after your maintenance operation, Marathon may have registered with a new frameworkId and launched is own copies of your tasks (why you see double). However, the old Marathon frameworkId probably has a failover_timeout of a week, so it will continue to be considered "registered", but "disconnected". Check the /frameworks endpoint to see if Mesos thinks you have two Marathons registered. If so, you can use the /teardown endpoint to unregister the old one, which will cause all of its tasks to be killed. On Wed, Mar 30, 2016 at 4:56 AM, Alberto del Barrio < alberto.delbarrio.albe...@gmail.com> wrote: > Hi haosdent, > > thanks for your reply. It is actually very weird, first time I see this > situation in around one year using mesos. > I am pasting here the truncate output you asked for. It is showing one of > the tasks with "Failed" state under "Active tasks": > > { > "executor_id": "", > "framework_id": > "c857c625-25dc-4650-89b8-de4b597026ed-", > "id": "pixie.33f85e8f-f03b-11e5-af6c-fa6389efeef1", > "labels": [ >.. > ], > "name": "myTask", > "resources": { > "cpus": 4.0, > "disk": 0, > "mem": 2560.0, > "ports": "[31679-31679]" > }, > "slave_id": > "c857c625-25dc-4650-89b8-de4b597026ed-S878", > "state": "TASK_FAILED", > "statuses": [ > { > "container_status": { > "network_infos": [ > { > "ip_address": "10.XX.XX.XX" > } > ] > }, > "state": "TASK_RUNNING", > "timestamp": 1458657321.16671 > }, > { > "container_status": { > "network_infos": [ > { > "ip_address": "10.XX.XX.XX" > } > ] > }, > "state": "TASK_FAILED", > "timestamp": 1459329310.13663 > } > ] > }, > > > t > > > On 03/30/16 13:30, haosdent wrote: > > >"Active tasks" with status "Failed" > A bit wired here. According to my test, it should exists in "Completed > Tasks". If possible, could you show you /master/state endpoint result. I > think the frameworks node in state response would be helpful to analyze the > problem. > > On Wed, Mar 30, 2016 at 6:26 PM, Alberto del Barrio < >alberto.delbarrio.albe...@gmail.com> > wrote: > >> Hi all, >> >> after a maintenance carried on in a mesos cluster (0.25) using marathon >> (0.10) as a only scheduler , I've finished with the double of tasks for >> each application. But marathon was recognizing only half of them. >> For getting rid of this orphaned tasks, I've did a "kill PID" of them, so >> they free up their resources. >> >> My problem now is that these tasks I've killed, are still appearing in >> the mesos UI under "Active tasks" with status "Failed". This is not >> affecting my system, but I would like to clean them up. >> Googling I can't find anything. >> Can someone point me to a solution for cleaning those tasks? >> >> Cheers, >> Alberto. >> > > > > -- > Best Regards, > Haosdent Huang > > >
Re: Agent won't start
Hello Paul, Few things to note here: 1. Whenever, you change value of any *resource *or any *attribute* (Description: http://mesos.apache.org/documentation/latest/attributes-resources/), you need to cleanup the work_dir (rm -rf /tmp/mesos) and restart the slave. 2. You muse be already knowing that all mesos tasks/executors started by mesos-slave keeps running even in case mesos slave process dies. Once, you cleanup the work_dir, you will no longer be able to recover those executors/tasks and hence all mesos tasks/executors running on that mesos slave will get killed. So ideally you *shouldn't* do it routinely. But as in your case, if it doesn't matter, you can add this work_dir cleanup in maybe sysinit/systemd/upstart script. (I can't think of the reason why stopping all services on all mesos nodes is a routine tasks unless your slaves are very temporary in nature eg. AWS spot instances) 3. If your use case is that you want to change resources dynamically on each mesos slave, i would suggest you to check mesos dynamic reservation apis (http://mesos.apache.org/documentation/latest/reservation/) Hope this answer you questions. Let me know if i can help you more. On Wed, Mar 30, 2016 at 8:20 PM, Paul Bellwrote: > Greg, thanks again - I am planning on moving my work_dir. > > > > Pradeep, thanks again. In a slightly different scenario, namely, > > service mesos-slave stop > edit /etc/default/mesos-slave (add a port resource) > service mesos-slave start > > > I noticed that slave did not start and - again - the log shows the same > phenomena as in my original post. Per your suggestion, I did a > > rm -Rf /tmp/mesos > > and the slave service started correctly. > > Questions: > > >1. Did editing /etc/default/mesos-slave cause the failure of the >service to start? >2. given that starting/stopping the entire cluster (stopping all >services on all nodes) is a standard feature in our product, should I >routinely to the above "rm" command when the mesos services are stopped? > > > Thanks for your help. > > Cordially, > > Paul > > On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann wrote: > >> Check out this link for info on /tmp cleanup in Ubuntu: >> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up >> >> And check out this link for information on some of the work_dir's >> contents on a Mesos agent: >> http://mesos.apache.org/documentation/latest/sandbox/ >> >> The work_dir contains important application state for the Mesos agent, so >> it should not be placed in a location that will be automatically >> garbage-collected by the OS. The choice of /tmp/mesos as a default location >> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to >> change it. Ideally you should be able to leave the work_dir alone and let >> the Mesos agent manage it for you. >> >> In any case, I would recommend that you set the work_dir to something >> outside of /tmp; /var/lib/mesos is a commonly-used location. >> >> Cheers, >> Greg >> > > -- Regards, Pradeep Chhetri
Re: Agent won't start
Greg, thanks again - I am planning on moving my work_dir. Pradeep, thanks again. In a slightly different scenario, namely, service mesos-slave stop edit /etc/default/mesos-slave (add a port resource) service mesos-slave start I noticed that slave did not start and - again - the log shows the same phenomena as in my original post. Per your suggestion, I did a rm -Rf /tmp/mesos and the slave service started correctly. Questions: 1. Did editing /etc/default/mesos-slave cause the failure of the service to start? 2. given that starting/stopping the entire cluster (stopping all services on all nodes) is a standard feature in our product, should I routinely to the above "rm" command when the mesos services are stopped? Thanks for your help. Cordially, Paul On Tue, Mar 29, 2016 at 6:16 PM, Greg Mannwrote: > Check out this link for info on /tmp cleanup in Ubuntu: > http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up > > And check out this link for information on some of the work_dir's contents > on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/ > > The work_dir contains important application state for the Mesos agent, so > it should not be placed in a location that will be automatically > garbage-collected by the OS. The choice of /tmp/mesos as a default location > is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to > change it. Ideally you should be able to leave the work_dir alone and let > the Mesos agent manage it for you. > > In any case, I would recommend that you set the work_dir to something > outside of /tmp; /var/lib/mesos is a commonly-used location. > > Cheers, > Greg >
Re: Compute event at Twitter HQ - 03/31
Hello everyone, The talks will indeed be recorded and made available on Twitter's YouTube channel. I will send an update after the event when the videos have been uploaded. Ian On Wed, Mar 16, 2016 at 6:16 PM, Micheal Benedict Arulwrote: > We won't be streaming live but are working to get the talks recorded. > > On Wed, Mar 16, 2016 at 9:45 AM, haosdent wrote: > >> Would it have youtube live link? >> >> On Thu, Mar 17, 2016 at 12:38 AM, Ian Downes wrote: >> >> > Hello everyone, >> > >> > I'd like to call attention to an event the Compute group at Twitter is >> > holding at the end of the month where there will be a few >> > Aurora/Mesos-related talks: >> > >> > 1. David Robinson, one of our SREs, will talk about how our small team >> > of SREs manages what is possibly the largest Mesos cluster in >> > existence. >> > 2. David McLaughlin, Aurora committer/PMC member, will talk about >> > Workflows, an internal tool we've built to orchestrate deployments >> > across Aurora clusters. >> > 3. David Hagar, Engineering Manager at TellApart, will talk about >> > running Aurora/Mesos in AWS. >> > >> > On top of that there will be lots of other great talks about how we >> > run the entirety of our compute infrastructure. >> > >> > The event is on the evening of March 31st at Twitter HQ in San >> > Francisco. I hope to see many of you there! >> > >> > https://www.eventbrite.com/e/compute-tickets-22811196904 >> > >> > Thanks, >> > >> > Ian >> > >> > Note: This is nearly a straight copy of an email that Joshua sent out >> > to the Aurora mailing lists. >> > >> >> >> >> -- >> Best Regards, >> Haosdent Huang >>
Re: MUG in Raleigh/Durham
Do you mean add your user group to http://mesos.apache.org/community/user-groups/ ? You could submit a pull request to https://github.com/apache/mesos/blob/master/site/source/community/user-groups.html.md and contact a committer help you submit it. On Wed, Mar 30, 2016 at 7:31 PM, John Crawfordwrote: > Hello! I hosted a Meetup last night around Mesos and would like to know > how we can become an "official" user group? Here is the link to the Meetup > group I created: http://www.meetup.com/Triangle-Area-Mesos-User-Group/ > > Thanks! > > -- > John Crawford > Co-founder & CTO > > 733 Foster Street l Suite 500 > Durham, NC 27701 > > j...@ndustrial.io > mobile l 704.437.9000 > -- Best Regards, Haosdent Huang
Re: Cleaning up failed tasks in the UI
Hi haosdent, thanks for your reply. It is actually very weird, first time I see this situation in around one year using mesos. I am pasting here the truncate output you asked for. It is showing one of the tasks with "Failed" state under "Active tasks": { "executor_id": "", "framework_id": "c857c625-25dc-4650-89b8-de4b597026ed-", "id": "pixie.33f85e8f-f03b-11e5-af6c-fa6389efeef1", "labels": [ .. ], "name": "myTask", "resources": { "cpus": 4.0, "disk": 0, "mem": 2560.0, "ports": "[31679-31679]" }, "slave_id": "c857c625-25dc-4650-89b8-de4b597026ed-S878", "state": "TASK_FAILED", "statuses": [ { "container_status": { "network_infos": [ { "ip_address": "10.XX.XX.XX" } ] }, "state": "TASK_RUNNING", "timestamp": 1458657321.16671 }, { "container_status": { "network_infos": [ { "ip_address": "10.XX.XX.XX" } ] }, "state": "TASK_FAILED", "timestamp": 1459329310.13663 } ] }, t On 03/30/16 13:30, haosdent wrote: >"Active tasks" with status "Failed" A bit wired here. According to my test, it should exists in "Completed Tasks". If possible, could you show you /master/state endpoint result. I think the frameworks node in state response would be helpful to analyze the problem. On Wed, Mar 30, 2016 at 6:26 PM, Alberto del Barrio> wrote: Hi all, after a maintenance carried on in a mesos cluster (0.25) using marathon (0.10) as a only scheduler , I've finished with the double of tasks for each application. But marathon was recognizing only half of them. For getting rid of this orphaned tasks, I've did a "kill PID" of them, so they free up their resources. My problem now is that these tasks I've killed, are still appearing in the mesos UI under "Active tasks" with status "Failed". This is not affecting my system, but I would like to clean them up. Googling I can't find anything. Can someone point me to a solution for cleaning those tasks? Cheers, Alberto. -- Best Regards, Haosdent Huang
MUG in Raleigh/Durham
Hello! I hosted a Meetup last night around Mesos and would like to know how we can become an "official" user group? Here is the link to the Meetup group I created: http://www.meetup.com/Triangle-Area-Mesos-User-Group/ Thanks! -- John Crawford Co-founder & CTO 733 Foster Street l Suite 500 Durham, NC 27701 j...@ndustrial.iomobile l 704.437.9000
Re: Cleaning up failed tasks in the UI
>"Active tasks" with status "Failed" A bit wired here. According to my test, it should exists in "Completed Tasks". If possible, could you show you /master/state endpoint result. I think the frameworks node in state response would be helpful to analyze the problem. On Wed, Mar 30, 2016 at 6:26 PM, Alberto del Barrio < alberto.delbarrio.albe...@gmail.com> wrote: > Hi all, > > after a maintenance carried on in a mesos cluster (0.25) using marathon > (0.10) as a only scheduler , I've finished with the double of tasks for > each application. But marathon was recognizing only half of them. > For getting rid of this orphaned tasks, I've did a "kill PID" of them, so > they free up their resources. > > My problem now is that these tasks I've killed, are still appearing in the > mesos UI under "Active tasks" with status "Failed". This is not affecting > my system, but I would like to clean them up. > Googling I can't find anything. > Can someone point me to a solution for cleaning those tasks? > > Cheers, > Alberto. > -- Best Regards, Haosdent Huang
Cleaning up failed tasks in the UI
Hi all, after a maintenance carried on in a mesos cluster (0.25) using marathon (0.10) as a only scheduler , I've finished with the double of tasks for each application. But marathon was recognizing only half of them. For getting rid of this orphaned tasks, I've did a "kill PID" of them, so they free up their resources. My problem now is that these tasks I've killed, are still appearing in the mesos UI under "Active tasks" with status "Failed". This is not affecting my system, but I would like to clean them up. Googling I can't find anything. Can someone point me to a solution for cleaning those tasks? Cheers, Alberto.