Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Klaus Ma
@Tom, one more question: how about your task run time? If the task run time is too short, e.g. 100ms, the resources will be return to allocator when task finished and will allocate it until next allocation cycle. Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource

Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Guangya Liu
Hi Tom, I saw that the two frameworks with roles is consuming most of the resources, so I think that you can do more test by removing the two frameworks with roles. Another I want to mention is that the DRF allocator may have some issues when there are plenty of frameworks and the community is

Re: Safe update of agent attributes

2016-02-22 Thread Adam Bordelon
Currently, changing any --attributes or --resources requires draining the agent and killing all running tasks. See https://issues.apache.org/jira/browse/MESOS-1739 You could do a `mesos-slave --recovery=cleanup` which essentially kills all the tasks and clears the work_dir; then restart with a

[RESULT][VOTE] Release Apache Mesos 0.27.1 (rc1)

2016-02-22 Thread Michael Park
Hi all, The vote for Mesos 0.27.1 (rc1) has passed with the following votes. +1 (Binding) -- Bernd Mathiske Joris Van Remoortere Vinod Kone +1 (Non-binding) -- Zhitao Li Jörg Schad There were no 0 or -1 votes. Please find the release at:

Re: Safe update of agent attributes

2016-02-22 Thread Marco Massenzio
IIRC you can avoid the issue by either using a different work_dir for the agent, or removing (and, possibly, re-creating) it. I'm afraid I don't have a running instance of Mesos on this machine and can't test it out. Also (and this is strictly my opinion :) I would consider a change of attribute

Re: Reusing Task IDs

2016-02-22 Thread Erik Weathers
Thanks for the responses. Filed a ticket for this: - https://issues.apache.org/jira/browse/MESOS-4737 - Erik On Mon, Feb 22, 2016 at 1:23 PM, Sargun Dhillon wrote: > As someone who has been there and back again (Reusing task-IDs, and > realizing it's a terrible idea),

Re: Reusing Task IDs

2016-02-22 Thread Sargun Dhillon
As someone who has been there and back again (Reusing task-IDs, and realizing it's a terrible idea), I'd put some advise in the docs + mesos.proto to compose task IDs from GUIDs, and add that it's dangerous to reuse them. I would advocate for a mechanism to prevent the usage of non-unique IDs for

Re: Reusing Task IDs

2016-02-22 Thread Vinod Kone
I would vote for updating comments in mesos.proto to warn users to not re-use task id for now. On Sun, Feb 21, 2016 at 9:05 PM, Klaus Ma wrote: > Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that > Master'll crash when Master failover with duplicated

Re: Safe update of agent attributes

2016-02-22 Thread Zameer Manji
Zhitao, In my experience the best way to manage these attributes is to ensure attribute changes are minimal (ie one attribute at a time) and roll them out slowly across a cluster. This way you can catch unsafe mutations quickly and rollback if needed. I don't think there is a whitelist/blacklist

Safe update of agent attributes

2016-02-22 Thread Zhitao Li
Hi, We recently discovered that updating attributes on Mesos agents is a very risk operation, and has a potential to send agent(s) into a crash loop if not done properly with errors like "Failed to perform recovery: Incompatible slave info detected". This combined with --recovery_timeout made the

[proposal] Generalized Authorized Interface

2016-02-22 Thread Alexander Rojas
Hey guys, After some extra thought, we came to what we think is a nice interface for the Mesos authorizer [1] which will allow users of Mesos to use to your custom backends in a nice way. Please share your thoughts with us in case we missed something or there are improvements we can make to

Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Tom Arnfeld
Hi Guangya, Most of the agents do not have a role, so they use the default wildcard role for resources. Also none of the frameworks have a role, therefore they fall into the wildcard role too. Frameworks are being offered resources up to a certain level of fairness but no further. The issue

Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Guangya Liu
If non of the framework has role, then no framework can consume reserved resources, so I think that at least the framework 20160219-164457-67375276-5050-28802-0014 and 20160219-164457-67375276-5050-28802-0015 should have role. Can you please show some detail for the following: 1) Master start

Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Tom Arnfeld
Ah yes sorry my mistake, there are a couple of agents with a dev role and only one or two frameworks connect to the cluster with that role, but not very often. Whether they’re connected or not doesn’t seem to cause any change in allocation behaviour. No other agents have roles. > 974 2420

Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Guangya Liu
Hi Tom, I think that your cluster should have some role, weight configuration because I can see there are at least two agent has role with "dev" configured. 56 1363 I0219 18:08:26.284010 28810 hierarchical.hpp:1025] Filtered ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600

RE: AW: Feature request: move in-flight containers w/o stopping them

2016-02-22 Thread Aaron Carey
If this is of any use to anyone: There is also an outstanding branch of Docker which has checkpoint/restore functionality in it (based on CRIU I believe) which is hopefully being merged into experimental soon. From: Sharma Podila [spod...@netflix.com] Sent: 19

RE: AW: Feature request: move in-flight containers w/o stopping them

2016-02-22 Thread Aaron Carey
Would you be able to elaborate a bit more on how you did this? From: Mauricio Garavaglia [mauri...@medallia.com] Sent: 19 February 2016 19:20 To: user@mesos.apache.org Subject: Re: AW: Feature request: move in-flight containers w/o stopping them Mesos is not only