Thanks everyone!

Stephan - There's a couple of useful points there, will definitely give it
a read.

Klaus - Thanks, we're running a bunch of different frameworks, in that list
there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown
frameworks we have. In this particular case the Hadoop framework is the
major concern, as it's designed to continually accept offers until it has
enough slots it needs. With the example I gave above, we observe that the
master is never sending any sizeable offers to some of these frameworks
(the ones with the larger shares), which is where my confusion stems from.

I've attached a snippet of our active master logs which show the activity
for a single slave (which has no active executors). We can see that it's
cycling though sending and recovering declined offers from a selection of
different frameworks (in order) but I can say that not all of the
frameworks are receiving these offers, in this case that's the Hadoop
framework.


On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote:

> Hi Tom,
>
> Which framework are you using, e.g. Swarm, Marathon or something else? and
> which language package are you using?
>
> DRF will sort role/framework by allocation ratio, and offer all
> "available" resources by slave; but if the resources it too small (<
> 0.1CPU) or the resources was reject/declined by framework, the resources
> will not offer it until filter timeout. For example, in Swarm 1.0, the
> default filter timeout 5s (because of go scheduler API); so here is case
> that may impact the utilisation: the Swarm got one slave with 16 CPUS, but
> only launch one container with 1 CPUS; the other 15 CPUS will return back
>  to master and did not re-offer until filter timeout (5s).
> I had pull a request to make Swarm's parameters configurable, refer to
> https://github.com/docker/swarm/pull/1585. I think you can check this
> case by master log.
>
> If any comments, please let me know.
>
> ----
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform OpenSource Technology, STG, IBM GCG
> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>
> On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com> wrote:
>
>> Hey,
>>
>> I've noticed some interesting behaviour recently when we have lots of
>> different frameworks connected to our Mesos cluster at once, all using a
>> variety of different shares. Some of the frameworks don't get offered more
>> resources (for long periods of time, hours even) leaving the cluster under
>> utilised.
>>
>> Here's an example state where we see this happen..
>>
>> Framework 1 - 13% (user A)
>> Framework 2 - 22% (user B)
>> Framework 3 - 4% (user C)
>> Framework 4 - 0.5% (user C)
>> Framework 5 - 1% (user C)
>> Framework 6 - 1% (user C)
>> Framework 7 - 1% (user C)
>> Framework 8 - 0.8% (user C)
>> Framework 9 - 11% (user D)
>> Framework 10 - 7% (user C)
>> Framework 11 - 1% (user C)
>> Framework 12 - 1% (user C)
>> Framework 13 - 6% (user E)
>>
>> In this example, there's another ~30% of the cluster that is unallocated,
>> and it stays like this for a significant amount of time until something
>> changes, perhaps another user joins and allocates the rest.... chunks of
>> this spare resource is offered to some of the frameworks, but not all of
>> them.
>>
>> I had always assumed that when lots of frameworks were involved,
>> eventually the frameworks that would keep accepting resources indefinitely
>> would consume the remaining resource, as every other framework had rejected
>> the offers.
>>
>> Could someone elaborate a little on how the DRF allocator / sorter
>> handles this situation, is this likely to be related to the different users
>> being used? Is there a way to mitigate this?
>>
>> We're running version 0.23.1.
>>
>> Cheers,
>>
>> Tom.
>>
>
>
0121 10:43:27.513950 22408 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0626
I0121 10:43:28.546314 22409 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0644
I0121 10:43:30.095793 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20151103-233456-100929708-5050-865-4520
I0121 10:43:31.208264 22406 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0650
I0121 10:43:32.285748 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0810
I0121 10:43:32.951354 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0635
I0121 10:43:34.040889 22405 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0626
I0121 10:43:35.138478 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0294
I0121 10:43:36.227308 22404 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0627
I0121 10:43:37.312897 22405 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0514
I0121 10:43:38.682509 22404 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0810
I0121 10:43:39.514909 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0650
I0121 10:43:40.641489 22406 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0626
I0121 10:43:42.157133 22409 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0294
I0121 10:43:43.165354 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20150515-105200-84152492-5050-9915-0010
I0121 10:43:43.916021 22407 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0810
I0121 10:43:45.254372 22406 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0650
I0121 10:43:46.103548 22408 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0651
I0121 10:43:47.198281 22405 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0627
I0121 10:43:48.364265 22404 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20151103-233456-100929708-5050-865-4520
I0121 10:43:49.919083 22409 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0810
I0121 10:43:50.494624 22405 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0514
I0121 10:43:51.576524 22406 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0294
I0121 10:43:52.669220 22405 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0650
I0121 10:43:54.152811 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0626
I0121 10:43:54.851694 22404 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0635

Reply via email to