Guangya - Nope, there's no outstanding offers for any frameworks, the ones that are getting offers are responding properly.
Klaus - This was just a sample of logs for a single agent, the cluster has at least ~40 agents at any one time. On 21 January 2016 at 15:20, Guangya Liu <gyliu...@gmail.com> wrote: > Can you please help check if some outstanding offers in cluster which does > not accept by any framework? You can check this via the endpoint of > /master/state.json endpoint. > > If there are some outstanding offers, you can start the master with a > offer_timeout flag to let master rescind some offers if those offers are > not accepted by framework. > > Cited from > https://github.com/apache/mesos/blob/master/docs/configuration.md > > --offer_timeout=VALUE Duration of time before an offer is rescinded from > a framework. > > This helps fairness when running frameworks that hold on to offers, or > frameworks that accidentally drop offers. > > Thanks, > > Guangya > > On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <t...@duedil.com> wrote: > >> Hi Klaus, >> >> Sorry I think I explained this badly, these are the logs for one slave >> (that's empty) and we can see that it is making offers to some frameworks. >> In this instance, the Hadoop framework (and others) are not among those >> getting any offers, they get offered nothing. The allocator is deciding to >> send offers in a loop to a certain set of frameworks, starving others. >> >> On 21 January 2016 at 13:17, Klaus Ma <klaus1982...@gmail.com> wrote: >> >>> Yes, it seems Hadoop framework did not consume all offered resources: if >>> framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will >>> return back to master (recoverResouces). >>> >>> ---- >>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >>> Platform OpenSource Technology, STG, IBM GCG >>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me >>> >>> On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com> wrote: >>> >>>> Thanks everyone! >>>> >>>> Stephan - There's a couple of useful points there, will definitely give >>>> it a read. >>>> >>>> Klaus - Thanks, we're running a bunch of different frameworks, in that >>>> list there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown >>>> frameworks we have. In this particular case the Hadoop framework is the >>>> major concern, as it's designed to continually accept offers until it has >>>> enough slots it needs. With the example I gave above, we observe that the >>>> master is never sending any sizeable offers to some of these frameworks >>>> (the ones with the larger shares), which is where my confusion stems from. >>>> >>>> I've attached a snippet of our active master logs which show the >>>> activity for a single slave (which has no active executors). We can see >>>> that it's cycling though sending and recovering declined offers from a >>>> selection of different frameworks (in order) but I can say that not all of >>>> the frameworks are receiving these offers, in this case that's the Hadoop >>>> framework. >>>> >>>> >>>> On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote: >>>> >>>>> Hi Tom, >>>>> >>>>> Which framework are you using, e.g. Swarm, Marathon or something else? >>>>> and which language package are you using? >>>>> >>>>> DRF will sort role/framework by allocation ratio, and offer all >>>>> "available" resources by slave; but if the resources it too small (< >>>>> 0.1CPU) or the resources was reject/declined by framework, the resources >>>>> will not offer it until filter timeout. For example, in Swarm 1.0, the >>>>> default filter timeout 5s (because of go scheduler API); so here is case >>>>> that may impact the utilisation: the Swarm got one slave with 16 CPUS, but >>>>> only launch one container with 1 CPUS; the other 15 CPUS will return back >>>>> to master and did not re-offer until filter timeout (5s). >>>>> I had pull a request to make Swarm's parameters configurable, refer to >>>>> https://github.com/docker/swarm/pull/1585. I think you can check this >>>>> case by master log. >>>>> >>>>> If any comments, please let me know. >>>>> >>>>> ---- >>>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >>>>> Platform OpenSource Technology, STG, IBM GCG >>>>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me >>>>> >>>>> On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com> wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> I've noticed some interesting behaviour recently when we have lots of >>>>>> different frameworks connected to our Mesos cluster at once, all using a >>>>>> variety of different shares. Some of the frameworks don't get offered >>>>>> more >>>>>> resources (for long periods of time, hours even) leaving the cluster >>>>>> under >>>>>> utilised. >>>>>> >>>>>> Here's an example state where we see this happen.. >>>>>> >>>>>> Framework 1 - 13% (user A) >>>>>> Framework 2 - 22% (user B) >>>>>> Framework 3 - 4% (user C) >>>>>> Framework 4 - 0.5% (user C) >>>>>> Framework 5 - 1% (user C) >>>>>> Framework 6 - 1% (user C) >>>>>> Framework 7 - 1% (user C) >>>>>> Framework 8 - 0.8% (user C) >>>>>> Framework 9 - 11% (user D) >>>>>> Framework 10 - 7% (user C) >>>>>> Framework 11 - 1% (user C) >>>>>> Framework 12 - 1% (user C) >>>>>> Framework 13 - 6% (user E) >>>>>> >>>>>> In this example, there's another ~30% of the cluster that is >>>>>> unallocated, and it stays like this for a significant amount of time >>>>>> until >>>>>> something changes, perhaps another user joins and allocates the rest.... >>>>>> chunks of this spare resource is offered to some of the frameworks, but >>>>>> not >>>>>> all of them. >>>>>> >>>>>> I had always assumed that when lots of frameworks were involved, >>>>>> eventually the frameworks that would keep accepting resources >>>>>> indefinitely >>>>>> would consume the remaining resource, as every other framework had >>>>>> rejected >>>>>> the offers. >>>>>> >>>>>> Could someone elaborate a little on how the DRF allocator / sorter >>>>>> handles this situation, is this likely to be related to the different >>>>>> users >>>>>> being used? Is there a way to mitigate this? >>>>>> >>>>>> We're running version 0.23.1. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Tom. >>>>>> >>>>> >>>>> >>>> >>> >> >