I can’t send the entire log as there’s a lot of activity on the cluster all the time, is there anything particular you’re looking for?
> On 22 Jan 2016, at 12:46, Klaus Ma <klaus1982...@gmail.com> wrote: > > Can you share the whole log of master? I'll be helpful :). > > ---- > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform OpenSource Technology, STG, IBM GCG > +86-10-8245 4084 | klaus1982...@gmail.com <mailto:klaus1982...@gmail.com> | > http://k82.me <http://k82.me/> > On Thu, Jan 21, 2016 at 11:57 PM, Tom Arnfeld <t...@duedil.com > <mailto:t...@duedil.com>> wrote: > Guangya - Nope, there's no outstanding offers for any frameworks, the ones > that are getting offers are responding properly. > > Klaus - This was just a sample of logs for a single agent, the cluster has at > least ~40 agents at any one time. > > On 21 January 2016 at 15:20, Guangya Liu <gyliu...@gmail.com > <mailto:gyliu...@gmail.com>> wrote: > Can you please help check if some outstanding offers in cluster which does > not accept by any framework? You can check this via the endpoint of > /master/state.json endpoint. > > If there are some outstanding offers, you can start the master with a > offer_timeout flag to let master rescind some offers if those offers are not > accepted by framework. > > Cited from https://github.com/apache/mesos/blob/master/docs/configuration.md > <https://github.com/apache/mesos/blob/master/docs/configuration.md> > > --offer_timeout=VALUE Duration of time before an offer is rescinded from a > framework. > This helps fairness when running frameworks that hold on to offers, or > frameworks that accidentally drop offers. > > > Thanks, > > Guangya > > On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <t...@duedil.com > <mailto:t...@duedil.com>> wrote: > Hi Klaus, > > Sorry I think I explained this badly, these are the logs for one slave > (that's empty) and we can see that it is making offers to some frameworks. In > this instance, the Hadoop framework (and others) are not among those getting > any offers, they get offered nothing. The allocator is deciding to send > offers in a loop to a certain set of frameworks, starving others. > > On 21 January 2016 at 13:17, Klaus Ma <klaus1982...@gmail.com > <mailto:klaus1982...@gmail.com>> wrote: > Yes, it seems Hadoop framework did not consume all offered resources: if > framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will > return back to master (recoverResouces). > > ---- > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform OpenSource Technology, STG, IBM GCG > +86-10-8245 4084 <tel:%2B86-10-8245%204084> | klaus1982...@gmail.com > <mailto:klaus1982...@gmail.com> | http://k82.me <http://k82.me/> > On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com > <mailto:t...@duedil.com>> wrote: > Thanks everyone! > > Stephan - There's a couple of useful points there, will definitely give it a > read. > > Klaus - Thanks, we're running a bunch of different frameworks, in that list > there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown > frameworks we have. In this particular case the Hadoop framework is the major > concern, as it's designed to continually accept offers until it has enough > slots it needs. With the example I gave above, we observe that the master is > never sending any sizeable offers to some of these frameworks (the ones with > the larger shares), which is where my confusion stems from. > > I've attached a snippet of our active master logs which show the activity for > a single slave (which has no active executors). We can see that it's cycling > though sending and recovering declined offers from a selection of different > frameworks (in order) but I can say that not all of the frameworks are > receiving these offers, in this case that's the Hadoop framework. > > > On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com > <mailto:klaus1982...@gmail.com>> wrote: > Hi Tom, > > Which framework are you using, e.g. Swarm, Marathon or something else? and > which language package are you using? > > DRF will sort role/framework by allocation ratio, and offer all "available" > resources by slave; but if the resources it too small (< 0.1CPU) or the > resources was reject/declined by framework, the resources will not offer it > until filter timeout. For example, in Swarm 1.0, the default filter timeout > 5s (because of go scheduler API); so here is case that may impact the > utilisation: the Swarm got one slave with 16 CPUS, but only launch one > container with 1 CPUS; the other 15 CPUS will return back to master and did > not re-offer until filter timeout (5s). > I had pull a request to make Swarm's parameters configurable, refer to > https://github.com/docker/swarm/pull/1585 > <https://github.com/docker/swarm/pull/1585>. I think you can check this case > by master log. > > If any comments, please let me know. > > ---- > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform OpenSource Technology, STG, IBM GCG > +86-10-8245 4084 <tel:%2B86-10-8245%204084> | klaus1982...@gmail.com > <mailto:klaus1982...@gmail.com> | http://k82.me <http://k82.me/> > > On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com > <mailto:t...@duedil.com>> wrote: > Hey, > > I've noticed some interesting behaviour recently when we have lots of > different frameworks connected to our Mesos cluster at once, all using a > variety of different shares. Some of the frameworks don't get offered more > resources (for long periods of time, hours even) leaving the cluster under > utilised. > > Here's an example state where we see this happen.. > > Framework 1 - 13% (user A) > Framework 2 - 22% (user B) > Framework 3 - 4% (user C) > Framework 4 - 0.5% (user C) > Framework 5 - 1% (user C) > Framework 6 - 1% (user C) > Framework 7 - 1% (user C) > Framework 8 - 0.8% (user C) > Framework 9 - 11% (user D) > Framework 10 - 7% (user C) > Framework 11 - 1% (user C) > Framework 12 - 1% (user C) > Framework 13 - 6% (user E) > > In this example, there's another ~30% of the cluster that is unallocated, and > it stays like this for a significant amount of time until something changes, > perhaps another user joins and allocates the rest.... chunks of this spare > resource is offered to some of the frameworks, but not all of them. > > I had always assumed that when lots of frameworks were involved, eventually > the frameworks that would keep accepting resources indefinitely would consume > the remaining resource, as every other framework had rejected the offers. > > Could someone elaborate a little on how the DRF allocator / sorter handles > this situation, is this likely to be related to the different users being > used? Is there a way to mitigate this? > > We're running version 0.23.1. > > Cheers, > > Tom. > > > > > > >