That works out to scheduling about 1 task/sec, which is at least one order of magnitude lower than i would expect. Are you sure tasks were scheduling and continuing to run, rather than exiting/failing and triggering more scheduling work?
What build is this from? Can you share (scrubbed) scheduler logs from this period? On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia < mauriciogaravag...@gmail.com> wrote: > Hello! > > Recently, running some reliability tests, we restarted all the nodes in a > cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule > everything, we have a change of leader in the middle of the scheduling and > that slowed it down even more. So we started looking which aurora > parameters needed more tuning. > > The value of max_tasks_per_schedule_attempt is set to the default now, > that probably needs to be increased, is there a rule of thumb to tune it > based on cluster size, # of jobs, # of frameworks, etc? > > Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen > pressure there. > > Any input on where to look at would be really appreciated :) > > Mauricio > > > > >