This was on 0.17. No logs sorry, I'll run the same test again in a week or so. I can share the new ones and even kill the leader in the middle of the process.
Tasks continued to run, I remember I dig through the logs to see how long it took for a particular task to show up again as assigned. I'll adjust the max_tasks_per_schedule_attempt and test it again. Thanks! On Wed, Nov 29, 2017 at 12:03 PM, Bill Farner <wfar...@apache.org> wrote: > That works out to scheduling about 1 task/sec, which is at least one order > of magnitude lower than i would expect. Are you sure tasks were scheduling > and continuing to run, rather than exiting/failing and triggering more > scheduling work? > > What build is this from? Can you share (scrubbed) scheduler logs from > this period? > > On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia < > mauriciogaravag...@gmail.com> wrote: > >> Hello! >> >> Recently, running some reliability tests, we restarted all the nodes in a >> cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule >> everything, we have a change of leader in the middle of the scheduling and >> that slowed it down even more. So we started looking which aurora >> parameters needed more tuning. >> >> The value of max_tasks_per_schedule_attempt is set to the default now, >> that probably needs to be increased, is there a rule of thumb to tune it >> based on cluster size, # of jobs, # of frameworks, etc? >> >> Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen >> pressure there. >> >> Any input on where to look at would be really appreciated :) >> >> Mauricio >> >> >> >> >> >