This was on 0.17. No logs sorry, I'll run the same test again in a week or
so. I can share the new ones and even kill the leader in the middle of the
process.

Tasks continued to run, I remember I dig through the logs to see how long
it took for a particular task to show up again as assigned. I'll adjust the
max_tasks_per_schedule_attempt and test it again.

Thanks!

On Wed, Nov 29, 2017 at 12:03 PM, Bill Farner <wfar...@apache.org> wrote:

> That works out to scheduling about 1 task/sec, which is at least one order
> of magnitude lower than i would expect.  Are you sure tasks were scheduling
> and continuing to run, rather than exiting/failing and triggering more
> scheduling work?
>
> What build is this from?  Can you share (scrubbed) scheduler logs from
> this period?
>
> On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
>
>> Hello!
>>
>> Recently, running some reliability tests, we restarted all the nodes in a
>> cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
>> everything, we have a change of leader in the middle of the scheduling and
>> that slowed it down even more. So we started looking which aurora
>> parameters needed more tuning.
>>
>> The value of max_tasks_per_schedule_attempt is set to the default now,
>> that probably needs to be increased, is there a rule of thumb to tune it
>> based on cluster size, # of jobs, # of frameworks, etc?
>>
>> Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
>> pressure there.
>>
>> Any input on where to look at would be really appreciated :)
>>
>> Mauricio
>>
>>
>>
>>
>>
>

Reply via email to