Thanks Bill. Please see inline:

On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote:

> I suspect they are getting enqueued
>
>
> Just to be sure - the offers do eventually get through though?
>
>
In one instance the offers did get through but it took several minutes. In
other instances we reloaded the scheduler to let another one become the
leader.


> The most likely culprit is contention for the storage write lock,  observable
> via spikes in stat log_storage_write_lock_wait_ns_total.
>

Thanks. I will check that one.


>
> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>> calls are being made at that time
>
>
> This sounds like API activity.  This shouldn't interfere with offer
> processing directly, but could potentially slow down the scheduler as a
> whole.
>
>
So these won't contend for locks with offer processing and task assignment
threads? Only 8-10 out of 24 cores were being used on the machine. I also
noticed a spike in mybatis active and bad connections. Can't say if the
spike in active is due to many bad connections or vice versa or there was a
3rd source causing both of these. Are there any metrics or logs that might
help here?


> I also notice a lot of "Timeout reached for task..." around the same time.
>> Can this happen if task is in PENDING state and does not reach ASSIGNED due
>> to lack of offers?
>
>
> This is unusual.  Pending tasks are not timed out; this applies to tasks
> in states where the scheduler is waiting for something else to act and it
> does not hear back (via a status update).
>

Perhaps they were in ASSIGNED or some other state. If updates from Mesos
are being delayed or processed too slowly both these effects will occur?


>
> I suggest digging into the cause of delayed offer processing first, i
> suspect it might be related to the task timeouts as well.
>
> version close to 0.18.0
>
>
> Is the ambiguity is due to custom patches?  Can you at least indicate the
> last git SHA off aurora/master?  Digging much deeper in diagnosing this may
> prove tricky without knowing what code is in play.
>
>
   -
   - c85bffd
   - 10 i
   -
   - s the commit from which we forked.
   -
   -
   -
   - The custom patch is mainly the dynamic reservation work done by
   Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
   (which is not upstreamed yet).
   -
   -
   -
   -
   -

I have cherrypicked the fix for Aurora-1952 as well.



> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> I also notice a lot of "Timeout reached for task..." around the same
>> time. Can this happen if task is in PENDING state and does not reach
>> ASSIGNED due to lack of offers?
>>
>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>>
>>> Folks,
>>> I have noticed some weird behavior in Aurora (version close to 0.18.0).
>>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
>>> I can see offers are coming in. I suspect they are getting enqueued for
>>> processing by "executor" but stay there for a long time and are not
>>> processed either due to locking or thread starvation.
>>>
>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>> calls are being made at that time. Could these calls starve the
>>> OfferManager(e.g. by contending for some lock)? What should I be looking
>>> for to debug this condition further?
>>>
>>> Mohit.
>>>
>>
>>
>

Reply via email to