Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that cause
or effect? :-)

On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Thanks Bill. Please see inline:
>
> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> I suspect they are getting enqueued
>>
>>
>> Just to be sure - the offers do eventually get through though?
>>
>>
> In one instance the offers did get through but it took several minutes. In
> other instances we reloaded the scheduler to let another one become the
> leader.
>
>
>> The most likely culprit is contention for the storage write lock,  observable
>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>
>
> Thanks. I will check that one.
>
>
>>
>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>> calls are being made at that time
>>
>>
>> This sounds like API activity.  This shouldn't interfere with offer
>> processing directly, but could potentially slow down the scheduler as a
>> whole.
>>
>>
> So these won't contend for locks with offer processing and task assignment
> threads? Only 8-10 out of 24 cores were being used on the machine. I also
> noticed a spike in mybatis active and bad connections. Can't say if the
> spike in active is due to many bad connections or vice versa or there was a
> 3rd source causing both of these. Are there any metrics or logs that might
> help here?
>
>
>> I also notice a lot of "Timeout reached for task..." around the same
>>> time. Can this happen if task is in PENDING state and does not reach
>>> ASSIGNED due to lack of offers?
>>
>>
>> This is unusual.  Pending tasks are not timed out; this applies to tasks
>> in states where the scheduler is waiting for something else to act and it
>> does not hear back (via a status update).
>>
>
> Perhaps they were in ASSIGNED or some other state. If updates from Mesos
> are being delayed or processed too slowly both these effects will occur?
>
>
>>
>> I suggest digging into the cause of delayed offer processing first, i
>> suspect it might be related to the task timeouts as well.
>>
>> version close to 0.18.0
>>
>>
>> Is the ambiguity is due to custom patches?  Can you at least indicate the
>> last git SHA off aurora/master?  Digging much deeper in diagnosing this may
>> prove tricky without knowing what code is in play.
>>
>>
>    -
>    - c85bffd
>    - 10 i
>    -
>    - s the commit from which we forked.
>    -
>    -
>    -
>    - The custom patch is mainly the dynamic reservation work done by
>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>    (which is not upstreamed yet).
>    -
>    -
>    -
>    -
>    -
>
> I have cherrypicked the fix for Aurora-1952 as well.
>
>
>
>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>>
>>> I also notice a lot of "Timeout reached for task..." around the same
>>> time. Can this happen if task is in PENDING state and does not reach
>>> ASSIGNED due to lack of offers?
>>>
>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> Folks,
>>>> I have noticed some weird behavior in Aurora (version close to 0.18.0).
>>>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
>>>> I can see offers are coming in. I suspect they are getting enqueued for
>>>> processing by "executor" but stay there for a long time and are not
>>>> processed either due to locking or thread starvation.
>>>>
>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>> calls are being made at that time. Could these calls starve the
>>>> OfferManager(e.g. by contending for some lock)? What should I be looking
>>>> for to debug this condition further?
>>>>
>>>> Mohit.
>>>>
>>>
>>>
>>
>

Reply via email to