and in log_storage_write_lock_wait_ns_per_event

On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
> cause or effect? :-)
>
> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Thanks Bill. Please see inline:
>>
>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote:
>>
>>> I suspect they are getting enqueued
>>>
>>>
>>> Just to be sure - the offers do eventually get through though?
>>>
>>>
>> In one instance the offers did get through but it took several minutes.
>> In other instances we reloaded the scheduler to let another one become the
>> leader.
>>
>>
>>> The most likely culprit is contention for the storage write lock,  
>>> observable
>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>
>>
>> Thanks. I will check that one.
>>
>>
>>>
>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>> calls are being made at that time
>>>
>>>
>>> This sounds like API activity.  This shouldn't interfere with offer
>>> processing directly, but could potentially slow down the scheduler as a
>>> whole.
>>>
>>>
>> So these won't contend for locks with offer processing and task
>> assignment threads? Only 8-10 out of 24 cores were being used on the
>> machine. I also noticed a spike in mybatis active and bad connections.
>> Can't say if the spike in active is due to many bad connections or vice
>> versa or there was a 3rd source causing both of these. Are there any
>> metrics or logs that might help here?
>>
>>
>>> I also notice a lot of "Timeout reached for task..." around the same
>>>> time. Can this happen if task is in PENDING state and does not reach
>>>> ASSIGNED due to lack of offers?
>>>
>>>
>>> This is unusual.  Pending tasks are not timed out; this applies to tasks
>>> in states where the scheduler is waiting for something else to act and it
>>> does not hear back (via a status update).
>>>
>>
>> Perhaps they were in ASSIGNED or some other state. If updates from Mesos
>> are being delayed or processed too slowly both these effects will occur?
>>
>>
>>>
>>> I suggest digging into the cause of delayed offer processing first, i
>>> suspect it might be related to the task timeouts as well.
>>>
>>> version close to 0.18.0
>>>
>>>
>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing this
>>> may prove tricky without knowing what code is in play.
>>>
>>>
>>    -
>>    - c85bffd
>>    - 10 i
>>    -
>>    - s the commit from which we forked.
>>    -
>>    -
>>    -
>>    - The custom patch is mainly the dynamic reservation work done by
>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs patch
>>    (which is not upstreamed yet).
>>    -
>>    -
>>    -
>>    -
>>    -
>>
>> I have cherrypicked the fix for Aurora-1952 as well.
>>
>>
>>
>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>> time. Can this happen if task is in PENDING state and does not reach
>>>> ASSIGNED due to lack of offers?
>>>>
>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> Folks,
>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I 
>>>>> tail
>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>> are not processed either due to locking or thread starvation.
>>>>>
>>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>>> calls are being made at that time. Could these calls starve the
>>>>> OfferManager(e.g. by contending for some lock)? What should I be looking
>>>>> for to debug this condition further?
>>>>>
>>>>> Mohit.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to