Thanks Bill. We havn't been able to track down a specific root
cause(although ZK node is known to have issues now and then but we don't
have logs for the specific outages we had). We will plan to move to 0.19.x
soon. In addition I want to let upstream software "know" that Aurora is
slowing down and that it should back off. To achieve this I want to send
5xx error codes back when update/rollback/kill etc are called and certain
metrics (like log write lock wait time) indicate heavy load. Perhaps, this
"defense" already exists?


On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> wrote:

> The next level is to determine why the storage lock is being held.  Common
> causes include:
>
> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
> 1a. long GC pauses in the scheduler, often induced by (1)
> 2. scheduler replicated log on slow disks
> 3. network issues between schedulers, schedulers to zookeeper, or between
> zookeepers
>
> As an immediate (partial) remedy, i suggest you upgrade to eliminate the
> use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
> (1a).
>
> commit f2755e1
> Author: Bill Farner <wfar...@apache.org>
> Date:   Tue Oct 24 23:34:09 2017 -0700
>
>     Exclusively use Map-based in-memory stores for primary storage
>
>
> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com>
> wrote:
>
>> and in log_storage_write_lock_wait_ns_per_event
>>
>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>>> cause or effect? :-)
>>>
>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> Thanks Bill. Please see inline:
>>>>
>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org>
>>>> wrote:
>>>>
>>>>> I suspect they are getting enqueued
>>>>>
>>>>>
>>>>> Just to be sure - the offers do eventually get through though?
>>>>>
>>>>>
>>>> In one instance the offers did get through but it took several minutes.
>>>> In other instances we reloaded the scheduler to let another one become the
>>>> leader.
>>>>
>>>>
>>>>> The most likely culprit is contention for the storage write lock,  
>>>>> observable
>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>
>>>>
>>>> Thanks. I will check that one.
>>>>
>>>>
>>>>>
>>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>>>> calls are being made at that time
>>>>>
>>>>>
>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>> whole.
>>>>>
>>>>>
>>>> So these won't contend for locks with offer processing and task
>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>> Can't say if the spike in active is due to many bad connections or vice
>>>> versa or there was a 3rd source causing both of these. Are there any
>>>> metrics or logs that might help here?
>>>>
>>>>
>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>> ASSIGNED due to lack of offers?
>>>>>
>>>>>
>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>> and it does not hear back (via a status update).
>>>>>
>>>>
>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>> Mesos are being delayed or processed too slowly both these effects will
>>>> occur?
>>>>
>>>>
>>>>>
>>>>> I suggest digging into the cause of delayed offer processing first, i
>>>>> suspect it might be related to the task timeouts as well.
>>>>>
>>>>> version close to 0.18.0
>>>>>
>>>>>
>>>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing 
>>>>> this
>>>>> may prove tricky without knowing what code is in play.
>>>>>
>>>>>
>>>>    -
>>>>    - c85bffd
>>>>    - 10 i
>>>>    -
>>>>    - s the commit from which we forked.
>>>>    -
>>>>    -
>>>>    -
>>>>    - The custom patch is mainly the dynamic reservation work done by
>>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs 
>>>> patch
>>>>    (which is not upstreamed yet).
>>>>    -
>>>>    -
>>>>    -
>>>>    -
>>>>    -
>>>>
>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>
>>>>
>>>>
>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>> ASSIGNED due to lack of offers?
>>>>>>
>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Folks,
>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I 
>>>>>>> tail
>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>
>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>> should I be looking for to debug this condition further?
>>>>>>>
>>>>>>> Mohit.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to