Re: Aurora pauses adding offers

Mohit Jaggi Wed, 29 Nov 2017 15:26:19 -0800

Do you mean there is a shared ZK connection for leadership and log
replication? I don't see "Lost leadership, committing suicide" during
outage. I do see it at other times.


On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wfar...@apache.org> wrote:

> - Does log replication "maintain" ZK connections and suffer when a NIC
>> flaps?
>
>
> Maintain, yes.  Shouldn't be impacted unless the ZK session expires, which
> would trigger a scheduler failover.
>
> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>
>
> Assuming this means 5 quorum member - no, that should not be a problem.
>
> If any of the above became an issue for the scheduler, it should certainly
> manifest in logs.
>
>
> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Thanks Bill. It would be the latter as this back-pressure is only needed
>> for calls that change state. Read only calls should be quite quick to
>> serve.
>>
>> One potential correlated outage we may have had is a NIC flap on a
>> Zookeeper node. The following questions come to mind:
>> - Does log replication "maintain" ZK connections and suffer when a NIC
>> flaps?
>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>
>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org> wrote:
>>
>>> is there a place I can inject a pre-processor for the API calls
>>>
>>>
>>> There is no off-the-shelf way to do this.  You could intercept API calls
>>> as HTTP requests similar to the CORS filter
>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>>> If you wanted to intercept specific calls and/or introspect arguments, you
>>> would be better off binding a layer for AuroraAdmin.Iface
>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>>> .
>>>
>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> I agree with that. I also believe that the scheduler should be
>>>> resilient in the presence of external faults. Systems that export an API
>>>> must take defensive steps to protect themselves.
>>>>
>>>> If I wanted to experiment with this change without modifying Aurora
>>>> code "inline", is there a place I can inject a pre-processor for the API
>>>> calls?
>>>>
>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org>
>>>> wrote:
>>>>
>>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>>> should offer quite high scheduling throughput, and i would rather focus
>>>>> energy on addressing bottlenecks.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> I think more explicit signaling is better. Increased latency can be
>>>>>> due to other conditions like network issues etc. Right now our mitigation
>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>>> back-pressure.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>>> dmclaugh...@apache.org> wrote:
>>>>>>
>>>>>>> Any log write latency will be reflected in the overall latency of
>>>>>>> the request. Increased request latency is one of the main ways any 
>>>>>>> server
>>>>>>> has of telling a client that it's under load. It's then up to the 
>>>>>>> client to
>>>>>>> react to this.
>>>>>>>
>>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>>> need to build reconciliation logic into it.
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Imagine something like Spinnaker using Aurora underneath to
>>>>>>>> schedule services. That layer often "amplifies" human effort and may 
>>>>>>>> result
>>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora slowed 
>>>>>>>> down
>>>>>>>> due to transient problems, it can signal that to upstream software in 
>>>>>>>> the
>>>>>>>> same way that busy web servers do during cyber Monday sales :-)
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>>>> and that it should back off
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable 
>>>>>>>>> to
>>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <
>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>>> cause(although ZK node is known to have issues now and then but we 
>>>>>>>>>> don't
>>>>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>>>>> 0.19.x
>>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora 
>>>>>>>>>> is
>>>>>>>>>> slowing down and that it should back off. To achieve this I want to 
>>>>>>>>>> send
>>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>>>>> certain
>>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. 
>>>>>>>>>> Perhaps, this
>>>>>>>>>> "defense" already exists?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>>
>>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very
>>>>>>>>>>> large, O(gb)
>>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper,
>>>>>>>>>>> or between zookeepers
>>>>>>>>>>>
>>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped 
>>>>>>>>>>> twitter
>>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>>
>>>>>>>>>>> commit f2755e1
>>>>>>>>>>> Author: Bill Farner <wfar...@apache.org>
>>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>>
>>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary
>>>>>>>>>>> storage
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>>> wfar...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through
>>>>>>>>>>>>>>> though?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In one instance the offers did get through but it took
>>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler to 
>>>>>>>>>>>>>> let
>>>>>>>>>>>>>> another one become the leader.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere
>>>>>>>>>>>>>>> with offer processing directly, but could potentially slow down 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> scheduler as a whole.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being 
>>>>>>>>>>>>>> used on the
>>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>>>>>> connections.
>>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections 
>>>>>>>>>>>>>> or vice
>>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there 
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for 
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these 
>>>>>>>>>>>>>> effects
>>>>>>>>>>>>>> will occur?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as 
>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much 
>>>>>>>>>>>>>>> deeper in
>>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is 
>>>>>>>>>>>>>>> in play.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race 
>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version
>>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI 
>>>>>>>>>>>>>>>>> offers page. But
>>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I suspect 
>>>>>>>>>>>>>>>>> they are
>>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay there 
>>>>>>>>>>>>>>>>> for a long
>>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread 
>>>>>>>>>>>>>>>>> starvation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. 
>>>>>>>>>>>>>>>>> Could these
>>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some 
>>>>>>>>>>>>>>>>> lock)? What
>>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to