Re: Aurora pauses adding offers

Bill Farner Wed, 29 Nov 2017 17:15:07 -0800

>
> Do you mean there is a shared ZK connection for leadership and log
> replication



Nope, different connections.  The log is all managed through code linked
from libmesos, including its use of ZK.  Here's an example of some logs
from this code:

I1129 15:56:00.560479  9316 group.cpp:341] Group process
> (zookeeper-group(1)@192.168.33.7:8083) connected to ZooKeeper
> I1129 15:56:00.560817  9316 group.cpp:831] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I1129 15:56:00.561249  9316 group.cpp:419] Trying to create path
> '/aurora/replicated-log' in ZooKeeper


Notice the group.cpp.  You'll also see relevant logs coming from log.cpp
and replica.cpp.


On Wed, Nov 29, 2017 at 3:25 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Do you mean there is a shared ZK connection for leadership and log
> replication? I don't see "Lost leadership, committing suicide" during
> outage. I do see it at other times.
>
> On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> - Does log replication "maintain" ZK connections and suffer when a NIC
>>> flaps?
>>
>>
>> Maintain, yes.  Shouldn't be impacted unless the ZK session expires,
>> which would trigger a scheduler failover.
>>
>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>
>>
>> Assuming this means 5 quorum member - no, that should not be a problem.
>>
>> If any of the above became an issue for the scheduler, it should
>> certainly manifest in logs.
>>
>>
>> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Thanks Bill. It would be the latter as this back-pressure is only needed
>>> for calls that change state. Read only calls should be quite quick to
>>> serve.
>>>
>>> One potential correlated outage we may have had is a NIC flap on a
>>> Zookeeper node. The following questions come to mind:
>>> - Does log replication "maintain" ZK connections and suffer when a NIC
>>> flaps?
>>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>>
>>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org>
>>> wrote:
>>>
>>>> is there a place I can inject a pre-processor for the API calls
>>>>
>>>>
>>>> There is no off-the-shelf way to do this.  You could intercept API
>>>> calls as HTTP requests similar to the CORS filter
>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>>>> If you wanted to intercept specific calls and/or introspect arguments, you
>>>> would be better off binding a layer for AuroraAdmin.Iface
>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>>>> .
>>>>
>>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> I agree with that. I also believe that the scheduler should be
>>>>> resilient in the presence of external faults. Systems that export an API
>>>>> must take defensive steps to protect themselves.
>>>>>
>>>>> If I wanted to experiment with this change without modifying Aurora
>>>>> code "inline", is there a place I can inject a pre-processor for the API
>>>>> calls?
>>>>>
>>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>>>> should offer quite high scheduling throughput, and i would rather focus
>>>>>> energy on addressing bottlenecks.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think more explicit signaling is better. Increased latency can be
>>>>>>> due to other conditions like network issues etc. Right now our 
>>>>>>> mitigation
>>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>>>> back-pressure.
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>>>> dmclaugh...@apache.org> wrote:
>>>>>>>
>>>>>>>> Any log write latency will be reflected in the overall latency of
>>>>>>>> the request. Increased request latency is one of the main ways any 
>>>>>>>> server
>>>>>>>> has of telling a client that it's under load. It's then up to the 
>>>>>>>> client to
>>>>>>>> react to this.
>>>>>>>>
>>>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>>>> Aurora that has request timeouts - which would send 503s to clients. 
>>>>>>>> But
>>>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>>>> need to build reconciliation logic into it.
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Imagine something like Spinnaker using Aurora underneath to
>>>>>>>>> schedule services. That layer often "amplifies" human effort and may 
>>>>>>>>> result
>>>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora slowed 
>>>>>>>>> down
>>>>>>>>> due to transient problems, it can signal that to upstream software in 
>>>>>>>>> the
>>>>>>>>> same way that busy web servers do during cyber Monday sales :-)
>>>>>>>>>
>>>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I want to let upstream software "know" that Aurora is slowing
>>>>>>>>>>> down and that it should back off
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable 
>>>>>>>>>> to
>>>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <
>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>>>> cause(although ZK node is known to have issues now and then but we 
>>>>>>>>>>> don't
>>>>>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>>>>>> 0.19.x
>>>>>>>>>>> soon. In addition I want to let upstream software "know" that 
>>>>>>>>>>> Aurora is
>>>>>>>>>>> slowing down and that it should back off. To achieve this I want to 
>>>>>>>>>>> send
>>>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>>>>>> certain
>>>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. 
>>>>>>>>>>> Perhaps, this
>>>>>>>>>>> "defense" already exists?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very
>>>>>>>>>>>> large, O(gb)
>>>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper,
>>>>>>>>>>>> or between zookeepers
>>>>>>>>>>>>
>>>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped 
>>>>>>>>>>>> twitter
>>>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>>>
>>>>>>>>>>>> commit f2755e1
>>>>>>>>>>>> Author: Bill Farner <wfar...@apache.org>
>>>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>>>
>>>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary
>>>>>>>>>>>> storage
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>>>> wfar...@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through
>>>>>>>>>>>>>>>> though?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In one instance the offers did get through but it took
>>>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler 
>>>>>>>>>>>>>>> to let
>>>>>>>>>>>>>>> another one become the leader.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere
>>>>>>>>>>>>>>>> with offer processing directly, but could potentially slow 
>>>>>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>> scheduler as a whole.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being 
>>>>>>>>>>>>>>> used on the
>>>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>>>>>>> connections.
>>>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections 
>>>>>>>>>>>>>>> or vice
>>>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are 
>>>>>>>>>>>>>>> there any
>>>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state 
>>>>>>>>>>>>>>>>> and does not
>>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for 
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If
>>>>>>>>>>>>>>> updates from Mesos are being delayed or processed too slowly 
>>>>>>>>>>>>>>> both these
>>>>>>>>>>>>>>> effects will occur?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer
>>>>>>>>>>>>>>>> processing first, i suspect it might be related to the task 
>>>>>>>>>>>>>>>> timeouts as
>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at
>>>>>>>>>>>>>>>> least indicate the last git SHA off aurora/master?  Digging 
>>>>>>>>>>>>>>>> much deeper in
>>>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is 
>>>>>>>>>>>>>>>> in play.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation
>>>>>>>>>>>>>>>    work done by Dmitri. We also have commits for offer/rescind 
>>>>>>>>>>>>>>> race issue,
>>>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..."
>>>>>>>>>>>>>>>>> around the same time. Can this happen if task is in PENDING 
>>>>>>>>>>>>>>>>> state and does
>>>>>>>>>>>>>>>>> not reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version
>>>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI 
>>>>>>>>>>>>>>>>>> offers page. But
>>>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I suspect 
>>>>>>>>>>>>>>>>>> they are
>>>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay there 
>>>>>>>>>>>>>>>>>> for a long
>>>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread 
>>>>>>>>>>>>>>>>>> starvation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. 
>>>>>>>>>>>>>>>>>> Could these
>>>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some 
>>>>>>>>>>>>>>>>>> lock)? What
>>>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to