>
> - Does log replication "maintain" ZK connections and suffer when a NIC
> flaps?


Maintain, yes.  Shouldn't be impacted unless the ZK session expires, which
would trigger a scheduler failover.

- If only 1 of 5 ZK's have this issue, could there still be a problem?


Assuming this means 5 quorum member - no, that should not be a problem.

If any of the above became an issue for the scheduler, it should certainly
manifest in logs.


On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Thanks Bill. It would be the latter as this back-pressure is only needed
> for calls that change state. Read only calls should be quite quick to
> serve.
>
> One potential correlated outage we may have had is a NIC flap on a
> Zookeeper node. The following questions come to mind:
> - Does log replication "maintain" ZK connections and suffer when a NIC
> flaps?
> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>
> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> is there a place I can inject a pre-processor for the API calls
>>
>>
>> There is no off-the-shelf way to do this.  You could intercept API calls
>> as HTTP requests similar to the CORS filter
>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>> If you wanted to intercept specific calls and/or introspect arguments, you
>> would be better off binding a layer for AuroraAdmin.Iface
>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>> .
>>
>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> I agree with that. I also believe that the scheduler should be resilient
>>> in the presence of external faults. Systems that export an API must take
>>> defensive steps to protect themselves.
>>>
>>> If I wanted to experiment with this change without modifying Aurora code
>>> "inline", is there a place I can inject a pre-processor for the API calls?
>>>
>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> wrote:
>>>
>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>> should offer quite high scheduling throughput, and i would rather focus
>>>> energy on addressing bottlenecks.
>>>>
>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> I think more explicit signaling is better. Increased latency can be
>>>>> due to other conditions like network issues etc. Right now our mitigation
>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>> back-pressure.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>> dmclaugh...@apache.org> wrote:
>>>>>
>>>>>> Any log write latency will be reflected in the overall latency of the
>>>>>> request. Increased request latency is one of the main ways any server has
>>>>>> of telling a client that it's under load. It's then up to the client to
>>>>>> react to this.
>>>>>>
>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>> need to build reconciliation logic into it.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>>>> services. That layer often "amplifies" human effort and may result in a 
>>>>>>> lot
>>>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>>>> transient problems, it can signal that to upstream software in the same 
>>>>>>> way
>>>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>>> and that it should back off
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>> cause(although ZK node is known to have issues now and then but we 
>>>>>>>>> don't
>>>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>>>> 0.19.x
>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora 
>>>>>>>>> is
>>>>>>>>> slowing down and that it should back off. To achieve this I want to 
>>>>>>>>> send
>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>>>> certain
>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, 
>>>>>>>>> this
>>>>>>>>> "defense" already exists?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>
>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>>>> O(gb)
>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>>>> between zookeepers
>>>>>>>>>>
>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped 
>>>>>>>>>> twitter
>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>
>>>>>>>>>> commit f2755e1
>>>>>>>>>> Author: Bill Farner <wfar...@apache.org>
>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>
>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>> wfar...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let 
>>>>>>>>>>>>> another one
>>>>>>>>>>>>> become the leader.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>>>> offer processing directly, but could potentially slow down the 
>>>>>>>>>>>>>> scheduler as
>>>>>>>>>>>>>> a whole.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being 
>>>>>>>>>>>>> used on the
>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>>>>> connections.
>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections 
>>>>>>>>>>>>> or vice
>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there 
>>>>>>>>>>>>> any
>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for 
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these 
>>>>>>>>>>>>> effects
>>>>>>>>>>>>> will occur?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as 
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much 
>>>>>>>>>>>>>> deeper in
>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in 
>>>>>>>>>>>>>> play.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race 
>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers 
>>>>>>>>>>>>>>>> page. But if I
>>>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they 
>>>>>>>>>>>>>>>> are getting
>>>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a 
>>>>>>>>>>>>>>>> long time and
>>>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. 
>>>>>>>>>>>>>>>> Could these
>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some 
>>>>>>>>>>>>>>>> lock)? What
>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to