> > - Does log replication "maintain" ZK connections and suffer when a NIC > flaps?
Maintain, yes. Shouldn't be impacted unless the ZK session expires, which would trigger a scheduler failover. - If only 1 of 5 ZK's have this issue, could there still be a problem? Assuming this means 5 quorum member - no, that should not be a problem. If any of the above became an issue for the scheduler, it should certainly manifest in logs. On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Thanks Bill. It would be the latter as this back-pressure is only needed > for calls that change state. Read only calls should be quite quick to > serve. > > One potential correlated outage we may have had is a NIC flap on a > Zookeeper node. The following questions come to mind: > - Does log replication "maintain" ZK connections and suffer when a NIC > flaps? > - If only 1 of 5 ZK's have this issue, could there still be a problem? > > On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org> wrote: > >> is there a place I can inject a pre-processor for the API calls >> >> >> There is no off-the-shelf way to do this. You could intercept API calls >> as HTTP requests similar to the CORS filter >> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>. >> If you wanted to intercept specific calls and/or introspect arguments, you >> would be better off binding a layer for AuroraAdmin.Iface >> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> >> . >> >> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> I agree with that. I also believe that the scheduler should be resilient >>> in the presence of external faults. Systems that export an API must take >>> defensive steps to protect themselves. >>> >>> If I wanted to experiment with this change without modifying Aurora code >>> "inline", is there a place I can inject a pre-processor for the API calls? >>> >>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> wrote: >>> >>>> I'd also suggest focusing on the source of the congestion. Aurora >>>> should offer quite high scheduling throughput, and i would rather focus >>>> energy on addressing bottlenecks. >>>> >>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> I think more explicit signaling is better. Increased latency can be >>>>> due to other conditions like network issues etc. Right now our mitigation >>>>> involves load-shedding and we would rather have load-avoidance. Indeed >>>>> proxy is not a good option. Only Aurora "knows" when it wants to >>>>> back-pressure. >>>>> >>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >>>>> dmclaugh...@apache.org> wrote: >>>>> >>>>>> Any log write latency will be reflected in the overall latency of the >>>>>> request. Increased request latency is one of the main ways any server has >>>>>> of telling a client that it's under load. It's then up to the client to >>>>>> react to this. >>>>>> >>>>>> If you want to throw error codes, you can put a proxy in front of >>>>>> Aurora that has request timeouts - which would send 503s to clients. But >>>>>> the issue with that is the requests are mostly non-idempotent so you'll >>>>>> need to build reconciliation logic into it. >>>>>> >>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>> wrote: >>>>>> >>>>>>> Imagine something like Spinnaker using Aurora underneath to schedule >>>>>>> services. That layer often "amplifies" human effort and may result in a >>>>>>> lot >>>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to >>>>>>> transient problems, it can signal that to upstream software in the same >>>>>>> way >>>>>>> that busy web servers do during cyber Monday sales :-) >>>>>>> >>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I want to let upstream software "know" that Aurora is slowing down >>>>>>>>> and that it should back off >>>>>>>> >>>>>>>> >>>>>>>> Can you offer more detail about how Aurora is being used in this >>>>>>>> regard? I haven't seen use cases in the past that would be amenable to >>>>>>>> this behavior, so i would like to understand better. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>>>>> cause(although ZK node is known to have issues now and then but we >>>>>>>>> don't >>>>>>>>> have logs for the specific outages we had). We will plan to move to >>>>>>>>> 0.19.x >>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora >>>>>>>>> is >>>>>>>>> slowing down and that it should back off. To achieve this I want to >>>>>>>>> send >>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and >>>>>>>>> certain >>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, >>>>>>>>> this >>>>>>>>> "defense" already exists? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The next level is to determine why the storage lock is being >>>>>>>>>> held. Common causes include: >>>>>>>>>> >>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large, >>>>>>>>>> O(gb) >>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>>>>> 2. scheduler replicated log on slow disks >>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or >>>>>>>>>> between zookeepers >>>>>>>>>> >>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to >>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler. This helped >>>>>>>>>> twitter >>>>>>>>>> improve (1) and (1a). >>>>>>>>>> >>>>>>>>>> commit f2755e1 >>>>>>>>>> Author: Bill Farner <wfar...@apache.org> >>>>>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>>>>> >>>>>>>>>> Exclusively use Map-based in-memory stores for primary storage >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi < >>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>> >>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi < >>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. >>>>>>>>>>>> Is that cause or effect? :-) >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi < >>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner < >>>>>>>>>>>>> wfar...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just to be sure - the offers do eventually get through though? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> In one instance the offers did get through but it took several >>>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let >>>>>>>>>>>>> another one >>>>>>>>>>>>> become the leader. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> The most likely culprit is contention for the storage write >>>>>>>>>>>>>> lock, observable via spikes in stat >>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. I will check that one. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This sounds like API activity. This shouldn't interfere with >>>>>>>>>>>>>> offer processing directly, but could potentially slow down the >>>>>>>>>>>>>> scheduler as >>>>>>>>>>>>>> a whole. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> So these won't contend for locks with offer processing and >>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being >>>>>>>>>>>>> used on the >>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>>>>> connections. >>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections >>>>>>>>>>>>> or vice >>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there >>>>>>>>>>>>> any >>>>>>>>>>>>> metrics or logs that might help here? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>>>> does not >>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is unusual. Pending tasks are not timed out; this >>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for >>>>>>>>>>>>>> something >>>>>>>>>>>>>> else to act and it does not hear back (via a status update). >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates >>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these >>>>>>>>>>>>> effects >>>>>>>>>>>>> will occur? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing >>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as >>>>>>>>>>>>>> well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>>>>>>>> indicate the last git SHA off aurora/master? Digging much >>>>>>>>>>>>>> deeper in >>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in >>>>>>>>>>>>>> play. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> - >>>>>>>>>>>>> - c85bffd >>>>>>>>>>>>> - 10 i >>>>>>>>>>>>> - >>>>>>>>>>>>> - s the commit from which we forked. >>>>>>>>>>>>> - >>>>>>>>>>>>> - >>>>>>>>>>>>> - >>>>>>>>>>>>> - The custom patch is mainly the dynamic reservation work >>>>>>>>>>>>> done by Dmitri. We also have commits for offer/rescind race >>>>>>>>>>>>> issue, >>>>>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>>>>> - >>>>>>>>>>>>> - >>>>>>>>>>>>> - >>>>>>>>>>>>> - >>>>>>>>>>>>> - >>>>>>>>>>>>> >>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>>>> does not >>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Folks, >>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close >>>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers >>>>>>>>>>>>>>>> page. But if I >>>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they >>>>>>>>>>>>>>>> are getting >>>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a >>>>>>>>>>>>>>>> long time and >>>>>>>>>>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. >>>>>>>>>>>>>>>> Could these >>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some >>>>>>>>>>>>>>>> lock)? What >>>>>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >