Do you mean there is a shared ZK connection for leadership and log replication? I don't see "Lost leadership, committing suicide" during outage. I do see it at other times.
On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wfar...@apache.org> wrote: > - Does log replication "maintain" ZK connections and suffer when a NIC >> flaps? > > > Maintain, yes. Shouldn't be impacted unless the ZK session expires, which > would trigger a scheduler failover. > > - If only 1 of 5 ZK's have this issue, could there still be a problem? > > > Assuming this means 5 quorum member - no, that should not be a problem. > > If any of the above became an issue for the scheduler, it should certainly > manifest in logs. > > > On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Thanks Bill. It would be the latter as this back-pressure is only needed >> for calls that change state. Read only calls should be quite quick to >> serve. >> >> One potential correlated outage we may have had is a NIC flap on a >> Zookeeper node. The following questions come to mind: >> - Does log replication "maintain" ZK connections and suffer when a NIC >> flaps? >> - If only 1 of 5 ZK's have this issue, could there still be a problem? >> >> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org> wrote: >> >>> is there a place I can inject a pre-processor for the API calls >>> >>> >>> There is no off-the-shelf way to do this. You could intercept API calls >>> as HTTP requests similar to the CORS filter >>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>. >>> If you wanted to intercept specific calls and/or introspect arguments, you >>> would be better off binding a layer for AuroraAdmin.Iface >>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> >>> . >>> >>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> I agree with that. I also believe that the scheduler should be >>>> resilient in the presence of external faults. Systems that export an API >>>> must take defensive steps to protect themselves. >>>> >>>> If I wanted to experiment with this change without modifying Aurora >>>> code "inline", is there a place I can inject a pre-processor for the API >>>> calls? >>>> >>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> >>>> wrote: >>>> >>>>> I'd also suggest focusing on the source of the congestion. Aurora >>>>> should offer quite high scheduling throughput, and i would rather focus >>>>> energy on addressing bottlenecks. >>>>> >>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>> wrote: >>>>> >>>>>> I think more explicit signaling is better. Increased latency can be >>>>>> due to other conditions like network issues etc. Right now our mitigation >>>>>> involves load-shedding and we would rather have load-avoidance. Indeed >>>>>> proxy is not a good option. Only Aurora "knows" when it wants to >>>>>> back-pressure. >>>>>> >>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >>>>>> dmclaugh...@apache.org> wrote: >>>>>> >>>>>>> Any log write latency will be reflected in the overall latency of >>>>>>> the request. Increased request latency is one of the main ways any >>>>>>> server >>>>>>> has of telling a client that it's under load. It's then up to the >>>>>>> client to >>>>>>> react to this. >>>>>>> >>>>>>> If you want to throw error codes, you can put a proxy in front of >>>>>>> Aurora that has request timeouts - which would send 503s to clients. But >>>>>>> the issue with that is the requests are mostly non-idempotent so you'll >>>>>>> need to build reconciliation logic into it. >>>>>>> >>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Imagine something like Spinnaker using Aurora underneath to >>>>>>>> schedule services. That layer often "amplifies" human effort and may >>>>>>>> result >>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora slowed >>>>>>>> down >>>>>>>> due to transient problems, it can signal that to upstream software in >>>>>>>> the >>>>>>>> same way that busy web servers do during cyber Monday sales :-) >>>>>>>> >>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I want to let upstream software "know" that Aurora is slowing down >>>>>>>>>> and that it should back off >>>>>>>>> >>>>>>>>> >>>>>>>>> Can you offer more detail about how Aurora is being used in this >>>>>>>>> regard? I haven't seen use cases in the past that would be amenable >>>>>>>>> to >>>>>>>>> this behavior, so i would like to understand better. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi < >>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>>>>>> cause(although ZK node is known to have issues now and then but we >>>>>>>>>> don't >>>>>>>>>> have logs for the specific outages we had). We will plan to move to >>>>>>>>>> 0.19.x >>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora >>>>>>>>>> is >>>>>>>>>> slowing down and that it should back off. To achieve this I want to >>>>>>>>>> send >>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and >>>>>>>>>> certain >>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. >>>>>>>>>> Perhaps, this >>>>>>>>>> "defense" already exists? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The next level is to determine why the storage lock is being >>>>>>>>>>> held. Common causes include: >>>>>>>>>>> >>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very >>>>>>>>>>> large, O(gb) >>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>>>>>> 2. scheduler replicated log on slow disks >>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, >>>>>>>>>>> or between zookeepers >>>>>>>>>>> >>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to >>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler. This helped >>>>>>>>>>> twitter >>>>>>>>>>> improve (1) and (1a). >>>>>>>>>>> >>>>>>>>>>> commit f2755e1 >>>>>>>>>>> Author: Bill Farner <wfar...@apache.org> >>>>>>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>>>>>> >>>>>>>>>>> Exclusively use Map-based in-memory stores for primary >>>>>>>>>>> storage >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi < >>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi < >>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>> Is that cause or effect? :-) >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi < >>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner < >>>>>>>>>>>>>> wfar...@apache.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through >>>>>>>>>>>>>>> though? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> In one instance the offers did get through but it took >>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler to >>>>>>>>>>>>>> let >>>>>>>>>>>>>> another one become the leader. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The most likely culprit is contention for the storage write >>>>>>>>>>>>>>> lock, observable via spikes in stat >>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. I will check that one. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This sounds like API activity. This shouldn't interfere >>>>>>>>>>>>>>> with offer processing directly, but could potentially slow down >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> scheduler as a whole. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> So these won't contend for locks with offer processing and >>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being >>>>>>>>>>>>>> used on the >>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>>>>>> connections. >>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections >>>>>>>>>>>>>> or vice >>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there >>>>>>>>>>>>>> any >>>>>>>>>>>>>> metrics or logs that might help here? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>>>>> does not >>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is unusual. Pending tasks are not timed out; this >>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for >>>>>>>>>>>>>>> something >>>>>>>>>>>>>>> else to act and it does not hear back (via a status update). >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates >>>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these >>>>>>>>>>>>>> effects >>>>>>>>>>>>>> will occur? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing >>>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as >>>>>>>>>>>>>>> well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>>>>>>>>> indicate the last git SHA off aurora/master? Digging much >>>>>>>>>>>>>>> deeper in >>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is >>>>>>>>>>>>>>> in play. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - c85bffd >>>>>>>>>>>>>> - 10 i >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - s the commit from which we forked. >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - The custom patch is mainly the dynamic reservation work >>>>>>>>>>>>>> done by Dmitri. We also have commits for offer/rescind race >>>>>>>>>>>>>> issue, >>>>>>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>>>>> does not >>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Folks, >>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version >>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI >>>>>>>>>>>>>>>>> offers page. But >>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I suspect >>>>>>>>>>>>>>>>> they are >>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay there >>>>>>>>>>>>>>>>> for a long >>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread >>>>>>>>>>>>>>>>> starvation. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. >>>>>>>>>>>>>>>>> Could these >>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some >>>>>>>>>>>>>>>>> lock)? What >>>>>>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >