> > Do you mean there is a shared ZK connection for leadership and log > replication
Nope, different connections. The log is all managed through code linked from libmesos, including its use of ZK. Here's an example of some logs from this code: I1129 15:56:00.560479 9316 group.cpp:341] Group process > (zookeeper-group(1)@192.168.33.7:8083) connected to ZooKeeper > I1129 15:56:00.560817 9316 group.cpp:831] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I1129 15:56:00.561249 9316 group.cpp:419] Trying to create path > '/aurora/replicated-log' in ZooKeeper Notice the group.cpp. You'll also see relevant logs coming from log.cpp and replica.cpp. On Wed, Nov 29, 2017 at 3:25 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Do you mean there is a shared ZK connection for leadership and log > replication? I don't see "Lost leadership, committing suicide" during > outage. I do see it at other times. > > On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wfar...@apache.org> wrote: > >> - Does log replication "maintain" ZK connections and suffer when a NIC >>> flaps? >> >> >> Maintain, yes. Shouldn't be impacted unless the ZK session expires, >> which would trigger a scheduler failover. >> >> - If only 1 of 5 ZK's have this issue, could there still be a problem? >> >> >> Assuming this means 5 quorum member - no, that should not be a problem. >> >> If any of the above became an issue for the scheduler, it should >> certainly manifest in logs. >> >> >> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> Thanks Bill. It would be the latter as this back-pressure is only needed >>> for calls that change state. Read only calls should be quite quick to >>> serve. >>> >>> One potential correlated outage we may have had is a NIC flap on a >>> Zookeeper node. The following questions come to mind: >>> - Does log replication "maintain" ZK connections and suffer when a NIC >>> flaps? >>> - If only 1 of 5 ZK's have this issue, could there still be a problem? >>> >>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org> >>> wrote: >>> >>>> is there a place I can inject a pre-processor for the API calls >>>> >>>> >>>> There is no off-the-shelf way to do this. You could intercept API >>>> calls as HTTP requests similar to the CORS filter >>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>. >>>> If you wanted to intercept specific calls and/or introspect arguments, you >>>> would be better off binding a layer for AuroraAdmin.Iface >>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> >>>> . >>>> >>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> I agree with that. I also believe that the scheduler should be >>>>> resilient in the presence of external faults. Systems that export an API >>>>> must take defensive steps to protect themselves. >>>>> >>>>> If I wanted to experiment with this change without modifying Aurora >>>>> code "inline", is there a place I can inject a pre-processor for the API >>>>> calls? >>>>> >>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> >>>>> wrote: >>>>> >>>>>> I'd also suggest focusing on the source of the congestion. Aurora >>>>>> should offer quite high scheduling throughput, and i would rather focus >>>>>> energy on addressing bottlenecks. >>>>>> >>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>> wrote: >>>>>> >>>>>>> I think more explicit signaling is better. Increased latency can be >>>>>>> due to other conditions like network issues etc. Right now our >>>>>>> mitigation >>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed >>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to >>>>>>> back-pressure. >>>>>>> >>>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >>>>>>> dmclaugh...@apache.org> wrote: >>>>>>> >>>>>>>> Any log write latency will be reflected in the overall latency of >>>>>>>> the request. Increased request latency is one of the main ways any >>>>>>>> server >>>>>>>> has of telling a client that it's under load. It's then up to the >>>>>>>> client to >>>>>>>> react to this. >>>>>>>> >>>>>>>> If you want to throw error codes, you can put a proxy in front of >>>>>>>> Aurora that has request timeouts - which would send 503s to clients. >>>>>>>> But >>>>>>>> the issue with that is the requests are mostly non-idempotent so you'll >>>>>>>> need to build reconciliation logic into it. >>>>>>>> >>>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Imagine something like Spinnaker using Aurora underneath to >>>>>>>>> schedule services. That layer often "amplifies" human effort and may >>>>>>>>> result >>>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora slowed >>>>>>>>> down >>>>>>>>> due to transient problems, it can signal that to upstream software in >>>>>>>>> the >>>>>>>>> same way that busy web servers do during cyber Monday sales :-) >>>>>>>>> >>>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I want to let upstream software "know" that Aurora is slowing >>>>>>>>>>> down and that it should back off >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can you offer more detail about how Aurora is being used in this >>>>>>>>>> regard? I haven't seen use cases in the past that would be amenable >>>>>>>>>> to >>>>>>>>>> this behavior, so i would like to understand better. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi < >>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>>>>>>> cause(although ZK node is known to have issues now and then but we >>>>>>>>>>> don't >>>>>>>>>>> have logs for the specific outages we had). We will plan to move to >>>>>>>>>>> 0.19.x >>>>>>>>>>> soon. In addition I want to let upstream software "know" that >>>>>>>>>>> Aurora is >>>>>>>>>>> slowing down and that it should back off. To achieve this I want to >>>>>>>>>>> send >>>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and >>>>>>>>>>> certain >>>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. >>>>>>>>>>> Perhaps, this >>>>>>>>>>> "defense" already exists? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> The next level is to determine why the storage lock is being >>>>>>>>>>>> held. Common causes include: >>>>>>>>>>>> >>>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very >>>>>>>>>>>> large, O(gb) >>>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>>>>>>> 2. scheduler replicated log on slow disks >>>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, >>>>>>>>>>>> or between zookeepers >>>>>>>>>>>> >>>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to >>>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler. This helped >>>>>>>>>>>> twitter >>>>>>>>>>>> improve (1) and (1a). >>>>>>>>>>>> >>>>>>>>>>>> commit f2755e1 >>>>>>>>>>>> Author: Bill Farner <wfar...@apache.org> >>>>>>>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>>>>>>> >>>>>>>>>>>> Exclusively use Map-based in-memory stores for primary >>>>>>>>>>>> storage >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi < >>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi < >>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>>> Is that cause or effect? :-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi < >>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner < >>>>>>>>>>>>>>> wfar...@apache.org> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through >>>>>>>>>>>>>>>> though? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In one instance the offers did get through but it took >>>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler >>>>>>>>>>>>>>> to let >>>>>>>>>>>>>>> another one become the leader. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The most likely culprit is contention for the storage write >>>>>>>>>>>>>>>> lock, observable via spikes in stat >>>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. I will check that one. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This sounds like API activity. This shouldn't interfere >>>>>>>>>>>>>>>> with offer processing directly, but could potentially slow >>>>>>>>>>>>>>>> down the >>>>>>>>>>>>>>>> scheduler as a whole. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So these won't contend for locks with offer processing and >>>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being >>>>>>>>>>>>>>> used on the >>>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>>>>>>> connections. >>>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections >>>>>>>>>>>>>>> or vice >>>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are >>>>>>>>>>>>>>> there any >>>>>>>>>>>>>>> metrics or logs that might help here? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state >>>>>>>>>>>>>>>>> and does not >>>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is unusual. Pending tasks are not timed out; this >>>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for >>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>> else to act and it does not hear back (via a status update). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If >>>>>>>>>>>>>>> updates from Mesos are being delayed or processed too slowly >>>>>>>>>>>>>>> both these >>>>>>>>>>>>>>> effects will occur? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer >>>>>>>>>>>>>>>> processing first, i suspect it might be related to the task >>>>>>>>>>>>>>>> timeouts as >>>>>>>>>>>>>>>> well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at >>>>>>>>>>>>>>>> least indicate the last git SHA off aurora/master? Digging >>>>>>>>>>>>>>>> much deeper in >>>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is >>>>>>>>>>>>>>>> in play. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - c85bffd >>>>>>>>>>>>>>> - 10 i >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - s the commit from which we forked. >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - The custom patch is mainly the dynamic reservation >>>>>>>>>>>>>>> work done by Dmitri. We also have commits for offer/rescind >>>>>>>>>>>>>>> race issue, >>>>>>>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." >>>>>>>>>>>>>>>>> around the same time. Can this happen if task is in PENDING >>>>>>>>>>>>>>>>> state and does >>>>>>>>>>>>>>>>> not reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Folks, >>>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version >>>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI >>>>>>>>>>>>>>>>>> offers page. But >>>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I suspect >>>>>>>>>>>>>>>>>> they are >>>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay there >>>>>>>>>>>>>>>>>> for a long >>>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread >>>>>>>>>>>>>>>>>> starvation. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. >>>>>>>>>>>>>>>>>> Could these >>>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some >>>>>>>>>>>>>>>>>> lock)? What >>>>>>>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >