and in log_storage_write_lock_wait_ns_per_event On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that > cause or effect? :-) > > On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Thanks Bill. Please see inline: >> >> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote: >> >>> I suspect they are getting enqueued >>> >>> >>> Just to be sure - the offers do eventually get through though? >>> >>> >> In one instance the offers did get through but it took several minutes. >> In other instances we reloaded the scheduler to let another one become the >> leader. >> >> >>> The most likely culprit is contention for the storage write lock, >>> observable >>> via spikes in stat log_storage_write_lock_wait_ns_total. >>> >> >> Thanks. I will check that one. >> >> >>> >>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs() >>>> calls are being made at that time >>> >>> >>> This sounds like API activity. This shouldn't interfere with offer >>> processing directly, but could potentially slow down the scheduler as a >>> whole. >>> >>> >> So these won't contend for locks with offer processing and task >> assignment threads? Only 8-10 out of 24 cores were being used on the >> machine. I also noticed a spike in mybatis active and bad connections. >> Can't say if the spike in active is due to many bad connections or vice >> versa or there was a 3rd source causing both of these. Are there any >> metrics or logs that might help here? >> >> >>> I also notice a lot of "Timeout reached for task..." around the same >>>> time. Can this happen if task is in PENDING state and does not reach >>>> ASSIGNED due to lack of offers? >>> >>> >>> This is unusual. Pending tasks are not timed out; this applies to tasks >>> in states where the scheduler is waiting for something else to act and it >>> does not hear back (via a status update). >>> >> >> Perhaps they were in ASSIGNED or some other state. If updates from Mesos >> are being delayed or processed too slowly both these effects will occur? >> >> >>> >>> I suggest digging into the cause of delayed offer processing first, i >>> suspect it might be related to the task timeouts as well. >>> >>> version close to 0.18.0 >>> >>> >>> Is the ambiguity is due to custom patches? Can you at least indicate >>> the last git SHA off aurora/master? Digging much deeper in diagnosing this >>> may prove tricky without knowing what code is in play. >>> >>> >> - >> - c85bffd >> - 10 i >> - >> - s the commit from which we forked. >> - >> - >> - >> - The custom patch is mainly the dynamic reservation work done by >> Dmitri. We also have commits for offer/rescind race issue, setrootfs patch >> (which is not upstreamed yet). >> - >> - >> - >> - >> - >> >> I have cherrypicked the fix for Aurora-1952 as well. >> >> >> >>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> I also notice a lot of "Timeout reached for task..." around the same >>>> time. Can this happen if task is in PENDING state and does not reach >>>> ASSIGNED due to lack of offers? >>>> >>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> Folks, >>>>> I have noticed some weird behavior in Aurora (version close to >>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I >>>>> tail >>>>> the logs I can see offers are coming in. I suspect they are getting >>>>> enqueued for processing by "executor" but stay there for a long time and >>>>> are not processed either due to locking or thread starvation. >>>>> >>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs() >>>>> calls are being made at that time. Could these calls starve the >>>>> OfferManager(e.g. by contending for some lock)? What should I be looking >>>>> for to debug this condition further? >>>>> >>>>> Mohit. >>>>> >>>> >>>> >>> >> >