Thanks Bill. Please see inline: On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote:
> I suspect they are getting enqueued > > > Just to be sure - the offers do eventually get through though? > > In one instance the offers did get through but it took several minutes. In other instances we reloaded the scheduler to let another one become the leader. > The most likely culprit is contention for the storage write lock, observable > via spikes in stat log_storage_write_lock_wait_ns_total. > Thanks. I will check that one. > > I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs() >> calls are being made at that time > > > This sounds like API activity. This shouldn't interfere with offer > processing directly, but could potentially slow down the scheduler as a > whole. > > So these won't contend for locks with offer processing and task assignment threads? Only 8-10 out of 24 cores were being used on the machine. I also noticed a spike in mybatis active and bad connections. Can't say if the spike in active is due to many bad connections or vice versa or there was a 3rd source causing both of these. Are there any metrics or logs that might help here? > I also notice a lot of "Timeout reached for task..." around the same time. >> Can this happen if task is in PENDING state and does not reach ASSIGNED due >> to lack of offers? > > > This is unusual. Pending tasks are not timed out; this applies to tasks > in states where the scheduler is waiting for something else to act and it > does not hear back (via a status update). > Perhaps they were in ASSIGNED or some other state. If updates from Mesos are being delayed or processed too slowly both these effects will occur? > > I suggest digging into the cause of delayed offer processing first, i > suspect it might be related to the task timeouts as well. > > version close to 0.18.0 > > > Is the ambiguity is due to custom patches? Can you at least indicate the > last git SHA off aurora/master? Digging much deeper in diagnosing this may > prove tricky without knowing what code is in play. > > - - c85bffd - 10 i - - s the commit from which we forked. - - - - The custom patch is mainly the dynamic reservation work done by Dmitri. We also have commits for offer/rescind race issue, setrootfs patch (which is not upstreamed yet). - - - - - I have cherrypicked the fix for Aurora-1952 as well. > On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> I also notice a lot of "Timeout reached for task..." around the same >> time. Can this happen if task is in PENDING state and does not reach >> ASSIGNED due to lack of offers? >> >> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: >> >>> Folks, >>> I have noticed some weird behavior in Aurora (version close to 0.18.0). >>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs >>> I can see offers are coming in. I suspect they are getting enqueued for >>> processing by "executor" but stay there for a long time and are not >>> processed either due to locking or thread starvation. >>> >>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs() >>> calls are being made at that time. Could these calls starve the >>> OfferManager(e.g. by contending for some lock)? What should I be looking >>> for to debug this condition further? >>> >>> Mohit. >>> >> >> >