Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that cause or effect? :-)
On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Thanks Bill. Please see inline: > > On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote: > >> I suspect they are getting enqueued >> >> >> Just to be sure - the offers do eventually get through though? >> >> > In one instance the offers did get through but it took several minutes. In > other instances we reloaded the scheduler to let another one become the > leader. > > >> The most likely culprit is contention for the storage write lock, observable >> via spikes in stat log_storage_write_lock_wait_ns_total. >> > > Thanks. I will check that one. > > >> >> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs() >>> calls are being made at that time >> >> >> This sounds like API activity. This shouldn't interfere with offer >> processing directly, but could potentially slow down the scheduler as a >> whole. >> >> > So these won't contend for locks with offer processing and task assignment > threads? Only 8-10 out of 24 cores were being used on the machine. I also > noticed a spike in mybatis active and bad connections. Can't say if the > spike in active is due to many bad connections or vice versa or there was a > 3rd source causing both of these. Are there any metrics or logs that might > help here? > > >> I also notice a lot of "Timeout reached for task..." around the same >>> time. Can this happen if task is in PENDING state and does not reach >>> ASSIGNED due to lack of offers? >> >> >> This is unusual. Pending tasks are not timed out; this applies to tasks >> in states where the scheduler is waiting for something else to act and it >> does not hear back (via a status update). >> > > Perhaps they were in ASSIGNED or some other state. If updates from Mesos > are being delayed or processed too slowly both these effects will occur? > > >> >> I suggest digging into the cause of delayed offer processing first, i >> suspect it might be related to the task timeouts as well. >> >> version close to 0.18.0 >> >> >> Is the ambiguity is due to custom patches? Can you at least indicate the >> last git SHA off aurora/master? Digging much deeper in diagnosing this may >> prove tricky without knowing what code is in play. >> >> > - > - c85bffd > - 10 i > - > - s the commit from which we forked. > - > - > - > - The custom patch is mainly the dynamic reservation work done by > Dmitri. We also have commits for offer/rescind race issue, setrootfs patch > (which is not upstreamed yet). > - > - > - > - > - > > I have cherrypicked the fix for Aurora-1952 as well. > > > >> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: >> >>> I also notice a lot of "Timeout reached for task..." around the same >>> time. Can this happen if task is in PENDING state and does not reach >>> ASSIGNED due to lack of offers? >>> >>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> Folks, >>>> I have noticed some weird behavior in Aurora (version close to 0.18.0). >>>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs >>>> I can see offers are coming in. I suspect they are getting enqueued for >>>> processing by "executor" but stay there for a long time and are not >>>> processed either due to locking or thread starvation. >>>> >>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs() >>>> calls are being made at that time. Could these calls starve the >>>> OfferManager(e.g. by contending for some lock)? What should I be looking >>>> for to debug this condition further? >>>> >>>> Mohit. >>>> >>> >>> >> >