Yes, this is definitely the case. We have ideas to implement something in the future to obviate this problem, although we're still considering exactly how to implement it. This is a short-term quick fix to solve some blocker issues that show up as a consequence of it.
On 17 June 2016 at 17:57, Iustin Pop <[email protected]> wrote: > 2016-06-17 9:46 GMT-07:00 Federico Pareschi <[email protected]>: > >> When a ganeti-watcher runs on the nodegroup, it submits the verify-group >> job. If there is another job in the queue that is taking some locks that >> stop the verify-group job (like an instance creation) then the whole >> ganeti-watcher is blocked and has to wait for that job to finish. >> >> We have a case where we need the ganeti-watcher's periodic check to >> restart some downed instances, but the ganeti-watcher itself gets stuck on >> some other jobs and those downed instances don't get brought back up on >> time (And this causes problems to us). >> >> There really is no reason to hold a lock to the watcher state file when >> submitting the verify disk job anyway. >> > > All this makes sense. My question is rather, if we end up with multiple > watchers basically running (waiting) concurrently, can we end up with > multiple (redundant) verify-group jobs? > > Sorry if I misunderstand the situation. > > iustin > > On 17 June 2016 at 17:18, Iustin Pop <[email protected]> wrote: >> >>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < >>> [email protected]>: >>> >>>> The ganeti-watcher holds the group file lock for too long, until after >>>> the execution of a group-verify-disk job. This locks for a long time if >>>> there are other jobs already running and blocking the verify from >>>> executing. When the lock is held, another ganeti-watcher run cannot be >>>> scheduled, so this prevents the ganeti-watcher from running for several >>>> minutes. >>>> >>>> With this commit, the lock is released before running the VerifyDisks >>>> operation, so even if the submitted job gets stuck in the Job Queue, a >>>> subsequient ganeti-watcher run would still happen. >>>> >>> >>> Quick question: what prevents a runaway case where VerifyDisks is >>> blocking for hours and we have many watchers all running and submitting >>> their own VerifyDisks? >>> >>> thanks, >>> iustin >>> >> >> >
