On 17 June 2016 at 10:36, Federico Pareschi <[email protected]> wrote: > Yes, this is definitely the case. > We have ideas to implement something in the future to obviate this > problem, although we're still considering exactly how to implement it. > This is a short-term quick fix to solve some blocker issues that show up > as a consequence of it. >
Ack, thanks for the info! iustin > On 17 June 2016 at 17:57, Iustin Pop <[email protected]> wrote: > >> 2016-06-17 9:46 GMT-07:00 Federico Pareschi <[email protected]>: >> >>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group >>> job. If there is another job in the queue that is taking some locks that >>> stop the verify-group job (like an instance creation) then the whole >>> ganeti-watcher is blocked and has to wait for that job to finish. >>> >>> We have a case where we need the ganeti-watcher's periodic check to >>> restart some downed instances, but the ganeti-watcher itself gets stuck on >>> some other jobs and those downed instances don't get brought back up on >>> time (And this causes problems to us). >>> >>> There really is no reason to hold a lock to the watcher state file when >>> submitting the verify disk job anyway. >>> >> >> All this makes sense. My question is rather, if we end up with multiple >> watchers basically running (waiting) concurrently, can we end up with >> multiple (redundant) verify-group jobs? >> >> Sorry if I misunderstand the situation. >> >> iustin >> >> On 17 June 2016 at 17:18, Iustin Pop <[email protected]> wrote: >>> >>>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < >>>> [email protected]>: >>>> >>>>> The ganeti-watcher holds the group file lock for too long, until after >>>>> the execution of a group-verify-disk job. This locks for a long time if >>>>> there are other jobs already running and blocking the verify from >>>>> executing. When the lock is held, another ganeti-watcher run cannot be >>>>> scheduled, so this prevents the ganeti-watcher from running for several >>>>> minutes. >>>>> >>>>> With this commit, the lock is released before running the VerifyDisks >>>>> operation, so even if the submitted job gets stuck in the Job Queue, a >>>>> subsequient ganeti-watcher run would still happen. >>>>> >>>> >>>> Quick question: what prevents a runaway case where VerifyDisks is >>>> blocking for hours and we have many watchers all running and submitting >>>> their own VerifyDisks? >>>> >>>> thanks, >>>> iustin >>>> >>> >>> >> >
