2016-06-17 9:46 GMT-07:00 Federico Pareschi <[email protected]>: > When a ganeti-watcher runs on the nodegroup, it submits the verify-group > job. If there is another job in the queue that is taking some locks that > stop the verify-group job (like an instance creation) then the whole > ganeti-watcher is blocked and has to wait for that job to finish. > > We have a case where we need the ganeti-watcher's periodic check to > restart some downed instances, but the ganeti-watcher itself gets stuck on > some other jobs and those downed instances don't get brought back up on > time (And this causes problems to us). > > There really is no reason to hold a lock to the watcher state file when > submitting the verify disk job anyway. >
All this makes sense. My question is rather, if we end up with multiple watchers basically running (waiting) concurrently, can we end up with multiple (redundant) verify-group jobs? Sorry if I misunderstand the situation. iustin On 17 June 2016 at 17:18, Iustin Pop <[email protected]> wrote: > >> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < >> [email protected]>: >> >>> The ganeti-watcher holds the group file lock for too long, until after >>> the execution of a group-verify-disk job. This locks for a long time if >>> there are other jobs already running and blocking the verify from >>> executing. When the lock is held, another ganeti-watcher run cannot be >>> scheduled, so this prevents the ganeti-watcher from running for several >>> minutes. >>> >>> With this commit, the lock is released before running the VerifyDisks >>> operation, so even if the submitted job gets stuck in the Job Queue, a >>> subsequient ganeti-watcher run would still happen. >>> >> >> Quick question: what prevents a runaway case where VerifyDisks is >> blocking for hours and we have many watchers all running and submitting >> their own VerifyDisks? >> >> thanks, >> iustin >> > >
