2016-06-17 9:46 GMT-07:00 Federico Pareschi <[email protected]>:

> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
> job. If there is another job in the queue that is taking some locks that
> stop the verify-group job (like an instance creation) then the whole
> ganeti-watcher is blocked and has to wait for that job to finish.
>
> We have a case where we need the ganeti-watcher's periodic check to
> restart some downed instances, but the ganeti-watcher itself gets stuck on
> some other jobs and those downed instances don't get brought back up on
> time (And this causes problems to us).
>
> There really is no reason to hold a lock to the watcher state file when
> submitting the verify disk job anyway.
>

All this makes sense. My question is rather, if we end up with multiple
watchers basically running (waiting) concurrently, can we end up with
multiple (redundant) verify-group jobs?

Sorry if I misunderstand the situation.

iustin

On 17 June 2016 at 17:18, Iustin Pop <[email protected]> wrote:
>
>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>> [email protected]>:
>>
>>> The ganeti-watcher holds the group file lock for too long, until after
>>> the execution of a group-verify-disk job. This locks for a long time if
>>> there are other jobs already running and blocking the verify from
>>> executing. When the lock is held, another ganeti-watcher run cannot be
>>> scheduled, so this prevents the ganeti-watcher from running for several
>>> minutes.
>>>
>>> With this commit, the lock is released before running the VerifyDisks
>>> operation, so even if the submitted job gets stuck in the Job Queue, a
>>> subsequient ganeti-watcher run would still happen.
>>>
>>
>> Quick question: what prevents a runaway case where VerifyDisks is
>> blocking for hours and we have many watchers all running and submitting
>> their own VerifyDisks?
>>
>> thanks,
>> iustin
>>
>
>

Reply via email to