Yes, this is definitely the case.
We have ideas to implement something in the future to obviate this problem,
although we're still considering exactly how to implement it.
This is a short-term quick fix to solve some blocker issues that show up as
a consequence of it.

On 17 June 2016 at 17:57, Iustin Pop <[email protected]> wrote:

> 2016-06-17 9:46 GMT-07:00 Federico Pareschi <[email protected]>:
>
>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
>> job. If there is another job in the queue that is taking some locks that
>> stop the verify-group job (like an instance creation) then the whole
>> ganeti-watcher is blocked and has to wait for that job to finish.
>>
>> We have a case where we need the ganeti-watcher's periodic check to
>> restart some downed instances, but the ganeti-watcher itself gets stuck on
>> some other jobs and those downed instances don't get brought back up on
>> time (And this causes problems to us).
>>
>> There really is no reason to hold a lock to the watcher state file when
>> submitting the verify disk job anyway.
>>
>
> All this makes sense. My question is rather, if we end up with multiple
> watchers basically running (waiting) concurrently, can we end up with
> multiple (redundant) verify-group jobs?
>
> Sorry if I misunderstand the situation.
>
> iustin
>
> On 17 June 2016 at 17:18, Iustin Pop <[email protected]> wrote:
>>
>>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>>> [email protected]>:
>>>
>>>> The ganeti-watcher holds the group file lock for too long, until after
>>>> the execution of a group-verify-disk job. This locks for a long time if
>>>> there are other jobs already running and blocking the verify from
>>>> executing. When the lock is held, another ganeti-watcher run cannot be
>>>> scheduled, so this prevents the ganeti-watcher from running for several
>>>> minutes.
>>>>
>>>> With this commit, the lock is released before running the VerifyDisks
>>>> operation, so even if the submitted job gets stuck in the Job Queue, a
>>>> subsequient ganeti-watcher run would still happen.
>>>>
>>>
>>> Quick question: what prevents a runaway case where VerifyDisks is
>>> blocking for hours and we have many watchers all running and submitting
>>> their own VerifyDisks?
>>>
>>> thanks,
>>> iustin
>>>
>>
>>
>

Reply via email to