On 17 June 2016 at 10:36, Federico Pareschi <[email protected]> wrote:

> Yes, this is definitely the case.
> We have ideas to implement something in the future to obviate this
> problem, although we're still considering exactly how to implement it.
> This is a short-term quick fix to solve some blocker issues that show up
> as a consequence of it.
>

Ack, thanks for the info!

iustin


> On 17 June 2016 at 17:57, Iustin Pop <[email protected]> wrote:
>
>> 2016-06-17 9:46 GMT-07:00 Federico Pareschi <[email protected]>:
>>
>>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
>>> job. If there is another job in the queue that is taking some locks that
>>> stop the verify-group job (like an instance creation) then the whole
>>> ganeti-watcher is blocked and has to wait for that job to finish.
>>>
>>> We have a case where we need the ganeti-watcher's periodic check to
>>> restart some downed instances, but the ganeti-watcher itself gets stuck on
>>> some other jobs and those downed instances don't get brought back up on
>>> time (And this causes problems to us).
>>>
>>> There really is no reason to hold a lock to the watcher state file when
>>> submitting the verify disk job anyway.
>>>
>>
>> All this makes sense. My question is rather, if we end up with multiple
>> watchers basically running (waiting) concurrently, can we end up with
>> multiple (redundant) verify-group jobs?
>>
>> Sorry if I misunderstand the situation.
>>
>> iustin
>>
>> On 17 June 2016 at 17:18, Iustin Pop <[email protected]> wrote:
>>>
>>>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>>>> [email protected]>:
>>>>
>>>>> The ganeti-watcher holds the group file lock for too long, until after
>>>>> the execution of a group-verify-disk job. This locks for a long time if
>>>>> there are other jobs already running and blocking the verify from
>>>>> executing. When the lock is held, another ganeti-watcher run cannot be
>>>>> scheduled, so this prevents the ganeti-watcher from running for several
>>>>> minutes.
>>>>>
>>>>> With this commit, the lock is released before running the VerifyDisks
>>>>> operation, so even if the submitted job gets stuck in the Job Queue, a
>>>>> subsequient ganeti-watcher run would still happen.
>>>>>
>>>>
>>>> Quick question: what prevents a runaway case where VerifyDisks is
>>>> blocking for hours and we have many watchers all running and submitting
>>>> their own VerifyDisks?
>>>>
>>>> thanks,
>>>> iustin
>>>>
>>>
>>>
>>
>

Reply via email to