On Mon, Dec 19, 2011 at 04:38:22PM +0100, Michael Hanselmann wrote:
> When an opcode is about to be processed its dependencies are
> evaluated using “_JobDependencyManager.CheckAndRegister”. Due
> to its nature that function requires a lock on the manager's
> internal structures. All of this happens while the job queue
> lock is held in shared mode (required for the job processor).
>
> When a job has been processed any pending dependencies are re-added
> to the job workerpool. Before this patch that would require
> the manager's lock and then, for adding the jobs, the job queue
> lock. Since this is in reverse order it will lead to deadlocks.
> ---
> lib/jqueue.py | 25 ++++++++++++++++++-------
> test/ganeti.jqueue_unittest.py | 3 +++
> 2 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/lib/jqueue.py b/lib/jqueue.py
> index d5ea3cb..cc06b70 100644
> --- a/lib/jqueue.py
> +++ b/lib/jqueue.py
> @@ -1439,28 +1439,39 @@ class _JobDependencyManager:
> " not one of '%s' as required" %
> (dep_job_id, status, utils.CommaJoin(dep_status)))
>
> - @locking.ssynchronized(_LOCK)
> + def _RemoveEmptyWaitersUnlocked(self):
> + """Remove all jobs without actual waiters.
> +
> + """
> + for job_id in [job_id for (job_id, waiters) in self._waiters.items()
> + if not waiters]:
> + del self._waiters[job_id]
> +
> def NotifyWaiters(self, job_id):
> """Notifies all jobs waiting for a certain job ID.
>
> + @important: Do not call until L{CheckAndRegister} returned a status other
> + than L{self.WAIT} for C{job_id}, or behaviour is undefined
I'm sorry, is this really serious? We deadlock simply due to calling
functions in the wrong state??
iustin