On Mon, Feb 23, 2026 at 22:39:43 -0300, Lucas Amaral wrote:
> When creating multiple VMs in parallel, concurrent volume creation
> holds an async job on the pool. A pool refresh during this window

Other operations increase the async job count too, such as
download/upload/wipe.

> fails immediately with "pool has asynchronous jobs running", which
> causes cascading failures in parallel provisioning workflows.
> 
> The refresh operation genuinely cannot run while a volume build is
> in progress (it clears all volume metadata via
> virStoragePoolObjClearVols), but the failure is premature since

Yes, clearing of the list of volumes especially can't happen while some
of the operations that set the 'in_use' semaphore of virStorageVol are
in use.

Without that we could perhaps do refcounted volume objects and let the
running API operate on the old variant while refresh populates a new
list, but that wouldn't work well with the interlocking.

> the async job will finish shortly.

IMO 'shortly' is relative if upload/download/wipe is considered.

> Add a condition variable to virStoragePoolObj that allows
> storagePoolRefresh() to wait up to 30 seconds for async jobs to
> drain. The volume build thread broadcasts the condition when it
> decrements asyncjobs to zero. After waking, the refresh function
> re-validates preconditions (pool still active, not starting) since

The pre-condition checks ought to be packaged into a helper function if
this approach is to be taken.

> the pool lock was released during the wait.
> 
> Only storagePoolRefresh() gets the wait mechanism. The other three
> operations (destroy, undefine, delete) keep the immediate error

This IMO doesn't make sense and additionally our async job handling on
VM objects doesn't have this distinction. All APIs get to wait for the
async job to finish. I don't think this has any reason to be exception.

I think that we need in fact to extract the logic you propose and apply
it to all operations which check the reference count and also add a
variant of this which will consider the 'in_use' semaphore in addition
to the 'asyncjobs' semaphore.

> because waiting to destroy or delete a pool during volume creation
> is not a sensible user workflow.

Technically the 'delete'/'destroy' would happen. Also as noted there are
other operations too which take the 'async job'.

> 
> Resolves: https://issues.redhat.com/browse/RHEL-150758
> Signed-off-by: Lucas Amaral <[email protected]>
> ---

[...]

Reply via email to