Re: [PATCH v5 08/20] jobs: protect jobs with job_lock/unlock

2022-02-24 Thread Emanuele Giuseppe Esposito



On 24/02/2022 17:48, Stefan Hajnoczi wrote:
> On Thu, Feb 24, 2022 at 01:45:48PM +0100, Emanuele Giuseppe Esposito wrote:
>>
>>
>> On 17/02/2022 15:48, Stefan Hajnoczi wrote:
>>> On Tue, Feb 08, 2022 at 09:35:01AM -0500, Emanuele Giuseppe Esposito wrote:
 diff --git a/block/replication.c b/block/replication.c
 index 55c8f894aa..a03b28726e 100644
 --- a/block/replication.c
 +++ b/block/replication.c
 @@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
  if (s->stage == BLOCK_REPLICATION_FAILOVER) {
  commit_job = >commit_job->job;
  assert(commit_job->aio_context == qemu_get_current_aio_context());
>>>
>>> Is it safe to access commit_job->aio_context outside job_mutex?
>>
>> No, but it is currently not done. Patch 18 takes care of protecting
>> aio_context. Remember again that job lock API is still nop.
>>>
 @@ -1838,7 +1840,9 @@ static void drive_backup_abort(BlkActionState 
 *common)
  aio_context = bdrv_get_aio_context(state->bs);
  aio_context_acquire(aio_context);
  
 -job_cancel_sync(>job->job, true);
 +WITH_JOB_LOCK_GUARD() {
 +job_cancel_sync(>job->job, true);
 +}
>>>
>>> Maybe job_cancel_sync() should take the lock internally since all
>>> callers in this patch seem to need the lock?
>>
>> The _locked version is useful because it is used when lock guards are
>> already present, and cover multiple operations. There are only 3 places
>> where a lock guard is added to cover job_cance_sync_locked. Is it worth
>> defining another additional function?
>>
>>
>>>
>>> I noticed this patch does not add WITH_JOB_LOCK_GUARD() to
>>> tests/unit/test-blockjob.c:cancel_common(). Was that an oversight or is
>>> there a reason why job_mutex is not needed around the job_cancel_sync()
>>> call there?
>>
>> No, locks in unit tests are added in patch 10 "jobs: protect jobs with
>> job_lock/unlock".
> 
> I see, it's a question of how to split up the patches. When patches
> leave the code in a state with broken invariants it becomes difficult to
> review. I can't distinguish between actual bugs and temporary violations
> that will be fixed in a later patch (unless they are clearly marked).
> 
> If you can structure patches so they are self-contained and don't leave
> the broken invariants then that would make review easier, but in this
> case it is tricky so I'll do the best I can to review it if you cannot
> restructure the sequence of commits.

Yes, the main problem is that ideally we want to add job lock and remove
Aiocontext lock. But together this can't happen, and just adding proper
locks will create a ton of deadlocks, because in order to maintain
invariants sometimes job lock is inside aiocontext lock, and some other
times the opposite happens.

The way it is done in this serie is:
1) create job_lock/unlock as nop
2) make sure the nop job_lock is protecting the Job fields
3) do all API renaming, accoring with the locking used above
4) enable job_lock (not nop anymore) and at the same time remove the
AioContext

If you want, a more up-to-date branch with your feedbacks applied so far
is here:
https://gitlab.com/eesposit/qemu/-/commits/dp_jobs_reviewed

> 
>>>
 @@ -252,7 +258,13 @@ int block_job_add_bdrv(BlockJob *job, const char 
 *name, BlockDriverState *bs,
  
  static void block_job_on_idle(Notifier *n, void *opaque)
  {
 +/*
 + * we can't kick with job_mutex held, but we also want
 + * to protect the notifier list.
 + */
 +job_unlock();
  aio_wait_kick();
 +job_lock();
>>>
>>> I don't understand this. aio_wait_kick() looks safe to call with a mutex
>>> held?
>> You are right. It should be safe.
>>
>>>
 @@ -292,7 +304,9 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
 Error **errp)
  job->speed = speed;
  
  if (drv->set_speed) {
 +job_unlock();
  drv->set_speed(job, speed);
 +job_lock();
>>>
>>> What guarantees that job stays alive during drv->set_speed(job)? We
>>> don't hold a ref here. Maybe the assumption is that
>>> block_job_set_speed() only gets called from the main loop thread and
>>> nothing else will modify the jobs list while we're in drv->set_speed()?
>>
>> What guaranteed this before? I am not sure.
> 
> I guess the reason is the one I suggested. It should be documented in
> the comments.
> 
>>
>>>
 @@ -545,10 +566,15 @@ BlockErrorAction block_job_error_action(BlockJob 
 *job, BlockdevOnError on_err,
  action);
  }
  if (action == BLOCK_ERROR_ACTION_STOP) {
 -if (!job->job.user_paused) {
 -job_pause(>job);
 -/* make the pause user visible, which will be resumed from 
 QMP. */
 -job->job.user_paused = true;
 +WITH_JOB_LOCK_GUARD() {
 +  

Re: [PATCH v5 08/20] jobs: protect jobs with job_lock/unlock

2022-02-24 Thread Stefan Hajnoczi
On Thu, Feb 24, 2022 at 01:45:48PM +0100, Emanuele Giuseppe Esposito wrote:
> 
> 
> On 17/02/2022 15:48, Stefan Hajnoczi wrote:
> > On Tue, Feb 08, 2022 at 09:35:01AM -0500, Emanuele Giuseppe Esposito wrote:
> >> diff --git a/block/replication.c b/block/replication.c
> >> index 55c8f894aa..a03b28726e 100644
> >> --- a/block/replication.c
> >> +++ b/block/replication.c
> >> @@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
> >>  if (s->stage == BLOCK_REPLICATION_FAILOVER) {
> >>  commit_job = >commit_job->job;
> >>  assert(commit_job->aio_context == qemu_get_current_aio_context());
> > 
> > Is it safe to access commit_job->aio_context outside job_mutex?
> 
> No, but it is currently not done. Patch 18 takes care of protecting
> aio_context. Remember again that job lock API is still nop.
> > 
> >> @@ -1838,7 +1840,9 @@ static void drive_backup_abort(BlkActionState 
> >> *common)
> >>  aio_context = bdrv_get_aio_context(state->bs);
> >>  aio_context_acquire(aio_context);
> >>  
> >> -job_cancel_sync(>job->job, true);
> >> +WITH_JOB_LOCK_GUARD() {
> >> +job_cancel_sync(>job->job, true);
> >> +}
> > 
> > Maybe job_cancel_sync() should take the lock internally since all
> > callers in this patch seem to need the lock?
> 
> The _locked version is useful because it is used when lock guards are
> already present, and cover multiple operations. There are only 3 places
> where a lock guard is added to cover job_cance_sync_locked. Is it worth
> defining another additional function?
> 
> 
> > 
> > I noticed this patch does not add WITH_JOB_LOCK_GUARD() to
> > tests/unit/test-blockjob.c:cancel_common(). Was that an oversight or is
> > there a reason why job_mutex is not needed around the job_cancel_sync()
> > call there?
> 
> No, locks in unit tests are added in patch 10 "jobs: protect jobs with
> job_lock/unlock".

I see, it's a question of how to split up the patches. When patches
leave the code in a state with broken invariants it becomes difficult to
review. I can't distinguish between actual bugs and temporary violations
that will be fixed in a later patch (unless they are clearly marked).

If you can structure patches so they are self-contained and don't leave
the broken invariants then that would make review easier, but in this
case it is tricky so I'll do the best I can to review it if you cannot
restructure the sequence of commits.

> > 
> >> @@ -252,7 +258,13 @@ int block_job_add_bdrv(BlockJob *job, const char 
> >> *name, BlockDriverState *bs,
> >>  
> >>  static void block_job_on_idle(Notifier *n, void *opaque)
> >>  {
> >> +/*
> >> + * we can't kick with job_mutex held, but we also want
> >> + * to protect the notifier list.
> >> + */
> >> +job_unlock();
> >>  aio_wait_kick();
> >> +job_lock();
> > 
> > I don't understand this. aio_wait_kick() looks safe to call with a mutex
> > held?
> You are right. It should be safe.
> 
> > 
> >> @@ -292,7 +304,9 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
> >> Error **errp)
> >>  job->speed = speed;
> >>  
> >>  if (drv->set_speed) {
> >> +job_unlock();
> >>  drv->set_speed(job, speed);
> >> +job_lock();
> > 
> > What guarantees that job stays alive during drv->set_speed(job)? We
> > don't hold a ref here. Maybe the assumption is that
> > block_job_set_speed() only gets called from the main loop thread and
> > nothing else will modify the jobs list while we're in drv->set_speed()?
> 
> What guaranteed this before? I am not sure.

I guess the reason is the one I suggested. It should be documented in
the comments.

> 
> > 
> >> @@ -545,10 +566,15 @@ BlockErrorAction block_job_error_action(BlockJob 
> >> *job, BlockdevOnError on_err,
> >>  action);
> >>  }
> >>  if (action == BLOCK_ERROR_ACTION_STOP) {
> >> -if (!job->job.user_paused) {
> >> -job_pause(>job);
> >> -/* make the pause user visible, which will be resumed from 
> >> QMP. */
> >> -job->job.user_paused = true;
> >> +WITH_JOB_LOCK_GUARD() {
> >> +if (!job->job.user_paused) {
> >> +job_pause(>job);
> >> +/*
> >> + * make the pause user visible, which will be
> >> + * resumed from QMP.
> >> + */
> >> +job->job.user_paused = true;
> >> +}
> >>  }
> >>  block_job_iostatus_set_err(job, error);
> > 
> > Does this need the lock? If not, why is block_job_iostatus_reset()
> > called with the hold?
> > 
> block_job_iostatus_set_err does not touch any Job fields. On the other
> hand block_job_iostatus_reset reads job.user_paused and job.pause_count.

BlockJob->iostatus requires no locking?


signature.asc
Description: PGP signature


Re: [PATCH v5 08/20] jobs: protect jobs with job_lock/unlock

2022-02-24 Thread Emanuele Giuseppe Esposito



On 17/02/2022 15:48, Stefan Hajnoczi wrote:
> On Tue, Feb 08, 2022 at 09:35:01AM -0500, Emanuele Giuseppe Esposito wrote:
>> diff --git a/block/replication.c b/block/replication.c
>> index 55c8f894aa..a03b28726e 100644
>> --- a/block/replication.c
>> +++ b/block/replication.c
>> @@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
>>  if (s->stage == BLOCK_REPLICATION_FAILOVER) {
>>  commit_job = >commit_job->job;
>>  assert(commit_job->aio_context == qemu_get_current_aio_context());
> 
> Is it safe to access commit_job->aio_context outside job_mutex?

No, but it is currently not done. Patch 18 takes care of protecting
aio_context. Remember again that job lock API is still nop.
> 
>> @@ -1838,7 +1840,9 @@ static void drive_backup_abort(BlkActionState *common)
>>  aio_context = bdrv_get_aio_context(state->bs);
>>  aio_context_acquire(aio_context);
>>  
>> -job_cancel_sync(>job->job, true);
>> +WITH_JOB_LOCK_GUARD() {
>> +job_cancel_sync(>job->job, true);
>> +}
> 
> Maybe job_cancel_sync() should take the lock internally since all
> callers in this patch seem to need the lock?

The _locked version is useful because it is used when lock guards are
already present, and cover multiple operations. There are only 3 places
where a lock guard is added to cover job_cance_sync_locked. Is it worth
defining another additional function?


> 
> I noticed this patch does not add WITH_JOB_LOCK_GUARD() to
> tests/unit/test-blockjob.c:cancel_common(). Was that an oversight or is
> there a reason why job_mutex is not needed around the job_cancel_sync()
> call there?

No, locks in unit tests are added in patch 10 "jobs: protect jobs with
job_lock/unlock".

> 
>> @@ -252,7 +258,13 @@ int block_job_add_bdrv(BlockJob *job, const char *name, 
>> BlockDriverState *bs,
>>  
>>  static void block_job_on_idle(Notifier *n, void *opaque)
>>  {
>> +/*
>> + * we can't kick with job_mutex held, but we also want
>> + * to protect the notifier list.
>> + */
>> +job_unlock();
>>  aio_wait_kick();
>> +job_lock();
> 
> I don't understand this. aio_wait_kick() looks safe to call with a mutex
> held?
You are right. It should be safe.

> 
>> @@ -292,7 +304,9 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
>> Error **errp)
>>  job->speed = speed;
>>  
>>  if (drv->set_speed) {
>> +job_unlock();
>>  drv->set_speed(job, speed);
>> +job_lock();
> 
> What guarantees that job stays alive during drv->set_speed(job)? We
> don't hold a ref here. Maybe the assumption is that
> block_job_set_speed() only gets called from the main loop thread and
> nothing else will modify the jobs list while we're in drv->set_speed()?

What guaranteed this before? I am not sure.

> 
>> @@ -545,10 +566,15 @@ BlockErrorAction block_job_error_action(BlockJob *job, 
>> BlockdevOnError on_err,
>>  action);
>>  }
>>  if (action == BLOCK_ERROR_ACTION_STOP) {
>> -if (!job->job.user_paused) {
>> -job_pause(>job);
>> -/* make the pause user visible, which will be resumed from QMP. 
>> */
>> -job->job.user_paused = true;
>> +WITH_JOB_LOCK_GUARD() {
>> +if (!job->job.user_paused) {
>> +job_pause(>job);
>> +/*
>> + * make the pause user visible, which will be
>> + * resumed from QMP.
>> + */
>> +job->job.user_paused = true;
>> +}
>>  }
>>  block_job_iostatus_set_err(job, error);
> 
> Does this need the lock? If not, why is block_job_iostatus_reset()
> called with the hold?
> 
block_job_iostatus_set_err does not touch any Job fields. On the other
hand block_job_iostatus_reset reads job.user_paused and job.pause_count.

Emanuele




Re: [PATCH v5 08/20] jobs: protect jobs with job_lock/unlock

2022-02-17 Thread Stefan Hajnoczi
On Tue, Feb 08, 2022 at 09:35:01AM -0500, Emanuele Giuseppe Esposito wrote:
> diff --git a/block/replication.c b/block/replication.c
> index 55c8f894aa..a03b28726e 100644
> --- a/block/replication.c
> +++ b/block/replication.c
> @@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
>  if (s->stage == BLOCK_REPLICATION_FAILOVER) {
>  commit_job = >commit_job->job;
>  assert(commit_job->aio_context == qemu_get_current_aio_context());

Is it safe to access commit_job->aio_context outside job_mutex?

> @@ -1838,7 +1840,9 @@ static void drive_backup_abort(BlkActionState *common)
>  aio_context = bdrv_get_aio_context(state->bs);
>  aio_context_acquire(aio_context);
>  
> -job_cancel_sync(>job->job, true);
> +WITH_JOB_LOCK_GUARD() {
> +job_cancel_sync(>job->job, true);
> +}

Maybe job_cancel_sync() should take the lock internally since all
callers in this patch seem to need the lock?

I noticed this patch does not add WITH_JOB_LOCK_GUARD() to
tests/unit/test-blockjob.c:cancel_common(). Was that an oversight or is
there a reason why job_mutex is not needed around the job_cancel_sync()
call there?

> @@ -252,7 +258,13 @@ int block_job_add_bdrv(BlockJob *job, const char *name, 
> BlockDriverState *bs,
>  
>  static void block_job_on_idle(Notifier *n, void *opaque)
>  {
> +/*
> + * we can't kick with job_mutex held, but we also want
> + * to protect the notifier list.
> + */
> +job_unlock();
>  aio_wait_kick();
> +job_lock();

I don't understand this. aio_wait_kick() looks safe to call with a mutex
held?

> @@ -292,7 +304,9 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
> Error **errp)
>  job->speed = speed;
>  
>  if (drv->set_speed) {
> +job_unlock();
>  drv->set_speed(job, speed);
> +job_lock();

What guarantees that job stays alive during drv->set_speed(job)? We
don't hold a ref here. Maybe the assumption is that
block_job_set_speed() only gets called from the main loop thread and
nothing else will modify the jobs list while we're in drv->set_speed()?

> @@ -545,10 +566,15 @@ BlockErrorAction block_job_error_action(BlockJob *job, 
> BlockdevOnError on_err,
>  action);
>  }
>  if (action == BLOCK_ERROR_ACTION_STOP) {
> -if (!job->job.user_paused) {
> -job_pause(>job);
> -/* make the pause user visible, which will be resumed from QMP. 
> */
> -job->job.user_paused = true;
> +WITH_JOB_LOCK_GUARD() {
> +if (!job->job.user_paused) {
> +job_pause(>job);
> +/*
> + * make the pause user visible, which will be
> + * resumed from QMP.
> + */
> +job->job.user_paused = true;
> +}
>  }
>  block_job_iostatus_set_err(job, error);

Does this need the lock? If not, why is block_job_iostatus_reset()
called with the hold?


signature.asc
Description: PGP signature


[PATCH v5 08/20] jobs: protect jobs with job_lock/unlock

2022-02-08 Thread Emanuele Giuseppe Esposito
Introduce the job locking mechanism through the whole job API,
following the comments  in job.h and requirements of job-monitor
(like the functions in job-qmp.c, assume lock is held) and
job-driver (like in mirror.c and all other JobDriver, lock is not held).

Use the _locked helpers introduced before to differentiate
between functions called with and without job_mutex.
This only applies to function that are called under both
cases, all the others will be renamed later.

job_{lock/unlock} is independent from real_job_{lock/unlock}.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c |  18 ---
 block/replication.c |   8 ++-
 blockdev.c  |  17 --
 blockjob.c  |  62 +++---
 job-qmp.c   |   2 +
 job.c   | 123 +++-
 monitor/qmp-cmds.c  |   6 ++-
 qemu-img.c  |  41 +--
 8 files changed, 191 insertions(+), 86 deletions(-)

diff --git a/block.c b/block.c
index 506c2f778d..33fe969fd9 100644
--- a/block.c
+++ b/block.c
@@ -4968,7 +4968,9 @@ static void bdrv_close(BlockDriverState *bs)
 
 void bdrv_close_all(void)
 {
-assert(job_next(NULL) == NULL);
+WITH_JOB_LOCK_GUARD() {
+assert(job_next(NULL) == NULL);
+}
 assert(qemu_in_main_thread());
 
 /* Drop references from requests still in flight, such as canceled block
@@ -6145,13 +6147,15 @@ XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp)
 }
 }
 
-for (job = block_job_next(NULL); job; job = block_job_next(job)) {
-GSList *el;
+WITH_JOB_LOCK_GUARD() {
+for (job = block_job_next(NULL); job; job = block_job_next(job)) {
+GSList *el;
 
-xdbg_graph_add_node(gr, job, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_JOB,
-   job->job.id);
-for (el = job->nodes; el; el = el->next) {
-xdbg_graph_add_edge(gr, job, (BdrvChild *)el->data);
+xdbg_graph_add_node(gr, job, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_JOB,
+job->job.id);
+for (el = job->nodes; el; el = el->next) {
+xdbg_graph_add_edge(gr, job, (BdrvChild *)el->data);
+}
 }
 }
 
diff --git a/block/replication.c b/block/replication.c
index 55c8f894aa..a03b28726e 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = >commit_job->job;
 assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job, false);
+WITH_JOB_LOCK_GUARD() {
+job_cancel_sync(commit_job, false);
+}
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +728,9 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * disk, secondary disk in backup_job_completed().
  */
 if (s->backup_job) {
-job_cancel_sync(>backup_job->job, true);
+WITH_JOB_LOCK_GUARD() {
+job_cancel_sync(>backup_job->job, true);
+}
 }
 
 if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index e315466914..c5fba4d157 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -150,6 +150,8 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 return;
 }
 
+JOB_LOCK_GUARD();
+
 for (job = block_job_next(NULL); job; job = block_job_next(job)) {
 if (block_job_has_bdrv(job, blk_bs(blk))) {
 AioContext *aio_context = job->job.aio_context;
@@ -1838,7 +1840,9 @@ static void drive_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job, true);
+WITH_JOB_LOCK_GUARD() {
+job_cancel_sync(>job->job, true);
+}
 
 aio_context_release(aio_context);
 }
@@ -1939,7 +1943,9 @@ static void blockdev_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job, true);
+WITH_JOB_LOCK_GUARD() {
+job_cancel_sync(>job->job, true);
+}
 
 aio_context_release(aio_context);
 }
@@ -2388,7 +2394,10 @@ exit:
 if (!has_props) {
 qapi_free_TransactionProperties(props);
 }
-job_txn_unref(block_job_txn);
+
+WITH_JOB_LOCK_GUARD() {
+job_txn_unref(block_job_txn);
+}
 }
 
 BlockDirtyBitmapSha256 *qmp_x_debug_block_dirty_bitmap_sha256(const char *node,
@@ -3711,6 +3720,8 @@ BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 BlockJobInfoList *head = NULL, **tail = 
 BlockJob *job;
 
+JOB_LOCK_GUARD();
+
 for (job = block_job_next(NULL); job; job = block_job_next(job)) {