Re: Potential regression in 'qemu-img convert' to LVM

2021-03-04 Thread Stefan Reiter

On 07/01/2021 21:03, Nir Soffer wrote:

On Tue, Sep 15, 2020 at 2:51 PM Stefan Reiter  wrote:


On 9/15/20 11:08 AM, Nir Soffer wrote:

On Mon, Sep 14, 2020 at 3:25 PM Stefan Reiter  wrote:


Hi list,

following command fails since 5.1 (tested on kernel 5.4.60):

# qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1
qemu-img: error while writing at byte 2157968896: Device or resource busy

(source is ZFS here, but doesn't matter in practice, it always fails the
same; offset changes slightly but consistently hovers around 2^31)

strace shows the following:
fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896,
4608) = -1 EBUSY (Device or resource busy)


What is the size of the LV?



Same as the source, 5GB in my test case. Created with:

# lvcreate -ay --size 5242880k --name disk-1 vg


Does it happen if you change sparse minimum size (-S)?

For example: -S 64k

  qemu-img convert -p -f raw -O raw -S 64k /dev/zvol/pool/disk-1
/dev/vg/disk-1



Tried a few different values, always the same result: EBUSY at byte
2157968896.


Other fallocate calls leading up to this work fine.

This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero
images", before that all fallocates happened at the start. Reverting the
commit and calling qemu-img exactly the same way on the same data works
fine.


But slowly, doing up to 100% more work for fully allocated images.



Of course, I'm not saying the patch is wrong, reverting it just avoids
triggering the bug.


Simply retrying the syscall on EBUSY (like EINTR) does *not* work,
once it fails it keeps failing with the same error.

I couldn't find anything related to EBUSY on fallocate, and it only
happens on LVM targets... Any idea or pointers where to look?


Is this thin LV?



No, regular LV. See command above.


This works for us using regular LVs.

Which kernel? which distro?



Reproducible on:
* PVE w/ kernel 5.4.60 (Ubuntu based)
* Manjaro w/ kernel 5.8.6

I found that it does not happen with all images, I suppose there must be
a certain number of smaller holes for it to happen. I am using a VM
image with a bare-bones Alpine Linux installation, but it's not an
isolated case, we've had two people report the issue on our bug tracker:
https://bugzilla.proxmox.com/show_bug.cgi?id=3002


I think that this issue may be fixed by
https://lists.nongnu.org/archive/html/qemu-block/2020-11/msg00358.html

Nir




Sorry for the late reply, but yes, I can confirm this fixes the issue.

~




Re: [PATCH] migration/block-dirty-bitmap: fix larger granularity bitmaps

2020-10-22 Thread Stefan Reiter

On 10/21/20 5:17 PM, Vladimir Sementsov-Ogievskiy wrote:

21.10.2020 17:44, Stefan Reiter wrote:

sectors_per_chunk is a 64 bit integer, but the calculation is done in 32
bits, leading to an overflow for coarse bitmap granularities.

If that results in the value 0, it leads to a hang where no progress is
made but send_bitmap_bits is constantly called with nr_sectors being 0.

Signed-off-by: Stefan Reiter 
---
  migration/block-dirty-bitmap.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/block-dirty-bitmap.c 
b/migration/block-dirty-bitmap.c

index 5bef793ac0..5398869e2b 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -562,8 +562,9 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,

  dbms->bitmap_alias = g_strdup(bitmap_alias);
  dbms->bitmap = bitmap;
  dbms->total_sectors = bdrv_nb_sectors(bs);
-    dbms->sectors_per_chunk = CHUNK_SIZE * 8 *
+    dbms->sectors_per_chunk = CHUNK_SIZE * 8lu *


I'd prefer 8llu for absolute safety.


  bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
+    assert(dbms->sectors_per_chunk != 0);


I doubt that we need this assertion. Bug fixed, and it's obviously 
impossible.
And if we really want to assert that there is no overflow (assuming 
future changes),

it should look like this:

   assert(bdrv_dirty_bitmap_granularity(bitmap) < (1ull << 63) / 
CHUNK_SIZE / 8 >> BDRV_SECTOR_BITS);


to cover not only corner case but any overflow.. And of course we should 
modify original expression

to do ">> BDRV_SECTOR_BITS" earlier than all multiplies, like

   dbms->sectors_per_chunk = CHUNK_SIZE * 8llu * 
(bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS);



But I think that only s/8/8ull/ change is enough.



I agree, and I wouldn't mind removing the assert, but just to clarify it 
was mostly meant to prevent the case where the migration gets stuck 
entirely. Even if the calculation is wrong, it would at least do 
_something_ instead of endlessly looping.


Maybe an

assert(nr_sectors != 0);

in send_bitmap_bits instead for that?


  if (bdrv_dirty_bitmap_enabled(bitmap)) {
  dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_ENABLED;
  }




With 8llu and with or without assertion:
Reviewed-by: Vladimir Sementsov-Ogievskiy 






[PATCH] migration/block-dirty-bitmap: fix larger granularity bitmaps

2020-10-21 Thread Stefan Reiter
sectors_per_chunk is a 64 bit integer, but the calculation is done in 32
bits, leading to an overflow for coarse bitmap granularities.

If that results in the value 0, it leads to a hang where no progress is
made but send_bitmap_bits is constantly called with nr_sectors being 0.

Signed-off-by: Stefan Reiter 
---
 migration/block-dirty-bitmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 5bef793ac0..5398869e2b 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -562,8 +562,9 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,
 dbms->bitmap_alias = g_strdup(bitmap_alias);
 dbms->bitmap = bitmap;
 dbms->total_sectors = bdrv_nb_sectors(bs);
-dbms->sectors_per_chunk = CHUNK_SIZE * 8 *
+dbms->sectors_per_chunk = CHUNK_SIZE * 8lu *
 bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
+assert(dbms->sectors_per_chunk != 0);
 if (bdrv_dirty_bitmap_enabled(bitmap)) {
 dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_ENABLED;
 }
-- 
2.20.1





Re: Potential regression in 'qemu-img convert' to LVM

2020-09-15 Thread Stefan Reiter

On 9/15/20 11:08 AM, Nir Soffer wrote:

On Mon, Sep 14, 2020 at 3:25 PM Stefan Reiter  wrote:


Hi list,

following command fails since 5.1 (tested on kernel 5.4.60):

# qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1
qemu-img: error while writing at byte 2157968896: Device or resource busy

(source is ZFS here, but doesn't matter in practice, it always fails the
same; offset changes slightly but consistently hovers around 2^31)

strace shows the following:
fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896,
4608) = -1 EBUSY (Device or resource busy)


What is the size of the LV?



Same as the source, 5GB in my test case. Created with:

# lvcreate -ay --size 5242880k --name disk-1 vg


Does it happen if you change sparse minimum size (-S)?

For example: -S 64k

 qemu-img convert -p -f raw -O raw -S 64k /dev/zvol/pool/disk-1
/dev/vg/disk-1



Tried a few different values, always the same result: EBUSY at byte 
2157968896.



Other fallocate calls leading up to this work fine.

This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero
images", before that all fallocates happened at the start. Reverting the
commit and calling qemu-img exactly the same way on the same data works
fine.


But slowly, doing up to 100% more work for fully allocated images.



Of course, I'm not saying the patch is wrong, reverting it just avoids 
triggering the bug.



Simply retrying the syscall on EBUSY (like EINTR) does *not* work,
once it fails it keeps failing with the same error.

I couldn't find anything related to EBUSY on fallocate, and it only
happens on LVM targets... Any idea or pointers where to look?


Is this thin LV?



No, regular LV. See command above.


This works for us using regular LVs.

Which kernel? which distro?



Reproducible on:
* PVE w/ kernel 5.4.60 (Ubuntu based)
* Manjaro w/ kernel 5.8.6

I found that it does not happen with all images, I suppose there must be 
a certain number of smaller holes for it to happen. I am using a VM 
image with a bare-bones Alpine Linux installation, but it's not an 
isolated case, we've had two people report the issue on our bug tracker: 
https://bugzilla.proxmox.com/show_bug.cgi?id=3002


Thanks,
Stefan


Nir







Potential regression in 'qemu-img convert' to LVM

2020-09-14 Thread Stefan Reiter

Hi list,

following command fails since 5.1 (tested on kernel 5.4.60):

# qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1
qemu-img: error while writing at byte 2157968896: Device or resource busy

(source is ZFS here, but doesn't matter in practice, it always fails the 
same; offset changes slightly but consistently hovers around 2^31)


strace shows the following:
fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896, 
4608) = -1 EBUSY (Device or resource busy)


Other fallocate calls leading up to this work fine.

This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero 
images", before that all fallocates happened at the start. Reverting the 
commit and calling qemu-img exactly the same way on the same data works 
fine. Simply retrying the syscall on EBUSY (like EINTR) does *not* work, 
once it fails it keeps failing with the same error.


I couldn't find anything related to EBUSY on fallocate, and it only 
happens on LVM targets... Any idea or pointers where to look?


~ Stefan




Re: [PATCH 0/3] Add support for sequential backups

2020-09-14 Thread Stefan Reiter

Friendly ping :)

On 8/26/20 2:13 PM, Stefan Reiter wrote:

Backups can already be done for multiple drives in a transaction. However, these
jobs will start all at once, potentially hogging a lot of disk IO all at once.
This problem is made worse, since IO throttling is only available on a per-job
basis.

Add a flag to QMP to support sequential transactions for backups. This way,
every job will be executed one after the other, while still providing the
benefit of transactions (i.e. once one fails, all remaining ones will be
cancelled).

We've internally (in Proxmox VE) been doing sequential backups for a long time
with great success, albeit in a different fashion. This series is the result of
aligning our internal changes closer to upstream, and might be useful for other
people as well.


Stefan Reiter (3):
   job: add sequential transaction support
   blockdev: add sequential mode to *-backup transactions
   backup: initialize bcs bitmap on job create, not start

  block/backup.c|  4 ++--
  blockdev.c| 25 ++---
  include/qemu/job.h| 12 
  job.c | 24 
  qapi/transaction.json |  6 +-
  5 files changed, 65 insertions(+), 6 deletions(-)






[PATCH 2/3] blockdev: add sequential mode to *-backup transactions

2020-08-26 Thread Stefan Reiter
Only supported with completion-mode 'grouped', since it relies on a
JobTxn to exist. This means that for now it is only available for
{drive,blockdev}-backup transactions.

Since only one job will be running at a time, bandwidth-limits can be
applied effectively. It can also prevent overloading a host's IO
capabilities in general.

Signed-off-by: Stefan Reiter 
---
 blockdev.c| 25 ++---
 qapi/transaction.json |  6 +-
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 3848a9c8ab..3691e5e791 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1826,7 +1826,10 @@ static void drive_backup_commit(BlkActionState *common)
 aio_context_acquire(aio_context);
 
 assert(state->job);
-job_start(&state->job->job);
+
+if (!common->txn_props->sequential) {
+job_start(&state->job->job);
+}
 
 aio_context_release(aio_context);
 }
@@ -1927,7 +1930,9 @@ static void blockdev_backup_commit(BlkActionState *common)
 aio_context_acquire(aio_context);
 
 assert(state->job);
-job_start(&state->job->job);
+if (!common->txn_props->sequential) {
+job_start(&state->job->job);
+}
 
 aio_context_release(aio_context);
 }
@@ -2303,6 +2308,11 @@ static TransactionProperties *get_transaction_properties(
 props->completion_mode = ACTION_COMPLETION_MODE_INDIVIDUAL;
 }
 
+if (!props->has_sequential) {
+props->has_sequential = true;
+props->sequential = false;
+}
+
 return props;
 }
 
@@ -2328,7 +2338,11 @@ void qmp_transaction(TransactionActionList *dev_list,
  */
 props = get_transaction_properties(props);
 if (props->completion_mode != ACTION_COMPLETION_MODE_INDIVIDUAL) {
-block_job_txn = job_txn_new();
+block_job_txn = props->sequential ? job_txn_new_seq() : job_txn_new();
+} else if (props->sequential) {
+error_setg(errp, "Sequential transaction mode is not supported with "
+ "completion-mode = individual");
+return;
 }
 
 /* drain all i/o before any operations */
@@ -2367,6 +2381,11 @@ void qmp_transaction(TransactionActionList *dev_list,
 }
 }
 
+/* jobs in sequential txns don't start themselves on commit */
+if (block_job_txn && props->sequential) {
+job_txn_start_seq(block_job_txn);
+}
+
 /* success */
 goto exit;
 
diff --git a/qapi/transaction.json b/qapi/transaction.json
index 15ddebdbc3..4808383088 100644
--- a/qapi/transaction.json
+++ b/qapi/transaction.json
@@ -84,11 +84,15 @@
 #   Actions will complete or fail as a group.
 #   See @ActionCompletionMode for details.
 #
+# @sequential: Run the jobs in the transaction one after the other, instead
+#  of all at once. Not supported for completion-mode 'individual'.
+#
 # Since: 2.5
 ##
 { 'struct': 'TransactionProperties',
   'data': {
-   '*completion-mode': 'ActionCompletionMode'
+   '*completion-mode': 'ActionCompletionMode',
+   '*sequential': 'bool'
   }
 }
 
-- 
2.20.1





[PATCH 1/3] job: add sequential transaction support

2020-08-26 Thread Stefan Reiter
Jobs in a sequential transaction should never be started with job_start
manually. job_txn_start_seq and the sequentially called job_start will
take care of it, 'assert'ing in case a job is already running or has
finished.

Signed-off-by: Stefan Reiter 
---
 include/qemu/job.h | 12 
 job.c  | 24 
 2 files changed, 36 insertions(+)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 32aabb1c60..f7a6a0926a 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -280,6 +280,18 @@ typedef enum JobCreateFlags {
  */
 JobTxn *job_txn_new(void);
 
+/**
+ * Create a new transaction and set it to sequential mode, i.e. run all jobs
+ * one after the other instead of at the same time.
+ */
+JobTxn *job_txn_new_seq(void);
+
+/**
+ * Helper method to start the first job in a sequential transaction to kick it
+ * off. Other jobs will be run after this one completes.
+ */
+void job_txn_start_seq(JobTxn *txn);
+
 /**
  * Release a reference that was previously acquired with job_txn_add_job or
  * job_txn_new. If it's the last reference to the object, it will be freed.
diff --git a/job.c b/job.c
index 8fecf38960..4df7c1d2ca 100644
--- a/job.c
+++ b/job.c
@@ -72,6 +72,8 @@ struct JobTxn {
 
 /* Reference count */
 int refcnt;
+
+bool sequential;
 };
 
 /* Right now, this mutex is only needed to synchronize accesses to job->busy
@@ -102,6 +104,25 @@ JobTxn *job_txn_new(void)
 return txn;
 }
 
+JobTxn *job_txn_new_seq(void)
+{
+JobTxn *txn = job_txn_new();
+txn->sequential = true;
+return txn;
+}
+
+void job_txn_start_seq(JobTxn *txn)
+{
+assert(txn->sequential);
+assert(!txn->aborting);
+
+Job *first = QLIST_FIRST(&txn->jobs);
+assert(first);
+assert(first->status == JOB_STATUS_CREATED);
+
+job_start(first);
+}
+
 static void job_txn_ref(JobTxn *txn)
 {
 txn->refcnt++;
@@ -840,6 +861,9 @@ static void job_completed_txn_success(Job *job)
  */
 QLIST_FOREACH(other_job, &txn->jobs, txn_list) {
 if (!job_is_completed(other_job)) {
+if (txn->sequential) {
+job_start(other_job);
+}
 return;
 }
 assert(other_job->ret == 0);
-- 
2.20.1





[PATCH 3/3] backup: initialize bcs bitmap on job create, not start

2020-08-26 Thread Stefan Reiter
After backup_init_bcs_bitmap the copy-before-write behaviour is active.
This way, multiple backup jobs created at once but running in a
sequential transaction will still represent the same point in time.

Signed-off-by: Stefan Reiter 
---

I'd imagine this was done on job start for a purpose, so this is potentially
wrong. In testing it works fine.

Sent along for feedback, since it would be necessary to really make use of the
sequential backup feature (without it, the individual backup jobs would not have
a consistent view of the guest).


 block/backup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 4f13bb20a5..14660eef45 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -237,8 +237,6 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 int ret = 0;
 
-backup_init_bcs_bitmap(s);
-
 if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
@@ -471,6 +469,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 block_job_add_bdrv(&job->common, "target", target, 0, BLK_PERM_ALL,
&error_abort);
 
+backup_init_bcs_bitmap(job);
+
 return &job->common;
 
  error:
-- 
2.20.1





[PATCH 0/3] Add support for sequential backups

2020-08-26 Thread Stefan Reiter
Backups can already be done for multiple drives in a transaction. However, these
jobs will start all at once, potentially hogging a lot of disk IO all at once.
This problem is made worse, since IO throttling is only available on a per-job
basis.

Add a flag to QMP to support sequential transactions for backups. This way,
every job will be executed one after the other, while still providing the
benefit of transactions (i.e. once one fails, all remaining ones will be
cancelled).

We've internally (in Proxmox VE) been doing sequential backups for a long time
with great success, albeit in a different fashion. This series is the result of
aligning our internal changes closer to upstream, and might be useful for other
people as well.


Stefan Reiter (3):
  job: add sequential transaction support
  blockdev: add sequential mode to *-backup transactions
  backup: initialize bcs bitmap on job create, not start

 block/backup.c|  4 ++--
 blockdev.c| 25 ++---
 include/qemu/job.h| 12 
 job.c | 24 
 qapi/transaction.json |  6 +-
 5 files changed, 65 insertions(+), 6 deletions(-)

-- 
2.20.1





Re: [PATCH for-5.1 v2 2/2] iotests: add test for unaligned granularity bitmap backup

2020-08-10 Thread Stefan Reiter

On 8/10/20 5:11 PM, Max Reitz wrote:

(Note: When submitting a patch series with multiple patches, our
guidelines require a cover letter:
https://wiki.qemu.org/Contribute/SubmitAPatch#Include_a_meaningful_cover_letter

But not too important now.)



Sorry, remembered for next time. Thanks for applying the patches!


On 10.08.20 11:55, Stefan Reiter wrote:

Start a VM with a 4097 byte image attached, add a 4096 byte granularity
dirty bitmap, mark it dirty, and then do a backup.

This used to run into an assert and fail, check that it works as
expected and also check the created image to ensure that misaligned
backups in general work correctly.

Signed-off-by: Stefan Reiter 
---

I saw Andrey's big series covering iotest 303 so I went for 304.


Works for me.


Never submitted
one before so I hope that's okay, if not feel free to renumber it.


Yep, if there’s a clash I tend to just renumber it when applying it.


  tests/qemu-iotests/304 | 68 ++
  tests/qemu-iotests/304.out |  2 ++
  tests/qemu-iotests/group   |  1 +
  3 files changed, 71 insertions(+)
  create mode 100755 tests/qemu-iotests/304
  create mode 100644 tests/qemu-iotests/304.out

diff --git a/tests/qemu-iotests/304 b/tests/qemu-iotests/304
new file mode 100755
index 00..9a3b0224fa
--- /dev/null
+++ b/tests/qemu-iotests/304
@@ -0,0 +1,68 @@
+#!/usr/bin/env python3
+#
+# Tests dirty-bitmap backup with unaligned bitmap granularity
+#
+# Copyright (c) 2020 Proxmox Server Solutions
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=s.rei...@proxmox.com
+
+import iotests
+from iotests import qemu_img_create, qemu_img_log, file_path
+
+iotests.script_initialize(supported_fmts=['qcow2'],
+  supported_protocols=['file'])
+
+test_img = file_path('test.qcow2')
+target_img = file_path('target.qcow2')
+
+# unaligned by one byte
+image_len = 4097
+bitmap_granularity = 4096
+
+qemu_img_create('-f', iotests.imgfmt, test_img, str(image_len))
+
+# create VM and add dirty bitmap
+vm = iotests.VM().add_drive(test_img)
+vm.launch()
+
+vm.qmp('block-dirty-bitmap-add', **{
+'node': 'drive0',
+'name': 'bitmap0',
+'granularity': bitmap_granularity
+})
+
+# mark entire bitmap as dirty
+vm.hmp_qemu_io('drive0', 'write -P0x16 0 4096');
+vm.hmp_qemu_io('drive0', 'write -P0x17 4097 1');


s/4097/4096/?

(4097 works, too, because of something somewhere aligning up the 4097 to
512 byte sectors, I suppose, but I don’t think it’s the address you want
here)



You're right, it seems counting is hard. I'll take you up on the offer 
from your other mail, you can fix this please :)



+
+# do backup and wait for completion
+vm.qmp('drive-backup', **{
+'device': 'drive0',
+'sync': 'full',
+'target': target_img,
+'bitmap': 'bitmap0',
+'bitmap-mode': 'on-success'


The bitmap is unnecessary, isn’t it?  I.e., if I drop the
block-dirty-bitmap-add call and the bitmap* parameters here, I still get
an assertion failure without patch 1.

Not that it really matters, it’s just that this makes it look like less
of an issue than it actually is.  (Which is why I’d drop the bitmap
stuff in case there’s no actual reason for it.)



Oh my, I just realized that I misunderstood the root cause then. I mean, 
the fix is fine, I see it now, but you're right, no dirty bitmap needed 
- you can remove that as well if you want.



+})
+
+event = vm.event_wait(name='BLOCK_JOB_COMPLETED',
+  match={'data': {'device': 'drive0'}},
+  timeout=5.0)


(By the way, “vm.run_job('drive0', auto_dismiss=True)” would have worked
as well.  But since the backup job just needs waiting for a single
event, I suppose it doesn’t matter.  Just a hint in case you start
writing more iotests in the future.)

Max






[PATCH for-5.1 v2 2/2] iotests: add test for unaligned granularity bitmap backup

2020-08-10 Thread Stefan Reiter
Start a VM with a 4097 byte image attached, add a 4096 byte granularity
dirty bitmap, mark it dirty, and then do a backup.

This used to run into an assert and fail, check that it works as
expected and also check the created image to ensure that misaligned
backups in general work correctly.

Signed-off-by: Stefan Reiter 
---

I saw Andrey's big series covering iotest 303 so I went for 304. Never submitted
one before so I hope that's okay, if not feel free to renumber it.

 tests/qemu-iotests/304 | 68 ++
 tests/qemu-iotests/304.out |  2 ++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 71 insertions(+)
 create mode 100755 tests/qemu-iotests/304
 create mode 100644 tests/qemu-iotests/304.out

diff --git a/tests/qemu-iotests/304 b/tests/qemu-iotests/304
new file mode 100755
index 00..9a3b0224fa
--- /dev/null
+++ b/tests/qemu-iotests/304
@@ -0,0 +1,68 @@
+#!/usr/bin/env python3
+#
+# Tests dirty-bitmap backup with unaligned bitmap granularity
+#
+# Copyright (c) 2020 Proxmox Server Solutions
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=s.rei...@proxmox.com
+
+import iotests
+from iotests import qemu_img_create, qemu_img_log, file_path
+
+iotests.script_initialize(supported_fmts=['qcow2'],
+  supported_protocols=['file'])
+
+test_img = file_path('test.qcow2')
+target_img = file_path('target.qcow2')
+
+# unaligned by one byte
+image_len = 4097
+bitmap_granularity = 4096
+
+qemu_img_create('-f', iotests.imgfmt, test_img, str(image_len))
+
+# create VM and add dirty bitmap
+vm = iotests.VM().add_drive(test_img)
+vm.launch()
+
+vm.qmp('block-dirty-bitmap-add', **{
+'node': 'drive0',
+'name': 'bitmap0',
+'granularity': bitmap_granularity
+})
+
+# mark entire bitmap as dirty
+vm.hmp_qemu_io('drive0', 'write -P0x16 0 4096');
+vm.hmp_qemu_io('drive0', 'write -P0x17 4097 1');
+
+# do backup and wait for completion
+vm.qmp('drive-backup', **{
+'device': 'drive0',
+'sync': 'full',
+'target': target_img,
+'bitmap': 'bitmap0',
+'bitmap-mode': 'on-success'
+})
+
+event = vm.event_wait(name='BLOCK_JOB_COMPLETED',
+  match={'data': {'device': 'drive0'}},
+  timeout=5.0)
+
+# shutdown to sync images
+vm.shutdown()
+
+# backup succeeded, check if image is correct
+qemu_img_log('compare', test_img, target_img)
diff --git a/tests/qemu-iotests/304.out b/tests/qemu-iotests/304.out
new file mode 100644
index 00..381cc056f7
--- /dev/null
+++ b/tests/qemu-iotests/304.out
@@ -0,0 +1,2 @@
+Images are identical.
+
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 025ed5238d..7f76066640 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -309,3 +309,4 @@
 299 auto quick
 301 backing quick
 302 quick
+304 rw quick
-- 
2.20.1





[PATCH for-5.1 v2 1/2] block/block-copy: always align copied region to cluster size

2020-08-10 Thread Stefan Reiter
Since commit 42ac214406e0 (block/block-copy: refactor task creation)
block_copy_task_create calculates the area to be copied via
bdrv_dirty_bitmap_next_dirty_area, but that can return an unaligned byte
count if the image's last cluster end is not aligned to the bitmap's
granularity.

Always ALIGN_UP the resulting bytes value to satisfy block_copy_do_copy,
which requires the 'bytes' parameter to be aligned to cluster size.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Stefan Reiter 
---

I've now marked it for-5.1 since it is just a fix, but it's probably okay if
done later as well.

v2:
* add assert on offset alignment
* remove 'backing image' wording from commit
* collect R-b

 block/block-copy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/block-copy.c b/block/block-copy.c
index f7428a7c08..a30b9097ef 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -142,6 +142,9 @@ static BlockCopyTask *block_copy_task_create(BlockCopyState 
*s,
 return NULL;
 }
 
+assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
+bytes = QEMU_ALIGN_UP(bytes, s->cluster_size);
+
 /* region is dirty, so no existent tasks possible in it */
 assert(!find_conflicting_task(s, offset, bytes));
 
-- 
2.20.1





[PATCH] block/block-copy: always align copied region to cluster size

2020-08-06 Thread Stefan Reiter
Since commit 42ac214406e0 (block/block-copy: refactor task creation)
block_copy_task_create calculates the area to be copied via
bdrv_dirty_bitmap_next_dirty_area, but that can return an unaligned byte
count if the backing image's last cluster end is not aligned to the
bitmap's granularity.

Always ALIGN_UP the resulting bytes value to satisfy block_copy_do_copy,
which requires the 'bytes' parameter to be aligned to cluster size.

Signed-off-by: Stefan Reiter 
---

This causes backups with unaligned image sizes to fail on the last block in my
testing (e.g. a backup job with 4k cluster size fails on a drive with 4097
bytes).

Alternatively one could remove the
  assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
from block_copy_do_copy, but I'd wager that's there for a reason?

 block/block-copy.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/block-copy.c b/block/block-copy.c
index f7428a7c08..023cb03200 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -142,6 +142,8 @@ static BlockCopyTask *block_copy_task_create(BlockCopyState 
*s,
 return NULL;
 }
 
+bytes = QEMU_ALIGN_UP(bytes, s->cluster_size);
+
 /* region is dirty, so no existent tasks possible in it */
 assert(!find_conflicting_task(s, offset, bytes));
 
-- 
2.20.1





Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext

2020-05-27 Thread Stefan Reiter

On 5/26/20 6:42 PM, Kevin Wolf wrote:

Am 25.05.2020 um 18:41 hat Kevin Wolf geschrieben:

Am 25.05.2020 um 16:18 hat Stefan Reiter geschrieben:

On 5/12/20 4:43 PM, Kevin Wolf wrote:

Coroutine functions that are entered through bdrv_run_co() are already
safe to call from synchronous code in a different AioContext because
bdrv_coroutine_enter() will schedule them in the context of the node.

However, the coroutine fastpath still requires that we're already in the
right AioContext when called in coroutine context.

In order to make the behaviour more consistent and to make life a bit
easier for callers, let's check the AioContext and automatically move
the current coroutine around if we're not in the right context yet.

Signed-off-by: Kevin Wolf 
---
   block/io.c | 15 ++-
   1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index c1badaadc9..7808e8bdc0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, 
CoroutineEntry *entry,
  void *opaque, int *ret)
   {
   if (qemu_in_coroutine()) {
-/* Fast-path if already in coroutine context */
+Coroutine *self = qemu_coroutine_self();
+AioContext *bs_ctx = bdrv_get_aio_context(bs);
+AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
+
+if (bs_ctx != co_ctx) {
+/* Move to the iothread of the node */
+aio_co_schedule(bs_ctx, self);
+qemu_coroutine_yield();


I'm pretty sure this can lead to a race: When the thread we're re-scheduling
to is faster to schedule us than we can reach qemu_coroutine_yield, then
we'll get an abort ("Co-routine re-entered recursively"), since co->caller
is still set.

I've seen this happen in our code when I try to do the scheduling fandangle
there.


Ah, crap. I guess letting a coroutine re-schedule itself is only safe
within the same thread then.


Is there a safer way to have a coroutine reschedule itself? Some lock
missing?


There is no problem that can't be solved by adding another level of
indirection... We would have to schedule a BH in the original thread
that will only schedule the coroutine in its new thread after it has
yielded.

Maybe we should actually introduce a helper function that moves the
current coroutine to a different AioContext this way.


Like this:

https://repo.or.cz/qemu/kevin.git/commitdiff/ed0244ba4ac699f7e8eaf7512ff25645cf43bda2



Commit looks good to me, using aio_co_reschedule_self fixes all aborts 
I've been seeing.



The series for which I need this isn't quite ready yet, so I haven't
sent it as a patch yet, but if it proves useful in other contexts, we
can always commit it without the rest.



I did a quick search for places where a similar pattern is used and 
found 'hw/9pfs/coth.h', where this behavior is already described (though 
the bh seems to be scheduled using the threadpool API, which I'm not 
really familiar with). All other places where qemu_coroutine_yield() is 
preceded by a aio_co_schedule() only do so in the same AioContext, which 
should be safe.



Kevin







Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext

2020-05-25 Thread Stefan Reiter

On 5/12/20 4:43 PM, Kevin Wolf wrote:

Coroutine functions that are entered through bdrv_run_co() are already
safe to call from synchronous code in a different AioContext because
bdrv_coroutine_enter() will schedule them in the context of the node.

However, the coroutine fastpath still requires that we're already in the
right AioContext when called in coroutine context.

In order to make the behaviour more consistent and to make life a bit
easier for callers, let's check the AioContext and automatically move
the current coroutine around if we're not in the right context yet.

Signed-off-by: Kevin Wolf 
---
  block/io.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index c1badaadc9..7808e8bdc0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, 
CoroutineEntry *entry,
 void *opaque, int *ret)
  {
  if (qemu_in_coroutine()) {
-/* Fast-path if already in coroutine context */
+Coroutine *self = qemu_coroutine_self();
+AioContext *bs_ctx = bdrv_get_aio_context(bs);
+AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
+
+if (bs_ctx != co_ctx) {
+/* Move to the iothread of the node */
+aio_co_schedule(bs_ctx, self);
+qemu_coroutine_yield();


I'm pretty sure this can lead to a race: When the thread we're 
re-scheduling to is faster to schedule us than we can reach 
qemu_coroutine_yield, then we'll get an abort ("Co-routine re-entered 
recursively"), since co->caller is still set.


I've seen this happen in our code when I try to do the scheduling 
fandangle there.


Is there a safer way to have a coroutine reschedule itself? Some lock 
missing?



+}
  entry(opaque);
+if (bs_ctx != co_ctx) {
+/* Move back to the original AioContext */
+aio_co_schedule(bs_ctx, self);
+qemu_coroutine_yield();
+}
  } else {
  Coroutine *co = qemu_coroutine_create(entry, opaque);
  *ret = NOT_DONE;






Re: [RFC PATCH 3/3] block: Assert we're running in the right thread

2020-05-14 Thread Stefan Reiter

On 5/12/20 4:43 PM, Kevin Wolf wrote:

tracked_request_begin() is called for most I/O operations, so it's a
good place to assert that we're indeed running in the home thread of the
node's AioContext.



Is this patch supposed to be always correct or only together with nr. 2?

I changed our code to call bdrv_flush_all from the main AIO context and 
it certainly works just fine (even without this series, so I suppose 
that would be the 'correct' way to fix it you mention on the cover), 
though of course it trips this assert without patch 2.



Signed-off-by: Kevin Wolf 
---
  block/io.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 7808e8bdc0..924bf5ba46 100644
--- a/block/io.c
+++ b/block/io.c
@@ -695,14 +695,17 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
uint64_t bytes,
enum BdrvTrackedRequestType type)
  {
+Coroutine *self = qemu_coroutine_self();
+
  assert(bytes <= INT64_MAX && offset <= INT64_MAX - bytes);
+assert(bs->aio_context == qemu_coroutine_get_aio_context(self));
  
  *req = (BdrvTrackedRequest){

  .bs = bs,
  .offset = offset,
  .bytes  = bytes,
  .type   = type,
-.co = qemu_coroutine_self(),
+.co = self,
  .serialising= false,
  .overlap_offset = offset,
  .overlap_bytes  = bytes,






Re: [RFC] bdrv_flush: only use fast path when in owned AioContext

2020-05-12 Thread Stefan Reiter

On 5/12/20 1:32 PM, Kevin Wolf wrote:

Am 12.05.2020 um 12:57 hat Kevin Wolf geschrieben:

Am 11.05.2020 um 18:50 hat Stefan Reiter geschrieben:

Just because we're in a coroutine doesn't imply ownership of the context
of the flushed drive. In such a case use the slow path which explicitly
enters bdrv_flush_co_entry in the correct AioContext.

Signed-off-by: Stefan Reiter 
---

We've experienced some lockups in this codepath when taking snapshots of VMs
with drives that have IO-Threads enabled (we have an async 'savevm'
implementation running from a coroutine).

Currently no reproducer for upstream versions I could find, but in testing this
patch fixes all issues we're seeing and I think the logic checks out.

The fast path pattern is repeated a few times in this file, so if this change
makes sense, it's probably worth evaluating the other occurences as well.


What do you mean by "owning" the context? If it's about taking the
AioContext lock, isn't the problem more with calling bdrv_flush() from
code that doesn't take the locks?

Though I think we have some code that doesn't only rely on holding the
AioContext locks, but that actually depends on running in the right
thread, so the change looks right anyway.


"Owning" as in it only works (doesn't hang) when bdrv_flush_co_entry 
runs on the same AioContext that the BlockDriverState it's flushing 
belongs to.


We hold the locks for all AioContexts we want to flush in our code (in 
this case called from do_vm_stop/bdrv_flush_all so we're even in a 
drained section).




Well, the idea is right, but the change itself isn't, of course. If
we're already in coroutine context, we must not busy wait with
BDRV_POLL_WHILE(). I'll see if I can put something together after lunch.

Kevin




Thanks for taking a look!




[RFC] bdrv_flush: only use fast path when in owned AioContext

2020-05-11 Thread Stefan Reiter
Just because we're in a coroutine doesn't imply ownership of the context
of the flushed drive. In such a case use the slow path which explicitly
enters bdrv_flush_co_entry in the correct AioContext.

Signed-off-by: Stefan Reiter 
---

We've experienced some lockups in this codepath when taking snapshots of VMs
with drives that have IO-Threads enabled (we have an async 'savevm'
implementation running from a coroutine).

Currently no reproducer for upstream versions I could find, but in testing this
patch fixes all issues we're seeing and I think the logic checks out.

The fast path pattern is repeated a few times in this file, so if this change
makes sense, it's probably worth evaluating the other occurences as well.

 block/io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index aba67f66b9..ee7310fa13 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2895,8 +2895,9 @@ int bdrv_flush(BlockDriverState *bs)
 .ret = NOT_DONE,
 };
 
-if (qemu_in_coroutine()) {
-/* Fast-path if already in coroutine context */
+if (qemu_in_coroutine() &&
+bdrv_get_aio_context(bs) == qemu_get_current_aio_context()) {
+/* Fast-path if already in coroutine and we own the drive's context */
 bdrv_flush_co_entry(&flush_co);
 } else {
 co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
-- 
2.20.1





[PATCH for-5.0 v5 3/3] backup: don't acquire aio_context in backup_clean

2020-04-07 Thread Stefan Reiter
All code-paths leading to backup_clean (via job_clean) have the job's
context already acquired. The job's context is guaranteed to be the same
as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

This is a partial revert of 0abf2581717a19.

Signed-off-by: Stefan Reiter 
Reviewed-by: Max Reitz 
---
 block/backup.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
 bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
-- 
2.26.0





[PATCH for-5.0 v5 0/3] Fix some AIO context locking in jobs

2020-04-07 Thread Stefan Reiter
Contains three seperate but related patches cleaning up and fixing some
issues regarding aio_context_acquire/aio_context_release for jobs. Mostly
affects blockjobs running for devices that have IO threads enabled AFAICT.


Changes from v4:
* Do job_ref/job_unref in job_txn_apply and job_exit since we need the job to
  survive the callback to access the potentially changed lock afterwards
* Reduce patch 2/3 to an assert, the context should already be acquired since
  it's a bdrv handler
* Collect R-by for 3/3

I've marked it 'for-5.0' this time, I think it would make sense to be
picked up together with Kevin's "block: Fix blk->in_flight during
blk_wait_while_drained()" series. With that series and these three patches
applied I can no longer reproduce any of the reported related crashes/hangs.


Changes from v3:
* commit_job appears to be unset in certain cases when replication_close is
  called, only access when necessary to avoid SIGSEGV

Missed this when shuffling around patches, sorry for noise with still-broken v3.

Changes from v2:
* reordered patch 1 to the end to not introduce temporary breakages
* added more fixes to job txn patch (should now pass the tests)

Changes from v1:
* fixed commit message for patch 1
* added patches 2 and 3


qemu: Stefan Reiter (3):
  job: take each job's lock individually in job_txn_apply
  replication: assert we own context before job_cancel_sync
  backup: don't acquire aio_context in backup_clean

 block/backup.c|  4 
 block/replication.c   |  5 -
 blockdev.c|  9 
 job-qmp.c |  9 
 job.c | 50 ++-
 tests/test-blockjob.c |  2 ++
 6 files changed, 64 insertions(+), 15 deletions(-)

-- 
2.26.0




[PATCH for-5.0 v5 1/3] job: take each job's lock individually in job_txn_apply

2020-04-07 Thread Stefan Reiter
All callers of job_txn_apply hold a single job's lock, but different
jobs within a transaction can have different contexts, thus we need to
lock each one individually before applying the callback function.

Similar to job_completed_txn_abort this also requires releasing the
caller's context before and reacquiring it after to avoid recursive
locks which might break AIO_WAIT_WHILE in the callback. This is safe, since
existing code would already have to take this into account, lest
job_completed_txn_abort might have broken.

This also brings to light a different issue: When a callback function in
job_txn_apply moves it's job to a different AIO context, callers will
try to release the wrong lock (now that we re-acquire the lock
correctly, previously it would just continue with the old lock, leaving
the job unlocked for the rest of the return path). Fix this by not caching
the job's context.

This is only necessary for qmp_block_job_finalize, qmp_job_finalize and
job_exit, since everyone else calls through job_exit.

One test needed adapting, since it calls job_finalize directly, so it
manually needs to acquire the correct context.

Signed-off-by: Stefan Reiter 
---
 blockdev.c|  9 
 job-qmp.c |  9 
 job.c | 50 ++-
 tests/test-blockjob.c |  2 ++
 4 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index fa8630cb41..5faddaa705 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3612,7 +3612,16 @@ void qmp_block_job_finalize(const char *id, Error **errp)
 }
 
 trace_qmp_block_job_finalize(job);
+job_ref(&job->job);
 job_finalize(&job->job, errp);
+
+/*
+ * Job's context might have changed via job_finalize (and job_txn_apply
+ * automatically acquires the new one), so make sure we release the correct
+ * one.
+ */
+aio_context = blk_get_aio_context(job->blk);
+job_unref(&job->job);
 aio_context_release(aio_context);
 }
 
diff --git a/job-qmp.c b/job-qmp.c
index fecc939ebd..f9a58832e1 100644
--- a/job-qmp.c
+++ b/job-qmp.c
@@ -114,7 +114,16 @@ void qmp_job_finalize(const char *id, Error **errp)
 }
 
 trace_qmp_job_finalize(job);
+job_ref(job);
 job_finalize(job, errp);
+
+/*
+ * Job's context might have changed via job_finalize (and job_txn_apply
+ * automatically acquires the new one), so make sure we release the correct
+ * one.
+ */
+aio_context = job->aio_context;
+job_unref(job);
 aio_context_release(aio_context);
 }
 
diff --git a/job.c b/job.c
index 134a07b92e..53be57a3a0 100644
--- a/job.c
+++ b/job.c
@@ -136,17 +136,38 @@ static void job_txn_del_job(Job *job)
 }
 }
 
-static int job_txn_apply(JobTxn *txn, int fn(Job *))
+static int job_txn_apply(Job *job, int fn(Job *))
 {
-Job *job, *next;
+AioContext *inner_ctx;
+Job *other_job, *next;
+JobTxn *txn = job->txn;
 int rc = 0;
 
-QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {
-rc = fn(job);
+/*
+ * Similar to job_completed_txn_abort, we take each job's lock before
+ * applying fn, but since we assume that outer_ctx is held by the caller,
+ * we need to release it here to avoid holding the lock twice - which would
+ * break AIO_WAIT_WHILE from within fn.
+ */
+job_ref(job);
+aio_context_release(job->aio_context);
+
+QLIST_FOREACH_SAFE(other_job, &txn->jobs, txn_list, next) {
+inner_ctx = other_job->aio_context;
+aio_context_acquire(inner_ctx);
+rc = fn(other_job);
+aio_context_release(inner_ctx);
 if (rc) {
 break;
 }
 }
+
+/*
+ * Note that job->aio_context might have been changed by calling fn, so we
+ * can't use a local variable to cache it.
+ */
+aio_context_acquire(job->aio_context);
+job_unref(job);
 return rc;
 }
 
@@ -774,11 +795,11 @@ static void job_do_finalize(Job *job)
 assert(job && job->txn);
 
 /* prepare the transaction to complete */
-rc = job_txn_apply(job->txn, job_prepare);
+rc = job_txn_apply(job, job_prepare);
 if (rc) {
 job_completed_txn_abort(job);
 } else {
-job_txn_apply(job->txn, job_finalize_single);
+job_txn_apply(job, job_finalize_single);
 }
 }
 
@@ -824,10 +845,10 @@ static void job_completed_txn_success(Job *job)
 assert(other_job->ret == 0);
 }
 
-job_txn_apply(txn, job_transition_to_pending);
+job_txn_apply(job, job_transition_to_pending);
 
 /* If no jobs need manual finalization, automatically do so */
-if (job_txn_apply(txn, job_needs_finalize) == 0) {
+if (job_txn_apply(job, job_needs_finalize) == 0) {
 job_do_finalize(job);
 }
 }
@@ -849,9 +870,10 @@ static void job_completed(Job *job)
 static void jo

[PATCH for-5.0 v5 2/3] replication: assert we own context before job_cancel_sync

2020-04-07 Thread Stefan Reiter
job_cancel_sync requires the job's lock to be held, all other callers
already do this (replication_stop, drive_backup_abort,
blockdev_backup_abort, job_cancel_sync_all, cancel_common).

In this case we're in a BlockDriver handler, so we already have a lock,
just assert that it is the same as the one used for the commit_job.

Signed-off-by: Stefan Reiter 
---
 block/replication.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 413d95407d..da013c2041 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -144,12 +144,15 @@ fail:
 static void replication_close(BlockDriverState *bs)
 {
 BDRVReplicationState *s = bs->opaque;
+Job *commit_job;
 
 if (s->stage == BLOCK_REPLICATION_RUNNING) {
 replication_stop(s->rs, false, NULL);
 }
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-job_cancel_sync(&s->commit_job->job);
+commit_job = &s->commit_job->job;
+assert(commit_job->aio_context == qemu_get_current_aio_context());
+job_cancel_sync(commit_job);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
-- 
2.26.0





Re: [PATCH v4 2/3] replication: acquire aio context before calling job_cancel_sync

2020-04-02 Thread Stefan Reiter

On 02/04/2020 14:41, Max Reitz wrote:

On 01.04.20 10:15, Stefan Reiter wrote:

job_cancel_sync requires the job's lock to be held, all other callers
already do this (replication_stop, drive_backup_abort,
blockdev_backup_abort, job_cancel_sync_all, cancel_common).


I think all other callers come directly from QMP, though, so they have
no locks yet.  This OTOH is called from a block driver function, so I
would assume the BDS context is locked already (or rather, this is
executed in the BDS context).

I also think that the commit job runs in the same context.  So I would
assume that this would be a nested lock, which should be unnecessary and
might cause problems.  Maybe we should just assert that the job’s
context is the current context?

(Or would that still be problematic because now job_txn_apply() wants to
release some context, and that isn’t possible without this patch?  I
would hope it’s possible, because I think we shouldn’t have to acquire
the current context, and should be free to release it...?  I have no
idea.  Maybe we are actually free to reacquire the current context here.)



You're right, this seems to be unnecessary. Adding an

  assert(commit_job->aio_context == qemu_get_current_aio_context());

to make sure seems like a good idea though. bdrv_close appears to always 
have the lock on its driver's context held.



Signed-off-by: Stefan Reiter 
---
  block/replication.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 413d95407d..17ddc31569 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -144,12 +144,18 @@ fail:
  static void replication_close(BlockDriverState *bs)
  {
  BDRVReplicationState *s = bs->opaque;
+Job *commit_job;
+AioContext *commit_ctx;
  
  if (s->stage == BLOCK_REPLICATION_RUNNING) {

  replication_stop(s->rs, false, NULL);
  }
  if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-job_cancel_sync(&s->commit_job->job);
+commit_job = &s->commit_job->job;
+commit_ctx = commit_job->aio_context;
+aio_context_acquire(commit_ctx);
+job_cancel_sync(commit_job);
+aio_context_release(commit_ctx);


Anyway, there’s also the problem that I would guess the
job_cancel_sync() might move the job from its current context back into
the main context.  Then we’d release the wrong context here.
 > Max


  }
  
  if (s->mode == REPLICATION_MODE_SECONDARY) {










Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn_apply

2020-04-02 Thread Stefan Reiter

On 02/04/2020 14:33, Max Reitz wrote:

On 01.04.20 10:15, Stefan Reiter wrote:

All callers of job_txn_apply hold a single job's lock, but different
jobs within a transaction can have different contexts, thus we need to
lock each one individually before applying the callback function.

Similar to job_completed_txn_abort this also requires releasing the
caller's context before and reacquiring it after to avoid recursive
locks which might break AIO_WAIT_WHILE in the callback.

This also brings to light a different issue: When a callback function in
job_txn_apply moves it's job to a different AIO context, job_exit will
try to release the wrong lock (now that we re-acquire the lock
correctly, previously it would just continue with the old lock, leaving
the job unlocked for the rest of the codepath back to job_exit). Fix
this by not caching the job's context in job_exit and add a comment
about why this is done.

One test needed adapting, since it calls job_finalize directly, so it
manually needs to acquire the correct context.

Signed-off-by: Stefan Reiter 
---
  job.c | 48 ++-
  tests/test-blockjob.c |  2 ++
  2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/job.c b/job.c
index 134a07b92e..5fbaaabf78 100644
--- a/job.c
+++ b/job.c
@@ -136,17 +136,36 @@ static void job_txn_del_job(Job *job)
  }
  }
  
-static int job_txn_apply(JobTxn *txn, int fn(Job *))

+static int job_txn_apply(Job *job, int fn(Job *))
  {
-Job *job, *next;
+AioContext *inner_ctx;
+Job *other_job, *next;
+JobTxn *txn = job->txn;
  int rc = 0;
  
-QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {

-rc = fn(job);
+/*
+ * Similar to job_completed_txn_abort, we take each job's lock before
+ * applying fn, but since we assume that outer_ctx is held by the caller,
+ * we need to release it here to avoid holding the lock twice - which would
+ * break AIO_WAIT_WHILE from within fn.
+ */
+aio_context_release(job->aio_context);


Hm, is it safe to drop the lock here and then reacquire it later?  I.e.,
is the job in a consistent state in between?  I don’t know.  Looks like
it.  Maybe someone else knows better.



I would have said so too, but the iotest segfaults Kevin discovered (I 
can reproduce them after a dozen or so cycles) seem to be triggered by 
some kind of race, which I can only imagine being caused by it not being 
safe to drop the lock.


It can be resolved by doing a job_ref/unref in job_txn_apply, but that 
might be only fixing the symptoms and ignoring the problem.



(I find the job code rather confusing.)



That seems to be common around here ;)


+
+QLIST_FOREACH_SAFE(other_job, &txn->jobs, txn_list, next) {
+inner_ctx = other_job->aio_context;
+aio_context_acquire(inner_ctx);


Ignoring the fact that fn() can change a job's lock, one idea I had was 
to do a


  if (inner_ctx != job->aio_context) {
  aio_context_acquire(inner_ctx);
  }

instead of releasing the lock prior.
However, that gets complicated when the job's context does change in fn.


+rc = fn(other_job);
+aio_context_release(inner_ctx);
  if (rc) {
  break;
  }
  }
+
+/*
+ * Note that job->aio_context might have been changed by calling fn, so we
+ * can't use a local variable to cache it.
+ */
+aio_context_acquire(job->aio_context);


But all callers of job_txn_apply() (but job_exit(), which you fix in
this patch) cache it.  Won’t they release the wrong context then?  Or is
a context change only possible when job_txn_apply() is called from
job_exit()?



You're right that it can probably change for other callers as well 
(though at least it doesn't seem to currently, since no other test cases 
fail?). Looking at the analysis Vladimir did [0], there's a few places 
where that would need fixing.


The issue is that all of these places would also need the job_ref/unref 
part AFAICT, which would make this a bit unwieldy...


[0] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07867.html


Max






[PATCH v4 1/3] job: take each job's lock individually in job_txn_apply

2020-04-01 Thread Stefan Reiter
All callers of job_txn_apply hold a single job's lock, but different
jobs within a transaction can have different contexts, thus we need to
lock each one individually before applying the callback function.

Similar to job_completed_txn_abort this also requires releasing the
caller's context before and reacquiring it after to avoid recursive
locks which might break AIO_WAIT_WHILE in the callback.

This also brings to light a different issue: When a callback function in
job_txn_apply moves it's job to a different AIO context, job_exit will
try to release the wrong lock (now that we re-acquire the lock
correctly, previously it would just continue with the old lock, leaving
the job unlocked for the rest of the codepath back to job_exit). Fix
this by not caching the job's context in job_exit and add a comment
about why this is done.

One test needed adapting, since it calls job_finalize directly, so it
manually needs to acquire the correct context.

Signed-off-by: Stefan Reiter 
---
 job.c | 48 ++-
 tests/test-blockjob.c |  2 ++
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/job.c b/job.c
index 134a07b92e..5fbaaabf78 100644
--- a/job.c
+++ b/job.c
@@ -136,17 +136,36 @@ static void job_txn_del_job(Job *job)
 }
 }
 
-static int job_txn_apply(JobTxn *txn, int fn(Job *))
+static int job_txn_apply(Job *job, int fn(Job *))
 {
-Job *job, *next;
+AioContext *inner_ctx;
+Job *other_job, *next;
+JobTxn *txn = job->txn;
 int rc = 0;
 
-QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {
-rc = fn(job);
+/*
+ * Similar to job_completed_txn_abort, we take each job's lock before
+ * applying fn, but since we assume that outer_ctx is held by the caller,
+ * we need to release it here to avoid holding the lock twice - which would
+ * break AIO_WAIT_WHILE from within fn.
+ */
+aio_context_release(job->aio_context);
+
+QLIST_FOREACH_SAFE(other_job, &txn->jobs, txn_list, next) {
+inner_ctx = other_job->aio_context;
+aio_context_acquire(inner_ctx);
+rc = fn(other_job);
+aio_context_release(inner_ctx);
 if (rc) {
 break;
 }
 }
+
+/*
+ * Note that job->aio_context might have been changed by calling fn, so we
+ * can't use a local variable to cache it.
+ */
+aio_context_acquire(job->aio_context);
 return rc;
 }
 
@@ -774,11 +793,11 @@ static void job_do_finalize(Job *job)
 assert(job && job->txn);
 
 /* prepare the transaction to complete */
-rc = job_txn_apply(job->txn, job_prepare);
+rc = job_txn_apply(job, job_prepare);
 if (rc) {
 job_completed_txn_abort(job);
 } else {
-job_txn_apply(job->txn, job_finalize_single);
+job_txn_apply(job, job_finalize_single);
 }
 }
 
@@ -824,10 +843,10 @@ static void job_completed_txn_success(Job *job)
 assert(other_job->ret == 0);
 }
 
-job_txn_apply(txn, job_transition_to_pending);
+job_txn_apply(job, job_transition_to_pending);
 
 /* If no jobs need manual finalization, automatically do so */
-if (job_txn_apply(txn, job_needs_finalize) == 0) {
+if (job_txn_apply(job, job_needs_finalize) == 0) {
 job_do_finalize(job);
 }
 }
@@ -849,9 +868,10 @@ static void job_completed(Job *job)
 static void job_exit(void *opaque)
 {
 Job *job = (Job *)opaque;
-AioContext *ctx = job->aio_context;
+AioContext *ctx;
 
-aio_context_acquire(ctx);
+job_ref(job);
+aio_context_acquire(job->aio_context);
 
 /* This is a lie, we're not quiescent, but still doing the completion
  * callbacks. However, completion callbacks tend to involve operations that
@@ -862,6 +882,14 @@ static void job_exit(void *opaque)
 
 job_completed(job);
 
+/*
+ * Note that calling job_completed can move the job to a different
+ * aio_context, so we cannot cache from above. job_txn_apply takes care of
+ * acquiring the new lock, and we ref/unref to avoid job_completed freeing
+ * the job underneath us.
+ */
+ctx = job->aio_context;
+job_unref(job);
 aio_context_release(ctx);
 }
 
diff --git a/tests/test-blockjob.c b/tests/test-blockjob.c
index 4eeb184caf..7519847912 100644
--- a/tests/test-blockjob.c
+++ b/tests/test-blockjob.c
@@ -367,7 +367,9 @@ static void test_cancel_concluded(void)
 aio_poll(qemu_get_aio_context(), true);
 assert(job->status == JOB_STATUS_PENDING);
 
+aio_context_acquire(job->aio_context);
 job_finalize(job, &error_abort);
+aio_context_release(job->aio_context);
 assert(job->status == JOB_STATUS_CONCLUDED);
 
 cancel_common(s);
-- 
2.26.0





[PATCH v4 2/3] replication: acquire aio context before calling job_cancel_sync

2020-04-01 Thread Stefan Reiter
job_cancel_sync requires the job's lock to be held, all other callers
already do this (replication_stop, drive_backup_abort,
blockdev_backup_abort, job_cancel_sync_all, cancel_common).

Signed-off-by: Stefan Reiter 
---
 block/replication.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 413d95407d..17ddc31569 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -144,12 +144,18 @@ fail:
 static void replication_close(BlockDriverState *bs)
 {
 BDRVReplicationState *s = bs->opaque;
+Job *commit_job;
+AioContext *commit_ctx;
 
 if (s->stage == BLOCK_REPLICATION_RUNNING) {
 replication_stop(s->rs, false, NULL);
 }
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-job_cancel_sync(&s->commit_job->job);
+commit_job = &s->commit_job->job;
+commit_ctx = commit_job->aio_context;
+aio_context_acquire(commit_ctx);
+job_cancel_sync(commit_job);
+aio_context_release(commit_ctx);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
-- 
2.26.0





[PATCH v4 0/3] Fix some AIO context locking in jobs

2020-04-01 Thread Stefan Reiter
Contains three seperate but related patches cleaning up and fixing some
issues regarding aio_context_acquire/aio_context_release for jobs. Mostly
affects blockjobs running for devices that have IO threads enabled AFAICT.

This is based on the discussions here:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07929.html

Changes from v3:
* commit_job appears to be unset in certain cases when replication_close is
  called, only access when necessary to avoid SIGSEGV

Missed this when shuffling around patches, sorry for noise with still-broken v3.

Changes from v2:
* reordered patch 1 to the end to not introduce temporary breakages
* added more fixes to job txn patch (should now pass the tests)

Changes from v1:
* fixed commit message for patch 1
* added patches 2 and 3


qemu: Stefan Reiter (3):
  job: take each job's lock individually in job_txn_apply
  replication: acquire aio context before calling job_cancel_sync
  backup: don't acquire aio_context in backup_clean

 block/backup.c|  4 
 block/replication.c   |  8 +++-
 job.c | 48 ++-
 tests/test-blockjob.c |  2 ++
 4 files changed, 47 insertions(+), 15 deletions(-)

-- 
2.26.0




[PATCH v4 3/3] backup: don't acquire aio_context in backup_clean

2020-04-01 Thread Stefan Reiter
All code-paths leading to backup_clean (via job_clean) have the job's
context already acquired. The job's context is guaranteed to be the same
as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

This is a partial revert of 0abf2581717a19.

Signed-off-by: Stefan Reiter 
---

With the two previous patches applied, the commit message should now hold true.

 block/backup.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
 bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
-- 
2.26.0





[PATCH v3 1/3] job: take each job's lock individually in job_txn_apply

2020-03-31 Thread Stefan Reiter
All callers of job_txn_apply hold a single job's lock, but different
jobs within a transaction can have different contexts, thus we need to
lock each one individually before applying the callback function.

Similar to job_completed_txn_abort this also requires releasing the
caller's context before and reacquiring it after to avoid recursive
locks which might break AIO_WAIT_WHILE in the callback.

This also brings to light a different issue: When a callback function in
job_txn_apply moves it's job to a different AIO context, job_exit will
try to release the wrong lock (now that we re-acquire the lock
correctly, previously it would just continue with the old lock, leaving
the job unlocked for the rest of the codepath back to job_exit). Fix
this by not caching the job's context in job_exit and add a comment
about why this is done.

One test needed adapting, since it calls job_finalize directly, so it
manually needs to acquire the correct context.

Signed-off-by: Stefan Reiter 
---
 job.c | 48 ++-
 tests/test-blockjob.c |  2 ++
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/job.c b/job.c
index 134a07b92e..5fbaaabf78 100644
--- a/job.c
+++ b/job.c
@@ -136,17 +136,36 @@ static void job_txn_del_job(Job *job)
 }
 }
 
-static int job_txn_apply(JobTxn *txn, int fn(Job *))
+static int job_txn_apply(Job *job, int fn(Job *))
 {
-Job *job, *next;
+AioContext *inner_ctx;
+Job *other_job, *next;
+JobTxn *txn = job->txn;
 int rc = 0;
 
-QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {
-rc = fn(job);
+/*
+ * Similar to job_completed_txn_abort, we take each job's lock before
+ * applying fn, but since we assume that outer_ctx is held by the caller,
+ * we need to release it here to avoid holding the lock twice - which would
+ * break AIO_WAIT_WHILE from within fn.
+ */
+aio_context_release(job->aio_context);
+
+QLIST_FOREACH_SAFE(other_job, &txn->jobs, txn_list, next) {
+inner_ctx = other_job->aio_context;
+aio_context_acquire(inner_ctx);
+rc = fn(other_job);
+aio_context_release(inner_ctx);
 if (rc) {
 break;
 }
 }
+
+/*
+ * Note that job->aio_context might have been changed by calling fn, so we
+ * can't use a local variable to cache it.
+ */
+aio_context_acquire(job->aio_context);
 return rc;
 }
 
@@ -774,11 +793,11 @@ static void job_do_finalize(Job *job)
 assert(job && job->txn);
 
 /* prepare the transaction to complete */
-rc = job_txn_apply(job->txn, job_prepare);
+rc = job_txn_apply(job, job_prepare);
 if (rc) {
 job_completed_txn_abort(job);
 } else {
-job_txn_apply(job->txn, job_finalize_single);
+job_txn_apply(job, job_finalize_single);
 }
 }
 
@@ -824,10 +843,10 @@ static void job_completed_txn_success(Job *job)
 assert(other_job->ret == 0);
 }
 
-job_txn_apply(txn, job_transition_to_pending);
+job_txn_apply(job, job_transition_to_pending);
 
 /* If no jobs need manual finalization, automatically do so */
-if (job_txn_apply(txn, job_needs_finalize) == 0) {
+if (job_txn_apply(job, job_needs_finalize) == 0) {
 job_do_finalize(job);
 }
 }
@@ -849,9 +868,10 @@ static void job_completed(Job *job)
 static void job_exit(void *opaque)
 {
 Job *job = (Job *)opaque;
-AioContext *ctx = job->aio_context;
+AioContext *ctx;
 
-aio_context_acquire(ctx);
+job_ref(job);
+aio_context_acquire(job->aio_context);
 
 /* This is a lie, we're not quiescent, but still doing the completion
  * callbacks. However, completion callbacks tend to involve operations that
@@ -862,6 +882,14 @@ static void job_exit(void *opaque)
 
 job_completed(job);
 
+/*
+ * Note that calling job_completed can move the job to a different
+ * aio_context, so we cannot cache from above. job_txn_apply takes care of
+ * acquiring the new lock, and we ref/unref to avoid job_completed freeing
+ * the job underneath us.
+ */
+ctx = job->aio_context;
+job_unref(job);
 aio_context_release(ctx);
 }
 
diff --git a/tests/test-blockjob.c b/tests/test-blockjob.c
index 4eeb184caf..7519847912 100644
--- a/tests/test-blockjob.c
+++ b/tests/test-blockjob.c
@@ -367,7 +367,9 @@ static void test_cancel_concluded(void)
 aio_poll(qemu_get_aio_context(), true);
 assert(job->status == JOB_STATUS_PENDING);
 
+aio_context_acquire(job->aio_context);
 job_finalize(job, &error_abort);
+aio_context_release(job->aio_context);
 assert(job->status == JOB_STATUS_CONCLUDED);
 
 cancel_common(s);
-- 
2.26.0





[PATCH v3 3/3] backup: don't acquire aio_context in backup_clean

2020-03-31 Thread Stefan Reiter
All code-paths leading to backup_clean (via job_clean) have the job's
context already acquired. The job's context is guaranteed to be the same
as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

This is a partial revert of 0abf2581717a19.

Signed-off-by: Stefan Reiter 
---

With the two previous patches applied, the commit message should now hold true.

 block/backup.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
 bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
-- 
2.26.0





[PATCH v3 2/3] replication: acquire aio context before calling job_cancel_sync

2020-03-31 Thread Stefan Reiter
job_cancel_sync requires the job's lock to be held, all other callers
already do this (replication_stop, drive_backup_abort,
blockdev_backup_abort, job_cancel_sync_all, cancel_common).

Signed-off-by: Stefan Reiter 
---
 block/replication.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 413d95407d..6c809cda4e 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -144,12 +144,16 @@ fail:
 static void replication_close(BlockDriverState *bs)
 {
 BDRVReplicationState *s = bs->opaque;
+Job *commit_job = &s->commit_job->job;
+AioContext *commit_ctx = commit_job->aio_context;
 
 if (s->stage == BLOCK_REPLICATION_RUNNING) {
 replication_stop(s->rs, false, NULL);
 }
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-job_cancel_sync(&s->commit_job->job);
+aio_context_acquire(commit_ctx);
+job_cancel_sync(commit_job);
+aio_context_release(commit_ctx);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
-- 
2.26.0





[PATCH v3 0/3] Fix some AIO context locking in jobs

2020-03-31 Thread Stefan Reiter
Contains three seperate but related patches cleaning up and fixing some
issues regarding aio_context_acquire/aio_context_release for jobs. Mostly
affects blockjobs running for devices that have IO threads enabled AFAICT.

This is based on the discussions here:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07929.html

Changes from v2:
* reordered patch 1 to the end to not introduce temporary breakages
* added more fixes to job txn patch (should now pass the tests)

Changes from v1:
* fixed commit message for patch 1
* added patches 2 and 3


qemu: Stefan Reiter (3):
  job: take each job's lock individually in job_txn_apply
  replication: acquire aio context before calling job_cancel_sync
  backup: don't acquire aio_context in backup_clean

 block/backup.c|  4 
 block/replication.c   |  6 +-
 job.c | 48 ++-
 tests/test-blockjob.c |  2 ++
 4 files changed, 45 insertions(+), 15 deletions(-)

-- 
2.26.0




[PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-26 Thread Stefan Reiter
Contains three seperate but related patches cleaning up and fixing some
issues regarding aio_context_acquire/aio_context_release for jobs. Mostly
affects blockjobs running for devices that have IO threads enabled AFAICT.

This is based on the discussions here:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07929.html

I *think* the second patch also fixes the hangs on backup abort that I and
Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
before too.

Changes from v1:
* fixed commit message for patch 1
* added patches 2 and 3

Stefan Reiter (3):
  backup: don't acquire aio_context in backup_clean
  job: take each job's lock individually in job_txn_apply
  replication: acquire aio context before calling job_cancel_sync

 block/backup.c  |  4 
 block/replication.c |  6 +-
 job.c   | 32 
 3 files changed, 29 insertions(+), 13 deletions(-)

-- 
2.26.0





[PATCH v2 1/3] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Stefan Reiter
All code-paths leading to backup_clean (via job_clean) have the job's
context already acquired. The job's context is guaranteed to be the same
as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

This is a partial revert of 0abf2581717a19.

Signed-off-by: Stefan Reiter 
---
 block/backup.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
 bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
-- 
2.26.0





[PATCH v2 2/3] job: take each job's lock individually in job_txn_apply

2020-03-26 Thread Stefan Reiter
All callers of job_txn_apply hold a single job's lock, but different
jobs within a transaction can have different contexts, thus we need to
lock each one individually before applying the callback function.

Similar to job_completed_txn_abort this also requires releasing the
caller's context before and reacquiring it after to avoid recursive
locks which might break AIO_WAIT_WHILE in the callback.

Signed-off-by: Stefan Reiter 
---
 job.c | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/job.c b/job.c
index 134a07b92e..e0966162fa 100644
--- a/job.c
+++ b/job.c
@@ -136,17 +136,33 @@ static void job_txn_del_job(Job *job)
 }
 }
 
-static int job_txn_apply(JobTxn *txn, int fn(Job *))
+static int job_txn_apply(Job *job, int fn(Job *))
 {
-Job *job, *next;
+AioContext *outer_ctx = job->aio_context;
+AioContext *inner_ctx;
+Job *other_job, *next;
+JobTxn *txn = job->txn;
 int rc = 0;
 
-QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {
-rc = fn(job);
+/*
+ * Similar to job_completed_txn_abort, we take each job's lock before
+ * applying fn, but since we assume that outer_ctx is held by the caller,
+ * we need to release it here to avoid holding the lock twice - which would
+ * break AIO_WAIT_WHILE from within fn.
+ */
+aio_context_release(outer_ctx);
+
+QLIST_FOREACH_SAFE(other_job, &txn->jobs, txn_list, next) {
+inner_ctx = other_job->aio_context;
+aio_context_acquire(inner_ctx);
+rc = fn(other_job);
+aio_context_release(inner_ctx);
 if (rc) {
 break;
 }
 }
+
+aio_context_acquire(outer_ctx);
 return rc;
 }
 
@@ -774,11 +790,11 @@ static void job_do_finalize(Job *job)
 assert(job && job->txn);
 
 /* prepare the transaction to complete */
-rc = job_txn_apply(job->txn, job_prepare);
+rc = job_txn_apply(job, job_prepare);
 if (rc) {
 job_completed_txn_abort(job);
 } else {
-job_txn_apply(job->txn, job_finalize_single);
+job_txn_apply(job, job_finalize_single);
 }
 }
 
@@ -824,10 +840,10 @@ static void job_completed_txn_success(Job *job)
 assert(other_job->ret == 0);
 }
 
-job_txn_apply(txn, job_transition_to_pending);
+job_txn_apply(job, job_transition_to_pending);
 
 /* If no jobs need manual finalization, automatically do so */
-if (job_txn_apply(txn, job_needs_finalize) == 0) {
+if (job_txn_apply(job, job_needs_finalize) == 0) {
 job_do_finalize(job);
 }
 }
-- 
2.26.0





[PATCH v2 3/3] replication: acquire aio context before calling job_cancel_sync

2020-03-26 Thread Stefan Reiter
job_cancel_sync requires the job's lock to be held, all other callers
already do this (replication_stop, drive_backup_abort,
blockdev_backup_abort, job_cancel_sync_all, cancel_common).

Signed-off-by: Stefan Reiter 
---
 block/replication.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 413d95407d..6c809cda4e 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -144,12 +144,16 @@ fail:
 static void replication_close(BlockDriverState *bs)
 {
 BDRVReplicationState *s = bs->opaque;
+Job *commit_job = &s->commit_job->job;
+AioContext *commit_ctx = commit_job->aio_context;
 
 if (s->stage == BLOCK_REPLICATION_RUNNING) {
 replication_stop(s->rs, false, NULL);
 }
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-job_cancel_sync(&s->commit_job->job);
+aio_context_acquire(commit_ctx);
+job_cancel_sync(commit_job);
+aio_context_release(commit_ctx);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
-- 
2.26.0





Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Stefan Reiter

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:

25.03.2020 18:50, Stefan Reiter wrote:

backup_clean is only ever called as a handler via job_exit, which


Hmm.. I'm afraid it's not quite correct.

job_clean

   job_finalize_single

  job_completed_txn_abort (lock aio context)

  job_do_finalize


Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock 
aio context..
And on the same time, it directaly calls job_txn_apply(job->txn, 
job_finalize_single)

without locking. Is it a bug?



I think, as you say, the idea is that job_do_finalize is always called 
with the lock acquired. That's why job_completed_txn_abort takes care to 
release the lock (at least of the "outer_ctx" as it calls it) before 
reacquiring it.


And, even if job_do_finalize called always with locked context, where is 
guarantee that all

context of all jobs in txn are locked?



I also don't see anything that guarantees that... I guess it could be 
adapted to handle locks like job_completed_txn_abort does?


Haven't looked into transactions too much, but does it even make sense 
to have jobs in different contexts in one transaction?



Still, let's look through its callers.

   job_finalize

    qmp_block_job_finalize (lock aio context)
    qmp_job_finalize (lock aio context)
    test_cancel_concluded (doesn't lock, but it's a test)

   job_completed_txn_success

    job_completed

     job_exit (lock aio context)

     job_cancel

  blockdev_mark_auto_del (lock aio context)

  job_user_cancel

  qmp_block_job_cancel (locks context)
  qmp_job_cancel  (locks context)

  job_cancel_err

   job_cancel_sync (return 
job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls 
callback)


    replication_close (it's 
.bdrv_close.. Hmm, I don't see context locking, where is it ?)
Hm, don't see it either. This might indeed be a way to get to job_clean 
without a lock held.


I don't have any testing set up for replication atm, but if you believe 
this would be correct I can send a patch for that as well (just acquire 
the lock in replication_close before job_cancel_async?).




    replication_stop (locks context)

    drive_backup_abort (locks context)

    blockdev_backup_abort (locks context)

    job_cancel_sync_all (locks context)

    cancel_common (locks context)

  test_* (I don't care)



To clarify, aside from the commit message the patch itself does not 
appear to be wrong? All paths (aside from replication_close mentioned 
above) guarantee the job lock to be held.



already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the 
BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release 
the lock

once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 


Just note, that this thing were recently touched by 0abf2581717a19 , so 
add Sergio (its author) to CC.



---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying 
to figure
out why that happens, stack trace indicates a similar problem with the 
main
thread hanging at bdrv_do_drained_begin, though I have no clue why as 
of yet.


  block/backup.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)
  {
  BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-    aio_context_acquire(aio_context);
  bdrv_backup_top_drop(s->backup_top);
-    aio_context_release(aio_context);
  }
  void backup_do_checkpoint(BlockJob *job, Error **errp)









[PATCH] backup: don't acquire aio_context in backup_clean

2020-03-25 Thread Stefan Reiter
backup_clean is only ever called as a handler via job_exit, which
already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 
---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying to figure
out why that happens, stack trace indicates a similar problem with the main
thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.

 block/backup.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
 bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
-- 
2.25.2