Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap

2023-11-03 Thread Hanna Czenczek

On 03.11.23 16:51, Hanna Czenczek wrote:
On 20.10.23 23:56, Andrey Drobyshev wrote: 


[...]


@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
  else
  _make_test_img -o extended_l2=on 1M
  fi
+    # Write cluster #0 and discard its subclusters #0-#3
+    $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+    before=$(disk_usage "$TEST_IMG")
+    $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+    after=$(disk_usage "$TEST_IMG")
+    _verify_du_delta $before $after 8192
+    alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+    _verify_l2_bitmap 0
  # Write clusters #0-#2 and then discard them
  $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
  $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"


Similarly to above, I think it would be good if we combined this 
following case with the one you added, i.e. to write 128k from the 
beginning, drop the write here, and change the discard to be “discard 
-q 8k 120k”, i.e. skip the subclusters we have already discarded, to 
see that this is still combined to discard the whole first cluster.


...Ah, see, and when I try this, the following assertion fails:

qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion 
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) ( 
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )


Looks like an L2 table is leaked somewhere.  That’s why SCRI should be 
a g_auto()-able type.


Forgot to add: This single test case here is the only place where we 
test the added functionality.  I think there should be more cases. It 
doesn’t really make sense now that 271 has so many cases for writing 
zeroes, but so few for discarding, now that discarding works on 
subclusters.  Most of them should at least be considered whether we can 
run them for discard as well.


I didn’t want to push for such an extensive set of tests, but, well, now 
it turned out I overlooked a bug in patch 4, and only found it because I 
thought “this place might also make a nice test case for this series”.


Hanna




Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2023-11-03 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 03.11.2023 um 10:36 hat Markus Armbruster geschrieben:
>> Vladimir Sementsov-Ogievskiy  writes:
>> 
>> > On 11.10.23 13:18, Fiona Ebner wrote:
>> >> Am 10.10.23 um 19:55 schrieb Vladimir Sementsov-Ogievskiy:
>> >>> On 09.10.23 12:46, Fiona Ebner wrote:
>> 
>>  Initially, I tried to go for a more general 'job-change' command, but
>>  I couldn't figure out a way to avoid mutual inclusion between
>>  block-core.json and job.json.
>> 
>> >>>
>> >>> What is the problem with it? I still think that job-change would be 
>> >>> better.
>> >>>
>> >> If going for job-change in job.json, the dependencies would be
>> >> job-change -> JobChangeOptions -> JobChangeOptionsMirror -> MirrorCopyMode
>> >> query-jobs -> JobInfo -> JobInfoMirror
>> >> and we can't include block-core.json in job.json, because an inclusion
>> >> loop gives a build error.
>> 
>> Let me try to understand this.
>> 
>> Command job-change needs its argument type JobChangeOptions.
>> 
>> JobChangeOptions is a union, and JobChangeOptionsMirror is one of its
>> branches.
>> 
>> JobChangeOptionsMirror needs MirrorCopyMode from block-core.json.
>> 
>> block-core.json needs job.json for JobType and JobStatus.
>> 
>> >> Could be made to work by moving MirrorCopyMode (and
>> >> JobChangeOptionsMirror, JobInfoMirror) to job.json or some place that
>> >> can be included by both job.json and block-core.json. Moving the
>> >> type-specific definitions to the general job.json didn't feel right to
>> >> me. Including another file with type-specific definitions in job.json
>> >> feels slightly less wrong, but still not quite right and I didn't want
>> >> to create a new file just for MirrorCopyMode (and
>> >> JobChangeOptionsMirror, JobInfoMirror).
>> >> And going further and moving all mirror-related things to a separate
>> >> file would require moving along things like NewImageMode with it or
>> >> create yet another file for such general things used by multiple 
>> >> block-jobs.
>> >> If preferred, I can try and go with some version of the above.
>> >> 
>> >
>> > OK, I see the problem. Seems, that all requires some good refactoring. But 
>> > that's a preexisting big work, and should not hold up your series. I'm OK 
>> > to proceed with block-job-change.
>> 
>> Saving ourselves some internal refactoring is a poor excuse for
>> undesirable external interfaces.
>
> I'm not sure how undesirable it is. We have block-job-* commands for
> pretty much every other operation, so it's only consistent to have
> block-job-change, too.

Is the job abstraction a failure?

We have

block-job- command  since   job- commandsince
-
block-job-set-speed 1.1
block-job-cancel1.1 job-cancel  3.0
block-job-pause 1.3 job-pause   3.0
block-job-resume1.3 job-resume  3.0
block-job-complete  1.3 job-complete3.0
block-job-dismiss   2.12job-dismiss 3.0
block-job-finalize  2.12job-finalize3.0
block-job-change8.2
query-block-jobs1.1 query-jobs

I was under the impression that we added the (more general) job-
commands to replace the (less general) block-job commands, and we're
keeping the latter just for compatibility.  Am I mistaken?

Which one should be used?

Why not deprecate the one that shouldn't be used?

The addition of block-job-change without even trying to do job-change
makes me wonder: have we given up on the job- interface?

I'm okay with giving up on failures.  All I want is clarity.  Right now,
I feel thoroughly confused about the status block-jobs and jobs, and how
they're related.

> Having job-change, too, might be nice in theory, but we don't have even
> a potential user for it at this point (i.e. a job type that isn't a
> block job, but for which changing options at runtime makes sense).
>
>> We need to answer two questions before we do that:
>> 
>> 1. How much work would the refactoring be?
>> 
>> 2. Is the interface improvement this enables worth the work?
>> 
>> Let's start with 1.
>> 
>> An obvious solution is to split JobType and JobStatus off job.json to
>> break the dependency of block-core.json on job.json.
>> 
>> But I'd like us to investigate another one.  block-core.json is *huge*.
>> It's almost a quarter of the entire QAPI schema.  Can we spin out block
>> jobs into block-job.json?  Moves the dependency on job.json from
>> block-core.json to block-job.json.
>
> It also makes job.json depend on block-job.json instead of
> block-core.json, so you only moved the problem without solving it.

block-job.json needs block-core.json and job.json.

job.json needs block-core.json.

No circle so far.




Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-11-03 Thread Hanna Czenczek

On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
  return nb_clusters;
  }
  
+static int coroutine_fn GRAPH_RDLOCK

+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_subclusters,
+   enum qcow2_discard_type type,
+   bool full_discard,
+   SubClusterRangeInfo *pscri)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
+
+if (!pscri) {
+ret = get_sc_range_info(bs, offset, nb_subclusters, &scri);
+if (ret < 0) {
+goto out;
+}
+} else {
+scri = *pscri;
+}
+
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap = scri.l2_bitmap;
+new_l2_bitmap &= ~l2_bitmap_mask;
+
+/*
+ * If there're no allocated subclusters left, we might as well discard
+ * the entire cluster.  That way we'd also update the refcount table.
+ */
+if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
+return discard_in_l2_slice(bs,
+   QEMU_ALIGN_DOWN(offset, s->cluster_size),
+   1, type, full_discard);
+}


scri.l2_slice is leaked here.

Hanna




Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap

2023-11-03 Thread Hanna Czenczek

On 20.10.23 23:56, Andrey Drobyshev wrote:

Add _verify_du_delta() checker which is used to check that real disk
usage delta meets the expectations.  For now we use it for checking that
subcluster-based discard/unmap operations lead to actual disk usage
decrease (i.e. PUNCH_HOLE operation is performed).


I’m not too happy about checking the disk usage because that relies on 
the underlying filesystem actually accepting and executing the unmap.  
Why is it not enough to check the L2 bitmap?


…Coming back later (I had to fix the missing `ret = ` I mentioned in 
patch 2, or this test would hang, so I couldn’t run it at first), I note 
that checking the disk usage in fact doesn’t work on tmpfs.  I usually 
run the iotests in tmpfs, so that’s not great.



Also add separate test case for discarding particular subcluster within
one cluster.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/271 | 25 -
  tests/qemu-iotests/271.out |  2 ++
  2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index c7c2cadda0..5fcb209f5f 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -81,6 +81,15 @@ _verify_l2_bitmap()
  fi
  }
  
+# Check disk usage delta after a discard/unmap operation

+# _verify_du_delta $before $after $expected_delta
+_verify_du_delta()
+{
+if [ $(($1 - $2)) -ne $3 ]; then
+printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n"
+fi
+}
+
  # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
  # c:   cluster number (0 if unset)
  # sc:  subcluster number inside cluster @c (0 if unset)
@@ -198,9 +207,12 @@ for use_backing_file in yes no; do
  alloc="$(seq 0 31)"; zero=""
  _run_test sc=0 len=64k
  
-### Zero and unmap half of cluster #0 (this won't unmap it)

+### Zero and unmap half of cluster #0 (this will unmap it)


I think “it” refers to the cluster, and it is not unmapped.  This test 
case does not use a discard, but write -z instead, so it worked before.  
(The L2 bitmap shown in the output doesn’t change, so functionally, this 
patch series didn’t change this case.)



  alloc="$(seq 16 31)"; zero="$(seq 0 15)"
+before=$(disk_usage "$TEST_IMG")
  _run_test sc=0 len=32k cmd=unmap
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after 32768
  
  ### Zero and unmap cluster #0

  alloc=""; zero="$(seq 0 31)"


For this following case shown truncated here, why don’t we try 
“_run_test sc=16 len=32k cmd=unmap” instead of “sc=0 len=64k”?  I.e. 
unmap only the second half, which, thanks to patch 3, should still unmap 
the whole cluster, because the first half is already unmapped.



@@ -447,7 +459,10 @@ for use_backing_file in yes no; do
  
  # Subcluster-aligned request from clusters #12 to #14

  alloc="$(seq 0 15)"; zero="$(seq 16 31)"
+before=$(disk_usage "$TEST_IMG")
  _run_test c=12 sc=16 len=128k cmd=unmap
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after $((128 * 1024))
  alloc=""; zero="$(seq 0 31)"
  _verify_l2_bitmap 13
  alloc="$(seq 16 31)"; zero="$(seq 0 15)"
@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
  else
  _make_test_img -o extended_l2=on 1M
  fi
+# Write cluster #0 and discard its subclusters #0-#3
+$QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+before=$(disk_usage "$TEST_IMG")
+$QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after 8192
+alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+_verify_l2_bitmap 0
  # Write clusters #0-#2 and then discard them
  $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
  $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"


Similarly to above, I think it would be good if we combined this 
following case with the one you added, i.e. to write 128k from the 
beginning, drop the write here, and change the discard to be “discard -q 
8k 120k”, i.e. skip the subclusters we have already discarded, to see 
that this is still combined to discard the whole first cluster.


...Ah, see, and when I try this, the following assertion fails:

qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion 
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) ( 
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )


Looks like an L2 table is leaked somewhere.  That’s why SCRI should be a 
g_auto()-able type.


Hanna


diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index 5be780de76..0da8d72cde 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -426,6 +426,7 @@ L2 entry #29: 0x 
  ### Discarding clusters with non-zero bitmaps (backing file: yes) ###
  
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TES

Re: [PATCH 6/7] iotests/common.rc: add disk_usage function

2023-11-03 Thread Hanna Czenczek

On 20.10.23 23:56, Andrey Drobyshev wrote:

Move the definition from iotests/250 to common.rc.  This is used to
detect real disk usage of sparse files.  In particular, we want to use
it for checking subclusters-based discards.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/250   | 5 -
  tests/qemu-iotests/common.rc | 6 ++
  2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
index af48f83aba..c0a0dbc0ff 100755
--- a/tests/qemu-iotests/250
+++ b/tests/qemu-iotests/250
@@ -52,11 +52,6 @@ _unsupported_imgopts data_file
  # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed
  # anyway.
  
-disk_usage()

-{
-du --block-size=1 $1 | awk '{print $1}'
-}
-
  size=2100M
  
  _make_test_img -o "cluster_size=1M,preallocation=metadata" $size

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 95c12577dd..5d2ea26c7f 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -140,6 +140,12 @@ _optstr_add()
  fi
  }
  
+# report real disk usage for sparse files

+disk_usage()
+{
+du --block-size=1 $1 | awk '{print $1}'


Pre-existing, but since you’re touching this now: Can you please change 
the $1 to "$1"?


Hanna


+}
+
  # Set the variables to the empty string to turn Valgrind off
  # for specific processes, e.g.
  # $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015





Re: [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested

2023-11-03 Thread Hanna Czenczek

On 20.10.23 23:56, Andrey Drobyshev wrote:

When zeroizing subclusters within single cluster, detect usage of the
BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
operation, much like it's done with the cluster-based discards.  That
way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
lead to actual unmap.


Ever since the introduction of discard-no-unref, I wonder whether that 
is actually an advantage or not.  I can’t think of a reason why it’d be 
advantageous to drop the allocation.


On the other hand, zero_in_l2_slice() does it indeed, so consistency is 
a reasonable argument.



Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index cf40f2dc12..040251f2c3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
  unsigned nb_subclusters, int flags)
  {
  BDRVQcow2State *s = bs->opaque;
-uint64_t new_l2_bitmap;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
  int ret, sc = offset_to_sc_index(s, offset);
  SubClusterRangeInfo scri = { 0 };
  
@@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,

  goto out;
  }
  
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);


“l2_bitmap_mask” already wasn’t a great name in patch 4 because it 
doesn’t say what mask it is, but in patch 4, there was at least only one 
relevant mask.  Here, we have two (ALLOC_RANGE and ZERO_RANGE), so the 
name should reflect what kind of mask it is.



  new_l2_bitmap = scri.l2_bitmap;
-new_l2_bitmap |=  QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
-new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap &= ~l2_bitmap_mask;
  
  /*

   * If there're no non-zero subclusters left, we might as well zeroize
@@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
  1, flags);
  }
  
+/*

+ * If the request allows discarding subclusters and they're actually
+ * allocated, we go down the discard path since after the discard
+ * operation the subclusters are going to be read as zeroes anyway.
+ */
+if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap & l2_bitmap_mask)) {
+return discard_l2_subclusters(bs, offset, nb_subclusters,
+  QCOW2_DISCARD_REQUEST, false, &scri);
+}


I wonder why it matters whether any subclusters are allocated, i.e. why 
can’t we just immediately use discard_l2_subclusters() whenever 
BDRV_REQ_MAY_UNMAP is set, without even fetching SCRI (that would also 
save us passing SCRI here, which I’ve already said on patch 4, I don’t 
find great).


Doesn’t discard_l2_subclusters() guarantee the clusters read as zero 
when full_discard=false?  There is this case where it won’t mark the 
subclusters as zero if there is no backing file, and none of the 
subclusters is allocated.  But still, (A) those subclusters will read as 
zero, and (B) if there is a problem there, why don’t we just always mark 
those subclusters as zero?  This optimization only has effect when the 
guest discards/zeroes subclusters (not whole clusters) that were already 
discarded, so sounds miniscule.


Hanna


+
  if (new_l2_bitmap != scri.l2_bitmap) {
  set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
  qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);





Re: deadlock when using iothread during backup_clean()

2023-11-03 Thread Fiona Ebner
Am 20.10.23 um 15:52 schrieb Fiona Ebner:
> And I found that with drive-mirror, the issue during starting seems to
> manifest with the bdrv_open() call. Adding a return before it, the guest
> IO didn't get stuck in my testing, but adding a return after it, it can
> get stuck. I'll try to see if I can further narrow it down next week,
> but maybe that's already a useful hint?
> 

In the end, I was able to find a reproducer that just does draining and
bisected the issue (doesn't seem related to the graph lock after all). I
replied there, to avoid all the overhead from this thread:

https://lists.nongnu.org/archive/html/qemu-devel/2023-11/msg00681.html

Best Regards,
Fiona




Re: [PULL 29/32] virtio-blk: implement BlockDevOps->drained_begin()

2023-11-03 Thread Fiona Ebner
Hi,

Am 30.05.23 um 18:32 schrieb Kevin Wolf:
> From: Stefan Hajnoczi 
> 
> Detach ioeventfds during drained sections to stop I/O submission from
> the guest. virtio-blk is no longer reliant on aio_disable_external()
> after this patch. This will allow us to remove the
> aio_disable_external() API once all other code that relies on it is
> converted.
> 
> Take extra care to avoid attaching/detaching ioeventfds if the data
> plane is started/stopped during a drained section. This should be rare,
> but maybe the mirror block job can trigger it.
> 
> Signed-off-by: Stefan Hajnoczi 
> Message-Id: <20230516190238.8401-18-stefa...@redhat.com>
> Signed-off-by: Kevin Wolf 

I ran into a strange issue where guest IO would get completely stuck
during certain block jobs a while ago and finally managed to find a
small reproducer [0]. I'm using a VM with virtio-blk-pci (or
virtio-scsi-pci) with an iothread and running

fio --name=file --size=100M --direct=1 --rw=randwrite --bs=4k
--ioengine=psync --numjobs=5 --runtime=1200 --time_based

in the guest. Then I'm issuing the QMP command with the reproducer in a
loop. Usually, the guest IO will get stuck after about 1-3 minutes,
sometimes fio can manage to continue with a lower speed for a while (but
trying to Ctrl+C it or doing other IO in the guest will already be
broken), which I guess could be a hint that it's an issue with notifiers?

Bisecting (to declare a commit good, I waited 10 minutes) led me to this
patch, i.e. commit 1665d9326f ("virtio-blk: implement
BlockDevOps->drained_begin()") and for SCSI, I verified that the issue
similarly starts happening after 766aa2de0f ("virtio-scsi: implement
BlockDevOps->drained_begin()").

Both issues are still present on current master (i.e. 1c98a821a2
("tests/qtest: Introduce tests for AMD/Xilinx Versal TRNG device"))

Happy to provide more information and hints about how to debug the issue
further.

Best Regards,
Fiona

[0]:

> diff --git a/blockdev.c b/blockdev.c
> index db2725fe74..bf2e0fc22c 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -2986,6 +2986,11 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>  bool zero_target;
>  int ret;
>  
> +bdrv_drain_all_begin();
> +bdrv_drain_all_end();
> +return;
> +
> +
>  bs = qmp_get_root_bs(arg->device, errp);
>  if (!bs) {
>  return;




Re: [PATCH 19/24] block: Introduce bdrv_co_change_backing_file()

2023-11-03 Thread Eric Blake
On Fri, Nov 03, 2023 at 11:33:47AM +0100, Kevin Wolf wrote:
> Am 30.10.2023 um 14:57 hat Eric Blake geschrieben:
> > On Fri, Oct 27, 2023 at 05:53:28PM +0200, Kevin Wolf wrote:
> > > bdrv_change_backing_file() is called both inside and outside coroutine
> > > context. This makes it difficult for it to take the graph lock
> > > internally. It also means that driver implementations need to be able to
> > > run outside of coroutines, too. Switch it to the usual model with a
> > > coroutine based implementation and a co_wrapper instead. The new
> > > function is marked GRAPH_RDLOCK.
> > > 
> > > As the co_wrapper now runs the function in the AioContext of the node
> > > (as it should always have done), this is not GLOBAL_STATE_CODE() any
> > > more.
> > > 
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >  include/block/block-global-state.h |  3 +-
> > >  include/block/block-io.h   |  8 
> > >  include/block/block_int-common.h   |  5 ++-
> > >  block.c| 11 ++---
> > >  block/qcow2.c  | 18 +
> > >  block/qed.c| 64 +++---
> > >  tests/unit/test-bdrv-drain.c   |  8 ++--
> > >  7 files changed, 65 insertions(+), 52 deletions(-)
> > > 
> > > +++ b/block/qcow2.c
> > > @@ -6155,9 +6159,9 @@ BlockDriver bdrv_qcow2 = {
> > >  .bdrv_co_save_vmstate   = qcow2_co_save_vmstate,
> > >  .bdrv_co_load_vmstate   = qcow2_co_load_vmstate,
> > >  
> > > -.is_format  = true,
> > > -.supports_backing   = true,
> > > -.bdrv_change_backing_file   = qcow2_change_backing_file,
> > > +.is_format  = true,
> > > +.supports_backing   = true,
> > > +.bdrv_co_change_backing_file= qcow2_co_change_backing_file,
> > >  
> > >  .bdrv_refresh_limits= qcow2_refresh_limits,
> > >  .bdrv_co_invalidate_cache   = qcow2_co_invalidate_cache,
> > 
> > Here, you only realigned = on a portion of the initializer...
> > 
> > > diff --git a/block/qed.c b/block/qed.c
> > > index 686ad711f7..996aa384fe 100644
> > > --- a/block/qed.c
> > > +++ b/block/qed.c
> > >  static BlockDriver bdrv_qed = {
> > > -.format_name  = "qed",
> > > -.instance_size= sizeof(BDRVQEDState),
> > > -.create_opts  = &qed_create_opts,
> > > -.is_format= true,
> > > -.supports_backing = true,
> > > -
> > > -.bdrv_probe   = bdrv_qed_probe,
> > 
> > ...while here, you are doing it on the entire block.  This shows why I
> > personally dislike aligning =, but I tolerate it when it is already
> > prevailing style.  Still, it feels weird to be inconsistent within the
> > same patch.
> 
> It's because qcow2 already had multiple different indentations, but qed
> had everything aligned to the same column. I can update qcow2.

It's your call how you want it to look (as I stated, we are already
inconsistent on style; my preference is a version with less churn, but
I also understand preserving the existing style as a way to avoid
churn).  Whatever the end result, it doesn't affect my R-b.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH 16/24] block: Mark bdrv_replace_node() GRAPH_WRLOCK

2023-11-03 Thread Eric Blake
On Fri, Nov 03, 2023 at 11:32:39AM +0100, Kevin Wolf wrote:
...
> > > -GLOBAL_STATE_CODE();
> > > -
> > > -/* Make sure that @from doesn't go away until we have successfully 
> > > attached
> > > - * all of its parents to @to. */
> > 
> > Useful comment that you just moved here in the previous patch...
> > 
> > > -bdrv_ref(from);
...
> > > @@ -5717,9 +5699,15 @@ BlockDriverState 
> > > *bdrv_insert_node(BlockDriverState *bs, QDict *options,
> > >  goto fail;
> > >  }
> > >  
> > > +bdrv_ref(bs);
> > 
> > ...but now it is gone.  Intentional?
> 
> I figured it was obvious enough that bdrv_ref() is always called to make
> sure that the node doesn't go away too early, but I can add it back.

Your call.

> > > @@ -94,8 +95,12 @@ static void commit_abort(Job *job)
> > >   * XXX Can (or should) we somehow keep 'consistent read' blocked even
> > >   * after the failed/cancelled commit job is gone? If we already wrote
> > >   * something to base, the intermediate images aren't valid any more. 
> > > */
> > > -bdrv_replace_node(s->commit_top_bs, s->commit_top_bs->backing->bs,
> > > -  &error_abort);
> > > +commit_top_backing_bs = s->commit_top_bs->backing->bs;
> > > +bdrv_drained_begin(commit_top_backing_bs);
> > > +bdrv_graph_wrlock(commit_top_backing_bs);
> > 
> > Here, and elsewhere in the patch, drained_begin/end is outside
> > wr(un)lock...
> > 
> > > +bdrv_replace_node(s->commit_top_bs, commit_top_backing_bs, 
> > > &error_abort);
> > > +bdrv_graph_wrunlock();
> > > +bdrv_drained_end(commit_top_backing_bs);
> > >  
> > >  bdrv_unref(s->commit_top_bs);
> > >  bdrv_unref(top_bs);
> > > @@ -425,7 +430,11 @@ fail:
> > >  /* commit_top_bs has to be replaced after deleting the block job,
> > >   * otherwise this would fail because of lack of permissions. */
> > >  if (commit_top_bs) {
> > > +bdrv_graph_wrlock(top);
> > > +bdrv_drained_begin(top);
> > >  bdrv_replace_node(commit_top_bs, top, &error_abort);
> > > +bdrv_drained_end(top);
> > > +bdrv_graph_wrunlock();
> > 
> > ...but here you do it in the opposite order.  Intentional?
> 
> No, this is actually wrong. bdrv_drained_begin() has a nested event
> loop, and running a nested event loop while holding the graph lock can
> cause deadlocks, so it's forbidden. Thanks for catching this!

That's what review is for!

> 
> Since the two comments above are the only thing you found in the review,
> I'll just directly fix them while applying the series.

Sounds good to me. With the deadlock fixed by swapping order,

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH 04/24] block: Mark bdrv_root_attach_child() GRAPH_WRLOCK

2023-11-03 Thread Eric Blake
On Fri, Nov 03, 2023 at 10:45:11AM +0100, Kevin Wolf wrote:
> Am 27.10.2023 um 22:22 hat Eric Blake geschrieben:
> > On Fri, Oct 27, 2023 at 05:53:13PM +0200, Kevin Wolf wrote:
> > > Instead of taking the writer lock internally, require callers to already
> > > hold it when calling bdrv_root_attach_child(). These callers will
> > > typically already hold the graph lock once the locking work is
> > > completed, which means that they can't call functions that take it
> > > internally.
> > > 
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >  include/block/block_int-global-state.h | 13 +++--
> > >  block.c|  5 +
> > >  block/block-backend.c  |  2 ++
> > >  blockjob.c |  2 ++
> > >  4 files changed, 12 insertions(+), 10 deletions(-)
> > > 
> > > +++ b/block.c
> > > @@ -3214,8 +3214,6 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
> > > *child_bs,
> > >  
> > >  GLOBAL_STATE_CODE();
> > >  
> > > -bdrv_graph_wrlock(child_bs);
> > > -
> > >  child = bdrv_attach_child_common(child_bs, child_name, child_class,
> > 
> > Do we need some sort of assertion that the caller did indeed grab the
> > lock at this point?  Or is that redundant with assertions made in all
> > helper functions we are calling where the lock already matters?
> 
> The GRAPH_WRLOCK in the header file makes it a compiler error with TSA
> enabled if the caller doesn't already hold the lock. And our CI has some
> builds with TSA, so even if the maintainer only uses gcc, we'll catch
> it.

TSA is awesome - guaranteeing code correctness during CI has been an
awesome result of this massive refactoring effort.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2023-11-03 Thread Kevin Wolf
Am 03.11.2023 um 10:36 hat Markus Armbruster geschrieben:
> Vladimir Sementsov-Ogievskiy  writes:
> 
> > On 11.10.23 13:18, Fiona Ebner wrote:
> >> Am 10.10.23 um 19:55 schrieb Vladimir Sementsov-Ogievskiy:
> >>> On 09.10.23 12:46, Fiona Ebner wrote:
> 
>  Initially, I tried to go for a more general 'job-change' command, but
>  I couldn't figure out a way to avoid mutual inclusion between
>  block-core.json and job.json.
> 
> >>>
> >>> What is the problem with it? I still think that job-change would be 
> >>> better.
> >>>
> >> If going for job-change in job.json, the dependencies would be
> >> job-change -> JobChangeOptions -> JobChangeOptionsMirror -> MirrorCopyMode
> >> query-jobs -> JobInfo -> JobInfoMirror
> >> and we can't include block-core.json in job.json, because an inclusion
> >> loop gives a build error.
> 
> Let me try to understand this.
> 
> Command job-change needs its argument type JobChangeOptions.
> 
> JobChangeOptions is a union, and JobChangeOptionsMirror is one of its
> branches.
> 
> JobChangeOptionsMirror needs MirrorCopyMode from block-core.json.
> 
> block-core.json needs job.json for JobType and JobStatus.
> 
> >> Could be made to work by moving MirrorCopyMode (and
> >> JobChangeOptionsMirror, JobInfoMirror) to job.json or some place that
> >> can be included by both job.json and block-core.json. Moving the
> >> type-specific definitions to the general job.json didn't feel right to
> >> me. Including another file with type-specific definitions in job.json
> >> feels slightly less wrong, but still not quite right and I didn't want
> >> to create a new file just for MirrorCopyMode (and
> >> JobChangeOptionsMirror, JobInfoMirror).
> >> And going further and moving all mirror-related things to a separate
> >> file would require moving along things like NewImageMode with it or
> >> create yet another file for such general things used by multiple 
> >> block-jobs.
> >> If preferred, I can try and go with some version of the above.
> >> 
> >
> > OK, I see the problem. Seems, that all requires some good refactoring. But 
> > that's a preexisting big work, and should not hold up your series. I'm OK 
> > to proceed with block-job-change.
> 
> Saving ourselves some internal refactoring is a poor excuse for
> undesirable external interfaces.

I'm not sure how undesirable it is. We have block-job-* commands for
pretty much every other operation, so it's only consistent to have
block-job-change, too.

Having job-change, too, might be nice in theory, but we don't have even
a potential user for it at this point (i.e. a job type that isn't a
block job, but for which changing options at runtime makes sense).

> We need to answer two questions before we do that:
> 
> 1. How much work would the refactoring be?
> 
> 2. Is the interface improvement this enables worth the work?
> 
> Let's start with 1.
> 
> An obvious solution is to split JobType and JobStatus off job.json to
> break the dependency of block-core.json on job.json.
> 
> But I'd like us to investigate another one.  block-core.json is *huge*.
> It's almost a quarter of the entire QAPI schema.  Can we spin out block
> jobs into block-job.json?  Moves the dependency on job.json from
> block-core.json to block-job.json.

It also makes job.json depend on block-job.json instead of
block-core.json, so you only moved the problem without solving it.

Kevin




Re: IDE pending patches

2023-11-03 Thread BALATON Zoltan

On Thu, 2 Nov 2023, Kevin Wolf wrote:

Am 02.11.2023 um 12:23 hat Niklas Cassel geschrieben:

Hello Philippe, Kevin,

The QEMU 8.2 freeze is next week,
and the IDE maintainer (John) hasn't been replying to emails lately.

Kevin, considering that you picked up Fiona's series:
https://lore.kernel.org/qemu-devel/d6286ef8-6cf0-4e72-90e9-e91cef9da...@proxmox.com/
which was sent 2023-09-06, via your tree, do you think that you could
queue up some additional pending IDE patches?

If you don't want to take them, perhaps Kevin can take them?


Yes, I can take IDE patches through my tree if necessary. And actually I
went through patches that are still open and saw yours earlier this
week, so I already made a mental note to get to them in time for 8.2.


I have these two patches:
https://lists.nongnu.org/archive/html/qemu-block/2023-10/msg00172.html
which was sent 2023-10-05
and
https://lists.nongnu.org/archive/html/qemu-block/2023-10/msg00382.html
which was sent 2023-10-11


Both of them are fixes, so they are not immediately affected by the
feature freeze. If there is feature work to do, it will take priority
for me until Tuesday.


Looking at the list, Mark's series:
https://lists.nongnu.org/archive/html/qemu-block/2023-10/msg01289.html
v2 was sent quite recently, 2023-10-24, but seems to have sufficient
tags to be ready to go in this cycle as well.


It only seems to have Tested-by tags so far, so if you have spare cycles
to give it some actual code review, that might be useful. I'll try to
have a look, too.


Bernhard has sent a R-b already:

https://patchew.org/QEMU/20231024224056.842607-1-mark.cave-ayl...@ilande.co.uk/

but review from IDE people would not hurt.

(I had a simpler version for this only touching the via-ide first:

https://patchew.org/QEMU/cover.1697661160.git.bala...@eik.bme.hu/4095e01f4596e77a478759161ae736f0c398600a.1697661160.git.bala...@eik.bme.hu/

but Mark wanted to make it more general for possible use in other devices 
at some point.)


Regards,
BALATON Zoltan



Re: [PATCH 19/24] block: Introduce bdrv_co_change_backing_file()

2023-11-03 Thread Kevin Wolf
Am 30.10.2023 um 14:57 hat Eric Blake geschrieben:
> On Fri, Oct 27, 2023 at 05:53:28PM +0200, Kevin Wolf wrote:
> > bdrv_change_backing_file() is called both inside and outside coroutine
> > context. This makes it difficult for it to take the graph lock
> > internally. It also means that driver implementations need to be able to
> > run outside of coroutines, too. Switch it to the usual model with a
> > coroutine based implementation and a co_wrapper instead. The new
> > function is marked GRAPH_RDLOCK.
> > 
> > As the co_wrapper now runs the function in the AioContext of the node
> > (as it should always have done), this is not GLOBAL_STATE_CODE() any
> > more.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block-global-state.h |  3 +-
> >  include/block/block-io.h   |  8 
> >  include/block/block_int-common.h   |  5 ++-
> >  block.c| 11 ++---
> >  block/qcow2.c  | 18 +
> >  block/qed.c| 64 +++---
> >  tests/unit/test-bdrv-drain.c   |  8 ++--
> >  7 files changed, 65 insertions(+), 52 deletions(-)
> > 
> > +++ b/block/qcow2.c
> > @@ -6155,9 +6159,9 @@ BlockDriver bdrv_qcow2 = {
> >  .bdrv_co_save_vmstate   = qcow2_co_save_vmstate,
> >  .bdrv_co_load_vmstate   = qcow2_co_load_vmstate,
> >  
> > -.is_format  = true,
> > -.supports_backing   = true,
> > -.bdrv_change_backing_file   = qcow2_change_backing_file,
> > +.is_format  = true,
> > +.supports_backing   = true,
> > +.bdrv_co_change_backing_file= qcow2_co_change_backing_file,
> >  
> >  .bdrv_refresh_limits= qcow2_refresh_limits,
> >  .bdrv_co_invalidate_cache   = qcow2_co_invalidate_cache,
> 
> Here, you only realigned = on a portion of the initializer...
> 
> > diff --git a/block/qed.c b/block/qed.c
> > index 686ad711f7..996aa384fe 100644
> > --- a/block/qed.c
> > +++ b/block/qed.c
> >  static BlockDriver bdrv_qed = {
> > -.format_name  = "qed",
> > -.instance_size= sizeof(BDRVQEDState),
> > -.create_opts  = &qed_create_opts,
> > -.is_format= true,
> > -.supports_backing = true,
> > -
> > -.bdrv_probe   = bdrv_qed_probe,
> 
> ...while here, you are doing it on the entire block.  This shows why I
> personally dislike aligning =, but I tolerate it when it is already
> prevailing style.  Still, it feels weird to be inconsistent within the
> same patch.

It's because qcow2 already had multiple different indentations, but qed
had everything aligned to the same column. I can update qcow2.

Kevin




Re: [PATCH 16/24] block: Mark bdrv_replace_node() GRAPH_WRLOCK

2023-11-03 Thread Kevin Wolf
Am 27.10.2023 um 23:33 hat Eric Blake geschrieben:
> On Fri, Oct 27, 2023 at 05:53:25PM +0200, Kevin Wolf wrote:
> > Instead of taking the writer lock internally, require callers to already
> > hold it when calling bdrv_replace_node(). Its callers may already want
> > to hold the graph lock and so wouldn't be able to call functions that
> > take it internally.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block-global-state.h |  6 --
> >  block.c| 26 +++---
> >  block/commit.c | 13 +++--
> >  block/mirror.c | 26 --
> >  blockdev.c |  5 +
> >  tests/unit/test-bdrv-drain.c   |  6 ++
> >  tests/unit/test-bdrv-graph-mod.c   | 13 +++--
> >  7 files changed, 60 insertions(+), 35 deletions(-)
> > 
> > +++ b/block.c
> > @@ -5484,25 +5484,7 @@ out:
> >  int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
> >Error **errp)
> >  {
> > -int ret;
> > -
> > -GLOBAL_STATE_CODE();
> > -
> > -/* Make sure that @from doesn't go away until we have successfully 
> > attached
> > - * all of its parents to @to. */
> 
> Useful comment that you just moved here in the previous patch...
> 
> > -bdrv_ref(from);
> > -bdrv_drained_begin(from);
> > -bdrv_drained_begin(to);
> > -bdrv_graph_wrlock(to);
> > -
> > -ret = bdrv_replace_node_common(from, to, true, false, errp);
> > -
> > -bdrv_graph_wrunlock();
> > -bdrv_drained_end(to);
> > -bdrv_drained_end(from);
> > -bdrv_unref(from);
> > -
> > -return ret;
> > +return bdrv_replace_node_common(from, to, true, false, errp);
> >  }
> >  
> >  int bdrv_drop_filter(BlockDriverState *bs, Error **errp)
> > @@ -5717,9 +5699,15 @@ BlockDriverState *bdrv_insert_node(BlockDriverState 
> > *bs, QDict *options,
> >  goto fail;
> >  }
> >  
> > +bdrv_ref(bs);
> 
> ...but now it is gone.  Intentional?

I figured it was obvious enough that bdrv_ref() is always called to make
sure that the node doesn't go away too early, but I can add it back.

> >  bdrv_drained_begin(bs);
> > +bdrv_drained_begin(new_node_bs);
> > +bdrv_graph_wrlock(new_node_bs);
> >  ret = bdrv_replace_node(bs, new_node_bs, errp);
> > +bdrv_graph_wrunlock();
> > +bdrv_drained_end(new_node_bs);
> >  bdrv_drained_end(bs);
> > +bdrv_unref(bs);
> >  
> >  if (ret < 0) {
> >  error_prepend(errp, "Could not replace node: ");
> > diff --git a/block/commit.c b/block/commit.c
> > index d92af02ead..2fecdce86f 100644
> > --- a/block/commit.c
> > +++ b/block/commit.c
> > @@ -68,6 +68,7 @@ static void commit_abort(Job *job)
> >  {
> >  CommitBlockJob *s = container_of(job, CommitBlockJob, common.job);
> >  BlockDriverState *top_bs = blk_bs(s->top);
> > +BlockDriverState *commit_top_backing_bs;
> >  
> >  if (s->chain_frozen) {
> >  bdrv_graph_rdlock_main_loop();
> > @@ -94,8 +95,12 @@ static void commit_abort(Job *job)
> >   * XXX Can (or should) we somehow keep 'consistent read' blocked even
> >   * after the failed/cancelled commit job is gone? If we already wrote
> >   * something to base, the intermediate images aren't valid any more. */
> > -bdrv_replace_node(s->commit_top_bs, s->commit_top_bs->backing->bs,
> > -  &error_abort);
> > +commit_top_backing_bs = s->commit_top_bs->backing->bs;
> > +bdrv_drained_begin(commit_top_backing_bs);
> > +bdrv_graph_wrlock(commit_top_backing_bs);
> 
> Here, and elsewhere in the patch, drained_begin/end is outside
> wr(un)lock...
> 
> > +bdrv_replace_node(s->commit_top_bs, commit_top_backing_bs, 
> > &error_abort);
> > +bdrv_graph_wrunlock();
> > +bdrv_drained_end(commit_top_backing_bs);
> >  
> >  bdrv_unref(s->commit_top_bs);
> >  bdrv_unref(top_bs);
> > @@ -425,7 +430,11 @@ fail:
> >  /* commit_top_bs has to be replaced after deleting the block job,
> >   * otherwise this would fail because of lack of permissions. */
> >  if (commit_top_bs) {
> > +bdrv_graph_wrlock(top);
> > +bdrv_drained_begin(top);
> >  bdrv_replace_node(commit_top_bs, top, &error_abort);
> > +bdrv_drained_end(top);
> > +bdrv_graph_wrunlock();
> 
> ...but here you do it in the opposite order.  Intentional?

No, this is actually wrong. bdrv_drained_begin() has a nested event
loop, and running a nested event loop while holding the graph lock can
cause deadlocks, so it's forbidden. Thanks for catching this!

> > +++ b/tests/unit/test-bdrv-drain.c
> > @@ -2000,7 +2000,13 @@ static void do_test_replace_child_mid_drain(int 
> > old_drain_count,
> >  parent_s->was_undrained = false;
> >  
> >  g_assert(parent_bs->quiesce_counter == old_drain_count);
> > +bdrv_drained_begin(old_child_bs);
> > +bdrv_drained_begin(new_child_bs);
> > +bdrv_graph_wrlock

Re: [PATCH 23/24] block: Take graph lock for most of .bdrv_open

2023-11-03 Thread Kevin Wolf
Am 30.10.2023 um 22:34 hat Eric Blake geschrieben:
> On Fri, Oct 27, 2023 at 05:53:32PM +0200, Kevin Wolf wrote:
> > Most implementations of .bdrv_open first open their file child (which is
> > an operation that internally takes the write lock and therefore we
> > shouldn't hold the graph lock while calling it), and afterwards many
> > operations that require holding the graph lock, e.g. for accessing
> > bs->file.
> > 
> > This changes block drivers that follow this pattern to take the graph
> > lock after opening the child node.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block/blkdebug.c  | 16 ++--
> >  block/bochs.c |  4 
> >  block/cloop.c |  4 
> >  block/copy-before-write.c |  2 ++
> >  block/copy-on-read.c  |  4 ++--
> >  block/crypto.c|  4 
> >  block/dmg.c   |  5 +
> >  block/filter-compress.c   |  2 ++
> >  block/parallels.c |  4 ++--
> >  block/preallocate.c   |  4 
> >  block/qcow.c  | 11 +++
> >  block/raw-format.c|  6 --
> >  block/snapshot-access.c   |  3 +++
> >  block/throttle.c  |  3 +++
> >  block/vdi.c   |  4 ++--
> >  block/vpc.c   |  4 ++--
> >  16 files changed, 60 insertions(+), 20 deletions(-)
> > 
> 
> > +++ b/block/qcow.c
> > @@ -124,9 +124,11 @@ static int qcow_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  
> >  ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
> >  if (ret < 0) {
> > -goto fail;
> > +goto fail_unlocked;
> >  }
> >  
> > +bdrv_graph_rdlock_main_loop();
> > +
> >  ret = bdrv_pread(bs->file, 0, sizeof(header), &header, 0);
> >  if (ret < 0) {
> >  goto fail;
> > @@ -301,11 +303,9 @@ static int qcow_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  }
> >  
> >  /* Disable migration when qcow images are used */
> > -bdrv_graph_rdlock_main_loop();
> >  error_setg(&s->migration_blocker, "The qcow format used by node '%s' "
> > "does not support live migration",
> > bdrv_get_device_or_node_name(bs));
> > -bdrv_graph_rdunlock_main_loop();
> >  
> >  ret = migrate_add_blocker(s->migration_blocker, errp);
> >  if (ret < 0) {
> > @@ -316,9 +316,12 @@ static int qcow_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  qobject_unref(encryptopts);
> >  qapi_free_QCryptoBlockOpenOptions(crypto_opts);
> >  qemu_co_mutex_init(&s->lock);
> > +bdrv_graph_rdunlock_main_loop();
> >  return 0;
> >  
> > - fail:
> > +fail:
> 
> Why the change in indentation?  At least emacs intentionally prefers
> to indent top-level labels 1 column in, so that they cannot be
> confused with function declarations in the first column.  But that's cosmetic.

The coding style document isn't explicit about it, but the overwhelming
majority of code both in the block layer and QEMU in general prefers the
style without indentation:

$ git grep '^ [A-Za-z_]\+:$' | wc -l
392
$ git grep '^[A-Za-z_]\+:$' | wc -l
2739

$ git grep '^ [A-Za-z_]\+:$' block | wc -l
27
$ git grep '^[A-Za-z_]\+:$' block | wc -l
332

So I wanted to add the new label with the usual convention, but also not
make the function inconsistent, which is why I reindented the existing
label.

Kevin




Re: [PATCH 09/24] block: Mark bdrv_(un)freeze_backing_chain() and callers GRAPH_RDLOCK

2023-11-03 Thread Kevin Wolf
Am 27.10.2023 um 23:00 hat Eric Blake geschrieben:
> On Fri, Oct 27, 2023 at 05:53:18PM +0200, Kevin Wolf wrote:
> > This adds GRAPH_RDLOCK annotations to declare that callers of
> > bdrv_(un)freeze_backing_chain() need to hold a reader lock for the
> > graph because it calls bdrv_filter_or_cow_child(), which accesses
> > bs->file/backing.
> > 
> > Use the opportunity to make bdrv_is_backing_chain_frozen() static, it
> > has no external callers.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block/copy-on-read.h   |  3 ++-
> >  include/block/block-global-state.h | 11 ++-
> >  block.c|  5 +++--
> >  block/commit.c |  6 ++
> >  block/copy-on-read.c   | 19 +++
> >  block/mirror.c |  3 +++
> >  block/stream.c | 16 +++-
> >  7 files changed, 46 insertions(+), 17 deletions(-)
> >
> ...
> > +++ b/block/copy-on-read.c
> ...
> > -static void cor_close(BlockDriverState *bs)
> > +static void GRAPH_UNLOCKED cor_close(BlockDriverState *bs)
> >  {
> >  BDRVStateCOR *s = bs->opaque;
> >  
> > +GLOBAL_STATE_CODE();
> > +
> >  if (s->chain_frozen) {
> > +bdrv_graph_rdlock_main_loop();
> >  s->chain_frozen = false;
> >  bdrv_unfreeze_backing_chain(bs, s->bottom_bs);
> > +bdrv_graph_rdunlock_main_loop();
> 
> Why the two-line addition here...
> 
> >  }
> >  
> >  bdrv_unref(s->bottom_bs);

I don't remember if there were more reasons without having a closer look
at the code, but just here from the context, bdrv_unref() is supposed to
be called without the graph lock held. We don't enforce this yet because
there are some cases that are not easy to fix, but I don't want to make
adding the GRAPH_UNLOCKED annotation to bdrv_unref() harder than it
needs to be.

> > @@ -263,12 +271,15 @@ static BlockDriver bdrv_copy_on_read = {
> >  };
> >  
> >  
> > -void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs)
> > +void no_coroutine_fn bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs)
> >  {
> >  BDRVStateCOR *s = cor_filter_bs->opaque;
> >  
> > +GLOBAL_STATE_CODE();
> > +
> >  /* unfreeze, as otherwise bdrv_replace_node() will fail */
> >  if (s->chain_frozen) {
> > +GRAPH_RDLOCK_GUARD_MAINLOOP();
> >  s->chain_frozen = false;
> >  bdrv_unfreeze_backing_chain(cor_filter_bs, s->bottom_bs);
> >  }
> 
> ...vs. the magic one-line per-scope change here?  Both work, so I
> don't see any problems, but it does seem odd to mix styles in the same
> patch.  (I can see other places where you have intentionally picked
> the version that required the least reindenting; adding a scope just
> to use GRAPH_RDLOCK_GUARD_MAINLOOP() without having to carefully pair
> an unlock on every early exit path is fewer lines of code overall, but
> more lines of churn in the patch itself.)

Generally I used the scoped locks only when I was quite sure that
nothing in the function will require dropping the lock. The safe default
is individually locking smaller sections.

With GRAPH_RDLOCK_GUARD_MAINLOOP(), the whole situation is a bit fuzzy
because it doesn't actually do anything apart from asserting that we
don't need real locking. So taking it while calling functions that want
to be called unlocked still works in practice, but it's a bit unclean,
and it would conflict with an actual GRAPH_UNLOCKED annotation.

Kevin




Re: [PATCH 04/24] block: Mark bdrv_root_attach_child() GRAPH_WRLOCK

2023-11-03 Thread Kevin Wolf
Am 27.10.2023 um 22:22 hat Eric Blake geschrieben:
> On Fri, Oct 27, 2023 at 05:53:13PM +0200, Kevin Wolf wrote:
> > Instead of taking the writer lock internally, require callers to already
> > hold it when calling bdrv_root_attach_child(). These callers will
> > typically already hold the graph lock once the locking work is
> > completed, which means that they can't call functions that take it
> > internally.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block_int-global-state.h | 13 +++--
> >  block.c|  5 +
> >  block/block-backend.c  |  2 ++
> >  blockjob.c |  2 ++
> >  4 files changed, 12 insertions(+), 10 deletions(-)
> > 
> > +++ b/block.c
> > @@ -3214,8 +3214,6 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
> > *child_bs,
> >  
> >  GLOBAL_STATE_CODE();
> >  
> > -bdrv_graph_wrlock(child_bs);
> > -
> >  child = bdrv_attach_child_common(child_bs, child_name, child_class,
> 
> Do we need some sort of assertion that the caller did indeed grab the
> lock at this point?  Or is that redundant with assertions made in all
> helper functions we are calling where the lock already matters?

The GRAPH_WRLOCK in the header file makes it a compiler error with TSA
enabled if the caller doesn't already hold the lock. And our CI has some
builds with TSA, so even if the maintainer only uses gcc, we'll catch
it.

Kevin




Re: [PATCH v3 0/9] mirror: allow switching from background to active mode

2023-11-03 Thread Markus Armbruster
Fiona Ebner  writes:

> Am 18.10.23 um 11:41 schrieb Markus Armbruster:
>> Fiona Ebner  writes:
>>>
>>> Initially, I tried to go for a more general 'job-change' command, but
>>> to avoid mutual inclusion of block-core.json and job.json, more
>>> preparation would be required.
>> 
>> Can you elaborate a bit?  A more generic command is preferable, and we
>> need to understand what it would take.
>> 
>
> Yes, I did so during the discussion of v2:
>
> https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg02993.html

Replied there.  Sorry for the delay!




Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2023-11-03 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> On 11.10.23 13:18, Fiona Ebner wrote:
>> Am 10.10.23 um 19:55 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 09.10.23 12:46, Fiona Ebner wrote:

 Initially, I tried to go for a more general 'job-change' command, but
 I couldn't figure out a way to avoid mutual inclusion between
 block-core.json and job.json.

>>>
>>> What is the problem with it? I still think that job-change would be better.
>>>
>> If going for job-change in job.json, the dependencies would be
>> job-change -> JobChangeOptions -> JobChangeOptionsMirror -> MirrorCopyMode
>> query-jobs -> JobInfo -> JobInfoMirror
>> and we can't include block-core.json in job.json, because an inclusion
>> loop gives a build error.

Let me try to understand this.

Command job-change needs its argument type JobChangeOptions.

JobChangeOptions is a union, and JobChangeOptionsMirror is one of its
branches.

JobChangeOptionsMirror needs MirrorCopyMode from block-core.json.

block-core.json needs job.json for JobType and JobStatus.

>> Could be made to work by moving MirrorCopyMode (and
>> JobChangeOptionsMirror, JobInfoMirror) to job.json or some place that
>> can be included by both job.json and block-core.json. Moving the
>> type-specific definitions to the general job.json didn't feel right to
>> me. Including another file with type-specific definitions in job.json
>> feels slightly less wrong, but still not quite right and I didn't want
>> to create a new file just for MirrorCopyMode (and
>> JobChangeOptionsMirror, JobInfoMirror).
>> And going further and moving all mirror-related things to a separate
>> file would require moving along things like NewImageMode with it or
>> create yet another file for such general things used by multiple block-jobs.
>> If preferred, I can try and go with some version of the above.
>> 
>
> OK, I see the problem. Seems, that all requires some good refactoring. But 
> that's a preexisting big work, and should not hold up your series. I'm OK to 
> proceed with block-job-change.

Saving ourselves some internal refactoring is a poor excuse for
undesirable external interfaces.  We need to answer two questions before
we do that:

1. How much work would the refactoring be?

2. Is the interface improvement this enables worth the work?

Let's start with 1.

An obvious solution is to split JobType and JobStatus off job.json to
break the dependency of block-core.json on job.json.

But I'd like us to investigate another one.  block-core.json is *huge*.
It's almost a quarter of the entire QAPI schema.  Can we spin out block
jobs into block-job.json?  Moves the dependency on job.json from
block-core.json to block-job.json.

Thoughts?




Re: [PATCH v5 2/4] qcow2: add configurations for zoned format extension

2023-11-03 Thread Markus Armbruster
Eric Blake  writes:

> On Mon, Oct 30, 2023 at 08:18:45PM +0800, Sam Li wrote:
>> To configure the zoned format feature on the qcow2 driver, it
>> requires settings as: the device size, zone model, zone size,
>> zone capacity, number of conventional zones, limits on zone
>> resources (max append bytes, max open zones, and max_active_zones).
>> 
>> To create a qcow2 file with zoned format, use command like this:
>> $ qemu-img create -f qcow2 test.qcow2 -o size=768M -o
>> zone_size=64M -o zone_capacity=64M -o conventional_zones=0 -o
>> max_append_bytes=4096 -o max_open_zones=0 -o max_active_zones=0
>> -o zone_model=host-managed
>> 
>> Signed-off-by: Sam Li 
>> 
>> fix config?
>
> Is this comment supposed to be part of the commit message?  If not,...
>
>> ---
>
> ...place it here under the divider, so 'git am' won't include it, if there is 
> nothing further to change on this patch.

[...]

>> +++ b/qapi/block-core.json
>> @@ -4981,6 +4981,21 @@
>>  { 'enum': 'Qcow2CompressionType',
>>'data': [ 'zlib', { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
>>  
>> +##
>> +# @Qcow2ZoneModel:
>> +#
>> +# Zoned device model used in qcow2 image file
>> +#
>> +# @non-zoned: non-zoned model is for regular block devices
>> +#
>> +# @host-managed: host-managed model only allows sequential write over the
>> +# device zones
>> +#
>> +# Since 8.2
>> +##
>> +{ 'enum': 'Qcow2ZoneModel',
>> +  'data': ['non-zoned', 'host-managed'] }
>> +
>>  ##
>>  # @BlockdevCreateOptionsQcow2:
>>  #
>> @@ -5023,6 +5038,27 @@
>>  # @compression-type: The image cluster compression method
>>  # (default: zlib, since 5.1)
>>  #
>> +# @zone-model: @Qcow2ZoneModel.  The zone device model.
>> +# (default: non-zoned, since 8.2)
>> +#
>> +# @zone-size: Total number of bytes within zones (since 8.2)
>
> If @zone-model is "non-zoned", does it make sense to even allow
> @zone-size and friends?  Should this use a QMP union, where you can
> pass in the remaining zone-* fields only when zone-model is set to
> host-managed?

Valid question; needs an answer.

>> +#
>> +# @zone-capacity: The number of usable logical blocks within zones
>> +# in bytes.  A zone capacity is always smaller or equal to the
>> +# zone size (since 8.2)
>> +#
>> +# @conventional-zones: The number of conventional zones of the
>> +# zoned device (since 8.2)
>> +#
>> +# @max-open-zones: The maximal number of open zones (since 8.2)
>> +#
>> +# @max-active-zones: The maximal number of zones in the implicit
>> +# open, explicit open or closed state (since 8.2)
>> +#
>> +# @max-append-bytes: The maximal number of bytes of a zone
>> +# append request that can be issued to the device.  It must be
>> +# 512-byte aligned (since 8.2)
>> +#
>>  # Since: 2.12
>>  ##
>>  { 'struct': 'BlockdevCreateOptionsQcow2',
>> @@ -5039,7 +5075,14 @@
>>  '*preallocation':   'PreallocMode',
>>  '*lazy-refcounts':  'bool',
>>  '*refcount-bits':   'int',
>> -'*compression-type':'Qcow2CompressionType' } }
>> +'*compression-type':'Qcow2CompressionType',
>> +'*zone-model': 'Qcow2ZoneModel',
>> +'*zone-size':  'size',
>> +'*zone-capacity':  'size',
>> +'*conventional-zones': 'uint32',
>> +'*max-open-zones': 'uint32',
>> +'*max-active-zones':   'uint32',
>> +'*max-append-bytes':   'uint32' } }
>
> In other words, I'm envisioning something like an optional
> '*zone':'ZoneStruct', where:
>
> { 'struct': 'ZoneHostManaged',
>   'data': { 'size': 'size', '*capacity': 'size', ..., '*max-append-bytes': 
> 'uint32' } }
> { 'union': 'ZoneStruct',
>   'base': { 'model': 'Qcow2ZoneModel' },
>   'discriminator': 'model',
>   'data': { 'non-zoned': {},
> 'host-managed': 'ZoneHostManaged' } }
>
> then over the wire, QMP can use the existing:
> { ..., "compression-type":"zstd" }
>
> as a synonym for the new but explicit non-zoned:
> { ..., "compression-type":"zstd", "zone":{"mode":"non-zoned"} }

I.e. @zone is optional, and defaults to {"mode": "non-zoned"}.

> and when we want to use zones, we pass:
> { ..., "compression-type":"zstd", "zone":{"mode":"host-managed", 
> "size":16777216} }
>
> where you don't have to have zone- prefixing everywhere because it is
> instead contained in the smart union object where it is obvious from
> the 'mode' field what other fields should be present.