Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-28 Thread Damien Le Moal
On 2020/09/29 6:25, Keith Busch wrote:
> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
>> On Sep 28 02:33, Dmitry Fomichev wrote:
>>> You are making it sound like the entire WDC series relies on this approach.
>>> Actually, the persistency is introduced in the second to last patch in the
>>> series and it only adds a couple of lines of code in the i/o path to mark
>>> zones dirty. This is possible because of using mmap() and I find the way
>>> it is done to be quite elegant, not ugly :)
>>>
>>
>> No, I understand that your implementation works fine without
>> persistance, but persistance is key. That is why my series adds it in
>> the first patch. Without persistence it is just a toy. And the QEMU
>> device is not just an "NVMe-version" of null_blk.
> 
> I really think we should be a bit more cautious of commiting to an
> on-disk format for the persistent state. Both this and Klaus' persistent
> state feels a bit ad-hoc, and with all the other knobs provided, it
> looks too easy to have out-of-sync states, or just not being able to
> boot at all if a qemu versions have different on-disk formats.
> 
> Is anyone really considering zone emulation for production level stuff
> anyway? I can't imagine a real scenario where you'd want put yourself
> through that: you are just giving yourself all the downsides of a zoned
> block device and none of the benefits. AFAIK, this is provided as a
> development vehicle, closer to a "toy".
> 
> I think we should consider trimming this down to a more minimal set that
> we *do* agree on and commit for inclusion ASAP. We can iterate all the
> bells & whistles and flush out the meta data's data marshalling scheme
> for persistence later.

+1 on this. Removing the persistence also removes the debate on endianess. With
that out of the way, it should be straightforward to get agreement on a series
that can be merged quickly to get developers started with testing ZNS software
with QEMU. That is the most important goal here. 5.9 is around the corner, we
need something for people to get started with ZNS quickly.


-- 
Damien Le Moal
Western Digital Research



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-28 Thread Keith Busch
On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> On Sep 28 02:33, Dmitry Fomichev wrote:
> > You are making it sound like the entire WDC series relies on this approach.
> > Actually, the persistency is introduced in the second to last patch in the
> > series and it only adds a couple of lines of code in the i/o path to mark
> > zones dirty. This is possible because of using mmap() and I find the way
> > it is done to be quite elegant, not ugly :)
> > 
> 
> No, I understand that your implementation works fine without
> persistance, but persistance is key. That is why my series adds it in
> the first patch. Without persistence it is just a toy. And the QEMU
> device is not just an "NVMe-version" of null_blk.

I really think we should be a bit more cautious of commiting to an
on-disk format for the persistent state. Both this and Klaus' persistent
state feels a bit ad-hoc, and with all the other knobs provided, it
looks too easy to have out-of-sync states, or just not being able to
boot at all if a qemu versions have different on-disk formats.

Is anyone really considering zone emulation for production level stuff
anyway? I can't imagine a real scenario where you'd want put yourself
through that: you are just giving yourself all the downsides of a zoned
block device and none of the benefits. AFAIK, this is provided as a
development vehicle, closer to a "toy".

I think we should consider trimming this down to a more minimal set that
we *do* agree on and commit for inclusion ASAP. We can iterate all the
bells & whistles and flush out the meta data's data marshalling scheme
for persistence later.



Re: [PATCH v9 0/9] Apply COR-filter to the block-stream permanently

2020-09-28 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/1601298001-774096-1-git-send-email-andrey.shinkev...@virtuozzo.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

C linker for the host machine: cc ld.bfd 2.27-43
Host machine cpu family: x86_64
Host machine cpu: x86_64
../src/meson.build:10: WARNING: Module unstable-keyval has no backwards or 
forwards compatibility and might not exist in future releases.
Program sh found: YES
Program python3 found: YES (/usr/bin/python3)
Configuring ninjatool using configuration
---
Compiling C object libblock.fa.p/block_dmg-bz2.c.o
Compiling C object libblock.fa.p/block_amend.c.o
Compiling C object libblock.fa.p/block_qcow2-snapshot.c.o
../src/block/copy-on-read.c:29:32: fatal error: block/copy-on-read.h: No such 
file or directory
 #include "block/copy-on-read.h"
^
compilation terminated.
---
Compiling C object libblock.fa.p/block_throttle-groups.c.o
Compiling C object libblock.fa.p/block_accounting.c.o
Compiling C object libblock.fa.p/block_nbd.c.o
make: *** [libblock.fa.p/block_copy-on-read.c.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 709, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--rm', 
'--label', 'com.qemu.instance.uuid=b2dcb88c84fa430991604513a90c8435', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '-e', 'TARGET_LIST=', '-e', 
'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-8zvp7iz2/src/docker-src.2020-09-28-16.33.58.15775:/var/tmp/qemu:z,ro',
 'qemu/centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=b2dcb88c84fa430991604513a90c8435
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-8zvp7iz2/src'
make: *** [docker-run-test-quick@centos7] Error 2

real3m24.503s
user0m18.816s


The full log is available at
http://patchew.org/logs/1601298001-774096-1-git-send-email-andrey.shinkev...@virtuozzo.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v9 0/9] Apply COR-filter to the block-stream permanently

2020-09-28 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/1601298001-774096-1-git-send-email-andrey.shinkev...@virtuozzo.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

Host machine cpu: x86_64
Target machine cpu family: x86
Target machine cpu: x86_64
../src/meson.build:10: WARNING: Module unstable-keyval has no backwards or 
forwards compatibility and might not exist in future releases.
Program sh found: YES
Program python3 found: YES (/usr/bin/python3)
Configuring ninjatool using configuration
---
Compiling C object libblock.fa.p/block_bochs.c.obj
Compiling C object libblock.fa.p/block_filter-compress.c.obj
Compiling C object libblock.fa.p/block_qcow2.c.obj
../src/block/copy-on-read.c:29:10: fatal error: block/copy-on-read.h: No such 
file or directory
   29 | #include "block/copy-on-read.h"
  |  ^~
compilation terminated.
make: *** [Makefile.ninja:887: libblock.fa.p/block_copy-on-read.c.obj] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 709, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--rm', 
'--label', 'com.qemu.instance.uuid=026d51a22d8649ff831a9ab8a970', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '-e', 'TARGET_LIST=', '-e', 
'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-_5yio3yh/src/docker-src.2020-09-28-16.28.55.8469:/var/tmp/qemu:z,ro',
 'qemu/fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=026d51a22d8649ff831a9ab8a970
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-_5yio3yh/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real4m33.430s
user0m18.970s


The full log is available at
http://patchew.org/logs/1601298001-774096-1-git-send-email-andrey.shinkev...@virtuozzo.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[PATCH] qcow2: Use L1E_SIZE in qcow2_write_l1_entry()

2020-09-28 Thread Alberto Garcia
We overlooked these in 02b1ecfa100e7ecc2306560cd27a4a2622bfeb04

Signed-off-by: Alberto Garcia 
---
 block/qcow2-cluster.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 9acc6ce4ae..aa87d3e99b 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -240,14 +240,14 @@ int qcow2_write_l1_entry(BlockDriverState *bs, int 
l1_index)
 }
 
 ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_ACTIVE_L1,
-s->l1_table_offset + 8 * l1_start_index, bufsize, false);
+s->l1_table_offset + L1E_SIZE * l1_start_index, bufsize, false);
 if (ret < 0) {
 return ret;
 }
 
 BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
 ret = bdrv_pwrite_sync(bs->file,
-   s->l1_table_offset + 8 * l1_start_index,
+   s->l1_table_offset + L1E_SIZE * l1_start_index,
buf, bufsize);
 if (ret < 0) {
 return ret;
-- 
2.20.1




Re: [PATCH 2/3] nbd: Add new qemu:allocation-depth metacontext

2020-09-28 Thread Richard W.M. Jones
On Mon, Sep 28, 2020 at 09:33:22AM -0500, Eric Blake wrote:
> On 9/26/20 2:33 AM, Richard W.M. Jones wrote:
> >On Fri, Sep 25, 2020 at 03:32:48PM -0500, Eric Blake wrote:
> >>+The second is related to exposing the source of various extents within
> >>+the image, with a single context named:
> >>+
> >>+qemu:allocation-depth
> >>+
> >>+In the allocation depth context, bits 0 and 1 form a tri-state value:
> >>+
> >>+bits 0-1 clear: NBD_STATE_DEPTH_UNALLOC, means the extent is 
> >>unallocated
> >>+bit 0 set: NBD_STATE_DEPTH_LOCAL, the extent is allocated in this image
> >>+bit 1 set: NBD_STATE_DEPTH_BACKING, the extent is inherited from a
> >>+   backing layer
> >
> >>From the cover description I imagined it would show the actual depth, ie:
> >
> >  top -> backing -> backing -> backing
> >  depth:   12 3     (0 = unallocated)
> >
> >I wonder if that is possible?  (Perhaps there's something I don't
> >understand here.)
> 
> The real reason I don't want to do a straight depth number is that
> 'qemu-img map' combined with x-dirty-bitmap is still a very
> convenient way to get at bits 0 and 1 (even if it requires
> decoding).  But if we plumb in a way for bdrv_get_status to return
> depth counts (rather than reimplementing the depth count ourselves),
> I would have no problem with returning a struct:
> 
> bits 31-4: the depth of the chain
> bits 3-2: reserved (to make reading hex values easier...)
> bits 1-0: tri-state of unalloc, local, or backing
> 
> where it would look like:
> 
> 0x -> unallocated
> 0x0011 -> depth 1, local
> 0x0022 -> depth 2, from the first backing layer
> 0x0032 -> depth 3, from the second backing layer
> 0x0042 ...

This looks nice too.  However I was only bikeshedding so if any of
this is hard to do then don't worry too much.

Would like to add support for this map to nbdinfo too :-)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v




Re: [PATCH v6 01/15] block: simplify comment to BDRV_REQ_SERIALISING

2020-09-28 Thread Alberto Garcia
On Fri 18 Sep 2020 08:19:37 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> 1. BDRV_REQ_NO_SERIALISING doesn't exist already, don't mention it.
>
> 2. We are going to add one more user of BDRV_REQ_SERIALISING, so
>comment about backup becomes a bit confusing here. The use case in
>backup is documented in block/backup.c, so let's just drop
>duplication here.
>
> 3. The fact that BDRV_REQ_SERIALISING is only for write requests is
>omitted. Add a note.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Stefan Hajnoczi 

Reviewed-by: Alberto Garcia 

Berto



Re: [PATCH v2 00/31] block/export: Add infrastructure and QAPI for block exports

2020-09-28 Thread Kevin Wolf
Am 24.09.2020 um 17:26 hat Kevin Wolf geschrieben:
> We are planning to add more block export types than just NBD in the near
> future (e.g. vhost-user-blk and FUSE). This series lays the ground for
> this with some generic block export infrastructure and QAPI interfaces
> that will allow managing all of them (for now add/remove/query).
> 
> As a side effect, qemu-storage-daemon can now map --export directly to
> the block-export-add QMP command, similar to other command line options.
> The built-in NBD servers also gains new options that bring it at least a
> little closer to feature parity with qemu-nbd.

Thanks for the review, applied to the block branch.

Kevin




Re: [PATCH 3/3] nbd: Add 'qemu-nbd -A' to expose allocation depth

2020-09-28 Thread Eric Blake

On 9/26/20 8:32 AM, Vladimir Sementsov-Ogievskiy wrote:

25.09.2020 23:32, Eric Blake wrote:

Allow the server to expose an additional metacontext to be requested
by savvy clients.  qemu-nbd adds a new option -A to expose the
qemu:allocation-depth metacontext through NBD_CMD_BLOCK_STATUS; this
can also be set via QMP when using nbd-server-add.

qemu as client can be hacked into viewing this new context by using
the now-misnamed x-dirty-bitmap option when creating an NBD blockdev;


may be rename it to x-block-status ?


I thought about it.  We don't promise any back-compat with x- prefix, 
but at the same time, if there's not a strong reason to change it, 
scripts written against older qemu will not break as long as we still 
have the hack.





although it is worth noting the decoding of how such context
information will appear in 'qemu-img map --output=json':

NBD_STATE_DEPTH_UNALLOC => "zero":false, "data":true
NBD_STATE_DEPTH_LOCAL => "zero":false, "data":false
NBD_STATE_DEPTH_BACKING => "zero":true, "data":true


It wouldn't be so simple if we decide to export exact depth number..


It's still simple to do tri-state results if we separate the exact depth 
number to bits 31-4, leaving bits 1-0 as the tri-state summary...


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/3] nbd: Add new qemu:allocation-depth metacontext

2020-09-28 Thread Eric Blake

On 9/26/20 2:33 AM, Richard W.M. Jones wrote:

On Fri, Sep 25, 2020 at 03:32:48PM -0500, Eric Blake wrote:

+The second is related to exposing the source of various extents within
+the image, with a single context named:
+
+qemu:allocation-depth
+
+In the allocation depth context, bits 0 and 1 form a tri-state value:
+
+bits 0-1 clear: NBD_STATE_DEPTH_UNALLOC, means the extent is unallocated
+bit 0 set: NBD_STATE_DEPTH_LOCAL, the extent is allocated in this image
+bit 1 set: NBD_STATE_DEPTH_BACKING, the extent is inherited from a
+   backing layer



From the cover description I imagined it would show the actual depth, ie:


  top -> backing -> backing -> backing
  depth:   12 3     (0 = unallocated)

I wonder if that is possible?  (Perhaps there's something I don't
understand here.)


The real reason I don't want to do a straight depth number is that 
'qemu-img map' combined with x-dirty-bitmap is still a very convenient 
way to get at bits 0 and 1 (even if it requires decoding).  But if we 
plumb in a way for bdrv_get_status to return depth counts (rather than 
reimplementing the depth count ourselves), I would have no problem with 
returning a struct:


bits 31-4: the depth of the chain
bits 3-2: reserved (to make reading hex values easier...)
bits 1-0: tri-state of unalloc, local, or backing

where it would look like:

0x -> unallocated
0x0011 -> depth 1, local
0x0022 -> depth 2, from the first backing layer
0x0032 -> depth 3, from the second backing layer
0x0042 ...

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v7 06/13] qmp: Call monitor_set_cur() only in qmp_dispatch()

2020-09-28 Thread Kevin Wolf
Am 28.09.2020 um 13:42 hat Markus Armbruster geschrieben:
> Kevin Wolf  writes:
> 
> > Am 14.09.2020 um 17:10 hat Markus Armbruster geschrieben:
> >> Kevin Wolf  writes:
> >> 
> >> > The correct way to set the current monitor for a coroutine handler will
> >> > be different than for a blocking handler, so monitor_set_cur() needs to
> >> > be called in qmp_dispatch().
> >> >
> >> > Signed-off-by: Kevin Wolf 
> >> > ---
> >> >  include/qapi/qmp/dispatch.h | 3 ++-
> >> >  monitor/qmp.c   | 8 +---
> >> >  qapi/qmp-dispatch.c | 8 +++-
> >> >  qga/main.c  | 2 +-
> >> >  stubs/monitor-core.c| 5 +
> >> >  tests/test-qmp-cmds.c   | 6 +++---
> >> >  6 files changed, 19 insertions(+), 13 deletions(-)
> >> >
> >> > diff --git a/include/qapi/qmp/dispatch.h b/include/qapi/qmp/dispatch.h
> >> > index 5a9cf82472..0c2f467028 100644
> >> > --- a/include/qapi/qmp/dispatch.h
> >> > +++ b/include/qapi/qmp/dispatch.h
> >> > @@ -14,6 +14,7 @@
> >> >  #ifndef QAPI_QMP_DISPATCH_H
> >> >  #define QAPI_QMP_DISPATCH_H
> >> >  
> >> > +#include "monitor/monitor.h"
> >> >  #include "qemu/queue.h"
> >> >  
> >> >  typedef void (QmpCommandFunc)(QDict *, QObject **, Error **);
> >> > @@ -49,7 +50,7 @@ const char *qmp_command_name(const QmpCommand *cmd);
> >> >  bool qmp_has_success_response(const QmpCommand *cmd);
> >> >  QDict *qmp_error_response(Error *err);
> >> >  QDict *qmp_dispatch(const QmpCommandList *cmds, QObject *request,
> >> > -bool allow_oob);
> >> > +bool allow_oob, Monitor *cur_mon);
> >> >  bool qmp_is_oob(const QDict *dict);
> >> >  
> >> >  typedef void (*qmp_cmd_callback_fn)(const QmpCommand *cmd, void 
> >> > *opaque);
> >> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> >> > index 8469970c69..922fdb5541 100644
> >> > --- a/monitor/qmp.c
> >> > +++ b/monitor/qmp.c
> >> > @@ -135,16 +135,10 @@ static void monitor_qmp_respond(MonitorQMP *mon, 
> >> > QDict *rsp)
> >> >  
> >> >  static void monitor_qmp_dispatch(MonitorQMP *mon, QObject *req)
> >> >  {
> >> > -Monitor *old_mon;
> >> >  QDict *rsp;
> >> >  QDict *error;
> >> >  
> >> > -old_mon = monitor_set_cur(>common);
> >> > -assert(old_mon == NULL);
> >> > -
> >> > -rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon));
> >> > -
> >> > -monitor_set_cur(NULL);
> >> > +rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon), 
> >> > >common);
> >> 
> >> Long line.  Happy to wrap it in my tree.  A few more in PATCH 08-11.
> >
> > It's 79 characters. Should be fine even with your local deviation from
> > the coding style to require less than that for comments?
> 
> Let me rephrase my remark.
> 
> For me,
> 
> rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon),
>>common);
> 
> is significantly easier to read than
> 
> rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon), 
> >common);

I guess this is highly subjective. I find wrapped lines harder to read.
For answering subjective questions like this, we generally use the
coding style document.

Anyway, I guess following an idiosyncratic coding style that is
different from every other subsystem in QEMU is possible (if
inconvenient) if I know what it is.

My problem is more that I don't know what the exact rules are. Can they
only be figured out experimentally by submitting patches and seeing
whether you like them or not?

> Would you mind me wrapping this line in my tree?

I have no say in this subsystem and I take it that you want all code to
look as if you had written it yourself, so do as you wish.

But I understand that I'll have to respin anyway, so if you could
explain what you're after, I might be able to apply the rules for the
next version of the series.

Kevin




Re: [PATCH 1/3] nbd: Simplify meta-context parsing

2020-09-28 Thread Eric Blake

On 9/26/20 7:58 AM, Vladimir Sementsov-Ogievskiy wrote:

25.09.2020 23:32, Eric Blake wrote:

We had a premature optimization of trying to read as little from the
wire as possible while handling NBD_OPT_SET_META_CONTEXT in phases.
But in reality, we HAVE to read the entire string from the client
before we can get to the next command, and it is easier to just read
it all at once than it is to read it in pieces.  And once we do that,
several functions end up no longer performing I/O, and no longer need
to return a value.

While simplifying this, take advantage of g_str_has_prefix for less
repetition of boilerplate string length computation.

Our iotests still pass; I also checked that libnbd's testsuite (which
covers more corner cases of odd meta context requests) still passes.

Signed-off-by: Eric Blake 
---




-/* Read strlen(@pattern) bytes, and set @match to true if they match 
@pattern.

- * @match is never set to false.
- *
- * Return -errno on I/O error, 0 if option was completely handled by
- * sending a reply about inconsistent lengths, or 1 on success.
- *
- * Note: return code = 1 doesn't mean that we've read exactly @pattern.
- * It only means that there are no errors.
+
+/*
+ * Check @ns with @len bytes, and set @match to true if it matches 
@pattern,
+ * or if @len is 0 and the client is performing _LIST_. @match is 
never set

+ * to false.
   */
-static int nbd_meta_pattern(NBDClient *client, const char *pattern, 
bool *match,

-    Error **errp)
+static void nbd_meta_empty_or_pattern(NBDClient *client, const char 
*pattern,

+  const char *ns, uint32_t len,


ns changed its meaning, it's not just a namespace, but the whole query. 
I think, better to rename it.


Sure, that makes sense.



Also, it's unusual to pass length together with nul-terminated string, 
it seems redundant.
And, it's used only to compare with zero, strlen(ns) == 0 or ns[0] == 0 
is not slower.


Good point.



Also, errp is unused argument. And it violate Error API recommendation 
to not create void functions with errp.


Another simplification.  Looks like I'll be spinning v2.



Also we can use bool return instead of return through pointer.


That changes the callers, but it's not necessarily a bad thing; whereas 
we were doing:


if (cond) {
nbd_meta_pattern(..., , ...);
}

which either leaves match unchanged or sets it to true, we could instead do:

if (nbd_meta_pattern(...)) {
match = true;
}

My only worry is that the more changes I make, the harder the patch is 
to read.  I don't know if it's worth breaking it into steps, or just 
doing all the simplifications in one blow.



+static void nbd_meta_base_query(NBDClient *client, 
NBDExportMetaContexts *meta,
+    const char *ns, uint32_t len, Error 
**errp)

  {
-    return nbd_meta_empty_or_pattern(client, "allocation", len,
- >base_allocation, errp);
+    nbd_meta_empty_or_pattern(client, "allocation", ns + 5, len - 5,


This one is correct, but would be simpler, if only subquery (after 
colon) passed here.

(Really, no reason to pass 'base:' to _base_ func)

Hmm, I see that it gives a possibility to use 
meta->exp->export_bitmap_context directly.


Yeah, that's where it got interesting.  A direct strcmp() is nice, but 
if we strip a prefix in one place, we have to strip it in the other.  I 
could go either way.




+ * @len is the length of @ns, including the `qemu:' prefix.
+ */
+static void nbd_meta_qemu_query(NBDClient *client, 
NBDExportMetaContexts *meta,
+    const char *ns, uint32_t len, Error 
**errp)

  {
-    bool dirty_bitmap = false;
-    size_t dirty_bitmap_len = strlen("dirty-bitmap:");
-    int ret;
-
  if (!meta->exp->export_bitmap) {
  trace_nbd_negotiate_meta_query_skip("no dirty-bitmap 
exported");

-    return nbd_opt_skip(client, len, errp);
-    }
-
-    if (len == 0) {
+    } else if (len == 0) {


s/0/5/ ?


Oh, good catch.  Using 0 is an unintended logic change, but none of our 
iotests caught it, so we're also missing test coverage.


I'm working on adding 'nbd_opt_list_meta_context()' to libnbd, which 
will add testsuite coverage of this.




+    } else if (!g_str_has_prefix(ns + 5, "dirty-bitmap:")) {
  trace_nbd_negotiate_meta_query_skip("not dirty-bitmap:");
-    return nbd_opt_skip(client, len, errp);
+    } else {
+    trace_nbd_negotiate_meta_query_parse("dirty-bitmap:");
+    nbd_meta_empty_or_pattern(client, 
meta->exp->export_bitmap_context,

+  ns, len, >bitmap, errp);


hmm. _empty_ is impossible here, as ns includes "qemu:dirty-bitmap"..


Ultimately, we want something like:

if starts with "base:":
  if nothing else:
if list mode:
  return "base:allocation"
else:
  return empty list
  else if "base:allocation":
return "base:allocation"
  else
return empty list
else if 

[PATCH v9 9/9] block: apply COR-filter to block-stream jobs

2020-09-28 Thread Andrey Shinkevich via
This patch completes the series with the COR-filter insertion for
block-stream operations. Adding the filter makes it possible for copied
regions to be discarded in backing files during the block-stream job,
what will reduce the disk overuse.
The COR-filter insertion incurs changes in the iotests case
245:test_block_stream_4 that reopens the backing chain during a
block-stream job. There are changes in the iotests #030 as well.
The iotests case 030:test_stream_parallel was deleted due to multiple
conflicts between the concurrent job operations over the same backing
chain. The base backing node for one job is the top node for another
job. It may change due to the filter node inserted into the backing
chain while both jobs are running. Another issue is that the parts of
the backing chain are being frozen by the running job and may not be
changed by the concurrent job when needed. The concept of the parallel
jobs with common nodes is considered vital no more.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 93 ++
 tests/qemu-iotests/030 | 51 +++--
 tests/qemu-iotests/030.out |  4 +-
 tests/qemu-iotests/141.out |  2 +-
 tests/qemu-iotests/245 | 19 +++---
 5 files changed, 83 insertions(+), 86 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index fe2663f..240b3dc 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -17,8 +17,10 @@
 #include "block/blockjob_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
 #include "qemu/ratelimit.h"
 #include "sysemu/block-backend.h"
+#include "block/copy-on-read.h"
 
 enum {
 /*
@@ -33,6 +35,8 @@ typedef struct StreamBlockJob {
 BlockJob common;
 BlockDriverState *base_overlay; /* COW overlay (stream from this) */
 BlockDriverState *above_base;   /* Node directly above the base */
+BlockDriverState *cor_filter_bs;
+BlockDriverState *target_bs;
 BlockdevOnError on_error;
 bool bs_read_only;
 bool chain_frozen;
@@ -52,23 +56,20 @@ static void stream_abort(Job *job)
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 
 if (s->chain_frozen) {
-BlockJob *bjob = >common;
-bdrv_unfreeze_backing_chain(blk_bs(bjob->blk), s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 }
 }
 
 static int stream_prepare(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
-BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
 BlockDriverState *base_metadata = bdrv_skip_filters(base);
 Error *local_err = NULL;
 int ret = 0;
 
-bdrv_unfreeze_backing_chain(bs, s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 s->chain_frozen = false;
 
 if (bdrv_cow_child(unfiltered_bs)) {
@@ -94,13 +95,14 @@ static void stream_clean(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
+
+bdrv_cor_filter_drop(s->cor_filter_bs);
 
 /* Reopen the image back in read-only mode if necessary */
 if (s->bs_read_only) {
 /* Give up write permissions before making it read-only */
 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, _abort);
-bdrv_reopen_set_read_only(bs, true, NULL);
+bdrv_reopen_set_read_only(s->target_bs, true, NULL);
 }
 }
 
@@ -108,9 +110,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockBackend *blk = s->common.blk;
-BlockDriverState *bs = blk_bs(blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
-bool enable_cor = !bdrv_cow_child(s->base_overlay);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 int64_t len;
 int64_t offset = 0;
 uint64_t delay_ns = 0;
@@ -122,21 +122,12 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 return 0;
 }
 
-len = bdrv_getlength(bs);
+len = bdrv_getlength(s->target_bs);
 if (len < 0) {
 return len;
 }
 job_progress_set_remaining(>common.job, len);
 
-/* Turn on copy-on-read for the whole block device so that guest read
- * requests help us make progress.  Only do this when copying the entire
- * backing chain since the copy-on-read operation does not take base into
- * account.
- */
-if (enable_cor) {
-bdrv_enable_copy_on_read(bs);
-}
-
 for ( ; offset < len; offset += n) {
 bool copy;
 int ret;
@@ -195,10 +186,6 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 }
 }
 
-

[PATCH v9 8/9] block: remove unused backing-file name parameter

2020-09-28 Thread Andrey Shinkevich via
The block stream QMP parameter backing-file is in use no more. It
designates a backing file name to set in QCOW2 image header after the
block stream job finished. The base file name is used instead.

Signed-off-by: Andrey Shinkevich 
---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/stream.c |  6 +-
 blockdev.c | 17 +
 include/block/block_int.h  |  2 +-
 qapi/block-core.json   | 17 +
 5 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 4e66775..5f19499 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -506,7 +506,7 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
 int64_t speed = qdict_get_try_int(qdict, "speed", 0);
 
 qmp_block_stream(true, device, device, base != NULL, base, false, NULL,
- false, NULL, qdict_haskey(qdict, "speed"), speed, true,
+ qdict_haskey(qdict, "speed"), speed, true,
  BLOCKDEV_ON_ERROR_REPORT, false, NULL, false, false, 
false,
  false, );
 
diff --git a/block/stream.c b/block/stream.c
index b0719e9..fe2663f 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -34,7 +34,6 @@ typedef struct StreamBlockJob {
 BlockDriverState *base_overlay; /* COW overlay (stream from this) */
 BlockDriverState *above_base;   /* Node directly above the base */
 BlockdevOnError on_error;
-char *backing_file_str;
 bool bs_read_only;
 bool chain_frozen;
 } StreamBlockJob;
@@ -103,8 +102,6 @@ static void stream_clean(Job *job)
 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, _abort);
 bdrv_reopen_set_read_only(bs, true, NULL);
 }
-
-g_free(s->backing_file_str);
 }
 
 static int coroutine_fn stream_run(Job *job, Error **errp)
@@ -220,7 +217,7 @@ static const BlockJobDriver stream_job_driver = {
 };
 
 void stream_start(const char *job_id, BlockDriverState *bs,
-  BlockDriverState *base, const char *backing_file_str,
+  BlockDriverState *base,
   int creation_flags, int64_t speed,
   BlockdevOnError on_error,
   const char *filter_node_name,
@@ -295,7 +292,6 @@ void stream_start(const char *job_id, BlockDriverState *bs,
 
 s->base_overlay = base_overlay;
 s->above_base = above_base;
-s->backing_file_str = g_strdup(backing_file_str);
 s->bs_read_only = bs_read_only;
 s->chain_frozen = true;
 
diff --git a/blockdev.c b/blockdev.c
index d719c47..b223601 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2486,7 +2486,6 @@ out:
 void qmp_block_stream(bool has_job_id, const char *job_id, const char *device,
   bool has_base, const char *base,
   bool has_base_node, const char *base_node,
-  bool has_backing_file, const char *backing_file,
   bool has_speed, int64_t speed,
   bool has_on_error, BlockdevOnError on_error,
   bool has_filter_node_name, const char *filter_node_name,
@@ -2498,7 +2497,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 BlockDriverState *base_bs = NULL;
 AioContext *aio_context;
 Error *local_err = NULL;
-const char *base_name = NULL;
 int job_flags = JOB_DEFAULT;
 
 if (!has_on_error) {
@@ -2526,7 +2524,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
-base_name = base;
 }
 
 if (has_base_node) {
@@ -2541,7 +2538,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
 bdrv_refresh_filename(base_bs);
-base_name = base_bs->filename;
 }
 
 /* Check for op blockers in the whole chain between bs and base */
@@ -2553,17 +2549,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 }
 
-/* if we are streaming the entire chain, the result will have no backing
- * file, and specifying one is therefore an error */
-if (base_bs == NULL && has_backing_file) {
-error_setg(errp, "backing file specified, but streaming the "
- "entire chain");
-goto out;
-}
-
-/* backing_file string overrides base bs filename */
-base_name = has_backing_file ? backing_file : base_name;
-
 if (has_auto_finalize && !auto_finalize) {
 job_flags |= JOB_MANUAL_FINALIZE;
 }
@@ -2571,7 +2556,7 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 job_flags |= JOB_MANUAL_DISMISS;
 }
 
-stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
+stream_start(has_job_id ? 

[PATCH v9 2/9] copy-on-read: add filter append/drop functions

2020-09-28 Thread Andrey Shinkevich via
Provide API for the COR-filter insertion/removal.
Also, drop the filter child permissions for an inactive state when the
filter node is being removed.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 84 
 1 file changed, 84 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index cb03e0f..3c8231f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -23,11 +23,21 @@
 #include "qemu/osdep.h"
 #include "block/block_int.h"
 #include "qemu/module.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "block/copy-on-read.h"
+
+
+typedef struct BDRVStateCOR {
+bool active;
+} BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BDRVStateCOR *state = bs->opaque;
+
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
false, errp);
@@ -42,6 +52,13 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+state->active = true;
+
+/*
+ * We don't need to call bdrv_child_refresh_perms() now as the permissions
+ * will be updated later when the filter node gets its parent.
+ */
+
 return 0;
 }
 
@@ -57,6 +74,17 @@ static void cor_child_perm(BlockDriverState *bs, BdrvChild 
*c,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
 {
+BDRVStateCOR *s = bs->opaque;
+
+if (!s->active) {
+/*
+ * While the filter is being removed
+ */
+*nperm = 0;
+*nshared = BLK_PERM_ALL;
+return;
+}
+
 *nperm = perm & PERM_PASSTHROUGH;
 *nshared = (shared & PERM_PASSTHROUGH) | PERM_UNCHANGED;
 
@@ -135,6 +163,7 @@ static void cor_lock_medium(BlockDriverState *bs, bool 
locked)
 
 static BlockDriver bdrv_copy_on_read = {
 .format_name= "copy-on-read",
+.instance_size  = sizeof(BDRVStateCOR),
 
 .bdrv_open  = cor_open,
 .bdrv_child_perm= cor_child_perm,
@@ -159,4 +188,59 @@ static void bdrv_copy_on_read_init(void)
 bdrv_register(_copy_on_read);
 }
 
+
+BlockDriverState *bdrv_cor_filter_append(BlockDriverState *bs,
+ QDict *node_options,
+ int flags, Error **errp)
+{
+BlockDriverState *cor_filter_bs;
+Error *local_err = NULL;
+
+cor_filter_bs = bdrv_open(NULL, NULL, node_options, flags, errp);
+if (cor_filter_bs == NULL) {
+error_prepend(errp, "Could not create COR-filter node: ");
+return NULL;
+}
+
+bdrv_drained_begin(bs);
+bdrv_replace_node(bs, cor_filter_bs, _err);
+bdrv_drained_end(bs);
+
+if (local_err) {
+bdrv_unref(cor_filter_bs);
+error_propagate(errp, local_err);
+return NULL;
+}
+
+return cor_filter_bs;
+}
+
+
+void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs)
+{
+BdrvChild *child;
+BlockDriverState *bs;
+BDRVStateCOR *s = cor_filter_bs->opaque;
+
+child = bdrv_filter_child(cor_filter_bs);
+if (!child) {
+return;
+}
+bs = child->bs;
+
+/* Retain the BDS until we complete the graph change. */
+bdrv_ref(bs);
+/* Hold a guest back from writing while permissions are being reset. */
+bdrv_drained_begin(bs);
+/* Drop permissions before the graph change. */
+s->active = false;
+bdrv_child_refresh_perms(cor_filter_bs, child, _abort);
+bdrv_replace_node(cor_filter_bs, bs, _abort);
+
+bdrv_drained_end(bs);
+bdrv_unref(bs);
+bdrv_unref(cor_filter_bs);
+}
+
+
 block_init(bdrv_copy_on_read_init);
-- 
1.8.3.1




[PATCH v9 1/9] copy-on-read: Support preadv/pwritev_part functions

2020-09-28 Thread Andrey Shinkevich via
Add support for the recently introduced functions
bdrv_co_preadv_part()
and
bdrv_co_pwritev_part()
to the COR-filter driver.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/copy-on-read.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 2816e61..cb03e0f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -74,21 +74,25 @@ static int64_t cor_getlength(BlockDriverState *bs)
 }
 
 
-static int coroutine_fn cor_co_preadv(BlockDriverState *bs,
-  uint64_t offset, uint64_t bytes,
-  QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_preadv_part(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   QEMUIOVector *qiov,
+   size_t qiov_offset,
+   int flags)
 {
-return bdrv_co_preadv(bs->file, offset, bytes, qiov,
-  flags | BDRV_REQ_COPY_ON_READ);
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
 }
 
 
-static int coroutine_fn cor_co_pwritev(BlockDriverState *bs,
-   uint64_t offset, uint64_t bytes,
-   QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_pwritev_part(BlockDriverState *bs,
+uint64_t offset,
+uint64_t bytes,
+QEMUIOVector *qiov,
+size_t qiov_offset, int flags)
 {
-
-return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
+flags);
 }
 
 
@@ -137,8 +141,8 @@ static BlockDriver bdrv_copy_on_read = {
 
 .bdrv_getlength = cor_getlength,
 
-.bdrv_co_preadv = cor_co_preadv,
-.bdrv_co_pwritev= cor_co_pwritev,
+.bdrv_co_preadv_part= cor_co_preadv_part,
+.bdrv_co_pwritev_part   = cor_co_pwritev_part,
 .bdrv_co_pwrite_zeroes  = cor_co_pwrite_zeroes,
 .bdrv_co_pdiscard   = cor_co_pdiscard,
 .bdrv_co_pwritev_compressed = cor_co_pwritev_compressed,
-- 
1.8.3.1




[PATCH v9 0/9] Apply COR-filter to the block-stream permanently

2020-09-28 Thread Andrey Shinkevich via
Despite the patch "freeze link to base node..." has been removed from the
series in the current version 9, the iotest case test_stream_parallel does
not pass after the COR-filter is inserted into the backing chain. As the
test case may not be initialized, it does not make a sense and was removed
again.
The check with bdrv_is_allocated_above() takes place in the COR-filter and
in the block-stream job both. An optimization of the block-stream job based
on the filter functionality may be made in a separate series.

v9:
  02: Refactored.
  04: Base node name is used instead of the file name.
  05: New implementation based on Max' review.
  06: New.
  07: New. The patch "freeze link to base node..." was deleted.
  08: New.
  09: The filter node options are initialized.

The v8 Message-Id:
<1598633579-221780-1-git-send-email-andrey.shinkev...@virtuozzo.com>

Andrey Shinkevich (9):
  copy-on-read: Support preadv/pwritev_part functions
  copy-on-read: add filter append/drop functions
  qapi: add filter-node-name to block-stream
  copy-on-read: pass base node name to COR driver
  copy-on-read: limit guest COR activity to base in COR driver
  copy-on-read: skip non-guest reads if no copy needed
  stream: skip filters when writing backing file name to QCOW2 header
  block: remove unused parameter backing-file name
  block: apply COR-filter to block-stream jobs

 block/copy-on-read.c   | 165 ++---
 block/io.c |   2 +-
 block/monitor/block-hmp-cmds.c |   6 +-
 block/stream.c | 112 +---
 blockdev.c |  21 +-
 include/block/block_int.h  |   9 ++-
 qapi/block-core.json   |  23 ++
 tests/qemu-iotests/030 |  51 ++---
 tests/qemu-iotests/030.out |   4 +-
 tests/qemu-iotests/141.out |   2 +-
 tests/qemu-iotests/245 |  19 +++--
 11 files changed, 267 insertions(+), 147 deletions(-)

-- 
1.8.3.1




[PATCH v9 7/9] stream: skip filters when writing backing file name to QCOW2 header

2020-09-28 Thread Andrey Shinkevich via
Avoid writing a filter JSON-name to QCOW2 image when the backing file
is changed after the block stream job.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index e0540ee..b0719e9 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
 BlockDriverState *bs = blk_bs(bjob->blk);
 BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
+BlockDriverState *base_metadata = bdrv_skip_filters(base);
 Error *local_err = NULL;
 int ret = 0;
 
@@ -73,10 +74,10 @@ static int stream_prepare(Job *job)
 
 if (bdrv_cow_child(unfiltered_bs)) {
 const char *base_id = NULL, *base_fmt = NULL;
-if (base) {
-base_id = s->backing_file_str;
-if (base->drv) {
-base_fmt = base->drv->format_name;
+if (base_metadata) {
+base_id = base_metadata->filename;
+if (base_metadata->drv) {
+base_fmt = base_metadata->drv->format_name;
 }
 }
 bdrv_set_backing_hd(unfiltered_bs, base, _err);
-- 
1.8.3.1




[PATCH v9 4/9] copy-on-read: pass base node name to COR driver

2020-09-28 Thread Andrey Shinkevich via
To limit the guest's COR operations by the base node in the backing
chain during stream job, pass the base node name to the copy-on-read
driver. The rest of the functionality will be implemented in the patch
that follows.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 3c8231f..e04092f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -24,19 +24,23 @@
 #include "block/block_int.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
 #include "qapi/qmp/qdict.h"
 #include "block/copy-on-read.h"
 
 
 typedef struct BDRVStateCOR {
 bool active;
+BlockDriverState *base_bs;
 } BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BlockDriverState *base_bs = NULL;
 BDRVStateCOR *state = bs->opaque;
+const char *base_node = qdict_get_try_str(options, "base");
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -52,7 +56,16 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+if (base_node) {
+qdict_del(options, "base");
+base_bs = bdrv_lookup_bs(NULL, base_node, errp);
+if (!base_bs) {
+error_setg(errp, QERR_BASE_NOT_FOUND, base_node);
+return -EINVAL;
+}
+}
 state->active = true;
+state->base_bs = base_bs;
 
 /*
  * We don't need to call bdrv_child_refresh_perms() now as the permissions
-- 
1.8.3.1




[PATCH v9 5/9] copy-on-read: limit guest COR activity to base in COR driver

2020-09-28 Thread Andrey Shinkevich via
Limit the guest's COR operations by the base node in the backing chain
when the base node name is given. It will be useful for a block stream
job when the COR-filter is applied.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index e04092f..f53f7e0 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -121,8 +121,42 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
size_t qiov_offset,
int flags)
 {
-return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
-   flags | BDRV_REQ_COPY_ON_READ);
+int64_t n = 0;
+int64_t size = offset + bytes;
+int local_flags;
+int ret;
+BDRVStateCOR *state = bs->opaque;
+
+if (!state->base_bs) {
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
+}
+
+while (offset < size) {
+local_flags = flags;
+
+/* In case of failure, try to copy-on-read anyway */
+ret = bdrv_is_allocated(bs->file->bs, offset, bytes, );
+if (!ret) {
+ret = bdrv_is_allocated_above(bdrv_cow_bs(bs->file->bs),
+  state->base_bs, false, offset, n, 
);
+if (ret > 0) {
+local_flags |= BDRV_REQ_COPY_ON_READ;
+}
+}
+
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
+
+offset += n;
+qiov_offset += n;
+bytes -= n;
+}
+
+return 0;
 }
 
 
-- 
1.8.3.1




[PATCH v9 3/9] qapi: add filter-node-name to block-stream

2020-09-28 Thread Andrey Shinkevich via
Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/monitor/block-hmp-cmds.c | 4 ++--
 block/stream.c | 4 +++-
 blockdev.c | 4 +++-
 include/block/block_int.h  | 7 ++-
 qapi/block-core.json   | 6 ++
 5 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 4d3db5e..4e66775 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -507,8 +507,8 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
 
 qmp_block_stream(true, device, device, base != NULL, base, false, NULL,
  false, NULL, qdict_haskey(qdict, "speed"), speed, true,
- BLOCKDEV_ON_ERROR_REPORT, false, false, false, false,
- );
+ BLOCKDEV_ON_ERROR_REPORT, false, NULL, false, false, 
false,
+ false, );
 
 hmp_handle_error(mon, error);
 }
diff --git a/block/stream.c b/block/stream.c
index 8ce6729..e0540ee 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -221,7 +221,9 @@ static const BlockJobDriver stream_job_driver = {
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp)
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp)
 {
 StreamBlockJob *s;
 BlockDriverState *iter;
diff --git a/blockdev.c b/blockdev.c
index bebd3ba..d719c47 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2489,6 +2489,7 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
   bool has_backing_file, const char *backing_file,
   bool has_speed, int64_t speed,
   bool has_on_error, BlockdevOnError on_error,
+  bool has_filter_node_name, const char *filter_node_name,
   bool has_auto_finalize, bool auto_finalize,
   bool has_auto_dismiss, bool auto_dismiss,
   Error **errp)
@@ -2571,7 +2572,8 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 
 stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
- job_flags, has_speed ? speed : 0, on_error, _err);
+ job_flags, has_speed ? speed : 0, on_error,
+ filter_node_name, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38cad9d..f782737 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1134,6 +1134,9 @@ int is_windows_drive(const char *filename);
  *  See @BlockJobCreateFlags
  * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
  * @on_error: The action to take upon error.
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the commit job inserts into the graph above @bs. NULL means
+ * that a node name should be autogenerated.
  * @errp: Error object.
  *
  * Start a streaming operation on @bs.  Clusters that are unallocated
@@ -1146,7 +1149,9 @@ int is_windows_drive(const char *filename);
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp);
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp);
 
 /**
  * commit_start:
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 3c16f1e..32fb097 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2533,6 +2533,11 @@
 #'stop' and 'enospc' can only be used if the block device
 #supports io-status (see BlockInfo).  Since 1.3.
 #
+# @filter-node-name: the node name that should be assigned to the
+#filter driver that the stream job inserts into the graph
+#above @device. If this option is not given, a node name is
+#autogenerated. (Since: 5.2)
+#
 # @auto-finalize: When false, this job will wait in a PENDING state after it 
has
 # finished its work, waiting for @block-job-finalize before
 # making any block graph changes.
@@ -2563,6 +2568,7 @@
   'data': { '*job-id': 'str', 'device': 'str', '*base': 'str',
 '*base-node': 'str', '*backing-file': 'str', '*speed': 'int',
 '*on-error': 'BlockdevOnError',
+ 

[PATCH v9 6/9] copy-on-read: skip non-guest reads if no copy needed

2020-09-28 Thread Andrey Shinkevich via
If the flag BDRV_REQ_PREFETCH was set, pass it further to the
COR-driver to skip unneeded reading. It can be taken into account for
the COR-algorithms optimization. That check is being made during the
block stream job by the moment.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 14 ++
 block/io.c   |  2 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index f53f7e0..5389dca 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -145,10 +145,16 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
 }
 }
 
-ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
-  local_flags);
-if (ret < 0) {
-return ret;
+if ((flags & BDRV_REQ_PREFETCH) &
+!(local_flags & BDRV_REQ_COPY_ON_READ)) {
+/* Skip non-guest reads if no copy needed */
+} else {
+
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
 }
 
 offset += n;
diff --git a/block/io.c b/block/io.c
index 11df188..62b75a5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1388,7 +1388,7 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BdrvChild *child,
 qemu_iovec_init_buf(_qiov, bounce_buffer, pnum);
 
 ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
- _qiov, 0, 0);
+ _qiov, 0, flags & 
BDRV_REQ_PREFETCH);
 if (ret < 0) {
 goto err;
 }
-- 
1.8.3.1




Re: [PATCH v7 07/13] monitor: Make current monitor a per-coroutine property

2020-09-28 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 28.09.2020 um 09:47 hat Markus Armbruster geschrieben:
>> Kevin Wolf  writes:
>> 
>> > Am 14.09.2020 um 17:11 hat Markus Armbruster geschrieben:
>> >> Kevin Wolf  writes:
>> >> 
>> >> > This way, a monitor command handler will still be able to access the
>> >> > current monitor, but when it yields, all other code code will correctly
>> >> > get NULL from monitor_cur().
>> >> >
>> >> > This uses a hash table to map the coroutine pointer to the current
>> >> > monitor of that coroutine.  Outside of coroutine context, we associate
>> >> > the current monitor with the leader coroutine of the current thread.
>> >> 
>> >> In qemu-system-FOO, the hash table can have only these entries:
>> >> 
>> >> * (OOB) One mapping @mon_iothread's thread leader to a QMP monitor, while
>> >>   executing a QMP command out-of-band.
>> >> 
>> >> * (QMP-CO) One mapping @qmp_dispatcher_co (a coroutine in the main
>> >>   thread) to a QMP monitor, while executing a QMP command in-band and in
>> >>   coroutine context.
>> >> 
>> >> * (QMP) One mapping the main thread's leader to a QMP monitor, while
>> >>   executing a QMP command in-band and out of coroutine context, in a
>> >>   bottom half.
>> >> 
>> >> * (HMP) One mapping the main thread's leader to an HMP monitor, while
>> >>   executing an HMP command out of coroutine context.
>> >> 
>> >> * (HMP-CO) One mapping a transient coroutine in the main thread to an
>> >>   HMP monitor, while executing an HMP command in coroutine context.
>> >> 
>> >> In-band execution is one command after the other.
>> >> 
>> >> Therefore, at most one monitor command can be executing in-band at any
>> >> time.
>> >> 
>> >> Therefore, the hash table has at most *two* entries: one (OOB), and one
>> >> of the other four.
>> >> 
>> >> Can you shoot any holes into my argument?
>> >
>> > I think with human-monitor-command, you can have three mappings:
>> >
>> > 1. The main thread's leader (it is a non-coroutine QMP command) to the
>> >QMP monitor
>> 
>> This is (QMP).
>> 
>> > 2. With a coroutine HMP command, one mapping from the transient HMP
>> >coroutine to the transient HMP monitor (with a non-coroutine HMP
>> >command, we'd instead temporarily change the mapping from 1.)
>> 
>> This is (HMP-CO).
>> 
>> > 3. The OOB entry
>> 
>> This is (OOB).
>> 
>> To get 1. (QMP) and 2, (HMP-CO) at the same time, the in-band,
>> non-coroutine QMP command needs to execute interleaved with the in-band,
>> coroutine HMP command.
>> 
>> Such an interleaving contradicts "In-band execution is one command after
>> the other", which is a fundamental assumption in-band commands may make.
>> If the assumption is invalid, we got a problem.  Is it?
>
> Interleaving, or rather executing another command in the middle of its
> implementation is the very purpose of human-monitor-command (which is
> what I was talking about, so "the in-band non-coroutine QMP command" is
> a very specific one).

Got it now, thanks.

> It's the only command I can think of that is exceptional in this way
> and would lead to three mappings.
>
> Kevin




Re: [PATCH v7 06/13] qmp: Call monitor_set_cur() only in qmp_dispatch()

2020-09-28 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 14.09.2020 um 17:10 hat Markus Armbruster geschrieben:
>> Kevin Wolf  writes:
>> 
>> > The correct way to set the current monitor for a coroutine handler will
>> > be different than for a blocking handler, so monitor_set_cur() needs to
>> > be called in qmp_dispatch().
>> >
>> > Signed-off-by: Kevin Wolf 
>> > ---
>> >  include/qapi/qmp/dispatch.h | 3 ++-
>> >  monitor/qmp.c   | 8 +---
>> >  qapi/qmp-dispatch.c | 8 +++-
>> >  qga/main.c  | 2 +-
>> >  stubs/monitor-core.c| 5 +
>> >  tests/test-qmp-cmds.c   | 6 +++---
>> >  6 files changed, 19 insertions(+), 13 deletions(-)
>> >
>> > diff --git a/include/qapi/qmp/dispatch.h b/include/qapi/qmp/dispatch.h
>> > index 5a9cf82472..0c2f467028 100644
>> > --- a/include/qapi/qmp/dispatch.h
>> > +++ b/include/qapi/qmp/dispatch.h
>> > @@ -14,6 +14,7 @@
>> >  #ifndef QAPI_QMP_DISPATCH_H
>> >  #define QAPI_QMP_DISPATCH_H
>> >  
>> > +#include "monitor/monitor.h"
>> >  #include "qemu/queue.h"
>> >  
>> >  typedef void (QmpCommandFunc)(QDict *, QObject **, Error **);
>> > @@ -49,7 +50,7 @@ const char *qmp_command_name(const QmpCommand *cmd);
>> >  bool qmp_has_success_response(const QmpCommand *cmd);
>> >  QDict *qmp_error_response(Error *err);
>> >  QDict *qmp_dispatch(const QmpCommandList *cmds, QObject *request,
>> > -bool allow_oob);
>> > +bool allow_oob, Monitor *cur_mon);
>> >  bool qmp_is_oob(const QDict *dict);
>> >  
>> >  typedef void (*qmp_cmd_callback_fn)(const QmpCommand *cmd, void *opaque);
>> > diff --git a/monitor/qmp.c b/monitor/qmp.c
>> > index 8469970c69..922fdb5541 100644
>> > --- a/monitor/qmp.c
>> > +++ b/monitor/qmp.c
>> > @@ -135,16 +135,10 @@ static void monitor_qmp_respond(MonitorQMP *mon, 
>> > QDict *rsp)
>> >  
>> >  static void monitor_qmp_dispatch(MonitorQMP *mon, QObject *req)
>> >  {
>> > -Monitor *old_mon;
>> >  QDict *rsp;
>> >  QDict *error;
>> >  
>> > -old_mon = monitor_set_cur(>common);
>> > -assert(old_mon == NULL);
>> > -
>> > -rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon));
>> > -
>> > -monitor_set_cur(NULL);
>> > +rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon), 
>> > >common);
>> 
>> Long line.  Happy to wrap it in my tree.  A few more in PATCH 08-11.
>
> It's 79 characters. Should be fine even with your local deviation from
> the coding style to require less than that for comments?

Let me rephrase my remark.

For me,

rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon),
   >common);

is significantly easier to read than

rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon), >common);

Would you mind me wrapping this line in my tree?

A few more in PATCH 08-11.




Re: [PATCH v6] nvme: allow cmb and pmr emulation on same device

2020-09-28 Thread Klaus Jensen
On Jul 29 15:01, Andrzej Jakowski wrote:
> Resending series recently posted on mailing list related to nvme device
> extension with couple of fixes after review.
> 
> This patch series does following:
>  - Fixes problem where CMBS bit was not set in controller capabilities
>register, so support for CMB was not correctly advertised to guest.
>This is resend of patch that has been submitted and reviewed by
>Klaus [1]
>  - Introduces BAR4 sharing between MSI-X vectors and CMB. This allows
>to have CMB and PMR emulated on the same device. This extension
>was indicated by Keith [2]
> 
> v6:
>  - instead of using memory_region_to_absolute_addr() function local helper has
>been defined (nvme_cmb_to_absolute_addr()) to calculate absolute address of
>CMB in simplified way. Also a number of code style changes has been done
>(function rename, use literal instead of macro definition, etc.)
> 
> v5:
>  - fixed problem introduced in v4 where CMB buffer was represented as
>subregion of BAR4 memory region. In that case CMB address was used
>incorrectly as it was relative to BAR4 and not absolute. Appropriate
>changes were added to v5 to calculate CMB address properly ([6])
> 
> v4:
>  - modified BAR4 initialization, so now it consists of CMB, MSIX and
>PBA memory regions overlapping on top of it. This reduces patch
>complexity significantly (Klaus [5])
> 
> v3:
>  - code style fixes including: removal of spurious line break, moving
>define into define section and code alignment (Klaus [4])
>  - removed unintended code reintroduction (Klaus [4])
> 
> v2:
>  - rebase on Kevin's latest block branch (Klaus [3])
>  - improved comments section (Klaus [3])
>  - simplified calculation of BAR4 size (Klaus [3])
> 
> v1:
>  - initial push of the patch
> 
> [1]: 
> https://lore.kernel.org/qemu-devel/20200408055607.g2ii7gwqbnv6cd3w@apples.localdomain/
> [2]: 
> https://lore.kernel.org/qemu-devel/20200330165518.ga8...@redsun51.ssa.fujisawa.hgst.com/
> [3]: 
> https://lore.kernel.org/qemu-devel/20200605181043.28782-1-andrzej.jakow...@linux.intel.com/
> [4]: 
> https://lore.kernel.org/qemu-devel/20200618092524.posxi5mysb3jjtpn@apples.localdomain/
> [5]: 
> https://lore.kernel.org/qemu-devel/20200626055033.6vxqvi4s5pll7som@apples.localdomain/
> [6]: 
> https://lore.kernel.org/qemu-devel/9143a543-d32d-f3e7-c37b-b3df7f853...@linux.intel.com/
> 

Hi Andrzej,

Any chance you can respin this on git.infradead.org/qemu-nvme.git
nvme-next branch?


Thanks,
Klaus


signature.asc
Description: PGP signature


Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set

2020-09-28 Thread Klaus Jensen
On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 

I think the fill_pattern feature stands separate, so it would be nice to
extract that to a patch on its own.

> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> ---
>  block/nvme.c |   2 +-
>  hw/block/nvme-ns.c   | 185 -
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c  | 872 +--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 04172f083e..daa13546c4 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
>  
>  typedef struct NvmeNamespaceParams {
>  uint32_t nsid;
> -uint8_t  csi;
>  bool attached;
>  QemuUUID uuid;
>  
> @@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
>  DeviceState  parent_obj;
>  BlockConfblkconf;
>  int32_t  bootindex;
> +uint8_t  csi;
>  int64_t  size;
>  NvmeIdNs id_ns;

This should be squashed into the namespace types patch.

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 63ad03d6d6..38e25a4d1f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -54,6 +54,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qemu/error-report.h"
> +#include "crypto/random.h"

I think this is not used until the offline/read-only zones injection
patch, right?

> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +  bool failed)
> +{
> +NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
> +NvmeZone *zone;
> +uint64_t slba, start_wp = req->cqe.result64;
> +uint32_t nlb, zone_idx;
> +uint8_t zs;
> +
> +if (rw->opcode != NVME_CMD_WRITE &&
> +rw->opcode != NVME_CMD_ZONE_APPEND &&
> +rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +return false;
> +}
> +
> +slba = le64_to_cpu(rw->slba);
> +nlb = le16_to_cpu(rw->nlb) + 1;
> +zone_idx = nvme_zone_idx(ns, slba);
> +assert(zone_idx < ns->num_zones);
> +zone = >zone_array[zone_idx];
> +
> +if (!failed && zone->w_ptr < start_wp + nlb) {
> +/*
> + * A preceding queued write to the zone has failed,
> + * now this write is not at the WP, fail it too.
> + */
> +failed = true;
> +}
> +
> +if (failed) {
> +if (zone->w_ptr > start_wp) {
> +zone->w_ptr = start_wp;
> +}

It is possible (though unlikely) that you already posted the CQE for the
write that moved the WP to w_ptr - and now you are reverting it.  This
looks like a recipe for data corruption to me.

Take this example. I use append, because if you have multiple regular
writes in queue you're screwed anyway.

  w_ptr = 0, d.wp = 0
  append 1 lba  -> w_ptr = 1, start_wp = 0, issues aio A
  append 2 lbas -> w_ptr = 3, start_wp = 1, issues aio B

  aio B success -> d.wp = 2 (since you are adding nlb),

Now, I totally do the same. Even though the zone descriptor write
pointer gets "out of sync", it will be reconciled in the absence of
failures and its fair to define that the host cannot expect a consistent
view of the write pointer without quescing I/O.

The problem is if a write then fails:

  aio A fails   -> w_ptr > start_wp (3 > 1), so you revert to w_ptr = 1

That looks bad to me. I dont think this is ever reconciled? If another
append then comes in:

  append 1 lba -> w_ptr = 2, start_wp = 1, issues aio C and overwrites
 

Re: [PATCH v7 07/13] monitor: Make current monitor a per-coroutine property

2020-09-28 Thread Kevin Wolf
Am 28.09.2020 um 09:47 hat Markus Armbruster geschrieben:
> Kevin Wolf  writes:
> 
> > Am 14.09.2020 um 17:11 hat Markus Armbruster geschrieben:
> >> Kevin Wolf  writes:
> >> 
> >> > This way, a monitor command handler will still be able to access the
> >> > current monitor, but when it yields, all other code code will correctly
> >> > get NULL from monitor_cur().
> >> >
> >> > This uses a hash table to map the coroutine pointer to the current
> >> > monitor of that coroutine.  Outside of coroutine context, we associate
> >> > the current monitor with the leader coroutine of the current thread.
> >> 
> >> In qemu-system-FOO, the hash table can have only these entries:
> >> 
> >> * (OOB) One mapping @mon_iothread's thread leader to a QMP monitor, while
> >>   executing a QMP command out-of-band.
> >> 
> >> * (QMP-CO) One mapping @qmp_dispatcher_co (a coroutine in the main
> >>   thread) to a QMP monitor, while executing a QMP command in-band and in
> >>   coroutine context.
> >> 
> >> * (QMP) One mapping the main thread's leader to a QMP monitor, while
> >>   executing a QMP command in-band and out of coroutine context, in a
> >>   bottom half.
> >> 
> >> * (HMP) One mapping the main thread's leader to an HMP monitor, while
> >>   executing an HMP command out of coroutine context.
> >> 
> >> * (HMP-CO) One mapping a transient coroutine in the main thread to an
> >>   HMP monitor, while executing an HMP command in coroutine context.
> >> 
> >> In-band execution is one command after the other.
> >> 
> >> Therefore, at most one monitor command can be executing in-band at any
> >> time.
> >> 
> >> Therefore, the hash table has at most *two* entries: one (OOB), and one
> >> of the other four.
> >> 
> >> Can you shoot any holes into my argument?
> >
> > I think with human-monitor-command, you can have three mappings:
> >
> > 1. The main thread's leader (it is a non-coroutine QMP command) to the
> >QMP monitor
> 
> This is (QMP).
> 
> > 2. With a coroutine HMP command, one mapping from the transient HMP
> >coroutine to the transient HMP monitor (with a non-coroutine HMP
> >command, we'd instead temporarily change the mapping from 1.)
> 
> This is (HMP-CO).
> 
> > 3. The OOB entry
> 
> This is (OOB).
> 
> To get 1. (QMP) and 2, (HMP-CO) at the same time, the in-band,
> non-coroutine QMP command needs to execute interleaved with the in-band,
> coroutine HMP command.
> 
> Such an interleaving contradicts "In-band execution is one command after
> the other", which is a fundamental assumption in-band commands may make.
> If the assumption is invalid, we got a problem.  Is it?

Interleaving, or rather executing another command in the middle of its
implementation is the very purpose of human-monitor-command (which is
what I was talking about, so "the in-band non-coroutine QMP command" is
a very specific one).

It's the only command I can think of that is exceptional in this way
and would lead to three mappings.

Kevin




Re: [PATCH v7 13/13] block: Convert 'block_resize' to coroutine

2020-09-28 Thread Kevin Wolf
Am 28.09.2020 um 11:05 hat Stefan Hajnoczi geschrieben:
> On Fri, Sep 25, 2020 at 06:07:50PM +0200, Kevin Wolf wrote:
> > Am 15.09.2020 um 16:57 hat Stefan Hajnoczi geschrieben:
> > > On Wed, Sep 09, 2020 at 05:11:49PM +0200, Kevin Wolf wrote:
> > > > @@ -2456,8 +2456,7 @@ void qmp_block_resize(bool has_device, const char 
> > > > *device,
> > > >  return;
> > > >  }
> > > >  
> > > > -aio_context = bdrv_get_aio_context(bs);
> > > > -aio_context_acquire(aio_context);
> > > > +old_ctx = bdrv_co_move_to_aio_context(bs);
> > > >  
> > > >  if (size < 0) {
> > > >  error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "size", "a >0 
> > > > size");
> > > 
> > > Is it safe to call blk_new() outside the BQL since it mutates global 
> > > state?
> > > 
> > > In other words, could another thread race with us?
> > 
> > Hm, probably not.
> > 
> > Would it be safer to have the bdrv_co_move_to_aio_context() call only
> > immediately before the drain?
> 
> Yes, sounds good.
> 
> > > > @@ -2479,8 +2478,8 @@ void qmp_block_resize(bool has_device, const char 
> > > > *device,
> > > >  bdrv_drained_end(bs);
> > > >  
> > > >  out:
> > > > +aio_co_reschedule_self(old_ctx);
> > > >  blk_unref(blk);
> > > > -aio_context_release(aio_context);
> > > 
> > > The following precondition is violated by the blk_unref -> bdrv_drain ->
> > > AIO_WAIT_WHILE() call if blk->refcnt is 1 here:
> > > 
> > >  * The caller's thread must be the IOThread that owns @ctx or the main 
> > > loop
> > >  * thread (with @ctx acquired exactly once).
> > > 
> > > blk_unref() is called from the main loop thread without having acquired
> > > blk's AioContext.
> > > 
> > > Normally blk->refcnt will be > 1 so bdrv_drain() won't be called, but
> > > I'm not sure if that can be guaranteed.
> > > 
> > > The following seems safer although it's uglier:
> > > 
> > >   aio_context = bdrv_get_aio_context(bs);
> > >   aio_context_acquire(aio_context);
> > >   blk_unref(blk);
> > >   aio_context_release(aio_context);
> > 
> > May we actually acquire aio_context if blk is in the main thread? I
> > think we must only do this if it's in a different iothread because we'd
> > end up with a recursive lock and drain would hang.
> 
> Right :). Maybe an aio_context_acquire_once() API would help.

If you want it to work in the general case, how would you implement
this? As far as I know there is no way to tell whether we already own
the lock or not.

Something like aio_context_acquire_unless_self() might be easier to
implement.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v7 12/13] block: Add bdrv_co_move_to_aio_context()

2020-09-28 Thread Kevin Wolf
Am 28.09.2020 um 10:59 hat Stefan Hajnoczi geschrieben:
> On Fri, Sep 25, 2020 at 06:00:51PM +0200, Kevin Wolf wrote:
> > Am 15.09.2020 um 16:31 hat Stefan Hajnoczi geschrieben:
> > > On Wed, Sep 09, 2020 at 05:11:48PM +0200, Kevin Wolf wrote:
> > > > Add a function to move the current coroutine to the AioContext of a
> > > > given BlockDriverState.
> > > > 
> > > > Signed-off-by: Kevin Wolf 
> > > > ---
> > > >  include/block/block.h |  6 ++
> > > >  block.c   | 10 ++
> > > >  2 files changed, 16 insertions(+)
> > > > 
> > > > diff --git a/include/block/block.h b/include/block/block.h
> > > > index 981ab5b314..80ab322f11 100644
> > > > --- a/include/block/block.h
> > > > +++ b/include/block/block.h
> > > > @@ -626,6 +626,12 @@ bool bdrv_debug_is_suspended(BlockDriverState *bs, 
> > > > const char *tag);
> > > >   */
> > > >  AioContext *bdrv_get_aio_context(BlockDriverState *bs);
> > > >  
> > > > +/**
> > > > + * Move the current coroutine to the AioContext of @bs and return the 
> > > > old
> > > > + * AioContext of the coroutine.
> > > > + */
> > > > +AioContext *coroutine_fn bdrv_co_move_to_aio_context(BlockDriverState 
> > > > *bs);
> > > 
> > > I'm not sure this function handles all cases:
> > > 1. Being called without the BQL (i.e. from an IOThread).
> > > 2. Being called while a device stops using its IOThread.
> > > 
> > > The races that come to mind are fetching the AioContext for bs and then
> > > scheduling a BH. The BH is executed later on by the event loop. There
> > > might be cases where the AioContext for bs is updated before the BH
> > > runs.
> 
> The scenario I'm thinking about is where bs' AioContext changes while we
> are trying to move there.
> 
> There is a window of time between fetching bs' AioContext, scheduling a
> BH in our old AioContext (not in bs' AioContext), and then scheduling
> the coroutine into the AioContext we previously fetched for bs.
> 
> Is it possible for the AioContext value we stashed to be outdated by the
> time we use it?
> 
> I think the answer is that it's safe to use this function from the main
> loop thread under the BQL. That way nothing else will change bs'
> AioContext while we're running. But it's probably not safe to use this
> function from an arbitrary IOThread (without the BQL).

It's probably the safest to treat it as such. The first part of it (the
window between fetching bs->aio_context and using it) is actually also
true for this ubiquitous sequence:

AioContext *ctx = bdrv_get_aio_context(bs);
aio_context_acquire(ctx);

I never really thought about this, but this is only safe in the main
thread. Most of its users are of course monitor command handlers, which
always run in the main thread.

> I think this limitation is okay but it needs to be documented. Maybe an
> assertion can verify that it holds.

Yes, why not.

Maybe we should actually change the interface into a pair of
bdrv_co_enter/leave() functions that also increase bs->in_flight so that
the whole section will complete before the AioContext of bs changes
(changing the AioContext while the handle coroutine has yielded and will
continue to run in the old context would be bad).

block_resize is safe without it, but it might be better to introduce
patterns that will be safe without being extra careful in each command.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v7 00/13] monitor: Optionally run handlers in coroutines

2020-09-28 Thread Kevin Wolf
Am 28.09.2020 um 10:46 hat Stefan Hajnoczi geschrieben:
> On Fri, Sep 25, 2020 at 07:15:41PM +0200, Kevin Wolf wrote:
> > Am 10.09.2020 um 15:24 hat Stefan Hajnoczi geschrieben:
> > > On Wed, Sep 09, 2020 at 05:11:36PM +0200, Kevin Wolf wrote:
> > > > Some QMP command handlers can block the main loop for a relatively long
> > > > time, for example because they perform some I/O. This is quite nasty.
> > > > Allowing such handlers to run in a coroutine where they can yield (and
> > > > therefore release the BQL) while waiting for an event such as I/O
> > > > completion solves the problem.
> > > > 
> > > > This series adds the infrastructure to allow this and switches
> > > > block_resize to run in a coroutine as a first example.
> > > > 
> > > > This is an alternative solution to Marc-André's "monitor: add
> > > > asynchronous command type" series.
> > > 
> > > Please clarify the following in the QAPI documentation:
> > >  * Is the QMP monitor suspended while the command is pending?
> > 
> > Suspended as in monitor_suspend()? No.
> > 
> > >  * Are QMP events reported while the command is pending?
> > 
> > Hm, I don't know to be honest. But I think so.
> > 
> > Does it matter, though? I don't think events have a defined order
> > compared to command results, and the client can't respond to the event
> > anyway until the current command has completed.
> 
> You're right, I don't think it matters because the client must expect
> QMP events at any time.
> 
> I was trying to understand the semantics of coroutine monitor commands
> from two perspectives:
> 
> 1. The QMP client - do coroutine commands behave differently from
>non-coroutine commands? I think the answer is no. The monitor will
>not process more commands until the coroutine finishes?

No, on the wire, things should look exactly the same. If you consider
more than the QMP traffic and the client communicates with the guest, it
might see that the guest isn't blocked any more, of course.

> 2. The command implementation - which thread does the coroutine run in?
>I guess it runs in the main loop thread with the BQL and the
>AioContext acquired?

By default, yes. But the coroutine can reschedule itself to another
thread. Block-related handlers will want to reschedule themselves to
bs->aio_context because you can't just hold the AioContext lock from
another thread across yields. This is what block_resize does after this
series.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v7 00/13] monitor: Optionally run handlers in coroutines

2020-09-28 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 10.09.2020 um 15:24 hat Stefan Hajnoczi geschrieben:
>> On Wed, Sep 09, 2020 at 05:11:36PM +0200, Kevin Wolf wrote:
>> > Some QMP command handlers can block the main loop for a relatively long
>> > time, for example because they perform some I/O. This is quite nasty.
>> > Allowing such handlers to run in a coroutine where they can yield (and
>> > therefore release the BQL) while waiting for an event such as I/O
>> > completion solves the problem.
>> > 
>> > This series adds the infrastructure to allow this and switches
>> > block_resize to run in a coroutine as a first example.
>> > 
>> > This is an alternative solution to Marc-André's "monitor: add
>> > asynchronous command type" series.
>> 
>> Please clarify the following in the QAPI documentation:
>>  * Is the QMP monitor suspended while the command is pending?
>
> Suspended as in monitor_suspend()? No.

A suspended monitor doesn't read monitor input.

We suspend

* a QMP monitor while the request queue is full

* an HMP monitor while it executes a command

* a multiplexed HMP monitor while the "mux-focus" is elsewhere

* an HMP monitor when it executes command "quit", forever

* an HMP monitor while it executes command "migrate" without -d

Let me explain the first item in a bit more detail.  Before OOB, a QMP
monitor was also suspended while it executed a command.  To make OOB
work, we moved the QMP monitors to an I/O thread and added a request
queue, drained by the main loop.  QMP monitors continue to read
commands, execute right away if OOB, else queue, suspend when queue gets
full, resume when it gets non-full.

The "run command in coroutine context" feature does not affect any of
this.

qapi-code-gen.txt does not talk about monitor suspension at all.  It's
instead discussed in qmp-spec.txt section 2.3.1 Out-of-band execution.

Stefan, what would you like us to clarify, and where?

>>  * Are QMP events reported while the command is pending?
>
> Hm, I don't know to be honest. But I think so.

Yes, events should be reported while a command is being executed.

Sending events takes locks.  Their critical sections are all
short-lived.  Another possible delay is the underlying character device
failing the send with EAGAIN.  That's all.

Fine print: qapi_event_emit() takes @monitor_lock.  It sends to each QMP
monitor with qmp_send_response(), which uses monitor_puts(), which takes
the monitor's @mon_lock.

The "run command in coroutine context" feature does not affect any of
this.

> Does it matter, though? I don't think events have a defined order
> compared to command results, and the client can't respond to the event
> anyway until the current command has completed.

Stefan, what would you like us to clarify, and where?




Re: [PATCH v7 13/13] block: Convert 'block_resize' to coroutine

2020-09-28 Thread Stefan Hajnoczi
On Fri, Sep 25, 2020 at 06:07:50PM +0200, Kevin Wolf wrote:
> Am 15.09.2020 um 16:57 hat Stefan Hajnoczi geschrieben:
> > On Wed, Sep 09, 2020 at 05:11:49PM +0200, Kevin Wolf wrote:
> > > @@ -2456,8 +2456,7 @@ void qmp_block_resize(bool has_device, const char 
> > > *device,
> > >  return;
> > >  }
> > >  
> > > -aio_context = bdrv_get_aio_context(bs);
> > > -aio_context_acquire(aio_context);
> > > +old_ctx = bdrv_co_move_to_aio_context(bs);
> > >  
> > >  if (size < 0) {
> > >  error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "size", "a >0 
> > > size");
> > 
> > Is it safe to call blk_new() outside the BQL since it mutates global state?
> > 
> > In other words, could another thread race with us?
> 
> Hm, probably not.
> 
> Would it be safer to have the bdrv_co_move_to_aio_context() call only
> immediately before the drain?

Yes, sounds good.

> > > @@ -2479,8 +2478,8 @@ void qmp_block_resize(bool has_device, const char 
> > > *device,
> > >  bdrv_drained_end(bs);
> > >  
> > >  out:
> > > +aio_co_reschedule_self(old_ctx);
> > >  blk_unref(blk);
> > > -aio_context_release(aio_context);
> > 
> > The following precondition is violated by the blk_unref -> bdrv_drain ->
> > AIO_WAIT_WHILE() call if blk->refcnt is 1 here:
> > 
> >  * The caller's thread must be the IOThread that owns @ctx or the main loop
> >  * thread (with @ctx acquired exactly once).
> > 
> > blk_unref() is called from the main loop thread without having acquired
> > blk's AioContext.
> > 
> > Normally blk->refcnt will be > 1 so bdrv_drain() won't be called, but
> > I'm not sure if that can be guaranteed.
> > 
> > The following seems safer although it's uglier:
> > 
> >   aio_context = bdrv_get_aio_context(bs);
> >   aio_context_acquire(aio_context);
> >   blk_unref(blk);
> >   aio_context_release(aio_context);
> 
> May we actually acquire aio_context if blk is in the main thread? I
> think we must only do this if it's in a different iothread because we'd
> end up with a recursive lock and drain would hang.

Right :). Maybe an aio_context_acquire_once() API would help.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v7 12/13] block: Add bdrv_co_move_to_aio_context()

2020-09-28 Thread Stefan Hajnoczi
On Fri, Sep 25, 2020 at 06:00:51PM +0200, Kevin Wolf wrote:
> Am 15.09.2020 um 16:31 hat Stefan Hajnoczi geschrieben:
> > On Wed, Sep 09, 2020 at 05:11:48PM +0200, Kevin Wolf wrote:
> > > Add a function to move the current coroutine to the AioContext of a
> > > given BlockDriverState.
> > > 
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >  include/block/block.h |  6 ++
> > >  block.c   | 10 ++
> > >  2 files changed, 16 insertions(+)
> > > 
> > > diff --git a/include/block/block.h b/include/block/block.h
> > > index 981ab5b314..80ab322f11 100644
> > > --- a/include/block/block.h
> > > +++ b/include/block/block.h
> > > @@ -626,6 +626,12 @@ bool bdrv_debug_is_suspended(BlockDriverState *bs, 
> > > const char *tag);
> > >   */
> > >  AioContext *bdrv_get_aio_context(BlockDriverState *bs);
> > >  
> > > +/**
> > > + * Move the current coroutine to the AioContext of @bs and return the old
> > > + * AioContext of the coroutine.
> > > + */
> > > +AioContext *coroutine_fn bdrv_co_move_to_aio_context(BlockDriverState 
> > > *bs);
> > 
> > I'm not sure this function handles all cases:
> > 1. Being called without the BQL (i.e. from an IOThread).
> > 2. Being called while a device stops using its IOThread.
> > 
> > The races that come to mind are fetching the AioContext for bs and then
> > scheduling a BH. The BH is executed later on by the event loop. There
> > might be cases where the AioContext for bs is updated before the BH
> > runs.

The scenario I'm thinking about is where bs' AioContext changes while we
are trying to move there.

There is a window of time between fetching bs' AioContext, scheduling a
BH in our old AioContext (not in bs' AioContext), and then scheduling
the coroutine into the AioContext we previously fetched for bs.

Is it possible for the AioContext value we stashed to be outdated by the
time we use it?

I think the answer is that it's safe to use this function from the main
loop thread under the BQL. That way nothing else will change bs'
AioContext while we're running. But it's probably not safe to use this
function from an arbitrary IOThread (without the BQL).

I think this limitation is okay but it needs to be documented. Maybe an
assertion can verify that it holds.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF

2020-09-28 Thread Klaus Jensen
On Sep 28 11:35, Dmitry Fomichev wrote:
> Calculate the data shift value to report based on the set value of
> logical_block_size device property.
> 
> In the process, use a local variable to calculate the LBA format
> index instead of the hardcoded value 0. This makes the code more
> readable and it will make it easier to add support for multiple LBA
> formats in the future.
> 
> Signed-off-by: Dmitry Fomichev 
> Reviewed-by: Klaus Jensen 
> ---
>  hw/block/nvme-ns.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 2ba0263dda..bbd7879492 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -47,6 +47,8 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  
>  static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>  {
> +int lba_index;
> +
>  if (!blkconf_blocksizes(>blkconf, errp)) {
>  return -1;
>  }
> @@ -67,6 +69,9 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, 
> Error **errp)
>  n->features.vwc = 0x1;
>  }
>  
> +lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> +ns->id_ns.lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);

You fix this later in the zoned support patch, but this should use
ns->blkconf.conf.logical_block_size.


signature.asc
Description: PGP signature


Re: [PATCH v7 00/13] monitor: Optionally run handlers in coroutines

2020-09-28 Thread Stefan Hajnoczi
On Fri, Sep 25, 2020 at 07:15:41PM +0200, Kevin Wolf wrote:
> Am 10.09.2020 um 15:24 hat Stefan Hajnoczi geschrieben:
> > On Wed, Sep 09, 2020 at 05:11:36PM +0200, Kevin Wolf wrote:
> > > Some QMP command handlers can block the main loop for a relatively long
> > > time, for example because they perform some I/O. This is quite nasty.
> > > Allowing such handlers to run in a coroutine where they can yield (and
> > > therefore release the BQL) while waiting for an event such as I/O
> > > completion solves the problem.
> > > 
> > > This series adds the infrastructure to allow this and switches
> > > block_resize to run in a coroutine as a first example.
> > > 
> > > This is an alternative solution to Marc-André's "monitor: add
> > > asynchronous command type" series.
> > 
> > Please clarify the following in the QAPI documentation:
> >  * Is the QMP monitor suspended while the command is pending?
> 
> Suspended as in monitor_suspend()? No.
> 
> >  * Are QMP events reported while the command is pending?
> 
> Hm, I don't know to be honest. But I think so.
> 
> Does it matter, though? I don't think events have a defined order
> compared to command results, and the client can't respond to the event
> anyway until the current command has completed.

You're right, I don't think it matters because the client must expect
QMP events at any time.

I was trying to understand the semantics of coroutine monitor commands
from two perspectives:

1. The QMP client - do coroutine commands behave differently from
   non-coroutine commands? I think the answer is no. The monitor will
   not process more commands until the coroutine finishes?

2. The command implementation - which thread does the coroutine run in?
   I guess it runs in the main loop thread with the BQL and the
   AioContext acquired?


signature.asc
Description: PGP signature


Re: [PATCH 4/4] block/export: add BlockExportOptions->iothread member

2020-09-28 Thread Stefan Hajnoczi
On Fri, Sep 25, 2020 at 05:01:42PM +0200, Kevin Wolf wrote:
> Am 25.09.2020 um 15:42 hat Stefan Hajnoczi geschrieben:
> > Make it possible to specify the iothread where the export will run.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > Note the x-blockdev-set-iothread QMP command can be used to do the same,
> > but not from the command-line. And it requires sending an additional
> > command.
> > 
> > In the long run vhost-user-blk will support per-virtqueue iothread
> > mappings. But for now a single iothread makes sense and most other
> > transports will just use one iothread anyway.
> > ---
> >  qapi/block-export.json |  4 
> >  block/export/export.c  | 26 +-
> >  2 files changed, 29 insertions(+), 1 deletion(-)
> > 
> > diff --git a/qapi/block-export.json b/qapi/block-export.json
> > index 87ac5117cd..eba6f6eae9 100644
> > --- a/qapi/block-export.json
> > +++ b/qapi/block-export.json
> > @@ -219,11 +219,15 @@
> >  #export before completion is signalled. (since: 5.2;
> >  #default: false)
> >  #
> > +# @iothread: The name of the iothread object where the export will run. The
> > +#default is the main loop thread. (since: 5.2)
> 
> NBD exports currently switch automatically to a different AioContext if
> another user (usually a guest device using the same node) tries to
> change the AioContext. I believe this is also the most useful mode in
> the context of the system emulator.
> 
> I can see the need for an iothread option in qemu-storage-daemon where
> usually nobody else will move the node into a different AioContext.
> 
> But we need to define the semantics more precisely and specify what
> happens if another user wants to change the AioContext later. Currently,
> the NBD export will allow this and just switch the AioContext - after
> this patch, ignoring what the user set explicitly with this new option.
> 
> I see two options to handle this more consistently:
> 
> 1. If @iothread is set, move the block node into the requested
>AioContext, and if that fails, block-export-add fails. Other users of
>the node will be denied to change the AioContext while the export is
>active.
> 
>If @iothread is not given, it behaves like today: Use whatever
>AioContext the node is currently in and switch whenever another user
>requests it.
> 
> 2. Add a bool option @fixed-iothread that determines whether other users
>can change the AioContext while the export is active.
> 
>Giving an @iothread and fixed-iothread == true means that we'll
>enforce the given AioContext during the whole lifetime of the export.
>With fixed-iothread == false it means that we try to move the block
>node to the requested iothread if possible (but we won't fail if it
>isn't possible) and will follow any other user switching the
>AioContext of the node.
> 
>Not giving @iothread means that we start with the current AioContext
>of the node, and @fixed-iothread then means the same as before.
> 
> Does this make sense to you?

Thanks for raising this. #2 is more flexible. I suspect most users will
be happy with fixed-iothread = false by default (i.e. things keep
working even if the iothread is changed).

I can adjust this patch to follow those semantics.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v7 08/13] qapi: Add a 'coroutine' flag for commands

2020-09-28 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 14.09.2020 um 17:15 hat Markus Armbruster geschrieben:
>> Kevin Wolf  writes:
>> 
>> > This patch adds a new 'coroutine' flag to QMP command definitions that
>> > tells the QMP dispatcher that the command handler is safe to be run in a
>> > coroutine.
>> >
>> > The documentation of the new flag pretends that this flag is already
>> > used as intended, which it isn't yet after this patch. We'll implement
>> > this in another patch in this series.
>> >
>> > Signed-off-by: Kevin Wolf 
>> > Reviewed-by: Marc-André Lureau 
>> > Reviewed-by: Markus Armbruster 
>> > ---
>> >  docs/devel/qapi-code-gen.txt| 12 
>> >  include/qapi/qmp/dispatch.h |  1 +
>> >  tests/test-qmp-cmds.c   |  4 
>> >  scripts/qapi/commands.py| 10 +++---
>> >  scripts/qapi/doc.py |  2 +-
>> >  scripts/qapi/expr.py| 10 --
>> >  scripts/qapi/introspect.py  |  2 +-
>> >  scripts/qapi/schema.py  | 12 
>> >  tests/qapi-schema/test-qapi.py  |  7 ---
>> >  tests/qapi-schema/meson.build   |  1 +
>> >  tests/qapi-schema/oob-coroutine.err |  2 ++
>> >  tests/qapi-schema/oob-coroutine.json|  2 ++
>> >  tests/qapi-schema/oob-coroutine.out |  0
>> >  tests/qapi-schema/qapi-schema-test.json |  1 +
>> >  tests/qapi-schema/qapi-schema-test.out  |  2 ++
>> >  15 files changed, 54 insertions(+), 14 deletions(-)
>> >  create mode 100644 tests/qapi-schema/oob-coroutine.err
>> >  create mode 100644 tests/qapi-schema/oob-coroutine.json
>> >  create mode 100644 tests/qapi-schema/oob-coroutine.out
>> >
>> > diff --git a/docs/devel/qapi-code-gen.txt b/docs/devel/qapi-code-gen.txt
>> > index f3e7ced212..36daa9b5f8 100644
>> > --- a/docs/devel/qapi-code-gen.txt
>> > +++ b/docs/devel/qapi-code-gen.txt
>> > @@ -472,6 +472,7 @@ Syntax:
>> >  '*gen': false,
>> >  '*allow-oob': true,
>> >  '*allow-preconfig': true,
>> > +'*coroutine': true,
>> >  '*if': COND,
>> >  '*features': FEATURES }
>> >  
>> > @@ -596,6 +597,17 @@ before the machine is built.  It defaults to false.  
>> > For example:
>> >  QMP is available before the machine is built only when QEMU was
>> >  started with --preconfig.
>> >  
>> > +Member 'coroutine' tells the QMP dispatcher whether the command handler
>> > +is safe to be run in a coroutine.
>> 
>> We need to document what exactly makes a command handler coroutine safe
>> / unsafe.  We discussed this at some length in review of PATCH v4 1/4:
>> 
>> Message-ID: <874kwnvgad@dusky.pond.sub.org>
>> https://lists.nongnu.org/archive/html/qemu-devel/2020-01/msg05015.html
>> 
>> I'd be willing to accept a follow-up patch, if that's more convenient
>> for you.
>
> Did we ever arrive at a conclusion for a good definition?
>
> I mean I can give a definition like "it's coroutine safe if it results
> in the right behaviour even when called in coroutine context", but
> that's not really helpful.

If an actual definition is out of reach, we should still say
*something*.  Even a mere "it's complicated" would help, because it
warns the reader he's on thin ice.

Can we give examples for "unsafe"?  You mentioned nested event loops.

> FWIW, I would also have a hard time giving a much better definition than
> this for thread safety.

For what it's worth, here's how POSIX.1-2017 defined "thread-safe":

A thread-safe function can be safely invoked concurrently with other
calls to the same function, or with calls to any other thread-safe
functions, by multiple threads.  Each function defined in the System
Interfaces volume of POSIX.1-2017 is thread-safe unless explicitly
stated otherwise.  Examples are any "pure" function, a function
which holds a mutex locked while it is accessing static storage, or
objects shared among threads.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_407

>> > It defaults to false.  If it is true,
>> > +the command handler is called from coroutine context and may yield while
>> 
>> Is it *always* called from coroutine context, or could it be called
>> outside coroutine context, too?  I guess the former, thanks to PATCH 10,
>> and as long as we diligently mark HMP commands that such call QMP
>> handlers, like you do in PATCH 13.
>
> Yes, it must *always* be called from coroutine context. This is the
> reason why I included HMP after all, even though I'm really mostly just
> interested in QMP.
>
>> What's the worst than can happen when we neglect to mark such an HMP
>> command?
>
> When the command handler tries to yield and it's not in coroutine
> context, it will abort(). If it never tries to yield, I think it would
> just work - but of course, the ability to yield is the only reason why
> you would want to run a command 

Re: [PATCH v7 09/13] qmp: Move dispatcher to a coroutine

2020-09-28 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 14.09.2020 um 17:30 hat Markus Armbruster geschrieben:
>> Kevin Wolf  writes:
>> 
>> > This moves the QMP dispatcher to a coroutine and runs all QMP command
>> > handlers that declare 'coroutine': true in coroutine context so they
>> > can avoid blocking the main loop while doing I/O or waiting for other
>> > events.
>> >
>> > For commands that are not declared safe to run in a coroutine, the
>> > dispatcher drops out of coroutine context by calling the QMP command
>> > handler from a bottom half.
>> >
>> > Signed-off-by: Kevin Wolf 
>> > Reviewed-by: Markus Armbruster 
>> > ---
[...]
>> > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
>> > index 5677ba92ca..754f7b854c 100644
>> > --- a/qapi/qmp-dispatch.c
>> > +++ b/qapi/qmp-dispatch.c
[...]
>> > @@ -153,12 +181,35 @@ QDict *qmp_dispatch(const QmpCommandList *cmds, 
>> > QObject *request,
>> >  qobject_ref(args);
>> >  }
>> >  
>> > +assert(!(oob && qemu_in_coroutine()));
>> >  assert(monitor_cur() == NULL);
>> > -monitor_set_cur(qemu_coroutine_self(), cur_mon);
>> > -
>> > -cmd->fn(args, , );
>> > -
>> > -monitor_set_cur(qemu_coroutine_self(), NULL);
>> > +if (!!(cmd->options & QCO_COROUTINE) == qemu_in_coroutine()) {
>> > +monitor_set_cur(qemu_coroutine_self(), cur_mon);
>> > +cmd->fn(args, , );
>> > +monitor_set_cur(qemu_coroutine_self(), NULL);
>> > +} else {
>> > +/*
>> > + * Not being in coroutine context implies that we're handling
>> > + * an OOB command, which must not have QCO_COROUTINE.
>> > + *
>> > + * This implies that we are in coroutine context, but the
>> > + * command doesn't have QCO_COROUTINE. We must drop out of
>> > + * coroutine context for this one.
>> > + */
>> 
>> I had to read this several times to get it.  The first sentence leads me
>> into coroutine context, and then the next sentence tells me the
>> opposite, throwing me into confusion.
>> 
>> Perhaps something like this:
>> 
>>/*
>> * Actual context doesn't match the one the command needs.
>> * Case 1: we are in coroutine context, but command does not
>> * have QCO_COROUTINE.  We need to drop out of coroutine
>> * context for executing it.
>> * Case 2: we are outside coroutine context, but command has
>> * QCO_COROUTINE.  Can't actually happen, because we get here
>> * outside coroutine context only when executing a command
>> * out of band, and OOB commands never have QCO_COROUTINE.
>> */
>
> Works for me. Can you squash this in while applying?

Sure!




Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence

2020-09-28 Thread Klaus Jensen
On Sep 28 11:35, Dmitry Fomichev wrote:
> A ZNS drive that is emulated by this module is currently initialized
> with all zones Empty upon startup. However, actual ZNS SSDs save the
> state and condition of all zones in their internal NVRAM in the event
> of power loss. When such a drive is powered up again, it closes or
> finishes all zones that were open at the moment of shutdown. Besides
> that, the write pointer position as well as the state and condition
> of all zones is preserved across power-downs.
> 
> This commit adds the capability to have a persistent zone metadata
> to the device. The new optional module property, "zone_file",
> is introduced. If added to the command line, this property specifies
> the name of the file that stores the zone metadata. If "zone_file" is
> omitted, the device will be initialized with all zones empty, the same
> as before.
> 
> If zone metadata is configured to be persistent, then zone descriptor
> extensions also persist across controller shutdowns.
> 
> Signed-off-by: Dmitry Fomichev 
> ---
>  hw/block/nvme-ns.c| 341 --
>  hw/block/nvme-ns.h|  33 
>  hw/block/nvme.c   |   2 +
>  hw/block/trace-events |   1 +
>  4 files changed, 362 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 47751f2d54..a94021da81 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -293,12 +421,180 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
>  i--;
>  }
>  }
> +
> +if (ns->params.zone_file) {
> +nvme_set_zone_meta_dirty(ns);
> +}
> +}
> +
> +static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
> +   Error **errp)
> +{
> +Object *file_be;
> +HostMemoryBackend *fb;
> +struct stat statbuf;
> +int ret;
> +
> +ret = stat(ns->params.zone_file, );
> +if (ret && errno == ENOENT) {
> +*init_meta = true;
> +} else if (!S_ISREG(statbuf.st_mode)) {
> +error_setg(errp, "\"%s\" is not a regular file",
> +   ns->params.zone_file);
> +return -1;
> +}
> +
> +file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
> +object_property_set_str(file_be, "mem-path", ns->params.zone_file,
> +_abort);
> +object_property_set_int(file_be, "size", ns->meta_size, _abort);
> +object_property_set_bool(file_be, "share", true, _abort);
> +object_property_set_bool(file_be, "discard-data", false, _abort);
> +if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
> +object_unref(file_be);
> +return -1;
> +}
> +object_property_add_child(OBJECT(ns), "_fb", file_be);
> +object_unref(file_be);
> +
> +fb = MEMORY_BACKEND(file_be);
> +ns->zone_mr = host_memory_backend_get_memory(fb);
> +
> +return 0;
> +}
> +
> +static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
> +{
> +ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);

I forgot that the HostMemoryBackend doesn't magically make the memory
available to the device, so of course this is still needed.

Anyway.

No reason for me to keep complaining about this. I do not like it, I
will not ACK it and I think I made my reasons pretty clear.


signature.asc
Description: PGP signature


Re: [PATCH v7 07/13] monitor: Make current monitor a per-coroutine property

2020-09-28 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 14.09.2020 um 17:11 hat Markus Armbruster geschrieben:
>> Kevin Wolf  writes:
>> 
>> > This way, a monitor command handler will still be able to access the
>> > current monitor, but when it yields, all other code code will correctly
>> > get NULL from monitor_cur().
>> >
>> > This uses a hash table to map the coroutine pointer to the current
>> > monitor of that coroutine.  Outside of coroutine context, we associate
>> > the current monitor with the leader coroutine of the current thread.
>> 
>> In qemu-system-FOO, the hash table can have only these entries:
>> 
>> * (OOB) One mapping @mon_iothread's thread leader to a QMP monitor, while
>>   executing a QMP command out-of-band.
>> 
>> * (QMP-CO) One mapping @qmp_dispatcher_co (a coroutine in the main
>>   thread) to a QMP monitor, while executing a QMP command in-band and in
>>   coroutine context.
>> 
>> * (QMP) One mapping the main thread's leader to a QMP monitor, while
>>   executing a QMP command in-band and out of coroutine context, in a
>>   bottom half.
>> 
>> * (HMP) One mapping the main thread's leader to an HMP monitor, while
>>   executing an HMP command out of coroutine context.
>> 
>> * (HMP-CO) One mapping a transient coroutine in the main thread to an
>>   HMP monitor, while executing an HMP command in coroutine context.
>> 
>> In-band execution is one command after the other.
>> 
>> Therefore, at most one monitor command can be executing in-band at any
>> time.
>> 
>> Therefore, the hash table has at most *two* entries: one (OOB), and one
>> of the other four.
>> 
>> Can you shoot any holes into my argument?
>
> I think with human-monitor-command, you can have three mappings:
>
> 1. The main thread's leader (it is a non-coroutine QMP command) to the
>QMP monitor

This is (QMP).

> 2. With a coroutine HMP command, one mapping from the transient HMP
>coroutine to the transient HMP monitor (with a non-coroutine HMP
>command, we'd instead temporarily change the mapping from 1.)

This is (HMP-CO).

> 3. The OOB entry

This is (OOB).

To get 1. (QMP) and 2, (HMP-CO) at the same time, the in-band,
non-coroutine QMP command needs to execute interleaved with the in-band,
coroutine HMP command.

Such an interleaving contradicts "In-band execution is one command after
the other", which is a fundamental assumption in-band commands may make.
If the assumption is invalid, we got a problem.  Is it?

>> I suspect there's a simpler solution struggling to get out.  But this
>> solution works, so in it goes.  Should the simpler one succeed at
>> getting out, it can go in on top.  If not, I'll probably add even more
>> comments to remind myself of these facts.
>
> I think the simple approach you had posted could work if you can fill in
> the HMP part, but I think it wasn't completely trivial and you told me
> not to bother for now,

Correct.  I decided to go with the code you already tested.

>so I didn't. It may still be a viable path
> forward if you like it better.




Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set

2020-09-28 Thread Klaus Jensen
On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> ---
>  block/nvme.c |   2 +-
>  hw/block/nvme-ns.c   | 185 -
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c  | 872 +--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -1040,18 +1318,468 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest 
> *req)
>  goto invalid;
>  }
>  
> +if (ns->params.zoned) {
> +zone_idx = nvme_zone_idx(ns, slba);
> +assert(zone_idx < ns->num_zones);
> +zone = >zone_array[zone_idx];
> +
> +if (is_write) {
> +status = nvme_check_zone_write(zone, slba, nlb);
> +if (status != NVME_SUCCESS) {
> +trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +goto invalid;
> +}
> +
> +assert(nvme_wp_is_valid(zone));
> +if (append) {
> +if (unlikely(slba != zone->d.zslba)) {
> +trace_pci_nvme_err_append_not_at_start(slba, 
> zone->d.zslba);
> +status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +goto invalid;
> +}
> +if (data_size > (n->page_size << n->zasl)) {
> +trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
> +status = NVME_INVALID_FIELD | NVME_DNR;
> +goto invalid;
> +}
> +slba = zone->w_ptr;
> +} else if (unlikely(slba != zone->w_ptr)) {
> +trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +   zone->w_ptr);
> +status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +goto invalid;
> +}
> +req->fill_ofs = -1LL;
> +} else {
> +status = nvme_check_zone_read(ns, zone, slba, nlb);
> +if (status != NVME_SUCCESS) {
> +trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
> +goto invalid;
> +}
> +
> +if (slba + nlb > zone->w_ptr) {
> +/*
> + * All or some data is read above the WP. Need to
> + * fill out the buffer area that has no backing data
> + * with a predefined data pattern (zeros by default)
> + */
> +if (slba >= zone->w_ptr) {
> +req->fill_ofs = 0;
> +} else {
> +req->fill_ofs = nvme_l2b(ns, zone->w_ptr - slba);
> +}
> +req->fill_len = nvme_l2b(ns,
> +nvme_zone_rd_boundary(ns, zone) - slba);

OK then. Next edge case.

Now what happens if the read crosses into a partially written zone and
reads above the write pointer in that zone?


signature.asc
Description: PGP signature


Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-28 Thread Klaus Jensen
On Sep 28 02:33, Dmitry Fomichev wrote:
> > -Original Message-
> > From: Klaus Jensen 
> >
> > If it really needs to be memory mapped, then I think a hostmem-based
> > approach similar to what Andrzej did for PMR is needed (I think that
> > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > slightly tricky to get it to work on all platforms AFAIK).
> 
> Ok, it looks that using the HostMemoryBackendFile backend will be
> more appropriate. This will remove the need for conditional compile.
> 
> The mmap() portability is pretty decent across software platforms.
> Any poor Windows user who is forced to emulate ZNS on mingw will be
> able to do so, just without having zone state persistency. Considering
> how specialized this stuff is in first place, I estimate the number of users
> affected by this "limitation" to be exactly zero.
> 

QEMU is a cross platform project - we should strive for portability.

Alienating developers that use a Windows platform and calling them out
as "poor" is not exactly good for the zoned ecosystem.

> > But really,
> > since we do not require memory semantics for this, then I think the
> > abstraction is fundamentally wrong.
> > 
> 
> Seriously, what is wrong with using mmap :) ? It is used successfully for
> similar applications, for example -
> https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> 

There is nothing fundamentally wrong with mmap. I just think it is the
wrong abstraction here (and it limits portability for no good reason).
For PMR there is a good reason - it requires memory semantics.

> > I am, of course, blowing my own horn, since my implementation uses a
> > portable blockdev for this.
> > 
> 
> You are making it sound like the entire WDC series relies on this approach.
> Actually, the persistency is introduced in the second to last patch in the
> series and it only adds a couple of lines of code in the i/o path to mark
> zones dirty. This is possible because of using mmap() and I find the way
> it is done to be quite elegant, not ugly :)
> 

No, I understand that your implementation works fine without
persistance, but persistance is key. That is why my series adds it in
the first patch. Without persistence it is just a toy. And the QEMU
device is not just an "NVMe-version" of null_blk.

And I don't think I ever called the use of mmap ugly. I called out the
physical memory API shenanigans as a hack.

> > Another issue is the complete lack of endian conversions. Does it
> > matter? It depends. Will anyone ever use this on a big endian host and
> > move the meta data backing file to a little endian host? Probably not.
> > So does it really matter? Probably not, but it is cutting corners.
> > 

After I had replied this, I considered a follow-up, because there are
probably QEMU developers that would call me out on this.

This definitely DOES matter to QEMU.

> 
> Great point on endianness! Naturally, all file backed values are stored in
> their native endianness. This way, there is no extra overhead on big endian
> hardware architectures. Portability concerns can be easily addressed by
> storing metadata endianness as a byte flag in its header. Then, during
> initialization, the metadata validation code can detect the possible
> discrepancy in endianness and automatically convert the metadata to the
> endianness of the host. This part is out of scope of this series, but I would
> be able to contribute such a solution as an enhancement in the future.
> 

It is not out of scope. I don't see why we should merge something that
is arguably buggy.

Bottomline is that I just don't see why we should accept an
implementation that

  a) excludes some platforms (Windows) from using persistence; and
  b) contains endianness conversion issues

when there is a portable implementation posted that at least tries to
convert endianness as needed.


signature.asc
Description: PGP signature