date:20230310

Re: [PATCH nbd 1/4] nbd: Add multi-conn option

2023-03-10 Thread Richard W.M. Jones

On Fri, Mar 10, 2023 at 04:17:17PM -0600, Eric Blake wrote:
> On Thu, Mar 09, 2023 at 11:39:43AM +, Richard W.M. Jones wrote:
> > + * safe for multi-conn, force it to 1.
> > + */
> > +if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
> > +s->multi_conn = 1;
> > +}
> > +
> >  return 0;
> 
> Is there an intended QAPI counterpart for this command?  We'll need
> that if it is to be set during the command line of
> qemu-storage-daemon.

Does it just need to be added to qapi/block-core.json?

It's a shame we can't add the API in one place and have everything
generated from there.  Like some kind of 'generator' ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH nbd 1/4] nbd: Add multi-conn option

2023-03-10 Thread Eric Blake

On Thu, Mar 09, 2023 at 11:39:43AM +, Richard W.M. Jones wrote:
> Add multi-conn option to the NBD client.  This commit just adds the
> option, it is not functional.

Maybe add the phrase "until later in this patch series" ?

> 
> Setting this to a value > 1 permits multiple connections to the NBD
> server; a typical value might be 4.  The default is 1, meaning only a
> single connection is made.  If the NBD server does not advertise that
> it is safe for multi-conn then this setting is forced to 1.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  block/nbd.c | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index bf2894ad5c..5ffae0b798 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -49,6 +49,7 @@
>  
>  #define EN_OPTSTR ":exportname="
>  #define MAX_NBD_REQUESTS16
> +#define MAX_MULTI_CONN  16
>  
>  #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
>  #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
> @@ -98,6 +99,7 @@ typedef struct BDRVNBDState {
>  /* Connection parameters */
>  uint32_t reconnect_delay;
>  uint32_t open_timeout;
> +uint32_t multi_conn;
>  SocketAddress *saddr;
>  char *export;
>  char *tlscredsid;
> @@ -1803,6 +1805,15 @@ static QemuOptsList nbd_runtime_opts = {
>  "attempts until successful or until @open-timeout 
> seconds "
>  "have elapsed. Default 0",
>  },
> +{
> +.name = "multi-conn",
> +.type = QEMU_OPT_NUMBER,
> +.help = "If > 1 permit up to this number of connections to the "
> +"server. The server must also advertise multi-conn "
> +"support.  If <= 1, only a single connection is made "
> +"to the server even if the server advertises multi-conn. 
> "
> +"Default 1",
> +},
>  { /* end of list */ }
>  },
>  };
> @@ -1858,6 +1869,10 @@ static int nbd_process_options(BlockDriverState *bs, 
> QDict *options,
>  
>  s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
>  s->open_timeout = qemu_opt_get_number(opts, "open-timeout", 0);
> +s->multi_conn = qemu_opt_get_number(opts, "multi-conn", 1);
> +if (s->multi_conn > MAX_MULTI_CONN) {
> +s->multi_conn = MAX_MULTI_CONN;
> +}

This silently ignores out-of-range values (negative, greater than 16)
and treats 0 as a synonym for 1.  The latter I'm okay with, the former
I wonder if we should instead raise an error that the user is
requesting something we can't honor, instead of silently bounding it.

>  
>  ret = 0;
>  
> @@ -1912,6 +1927,15 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  
>  nbd_client_connection_enable_retry(s->conn);
>  
> +/*
> + * We set s->multi_conn in nbd_process_options above, but now that
> + * we have connected if the server doesn't advertise that it is

s/connected/connected,/

> + * safe for multi-conn, force it to 1.
> + */
> +if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
> +s->multi_conn = 1;
> +}
> +
>  return 0;

Is there an intended QAPI counterpart for this command?  We'll need
that if it is to be set during the command line of
qemu-storage-daemon.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

2023-03-10 Thread Richard W.M. Jones

On Fri, Mar 10, 2023 at 01:04:12PM -0600, Eric Blake wrote:
> How many of these timing numbers can be repeated with TLS in the mix?

While I have been playing with TLS and kTLS recently, it's not
something that is especially important to v2v since all NBD traffic
goes over Unix domain sockets only (ie. it's used as kind of
interprocess communication).

I could certainly provide benchmarks, although as I'm going on holiday
shortly it may be a little while.

> > Curl local server test (./multi-conn.pl curlhttp)
> > =
> > 
> > Localhost Apache serving a file over http
> >   |
> >   | http
> >   v
> > nbdkit-curl-plugin   or   qemu-nbd
> >   |
> >   | nbd+unix
> >   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download an image from a local web server through
> > nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> > The image is copied to /dev/null.
> > 
> >   server  clientmulti-conn
> >   ---
> >   qemu-nbd nbdcopy  1   8.88s   
> >   qemu-nbd nbdcopy  2   8.64s   
> >   qemu-nbd nbdcopy  4   8.37s   
> >   qemu-nbdqemu-img  [u/s]   6.47s
> 
> Do we have any good feel for why qemu-img is faster than nbdcopy in
> the baseline?  But improving that is orthogonal to this series.

I do not, but we have in the past found that results can be very
sensitive to request size.  By default (and also in all of these
tests) nbdcopy is using a request size of 256K, and qemu-img is using
a request size of 2M.

> >   qemu-nbdqemu-img  1   6.56s   
> >   qemu-nbdqemu-img  2   6.63s   
> >   qemu-nbdqemu-img  4   6.50s   
> > nbdkit nbdcopy  1   12.15s  
> 
> I'm assuming this is nbdkit with your recent in-progress patches to
> have the curl plugin serve parallel requests.  But another place where
> we can investigate why nbdkit is not as performant as qemu-nbd at
> utilizing curl.
> 
> > nbdkit nbdcopy  2   7.05s   (72.36% better)
> > nbdkit nbdcopy  4   3.54s   (242.90% better)
> 
> That one is impressive!
> 
> > nbdkitqemu-img  [u/s]   6.90s   
> > nbdkitqemu-img  1   7.00s   
> 
> Minimal penalty for adding the code but not utilizing it...

[u/s] and qemu-img with multi-conn:1 ought to be identical actually.
After all, the only difference should be the restructuring of the code
to add the intermediate NBDConnState struct In this case it's probably
just measurement error.

> > nbdkitqemu-img  2   3.85s   (79.15% better)
> > nbdkitqemu-img  4   3.85s   (79.15% better)
> 
> ...and definitely shows its worth.
> 
> > 
> > 
> > Curl local file test (./multi-conn.pl curlfile)
> > ===
> > 
> > nbdkit-curl-plugin   using file:/// URI
> >   |
> >   | nbd+unix
> >   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download from a file:/// URI.  This test is designed to exercise
> > NBD and some curl internal paths without the overhead from an external
> > server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> > the test for qemu as server.
> > 
> >   server  clientmulti-conn
> >   ---
> > nbdkit nbdcopy  1   31.32s  
> > nbdkit nbdcopy  2   20.29s  (54.38% better)
> > nbdkit nbdcopy  4   13.22s  (136.91% better)
> > nbdkitqemu-img  [u/s]   31.55s  
> 
> Here, the baseline is already comparable; both nbdcopy and qemu-img
> are parsing the image off nbdkit in about the same amount of time.
> 
> > nbdkitqemu-img  1   31.70s  
> 
> And again, minimal penalty for having the new code in place but not
> exploiting it.
> 
> > nbdkitqemu-img  2   21.60s  (46.07% better)
> > nbdkitqemu-img  4   13.88s  (127.25% better)
> 
> Plus an obvious benefit when the parallel sockets matter.
> 
> > 
> > 
> > Curl remote server test (./multi-conn.pl curlremote)
> > 
> > 
> > nbdkit-curl-plugin   using http://remote/*.qcow2 URI
> >  |
> >  | nbd+unix
> >  v
> > qemu-img convert
> > 
> > We download from a remote qcow2 file to a local raw file, converting
> > between formats during copying.
> > 
> > qemu-nbd   using http://remote/*.qcow2 URI
> > |
> > | nbd+unix
> > v
> > qemu-img convert
> > 
> > Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> > if it is raw, so the conversion is still done by qemu-img).
> > 
> > Additionally we compare downloading the file with wget (note this
> > doesn't include the time for conversion, but that

Re: [PATCH] hw/block: replace TABs with space Bring the block files in line with the QEMU coding style, with spaces for indentation. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371

2023-03-10 Thread Eric Blake

Missing the blank line between the one-line summary and the rest of
the commit body, which killed the subject line.

You now have multiple threads referencing gitlab issue 371; maybe
instead of saying 'Resolves:', you should just mention that each patch
is a partial resolution (since one patch in isolation is obviously not
resolving it tree-wide, if you have other series referencing the same
bug).

On Fri, Mar 10, 2023 at 06:31:49PM +0800, Yeqi Fu wrote:
> Signed-off-by: Yeqi Fu 
> ---
>  hw/block/fdc.c |   4 +-
>  hw/block/nand.c| 222 ++---
>  hw/block/onenand.c | 126 -
>  hw/block/tc58128.c | 136 +--
>  4 files changed, 244 insertions(+), 244 deletions(-)
> 

> +++ b/hw/block/onenand.c
> @@ -35,10 +35,10 @@
>  #include "qom/object.h"
>  
>  /* 11 for 2kB-page OneNAND ("2nd generation") and 10 for 1kB-page chips */
> -#define PAGE_SHIFT   11
> +#define PAGE_SHIFT 11
>  
>  /* Fixed */
> -#define BLOCK_SHIFT  (PAGE_SHIFT + 6)
> +#define BLOCK_SHIFT (PAGE_SHIFT + 6)
>  
>  #define TYPE_ONE_NAND "onenand"
>  OBJECT_DECLARE_SIMPLE_TYPE(OneNANDState, ONE_NAND)
> @@ -408,23 +408,23 @@ static void onenand_command(OneNANDState *s)
>  int b;
>  int sec;
>  void *buf;
> -#define SETADDR(block, page) \
> -sec = (s->addr[page] & 3) +  \
> -s->addr[page] >> 2) & 0x3f) +\
> -  (((s->addr[block] & 0xfff) |   \
> -(s->addr[block] >> 15 ?  \

The old code had aligned '\' (harder to see when diff's prefix messes
up tab stops)...

> +#define SETADDR(block, page)\
> +sec = (s->addr[page] & 3) + \
> +s->addr[page] >> 2) & 0x3f) +   \
> +  (((s->addr[block] & 0xfff) |  \
> +(s->addr[block] >> 15 ? \
>   s->density_mask : 0)) << 6)) << (PAGE_SHIFT - 9));

...but the spacing on the new-code is inconsistent.  Whereas I'm
ambivalent on aligned '=' in struct initializations, when it comes to
preprocessor macros, I'm very much a fan of using aligned '\'.

You may also want to mention that 'git diff -w ...' shows no change,
proving that the patch is whitespace only (or if it does show change,
that it is because you rewrapped lines).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

2023-03-10 Thread Eric Blake

On Thu, Mar 09, 2023 at 11:39:42AM +, Richard W.M. Jones wrote:
> [ Patch series also available here, along with this cover letter and the
>   script used to generate test results:
>   https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]
> 
> This patch series adds multi-conn support to the NBD block driver in
> qemu.  It is only meant for discussion and testing because it has a
> number of obvious shortcomings (see "XXX" in commit messages and
> code).  If we decided this was a good idea, we can work on a better
> patch.

Overall, I'm in favor of this.  A longer term project might be to have
qemu's NBD client code call into libnbd instead of reimplementing
things itself, at which point having libnbd manage multi-conn under
the hood would be awesome, but as that's a much bigger effort, a
shorter-term task of having qemu itself handle parallel sockets seems
worthwhile.

> 
>  - It works effectively for qemu client & nbdkit server, especially in
>cases where the server does large, heavyweight requests.  This is
>important for us because virt-v2v uses an nbdkit Python plugin and
>various other heavyweight plugins (eg. plugins that access remote
>servers for each request).
> 
>  - It seems to make little or no difference with qemu + qemu-nbd
>server.  I speculate that's because qemu-nbd doesn't support system
>threads, so networking is bottlenecked through a single core.  Even
>though there are coroutines handling different sockets, they must
>all wait in turn to issue send(3) or recv(3) calls on the same
>core.

Is the current work to teach qemu to do multi-queue (that is, spread
the I/O load for a single block device across multiple cores) going to
help here?  I haven't been following the multi-queue efforts closely
enough to know if the approach used in this series will play nicely,
or need even further overhaul.

> 
>  - qemu-img unfortunately uses a single thread for all coroutines so
>it suffers from a similar problem to qemu-nbd.  This change would
>be much more effective if we could distribute coroutines across
>threads.

qemu-img uses the same client code as qemu-nbd; any multi-queue
improvements that can spread the send()/recv() load of multiple
sockets across multiple cores will benefit both programs
simultaneously.

> 
>  - For tests which are highly bottlenecked on disk I/O (eg. the large
>local file test and null test) multi-conn doesn't make much
>difference.

As long as it isn't adding to much penalty, that's okay.  If the
saturation is truly at the point of how fast disk requests can be
served, it doesn't matter if we can queue up more of those requests in
parallel across multiple NBD sockets.

> 
>  - Multi-conn even with only 2 connections can make up for the
>overhead of range requests, exceeding the performance of wget.

That alone is a rather cool result, and an argument in favor of
further developing this.

> 
>  - In the curlremote test, qemu-nbd is especially slow, for unknown
>reasons.
> 
> 
> Integrity test (./multi-conn.pl integrity)
> ==
> 
> nbdkit-sparse-random-plugin
>   | ^
>   | nbd+unix| nbd+unix
>   v |
>qemu-img convert
> 
> Reading from and writing the same data back to nbdkit sparse-random
> plugin checks that the data written is the same as the data read.
> This uses two Unix domain sockets, with or without multi-conn.  This
> test is mainly here to check we don't crash or corrupt data with this
> patch.
> 
>   server  clientmulti-conn
>   ---
> nbdkit  qemu-img  [u/s]   9.07s   
> nbdkit  qemu-img  1   9.05s   
> nbdkit  qemu-img  2   9.02s   
> nbdkit  qemu-img  4   8.98s   
> 
> [u/s] = upstream qemu 7.2.0

How many of these timing numbers can be repeated with TLS in the mix?

> 
> 
> Curl local server test (./multi-conn.pl curlhttp)
> =
> 
> Localhost Apache serving a file over http
>   |
>   | http
>   v
> nbdkit-curl-plugin   or   qemu-nbd
>   |
>   | nbd+unix
>   v
> qemu-img convert   or   nbdcopy
> 
> We download an image from a local web server through
> nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> The image is copied to /dev/null.
> 
>   server  clientmulti-conn
>   ---
>   qemu-nbd   nbdcopy  1   8.88s   
>   qemu-nbd   nbdcopy  2   8.64s   
>   qemu-nbd   nbdcopy  4   8.37s   
>   qemu-nbd  qemu-img  [u/s]   6.47s

Do we have any good feel for why qemu-img is faster than nbdcopy in
the baseline?  But improving that is orthogonal to this series.

>   qemu-nbd  qemu-img

[PULL 3/3] qed: remove spurious BDRV_POLL_WHILE()

2023-03-10 Thread Kevin Wolf

From: Stefan Hajnoczi 

This looks like a copy-paste or merge error. BDRV_POLL_WHILE() is
already called above. It's not needed in the qemu_in_coroutine() case.

Fixes: 9fb4dfc570ce ("qed: make bdrv_qed_do_open a coroutine_fn")
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20230309163134.398707-1-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/qed.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/qed.c b/block/qed.c
index ed94bb61ca..0705a7b4e2 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -594,7 +594,6 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict 
*options, int flags,
 qemu_coroutine_enter(qemu_coroutine_create(bdrv_qed_open_entry, &qoc));
 BDRV_POLL_WHILE(bs, qoc.ret == -EINPROGRESS);
 }
-BDRV_POLL_WHILE(bs, qoc.ret == -EINPROGRESS);
 return qoc.ret;
 }
 
-- 
2.39.2

[PULL 0/3] Block layer patches

2023-03-10 Thread Kevin Wolf

The following changes since commit ee59483267de29056b5b2ee2421ef3844e5c9932:

  Merge tag 'qemu-openbios-20230307' of https://github.com/mcayland/qemu into 
staging (2023-03-09 16:55:03 +)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to ecf8191314798391b1df80bcb829c0ead4f8acc9:

  qed: remove spurious BDRV_POLL_WHILE() (2023-03-10 15:14:46 +0100)


Block layer patches

- fuse: Fix fallocate(PUNCH_HOLE) to zero out the range
- qed: remove spurious BDRV_POLL_WHILE()


Hanna Czenczek (2):
  block/fuse: Let PUNCH_HOLE write zeroes
  iotests/308: Add test for 'write -zu'

Stefan Hajnoczi (1):
  qed: remove spurious BDRV_POLL_WHILE()

 block/export/fuse.c| 11 ++-
 block/qed.c|  1 -
 tests/qemu-iotests/308 | 43 +++
 tests/qemu-iotests/308.out | 35 +++
 4 files changed, 88 insertions(+), 2 deletions(-)

[PULL 2/3] iotests/308: Add test for 'write -zu'

2023-03-10 Thread Kevin Wolf

From: Hanna Czenczek 

Try writing zeroes to a FUSE export while allowing the area to be
unmapped; block/file-posix.c generally implements writing zeroes with
BDRV_REQ_MAY_UNMAP ('write -zu') by calling fallocate(PUNCH_HOLE).  This
used to lead to a blk_pdiscard() in the FUSE export, which may or may
not lead to the area being zeroed.  HEAD^ fixed this to use
blk_pwrite_zeroes() instead (again with BDRV_REQ_MAY_UNMAP), so verify
that running `qemu-io 'write -zu'` on a FUSE exports always results in
zeroes being written.

Signed-off-by: Hanna Czenczek 
Message-Id: <20230227104725.33511-3-hre...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/308 | 43 ++
 tests/qemu-iotests/308.out | 35 +++
 2 files changed, 78 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 09275e9a10..de12b2b1b9 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -370,6 +370,49 @@ echo
 echo '=== Compare copy with original ==='
 
 $QEMU_IMG compare -f raw -F $IMGFMT "$COPIED_IMG" "$TEST_IMG"
+_cleanup_test_img
+
+echo
+echo '=== Writing zeroes while unmapping ==='
+# Regression test for https://gitlab.com/qemu-project/qemu/-/issues/1507
+_make_test_img 64M
+$QEMU_IO -c 'write -s /dev/urandom 0 64M' "$TEST_IMG" | _filter_qemu_io
+
+_launch_qemu
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'qmp_capabilities'}" \
+'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'blockdev-add',
+  'arguments': {
+  'driver': '$IMGFMT',
+  'node-name': 'node-format',
+  'file': {
+  'driver': 'file',
+  'filename': '$TEST_IMG'
+  }
+  } }" \
+'return'
+
+fuse_export_add 'export' "'mountpoint': '$EXT_MP', 'writable': true"
+
+# Try writing zeroes by unmapping
+$QEMU_IO -f raw -c 'write -zu 0 64M' "$EXT_MP" | _filter_qemu_io
+
+# Check the result
+$QEMU_IO -f raw -c 'read -P 0 0 64M' "$EXT_MP" | _filter_qemu_io
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'quit'}" \
+'return'
+
+wait=yes _cleanup_qemu
+
+# Check the original image
+$QEMU_IO -c 'read -P 0 0 64M' "$TEST_IMG" | _filter_qemu_io
+
+_cleanup_test_img
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index e4467a10cf..d5767133b1 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -171,4 +171,39 @@ OK: Post-truncate image size is as expected
 
 === Compare copy with original ===
 Images are identical.
+
+=== Writing zeroes while unmapping ===
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 67108864/67108864 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+{'execute': 'qmp_capabilities'}
+{"return": {}}
+{'execute': 'blockdev-add',
+  'arguments': {
+  'driver': 'IMGFMT',
+  'node-name': 'node-format',
+  'file': {
+  'driver': 'file',
+  'filename': 'TEST_DIR/t.IMGFMT'
+  }
+  } }
+{"return": {}}
+{'execute': 'block-export-add',
+  'arguments': {
+  'type': 'fuse',
+  'id': 'export',
+  'node-name': 'node-format',
+  'mountpoint': 'TEST_DIR/t.IMGFMT.fuse', 'writable': true
+  } }
+{"return": {}}
+wrote 67108864/67108864 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 67108864/67108864 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+{'execute': 'quit'}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_EXPORT_DELETED", "data": {"id": "export"}}
+read 67108864/67108864 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done
-- 
2.39.2

[PULL 1/3] block/fuse: Let PUNCH_HOLE write zeroes

2023-03-10 Thread Kevin Wolf

From: Hanna Czenczek 

fallocate(2) says about PUNCH_HOLE: "After a successful call, subsequent
reads from this range will return zeros."  As it is, PUNCH_HOLE is
implemented as a call to blk_pdiscard(), which does not guarantee this.

We must call blk_pwrite_zeroes() instead.  The difference to ZERO_RANGE
is that we pass the `BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK` flags to
the call -- the storage is supposed to be unmapped, and a slow fallback
by actually writing zeroes as data is not allowed.

Closes: https://gitlab.com/qemu-project/qemu/-/issues/1507
Signed-off-by: Hanna Czenczek 
Message-Id: <20230227104725.33511-2-hre...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/export/fuse.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index e5fc4af165..06fa41079e 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -673,7 +673,16 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t 
inode, int mode,
 do {
 int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-ret = blk_pdiscard(exp->common.blk, offset, size);
+ret = blk_pwrite_zeroes(exp->common.blk, offset, size,
+BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
+if (ret == -ENOTSUP) {
+/*
+ * fallocate() specifies to return EOPNOTSUPP for unsupported
+ * operations
+ */
+ret = -EOPNOTSUPP;
+}
+
 offset += size;
 length -= size;
 } while (ret == 0 && length > 0);
-- 
2.39.2

Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

2023-03-10 Thread Vladimir Sementsov-Ogievskiy


On 09.03.23 14:39, Richard W.M. Jones wrote:

[ Patch series also available here, along with this cover letter and the
   script used to generate test results:
   https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1  ]

This patch series adds multi-conn support to the NBD block driver in
qemu.  It is only meant for discussion and testing because it has a
number of obvious shortcomings (see "XXX" in commit messages and
code).  If we decided this was a good idea, we can work on a better
patch.


I looked through the results and the code, and I think that's of course a good 
idea!

We still need smarter integration with reconnect logic.

At least, we shouldn't make several open_timer instances..


Currently, on open() we have open-timeout. That's just a limit for the whole 
nbd_open() - we can do several connection attempts during this time.

Seems we should proceed with success, if we succeeded with at least one 
connection. Postponing additional connections to be established after open() 
seems good too[*].


Next, we have reconnect-delay. When connection is lost nbd-client tries to 
reconnect with no limit in attempts, but after reconnect-delay seconds of 
reconnection all in-flight requests that are waiting for connection are just 
failed.

When we have several connections, and one is broken, I think we shouldn't wait, 
but instead retry the requests on other working connections. This way we don't 
need several reconnect_delay_timer objects: we need only one, when all 
connections are lost.


Reestablishing additional connections better to do in background, not blocking 
in-flight requests. And that's the same as postponing additional connections 
after open() should work ([*]).

--
Best regards,
Vladimir

Re: [PATCH v2] block: replace TABs with space

2023-03-10 Thread Eric Blake

On Fri, Mar 10, 2023 at 11:19:05AM +0800, Yeqi Fu wrote:
> Bring the block files in line with the QEMU coding style, with spaces
> for indentation.
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371

In v1, I mentioned I like (but don't insist on) one-space '='; then
Kevin mentioned he likes aligned '=', although I can see how his
comment may have landed after you had already started editing per my
review.  Note that we do not have a documented rule for which style
should be used.  After reading through your patch, I see v2 went the
route of uniformly using one-space '=', which causes more (one-time)
churn in some files than others.  It may be worth specifically
mentioning this as part of your commit message, especially since it
was a point brought up during review and since not everyone favors
that style.

Maybe I should have also stated earlier: when writing bulk patches, I
try to be conservative on reformatting: while project-wide consistency
is nice, I'm more worried about per-file consistency.  If a single
file has an inconsistent mix between one-space '=' and aligned '=', or
if the alignment has broken, I don't mind fixing that file to look
nicer; however, my fix may tend to move the minority formatted lines
to match the majority, rather than towards any global style.  But if a
file already has what appears to be self-consistent formatting, even
if it is unlike other files, I try to preserve that style.

But what I don't want is to delay your patch on a style argument -
it's not fair for me to hold up your work on something that does not
affect runtime correctness.  So that said, since I'm not the main
block maintainer, you are probably better off waiting for Kevin's
verdict (if he really wants aligned '=' in block files, that's worth a
v3; but if he is okay living with code that doesn't meet his personal
style, then v2 is good enough).

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 0/8] iotests: make meson aware of individual I/O tests

2023-03-10 Thread Hanna Czenczek


On 10.03.23 17:17, Daniel P. Berrangé wrote:

Kevin / Hanna.  do you have any comments you want to
make on this, since it is notionally under the maintainership
of the block team ? If not Alex has volunteered to queue this
via his testing tree.


Looks good to me!

So, if it helps:

Acked-by: Hanna Czenczek

Re: [PATCH v2 0/8] iotests: make meson aware of individual I/O tests

2023-03-10 Thread Daniel P . Berrangé

Kevin / Hanna.  do you have any comments you want to
make on this, since it is notionally under the maintainership
of the block team ? If not Alex has volunteered to queue this
via his testing tree.

On Fri, Mar 03, 2023 at 04:07:19PM +, Daniel P. Berrangé wrote:
> To just repeat the patch 5 description...
> 
> Currently meson registers a single test that invokes an entire group of
> I/O tests, hiding the test granularity from meson. There are various
> downsides of doing this
> 
>  * You cannot ask 'meson test' to invoke a single I/O test
>  * The meson test timeout can't be applied to the individual
>tests
>  * Meson only gets a pass/fail for the overall I/O test group
>not individual tests
>  * Meson can't show the time of individual I/O tests, so we
>can't see why 4-5 are consuming the bulk of the time
>and ripe for optimization
>  * If a CI job gets killed by the GitLab timeout, we don't
>get visibility into how far through the I/O tests
>execution got.
> 
> This is not really specific to the I/O tests, the problem is common
> to any case of us running a test which is in fact another test
> harness which runs many tests. It would be nice to have meson have
> the full view of all tests run. Adapting the I/O tests is as easy
> win in this respect.
> 
> This switches meson to perform test discovery by invoking 'check' in
> dry-run mode. It then registers one meson test case for each I/O
> test. Parallel execution remains disabled since the I/O tests do not
> use self contained execution environments and thus conflict with
> each other.
> 
> Compare contrast output from a current job:
> 
>   https://gitlab.com/qemu-project/qemu/-/jobs/3863603546
> 
> [quote]
> 204/224 qemu:block / qemu-iotests qcow2   OK 329.94s   119 subtests passed
> [/quote]
> 
> Vs what is seen with this series:
> 
>   https://gitlab.com/berrange/qemu/-/jobs/3865975463
> 
> [quote]
> 204/350 qemu:block / qemu-iotests-qcow2-001   OK2.16s   1 subtests passed
> 205/350 qemu:block / qemu-iotests-qcow2-002   OK2.77s   1 subtests passed
> 
> ...snip...
> 
> 329/350 qemu:block / qemu-iotests-qcow2-qemu-img-close-errors   OK
> 6.19s   1 subtests passed
> 330/350 qemu:block / qemu-iotests-qcow2-qsd-jobs  OK0.55s   1 
> subtests passed
> [/quote]
> 
> A few tweaks were needed to the iotests runner because it had a few
> assumptions about it always running in a tree that has already been
> built, which is obviously not the case at the time meson does test
> discovery.
> 
> In v2:
> 
> New example pipeline job
> 
>https://gitlab.com/berrange/qemu/-/jobs/3871446106
> 
>  * Set build/source dir defaults in CLI option parser
>instead of testenv.py (Alex)
>  * Fix messed up termios settings with parallel execution
>by connecting stdin to /dev/null (Thomas)
>  * Remove the obsolete check-block.sh script (Thomas)
>  * Use a unique sub-directory per test to allow parallelization (Thomas)
>  * Enable parallel execution by meson (Thomas)
>  * Remove leftover debugging message (Thomas)
>  * Use a shorter meson test name 'io-qcow2-012' instead of
>'qemu-iotests-qcow2-012'
> 
> Daniel P. Berrangé (8):
>   iotests: explicitly pass source/build dir to 'check' command
>   iotests: allow test discovery before building
>   iotests: strip subdir path when listing tests
>   iotests: print TAP protocol version when reporting tests
>   iotests: connect stdin to /dev/null when running tests
>   iotests: always use a unique sub-directory per test
>   iotests: register each I/O test separately with meson
>   iotests: remove the check-block.sh script
> 
>  tests/check-block.sh | 43 
>  tests/qemu-iotests/check | 30 +++---
>  tests/qemu-iotests/meson.build   | 35 +-
>  tests/qemu-iotests/testenv.py| 20 +++
>  tests/qemu-iotests/testrunner.py | 43 ++--
>  5 files changed, 78 insertions(+), 93 deletions(-)
>  delete mode 100755 tests/check-block.sh
> 
> -- 
> 2.39.2
> 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH for-8.0] ide: Fix manual in-flight count for TRIM BH

2023-03-10 Thread Paolo Bonzini

On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf  wrote:
> > 1. The TRIM operation should be completed on the IDE level before
> > draining ends.
> > 2. Block layer requests issued after draining has begun are queued.
> >
> > To me, the conclusion seems to be:
> > Issue all block layer requests belonging to the IDE TRIM operation up
> > front.
> >
> > The other alternative I see is to break assumption 2, introduce a way
> > to not queue certain requests while drained, and use it for the
> > recursive requests issued by ide_issue_trim_cb. But not the initial
> > one, if that would defeat the purpose of request queuing. Of course
> > this can't be done if QEMU relies on the assumption in other places
> > already.
>
> I feel like this should be allowed because if anyone has exclusive
> access in this scenario, it's IDE, so it should be able to bypass the
> queuing. Of course, the queuing is still needed if someone else drained
> the backend, so we can't just make TRIM bypass it in general. And if you
> make it conditional on IDE being in blk_drain(), it already starts to
> become ugly again...
>
> So maybe the while loop is unavoidable.
>
> Hmm... But could ide_cancel_dma_sync() just directly use
> AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()?

While that should work, it would not fix other uses of
bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device
model relies on those to run *until the device model has finished
submitting requests*.

So I still think that this bug is a symptom of a problem in the design
of request queuing.

In fact, shouldn't request queuing was enabled at the _end_ of
bdrv_drained_begin (once the BlockBackend has reached a quiescent
state on its own terms), rather than at the beginning (which leads to
deadlocks like this one)?

blk->quiesce_counter becomes just a nesting counter for
drained_begin/end, with no uses outside, and blk_wait_while_drained
uses a new counter. Then you have something like this in
blk_root_drained_poll:

if (blk->dev_ops && blk->dev_ops->drained_poll) {
busy = blk->dev_ops->drained_poll(blk->dev_opaque);
}
busy |= !!blk->in_flight;
if (!busy) {
   qatomic_set(&blk->queue_requests, true);
}
return busy;

And there's no need to touch IDE at all.

Paolo

Re: [PATCH for-8.0] ide: Fix manual in-flight count for TRIM BH

2023-03-10 Thread Kevin Wolf

Am 10.03.2023 um 14:05 hat Fiona Ebner geschrieben:
> Am 09.03.23 um 18:46 schrieb Kevin Wolf:
> > Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben:
> >> On 3/9/23 13:31, Hanna Czenczek wrote:
> >>> On 09.03.23 13:08, Paolo Bonzini wrote:
>  On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini  wrote:
> > I think having to do this is problematic, because the blk_drain should
> > leave no pending operation.
> >
> > Here it seems okay because you do it in a controlled situation, but the
> > main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
> > and there would be pending I/O operations when it returns.
> >>>
> >>> Not really.  We would stop in the middle of a trim that processes a list
> >>> of discard requests.  So I see it more like stopping in the middle of
> >>> anything that processes guest requests.  Once drain ends, we continue
> >>> processing them, and that’s not exactly pending I/O.
> >>>
> >>> There is a pending object in s->bus->dma->aiocb on the IDE side, so
> >>> there is a pending DMA operation, but naïvely, I don’t see that as a
> >>> problem.
> >>
> >> What about the bdrv_drain_all() when a VM stops, would the guest continue 
> >> to
> >> access memory and disks after bdrv_drain() return?
> > 
> > That one shouldn't be a problem because the devices are stopped before
> > the backends.
> > 
> >> Migration could also be a problem, because the partial TRIM would not be
> >> recorded in the s->bus->error_status field of IDEState (no surprise there,
> >> it's not an error).  Also, errors happening after bdrv_drain() might not be
> >> migrated correctly.
> > 
> > Yes, I think it would be good to have the I/O operation fully completed
> > on the IDE level rather than just in the block layer.
> > 
> >>> Or the issue is generally that IDE uses dma_* functions, which might
> >>> cause I/O functions to be run from new BHs (I guess through
> >>> reschedule_dma()?).
> > 
> > None of those complicated scenarios actually. The problem solved by the
> > request queuing is simply that nothing else stops the guest from
> > submitting new requests to drained nodes if the CPUs are still running.
> > 
> > Drain uses aio_disable_external() to disable processing of external
> > events, in particular the ioeventfd used by virtio-blk and virtio-scsi.
> > But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread
> > exits to userspace and calls directly into the IDE code, so it's
> > completely unaffected by aio_disable_external().
> > 
> >> Ah, you mean that you can have pending I/O operations while blk->in_flight
> >> is zero?  That would be a problem indeed.  We already have BlockDevOps for
> >> ide-cd and ide-hd, should we add a .drained_poll callback there?
> > 
> > To be more precise, you suggested in the call that .drained_poll should
> > return that IDE is still busy while aiocb != NULL. Without having looked
> > at the code in detail yet, that seems to make sense to me. And maybe
> > even the blk_inc/dec_in_flight() pair can then go completely away
> > because IDE takes care of its drain state itself then.
> > 
> 
> I assume the addition of drained_poll is meant to be orthogonal to the
> fix of the deadlock? At least I can't see how it would help there?

You're right, it doesn't.

Basically we're running into the old problem again that draining is
overloaded with two different meanings: I want exclusive access to the
backend or I want to wait for all my requests to complete. IDE or more
generally blk_drain() wants the latter, but queuing is done for
implementing the former.

> If we have the assumptions:
> 1. The TRIM operation should be completed on the IDE level before
> draining ends.
> 2. Block layer requests issued after draining has begun are queued.
> 
> To me, the conclusion seems to be:
> Issue all block layer requests belonging to the IDE TRIM operation up
> front.
> 
> The other alternative I see is to break assumption 2, introduce a way
> to not queue certain requests while drained, and use it for the
> recursive requests issued by ide_issue_trim_cb. But not the initial
> one, if that would defeat the purpose of request queuing. Of course
> this can't be done if QEMU relies on the assumption in other places
> already.

I feel like this should be allowed because if anyone has exclusive
access in this scenario, it's IDE, so it should be able to bypass the
queuing. Of course, the queuing is still needed if someone else drained
the backend, so we can't just make TRIM bypass it in general. And if you
make it conditional on IDE being in blk_drain(), it already starts to
become ugly again...

So maybe the while loop is unavoidable.

Hmm... But could ide_cancel_dma_sync() just directly use
AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()?

Kevin

Re: [PATCH for-8.0] ide: Fix manual in-flight count for TRIM BH

2023-03-10 Thread Fiona Ebner

Am 09.03.23 um 18:46 schrieb Kevin Wolf:
> Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben:
>> On 3/9/23 13:31, Hanna Czenczek wrote:
>>> On 09.03.23 13:08, Paolo Bonzini wrote:
 On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini  wrote:
> I think having to do this is problematic, because the blk_drain should
> leave no pending operation.
>
> Here it seems okay because you do it in a controlled situation, but the
> main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
> and there would be pending I/O operations when it returns.
>>>
>>> Not really.  We would stop in the middle of a trim that processes a list
>>> of discard requests.  So I see it more like stopping in the middle of
>>> anything that processes guest requests.  Once drain ends, we continue
>>> processing them, and that’s not exactly pending I/O.
>>>
>>> There is a pending object in s->bus->dma->aiocb on the IDE side, so
>>> there is a pending DMA operation, but naïvely, I don’t see that as a
>>> problem.
>>
>> What about the bdrv_drain_all() when a VM stops, would the guest continue to
>> access memory and disks after bdrv_drain() return?
> 
> That one shouldn't be a problem because the devices are stopped before
> the backends.
> 
>> Migration could also be a problem, because the partial TRIM would not be
>> recorded in the s->bus->error_status field of IDEState (no surprise there,
>> it's not an error).  Also, errors happening after bdrv_drain() might not be
>> migrated correctly.
> 
> Yes, I think it would be good to have the I/O operation fully completed
> on the IDE level rather than just in the block layer.
> 
>>> Or the issue is generally that IDE uses dma_* functions, which might
>>> cause I/O functions to be run from new BHs (I guess through
>>> reschedule_dma()?).
> 
> None of those complicated scenarios actually. The problem solved by the
> request queuing is simply that nothing else stops the guest from
> submitting new requests to drained nodes if the CPUs are still running.
> 
> Drain uses aio_disable_external() to disable processing of external
> events, in particular the ioeventfd used by virtio-blk and virtio-scsi.
> But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread
> exits to userspace and calls directly into the IDE code, so it's
> completely unaffected by aio_disable_external().
> 
>> Ah, you mean that you can have pending I/O operations while blk->in_flight
>> is zero?  That would be a problem indeed.  We already have BlockDevOps for
>> ide-cd and ide-hd, should we add a .drained_poll callback there?
> 
> To be more precise, you suggested in the call that .drained_poll should
> return that IDE is still busy while aiocb != NULL. Without having looked
> at the code in detail yet, that seems to make sense to me. And maybe
> even the blk_inc/dec_in_flight() pair can then go completely away
> because IDE takes care of its drain state itself then.
> 

I assume the addition of drained_poll is meant to be orthogonal to the
fix of the deadlock? At least I can't see how it would help there?

If we have the assumptions:
1. The TRIM operation should be completed on the IDE level before
draining ends.
2. Block layer requests issued after draining has begun are queued.

To me, the conclusion seems to be:
Issue all block layer requests belonging to the IDE TRIM operation up front.

The other alternative I see is to break assumption 2, introduce a way to
not queue certain requests while drained, and use it for the recursive
requests issued by ide_issue_trim_cb. But not the initial one, if that
would defeat the purpose of request queuing. Of course this can't be
done if QEMU relies on the assumption in other places already.

Best Regards,
Fiona

[PATCH] hw/ide: replace TABs with space Bring the block files in line with the QEMU coding style, with spaces for indentation. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371

2023-03-10 Thread Yeqi Fu

Signed-off-by: Yeqi Fu 
---
 hw/ide/cmd646.c   |  28 +--
 hw/ide/core.c |  84 -
 hw/ide/microdrive.c   | 360 +++---
 include/hw/ide/internal.h | 254 +--
 4 files changed, 363 insertions(+), 363 deletions(-)

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index 26a90ed45f..a68357c1c5 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -36,20 +36,20 @@
 #include "trace.h"
 
 /* CMD646 specific */
-#define CFR0x50
-#define   CFR_INTR_CH0 0x04
-#define CNTRL  0x51
-#define   CNTRL_EN_CH0 0x04
-#define   CNTRL_EN_CH1 0x08
-#define ARTTIM23   0x57
-#defineARTTIM23_INTR_CH1   0x10
-#define MRDMODE0x71
-#define   MRDMODE_INTR_CH0 0x04
-#define   MRDMODE_INTR_CH1 0x08
-#define   MRDMODE_BLK_CH0  0x10
-#define   MRDMODE_BLK_CH1  0x20
-#define UDIDETCR0  0x73
-#define UDIDETCR1  0x7B
+#define CFR  0x50
+#define   CFR_INTR_CH0   0x04
+#define CNTRL0x51
+#define   CNTRL_EN_CH0   0x04
+#define   CNTRL_EN_CH1   0x08
+#define ARTTIM23 0x57
+#defineARTTIM23_INTR_CH1 0x10
+#define MRDMODE  0x71
+#define   MRDMODE_INTR_CH0   0x04
+#define   MRDMODE_INTR_CH1   0x08
+#define   MRDMODE_BLK_CH00x10
+#define   MRDMODE_BLK_CH10x20
+#define UDIDETCR00x73
+#define UDIDETCR10x7B
 
 static void cmd646_update_irq(PCIDevice *pd);
 
diff --git a/hw/ide/core.c b/hw/ide/core.c
index 2d034731cf..09b2f4a310 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -318,52 +318,52 @@ static void ide_cfata_identify(IDEState *s)
 
 cur_sec = s->cylinders * s->heads * s->sectors;
 
-put_le16(p + 0, 0x848a);   /* CF Storage Card signature */
-put_le16(p + 1, s->cylinders); /* Default cylinders */
-put_le16(p + 3, s->heads); /* Default heads */
-put_le16(p + 6, s->sectors);   /* Default sectors per track */
+put_le16(p + 0, 0x848a);/* CF Storage Card signature */
+put_le16(p + 1, s->cylinders);  /* Default cylinders */
+put_le16(p + 3, s->heads);  /* Default heads */
+put_le16(p + 6, s->sectors);/* Default sectors per track */
 /* *(p + 7) := nb_sectors >> 16 -- see ide_cfata_identify_size */
 /* *(p + 8) := nb_sectors   -- see ide_cfata_identify_size */
 padstr((char *)(p + 10), s->drive_serial_str, 20); /* serial number */
-put_le16(p + 22, 0x0004);  /* ECC bytes */
-padstr((char *) (p + 23), s->version, 8);  /* Firmware Revision */
+put_le16(p + 22, 0x0004);   /* ECC bytes */
+padstr((char *) (p + 23), s->version, 8);   /* Firmware Revision */
 padstr((char *) (p + 27), s->drive_model_str, 40);/* Model number */
 #if MAX_MULT_SECTORS > 1
 put_le16(p + 47, 0x8000 | MAX_MULT_SECTORS);
 #else
 put_le16(p + 47, 0x);
 #endif
-put_le16(p + 49, 0x0f00);  /* Capabilities */
-put_le16(p + 51, 0x0002);  /* PIO cycle timing mode */
-put_le16(p + 52, 0x0001);  /* DMA cycle timing mode */
-put_le16(p + 53, 0x0003);  /* Translation params valid */
-put_le16(p + 54, s->cylinders);/* Current cylinders */
-put_le16(p + 55, s->heads);/* Current heads */
-put_le16(p + 56, s->sectors);  /* Current sectors */
-put_le16(p + 57, cur_sec); /* Current capacity */
-put_le16(p + 58, cur_sec >> 16);   /* Current capacity */
-if (s->mult_sectors)   /* Multiple sector setting */
+put_le16(p + 49, 0x0f00);   /* Capabilities */
+put_le16(p + 51, 0x0002);   /* PIO cycle timing mode */
+put_le16(p + 52, 0x0001);   /* DMA cycle timing mode */
+put_le16(p + 53, 0x0003);   /* Translation params valid */
+put_le16(p + 54, s->cylinders); /* Current cylinders */
+put_le16(p + 55, s->heads); /* Current heads */
+put_le16(p + 56, s->sectors);   /* Current sectors */
+put_le16(p + 57, cur_sec);  /* Current capacity */
+put_le16(p + 58, cur_sec >> 16);/* Current capacity */
+if (s->mult_sectors)/* Multiple sector setting */
 put_le16(p + 59, 0x100 | s->mult_sectors);
 /* *(p + 60) := nb_sectors   -- see ide_cfata_identify_size */
 /* *(p + 61) := nb_sectors >> 16 -- see ide_cfata_identify_size */
-put_le16(p + 63, 0x0203);  /* Multiword DMA capability */
-put_le16(p + 64, 0x0001);  /* Flow Control PIO support */
-put_le16(p + 65, 0x0096);  /* Min. Multiword DMA cycle */
-put_le16(p + 66, 0x0096);  /* Rec. Multiword DMA cycle */
-put_le16(p + 68, 0x00b4);  /* Min. PIO cycle time */
-put_le16(p + 82, 0x400c);

[PATCH v2] hw/block: replace TABs with space Bring the block files in line with the QEMU coding style, with spaces for indentation. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371

2023-03-10 Thread Yeqi Fu

Signed-off-by: Yeqi Fu 
---
 hw/block/fdc.c   |   4 +-
 hw/block/nand.c  | 222 +++
 hw/block/onenand.c   | 126 +++---
 hw/block/tc58128.c   | 138 
 include/hw/block/flash.h |  20 ++--
 5 files changed, 255 insertions(+), 255 deletions(-)

diff --git a/hw/block/fdc.c b/hw/block/fdc.c
index 64ae4a6899..d7cc4d3ec1 100644
--- a/hw/block/fdc.c
+++ b/hw/block/fdc.c
@@ -601,8 +601,8 @@ enum {
 };
 
 enum {
-FD_STATE_MULTI  = 0x01,/* multi track flag */
-FD_STATE_FORMAT = 0x02,/* format flag */
+FD_STATE_MULTI  = 0x01, /* multi track flag */
+FD_STATE_FORMAT = 0x02, /* format flag */
 };
 
 enum {
diff --git a/hw/block/nand.c b/hw/block/nand.c
index 1aee1cb2b1..d4402f2c65 100644
--- a/hw/block/nand.c
+++ b/hw/block/nand.c
@@ -30,33 +30,33 @@
 #include "qemu/module.h"
 #include "qom/object.h"
 
-# define NAND_CMD_READ00x00
-# define NAND_CMD_READ10x01
-# define NAND_CMD_READ20x50
-# define NAND_CMD_LPREAD2  0x30
-# define NAND_CMD_NOSERIALREAD20x35
-# define NAND_CMD_RANDOMREAD1  0x05
-# define NAND_CMD_RANDOMREAD2  0xe0
-# define NAND_CMD_READID   0x90
-# define NAND_CMD_RESET0xff
-# define NAND_CMD_PAGEPROGRAM1 0x80
-# define NAND_CMD_PAGEPROGRAM2 0x10
-# define NAND_CMD_CACHEPROGRAM20x15
-# define NAND_CMD_BLOCKERASE1  0x60
-# define NAND_CMD_BLOCKERASE2  0xd0
-# define NAND_CMD_READSTATUS   0x70
-# define NAND_CMD_COPYBACKPRG1 0x85
-
-# define NAND_IOSTATUS_ERROR   (1 << 0)
-# define NAND_IOSTATUS_PLANE0  (1 << 1)
-# define NAND_IOSTATUS_PLANE1  (1 << 2)
-# define NAND_IOSTATUS_PLANE2  (1 << 3)
-# define NAND_IOSTATUS_PLANE3  (1 << 4)
+# define NAND_CMD_READ0 0x00
+# define NAND_CMD_READ1 0x01
+# define NAND_CMD_READ2 0x50
+# define NAND_CMD_LPREAD2   0x30
+# define NAND_CMD_NOSERIALREAD2 0x35
+# define NAND_CMD_RANDOMREAD1   0x05
+# define NAND_CMD_RANDOMREAD2   0xe0
+# define NAND_CMD_READID0x90
+# define NAND_CMD_RESET 0xff
+# define NAND_CMD_PAGEPROGRAM1  0x80
+# define NAND_CMD_PAGEPROGRAM2  0x10
+# define NAND_CMD_CACHEPROGRAM2 0x15
+# define NAND_CMD_BLOCKERASE1   0x60
+# define NAND_CMD_BLOCKERASE2   0xd0
+# define NAND_CMD_READSTATUS0x70
+# define NAND_CMD_COPYBACKPRG1  0x85
+
+# define NAND_IOSTATUS_ERROR(1 << 0)
+# define NAND_IOSTATUS_PLANE0   (1 << 1)
+# define NAND_IOSTATUS_PLANE1   (1 << 2)
+# define NAND_IOSTATUS_PLANE2   (1 << 3)
+# define NAND_IOSTATUS_PLANE3   (1 << 4)
 # define NAND_IOSTATUS_READY(1 << 6)
-# define NAND_IOSTATUS_UNPROTCT(1 << 7)
+# define NAND_IOSTATUS_UNPROTCT (1 << 7)
 
-# define MAX_PAGE  0x800
-# define MAX_OOB   0x40
+# define MAX_PAGE   0x800
+# define MAX_OOB0x40
 
 typedef struct NANDFlashState NANDFlashState;
 struct NANDFlashState {
@@ -102,40 +102,40 @@ static void mem_and(uint8_t *dest, const uint8_t *src, 
size_t n)
 }
 }
 
-# define NAND_NO_AUTOINCR  0x0001
-# define NAND_BUSWIDTH_16  0x0002
-# define NAND_NO_PADDING   0x0004
-# define NAND_CACHEPRG 0x0008
-# define NAND_COPYBACK 0x0010
-# define NAND_IS_AND   0x0020
-# define NAND_4PAGE_ARRAY  0x0040
-# define NAND_NO_READRDY   0x0100
-# define NAND_SAMSUNG_LP   (NAND_NO_PADDING | NAND_COPYBACK)
+# define NAND_NO_AUTOINCR   0x0001
+# define NAND_BUSWIDTH_16   0x0002
+# define NAND_NO_PADDING0x0004
+# define NAND_CACHEPRG  0x0008
+# define NAND_COPYBACK  0x0010
+# define NAND_IS_AND0x0020
+# define NAND_4PAGE_ARRAY   0x0040
+# define NAND_NO_READRDY0x0100
+# define NAND_SAMSUNG_LP(NAND_NO_PADDING | NAND_COPYBACK)
 
 # define NAND_IO
 
-# define PAGE(addr)((addr) >> ADDR_SHIFT)
-# define PAGE_START(page)   (PAGE(page) * (NAND_PAGE_SIZE + OOB_SIZE))
-# define PAGE_MASK ((1 << ADDR_SHIFT) - 1)
-# define OOB_SHIFT (PAGE_SHIFT - 5)
-# define OOB_SIZE  (1 << OOB_SHIFT)
-# define SECTOR(addr)  ((addr) >> (9 + ADDR_SHIFT - PAGE_SHIFT))
-# define SECTOR_OFFSET(addr)   ((addr) & ((511 >> PAGE_SHIFT) << 8))
-
-# define NAND_PAGE_SIZE 256
-# define PAGE_SHIFT8
-# define PAGE_SECTORS  1
-# define ADDR_SHIFT8
+# define PAGE(addr)  ((addr) >> ADDR_SHIFT)
+# define PAGE_START(page)(PAGE(page) * (NAND_PAGE_SIZE + OOB_SIZE))
+# define PAGE_MASK   ((1 << ADDR_SHIFT) - 1)
+# define OOB_SHIFT   (PAGE_SHIFT - 5)
+# define OOB_SIZE(1 << OOB_SHIFT)
+# define SECTOR(addr)((addr) >> (9 + ADDR_SHIFT - PAGE_SHIFT))
+# define SECTOR_OFFSET(addr) ((addr) & ((511 >> PAGE_SHIFT) << 8))
+
+# define NAND_PAGE_SIZE 256
+# define PAGE_SHIFT 8
+# define PAGE_SECTORS   1
+# define ADDR_SHIFT 8
 # include "nand.c"
-# define NAND_PAGE

Re: [PATCH v10 07/12] parallels: Move check of cluster outside image to a separate function

2023-03-10 Thread Hanna Czenczek


On 10.03.23 12:08, Alexander Ivanov wrote:



On 3/7/23 13:17, Hanna Czenczek wrote:

On 03.02.23 10:18, Alexander Ivanov wrote:

We will add more and more checks so we need a better code structure in
parallels_co_check. Let each check performs in a separate loop in a
separate helper.

Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
---
  block/parallels.c | 81 
++-

  1 file changed, 52 insertions(+), 29 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 02fbaee1f2..f9acee1fa8 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -438,14 +438,13 @@ static void 
parallels_check_unclean(BlockDriverState *bs,

  }
  }
  -static int coroutine_fn parallels_co_check(BlockDriverState *bs,
-   BdrvCheckResult *res,
-   BdrvCheckMode fix)
+static int parallels_check_outside_image(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode fix)


I wonder, should we mark this function coroutine_fn?  And with the 
graph lock changes that happened in the meantime, probably also 
GRAPH_RDLOCK (because it’s calling bdrv_getlength()).


Hanna


Thank you for your review.

It seems, parallels_collect_statistics(), parallels_check_unclean() 
and parallels_set_bat_entry() also should be marked as coroutine_fn, 
shouldn't they?


There should be no harm in doing so.  I wasn’t mentioning it just 
because those functions don’t call other coroutine functions themselves, 
so I think it doesn’t really matter.


Hanna

Re: [PATCH v10 07/12] parallels: Move check of cluster outside image to a separate function

2023-03-10 Thread Alexander Ivanov





On 3/7/23 13:17, Hanna Czenczek wrote:

On 03.02.23 10:18, Alexander Ivanov wrote:

We will add more and more checks so we need a better code structure in
parallels_co_check. Let each check performs in a separate loop in a
separate helper.

Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
---
  block/parallels.c | 81 ++-
  1 file changed, 52 insertions(+), 29 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 02fbaee1f2..f9acee1fa8 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -438,14 +438,13 @@ static void 
parallels_check_unclean(BlockDriverState *bs,

  }
  }
  -static int coroutine_fn parallels_co_check(BlockDriverState *bs,
-   BdrvCheckResult *res,
-   BdrvCheckMode fix)
+static int parallels_check_outside_image(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode fix)


I wonder, should we mark this function coroutine_fn?  And with the 
graph lock changes that happened in the meantime, probably also 
GRAPH_RDLOCK (because it’s calling bdrv_getlength()).


Hanna


Thank you for your review.

It seems, parallels_collect_statistics(), parallels_check_unclean() and 
parallels_set_bat_entry() also should be marked as coroutine_fn, 
shouldn't they?

[PATCH v7 3/4] block: add accounting for zone append operation

2023-03-10 Thread Sam Li

Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li 
---
 block/qapi-sysemu.c| 11 
 block/qapi.c   | 15 ++
 hw/block/virtio-blk.c  |  4 +++
 include/block/accounting.h |  1 +
 qapi/block-core.json   | 56 ++
 qapi/block.json|  4 +++
 6 files changed, 80 insertions(+), 11 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 7bd7554150..f7e56dfeb2 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -517,6 +517,7 @@ void qmp_block_latency_histogram_set(
 bool has_boundaries, uint64List *boundaries,
 bool has_boundaries_read, uint64List *boundaries_read,
 bool has_boundaries_write, uint64List *boundaries_write,
+bool has_boundaries_append, uint64List *boundaries_append,
 bool has_boundaries_flush, uint64List *boundaries_flush,
 Error **errp)
 {
@@ -557,6 +558,16 @@ void qmp_block_latency_histogram_set(
 }
 }
 
+if (has_boundaries || has_boundaries_append) {
+ret = block_latency_histogram_set(
+stats, BLOCK_ACCT_APPEND,
+has_boundaries_append ? boundaries_append : boundaries);
+if (ret) {
+error_setg(errp, "Device '%s' set append write boundaries fail", 
id);
+return;
+}
+}
+
 if (has_boundaries || has_boundaries_flush) {
 ret = block_latency_histogram_set(
 stats, BLOCK_ACCT_FLUSH,
diff --git a/block/qapi.c b/block/qapi.c
index c84147849d..d4be8ad72e 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -533,27 +533,33 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
+ds->zap_bytes = stats->nr_bytes[BLOCK_ACCT_APPEND];
 ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
 ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
 ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
+ds->zap_operations = stats->nr_ops[BLOCK_ACCT_APPEND];
 ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
 
 ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
 ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
+ds->failed_zap_operations = stats->failed_ops[BLOCK_ACCT_APPEND];
 ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
 ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
 
 ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
 ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
+ds->invalid_zap_operations = stats->invalid_ops[BLOCK_ACCT_APPEND];
 ds->invalid_flush_operations =
 stats->invalid_ops[BLOCK_ACCT_FLUSH];
 ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
 
 ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
 ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
+ds->zap_merged = stats->merged[BLOCK_ACCT_APPEND];
 ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
 ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
 ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
+ds->zap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_APPEND];
 ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
 ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
 ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -571,6 +577,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
 TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
+TimedAverage *zap = &ts->latency[BLOCK_ACCT_APPEND];
 TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
 
 dev_stats->interval_length = ts->interval_length;
@@ -583,6 +590,10 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 dev_stats->max_wr_latency_ns = timed_average_max(wr);
 dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 
+dev_stats->min_zap_latency_ns = timed_average_min(zap);
+dev_stats->max_zap_latency_ns = timed_average_max(zap);
+dev_stats->avg_zap_latency_ns = timed_average_avg(zap);
+
 dev_stats->min_flush_latency_ns = timed_average_min(fl);
 dev_stats->max_flush_latency_ns = timed_average_max(fl);
 dev_stats->avg_flush_latency_ns = timed_average_avg(fl);
@@ -591,6 +602,8 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 block_acct_queue_depth(ts, BLOCK_ACCT_READ);
 dev_stats->avg_wr_queue_depth =
 block_acct_queue_depth(ts, BLOCK_ACCT_WRITE);
+dev_stats->avg_zap_queue_depth =
+block_acct_queue_depth(ts, BLOCK_ACCT_APPEND)

[PATCH v7 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-03-10 Thread Sam Li

This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 394 +++
 2 files changed, 396 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..4ded625732 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,341 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->bl.wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_descriptor) {
+.z_start = cpu_to_le64(data->zone_report_data.zones[j].start
+>> BDRV_SECTOR_BITS),
+.z_cap = cpu_to_le64(data->zone_report_data.zones[j].cap
+

[PATCH v7 4/4] virtio-blk: add some trace events for zoned emulation

2023-03-10 Thread Sam Li

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 hw/block/trace-events |  7 +++
 hw/block/virtio-blk.c | 12 
 2 files changed, 19 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2c45a62bd5..34be8b9135 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -44,9 +44,16 @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: 
unknown command 0x%02x"
 # virtio-blk.c
 virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p 
status %d"
 virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, 
int ret) "vdev %p req %p nr_zones %u ret %d"
+virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p 
ret %d"
+virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int 
ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
 virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, 
uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs 
%d offset %"PRIu64" size %zu is_write %d"
+virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned 
int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
+virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, 
int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
+virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, 
int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
+virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p 
req %p, append sector 0x%" PRIx64 ""
 
 # hd-geometry.c
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS 
%d %d %d"
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 7605ca4f03..7ad45fda03 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -676,6 +676,7 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 int64_t nz = data->zone_report_data.nr_zones;
 int8_t err_status = VIRTIO_BLK_S_OK;
 
+trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
@@ -792,6 +793,8 @@ static int virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
 nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
 sizeof(struct virtio_blk_zone_report)) /
sizeof(struct virtio_blk_zone_descriptor);
+trace_virtio_blk_handle_zone_report(vdev, req,
+offset >> BDRV_SECTOR_BITS, nr_zones);
 
 zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
 data = g_malloc(sizeof(ZoneCmdData));
@@ -818,7 +821,9 @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int 
ret)
 {
 VirtIOBlockReq *req = opaque;
 VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
 int8_t err_status = VIRTIO_BLK_S_OK;
+trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
 
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
@@ -845,6 +850,8 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 /* Entire drive capacity */
 offset = 0;
 len = capacity;
+trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
+   bs->total_sectors);
 } else {
 if (bs->bl.zone_size > capacity - offset) {
 /* The zoned device allows the last smaller zone. */
@@ -852,6 +859,9 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 } else {
 len = bs->bl.zone_size;
 }
+trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
+  offset >> BDRV_SECTOR_BITS,
+  len >> BDRV_SECTOR_BITS);
 }
 
 if (!check_zoned_request(s, offset, len, false, &err_status)) {
@@ -894,6 +904,7 @@ static void virtio_blk_zone_append_complete(void *opaque, 
int ret)
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
 }
+trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
 
 out:
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -915,6 +926,7 @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq 
*req,
 int64_t offset = virtio_ldq_p(vdev, &req->out.sector) << BDRV_SECTOR_BITS;
 int64_t len = iov_size(out_iov, out_num);
 
+trace_virtio_blk_handle_zone_append(vdev, req, offset >> BDRV_SECTOR_BITS);
 if (!check_zoned_request(s, offset, len, true, &err_status)) {

[PATCH v7 1/4] include: update virtio_blk headers to v6.3-rc1

2023-03-10 Thread Sam Li

Use scripts/update-linux-headers.sh to update headers to 6.3-rc1.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 include/standard-headers/drm/drm_fourcc.h|  12 +++
 include/standard-headers/linux/ethtool.h |  48 -
 include/standard-headers/linux/fuse.h|  45 +++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +++
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 ++
 linux-headers/linux/vfio.h   |  15 +--
 linux-headers/linux/vhost.h  |   8 ++
 11 files changed, 270 insertions(+), 10 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 69cab17b38..dc3e6112c1 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -87,6 +87,18 @@ extern "C" {
  *
  * The authoritative list of format modifier codes is found in
  * `include/uapi/drm/drm_fourcc.h`
+ *
+ * Open Source User Waiver
+ * ---
+ *
+ * Because this is the authoritative source for pixel formats and modifiers
+ * referenced by GL, Vulkan extensions and other standards and hence used both
+ * by open source and closed source driver stacks, the usual requirement for an
+ * upstream in-kernel or open source userspace user does not apply.
+ *
+ * To ensure, as much as feasible, compatibility across stacks and avoid
+ * confusion with incompatible enumerations stakeholders for all relevant 
driver
+ * stacks should approve additions.
  */
 
 #define fourcc_code(a, b, c, d) ((uint32_t)(a) | ((uint32_t)(b) << 8) | \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 87176ab075..99fcddf04f 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -711,6 +711,24 @@ enum ethtool_stringset {
ETH_SS_COUNT
 };
 
+/**
+ * enum ethtool_mac_stats_src - source of ethtool MAC statistics
+ * @ETHTOOL_MAC_STATS_SRC_AGGREGATE:
+ * if device supports a MAC merge layer, this retrieves the aggregate
+ * statistics of the eMAC and pMAC. Otherwise, it retrieves just the
+ * statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_EMAC:
+ * if device supports a MM layer, this retrieves the eMAC statistics.
+ * Otherwise, it retrieves the statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_PMAC:
+ * if device supports a MM layer, this retrieves the pMAC statistics.
+ */
+enum ethtool_mac_stats_src {
+   ETHTOOL_MAC_STATS_SRC_AGGREGATE,
+   ETHTOOL_MAC_STATS_SRC_EMAC,
+   ETHTOOL_MAC_STATS_SRC_PMAC,
+};
+
 /**
  * enum ethtool_module_power_mode_policy - plug-in module power mode policy
  * @ETHTOOL_MODULE_POWER_MODE_POLICY_HIGH: Module is always in high power mode.
@@ -779,6 +797,31 @@ enum ethtool_podl_pse_pw_d_status {
ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
 };
 
+/**
+ * enum ethtool_mm_verify_status - status of MAC Merge Verify function
+ * @ETHTOOL_MM_VERIFY_STATUS_UNKNOWN:
+ * verification status is unknown
+ * @ETHTOOL_MM_VERIFY_STATUS_INITIAL:
+ * the 802.3 Verify State diagram is in the state INIT_VERIFICATION
+ * @ETHTOOL_MM_VERIFY_STATUS_VERIFYING:
+ * the Verify State diagram is in the state VERIFICATION_IDLE,
+ * SEND_VERIFY or WAIT_FOR_RESPONSE
+ * @ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED:
+ * indicates that the Verify State diagram is in the state VERIFIED
+ * @ETHTOOL_MM_VERIFY_STATUS_FAILED:
+ * the Verify State diagram is in the state VERIFY_FAIL
+ * @ETHTOOL_MM_VERIFY_STATUS_DISABLED:
+ * verification of preemption operation is disabled
+ */
+enum ethtool_mm_verify_status {
+   ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
+   ETHTOOL_MM_VERIFY_STATUS_INITIAL,
+   ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
+   ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
+   ETHTOOL_MM_VERIFY_STATUS_FAILED,
+   ETHTOOL_MM_VERIFY_STATUS_DISABLED,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1183,7 +1226,7 @@ struct ethtool_rxnfc {
uint32_trule_cnt;
uint32_trss_context;
};
-   uint32_trule_locs[0];
+   uint32_trule_locs[];
 };
 
 
@@ -1741,6 +1784,9 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_80baseDR8_2_Full_BIT   = 96,
ETHTOOL_LINK_MODE_80baseSR8_Full_BIT = 97,
ETHTOOL_LINK_MODE_80baseVR8_Full_BIT = 98,
+   ETHTOOL_LINK_MODE_10baseT1S_Full_BIT = 99,
+   ETHTOOL_LINK_MODE_10baseT1S_Half_BIT =

[PATCH v7 0/4] Add zoned storage emulation to virtio-blk driver

2023-03-10 Thread Sam Li

This patch adds zoned storage emulation to the virtio-blk driver.

The patch implements the virtio-blk ZBD support standardization that is
recently accepted by virtio-spec. The link to related commit is at

https://github.com/oasis-tcs/virtio-spec/commit/b4e8efa0fa6c8d844328090ad15db65af8d7d981

The Linux zoned device code that implemented by Dmitry Fomichev has been
released at the latest Linux version v6.3-rc1.

Aside: adding zoned=on alike options to virtio-blk device will be
considered as following-ups in future.

v6:
- update headers to v6.3-rc1

v5:
- address Stefan's review comments
  * restore the way writing zone append result to buffer
  * fix error checking case and other errands

v4:
- change the way writing zone append request result to buffer
- change zone state, zone type value of virtio_blk_zone_descriptor
- add trace events for new zone APIs

v3:
- use qemuio_from_buffer to write status bit [Stefan]
- avoid using req->elem directly [Stefan]
- fix error checkings and memory leak [Stefan]

v2:
- change units of emulated zone op coresponding to block layer APIs
- modify error checking cases [Stefan, Damien]

v1:
- add zoned storage emulation

Sam Li (4):
  include: update virtio_blk headers to v6.3-rc1
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation

 block/qapi-sysemu.c  |  11 +
 block/qapi.c |  15 +
 hw/block/trace-events|   7 +
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 410 +++
 include/block/accounting.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h|  12 +
 include/standard-headers/linux/ethtool.h |  48 ++-
 include/standard-headers/linux/fuse.h|  45 +-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 +
 linux-headers/linux/vfio.h   |  15 +-
 linux-headers/linux/vhost.h  |   8 +
 qapi/block-core.json |  56 ++-
 qapi/block.json  |   4 +
 19 files changed, 765 insertions(+), 21 deletions(-)

-- 
2.39.2

Re: [PATCH] block: replace TABs with space

2023-03-10 Thread Kevin Wolf

Am 09.03.2023 um 19:46 hat Eric Blake geschrieben:
> On Fri, Mar 10, 2023 at 12:10:08AM +0800, Yeqi Fu wrote:
> > Bring the block files in line with the QEMU coding style, with spaces
> > for indentation.
> > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371
> > 
> > Signed-off-by: Yeqi Fu 
> > ---
> >  block/bochs.c   | 12 +--
> >  block/file-posix.c  | 52 ++---
> >  block/file-win32.c  | 38 -
> >  block/parallels.c   | 10 -
> >  block/qcow.c| 10 -
> >  include/block/nbd.h |  2 +-
> >  6 files changed, 62 insertions(+), 62 deletions(-)
> > 
> > diff --git a/block/bochs.c b/block/bochs.c
> > index 2f5ae52c90..8ff38ac0d9 100644
> > --- a/block/bochs.c
> > +++ b/block/bochs.c
> > @@ -293,15 +293,15 @@ static void bochs_close(BlockDriverState *bs)
> >  }
> >  
> >  static BlockDriver bdrv_bochs = {
> > -.format_name   = "bochs",
> > -.instance_size = sizeof(BDRVBochsState),
> > -.bdrv_probe= bochs_probe,
> > -.bdrv_open = bochs_open,
> > +.format_name= "bochs",
> > +.instance_size  = sizeof(BDRVBochsState),
> > +.bdrv_probe = bochs_probe,
> > +.bdrv_open  = bochs_open,
> >  .bdrv_child_perm = bdrv_default_perms,
> >  .bdrv_refresh_limits = bochs_refresh_limits,
> 
> Hmm. When shown in the diff, it looks like you are changing the
> alignment of the .bdrv_probe line; but in reality, the extra prefix
> from diff changes which tabstop the '=' goes out to, so you are
> preserving the original column.  Still, it looks really awkward that
> we have one alignment for .format_name through .bdrv_open, and another
> alignment for .bdrv_child_perm and .bdrv_refresh_limits.
> 
> My personal preference: '=' alignment is a lost cause.  It causes
> pointless churn if we later add a field with a longer name (all other
> lines have to reindent to match the new length).  I'd prefer exactly
> one space before '=' on all lines that you touch (and maybe touch a
> few more lines in the process).  But if we DO stick with alignment,
> I'd much rather that ALL '=' be in the same column, rather than the
> pre-existing half-and-half mix, even though your patch kept that
> visually undisturbed (other than when diff injects a prefix that
> rejiggers the tab stops).

I really appreciate the visual separation between field names and values
that helps me to recognise the functions much faster. Whenever I come
across a file that is formatted like you suggest, it takes me much
longer.

Now I can agree that when doing alignment, having the = in different
columns within a single BlockDriver definition might look a bit awkward,
but as long as it's still done in some kind of blocks, the visual
separation remains and I'm happy enough from a functional perspective.

So I think the existing mix is strictly better than no alignment at all.
In principle I wouldn't object to aligning everything to the same
column, though of course it means additional churn that will affect 'git
blame' etc. Which is why I'm generally not super enthusiastic about
patches changing only the formatting. But I guess if it's a one-time
thing in each file, it's not too bad. Let's just make sure to leave
enough space for the longest field even if a BlockDriver doesn't
currently implement it so that we don't start accumulating new
inconsistencies right afterwards.

Kevin

Re: [PATCH v6 4/4] hw: replace most qemu_bh_new calls with qemu_bh_new_guarded

2023-03-10 Thread Thomas Huth


On 05/02/2023 05.07, Alexander Bulekov wrote:

This protects devices from bh->mmio reentrancy issues.

Reviewed-by: Darren Kenny 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Alexander Bulekov 
---

...

diff --git a/hw/9pfs/xen-9p-backend.c b/hw/9pfs/xen-9p-backend.c
index 65c4979c3c..f077c1b255 100644
--- a/hw/9pfs/xen-9p-backend.c
+++ b/hw/9pfs/xen-9p-backend.c
@@ -441,7 +441,9 @@ static int xen_9pfs_connect(struct XenLegacyDevice *xendev)
  xen_9pdev->rings[i].ring.out = xen_9pdev->rings[i].data +
 XEN_FLEX_RING_SIZE(ring_order);
  
-xen_9pdev->rings[i].bh = qemu_bh_new(xen_9pfs_bh, &xen_9pdev->rings[i]);

+xen_9pdev->rings[i].bh = qemu_bh_new_guarded(xen_9pfs_bh,
+ &xen_9pdev->rings[i],
+ 
&DEVICE(xen_9pdev)->mem_reentrancy_guard);


xen_9pdev is not derived from DeviceState, so you must not cast it with 
DEVICE().



diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 7ce001cacd..37091150cb 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1508,7 +1508,8 @@ static void ahci_cmd_done(const IDEDMA *dma)
  ahci_write_fis_d2h(ad);
  
  if (ad->port_regs.cmd_issue && !ad->check_bh) {

-ad->check_bh = qemu_bh_new(ahci_check_cmd_bh, ad);
+ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
+   &DEVICE(ad)->mem_reentrancy_guard);
  qemu_bh_schedule(ad->check_bh);
  }
  }


Dito - ad is not derived from DeviceState, so you cannot use DEVICE() here.

(This was causing the crash in the macOS CI job)


diff --git a/hw/ide/core.c b/hw/ide/core.c
index 5d1039378f..8c8d1a8ec2 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -519,7 +519,8 @@ BlockAIOCB *ide_issue_trim(
  
  iocb = blk_aio_get(&trim_aiocb_info, s->blk, cb, cb_opaque);

  iocb->s = s;
-iocb->bh = qemu_bh_new(ide_trim_bh_cb, iocb);
+iocb->bh = qemu_bh_new_guarded(ide_trim_bh_cb, iocb,
+   &DEVICE(s)->mem_reentrancy_guard);


IDEState s is also not directly derived from DeviceState. Not sure, but 
maybe you can get to the device here in a similar way that is done in 
ide_identify() :


 IDEDevice *dev = s->unit ? s->bus->slave : s->bus->master;

?


diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 746f07c4d2..309cebacc6 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -908,8 +908,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, 
Error **errp)
  precopy_add_notifier(&s->free_page_hint_notify);
  
  object_ref(OBJECT(s->iothread));

-s->free_page_bh = aio_bh_new(iothread_get_aio_context(s->iothread),
- virtio_ballloon_get_free_page_hints, s);
+s->free_page_bh = 
aio_bh_new_guarded(iothread_get_aio_context(s->iothread),
+ 
virtio_ballloon_get_free_page_hints, s,
+ &DEVICE(s)->mem_reentrancy_guard);


You could use "dev" instead of "s" here to get rid of the DEVICE() cast.

The remaining changes look fine to me.

 Thomas

[PATCH] hw/block: replace TABs with space Bring the block files in line with the QEMU coding style, with spaces for indentation. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371

2023-03-10 Thread Yeqi Fu

Signed-off-by: Yeqi Fu 
---
 hw/block/fdc.c |   4 +-
 hw/block/nand.c| 222 ++---
 hw/block/onenand.c | 126 -
 hw/block/tc58128.c | 136 +--
 4 files changed, 244 insertions(+), 244 deletions(-)

diff --git a/hw/block/fdc.c b/hw/block/fdc.c
index 64ae4a6899..d7cc4d3ec1 100644
--- a/hw/block/fdc.c
+++ b/hw/block/fdc.c
@@ -601,8 +601,8 @@ enum {
 };
 
 enum {
-FD_STATE_MULTI  = 0x01,/* multi track flag */
-FD_STATE_FORMAT = 0x02,/* format flag */
+FD_STATE_MULTI  = 0x01, /* multi track flag */
+FD_STATE_FORMAT = 0x02, /* format flag */
 };
 
 enum {
diff --git a/hw/block/nand.c b/hw/block/nand.c
index 1aee1cb2b1..79832b9cd5 100644
--- a/hw/block/nand.c
+++ b/hw/block/nand.c
@@ -30,33 +30,33 @@
 #include "qemu/module.h"
 #include "qom/object.h"
 
-# define NAND_CMD_READ00x00
-# define NAND_CMD_READ10x01
-# define NAND_CMD_READ20x50
-# define NAND_CMD_LPREAD2  0x30
-# define NAND_CMD_NOSERIALREAD20x35
-# define NAND_CMD_RANDOMREAD1  0x05
-# define NAND_CMD_RANDOMREAD2  0xe0
-# define NAND_CMD_READID   0x90
-# define NAND_CMD_RESET0xff
-# define NAND_CMD_PAGEPROGRAM1 0x80
-# define NAND_CMD_PAGEPROGRAM2 0x10
-# define NAND_CMD_CACHEPROGRAM20x15
-# define NAND_CMD_BLOCKERASE1  0x60
-# define NAND_CMD_BLOCKERASE2  0xd0
-# define NAND_CMD_READSTATUS   0x70
-# define NAND_CMD_COPYBACKPRG1 0x85
-
-# define NAND_IOSTATUS_ERROR   (1 << 0)
-# define NAND_IOSTATUS_PLANE0  (1 << 1)
-# define NAND_IOSTATUS_PLANE1  (1 << 2)
-# define NAND_IOSTATUS_PLANE2  (1 << 3)
-# define NAND_IOSTATUS_PLANE3  (1 << 4)
+# define NAND_CMD_READ0 0x00
+# define NAND_CMD_READ1 0x01
+# define NAND_CMD_READ2 0x50
+# define NAND_CMD_LPREAD2   0x30
+# define NAND_CMD_NOSERIALREAD2 0x35
+# define NAND_CMD_RANDOMREAD1   0x05
+# define NAND_CMD_RANDOMREAD2   0xe0
+# define NAND_CMD_READID0x90
+# define NAND_CMD_RESET 0xff
+# define NAND_CMD_PAGEPROGRAM1  0x80
+# define NAND_CMD_PAGEPROGRAM2  0x10
+# define NAND_CMD_CACHEPROGRAM2 0x15
+# define NAND_CMD_BLOCKERASE1   0x60
+# define NAND_CMD_BLOCKERASE2   0xd0
+# define NAND_CMD_READSTATUS0x70
+# define NAND_CMD_COPYBACKPRG1  0x85
+
+# define NAND_IOSTATUS_ERROR(1 << 0)
+# define NAND_IOSTATUS_PLANE0   (1 << 1)
+# define NAND_IOSTATUS_PLANE1   (1 << 2)
+# define NAND_IOSTATUS_PLANE2   (1 << 3)
+# define NAND_IOSTATUS_PLANE3   (1 << 4)
 # define NAND_IOSTATUS_READY(1 << 6)
-# define NAND_IOSTATUS_UNPROTCT(1 << 7)
+# define NAND_IOSTATUS_UNPROTCT (1 << 7)
 
-# define MAX_PAGE  0x800
-# define MAX_OOB   0x40
+# define MAX_PAGE   0x800
+# define MAX_OOB0x40
 
 typedef struct NANDFlashState NANDFlashState;
 struct NANDFlashState {
@@ -102,40 +102,40 @@ static void mem_and(uint8_t *dest, const uint8_t *src, 
size_t n)
 }
 }
 
-# define NAND_NO_AUTOINCR  0x0001
-# define NAND_BUSWIDTH_16  0x0002
-# define NAND_NO_PADDING   0x0004
-# define NAND_CACHEPRG 0x0008
-# define NAND_COPYBACK 0x0010
-# define NAND_IS_AND   0x0020
-# define NAND_4PAGE_ARRAY  0x0040
-# define NAND_NO_READRDY   0x0100
-# define NAND_SAMSUNG_LP   (NAND_NO_PADDING | NAND_COPYBACK)
+# define NAND_NO_AUTOINCR   0x0001
+# define NAND_BUSWIDTH_16   0x0002
+# define NAND_NO_PADDING0x0004
+# define NAND_CACHEPRG  0x0008
+# define NAND_COPYBACK  0x0010
+# define NAND_IS_AND0x0020
+# define NAND_4PAGE_ARRAY   0x0040
+# define NAND_NO_READRDY0x0100
+# define NAND_SAMSUNG_LP(NAND_NO_PADDING | NAND_COPYBACK)
 
 # define NAND_IO
 
-# define PAGE(addr)((addr) >> ADDR_SHIFT)
-# define PAGE_START(page)   (PAGE(page) * (NAND_PAGE_SIZE + OOB_SIZE))
-# define PAGE_MASK ((1 << ADDR_SHIFT) - 1)
-# define OOB_SHIFT (PAGE_SHIFT - 5)
-# define OOB_SIZE  (1 << OOB_SHIFT)
-# define SECTOR(addr)  ((addr) >> (9 + ADDR_SHIFT - PAGE_SHIFT))
-# define SECTOR_OFFSET(addr)   ((addr) & ((511 >> PAGE_SHIFT) << 8))
-
-# define NAND_PAGE_SIZE 256
-# define PAGE_SHIFT8
-# define PAGE_SECTORS  1
-# define ADDR_SHIFT8
+# define PAGE(addr)  ((addr) >> ADDR_SHIFT)
+# define PAGE_START(page)(PAGE(page) * (NAND_PAGE_SIZE + OOB_SIZE))
+# define PAGE_MASK   ((1 << ADDR_SHIFT) - 1)
+# define OOB_SHIFT   (PAGE_SHIFT - 5)
+# define OOB_SIZE(1 << OOB_SHIFT)
+# define SECTOR(addr)((addr) >> (9 + ADDR_SHIFT - PAGE_SHIFT))
+# define SECTOR_OFFSET(addr) ((addr) & ((511 >> PAGE_SHIFT) << 8))
+
+# define NAND_PAGE_SIZE 256
+# define PAGE_SHIFT 8
+# define PAGE_SECTORS   1
+# define ADDR_SHIFT 8
 # include "nand.c"
-# define NAND_PAGE_SIZE 512
-# define PAGE_SHIFT

[PATCH v6 3/4] qemu-iotests: test zone append operation

2023-03-10 Thread Sam Li

This tests is mainly a helper to indicate append writes in block layer
behaves as expected.

Signed-off-by: Sam Li 
---
 qemu-io-cmds.c | 65 ++
 tests/qemu-iotests/tests/zoned.out |  7 
 tests/qemu-iotests/tests/zoned.sh  |  9 +
 3 files changed, 81 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f35ea627d7..4159f41ab9 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1874,6 +1874,70 @@ static const cmdinfo_t zone_reset_cmd = {
 .oneline = "reset a zone write pointer in zone block device",
 };
 
+static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov,
+  int64_t *offset, int flags, int *total)
+{
+int async_ret = NOT_DONE;
+
+blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, &async_ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+*total = qiov->size;
+return async_ret < 0 ? async_ret : 1;
+}
+
+static int zone_append_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int flags = 0;
+int total = 0;
+int64_t offset;
+char *buf;
+int nr_iov;
+int pattern = 0xcd;
+QEMUIOVector qiov;
+
+if (optind > argc - 2) {
+return -EINVAL;
+}
+optind++;
+offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
+optind++;
+nr_iov = argc - optind;
+buf = create_iovec(blk, &qiov, &argv[optind], nr_iov, pattern,
+   flags & BDRV_REQ_REGISTERED_BUF);
+if (buf == NULL) {
+return -EINVAL;
+}
+ret = do_aio_zone_append(blk, &qiov, &offset, flags, &total);
+if (ret < 0) {
+printf("zone append failed: %s\n", strerror(-ret));
+goto out;
+}
+
+out:
+qemu_io_free(blk, buf, qiov.size,
+ flags & BDRV_REQ_REGISTERED_BUF);
+qemu_iovec_destroy(&qiov);
+return ret;
+}
+
+static const cmdinfo_t zone_append_cmd = {
+.name = "zone_append",
+.altname = "zap",
+.cfunc = zone_append_f,
+.argmin = 3,
+.argmax = 3,
+.args = "offset len [len..]",
+.oneline = "append write a number of bytes at a specified offset",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
 .name   = "truncate",
@@ -2672,6 +2736,7 @@ static void __attribute((constructor)) 
init_qemuio_commands(void)
 qemuio_add_command(&zone_close_cmd);
 qemuio_add_command(&zone_finish_cmd);
 qemuio_add_command(&zone_reset_cmd);
+qemuio_add_command(&zone_append_cmd);
 qemuio_add_command(&truncate_cmd);
 qemuio_add_command(&length_cmd);
 qemuio_add_command(&info_cmd);
diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
index 0c8f96deb9..b3b139b4ec 100644
--- a/tests/qemu-iotests/tests/zoned.out
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -50,4 +50,11 @@ start: 0x8, len 0x8, cap 0x8, wptr 0x10, 
zcond:14, [type: 2]
 (5) resetting the second zone
 After resetting a zone:
 start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+
+
+(6) append write
+After appending the first zone:
+start: 0x0, len 0x8, cap 0x8, wptr 0x18, zcond:2, [type: 2]
+After appending the second zone:
+start: 0x8, len 0x8, cap 0x8, wptr 0x80018, zcond:2, [type: 2]
 *** done
diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
index 9d7c15dde6..6c3ded6c4c 100755
--- a/tests/qemu-iotests/tests/zoned.sh
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -79,6 +79,15 @@ echo "(5) resetting the second zone"
 sudo $QEMU_IO $IMG -c "zrs 268435456 268435456"
 echo "After resetting a zone:"
 sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(6) append write" # physical block size of the device is 4096
+sudo $QEMU_IO $IMG -c "zap 0 0x1000 0x2000"
+echo "After appending the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+sudo $QEMU_IO $IMG -c "zap 268435456 0x1000 0x2000"
+echo "After appending the second zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
 
 # success, all done
 echo "*** done"
-- 
2.39.2

[PATCH v6 2/4] block: introduce zone append write for zoned devices

2023-03-10 Thread Sam Li

A zone append command is a write operation that specifies the first
logical block of a zone as the write position. When writing to a zoned
block device using zone append, the byte offset of writes is pointing
to the write pointer of that zone. Upon completion the device will
respond with the position the data has been written in the zone.

Signed-off-by: Sam Li 
---
 block/block-backend.c | 60 +++
 block/file-posix.c| 54 +---
 block/io.c| 21 +++
 block/io_uring.c  |  4 +++
 block/linux-aio.c |  3 ++
 block/raw-format.c|  8 +
 include/block/block-io.h  |  4 +++
 include/block/block_int-common.h  |  5 +++
 include/block/raw-aio.h   |  4 ++-
 include/sysemu/block-backend-io.h |  9 +
 10 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index f70b08e3f6..28e8f5d778 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1888,6 +1888,45 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 return &acb->common;
 }
 
+static void coroutine_fn blk_aio_zone_append_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_append(rwco->blk, &acb->bytes,
+   rwco->iobuf, rwco->flags);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags,
+BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.ret= NOT_DONE,
+.flags  = flags,
+.iobuf  = qiov,
+};
+acb->bytes = *offset;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
 /*
  * Send a zone_report command.
  * offset is a byte offset from the start of the device. No alignment
@@ -1939,6 +1978,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 return ret;
 }
 
+/*
+ * Send a zone_append command.
+ */
+int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+
+ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
+blk_dec_in_flight(blk);
+return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index 61ed769ac8..2ba9174778 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,6 +160,7 @@ typedef struct BDRVRawState {
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
 bool use_linux_io_uring:1;
+int64_t *offset; /* offset of zone append operation */
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
 bool needs_alignment;
@@ -1672,7 +1673,7 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 ssize_t len;
 
 len = RETRY_ON_EINTR(
-(aiocb->aio_type & QEMU_AIO_WRITE) ?
+(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
 qemu_pwritev(aiocb->aio_fildes,
aiocb->io.iov,
aiocb->io.niov,
@@ -1701,7 +1702,7 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 ssize_t len;
 
 while (offset < aiocb->aio_nbytes) {
-if (aiocb->aio_type & QEMU_AIO_WRITE) {
+if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
 len = pwrite(aiocb->aio_fildes,
  (const char *)buf + offset,
  aiocb->aio_nbytes - offset,
@@ -1794,7 +1795,7 @@ static int handle_aiocb_rw(void *opaque)
 }
 
 nbytes = handle_aiocb_rw_linear(aiocb, buf);
-if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
+if (!(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
 char *p = buf;
 size_t count = aiocb->aio_nbytes, copy;
 int i;
@@ -2431,6 +2432,10 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, 
uint64_t offset,
 #if defined(CONFIG_BLKZONED)
 if (bs->bl.wps) {
 qemu_co_mutex_

[PATCH v6 4/4] block: add some trace events for zone append

2023-03-10 Thread Sam Li

Signed-off-by: Sam Li 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2ba9174778..5187f810e5 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2489,6 +2489,8 @@ out:
 if (!BDRV_ZT_IS_CONV(wps->wp[index])) {
 if (type & QEMU_AIO_ZONE_APPEND) {
 *s->offset = wps->wp[index];
+trace_zbd_zone_append_complete(bs, *s->offset
+>> BDRV_SECTOR_BITS);
 }
 /* Advance the wp if needed */
 if (offset + bytes > wps->wp[index]) {
@@ -3537,6 +3539,7 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 len += iov_len;
 }
 
+trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
 return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
diff --git a/block/trace-events b/block/trace-events
index 3f4e1d088a..32665158d6 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -211,6 +211,8 @@ file_hdev_is_sg(int type, int version) "SG device found: 
type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
 zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
 zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
+zbd_zone_append(void *bs, int64_t sector) "bs %p append at sector offset 0x%" 
PRIx64 ""
+zbd_zone_append_complete(void *bs, int64_t sector) "bs %p returns append 
sector 0x%" PRIx64 ""
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.39.2

[PATCH v6 1/4] file-posix: add tracking of the zone write pointers

2023-03-10 Thread Sam Li

Since Linux doesn't have a user API to issue zone append operations to
zoned devices from user space, the file-posix driver is modified to add
zone append emulation using regular writes. To do this, the file-posix
driver tracks the wp location of all zones of the device. It uses an
array of uint64_t. The most significant bit of each wp location indicates
if the zone type is conventional zones.

The zones wp can be changed due to the following operations issued:
- zone reset: change the wp to the start offset of that zone
- zone finish: change to the end location of that zone
- write to a zone
- zone append

Signed-off-by: Sam Li 
---
 block/file-posix.c   | 159 ++-
 include/block/block-common.h |  14 +++
 include/block/block_int-common.h |   3 +
 3 files changed, 172 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 563acc76ae..61ed769ac8 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1324,6 +1324,77 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 #endif
 }
 
+#if defined(CONFIG_BLKZONED)
+static int get_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
+unsigned int nrz) {
+struct blk_zone *blkz;
+size_t rep_size;
+uint64_t sector = offset >> BDRV_SECTOR_BITS;
+int ret, n = 0, i = 0;
+rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+g_autofree struct blk_zone_report *rep = NULL;
+
+rep = g_malloc(rep_size);
+blkz = (struct blk_zone *)(rep + 1);
+while (n < nrz) {
+memset(rep, 0, rep_size);
+rep->sector = sector;
+rep->nr_zones = nrz - n;
+
+do {
+ret = ioctl(fd, BLKREPORTZONE, rep);
+} while (ret != 0 && errno == EINTR);
+if (ret != 0) {
+error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+fd, offset, errno);
+return -errno;
+}
+
+if (!rep->nr_zones) {
+break;
+}
+
+for (i = 0; i < rep->nr_zones; i++, n++) {
+/*
+ * The wp tracking cares only about sequential writes required and
+ * sequential write preferred zones so that the wp can advance to
+ * the right location.
+ * Use the most significant bit of the wp location to indicate the
+ * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
+ */
+if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+wps->wp[i] = 1ULL << 63;
+} else {
+switch(blkz[i].cond) {
+case BLK_ZONE_COND_FULL:
+case BLK_ZONE_COND_READONLY:
+/* Zone not writable */
+wps->wp[i] = (blkz[i].start + blkz[i].len) << 
BDRV_SECTOR_BITS;
+break;
+case BLK_ZONE_COND_OFFLINE:
+/* Zone not writable nor readable */
+wps->wp[i] = (blkz[i].start) << BDRV_SECTOR_BITS;
+break;
+default:
+wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
+break;
+}
+}
+}
+sector = blkz[i - 1].start + blkz[i - 1].len;
+}
+
+return 0;
+}
+
+static void update_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
+unsigned int nrz) {
+if (get_zones_wp(fd, wps, offset, nrz) < 0) {
+error_report("update zone wp failed");
+}
+}
+#endif
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
@@ -1413,6 +1484,21 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 if (ret >= 0) {
 bs->bl.max_active_zones = ret;
 }
+
+ret = get_sysfs_long_val(&st, "physical_block_size");
+if (ret >= 0) {
+bs->bl.write_granularity = ret;
+}
+
+bs->bl.wps = g_malloc(sizeof(BlockZoneWps) +
+sizeof(int64_t) * bs->bl.nr_zones);
+ret = get_zones_wp(s->fd, bs->bl.wps, 0, bs->bl.nr_zones);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "report wps failed");
+g_free(bs->bl.wps);
+return;
+}
+qemu_co_mutex_init(&bs->bl.wps->colock);
 return;
 }
 out:
@@ -2338,9 +2424,15 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, 
uint64_t offset,
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+int ret;
 
 if (fd_open(bs) < 0)
 return -EIO;
+#if defined(CONFIG_BLKZONED)
+if (bs->bl.wps) {
+qemu_co_mutex_lock(&bs->bl.wps->colock);
+}
+#endif
 
 /*
  * When using O_DIRECT, the request must be aligned to be able to use
@@ -2354,14 +2446,16 @@ static int coroutine_fn raw_co_prw(BlockDriverState 
*bs, uint64_t offset,
 } else if (s->use_linux_io_uring) {

[PATCH v6 0/4] Add zone append write for zoned device

2023-03-10 Thread Sam Li

This patch series add zone append operation based on the previous
zoned device support part. The file-posix driver is modified to
add zone append emulation using regular writes.

v6:
- add small fixes

v5:
- fix locking conditions and error handling
- drop some trival optimizations
- add tracing points for zone append

v4:
- fix lock related issues[Damien]
- drop all field in zone_mgmt op [Damien]
- fix state checks in zong_mgmt command [Damien]
- return start sector of wp when issuing zap req [Damien]

v3:
- only read wps when it is locked [Damien]
- allow last smaller zone case [Damien]
- add zone type and state checks in zone_mgmt command [Damien]
- fix RESET_ALL related problems

v2:
- split patch to two patches for better reviewing
- change BlockZoneWps's structure to an array of integers
- use only mutex lock on locking conditions of zone wps
- coding styles and clean-ups

v1:
- introduce zone append write

Sam Li (4):
  file-posix: add tracking of the zone write pointers
  block: introduce zone append write for zoned devices
  qemu-iotests: test zone append operation
  block: add some trace events for zone append

 block/block-backend.c  |  60 
 block/file-posix.c | 212 -
 block/io.c |  21 +++
 block/io_uring.c   |   4 +
 block/linux-aio.c  |   3 +
 block/raw-format.c |   8 ++
 block/trace-events |   2 +
 include/block/block-common.h   |  14 ++
 include/block/block-io.h   |   4 +
 include/block/block_int-common.h   |   8 ++
 include/block/raw-aio.h|   4 +-
 include/sysemu/block-backend-io.h  |   9 ++
 qemu-io-cmds.c |  65 +
 tests/qemu-iotests/tests/zoned.out |   7 +
 tests/qemu-iotests/tests/zoned.sh  |   9 ++
 15 files changed, 422 insertions(+), 8 deletions(-)

-- 
2.39.2

[PATCH v16 5/8] config: add check to block layer

2023-03-10 Thread Sam Li

Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block.c  | 19 +++
 block/file-posix.c   | 12 
 block/raw-format.c   |  1 +
 include/block/block_int-common.h |  5 +
 4 files changed, 37 insertions(+)

diff --git a/block.c b/block.c
index 0dd604d0f6..4ebf7bbc90 100644
--- a/block.c
+++ b/block.c
@@ -7953,6 +7953,25 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children &&
+child_bs->bl.zoned == BLK_Z_HM) {
+/*
+ * The host-aware model allows zoned storage constraints and random
+ * write. Allow mixing host-aware and non-zoned drivers. Using
+ * host-aware device as a regular device.
+ */
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index df9b9f1e30..2eceb250f1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -776,6 +776,18 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 goto fail;
 }
 }
+#ifdef CONFIG_BLKZONED
+/*
+ * The kernel page cache does not reliably work for writes to SWR zones
+ * of zoned block device because it can not guarantee the order of writes.
+ */
+if ((bs->bl.zoned != BLK_Z_NONE) &&
+(!(s->open_flags & O_DIRECT))) {
+error_setg(errp, "The driver supports zoned devices, and it requires "
+ "cache.direct=on, which was not specified.");
+return -EINVAL; /* No host kernel page cache */
+}
+#endif
 
 if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index 6e1b9394c8..72e23e7b55 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -621,6 +621,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index a3efb385e0..1bd2aef4d5 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -137,6 +137,11 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.39.2

[PATCH v16 8/8] docs/zoned-storage: add zoned device documentation

2023-03-10 Thread Sam Li

Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
---
 docs/devel/zoned-storage.rst   | 43 ++
 docs/system/qemu-block-drivers.rst.inc |  6 
 2 files changed, 49 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..6a36133e51
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,43 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is:
+$ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
+-c "zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..105cb9679c 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated 
storage
+  controller supports zoned storage. Use ``--blockdev host_device,
+  node-name=drive0,filename=/dev/nullb0,cache.direct=on`` to pass through
+  ``/dev/nullb0`` as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.39.2

[PATCH v16 6/8] qemu-iotests: test new zone operations

2023-03-10 Thread Sam Li

We have added new block layer APIs of zoned block devices. Test it as
follows: Run each zone operation on a newly created null_blk device
and see whether the logs show the correct zone information. By:
$ ./tests/qemu-iotests/tests/zoned.sh

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/tests/zoned.out | 53 ++
 tests/qemu-iotests/tests/zoned.sh  | 86 ++
 2 files changed, 139 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..0c8f96deb9
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8, wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8, wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8, wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8, wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8, wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8, wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8, wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8, wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:1, [type: 
2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:3, [type: 
2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:1, [type: 
2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x8, len 0x8, cap 0x8, wptr 0x10, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..9d7c15dde6
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "case 1: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2" # 0x3e7000 / 512 = 0x1f38
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x3e7000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x3e7000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+sudo $QEMU_IO $IMG -c "zf 268435456 268435456"
+echo "After finishing a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(5) resetting the second zone"
+sudo $QEMU_IO $IMG -c "zrs 268435456 26843

[PATCH v16 4/8] raw-format: add zone operations to pass through requests

2023-03-10 Thread Sam Li

raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block/raw-format.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 66783ed8e7..6e1b9394c8 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -317,6 +317,21 @@ raw_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones)
+{
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len)
+{
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 raw_co_getlength(BlockDriverState *bs)
 {
@@ -617,6 +632,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = &raw_co_pwritev,
 .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
 .bdrv_co_pdiscard = &raw_co_pdiscard,
+.bdrv_co_zone_report  = &raw_co_zone_report,
+.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
 .bdrv_co_block_status = &raw_co_block_status,
 .bdrv_co_copy_range_from = &raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.39.2

[PATCH v16 7/8] block: add some trace events for new block layer APIs

2023-03-10 Thread Sam Li

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2eceb250f1..563acc76ae 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3256,6 +3256,7 @@ static int coroutine_fn 
raw_co_zone_report(BlockDriverState *bs, int64_t offset,
BlockZoneDescriptor *zones) {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
 
 acb = (RawPosixAIOData) {
 .bs = bs,
@@ -3334,6 +3335,8 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState 
*bs, BlockZoneOp op,
 },
 };
 
+trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
+len >> BDRV_SECTOR_BITS);
 ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
 if (ret != 0) {
 ret = -errno;
diff --git a/block/trace-events b/block/trace-events
index 48dbf10c66..3f4e1d088a 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -209,6 +209,8 @@ file_FindEjectableOpticalMedia(const char *media) "Matching 
using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
 file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
+zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
+zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.39.2

[PATCH v16 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2023-03-10 Thread Sam Li

Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Acked-by: Kevin Wolf 
---
 block/block-backend.c | 133 +
 block/file-posix.c| 309 +-
 block/io.c|  41 
 include/block/block-io.h  |   9 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h   |   6 +-
 include/sysemu/block-backend-io.h |  18 ++
 meson.build   |   4 +
 qemu-io-cmds.c| 149 ++
 9 files changed, 687 insertions(+), 3 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 278b04ce69..f70b08e3f6 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1806,6 +1806,139 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+   (unsigned int*)acb->bytes,rwco->iobuf);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor  *zones,
+BlockCompletionFunc *cb, void *opaque)
+{
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.iobuf  = zones,
+.ret= NOT_DONE,
+};
+acb->bytes = (int64_t)nr_zones,
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_mgmt(rwco->blk, (BlockZoneOp)rwco->iobuf,
+ rwco->offset, acb->bytes);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len,
+  BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.iobuf  = (void *)op,
+.ret= NOT_DONE,
+};
+acb->bytes = len;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * op is the zone operation;
+ * offset is the byte offset from the start of the zoned device;
+ * len is the maximum number of bytes the command should operate on. It
+ * should be aligned with the dev

[PATCH v16 2/8] file-posix: introduce helper functions for sysfs attributes

2023-03-10 Thread Sam Li

Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
---
 block/file-posix.c   | 122 ++-
 include/block/block_int-common.h |   3 +
 2 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 5760cf22d1..496edc644c 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1202,64 +1202,112 @@ static int hdev_get_max_hw_transfer(int fd, struct 
stat *st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get a sysfs attribute value as character string.
+ */
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+ char **val) {
 #ifdef CONFIG_LINUX
-char buf[32];
-const char *end;
-char *sysfspath = NULL;
+g_autofree char *sysfspath = NULL;
 int ret;
-int sysfd = -1;
-long max_segments;
+size_t len;
 
-if (S_ISCHR(st->st_mode)) {
-if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
-return ret;
-}
+if (!S_ISBLK(st->st_mode)) {
 return -ENOTSUP;
 }
 
-if (!S_ISBLK(st->st_mode)) {
-return -ENOTSUP;
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+ret = g_file_get_contents(sysfspath, val, &len, NULL);
+if (ret == -1) {
+return -ENOENT;
 }
 
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
-sysfd = open(sysfspath, O_RDONLY);
-if (sysfd == -1) {
-ret = -errno;
-goto out;
+/* The file is ended with '\n' */
+char *p;
+p = *val;
+if (*(p + len - 1) == '\n') {
+*(p + len - 1) = '\0';
 }
-ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
+{
+g_autofree char *val = NULL;
+int ret;
+
+ret = get_sysfs_str_val(st, "zoned", &val);
 if (ret < 0) {
-ret = -errno;
-goto out;
-} else if (ret == 0) {
-ret = -EIO;
-goto out;
+return ret;
 }
-buf[ret] = 0;
-/* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, &end, 10, &max_segments);
-if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
+static long get_sysfs_long_val(struct stat *st, const char *attribute)
+{
+#ifdef CONFIG_LINUX
+g_autofree char *str = NULL;
+const char *end;
+long val;
+int ret;
+
+ret = get_sysfs_str_val(st, attribute, &str);
+if (ret < 0) {
+return ret;
 }
 
-out:
-if (sysfd != -1) {
-close(sysfd);
+/* The file is ended with '\n', pass 'end' to accept that. */
+ret = qemu_strtol(str, &end, 10, &val);
+if (ret == 0 && end && *end == '\0') {
+ret = val;
 }
-g_free(sysfspath);
 return ret;
 #else
 return -ENOTSUP;
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st)
+{
+#ifdef CONFIG_LINUX
+int ret;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+return get_sysfs_long_val(st, "max_segments");
+#else
+return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 struct stat st;
+int ret;
+BlockZoneModel zoned;
 
 s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
@@ -1297,6 +1345,12 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_hw_iov = ret;
 }
 }
+
+ret = get_sysfs_zoned_model(&st, &zoned);
+if (ret < 0) {
+zoned = BLK_Z_NONE;
+}
+bs->bl.zoned = zoned;
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index d419017328..6d0f470626 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -855,6 +855,9 @@ typedef struct BlockLimits {
 
 /* maximu

[PATCH v16 1/8] include: add zoned device structs

2023-03-10 Thread Sam Li

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index b5122ef8ab..1576fcf2ed 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -75,6 +75,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneState {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneState;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+BLK_ZT_SWR = 0x2, /* Sequential writes required */
+BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneState state;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.39.2

[PATCH v16 0/8] Add support for zoned device

2023-03-10 Thread Sam Li

Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs with the minimum set of zoned
commands that are necessary to support zoned devices. The commands are - Report
Zones, four zone operations and Zone Append.

There has been a debate on whethre introducing new zoned_host_device BlockDriver
specifically for zoned devices. In the end, it's been decided to stick to
existing host_device BlockDriver interface by only adding new zoned operations
inside it. The benefit of that is to avoid further changes - one example is
command line syntax - to the applications like Libvirt using QEMU zoned
emulation.

It can be tested on a null_blk device using qemu-io or qemu-iotests. For
example, to test zone report using qemu-io:
$ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
-c "zrp offset nr_zones"

v16:
- update zoned_host device name to host_device [Stefan]
- fix probing zoned device blocksizes [Stefan]
- Use empty fields instead of changing struct size of BlkRwCo [Kevin, Stefan]

v15:
- drop zoned_host_device BlockDriver
- add zoned device option to host_device driver instead of introducing a new
  zoned_host_device BlockDriver [Stefan]

v14:
- address Stefan's comments of probing block sizes

v13:
- add some tracing points for new zone APIs [Dmitry]
- change error handling in zone_mgmt [Damien, Stefan]

v12:
- address review comments
  * drop BLK_ZO_RESET_ALL bit [Damien]
  * fix error messages, style, and typos[Damien, Hannes]

v11:
- address review comments
  * fix possible BLKZONED config compiling warnings [Stefan]
  * fix capacity field compiling warnings on older kernel [Stefan,Damien]

v10:
- address review comments
  * deal with the last small zone case in zone_mgmt operations [Damien]
  * handle the capacity field outdated in old kernel(before 5.9) [Damien]
  * use byte unit in block layer to be consistent with QEMU [Eric]
  * fix coding style related problems [Stefan]

v9:
- address review comments
  * specify units of zone commands requests [Stefan]
  * fix some error handling in file-posix [Stefan]
  * introduce zoned_host_devcie in the commit message [Markus]

v8:
- address review comments
  * solve patch conflicts and merge sysfs helper funcations into one patch
  * add cache.direct=on check in config

v7:
- address review comments
  * modify sysfs attribute helper funcations
  * move the input validation and error checking into raw_co_zone_* function
  * fix checks in config

v6:
- drop virtio-blk emulation changes
- address Stefan's review comments
  * fix CONFIG_BLKZONED configs in related functions
  * replace reading fd by g_file_get_contents() in get_sysfs_str_val()
  * rewrite documentation for zoned storage

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (8):
  include: add zoned device structs
  file-posix: introduce helper functions for sysfs attributes
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  qemu-iotests: test new zone operations
  block: add some trace events for new block layer APIs
  docs/zoned-storage: add zoned device documentation

 block.c|  19 ++
 block/block-backend.c  | 133 
 block/file-posix.c | 446 +++--
 block/io.c |

42 matches

Mail list logo