date:20211203

Re: QEMU 6.2.0 and rhbz#1999878

2021-12-03 Thread Richard Henderson


On 12/3/21 2:00 PM, Richard Henderson wrote:

Oh I see, it was indeed replaced by Richard Henderson's patch:

https://src.fedoraproject.org/rpms/qemu/blob/rawhide/f/0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch 




At the moment I kept it as part of 6.2.0 build, which I am just about to push
to rawhide. It builds locally, and I am only waiting for the scratch-build to
finish.


Yes looks like we need to keep it, and get it upstream too.


Whoops.  That dropped through the cracks.
I'll queue that now-ish.


https://patchew.org/QEMU/20210912174925.200132-1-richard.hender...@linaro.org/

Ah right, I was supposed to test your kernel and never got there.
Plus it never got any r-b's.

Rebase was smooth and regression testing went ok on cortex-a57 host.


r~

Re: [PATCH v2 2/2] virtio-mem: Correct default THP size for ARM64

2021-12-03 Thread Gavin Shan


On 12/4/21 5:16 AM, David Hildenbrand wrote:

On 03.12.21 04:35, Gavin Shan wrote:

The default block size is same as to the THP size, which is either
retrieved from "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
or hardcoded to 2MB. There are flaws in both mechanisms and this
intends to fix them up.

   * When "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" is
 used to getting the THP size, 32MB and 512MB are valid values
 when we have 16KB and 64KB page size on ARM64.


Ah, right, there is 16KB as well :)



Yep, even though it's rarely used :)



   * When the hardcoded THP size is used, 2MB, 32MB and 512MB are
 valid values when we have 4KB, 16KB and 64KB page sizes on
 ARM64.

Co-developed-by: David Hildenbrand 
Signed-off-by: Gavin Shan 
---
  hw/virtio/virtio-mem.c | 32 
  1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index ac7a40f514..8f3c95300f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -38,14 +38,25 @@
   */
  #define VIRTIO_MEM_MIN_BLOCK_SIZE ((uint32_t)(1 * MiB))
  
-#if defined(__x86_64__) || defined(__arm__) || defined(__aarch64__) || \

-defined(__powerpc64__)
-#define VIRTIO_MEM_DEFAULT_THP_SIZE ((uint32_t)(2 * MiB))
-#else
-/* fallback to 1 MiB (e.g., the THP size on s390x) */
-#define VIRTIO_MEM_DEFAULT_THP_SIZE VIRTIO_MEM_MIN_BLOCK_SIZE
+static uint32_t virtio_mem_default_thp_size(void)
+{
+uint32_t default_thp_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
+
+#if defined(__x86_64__) || defined(__arm__) || defined(__powerpc64__)
+default_thp_size = (uint32_t)(2 * MiB);
+#elif defined(__aarch64__)
+if (qemu_real_host_page_size == (4 * KiB)) {


you can drop the superfluous (), also in the cases below.



It will be included in v3.


+default_thp_size = (uint32_t)(2 * MiB);


The explicit cast shouldn't be required.



It's not required, but inherited from the definition
of VIRTIO_MEM_MIN_BLOCK_SIZE. However, it's safe to
drop the cast and it will be included in v3.


+} else if (qemu_real_host_page_size == (16 * KiB)) {
+default_thp_size = (uint32_t)(32 * MiB);
+} else if (qemu_real_host_page_size == (64 * KiB)) {
+default_thp_size = (uint32_t)(512 * MiB);
+}
  #endif
  
+return default_thp_size;

+}
+
  /*
   * We want to have a reasonable default block size such that
   * 1. We avoid splitting THPs when unplugging memory, which degrades
@@ -78,11 +89,8 @@ static uint32_t virtio_mem_thp_size(void)
  if (g_file_get_contents(HPAGE_PMD_SIZE_PATH, &content, NULL, NULL) &&
  !qemu_strtou64(content, &endptr, 0, &tmp) &&
  (!endptr || *endptr == '\n')) {
-/*
- * Sanity-check the value, if it's too big (e.g., aarch64 with 64k base
- * pages) or weird, fallback to something smaller.
- */
-if (!tmp || !is_power_of_2(tmp) || tmp > 16 * MiB) {
+/* Sanity-check the value and fallback to something reasonable. */
+if (!tmp || !is_power_of_2(tmp)) {
  warn_report("Read unsupported THP size: %" PRIx64, tmp);
  } else {
  thp_size = tmp;
@@ -90,7 +98,7 @@ static uint32_t virtio_mem_thp_size(void)
  }
  
  if (!thp_size) {

-thp_size = VIRTIO_MEM_DEFAULT_THP_SIZE;
+thp_size = virtio_mem_default_thp_size();
  warn_report("Could not detect THP size, falling back to %" PRIx64
  "  MiB.", thp_size / MiB);
  }



Apart from that,

Reviewed-by: David Hildenbrand 



Thanks for your review, David!

Thanks,
Gavin

[libnbd PATCH 12/13] generator: Actually request extended headers

2021-12-03 Thread Eric Blake

This is the culmination of the previous patches preparation work for
using extended headers when possible.  The new states in the state
machine are copied extensively from our handling of
OPT_STRUCTURED_REPLY.

At the same time I posted this patch, I had patches for qemu-nbd to
support extended headers as server (nbdkit is a bit tougher).  The
interop tests still pass when using a new enough qemu-nbd, showing
that we have cross-project interoperability and therefore an extension
worth standardizing.
---
 generator/Makefile.am |  3 +-
 generator/state_machine.ml| 41 +
 .../states-newstyle-opt-extended-headers.c| 90 +++
 generator/states-newstyle-opt-starttls.c  | 10 +--
 4 files changed, 138 insertions(+), 6 deletions(-)
 create mode 100644 generator/states-newstyle-opt-extended-headers.c

diff --git a/generator/Makefile.am b/generator/Makefile.am
index 594d23cf..c889eb7f 100644
--- a/generator/Makefile.am
+++ b/generator/Makefile.am
@@ -1,5 +1,5 @@
 # nbd client library in userspace
-# Copyright (C) 2013-2020 Red Hat Inc.
+# Copyright (C) 2013-2021 Red Hat Inc.
 #
 # This library is free software; you can redistribute it and/or
 # modify it under the terms of the GNU Lesser General Public
@@ -30,6 +30,7 @@ states_code = \
states-issue-command.c \
states-magic.c \
states-newstyle-opt-export-name.c \
+   states-newstyle-opt-extended-headers.c \
states-newstyle-opt-list.c \
states-newstyle-opt-go.c \
states-newstyle-opt-meta-context.c \
diff --git a/generator/state_machine.ml b/generator/state_machine.ml
index 99652948..ad8eba5e 100644
--- a/generator/state_machine.ml
+++ b/generator/state_machine.ml
@@ -295,6 +295,7 @@ and
* NEGOTIATING after OPT_STRUCTURED_REPLY or any failed OPT_GO.
*)
   Group ("OPT_STARTTLS", newstyle_opt_starttls_state_machine);
+  Group ("OPT_EXTENDED_HEADERS", newstyle_opt_extended_headers_state_machine);
   Group ("OPT_STRUCTURED_REPLY", newstyle_opt_structured_reply_state_machine);
   Group ("OPT_META_CONTEXT", newstyle_opt_meta_context_state_machine);
   Group ("OPT_GO", newstyle_opt_go_state_machine);
@@ -432,6 +433,46 @@ and
   };
 ]

+(* Fixed newstyle NBD_OPT_EXTENDED_HEADERS option.
+ * Implementation: generator/states-newstyle-opt-extended-headers.c
+ *)
+and newstyle_opt_extended_headers_state_machine = [
+  State {
+default_state with
+name = "START";
+comment = "Try to negotiate newstyle NBD_OPT_EXTENDED_HEADERS";
+external_events = [];
+  };
+
+  State {
+default_state with
+name = "SEND";
+comment = "Send newstyle NBD_OPT_EXTENDED_HEADERS negotiation request";
+external_events = [ NotifyWrite, "" ];
+  };
+
+  State {
+default_state with
+name = "RECV_REPLY";
+comment = "Receive newstyle NBD_OPT_EXTENDED_HEADERS option reply";
+external_events = [ NotifyRead, "" ];
+  };
+
+  State {
+default_state with
+name = "RECV_REPLY_PAYLOAD";
+comment = "Receive any newstyle NBD_OPT_EXTENDED_HEADERS reply payload";
+external_events = [ NotifyRead, "" ];
+  };
+
+  State {
+default_state with
+name = "CHECK_REPLY";
+comment = "Check newstyle NBD_OPT_EXTENDED_HEADERS option reply";
+external_events = [];
+  };
+]
+
 (* Fixed newstyle NBD_OPT_STRUCTURED_REPLY option.
  * Implementation: generator/states-newstyle-opt-structured-reply.c
  *)
diff --git a/generator/states-newstyle-opt-extended-headers.c 
b/generator/states-newstyle-opt-extended-headers.c
new file mode 100644
index ..e2c9890e
--- /dev/null
+++ b/generator/states-newstyle-opt-extended-headers.c
@@ -0,0 +1,90 @@
+/* nbd client library in userspace: state machine
+ * Copyright (C) 2013-2021 Red Hat Inc.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* State machine for negotiating NBD_OPT_EXTENDED_HEADERS. */
+
+STATE_MACHINE {
+ NEWSTYLE.OPT_EXTENDED_HEADERS.START:
+  assert (h->gflags & LIBNBD_HANDSHAKE_FLAG_FIXED_NEWSTYLE);
+  if (!h->request_eh) {
+SET_NEXT_STATE (%^OPT_STRUCTURED_REPLY.START);
+return 0;
+  }
+
+  h->sbuf.option.version = htobe64 (NBD_NEW_VERSION);
+  h->sbuf.option.option = htobe32 (NBD_OPT_EXTENDED_HEADERS);
+  h->sbuf.option.optlen = htobe32 (0);
+  h->wbuf = &h->sbuf;

Re: [PATCH for 7.0 0/5] bsd-user-smoke: A simple smoke test for bsd-user

2021-12-03 Thread Warner Losh

PING!

If anybody (especially the BSD reviewers) could look at these, that would
be great!

It's been suggested I rename bsd-user-smoke to just be bsd-user and we put
our tests there until we can switch to the more generic tcg tests, so I'll
do that and resend in a few days.

Warner

On Sat, Nov 27, 2021 at 1:19 PM Warner Losh  wrote:

> This series adds a number of simple binaries that FreeBSD's clang can
> build on
> any system. I've kept it simple so that there's no extra binaries that
> need to
> be installed. Given the current state of bsd-user in the project's repo,
> this
> likely is as extensive a set of tests that should be done right now. We
> can load
> static binaries only (so these are static binaries) and hello world is the
> canonical test. I have binaries for all the supported FreeBSD targets, but
> have
> included only the ones that are in upstream (or in review) at this time.
>
> In the future, I'll integreate with the tcg tests when there's more in
> upstream
> they can test.  Since that requires putting together FreeBSD sysroots for
> all
> the supported architectures for multiple versions, I'm going to delay that
> for a
> while. I'll also integrate FreeBSD's 5k system tests when we're much
> further
> along with the upstreaming.
>
> The purpose of this is to give others doing changes in this area a
> standardized
> way to ensure their changes don't fundamentally break bsd-user. This
> approach
> will work for all setups that do a 'make check' to do their testing.
>
> Based-on: 20211108035136.43687-1-...@bsdimp.com
>
> Warner Losh (5):
>   h.armv7: Simple hello-world test for armv7
>   h.i386: Simple hello-world test for i386
>   h.amd64: Simple hello-world test for x86_64
>   smoke-bsd-user: A test script to run all the FreeBSD binaries
>   bsd-user-smoke: Add to build
>
>  tests/bsd-user-smoke/h.amd64.S  | 28 +
>  tests/bsd-user-smoke/h.armv7.S  | 37 +++
>  tests/bsd-user-smoke/h.i386.S   | 39 +
>  tests/bsd-user-smoke/meson.build| 31 +++
>  tests/bsd-user-smoke/smoke-bsd-user | 22 
>  tests/meson.build   |  1 +
>  6 files changed, 158 insertions(+)
>  create mode 100644 tests/bsd-user-smoke/h.amd64.S
>  create mode 100644 tests/bsd-user-smoke/h.armv7.S
>  create mode 100644 tests/bsd-user-smoke/h.i386.S
>  create mode 100644 tests/bsd-user-smoke/meson.build
>  create mode 100644 tests/bsd-user-smoke/smoke-bsd-user
>
> --
> 2.33.0
>
>

[libnbd PATCH 09/13] block_status: Accept 64-bit extents during block status

2021-12-03 Thread Eric Blake

Support a server giving us a 64-bit extent.  Note that the protocol
says a server should not give a 64-bit answer when extended headers
are not negotiated, but since the client's size is merely a hint, it
is possible for a server to have a 64-bit answer even when the
original query was 32 bits.  At any rate, it is just as easy for us to
always support the new chunk type as it is to complain when it is used
incorrectly by the server, and the user's 32-bit callback doesn't have
to care which size the server's result used (either the server's
result was a 32-bit value, or our shim silently truncates it, but the
user still makes progress).  Of course, until a later patch enables
extended headers negotiation, no compliant server will trigger the new
code here.

Implementation-wise, we don't care if we will be narrowing from the
server's 16-byte extent (including explicit padding) to a 12-byte
struct, or if our 'nbd_extent' type has implicit padding and is thus
also 16 bytes; either way, the order of our byte-swapping traversal is
safe.
---
 lib/internal.h  |  1 +
 generator/states-reply-structured.c | 75 +++--
 2 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 4800df83..97abf4f2 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -289,6 +289,7 @@ struct nbd_handle {
   union {
 nbd_extent *normal; /* Our 64-bit preferred internal form */
 uint32_t *narrow;   /* 32-bit form of NBD_REPLY_TYPE_BLOCK_STATUS */
+struct nbd_block_descriptor_ext *wide; /* NBD_REPLY_TYPE_BLOCK_STATUS_EXT 
*/
   } bs_entries;

   /* Commands which are waiting to be issued [meaning the request
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 71c761e9..29b1c3d8 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -22,6 +22,8 @@
 #include 
 #include 

+#include "minmax.h"
+
 /* Structured reply must be completely inside the bounds of the
  * requesting command.
  */
@@ -202,7 +204,8 @@ STATE_MACHINE {
 SET_NEXT_STATE (%RECV_OFFSET_HOLE);
 return 0;
   }
-  else if (type == NBD_REPLY_TYPE_BLOCK_STATUS) {
+  else if (type == NBD_REPLY_TYPE_BLOCK_STATUS ||
+   type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT) {
 if (cmd->type != NBD_CMD_BLOCK_STATUS) {
   SET_NEXT_STATE (%.DEAD);
   set_error (0, "invalid command for receiving block-status chunk, "
@@ -211,12 +214,19 @@ STATE_MACHINE {
  cmd->type);
   return 0;
 }
-/* XXX We should be able to skip the bad reply in these two cases. */
-if (length < 12 || ((length-4) & 7) != 0) {
+/* XXX We should be able to skip the bad reply in these cases. */
+if (type == NBD_REPLY_TYPE_BLOCK_STATUS &&
+(length < 12 || (length-4) % (2 * sizeof(uint32_t {
   SET_NEXT_STATE (%.DEAD);
   set_error (0, "invalid length in NBD_REPLY_TYPE_BLOCK_STATUS");
   return 0;
 }
+if (type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT &&
+(length < 20 || (length-4) % sizeof(struct nbd_block_descriptor_ext))) 
{
+  SET_NEXT_STATE (%.DEAD);
+  set_error (0, "invalid length in NBD_REPLY_TYPE_BLOCK_STATUS_EXT");
+  return 0;
+}
 if (CALLBACK_IS_NULL (cmd->cb.fn.extent)) {
   SET_NEXT_STATE (%.DEAD);
   set_error (0, "not expecting NBD_REPLY_TYPE_BLOCK_STATUS here");
@@ -495,6 +505,7 @@ STATE_MACHINE {
   struct command *cmd = h->reply_cmd;
   uint32_t length;
   uint32_t count;
+  uint16_t type;

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return 0;
@@ -504,24 +515,33 @@ STATE_MACHINE {
 return 0;
   case 0:
 length = h->sbuf.sr.hdr.structured_reply.length; /* normalized in CHECK */
+type = be16toh (h->sbuf.sr.hdr.structured_reply.type);

 assert (cmd); /* guaranteed by CHECK */
 assert (cmd->type == NBD_CMD_BLOCK_STATUS);
 assert (length >= 12);
 length -= sizeof h->bs_contextid;
-count = length / (2 * sizeof (uint32_t));
+if (type == NBD_REPLY_TYPE_BLOCK_STATUS)
+  count = length / (2 * sizeof (uint32_t));
+else {
+  assert (type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT);
+  /* XXX Insist on h->extended_headers? */
+  count = length / sizeof (struct nbd_block_descriptor_ext);
+}

-/* Read raw data into a subset of h->bs_entries, then expand it
+/* Read raw data into an overlap of h->bs_entries, then move it
  * into place later later during byte-swapping.
  */
 free (h->bs_entries.normal);
-h->bs_entries.normal = malloc (count * sizeof *h->bs_entries.normal);
+h->bs_entries.normal = malloc (MAX (count * sizeof *h->bs_entries.normal,
+length));
 if (h->bs_entries.normal == NULL) {
   SET_NEXT_STATE (%.DEAD);
   set_error (errno, "malloc");
   return 0;
 }
-h->rbuf = h->bs_entries.narrow;
+h->rbuf = type == NBD_REPLY_TYPE_BLOCK_STATUS
+  ? h->bs_

[libnbd PATCH 07/13] generator: Add struct nbd_extent in prep for 64-bit extents

2021-12-03 Thread Eric Blake

The existing nbd_block_status() callback is permanently stuck with an
array of uint32_t pairs (len/2 extents), and exposing 64-bit extents
requires a new API.  Before we get there, we first need a way to
express a struct containing uint64_t length and uint32_t flags across
the various language bindings in the callback that will be used by the
new API.

For the language bindings, we have to construct an array of a similar
struct in the target language's preferred format.  The bindings for
Python and OCaml were relatively straightforward; the Golang bindings
took a bit more effort for me to write.  Temporary unused attributes
are needed to keep the compiler happy until a later patch exposes a
new API using the new callback type.
---
 generator/API.ml| 12 +++-
 generator/API.mli   |  3 ++-
 generator/C.ml  | 24 +---
 generator/GoLang.ml | 24 
 generator/OCaml.ml  | 21 +
 generator/Python.ml | 30 ++
 ocaml/helpers.c | 22 +-
 ocaml/nbd-c.h   |  3 ++-
 golang/handle.go|  6 ++
 9 files changed, 130 insertions(+), 15 deletions(-)

diff --git a/generator/API.ml b/generator/API.ml
index cf2e7543..70ae721d 100644
--- a/generator/API.ml
+++ b/generator/API.ml
@@ -42,6 +42,7 @@
 | BytesPersistOut of string * string
 | Closure of closure
 | Enum of string * enum
+| Extent64 of string
 | Fd of string
 | Flags of string * flags
 | Int of string
@@ -142,6 +143,14 @@ let extent_closure =
 "nr_entries");
  CBMutable (Int "error") ]
 }
+let extent64_closure = {
+  cbname = "extent64";
+  cbargs = [ CBString "metacontext";
+ CBUInt64 "offset";
+ CBArrayAndLen (Extent64 "entries",
+"nr_entries");
+ CBMutable (Int "error") ]
+}
 let list_closure = {
   cbname = "list";
   cbargs = [ CBString "name"; CBString "description" ]
@@ -151,7 +160,8 @@ let context_closure =
   cbargs = [ CBString "name" ]
 }
 let all_closures = [ chunk_closure; completion_closure;
- debug_closure; extent_closure; list_closure;
+ debug_closure; extent_closure; extent64_closure;
+ list_closure;
  context_closure ]

 (* Enums. *)
diff --git a/generator/API.mli b/generator/API.mli
index d284637f..922d8120 100644
--- a/generator/API.mli
+++ b/generator/API.mli
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: the API
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -52,6 +52,7 @@ and
 | BytesPersistOut of string * string
 | Closure of closure   (** function pointer + void *opaque *)
 | Enum of string * enum(** enum/union type, int in C *)
+| Extent64 of string   (** extent descriptor, struct nbd_extent in C *)
 | Fd of string (** file descriptor *)
 | Flags of string * flags  (** flags, uint32_t in C *)
 | Int of string(** small int *)
diff --git a/generator/C.ml b/generator/C.ml
index 797af531..7b0be583 100644
--- a/generator/C.ml
+++ b/generator/C.ml
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: generate the C API and documentation
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -90,6 +90,7 @@ let
 | Closure { cbname } ->
[ sprintf "%s_callback" cbname; sprintf "%s_user_data" cbname ]
 | Enum (n, _) -> [n]
+| Extent64 n -> [n]
 | Fd n -> [n]
 | Flags (n, _) -> [n]
 | Int n -> [n]
@@ -152,6 +153,9 @@ and
   | Enum (n, _) ->
  if types then pr "int ";
  pr "%s" n
+  | Extent64 n ->
+ if types then pr "nbd_extent ";
+ pr "%s" n
   | Flags (n, _) ->
  if types then pr "uint32_t ";
  pr "%s" n
@@ -238,6 +242,11 @@ and
  pr "%s, " n;
  if types then pr "size_t ";
  pr "%s" len
+  | CBArrayAndLen (Extent64 n, len) ->
+ if types then pr "nbd_extent *";
+ pr "%s, " n;
+ if types then pr "size_t ";
+ pr "%s" len
   | CBArrayAndLen _ -> assert false
   | CBBytesIn (n, len) ->
  if types then pr "const void *";
@@ -388,6 +397,13 @@ let
   pr "extern int nbd_get_errno (void);\n";
   pr "#define LIBNBD_HAVE_NBD_GET_ERRNO 1\n";
   pr "\n";
+  pr "/* This is used in the callback for nbd_block_status_64.\n";
+  pr " */\n";
+  pr "typedef struct {\n";
+  pr "  uint64_t length;\n";
+  pr "  uint32_t flags;\n";
+  pr "} nbd_extent;\n";
+  pr "\n";
   print_closure_structs ();
   List.iter (
 fun (name, { args; optargs; ret }) ->
@@

[PATCH v3 1/2] virtio-mem: Correct default THP size for ARM64

2021-12-03 Thread Gavin Shan

The default block size is same as to the THP size, which is either
retrieved from "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
or hardcoded to 2MB. There are flaws in both mechanisms and this
intends to fix them up.

  * When "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" is
used to getting the THP size, 32MB and 512MB are valid values
when we have 16KB and 64KB page size on ARM64.

  * When the hardcoded THP size is used, 2MB, 32MB and 512MB are
valid values when we have 4KB, 16KB and 64KB page sizes on
ARM64.

Co-developed-by: David Hildenbrand 
Signed-off-by: Gavin Shan 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index d5a578142b..b20595a496 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -38,14 +38,25 @@
  */
 #define VIRTIO_MEM_MIN_BLOCK_SIZE ((uint32_t)(1 * MiB))
 
-#if defined(__x86_64__) || defined(__arm__) || defined(__aarch64__) || \
-defined(__powerpc64__)
-#define VIRTIO_MEM_DEFAULT_THP_SIZE ((uint32_t)(2 * MiB))
-#else
-/* fallback to 1 MiB (e.g., the THP size on s390x) */
-#define VIRTIO_MEM_DEFAULT_THP_SIZE VIRTIO_MEM_MIN_BLOCK_SIZE
+static uint32_t virtio_mem_default_thp_size(void)
+{
+uint32_t default_thp_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
+
+#if defined(__x86_64__) || defined(__arm__) || defined(__powerpc64__)
+default_thp_size = 2 * MiB;
+#elif defined(__aarch64__)
+if (qemu_real_host_page_size == 4 * KiB) {
+default_thp_size = 2 * MiB;
+} else if (qemu_real_host_page_size == 16 * KiB) {
+default_thp_size = 32 * MiB;
+} else if (qemu_real_host_page_size == 64 * KiB) {
+default_thp_size = 512 * MiB;
+}
 #endif
 
+return default_thp_size;
+}
+
 /*
  * We want to have a reasonable default block size such that
  * 1. We avoid splitting THPs when unplugging memory, which degrades
@@ -78,11 +89,8 @@ static uint32_t virtio_mem_thp_size(void)
 if (g_file_get_contents(HPAGE_PMD_SIZE_PATH, &content, NULL, NULL) &&
 !qemu_strtou64(content, &endptr, 0, &tmp) &&
 (!endptr || *endptr == '\n')) {
-/*
- * Sanity-check the value, if it's too big (e.g., aarch64 with 64k base
- * pages) or weird, fallback to something smaller.
- */
-if (!tmp || !is_power_of_2(tmp) || tmp > 16 * MiB) {
+/* Sanity-check the value and fallback to something reasonable. */
+if (!tmp || !is_power_of_2(tmp)) {
 warn_report("Read unsupported THP size: %" PRIx64, tmp);
 } else {
 thp_size = tmp;
@@ -90,7 +98,7 @@ static uint32_t virtio_mem_thp_size(void)
 }
 
 if (!thp_size) {
-thp_size = VIRTIO_MEM_DEFAULT_THP_SIZE;
+thp_size = virtio_mem_default_thp_size();
 warn_report("Could not detect THP size, falling back to %" PRIx64
 "  MiB.", thp_size / MiB);
 }
-- 
2.23.0

[libnbd PATCH 04/13] protocol: Prepare to send 64-bit requests

2021-12-03 Thread Eric Blake

Support sending 64-bit requests if extended headers were negotiated.

At this point, h->extended_headers is permanently false (we can't
enable it until all other aspects of the protocol have likewise been
converted).
---
 lib/internal.h  | 12 ---
 generator/states-issue-command.c| 31 +++--
 generator/states-reply-structured.c |  2 +-
 lib/rw.c| 10 --
 4 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 7e96e8e9..07378588 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -1,5 +1,5 @@
 /* nbd client library in userspace: internal definitions
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -106,6 +106,9 @@ struct nbd_handle {
   char *tls_username;   /* Username, NULL = use current username */
   char *tls_psk_file;   /* PSK filename, NULL = no PSK */

+  /* Extended headers. */
+  bool extended_headers;/* If we negotiated NBD_OPT_EXTENDED_HEADERS */
+
   /* Desired metadata contexts. */
   bool request_sr;
   string_vector request_meta_contexts;
@@ -242,7 +245,10 @@ struct nbd_handle {
   /* Issuing a command must use a buffer separate from sbuf, for the
* case when we interrupt a request to service a reply.
*/
-  struct nbd_request request;
+  union {
+struct nbd_request request;
+struct nbd_request_ext request_ext;
+  } req;
   bool in_write_payload;
   bool in_write_shutdown;

@@ -347,7 +353,7 @@ struct command {
   uint16_t type;
   uint64_t cookie;
   uint64_t offset;
-  uint32_t count;
+  uint64_t count;
   void *data; /* Buffer for read/write */
   struct command_cb cb;
   enum state state; /* State to resume with on next POLLIN */
diff --git a/generator/states-issue-command.c b/generator/states-issue-command.c
index a8101144..7b1d6dc7 100644
--- a/generator/states-issue-command.c
+++ b/generator/states-issue-command.c
@@ -1,5 +1,5 @@
 /* nbd client library in userspace: state machine
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -41,14 +41,23 @@ STATE_MACHINE {
 return 0;
   }

-  h->request.magic = htobe32 (NBD_REQUEST_MAGIC);
-  h->request.flags = htobe16 (cmd->flags);
-  h->request.type = htobe16 (cmd->type);
-  h->request.handle = htobe64 (cmd->cookie);
-  h->request.offset = htobe64 (cmd->offset);
-  h->request.count = htobe32 ((uint32_t) cmd->count);
-  h->wbuf = &h->request;
-  h->wlen = sizeof (h->request);
+  /* These fields are coincident between req.request and req.request_ext */
+  h->req.request.flags = htobe16 (cmd->flags);
+  h->req.request.type = htobe16 (cmd->type);
+  h->req.request.handle = htobe64 (cmd->cookie);
+  h->req.request.offset = htobe64 (cmd->offset);
+  if (h->extended_headers) {
+h->req.request_ext.magic = htobe32 (NBD_REQUEST_EXT_MAGIC);
+h->req.request_ext.count = htobe64 (cmd->count);
+h->wlen = sizeof (h->req.request_ext);
+  }
+  else {
+assert (cmd->count <= UINT32_MAX);
+h->req.request.magic = htobe32 (NBD_REQUEST_MAGIC);
+h->req.request.count = htobe32 (cmd->count);
+h->wlen = sizeof (h->req.request);
+  }
+  h->wbuf = &h->req;
   if (cmd->type == NBD_CMD_WRITE || cmd->next)
 h->wflags = MSG_MORE;
   SET_NEXT_STATE (%SEND_REQUEST);
@@ -73,7 +82,7 @@ STATE_MACHINE {

   assert (h->cmds_to_issue != NULL);
   cmd = h->cmds_to_issue;
-  assert (cmd->cookie == be64toh (h->request.handle));
+  assert (cmd->cookie == be64toh (h->req.request.handle));
   if (cmd->type == NBD_CMD_WRITE) {
 h->wbuf = cmd->data;
 h->wlen = cmd->count;
@@ -119,7 +128,7 @@ STATE_MACHINE {
   assert (!h->wlen);
   assert (h->cmds_to_issue != NULL);
   cmd = h->cmds_to_issue;
-  assert (cmd->cookie == be64toh (h->request.handle));
+  assert (cmd->cookie == be64toh (h->req.request.handle));
   h->cmds_to_issue = cmd->next;
   if (h->cmds_to_issue_tail == cmd)
 h->cmds_to_issue_tail = NULL;
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index e1da850d..5524e000 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -34,7 +34,7 @@ structured_reply_in_bounds (uint64_t offset, uint32_t length,
   offset + length > cmd->offset + cmd->count) {
 set_error (0, "range of structured reply is out of bounds, "
"offset=%" PRIu64 ", cmd->offset=%" PRIu64 ", "
-   "length=%" PRIu32 ", cmd->count=%" PRIu32 ": "
+   "length=%" PRIu32 ", cmd->count=%" PRIu64 ": "
"this is likely to be a bug in the NBD server",
offset, cmd->offset, length, cmd->count);
 return false;
diff --git a/lib/rw.

[PATCH v3 0/2] hw/arm/virt: Support for virtio-mem-pci

2021-12-03 Thread Gavin Shan

This series supports virtio-mem-pci device, by simply following the
implementation on x86. The exception is the block size is 512MB on
ARM64 instead of 128MB on x86, compatible with the memory section
size in linux guest.

The work was done by David Hildenbrand and then Jonathan Cameron. I'm
taking the patch and putting more efforts, which is all about testing
to me at current stage.

Testing
===
The upstream linux kernel (v5.16.rc3) is used on host/guest during
the testing. The guest kernel includes changes to enable virtio-mem
driver, which is simply to enable CONFIG_VIRTIO_MEM on ARM64.

Mutiple combinations like page sizes on host/guest, memory backend
device etc are covered in the testing. Besides, migration is also
tested. The following command lines are used for VM or virtio-mem-pci
device hot-add. It's notable that virtio-mem-pci device hot-remove
isn't supported, similar to what we have on x86. 

  host.pgsize  guest.pgsize  backendhot-add  hot-remove  migration
  -
   4KB 4KB   normal ok   ok  ok
 THPok   ok  ok
 hugeTLBok   ok  ok
   4KB 64KB  normal ok   ok  ok
 THPok   ok  ok
 hugeTLBok   ok  ok
  64KB 4KB   normal ok   ok  ok
 THPok   ok  ok
 hugeTLBok   ok  ok
  64KB 64KB  normal ok   ok  ok
 THPok   ok  ok
 hugeTLBok   ok  ok

The command lines are used for VM. When hugeTLBfs is used, all memory
backend objects are popuated on /dev/hugepages-2048kB or
/dev/hugepages-524288kB, depending on the host page sizes.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64   
\
  -accel kvm -machine virt,gic-version=host 
\
  -cpu host -smp 4,sockets=2,cores=2,threads=1  
\
  -m 1024M,slots=16,maxmem=64G  
\
  -object memory-backend-ram,id=mem0,size=512M  
\
  -object memory-backend-ram,id=mem1,size=512M  
\
  -numa node,nodeid=0,cpus=0-1,memdev=mem0  
\
  -numa node,nodeid=1,cpus=2-3,memdev=mem1  
\
 :
  -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image 
\
  -initrd /home/gavin/sandbox/images/rootfs.cpio.xz 
\
  -append earlycon=pl011,mmio,0x900 
\
  -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1 
\
  -device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2 
\
  -device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3 
\
  -object memory-backend-ram,id=vmem0,size=512M 
\
  -device virtio-mem-pci,id=vm0,bus=pcie.1,memdev=vmem0,node=0,requested-size=0 
\
  -object memory-backend-ram,id=vmem1,size=512M 
\
  -device virtio-mem-pci,id=vm1,bus=pcie.2,memdev=vmem1,node=1,requested-size=0 

Command lines used for memory hot-add and hot-remove:

  (qemu) qom-set vm1 requested-size 512M
  (qemu) qom-set vm1 requested-size 0
  (qemu) qom-set vm1 requested-size 512M

Command lines used for virtio-mem-pci device hot-add:

  (qemu) object_add memory-backend-ram,id=hp-mem1,size=512M
  (qemu) device_add virtio-mem-pci,id=hp-vm1,bus=pcie.3,memdev=hp-mem1,node=1
  (qemu) qom-set hp-vm1 requested-size 512M
  (qemu) qom-set hp-vm1 requested-size 0
  (qemu) qom-set hp-vm1 requested-size 512M

Changelog
=
v3:
  * Reshuffle patches   (David)
  * Suggested code refactoring for virtio_mem_default_thp_size()(David)
  * Pick r-b from Jonathan and David(Gavin)
v2:
  * Include David/Jonathan as co-developers in the commit log   (David)
  * Decrease VIRTIO_MEM_USABLE_EXTENT to 512MB on ARM64 in PATCH[1/2]   (David)
  * PATCH[2/2] is added to correct the THP sizes on ARM64   (David)

Gavin Shan (2):
  virtio-mem: Correct default THP size for ARM64
  hw/arm/virt: Support for virtio-mem-pci

 hw/arm/Kconfig |  1 +
 hw/arm/virt.c  | 68 +-
 hw/virtio/virtio-mem.c | 36 ++
 3 files changed, 91 insertions(+), 14 deletions(-)

-- 
2.23.0

[libnbd PATCH 03/13] protocol: Add definitions for extended headers

2021-12-03 Thread Eric Blake

Add the magic numbers and new structs necessary to implement the NBD
protocol extension of extended headers providing 64-bit lengths.
---
 lib/nbd-protocol.h | 61 ++
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/lib/nbd-protocol.h b/lib/nbd-protocol.h
index e5d6404b..7247d775 100644
--- a/lib/nbd-protocol.h
+++ b/lib/nbd-protocol.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are
@@ -124,6 +124,7 @@ struct nbd_fixed_new_option_reply {
 #define NBD_OPT_STRUCTURED_REPLY   8
 #define NBD_OPT_LIST_META_CONTEXT  9
 #define NBD_OPT_SET_META_CONTEXT   10
+#define NBD_OPT_EXTENDED_HEADERS   11

 #define NBD_REP_ERR(val) (0x8000 | (val))
 #define NBD_REP_IS_ERR(val) (!!((val) & 0x8000))
@@ -188,6 +189,13 @@ struct nbd_block_descriptor {
   uint32_t status_flags;/* block type (hole etc) */
 } NBD_ATTRIBUTE_PACKED;

+/* NBD_REPLY_TYPE_BLOCK_STATUS_EXT block descriptor. */
+struct nbd_block_descriptor_ext {
+  uint64_t length;  /* length of block */
+  uint32_t status_flags;/* block type (hole etc) */
+  uint32_t pad; /* must be zero */
+} NBD_ATTRIBUTE_PACKED;
+
 /* Request (client -> server). */
 struct nbd_request {
   uint32_t magic;   /* NBD_REQUEST_MAGIC. */
@@ -197,6 +205,14 @@ struct nbd_request {
   uint64_t offset;  /* Request offset. */
   uint32_t count;   /* Request length. */
 } NBD_ATTRIBUTE_PACKED;
+struct nbd_request_ext {
+  uint32_t magic;   /* NBD_REQUEST_EXT_MAGIC. */
+  uint16_t flags;   /* Request flags. */
+  uint16_t type;/* Request type. */
+  uint64_t handle;  /* Opaque handle. */
+  uint64_t offset;  /* Request offset. */
+  uint64_t count;   /* Request length. */
+} NBD_ATTRIBUTE_PACKED;

 /* Simple reply (server -> client). */
 struct nbd_simple_reply {
@@ -204,6 +220,13 @@ struct nbd_simple_reply {
   uint32_t error;   /* NBD_SUCCESS or one of NBD_E*. */
   uint64_t handle;  /* Opaque handle. */
 } NBD_ATTRIBUTE_PACKED;
+struct nbd_simple_reply_ext {
+  uint32_t magic;   /* NBD_SIMPLE_REPLY_EXT_MAGIC. */
+  uint32_t error;   /* NBD_SUCCESS or one of NBD_E*. */
+  uint64_t handle;  /* Opaque handle. */
+  uint64_t pad1;/* Must be 0. */
+  uint64_t pad2;/* Must be 0. */
+} NBD_ATTRIBUTE_PACKED;

 /* Structured reply (server -> client). */
 struct nbd_structured_reply {
@@ -213,6 +236,14 @@ struct nbd_structured_reply {
   uint64_t handle;  /* Opaque handle. */
   uint32_t length;  /* Length of payload which follows. */
 } NBD_ATTRIBUTE_PACKED;
+struct nbd_structured_reply_ext {
+  uint32_t magic;   /* NBD_STRUCTURED_REPLY_EXT_MAGIC. */
+  uint16_t flags;   /* NBD_REPLY_FLAG_* */
+  uint16_t type;/* NBD_REPLY_TYPE_* */
+  uint64_t handle;  /* Opaque handle. */
+  uint64_t length;  /* Length of payload which follows. */
+  uint64_t pad; /* Must be 0. */
+} NBD_ATTRIBUTE_PACKED;

 struct nbd_structured_reply_offset_data {
   uint64_t offset;  /* offset */
@@ -224,15 +255,23 @@ struct nbd_structured_reply_offset_hole {
   uint32_t length;  /* Length of hole. */
 } NBD_ATTRIBUTE_PACKED;

+struct nbd_structured_reply_offset_hole_ext {
+  uint64_t offset;
+  uint64_t length;  /* Length of hole. */
+} NBD_ATTRIBUTE_PACKED;
+
 struct nbd_structured_reply_error {
   uint32_t error;   /* NBD_E* error number */
   uint16_t len; /* Length of human readable error. */
   /* Followed by human readable error string, and possibly more structure. */
 } NBD_ATTRIBUTE_PACKED;

-#define NBD_REQUEST_MAGIC   0x25609513
-#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
-#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+#define NBD_REQUEST_MAGIC   0x25609513
+#define NBD_REQUEST_EXT_MAGIC   0x21e41c71
+#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
+#define NBD_SIMPLE_REPLY_EXT_MAGIC  0x60d12fd6
+#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+#define NBD_STRUCTURED_REPLY_EXT_MAGIC  0x6e8a278c

 /* Structured reply flags. */
 #define NBD_REPLY_FLAG_DONE (1<<0)
@@ -241,12 +280,14 @@ struct nbd_structured_reply_error {
 #define NBD_REPLY_TYPE_IS_ERR(val) (!!((val) & (1<<15)))

 /* Structured reply types. */
-#define NBD_REPLY_TYPE_NONE 0
-#define NBD_REPLY_TYPE_OFFSET_DATA  1
-#define NBD_REPLY_TYPE_OFFSET_HOLE  2
-#define NBD_REPLY_TYPE_BLOCK_STATUS 5
-#define NBD_REPLY_TYPE_ERRORNBD_REPLY_TYPE_ERR (1)
-#define NBD_REPLY_TYPE_ERROR_OFFSET NBD_REPLY_TYPE_ERR (2)
+#define NBD

[PATCH v3 2/2] hw/arm/virt: Support for virtio-mem-pci

2021-12-03 Thread Gavin Shan

This supports virtio-mem-pci device on "virt" platform, by simply
following the implementation on x86.

   * This implements the hotplug handlers to support virtio-mem-pci
 device hot-add, while the hot-remove isn't supported as we have
 on x86.

   * The block size is 512MB on ARM64 instead of 128MB on x86.

   * It has been passing the tests with various combinations like 64KB
 and 4KB page sizes on host and guest, different memory device
 backends like normal, transparent huge page and HugeTLB, plus
 migration.

Co-developed-by: David Hildenbrand 
Co-developed-by: Jonathan Cameron 
Signed-off-by: Gavin Shan 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
---
 hw/arm/Kconfig |  1 +
 hw/arm/virt.c  | 68 +-
 hw/virtio/virtio-mem.c |  4 ++-
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 2d37d29f02..15aff8efb8 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -27,6 +27,7 @@ config ARM_VIRT
 select DIMM
 select ACPI_HW_REDUCED
 select ACPI_APEI
+select VIRTIO_MEM_SUPPORTED
 
 config CHEETAH
 bool
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 30da05dfe0..db1544760d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -72,9 +72,11 @@
 #include "hw/arm/smmuv3.h"
 #include "hw/acpi/acpi.h"
 #include "target/arm/internals.h"
+#include "hw/mem/memory-device.h"
 #include "hw/mem/pc-dimm.h"
 #include "hw/mem/nvdimm.h"
 #include "hw/acpi/generic_event_device.h"
+#include "hw/virtio/virtio-mem-pci.h"
 #include "hw/virtio/virtio-iommu.h"
 #include "hw/char/pl011.h"
 #include "qemu/guest-random.h"
@@ -2483,6 +2485,63 @@ static void virt_memory_plug(HotplugHandler *hotplug_dev,
  dev, &error_abort);
 }
 
+static void virt_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev,
+DeviceState *dev, Error **errp)
+{
+HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
+Error *local_err = NULL;
+
+if (!hotplug_dev2 && dev->hotplugged) {
+/*
+ * Without a bus hotplug handler, we cannot control the plug/unplug
+ * order. We should never reach this point when hotplugging on x86,
+ * however, better add a safety net.
+ */
+error_setg(errp, "hotplug of virtio based memory devices not supported"
+   " on this bus.");
+return;
+}
+/*
+ * First, see if we can plug this memory device at all. If that
+ * succeeds, branch of to the actual hotplug handler.
+ */
+memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL,
+   &local_err);
+if (!local_err && hotplug_dev2) {
+hotplug_handler_pre_plug(hotplug_dev2, dev, &local_err);
+}
+error_propagate(errp, local_err);
+}
+
+static void virt_virtio_md_pci_plug(HotplugHandler *hotplug_dev,
+DeviceState *dev, Error **errp)
+{
+HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
+Error *local_err = NULL;
+
+/*
+ * Plug the memory device first and then branch off to the actual
+ * hotplug handler. If that one fails, we can easily undo the memory
+ * device bits.
+ */
+memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+if (hotplug_dev2) {
+hotplug_handler_plug(hotplug_dev2, dev, &local_err);
+if (local_err) {
+memory_device_unplug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+}
+}
+error_propagate(errp, local_err);
+}
+
+static void virt_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
+  DeviceState *dev, Error **errp)
+{
+/* We don't support hot unplug of virtio based memory devices */
+error_setg(errp, "virtio based memory devices cannot be unplugged.");
+}
+
+
 static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
 DeviceState *dev, Error **errp)
 {
@@ -2516,6 +2575,8 @@ static void 
virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
 qdev_prop_set_uint32(dev, "len-reserved-regions", 1);
 qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str);
 g_free(resv_prop_str);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+virt_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
 }
 }
 
@@ -2541,6 +2602,8 @@ static void virt_machine_device_plug_cb(HotplugHandler 
*hotplug_dev,
 vms->iommu = VIRT_IOMMU_VIRTIO;
 vms->virtio_iommu_bdf = pci_get_bdf(pdev);
 create_virtio_iommu_dt_bindings(vms);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+virt_virtio_md_pci_plug(hotplug_dev, dev, errp);
 }
 }
 
@@ -2591,6 +2654,8 @@ static void 
virt_machine_device_unplug_request_cb(HotplugHandler *hotplug_dev,
 {
 if (o

Re: [PATCH v2 1/2] hw/arm/virt: Support for virtio-mem-pci

2021-12-03 Thread Gavin Shan


On 12/4/21 5:18 AM, David Hildenbrand wrote:

On 03.12.21 04:35, Gavin Shan wrote:

This supports virtio-mem-pci device on "virt" platform, by simply
following the implementation on x86.

* This implements the hotplug handlers to support virtio-mem-pci
  device hot-add, while the hot-remove isn't supported as we have
  on x86.

* The block size is 512MB on ARM64 instead of 128MB on x86.

* It has been passing the tests with various combinations like 64KB
  and 4KB page sizes on host and guest, different memory device
  backends like normal, transparent huge page and HugeTLB, plus
  migration.



I would turn this patch into 2/2, reshuffling both patches.


Co-developed-by: David Hildenbrand 
Co-developed-by: Jonathan Cameron 
Signed-off-by: Gavin Shan 


Reviewed-by: David Hildenbrand 

Thanks Gavin!



Yup, I thought of it. The fixed issue doesn't exist if virtio-mem
isn't enabled on ARM64 with PATCH[1/2]. That's why I have this
patch as PATCH[2/2]. However, It's also sensible to me to reshuffle
the patches: to eliminate potential issues before enabling virtio-mem
on ARM64. v3 will have the changes :)

Thanks,
Gavin

[libnbd PATCH 02/13] block_status: Refactor array storage

2021-12-03 Thread Eric Blake

For 32-bit block status, we were able to cheat and use an array with
an odd number of elements, with array[0] holding the context id, and
passing &array[1] to the user's callback.  But once we have 64-bit
extents, we can no longer abuse array element 0 like that.  Split out
a new state to receive the context id separately from the extents
array.  No behavioral change, other than the rare possibility of
landing in the new state.
---
 lib/internal.h  |  1 +
 generator/state_machine.ml  | 11 +-
 generator/states-reply-structured.c | 58 -
 3 files changed, 51 insertions(+), 19 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 0e205aba..7e96e8e9 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -274,6 +274,7 @@ struct nbd_handle {
   size_t querynum;

   /* When receiving block status, this is used. */
+  uint32_t bs_contextid;
   uint32_t *bs_entries;

   /* Commands which are waiting to be issued [meaning the request
diff --git a/generator/state_machine.ml b/generator/state_machine.ml
index 3bc77f24..99652948 100644
--- a/generator/state_machine.ml
+++ b/generator/state_machine.ml
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: state machine definition
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -862,10 +862,17 @@ and
 external_events = [];
   };

+  State {
+default_state with
+name = "RECV_BS_CONTEXTID";
+comment = "Receive contextid of structured reply block-status payload";
+external_events = [];
+  };
+
   State {
 default_state with
 name = "RECV_BS_ENTRIES";
-comment = "Receive a structured reply block-status payload";
+comment = "Receive entries array of structured reply block-status payload";
 external_events = [];
   };

diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 70010474..e1da850d 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -1,5 +1,5 @@
 /* nbd client library in userspace: state machine
- * Copyright (C) 2013-2019 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -185,19 +185,10 @@ STATE_MACHINE {
   set_error (0, "not expecting NBD_REPLY_TYPE_BLOCK_STATUS here");
   return 0;
 }
-/* We read the context ID followed by all the entries into a
- * single array and deal with it at the end.
- */
-free (h->bs_entries);
-h->bs_entries = malloc (length);
-if (h->bs_entries == NULL) {
-  SET_NEXT_STATE (%.DEAD);
-  set_error (errno, "malloc");
-  return 0;
-}
-h->rbuf = h->bs_entries;
-h->rlen = length;
-SET_NEXT_STATE (%RECV_BS_ENTRIES);
+/* Start by reading the context ID. */
+h->rbuf = &h->bs_contextid;
+h->rlen = sizeof h->bs_contextid;
+SET_NEXT_STATE (%RECV_BS_CONTEXTID);
 return 0;
   }
   else {
@@ -452,9 +443,41 @@ STATE_MACHINE {
   }
   return 0;

+ REPLY.STRUCTURED_REPLY.RECV_BS_CONTEXTID:
+  struct command *cmd = h->reply_cmd;
+  uint32_t length;
+
+  switch (recv_into_rbuf (h)) {
+  case -1: SET_NEXT_STATE (%.DEAD); return 0;
+  case 1:
+save_reply_state (h);
+SET_NEXT_STATE (%.READY);
+return 0;
+  case 0:
+length = be32toh (h->sbuf.sr.structured_reply.length);
+
+assert (cmd); /* guaranteed by CHECK */
+assert (cmd->type == NBD_CMD_BLOCK_STATUS);
+assert (length >= 12);
+length -= sizeof h->bs_contextid;
+
+free (h->bs_entries);
+h->bs_entries = malloc (length);
+if (h->bs_entries == NULL) {
+  SET_NEXT_STATE (%.DEAD);
+  set_error (errno, "malloc");
+  return 0;
+}
+h->rbuf = h->bs_entries;
+h->rlen = length;
+SET_NEXT_STATE (%RECV_BS_ENTRIES);
+  }
+  return 0;
+
  REPLY.STRUCTURED_REPLY.RECV_BS_ENTRIES:
   struct command *cmd = h->reply_cmd;
   uint32_t length;
+  uint32_t count;
   size_t i;
   uint32_t context_id;
   struct meta_context *meta_context;
@@ -473,15 +496,16 @@ STATE_MACHINE {
 assert (CALLBACK_IS_NOT_NULL (cmd->cb.fn.extent));
 assert (h->bs_entries);
 assert (length >= 12);
+count = (length - sizeof h->bs_contextid) / sizeof *h->bs_entries;

 /* Need to byte-swap the entries returned, but apart from that we
  * don't validate them.
  */
-for (i = 0; i < length/4; ++i)
+for (i = 0; i < count; ++i)
   h->bs_entries[i] = be32toh (h->bs_entries[i]);

 /* Look up the context ID. */
-context_id = h->bs_entries[0];
+context_id = be32toh (h->bs_contextid);
 for (meta_context = h->meta_contexts;
  meta_context;
  meta_context = meta_context->next)
@@ -494,7 +518,7 @@ STATE_MACHIN

[libnbd PATCH 13/13] interop: Add test of 64-bit block status

2021-12-03 Thread Eric Blake

Prove that we can round-trip a block status request larger than 4G
through a new-enough qemu-nbd.  Also serves as a unit test of our shim
for converting internal 64-bit representation back to the older 32-bit
nbd_block_status callback interface.
---
 interop/Makefile.am |   6 ++
 interop/large-status.c  | 186 
 interop/large-status.sh |  49 +++
 .gitignore  |   1 +
 4 files changed, 242 insertions(+)
 create mode 100644 interop/large-status.c
 create mode 100755 interop/large-status.sh

diff --git a/interop/Makefile.am b/interop/Makefile.am
index 3a8d5677..96c0a0f6 100644
--- a/interop/Makefile.am
+++ b/interop/Makefile.am
@@ -20,6 +20,7 @@ include $(top_srcdir)/subdir-rules.mk
 EXTRA_DIST = \
dirty-bitmap.sh \
interop-qemu-storage-daemon.sh \
+   large-status.sh \
list-exports-nbd-config \
list-exports-test-dir/disk1 \
list-exports-test-dir/disk2 \
@@ -129,6 +130,7 @@ check_PROGRAMS += \
list-exports-qemu-nbd \
socket-activation-qemu-nbd \
dirty-bitmap \
+   large-status \
structured-read \
$(NULL)
 TESTS += \
@@ -138,6 +140,7 @@ TESTS += \
list-exports-qemu-nbd \
socket-activation-qemu-nbd \
dirty-bitmap.sh \
+   large-status.sh \
structured-read.sh \
$(NULL)

@@ -227,6 +230,9 @@ socket_activation_qemu_nbd_LDADD = 
$(top_builddir)/lib/libnbd.la
 dirty_bitmap_SOURCES = dirty-bitmap.c
 dirty_bitmap_LDADD = $(top_builddir)/lib/libnbd.la

+large_status_SOURCES = large-status.c
+large_status_LDADD = $(top_builddir)/lib/libnbd.la
+
 structured_read_SOURCES = structured-read.c
 structured_read_LDADD = $(top_builddir)/lib/libnbd.la

diff --git a/interop/large-status.c b/interop/large-status.c
new file mode 100644
index ..3cc040fe
--- /dev/null
+++ b/interop/large-status.c
@@ -0,0 +1,186 @@
+/* NBD client library in userspace
+ * Copyright (C) 2013-2021 Red Hat Inc.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Test 64-bit block status with qemu. */
+
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static const char *bitmap;
+
+struct data {
+  bool req_one;/* input: true if req_one was passed to request */
+  int count;   /* input: count of expected remaining calls */
+  bool seen_base;  /* output: true if base:allocation encountered */
+  bool seen_dirty; /* output: true if qemu:dirty-bitmap encountered */
+};
+
+static int
+cb32 (void *opaque, const char *metacontext, uint64_t offset,
+  uint32_t *entries, size_t len, int *error)
+{
+  struct data *data = opaque;
+
+  assert (offset == 0);
+  assert (data->count-- > 0);
+
+  if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) == 0) {
+assert (!data->seen_base);
+data->seen_base = true;
+
+/* Data block offset 0 size 64k, remainder is hole */
+assert (len == 4);
+assert (entries[0] == 65536);
+assert (entries[1] == 0);
+/* libnbd had to truncate qemu's >4G answer */
+assert (entries[2] == 4227858432);
+assert (entries[3] == (LIBNBD_STATE_HOLE|LIBNBD_STATE_ZERO));
+  }
+  else if (strcmp (metacontext, bitmap) == 0) {
+assert (!data->seen_dirty);
+data->seen_dirty = true;
+
+/* Dirty block at offset 5G-64k, remainder is clean */
+/* libnbd had to truncate qemu's >4G answer */
+assert (len == 2);
+assert (entries[0] == 4227858432);
+assert (entries[1] == 0);
+  }
+  else {
+fprintf (stderr, "unexpected context %s\n", metacontext);
+exit (EXIT_FAILURE);
+  }
+  return 0;
+}
+
+static int
+cb64 (void *opaque, const char *metacontext, uint64_t offset,
+  nbd_extent *entries, size_t len, int *error)
+{
+  struct data *data = opaque;
+
+  assert (offset == 0);
+  assert (data->count-- > 0);
+
+  if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) == 0) {
+assert (!data->seen_base);
+data->seen_base = true;
+
+/* Data block offset 0 size 64k, remainder is hole */
+assert (len == 2);
+assert (entries[0].length == 65536);
+assert (entries[0].flags == 0);
+assert (entries[1].length == 5368643584ULL);
+assert (entries[1].flags == (LIBNBD_STATE_HOLE|LIBNBD_STATE_ZERO));
+  }
+  else if

[libnbd PATCH 11/13] api: Add three functions for controlling extended headers

2021-12-03 Thread Eric Blake

The new NBD_OPT_EXTENDED_HEADERS feature is worth using by default,
but there may be cases where the user explicitly wants to stick with
the older 32-bit headers.  nbd_set_request_extended_headers() will let
the client override the default, nbd_get_request_extended_headers()
determines the current state of the request, and
nbd_get_extended_headers_negotiated() determines what the client and
server actually settled on.  These use
nbd_set_request_structured_headers() and friends as a template.

Note that this patch just adds the API but ignores the state variable;
the next one will then tweak the state machine to actually request
structured headers when the state variable is set.
---
 lib/internal.h |  1 +
 generator/API.ml   | 89 --
 lib/handle.c   | 23 ++
 python/t/110-defaults.py   |  3 +-
 python/t/120-set-non-defaults.py   |  4 +-
 ocaml/tests/test_110_defaults.ml   |  4 +-
 ocaml/tests/test_120_set_non_defaults.ml   |  5 +-
 golang/libnbd_110_defaults_test.go |  8 ++
 golang/libnbd_120_set_non_defaults_test.go | 12 +++
 9 files changed, 137 insertions(+), 12 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 97abf4f2..a579e413 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -107,6 +107,7 @@ struct nbd_handle {
   char *tls_psk_file;   /* PSK filename, NULL = no PSK */

   /* Extended headers. */
+  bool request_eh;  /* Whether to request extended headers */
   bool extended_headers;/* If we negotiated NBD_OPT_EXTENDED_HEADERS */

   /* Desired metadata contexts. */
diff --git a/generator/API.ml b/generator/API.ml
index 1a452a24..e45f0c86 100644
--- a/generator/API.ml
+++ b/generator/API.ml
@@ -675,6 +675,63 @@   "get_tls_psk_file", {
   };
 *)

+  "set_request_extended_headers", {
+default_call with
+args = [Bool "request"]; ret = RErr;
+permitted_states = [ Created ];
+shortdesc = "control use of extended headers";
+longdesc = "\
+By default, libnbd tries to negotiate extended headers with the
+server, as this protocol extension permits the use of 64-bit
+zero, trim, and block status actions.  However,
+for integration testing, it can be useful to clear this flag
+rather than find a way to alter the server to fail the negotiation
+request.";
+see_also = [Link "get_request_extended_headers";
+Link "set_handshake_flags"; Link "set_strict_mode";
+Link "get_extended_headers_negotiated";
+Link "zero"; Link "trim"; Link "cache";
+Link "block_status_64";
+Link "set_request_structured_replies"];
+  };
+
+  "get_request_extended_headers", {
+default_call with
+args = []; ret = RBool;
+may_set_error = false;
+shortdesc = "see if extended headers are attempted";
+longdesc = "\
+Return the state of the request extended headers flag on this
+handle.
+
+B If you want to find out if extended headers were actually
+negotiated on a particular connection use
+L instead.";
+see_also = [Link "set_request_extended_headers";
+Link "get_extended_headers_negotiated";
+Link "get_request_extended_headers"];
+  };
+
+  "get_extended_headers_negotiated", {
+default_call with
+args = []; ret = RBool;
+permitted_states = [ Negotiating; Connected; Closed ];
+shortdesc = "see if extended headers are in use";
+longdesc = "\
+After connecting you may call this to find out if the connection is
+using extended headers.  When extended headers are not in use, commands
+are limited to a 32-bit length, even when the libnbd API uses a 64-bit
+variable to express the length.  But even when extended headers are
+supported, the server may enforce other limits, visible through
+L.";
+see_also = [Link "set_request_extended_headers";
+Link "get_request_extended_headers";
+Link "zero"; Link "trim"; Link "cache";
+Link "block_status_64"; Link "get_block_size";
+Link "get_protocol";
+Link "get_structured_replies_negotiated"];
+  };
+
   "set_request_structured_replies", {
 default_call with
 args = [Bool "request"]; ret = RErr;
@@ -690,7 +747,8 @@   "set_request_structured_replies", {
 see_also = [Link "get_request_structured_replies";
 Link "set_handshake_flags"; Link "set_strict_mode";
 Link "get_structured_replies_negotiated";
-Link "can_meta_context"; Link "can_df"];
+Link "can_meta_context"; Link "can_df";
+Link "set_request_extended_headers"];
   };

   "get_request_structured_replies", {
@@ -706,7 +764,8 @@   "get_request_structured_replies", {
 negotiated on a particular connection use
 L instead.";
 see_also = [Link "set_request_structured_replies";
-Link "get_structured_replies_negotia

[libnbd PATCH 00/13] libnbd patches for NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake

Available here: https://repo.or.cz/libnbd/ericb.git/shortlog/refs/tags/exthdr-v1

I also want to do followup patches to teach 'nbdinfo --map' and
'nbdcopy' to utilize 64-bit extents.

Eric Blake (13):
  golang: Simplify nbd_block_status callback array copy
  block_status: Refactor array storage
  protocol: Add definitions for extended headers
  protocol: Prepare to send 64-bit requests
  protocol: Prepare to receive 64-bit replies
  protocol: Accept 64-bit holes during pread
  generator: Add struct nbd_extent in prep for 64-bit extents
  block_status: Track 64-bit extents internally
  block_status: Accept 64-bit extents during block status
  api: Add [aio_]nbd_block_status_64
  api: Add three functions for controlling extended headers
  generator: Actually request extended headers
  interop: Add test of 64-bit block status

 lib/internal.h|  31 ++-
 lib/nbd-protocol.h|  61 -
 generator/API.ml  | 237 --
 generator/API.mli |   3 +-
 generator/C.ml|  24 +-
 generator/GoLang.ml   |  35 ++-
 generator/Makefile.am |   3 +-
 generator/OCaml.ml|  20 +-
 generator/Python.ml   |  29 ++-
 generator/state_machine.ml|  52 +++-
 generator/states-issue-command.c  |  31 ++-
 .../states-newstyle-opt-extended-headers.c|  90 +++
 generator/states-newstyle-opt-starttls.c  |  10 +-
 generator/states-reply-structured.c   | 220 
 generator/states-reply.c  |  31 ++-
 lib/handle.c  |  27 +-
 lib/rw.c  | 105 +++-
 python/t/110-defaults.py  |   3 +-
 python/t/120-set-non-defaults.py  |   4 +-
 python/t/465-block-status-64.py   |  56 +
 ocaml/helpers.c   |  22 +-
 ocaml/nbd-c.h |   3 +-
 ocaml/tests/Makefile.am   |   5 +-
 ocaml/tests/test_110_defaults.ml  |   4 +-
 ocaml/tests/test_120_set_non_defaults.ml  |   5 +-
 ocaml/tests/test_465_block_status_64.ml   |  58 +
 tests/meta-base-allocation.c  | 111 +++-
 interop/Makefile.am   |   6 +
 interop/large-status.c| 186 ++
 interop/large-status.sh   |  49 
 .gitignore|   1 +
 golang/Makefile.am|   3 +-
 golang/handle.go  |   6 +
 golang/libnbd_110_defaults_test.go|   8 +
 golang/libnbd_120_set_non_defaults_test.go|  12 +
 golang/libnbd_465_block_status_64_test.go | 119 +
 36 files changed, 1511 insertions(+), 159 deletions(-)
 create mode 100644 generator/states-newstyle-opt-extended-headers.c
 create mode 100644 python/t/465-block-status-64.py
 create mode 100644 ocaml/tests/test_465_block_status_64.ml
 create mode 100644 interop/large-status.c
 create mode 100755 interop/large-status.sh
 create mode 100644 golang/libnbd_465_block_status_64_test.go

-- 
2.33.1

[libnbd PATCH 10/13] api: Add [aio_]nbd_block_status_64

2021-12-03 Thread Eric Blake

Overcome the inherent 32-bit limitation of our existing
nbd_block_status command by adding a 64-bit variant.  The command sent
to the server does not change, but the user's callback is now handed
64-bit information regardless of whether the server replies with 32-
or 64-bit extents.

Unit tests prove that the new API works in each of C, Python, OCaml,
and Go bindings.  We can also get rid of the temporary hack added to
appease the compiler in an earlier patch.
---
 generator/API.ml  | 138 +++---
 generator/OCaml.ml|   1 -
 generator/Python.ml   |   1 -
 lib/rw.c  |  48 ++--
 python/t/465-block-status-64.py   |  56 +
 ocaml/tests/Makefile.am   |   5 +-
 ocaml/tests/test_465_block_status_64.ml   |  58 +
 tests/meta-base-allocation.c  | 111 +++--
 golang/Makefile.am|   3 +-
 golang/libnbd_465_block_status_64_test.go | 119 +++
 10 files changed, 503 insertions(+), 37 deletions(-)
 create mode 100644 python/t/465-block-status-64.py
 create mode 100644 ocaml/tests/test_465_block_status_64.ml
 create mode 100644 golang/libnbd_465_block_status_64_test.go

diff --git a/generator/API.ml b/generator/API.ml
index 70ae721d..1a452a24 100644
--- a/generator/API.ml
+++ b/generator/API.ml
@@ -1071,7 +1071,7 @@   "add_meta_context", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This call adds one metadata context to the list to be negotiated.
@@ -1098,7 +1098,7 @@   "add_meta_context", {
 Other metadata contexts are server-specific, but include
 C<\"qemu:dirty-bitmap:...\"> and C<\"qemu:allocation-depth\"> for
 qemu-nbd (see qemu-nbd I<-B> and I<-A> options).";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "get_nr_meta_contexts"; Link "get_meta_context";
 Link "clear_meta_contexts"];
   };
@@ -,14 +,14 @@   "get_nr_meta_contexts", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This command returns how many meta contexts have been added to
 the list to request from the server via L.
 The server is not obligated to honor all of the requests; to see
 what it actually supports, see L.";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "add_meta_context"; Link "get_meta_context";
 Link "clear_meta_contexts"];
   };
@@ -1131,13 +1131,13 @@   "get_meta_context", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This command returns the i'th meta context request, as added by
 L, and bounded by
 L.";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "add_meta_context"; Link "get_nr_meta_contexts";
 Link "clear_meta_contexts"];
   };
@@ -1151,7 +1151,7 @@   "clear_meta_contexts", {
 During connection libnbd can negotiate zero or more metadata
 contexts with the server.  Metadata contexts are features (such
 as C<\"base:allocation\">) which describe information returned
-by the L command (for C<\"base:allocation\">
+by the L command (for C<\"base:allocation\">
 this is whether blocks of data are allocated, zero or sparse).

 This command resets the list of meta contexts to request back to
@@ -1160,7 +1160,7 @@   "clear_meta_contexts", {
 negotiation mode is selected (see L), for
 altering the list of attempted contexts between subsequent export
 queries.";
-see_also = [Link "block_status"; Link "can_meta_context";
+see_also = [Link "block_status_64"; Link "can_meta_context";
 Link "add_meta_context"; Link "get_nr_meta_contexts";
 Link "get_meta_context"; Link "set_opt_mode"];
   };
@@ -1727,7 +1727,7 @@   "can_meta_context", {
 ^ non_blocking_test_call_description;
 see_also = [SectionLink "Flag calls"; Link "opt_info";

[libnbd PATCH 01/13] golang: Simplify nbd_block_status callback array copy

2021-12-03 Thread Eric Blake

In the block status callback glue code, we need to copy a C uint32_t[]
into a golang []uint32.  The copy is necessary since the lifetime of
the C array is not guaranteed to outlive whatever the Go callback may
have done with what it was handed; copying ensures that the user's Go
code doesn't have to worry about lifetime issues.  But we don't have
to have quite so many casts and pointer additions: since we can assume
C.uint32_t and uint32 occupy the same amount of memory (even though
they are different types), we can exploit Go's ability to treat an
unsafe pointer as if it were an oversized array, take a slice of that
array, and then use idiomatic Go to copy from the slice.

https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices
---
 generator/GoLang.ml | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/generator/GoLang.ml b/generator/GoLang.ml
index eb3aa263..d3b7dc79 100644
--- a/generator/GoLang.ml
+++ b/generator/GoLang.ml
@@ -1,6 +1,6 @@
 (* hey emacs, this is OCaml code: -*- tuareg -*- *)
 (* nbd client library in userspace: generator
- * Copyright (C) 2013-2020 Red Hat Inc.
+ * Copyright (C) 2013-2021 Red Hat Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -514,11 +514,14 @@ let
 /* Closures. */

 func copy_uint32_array (entries *C.uint32_t, count C.size_t) []uint32 {
-ret := make([]uint32, int (count))
-for i := 0; i < int (count); i++ {
-   entry := (*C.uint32_t) (unsafe.Pointer(uintptr(unsafe.Pointer(entries)) 
+ (unsafe.Sizeof(*entries) * uintptr(i
-   ret[i] = uint32 (*entry)
-}
+/* https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices */
+unsafePtr := unsafe.Pointer(entries)
+/* Max structured reply payload is 64M, so this array size is more than
+ * sufficient for the underlying slice we want to access.
+ */
+arrayPtr := (*[1 << 20]uint32)(unsafePtr)
+ret := make([]uint32, count)
+copy(ret, arrayPtr[:count:count])
 return ret
 }
 ";
-- 
2.33.1

[libnbd PATCH 08/13] block_status: Track 64-bit extents internally

2021-12-03 Thread Eric Blake

When extended headers are in use, the server can send us 64-bit
extents, even for a 32-bit query (if the server knows the entire image
is data, for example).  For maximum flexibility, we are thus better
off storing 64-bit lengths internally, even if we have to convert it
back to 32-bit lengths when invoking the user's 32-bit callback.  The
next patch will then add a new API for letting the user access the
full 64-bit extent information.  The goal is to let both APIs work all
the time, regardless of the size extents that the server actually
answered with.

Note that when using the old nbd_block_status() API with a server that
lacks extended headers, we now add a double-conversion speed penalty
(converting the server's 32-bit answer into 64-bit internally and back
to 32-bit for the callback).  But the speed penalty will not be a
problem for applications using the new nbd_block_status_64() API (we
have to give a 64-bit answer no matter what the server answered), and
ideally the situation will become less common as more servers learn
extended headers.  So for now I chose to unconditionally use a 64-bit
internal representation; but if it turns out to have noticeable
degredation, we could tweak things to conditionally retain 32-bit
internal representation for servers lacking extended headers at the
expense of more code maintenance.

One of the trickier aspects of this patch is auditing that both the
user's extent and our malloc'd shim get cleaned up once on all
possible paths, so that there is neither a leak nor a double free.
---
 lib/internal.h  |  7 +++-
 generator/states-reply-structured.c | 31 ++-
 lib/handle.c|  4 +-
 lib/rw.c| 59 -
 4 files changed, 85 insertions(+), 16 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 06f3a65c..4800df83 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -75,7 +75,7 @@ struct export {

 struct command_cb {
   union {
-nbd_extent_callback extent;
+nbd_extent64_callback extent;
 nbd_chunk_callback chunk;
 nbd_list_callback list;
 nbd_context_callback context;
@@ -286,7 +286,10 @@ struct nbd_handle {

   /* When receiving block status, this is used. */
   uint32_t bs_contextid;
-  uint32_t *bs_entries;
+  union {
+nbd_extent *normal; /* Our 64-bit preferred internal form */
+uint32_t *narrow;   /* 32-bit form of NBD_REPLY_TYPE_BLOCK_STATUS */
+  } bs_entries;

   /* Commands which are waiting to be issued [meaning the request
* packet is sent to the server].  This is used as a simple linked
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index a3e0e2ac..71c761e9 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -494,6 +494,7 @@ STATE_MACHINE {
  REPLY.STRUCTURED_REPLY.RECV_BS_CONTEXTID:
   struct command *cmd = h->reply_cmd;
   uint32_t length;
+  uint32_t count;

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return 0;
@@ -508,15 +509,19 @@ STATE_MACHINE {
 assert (cmd->type == NBD_CMD_BLOCK_STATUS);
 assert (length >= 12);
 length -= sizeof h->bs_contextid;
+count = length / (2 * sizeof (uint32_t));

-free (h->bs_entries);
-h->bs_entries = malloc (length);
-if (h->bs_entries == NULL) {
+/* Read raw data into a subset of h->bs_entries, then expand it
+ * into place later later during byte-swapping.
+ */
+free (h->bs_entries.normal);
+h->bs_entries.normal = malloc (count * sizeof *h->bs_entries.normal);
+if (h->bs_entries.normal == NULL) {
   SET_NEXT_STATE (%.DEAD);
   set_error (errno, "malloc");
   return 0;
 }
-h->rbuf = h->bs_entries;
+h->rbuf = h->bs_entries.narrow;
 h->rlen = length;
 SET_NEXT_STATE (%RECV_BS_ENTRIES);
   }
@@ -528,6 +533,7 @@ STATE_MACHINE {
   uint32_t count;
   size_t i;
   uint32_t context_id;
+  uint32_t *raw;
   struct meta_context *meta_context;

   switch (recv_into_rbuf (h)) {
@@ -542,15 +548,20 @@ STATE_MACHINE {
 assert (cmd); /* guaranteed by CHECK */
 assert (cmd->type == NBD_CMD_BLOCK_STATUS);
 assert (CALLBACK_IS_NOT_NULL (cmd->cb.fn.extent));
-assert (h->bs_entries);
+assert (h->bs_entries.normal);
 assert (length >= 12);
-count = (length - sizeof h->bs_contextid) / sizeof *h->bs_entries;
+count = (length - sizeof h->bs_contextid) / (2 * sizeof (uint32_t));

 /* Need to byte-swap the entries returned, but apart from that we
- * don't validate them.
+ * don't validate them.  Reverse order is essential, since we are
+ * expanding in-place from narrow to wider type.
  */
-for (i = 0; i < count; ++i)
-  h->bs_entries[i] = be32toh (h->bs_entries[i]);
+raw = h->bs_entries.narrow;
+for (i = count; i > 0; ) {
+  --i;
+  h->bs_entries.normal[i].flags = be32toh (raw[i * 2 + 1]);
+  h->bs_entries.normal[i].length = be32toh

[libnbd PATCH 06/13] protocol: Accept 64-bit holes during pread

2021-12-03 Thread Eric Blake

Even though we don't allow the user to request NBD_CMD_READ with more
than 64M (and even if we did, our API signature caps us at SIZE_MAX,
which is 32 bits on a 32-bit machine), the NBD extension to allow
64-bit requests implies that for symmetry we have to be able to
support 64-bit holes over the wire.  Note that we don't have to change
the signature of the callback for nbd_pread_structured; nor is it
worth adding a counterpart to LIBNBD_READ_HOLE, because it is unlikely
that a user callback will ever need to distinguish between which size
was sent over the wire, when the value is always less than 32 bits.

While we cannot guarantee which size structured reply the server will
use, it is easy enough to handle both sizes, even for a non-compliant
server that sends wide replies when extended headers were not
negotiated.  Of course, until a later patch enables extended headers
negotiation, no compliant server will trigger the new code here.
---
 lib/internal.h  |  1 +
 generator/states-reply-structured.c | 41 +
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index c9f84441..06f3a65c 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -231,6 +231,7 @@ struct nbd_handle {
   union {
 struct nbd_structured_reply_offset_data offset_data;
 struct nbd_structured_reply_offset_hole offset_hole;
+struct nbd_structured_reply_offset_hole_ext offset_hole_ext;
 struct {
   struct nbd_structured_reply_error error;
   char msg[NBD_MAX_STRING]; /* Common to all error types */
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 1b675e8d..a3e0e2ac 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -26,15 +26,16 @@
  * requesting command.
  */
 static bool
-structured_reply_in_bounds (uint64_t offset, uint32_t length,
+structured_reply_in_bounds (uint64_t offset, uint64_t length,
 const struct command *cmd)
 {
   if (offset < cmd->offset ||
   offset >= cmd->offset + cmd->count ||
-  offset + length > cmd->offset + cmd->count) {
+  length > cmd->offset + cmd->count ||
+  offset > cmd->offset + cmd->count - length) {
 set_error (0, "range of structured reply is out of bounds, "
"offset=%" PRIu64 ", cmd->offset=%" PRIu64 ", "
-   "length=%" PRIu32 ", cmd->count=%" PRIu64 ": "
+   "length=%" PRIu64 ", cmd->count=%" PRIu64 ": "
"this is likely to be a bug in the NBD server",
offset, cmd->offset, length, cmd->count);
 return false;
@@ -182,6 +183,25 @@ STATE_MACHINE {
 SET_NEXT_STATE (%RECV_OFFSET_HOLE);
 return 0;
   }
+  else if (type == NBD_REPLY_TYPE_OFFSET_HOLE_EXT) {
+if (cmd->type != NBD_CMD_READ) {
+  SET_NEXT_STATE (%.DEAD);
+  set_error (0, "invalid command for receiving offset-hole chunk, "
+ "cmd->type=%" PRIu16 ", "
+ "this is likely to be a bug in the server",
+ cmd->type);
+  return 0;
+}
+if (length != sizeof h->sbuf.sr.payload.offset_hole_ext) {
+  SET_NEXT_STATE (%.DEAD);
+  set_error (0, "invalid length in NBD_REPLY_TYPE_OFFSET_HOLE_EXT");
+  return 0;
+}
+h->rbuf = &h->sbuf.sr.payload.offset_hole_ext;
+h->rlen = sizeof h->sbuf.sr.payload.offset_hole_ext;
+SET_NEXT_STATE (%RECV_OFFSET_HOLE);
+return 0;
+  }
   else if (type == NBD_REPLY_TYPE_BLOCK_STATUS) {
 if (cmd->type != NBD_CMD_BLOCK_STATUS) {
   SET_NEXT_STATE (%.DEAD);
@@ -415,7 +435,8 @@ STATE_MACHINE {
  REPLY.STRUCTURED_REPLY.RECV_OFFSET_HOLE:
   struct command *cmd = h->reply_cmd;
   uint64_t offset;
-  uint32_t length;
+  uint64_t length;
+  uint16_t type;

   switch (recv_into_rbuf (h)) {
   case -1: SET_NEXT_STATE (%.DEAD); return 0;
@@ -425,7 +446,14 @@ STATE_MACHINE {
 return 0;
   case 0:
 offset = be64toh (h->sbuf.sr.payload.offset_hole.offset);
-length = be32toh (h->sbuf.sr.payload.offset_hole.length);
+type = be16toh (h->sbuf.sr.hdr.structured_reply.type);
+
+if (type == NBD_REPLY_TYPE_OFFSET_HOLE)
+  length = be32toh (h->sbuf.sr.payload.offset_hole.length);
+else {
+  /* XXX Insist on h->extended_headers? */
+  length = be64toh (h->sbuf.sr.payload.offset_hole_ext.length);
+}

 assert (cmd); /* guaranteed by CHECK */

@@ -443,7 +471,10 @@ STATE_MACHINE {
 /* The spec states that 0-length requests are unspecified, but
  * 0-length replies are broken. Still, it's easy enough to support
  * them as an extension, and this works even when length == 0.
+ * Although length is 64 bits, the bounds check above ensures that
+ * it is no larger than the 64M cap we put on NBD_CMD_READ.
  */
+assert (length <= SIZE_MAX);
 memset (cmd->data + offset, 0, length);
 if (CALLBACK_IS_NOT_NULL (cmd->cb.fn.chunk)) {

[PATCH 12/14] nbd/client: Accept 64-bit block status chunks

2021-12-03 Thread Eric Blake

Because we use NBD_CMD_FLAG_REQ_ONE with NBD_CMD_BLOCK_STATUS, a
client in narrow mode should not be able to provoke a server into
sending a block status result larger than the client's 32-bit request.
But in extended mode, a 64-bit status request must be able to handle a
64-bit status result, once a future patch enables the client
requesting extended mode.  We can also tolerate a non-compliant server
sending the new chunk even when it should not.

Signed-off-by: Eric Blake 
---
 block/nbd.c | 38 +++---
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c5dea864ebb6..bd4a9c407bde 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -563,13 +563,15 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  */
 static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  NBDStructuredReplyChunk *chunk,
- uint8_t *payload, uint64_t 
orig_length,
- NBDExtent *extent, Error **errp)
+ uint8_t *payload, bool wide,
+ uint64_t orig_length,
+ NBDExtentExt *extent, Error **errp)
 {
 uint32_t context_id;
+size_t len = wide ? sizeof(*extent) : sizeof(NBDExtent);

 /* The server succeeded, so it must have sent [at least] one extent */
-if (chunk->length < sizeof(context_id) + sizeof(*extent)) {
+if (chunk->length < sizeof(context_id) + len) {
 error_setg(errp, "Protocol error: invalid payload for "
  "NBD_REPLY_TYPE_BLOCK_STATUS");
 return -EINVAL;
@@ -584,8 +586,16 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
 return -EINVAL;
 }

-extent->length = payload_advance32(&payload);
-extent->flags = payload_advance32(&payload);
+if (wide) {
+extent->length = payload_advance64(&payload);
+extent->flags = payload_advance32(&payload);
+if (payload_advance32(&payload) != 0) {
+trace_nbd_parse_blockstatus_compliance("non-zero extent padding");
+}
+} else {
+extent->length = payload_advance32(&payload);
+extent->flags = payload_advance32(&payload);
+}

 if (extent->length == 0) {
 error_setg(errp, "Protocol error: server sent status chunk with "
@@ -625,7 +635,7 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
  * connection; just ignore trailing extents, and clamp things to
  * the length of our request.
  */
-if (chunk->length > sizeof(context_id) + sizeof(*extent)) {
+if (chunk->length > sizeof(context_id) + len) {
 trace_nbd_parse_blockstatus_compliance("more than one extent");
 }
 if (extent->length > orig_length) {
@@ -1081,7 +1091,7 @@ static int nbd_co_receive_cmdread_reply(BDRVNBDState *s, 
uint64_t handle,

 static int nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
 uint64_t handle, uint64_t length,
-NBDExtent *extent,
+NBDExtentExt *extent,
 int *request_ret, Error **errp)
 {
 NBDReplyChunkIter iter;
@@ -1098,6 +1108,11 @@ static int nbd_co_receive_blockstatus_reply(BDRVNBDState 
*s,
 assert(nbd_reply_is_structured(&reply));

 switch (chunk->type) {
+case NBD_REPLY_TYPE_BLOCK_STATUS_EXT:
+if (!s->info.extended_headers) {
+trace_nbd_extended_headers_compliance("block_status_ext");
+}
+/* fallthrough */
 case NBD_REPLY_TYPE_BLOCK_STATUS:
 if (received) {
 nbd_channel_error(s, -EINVAL);
@@ -1106,9 +1121,10 @@ static int nbd_co_receive_blockstatus_reply(BDRVNBDState 
*s,
 }
 received = true;

-ret = nbd_parse_blockstatus_payload(s, &reply.structured,
-payload, length, extent,
-&local_err);
+ret = nbd_parse_blockstatus_payload(
+s, &reply.structured, payload,
+chunk->type == NBD_REPLY_TYPE_BLOCK_STATUS_EXT,
+length, extent, &local_err);
 if (ret < 0) {
 nbd_channel_error(s, ret);
 nbd_iter_channel_error(&iter, ret, &local_err);
@@ -1337,7 +1353,7 @@ static int coroutine_fn nbd_client_co_block_status(
 int64_t *pnum, int64_t *map, BlockDriverState **file)
 {
 int ret, request_ret;
-NBDExtent extent = { 0 };
+NBDExtentExt extent = { 0 };
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 Error *local_err = NULL;

-- 
2.33.1

[libnbd PATCH 05/13] protocol: Prepare to receive 64-bit replies

2021-12-03 Thread Eric Blake

Support receiving headers for 64-bit replies if extended headers were
negotiated.  We already insist that the server not send us too much
payload in one reply, so we can exploit that and merge the 64-bit
length back into a normalized 32-bit field for the rest of the payload
length calculations.  The NBD protocol specifically made extended
simple and structured replies both occupy 32 bytes, while the handle
field is still in the same offset between all reply types.

Note that if we negotiate extended headers, but a non-compliant server
replies with a non-extended header, we will stall waiting for the
server to send more bytes rather than noticing that the magic number
is wrong.  The alternative would be to read just the first 4 bytes of
magic, then determine how many more bytes to expect; but that would
require more states and syscalls, and not worth it since the typical
server will be compliant.

At this point, h->extended_headers is permanently false (we can't
enable it until all other aspects of the protocol have likewise been
converted).
---
 lib/internal.h  |  8 +++-
 generator/states-reply-structured.c | 59 +++--
 generator/states-reply.c| 31 +++
 3 files changed, 68 insertions(+), 30 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index 07378588..c9f84441 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -222,8 +222,12 @@ struct nbd_handle {
 }  __attribute__((packed)) or;
 struct nbd_export_name_option_reply export_name_reply;
 struct nbd_simple_reply simple_reply;
+struct nbd_simple_reply_ext simple_reply_ext;
 struct {
-  struct nbd_structured_reply structured_reply;
+  union {
+struct nbd_structured_reply structured_reply;
+struct nbd_structured_reply_ext structured_reply_ext;
+  } hdr;
   union {
 struct nbd_structured_reply_offset_data offset_data;
 struct nbd_structured_reply_offset_hole offset_hole;
@@ -233,7 +237,7 @@ struct nbd_handle {
   uint64_t offset; /* Only used for NBD_REPLY_TYPE_ERROR_OFFSET */
 } __attribute__((packed)) error;
   } payload;
-}  __attribute__((packed)) sr;
+} sr;
 uint16_t gflags;
 uint32_t cflags;
 uint32_t len;
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 5524e000..1b675e8d 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -45,19 +45,23 @@ structured_reply_in_bounds (uint64_t offset, uint32_t 
length,

 STATE_MACHINE {
  REPLY.STRUCTURED_REPLY.START:
-  /* We've only read the simple_reply.  The structured_reply is longer,
-   * so read the remaining part.
+  /* We've only read the simple_reply.  Unless we have extended headers,
+   * the structured_reply is longer, so read the remaining part.
*/
   if (!h->structured_replies) {
 set_error (0, "server sent unexpected structured reply");
 SET_NEXT_STATE(%.DEAD);
 return 0;
   }
-  h->rbuf = &h->sbuf;
-  h->rbuf += sizeof h->sbuf.simple_reply;
-  h->rlen = sizeof h->sbuf.sr.structured_reply;
-  h->rlen -= sizeof h->sbuf.simple_reply;
-  SET_NEXT_STATE (%RECV_REMAINING);
+  if (h->extended_headers)
+SET_NEXT_STATE (%CHECK);
+  else {
+h->rbuf = &h->sbuf;
+h->rbuf += sizeof h->sbuf.simple_reply;
+h->rlen = sizeof h->sbuf.sr.hdr.structured_reply;
+h->rlen -= sizeof h->sbuf.simple_reply;
+SET_NEXT_STATE (%RECV_REMAINING);
+  }
   return 0;

  REPLY.STRUCTURED_REPLY.RECV_REMAINING:
@@ -75,12 +79,21 @@ STATE_MACHINE {
   struct command *cmd = h->reply_cmd;
   uint16_t flags, type;
   uint64_t cookie;
-  uint32_t length;
+  uint64_t length;

-  flags = be16toh (h->sbuf.sr.structured_reply.flags);
-  type = be16toh (h->sbuf.sr.structured_reply.type);
-  cookie = be64toh (h->sbuf.sr.structured_reply.handle);
-  length = be32toh (h->sbuf.sr.structured_reply.length);
+  flags = be16toh (h->sbuf.sr.hdr.structured_reply.flags);
+  type = be16toh (h->sbuf.sr.hdr.structured_reply.type);
+  cookie = be64toh (h->sbuf.sr.hdr.structured_reply.handle);
+  if (h->extended_headers) {
+length = be64toh (h->sbuf.sr.hdr.structured_reply_ext.length);
+if (h->sbuf.sr.hdr.structured_reply_ext.pad) {
+  set_error (0, "server sent non-zero padding in structured reply header");
+  SET_NEXT_STATE(%.DEAD);
+  return 0;
+}
+  }
+  else
+length = be32toh (h->sbuf.sr.hdr.structured_reply.length);

   assert (cmd);
   assert (cmd->cookie == cookie);
@@ -97,6 +110,10 @@ STATE_MACHINE {
 SET_NEXT_STATE (%.DEAD);
 return 0;
   }
+  /* For convenience, we now normalize extended replies into compact,
+   * doable since we validated length fits in 32 bits.
+   */
+  h->sbuf.sr.hdr.structured_reply.length = length;

   if (NBD_REPLY_TYPE_IS_ERR (type)) {
 if (length < sizeof h->sbuf.sr.payload.error.error) {
@@ -207,7 +224,7 @@ STATE_MACHINE {
 SET_NEXT_STATE (%.READY);
 return 0;
   case 0:
-

[PATCH 13/14] nbd/client: Request extended headers during negotiation

2021-12-03 Thread Eric Blake

All the pieces are in place for a client to finally request extended
headers.  Note that we must not request extended headers when qemu-nbd
is used to connect to the kernel module (as nbd.ko does not expect
them), but there is no harm in all other clients requesting them.

Extended headers do not make a difference to the information collected
during 'qemu-nbd --list', but probing for it gives us one more piece
of information in that output.  Update the iotests affected by the new
line of output.

Signed-off-by: Eric Blake 
---
 nbd/client-connection.c   |  1 +
 nbd/client.c  | 26 ---
 qemu-nbd.c|  2 ++
 tests/qemu-iotests/223.out|  4 +++
 tests/qemu-iotests/233.out|  1 +
 tests/qemu-iotests/241|  8 +++---
 tests/qemu-iotests/307|  2 +-
 tests/qemu-iotests/307.out|  5 
 .../tests/nbd-qemu-allocation.out |  1 +
 9 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/nbd/client-connection.c b/nbd/client-connection.c
index 695f85575414..d8b9ae230264 100644
--- a/nbd/client-connection.c
+++ b/nbd/client-connection.c
@@ -87,6 +87,7 @@ NBDClientConnection *nbd_client_connection_new(const 
SocketAddress *saddr,

 .initial_info.request_sizes = true,
 .initial_info.structured_reply = true,
+.initial_info.extended_headers = true,
 .initial_info.base_allocation = true,
 .initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
 .initial_info.name = g_strdup(export_name ?: "")
diff --git a/nbd/client.c b/nbd/client.c
index f1aa5256c8bf..0e227255d59b 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -882,8 +882,8 @@ static int nbd_list_meta_contexts(QIOChannel *ioc,
 static int nbd_start_negotiate(AioContext *aio_context, QIOChannel *ioc,
QCryptoTLSCreds *tlscreds,
const char *hostname, QIOChannel **outioc,
-   bool structured_reply, bool *zeroes,
-   Error **errp)
+   bool structured_reply, bool *ext_hdrs,
+   bool *zeroes, Error **errp)
 {
 ERRP_GUARD();
 uint64_t magic;
@@ -960,6 +960,15 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 if (fixedNewStyle) {
 int result = 0;

+if (ext_hdrs && *ext_hdrs) {
+result = nbd_request_simple_option(ioc,
+   NBD_OPT_EXTENDED_HEADERS,
+   false, errp);
+if (result < 0) {
+return -EINVAL;
+}
+*ext_hdrs = result == 1;
+}
 if (structured_reply) {
 result = nbd_request_simple_option(ioc,
NBD_OPT_STRUCTURED_REPLY,
@@ -970,6 +979,9 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 }
 return 2 + result;
 } else {
+if (ext_hdrs) {
+*ext_hdrs = false;
+}
 return 1;
 }
 } else if (magic == NBD_CLIENT_MAGIC) {
@@ -977,6 +989,9 @@ static int nbd_start_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 error_setg(errp, "Server does not support STARTTLS");
 return -EINVAL;
 }
+if (ext_hdrs) {
+*ext_hdrs = false;
+}
 return 0;
 } else {
 error_setg(errp, "Bad server magic received: 0x%" PRIx64, magic);
@@ -1030,7 +1045,8 @@ int nbd_receive_negotiate(AioContext *aio_context, 
QIOChannel *ioc,
 trace_nbd_receive_negotiate_name(info->name);

 result = nbd_start_negotiate(aio_context, ioc, tlscreds, hostname, outioc,
- info->structured_reply, &zeroes, errp);
+ info->structured_reply,
+ &info->extended_headers, &zeroes, errp);

 info->structured_reply = false;
 info->base_allocation = false;
@@ -1147,10 +1163,11 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 int ret = -1;
 NBDExportInfo *array = NULL;
 QIOChannel *sioc = NULL;
+bool ext_hdrs;

 *info = NULL;
 result = nbd_start_negotiate(NULL, ioc, tlscreds, hostname, &sioc, true,
- NULL, errp);
+ &ext_hdrs, NULL, errp);
 if (tlscreds && sioc) {
 ioc = sioc;
 }
@@ -1179,6 +1196,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 array[count - 1].name = name;
 array[count - 1].description = desc;
 array[count - 1].structured_reply = result == 3;
+a

[PATCH 09/14] nbd/server: Support 64-bit block status

2021-12-03 Thread Eric Blake

The previous patch handled extended headers by truncating large block
status requests from the client back to 32 bits.  But this is not
ideal; for cases where we can truly determine the status of the entire
image quickly (for example, when reporting the entire image as
non-sparse because we lack the ability to probe for holes), this
causes more network traffic for the client to iterate through 4G
chunks than for us to just report the entire image at once.  For ease
of implementation, if extended headers were negotiated, then we always
reply with 64-bit block status replies, even when the result could
have fit in the older 32-bit block status chunk (clients supporting
extended headers have to be prepared for either chunk type, so
temporarily reverting this patch proves whether a client is
compliant).

Note that we previously had some interesting size-juggling on call
chains, such as:

nbd_co_send_block_status(uint32_t length)
-> blockstatus_to_extends(uint32_t bytes)
  -> bdrv_block_status_above(bytes, &uint64_t num)
  -> nbd_extent_array_add(uint64_t num)
-> store num in 32-bit length

But we were lucky that it never overflowed: bdrv_block_status_above
never sets num larger than bytes, and we had previously been capping
'bytes' at 32 bits (either by the protocol, or in the previous patch
with an explicit truncation).  This patch adds some assertions that
ensure we continue to avoid overflowing 32 bits for a narrow client,
while fully utilizing 64-bits all the way through when the client
understands that.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 72 ++--
 1 file changed, 48 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 0e496f60ffbd..7e6140350797 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2106,20 +2106,26 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 }

 typedef struct NBDExtentArray {
-NBDExtent *extents;
+union {
+NBDExtent *narrow;
+NBDExtentExt *extents;
+};
 unsigned int nb_alloc;
 unsigned int count;
 uint64_t total_length;
+bool extended; /* Whether 64-bit extents are allowed */
 bool can_add;
 bool converted_to_be;
 } NBDExtentArray;

-static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc)
+static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc,
+bool extended)
 {
 NBDExtentArray *ea = g_new0(NBDExtentArray, 1);

 ea->nb_alloc = nb_alloc;
-ea->extents = g_new(NBDExtent, nb_alloc);
+ea->extents = g_new(NBDExtentExt, nb_alloc);
+ea->extended = extended;
 ea->can_add = true;

 return ea;
@@ -2133,17 +2139,31 @@ static void nbd_extent_array_free(NBDExtentArray *ea)
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(NBDExtentArray, nbd_extent_array_free);

 /* Further modifications of the array after conversion are abandoned */
-static void nbd_extent_array_convert_to_be(NBDExtentArray *ea)
+static void nbd_extent_array_convert_to_be(NBDExtentArray *ea,
+   struct iovec *iov)
 {
 int i;

 assert(!ea->converted_to_be);
+assert(iov->iov_base == ea->extents);
 ea->can_add = false;
 ea->converted_to_be = true;

-for (i = 0; i < ea->count; i++) {
-ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
-ea->extents[i].length = cpu_to_be32(ea->extents[i].length);
+if (ea->extended) {
+for (i = 0; i < ea->count; i++) {
+ea->extents[i].length = cpu_to_be64(ea->extents[i].length);
+ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
+assert(ea->extents[i]._pad == 0);
+}
+iov->iov_len = ea->count * sizeof(ea->extents[0]);
+} else {
+/* Conversion reduces memory usage, order of iteration matters */
+for (i = 0; i < ea->count; i++) {
+assert(ea->extents[i].length <= UINT32_MAX);
+ea->narrow[i].length = cpu_to_be32(ea->extents[i].length);
+ea->narrow[i].flags = cpu_to_be32(ea->extents[i].flags);
+}
+iov->iov_len = ea->count * sizeof(ea->narrow[0]);
 }
 }

@@ -2157,19 +2177,23 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
  * would result in an incorrect range reported to the client)
  */
 static int nbd_extent_array_add(NBDExtentArray *ea,
-uint32_t length, uint32_t flags)
+uint64_t length, uint32_t flags)
 {
 assert(ea->can_add);

 if (!length) {
 return 0;
 }
+if (!ea->extended) {
+assert(length <= UINT32_MAX);
+}

 /* Extend previous extent if flags are the same */
 if (ea->count > 0 && flags == ea->extents[ea->count - 1].flags) {
-uint64_t sum = (uint64_t)length + ea->extents[ea->count - 1].length;
+uint64_t sum = length + ea->extents[ea->count - 1].length;

-if (sum <= UINT32_MAX) {
+as

[PATCH 07/14] nbd: Add types for extended headers

2021-12-03 Thread Eric Blake

Add the constants and structs necessary for later patches to start
implementing the NBD_OPT_EXTENDED_HEADERS extension in both the client
and server.  This patch does not change any existing behavior, but
merely sets the stage.

This patch does not change the status quo that neither the client nor
server use a packed-struct representation for the request header.

Signed-off-by: Eric Blake 
---
 docs/interop/nbd.txt |  1 +
 include/block/nbd.h  | 67 +++-
 nbd/common.c | 10 +--
 3 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index bdb0f2a41aca..6229ea573c04 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -68,3 +68,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 * 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports,
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
+* 7.0: NBD_OPT_EXTENDED_HEADERS
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 732314aaba11..5f9d86a86352 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -69,6 +69,14 @@ typedef struct NBDSimpleReply {
 uint64_t handle;
 } QEMU_PACKED NBDSimpleReply;

+typedef struct NBDSimpleReplyExt {
+uint32_t magic;  /* NBD_SIMPLE_REPLY_EXT_MAGIC */
+uint32_t error;
+uint64_t handle;
+uint64_t _pad1;  /* Must be 0 */
+uint64_t _pad2;  /* Must be 0 */
+} QEMU_PACKED NBDSimpleReplyExt;
+
 /* Header of all structured replies */
 typedef struct NBDStructuredReplyChunk {
 uint32_t magic;  /* NBD_STRUCTURED_REPLY_MAGIC */
@@ -78,9 +86,20 @@ typedef struct NBDStructuredReplyChunk {
 uint32_t length; /* length of payload */
 } QEMU_PACKED NBDStructuredReplyChunk;

+typedef struct NBDStructuredReplyChunkExt {
+uint32_t magic;  /* NBD_STRUCTURED_REPLY_EXT_MAGIC */
+uint16_t flags;  /* combination of NBD_REPLY_FLAG_* */
+uint16_t type;   /* NBD_REPLY_TYPE_* */
+uint64_t handle; /* request handle */
+uint64_t length; /* length of payload */
+uint64_t _pad;   /* Must be 0 */
+} QEMU_PACKED NBDStructuredReplyChunkExt;
+
 typedef union NBDReply {
 NBDSimpleReply simple;
+NBDSimpleReplyExt simple_ext;
 NBDStructuredReplyChunk structured;
+NBDStructuredReplyChunkExt structured_ext;
 struct {
 /* @magic and @handle fields have the same offset and size both in
  * simple reply and structured reply chunk, so let them be accessible
@@ -106,6 +125,13 @@ typedef struct NBDStructuredReadHole {
 uint32_t length;
 } QEMU_PACKED NBDStructuredReadHole;

+/* Complete chunk for NBD_REPLY_TYPE_OFFSET_HOLE_EXT */
+typedef struct NBDStructuredReadHoleExt {
+/* header's length == 16 */
+uint64_t offset;
+uint64_t length;
+} QEMU_PACKED NBDStructuredReadHoleExt;
+
 /* Header of all NBD_REPLY_TYPE_ERROR* errors */
 typedef struct NBDStructuredError {
 /* header's length >= 6 */
@@ -113,19 +139,26 @@ typedef struct NBDStructuredError {
 uint16_t message_length;
 } QEMU_PACKED NBDStructuredError;

-/* Header of NBD_REPLY_TYPE_BLOCK_STATUS */
+/* Header of NBD_REPLY_TYPE_BLOCK_STATUS, NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
 typedef struct NBDStructuredMeta {
-/* header's length >= 12 (at least one extent) */
+/* header's length >= 12 narrow, or >= 20 extended (at least one extent) */
 uint32_t context_id;
-/* extents follows */
+/* extents[] follows: NBDExtent for narrow, NBDExtentExt for extended */
 } QEMU_PACKED NBDStructuredMeta;

-/* Extent chunk for NBD_REPLY_TYPE_BLOCK_STATUS */
+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDExtent {
 uint32_t length;
 uint32_t flags; /* NBD_STATE_* */
 } QEMU_PACKED NBDExtent;

+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDExtentExt {
+uint64_t length;
+uint32_t flags; /* NBD_STATE_* */
+uint32_t _pad;  /* Must be 0 */
+} QEMU_PACKED NBDExtentExt;
+
 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
 enum {
@@ -178,6 +211,7 @@ enum {
 #define NBD_OPT_STRUCTURED_REPLY  (8)
 #define NBD_OPT_LIST_META_CONTEXT (9)
 #define NBD_OPT_SET_META_CONTEXT  (10)
+#define NBD_OPT_EXTENDED_HEADERS  (11)

 /* Option reply types. */
 #define NBD_REP_ERR(value) ((UINT32_C(1) << 31) | (value))
@@ -234,12 +268,15 @@ enum {
  */
 #define NBD_MAX_STRING_SIZE 4096

-/* Transmission request structure */
+/* Two types of request structures, a given client will only use 1 */
 #define NBD_REQUEST_MAGIC   0x25609513
+#define NBD_REQUEST_EXT_MAGIC   0x21e41c71

-/* Two types of reply structures */
-#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
-#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+/* Four types of reply structures, a given client will only use 2 */
+#define NBD_SIMPLE_REPLY_MAGIC  0x67446698
+#define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
+#define NBD_SI

[PATCH 11/14] nbd/client: Accept 64-bit hole chunks

2021-12-03 Thread Eric Blake

Although our read requests are sized such that servers need not send
an extended hole chunk, we still have to be prepared for it to be
fully compliant if we request extended headers.  We can also tolerate
a non-compliant server sending the new chunk even when it should not.

Signed-off-by: Eric Blake 
---
 block/nbd.c| 26 --
 block/trace-events |  1 +
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index da5e6ac2d9a5..c5dea864ebb6 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -518,20 +518,26 @@ static inline uint64_t payload_advance64(uint8_t 
**payload)

 static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  NBDStructuredReplyChunk *chunk,
- uint8_t *payload, uint64_t 
orig_offset,
+ uint8_t *payload, bool wide,
+ uint64_t orig_offset,
  QEMUIOVector *qiov, Error **errp)
 {
 uint64_t offset;
-uint32_t hole_size;
+uint64_t hole_size;
+size_t len = wide ? sizeof(hole_size) : sizeof(uint32_t);

-if (chunk->length != sizeof(offset) + sizeof(hole_size)) {
+if (chunk->length != sizeof(offset) + len) {
 error_setg(errp, "Protocol error: invalid payload for "
  "NBD_REPLY_TYPE_OFFSET_HOLE");
 return -EINVAL;
 }

 offset = payload_advance64(&payload);
-hole_size = payload_advance32(&payload);
+if (wide) {
+hole_size = payload_advance64(&payload);
+} else {
+hole_size = payload_advance32(&payload);
+}

 if (!hole_size || offset < orig_offset || hole_size > qiov->size ||
 offset > orig_offset + qiov->size - hole_size) {
@@ -544,6 +550,7 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
 trace_nbd_structured_read_compliance("hole");
 }

+assert(hole_size <= SIZE_MAX);
 qemu_iovec_memset(qiov, offset - orig_offset, 0, hole_size);

 return 0;
@@ -1037,9 +1044,16 @@ static int nbd_co_receive_cmdread_reply(BDRVNBDState *s, 
uint64_t handle,
  * in qiov
  */
 break;
+case NBD_REPLY_TYPE_OFFSET_HOLE_EXT:
+if (!s->info.extended_headers) {
+trace_nbd_extended_headers_compliance("hole_ext");
+}
+/* fallthrough */
 case NBD_REPLY_TYPE_OFFSET_HOLE:
-ret = nbd_parse_offset_hole_payload(s, &reply.structured, payload,
-offset, qiov, &local_err);
+ret = nbd_parse_offset_hole_payload(
+s, &reply.structured, payload,
+chunk->type == NBD_REPLY_TYPE_OFFSET_HOLE_EXT,
+offset, qiov, &local_err);
 if (ret < 0) {
 nbd_channel_error(s, ret);
 nbd_iter_channel_error(&iter, ret, &local_err);
diff --git a/block/trace-events b/block/trace-events
index 549090d453e7..ee65da204dde 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -168,6 +168,7 @@ iscsi_xcopy(void *src_lun, uint64_t src_off, void *dst_lun, 
uint64_t dst_off, ui
 # nbd.c
 nbd_parse_blockstatus_compliance(const char *err) "ignoring extra data from 
non-compliant server: %s"
 nbd_structured_read_compliance(const char *type) "server sent non-compliant 
unaligned read %s chunk"
+nbd_extended_headers_compliance(const char *type) "server sent non-compliant 
%s chunk without extended headers"
 nbd_read_reply_entry_fail(int ret, const char *err) "ret = %d, err: %s"
 nbd_co_request_fail(uint64_t from, uint32_t len, uint64_t handle, uint16_t 
flags, uint16_t type, const char *name, int ret, const char *err) "Request 
failed { .from = %" PRIu64", .len = %" PRIu32 ", .handle = %" PRIu64 ", .flags 
= 0x%" PRIx16 ", .type = %" PRIu16 " (%s) } ret = %d, err: %s"
 nbd_client_handshake(const char *export_name) "export '%s'"
-- 
2.33.1

[PATCH 14/14] do not apply: nbd/server: Send 64-bit hole chunk

2021-12-03 Thread Eric Blake

Since we cap NBD_CMD_READ requests to 32M, we never have a reason to
send a 64-bit chunk type for a hole; but it is worth producing these
for interoperability testing of clients that want extended headers.
---
 nbd/server.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 7e6140350797..4369a9a8ff08 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2071,19 +2071,29 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 if (status & BDRV_BLOCK_ZERO) {
 NBDReply hdr;
 NBDStructuredReadHole chunk;
+NBDStructuredReadHoleExt chunk_ext;
 struct iovec iov[] = {
 {.iov_base = &hdr},
-{.iov_base = &chunk, .iov_len = sizeof(chunk)},
+{.iov_base = client->extended_headers ? &chunk_ext
+ : (void *) &chunk,
+ .iov_len = client->extended_headers ? sizeof(chunk_ext)
+ : sizeof(chunk)},
 };

 trace_nbd_co_send_structured_read_hole(handle, offset + progress,
pnum);
 set_be_chunk(client, &iov[0],
  final ? NBD_REPLY_FLAG_DONE : 0,
- NBD_REPLY_TYPE_OFFSET_HOLE,
+ client->extended_headers ? 
NBD_REPLY_TYPE_OFFSET_HOLE_EXT
+ : NBD_REPLY_TYPE_OFFSET_HOLE,
  handle, iov[1].iov_len);
-stq_be_p(&chunk.offset, offset + progress);
-stl_be_p(&chunk.length, pnum);
+if (client->extended_headers) {
+stq_be_p(&chunk_ext.offset, offset + progress);
+stq_be_p(&chunk_ext.length, pnum);
+} else {
+stq_be_p(&chunk.offset, offset + progress);
+stl_be_p(&chunk.length, pnum);
+}
 ret = nbd_co_send_iov(client, iov, 2, errp);
 } else {
 ret = blk_pread(exp->common.blk, offset + progress,
-- 
2.33.1

[PATCH 08/14] nbd/server: Initial support for extended headers

2021-12-03 Thread Eric Blake

We have no reason to send NBD_REPLY_TYPE_OFFSET_HOLE_EXT since we
already cap NBD_CMD_READ to 32M.  For NBD_CMD_WRITE_ZEROES and
NBD_CMD_TRIM, the block layer already supports 64-bit operations
without any effort on our part.  For NBD_CMD_BLOCK_STATUS, the
client's length is a hint; the easiest approach is to truncate our
answer back to 32 bits, letting us delay the effort of implementing
NBD_REPLY_TYPE_BLOCK_STATUS_EXT to a separate patch.

Signed-off-by: Eric Blake 
---
 nbd/nbd-internal.h |   5 ++-
 nbd/server.c   | 106 ++---
 2 files changed, 85 insertions(+), 26 deletions(-)

diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 0016793ff4b1..875b6204c28c 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -35,8 +35,11 @@
  * https://github.com/yoe/nbd/blob/master/doc/proto.md
  */

-/* Size of all NBD_OPT_*, without payload */
+/* Size of all compact NBD_CMD_*, without payload */
 #define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
+/* Size of all extended NBD_CMD_*, without payload */
+#define NBD_REQUEST_EXT_SIZE(4 + 2 + 2 + 8 + 8 + 8)
+
 /* Size of all NBD_REP_* sent in answer to most NBD_OPT_*, without payload */
 #define NBD_REPLY_SIZE  (4 + 4 + 8)
 /* Size of reply to NBD_OPT_EXPORT_NAME */
diff --git a/nbd/server.c b/nbd/server.c
index 4306fa7b426c..0e496f60ffbd 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -142,6 +142,7 @@ struct NBDClient {
 uint32_t check_align; /* If non-zero, check for aligned client requests */

 bool structured_reply;
+bool extended_headers;
 NBDExportMetaContexts export_meta;

 uint32_t opt; /* Current option being negotiated */
@@ -1275,6 +1276,19 @@ static int nbd_negotiate_options(NBDClient *client, 
Error **errp)
  errp);
 break;

+case NBD_OPT_EXTENDED_HEADERS:
+if (length) {
+ret = nbd_reject_length(client, false, errp);
+} else if (client->extended_headers) {
+ret = nbd_negotiate_send_rep_err(
+client, NBD_REP_ERR_INVALID, errp,
+"extended headers already negotiated");
+} else {
+ret = nbd_negotiate_send_rep(client, NBD_REP_ACK, errp);
+client->extended_headers = true;
+}
+break;
+
 default:
 ret = nbd_opt_drop(client, NBD_REP_ERR_UNSUP, errp,
"Unsupported option %" PRIu32 " (%s)",
@@ -1410,11 +1424,13 @@ nbd_read_eof(NBDClient *client, void *buffer, size_t 
size, Error **errp)
 static int nbd_receive_request(NBDClient *client, NBDRequest *request,
Error **errp)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
-uint32_t magic;
+uint8_t buf[NBD_REQUEST_EXT_SIZE];
+uint32_t magic, expect;
 int ret;
+size_t size = client->extended_headers ? NBD_REQUEST_EXT_SIZE
+: NBD_REQUEST_SIZE;

-ret = nbd_read_eof(client, buf, sizeof(buf), errp);
+ret = nbd_read_eof(client, buf, size, errp);
 if (ret < 0) {
 return ret;
 }
@@ -1422,13 +1438,21 @@ static int nbd_receive_request(NBDClient *client, 
NBDRequest *request,
 return -EIO;
 }

-/* Request
-   [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
-   [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
-   [ 6 ..  7]   type(NBD_CMD_READ, ...)
-   [ 8 .. 15]   handle
-   [16 .. 23]   from
-   [24 .. 27]   len
+/*
+ * Compact request
+ *  [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
+ *  [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
+ *  [ 6 ..  7]   type(NBD_CMD_READ, ...)
+ *  [ 8 .. 15]   handle
+ *  [16 .. 23]   from
+ *  [24 .. 27]   len
+ * Extended request
+ *  [ 0 ..  3]   magic   (NBD_REQUEST_EXT_MAGIC)
+ *  [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
+ *  [ 6 ..  7]   type(NBD_CMD_READ, ...)
+ *  [ 8 .. 15]   handle
+ *  [16 .. 23]   from
+ *  [24 .. 31]   len
  */

 magic = ldl_be_p(buf);
@@ -1436,12 +1460,18 @@ static int nbd_receive_request(NBDClient *client, 
NBDRequest *request,
 request->type   = lduw_be_p(buf + 6);
 request->handle = ldq_be_p(buf + 8);
 request->from   = ldq_be_p(buf + 16);
-request->len= ldl_be_p(buf + 24); /* widen 32 to 64 bits */
+if (client->extended_headers) {
+request->len = ldq_be_p(buf + 24);
+expect = NBD_REQUEST_EXT_MAGIC;
+} else {
+request->len = ldl_be_p(buf + 24); /* widen 32 to 64 bits */
+expect = NBD_REQUEST_MAGIC;
+}

 trace_nbd_receive_request(magic, request->flags, request->type,
   request->from, request->len);

-if (magic != NBD_REQUEST_MAGIC) {
+if (magic != expect) {
 error_setg(errp, "invalid magic (got 0x%" PRIx32 ")", magic);
 retu

[PATCH 03/14] qemu-io: Allow larger write zeroes under no fallback

2021-12-03 Thread Eric Blake

When writing zeroes can fall back to a slow write, permitting an
overly large request can become an amplification denial of service
attack in triggering a large amount of work from a small request.  But
the whole point of the no fallback flag is to quickly determine if
writing an entire device to zero can be done quickly (such as when it
is already known that the device started with zero contents); in those
cases, artificially capping things at 2G in qemu-io itself doesn't
help us.

Signed-off-by: Eric Blake 
---
 qemu-io-cmds.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 954955c12fb9..45a957093369 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -603,10 +603,6 @@ static int do_co_pwrite_zeroes(BlockBackend *blk, int64_t 
offset,
 .done   = false,
 };

-if (bytes > INT_MAX) {
-return -ERANGE;
-}
-
 co = qemu_coroutine_create(co_pwrite_zeroes_entry, &data);
 bdrv_coroutine_enter(blk_bs(blk), co);
 while (!data.done) {
@@ -1160,8 +1156,9 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 if (count < 0) {
 print_cvtnum_err(count, argv[optind]);
 return count;
-} else if (count > BDRV_REQUEST_MAX_BYTES) {
-printf("length cannot exceed %" PRIu64 ", given %s\n",
+} else if (count > BDRV_REQUEST_MAX_BYTES &&
+   !(flags & BDRV_REQ_NO_FALLBACK)) {
+printf("length cannot exceed %" PRIu64 " without -n, given %s\n",
(uint64_t)BDRV_REQUEST_MAX_BYTES, argv[optind]);
 return -EINVAL;
 }
-- 
2.33.1

[PATCH 05/14] nbd/server: Prepare for alternate-size headers

2021-12-03 Thread Eric Blake

An upcoming NBD extension wants to add the ability to do 64-bit
requests.  As part of that extension, the size of the reply headers
will change in order to permit a 64-bit length in the reply for
symmetry [*].  Additionally, where the reply header is currently 16
bytes for simple reply, and 20 bytes for structured reply; with the
extension enabled, both reply type headers will be 32 bytes.  Since we
are already wired up to use iovecs, it is easiest to allow for this
change in header size by splitting each structured reply across two
iovecs, one for the header (which will become variable-length in a
future patch according to client negotiation), and the other for the
payload, and removing the header from the payload struct definitions.
Interestingly, the client side code never utilized the packed types,
so only the server code needs to be updated.

[*] Note that on the surface, this is because some server might permit
a 4G+ NBD_CMD_READ and need to reply with that much data in one
transaction.  But even though the extended reply length is widened to
64 bits, we will still never send a reply payload larger than just
over 32M (the maximum buffer we allow in NBD_CMD_READ; and we cap the
number of extents we are willing to report in NBD_CMD_BLOCK_STATUS).
Where 64-bit fields really matter in the extension is in a later patch
adding 64-bit support into a counterpart for REPLY_TYPE_BLOCK_STATUS.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h | 10 +++
 nbd/server.c| 64 -
 2 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 78d101b77488..3d0689b69367 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -1,5 +1,5 @@
 /*
- *  Copyright (C) 2016-2020 Red Hat, Inc.
+ *  Copyright (C) 2016-2021 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori 
  *
  *  Network Block Device
@@ -95,28 +95,28 @@ typedef union NBDReply {

 /* Header of chunk for NBD_REPLY_TYPE_OFFSET_DATA */
 typedef struct NBDStructuredReadData {
-NBDStructuredReplyChunk h; /* h.length >= 9 */
+/* header's .length >= 9 */
 uint64_t offset;
 /* At least one byte of data payload follows, calculated from h.length */
 } QEMU_PACKED NBDStructuredReadData;

 /* Complete chunk for NBD_REPLY_TYPE_OFFSET_HOLE */
 typedef struct NBDStructuredReadHole {
-NBDStructuredReplyChunk h; /* h.length == 12 */
+/* header's length == 12 */
 uint64_t offset;
 uint32_t length;
 } QEMU_PACKED NBDStructuredReadHole;

 /* Header of all NBD_REPLY_TYPE_ERROR* errors */
 typedef struct NBDStructuredError {
-NBDStructuredReplyChunk h; /* h.length >= 6 */
+/* header's length >= 6 */
 uint32_t error;
 uint16_t message_length;
 } QEMU_PACKED NBDStructuredError;

 /* Header of NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDStructuredMeta {
-NBDStructuredReplyChunk h; /* h.length >= 12 (at least one extent) */
+/* header's length >= 12 (at least one extent) */
 uint32_t context_id;
 /* extents follows */
 } QEMU_PACKED NBDStructuredMeta;
diff --git a/nbd/server.c b/nbd/server.c
index f302e1cbb03e..64845542fd6b 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1869,9 +1869,12 @@ static int coroutine_fn nbd_co_send_iov(NBDClient 
*client, struct iovec *iov,
 return ret;
 }

-static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
-   uint64_t handle)
+static inline void set_be_simple_reply(NBDClient *client, struct iovec *iov,
+   uint64_t error, uint64_t handle)
 {
+NBDSimpleReply *reply = iov->iov_base;
+
+iov->iov_len = sizeof(*reply);
 stl_be_p(&reply->magic, NBD_SIMPLE_REPLY_MAGIC);
 stl_be_p(&reply->error, error);
 stq_be_p(&reply->handle, handle);
@@ -1884,23 +1887,27 @@ static int nbd_co_send_simple_reply(NBDClient *client,
 size_t len,
 Error **errp)
 {
-NBDSimpleReply reply;
+NBDReply hdr;
 int nbd_err = system_errno_to_nbd_errno(error);
 struct iovec iov[] = {
-{.iov_base = &reply, .iov_len = sizeof(reply)},
+{.iov_base = &hdr},
 {.iov_base = data, .iov_len = len}
 };

 trace_nbd_co_send_simple_reply(handle, nbd_err, nbd_err_lookup(nbd_err),
len);
-set_be_simple_reply(&reply, nbd_err, handle);
+set_be_simple_reply(client, &iov[0], nbd_err, handle);

 return nbd_co_send_iov(client, iov, len ? 2 : 1, errp);
 }

-static inline void set_be_chunk(NBDStructuredReplyChunk *chunk, uint16_t flags,
-uint16_t type, uint64_t handle, uint32_t 
length)
+static inline void set_be_chunk(NBDClient *client, struct iovec *iov,
+uint16_t flags, uint16_t type,
+uint64_t handle, uint32_t length)
 {
+NBDStructuredReplyChunk *

[PATCH 10/14] nbd/client: Initial support for extended headers

2021-12-03 Thread Eric Blake

Update the client code to be able to send an extended request, and
parse an extended header from the server.  Note that since we reject
any structured reply with a too-large payload, we can always normalize
a valid header back into the compact form, so that the caller need not
deal with two branches of a union.  Still, until a later patch lets
the client negotiate extended headers, the code added here should not
be reached.  Note that because of the different magic numbers, it is
just as easy to trace and then tolerate a non-compliant server sending
the wrong header reply as it would be to insist that the server is
compliant.

The only caller to nbd_receive_reply() always passed NULL for errp;
since we are changing the signature anyways, I decided to sink the
decision to ignore errors one layer lower.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h |   2 +-
 block/nbd.c |   3 +-
 nbd/client.c| 112 +++-
 nbd/trace-events|   1 +
 4 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 5f9d86a86352..d489c67d98dc 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -366,7 +366,7 @@ int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo 
*info,
  Error **errp);
 int nbd_send_request(QIOChannel *ioc, NBDRequest *request, bool ext_hdr);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
-   NBDReply *reply, Error **errp);
+   NBDReply *reply, bool ext_hdrs);
 int nbd_client(int fd);
 int nbd_disconnect(int fd);
 int nbd_errno_to_system_errno(int err);
diff --git a/block/nbd.c b/block/nbd.c
index 3e9875241bec..da5e6ac2d9a5 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -401,7 +401,8 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)

 /* We are under mutex and handle is 0. We have to do the dirty work. */
 assert(s->reply.handle == 0);
-ret = nbd_receive_reply(s->bs, s->ioc, &s->reply, NULL);
+ret = nbd_receive_reply(s->bs, s->ioc, &s->reply,
+s->info.extended_headers);
 if (ret <= 0) {
 ret = ret ? ret : -EIO;
 nbd_channel_error(s, ret);
diff --git a/nbd/client.c b/nbd/client.c
index aa162b9d08d5..f1aa5256c8bf 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1347,22 +1347,28 @@ int nbd_disconnect(int fd)

 int nbd_send_request(QIOChannel *ioc, NBDRequest *request, bool ext_hdr)
 {
-uint8_t buf[NBD_REQUEST_SIZE];
+uint8_t buf[NBD_REQUEST_EXT_SIZE];
+size_t len;

-assert(!ext_hdr);
-assert(request->len <= UINT32_MAX);
 trace_nbd_send_request(request->from, request->len, request->handle,
request->flags, request->type,
nbd_cmd_lookup(request->type));

-stl_be_p(buf, NBD_REQUEST_MAGIC);
+stl_be_p(buf, ext_hdr ? NBD_REQUEST_EXT_MAGIC : NBD_REQUEST_MAGIC);
 stw_be_p(buf + 4, request->flags);
 stw_be_p(buf + 6, request->type);
 stq_be_p(buf + 8, request->handle);
 stq_be_p(buf + 16, request->from);
-stl_be_p(buf + 24, request->len);
+if (ext_hdr) {
+stq_be_p(buf + 24, request->len);
+len = NBD_REQUEST_EXT_SIZE;
+} else {
+assert(request->len <= UINT32_MAX);
+stl_be_p(buf + 24, request->len);
+len = NBD_REQUEST_SIZE;
+}

-return nbd_write(ioc, buf, sizeof(buf), NULL);
+return nbd_write(ioc, buf, len, NULL);
 }

 /* nbd_receive_simple_reply
@@ -1370,49 +1376,69 @@ int nbd_send_request(QIOChannel *ioc, NBDRequest 
*request, bool ext_hdr)
  * Payload is not read (payload is possible for CMD_READ, but here we even
  * don't know whether it take place or not).
  */
-static int nbd_receive_simple_reply(QIOChannel *ioc, NBDSimpleReply *reply,
+static int nbd_receive_simple_reply(QIOChannel *ioc, NBDReply *reply,
 Error **errp)
 {
 int ret;
+size_t len;

-assert(reply->magic == NBD_SIMPLE_REPLY_MAGIC);
+if (reply->magic == NBD_SIMPLE_REPLY_MAGIC) {
+len = sizeof(reply->simple);
+} else {
+assert(reply->magic == NBD_SIMPLE_REPLY_EXT_MAGIC);
+len = sizeof(reply->simple_ext);
+}

 ret = nbd_read(ioc, (uint8_t *)reply + sizeof(reply->magic),
-   sizeof(*reply) - sizeof(reply->magic), "reply", errp);
+   len - sizeof(reply->magic), "reply", errp);
 if (ret < 0) {
 return ret;
 }

-reply->error = be32_to_cpu(reply->error);
-reply->handle = be64_to_cpu(reply->handle);
+/* error and handle occupy same space between forms */
+reply->simple.error = be32_to_cpu(reply->simple.error);
+reply->simple.handle = be64_to_cpu(reply->handle);
+if (reply->magic == NBD_SIMPLE_REPLY_EXT_MAGIC) {
+if (reply->simple_ext._pad1 || reply->simple_ext._pad2) {
+er

[PATCH 04/14] nbd/client: Add safety check on chunk payload length

2021-12-03 Thread Eric Blake

Our existing use of structured replies either reads into a qiov capped
at 32M (NBD_CMD_READ) or caps allocation to 1000 bytes (see
NBD_MAX_MALLOC_PAYLOAD in block/nbd.c).  But the existing length
checks are rather late; if we encounter a buggy (or malicious) server
that sends a super-large payload length, we should drop the connection
right then rather than assuming the layer on top will be careful.
This becomes more important when we permit 64-bit lengths which are
even more likely to have the potential for attempted denial of service
abuse.

Signed-off-by: Eric Blake 
---
 nbd/client.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/nbd/client.c b/nbd/client.c
index 30d5383cb195..8f137c2320bb 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1412,6 +1412,18 @@ static int nbd_receive_structured_reply_chunk(QIOChannel 
*ioc,
 chunk->handle = be64_to_cpu(chunk->handle);
 chunk->length = be32_to_cpu(chunk->length);

+/*
+ * Because we use BLOCK_STATUS with REQ_ONE, and cap READ requests
+ * at 32M, no valid server should send us payload larger than
+ * this.  Even if we stopped using REQ_ONE, sane servers will cap
+ * the number of extents they return for block status.
+ */
+if (chunk->length > NBD_MAX_BUFFER_SIZE + sizeof(NBDStructuredReadData)) {
+error_setg(errp, "server chunk %" PRIu32 " (%s) payload is too long",
+   chunk->type, nbd_rep_lookup(chunk->type));
+return -EINVAL;
+}
+
 return 0;
 }

-- 
2.33.1

[PATCH 01/14] nbd/server: Minor cleanups

2021-12-03 Thread Eric Blake

Spelling fixes, grammar improvements and consistent spacing, noticed
while preparing other patches in this file.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4630dd732250..f302e1cbb03e 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2085,11 +2085,10 @@ static void 
nbd_extent_array_convert_to_be(NBDExtentArray *ea)
  * Add extent to NBDExtentArray. If extent can't be added (no available space),
  * return -1.
  * For safety, when returning -1 for the first time, .can_add is set to false,
- * further call to nbd_extent_array_add() will crash.
- * (to avoid the situation, when after failing to add an extent (returned -1),
- * user miss this failure and add another extent, which is successfully added
- * (array is full, but new extent may be squashed into the last one), then we
- * have invalid array with skipped extent)
+ * and further calls to nbd_extent_array_add() will crash.
+ * (this avoids the situation where a caller ignores failure to add one extent,
+ * where adding another extent that would squash into the last array entry
+ * would result in an incorrect range reported to the client)
  */
 static int nbd_extent_array_add(NBDExtentArray *ea,
 uint32_t length, uint32_t flags)
@@ -2288,7 +2287,7 @@ static int nbd_co_receive_request(NBDRequestData *req, 
NBDRequest *request,
 assert(client->recv_coroutine == qemu_coroutine_self());
 ret = nbd_receive_request(client, request, errp);
 if (ret < 0) {
-return  ret;
+return ret;
 }

 trace_nbd_co_receive_request_decode_type(request->handle, request->type,
@@ -2648,7 +2647,7 @@ static coroutine_fn void nbd_trip(void *opaque)
 }

 if (ret < 0) {
-/* It wans't -EIO, so, according to nbd_co_receive_request()
+/* It wasn't -EIO, so, according to nbd_co_receive_request()
  * semantics, we should return the error to the client. */
 Error *export_err = local_err;

-- 
2.33.1

[PATCH 00/14] qemu patches for NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake

Available at https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/exthdr-v1

Patch 14 is optional; I'm including it now because I tested with it,
but I'm also okay with dropping it based on RFC discussion.

Eric Blake (14):
  nbd/server: Minor cleanups
  qemu-io: Utilize 64-bit status during map
  qemu-io: Allow larger write zeroes under no fallback
  nbd/client: Add safety check on chunk payload length
  nbd/server: Prepare for alternate-size headers
  nbd: Prepare for 64-bit requests
  nbd: Add types for extended headers
  nbd/server: Initial support for extended headers
  nbd/server: Support 64-bit block status
  nbd/client: Initial support for extended headers
  nbd/client: Accept 64-bit hole chunks
  nbd/client: Accept 64-bit block status chunks
  nbd/client: Request extended headers during negotiation
  do not apply: nbd/server: Send 64-bit hole chunk

 docs/interop/nbd.txt  |   1 +
 include/block/nbd.h   |  94 +--
 nbd/nbd-internal.h|   8 +-
 block/nbd.c   | 102 +--
 nbd/client-connection.c   |   1 +
 nbd/client.c  | 150 +++---
 nbd/common.c  |  10 +-
 nbd/server.c  | 262 +-
 qemu-io-cmds.c|  16 +-
 qemu-nbd.c|   2 +
 block/trace-events|   1 +
 nbd/trace-events  |   9 +-
 tests/qemu-iotests/223.out|   4 +
 tests/qemu-iotests/233.out|   1 +
 tests/qemu-iotests/241|   8 +-
 tests/qemu-iotests/307|   2 +-
 tests/qemu-iotests/307.out|   5 +
 .../tests/nbd-qemu-allocation.out |   1 +
 18 files changed, 486 insertions(+), 191 deletions(-)

-- 
2.33.1

[PATCH 02/14] qemu-io: Utilize 64-bit status during map

2021-12-03 Thread Eric Blake

The block layer has supported 64-bit block status from drivers since
commit 86a3d5c688 ("block: Add .bdrv_co_block_status() callback",
v2.12) and friends, with individual driver callbacks responsible for
capping things where necessary.  Artificially capping things below 2G
in the qemu-io 'map' command, added in commit d6a644bbfe ("block: Make
bdrv_is_allocated() byte-based", v2.10) is thus no longer necessary.

One way to test this is with qemu-nbd as server on a raw file larger
than 4G (the entire file should show as allocated), plus 'qemu-io -f
raw -c map nbd://localhost --trace=nbd_\*' as client.  Prior to this
patch, the NBD_CMD_BLOCK_STATUS requests are fragmented at 0x7e00
distances; with this patch, the fragmenting changes to 0x7fff
(since the NBD protocol is currently still limited to 32-bit
transactions - see block/nbd.c:nbd_client_co_block_status).  Then in
later patches, once I add an NBD extension for a 64-bit block status,
the same map command completes with just one NBD_CMD_BLOCK_STATUS.

Signed-off-by: Eric Blake 
---
 qemu-io-cmds.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 46593d632d8f..954955c12fb9 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1993,11 +1993,9 @@ static int map_is_allocated(BlockDriverState *bs, 
int64_t offset,
 int64_t bytes, int64_t *pnum)
 {
 int64_t num;
-int num_checked;
 int ret, firstret;

-num_checked = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-ret = bdrv_is_allocated(bs, offset, num_checked, &num);
+ret = bdrv_is_allocated(bs, offset, bytes, &num);
 if (ret < 0) {
 return ret;
 }
@@ -2009,8 +2007,7 @@ static int map_is_allocated(BlockDriverState *bs, int64_t 
offset,
 offset += num;
 bytes -= num;

-num_checked = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-ret = bdrv_is_allocated(bs, offset, num_checked, &num);
+ret = bdrv_is_allocated(bs, offset, bytes, &num);
 if (ret == firstret && num) {
 *pnum += num;
 } else {
-- 
2.33.1

[PATCH 06/14] nbd: Prepare for 64-bit requests

2021-12-03 Thread Eric Blake

Widen the length field of NBDRequest to 64-bits, although we can
assert that all current uses are still under 32 bits.  Move the
request magic number to nbd.h, to live alongside the reply magic
number.  Add a bool that will eventually track whether the client
successfully negotiated extended headers with the server, allowing the
nbd driver to pass larger requests along where possible; although in
this patch it always remains false for no semantic change yet.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h | 19 +++
 nbd/nbd-internal.h  |  3 +--
 block/nbd.c | 35 ---
 nbd/client.c|  8 +---
 nbd/server.c| 11 ---
 nbd/trace-events|  8 
 6 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 3d0689b69367..732314aaba11 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -52,17 +52,16 @@ typedef struct NBDOptionReplyMetaContext {

 /* Transmission phase structs
  *
- * Note: these are _NOT_ the same as the network representation of an NBD
- * request and reply!
+ * Note: NBDRequest is _NOT_ the same as the network representation of an NBD
+ * request!
  */
-struct NBDRequest {
+typedef struct NBDRequest {
 uint64_t handle;
 uint64_t from;
-uint32_t len;
+uint64_t len;   /* Must fit 32 bits unless extended headers negotiated */
 uint16_t flags; /* NBD_CMD_FLAG_* */
-uint16_t type; /* NBD_CMD_* */
-};
-typedef struct NBDRequest NBDRequest;
+uint16_t type;  /* NBD_CMD_* */
+} NBDRequest;

 typedef struct NBDSimpleReply {
 uint32_t magic;  /* NBD_SIMPLE_REPLY_MAGIC */
@@ -235,6 +234,9 @@ enum {
  */
 #define NBD_MAX_STRING_SIZE 4096

+/* Transmission request structure */
+#define NBD_REQUEST_MAGIC   0x25609513
+
 /* Two types of reply structures */
 #define NBD_SIMPLE_REPLY_MAGIC  0x67446698
 #define NBD_STRUCTURED_REPLY_MAGIC  0x668e33ef
@@ -293,6 +295,7 @@ struct NBDExportInfo {
 /* In-out fields, set by client before nbd_receive_negotiate() and
  * updated by server results during nbd_receive_negotiate() */
 bool structured_reply;
+bool extended_headers;
 bool base_allocation; /* base:allocation context for NBD_CMD_BLOCK_STATUS 
*/

 /* Set by server results during nbd_receive_negotiate() and
@@ -322,7 +325,7 @@ int nbd_receive_export_list(QIOChannel *ioc, 
QCryptoTLSCreds *tlscreds,
 Error **errp);
 int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo *info,
  Error **errp);
-int nbd_send_request(QIOChannel *ioc, NBDRequest *request);
+int nbd_send_request(QIOChannel *ioc, NBDRequest *request, bool ext_hdr);
 int coroutine_fn nbd_receive_reply(BlockDriverState *bs, QIOChannel *ioc,
NBDReply *reply, Error **errp);
 int nbd_client(int fd);
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 1b2141ab4b2d..0016793ff4b1 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -1,7 +1,7 @@
 /*
  * NBD Internal Declarations
  *
- * Copyright (C) 2016 Red Hat, Inc.
+ * Copyright (C) 2016-2021 Red Hat, Inc.
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -45,7 +45,6 @@
 #define NBD_OLDSTYLE_NEGOTIATE_SIZE (8 + 8 + 8 + 4 + 124)

 #define NBD_INIT_MAGIC  0x4e42444d41474943LL /* ASCII "NBDMAGIC" */
-#define NBD_REQUEST_MAGIC   0x25609513
 #define NBD_OPTS_MAGIC  0x49484156454F5054LL /* ASCII "IHAVEOPT" */
 #define NBD_CLIENT_MAGIC0x420281861253LL
 #define NBD_REP_MAGIC   0x0003e889045565a9LL
diff --git a/block/nbd.c b/block/nbd.c
index 5ef462db1b7f..3e9875241bec 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -2,7 +2,7 @@
  * QEMU Block driver for  NBD
  *
  * Copyright (c) 2019 Virtuozzo International GmbH.
- * Copyright (C) 2016 Red Hat, Inc.
+ * Copyright (C) 2016-2021 Red Hat, Inc.
  * Copyright (C) 2008 Bull S.A.S.
  * Author: Laurent Vivier 
  *
@@ -300,7 +300,7 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
  */
 NBDRequest request = { .type = NBD_CMD_DISC };

-nbd_send_request(s->ioc, &request);
+nbd_send_request(s->ioc, &request, s->info.extended_headers);

 yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
  nbd_yank, bs);
@@ -470,7 +470,7 @@ static int nbd_co_send_request(BlockDriverState *bs,

 if (qiov) {
 qio_channel_set_cork(s->ioc, true);
-rc = nbd_send_request(s->ioc, request);
+rc = nbd_send_request(s->ioc, request, s->info.extended_headers);
 if (nbd_client_connected(s) && rc >= 0) {
 if (qio_channel_writev_all(s->ioc, qiov->iov, qiov->niov,
NULL) < 0) {
@@ -481,7 +481,7 @@ static int nbd_co_send_request(BlockDriverState *bs,

[PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake

Add a new negotiation feature where the client and server agree to use
larger packet headers on every packet sent during transmission phase.
This has two purposes: first, it makes it possible to perform
operations like trim, write zeroes, and block status on more than 2^32
bytes in a single command; this in turn requires that some structured
replies from the server also be extended to match.  The wording chosen
here is careful to permit a server to use either flavor in its reply
(that is, a request less than 32-bits can trigger an extended reply,
and conversely a request larger than 32-bits can trigger a compact
reply).

Second, when structured replies are active, clients have to deal with
the difference between 16- and 20-byte headers of simple
vs. structured replies, which impacts performance if the client must
perform multiple syscalls to first read the magic before knowing how
many additional bytes to read.  In extended header mode, all headers
are the same width, so the client can read a full header before
deciding whether the header describes a simple or structured reply.
Similarly, by having extended mode use a power-of-2 sizing, it becomes
easier to manipulate headers within a single cache line, even if it
requires padding bytes sent over the wire.  However, note that this
change only affects the headers; as data payloads can still be
unaligned (for example, a client performing 1-byte reads or writes),
we would need to negotiate yet another extension if we wanted to
ensure that all NBD transmission packets started on an 8-byte boundary
after option haggling has completed.

This spec addition was done in parallel with a proof of concept
implementation in qemu (server and client) and libnbd (client), and I
also have plans to implement it in nbdkit (server).

Signed-off-by: Eric Blake 
---

Available at https://repo.or.cz/nbd/ericb.git/shortlog/refs/tags/exthdr-v1

 doc/proto.md | 218 +--
 1 file changed, 177 insertions(+), 41 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 3a877a9..46560b6 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -295,6 +295,21 @@ reply is also problematic for error handling of the 
`NBD_CMD_READ`
 request.  Therefore, structured replies can be used to create a
 a context-free server stream; see below.

+The results of client negotiation also determine whether the client
+and server will utilize only compact requests and replies, or whether
+both sides will use only extended packets.  Compact messages are the
+default, but inherently limit single transactions to a 32-bit window
+starting at a 64-bit offset.  Extended messages make it possible to
+perform 64-bit transactions (although typically only for commands that
+do not include a data payload).  Furthermore, when structured replies
+have been negotiated, compact messages require the client to perform
+partial reads to determine which reply packet style (simple or
+structured) is on the wire before knowing the length of the rest of
+the reply, which can reduce client performance.  With extended
+messages, all packet headers have a fixed length of 32 bytes, and
+although this results in more traffic over the network due to padding,
+the resulting layout is friendlier for performance.
+
 Replies need not be sent in the same order as requests (i.e., requests
 may be handled by the server asynchronously), and structured reply
 chunks from one request may be interleaved with reply messages from
@@ -343,7 +358,9 @@ may be useful.

  Request message

-The request message, sent by the client, looks as follows:
+The compact request message, sent by the client when extended
+transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+looks as follows:

 C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
 C: 16 bits, command flags  
@@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
 C: 32 bits, length (unsigned)  
 C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  

+If negotiation agreed on extended transactions with
+`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
+
+C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
+C: 16 bits, command flags  
+C: 16 bits, type  
+C: 64 bits, handle  
+C: 64 bits, offset (unsigned)  
+C: 64 bits, length (unsigned)  
+C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
+
  Simple reply message

 The simple reply message MUST be sent by the server in response to all
 requests if structured replies have not been negotiated using
-`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
simple
-reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
-but only if the reply has no data payload.  The message looks as
-follows:
+`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
+negotiated, a simple reply MAY be used as a reply to any request other
+than `NBD_CMD_READ`, but only if the reply has no data payload.

RFC for NBD protocol extension: extended headers

2021-12-03 Thread Eric Blake

In response to this mail, I will be cross-posting a series of patches
to multiple projects as a proof-of-concept implementation and request
for comments on a new NBD protocol extension, called
NBD_OPT_EXTENDED_HEADERS.  With this in place, it will be possible for
clients to request 64-bit zero, trim, cache, and block status
operations when supported by the server.

Not yet complete: an implementation of this in nbdkit.  I also plan to
tweak libnbd's 'nbdinfo --map' and 'nbdcopy' to take advantage of the
larger block status results.  Also, with 64-bit commands, we may want
to also make it easier to let servers advertise an actual maximum size
they are willing to accept for the commands in question (for example,
a server may be happy with a full 64-bit block status, but still want
to limit non-fast zero and cache to a smaller limit to avoid denial of
service).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 0/2] hw/arm/virt: Support for virtio-mem-pci

2021-12-03 Thread Gavin Shan


Hi Jonathan,

On 12/4/21 1:10 AM, Jonathan Cameron wrote:

On Fri,  3 Dec 2021 11:35:20 +0800
Gavin Shan  wrote:


This series supports virtio-mem-pci device, by simply following the
implementation on x86. The exception is the block size is 512MB on
ARM64 instead of 128MB on x86, compatible with the memory section
size in linux guest.

The work was done by David Hildenbrand and then Jonathan Cameron. I'm
taking the patch and putting more efforts, which is all about testing
to me at current stage.


Thanks for taking this forwards.  What you have here looks good to me, but
I've not looked at this for a while, so I'll go with whatever David and
others say :)



[...]

I would translate this as your reviewed-by tag, which will be added to v3.
However, it shouldn't stop you from further reviewing :)

Thanks,
Gavin

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-12-03 Thread Vladimir Sementsov-Ogievskiy


03.12.2021 14:17, Peter Lieven wrote:

Am 19.05.21 um 18:48 schrieb Kevin Wolf:

Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:

Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:

20.04.2021 18:04, Kevin Wolf wrote:

Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:

15.04.2021 18:22, Kevin Wolf wrote:

In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
like non-zero data if the end of the checked area isn't aligned. This
can improve the efficiency of the conversion and was introduced in
commit 8dcd3c9b91a.

However, it comes with a correctness problem: qemu-img convert is
supposed to sparsify areas that contain only zeros, which it doesn't do
any more. It turns out that this even happens when not only the
unaligned area is zeroed, but also the blocks before and after it. In
the bug report, conversion of a fragmented 10G image containing only
zeros resulted in an image consuming 2.82 GiB even though the expected
size is only 4 KiB.

As a tradeoff between both, let's ignore zeroed sectors only after
non-zero data to fix the alignment, but if we're only looking at zeros,
keep them as such, even if it may mean additional RMW cycles.


Hmm.. If I understand correctly, we are going to do unaligned
write-zero. And that helps.

This can happen (mostly raw images on block devices, I think?), but
usually it just means skipping the write because we know that the target
image is already zeroed.

What it does mean is that if the next part is data, we'll have an
unaligned data write.


Doesn't that mean that alignment is wrongly detected?

The problem is that you can have bdrv_block_status_above() return the
same allocation status multiple times in a row, but *pnum can be
unaligned for the conversion.

We only look at a single range returned by it when detecting the
alignment, so it could be that we have zero buffers for both 0-11 and
12-16 and detect two misaligned ranges, when both together are a
perfectly aligned zeroed range.

In theory we could try to do some lookahead and merge ranges where
possible, which should give us the perfect result, but it would make the
code considerably more complicated. (Whether we want to merge them
doesn't only depend on the block status, but possibly also on the
content of a DATA range.)

Kevin


Oh, I understand now the problem, thanks for explanation.

Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors must not align 
it down, to be possibly "merged" with next chunk if it is zero too.

But it's still good to align zeroes down, if data starts somewhere inside the 
buf, isn't it?

what about something like this:

diff --git a/qemu-img.c b/qemu-img.c
index babb5573ab..d1704584a0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, int 
n, int *pnum,
  }
  }
  
+    if (i == n) {

+    /*
+ * The whole buf is the same.
+ *
+ * if it's data, just return it. It's the old behavior.
+ *
+ * if it's zero, just return too. It will work good if target is alredy
+ * zeroed. And if next chunk is zero too we'll have no RMW and no 
reason
+ * to write data.
+ */
+    *pnum = i;
+    return !is_zero;
+    }
+
  tail = (sector_num + i) & (alignment - 1);
  if (tail) {
  if (is_zero && i <= tail) {
-    /* treat unallocated areas which only consist
- * of a small tail as allocated. */
+    /*
+ * For sure next sector after i is data, and it will rewrite this
+ * tail anyway due to RMW. So, let's just write data now.
+ */
  is_zero = false;
  }
  if (!is_zero) {
-    /* align up end offset of allocated areas. */
+    /* If possible, align up end offset of allocated areas. */
  i += alignment - tail;
  i = MIN(i, n);
  } else {
-    /* align down end offset of zero areas. */
+    /*
+ * For sure next sector after i is data, and it will rewrite this
+ * tail anyway due to RMW. Better is avoid RMW and write zeroes up
+ * to aligned bound.
+ */
  i -= tail;
  }
  }

I think we forgot to follow up on this. Has anyone tested this
suggestion?

Otherwise, I would try to rerun the tests I did with the my old and
Kevins suggestion.

I noticed earlier this week that these patches are still in my
development branch, but didn't actually pick it up again yet. So feel
free to try it out.



It seems this time I forgot to follow up. Is this topic still open?



Most probably yes :) I now checked, that my proposed diff is still applicable 
to master and don't break compilation. So, if you have some test, you can check 
if it works better with the change.

--
Best regards,
Vladimir

[PATCH v2] hw/net: npcm7xx_emc fix missing queue_flush

2021-12-03 Thread Patrick Venture

The rx_active boolean change to true should always trigger a try_read
call that flushes the queue.

Signed-off-by: Patrick Venture 
---
v2: introduced helper method to encapsulate rx activation and queue flush.
---
 hw/net/npcm7xx_emc.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/hw/net/npcm7xx_emc.c b/hw/net/npcm7xx_emc.c
index 7c892f820f..545b2b7410 100644
--- a/hw/net/npcm7xx_emc.c
+++ b/hw/net/npcm7xx_emc.c
@@ -284,6 +284,12 @@ static void emc_halt_rx(NPCM7xxEMCState *emc, uint32_t 
mista_flag)
 emc_set_mista(emc, mista_flag);
 }
 
+static void emc_enable_rx_and_flush(NPCM7xxEMCState *emc)
+{
+emc->rx_active = true;
+qemu_flush_queued_packets(qemu_get_queue(emc->nic));
+}
+
 static void emc_set_next_tx_descriptor(NPCM7xxEMCState *emc,
const NPCM7xxEMCTxDesc *tx_desc,
uint32_t desc_addr)
@@ -581,13 +587,6 @@ static ssize_t emc_receive(NetClientState *nc, const 
uint8_t *buf, size_t len1)
 return len;
 }
 
-static void emc_try_receive_next_packet(NPCM7xxEMCState *emc)
-{
-if (emc_can_receive(qemu_get_queue(emc->nic))) {
-qemu_flush_queued_packets(qemu_get_queue(emc->nic));
-}
-}
-
 static uint64_t npcm7xx_emc_read(void *opaque, hwaddr offset, unsigned size)
 {
 NPCM7xxEMCState *emc = opaque;
@@ -703,7 +702,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr offset,
 emc->regs[REG_MGSTA] |= REG_MGSTA_RXHA;
 }
 if (value & REG_MCMDR_RXON) {
-emc->rx_active = true;
+emc_enable_rx_and_flush(emc);
 } else {
 emc_halt_rx(emc, 0);
 }
@@ -739,8 +738,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr offset,
 break;
 case REG_RSDR:
 if (emc->regs[REG_MCMDR] & REG_MCMDR_RXON) {
-emc->rx_active = true;
-emc_try_receive_next_packet(emc);
+emc_enable_rx_and_flush(emc);
 }
 break;
 case REG_MIIDA:
-- 
2.34.1.400.ga245620fadb-goog

Re: [PATCH] hw/net: npcm7xx_emc fix missing queue_flush

2021-12-03 Thread Patrick Venture

On Fri, Dec 3, 2021 at 1:54 PM Patrick Venture  wrote:

>
>
> On Fri, Dec 3, 2021 at 1:42 PM Philippe Mathieu-Daudé 
> wrote:
>
>> On 12/3/21 22:27, Patrick Venture wrote:
>> > The rx_active boolean change to true should always trigger a try_read
>> > call that flushes the queue.
>> >
>> > Signed-off-by: Patrick Venture 
>> > ---
>> >  hw/net/npcm7xx_emc.c | 10 ++
>> >  1 file changed, 2 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/hw/net/npcm7xx_emc.c b/hw/net/npcm7xx_emc.c
>> > index 7c892f820f..97522e6388 100644
>> > --- a/hw/net/npcm7xx_emc.c
>> > +++ b/hw/net/npcm7xx_emc.c
>> > @@ -581,13 +581,6 @@ static ssize_t emc_receive(NetClientState *nc,
>> const uint8_t *buf, size_t len1)
>> >  return len;
>> >  }
>> >
>> > -static void emc_try_receive_next_packet(NPCM7xxEMCState *emc)
>> > -{
>> > -if (emc_can_receive(qemu_get_queue(emc->nic))) {
>> > -qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>> > -}
>> > -}
>>
>> What about modifying as emc_flush_rx() or emc_receive_and_flush()
>> helper instead?
>>
>>  static void emc_flush_rx(NPCM7xxEMCState *emc)
>>  {
>>  emc->rx_active = true;
>>  qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>>  }
>>
>
> I'm ok with that idea, although I'm less fond that it _hides_ the
> rx_active=true.  There is an emc_halt_rx that hides rx_active=false, so one
> could argue it's not an issue. Looking at ftgmac100, it mostly just calls
> the qemu_flush_queued_packets inline where it needs it.  So given the prior
> art, I'm more inclined to leave this as a two-line pair, versus collapsing
> it into a method.  Mostly because I don't anticipate this call being made
> from any other places, so it's not a "growing" device.  The method
> originally was emc_try_receive_next_packet, which didn't do anything more
> than a no-op check and the queue_flush.  The new method would move the
> rx_active setting from the call that deliberately controls it (the register
> change) into a subordinate method...
>
> Beyond all that, I think it's fine either way.  Feel free to push back and
> I'll make the change.
>

I figured why not :) And just made the change and sent out a v2.

>
>> >  static uint64_t npcm7xx_emc_read(void *opaque, hwaddr offset, unsigned
>> size)
>> >  {
>> >  NPCM7xxEMCState *emc = opaque;
>> > @@ -704,6 +697,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr
>> offset,
>> >  }
>> >  if (value & REG_MCMDR_RXON) {
>> >  emc->rx_active = true;
>> > +qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>> >  } else {
>> >  emc_halt_rx(emc, 0);
>> >  }
>> > @@ -740,7 +734,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr
>> offset,
>> >  case REG_RSDR:
>> >  if (emc->regs[REG_MCMDR] & REG_MCMDR_RXON) {
>> >  emc->rx_active = true;
>> > -emc_try_receive_next_packet(emc);
>> > +qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>> >  }
>> >  break;
>> >  case REG_MIIDA:
>> >
>>
>>

Re: QEMU 6.2.0 and rhbz#1999878

2021-12-03 Thread Eduardo Lima

On Fri, Dec 3, 2021 at 4:37 PM Richard W.M. Jones  wrote:

> On Fri, Dec 03, 2021 at 04:20:23PM -0300, Eduardo Lima wrote:
> > Hi Rich,
> >
> > Can you confirm if the patch you added for qemu in Fedora has still not
> been
> > merged upstream? I could not find it on the git source tree.
> >
> > +Patch2: 0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch
> > +From 1331e4eec016a295949009b4360c592401b089f7 Mon Sep 17 00:00:00 2001
> > +From: Richard Henderson 
> > +Date: Sun, 12 Sep 2021 10:49:25 -0700
> > +Subject: [PATCH] tcg/arm: Reduce vector alignment requirement for NEON
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1999878
> https://lists.nongnu.org/archive/html/qemu-devel/2021-09/msg01028.html
>
> The patch I posted wasn't correct (or meant to be), it was just a
> workaround.  However I think you're right - I don't believe the
> original problem was ever fixed.
>
Yes, I saw that your original patch had been replaced by this new one I
mentioned, so I thought it was the correct solution, but I could not find
this new one on the repository as well.

At the moment I kept it as part of 6.2.0 build, which I am just about to
push to rawhide. It builds locally, and I am only waiting for the
scratch-build to finish.

https://koji.fedoraproject.org/koji/taskinfo?taskID=79556515

Thanks, Eduardo.



>
> Let's see what upstreams says ...
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat
> http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-p2v converts physical machines to virtual machines.  Boot with a
> live CD or over the network (PXE) and turn machines into KVM guests.
> http://libguestfs.org/virt-v2v
>
>

Re: QEMU 6.2.0 and rhbz#1999878

2021-12-03 Thread Richard Henderson

On 12/3/21 1:03 PM, Richard W.M. Jones wrote:

On Fri, Dec 03, 2021 at 05:35:41PM -0300, Eduardo Lima wrote:

On Fri, Dec 3, 2021 at 4:37 PM Richard W.M. Jones  wrote:

 On Fri, Dec 03, 2021 at 04:20:23PM -0300, Eduardo Lima wrote:
 > Hi Rich,
 >
 > Can you confirm if the patch you added for qemu in Fedora has still not
 been
 > merged upstream? I could not find it on the git source tree.
 >
 > +Patch2: 0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch
 > +From 1331e4eec016a295949009b4360c592401b089f7 Mon Sep 17 00:00:00 2001
 > +From: Richard Henderson 
 > +Date: Sun, 12 Sep 2021 10:49:25 -0700
 > +Subject: [PATCH] tcg/arm: Reduce vector alignment requirement for NEON

 https://bugzilla.redhat.com/show_bug.cgi?id=1999878
 https://lists.nongnu.org/archive/html/qemu-devel/2021-09/msg01028.html

 The patch I posted wasn't correct (or meant to be), it was just a
 workaround.  However I think you're right - I don't believe the
 original problem was ever fixed.

Yes, I saw that your original patch had been replaced by this new
one I mentioned, so I thought it was the correct solution, but I
could not find this new one on the repository as well.

Oh I see, it was indeed replaced by Richard Henderson's patch:

https://src.fedoraproject.org/rpms/qemu/blob/rawhide/f/0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch

At the moment I kept it as part of 6.2.0 build, which I am just about to push
to rawhide. It builds locally, and I am only waiting for the scratch-build to
finish.

Yes looks like we need to keep it, and get it upstream too.

Whoops.  That dropped through the cracks.
I'll queue that now-ish.

r~

Re: [PATCH] hw/net: npcm7xx_emc fix missing queue_flush

2021-12-03 Thread Patrick Venture

On Fri, Dec 3, 2021 at 1:42 PM Philippe Mathieu-Daudé 
wrote:

> On 12/3/21 22:27, Patrick Venture wrote:
> > The rx_active boolean change to true should always trigger a try_read
> > call that flushes the queue.
> >
> > Signed-off-by: Patrick Venture 
> > ---
> >  hw/net/npcm7xx_emc.c | 10 ++
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> >
> > diff --git a/hw/net/npcm7xx_emc.c b/hw/net/npcm7xx_emc.c
> > index 7c892f820f..97522e6388 100644
> > --- a/hw/net/npcm7xx_emc.c
> > +++ b/hw/net/npcm7xx_emc.c
> > @@ -581,13 +581,6 @@ static ssize_t emc_receive(NetClientState *nc,
> const uint8_t *buf, size_t len1)
> >  return len;
> >  }
> >
> > -static void emc_try_receive_next_packet(NPCM7xxEMCState *emc)
> > -{
> > -if (emc_can_receive(qemu_get_queue(emc->nic))) {
> > -qemu_flush_queued_packets(qemu_get_queue(emc->nic));
> > -}
> > -}
>
> What about modifying as emc_flush_rx() or emc_receive_and_flush()
> helper instead?
>
>  static void emc_flush_rx(NPCM7xxEMCState *emc)
>  {
>  emc->rx_active = true;
>  qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>  }
>

I'm ok with that idea, although I'm less fond that it _hides_ the
rx_active=true.  There is an emc_halt_rx that hides rx_active=false, so one
could argue it's not an issue. Looking at ftgmac100, it mostly just calls
the qemu_flush_queued_packets inline where it needs it.  So given the prior
art, I'm more inclined to leave this as a two-line pair, versus collapsing
it into a method.  Mostly because I don't anticipate this call being made
from any other places, so it's not a "growing" device.  The method
originally was emc_try_receive_next_packet, which didn't do anything more
than a no-op check and the queue_flush.  The new method would move the
rx_active setting from the call that deliberately controls it (the register
change) into a subordinate method...

Beyond all that, I think it's fine either way.  Feel free to push back and
I'll make the change.

>
> >  static uint64_t npcm7xx_emc_read(void *opaque, hwaddr offset, unsigned
> size)
> >  {
> >  NPCM7xxEMCState *emc = opaque;
> > @@ -704,6 +697,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr
> offset,
> >  }
> >  if (value & REG_MCMDR_RXON) {
> >  emc->rx_active = true;
> > +qemu_flush_queued_packets(qemu_get_queue(emc->nic));
> >  } else {
> >  emc_halt_rx(emc, 0);
> >  }
> > @@ -740,7 +734,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr
> offset,
> >  case REG_RSDR:
> >  if (emc->regs[REG_MCMDR] & REG_MCMDR_RXON) {
> >  emc->rx_active = true;
> > -emc_try_receive_next_packet(emc);
> > +qemu_flush_queued_packets(qemu_get_queue(emc->nic));
> >  }
> >  break;
> >  case REG_MIIDA:
> >
>
>

Re: [PATCH] hw/net: npcm7xx_emc fix missing queue_flush

2021-12-03 Thread Philippe Mathieu-Daudé

On 12/3/21 22:27, Patrick Venture wrote:
> The rx_active boolean change to true should always trigger a try_read
> call that flushes the queue.
> 
> Signed-off-by: Patrick Venture 
> ---
>  hw/net/npcm7xx_emc.c | 10 ++
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/net/npcm7xx_emc.c b/hw/net/npcm7xx_emc.c
> index 7c892f820f..97522e6388 100644
> --- a/hw/net/npcm7xx_emc.c
> +++ b/hw/net/npcm7xx_emc.c
> @@ -581,13 +581,6 @@ static ssize_t emc_receive(NetClientState *nc, const 
> uint8_t *buf, size_t len1)
>  return len;
>  }
>  
> -static void emc_try_receive_next_packet(NPCM7xxEMCState *emc)
> -{
> -if (emc_can_receive(qemu_get_queue(emc->nic))) {
> -qemu_flush_queued_packets(qemu_get_queue(emc->nic));
> -}
> -}

What about modifying as emc_flush_rx() or emc_receive_and_flush()
helper instead?

 static void emc_flush_rx(NPCM7xxEMCState *emc)
 {
 emc->rx_active = true;
 qemu_flush_queued_packets(qemu_get_queue(emc->nic));
 }

>  static uint64_t npcm7xx_emc_read(void *opaque, hwaddr offset, unsigned size)
>  {
>  NPCM7xxEMCState *emc = opaque;
> @@ -704,6 +697,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr offset,
>  }
>  if (value & REG_MCMDR_RXON) {
>  emc->rx_active = true;
> +qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>  } else {
>  emc_halt_rx(emc, 0);
>  }
> @@ -740,7 +734,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr offset,
>  case REG_RSDR:
>  if (emc->regs[REG_MCMDR] & REG_MCMDR_RXON) {
>  emc->rx_active = true;
> -emc_try_receive_next_packet(emc);
> +qemu_flush_queued_packets(qemu_get_queue(emc->nic));
>  }
>  break;
>  case REG_MIIDA:
>

[PATCH] hw/net: npcm7xx_emc fix missing queue_flush

2021-12-03 Thread Patrick Venture

The rx_active boolean change to true should always trigger a try_read
call that flushes the queue.

Signed-off-by: Patrick Venture 
---
 hw/net/npcm7xx_emc.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/hw/net/npcm7xx_emc.c b/hw/net/npcm7xx_emc.c
index 7c892f820f..97522e6388 100644
--- a/hw/net/npcm7xx_emc.c
+++ b/hw/net/npcm7xx_emc.c
@@ -581,13 +581,6 @@ static ssize_t emc_receive(NetClientState *nc, const 
uint8_t *buf, size_t len1)
 return len;
 }
 
-static void emc_try_receive_next_packet(NPCM7xxEMCState *emc)
-{
-if (emc_can_receive(qemu_get_queue(emc->nic))) {
-qemu_flush_queued_packets(qemu_get_queue(emc->nic));
-}
-}
-
 static uint64_t npcm7xx_emc_read(void *opaque, hwaddr offset, unsigned size)
 {
 NPCM7xxEMCState *emc = opaque;
@@ -704,6 +697,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr offset,
 }
 if (value & REG_MCMDR_RXON) {
 emc->rx_active = true;
+qemu_flush_queued_packets(qemu_get_queue(emc->nic));
 } else {
 emc_halt_rx(emc, 0);
 }
@@ -740,7 +734,7 @@ static void npcm7xx_emc_write(void *opaque, hwaddr offset,
 case REG_RSDR:
 if (emc->regs[REG_MCMDR] & REG_MCMDR_RXON) {
 emc->rx_active = true;
-emc_try_receive_next_packet(emc);
+qemu_flush_queued_packets(qemu_get_queue(emc->nic));
 }
 break;
 case REG_MIIDA:
-- 
2.34.1.400.ga245620fadb-goog

Re: [PATCH 15/35] target/ppc: Use FloatRoundMode in do_fri

2021-12-03 Thread Philippe Mathieu-Daudé

On 11/19/21 17:04, Richard Henderson wrote:
> This is the proper type for the enumeration.
> 
> Signed-off-by: Richard Henderson 
> ---
>  target/ppc/fpu_helper.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 14/35] target/ppc: Remove inline from do_fri

2021-12-03 Thread Philippe Mathieu-Daudé

On 11/19/21 17:04, Richard Henderson wrote:
> There's no reason the callers can't tail call to one function.
> Leave it up to the compiler either way.
> 
> Signed-off-by: Richard Henderson 
> ---
>  target/ppc/fpu_helper.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 01/35] softfloat: Extend float_exception_flags to 16 bits

2021-12-03 Thread Philippe Mathieu-Daudé

On 11/19/21 17:04, Richard Henderson wrote:
> We will shortly have more than 8 bits of exceptions.
> Repack the existing flags into low bits and reformat to hex.
> 
> Signed-off-by: Richard Henderson 
> ---
>  include/fpu/softfloat-types.h | 16 
>  include/fpu/softfloat.h   |  2 +-
>  2 files changed, 9 insertions(+), 9 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé

Re: QEMU 6.2.0 and rhbz#1999878

2021-12-03 Thread Richard W.M. Jones

On Fri, Dec 03, 2021 at 05:35:41PM -0300, Eduardo Lima wrote:
> 
> 
> On Fri, Dec 3, 2021 at 4:37 PM Richard W.M. Jones  wrote:
> 
> On Fri, Dec 03, 2021 at 04:20:23PM -0300, Eduardo Lima wrote:
> > Hi Rich,
> >
> > Can you confirm if the patch you added for qemu in Fedora has still not
> been
> > merged upstream? I could not find it on the git source tree.
> >
> > +Patch2: 0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch
> > +From 1331e4eec016a295949009b4360c592401b089f7 Mon Sep 17 00:00:00 2001
> > +From: Richard Henderson 
> > +Date: Sun, 12 Sep 2021 10:49:25 -0700
> > +Subject: [PATCH] tcg/arm: Reduce vector alignment requirement for NEON
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1999878
> https://lists.nongnu.org/archive/html/qemu-devel/2021-09/msg01028.html
> 
> The patch I posted wasn't correct (or meant to be), it was just a
> workaround.  However I think you're right - I don't believe the
> original problem was ever fixed.
>
> Yes, I saw that your original patch had been replaced by this new
> one I mentioned, so I thought it was the correct solution, but I
> could not find this new one on the repository as well.

Oh I see, it was indeed replaced by Richard Henderson's patch:

https://src.fedoraproject.org/rpms/qemu/blob/rawhide/f/0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch

> At the moment I kept it as part of 6.2.0 build, which I am just about to push
> to rawhide. It builds locally, and I am only waiting for the scratch-build to
> finish.

Yes looks like we need to keep it, and get it upstream too.

Thanks,

Rich.

> https://koji.fedoraproject.org/koji/taskinfo?taskID=79556515
> 
> Thanks, Eduardo.
> 
>  
> 
> 
> Let's see what upstreams says ...
> 
> Rich.
> 
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/
> ~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-p2v converts physical machines to virtual machines.  Boot with a
> live CD or over the network (PXE) and turn machines into KVM guests.
> http://libguestfs.org/virt-v2v
> 
> 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[PATCH 06/14] test-bdrv-graph-mod: fix filters to be filters

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

bdrv_pass_through is used as filter, even all node variables has
corresponding names. We want to append it, so it should be
backing-child-based filter like mirror_top.
So, in test_update_perm_tree, first child should be DATA, as we don't
want filters with two filtered children.

bdrv_exclusive_writer is used as a filter once. So it should be filter
anyway. We want to append it, so it should be backing-child-based
fitler too.

Make all FILTERED children to be PRIMARY as well. We are going to force
this rule by assertion soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h|  5 +++--
 tests/unit/test-bdrv-graph-mod.c | 24 +---
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9c06f8816e..919e33de5f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -121,8 +121,9 @@ struct BlockDriver {
 /*
  * Only make sense for filter drivers, for others must be false.
  * If true, filtered child is bs->backing. Otherwise it's bs->file.
- * Only two internal filters use bs->backing as filtered child and has this
- * field set to true: mirror_top and commit_top.
+ * Two internal filters use bs->backing as filtered child and has this
+ * field set to true: mirror_top and commit_top. There also two such test
+ * filters in tests/unit/test-bdrv-graph-mod.c.
  *
  * Never create any more such filters!
  *
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index 40795d3c04..7265971013 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -26,6 +26,8 @@
 
 static BlockDriver bdrv_pass_through = {
 .format_name = "pass-through",
+.is_filter = true,
+.filtered_child_is_backing = true,
 .bdrv_child_perm = bdrv_default_perms,
 };
 
@@ -57,6 +59,8 @@ static void exclusive_write_perms(BlockDriverState *bs, 
BdrvChild *c,
 
 static BlockDriver bdrv_exclusive_writer = {
 .format_name = "exclusive-writer",
+.is_filter = true,
+.filtered_child_is_backing = true,
 .bdrv_child_perm = exclusive_write_perms,
 };
 
@@ -134,7 +138,7 @@ static void test_update_perm_tree(void)
 blk_insert_bs(root, bs, &error_abort);
 
 bdrv_attach_child(filter, bs, "child", &child_of_bds,
-  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY, &error_abort);
+  BDRV_CHILD_DATA, &error_abort);
 
 ret = bdrv_append(filter, bs, NULL);
 g_assert_cmpint(ret, <, 0);
@@ -228,11 +232,14 @@ static void test_parallel_exclusive_write(void)
  */
 bdrv_ref(base);
 
-bdrv_attach_child(top, fl1, "backing", &child_of_bds, BDRV_CHILD_DATA,
+bdrv_attach_child(top, fl1, "backing", &child_of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
-bdrv_attach_child(fl1, base, "backing", &child_of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl1, base, "backing", &child_of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
-bdrv_attach_child(fl2, base, "backing", &child_of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl2, base, "backing", &child_of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
 
 bdrv_replace_node(fl1, fl2, &error_abort);
@@ -344,9 +351,11 @@ static void test_parallel_perm_update(void)
   BDRV_CHILD_DATA, &error_abort);
 c_fl2 = bdrv_attach_child(ws, fl2, "second", &child_of_bds,
   BDRV_CHILD_DATA, &error_abort);
-bdrv_attach_child(fl1, base, "backing", &child_of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl1, base, "backing", &child_of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
-bdrv_attach_child(fl2, base, "backing", &child_of_bds, BDRV_CHILD_FILTERED,
+bdrv_attach_child(fl2, base, "backing", &child_of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
 
 /* Select fl1 as first child to be active */
@@ -397,7 +406,8 @@ static void test_append_greedy_filter(void)
 BlockDriverState *base = no_perm_node("base");
 BlockDriverState *fl = exclusive_writer_node("fl1");
 
-bdrv_attach_child(top, base, "backing", &child_of_bds, BDRV_CHILD_COW,
+bdrv_attach_child(top, base, "backing", &child_of_bds,
+  BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
 
 bdrv_append(fl, base, &error_abort);
-- 
2.31.1

[PATCH 13/14] block: Manipulate bs->file / bs->backing pointers in .attach/.detach

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

bs->file and bs->backing are a kind of duplication of part of
bs->children. But very useful diplication, so let's not drop them at
all:)

We should manage bs->file and bs->backing in same place, where we
manage bs->children, to keep them in sync.

Moreover, generic io paths are unprepared to BdrvChild without a bs, so
it's double good to clear bs->file / bs->backing when we detach the
child.

Detach is simple: if we detach bs->file or bs->backing child, just
set corresponding field to NULL.

Attach is a bit more complicated. But we still can precisely detect
should we set one of bs->file / bs->backing or not:

- if role is BDRV_CHILD_COW, we definitely deal with bs->backing
- else, if role is BDRV_CHILD_FILTERED (it must be also
  BDRV_CHILD_PRIMARY), it's a filtered child. Use
  bs->drv->filtered_child_is_backing to chose the pointer field to
  modify.
- else, if role is BDRV_CHILD_PRIMARY, we deal with bs->file
- in all other cases, it's neither bs->backing nor bs->file. It's some
  other child and we shouldn't care

OK. This change brings one more good thing: we can (and should) get rid
of all indirect pointers in the block-graph-change transactions:

bdrv_attach_child_common() stores BdrvChild** into transaction to clear
it on abort.

bdrv_attach_child_common() has two callers: bdrv_attach_child_noperm()
just pass-through this feature, bdrv_root_attach_child() doesn't need
the feature.

Look at bdrv_attach_child_noperm() callers:
  - bdrv_attach_child() doesn't need the feature
  - bdrv_set_file_or_backing_noperm() uses the feature to manage
bs->file and bs->backing, we don't want it anymore
  - bdrv_append() uses the feature to manage bs->backing, again we
don't want it anymore

So, we should drop this stuff! Great!

We still keep BdrvChild** argument to return the child and int return
value, and not move to simply returning BdrvChild*, as we don't want to
lose int return values.

However we don't require *@child to be NULL anymore, and even allow
@child to be NULL, if caller don't need the new child pointer.

Finally, we now set .file / .backing automatically in generic code and
want to restring setting them by hand outside of .attach/.detach.
So, this patch cleanups all remaining places where they were set.
To find such places I use:

  git grep '\->file ='
  git grep '\->backing ='
  git grep '&.*\'
  git grep '&.*\'

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h|  15 +++-
 block.c  | 155 ---
 block/raw-format.c   |   4 +-
 block/snapshot.c |   1 -
 tests/unit/test-bdrv-drain.c |  10 +--
 5 files changed, 88 insertions(+), 97 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 919e33de5f..4ea800e589 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -937,9 +937,6 @@ struct BlockDriverState {
 QDict *full_open_options;
 char exact_filename[PATH_MAX];
 
-BdrvChild *backing;
-BdrvChild *file;
-
 /* I/O Limits */
 BlockLimits bl;
 
@@ -992,7 +989,19 @@ struct BlockDriverState {
  * which can affect this node by changing these defaults). This is always a
  * parent node of this node. */
 BlockDriverState *inherits_from;
+
+/*
+ * @backing and @file are some of @children or NULL. All these three fields
+ * (@file, @backing and @children) are modified only in
+ * bdrv_child_cb_attach() and bdrv_child_cb_detach().
+ *
+ * See also comment in include/block/block.h, to learn how backing and file
+ * are connected with BdrvChildRole.
+ */
 QLIST_HEAD(, BdrvChild) children;
+BdrvChild *backing;
+BdrvChild *file;
+
 QLIST_HEAD(, BdrvChild) parents;
 
 QDict *options;
diff --git a/block.c b/block.c
index d57d7a80ab..0c6bbc9b0b 100644
--- a/block.c
+++ b/block.c
@@ -1388,9 +1388,33 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 BlockDriverState *bs = child->opaque;
 
 QLIST_INSERT_HEAD(&bs->children, child, next);
+if (bs->drv->is_filter | (child->role & BDRV_CHILD_FILTERED)) {
+/*
+ * Here we handle filters and block/raw-format.c when it behave like
+ * filter.
+ */
+assert(!(child->role & BDRV_CHILD_COW));
+if (child->role & (BDRV_CHILD_PRIMARY | BDRV_CHILD_FILTERED)) {
+assert(child->role & BDRV_CHILD_PRIMARY);
+assert(child->role & BDRV_CHILD_FILTERED);
+assert(!bs->backing);
+assert(!bs->file);
 
-if (child->role & BDRV_CHILD_COW) {
+if (bs->drv->filtered_child_is_backing) {
+bs->backing = child;
+} else {
+bs->file = child;
+}
+}
+} else if (child->role & BDRV_CHILD_COW) {
+assert(bs->drv->supports_backing);
+assert(!(child->role & BDRV_CHILD_PRIMARY));
+assert(!bs->backing);
+bs->backing = child;
 bdrv_backing_attac

[PATCH 08/14] block/snapshot: stress that we fallback to primary child

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Actually what we chose is a primary child. Let's stress it in the code.

We are going to drop indirect pointer logic here in future. Actually
this commit simplifies the future work: we drop use of indirection in
the assertion now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/snapshot.c | 30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index ccacda8bd5..12fa0e3904 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -158,21 +158,14 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState 
*bs,
 static BdrvChild **bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
 {
 BdrvChild **fallback;
-BdrvChild *child;
+BdrvChild *child = bdrv_primary_child(bs);
 
-/*
- * The only BdrvChild pointers that are safe to modify (and which
- * we can thus return a reference to) are bs->file and
- * bs->backing.
- */
-fallback = &bs->file;
-if (!*fallback && bs->drv && bs->drv->is_filter) {
-fallback = &bs->backing;
-}
-
-if (!*fallback) {
+/* We allow fallback only to primary child */
+if (!child) {
 return NULL;
 }
+fallback = (child == bs->file ? &bs->file : &bs->backing);
+assert(*fallback == child);
 
 /*
  * Check that there are no other children that would need to be
@@ -300,15 +293,12 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 }
 
 /*
- * fallback_ptr is &bs->file or &bs->backing.  *fallback_ptr
- * was closed above and set to NULL, but the .bdrv_open() call
- * has opened it again, because we set the respective option
- * (with the qdict_put_str() call above).
- * Assert that .bdrv_open() has attached some child on
- * *fallback_ptr, and that it has attached the one we wanted
- * it to (i.e., fallback_bs).
+ * fallback was a primary child. It was closed above and set to NULL,
+ * but the .bdrv_open() call has opened it again, because we set the
+ * respective option (with the qdict_put_str() call above).
+ * Assert that .bdrv_open() has attached some BDS as primary child.
  */
-assert(*fallback_ptr && fallback_bs == (*fallback_ptr)->bs);
+assert(bdrv_primary_bs(bs) == fallback_bs);
 bdrv_unref(fallback_bs);
 return ret;
 }
-- 
2.31.1

[PATCH 07/14] block: document connection between child roles and bs->backing/bs->file

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Make the informal rules formal. In further commit we'll add
corresponding assertions.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/include/block/block.h b/include/block/block.h
index f885f113ef..8a3278d4b6 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -290,6 +290,48 @@ enum {
  *
  * At least one of DATA, METADATA, FILTERED, or COW must be set for
  * every child.
+ *
+ *
+ * = Connection with bs->children, bs->file and bs->backing fields =
+ *
+ * 1. Filters
+ *
+ * Filter drivers has drv->is_filter = true.
+ *
+ * Filter driver has exactly one FILTERED|PRIMARY child, any may have other
+ * children which must not have these bits (the example is copy-before-write
+ * filter that also has target DATA child).
+ *
+ * Filter driver never has COW children.
+ *
+ * For all filters except for mirror_top and commit_top, the filtered child is
+ * linked in bs->file, bs->backing is NULL.
+ *
+ * For mirror_top and commit_top filtered child is linked in bs->backing and
+ * their bs->file is NULL. These two filters has drv->filtered_child_is_backing
+ * = true.
+ *
+ * 2. "raw" driver (block/raw-format.c)
+ *
+ * Formally it's not a filter (drv->is_filter = false)
+ *
+ * bs->backing is always NULL
+ *
+ * Only has one child, linked in bs->file. It's role is either FILTERED|PRIMARY
+ * (like filter) either DATA|PRIMARY depending on options.
+ *
+ * 3. Other drivers
+ *
+ * Doesn't have any FILTERED children.
+ *
+ * May have at most one COW child. In this case it's linked in bs->backing.
+ * Otherwise bs->backing is NULL. COW child is never PRIMARY.
+ *
+ * May have at most one PRIMARY child. In this case it's linked in bs->file.
+ * Otherwise bs->file is NULL.
+ *
+ * May also have some other children that don't have neither PRIMARY nor COW
+ * bits set.
  */
 enum BdrvChildRoleBits {
 /*
-- 
2.31.1

[PATCH 02/14] block: introduce bdrv_open_file_child() helper

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Almost all drivers call bdrv_open_child() similarly. Let's create a
helper for this.

The only not updated driver that call bdrv_open_child() to set
bs->file is raw-format, as it sometimes want to have filtered child but
don't set drv->is_filter to true.

Possibly we should implement drv->is_filter_func() handler, to consider
raw-format as filter when it works as filter.. But it's another story.

Note also, that we decrease assignments to bs->file in code: it helps
us restrict modifying this field in further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  3 +++
 block.c   | 21 +
 block/blkdebug.c  |  9 +++--
 block/blklogwrites.c  |  7 ++-
 block/blkreplay.c |  7 ++-
 block/blkverify.c |  9 +++--
 block/bochs.c |  7 +++
 block/cloop.c |  7 +++
 block/copy-before-write.c |  9 -
 block/copy-on-read.c  |  9 -
 block/crypto.c| 11 ++-
 block/dmg.c   |  7 +++
 block/filter-compress.c   |  6 ++
 block/parallels.c |  7 +++
 block/preallocate.c   |  9 -
 block/qcow.c  |  6 ++
 block/qcow2.c |  8 
 block/qed.c   |  8 
 block/replication.c   |  8 +++-
 block/throttle.c  |  8 +++-
 block/vdi.c   |  7 +++
 block/vhdx.c  |  7 +++
 block/vmdk.c  |  7 +++
 block/vpc.c   |  7 +++
 24 files changed, 94 insertions(+), 100 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index e5dd22b034..f885f113ef 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -376,6 +376,9 @@ BdrvChild *bdrv_open_child(const char *filename,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
bool allow_none, Error **errp);
+int bdrv_open_file_child(const char *filename,
+ QDict *options, const char *bdref_key,
+ BlockDriverState *parent, Error **errp);
 BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp);
 int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
 Error **errp);
diff --git a/block.c b/block.c
index 0ac5b163d2..a97720c5eb 100644
--- a/block.c
+++ b/block.c
@@ -3546,6 +3546,27 @@ BdrvChild *bdrv_open_child(const char *filename,
  errp);
 }
 
+/*
+ * Wrapper on bdrv_open_child() for most popular case: open primary child of 
bs.
+ */
+int bdrv_open_file_child(const char *filename,
+ QDict *options, const char *bdref_key,
+ BlockDriverState *parent, Error **errp)
+{
+BdrvChildRole role;
+
+/* commit_top and mirror_top don't use this function */
+assert(!parent->drv->filtered_child_is_backing);
+
+role = parent->drv->is_filter ?
+(BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY) : BDRV_CHILD_IMAGE;
+
+parent->file = bdrv_open_child(filename, options, bdref_key, parent,
+   &child_of_bds, role, false, errp);
+
+return parent->file ? 0 : -EINVAL;
+}
+
 /*
  * TODO Future callers may need to specify parent/child_class in order for
  * option inheritance to work. Existing callers use it for the root node.
diff --git a/block/blkdebug.c b/block/blkdebug.c
index bbf2948703..5fcfc8ac6f 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -503,12 +503,9 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 
 /* Open the image file */
-bs->file = bdrv_open_child(qemu_opt_get(opts, "x-image"), options, "image",
-   bs, &child_of_bds,
-   BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
-   false, errp);
-if (!bs->file) {
-ret = -EINVAL;
+ret = bdrv_open_file_child(qemu_opt_get(opts, "x-image"), options, "image",
+   bs, errp);
+if (ret < 0) {
 goto out;
 }
 
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index f7a251e91f..f66a617eb3 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -155,11 +155,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 
 /* Open the file */
-bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
-   BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY, false,
-   errp);
-if (!bs->file) {
-ret = -EINVAL;
+ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
+if (ret < 0) {
 goto fail;
 }
 
diff --git a/block/blkreplay.c b/block/blkreplay.c
index dcbe780ddb..76a0b8d12a 100644
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -26,11 +26,8 @@ sta

[PATCH 03/14] block/blklogwrites: don't care to remove bs->file child on failure

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We don't need to remove bs->file, generic layer takes care of it. No
other driver cares to remove bs->file on failure by hand.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/blklogwrites.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index f66a617eb3..7d25df97cc 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -254,10 +254,6 @@ fail_log:
 s->log_file = NULL;
 }
 fail:
-if (ret < 0) {
-bdrv_unref_child(bs, bs->file);
-bs->file = NULL;
-}
 qemu_opts_del(opts);
 return ret;
 }
-- 
2.31.1

[PATCH 04/14] test-bdrv-graph-mod: update test_parallel_perm_update test case

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

test_parallel_perm_update() does two things that we are going to
restrict in the near future:

1. It updates bs->file field by hand. bs->file will be managed
   automatically by generic code (together with bs->children list).

   Let's better refactor our "tricky" bds to have own state where one
   of children is linked as "selected".
   This also looks less "tricky", so avoid using this word.

2. It create FILTERED children that are not PRIMARY. Except for tests
   all FILTERED children in the Qemu block layer are always PRIMARY as
   well.  We are going to formalize this rule, so let's better use DATA
   children here.

While being here, update the picture to better correspond to the test
code.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-bdrv-graph-mod.c | 70 
 1 file changed, 44 insertions(+), 26 deletions(-)

diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index a6e3bb79be..40795d3c04 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -241,13 +241,26 @@ static void test_parallel_exclusive_write(void)
 bdrv_unref(top);
 }
 
-static void write_to_file_perms(BlockDriverState *bs, BdrvChild *c,
- BdrvChildRole role,
- BlockReopenQueue *reopen_queue,
- uint64_t perm, uint64_t shared,
- uint64_t *nperm, uint64_t *nshared)
+/*
+ * write-to-selected node may have several DATA children, one of them may be
+ * "selected". Exclusive write permission is taken on selected child.
+ *
+ * We don't realize write handler itself, as we need only to test how 
permission
+ * update works.
+ */
+typedef struct BDRVWriteToSelectedState {
+BdrvChild *selected;
+} BDRVWriteToSelectedState;
+
+static void write_to_selected_perms(BlockDriverState *bs, BdrvChild *c,
+BdrvChildRole role,
+BlockReopenQueue *reopen_queue,
+uint64_t perm, uint64_t shared,
+uint64_t *nperm, uint64_t *nshared)
 {
-if (bs->file && c == bs->file) {
+BDRVWriteToSelectedState *s = bs->opaque;
+
+if (s->selected && c == s->selected) {
 *nperm = BLK_PERM_WRITE;
 *nshared = BLK_PERM_ALL & ~BLK_PERM_WRITE;
 } else {
@@ -256,9 +269,10 @@ static void write_to_file_perms(BlockDriverState *bs, 
BdrvChild *c,
 }
 }
 
-static BlockDriver bdrv_write_to_file = {
-.format_name = "tricky-perm",
-.bdrv_child_perm = write_to_file_perms,
+static BlockDriver bdrv_write_to_selected = {
+.format_name = "write-to-selected",
+.instance_size = sizeof(BDRVWriteToSelectedState),
+.bdrv_child_perm = write_to_selected_perms,
 };
 
 
@@ -266,15 +280,18 @@ static BlockDriver bdrv_write_to_file = {
  * The following test shows that topological-sort order is required for
  * permission update, simple DFS is not enough.
  *
- * Consider the block driver which has two filter children: one active
- * with exclusive write access and one inactive with no specific
- * permissions.
+ * Consider the block driver (write-to-selected) which has two children: one is
+ * selected so we have exclusive write access to it and for the other one we
+ * don't need any specific permissions.
  *
  * And, these two children has a common base child, like this:
+ *   (additional "top" on top is used in test just because the only public
+ *function to update permission should get a specific child to update.
+ *Making bdrv_refresh_perms() public just for this test doesn't worth it)
  *
- * ┌─┐ ┌──┐
- * │ fl2 │ ◀── │ top  │
- * └─┘ └──┘
+ * ┌─┐ ┌───┐ ┌─┐
+ * │ fl2 │ ◀── │ write-to-selected │ ◀── │ top │
+ * └─┘ └───┘ └─┘
  *   │   │
  *   │   │ w
  *   │   ▼
@@ -290,7 +307,7 @@ static BlockDriver bdrv_write_to_file = {
  *
  * So, exclusive write is propagated.
  *
- * Assume, we want to make fl2 active instead of fl1.
+ * Assume, we want to select fl2  instead of fl1.
  * So, we set some option for top driver and do permission update.
  *
  * With simple DFS, if permission update goes first through
@@ -306,9 +323,10 @@ static BlockDriver bdrv_write_to_file = {
 static void test_parallel_perm_update(void)
 {
 BlockDriverState *top = no_perm_node("top");
-BlockDriverState *tricky =
-bdrv_new_open_driver(&bdrv_write_to_file, "tricky", BDRV_O_RDWR,
+BlockDriverState *ws =
+bdrv_new_open_driver(&bdrv_write_to_selected, "ws", BDRV_O_RDWR,
  &error_abort);
+BDRVWriteToSelectedState *s = ws->opaque;
 BlockDriverState *base = no_perm_node("base");
 BlockDriverState *fl1 = pass_through_node("fl1");
 BlockDriverState *fl2 = pass_through_node("fl2");
@@ -

[PATCH 14/14] block/snapshot: drop indirection around bdrv_snapshot_fallback_ptr

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Now the indirection is not actually used, we can safely reduce it to
simple pointer.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/snapshot.c | 39 +--
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index cb184d70b4..e32f9cb2ad 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -148,34 +148,29 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState 
*bs,
 }
 
 /**
- * Return a pointer to the child BDS pointer to which we can fall
+ * Return a pointer to child of given BDS to which we can fall
  * back if the given BDS does not support snapshots.
  * Return NULL if there is no BDS to (safely) fall back to.
- *
- * We need to return an indirect pointer because bdrv_snapshot_goto()
- * has to modify the BdrvChild pointer.
  */
-static BdrvChild **bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
+static BdrvChild *bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
 {
-BdrvChild **fallback;
-BdrvChild *child = bdrv_primary_child(bs);
+BdrvChild *fallback = bdrv_primary_child(bs);
+BdrvChild *child;
 
 /* We allow fallback only to primary child */
-if (!child) {
+if (!fallback) {
 return NULL;
 }
-fallback = (child == bs->file ? &bs->file : &bs->backing);
-assert(*fallback == child);
 
 /*
  * Check that there are no other children that would need to be
  * snapshotted.  If there are, it is not safe to fall back to
- * *fallback.
+ * fallback.
  */
 QLIST_FOREACH(child, &bs->children, next) {
 if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
BDRV_CHILD_FILTERED) &&
-child != *fallback)
+child != fallback)
 {
 return NULL;
 }
@@ -186,8 +181,8 @@ static BdrvChild 
**bdrv_snapshot_fallback_ptr(BlockDriverState *bs)
 
 static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
 {
-BdrvChild **child_ptr = bdrv_snapshot_fallback_ptr(bs);
-return child_ptr ? (*child_ptr)->bs : NULL;
+BdrvChild *child_ptr = bdrv_snapshot_fallback_ptr(bs);
+return child_ptr ? child_ptr->bs : NULL;
 }
 
 int bdrv_can_snapshot(BlockDriverState *bs)
@@ -230,7 +225,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
Error **errp)
 {
 BlockDriver *drv = bs->drv;
-BdrvChild **fallback_ptr;
+BdrvChild *fallback;
 int ret, open_ret;
 
 if (!drv) {
@@ -251,13 +246,13 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 return ret;
 }
 
-fallback_ptr = bdrv_snapshot_fallback_ptr(bs);
-if (fallback_ptr) {
+fallback = bdrv_snapshot_fallback_ptr(bs);
+if (fallback) {
 QDict *options;
 QDict *file_options;
 Error *local_err = NULL;
-BlockDriverState *fallback_bs = (*fallback_ptr)->bs;
-char *subqdict_prefix = g_strdup_printf("%s.", (*fallback_ptr)->name);
+BlockDriverState *fallback_bs = fallback->bs;
+char *subqdict_prefix = g_strdup_printf("%s.", fallback->name);
 
 options = qdict_clone_shallow(bs->options);
 
@@ -268,8 +263,8 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 qobject_unref(file_options);
 g_free(subqdict_prefix);
 
-/* Force .bdrv_open() below to re-attach fallback_bs on *fallback_ptr 
*/
-qdict_put_str(options, (*fallback_ptr)->name,
+/* Force .bdrv_open() below to re-attach fallback_bs on fallback */
+qdict_put_str(options, fallback->name,
   bdrv_get_node_name(fallback_bs));
 
 /* Now close bs, apply the snapshot on fallback_bs, and re-open bs */
@@ -278,7 +273,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 }
 
 /* .bdrv_open() will re-attach it */
-bdrv_unref_child(bs, *fallback_ptr);
+bdrv_unref_child(bs, fallback);
 
 ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
 open_ret = drv->bdrv_open(bs, options, bs->open_flags, &local_err);
-- 
2.31.1

[PATCH 12/14] Revert "block: Pass BdrvChild ** to replace_child_noperm"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

That's a preparation to previously reverted
"block: Let replace_child_noperm free children". Drop it too, we don't
need it for a new approach.

This reverts commit be64bbb0149748f3999c49b13976aafb8330ea86.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 2ba95f71b9..d57d7a80ab 100644
--- a/block.c
+++ b/block.c
@@ -87,7 +87,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
-static void bdrv_replace_child_noperm(BdrvChild **child,
+static void bdrv_replace_child_noperm(BdrvChild *child,
   BlockDriverState *new_bs);
 static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
@@ -2270,7 +2270,7 @@ static void bdrv_replace_child_abort(void *opaque)
 BlockDriverState *new_bs = s->child->bs;
 
 /* old_bs reference is transparently moved from @s to @s->child */
-bdrv_replace_child_noperm(&s->child, s->old_bs);
+bdrv_replace_child_noperm(s->child, s->old_bs);
 bdrv_unref(new_bs);
 }
 
@@ -2300,7 +2300,7 @@ static void bdrv_replace_child_tran(BdrvChild *child, 
BlockDriverState *new_bs,
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-bdrv_replace_child_noperm(&child, new_bs);
+bdrv_replace_child_noperm(child, new_bs);
 /* old_bs reference is transparently moved from @child to @s */
 }
 
@@ -2672,10 +2672,9 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission 
qapi_perm)
 return permissions[qapi_perm];
 }
 
-static void bdrv_replace_child_noperm(BdrvChild **childp,
+static void bdrv_replace_child_noperm(BdrvChild *child,
   BlockDriverState *new_bs)
 {
-BdrvChild *child = *childp;
 BlockDriverState *old_bs = child->bs;
 int new_bs_quiesce_counter;
 int drain_saldo;
@@ -2768,7 +2767,7 @@ static void bdrv_attach_child_common_abort(void *opaque)
 BdrvChild *child = *s->child;
 BlockDriverState *bs = child->bs;
 
-bdrv_replace_child_noperm(s->child, NULL);
+bdrv_replace_child_noperm(child, NULL);
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
 bdrv_try_set_aio_context(bs, s->old_child_ctx, &error_abort);
@@ -2868,7 +2867,7 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 }
 
 bdrv_ref(child_bs);
-bdrv_replace_child_noperm(&new_child, child_bs);
+bdrv_replace_child_noperm(new_child, child_bs);
 
 *child = new_child;
 
@@ -2923,12 +2922,12 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
 return 0;
 }
 
-static void bdrv_detach_child(BdrvChild **childp)
+static void bdrv_detach_child(BdrvChild *child)
 {
-BlockDriverState *old_bs = (*childp)->bs;
+BlockDriverState *old_bs = child->bs;
 
-bdrv_replace_child_noperm(childp, NULL);
-bdrv_child_free(*childp);
+bdrv_replace_child_noperm(child, NULL);
+bdrv_child_free(child);
 
 if (old_bs) {
 /*
@@ -3034,7 +3033,7 @@ void bdrv_root_unref_child(BdrvChild *child)
 BlockDriverState *child_bs;
 
 child_bs = child->bs;
-bdrv_detach_child(&child);
+bdrv_detach_child(child);
 bdrv_unref(child_bs);
 }
 
-- 
2.31.1

[PATCH 09/14] Revert "block: Let replace_child_noperm free children"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We are going to reimplement this behavior (clear bs->file / bs->backing
pointers automatically when child->bs is cleared) in a nicer way.

This reverts commit b0a9f6fed3d80de610dcd04a7e66f9f30a04174f.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 102 +---
 1 file changed, 23 insertions(+), 79 deletions(-)

diff --git a/block.c b/block.c
index a97720c5eb..69c20c729a 100644
--- a/block.c
+++ b/block.c
@@ -87,10 +87,8 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
-static void bdrv_child_free(BdrvChild *child);
 static void bdrv_replace_child_noperm(BdrvChild **child,
-  BlockDriverState *new_bs,
-  bool free_empty_child);
+  BlockDriverState *new_bs);
 static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
   Transaction *tran);
@@ -2258,16 +2256,12 @@ typedef struct BdrvReplaceChildState {
 BdrvChild *child;
 BdrvChild **childp;
 BlockDriverState *old_bs;
-bool free_empty_child;
 } BdrvReplaceChildState;
 
 static void bdrv_replace_child_commit(void *opaque)
 {
 BdrvReplaceChildState *s = opaque;
 
-if (s->free_empty_child && !s->child->bs) {
-bdrv_child_free(s->child);
-}
 bdrv_unref(s->old_bs);
 }
 
@@ -2284,26 +2278,22 @@ static void bdrv_replace_child_abort(void *opaque)
  * modify the BdrvChild * pointer we indirectly pass to it, i.e. it
  * will not modify s->child.  From that perspective, it does not matter
  * whether we pass s->childp or &s->child.
+ * (TODO: Right now, bdrv_replace_child_noperm() never modifies that
+ * pointer anyway (though it will in the future), so at this point it
+ * absolutely does not matter whether we pass s->childp or &s->child.)
  * (2) If new_bs is not NULL, s->childp will be NULL.  We then cannot use
  * it here.
  * (3) If new_bs is NULL, *s->childp will have been NULLed by
  * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
  * must not pass a NULL *s->childp here.
+ * (TODO: In its current state, bdrv_replace_child_noperm() will not
+ * have NULLed *s->childp, so this does not apply yet.  It will in the
+ * future.)
  *
  * So whether new_bs was NULL or not, we cannot pass s->childp here; and in
  * any case, there is no reason to pass it anyway.
  */
-bdrv_replace_child_noperm(&s->child, s->old_bs, true);
-/*
- * The child was pre-existing, so s->old_bs must be non-NULL, and
- * s->child thus must not have been freed
- */
-assert(s->child != NULL);
-if (!new_bs) {
-/* As described above, *s->childp was cleared, so restore it */
-assert(s->childp != NULL);
-*s->childp = s->child;
-}
+bdrv_replace_child_noperm(&s->child, s->old_bs);
 bdrv_unref(new_bs);
 }
 
@@ -2320,44 +2310,30 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  *
  * The function doesn't update permissions, caller is responsible for this.
  *
- * (*childp)->bs must not be NULL.
- *
  * Note that if new_bs == NULL, @childp is stored in a state object attached
  * to @tran, so that the old child can be reinstated in the abort handler.
  * Therefore, if @new_bs can be NULL, @childp must stay valid until the
  * transaction is committed or aborted.
  *
- * If @free_empty_child is true and @new_bs is NULL, the BdrvChild is
- * freed (on commit).  @free_empty_child should only be false if the
- * caller will free the BDrvChild themselves (which may be important
- * if this is in turn called in another transactional context).
+ * (TODO: The reinstating does not happen yet, but it will once
+ * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.)
  */
 static void bdrv_replace_child_tran(BdrvChild **childp,
 BlockDriverState *new_bs,
-Transaction *tran,
-bool free_empty_child)
+Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
 *s = (BdrvReplaceChildState) {
 .child = *childp,
 .childp = new_bs == NULL ? childp : NULL,
 .old_bs = (*childp)->bs,
-.free_empty_child = free_empty_child,
 };
 tran_add(tran, &bdrv_replace_child_drv, s);
 
-/* The abort handler relies on this */
-assert(s->old_bs != NULL);
-
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-/*
- * Pass free_empty_child=false, we will free the child (if
- * necessary) in bdrv_replace_child_commit() (if our
-

[PATCH 10/14] Revert "block: Let replace_child_tran keep indirect pointer"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

That's a preparation to previously reverted
"block: Let replace_child_noperm free children". Drop it too, we don't
need it for a new approach.

This reverts commit 82b54cf51656bf3cd5ed1ac549e8a1085a0e3290.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 83 +++--
 1 file changed, 10 insertions(+), 73 deletions(-)

diff --git a/block.c b/block.c
index 69c20c729a..3eec53796b 100644
--- a/block.c
+++ b/block.c
@@ -2254,7 +2254,6 @@ static int bdrv_drv_set_perm(BlockDriverState *bs, 
uint64_t perm,
 
 typedef struct BdrvReplaceChildState {
 BdrvChild *child;
-BdrvChild **childp;
 BlockDriverState *old_bs;
 } BdrvReplaceChildState;
 
@@ -2270,29 +2269,7 @@ static void bdrv_replace_child_abort(void *opaque)
 BdrvReplaceChildState *s = opaque;
 BlockDriverState *new_bs = s->child->bs;
 
-/*
- * old_bs reference is transparently moved from @s to s->child.
- *
- * Pass &s->child here instead of s->childp, because:
- * (1) s->old_bs must be non-NULL, so bdrv_replace_child_noperm() will not
- * modify the BdrvChild * pointer we indirectly pass to it, i.e. it
- * will not modify s->child.  From that perspective, it does not matter
- * whether we pass s->childp or &s->child.
- * (TODO: Right now, bdrv_replace_child_noperm() never modifies that
- * pointer anyway (though it will in the future), so at this point it
- * absolutely does not matter whether we pass s->childp or &s->child.)
- * (2) If new_bs is not NULL, s->childp will be NULL.  We then cannot use
- * it here.
- * (3) If new_bs is NULL, *s->childp will have been NULLed by
- * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
- * must not pass a NULL *s->childp here.
- * (TODO: In its current state, bdrv_replace_child_noperm() will not
- * have NULLed *s->childp, so this does not apply yet.  It will in the
- * future.)
- *
- * So whether new_bs was NULL or not, we cannot pass s->childp here; and in
- * any case, there is no reason to pass it anyway.
- */
+/* old_bs reference is transparently moved from @s to @s->child */
 bdrv_replace_child_noperm(&s->child, s->old_bs);
 bdrv_unref(new_bs);
 }
@@ -2309,32 +2286,22 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  * Note: real unref of old_bs is done only on commit.
  *
  * The function doesn't update permissions, caller is responsible for this.
- *
- * Note that if new_bs == NULL, @childp is stored in a state object attached
- * to @tran, so that the old child can be reinstated in the abort handler.
- * Therefore, if @new_bs can be NULL, @childp must stay valid until the
- * transaction is committed or aborted.
- *
- * (TODO: The reinstating does not happen yet, but it will once
- * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.)
  */
-static void bdrv_replace_child_tran(BdrvChild **childp,
-BlockDriverState *new_bs,
+static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
 Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
 *s = (BdrvReplaceChildState) {
-.child = *childp,
-.childp = new_bs == NULL ? childp : NULL,
-.old_bs = (*childp)->bs,
+.child = child,
+.old_bs = child->bs,
 };
 tran_add(tran, &bdrv_replace_child_drv, s);
 
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-bdrv_replace_child_noperm(childp, new_bs);
-/* old_bs reference is transparently moved from *childp to @s */
+bdrv_replace_child_noperm(&child, new_bs);
+/* old_bs reference is transparently moved from @child to @s */
 }
 
 /*
@@ -4898,7 +4865,6 @@ static bool should_update_child(BdrvChild *c, 
BlockDriverState *to)
 
 typedef struct BdrvRemoveFilterOrCowChild {
 BdrvChild *child;
-BlockDriverState *bs;
 bool is_backing;
 } BdrvRemoveFilterOrCowChild;
 
@@ -4928,19 +4894,10 @@ static void bdrv_remove_filter_or_cow_child_commit(void 
*opaque)
 bdrv_child_free(s->child);
 }
 
-static void bdrv_remove_filter_or_cow_child_clean(void *opaque)
-{
-BdrvRemoveFilterOrCowChild *s = opaque;
-
-/* Drop the bs reference after the transaction is done */
-bdrv_unref(s->bs);
-g_free(s);
-}
-
 static TransactionActionDrv bdrv_remove_filter_or_cow_child_drv = {
 .abort = bdrv_remove_filter_or_cow_child_abort,
 .commit = bdrv_remove_filter_or_cow_child_commit,
-.clean = bdrv_remove_filter_or_cow_child_clean,
+.clean = g_free,
 };
 
 /*
@@ -4958,11 +4915,6 @@ static void 
bdrv_remove_file_or_backing_child(BlockDriverState *bs,
 return;
 }
 
-/*
- * Keep a reference to @bs so @childp will stay valid throughout the
- * transaction (required by bdrv_replace_child_tran())
- */
-bdrv_ref(bs);
 if (chi

[PATCH 05/14] tests-bdrv-drain: bdrv_replace_test driver: declare supports_backing

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We do add COW child to the node.  In future we are going to forbid
adding COW child to the node that doesn't support backing. So, fix it
here now.

Don't worry about setting bs->backing itself: it further commit we'll
update the block-layer to automatically set/unset this field in generic
code.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-bdrv-drain.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 2d3c17e566..45edbd867f 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1944,6 +1944,7 @@ static void coroutine_fn 
bdrv_replace_test_co_drain_end(BlockDriverState *bs)
 static BlockDriver bdrv_replace_test = {
 .format_name= "replace_test",
 .instance_size  = sizeof(BDRVReplaceTestState),
+.supports_backing   = true,
 
 .bdrv_close = bdrv_replace_test_close,
 .bdrv_co_preadv = bdrv_replace_test_co_preadv,
-- 
2.31.1

[PATCH 00/14] block: cleanup backing and file handling

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Hi all!

I started this as a follow-up to
"block: Attempt on fixing 030-reported errors" by Hanna.

In Hanna's series I really like how bs->children handling moved to
.attach/.detach handlers.

.file and .backing are kind of "shortcuts" or "links" to some elementes
of this list, they duplicate the information. So they should be updated
in the same place to be in sync.

On the way to this target, I do also the following cleanups:

 - establish, which restrictions we have on how much children of
 different roles should node has, and which of the should be linked in
 .file / .backing. Add documentation and assertions.

 - drop all the complicated logic around passing pointer to bs->backing
 / bs->file  (BdrvChild **c), so that the field be automatically
 updated. Now they are natively automatically updated in
 .attach/.detach, so the rest of the code becomes simpler.

 - now .file / .backing are updated ONLY in .attach / .detach, no other
 code modify these fields

Vladimir Sementsov-Ogievskiy (14):
  block: BlockDriver: add .filtered_child_is_backing field
  block: introduce bdrv_open_file_child() helper
  block/blklogwrites: don't care to remove bs->file child on failure
  test-bdrv-graph-mod: update test_parallel_perm_update test case
  tests-bdrv-drain: bdrv_replace_test driver: declare supports_backing
  test-bdrv-graph-mod: fix filters to be filters
  block: document connection between child roles and
bs->backing/bs->file
  block/snapshot: stress that we fallback to primary child
  Revert "block: Let replace_child_noperm free children"
  Revert "block: Let replace_child_tran keep indirect pointer"
  Revert "block: Restructure remove_file_or_backing_child()"
  Revert "block: Pass BdrvChild ** to replace_child_noperm"
  block: Manipulate bs->file / bs->backing pointers in .attach/.detach
  block/snapshot: drop indirection around bdrv_snapshot_fallback_ptr

 include/block/block.h|  45 +
 include/block/block_int.h|  30 ++-
 block.c  | 335 ++-
 block/blkdebug.c |   9 +-
 block/blklogwrites.c |  11 +-
 block/blkreplay.c|   7 +-
 block/blkverify.c|   9 +-
 block/bochs.c|   7 +-
 block/cloop.c|   7 +-
 block/commit.c   |   1 +
 block/copy-before-write.c|   9 +-
 block/copy-on-read.c |   9 +-
 block/crypto.c   |  11 +-
 block/dmg.c  |   7 +-
 block/filter-compress.c  |   6 +-
 block/mirror.c   |   1 +
 block/parallels.c|   7 +-
 block/preallocate.c  |   9 +-
 block/qcow.c |   6 +-
 block/qcow2.c|   8 +-
 block/qed.c  |   8 +-
 block/raw-format.c   |   4 +-
 block/replication.c  |   8 +-
 block/snapshot.c |  60 ++
 block/throttle.c |   8 +-
 block/vdi.c  |   7 +-
 block/vhdx.c |   7 +-
 block/vmdk.c |   7 +-
 block/vpc.c  |   7 +-
 tests/unit/test-bdrv-drain.c |  11 +-
 tests/unit/test-bdrv-graph-mod.c |  94 ++---
 31 files changed, 343 insertions(+), 412 deletions(-)

-- 
2.31.1

[PATCH 11/14] Revert "block: Restructure remove_file_or_backing_child()"

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

That's a preparation to previously reverted
"block: Let replace_child_noperm free children". Drop it too, we don't
need it for a new approach.

This reverts commit 562bda8bb41879eeda0bd484dd3d55134579b28e.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 3eec53796b..2ba95f71b9 100644
--- a/block.c
+++ b/block.c
@@ -4908,33 +4908,30 @@ static void 
bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
   Transaction *tran)
 {
-BdrvChild **childp;
 BdrvRemoveFilterOrCowChild *s;
 
+assert(child == bs->backing || child == bs->file);
+
 if (!child) {
 return;
 }
 
-if (child == bs->backing) {
-childp = &bs->backing;
-} else if (child == bs->file) {
-childp = &bs->file;
-} else {
-g_assert_not_reached();
-}
-
 if (child->bs) {
-bdrv_replace_child_tran(*childp, NULL, tran);
+bdrv_replace_child_tran(child, NULL, tran);
 }
 
 s = g_new(BdrvRemoveFilterOrCowChild, 1);
 *s = (BdrvRemoveFilterOrCowChild) {
 .child = child,
-.is_backing = (childp == &bs->backing),
+.is_backing = (child == bs->backing),
 };
 tran_add(tran, &bdrv_remove_filter_or_cow_child_drv, s);
 
-*childp = NULL;
+if (s->is_backing) {
+bs->backing = NULL;
+} else {
+bs->file = NULL;
+}
 }
 
 /*
-- 
2.31.1

[PATCH 01/14] block: BlockDriver: add .filtered_child_is_backing field

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Unfortunately not all filters use .file child as filtered child. Two
exclusions are mirror_top and commit_top. Happily they both are private
filters. Bad thing is that this inconsistency is observable through qmp
commands query-block / query-named-block-nodes. So, could we just
change mirror_top and commit_top to use file child as all other filter
driver is an open question. Probably, we could do that with some kind
of deprecation period, but how to warn users during it?

For now, let's just add a field so we can distinguish them in generic
code, it will be used in further commits.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h | 14 ++
 block/commit.c|  1 +
 block/mirror.c|  1 +
 3 files changed, 16 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index f4c75e8ba9..9c06f8816e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -117,6 +117,20 @@ struct BlockDriver {
  * (And this filtered child must then be bs->file or bs->backing.)
  */
 bool is_filter;
+
+/*
+ * Only make sense for filter drivers, for others must be false.
+ * If true, filtered child is bs->backing. Otherwise it's bs->file.
+ * Only two internal filters use bs->backing as filtered child and has this
+ * field set to true: mirror_top and commit_top.
+ *
+ * Never create any more such filters!
+ *
+ * TODO: imagine how to deprecate this behavior and make all filters work
+ * similarly using bs->file as filtered child.
+ */
+bool filtered_child_is_backing;
+
 /*
  * Set to true if the BlockDriver is a format driver.  Format nodes
  * generally do not expect their children to be other format nodes
diff --git a/block/commit.c b/block/commit.c
index 10cc5ff451..23d60aebf4 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -237,6 +237,7 @@ static BlockDriver bdrv_commit_top = {
 .bdrv_child_perm= bdrv_commit_top_child_perm,
 
 .is_filter  = true,
+.filtered_child_is_backing  = true,
 };
 
 void commit_start(const char *job_id, BlockDriverState *bs,
diff --git a/block/mirror.c b/block/mirror.c
index efec2c7674..22e2b7b110 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1587,6 +1587,7 @@ static BlockDriver bdrv_mirror_top = {
 .bdrv_child_perm= bdrv_mirror_top_child_perm,
 
 .is_filter  = true,
+.filtered_child_is_backing  = true,
 };
 
 static BlockJob *mirror_start_job(
-- 
2.31.1

Re: [PATCH v3 2/3] target/ppc: Implement Vector Extract Mask

2021-12-03 Thread Richard Henderson


On 12/3/21 11:42 AM, matheus.fe...@eldorado.org.br wrote:

From: Matheus Ferst

Implement the following PowerISA v3.1 instructions:
vextractbm: Vector Extract Byte Mask
vextracthm: Vector Extract Halfword Mask
vextractwm: Vector Extract Word Mask
vextractdm: Vector Extract Doubleword Mask
vextractqm: Vector Extract Quadword Mask

Signed-off-by: Matheus Ferst
---
  target/ppc/insn32.decode|  6 +++
  target/ppc/translate/vmx-impl.c.inc | 82 +
  2 files changed, 88 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PULL 0/2] Seabios 20211203 patches

2021-12-03 Thread Richard Henderson


On 12/3/21 12:55 AM, Gerd Hoffmann wrote:

The following changes since commit a69254a2b320e31d3aa63ca910b7aa02efcd5492:

   Merge tag 'ide-pull-request' of https://gitlab.com/jsnow/qemu into staging 
(2021-12-02 08:49:51 -0800)

are available in the Git repository at:

   git://git.kraxel.org/qemu tags/seabios-20211203-pull-request

for you to fetch changes up to 3bc90ac567f64fbe07b17b1174c85ec8a3e17d94:

   seabios: update binaries to 1.15.0 (2021-12-03 09:54:11 +0100)


seabios: update from snapshot to final 1.15.0 release (no code changes).



Gerd Hoffmann (2):
   seabios: update submodule to 1.15.0
   seabios: update binaries to 1.15.0

  pc-bios/bios-256k.bin | Bin 262144 -> 262144 bytes
  pc-bios/bios-microvm.bin  | Bin 131072 -> 131072 bytes
  pc-bios/bios.bin  | Bin 131072 -> 131072 bytes
  pc-bios/vgabios-ati.bin   | Bin 39424 -> 39424 bytes
  pc-bios/vgabios-bochs-display.bin | Bin 28672 -> 28672 bytes
  pc-bios/vgabios-cirrus.bin| Bin 39424 -> 39424 bytes
  pc-bios/vgabios-qxl.bin   | Bin 39424 -> 39424 bytes
  pc-bios/vgabios-ramfb.bin | Bin 28672 -> 28672 bytes
  pc-bios/vgabios-stdvga.bin| Bin 39424 -> 39424 bytes
  pc-bios/vgabios-virtio.bin| Bin 39424 -> 39424 bytes
  pc-bios/vgabios-vmware.bin| Bin 39424 -> 39424 bytes
  pc-bios/vgabios.bin   | Bin 38912 -> 38912 bytes
  roms/seabios  |   2 +-
  13 files changed, 1 insertion(+), 1 deletion(-)


Applied, thanks.

r~

[PATCH v3 3/3] target/ppc: Implement Vector Mask Move insns

2021-12-03 Thread matheus . ferst

From: Matheus Ferst 

Implement the following PowerISA v3.1 instructions:
mtvsrbm: Move to VSR Byte Mask
mtvsrhm: Move to VSR Halfword Mask
mtvsrwm: Move to VSR Word Mask
mtvsrdm: Move to VSR Doubleword Mask
mtvsrqm: Move to VSR Quadword Mask
mtvsrbmi: Move to VSR Byte Mask Immediate

Reviewed-by: Richard Henderson 
Signed-off-by: Matheus Ferst 
---
 target/ppc/insn32.decode|  11 +++
 target/ppc/translate/vmx-impl.c.inc | 115 
 2 files changed, 126 insertions(+)

diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 639ac22bf0..f68931f4f3 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -40,6 +40,10 @@
 %ds_rtp 22:4   !function=times_2
 @DS_rtp .. 0 ra:5 .. .. &D rt=%ds_rtp 
si=%ds_si
 
+&DX_b   vrt b
+%dx_b   6:10 16:5 0:1
+@DX_b   .. vrt:5  . .. . .  &DX_b b=%dx_b
+
 &DX rt d
 %dx_d   6:s10 16:5 0:1
 @DX .. rt:5  . .. . .   &DX d=%dx_d
@@ -413,6 +417,13 @@ VSRDBI  000100 . . . 01 ... 010110  @VN
 
 ## Vector Mask Manipulation Instructions
 
+MTVSRBM 000100 . 1 . 1100110@VX_tb
+MTVSRHM 000100 . 10001 . 1100110@VX_tb
+MTVSRWM 000100 . 10010 . 1100110@VX_tb
+MTVSRDM 000100 . 10011 . 1100110@VX_tb
+MTVSRQM 000100 . 10100 . 1100110@VX_tb
+MTVSRBMI000100 . . .. 01010 .   @DX_b
+
 VEXPANDBM   000100 . 0 . 1100110@VX_tb
 VEXPANDHM   000100 . 1 . 1100110@VX_tb
 VEXPANDWM   000100 . 00010 . 1100110@VX_tb
diff --git a/target/ppc/translate/vmx-impl.c.inc 
b/target/ppc/translate/vmx-impl.c.inc
index 96c97bf6e7..d5e02fd7f2 100644
--- a/target/ppc/translate/vmx-impl.c.inc
+++ b/target/ppc/translate/vmx-impl.c.inc
@@ -1607,6 +1607,121 @@ static bool trans_VEXTRACTQM(DisasContext *ctx, 
arg_VX_tb *a)
 return true;
 }
 
+static bool do_mtvsrm(DisasContext *ctx, arg_VX_tb *a, unsigned vece)
+{
+const uint64_t elem_width = 8 << vece, elem_count_half = 8 >> vece;
+uint64_t c;
+int i, j;
+TCGv_i64 hi, lo, t0, t1;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+hi = tcg_temp_new_i64();
+lo = tcg_temp_new_i64();
+t0 = tcg_temp_new_i64();
+t1 = tcg_temp_new_i64();
+
+tcg_gen_extu_tl_i64(t0, cpu_gpr[a->vrb]);
+tcg_gen_extract_i64(hi, t0, elem_count_half, elem_count_half);
+tcg_gen_extract_i64(lo, t0, 0, elem_count_half);
+
+/*
+ * Spread the bits into their respective elements.
+ * E.g. for bytes:
+ * abcdefgh
+ *   << 32 - 4
+ * abcdefgh
+ *   |
+ * abcdefghabcdefgh
+ *   << 16 - 2
+ * 00abcdefghabcdefgh00
+ *   |
+ * 00abcdefgh00abcdefgh00abcdefgh00abcdefgh
+ *   << 8 - 1
+ * 000abcdefgh00abcdefgh00abcdefgh00abcdefgh000
+ *   |
+ * 000abcdefgXbcdefgXbcdefgXbcdefgXbcdefgXbcdefgXbcdefgXbcdefgh
+ *   & dup(1)
+ * 000a000b000c000d000e000f000g000h
+ *   * 0xff
+ * 
+ */
+for (i = elem_count_half / 2, j = 32; i > 0; i >>= 1, j >>= 1) {
+tcg_gen_shli_i64(t0, hi, j - i);
+tcg_gen_shli_i64(t1, lo, j - i);
+tcg_gen_or_i64(hi, hi, t0);
+tcg_gen_or_i64(lo, lo, t1);
+}
+
+c = dup_const(vece, 1);
+tcg_gen_andi_i64(hi, hi, c);
+tcg_gen_andi_i64(lo, lo, c);
+
+c = MAKE_64BIT_MASK(0, elem_width);
+tcg_gen_muli_i64(hi, hi, c);
+tcg_gen_muli_i64(lo, lo, c);
+
+set_avr64(a->vrt, lo, false);
+set_avr64(a->vrt, hi, true);
+
+tcg_temp_free_i64(hi);
+tcg_temp_free_i64(lo);
+tcg_temp_free_i64(t0);
+tcg_temp_free_i64(t1);
+
+return true;
+}
+
+TRANS(MTVSRBM, do_mtvsrm, MO_8)
+TRANS(MTVSRHM, do_mtvsrm, MO_16)
+TRANS(MTVSRWM, do_mtvsrm, MO_32)
+TRANS(MTVSRDM, do_mtvsrm, MO_64)
+
+static bool trans_MTVSRQM(DisasContext *ctx, arg_VX_tb *a)
+{
+TCGv_i64 tmp;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+tmp = tcg_temp_new_i64();
+
+tcg_gen_ext_tl_i64(tmp, cpu_gpr[a->vrb]);
+tcg_gen_sextract_i64(tmp, tmp, 0, 1);
+set_avr64(a->vrt, tmp, false);
+set_avr64(a->vrt, tmp, true);
+
+tcg_temp_free_i64(tmp);
+
+return true;
+}
+
+static bool trans_MTVSRBMI(DisasContext *ctx, arg_DX_b *a)
+{
+const uint64_t mask = dup_const(MO_8, 1);
+uint64_t hi, lo;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+hi = extract16(a->b, 8, 8);
+lo =

[PATCH v3 2/3] target/ppc: Implement Vector Extract Mask

2021-12-03 Thread matheus . ferst

From: Matheus Ferst 

Implement the following PowerISA v3.1 instructions:
vextractbm: Vector Extract Byte Mask
vextracthm: Vector Extract Halfword Mask
vextractwm: Vector Extract Word Mask
vextractdm: Vector Extract Doubleword Mask
vextractqm: Vector Extract Quadword Mask

Signed-off-by: Matheus Ferst 
---
 target/ppc/insn32.decode|  6 +++
 target/ppc/translate/vmx-impl.c.inc | 82 +
 2 files changed, 88 insertions(+)

diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 9a28f1d266..639ac22bf0 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -419,6 +419,12 @@ VEXPANDWM   000100 . 00010 . 1100110
@VX_tb
 VEXPANDDM   000100 . 00011 . 1100110@VX_tb
 VEXPANDQM   000100 . 00100 . 1100110@VX_tb
 
+VEXTRACTBM  000100 . 01000 . 1100110@VX_tb
+VEXTRACTHM  000100 . 01001 . 1100110@VX_tb
+VEXTRACTWM  000100 . 01010 . 1100110@VX_tb
+VEXTRACTDM  000100 . 01011 . 1100110@VX_tb
+VEXTRACTQM  000100 . 01100 . 1100110@VX_tb
+
 # VSX Load/Store Instructions
 
 LXV 01 . .  . 001   @DQ_TSX
diff --git a/target/ppc/translate/vmx-impl.c.inc 
b/target/ppc/translate/vmx-impl.c.inc
index ebb0484323..96c97bf6e7 100644
--- a/target/ppc/translate/vmx-impl.c.inc
+++ b/target/ppc/translate/vmx-impl.c.inc
@@ -1525,6 +1525,88 @@ static bool trans_VEXPANDQM(DisasContext *ctx, arg_VX_tb 
*a)
 return true;
 }
 
+static bool do_vextractm(DisasContext *ctx, arg_VX_tb *a, unsigned vece)
+{
+const uint64_t elem_width = 8 << vece, elem_count_half = 8 >> vece,
+   mask = dup_const(vece, 1 << (elem_width - 1));
+uint64_t i, j;
+TCGv_i64 lo, hi, t0, t1;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+hi = tcg_temp_new_i64();
+lo = tcg_temp_new_i64();
+t0 = tcg_temp_new_i64();
+t1 = tcg_temp_new_i64();
+
+get_avr64(lo, a->vrb, false);
+get_avr64(hi, a->vrb, true);
+
+tcg_gen_andi_i64(lo, lo, mask);
+tcg_gen_andi_i64(hi, hi, mask);
+
+/*
+ * Gather the most significant bit of each element in the highest element
+ * element. E.g. for bytes:
+ * aXXXbXXXcXXXdXXXeXXXfXXXgXXXhXXX
+ * & dup(1 << (elem_width - 1))
+ * a000b000c000d000e000f000g000h000
+ * << 32 - 4
+ * e000f000g000h000
+ * |
+ * a000e000b000f000c000g000d000h000e000f000g000h000
+ * << 16 - 2
+ * 00c000g000d000h000e000f000g000h0
+ * |
+ * a0c0e0g0b0d0f0h0c0e0g000d0f0h000e0g0f0h0g000h000
+ * << 8 - 1
+ * 0b0d0f0h0c0e0g000d0f0h000e0g0f0h0g000h00
+ * |
+ * abcdefghbcdefgh0cdefgh00defgh000efghfgh0gh00h000
+ */
+for (i = elem_count_half / 2, j = 32; i > 0; i >>= 1, j >>= 1) {
+tcg_gen_shli_i64(t0, hi, j - i);
+tcg_gen_shli_i64(t1, lo, j - i);
+tcg_gen_or_i64(hi, hi, t0);
+tcg_gen_or_i64(lo, lo, t1);
+}
+
+tcg_gen_shri_i64(hi, hi, 64 - elem_count_half);
+tcg_gen_extract2_i64(lo, lo, hi, 64 - elem_count_half);
+tcg_gen_trunc_i64_tl(cpu_gpr[a->vrt], lo);
+
+tcg_temp_free_i64(hi);
+tcg_temp_free_i64(lo);
+tcg_temp_free_i64(t0);
+tcg_temp_free_i64(t1);
+
+return true;
+}
+
+TRANS(VEXTRACTBM, do_vextractm, MO_8)
+TRANS(VEXTRACTHM, do_vextractm, MO_16)
+TRANS(VEXTRACTWM, do_vextractm, MO_32)
+TRANS(VEXTRACTDM, do_vextractm, MO_64)
+
+static bool trans_VEXTRACTQM(DisasContext *ctx, arg_VX_tb *a)
+{
+TCGv_i64 tmp;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+tmp = tcg_temp_new_i64();
+
+get_avr64(tmp, a->vrb, true);
+tcg_gen_shri_i64(tmp, tmp, 63);
+tcg_gen_trunc_i64_tl(cpu_gpr[a->vrt], tmp);
+
+tcg_temp_free_i64(tmp);
+
+return true;
+}
+
 #define GEN_VAFORM_PAIRED(name0, name1, opc2)   \
 static void glue(gen_, name0##_##name1)(DisasContext *ctx)  \
 {   \
-- 
2.25.1

[PATCH v3 1/3] target/ppc: Implement Vector Expand Mask

2021-12-03 Thread matheus . ferst

From: Matheus Ferst 

Implement the following PowerISA v3.1 instructions:
vexpandbm: Vector Expand Byte Mask
vexpandhm: Vector Expand Halfword Mask
vexpandwm: Vector Expand Word Mask
vexpanddm: Vector Expand Doubleword Mask
vexpandqm: Vector Expand Quadword Mask

Reviewed-by: Richard Henderson 
Signed-off-by: Matheus Ferst 
---
 target/ppc/insn32.decode| 11 ++
 target/ppc/translate/vmx-impl.c.inc | 34 +
 2 files changed, 45 insertions(+)

diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index e135b8aba4..9a28f1d266 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -56,6 +56,9 @@
 &VX_uim4vrt uim vrb
 @VX_uim4.. vrt:5 . uim:4 vrb:5 ...  &VX_uim4
 
+&VX_tb  vrt vrb
+@VX_tb  .. vrt:5 . vrb:5 ...&VX_tb
+
 &X  rt ra rb
 @X  .. rt:5 ra:5 rb:5 .. .  &X
 
@@ -408,6 +411,14 @@ VINSWVRX000100 . . . 0011000@VX
 VSLDBI  000100 . . . 00 ... 010110  @VN
 VSRDBI  000100 . . . 01 ... 010110  @VN
 
+## Vector Mask Manipulation Instructions
+
+VEXPANDBM   000100 . 0 . 1100110@VX_tb
+VEXPANDHM   000100 . 1 . 1100110@VX_tb
+VEXPANDWM   000100 . 00010 . 1100110@VX_tb
+VEXPANDDM   000100 . 00011 . 1100110@VX_tb
+VEXPANDQM   000100 . 00100 . 1100110@VX_tb
+
 # VSX Load/Store Instructions
 
 LXV 01 . .  . 001   @DQ_TSX
diff --git a/target/ppc/translate/vmx-impl.c.inc 
b/target/ppc/translate/vmx-impl.c.inc
index 8eb8d3a067..ebb0484323 100644
--- a/target/ppc/translate/vmx-impl.c.inc
+++ b/target/ppc/translate/vmx-impl.c.inc
@@ -1491,6 +1491,40 @@ static bool trans_VSRDBI(DisasContext *ctx, arg_VN *a)
 return true;
 }
 
+static bool do_vexpand(DisasContext *ctx, arg_VX_tb *a, unsigned vece)
+{
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+tcg_gen_gvec_sari(vece, avr_full_offset(a->vrt), avr_full_offset(a->vrb),
+  (8 << vece) - 1, 16, 16);
+
+return true;
+}
+
+TRANS(VEXPANDBM, do_vexpand, MO_8)
+TRANS(VEXPANDHM, do_vexpand, MO_16)
+TRANS(VEXPANDWM, do_vexpand, MO_32)
+TRANS(VEXPANDDM, do_vexpand, MO_64)
+
+static bool trans_VEXPANDQM(DisasContext *ctx, arg_VX_tb *a)
+{
+TCGv_i64 tmp;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+tmp = tcg_temp_new_i64();
+
+get_avr64(tmp, a->vrb, true);
+tcg_gen_sari_i64(tmp, tmp, 63);
+set_avr64(a->vrt, tmp, false);
+set_avr64(a->vrt, tmp, true);
+
+tcg_temp_free_i64(tmp);
+return true;
+}
+
 #define GEN_VAFORM_PAIRED(name0, name1, opc2)   \
 static void glue(gen_, name0##_##name1)(DisasContext *ctx)  \
 {   \
-- 
2.25.1

[PATCH v3 0/3] target/ppc: Implement Vector Expand/Extract Mask and Vector Mask

2021-12-03 Thread matheus . ferst

From: Matheus Ferst 

This is a small patch series just to allow Ubuntu 21.10 to boot with
-cpu POWER10. Glibc 2.34 is using vextractbm, so the init is killed by
SIGILL without the second patch of this series. The other two insns. are
included as they are somewhat close to Vector Extract Mask (at least in
pseudocode).

v3:
- VEXTRACT[BHWDQ]M optimized following rth suggestions

v2:
- Applied rth suggestions to VEXTRACT[BHWDQ]M and MTVSR[BHWDQ]M[I]

Matheus Ferst (3):
  target/ppc: Implement Vector Expand Mask
  target/ppc: Implement Vector Extract Mask
  target/ppc: Implement Vector Mask Move insns

 target/ppc/insn32.decode|  28 
 target/ppc/translate/vmx-impl.c.inc | 231 
 2 files changed, 259 insertions(+)

-- 
2.25.1

Re: QEMU 6.2.0 and rhbz#1999878

2021-12-03 Thread Richard W.M. Jones

On Fri, Dec 03, 2021 at 04:20:23PM -0300, Eduardo Lima wrote:
> Hi Rich,
> 
> Can you confirm if the patch you added for qemu in Fedora has still not been
> merged upstream? I could not find it on the git source tree.
> 
> +Patch2: 0001-tcg-arm-Reduce-vector-alignment-requirement-for-NEON.patch
> +From 1331e4eec016a295949009b4360c592401b089f7 Mon Sep 17 00:00:00 2001
> +From: Richard Henderson 
> +Date: Sun, 12 Sep 2021 10:49:25 -0700
> +Subject: [PATCH] tcg/arm: Reduce vector alignment requirement for NEON

https://bugzilla.redhat.com/show_bug.cgi?id=1999878
https://lists.nongnu.org/archive/html/qemu-devel/2021-09/msg01028.html

The patch I posted wasn't correct (or meant to be), it was just a
workaround.  However I think you're right - I don't believe the
original problem was ever fixed.

Let's see what upstreams says ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH 1/4] s390x/pci: use a reserved ID for the default PCI group

2021-12-03 Thread David Hildenbrand

On 03.12.21 03:25, Matthew Rosato wrote:
> On 12/2/21 6:06 PM, Halil Pasic wrote:
>> On Thu, 2 Dec 2021 12:11:38 -0500
>> Matthew Rosato  wrote:
>>

 What happens if we migrate a VM from old to new QEMU? Won't the guest be
 able to observe the change?

>>>
>>> Yes, technically --  But # itself is not really all that important, it
>>> is provided from CLP Q PCI FN to be subsequently used as input into Q
>>> PCI FNGRP -- With the fundamental notion being that all functions that
>>> share the same group # share the same group CLP info.  Whether the
>>> number is, say, 1 or 5 doesn't matter so much.
>>>
>>> However..  0xF0 and greater are the only values reserved for hypervisor
>>> use.  By using 0x20 we run the risk of accidentally conflating simulated
>>> devices and real hardware, hence the desire to change it.
>>>
>>> Is your concern about a migrated guest with a virtio device trying to do
>>> a CLP QUERY PCI FNGRP using 0x20 on a new QEMU?  I suppose we could
>>> modify 'clp_service_call, case CLP_QUERY_PCI_FNGRP' to silently catch
>>> simulated devices trying to use something other than the default group,
>>> e.g.:
>>>
>>> if ((pbdev->fh & FH_SHM_EMUL) &&
>>>   (pbdev->zpci_fn.pfgid != ZPCI_DEFAULT_FN_GRP)) {
>>>   /* Simulated device MUST have default group */
>>> pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
>>> group = s390_group_find(ZPCI_DEFAULT_FN_GRP);
>>> }
>>>
>>> What do you think?
>>
>> Another option, and in my opinion the cleaner one would be to tie this
>> change to a new machine version. That is if a post-change qemu is used
>> in compatibility mode, we would still have the old behavior.
>>
>> What do you think?
>>
> 
> The problem there is that the old behavior goes against the architecture 
> (group 0x20 could belong to real hardware) and AFAIU assigning this new 
> behavior only to a new machine version means we can't fix old stable 
> QEMU versions.
> 
> Also, wait a minute -- migration isn't even an option right now, it's 
> blocked for zpci devices, both passthrough and simulated (see 
> aede5d5dfc5f 's390x/pci: mark zpci devices as unmigratable') so I say 
> let's just move to a proper default group now before we potentially 
> allow migration later.

Perfect, thanks for confirming!


-- 
Thanks,

David / dhildenb

Re: [PATCH v2 1/2] hw/arm/virt: Support for virtio-mem-pci

2021-12-03 Thread David Hildenbrand

On 03.12.21 04:35, Gavin Shan wrote:
> This supports virtio-mem-pci device on "virt" platform, by simply
> following the implementation on x86.
> 
>* This implements the hotplug handlers to support virtio-mem-pci
>  device hot-add, while the hot-remove isn't supported as we have
>  on x86.
> 
>* The block size is 512MB on ARM64 instead of 128MB on x86.
> 
>* It has been passing the tests with various combinations like 64KB
>  and 4KB page sizes on host and guest, different memory device
>  backends like normal, transparent huge page and HugeTLB, plus
>  migration.
> 

I would turn this patch into 2/2, reshuffling both patches.

> Co-developed-by: David Hildenbrand 
> Co-developed-by: Jonathan Cameron 
> Signed-off-by: Gavin Shan 

Reviewed-by: David Hildenbrand 

Thanks Gavin!

-- 
Thanks,

David / dhildenb

Re: [PATCH v2 2/2] virtio-mem: Correct default THP size for ARM64

2021-12-03 Thread David Hildenbrand

On 03.12.21 04:35, Gavin Shan wrote:
> The default block size is same as to the THP size, which is either
> retrieved from "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
> or hardcoded to 2MB. There are flaws in both mechanisms and this
> intends to fix them up.
> 
>   * When "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" is
> used to getting the THP size, 32MB and 512MB are valid values
> when we have 16KB and 64KB page size on ARM64.

Ah, right, there is 16KB as well :)

> 
>   * When the hardcoded THP size is used, 2MB, 32MB and 512MB are
> valid values when we have 4KB, 16KB and 64KB page sizes on
> ARM64.
> 
> Co-developed-by: David Hildenbrand 
> Signed-off-by: Gavin Shan 
> ---
>  hw/virtio/virtio-mem.c | 32 
>  1 file changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index ac7a40f514..8f3c95300f 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -38,14 +38,25 @@
>   */
>  #define VIRTIO_MEM_MIN_BLOCK_SIZE ((uint32_t)(1 * MiB))
>  
> -#if defined(__x86_64__) || defined(__arm__) || defined(__aarch64__) || \
> -defined(__powerpc64__)
> -#define VIRTIO_MEM_DEFAULT_THP_SIZE ((uint32_t)(2 * MiB))
> -#else
> -/* fallback to 1 MiB (e.g., the THP size on s390x) */
> -#define VIRTIO_MEM_DEFAULT_THP_SIZE VIRTIO_MEM_MIN_BLOCK_SIZE
> +static uint32_t virtio_mem_default_thp_size(void)
> +{
> +uint32_t default_thp_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
> +
> +#if defined(__x86_64__) || defined(__arm__) || defined(__powerpc64__)
> +default_thp_size = (uint32_t)(2 * MiB);
> +#elif defined(__aarch64__)
> +if (qemu_real_host_page_size == (4 * KiB)) {

you can drop the superfluous (), also in the cases below.

> +default_thp_size = (uint32_t)(2 * MiB);

The explicit cast shouldn't be required.

> +} else if (qemu_real_host_page_size == (16 * KiB)) {
> +default_thp_size = (uint32_t)(32 * MiB);
> +} else if (qemu_real_host_page_size == (64 * KiB)) {
> +default_thp_size = (uint32_t)(512 * MiB);
> +}
>  #endif
>  
> +return default_thp_size;
> +}
> +
>  /*
>   * We want to have a reasonable default block size such that
>   * 1. We avoid splitting THPs when unplugging memory, which degrades
> @@ -78,11 +89,8 @@ static uint32_t virtio_mem_thp_size(void)
>  if (g_file_get_contents(HPAGE_PMD_SIZE_PATH, &content, NULL, NULL) &&
>  !qemu_strtou64(content, &endptr, 0, &tmp) &&
>  (!endptr || *endptr == '\n')) {
> -/*
> - * Sanity-check the value, if it's too big (e.g., aarch64 with 64k 
> base
> - * pages) or weird, fallback to something smaller.
> - */
> -if (!tmp || !is_power_of_2(tmp) || tmp > 16 * MiB) {
> +/* Sanity-check the value and fallback to something reasonable. */
> +if (!tmp || !is_power_of_2(tmp)) {
>  warn_report("Read unsupported THP size: %" PRIx64, tmp);
>  } else {
>  thp_size = tmp;
> @@ -90,7 +98,7 @@ static uint32_t virtio_mem_thp_size(void)
>  }
>  
>  if (!thp_size) {
> -thp_size = VIRTIO_MEM_DEFAULT_THP_SIZE;
> +thp_size = virtio_mem_default_thp_size();
>  warn_report("Could not detect THP size, falling back to %" PRIx64
>  "  MiB.", thp_size / MiB);
>  }
> 

Apart from that,

Reviewed-by: David Hildenbrand 


Thanks!

-- 
Thanks,

David / dhildenb

RFC: x86 memory map, where to put CXL ranges?

2021-12-03 Thread Jonathan Cameron via

Hi All,

For CXL emulation we require a couple of types of memory range that
are then provided to the OS via the CEDT ACPI table.

1) CXL Host Bridge Structures point to CXL Host Bridge Component Registers.
Small regions for each CXL Host bridge that are mapped into the memory space.
64k each.  In theory we may have a huge number of these but in reality I
think 16 will do for any reasonable system.

2) CXL Fixed Memory Window Structures (CFMWS)
Large PA space ranges (multiple TB) to which various CXL devices can be assigned
and their address decoders appropriately programmed.
Each such CFMWS will have particular characteristics such as interleaving across
multiple host bridges.  The can potentially be huge but are a system
characteristic.  For emulation purposes it won't matter if they move around
dependent on what else is the machine has configured. So I'd like to
just configure their size rather than fully specify them at the command line
and possibly clash on PA space with something else.  Alternatively could
leave them as fully specified at the command line (address and size) and just
error out if the hit memory already in use for something else.

Now unfortunately there are no systems out there yet that we can just
copy the memory map from...

Coming form an Arm background I have only a vague idea of how this should be
done for x86 so apologies if it is a stupid question.

My current approach is to put these above device_memory and moving
the pci hole up appropriately.

Is that the right choice?

On Arm I currently have the Host Bridge Structures low down in the MemMap and 
the CFMWS
can go above the device memory.  Comments on that also welcome.

In Ben's RFC the host bridge component register location was marked as a TODO
and a arbitrary address used in the meantime so time to figure out how to clean
that up.

Thanks,

Jonathan

Re: Suggestions for TCG performance improvements

2021-12-03 Thread Alex Bennée



Vasilev Oleg  writes:

> On 12/2/2021 7:02 PM, Alex Bennée wrote:
>
>> Vasilev Oleg  writes:
>>
>>> I've discovered some MMU-related suggestions in the 2018 letter[2], and
>>> those seem to be still not implemented (flush still uses memset[3]).
>>> Do you think we should go forward with implementing those?
>> I doubt you can do better than memset which should be the most optimised
>> memory clear for the platform. We could consider a separate thread to
>> proactively allocate and clear new TLBs so we don't have to do it at
>> flush time. However we wouldn't have complete information about what
>> size we want the new table to be.
>>
>> When a TLB flush is performed it could be that the majority of the old
>> table is still perfectly valid. 
>
> In that case, do you think it would be possible instead of flushing
> TLBs, store it somewhere and bring it back when the address space
> changes back?

It would need a new interface into cputlb but I don't see why not.

>
>> However we would need a reliable mechanism to work out which entries in the 
>> table could be kept. 
>
> We could invalidate entries in those stored TLBs the same way we
> invalidate the active TLB. If we are going to have new thread to
> manage TLB allocation, invalidation could also be offloaded to those.
>
>> I did ponder a debug mode which would keep the last N tables dropped by
>> tlb_mmu_resize_locked and then measure the differences in the entries
>> before submitting the free to an rcu tasks.
>>> The mentioned paper[4] also describes other possible improvements.
>>> Some of those are already implemented (such as victim TLB and dynamic
>>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>>> set-associative TLB layer). Do you think those improvements
>>> worth trying?
>> Anything is worth trying but you would need hard numbers. Also its all
>> too easy to target micro benchmarks which might not show much difference
>> in real world use. 
>
> The  mentioned paper presents some benchmarking, e. g. linux kernel
> compilation and some other stuff. Do you think those shouldn't be
> trusted?

No they are good. To be honest it's the context switches that get you.
Look at "info jit" between a normal distro and a initramfs shell. Places
where the kernel is switching between multiple maps means a churn of TLB
data.

See my other post with a match of "msr ttrb"

>
>> The best thing you can do at the moment is give the
>> guest plenty of RAM so page updates are limited because the guest OS
>> doesn't have to swap RAM around.
>>
>> Another optimisation would be looking at bigger page sizes. For example
>> the kernel (in a Linux setup) usually has a contiguous flat map for
>> kernel space. If we could represent that at a larger granularity then
>> not only could we make the page lookup tighter for kernel mode we could
>> also achieve things like cross-page TB chaining for kernel functions.
>
> Do I understand correctly that currently softmmu doesn't treat
> hugepages any special, and you are suggesting we add such support, so
> that a particular region of memory occupies less TLBentries? This
> probably means TLB lookup would become quite a bit more complex.
>
>>> Another idea for decreasing occurence of TLB refills is to make TBs key
>>> in htable independent of physical address. I assume it is only needed
>>> to distinguish different processes where VAs can be the same.
>>> Is that assumption correct?
>
> This one, what do you think? Can we replace physical address as part
> of a key in TB htable with some sort of address space identifier?

Hmm maybe - so a change in ASID wouldn't need a total flush?

>
>>> Do you have any other ideas which parts of TCG could require our
>>> attention w.r.t the flamegraph I attached?
>> It's been done before but not via upstream patches but improving code
>> generation for hot loops would be a potential performance win. 
>
> I am not sure optimizing the code generation itself would help much,
> at least in our case. The flamegraph I attached to previous letter
> shows that only about 10% of time qemu spends in generated code. The
> rest is helpers, searching for next block, TLB-related stuff and so
> on.
>
>> That would require some changes to the translation model to allow for
>> multiple exit points and probably introducing a new code generator
>> (gccjit or llvm) to generate highly optimised code.
>
> This, however, could bring a lot of performance gain, translation blocks 
> would become bigger, and we would spend less time searching for the next 
> block.
>
>>> I am also CCing my teammates. We are eager to improve the QEMU TCG
>>> performance for our needs and to contribute our patches to upstream.
>> Do you have any particular goal in mind or just "better"? The current
>> MTTCG scaling tends to drop off as we go above 10-12 vCPUs due to the
>> cost of synchronous flushing across all those vCPUs.
>
> We have some internal ways to measure performance, but we are looking
> for alternative metr

Re: [PATCH for-7.0] ppc: Mark the 'taihu' machine as deprecated

2021-12-03 Thread Daniel Henrique Barboza





On 12/3/21 13:49, Thomas Huth wrote:

The PPC 405 CPU is a system-on-a-chip, so all 405 machines are very similar,
except for some external periphery. However, the periphery of the 'taihu'
machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
been implemented), so there is not much value added by this board. The users
can use the 'ref405ep' machine to test their PPC405 code instead.

Signed-off-by: Thomas Huth 
---


Reviewed-by: Daniel Henrique Barboza 


  docs/about/deprecated.rst | 9 +
  hw/ppc/ppc405_boards.c| 1 +
  2 files changed, 10 insertions(+)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index ff7488cb63..5693abb663 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -315,6 +315,15 @@ This machine is deprecated because we have enough AST2500 
based OpenPOWER
  machines. It can be easily replaced by the ``witherspoon-bmc`` or the
  ``romulus-bmc`` machines.
  
+PPC 405 ``taihu`` machine (since 7.0)

+'
+
+The PPC 405 CPU is a system-on-a-chip, so all 405 machines are very similar,
+except for some external periphery. However, the periphery of the ``taihu``
+machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
+been implemented), so there is not much value added by this board. Use the
+``ref405ep`` machine instead.
+
  Backend options
  ---
  
diff --git a/hw/ppc/ppc405_boards.c b/hw/ppc/ppc405_boards.c

index 972a7a4a3e..ff6a6d26b4 100644
--- a/hw/ppc/ppc405_boards.c
+++ b/hw/ppc/ppc405_boards.c
@@ -547,6 +547,7 @@ static void taihu_class_init(ObjectClass *oc, void *data)
  mc->init = taihu_405ep_init;
  mc->default_ram_size = 0x0800;
  mc->default_ram_id = "taihu_405ep.ram";
+mc->deprecation_reason = "incomplete, use 'ref405ep' instead";
  }
  
  static const TypeInfo taihu_type = {

[PATCH for-7.0] ppc: Mark the 'taihu' machine as deprecated

2021-12-03 Thread Thomas Huth

The PPC 405 CPU is a system-on-a-chip, so all 405 machines are very similar,
except for some external periphery. However, the periphery of the 'taihu'
machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
been implemented), so there is not much value added by this board. The users
can use the 'ref405ep' machine to test their PPC405 code instead.

Signed-off-by: Thomas Huth 
---
 docs/about/deprecated.rst | 9 +
 hw/ppc/ppc405_boards.c| 1 +
 2 files changed, 10 insertions(+)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index ff7488cb63..5693abb663 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -315,6 +315,15 @@ This machine is deprecated because we have enough AST2500 
based OpenPOWER
 machines. It can be easily replaced by the ``witherspoon-bmc`` or the
 ``romulus-bmc`` machines.
 
+PPC 405 ``taihu`` machine (since 7.0)
+'
+
+The PPC 405 CPU is a system-on-a-chip, so all 405 machines are very similar,
+except for some external periphery. However, the periphery of the ``taihu``
+machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
+been implemented), so there is not much value added by this board. Use the
+``ref405ep`` machine instead.
+
 Backend options
 ---
 
diff --git a/hw/ppc/ppc405_boards.c b/hw/ppc/ppc405_boards.c
index 972a7a4a3e..ff6a6d26b4 100644
--- a/hw/ppc/ppc405_boards.c
+++ b/hw/ppc/ppc405_boards.c
@@ -547,6 +547,7 @@ static void taihu_class_init(ObjectClass *oc, void *data)
 mc->init = taihu_405ep_init;
 mc->default_ram_size = 0x0800;
 mc->default_ram_id = "taihu_405ep.ram";
+mc->deprecation_reason = "incomplete, use 'ref405ep' instead";
 }
 
 static const TypeInfo taihu_type = {
-- 
2.27.0

[PATCH for-7.0] ppc: Mark the 'taihu' machine as deprecated

2021-12-03 Thread Thomas Huth

The PPC 405 CPU is a system-on-a-chip, so all 405 machines are very similar,
except for some external periphery. However, the periphery of the 'taihu'
machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
been implemented), so there is not much value added by this board. The users
can use the 'ref405ep' machine to test their PPC405 code instead.

Signed-off-by: Thomas Huth 
---
 docs/about/deprecated.rst | 9 +
 hw/ppc/ppc405_boards.c| 1 +
 2 files changed, 10 insertions(+)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index ff7488cb63..5693abb663 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -315,6 +315,15 @@ This machine is deprecated because we have enough AST2500 
based OpenPOWER
 machines. It can be easily replaced by the ``witherspoon-bmc`` or the
 ``romulus-bmc`` machines.
 
+PPC 405 ``taihu`` machine (since 7.0)
+'
+
+The PPC 405 CPU is a system-on-a-chip, so all 405 machines are very similar,
+except for some external periphery. However, the periphery of the ``taihu``
+machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
+been implemented), so there is not much value added by this board. Use the
+``ref405ep`` machine instead.
+
 Backend options
 ---
 
diff --git a/hw/ppc/ppc405_boards.c b/hw/ppc/ppc405_boards.c
index 972a7a4a3e..ff6a6d26b4 100644
--- a/hw/ppc/ppc405_boards.c
+++ b/hw/ppc/ppc405_boards.c
@@ -547,6 +547,7 @@ static void taihu_class_init(ObjectClass *oc, void *data)
 mc->init = taihu_405ep_init;
 mc->default_ram_size = 0x0800;
 mc->default_ram_id = "taihu_405ep.ram";
+mc->deprecation_reason = "incomplete, use 'ref405ep' instead";
 }
 
 static const TypeInfo taihu_type = {
-- 
2.27.0

Re: Suggestions for TCG performance improvements

2021-12-03 Thread Vasilev Oleg via

On 12/2/2021 7:02 PM, Alex Bennée wrote:

> Vasilev Oleg  writes:
>
>> I've discovered some MMU-related suggestions in the 2018 letter[2], and
>> those seem to be still not implemented (flush still uses memset[3]).
>> Do you think we should go forward with implementing those?
> I doubt you can do better than memset which should be the most optimised
> memory clear for the platform. We could consider a separate thread to
> proactively allocate and clear new TLBs so we don't have to do it at
> flush time. However we wouldn't have complete information about what
> size we want the new table to be.
>
> When a TLB flush is performed it could be that the majority of the old
> table is still perfectly valid. 

In that case, do you think it would be possible instead of flushing TLBs, store 
it somewhere and bring it back when the address space changes back? 

> However we would need a reliable mechanism to work out which entries in the 
> table could be kept. 

We could invalidate entries in those stored TLBs the same way we invalidate the 
active TLB. If we are going to have new thread to manage TLB allocation, 
invalidation could also be offloaded to those.

> I did ponder a debug mode which would keep the last N tables dropped by
> tlb_mmu_resize_locked and then measure the differences in the entries
> before submitting the free to an rcu tasks.
>> The mentioned paper[4] also describes other possible improvements.
>> Some of those are already implemented (such as victim TLB and dynamic
>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>> set-associative TLB layer). Do you think those improvements
>> worth trying?
> Anything is worth trying but you would need hard numbers. Also its all
> too easy to target micro benchmarks which might not show much difference
> in real world use. 

The  mentioned paper presents some benchmarking, e. g. linux kernel compilation 
and some other stuff. Do you think those shouldn't be trusted?

> The best thing you can do at the moment is give the
> guest plenty of RAM so page updates are limited because the guest OS
> doesn't have to swap RAM around.
>
> Another optimisation would be looking at bigger page sizes. For example
> the kernel (in a Linux setup) usually has a contiguous flat map for
> kernel space. If we could represent that at a larger granularity then
> not only could we make the page lookup tighter for kernel mode we could
> also achieve things like cross-page TB chaining for kernel functions.

Do I understand correctly that currently softmmu doesn't treat hugepages any 
special, and you are suggesting we add such support, so that a particular 
region of memory occupies less TLBentries? This probably means TLB lookup would 
become quite a bit more complex.

>> Another idea for decreasing occurence of TLB refills is to make TBs key
>> in htable independent of physical address. I assume it is only needed
>> to distinguish different processes where VAs can be the same.
>> Is that assumption correct?

This one, what do you think? Can we replace physical address as part of a key 
in TB htable with some sort of address space identifier?

>> Do you have any other ideas which parts of TCG could require our
>> attention w.r.t the flamegraph I attached?
> It's been done before but not via upstream patches but improving code
> generation for hot loops would be a potential performance win. 

I am not sure optimizing the code generation itself would help much, at least 
in our case. The flamegraph I attached to previous letter shows that only about 
10% of time qemu spends in generated code. The rest is helpers, searching for 
next block, TLB-related stuff and so on.

> That would require some changes to the translation model to allow for
> multiple exit points and probably introducing a new code generator
> (gccjit or llvm) to generate highly optimised code.

This, however, could bring a lot of performance gain, translation blocks would 
become bigger, and we would spend less time searching for the next block.

>> I am also CCing my teammates. We are eager to improve the QEMU TCG
>> performance for our needs and to contribute our patches to upstream.
> Do you have any particular goal in mind or just "better"? The current
> MTTCG scaling tends to drop off as we go above 10-12 vCPUs due to the
> cost of synchronous flushing across all those vCPUs.

We have some internal ways to measure performance, but we are looking for 
alternative metric, that we could share and you could reproduce. Sysbench in 
threads mode is the closed we have found so far by comparing flamegraphs, but 
we are testing more benchmarking software.

>> [1]: https://github.com/akopytov/sysbench
>> [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg562103.html
>> [3]: 
>> https://github.com/qemu/qemu/blob/14d02cfbe4adaeebe7cb833a8cc71191352cf03b/accel/tcg/cputlb.c#L239
>> [4]: https://dl.acm.org/doi/pdf/10.1145/2686034
>>
>> [2. flamegraph.svg --- image/svg+xml; flamegraph.svg]...
>>
>>

Re: [RFC PATCH for-7.0 00/35] target/ppc fpu fixes and cleanups

2021-12-03 Thread Matheus K. Ferst


On 19/11/2021 13:04, Richard Henderson wrote:

This is a partial patch set showing the direction I believe
the cleanups should go, as opposed to a complete conversion.

I add a bunch of float_flag_* bits that diagnose the reason
for most float_flag_invalid, as guided by the requirements
of the PowerPC VX* bits.  I have converted some helpers to
use these new flags but not all.  A good signal for unconverted
routines is the use of float*_is_signalling_nan, which should
now be using float_flag_invalid_snan.

I added float64x32_* arithmetic routines, which take float64
arguments and round the result to the range and precision of
float32, while giving the result in float64 form.  This is
exactly what PowerPC requires for its single-precision math.
This fixes double-rounding problems that exist in the current
code, and are visible in the float_madds test.

I add test reference files for float_madds and float_convs
after fixing the bugs required to make the tests pass.


With this series and few other VSX instructions[1], QEMU now passes the 
GLibc math test suite.


Tested-by: Matheus Ferst 

[1] https://github.com/PPC64/qemu/tree/ferst-tcg-xsmaddqp (WIP)

Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO 
Analista de Software
Aviso Legal - Disclaimer

Re: [PATCH v2 06/15] target/m68k: Fix address argument for EXCP_CHK

2021-12-03 Thread Laurent Vivier


Le 03/12/2021 à 15:29, Richard Henderson a écrit :

On 12/3/21 6:27 AM, Laurent Vivier wrote:

Le 02/12/2021 à 21:48, Richard Henderson a écrit :

According to the M68040 Users Manual, section 8.4.3,
Six word stack frame (format 2), CHK, CHK2 (and others)
are supposed to record the next insn in PC and the
address of the trapping instruction in ADDRESS.

Create a raise_exception_format2 function to centralize recording
of the trapping pc in mmu.ar, plus advancing to the next insn.


It's weird to use mmu.ar as the field is used for MMU exceptions.


Should I rename the field to "excp_addr" or something?



No, I'm wondering if we shoud move it or duplicate it. It's not clear. I think we can keep it like 
this and later do a cleanup.


But I think you should add a comment in CPUM68KState next to ar to point out that it is also used to 
store address of CHK/CHK2/DIV/TRAP/


Thanks,
Laurent

[RFC PATCH 1/2] tests/plugin: allow libinsn.so per-CPU counts

2021-12-03 Thread Alex Bennée

We won't go fully flexible but for most system emulation 8 vCPUs
resolution should be enough for anybody ;-)

Signed-off-by: Alex Bennée 
---
 tests/plugin/insn.c | 39 +++
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/tests/plugin/insn.c b/tests/plugin/insn.c
index d229fdc001..d5a0a08cb4 100644
--- a/tests/plugin/insn.c
+++ b/tests/plugin/insn.c
@@ -16,22 +16,33 @@
 
 QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
 
-static uint64_t insn_count;
+#define MAX_CPUS 8 /* lets not go nuts */
+
+typedef struct {
+uint64_t last_pc;
+uint64_t insn_count;
+} InstructionCount;
+
+static InstructionCount counts[MAX_CPUS];
+static uint64_t inline_insn_count;
+
 static bool do_inline;
 static bool do_size;
+static bool do_frequency;
 static GArray *sizes;
 
 static void vcpu_insn_exec_before(unsigned int cpu_index, void *udata)
 {
-static uint64_t last_pc;
+unsigned int i = cpu_index % MAX_CPUS;
+InstructionCount *c = &counts[i];
 uint64_t this_pc = GPOINTER_TO_UINT(udata);
-if (this_pc == last_pc) {
+if (this_pc == c->last_pc) {
 g_autofree gchar *out = g_strdup_printf("detected repeat execution @ 
0x%"
 PRIx64 "\n", this_pc);
 qemu_plugin_outs(out);
 }
-last_pc = this_pc;
-insn_count++;
+c->last_pc = this_pc;
+c->insn_count++;
 }
 
 static void vcpu_tb_trans(qemu_plugin_id_t id, struct qemu_plugin_tb *tb)
@@ -44,7 +55,7 @@ static void vcpu_tb_trans(qemu_plugin_id_t id, struct 
qemu_plugin_tb *tb)
 
 if (do_inline) {
 qemu_plugin_register_vcpu_insn_exec_inline(
-insn, QEMU_PLUGIN_INLINE_ADD_U64, &insn_count, 1);
+insn, QEMU_PLUGIN_INLINE_ADD_U64, &inline_insn_count, 1);
 } else {
 uint64_t vaddr = qemu_plugin_insn_vaddr(insn);
 qemu_plugin_register_vcpu_insn_exec_cb(
@@ -66,9 +77,9 @@ static void vcpu_tb_trans(qemu_plugin_id_t id, struct 
qemu_plugin_tb *tb)
 static void plugin_exit(qemu_plugin_id_t id, void *p)
 {
 g_autoptr(GString) out = g_string_new(NULL);
+int i;
 
 if (do_size) {
-int i;
 for (i = 0; i <= sizes->len; i++) {
 unsigned long *cnt = &g_array_index(sizes, unsigned long, i);
 if (*cnt) {
@@ -76,8 +87,20 @@ static void plugin_exit(qemu_plugin_id_t id, void *p)
"len %d bytes: %ld insns\n", i, *cnt);
 }
 }
+} else if (do_inline) {
+g_string_append_printf(out, "insns: %" PRIu64 "\n", inline_insn_count);
 } else {
-g_string_append_printf(out, "insns: %" PRIu64 "\n", insn_count);
+uint64_t total_insns = 0;
+for (i = 0; i < MAX_CPUS; i++) {
+InstructionCount *c = &counts[i];
+if (c->insn_count) {
+g_string_append_printf(out, "cpu %d insns: %" PRIu64 "\n",
+   i, c->insn_count);
+total_insns += c->insn_count;
+}
+}
+g_string_append_printf(out, "total insns: %" PRIu64 "\n",
+   total_insns);
 }
 qemu_plugin_outs(out->str);
 }
-- 
2.30.2

[RFC PATCH 2/2] tests/plugins: add instruction matching to libinsn.so

2021-12-03 Thread Alex Bennée

This adds simple instruction matching to the libinsn.so plugin which
is useful for examining the execution distance between instructions.
For example to track how often we flush in ARM due to TLB updates:

  -plugin ./tests/plugin/libinsn.so,match=tlbi

which leads to output like this:

  0xffc01018fa00, tlbi aside1is, x0,  339, 32774 match hits, 23822 since 
last, avg 47279
  0xffc01018fa00, tlbi aside1is, x0,  340, 32775 match hits, 565051 since 
last, avg 47295
  0xffc0101915a4, tlbi vae1is, x0,  155, 32776 match hits, 151135 since 
last, avg 47298
  0xffc01018fc60, tlbi vae1is, x4,  224, 32777 match hits, 814 since last, 
avg 47297
  0xffc010194a44, tlbi vale1is, x1,  8835, 32778 match hits, 52027 since 
last, avg 47297
  0xffc010194a44, tlbi vale1is, x1,  8836, 32779 match hits, 8347 since 
last, avg 47296
  0xffc010194a44, tlbi vale1is, x1,  8837, 32780 match hits, 33677 since 
last, avg 47295

showing we do some sort of TLBI invalidation every 47 thousand
instructions.

Cc: Vasilev Oleg 
Cc: Richard Henderson 
Cc: Emilio Cota 
Signed-off-by: Alex Bennée 
---
 tests/plugin/insn.c | 88 -
 1 file changed, 87 insertions(+), 1 deletion(-)

diff --git a/tests/plugin/insn.c b/tests/plugin/insn.c
index d5a0a08cb4..3f48c86fd7 100644
--- a/tests/plugin/insn.c
+++ b/tests/plugin/insn.c
@@ -28,9 +28,25 @@ static uint64_t inline_insn_count;
 
 static bool do_inline;
 static bool do_size;
-static bool do_frequency;
 static GArray *sizes;
 
+typedef struct {
+char *match_string;
+uint64_t hits[MAX_CPUS];
+uint64_t last_hit[MAX_CPUS];
+uint64_t total_delta[MAX_CPUS];
+GPtrArray *history[MAX_CPUS];
+} Match;
+
+static GArray *matches;
+
+typedef struct {
+Match *match;
+uint64_t vaddr;
+uint64_t hits;
+char *disas;
+} Instruction;
+
 static void vcpu_insn_exec_before(unsigned int cpu_index, void *udata)
 {
 unsigned int i = cpu_index % MAX_CPUS;
@@ -45,6 +61,36 @@ static void vcpu_insn_exec_before(unsigned int cpu_index, 
void *udata)
 c->insn_count++;
 }
 
+static void vcpu_insn_matched_exec_before(unsigned int cpu_index, void *udata)
+{
+unsigned int i = cpu_index % MAX_CPUS;
+Instruction *insn = (Instruction *) udata;
+Match *match = insn->match;
+g_autoptr(GString) ts = g_string_new("");
+
+insn->hits++;
+g_string_append_printf(ts, "0x%" PRIx64 ", %s, % "PRId64,
+   insn->vaddr, insn->disas, insn->hits);
+
+uint64_t icount = counts[i].insn_count;
+uint64_t delta = icount - match->last_hit[i];
+
+match->hits[i]++;
+match->total_delta[i] += delta;
+
+g_string_append_printf(ts,
+   ", %"PRId64" match hits, %"PRId64
+   " since last, avg %"PRId64"\n",
+   match->hits[i], delta,
+   match->total_delta[i] / match->hits[i]);
+
+match->last_hit[i] = icount;
+
+qemu_plugin_outs(ts->str);
+
+g_ptr_array_add(match->history[i], insn);
+}
+
 static void vcpu_tb_trans(qemu_plugin_id_t id, struct qemu_plugin_tb *tb)
 {
 size_t n = qemu_plugin_tb_n_insns(tb);
@@ -71,6 +117,29 @@ static void vcpu_tb_trans(qemu_plugin_id_t id, struct 
qemu_plugin_tb *tb)
 unsigned long *cnt = &g_array_index(sizes, unsigned long, sz);
 (*cnt)++;
 }
+
+/*
+ * If we are tracking certain instructions we will need more
+ * information about the instruction which we also need to
+ * save if there is a hit.
+ */
+if (matches) {
+char *insn_disas = qemu_plugin_insn_disas(insn);
+int j;
+for (j = 0; j < matches->len; j++) {
+Match *m = &g_array_index(matches, Match, j);
+if (g_str_has_prefix(insn_disas, m->match_string)) {
+Instruction *rec = g_new0(Instruction, 1);
+rec->disas = g_strdup(insn_disas);
+rec->vaddr = qemu_plugin_insn_vaddr(insn);
+rec->match = m;
+qemu_plugin_register_vcpu_insn_exec_cb(
+insn, vcpu_insn_matched_exec_before,
+QEMU_PLUGIN_CB_NO_REGS, rec);
+}
+}
+g_free(insn_disas);
+}
 }
 }
 
@@ -105,6 +174,21 @@ static void plugin_exit(qemu_plugin_id_t id, void *p)
 qemu_plugin_outs(out->str);
 }
 
+
+/* Add a match to the array of matches */
+static void parse_match(char *match)
+{
+Match new_match = { .match_string = match };
+int i;
+for (i = 0; i < MAX_CPUS; i++) {
+new_match.history[i] = g_ptr_array_new();
+}
+if (!matches) {
+matches = g_array_new(false, true, sizeof(Match));
+}
+g_array_append_val(matches, new_match);
+}
+
 QEMU_PLUGIN_EXPORT int qemu_plugin_install(qemu_plugin_id_t id,
const qemu_info_t *info,

[RFC PATCH 0/2] insn plugin tweaks for measuring frequency

2021-12-03 Thread Alex Bennée

Hi,

This series was prompted by yesterdays email thread:

  Subject: Suggestions for TCG performance improvements
  Date: Thu, 2 Dec 2021 09:47:13 +
  Message-ID: 

which made me think if we could leverage the TCG plugins to measure
how frequently we these flush events happen. My initial measurements
with a Debian arm64 system indicate we do some sort of tlbi
instruction every 47 thousand instructions.

Alex Bennée (2):
  tests/plugin: allow libinsn.so per-CPU counts
  tests/plugins: add instruction matching to libinsn.so

 tests/plugin/insn.c | 125 +---
 1 file changed, 117 insertions(+), 8 deletions(-)

-- 
2.30.2

[PATCH v3 4/4] s390x/pci: add supported DT information to clp response

2021-12-03 Thread Matthew Rosato

The DTSM is a mask that specifies which I/O Address Translation designation
types are supported.  Today QEMU only supports DT=1.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c | 1 +
 hw/s390x/s390-pci-inst.c| 1 +
 hw/s390x/s390-pci-vfio.c| 1 +
 include/hw/s390x/s390-pci-bus.h | 1 +
 include/hw/s390x/s390-pci-clp.h | 3 ++-
 5 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 1b51a72838..01b58ebc70 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -782,6 +782,7 @@ static void s390_pci_init_default_group(void)
 resgrp->i = 128;
 resgrp->maxstbl = 128;
 resgrp->version = 0;
+resgrp->dtsm = ZPCI_DTSM;
 }
 
 static void set_pbdev_info(S390PCIBusDevice *pbdev)
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 07bab85ce5..6d400d4147 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -329,6 +329,7 @@ int clp_service_call(S390CPU *cpu, uint8_t r2, uintptr_t ra)
 stw_p(&resgrp->i, group->zpci_group.i);
 stw_p(&resgrp->maxstbl, group->zpci_group.maxstbl);
 resgrp->version = group->zpci_group.version;
+resgrp->dtsm = group->zpci_group.dtsm;
 stw_p(&resgrp->hdr.rsp, CLP_RC_OK);
 break;
 }
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index 2a153fa8c9..6f80a47e29 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -160,6 +160,7 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 resgrp->i = cap->noi;
 resgrp->maxstbl = cap->maxstbl;
 resgrp->version = cap->version;
+resgrp->dtsm = ZPCI_DTSM;
 }
 }
 
diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h
index 2727e7bdef..da3cde2bb4 100644
--- a/include/hw/s390x/s390-pci-bus.h
+++ b/include/hw/s390x/s390-pci-bus.h
@@ -37,6 +37,7 @@
 #define ZPCI_MAX_UID 0x
 #define UID_UNDEFINED 0
 #define UID_CHECKING_ENABLED 0x01
+#define ZPCI_DTSM 0x40
 
 OBJECT_DECLARE_SIMPLE_TYPE(S390pciState, S390_PCI_HOST_BRIDGE)
 OBJECT_DECLARE_SIMPLE_TYPE(S390PCIBus, S390_PCI_BUS)
diff --git a/include/hw/s390x/s390-pci-clp.h b/include/hw/s390x/s390-pci-clp.h
index 96b8e3f133..cc8c8662b8 100644
--- a/include/hw/s390x/s390-pci-clp.h
+++ b/include/hw/s390x/s390-pci-clp.h
@@ -163,7 +163,8 @@ typedef struct ClpRspQueryPciGrp {
 uint8_t fr;
 uint16_t maxstbl;
 uint16_t mui;
-uint64_t reserved3;
+uint8_t dtsm;
+uint8_t reserved3[7];
 uint64_t dasm; /* dma address space mask */
 uint64_t msia; /* MSI address */
 uint64_t reserved4;
-- 
2.27.0

Re: [PATCH v2 06/15] target/m68k: Fix address argument for EXCP_CHK

2021-12-03 Thread Richard Henderson


On 12/3/21 6:27 AM, Laurent Vivier wrote:

Le 02/12/2021 à 21:48, Richard Henderson a écrit :

According to the M68040 Users Manual, section 8.4.3,
Six word stack frame (format 2), CHK, CHK2 (and others)
are supposed to record the next insn in PC and the
address of the trapping instruction in ADDRESS.

Create a raise_exception_format2 function to centralize recording
of the trapping pc in mmu.ar, plus advancing to the next insn.


It's weird to use mmu.ar as the field is used for MMU exceptions.


Should I rename the field to "excp_addr" or something?


r~




Update m68k_interrupt_all to pass mmu.ar to do_stack_frame.
Update cpu_loop to pass mmu.ar to siginfo.si_addr, as the
kernel does in trap_c().

Signed-off-by: Richard Henderson 
---
  linux-user/m68k/cpu_loop.c |  2 +-
  target/m68k/op_helper.c    | 54 --
  2 files changed, 30 insertions(+), 26 deletions(-)


Reviewed-by: Laurent Vivier

[PATCH v3 2/4] s390x/pci: don't use hard-coded dma range in reg_ioat

2021-12-03 Thread Matthew Rosato

Instead use the values from clp info, they will either be the hard-coded
values or what came from the host driver via vfio.

Fixes: 9670ee752727 ("s390x/pci: use a PCI Function structure")
Reviewed-by: Eric Farman 
Reviewed-by: Pierre Morel 
Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-inst.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 1c8ad91175..11b7f6bfa1 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -916,9 +916,10 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev)
 return 0;
 }
 
-static int reg_ioat(CPUS390XState *env, S390PCIIOMMU *iommu, ZpciFib fib,
+static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
 uintptr_t ra)
 {
+S390PCIIOMMU *iommu = pbdev->iommu;
 uint64_t pba = ldq_p(&fib.pba);
 uint64_t pal = ldq_p(&fib.pal);
 uint64_t g_iota = ldq_p(&fib.iota);
@@ -927,7 +928,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIIOMMU 
*iommu, ZpciFib fib,
 
 pba &= ~0xfff;
 pal |= 0xfff;
-if (pba > pal || pba < ZPCI_SDMA_ADDR || pal > ZPCI_EDMA_ADDR) {
+if (pba > pal || pba < pbdev->zpci_fn.sdma || pal > pbdev->zpci_fn.edma) {
 s390_program_interrupt(env, PGM_OPERAND, ra);
 return -EINVAL;
 }
@@ -1125,7 +1126,7 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 } else if (pbdev->iommu->enabled) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_SEQUENCE);
-} else if (reg_ioat(env, pbdev->iommu, fib, ra)) {
+} else if (reg_ioat(env, pbdev, fib, ra)) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_INSUF_RES);
 }
@@ -1150,7 +1151,7 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 s390_set_status_code(env, r1, ZPCI_MOD_ST_SEQUENCE);
 } else {
 pci_dereg_ioat(pbdev->iommu);
-if (reg_ioat(env, pbdev->iommu, fib, ra)) {
+if (reg_ioat(env, pbdev, fib, ra)) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_INSUF_RES);
 }
-- 
2.27.0

Re: [PATCH v2 06/15] target/m68k: Fix address argument for EXCP_CHK

2021-12-03 Thread Laurent Vivier


Le 02/12/2021 à 21:48, Richard Henderson a écrit :

According to the M68040 Users Manual, section 8.4.3,
Six word stack frame (format 2), CHK, CHK2 (and others)
are supposed to record the next insn in PC and the
address of the trapping instruction in ADDRESS.

Create a raise_exception_format2 function to centralize recording
of the trapping pc in mmu.ar, plus advancing to the next insn.


It's weird to use mmu.ar as the field is used for MMU exceptions.


Update m68k_interrupt_all to pass mmu.ar to do_stack_frame.
Update cpu_loop to pass mmu.ar to siginfo.si_addr, as the
kernel does in trap_c().

Signed-off-by: Richard Henderson 
---
  linux-user/m68k/cpu_loop.c |  2 +-
  target/m68k/op_helper.c| 54 --
  2 files changed, 30 insertions(+), 26 deletions(-)


Reviewed-by: Laurent Vivier

[PATCH v3 1/4] s390x/pci: use a reserved ID for the default PCI group

2021-12-03 Thread Matthew Rosato

The current default PCI group being used can technically collide with a
real group ID passed from a hostdev.  Let's instead use a group ID that
comes from a special pool (0xF0-0xFF) that is architected to be reserved
for simulated devices.

Fixes: 28dc86a072 ("s390x/pci: use a PCI Group structure")
Reviewed-by: Eric Farman 
Reviewed-by: Pierre Morel 
Signed-off-by: Matthew Rosato 
---
 include/hw/s390x/s390-pci-bus.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h
index aa891c178d..2727e7bdef 100644
--- a/include/hw/s390x/s390-pci-bus.h
+++ b/include/hw/s390x/s390-pci-bus.h
@@ -313,7 +313,7 @@ typedef struct ZpciFmb {
 } ZpciFmb;
 QEMU_BUILD_BUG_MSG(offsetof(ZpciFmb, fmt0) != 48, "padding in ZpciFmb");
 
-#define ZPCI_DEFAULT_FN_GRP 0x20
+#define ZPCI_DEFAULT_FN_GRP 0xFF
 typedef struct S390PCIGroup {
 ClpRspQueryPciGrp zpci_group;
 int id;
-- 
2.27.0

[PATCH v3 3/4] s390x/pci: use the passthrough measurement update interval

2021-12-03 Thread Matthew Rosato

We may have gotten a measurement update interval from the underlying host
via vfio -- Use it to set the interval via which we update the function
measurement block.

Fixes: 28dc86a072 ("s390x/pci: use a PCI Group structure")
Reviewed-by: Eric Farman 
Reviewed-by: Pierre Morel 
Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-inst.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 11b7f6bfa1..07bab85ce5 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -1046,7 +1046,7 @@ static void fmb_update(void *opaque)
   sizeof(pbdev->fmb.last_update))) {
 return;
 }
-timer_mod(pbdev->fmb_timer, t + DEFAULT_MUI);
+timer_mod(pbdev->fmb_timer, t + pbdev->pci_group->zpci_group.mui);
 }
 
 int mpcifc_service_call(S390CPU *cpu, uint8_t r1, uint64_t fiba, uint8_t ar,
@@ -1204,7 +1204,8 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 }
 pbdev->fmb_addr = fmb_addr;
 timer_mod(pbdev->fmb_timer,
-  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + DEFAULT_MUI);
+  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+pbdev->pci_group->zpci_group.mui);
 break;
 }
 default:
-- 
2.27.0

[PATCH v3 0/4] s390x/pci: some small fixes

2021-12-03 Thread Matthew Rosato

A collection of small fixes for s390x PCI (not urgent).  The first 3 are
fixes related to always using the host-provided CLP value when provided
vs a hard-coded value.  The last patch adds logic for QEMU to report a
proper DTSM clp response rather than just 0s (guest linux doesn't look
at this field today).

Changes for v3:
- Actually fix patch 4 this time (Pierre)

Matthew Rosato (4):
  s390x/pci: use a reserved ID for the default PCI group
  s390x/pci: don't use hard-coded dma range in reg_ioat
  s390x/pci: use the passthrough measurement update interval
  s390x/pci: add supported DT information to clp response

 hw/s390x/s390-pci-bus.c |  1 +
 hw/s390x/s390-pci-inst.c| 15 +--
 hw/s390x/s390-pci-vfio.c|  1 +
 include/hw/s390x/s390-pci-bus.h |  3 ++-
 include/hw/s390x/s390-pci-clp.h |  3 ++-
 5 files changed, 15 insertions(+), 8 deletions(-)

-- 
2.27.0

Re: [PATCH v2 08/15] target/m68k: Fix address argument for EXCP_TRACE

2021-12-03 Thread Richard Henderson


On 12/2/21 12:48 PM, Richard Henderson wrote:

+static void gen_raise_exception_format2(DisasContext *s, int nr)
+{
+/*
+ * Pass the address of the insn to the exception handler,
+ * for recording in the Format $2 (6-word) stack frame.
+ * Re-use mmu.ar for the purpose, since that's only valid
+ * after tlb_fill.
+ */
+tcg_gen_st_i32(tcg_constant_i32(s->base.pc_next), cpu_env,
+   offsetof(CPUM68KState, mmu.ar));
+gen_raise_exception(nr);
+s->base.is_jmp = DISAS_NORETURN;
+}


Hmph, I think this only really works from within m68k_tr_translate_insn. But most of the 
uses are from within m68k_rt_tb_stop, where we have already advanced pc_next to the next 
instruction.


I'm not sure how to test this...


r~

Re: [PATCH v2 0/2] hw/arm/virt: Support for virtio-mem-pci

2021-12-03 Thread Jonathan Cameron via

On Fri,  3 Dec 2021 11:35:20 +0800
Gavin Shan  wrote:

> This series supports virtio-mem-pci device, by simply following the
> implementation on x86. The exception is the block size is 512MB on
> ARM64 instead of 128MB on x86, compatible with the memory section
> size in linux guest.
> 
> The work was done by David Hildenbrand and then Jonathan Cameron. I'm
> taking the patch and putting more efforts, which is all about testing
> to me at current stage.

Hi Gavin,

Thanks for taking this forwards.  What you have here looks good to me, but
I've not looked at this for a while, so I'll go with whatever David and
others say :)

Jonathan

> 
> Testing
> ===
> The upstream linux kernel (v5.16.rc3) is used on host/guest during
> the testing. The guest kernel includes changes to enable virtio-mem
> driver, which is simply to enable CONFIG_VIRTIO_MEM on ARM64.
> 
> Mutiple combinations like page sizes on host/guest, memory backend
> device etc are covered in the testing. Besides, migration is also
> tested. The following command lines are used for VM or virtio-mem-pci
> device hot-add. It's notable that virtio-mem-pci device hot-remove
> isn't supported, similar to what we have on x86. 
> 
>   host.pgsize  guest.pgsize  backendhot-add  hot-remove  migration
>   -
>4KB 4KB   normal ok   ok  ok
>  THPok   ok  ok
>  hugeTLBok   ok  ok
>4KB 64KB  normal ok   ok  ok
>  THPok   ok  ok
>  hugeTLBok   ok  ok
>   64KB 4KB   normal ok   ok  ok
>  THPok   ok  ok
>  hugeTLBok   ok  ok
>   64KB 64KB  normal ok   ok  ok
>  THPok   ok  ok
>  hugeTLBok   ok  ok
> 
> The command lines are used for VM. When hugeTLBfs is used, all memory
> backend objects are popuated on /dev/hugepages-2048kB or
> /dev/hugepages-524288kB, depending on the host page sizes.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 
>   \
>   -accel kvm -machine virt,gic-version=host   
>   \
>   -cpu host -smp 4,sockets=2,cores=2,threads=1
>   \
>   -m 1024M,slots=16,maxmem=64G
>   \
>   -object memory-backend-ram,id=mem0,size=512M
>   \
>   -object memory-backend-ram,id=mem1,size=512M
>   \
>   -numa node,nodeid=0,cpus=0-1,memdev=mem0
>   \
>   -numa node,nodeid=1,cpus=2-3,memdev=mem1
>   \
>  :
>   -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image   
>   \
>   -initrd /home/gavin/sandbox/images/rootfs.cpio.xz   
>   \
>   -append earlycon=pl011,mmio,0x900   
>   \
>   -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1   
>   \
>   -device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2   
>   \
>   -device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3   
>   \
>   -object memory-backend-ram,id=vmem0,size=512M   
>   \
>   -device 
> virtio-mem-pci,id=vm0,bus=pcie.1,memdev=vmem0,node=0,requested-size=0 \
>   -object memory-backend-ram,id=vmem1,size=512M   
>   \
>   -device 
> virtio-mem-pci,id=vm1,bus=pcie.2,memdev=vmem1,node=1,requested-size=0 
> 
> Command lines used for memory hot-add and hot-remove:
> 
>   (qemu) qom-set vm1 requested-size 512M
>   (qemu) qom-set vm1 requested-size 0
>   (qemu) qom-set vm1 requested-size 512M
> 
> Command lines used for virtio-mem-pci device hot-add:
> 
>   (qemu) object_add memory-backend-ram,id=hp-mem1,size=512M
>   (qemu) device_add virtio-mem-pci,id=hp-vm1,bus=pcie.3,memdev=hp-mem1,node=1
>   (qemu) qom-set hp-vm1 requested-size 512M
>   (qemu) qom-set hp-vm1 requested-size 0
>   (qemu) qom-set hp-vm1 requested-size 512M
> 
> Changelog
> =
> v2:
>   * Include David/Jonathan as co-developers in the commit log   
> (David)
>   * Decrease VIRTIO_MEM_USABLE_EXTENT to 512MB on ARM64 in PATCH[1/2]   
> (David)
>   * PATCH[2/2] is added to correct the THP sizes on ARM64   
> (David)
> 
> Gavin Shan (2):
>   hw/arm/virt: Support for virtio-mem-pci
>   virtio-mem: Correct default THP size for ARM64
> 
>  hw/arm/Kconfig |  1 +
>  hw/arm/virt.c  | 68 +-
>  hw/virtio/virtio-mem.c | 36

Re: [PATCH v2 1/1] multifd: Shut down the QIO channels to avoid blocking the send threads when they are terminated.

2021-12-03 Thread Daniel P . Berrangé

On Fri, Dec 03, 2021 at 12:55:33PM +0100, Li Zhang wrote:
> When doing live migration with multifd channels 8, 16 or larger number,
> the guest hangs in the presence of the network errors such as missing TCP 
> ACKs.
> 
> At sender's side:
> The main thread is blocked on qemu_thread_join, migration_fd_cleanup
> is called because one thread fails on qio_channel_write_all when
> the network problem happens and other send threads are blocked on sendmsg.
> They could not be terminated. So the main thread is blocked on 
> qemu_thread_join
> to wait for the threads terminated.
> 
> (gdb) bt
> 0  0x7f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0
> 1  0x55cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at 
> ../util/qemu-thread-posix.c:627
> 2  0x55cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542
> 3  0x55cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at 
> ../migration/migration.c:1808
> 4  0x55cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at 
> ../migration/migration.c:1850
> 5  0x55cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at 
> ../util/async.c:141
> 6  0x55cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at 
> ../util/async.c:169
> 7  0x55cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at 
> ../util/aio-posix.c:381
> 8  0x55cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, 
> callback=0x0, user_data=0x0) at ../util/async.c:311
> 9  0x7f30c9c8cdf4 in g_main_context_dispatch () at 
> /usr/lib64/libglib-2.0.so.0
> 10 0x55cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232
> 11 0x55cbb718521c in os_host_main_loop_wait (timeout=42251070366) at 
> ../util/main-loop.c:255
> 12 0x55cbb7185321 in main_loop_wait (nonblocking=0) at 
> ../util/main-loop.c:531
> 13 0x55cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726
> 14 0x55cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c57, 
> envp=0x7ffc0c578ab0) at ../softmmu/main.c:50
> 
> To make sure that the send threads could be terminated, IO channels should be
> shut down to avoid waiting IO.
> 
> Signed-off-by: Li Zhang 
> ---
>  migration/multifd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 7c9deb1921..33f8287969 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -523,6 +523,9 @@ static void multifd_send_terminate_threads(Error *err)
>  qemu_mutex_lock(&p->mutex);
>  p->quit = true;
>  qemu_sem_post(&p->sem);
> +if (p->c) {
> +qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +}
>  qemu_mutex_unlock(&p->mutex);
>  }
>  }

Reviewed-by: Daniel P. Berrangé 


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

[PATCH v4 16/19] iotest 214: explicit compression type

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

The test-case "Corrupted size field in compressed cluster descriptor"
heavily depends on zlib compression type. So, make it explicit. This
way test passes with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/214 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/214 b/tests/qemu-iotests/214
index 0889089d81..c66e246ba2 100755
--- a/tests/qemu-iotests/214
+++ b/tests/qemu-iotests/214
@@ -51,7 +51,7 @@ echo
 # The L2 entries of the two compressed clusters are located at
 # 0x80 and 0x88, their original values are 0x400800a0
 # and 0x400800a00802 (5 sectors for compressed data each).
-_make_test_img 8M -o cluster_size=2M
+_make_test_img 8M -o cluster_size=2M,compression_type=zlib
 $QEMU_IO -c "write -c -P 0x11 0 2M" -c "write -c -P 0x11 2M 2M" "$TEST_IMG" \
  2>&1 | _filter_qemu_io | _filter_testdir
 
-- 
2.31.1

[PATCH v4 15/19] iotests 60: more accurate set dirty bit in qcow2 header

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

Don't touch other incompatible bits, like compression-type. This makes
the test pass with IMGOPTS='compression_type=zstd'.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 tests/qemu-iotests/060 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index d1e3204d4e..df87d600f7 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -326,7 +326,7 @@ _make_test_img 64M
 # Let the refblock appear unaligned
 poke_file "$TEST_IMG" "$rt_offset""\x00\x00\x00\x00\xff\xff\x2a\x00"
 # Mark the image dirty, thus forcing an automatic check when opening it
-poke_file "$TEST_IMG" 72 "\x00\x00\x00\x00\x00\x00\x00\x01"
+$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 0
 # Open the image (qemu should refuse to do so)
 $QEMU_IO -c close "$TEST_IMG" 2>&1 | _filter_testdir | _filter_imgfmt
 
-- 
2.31.1

[PATCH v4 14/19] iotests: bash tests: filter compression type

2021-12-03 Thread Vladimir Sementsov-Ogievskiy

We want iotests pass with both the default zlib compression and with
IMGOPTS='compression_type=zstd'.

Actually the only test that is interested in real compression type in
test output is 287 (test for qcow2 compression type), so implement
specific option for it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 tests/qemu-iotests/060.out   |  2 +-
 tests/qemu-iotests/061.out   | 12 ++--
 tests/qemu-iotests/082.out   | 14 +++---
 tests/qemu-iotests/198.out   |  4 ++--
 tests/qemu-iotests/287   |  8 
 tests/qemu-iotests/common.filter |  8 
 tests/qemu-iotests/common.rc | 14 +-
 7 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index b74540bafb..329977d9b9 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -17,7 +17,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: true
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 7ecbd4dea8..139fc68177 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -525,7 +525,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
@@ -552,7 +552,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: foo
@@ -567,7 +567,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file raw: false
@@ -583,7 +583,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
@@ -597,7 +597,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
@@ -612,7 +612,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 data file: TEST_DIR/t.IMGFMT.data
diff --git a/tests/qemu-iotests/082.out b/tests/qemu-iotests/082.out
index 077ed0f2c7..d0dd333117 100644
--- a/tests/qemu-iotests/082.out
+++ b/tests/qemu-iotests/082.out
@@ -17,7 +17,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 4096
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -31,7 +31,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 8192
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -329,7 +329,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 4096
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -342,7 +342,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 8192
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -639,7 +639,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits: 16
 corrupt: false
@@ -652,7 +652,7 @@ virtual size: 130 MiB (136314880 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: false
 refcount bits: 16
 corrupt: false
@@ -665,7 +665,7 @@ virtual size: 132 MiB (138412032 bytes)
 cluster_size: 65536
 Format specific information:
 compat: 1.1
-compression type: zlib
+compression type: COMPRESSION_TYPE
 lazy refcounts: true
 refcount bits:

1 2 >

1 - 100 of 176 matches

Mail list logo