from:"Richard W.M. Jones"

Re: [PATCH for-9.1 6/9] block/nbd: Use URI parsing code from glib

2024-09-24 Thread Richard W.M. Jones

On Tue, Sep 24, 2024 at 09:52:04AM +0200, Thomas Huth wrote:
> On 23/09/2024 18.03, Eric Blake wrote:
> >On Sun, Sep 22, 2024 at 08:51:22PM GMT, Richard W.M. Jones wrote:
> >>On Thu, Mar 28, 2024 at 02:13:42PM +0000, Richard W.M. Jones wrote:
> >>>On Thu, Mar 28, 2024 at 03:06:03PM +0100, Thomas Huth wrote:
> >>>>Since version 2.66, glib has useful URI parsing functions, too.
> >>>>Use those instead of the QEMU-internal ones to be finally able
> >>>>to get rid of the latter. The g_uri_get_host() also takes care
> >>>>of removing the square brackets from IPv6 addresses, so we can
> >>>>drop that part of the QEMU code now, too.
> >>>>
> >
> >>>>-p = uri->path ? uri->path : "";
> >>>>+p = g_uri_get_path(uri) ?: "";
> >>>>  if (p[0] == '/') {
> >>>>  p++;
> >>>>  }
> >
> >>>Looks ok,
> >>>
> >>>Reviewed-by: Richard W.M. Jones 
> >>
> >>Or maybe not.  This caused a regression in the nbdkit test suite (when
> >>we use qemu-img from 9.1).  It seems the exportname part of the NBD
> >>URI gets munged:
> >>
> >>https://gitlab.com/qemu-project/qemu/-/issues/2584
> >
> >To be more specific, it looks like
> >g_uri_get_path("./name//with//..//slashes") is getting munged to
> >"name/slashes".  That is, glib is blindly assuming that ./ and XXX/../
> >can be dropped, and // can be simplified to /, which may be true for
> >arbitrary file names but not true for abitrary URIs (since URIs have
> >application-specific semantics, which may not match path name
> >traversal semantics).  Looks like we need to report a bug to glib,
> >and/or see if glib's URI functions have a flag for turning off this
> >unwanted munging.
> >
> >Or we may just want to document this corner case change as
> >intentional.
> 
> Ok ... so how bad is this for NBD? Can we go along with the odditiy
> or is this breaking some real world NBD scenarios?

After sleeping on it, I don't think it's serious at all.  It's also
permitted (albeit not required) by the URI RFCs, so other
implementations in future might do the same thing.

> ... in the worst case, we have to revert the patch ...

I closed the bug just now and have decided to go with a documentation
update instead.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH for-9.1 6/9] block/nbd: Use URI parsing code from glib

2024-09-23 Thread Richard W.M. Jones

On Mon, Sep 23, 2024 at 06:38:27PM +0200, Daniel P. Berrangé wrote:
> On Mon, Sep 23, 2024 at 11:03:08AM -0500, Eric Blake wrote:
> > On Sun, Sep 22, 2024 at 08:51:22PM GMT, Richard W.M. Jones wrote:
> > > On Thu, Mar 28, 2024 at 02:13:42PM +0000, Richard W.M. Jones wrote:
> > > > On Thu, Mar 28, 2024 at 03:06:03PM +0100, Thomas Huth wrote:
> > > > > Since version 2.66, glib has useful URI parsing functions, too.
> > > > > Use those instead of the QEMU-internal ones to be finally able
> > > > > to get rid of the latter. The g_uri_get_host() also takes care
> > > > > of removing the square brackets from IPv6 addresses, so we can
> > > > > drop that part of the QEMU code now, too.
> > > > > 
> > 
> > > > >  
> > > > > -p = uri->path ? uri->path : "";
> > > > > +p = g_uri_get_path(uri) ?: "";
> > > > >  if (p[0] == '/') {
> > > > >  p++;
> > > > >  }
> > 
> > > > Looks ok,
> > > >
> > > > Reviewed-by: Richard W.M. Jones 
> > > 
> > > Or maybe not.  This caused a regression in the nbdkit test suite (when
> > > we use qemu-img from 9.1).  It seems the exportname part of the NBD
> > > URI gets munged:
> > > 
> > > https://gitlab.com/qemu-project/qemu/-/issues/2584
> > 
> > To be more specific, it looks like
> > g_uri_get_path("./name//with//..//slashes") is getting munged to
> > "name/slashes".  That is, glib is blindly assuming that ./ and XXX/../
> > can be dropped, and // can be simplified to /, which may be true for
> > arbitrary file names but not true for abitrary URIs (since URIs have
> > application-specific semantics, which may not match path name
> > traversal semantics).  Looks like we need to report a bug to glib,
> > and/or see if glib's URI functions have a flag for turning off this
> > unwanted munging.
> 
> The source code indicates it is doing some normalization
> based on this:
> 
>   https://datatracker.ietf.org/doc/html/rfc3986#section-5.2.4

I wrote a bit about this in the bug:

https://gitlab.com/qemu-project/qemu/-/issues/2584#note_2125192404

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH for-9.1 6/9] block/nbd: Use URI parsing code from glib

2024-09-22 Thread Richard W.M. Jones

On Thu, Mar 28, 2024 at 02:13:42PM +, Richard W.M. Jones wrote:
> On Thu, Mar 28, 2024 at 03:06:03PM +0100, Thomas Huth wrote:
> > Since version 2.66, glib has useful URI parsing functions, too.
> > Use those instead of the QEMU-internal ones to be finally able
> > to get rid of the latter. The g_uri_get_host() also takes care
> > of removing the square brackets from IPv6 addresses, so we can
> > drop that part of the QEMU code now, too.
> > 
> > Signed-off-by: Thomas Huth 
> > ---
> >  block/nbd.c | 66 ++---
> >  1 file changed, 38 insertions(+), 28 deletions(-)
> > 
> > diff --git a/block/nbd.c b/block/nbd.c
> > index ef05f7cdfd..95b507f872 100644
> > --- a/block/nbd.c
> > +++ b/block/nbd.c
> > @@ -31,7 +31,6 @@
> >  #include "qemu/osdep.h"
> >  
> >  #include "trace.h"
> > -#include "qemu/uri.h"
> >  #include "qemu/option.h"
> >  #include "qemu/cutils.h"
> >  #include "qemu/main-loop.h"
> > @@ -1514,30 +1513,34 @@ static void nbd_client_close(BlockDriverState *bs)
> >  
> >  static int nbd_parse_uri(const char *filename, QDict *options)
> >  {
> > -URI *uri;
> > +GUri *uri;
> >  const char *p;
> > -QueryParams *qp = NULL;
> > +GHashTable *qp = NULL;
> > +int qp_n;
> >  int ret = 0;
> >  bool is_unix;
> > +const char *uri_scheme, *uri_query, *uri_server;
> > +int uri_port;
> >  
> > -uri = uri_parse(filename);
> > +uri = g_uri_parse(filename, G_URI_FLAGS_NONE, NULL);
> >  if (!uri) {
> >  return -EINVAL;
> >  }
> >  
> >  /* transport */
> > -if (!g_strcmp0(uri->scheme, "nbd")) {
> > +uri_scheme = g_uri_get_scheme(uri);
> > +if (!g_strcmp0(uri_scheme, "nbd")) {
> >  is_unix = false;
> > -} else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
> > +} else if (!g_strcmp0(uri_scheme, "nbd+tcp")) {
> >  is_unix = false;
> > -} else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
> > +} else if (!g_strcmp0(uri_scheme, "nbd+unix")) {
> >  is_unix = true;
> >  } else {
> >  ret = -EINVAL;
> >  goto out;
> >  }
> >  
> > -p = uri->path ? uri->path : "";
> > +p = g_uri_get_path(uri) ?: "";
> >  if (p[0] == '/') {
> >  p++;
> >  }
> > @@ -1545,51 +1548,58 @@ static int nbd_parse_uri(const char *filename, 
> > QDict *options)
> >  qdict_put_str(options, "export", p);
> >  }
> >  
> > -qp = query_params_parse(uri->query);
> > -if (qp->n > 1 || (is_unix && !qp->n) || (!is_unix && qp->n)) {
> > -ret = -EINVAL;
> > -goto out;
> > +uri_query = g_uri_get_query(uri);
> > +if (uri_query) {
> > +qp = g_uri_parse_params(uri_query, -1, "&", G_URI_PARAMS_NONE, 
> > NULL);
> > +if (!qp) {
> > +ret = -EINVAL;
> > +goto out;
> > +}
> > +qp_n = g_hash_table_size(qp);
> > +if (qp_n > 1 || (is_unix && !qp_n) || (!is_unix && qp_n)) {
> > +ret = -EINVAL;
> > +goto out;
> > +}
> > + }
> > +
> > +uri_server = g_uri_get_host(uri);
> > +if (uri_server && !uri_server[0]) {
> > +uri_server = NULL;
> >  }
> > +uri_port = g_uri_get_port(uri);
> >  
> >  if (is_unix) {
> >  /* nbd+unix:///export?socket=path */
> > -if (uri->server || uri->port || strcmp(qp->p[0].name, "socket")) {
> > +const char *uri_socket = g_hash_table_lookup(qp, "socket");
> > +if (uri_server || uri_port != -1 || !uri_socket) {
> >  ret = -EINVAL;
> >  goto out;
> >  }
> >  qdict_put_str(options, "server.type", "unix");
> > -qdict_put_str(options, "server.path", qp->p[0].value);
> > +qdict_put_str(options, "server.path", uri_socket);
> >  } else {
> > -QString *host;
> >  char *port_str;
> >  
> >  /* nbd[+tcp]://host[:port]/export */
> > -if (!uri->server) {
> > +if (!uri_server)

Re: [PATCH 01/39] docs/spin: replace assert(0) with g_assert_not_reached()

2024-09-11 Thread Richard W.M. Jones

On Wed, Sep 11, 2024 at 02:46:18PM +0200, Maciej S. Szmigiero wrote:
> On 11.09.2024 14:37, Eric Blake wrote:
> >On Wed, Sep 11, 2024 at 07:33:59AM GMT, Eric Blake wrote:
> >>On Tue, Sep 10, 2024 at 03:15:28PM GMT, Pierrick Bouvier wrote:
> >>>Signed-off-by: Pierrick Bouvier 
> >>>---
> >>
> >>A general suggestion for the entire series: please use a commit
> >>message that explains why this is a good idea.  Even something as
> >>boiler-plate as "refer to commit XXX for rationale" that can be
> >>copy-pasted into all the other commits is better than nothing,
> >>although a self-contained message is best.  Maybe:
> >>
> >>This patch is part of a series that moves towards a consistent use of
> >>g_assert_not_reached() rather than an ad hoc mix of different
> >>assertion mechanisms.
> >
> >Or summarize your cover letter:
> >
> >Use of assert(false) can trip spurious control flow warnings from some
> >versions of gcc:
> >https://lore.kernel.org/qemu-devel/54bb02a6-1b12-460a-97f6-3f478ef76...@linaro.org/
> >Solve that by unifying the code base on g_assert_not_reached()
> >instead.
> >
> 
> If using g_assert_not_reached() instead of assert(false) silences
> the warning about missing return value in such impossible to reach
> locations should we also be deleting the now-unnecessary "return"
> statements after g_assert_not_reached()?

Although it's unlikely to be used on any compiler that can also
compile qemu, there is a third implementation of g_assert_not_reached
that does nothing, see:

https://gitlab.gnome.org/GNOME/glib/-/blob/927683ebd94eb66c0d7868b77863f57ce9c5bc76/glib/gtestutils.h#L269

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 25/39] block: remove break after g_assert_not_reached()

2024-09-11 Thread Richard W.M. Jones

On Tue, Sep 10, 2024 at 03:15:52PM -0700, Pierrick Bouvier wrote:
> Signed-off-by: Pierrick Bouvier 
> ---
>  block/ssh.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/block/ssh.c b/block/ssh.c
> index 27d582e0e3d..871e1d47534 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -474,7 +474,6 @@ static int check_host_key(BDRVSSHState *s, 
> SshHostKeyCheck *hkc, Error **errp)
> errp);
>  }
>  g_assert_not_reached();
> -break;
>  case SSH_HOST_KEY_CHECK_MODE_KNOWN_HOSTS:
>  return check_host_key_knownhosts(s, errp);
>      default:
> -- 
> 2.39.2

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [PATCH 07/39] hw/watchdog: replace assert(0) with g_assert_not_reached()

2024-09-11 Thread Richard W.M. Jones

On Tue, Sep 10, 2024 at 03:15:34PM -0700, Pierrick Bouvier wrote:
> Signed-off-by: Pierrick Bouvier 
> ---
>  hw/watchdog/watchdog.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/watchdog/watchdog.c b/hw/watchdog/watchdog.c
> index 955046161bf..d0ce3c4ac55 100644
> --- a/hw/watchdog/watchdog.c
> +++ b/hw/watchdog/watchdog.c
> @@ -85,7 +85,7 @@ void watchdog_perform_action(void)
>  break;
>  
>  default:
> -assert(0);
> +g_assert_not_reached();
>  }
>  }
>  

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [PATCH for-9.1 0/2] NBD: don't print raw server error text to terminal

2024-08-05 Thread Richard W.M. Jones

On Mon, Aug 05, 2024 at 01:48:12PM -0500, Eric Blake wrote:
> On Fri, Aug 02, 2024 at 02:26:04PM GMT, Eric Blake wrote:
> > I've requested a CVE from Red Hat, and hope to have an assigned number
> > soon.  Meanwhile, we can get review started, to make sure this is
> > ready to include in 9.1.  'qemu-img info' should never print untrusted
> > data in a way that might take over a user's terminal.
> > 
> > There are probably other spots where qemu-img info is printing
> > untrusted data (such as filenames), where we probably ought to use the
> > same sanitization tactics as shown here.  Identifying those spots
> > would be a useful part of this review, and may mean a v2 that is even
> > more extensive in the number of patches.
> 
> In fact, should we insist that 'qemu-img info XXX' refuse to accept
> any control characters on any command-line filename, and that it
> reject any backing file name with control characters as a malformed
> qcow2 file?  For reference, we JUST fixed qemu-img info to change
> qcow2 files with a claimed backing file of json:... to favor the local
> file ./json:... over the potentially dangerous user-controlled
> format/protocol description, so we are _already_ changing how strict
> qemu-img is for 9.1, and adding one more restriction to avoid
> escape-sequence madness makes sense.
> 
> Note that with:
> 
> touch $'\e[m' && qemu-img info --output=json $'\e[m'
> 
> we already escape our output, but without --output=json, we are
> vulnerable to user-controlled ESC leaking through to stdout for more
> than just the NBD server errors that I addressed in v1 of this patch
> series.  Hence my question on whether v2 of the series should touch
> more places in the code, and whether doing something like flat-out
> refusing users stupid enough to embed control characters in their
> filenames is a safe change this close to release.

You could say if someone gives you a "malicious" text file which you
cat to stdout, it could change your terminal settings.  I don't think
therefore any of this is very serious.  We should probably fix any
obvious things, but it doesn't need to happen right before 9.1 is
released, we can take our time.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH 2/2] qemu-img: CVE-XXX Sanitize untrusted output from NBD server

2024-08-03 Thread Richard W.M. Jones

On Fri, Aug 02, 2024 at 11:01:36PM +0100, Richard W.M. Jones wrote:
> On Fri, Aug 02, 2024 at 02:26:06PM -0500, Eric Blake wrote:
> > Error messages from an NBD server must be treated as untrusted; a
> > malicious server can inject escape sequences to try and trigger RCE
> > flaws via escape sequences to whatever terminal happens to be running
> > qemu-img.
> 
> This presentation is relevant:
> 
> https://dgl.cx/2023/09/ansi-terminal-security

This took way too long, but ...

$ wget http://oirase.annexia.org/tmp/nyan.c
$ nbdkit --log=null cc /tmp/nyan.c --run 'qemu-img info "$uri"'

Needs nbdkit >= 1.40, and don't worry, it doesn't exploit the terminal
except for silly internet memes.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 2/2] qemu-img: CVE-XXX Sanitize untrusted output from NBD server

2024-08-02 Thread Richard W.M. Jones

On Fri, Aug 02, 2024 at 02:26:06PM -0500, Eric Blake wrote:
> Error messages from an NBD server must be treated as untrusted; a
> malicious server can inject escape sequences to try and trigger RCE
> flaws via escape sequences to whatever terminal happens to be running
> qemu-img.

This presentation is relevant:

https://dgl.cx/2023/09/ansi-terminal-security

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH 2/2] qemu-img: CVE-XXX Sanitize untrusted output from NBD server

2024-08-02 Thread Richard W.M. Jones

On Fri, Aug 02, 2024 at 02:26:06PM -0500, Eric Blake wrote:
> Error messages from an NBD server must be treated as untrusted; a
> malicious server can inject escape sequences to try and trigger RCE
> flaws via escape sequences to whatever terminal happens to be running
> qemu-img.  The easiest solution is to sanitize the output with the
> same code we use to produce sanitized (pseudo-)JSON over QMP.
> 
> Rich Jones originally pointed this flaw out at:
> https://lists.libguestfs.org/archives/list/gues...@lists.libguestfs.org/thread/2NXA23G2V3HPWJYAO726PLNBEAAEUJAU/
> 
> With this patch, and a malicious server run with nbdkit 1.40 as:
> 
> $ nbdkit --log=null eval open=' printf \
>   "EPERM x\\r mess up the output \e[31mmess up the output\e[m mess up" >&2; \
>   exit 1 ' get_size=' echo 0 ' --run 'qemu-img info "$uri"'
> 
> we now get:
> 
> qemu-img: Could not open 'nbd://localhost': Requested export not available
> server reported: /tmp/nbdkitOZHOKB/open: x\r mess up the output 
> \u001B[31mmess up the output\u001B[m mess up
> 
> instead of an attempt to hide the name of the Unix socket and forcing
> the terminal to render part of the text red.
> 
> Note that I did _not_ sanitize the string being sent through
> trace-events in trace_nbd_server_error_msg; this is because I assume
> that our trace engines already treat all string strings as untrusted
> input and apply their own escaping as needed.
> 
> Reported-by: "Richard W.M. Jones" 
> Signed-off-by: Eric Blake 
> 
> ---
> 
> If my assumption about allowing raw escape bytes through to trace_
> calls is wrong (such as when tracing to stderr), let me know.  That's
> a much bigger audit to determine which trace points, if any, should
> sanitize data before tracing, and/or change the trace engines to
> sanitize all strings (with possible knock-on effects if trace output
> changes unexpectedly for a tool expecting something unsanitized).
> ---
>  nbd/client.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/nbd/client.c b/nbd/client.c
> index c89c7504673..baa20d10d69 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -23,6 +23,7 @@
>  #include "trace.h"
>  #include "nbd-internal.h"
>  #include "qemu/cutils.h"
> +#include "qemu/unicode.h"
> 
>  /* Definitions for opaque data types */
> 
> @@ -230,7 +231,9 @@ static int nbd_handle_reply_err(QIOChannel *ioc, 
> NBDOptionReply *reply,
>  }
> 
>  if (msg) {
> -error_append_hint(errp, "server reported: %s\n", msg);
> +g_autoptr(GString) buf = g_string_sized_new(reply->length);
> +mod_utf8_sanitize(buf, msg);
> +error_append_hint(errp, "server reported: %s\n", buf->str);
>  }

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be public

2024-08-02 Thread Richard W.M. Jones

On Fri, Aug 02, 2024 at 02:26:05PM -0500, Eric Blake wrote:
> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals.  Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated.  [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
> 
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters).  But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+.
> 
> Signed-off-by: Eric Blake 

I have to admit I'd never heard of "modified UTF-8" before, but it's a
thing:

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

As the patch is almost a simple code motion:

Reviewed-by: Richard W.M. Jones 

Rich.

> ---
>  include/qemu/unicode.h |  3 ++
>  qobject/json-writer.c  | 47 +--
>  util/unicode.c | 84 ++
>  3 files changed, 88 insertions(+), 46 deletions(-)
> 
> diff --git a/include/qemu/unicode.h b/include/qemu/unicode.h
> index 7fa10b8e604..e1013b29f72 100644
> --- a/include/qemu/unicode.h
> +++ b/include/qemu/unicode.h
> @@ -4,4 +4,7 @@
>  int mod_utf8_codepoint(const char *s, size_t n, char **end);
>  ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint);
> 
> +void mod_utf8_sanitize(GString *buf, const char *str);
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len);
> +
>  #endif
> diff --git a/qobject/json-writer.c b/qobject/json-writer.c
> index 309a31d57a9..5c14574efee 100644
> --- a/qobject/json-writer.c
> +++ b/qobject/json-writer.c
> @@ -104,53 +104,8 @@ static void pretty_newline_or_space(JSONWriter *writer)
> 
>  static void quoted_str(JSONWriter *writer, const char *str)
>  {
> -const char *ptr;
> -char *end;
> -int cp;
> -
>  g_string_append_c(writer->contents, '"');
> -
> -for (ptr = str; *ptr; ptr = end) {
> -cp = mod_utf8_codepoint(ptr, 6, &end);
> -switch (cp) {
> -case '\"':
> -g_string_append(writer->contents, "\\\"");
> -break;
> -case '\\':
> -g_string_append(writer->contents, "");
> -break;
> -case '\b':
> -g_string_append(writer->contents, "\\b");
> -break;
> -case '\f':
> -g_string_append(writer->contents, "\\f");
> -break;
> -case '\n':
> -g_string_append(writer->contents, "\\n");
> -break;
> -case '\r':
> -g_string_append(writer->contents, "\\r");
> -break;
> -case '\t':
> -g_string_append(writer->contents, "\\t");
> -break;
> -default:
> -if (cp < 0) {
> -cp = 0xFFFD; /* replacement character */
> -}
> -if (cp > 0x) {
> -/* beyond BMP; need a surrogate pair */
> -g_string_append_printf(writer->contents, "\\u%04X\\u%04X",
> -   0xD800 + ((cp - 0x1) >> 10),
> -   0xDC00 + ((cp - 0x1) & 0x3FF));
> -} else if (cp < 0x20 || cp >= 0x7F) {
> -g_string_append_printf(writer->contents, "\\u%04X", cp);
> -} else {
> -g_string_append_c(writer->contents, cp);
> -}
> -}
> -};
> -
> +mod_utf8_sanitize(writer->contents, str);
>  g_string_append_c(writer->contents, '"');
>  }
> 
> diff --git a/util/unicode.c b/util/unicode.c
> index 8580bc598b3..a419ed4de35 100644
> --- a/util/unicode.c
> +++ b/util/unicode.c
> @@ -154,3 +154,87 @@ ssize_t mod_ut

Re: [Libguestfs] Passing block size preferences from NBD -> qemu -> Linux

2024-07-15 Thread Richard W.M. Jones

On Mon, Jul 15, 2024 at 09:44:08AM -0500, Eric Blake wrote:
> [adding qemu-block in cc]
> 
> On Sat, Jul 13, 2024 at 03:40:36PM GMT, Richard W.M. Jones wrote:
> > This is expanding on the commit message I wrote here:
> > 
> > https://gitlab.com/nbdkit/nbdkit/-/commit/780599d2e77c7cc4c1a7e99d0a933289761a9b27
> > 
> > A simple "one-liner" to test if NBD block size preferences are passed
> > correctly through qemu and into a Linux guest is this:
> > 
> >   $ nbdkit memory 1G --filter=blocksize-policy \
> >blocksize-minimum=4096 \
> >blocksize-preferred=65536 \
> >blocksize-maximum=8M \
> > --run '
> >   LIBGUESTFS_HV=/path/to/qemu-system-x86_64 \
> >   LIBGUESTFS_BACKEND=direct \
> >   guestfish --format=raw -a "$uri" \
> > run : \
> > debug sh "head -1 /sys/block/*/queue/*_io_size" : \
> > debug sh "for d in /dev/sd? ; do sg_inq -p 0xb0 \$d ; done" \
> > '
> > 
> > Current qemu (9.0.0) does not pass the block size preferences
> > correctly.  It's a problem in qemu, not in Linux.
> > 
> > qemu's NBD client requests the block size preferences from nbdkit and
> > reads them correctly.  I verified this by adding some print statements
> > into nbd/client.c.  The sizes are stored in BDRVNBDState 'info' field.
> > 
> > qemu's virtio-scsi driver *can* present a block limits VPD page (0xb0)
> > containing these limits (see hw/scsi/scsi-disk.c), and Linux is able
> > to see the contents of this page using tools like 'sg_inq'.  Linux
> > appears to translate the information faithfully into
> > /sys/block/sdX/queue/{minimum,optimal}_io_size files.
> > 
> > However the virtio-scsi driver in qemu populates this information from
> > the qemu command line (-device [...]min_io_size=512,opt_io_size=4096).
> > It doesn't pass the information through from the NBD source backing
> > the drive.
> 
> Is guestfish the one synthesizing the '-device min_io_size=512' used
> by qemu?

We don't add those parameters because we want it passed through from
the backing NBD device.  We do generate the -drive/-device parameter
(or do it via libvirt).  Here's an example to make this clearer:

  $ nbdkit memory 1G --run '
  LIBGUESTFS_BACKEND=direct guestfish -vx --format=raw -a "$uri" run
'
  ...
  /usr/bin/qemu-kvm \
  ...
  -drive 
file=nbd:unix:/tmp/nbdkitiJ5lz0/socket,cache=writeback,format=raw,id=hd0,if=none
 \
  -device scsi-hd,drive=hd0 \
  ...

> I don't see it in the nbdkit command line posted above.  Or
> is guestfish leaving it up to qemu to advertise its defaults, and this
> is merely a case of qemu favoring its defaults over what the device
> advertised?

qemu's NBD client fetches the I/O size preferences from nbdkit and
stores them in BDRVNBDState->info, but they get no further than that
inside qemu.  I expected they'd be passed through to the virtio-scsi
device and from there into the guest.

> > Fixing this seems like a non-trivial amount of work.
> 
> Indeed, if guestfish is passing command-line defaults for qemu to use,
> we have to determine when to prioritize hardware advertisements over
> command-line defaults, while still maintaining flexibility to
> intentionally pick different sizes than what hardware advertised for
> the purposes of performance testing.

We're not setting anything on the command line.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH for-9.1 6/9] block/nbd: Use URI parsing code from glib

2024-04-04 Thread Richard W.M. Jones

On Thu, Mar 28, 2024 at 04:40:10PM +, Richard W.M. Jones wrote:
> libnbd absolutely does *not* get this right, eg:
> 
>   $ nbdinfo NBD://localhost
>   nbdinfo: nbd_connect_uri: unknown NBD URI scheme: NBD: Invalid argument
> 
> so that's a bug too.

Proposed fix:
https://gitlab.com/nbdkit/libnbd/-/merge_requests/6

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH for-9.1 6/9] block/nbd: Use URI parsing code from glib

2024-03-28 Thread Richard W.M. Jones

On Thu, Mar 28, 2024 at 10:06:01AM -0500, Eric Blake wrote:
> Adjusting cc list to add upstream NBD and drop developers unrelated to
> this part of the qemu series...
> 
> On Thu, Mar 28, 2024 at 02:13:42PM +0000, Richard W.M. Jones wrote:
> > On Thu, Mar 28, 2024 at 03:06:03PM +0100, Thomas Huth wrote:
> > > Since version 2.66, glib has useful URI parsing functions, too.
> > > Use those instead of the QEMU-internal ones to be finally able
> > > to get rid of the latter. The g_uri_get_host() also takes care
> > > of removing the square brackets from IPv6 addresses, so we can
> > > drop that part of the QEMU code now, too.
> > > 
> 
> > >  
> > >  if (is_unix) {
> > >  /* nbd+unix:///export?socket=path */
> > > -if (uri->server || uri->port || strcmp(qp->p[0].name, "socket")) 
> > > {
> > > +const char *uri_socket = g_hash_table_lookup(qp, "socket");
> > > +if (uri_server || uri_port != -1 || !uri_socket) {
> > >  ret = -EINVAL;
> > >  goto out;
> > >  }
> 
> The spec for NBD URIs is at:
> 
> https://github.com/NetworkBlockDevice/nbd/blob/master/doc/uri.md
> 
> Should any of this spec mention case-insensitive concerns, such as
> whether 'NBD://' may be equivalent to 'nbd://', and whether
> 'nbd+unix:///?socket=x' is equivalent to 'nbd+unix:///?Socket=x'?
> Right now, I think that most implementations of NBD servers and
> clients happen to use case-sensitive parsing; but glib provides the
> option to do case-insensitive query parsing.

I haven't thought about this before, but do note that the NBD URI spec
defers to "IETF standards describing URIs" for all unanswered
questions.  RFC3986 does talk about this incidentally.  About the
scheme field it says (section 3.1):

   Although schemes are case-
   insensitive, the canonical form is lowercase and documents that
   specify schemes must do so with lowercase letters.  An implementation
   should accept uppercase letters as equivalent to lowercase in scheme
   names (e.g., allow "HTTP" as well as "http") for the sake of
   robustness but should only produce lowercase scheme names for
   consistency.

Hostname is also (obviously) case insensitive.  There's also a section
(6.2.3) which talks about normalization of URIs.

Overall it seems the intention of the RFC writer is that parsers
should handle any case; but when generating URIs (and for examples,
documentation etc) we should only generate lowercase.

libnbd absolutely does *not* get this right, eg:

  $ nbdinfo NBD://localhost
  nbdinfo: nbd_connect_uri: unknown NBD URI scheme: NBD: Invalid argument

so that's a bug too.

> If I read https://docs.gtk.org/glib/type_func.Uri.parse_params.html
> correctly, passing G_URI_PARAMS_CASE_INSENSITIVE (which you did not
> do) would mean that 'nbd+unix:///?socket=ignore&Socket=/for/real'
> would result in this g_hash_table_lookup finding only "Socket", not
> "socket".  Maybe it is worth an explicit addition to the NBD URI spec
> to mention that we intend to be case-sensitive (in the parts where it
> can be; I'm not sure if the schema part must be handled
> case-insensitively without re-reading the RFCs), and therefore that
> 'Socket=' does NOT have the same meaning as 'socket='.

We could mention this in the spec for clarity, but the current meaning
(as above) would be that case-insensitive parsing is recommended.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH for-9.1 8/9] block/ssh: Use URI parsing code from glib

2024-03-28 Thread Richard W.M. Jones

On Thu, Mar 28, 2024 at 03:06:05PM +0100, Thomas Huth wrote:
> Since version 2.66, glib has useful URI parsing functions, too.
> Use those instead of the QEMU-internal ones to be finally able
> to get rid of the latter.
> 
> Signed-off-by: Thomas Huth 
> ---
>  block/ssh.c | 69 +++--
>  1 file changed, 40 insertions(+), 29 deletions(-)
> 
> diff --git a/block/ssh.c b/block/ssh.c
> index 2748253d4a..c0caf59793 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -37,7 +37,6 @@
>  #include "qemu/ctype.h"
>  #include "qemu/cutils.h"
>  #include "qemu/sockets.h"
> -#include "qemu/uri.h"
>  #include "qapi/qapi-visit-sockets.h"
>  #include "qapi/qapi-visit-block-core.h"
>  #include "qapi/qmp/qdict.h"
> @@ -181,64 +180,76 @@ static void sftp_error_trace(BDRVSSHState *s, const 
> char *op)
>  
>  static int parse_uri(const char *filename, QDict *options, Error **errp)
>  {
> -URI *uri = NULL;
> -QueryParams *qp;
> +GUri *uri;
> +const char *uri_host, *uri_path, *uri_user, *uri_query;
>  char *port_str;
> -int i;
> +int port;
> +g_autoptr(GError) gerror = NULL;
> +char *qp_name, *qp_value;
> +GUriParamsIter qp;
>  
> -uri = uri_parse(filename);
> +uri = g_uri_parse(filename, G_URI_FLAGS_NONE, NULL);
>  if (!uri) {
>  return -EINVAL;
>  }
>  
> -if (g_strcmp0(uri->scheme, "ssh") != 0) {
> +if (g_strcmp0(g_uri_get_scheme(uri), "ssh") != 0) {
>  error_setg(errp, "URI scheme must be 'ssh'");
>  goto err;
>  }
>  
> -if (!uri->server || strcmp(uri->server, "") == 0) {
> +uri_host = g_uri_get_host(uri);
> +if (!uri_host || g_str_equal(uri_host, "")) {
>  error_setg(errp, "missing hostname in URI");
>  goto err;
>  }
>  
> -if (!uri->path || strcmp(uri->path, "") == 0) {
> +uri_path = g_uri_get_path(uri);
> +if (!uri_path || g_str_equal(uri_path, "")) {
>  error_setg(errp, "missing remote path in URI");
>  goto err;
>  }
>  
> -qp = query_params_parse(uri->query);
> -if (!qp) {
> -error_setg(errp, "could not parse query parameters");
> -goto err;
> -}
> -
> -if(uri->user && strcmp(uri->user, "") != 0) {
> -qdict_put_str(options, "user", uri->user);
> +uri_user = g_uri_get_user(uri);
> +if (uri_user && !g_str_equal(uri_user, "")) {
> +qdict_put_str(options, "user", uri_user);
>  }
>  
> -qdict_put_str(options, "server.host", uri->server);
> +qdict_put_str(options, "server.host", uri_host);
>  
> -port_str = g_strdup_printf("%d", uri->port ?: 22);
> +port = g_uri_get_port(uri);
> +port_str = g_strdup_printf("%d", port != -1 ? port : 22);
>  qdict_put_str(options, "server.port", port_str);
>  g_free(port_str);
>  
> -qdict_put_str(options, "path", uri->path);
> -
> -/* Pick out any query parameters that we understand, and ignore
> - * the rest.
> - */
> -for (i = 0; i < qp->n; ++i) {
> -if (strcmp(qp->p[i].name, "host_key_check") == 0) {
> -qdict_put_str(options, "host_key_check", qp->p[i].value);
> +qdict_put_str(options, "path", uri_path);
> +
> +uri_query = g_uri_get_query(uri);
> +if (uri_query) {
> +g_uri_params_iter_init(&qp, uri_query, -1, "&", G_URI_PARAMS_NONE);
> +while (g_uri_params_iter_next(&qp, &qp_name, &qp_value, &gerror)) {
> +if (!qp_name || !qp_value || gerror) {
> +warn_report("Failed to parse SSH URI parameters '%s'.",
> +            uri_query);
> +break;
> +}
> +/*
> + * Pick out the query parameters that we understand, and ignore
> + * (or rather warn about) the rest.
> + */
> +if (g_str_equal(qp_name, "host_key_check")) {
> +qdict_put_str(options, "host_key_check", qp_value);
> +} else {
> +warn_report("Unsupported parameter '%s' in URI.", qp_name);
> +}
>  }
>  }
>  
> -query_params_free(qp);
> -uri_free(uri);
> +g_uri_unref(uri);
>  return 0;
>  
>   err:
> -uri_free(uri);
> +g_uri_unref(uri);
>  return -EINVAL;
>  }

Seems reasonable too,

Reviewed-by: Richard W.M. Jones 


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH for-9.1 6/9] block/nbd: Use URI parsing code from glib

2024-03-28 Thread Richard W.M. Jones

On Thu, Mar 28, 2024 at 03:06:03PM +0100, Thomas Huth wrote:
> Since version 2.66, glib has useful URI parsing functions, too.
> Use those instead of the QEMU-internal ones to be finally able
> to get rid of the latter. The g_uri_get_host() also takes care
> of removing the square brackets from IPv6 addresses, so we can
> drop that part of the QEMU code now, too.
> 
> Signed-off-by: Thomas Huth 
> ---
>  block/nbd.c | 66 ++---
>  1 file changed, 38 insertions(+), 28 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index ef05f7cdfd..95b507f872 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -31,7 +31,6 @@
>  #include "qemu/osdep.h"
>  
>  #include "trace.h"
> -#include "qemu/uri.h"
>  #include "qemu/option.h"
>  #include "qemu/cutils.h"
>  #include "qemu/main-loop.h"
> @@ -1514,30 +1513,34 @@ static void nbd_client_close(BlockDriverState *bs)
>  
>  static int nbd_parse_uri(const char *filename, QDict *options)
>  {
> -URI *uri;
> +GUri *uri;
>  const char *p;
> -QueryParams *qp = NULL;
> +GHashTable *qp = NULL;
> +int qp_n;
>  int ret = 0;
>  bool is_unix;
> +const char *uri_scheme, *uri_query, *uri_server;
> +int uri_port;
>  
> -uri = uri_parse(filename);
> +uri = g_uri_parse(filename, G_URI_FLAGS_NONE, NULL);
>  if (!uri) {
>  return -EINVAL;
>  }
>  
>  /* transport */
> -if (!g_strcmp0(uri->scheme, "nbd")) {
> +uri_scheme = g_uri_get_scheme(uri);
> +if (!g_strcmp0(uri_scheme, "nbd")) {
>  is_unix = false;
> -} else if (!g_strcmp0(uri->scheme, "nbd+tcp")) {
> +} else if (!g_strcmp0(uri_scheme, "nbd+tcp")) {
>  is_unix = false;
> -} else if (!g_strcmp0(uri->scheme, "nbd+unix")) {
> +} else if (!g_strcmp0(uri_scheme, "nbd+unix")) {
>  is_unix = true;
>  } else {
>  ret = -EINVAL;
>  goto out;
>  }
>  
> -p = uri->path ? uri->path : "";
> +p = g_uri_get_path(uri) ?: "";
>  if (p[0] == '/') {
>  p++;
>  }
> @@ -1545,51 +1548,58 @@ static int nbd_parse_uri(const char *filename, QDict 
> *options)
>  qdict_put_str(options, "export", p);
>  }
>  
> -qp = query_params_parse(uri->query);
> -if (qp->n > 1 || (is_unix && !qp->n) || (!is_unix && qp->n)) {
> -ret = -EINVAL;
> -goto out;
> +uri_query = g_uri_get_query(uri);
> +if (uri_query) {
> +qp = g_uri_parse_params(uri_query, -1, "&", G_URI_PARAMS_NONE, NULL);
> +if (!qp) {
> +ret = -EINVAL;
> +goto out;
> +}
> +qp_n = g_hash_table_size(qp);
> +if (qp_n > 1 || (is_unix && !qp_n) || (!is_unix && qp_n)) {
> +ret = -EINVAL;
> +goto out;
> +}
> + }
> +
> +uri_server = g_uri_get_host(uri);
> +if (uri_server && !uri_server[0]) {
> +uri_server = NULL;
>  }
> +uri_port = g_uri_get_port(uri);
>  
>  if (is_unix) {
>  /* nbd+unix:///export?socket=path */
> -if (uri->server || uri->port || strcmp(qp->p[0].name, "socket")) {
> +const char *uri_socket = g_hash_table_lookup(qp, "socket");
> +if (uri_server || uri_port != -1 || !uri_socket) {
>  ret = -EINVAL;
>  goto out;
>  }
>  qdict_put_str(options, "server.type", "unix");
> -qdict_put_str(options, "server.path", qp->p[0].value);
> +qdict_put_str(options, "server.path", uri_socket);
>  } else {
> -QString *host;
>  char *port_str;
>  
>  /* nbd[+tcp]://host[:port]/export */
> -if (!uri->server) {
> +if (!uri_server) {
>  ret = -EINVAL;
>  goto out;
>  }
>  
> -/* strip braces from literal IPv6 address */
> -if (uri->server[0] == '[') {
> -host = qstring_from_substr(uri->server, 1,
> -   strlen(uri->server) - 1);
> -} else {
> -host = qstring_from_str(uri->server);
> -    }
> -
>      qdict_put_str(options, "server.type", "inet");
> -qdict_put(options, "server.host", host);
> +qdict_put_str(options, "server.host&quo

[PATCH v2] block/blkio: Make s->mem_region_alignment be 64 bits

2024-01-30 Thread Richard W.M. Jones

With GCC 14 the code failed to compile on i686 (and was wrong for any
version of GCC):

../block/blkio.c: In function ‘blkio_file_open’:
../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ from 
incompatible pointer type [-Wincompatible-pointer-types]
  857 |&s->mem_region_alignment);
  |^~~~
  ||
  |size_t * {aka unsigned int *}
In file included from ../block/blkio.c:12:
/usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
   49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
*value);
  | ~~^

Signed-off-by: Richard W.M. Jones 
---
 block/blkio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blkio.c b/block/blkio.c
index 0a0a6c0f5fd..bc2f21784c7 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -68,7 +68,7 @@ typedef struct {
 CoQueue bounce_available;
 
 /* The value of the "mem-region-alignment" property */
-size_t mem_region_alignment;
+uint64_t mem_region_alignment;
 
 /* Can we skip adding/deleting blkio_mem_regions? */
 bool needs_mem_regions;
-- 
2.43.0

Re: [PATCH [repost]] block/blkio: Don't assume size_t is 64 bit

2024-01-30 Thread Richard W.M. Jones

On Tue, Jan 30, 2024 at 01:04:46PM +0100, Kevin Wolf wrote:
> Am 30.01.2024 um 11:30 hat Richard W.M. Jones geschrieben:
> > On Tue, Jan 30, 2024 at 09:51:59AM +0100, Kevin Wolf wrote:
> > > Am 29.01.2024 um 19:53 hat Richard W.M. Jones geschrieben:
> > > > With GCC 14 the code failed to compile on i686 (and was wrong for any
> > > > version of GCC):
> > > > 
> > > > ../block/blkio.c: In function ‘blkio_file_open’:
> > > > ../block/blkio.c:857:28: error: passing argument 3 of 
> > > > ‘blkio_get_uint64’ from incompatible pointer type 
> > > > [-Wincompatible-pointer-types]
> > > >   857 |&s->mem_region_alignment);
> > > >   |^~~~
> > > >   ||
> > > >   |size_t * {aka unsigned int *}
> > > > In file included from ../block/blkio.c:12:
> > > > /usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
> > > > unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int 
> > > > *’}
> > > >49 | int blkio_get_uint64(struct blkio *b, const char *name, 
> > > > uint64_t *value);
> > > >   | 
> > > > ~~^
> > > > 
> > > > Signed-off-by: Richard W.M. Jones 
> > > 
> > > Why not simply make BDRVBlkioState.mem_region_alignment a uint64_t
> > > instead of keeping it size_t and doing an additional conversion with
> > > a check that requires an #if (probably to avoid a warning on 64 bit
> > > hosts because the condition is never true)?
> > 
> > The smaller change (attached) does work on i686, but this worries me a
> > little (although it doesn't give any error or warning):
> > 
> > if (((uintptr_t)host | size) % s->mem_region_alignment) {
> > error_setg(errp, "unaligned buf %p with size %zu", host, size);
> > return BMRR_FAIL;
> > }
> 
> I don't see the problem? The calculation will now be done in 64 bits
> even on a 32 bit host, but that seems fine to me. Is there a trap I'm
> missing?

I guess not.  Stefan, any comments on whether we need to worry about
huge mem-region-alignment?  I'll post the updated patch as a new
message in a second.

Rich.

> Kevin
> 
> > From 500f3a81652dcefa79a4864c1f3fa6747c16952e Mon Sep 17 00:00:00 2001
> > From: "Richard W.M. Jones" 
> > Date: Mon, 29 Jan 2024 18:20:46 +
> > Subject: [PATCH] block/blkio: Make s->mem_region_alignment be 64 bits
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> > 
> > With GCC 14 the code failed to compile on i686 (and was wrong for any
> > version of GCC):
> > 
> > ../block/blkio.c: In function ‘blkio_file_open’:
> > ../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ 
> > from incompatible pointer type [-Wincompatible-pointer-types]
> >   857 |&s->mem_region_alignment);
> >   |^~~~~~~~~~~~
> >   ||
> >   |size_t * {aka unsigned int *}
> > In file included from ../block/blkio.c:12:
> > /usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
> > unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
> >49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
> > *value);
> >   | 
> > ~~^
> > 
> > Signed-off-by: Richard W.M. Jones 
> > ---
> >  block/blkio.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/blkio.c b/block/blkio.c
> > index 0a0a6c0f5fd..bc2f21784c7 100644
> > --- a/block/blkio.c
> > +++ b/block/blkio.c
> > @@ -68,7 +68,7 @@ typedef struct {
> >  CoQueue bounce_available;
> >  
> >  /* The value of the "mem-region-alignment" property */
> > -size_t mem_region_alignment;
> > +uint64_t mem_region_alignment;
> >  
> >  /* Can we skip adding/deleting blkio_mem_regions? */
> >  bool needs_mem_regions;
> > -- 
> > 2.43.0
> > 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [PATCH [repost]] block/blkio: Don't assume size_t is 64 bit

2024-01-30 Thread Richard W.M. Jones

On Tue, Jan 30, 2024 at 09:51:59AM +0100, Kevin Wolf wrote:
> Am 29.01.2024 um 19:53 hat Richard W.M. Jones geschrieben:
> > With GCC 14 the code failed to compile on i686 (and was wrong for any
> > version of GCC):
> > 
> > ../block/blkio.c: In function ‘blkio_file_open’:
> > ../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ 
> > from incompatible pointer type [-Wincompatible-pointer-types]
> >   857 |&s->mem_region_alignment);
> >   |^~~~
> >   ||
> >   |size_t * {aka unsigned int *}
> > In file included from ../block/blkio.c:12:
> > /usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
> > unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
> >49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
> > *value);
> >   |             
> > ~~^
> > 
> > Signed-off-by: Richard W.M. Jones 
> 
> Why not simply make BDRVBlkioState.mem_region_alignment a uint64_t
> instead of keeping it size_t and doing an additional conversion with
> a check that requires an #if (probably to avoid a warning on 64 bit
> hosts because the condition is never true)?

The smaller change (attached) does work on i686, but this worries me a
little (although it doesn't give any error or warning):

if (((uintptr_t)host | size) % s->mem_region_alignment) {
error_setg(errp, "unaligned buf %p with size %zu", host, size);
return BMRR_FAIL;
}

Rich.

> Kevin
> 
> >  block/blkio.c | 12 +++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blkio.c b/block/blkio.c
> > index 0a0a6c0f5fd..52d78935147 100644
> > --- a/block/blkio.c
> > +++ b/block/blkio.c
> > @@ -794,6 +794,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  const char *blkio_driver = bs->drv->protocol_name;
> >  BDRVBlkioState *s = bs->opaque;
> >  int ret;
> > +uint64_t val;
> >  
> >  ret = blkio_create(blkio_driver, &s->blkio);
> >  if (ret < 0) {
> > @@ -854,7 +855,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  
> >  ret = blkio_get_uint64(s->blkio,
> > "mem-region-alignment",
> > -   &s->mem_region_alignment);
> > +   &val);
> >  if (ret < 0) {
> >  error_setg_errno(errp, -ret,
> >   "failed to get mem-region-alignment: %s",
> > @@ -862,6 +863,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  blkio_destroy(&s->blkio);
> >  return ret;
> >  }
> > +#if HOST_LONG_BITS == 32
> > +if (val > SIZE_MAX) {
> > +error_setg_errno(errp, ERANGE,
> > + "mem-region-alignment too large for size_t");
> > +blkio_destroy(&s->blkio);
> > +return -ERANGE;
> > +}
> > +#endif
> > +s->mem_region_alignment = (size_t)val;
> >  
> >  ret = blkio_get_bool(s->blkio,
> >   "may-pin-mem-regions",
> > -- 
> > 2.43.0
> > 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
>From 500f3a81652dcefa79a4864c1f3fa6747c16952e Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" 
Date: Mon, 29 Jan 2024 18:20:46 +
Subject: [PATCH] block/blkio: Make s->mem_region_alignment be 64 bits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With GCC 14 the code failed to compile on i686 (and was wrong for any
version of GCC):

../block/blkio.c: In function ‘blkio_file_open’:
../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ from 
incompatible pointer type [-Wincompatible-pointer-types]
  857 |&s->mem_region_alignment);
  |^~~~
  ||
  |size_t * {aka unsigned int *}
In file included from ../block/blkio.c:12:
/usr/include/blkio.h:49:67:

[PATCH [repost]] block/blkio: Don't assume size_t is 64 bit

2024-01-29 Thread Richard W.M. Jones

Repost of the same patch as a minute ago because I messed up a couple
of email addresses in the CC.

Rich.

[PATCH [repost]] block/blkio: Don't assume size_t is 64 bit

2024-01-29 Thread Richard W.M. Jones

With GCC 14 the code failed to compile on i686 (and was wrong for any
version of GCC):

../block/blkio.c: In function ‘blkio_file_open’:
../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ from 
incompatible pointer type [-Wincompatible-pointer-types]
  857 |&s->mem_region_alignment);
  |^~~~
  ||
  |size_t * {aka unsigned int *}
In file included from ../block/blkio.c:12:
/usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
   49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
*value);
  | ~~^

Signed-off-by: Richard W.M. Jones 
---
 block/blkio.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/blkio.c b/block/blkio.c
index 0a0a6c0f5fd..52d78935147 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -794,6 +794,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 const char *blkio_driver = bs->drv->protocol_name;
 BDRVBlkioState *s = bs->opaque;
 int ret;
+uint64_t val;
 
 ret = blkio_create(blkio_driver, &s->blkio);
 if (ret < 0) {
@@ -854,7 +855,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 ret = blkio_get_uint64(s->blkio,
"mem-region-alignment",
-   &s->mem_region_alignment);
+   &val);
 if (ret < 0) {
 error_setg_errno(errp, -ret,
  "failed to get mem-region-alignment: %s",
@@ -862,6 +863,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 blkio_destroy(&s->blkio);
 return ret;
 }
+#if HOST_LONG_BITS == 32
+if (val > SIZE_MAX) {
+error_setg_errno(errp, ERANGE,
+ "mem-region-alignment too large for size_t");
+blkio_destroy(&s->blkio);
+return -ERANGE;
+}
+#endif
+s->mem_region_alignment = (size_t)val;
 
 ret = blkio_get_bool(s->blkio,
  "may-pin-mem-regions",
-- 
2.43.0

Re: [PATCH 4/7] hw/scsi/virtio-scsi: Use VIRTIO_SCSI_COMMON() macro

2023-10-31 Thread Richard W.M. Jones

On Tue, Oct 31, 2023 at 05:42:37PM +0100, Kevin Wolf wrote:
> Am 31.10.2023 um 14:48 hat Richard W.M. Jones geschrieben:
> > On Tue, Oct 31, 2023 at 02:17:56PM +0100, Kevin Wolf wrote:
> > > Am 17.10.2023 um 16:01 hat Philippe Mathieu-Daudé geschrieben:
> > > > Access QOM parent with the proper QOM VIRTIO_SCSI_COMMON() macro.
> > > > 
> > > > Signed-off-by: Philippe Mathieu-Daudé 
> > > > ---
> > > >  hw/scsi/virtio-scsi.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > > > index 45b95ea070..fa53f0902c 100644
> > > > --- a/hw/scsi/virtio-scsi.c
> > > > +++ b/hw/scsi/virtio-scsi.c
> > > > @@ -761,7 +761,7 @@ static void virtio_scsi_fail_cmd_req(VirtIOSCSIReq 
> > > > *req)
> > > >  
> > > >  static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, 
> > > > VirtIOSCSIReq *req)
> > > >  {
> > > > -VirtIOSCSICommon *vs = &s->parent_obj;
> > > > +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
> > > >  SCSIDevice *d;
> > > >  int rc;
> > > 
> > > Why is a dynamic cast more "proper" than a static type-safe access, even
> > > more so in a hot I/O path?
> > > 
> > > Rich Jones posted a flamegraph the other day that surprised me because
> > > object_class_dynamic_class_assert() and object_dynamic_cast_assert()
> > > were shown to be a big part of scsi_req_new(). In the overall
> > > performance, it's probably dwarved by other issues, but unnecessary
> > > little things can add up, too.
> > 
> > I think Kevin is referring to one of these flamegraphs:
> > 
> >   http://oirase.annexia.org/tmp/2023-kvm-build-on-device.svg
> >   http://oirase.annexia.org/tmp/2023-kvm-build.svg
> > 
> > Here's a zoom showing scsi_req_new (hopefully this URL is stable ...):
> > 
> >   
> > http://oirase.annexia.org/tmp/2023-kvm-build-on-device.svg?s=scsi_req_new&x=512.9&y=501
> 
> Oh, this one (kvm-build-on-device) doesn't even show the object cast.
> I was looking at kvm-build, it seems, where both the class and the
> object cast are visible:
> 
> http://oirase.annexia.org/tmp/2023-kvm-build.svg?s=scsi_req_new&x=455.4&y=533

Not sure if this is the reason why, but the difference between these
two runs is that kvm-build is backed by a qcow2 file and
kvm-build-on-device is backed by a host block device.  I believe they
both were otherwise identically configured qemu.

More background in this Fedora thread:

https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/MSJAL7OE2TBO6U4ZWXKTKQLDSGRFK6YR/

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH 4/7] hw/scsi/virtio-scsi: Use VIRTIO_SCSI_COMMON() macro

2023-10-31 Thread Richard W.M. Jones

On Tue, Oct 31, 2023 at 02:17:56PM +0100, Kevin Wolf wrote:
> Am 17.10.2023 um 16:01 hat Philippe Mathieu-Daudé geschrieben:
> > Access QOM parent with the proper QOM VIRTIO_SCSI_COMMON() macro.
> > 
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> >  hw/scsi/virtio-scsi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 45b95ea070..fa53f0902c 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -761,7 +761,7 @@ static void virtio_scsi_fail_cmd_req(VirtIOSCSIReq *req)
> >  
> >  static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq 
> > *req)
> >  {
> > -VirtIOSCSICommon *vs = &s->parent_obj;
> > +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
> >  SCSIDevice *d;
> >  int rc;
> 
> Why is a dynamic cast more "proper" than a static type-safe access, even
> more so in a hot I/O path?
> 
> Rich Jones posted a flamegraph the other day that surprised me because
> object_class_dynamic_class_assert() and object_dynamic_cast_assert()
> were shown to be a big part of scsi_req_new(). In the overall
> performance, it's probably dwarved by other issues, but unnecessary
> little things can add up, too.

I think Kevin is referring to one of these flamegraphs:

  http://oirase.annexia.org/tmp/2023-kvm-build-on-device.svg
  http://oirase.annexia.org/tmp/2023-kvm-build.svg

Here's a zoom showing scsi_req_new (hopefully this URL is stable ...):

  
http://oirase.annexia.org/tmp/2023-kvm-build-on-device.svg?s=scsi_req_new&x=512.9&y=501

Note that qemu has been compiled with QOM cast debug.  This is the
default for Fedora (not RHEL) because we'd like to get early detection
of bugs from Fedora users.

There was another patch recently where a simple change saved about 5%
of total runtime in RISC-V TCG guests (admittedly a much more hot path
than this one).

  https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg02388.html

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 06:45:22PM +0200, Florian Weimer wrote:
> * Richard W. M. Jones:
> 
> > I tested the speed of decompression using:
> >
> >   $ hyperfine 'qemu-img convert -W -m 16 -f qcow2 test.qcow2.XXX -O raw 
> > test.out'
> >   (qemu 8.0.0-4.fc39.x86_64)
> >
> >   $ hyperfine 'nbdkit -U - --filter=qcow2dec file test.qcow2.XXX --run 
> > '\''nbdcopy --request-size "$uri" test.out'\'' '
> >   (nbdkit-1.35.11-2.fc40.x86_64)
> 
> How realistic is that?  Larger cluster sizes will make random access
> perform noticeably worse is some cases.  Think about reading a few bytes
> towards the end of the cluster.  It makes a difference whether you have
> to decompress 64 KiB bytes for that, or 2 MiB.  As far as I understand
> it, the above commands use all data decompressed, so they don't suffer
> from this issue (particularly with read-ahead to deal with unfortunate
> cluster boundaries).
> 
> Time to first HTTP request served after boot or something like that
> might be a better comparison.

Yes, this is a good point.

Current Fedora images use 64k cluster size which I think is also the
default:

$ qemu-img info 
https://download.fedoraproject.org/pub/fedora/linux/releases/38/Cloud/x86_64/images/Fedora-Cloud-Base-38-1.6.x86_64.qcow2
 
image: 
https://download.fedoraproject.org/pub/fedora/linux/releases/38/Cloud/x86_64/images/Fedora-Cloud-Base-38-1.6.x86_64.qcow2
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: unavailable
cluster_size: 65536<
Format specific information:
compat: 0.10
compression type: zlib
refcount bits: 16

And the tests showed there's barely any performance gain above the
default cluster size anyway.  Only very small clusters should be
avoided, but there are good reasons to avoid those already such as
metadata overhead and small maximum image size.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 01:10:16PM +0200, Kevin Wolf wrote:
> And nbdkit seems to get worse instead of better with larger cluster
> size, no matter whether zlib or zstd is used.

It's caused by nbdcopy's default request size being 256k.  Increasing
it to 2M cures the scaling problem - see updated results below.  (Note
nbdkit & nbdcopy are being used together so we're copying between two
programs.  The request size is the size of NBD requests between the two.)

> If you think using more threads is the key for the remaining difference
> at 64k, would increasing QCOW2_MAX_THREADS (currently only 4) help on
> the qemu-img side?

Results:

  qemu800 = qemu-img-8.0.0-4.fc39.x86_64 [previous results]
  qemugit = qemu @ 17780edd81d
  qemuthr = qemu @ 17780edd81d with QCOW2_MAX_THREADS changed from 4 to 16
  nbdkit = nbdkit-1.35.11-2.fc40.x86_64 [previous results]
  nbdkit2M = nbdkit with nbdcopy --request-size=$((2*1024*1024))


Cluster  Compression  Compressed size  Prog  Decompression speed

4k   zlib 3228811264   qemu800   5.921 s ±  0.074 s
4k   zstd 3258097664   qemu800   5.189 s ±  0.158 s

4k   zlib 3228811264   qemugit   7.021 s ±  0.234 s
4k   zstd 3258097664   qemugit   6.594 s ±  0.170 s

4k   zlib 3228811264   qemuthr   6.744 s ±  0.111 s
4k   zstd 3258097664   qemuthr   6.428 s ±  0.206 s

4k   zlib 3228811264   nbdkit1.390 s ±  0.094 s
4k   zstd 3258097664   nbdkit1.328 s ±  0.055 s


64k  zlib 3164667904   qemu800   3.579 s ±  0.094 s
64k  zstd 3132686336   qemu800   1.770 s ±  0.060 s

64k  zlib 3164667904   qemugit   3.644 s ±  0.018 s
64k  zstd 3132686336   qemugit   1.814 s ±  0.098 s

64k  zlib 3164667904   qemuthr   1.356 s ±  0.058 s
64k  zstd 3132686336   qemuthr   1.266 s ±  0.064 s

64k  zlib 3164667904   nbdkit1.254 s ±  0.065 s
64k  zstd 3132686336   nbdkit1.315 s ±  0.037 s


512k zlib 3158744576   qemu800   4.008 s ±  0.058 s
512k zstd 3032697344   qemu800   1.503 s ±  0.072 s

512k zlib 3158744576   qemugit   4.015 s ±  0.040 s
512k zstd 3032697344   qemugit   1.557 s ±  0.025 s

512k zlib 3158744576   qemuthr   1.233 s ±  0.050 s
512k zstd 3032697344   qemuthr   1.149 s ±  0.032 s

512k zlib 3158744576   nbdkit1.702 s ±  0.026 s
512k zstd 3032697344   nbdkit1.593 s ±  0.039 s


2048kzlib 3197569024   qemu800   4.327 s ±  0.051 s
2048kzstd 2995143168   qemu800   1.465 s ±  0.085 s

2048kzlib 3197569024   qemugit   4.323 s ±  0.031 s
2048kzstd 2995143168   qemugit   1.484 s ±  0.067 s

2048kzlib 3197569024   qemuthr   1.299 s ±  0.055 s
2048kzstd 2995143168   qemuthr   1.229 s ±  0.046 s

2048kzlib 3197569024   nbdkit2M  1.636 s ±  0.071 s
2048kzstd 2995143168   nbdkit2M  1.644 s ±  0.040 s


Increasing the number of threads makes a big difference, so I think
changing the default (or making it run-time adjustable somehow) is a
good idea, also an easy win.

Increased qcow2 threads + zlib-ng would be _very_ interesting.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [PATCH] qemu-img: Update documentation for compressed images

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 12:24:30PM +0200, Kevin Wolf wrote:
> Document the 'compression_type' option for qcow2, and mention that
> streamOptimized vmdk supports compression, too.
> 
> Reported-by: Richard W.M. Jones 
> Signed-off-by: Kevin Wolf 

Looks good, so:

Reviewed-by: Richard W.M. Jones 

> ---
>  docs/tools/qemu-img.rst | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
> index 15aeddc6d8..ca5a2773cf 100644
> --- a/docs/tools/qemu-img.rst
> +++ b/docs/tools/qemu-img.rst
> @@ -106,7 +106,11 @@ by the used format or see the format descriptions below 
> for details.
>  
>  .. option:: -c
>  
> -  Indicates that target image must be compressed (qcow format only).
> +  Indicates that target image must be compressed (qcow/qcow2 and vmdk with
> +  streamOptimized subformat only).
> +
> +  For qcow2, the compression algorithm can be specified with the ``-o
> +  compression_type=...`` option (see below).
>  
>  .. option:: -h
>  
> @@ -776,7 +780,7 @@ Supported image file formats:
>  
>QEMU image format, the most versatile format. Use it to have smaller
>images (useful if your filesystem does not supports holes, for example
> -  on Windows), optional AES encryption, zlib based compression and
> +  on Windows), optional AES encryption, zlib or zstd based compression and
>support of multiple VM snapshots.
>  
>Supported options:
> @@ -794,6 +798,17 @@ Supported image file formats:
>``backing_fmt``
>  Image format of the base image
>  
> +  ``compression_type``
> +This option configures which compression algorithm will be used for
> +compressed clusters on the image. Note that setting this option doesn't 
> yet
> +cause the image to actually receive compressed writes. It is most 
> commonly
> +used with the ``-c`` option of ``qemu-img convert``, but can also be used
> +with the ``compress`` filter driver or backup block jobs with compression
> +enabled.
> +
> +Valid values are ``zlib`` and ``zstd``. For images that use
> +``compat=0.10``, only ``zlib`` compression is available.
> +
>``encryption``
>  If this option is set to ``on``, the image is encrypted with
>  128-bit AES-CBC.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 11:06:20AM +0100, Daniel P. Berrangé wrote:
> On Fri, Sep 01, 2023 at 11:03:54AM +0100, Richard W.M. Jones wrote:
> > I forgot to say that nbdkit is using zlib-ng, since I made the source
> > level changes a few weeks back (but most of the nbdkit performance
> > improvement comes from being able to use lots of threads).
> 
> Ah that last point is interesting. If we look at nbdkit results we can
> see that while zstd is clearly faster, the margin of the win is massively
> lower. So I presume we can infer similar margins if qemu-img were
> switched too.

FWIW here's the nbdkit commit which added zlib-ng support (assuming
you don't want to wait for a compat package).  It's not a massive
change so something similar might be done for qemu:

https://gitlab.com/nbdkit/nbdkit/-/commit/1b67e323e998a5d719f1afe43d5be84e45c6739b

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 10:55:50AM +0100, Daniel P. Berrangé wrote:
> On Fri, Sep 01, 2023 at 10:42:16AM +0100, Richard W.M. Jones wrote:
> > On Fri, Sep 01, 2023 at 10:48:14AM +0200, Kevin Wolf wrote:
> > > Am 31.08.2023 um 11:20 hat Richard W.M. Jones geschrieben:
> > > > On Thu, Aug 31, 2023 at 11:05:55AM +0200, Kevin Wolf wrote:
> > > > > [ Cc: qemu-block ]
> > > > > 
> > > > > Am 30.08.2023 um 20:26 hat Richard W.M. Jones geschrieben:
> > > > > > On Tue, Aug 29, 2023 at 05:49:24PM -, Daniel Alley wrote:
> > > > > > > > The background to this is I've spent far too long trying to 
> > > > > > > > optimize
> > > > > > > > the conversion of qcow2 files to raw files.  Most existing 
> > > > > > > > qcow2 files
> > > > > > > > that you can find online are zlib compressed, including the 
> > > > > > > > qcow2
> > > > > > > > images provided by Fedora.  Each cluster in the file is 
> > > > > > > > separately
> > > > > > > > compressed as a zlib stream, and qemu uses zlib library 
> > > > > > > > functions to
> > > > > > > > decompress them.  When downloading and decompressing these 
> > > > > > > > files, I
> > > > > > > > measured 40%+ of the total CPU time is doing zlib decompression.
> > > > > > > > 
> > > > > > > > [You don't need to tell me how great Zstd is, qcow2 supports 
> > > > > > > > this for
> > > > > > > > compression also, but it is not widely used by existing 
> > > > > > > > content.]
> > > > > 
> > > > > You make it sound like compressing each cluster individually has a big
> > > > > impact. If so, does increasing the cluster size make a difference, 
> > > > > too?
> > > > > That could be an change with less compatibility concerns.
> > > > 
> > > > The issue we're discussing in the original thread is speed of
> > > > decompression.  We noted that using zlib-ng (a not-quite drop-in
> > > > replacement for zlib) improves decompression speed by 40% or more.
> > > > 
> > > > Original thread:
> > > > https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/CDNPJ4SOTRQMYVCDI3ZSY4SP4FYESCWD/
> > > > zlib-ng proposed change:
> > > > https://src.fedoraproject.org/rpms/zlib-ng/pull-request/3
> > > > 
> > > > Size of the compressed file is also a concern, but wasn't discussed.
> > > 
> > > I understand the context and didn't really think about file size at all.
> > > 
> > > My question was essentially if decompressing many small blocks (as we do
> > > today) performs significantly different from decompressing fewer larger
> > > blocks (as we would do with a larger cluster size).
> > 
> > I did a quick test just by adjusting the cluster size of a qcow2 file:
> > 
> >   $ virt-builder fedora-36
> >   $ ls -lsh fedora-36.img 
> >   1.2G -rw-r--r--. 1 rjones rjones 6.0G Sep  1 09:53 fedora-36.img
> >   $ cat fedora-36.img fedora-36.img fedora-36.img fedora-36.img  > test.raw
> >   $ ls -lsh test.raw 
> >   4.7G -rw-r--r--. 1 rjones rjones 24G Sep  1 09:53 test.raw
> >   $ qemu-img convert -f raw test.raw -O qcow2 test.qcow2.zlib.4k -c -o 
> > compression_type=zlib,cluster_size=4096
> > 
> > (for cluster sizes 4k, 64k, 512k, 2048k, and
> > compression types zlib & zstd)
> > 
> > I tested the speed of decompression using:
> > 
> >   $ hyperfine 'qemu-img convert -W -m 16 -f qcow2 test.qcow2.XXX -O raw 
> > test.out'
> >   (qemu 8.0.0-4.fc39.x86_64)
> > 
> >   $ hyperfine 'nbdkit -U - --filter=qcow2dec file test.qcow2.XXX --run 
> > '\''nbdcopy --request-size "$uri" test.out'\'' '
> >   (nbdkit-1.35.11-2.fc40.x86_64)
> > 
> > Results:
> > 
> >   Cluster  Compression  Compressed size  Prog   Decompression speed
> > 
> >   4k   zlib 3228811264   qemu   5.921 s ±  0.074 s
> >   4k   zstd 3258097664   qemu   5.189 s ±  0.158 s
> > 
> >   4k   zlib/zstd nbdkit failed, bug!!
> > 
> >   64k  zlib 3164667904   qemu   3.579 s ±  0.094 s
> >

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 10:42:16AM +0100, Richard W.M. Jones wrote:
> Results:
> 
>   Cluster  Compression  Compressed size  Prog   Decompression speed
> 
>   4k   zlib 3228811264   qemu   5.921 s ±  0.074 s
>   4k   zstd 3258097664   qemu   5.189 s ±  0.158 s
> 
>   4k   zlib/zstd nbdkit failed, bug!!

Stupid bounds checking bug, fixed now [1]:

4k   zlib 3228811264   nbdkit 1.390 s ±  0.094 s
4k   zstd 3258097664   nbdkit 1.328 s ±  0.055 s

>   64k  zlib 3164667904   qemu   3.579 s ±  0.094 s
>   64k  zstd 3132686336   qemu   1.770 s ±  0.060 s
> 
>   64k  zlib 3164667904   nbdkit 1.254 s ±  0.065 s
>   64k  zstd 3132686336   nbdkit 1.315 s ±  0.037 s
> 
>   512k zlib 3158744576   qemu   4.008 s ±  0.058 s
>   512k zstd 3032697344   qemu   1.503 s ±  0.072 s
> 
>   512k zlib 3158744576   nbdkit 1.702 s ±  0.026 s
>   512k zstd 3032697344   nbdkit 1.593 s ±  0.039 s
> 
>   2048kzlib 3197569024   qemu   4.327 s ±  0.051 s
>   2048kzstd 2995143168   qemu   1.465 s ±  0.085 s
> 
>   2048kzlib 3197569024   nbdkit 3.660 s ±  0.011 s
>   2048kzstd 2995143168   nbdkit 3.368 s ±  0.057 s

[1] 
https://gitlab.com/nbdkit/nbdkit/-/commit/73b58dc13dd7a3bbc7e7b3b718a535ee362caa92

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 10:42:16AM +0100, Richard W.M. Jones wrote:
>   $ hyperfine 'nbdkit -U - --filter=qcow2dec file test.qcow2.XXX --run 
> '\''nbdcopy --request-size "$uri" test.out'\'' '

Sorry, copy and paste error, the command is:

hyperfine 'nbdkit -U - --filter=qcow2dec file test.qcow2.XXX --run '\''nbdcopy 
"$uri" test.out'\'' '

unless you want to adjust the --request-size parameter to fix
the scaling problem at the largest cluster size.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: zlib-ng as a compat replacement for zlib

2023-09-01 Thread Richard W.M. Jones

On Fri, Sep 01, 2023 at 10:48:14AM +0200, Kevin Wolf wrote:
> Am 31.08.2023 um 11:20 hat Richard W.M. Jones geschrieben:
> > On Thu, Aug 31, 2023 at 11:05:55AM +0200, Kevin Wolf wrote:
> > > [ Cc: qemu-block ]
> > > 
> > > Am 30.08.2023 um 20:26 hat Richard W.M. Jones geschrieben:
> > > > On Tue, Aug 29, 2023 at 05:49:24PM -, Daniel Alley wrote:
> > > > > > The background to this is I've spent far too long trying to optimize
> > > > > > the conversion of qcow2 files to raw files.  Most existing qcow2 
> > > > > > files
> > > > > > that you can find online are zlib compressed, including the qcow2
> > > > > > images provided by Fedora.  Each cluster in the file is separately
> > > > > > compressed as a zlib stream, and qemu uses zlib library functions to
> > > > > > decompress them.  When downloading and decompressing these files, I
> > > > > > measured 40%+ of the total CPU time is doing zlib decompression.
> > > > > > 
> > > > > > [You don't need to tell me how great Zstd is, qcow2 supports this 
> > > > > > for
> > > > > > compression also, but it is not widely used by existing content.]
> > > 
> > > You make it sound like compressing each cluster individually has a big
> > > impact. If so, does increasing the cluster size make a difference, too?
> > > That could be an change with less compatibility concerns.
> > 
> > The issue we're discussing in the original thread is speed of
> > decompression.  We noted that using zlib-ng (a not-quite drop-in
> > replacement for zlib) improves decompression speed by 40% or more.
> > 
> > Original thread:
> > https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/CDNPJ4SOTRQMYVCDI3ZSY4SP4FYESCWD/
> > zlib-ng proposed change:
> > https://src.fedoraproject.org/rpms/zlib-ng/pull-request/3
> > 
> > Size of the compressed file is also a concern, but wasn't discussed.
> 
> I understand the context and didn't really think about file size at all.
> 
> My question was essentially if decompressing many small blocks (as we do
> today) performs significantly different from decompressing fewer larger
> blocks (as we would do with a larger cluster size).

I did a quick test just by adjusting the cluster size of a qcow2 file:

  $ virt-builder fedora-36
  $ ls -lsh fedora-36.img 
  1.2G -rw-r--r--. 1 rjones rjones 6.0G Sep  1 09:53 fedora-36.img
  $ cat fedora-36.img fedora-36.img fedora-36.img fedora-36.img  > test.raw
  $ ls -lsh test.raw 
  4.7G -rw-r--r--. 1 rjones rjones 24G Sep  1 09:53 test.raw
  $ qemu-img convert -f raw test.raw -O qcow2 test.qcow2.zlib.4k -c -o 
compression_type=zlib,cluster_size=4096

(for cluster sizes 4k, 64k, 512k, 2048k, and
compression types zlib & zstd)

I tested the speed of decompression using:

  $ hyperfine 'qemu-img convert -W -m 16 -f qcow2 test.qcow2.XXX -O raw 
test.out'
  (qemu 8.0.0-4.fc39.x86_64)

  $ hyperfine 'nbdkit -U - --filter=qcow2dec file test.qcow2.XXX --run 
'\''nbdcopy --request-size "$uri" test.out'\'' '
  (nbdkit-1.35.11-2.fc40.x86_64)

Results:

  Cluster  Compression  Compressed size  Prog   Decompression speed

  4k   zlib 3228811264   qemu   5.921 s ±  0.074 s
  4k   zstd 3258097664   qemu   5.189 s ±  0.158 s

  4k   zlib/zstd nbdkit failed, bug!!

  64k  zlib 3164667904   qemu   3.579 s ±  0.094 s
  64k  zstd 3132686336   qemu   1.770 s ±  0.060 s

  64k  zlib 3164667904   nbdkit 1.254 s ±  0.065 s
  64k  zstd 3132686336   nbdkit 1.315 s ±  0.037 s

  512k zlib 3158744576   qemu   4.008 s ±  0.058 s
  512k zstd 3032697344   qemu   1.503 s ±  0.072 s

  512k zlib 3158744576   nbdkit 1.702 s ±  0.026 s
  512k zstd 3032697344   nbdkit 1.593 s ±  0.039 s

  2048kzlib 3197569024   qemu   4.327 s ±  0.051 s
  2048kzstd 2995143168   qemu   1.465 s ±  0.085 s

  2048kzlib 3197569024   nbdkit 3.660 s ±  0.011 s
  2048kzstd 2995143168   nbdkit 3.368 s ±  0.057 s

No great surprises - very small cluster size is inefficient, but
scaling up the cluster size gain performance, and zstd performs better
than zlib once the cluster size is sufficiently large.

Something strange happens with 2048k clusters + zlib - the file gets
larger and decompression gets slower again.

nbdkit has several issues, fails completely at 4k, and scaling issues
at the largest cluster size.  The scaling issue turns out to be caused
by the default request size used by nbdcopy (256k, increasing it cures
the problem).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: zlib-ng as a compat replacement for zlib

2023-08-31 Thread Richard W.M. Jones

On Thu, Aug 31, 2023 at 11:05:55AM +0200, Kevin Wolf wrote:
> [ Cc: qemu-block ]
> 
> Am 30.08.2023 um 20:26 hat Richard W.M. Jones geschrieben:
> > On Tue, Aug 29, 2023 at 05:49:24PM -, Daniel Alley wrote:
> > > > The background to this is I've spent far too long trying to optimize
> > > > the conversion of qcow2 files to raw files.  Most existing qcow2 files
> > > > that you can find online are zlib compressed, including the qcow2
> > > > images provided by Fedora.  Each cluster in the file is separately
> > > > compressed as a zlib stream, and qemu uses zlib library functions to
> > > > decompress them.  When downloading and decompressing these files, I
> > > > measured 40%+ of the total CPU time is doing zlib decompression.
> > > > 
> > > > [You don't need to tell me how great Zstd is, qcow2 supports this for
> > > > compression also, but it is not widely used by existing content.]
> 
> You make it sound like compressing each cluster individually has a big
> impact. If so, does increasing the cluster size make a difference, too?
> That could be an change with less compatibility concerns.

The issue we're discussing in the original thread is speed of
decompression.  We noted that using zlib-ng (a not-quite drop-in
replacement for zlib) improves decompression speed by 40% or more.

Original thread:
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/CDNPJ4SOTRQMYVCDI3ZSY4SP4FYESCWD/
zlib-ng proposed change:
https://src.fedoraproject.org/rpms/zlib-ng/pull-request/3

Size of the compressed file is also a concern, but wasn't discussed.

> > > Independent from the decision to use zlib-ng as a distro-wide zlib
> > > replacement, which is a good idea regardless.
> > >
> > > Are there reasons why Fedora's qcow2 images cannot switch to Zstd
> > > compression?  Zstd support appears to have been added by QEMU 5.1 in
> > > August 2020, and both EL8 and EL9 appear to have newer versions QEMU
> > > available (therefore, they ought to be able to support those
> > > images).
> > 
> > TBH I think the most probable reason is that people don't know about
> > it and it is not obvious that you have to enable it.  To generate a
> > zlib-compressed qcow2 file, you simply add the -c option, easy.  To
> > use zstd compression you have to use this mouthful:
> > 
> >   qemu-img convert -f raw disk.img -O qcow2 disk.qcow2 -c -o 
> > compression_type=zstd
> > 
> > The qemu-img man page doesn't even mention it.
> 
> Good point, that needs to be fixed.
> 
> (Though I don't think that '-o compression_type=zstd' in an unreasonable
> mouthful for enabling a non-standard compression algorithm.)
> 
> > I think all recent qcow2-based tools should be fine with zstd, but I
> > didn't check them all (RHEL 7 is still quite popular so that platform
> > would no longer be compatible).
> 
> So my first thought was that maybe we can just change the default now
> that the option has been there for quite a while. But then it occurred
> to me that it's not a hard dependency. So at least, the default would
> still have to be zlib if zstd isn't even compiled in, which makes it a
> bit less nice in theory. In practice, it depends on what build options
> distros actually use.
> 
> Unfortunately, we seem to build the RHEL packages with --disable-zstd
> (I suppose just because we tend to disable everything nobody explicitly
> asked for). Maybe we should check other distros. If zstd is commonly
> enabled, we could still make it the default upstream (because honestly,
> it probably does makes sense from an upstream perspective). Of course,
> for RHEL this would mean that images out there are likely to use zstd
> soon, so it might need to enable zstd in newer versions, too.

Oh that is bad actually.  We really should enable zstd support in RHEL :-/

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers

2023-06-08 Thread Richard W.M. Jones

On Thu, Jun 08, 2023 at 02:38:40PM +0200, Laszlo Ersek wrote:
> On 6/8/23 14:20, Richard W.M. Jones wrote:
> > On Thu, Jun 08, 2023 at 01:48:41PM +0200, Laszlo Ersek wrote:
> >> On 6/7/23 12:00, Richard W.M. Jones wrote:
> >>> On Tue, May 30, 2023 at 05:48:25PM +0200, Laszlo Ersek wrote:
> >>>> BTW I'm foreseeing a problem: if the extended block descriptor can
> >>>> provide an unsigned 64-bit length, we're going to have trouble exposing
> >>>> that in OCaml, because OCaml only has signed 64-bit integers. So that's
> >>>> going to reproduce the same issue, only for OCaml callers of the *new* 
> >>>> API.
> >>>>
> >>>> I can see Eric's series includes patches like "ocaml: Add example for
> >>>> 64-bit extents" -- I've not looked at those yet; for now I'm just
> >>>> wondering what tricks we might need in the bindings generator. The
> >>>> method seen in the "middle patch" above won't work; we don't have a
> >>>> native OCaml "i128" type for example that we could use as an escape
> >>>> hatch, for representing C's uint64_t.
> >>>
> >>> I think that's OK because disk sizes are already limited to
> >>> 2^63 - 1 by the kernel (and for qemu even less than that).
> >>> The OCaml bindings return a (signed) int64 for NBD.get_size.
> >>
> >> Under patch#7 yesterday, I made a proposal for "armoring" at least one
> >> instance / direction of the uint64_t <-> int64 conversion. It raised an
> >> interesting problem: raising OCaml exceptions in such C functions that
> >> are *not* directly called by the OCaml runtime. Comments would be much
> >> appreciated in that subthread!
> > 
> > I can't seem to find that thread (but also gmail-- split the messages
> > randomly over the 3 different mailing lists because Google don't
> > understand how email works).  Do you have a link?
> 
> It starts here (link to patch#7):
> 
> 20230525130108.757242-8-eblake@redhat.com">http://mid.mail-archive.com/20230525130108.757242-8-eblake@redhat.com

OK I see it now (in the reply).  I have answered there.

> So for example, in extent64_wrapper_locked(), if we exited the called
> nbd_internal_ocaml_alloc_extent64_array() function with caml_failwith(),
> the caml_copy_string() and caml_copy_int64() allocations, stored earlier
> into "CAMLlocal"s "metacontextv" and "offsetv", would not be leaked?

Correct.  When unwinding the stack, those frame structs created by
CAML* macros are unlinked.  Then the heap variables will have no
references and will be freed in the course of garbage collection.

> > 
> >> (On a tangent: I've also noticed we use CAMLparam0() & friends in some
> >> of our functions that are *not* directly called by the OCaml runtime.
> >> They certainly run on the OCaml runtime's stack, but there's at least
> >> one intervening stack frame where the C-language function is provided by
> >> us. Now I know we must use CAMLparam0() in our *outermost* such
> >> function, but what about the further functions (inner C-language
> >> functions) that our outermost function calls in turn? I think the inner
> >> functions are at liberty not to use CAMLparam0() -- otherwise, our
> >> functions couldn't even call normal C library functions!)
> > 
> > These macros just set up a linked list of frames.  You don't need to
> > use them in every function, only ones which are using OCaml values.
> 
> Ah, understood.
> 
> > The macros are fairly easy to understand by reading them:
> > 
> > https://github.com/ocaml/ocaml/blob/864f772e5338dcf6be2093d5cc3ed6f7fbce16b7/runtime/caml/memory.h#L270
> > 
> > When the GC runs it walks up the linked list of the current thread to
> > find roots.  The only tricky thing about it is making sure that at any
> > point where the GC could run, each slot contains a valid entry and not
> > some intermediate or uninitialized value, since this is precise (not
> > conservative) garbage collection.
> 
> Right, <https://v2.ocaml.org/manual/intfc.html> too contains a related
> warning:
> 
>   Rule 5  After a structured block (a block with tag less than
>   No_scan_tag) is allocated with the low-level functions, all fields of
>   this block must be filled with well-formed values before the next
>   allocation operation. [...]
> 
> Thankfully it also says, "You can ignore those rules if you stick to t

Re: [Libguestfs] [libnbd PATCH v3 07/22] generator: Add struct nbd_extent in prep for 64-bit extents

2023-06-08 Thread Richard W.M. Jones

On Wed, Jun 07, 2023 at 04:23:27PM +0200, Laszlo Ersek wrote:
[...]
> > diff --git a/ocaml/helpers.c b/ocaml/helpers.c
> > index 3361a696..09666daf 100644
> > --- a/ocaml/helpers.c
> > +++ b/ocaml/helpers.c
> > @@ -133,6 +133,26 @@ nbd_internal_ocaml_alloc_i64_from_u32_array (uint32_t 
> > *a, size_t len)
> >CAMLreturn (rv);
> >  }
> >
> > +value
> > +nbd_internal_ocaml_alloc_extent64_array (nbd_extent *a, size_t len)
> > +{
> > +  CAMLparam0 ();
> > +  CAMLlocal3 (s, v, rv);
> > +  size_t i;
> > +
> > +  rv = caml_alloc (len, 0);
> > +  for (i = 0; i < len; ++i) {
> > +s = caml_alloc (2, 0);
> > +v = caml_copy_int64 (a[i].length);
> > +Store_field (s, 0, v);
> > +v = caml_copy_int64 (a[i].flags);
> > +Store_field (s, 1, v);
> > +Store_field (rv, i, s);
> > +  }
> > +
> > +  CAMLreturn (rv);
> > +}
> > +
> >  /* Convert a Unix.sockaddr to a C struct sockaddr. */
> >  void
> >  nbd_internal_unix_sockaddr_to_sa (value sockaddrv,
> 
> (19) I'd suggest the following addition:
> 
> > diff --git a/ocaml/helpers.c b/ocaml/helpers.c
> > index 09666dafa7d1..db652943141d 100644
> > --- a/ocaml/helpers.c
> > +++ b/ocaml/helpers.c
> > @@ -25,6 +25,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -140,6 +141,16 @@ nbd_internal_ocaml_alloc_extent64_array (nbd_extent 
> > *a, size_t len)
> >CAMLlocal3 (s, v, rv);
> >size_t i;
> >
> > +  for (i = 0; i < len; ++i)
> > +if (a[i].length > INT64_MAX || a[i].flags > INT64_MAX) {
> > +  char errmsg[256];
> > +
> > +  snprintf (errmsg, sizeof errmsg,
> > +"%s: extent[%zu] = { .length = %"PRIu64", .flags = 
> > %"PRIu64"}",
> > +__func__, i, a[i].length, a[i].flags);
> > +  caml_failwith (errmsg);
> > +}
> > +
> >rv = caml_alloc (len, 0);
> >for (i = 0; i < len; ++i) {
> >  s = caml_alloc (2, 0);
>
> *However*, considering the nbd_internal_ocaml_alloc_extent64_array()
> call site, in the generated extent64_wrapper_locked() function
> [ocaml/nbd-c.c]:
> 
> > /* Wrapper for extent64 callback. */
> > static int
> > extent64_wrapper_locked (void *user_data, const char *metacontext,
> >  uint64_t offset, nbd_extent *entries,
> >  size_t nr_entries, int *error)
> > {
> >   CAMLparam0 ();
> >   CAMLlocal4 (metacontextv, offsetv, entriesv, errorv);
> >   CAMLlocal2 (exn, rv);
> >   const struct user_data *data = user_data;
> >   int r;
> >   value args[4];
> >
> >   metacontextv = caml_copy_string (metacontext);
> >   offsetv = caml_copy_int64 (offset);
> >   entriesv = nbd_internal_ocaml_alloc_extent64_array (
> >entries,
> >nr_entries
> >  );
> >   errorv = caml_alloc_tuple (1);
> >   Store_field (errorv, 0, Val_int (*error));
> >   args[0] = metacontextv;
> >   args[1] = offsetv;
> >   args[2] = entriesv;
> >   args[3] = errorv;
> >   rv = caml_callbackN_exn (data->fnv, 4, args);
> >   *error = Int_val (Field (errorv, 0));
> >   if (Is_exception_result (rv)) {
> > nbd_internal_ocaml_exception_in_wrapper ("extent64", rv);
> > CAMLreturnT (int, -1);
> >   }
> >
> >   r = Int_val (rv);
> >   assert (r >= 0);
> >   CAMLreturnT (int, r);
> > }
> 
> I'm not sure if raising an OCaml exception like this, in an *inner* C
> function, is appropriate. caml_failwith() may only be suitable for C
> functions *directly* called by the OCaml runtime.

caml_failwith is just a longjmp.  The OCaml values on the stack are
cleaned up by following a chain of stack frames which are created by
the CAMLparam* macros.  These macros don't need to be used in every
function, although they should be used in functions which have OCaml
value parameters or value local variables (but there's no harm in
using them unnecessarily).  They can also be omitted for functions
which do not invoke the OCaml GC (don't allocate on the OCaml heap,
basically).

C local variables however will not be cleaned up, so if there are any
arrays that have to be freed then it gets a bit more complicated.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [libnbd PATCH v3 17/22] ocaml: Add example for 64-bit extents

2023-06-08 Thread Richard W.M. Jones

On Thu, Jun 08, 2023 at 10:37:59AM +0100, Richard W.M. Jones wrote:
> Yes, the API is nicer now we return the subelements as a list instead
> of having to iterate over the list in pairs.  I might change that to
> an array or struct after these patches go upstream as that will be a
> tiny bit more efficient.

Actually the second sentence here is wrong.  It's allocated as a pair
not a list, and a pair has the same internal representation as a
2-element array.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers

2023-06-08 Thread Richard W.M. Jones

On Thu, Jun 08, 2023 at 01:48:41PM +0200, Laszlo Ersek wrote:
> On 6/7/23 12:00, Richard W.M. Jones wrote:
> > On Tue, May 30, 2023 at 05:48:25PM +0200, Laszlo Ersek wrote:
> >> BTW I'm foreseeing a problem: if the extended block descriptor can
> >> provide an unsigned 64-bit length, we're going to have trouble exposing
> >> that in OCaml, because OCaml only has signed 64-bit integers. So that's
> >> going to reproduce the same issue, only for OCaml callers of the *new* API.
> >>
> >> I can see Eric's series includes patches like "ocaml: Add example for
> >> 64-bit extents" -- I've not looked at those yet; for now I'm just
> >> wondering what tricks we might need in the bindings generator. The
> >> method seen in the "middle patch" above won't work; we don't have a
> >> native OCaml "i128" type for example that we could use as an escape
> >> hatch, for representing C's uint64_t.
> > 
> > I think that's OK because disk sizes are already limited to
> > 2^63 - 1 by the kernel (and for qemu even less than that).
> > The OCaml bindings return a (signed) int64 for NBD.get_size.
> 
> Under patch#7 yesterday, I made a proposal for "armoring" at least one
> instance / direction of the uint64_t <-> int64 conversion. It raised an
> interesting problem: raising OCaml exceptions in such C functions that
> are *not* directly called by the OCaml runtime. Comments would be much
> appreciated in that subthread!

I can't seem to find that thread (but also gmail-- split the messages
randomly over the 3 different mailing lists because Google don't
understand how email works).  Do you have a link?

But the answer is you can raise an exception anywhere since it's
really just a longjmp.  The only issue is if you need stack-allocated
variables to be freed, which for OCaml values is handled by the
CAMLlocal* macros and for C variables you need to be careful about and
deal with yourself.

> (On a tangent: I've also noticed we use CAMLparam0() & friends in some
> of our functions that are *not* directly called by the OCaml runtime.
> They certainly run on the OCaml runtime's stack, but there's at least
> one intervening stack frame where the C-language function is provided by
> us. Now I know we must use CAMLparam0() in our *outermost* such
> function, but what about the further functions (inner C-language
> functions) that our outermost function calls in turn? I think the inner
> functions are at liberty not to use CAMLparam0() -- otherwise, our
> functions couldn't even call normal C library functions!)

These macros just set up a linked list of frames.  You don't need to
use them in every function, only ones which are using OCaml values.

The macros are fairly easy to understand by reading them:

https://github.com/ocaml/ocaml/blob/864f772e5338dcf6be2093d5cc3ed6f7fbce16b7/runtime/caml/memory.h#L270

When the GC runs it walks up the linked list of the current thread to
find roots.  The only tricky thing about it is making sure that at any
point where the GC could run, each slot contains a valid entry and not
some intermediate or uninitialized value, since this is precise (not
conservative) garbage collection.

This mechanism is only used by C code.  In OCaml code there's a bitmap
generated for each function showing which stack slots contain values
(versus ints, return addresses, other stuff).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [Libguestfs] [libnbd PATCH v3 00/22] NBD 64-bit extensions (libnbd portion)

2023-06-08 Thread Richard W.M. Jones



Sorry it took me so long to get around to this one.  Hopefully
the next version will need only cursory approval.

I did read all of the patches, and I basically agree with Laszlo on
the ones he reviewed already.  Therefore I didn't add any comment
except where necessary.  You can assume Acked-by on those patches.

Thanks!

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [Libguestfs] [libnbd PATCH v3 22/22] api: Add nbd_[aio_]block_status_filter()

2023-06-08 Thread Richard W.M. Jones

 bytes_sent = nbd_stats_bytes_sent (nbd);
> +  seen = 0;
> +  if (nbd_block_status_filter (nbd, exportsize, 0, list (0x0),
> +   (nbd_extent64_callback) { .callback = cb,
> + .user_data = &seen 
> },
> +   0) != -1) {
> +fprintf (stderr, "expecting block status failure\n");
> +exit (EXIT_FAILURE);
> +  }
> +  assert (seen == 0x0);
> +  if (nbd_get_errno () != EINVAL) {
> +fprintf (stderr, "expecting EINVAL after block status failure\n");
> +exit (EXIT_FAILURE);
> +  }
> +  if (nbd_stats_bytes_sent (nbd) <= bytes_sent) {
> +fprintf (stderr, "expecting server-side rejection of bad request\n");
> +exit (EXIT_FAILURE);
> +  }
> +
> +  /* Giving unknown string triggers EINVAL from libnbd */
> +  bytes_sent = nbd_stats_bytes_sent (nbd);
> +  seen = 0;
> +  {
> +const char *bogus[] = { "qemu:dirty-bitmap:bitmap2", NULL };
> +if (nbd_block_status_filter (nbd, exportsize, 0, (char **) bogus,
> + (nbd_extent64_callback) { .callback = cb,
> +   .user_data = 
> &seen },
> + 0) != -1) {
> +  fprintf (stderr, "expecting block status failure\n");
> +  exit (EXIT_FAILURE);
> +}
> +  }
> +  if (nbd_get_errno () != EINVAL) {
> +fprintf (stderr, "expecting EINVAL after block status failure\n");
> +exit (EXIT_FAILURE);
> +  }
> +  assert (seen == 0x0);
> +  if (nbd_stats_bytes_sent (nbd) != bytes_sent) {
> +fprintf (stderr, "expecting client-side rejection of bad request\n");
> +exit (EXIT_FAILURE);
> +  }
> +
> +  /* Giving same string twice triggers EINVAL from qemu */
> +  seen = 0;
> +  {
> +const char *dupes[] = { "base:allocation", "base:allocation", NULL };
> +if (nbd_block_status_filter (nbd, exportsize, 0, (char **) dupes,
> + (nbd_extent64_callback) { .callback = cb,
> +   .user_data = 
> &seen },
> + 0) != -1) {
> +  fprintf (stderr, "expecting block status failure\n");
> +  exit (EXIT_FAILURE);
> +}
> +  }
> +  if (nbd_get_errno () != EINVAL) {
> +fprintf (stderr, "expecting EINVAL after block status failure\n");
> +exit (EXIT_FAILURE);
> +  }
> +  assert (seen == 0x0);
> +  if (nbd_stats_bytes_sent (nbd) <= bytes_sent) {
> +fprintf (stderr, "expecting server-side rejection of bad request\n");
> +exit (EXIT_FAILURE);
> +  }
> +
> +  /* Done */
>if (nbd_shutdown (nbd, 0) == -1) {
>  fprintf (stderr, "%s\n", nbd_get_error ());
>  exit (EXIT_FAILURE);
> diff --git a/interop/block-status-payload.sh b/interop/block-status-payload.sh
> index a12cfc8a..0e6681b6 100755
> --- a/interop/block-status-payload.sh
> +++ b/interop/block-status-payload.sh
> @@ -49,6 +49,7 @@ args = ["qemu-nbd", "-f", "qcow2", "-A", "-B", "bitmap0", 
> "-B", "bitmap1",
>  h.connect_systemd_socket_activation(args)
>  assert h.aio_is_negotiating() is True
>  assert h.get_extended_headers_negotiated() is False
> +
>  # Flag not available until info or go
>  try:
>h.can_block_status_payload()
> @@ -58,7 +59,18 @@ except nbd.Error:
>  h.opt_info()
>  assert h.can_block_status_payload() is False
>  assert h.can_meta_context("base:allocation") is True
> -h.opt_abort()
> +
> +# Filter request not allowed if not advertised
> +def f():
> +  assert False
> +h.opt_go()
> +assert h.can_block_status_payload() is False
> +try:
> +  h.block_status_filter(0, 512, ["base:allocation"], f)
> +  assert False
> +except nbd.Error:
> +  pass
> +h.shutdown()
>  '
> 
>  # Conditional part of test: if qemu is new enough to support extended
> diff --git a/info/info-can.sh b/info/info-can.sh
> index 8154d1ce..097837d2 100755
> --- a/info/info-can.sh
> +++ b/info/info-can.sh
> @@ -38,6 +38,9 @@ requires bash -c "nbdkit sh --dump-plugin | grep 
> has_can_cache=1"
>  # and oldstyle never, but that feels like depending a bit too much on
>  # the implementation.
> 
> +# --can block-status-payload is not supported by nbdkit yet. Testing
> +# is done during interop with new-enough qemu.
> +
>  # --can structured-reply is not a per-export setting, but rather
>  # something set on the server as a whole.

Seems generally OK, so:

Acked-by: Richard W.M. Jones 

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [libnbd PATCH v3 21/22] api: Add nbd_can_block_status_payload()

2023-06-08 Thread Richard W.M. Jones

he GNU
> +# Lesser General Public License for more details.
> +#
> +# You should have received a copy of the GNU Lesser General Public
> +# License along with this library; if not, write to the Free Software
> +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> +
> +# Test use of block status payload for server filtering
> +
> +source ../tests/functions.sh
> +set -e
> +set -x
> +
> +requires qemu-img bitmap --help
> +# This test uses the qemu-nbd -A and -B options.
> +requires qemu-nbd -A -BA --version
> +
> +file="block-status-payload.qcow2"
> +rm -f $file
> +cleanup_fn rm -f $file
> +
> +# Create sparse file with two bitmaps.
> +qemu-img create -f qcow2 $file 1M
> +qemu-img bitmap --add --enable -f qcow2 $file bitmap0
> +qemu-img bitmap --add --enable -f qcow2 $file bitmap1
> +
> +# Unconditional part of test: qemu should not advertise block status payload
> +# support if extended headers are not in use
> +nbdsh -c '
> +h.set_request_extended_headers(False)
> +h.add_meta_context("base:allocation")
> +h.add_meta_context("qemu:allocation-depth")
> +h.add_meta_context("qemu:dirty-bitmap:bitmap0")
> +h.add_meta_context("qemu:dirty-bitmap:bitmap1")
> +h.set_opt_mode(True)
> +args = ["qemu-nbd", "-f", "qcow2", "-A", "-B", "bitmap0", "-B", "bitmap1",
> +"'"$file"'"]
> +h.connect_systemd_socket_activation(args)
> +assert h.aio_is_negotiating() is True
> +assert h.get_extended_headers_negotiated() is False
> +# Flag not available until info or go
> +try:
> +  h.can_block_status_payload()
> +  assert False
> +except nbd.Error:
> +  pass
> +h.opt_info()
> +assert h.can_block_status_payload() is False
> +assert h.can_meta_context("base:allocation") is True
> +h.opt_abort()
> +'
> +
> +# Conditional part of test: if qemu is new enough to support extended
> +# headers, we assume it can also support block status payload.
> +requires nbdinfo --can extended-headers -- [ qemu-nbd -r -f qcow2 "$file" ]
> +$VG ./block-status-payload \
> +qemu-nbd -f qcow2 -A -B bitmap0 -B bitmap1 $file
> diff --git a/.gitignore b/.gitignore
> index fd81357b..a2d052bd 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -101,6 +101,7 @@ Makefile.in
>  /info/nbdinfo
>  /info/nbdinfo.1
>  /install-sh
> +/interop/block-status-payload
>  /interop/dirty-bitmap
>  /interop/interop-nbd-server
>  /interop/interop-nbd-server-tls
> diff --git a/info/can.c b/info/can.c
> index 31c4a1ca..6dd68eeb 100644
> --- a/info/can.c
> +++ b/info/can.c
> @@ -72,6 +72,11 @@ do_can (void)
>else if (strcasecmp (can, "rotational") == 0)
>  feature = nbd_is_rotational (nbd);
> 
> +  else if (strcasecmp (can, "block status payload") == 0 ||
> +   strcasecmp (can, "block-status-payload") == 0 ||
> +   strcasecmp (can, "block_status_payload") == 0)
> +feature = nbd_can_block_status_payload (nbd);
> +
>else if (strcasecmp (can, "cache") == 0)
>  feature = nbd_can_cache (nbd);
> 
> diff --git a/info/show.c b/info/show.c
> index 920bbb0a..8914f927 100644
> --- a/info/show.c
> +++ b/info/show.c
> @@ -54,7 +54,7 @@ show_one_export (struct nbd_handle *nbd, const char *desc,
>char *uri = NULL;
>int is_rotational, is_read_only;
>int can_cache, can_df, can_fast_zero, can_flush, can_fua,
> -can_multi_conn, can_trim, can_zero;
> +can_multi_conn, can_trim, can_zero, can_block_status_payload;
>int64_t block_minimum, block_preferred, block_maximum;
>string_vector contexts = empty_vector;
>bool show_context = false;
> @@ -120,6 +120,7 @@ show_one_export (struct nbd_handle *nbd, const char *desc,
>can_multi_conn = nbd_can_multi_conn (nbd);
>can_trim = nbd_can_trim (nbd);
>can_zero = nbd_can_zero (nbd);
> +  can_block_status_payload = nbd_can_block_status_payload (nbd);
>block_minimum = nbd_get_block_size (nbd, LIBNBD_SIZE_MINIMUM);
>block_preferred = nbd_get_block_size (nbd, LIBNBD_SIZE_PREFERRED);
>block_maximum = nbd_get_block_size (nbd, LIBNBD_SIZE_MAXIMUM);
> @@ -161,6 +162,8 @@ show_one_export (struct nbd_handle *nbd, const char *desc,
>  if (is_read_only >= 0)
>fprintf (fp, "\t%s: %s\n", "is_read_only",
> is_read_only ? "true" : "false");
> +if (can_block_status_payload >= 0)
> +  show_boolean ("can_block_status_payload", can_block_status_payload);
>  if (can_cache >= 0)
>show_boolean ("can_cache", can_cache);
>  if (can_df >= 0)
> @@ -230,6 +233,10 @@ show_one_export (struct nbd_handle *nbd, const char 
> *desc,
>  if (is_read_only >= 0)
>fprintf (fp, "\t\"%s\": %s,\n",
>"is_read_only", is_read_only ? "true" : "false");
> +if (can_block_status_payload >= 0)
> +  fprintf (fp, "\t\"%s\": %s,\n",
> +  "can_block_status_payload",
> +   can_block_status_payload ? "true" : "false");
>  if (can_cache >= 0)
>fprintf (fp, "\t\"%s\": %s,\n",
>"can_cache", can_cache ? "true" : "false");

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [Libguestfs] [libnbd PATCH v3 20/22] interop: Add test of 64-bit block status

2023-06-08 Thread Richard W.M. Jones

On Thu, May 25, 2023 at 08:01:06AM -0500, Eric Blake wrote:
> Prove that we can round-trip a block status request larger than 4G
> through a new-enough qemu-nbd.  Also serves as a unit test of our shim
> for converting internal 64-bit representation back to the older 32-bit
> nbd_block_status callback interface.

I think it would be best to call this test "large-block-status.{c,sh}"
as "large-status" is ambiguous.  (Or even "block-status-64"?)

The test itself is fine, so if renamed:

Reviewed-by: Richard W.M. Jones 

> Signed-off-by: Eric Blake 
> ---
>  interop/Makefile.am |   6 ++
>  interop/large-status.c  | 186 
>  interop/large-status.sh |  49 +++
>  .gitignore  |   1 +
>  4 files changed, 242 insertions(+)
>  create mode 100644 interop/large-status.c
>  create mode 100755 interop/large-status.sh
> 
> diff --git a/interop/Makefile.am b/interop/Makefile.am
> index 3f81df0c..9a7a5967 100644
> --- a/interop/Makefile.am
> +++ b/interop/Makefile.am
> @@ -21,6 +21,7 @@ EXTRA_DIST = \
>   dirty-bitmap.sh \
>   interop-qemu-storage-daemon.sh \
>   interop-qemu-block-size.sh \
> + large-status.sh \
>   list-exports-nbd-config \
>   list-exports-test-dir/disk1 \
>   list-exports-test-dir/disk2 \
> @@ -134,6 +135,7 @@ check_PROGRAMS += \
>   list-exports-qemu-nbd \
>   socket-activation-qemu-nbd \
>   dirty-bitmap \
> + large-status \
>   structured-read \
>   opt-extended-headers \
>   $(NULL)
> @@ -144,6 +146,7 @@ TESTS += \
>   list-exports-qemu-nbd \
>   socket-activation-qemu-nbd \
>   dirty-bitmap.sh \
> + large-status.sh \
>   structured-read.sh \
>   interop-qemu-block-size.sh \
>   opt-extended-headers.sh \
> @@ -235,6 +238,9 @@ socket_activation_qemu_nbd_LDADD = 
> $(top_builddir)/lib/libnbd.la
>  dirty_bitmap_SOURCES = dirty-bitmap.c
>  dirty_bitmap_LDADD = $(top_builddir)/lib/libnbd.la
> 
> +large_status_SOURCES = large-status.c
> +large_status_LDADD = $(top_builddir)/lib/libnbd.la
> +
>  structured_read_SOURCES = structured-read.c
>  structured_read_LDADD = $(top_builddir)/lib/libnbd.la
> 
> diff --git a/interop/large-status.c b/interop/large-status.c
> new file mode 100644
> index ..36415653
> --- /dev/null
> +++ b/interop/large-status.c
> @@ -0,0 +1,186 @@
> +/* NBD client library in userspace
> + * Copyright Red Hat
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +/* Test 64-bit block status with qemu. */
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +static const char *bitmap;
> +
> +struct data {
> +  bool req_one;/* input: true if req_one was passed to request */
> +  int count;   /* input: count of expected remaining calls */
> +  bool seen_base;  /* output: true if base:allocation encountered */
> +  bool seen_dirty; /* output: true if qemu:dirty-bitmap encountered */
> +};
> +
> +static int
> +cb32 (void *opaque, const char *metacontext, uint64_t offset,
> +  uint32_t *entries, size_t len, int *error)
> +{
> +  struct data *data = opaque;
> +
> +  assert (offset == 0);
> +  assert (data->count-- > 0);
> +
> +  if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) == 0) {
> +assert (!data->seen_base);
> +data->seen_base = true;
> +
> +/* Data block offset 0 size 64k, remainder is hole */
> +assert (len == 4);
> +assert (entries[0] == 65536);
> +assert (entries[1] == 0);
> +/* libnbd had to truncate qemu's >4G answer */
> +assert (entries[2] == 4227858432);
> +assert (entries[3] == (LIBNBD_STATE_HOLE|LIBNBD_STATE_ZERO));
> +  }
> +  else if (strcmp (metacontext, bitmap) == 0) {
> +assert (!data->seen_dirty);
> +data->seen_dirty =

Re: [libnbd PATCH v3 19/22] api: Add nbd_[aio_]opt_extended_headers()

2023-06-08 Thread Richard W.M. Jones

rue);
> +
> +  /* request_eh is ignored if request_sr is false. */
> +  nbd = prep (false, true, &argv[1]);
> +  check (nbd_get_extended_headers_negotiated (nbd), false);
> +  check (nbd_get_structured_replies_negotiated (nbd), false);
> +  cleanup (nbd, false);
> +
> +  /* Swap order, requesting structured replies before extended headers */
> +  nbd = prep (false, false, &argv[1]);
> +  check (nbd_get_extended_headers_negotiated (nbd), false);
> +  check (nbd_get_structured_replies_negotiated (nbd), false);
> +  check (nbd_opt_structured_reply (nbd), true);
> +  check (nbd_get_extended_headers_negotiated (nbd), false);
> +  check (nbd_get_structured_replies_negotiated (nbd), true);
> +  check (nbd_opt_extended_headers (nbd), true);
> +  check (nbd_get_extended_headers_negotiated (nbd), true);
> +  check (nbd_get_structured_replies_negotiated (nbd), true);
> +  cleanup (nbd, true);
> +
> +  exit (EXIT_SUCCESS);
> +}
> diff --git a/interop/opt-extended-headers.sh b/interop/opt-extended-headers.sh
> new file mode 100755
> index ..41322f36
> --- /dev/null
> +++ b/interop/opt-extended-headers.sh
> @@ -0,0 +1,29 @@
> +#!/usr/bin/env bash
> +# nbd client library in userspace
> +# Copyright Red Hat
> +#
> +# This library is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU Lesser General Public
> +# License as published by the Free Software Foundation; either
> +# version 2 of the License, or (at your option) any later version.
> +#
> +# This library is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +# Lesser General Public License for more details.
> +#
> +# You should have received a copy of the GNU Lesser General Public
> +# License along with this library; if not, write to the Free Software
> +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> +
> +# Test low-level nbd_opt_extended_headers() details with qemu-nbd
> +
> +source ../tests/functions.sh
> +set -e
> +set -x
> +
> +requires qemu-nbd --version
> +requires nbdinfo --can extended-headers -- [ qemu-nbd -r -f raw "$0" ]
> +
> +# Run the test.
> +$VG ./opt-extended-headers qemu-nbd -r -f raw "$0"
> diff --git a/.gitignore b/.gitignore
> index bc7c2c37..24642748 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -118,6 +118,7 @@ Makefile.in
>  /interop/list-exports-nbdkit
>  /interop/list-exports-qemu-nbd
>  /interop/nbd-server-tls.conf
> +/interop/opt-extended-headers
>  /interop/requires.c
>  /interop/socket-activation-nbdkit
>  /interop/socket-activation-qemu-nbd

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [libnbd PATCH v3 18/22] generator: Actually request extended headers

2023-06-08 Thread Richard W.M. Jones

. */
> +h->structured_replies = true;
> +break;
> +  default:
> +if (handle_reply_error (h) == -1) {
> +  SET_NEXT_STATE (%.DEAD);
> +  return 0;
> +}
> +
> +debug (h, "extended headers are not supported by this server");
> +break;
> +  }
> +
> +  /* Next option. */
> +  SET_NEXT_STATE (%^OPT_STRUCTURED_REPLY.START);
> +  return 0;
> +
> +} /* END STATE MACHINE */
> diff --git a/generator/states-newstyle-opt-starttls.c 
> b/generator/states-newstyle-opt-starttls.c
> index e497548c..1e2997a3 100644
> --- a/generator/states-newstyle-opt-starttls.c
> +++ b/generator/states-newstyle-opt-starttls.c
> @@ -26,7 +26,7 @@  NEWSTYLE.OPT_STARTTLS.START:
>else {
>  /* If TLS was not requested we skip this option and go to the next one. 
> */
>  if (h->tls == LIBNBD_TLS_DISABLE) {
> -  SET_NEXT_STATE (%^OPT_STRUCTURED_REPLY.START);
> +  SET_NEXT_STATE (%^OPT_EXTENDED_HEADERS.START);
>return 0;
>  }
>  assert (CALLBACK_IS_NULL (h->opt_cb.completion));
> @@ -128,7 +128,7 @@  NEWSTYLE.OPT_STARTTLS.CHECK_REPLY:
>SET_NEXT_STATE (%.NEGOTIATING);
>  else {
>debug (h, "continuing with unencrypted connection");
> -  SET_NEXT_STATE (%^OPT_STRUCTURED_REPLY.START);
> +  SET_NEXT_STATE (%^OPT_EXTENDED_HEADERS.START);
>  }
>  return 0;
>}
> @@ -185,7 +185,7 @@  NEWSTYLE.OPT_STARTTLS.TLS_HANDSHAKE_DONE:
>if (h->opt_current == NBD_OPT_STARTTLS)
>  SET_NEXT_STATE (%.NEGOTIATING);
>else
> -SET_NEXT_STATE (%^OPT_STRUCTURED_REPLY.START);
> +SET_NEXT_STATE (%^OPT_EXTENDED_HEADERS.START);
>return 0;
> 
>  } /* END STATE MACHINE */

Seems pretty straightforward.  We add a new state machine between
NEWSTYLE.OPT_STARTTLS and NEWSTYLE.OPT_STRUCTURED_REPLY which sends
the new extended headers option (if enabled).

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [libnbd PATCH v3 17/22] ocaml: Add example for 64-bit extents

2023-06-08 Thread Richard W.M. Jones

On Thu, May 25, 2023 at 08:01:03AM -0500, Eric Blake wrote:
> Since our example program for 32-bit extents is inherently limited to
> 32-bit lengths, it is also worth demonstrating the 64-bit extent API,
> including the difference in the array indexing being saner.
> 
> Signed-off-by: Eric Blake 
> ---
>  ocaml/examples/Makefile.am  |  1 +
>  ocaml/examples/extents64.ml | 42 +
>  2 files changed, 43 insertions(+)
>  create mode 100644 ocaml/examples/extents64.ml
> 
> diff --git a/ocaml/examples/Makefile.am b/ocaml/examples/Makefile.am
> index 28b4ab94..a4eb47a5 100644
> --- a/ocaml/examples/Makefile.am
> +++ b/ocaml/examples/Makefile.am
> @@ -20,6 +20,7 @@ include $(top_srcdir)/subdir-rules.mk
>  ml_examples = \
>   asynch_copy.ml \
>   extents.ml \
> + extents64.ml \
>   get_size.ml \
>   open_qcow2.ml \
>   server_flags.ml \
> diff --git a/ocaml/examples/extents64.ml b/ocaml/examples/extents64.ml
> new file mode 100644
> index ..8ee7e218
> --- /dev/null
> +++ b/ocaml/examples/extents64.ml
> @@ -0,0 +1,42 @@
> +open Printf
> +
> +let () =
> +  NBD.with_handle (
> +fun nbd ->
> +  NBD.add_meta_context nbd "base:allocation";
> +  NBD.connect_command nbd
> +  ["nbdkit"; "-s"; "--exit-with-parent"; "-r";
> +   "sparse-random"; "8G"];
> +
> +  (* Read the extents and print them. *)
> +  let size = NBD.get_size nbd in
> +  let cap =
> +match NBD.get_extended_headers_negotiated nbd with
> +| true -> size
> +| false -> 0x8000__L in
> +  let fetch_offset = ref 0_L in
> +  while !fetch_offset < size do
> +let remaining = Int64.sub size !fetch_offset in
> +let fetch_size = min remaining cap in
> +NBD.block_status_64 nbd fetch_size !fetch_offset (
> +  fun meta _ entries err ->
> +printf "nbd_block_status callback: meta=%s err=%d\n" meta !err;
> +if meta = "base:allocation" then (
> +  printf "index\t%16s %16s %s\n" "offset" "length" "flags";
> +  for i = 0 to Array.length entries - 1 do
> +let len = fst entries.(i)
> +and flags =
> +  match snd entries.(i) with
> +  | 0_L -> "data"
> +  | 1_L -> "hole"
> +  | 2_L -> "zero"
> +  | 3_L -> "hole+zero"
> +  | unknown -> sprintf "unknown (%Ld)" unknown in
> +printf "%d:\t%16Ld %16Ld %s\n" i !fetch_offset len flags;
> +    fetch_offset := Int64.add !fetch_offset len
> +  done;
> +);
> +0
> +) (* NBD.block_status *)
> +  done
> +  )

Yes, the API is nicer now we return the subelements as a list instead
of having to iterate over the list in pairs.  I might change that to
an array or struct after these patches go upstream as that will be a
tiny bit more efficient.

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [libnbd PATCH v3 16/22] examples: Update copy-libev to use 64-bit block status

2023-06-08 Thread Richard W.M. Jones

On Thu, May 25, 2023 at 08:01:02AM -0500, Eric Blake wrote:
> Although our use of "base:allocation" doesn't require the use of the
> 64-bit API for flags, we might perform slightly faster for a server
> that does give us 64-bit extent lengths and honors larger nbd_zero
> lengths.
> 
> Signed-off-by: Eric Blake 
> ---
>  examples/copy-libev.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/examples/copy-libev.c b/examples/copy-libev.c
> index 32cb46b3..9346faf7 100644
> --- a/examples/copy-libev.c
> +++ b/examples/copy-libev.c
> @@ -96,7 +96,7 @@ struct request {
>  };
> 
>  struct extent {
> -uint32_t length;
> +uint64_t length;
>  bool zero;
>  };
> 
> @@ -184,7 +184,7 @@ get_events (struct connection *c)
> 
>  static int
>  extent_callback (void *user_data, const char *metacontext, uint64_t offset,
> - uint32_t *entries, size_t nr_entries, int *error)
> + nbd_extent *entries, size_t nr_entries, int *error)
>  {
>  struct request *r = user_data;
> 
> @@ -199,22 +199,21 @@ extent_callback (void *user_data, const char 
> *metacontext, uint64_t offset,
>  return 1;
>  }
> 
> -/* Libnbd returns uint32_t pair (length, flags) for each extent. */
> -extents_len = nr_entries / 2;
> +extents_len = nr_entries;
> 
>  extents = malloc (extents_len * sizeof *extents);
>  if (extents == NULL)
>  FAIL ("Cannot allocated extents: %s", strerror (errno));
> 
>  /* Copy libnbd entries to extents array. */
> -for (int i = 0, j = 0; i < extents_len; i++, j=i*2) {
> -extents[i].length = entries[j];
> +for (int i = 0; i < extents_len; i++) {
> +extents[i].length = entries[i].length;
> 
>  /* Libnbd exposes both ZERO and HOLE flags. We care only about
>   * ZERO status, meaning we can copy this extent using efficinet
>   * zero method.
>   */
> -extents[i].zero = (entries[j + 1] & LIBNBD_STATE_ZERO) != 0;
> +extents[i].zero = (entries[i].flags & LIBNBD_STATE_ZERO) != 0;
>  }
> 
>  DEBUG ("r%zu: received %zu extents for %s",
> @@ -286,10 +285,10 @@ start_extents (struct request *r)
>  DEBUG ("r%zu: start extents offset=%" PRIi64 " count=%zu",
> r->index, offset, count);
> 
> -cookie = nbd_aio_block_status (
> +cookie = nbd_aio_block_status_64 (
>  src.nbd, count, offset,
> -(nbd_extent_callback) { .callback=extent_callback,
> -.user_data=r },
> +(nbd_extent64_callback) { .callback=extent_callback,
> +  .user_data=r },
>  (nbd_completion_callback) { .callback=extents_completed,
>  .user_data=r },
>  0);
> @@ -324,7 +323,7 @@ next_extent (struct request *r)
>  limit = MIN (REQUEST_SIZE, size - offset);
> 
>  while (length < limit) {
> -    DEBUG ("e%zu: offset=%" PRIi64 " len=%" PRIu32 " zero=%d",
> +DEBUG ("e%zu: offset=%" PRIi64 " len=%" PRIu64 " zero=%d",
> extents_pos, offset, extents[extents_pos].length, is_zero);
> 
>  /* If this extent is too large, steal some data from it to

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [libnbd PATCH v3 15/22] info: Update nbdinfo --map to use 64-bit block status

2023-06-08 Thread Richard W.M. Jones

 = 0; i < entries->len; i += 2) {
> -uint32_t type = entries->ptr[last+1];
> +uint64_t type = entries->ptr[last+1];
> 
>  /* If we're coalescing and the current type is different from the
>   * previous one then we should print everything up to this entry.
> @@ -157,7 +160,7 @@ print_extents (uint32_vector *entries)
> 
>/* Print the last extent if there is one. */
>if (last != i) {
> -uint32_t type = entries->ptr[last+1];
> +uint64_t type = entries->ptr[last+1];
>  uint64_t len;
> 
>  for (j = last, len = 0; j < i; j += 2)
> @@ -169,7 +172,7 @@ print_extents (uint32_vector *entries)
>  }
> 
>  static void
> -print_one_extent (uint64_t offset, uint64_t len, uint32_t type)
> +print_one_extent (uint64_t offset, uint64_t len, uint64_t type)
>  {
>static bool comma = false;
>char *descr;
> @@ -185,7 +188,7 @@ print_one_extent (uint64_t offset, uint64_t len, uint32_t 
> type)
>ansi_colour (bg, fp);
>  fprintf (fp, "%10" PRIu64 "  "
>   "%10" PRIu64 "  "
> - "%3" PRIu32,
> + "%3" PRIu64,
>   offset, len, type);
>  if (descr)
>fprintf (fp, "  %s", descr);
> @@ -199,7 +202,7 @@ print_one_extent (uint64_t offset, uint64_t len, uint32_t 
> type)
> 
>  fprintf (fp, "{ \"offset\": %" PRIu64 ", "
>   "\"length\": %" PRIu64 ", "
> - "\"type\": %" PRIu32,
> + "\"type\": %" PRIu64,
>   offset, len, type);
>  if (descr) {
>fprintf (fp, ", \"description\": ");
> @@ -215,9 +218,9 @@ print_one_extent (uint64_t offset, uint64_t len, uint32_t 
> type)
> 
>  /* --map --totals suboption */
>  static void
> -print_totals (uint32_vector *entries, int64_t size)
> +print_totals (uint64_vector *entries, int64_t size)
>  {
> -  uint32_t type;
> +  uint64_t type;
>bool comma = false;
> 
>/* This is necessary to avoid a divide by zero below, but if the
> @@ -237,16 +240,16 @@ print_totals (uint32_vector *entries, int64_t size)
> */
>type = 0;
>for (;;) {
> -uint64_t next_type = (uint64_t)UINT32_MAX + 1;
> +uint64_t next_type = 0;
>  uint64_t c = 0;
>  size_t i;
> 
>  for (i = 0; i < entries->len; i += 2) {
> -  uint32_t t = entries->ptr[i+1];
> +  uint64_t t = entries->ptr[i+1];
> 
>if (t == type)
>  c += entries->ptr[i];
> -  else if (type < t && t < next_type)
> +  else if (type < t && (next_type == 0 || t < next_type))
>  next_type = t;
>  }
> 
> @@ -263,7 +266,7 @@ print_totals (uint32_vector *entries, int64_t size)
>ansi_colour (fg, fp);
>  if (bg)
>ansi_colour (bg, fp);
> -fprintf (fp, "%10" PRIu64 " %5.1f%% %3" PRIu32,
> +fprintf (fp, "%10" PRIu64 " %5.1f%% %3" PRIu64,
>   c, percent, type);
>  if (descr)
>fprintf (fp, " %s", descr);
> @@ -278,7 +281,7 @@ print_totals (uint32_vector *entries, int64_t size)
>  fprintf (fp,
>   "{ \"size\": %" PRIu64 ", "
>   "\"percent\": %g, "
> - "\"type\": %" PRIu32,
> + "\"type\": %" PRIu64,
>       c, percent, type);
>  if (descr) {
>fprintf (fp, ", \"description\": ");
> @@ -292,7 +295,7 @@ print_totals (uint32_vector *entries, int64_t size)
>  free (descr);
>  }
> 
> -if (next_type == (uint64_t)UINT32_MAX + 1)
> +if (next_type == 0)
>break;
>  type = next_type;
>}
> @@ -301,7 +304,7 @@ print_totals (uint32_vector *entries, int64_t size)
>  }
> 
>  static void
> -extent_description (const char *metacontext, uint32_t type,
> +extent_description (const char *metacontext, uint64_t type,
>  char **descr, bool *free_descr,
>  const char **fg, const char **bg)
>  {
> @@ -348,7 +351,7 @@ extent_description (const char *metacontext, uint32_t 
> type,
>*fg = ANSI_FG_BRIGHT_WHITE; *bg = ANSI_BG_BLACK;
>return;
>  default:
> -  if (asprintf (descr, "backing depth %u", type) == -1) {
> +  if (asprintf (descr, "backing depth %" PRIu64, type) == -1) {
>  perror ("asprintf");
>  exit (EXIT_FAILURE);
>}

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [libnbd PATCH v3 14/22] info: Expose extended-headers support through nbdinfo

2023-06-08 Thread Richard W.M. Jones

gt; @@ -27,12 +27,27 @@ requires nbdkit --no-sr memory --version
>  out=info-packets.out
>  cleanup_fn rm -f $out
> 
> +# Older nbdkit does not support extended headers; --no-eh is a reliable
> +# witness of whether nbdkit is new enough.
> +
> +no_eh=
> +if nbdkit --no-eh --help >/dev/null 2>/dev/null; then
> +no_eh=--no-eh
> +fi
> +
>  nbdkit --no-sr -U - memory size=1M \
> --run '$VG nbdinfo "nbd+unix:///?socket=$unixsocket"' > $out
>  cat $out
>  grep "protocol: .*using simple packets" $out
> 
> -nbdkit -U - memory size=1M \
> +nbdkit $no_eh -U - memory size=1M \
> --run '$VG nbdinfo "nbd+unix:///?socket=$unixsocket"' > $out
>  cat $out
>  grep "protocol: .*using structured packets" $out
> +
> +if test x != "x$no_eh"; then
> +nbdkit -U - memory size=1M \
> +   --run '$VG nbdinfo "nbd+unix:///?socket=$unixsocket"' > $out
> +cat $out
> +grep "protocol: .*using extended packets" $out
> +fi
> diff --git a/info/main.c b/info/main.c
> index 1b99e089..8c923266 100644
> --- a/info/main.c
> +++ b/info/main.c
> @@ -302,11 +302,13 @@ main (int argc, char *argv[])
>  const char *protocol;
>  int tls_negotiated;
>  int sr_negotiated;
> +int eh_negotiated;
> 
>  /* Print per-connection fields. */
>  protocol = nbd_get_protocol (nbd);
>  tls_negotiated = nbd_get_tls_negotiated (nbd);
>  sr_negotiated = nbd_get_structured_replies_negotiated (nbd);
> +eh_negotiated = nbd_get_extended_headers_negotiated (nbd);
> 
>  if (!json_output) {
>if (protocol) {
> @@ -314,8 +316,9 @@ main (int argc, char *argv[])
>  fprintf (fp, "protocol: %s", protocol);
>  if (tls_negotiated >= 0)
>fprintf (fp, " %s TLS", tls_negotiated ? "with" : "without");
> -if (sr_negotiated >= 0)
> +if (eh_negotiated >= 0 && sr_negotiated >= 0)
>fprintf (fp, ", using %s packets",
> +   eh_negotiated ? "extended" :
> sr_negotiated ? "structured" : "simple");
>  fprintf (fp, "\n");
>  ansi_restore (fp);
> @@ -333,6 +336,8 @@ main (int argc, char *argv[])
>  fprintf (fp, "\"TLS\": %s,\n", tls_negotiated ? "true" : "false");
>if (sr_negotiated >= 0)
>  fprintf (fp, "\"structured\": %s,\n", sr_negotiated ? "true" : 
> "false");
> +  if (eh_negotiated >= 0)
> +fprintf (fp, "\"extended\": %s,\n", eh_negotiated ? "true" : 
> "false");
>  }
> 
>  if (!list_all)
> -- 
> 2.40.1

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [Libguestfs] [libnbd PATCH v3 13/22] dump: Update nbddump to use 64-bit block status

2023-06-08 Thread Richard W.M. Jones

On Thu, May 25, 2023 at 08:00:59AM -0500, Eric Blake wrote:
> Although our use of "base:allocation" doesn't require the use of the
> 64-bit API for flags, we might perform slightly faster for a server
> that does give us 64-bit extent lengths.
> 
> Signed-off-by: Eric Blake 
> ---
>  dump/dump.c | 27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/dump/dump.c b/dump/dump.c
> index b4aebe84..71053277 100644
> --- a/dump/dump.c
> +++ b/dump/dump.c
> @@ -38,7 +38,7 @@
>  #include "version.h"
>  #include "vector.h"
> 
> -DEFINE_VECTOR_TYPE (uint32_vector, uint32_t);
> +DEFINE_VECTOR_TYPE (uint64_vector, uint64_t);
> 
>  static const char *progname;
>  static struct nbd_handle *nbd;
> @@ -262,10 +262,10 @@ catch_signal (int sig)
>  static int
>  extent_callback (void *user_data, const char *metacontext,
>   uint64_t offset,
> - uint32_t *entries, size_t nr_entries,
> + nbd_extent *entries, size_t nr_entries,
>   int *error)
>  {
> -  uint32_vector *list = user_data;
> +  uint64_vector *list = user_data;
>size_t i;
> 
>if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) != 0)
> @@ -273,7 +273,8 @@ extent_callback (void *user_data, const char *metacontext,
> 
>/* Just append the entries we got to the list. */
>for (i = 0; i < nr_entries; ++i) {
> -if (uint32_vector_append (list, entries[i]) == -1) {
> +if (uint64_vector_append (list, entries[i].length) == -1 ||
> +uint64_vector_append (list, entries[i].flags) == -1) {
>perror ("realloc");
>exit (EXIT_FAILURE);
>  }
> @@ -284,7 +285,7 @@ extent_callback (void *user_data, const char *metacontext,
>  static bool
>  test_all_zeroes (uint64_t offset, size_t count)
>  {
> -  uint32_vector entries = empty_vector;
> +  uint64_vector entries = empty_vector;
>size_t i;
>uint64_t count_read;
> 
> @@ -296,22 +297,22 @@ test_all_zeroes (uint64_t offset, size_t count)
> * false, causing the main code to do a full read.  We could be
> * smarter and keep asking the server (XXX).
> */
> -  if (nbd_block_status (nbd, count, offset,
> -(nbd_extent_callback) {
> -  .callback = extent_callback,
> -  .user_data = &entries },
> -0) == -1) {
> +  if (nbd_block_status_64 (nbd, count, offset,
> +   (nbd_extent64_callback) {
> + .callback = extent_callback,
> + .user_data = &entries },
> +   0) == -1) {
>  fprintf (stderr, "%s: %s\n", progname, nbd_get_error ());
>  exit (EXIT_FAILURE);
>}
> 
>count_read = 0;
>for (i = 0; i < entries.len; i += 2) {
> -uint32_t len = entries.ptr[i];
> -uint32_t type = entries.ptr[i+1];
> +uint64_t len = entries.ptr[i];
> +uint64_t type = entries.ptr[i+1];
> 
>  count_read += len;
> -if (!(type & 2))/* not zero */
> +if (!(type & LIBNBD_STATE_ZERO))/* not zero */
>return false;
>}

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [Libguestfs] [libnbd PATCH v3 12/22] copy: Update nbdcopy to use 64-bit block status

2023-06-08 Thread Richard W.M. Jones

On Thu, May 25, 2023 at 08:00:58AM -0500, Eric Blake wrote:
> Although our use of "base:allocation" doesn't require the use of the
> 64-bit API for flags, we might perform slightly faster for a server
> that does give us 64-bit extent lengths and honors larger nbd_zero
> lengths.

I think this change may turn out to be more significant than you have
describe here.

Consider copying from a remote, very large, mostly blank NBD source,
and also that our block status querying is synchronous.  The old
nbdcopy might end up doing lots of serialized round trips over 4GB
segments of the source to discover that it is sparse.  With this
change as I understand it, those would be coalesced into a single RTT.

(Similarly when zeroing on the output side, but to a lesser extent
because those requests can be issued in parallel.)

> Signed-off-by: Eric Blake 
> ---
>  copy/nbd-ops.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/copy/nbd-ops.c b/copy/nbd-ops.c
> index f3b3bed3..71ee0a3b 100644
> --- a/copy/nbd-ops.c
> +++ b/copy/nbd-ops.c
> @@ -428,7 +428,7 @@ nbd_ops_asynch_notify_write (struct rw *rw, size_t index)
>   * request for extents in a single round trip.
>   */
>  static int add_extent (void *vp, const char *metacontext,
> -   uint64_t offset, uint32_t *entries, size_t nr_entries,
> +   uint64_t offset, nbd_extent *entries, size_t 
> nr_entries,
> int *error);
> 
>  static void
> @@ -449,11 +449,11 @@ nbd_ops_get_extents (struct rw *rw, size_t index,
>  size_t i;
> 
>  exts.len = 0;
> -if (nbd_block_status (nbd, count, offset,
> -  (nbd_extent_callback) {
> -.user_data = &exts,
> -.callback = add_extent
> -  }, 0) == -1) {
> +if (nbd_block_status_64 (nbd, count, offset,
> + (nbd_extent64_callback) {
> +   .user_data = &exts,
> +   .callback = add_extent
> + }, 0) == -1) {
>/* XXX We could call default_get_extents, but unclear if it's
> * the right thing to do if the server is returning errors.
> */
> @@ -496,7 +496,7 @@ nbd_ops_get_extents (struct rw *rw, size_t index,
> 
>  static int
>  add_extent (void *vp, const char *metacontext,
> -uint64_t offset, uint32_t *entries, size_t nr_entries,
> +uint64_t offset, nbd_extent *entries, size_t nr_entries,
>  int *error)
>  {
>extent_list *ret = vp;
> @@ -505,25 +505,25 @@ add_extent (void *vp, const char *metacontext,
>if (strcmp (metacontext, "base:allocation") != 0 || *error)
>  return 0;
> 
> -  for (i = 0; i < nr_entries; i += 2) {
> +  for (i = 0; i < nr_entries; i++) {
>  struct extent e;
> 
>  e.offset = offset;
> -e.length = entries[i];
> +e.length = entries[i].length;
> 
>  /* Note we deliberately don't care about the HOLE flag.  There is
>   * no need to read extent that reads as zeroes.  We will convert
>   * to it to a hole or allocated extents based on the command line
>   * arguments.
>   */
> -e.zero = (entries[i+1] & LIBNBD_STATE_ZERO) != 0;
> +e.zero = (entries[i].flags & LIBNBD_STATE_ZERO) != 0;
> 
>  if (extent_list_append (ret, e) == -1) {
>perror ("realloc");
>exit (EXIT_FAILURE);
>  }
> 
> -offset += entries[i];
> +offset += entries[i].length;
>}
> 
>return 0;
> -- 
> 2.40.1

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [Libguestfs] [libnbd PATCH v3 05/22] states: Prepare to receive 64-bit replies

2023-06-07 Thread Richard W.M. Jones

On Thu, Jun 01, 2023 at 11:04:05AM +0200, Laszlo Ersek wrote:
> On 5/25/23 15:00, Eric Blake wrote:
> > @@ -69,11 +75,18 @@  REPLY.STRUCTURED_REPLY.RECV_REMAINING:
> >   REPLY.STRUCTURED_REPLY.CHECK:
> >struct command *cmd = h->reply_cmd;
> >uint16_t flags, type;
> > -  uint32_t length;
> > +  uint64_t length;
> > +  uint64_t offset = -1;
> 
> (6) I disagree with initializing the local variable "offset" here.
> 
> Below (in the rest of REPLY.STRUCTURED_REPLY.CHECK), we only read
> "offset" back if "extended_headers" is set. But if "extended_headers" is
> set, we also store a value to "offset", before the read.
> 
> Initializing "offset" to -1 suggests that the code might otherwise read
> an indeterminate value from "offset" -- but that's not the case.

You may find that the compiler will give a warning.  It's usually not
good about dealing with the case where a variable being initialized +
used depends on another variable being true.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [libnbd PATCH v3 04/22] states: Prepare to send 64-bit requests

2023-06-07 Thread Richard W.M. Jones

/* Other commands are limited by the 32 bit field in the command
> + * structure on the wire, unless extended headers were negotiated.
>   */
>default:
> -if (count > UINT32_MAX) {
> +if (!h->extended_headers && count > UINT32_MAX) {
>set_error (ERANGE, "request too large: maximum request size is %" 
> PRIu32,
>   UINT32_MAX);
>goto err;
> @@ -358,6 +356,15 @@ nbd_unlocked_aio_pwrite (struct nbd_handle *h, const 
> void *buf,
>return -1;
>  }
>}
> +  /* It is more convenient to manage PAYLOAD_LEN by what was negotiated
> +   * than to require the user to have to set it correctly.
> +   * TODO: Add new h->strict bit to allow intentional protocol violation
> +   * for interoperability testing.
> +   */
> +  if (h->extended_headers)
> +flags |= LIBNBD_CMD_FLAG_PAYLOAD_LEN;
> +  else
> +flags &= ~LIBNBD_CMD_FLAG_PAYLOAD_LEN;
> 
>SET_CALLBACK_TO_NULL (*completion);
>return nbd_internal_command_common (h, flags, NBD_CMD_WRITE, offset, count,
> diff --git a/tests/Makefile.am b/tests/Makefile.am
> index 3a93251e..8b839bf5 100644
> --- a/tests/Makefile.am
> +++ b/tests/Makefile.am
> @@ -232,6 +232,7 @@ check_PROGRAMS += \
>   closure-lifetimes \
>   pread-initialize \
>   socket-activation-name \
> +  pwrite-extended \

Spaces vs tabs here, as Laszlo already pointed out.

>   $(NULL)
> 
>  TESTS += \
> @@ -650,6 +651,9 @@ socket_activation_name_SOURCES = \
>   requires.h
>  socket_activation_name_LDADD = $(top_builddir)/lib/libnbd.la
> 
> +pwrite_extended_SOURCES = pwrite-extended.c
> +pwrite_extended_LDADD = $(top_builddir)/lib/libnbd.la
> +
>  #--
>  # Testing TLS support.
> 
> diff --git a/tests/pwrite-extended.c b/tests/pwrite-extended.c
> new file mode 100644
> index ..f0b5a3f3
> --- /dev/null
> +++ b/tests/pwrite-extended.c
> @@ -0,0 +1,112 @@
> +/* NBD client library in userspace
> + * Copyright Red Hat
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +/* Check behavior of pwrite with PAYLOAD_LEN flag for extended headers. */
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +static char *progname;
> +static char buf[512];
> +
> +static void
> +check (int experr, const char *prefix)
> +{
> +  const char *msg = nbd_get_error ();
> +  int errnum = nbd_get_errno ();
> +
> +  fprintf (stderr, "error: \"%s\"\n", msg);
> +  fprintf (stderr, "errno: %d (%s)\n", errnum, strerror (errnum));
> +  if (strncmp (msg, prefix, strlen (prefix)) != 0) {
> +fprintf (stderr, "%s: test failed: missing context prefix: %s\n",
> + progname, msg);
> +exit (EXIT_FAILURE);
> +  }
> +  if (errnum != experr) {
> +fprintf (stderr, "%s: test failed: "
> + "expected errno = %d (%s), but got %d\n",
> + progname, experr, strerror (experr), errnum);
> +exit (EXIT_FAILURE);
> +  }
> +}
> +
> +int
> +main (int argc, char *argv[])
> +{
> +  struct nbd_handle *nbd;
> +  const char *cmd[] = {
> +"nbdkit", "-s", "-v", "--exit-with-parent", "memory", "1048576", NULL
> +  };
> +  uint32_t strict;
> +
> +  progname = argv[0];
> +
> +  nbd = nbd_create ();
> +  if (nbd == NULL) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +exit (EXIT_FAILURE);
> +  }
> +
> +  /* Connect to the server. */
> +  if (nbd_connect_command (nbd, (char **)cmd) == -1) {
> +fprintf (stderr, "%s: %s\n", argv[0], nbd_get_error ());
> +exit (EXIT_FAILURE);
> +  }
> +
> +  /* FIXME: future API addition to test if server negotiated extended mode.
> +   * Until then, strict flags must ignore the PAYLOAD_LEN flag for pwrite,
> +   * even though it is rejected for other commands.
> +   */
> +  strict = nbd_get_strict_mode (nbd);
> +  if (!(strict & LIBNBD_STRICT_FLAGS)) {
> +fprintf (stderr, "%s: test failed: "
> + "nbd_get_strict_mode did not have expected flag set\n",
> + argv[0]);
> +exit (EXIT_FAILURE);
> +  }
> +  if (nbd_aio_pread (nbd, buf, 512, 0, NBD_NULL_COMPLETION,
> + LIBNBD_CMD_FLAG_PAYLOAD_LEN) != -1) {
> +fprintf (stderr, "%s: test failed: "
> + "nbd_aio_pread did not fail with unexpected flag\n",
> + argv[0]);
> +exit (EXIT_FAILURE);
> +  }
> +  check (EINVAL, "nbd_aio_pread: ");
> +
> +  if (nbd_aio_pwrite (nbd, buf, 512, 0, NBD_NULL_COMPLETION,
> + LIBNBD_CMD_FLAG_PAYLOAD_LEN) == -1) {
> +fprintf (stderr, "%s: %s\n", argv[0], nbd_get_error ());
> +exit (EXIT_FAILURE);
> +  }
> +
> +  if (nbd_aio_pwrite (nbd, buf, 512, 0, NBD_NULL_COMPLETION, 0) == -1) {
> +fprintf (stderr, "%s: %s\n", argv[0], nbd_get_error ());
> +exit (EXIT_FAILURE);
> +  }
> +
> +  nbd_close (nbd);
> +  exit (EXIT_SUCCESS);
> +}
> diff --git a/.gitignore b/.gitignore
> index fe7feffa..bc7c2c37 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -250,6 +250,7 @@ Makefile.in
>  /tests/pki/
>  /tests/pread-initialize
>  /tests/private-data
> +/tests/pwrite-extended
>  /tests/read-only-flag
>  /tests/read-write-flag
>  /tests/server-death

Reviewed-by: Richard W.M. Jones 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers

2023-06-07 Thread Richard W.M. Jones

On Tue, May 30, 2023 at 05:48:25PM +0200, Laszlo Ersek wrote:
> BTW I'm foreseeing a problem: if the extended block descriptor can
> provide an unsigned 64-bit length, we're going to have trouble exposing
> that in OCaml, because OCaml only has signed 64-bit integers. So that's
> going to reproduce the same issue, only for OCaml callers of the *new* API.
> 
> I can see Eric's series includes patches like "ocaml: Add example for
> 64-bit extents" -- I've not looked at those yet; for now I'm just
> wondering what tricks we might need in the bindings generator. The
> method seen in the "middle patch" above won't work; we don't have a
> native OCaml "i128" type for example that we could use as an escape
> hatch, for representing C's uint64_t.

I think that's OK because disk sizes are already limited to
2^63 - 1 by the kernel (and for qemu even less than that).
The OCaml bindings return a (signed) int64 for NBD.get_size.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [Libguestfs] [PATCH v3 6/6] RFC: spec: Introduce NBD_REPLY_TYPE_OFFSET_HOLE_EXT

2023-04-18 Thread Richard W.M. Jones

On Tue, Apr 18, 2023 at 02:33:41PM +0200, Wouter Verhelst wrote:
> On Thu, Apr 13, 2023 at 05:02:41PM -0500, Eric Blake wrote:
> > Rather than requiring all servers and clients to have a 32-bit limit
> > on maximum NBD_CMD_READ/WRITE sizes, we can choose to standardize
> > support for a 64-bit single I/O transaction now.
> > NBD_REPLY_TYPE_OFFSET_DATA can already handle a large reply, but
> > NBD_REPLY_TYPE_OFFSET_HOLE needs a 64-bit counterpart.
> > 
> > By standardizing this, all clients must be prepared to support both
> > types of hole type replies, even though most server implementations of
> > extended replies are likely to only send one hole type.
> 
> I think it's probably a better idea to push this patch to a separate
> "extension-*" branch, and link to that from proto.md on master. Those
> are documented as "we standardized this, but no first implementor exists
> yet".
> 
> If someone actually comes up with a reason for 64-bit transactions, we
> can then see if the spec matches the need and merge it to master.
> 
> Otherwise this feels too much like a solution in search of a problem to
> me.

I'd like to push back a bit on this.  Firstly Eric does have two
complete implementations.  It's true however that they not upstream in
either case.

But we also need this because there are relatively serious issues
observed, particularly around trimming/zeroing, and extents.  The
trimming problem can be demonstrated very easily in fact:

$ nbdkit -U - --filter=stats memory 1P statsfile=/dev/stdout --run ' time 
guestfish add "" protocol:nbd server:unix:$unixsocket discard:enable format:raw 
: run : mkfs xfs /dev/sda '

real 4m17.531s
user 0m0.032s
sys  0m0.040s
total: 1066328 ops, 257.558068 s, 1048578.04 GiB, 4071.23 GiB/s
read: 4356 ops, 0.003335 s, 14.61 MiB, 4.28 GiB/s op, 58.08 KiB/s total
  Request size and alignment breakdown:
12 bits: 50.8% (2215 reqs, 8.65 MiB total)
 12 bit aligned: 100.0% (2215)
 13 bit aligned:  51.6% (1143)
 14 bit aligned:  26.9% (595)
 15 bit aligned:  14.6% (323)
 16 bit aligned:   8.4% (185)
 17+ bit-aligned:  4.9% (109)
 9 bits: 47.4% (2064 reqs, 1.01 MiB total)
  9 bit aligned: 100.0% (2064)
 10+ bit-aligned:  0.6% (13)
other sizes:  1.8% (77 reqs, 14.61 MiB total)

write: 13325 ops, 0.046782 s, 31.29 MiB, 668.91 MiB/s op, 124.41 KiB/s total
  Request size and alignment breakdown:
12 bits: 53.8% (7170 reqs, 28.01 MiB total)
 12 bit aligned: 100.0% (7170)
 13 bit aligned:  50.0% (3585)
 14 bit aligned:  25.0% (1793)
 15 bit aligned:  12.5% (896)
 16 bit aligned:   6.2% (448)
 17+ bit-aligned:  3.1% (224)
 9 bits: 46.2% (6150 reqs, 3.00 MiB total)
  9 bit aligned: 100.0% (6150)
 10 bit aligned:  33.4% (2054)
 12 bit aligned:  16.7% (1029)
 13 bit aligned:   8.4% (515)
 14+ bit-aligned:  4.2% (259)
other sizes:  0.0% (5 reqs, 31.29 MiB total)

trim: 1048576 ops, 306.059735 s, 1048576.00 GiB, 3426.05 GiB/s op, 4071.22 
GiB/s total
  Request size and alignment breakdown:
30 bits: 100.0% (1048576 reqs, 1048576.00 GiB total)
 30 bit aligned: 100.0% (1048576)
 31 bit aligned:  50.0% (524288)
 32 bit aligned:  25.0% (262144)
 33 bit aligned:  12.5% (131072)
 34 bit aligned:   6.2% (65536)
 35+ bit-aligned:  3.1% (32768)

zero: 64 ops, 0.003912 s, 1.99 GiB, 508.75 GiB/s op, 7.91 MiB/s total
  Request size and alignment breakdown:
25 bits: 98.4% (63 reqs, 1.97 GiB total)
 13 bit aligned: 100.0% (63)
other sizes:  1.6% (1 reqs, 1.99 GiB total)

flush: 7 ops, 0.01 s, 0 bytes, 0 bytes/s op, 0 bytes/s total

Note how trim takes a million operations and most of the time.  That
should be done in one operation.  If you stop advertising discard
support on the disk ("discard:disable") it takes only a fraction of
the time.

The extents one is harder to demonstrate, but it makes our code
considerably more complex that we cannot just grab the extent map for
a whole disk larger than 4GB in a single command.  (The complexity
won't go away, but the querying will be faster with fewer round trips
with this change.)

Nevertheless I'm not opposed to keeping this as an extension until the
implementations are upstream and bedded in.

Rich.


> With that said, for the things I didn't reply to, you can add:
> 
> Reviewed-By: Wouter Verhelst 
> 
> -- 
>  w@uter.{be,co.za}
> wouter@{grep.be,fosdem.org,debian.org}
> 
> I will have a Tin-Actinium-Potassium mixture, thanks.
> 
> ___
> Libguestfs mailing list
> libgues...@redhat.com
> https://listman.redhat.com/mailman/listinfo/libguestfs

--

Re: [Libguestfs] [PATCH v3 6/6] RFC: spec: Introduce NBD_REPLY_TYPE_OFFSET_HOLE_EXT

2023-04-14 Thread Richard W.M. Jones



So I read through the whole series this morning and it all seems
reasonable.  Therefore:

Reviewed-by: Richard W.M. Jones 

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH nbd 1/4] nbd: Add multi-conn option

2023-03-10 Thread Richard W.M. Jones

On Fri, Mar 10, 2023 at 04:17:17PM -0600, Eric Blake wrote:
> On Thu, Mar 09, 2023 at 11:39:43AM +0000, Richard W.M. Jones wrote:
> > + * safe for multi-conn, force it to 1.
> > + */
> > +if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
> > +s->multi_conn = 1;
> > +}
> > +
> >  return 0;
> 
> Is there an intended QAPI counterpart for this command?  We'll need
> that if it is to be set during the command line of
> qemu-storage-daemon.

Does it just need to be added to qapi/block-core.json?

It's a shame we can't add the API in one place and have everything
generated from there.  Like some kind of 'generator' ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

2023-03-10 Thread Richard W.M. Jones

On Fri, Mar 10, 2023 at 01:04:12PM -0600, Eric Blake wrote:
> How many of these timing numbers can be repeated with TLS in the mix?

While I have been playing with TLS and kTLS recently, it's not
something that is especially important to v2v since all NBD traffic
goes over Unix domain sockets only (ie. it's used as kind of
interprocess communication).

I could certainly provide benchmarks, although as I'm going on holiday
shortly it may be a little while.

> > Curl local server test (./multi-conn.pl curlhttp)
> > =
> > 
> > Localhost Apache serving a file over http
> >   |
> >   | http
> >   v
> > nbdkit-curl-plugin   or   qemu-nbd
> >   |
> >   | nbd+unix
> >   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download an image from a local web server through
> > nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> > The image is copied to /dev/null.
> > 
> >   server  clientmulti-conn
> >   ---
> >   qemu-nbd nbdcopy  1   8.88s   
> >   qemu-nbd nbdcopy  2   8.64s   
> >   qemu-nbd nbdcopy  4   8.37s   
> >   qemu-nbdqemu-img  [u/s]   6.47s
> 
> Do we have any good feel for why qemu-img is faster than nbdcopy in
> the baseline?  But improving that is orthogonal to this series.

I do not, but we have in the past found that results can be very
sensitive to request size.  By default (and also in all of these
tests) nbdcopy is using a request size of 256K, and qemu-img is using
a request size of 2M.

> >   qemu-nbdqemu-img  1   6.56s   
> >   qemu-nbdqemu-img  2   6.63s   
> >   qemu-nbdqemu-img  4   6.50s   
> > nbdkit nbdcopy  1   12.15s  
> 
> I'm assuming this is nbdkit with your recent in-progress patches to
> have the curl plugin serve parallel requests.  But another place where
> we can investigate why nbdkit is not as performant as qemu-nbd at
> utilizing curl.
> 
> > nbdkit nbdcopy  2   7.05s   (72.36% better)
> > nbdkit nbdcopy  4   3.54s   (242.90% better)
> 
> That one is impressive!
> 
> > nbdkitqemu-img  [u/s]   6.90s   
> > nbdkitqemu-img  1   7.00s   
> 
> Minimal penalty for adding the code but not utilizing it...

[u/s] and qemu-img with multi-conn:1 ought to be identical actually.
After all, the only difference should be the restructuring of the code
to add the intermediate NBDConnState struct In this case it's probably
just measurement error.

> > nbdkitqemu-img  2   3.85s   (79.15% better)
> > nbdkitqemu-img  4   3.85s   (79.15% better)
> 
> ...and definitely shows its worth.
> 
> > 
> > 
> > Curl local file test (./multi-conn.pl curlfile)
> > ===
> > 
> > nbdkit-curl-plugin   using file:/// URI
> >   |
> >   | nbd+unix
> >   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download from a file:/// URI.  This test is designed to exercise
> > NBD and some curl internal paths without the overhead from an external
> > server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> > the test for qemu as server.
> > 
> >   server  clientmulti-conn
> >   ---
> > nbdkit nbdcopy  1   31.32s  
> > nbdkit nbdcopy  2   20.29s  (54.38% better)
> > nbdkit nbdcopy  4   13.22s  (136.91% better)
> > nbdkitqemu-img  [u/s]   31.55s  
> 
> Here, the baseline is already comparable; both nbdcopy and qemu-img
> are parsing the image off nbdkit in about the same amount of time.
> 
> > nbdkitqemu-img  1   31.70s  
> 
> And again, minimal penalty for having the new code in place but not
> exploiting it.
> 
> > nbdkitqemu-img  2   21.60s  (46.07% better)
> > nbdkitqemu-img  4   13.88s  (127.25% better)
> 
> Plus an obvious benefit when the parallel sockets matter.
> 
> > 
> > 
> > Curl remote server test (./multi-conn.pl curlremote)
> > 
> > 
> > nbdkit-curl-plugin   using http://remote/*.qcow2 URI
> >  |
> >  | nbd+unix
> >  v
> > qemu-img convert
> > 
> > We download from a remote qcow2 file to a local raw file, converting
> > between formats during copying.
> > 
> > qemu-nbd   using http://remote/*.qcow2 URI
> > |
> > | nbd+unix
> > v
> > qemu-img convert
> > 
> > Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> > if it is raw, so the conversion is still done by qemu-img).
> > 
> > Additionally we compare downloading the file with wget (note this
> > doesn't include the time for conversion, but that

[PATCH nbd 1/4] nbd: Add multi-conn option

2023-03-09 Thread Richard W.M. Jones

Add multi-conn option to the NBD client.  This commit just adds the
option, it is not functional.

Setting this to a value > 1 permits multiple connections to the NBD
server; a typical value might be 4.  The default is 1, meaning only a
single connection is made.  If the NBD server does not advertise that
it is safe for multi-conn then this setting is forced to 1.

Signed-off-by: Richard W.M. Jones 
---
 block/nbd.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index bf2894ad5c..5ffae0b798 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -49,6 +49,7 @@
 
 #define EN_OPTSTR ":exportname="
 #define MAX_NBD_REQUESTS16
+#define MAX_MULTI_CONN  16
 
 #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
 #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
@@ -98,6 +99,7 @@ typedef struct BDRVNBDState {
 /* Connection parameters */
 uint32_t reconnect_delay;
 uint32_t open_timeout;
+uint32_t multi_conn;
 SocketAddress *saddr;
 char *export;
 char *tlscredsid;
@@ -1803,6 +1805,15 @@ static QemuOptsList nbd_runtime_opts = {
 "attempts until successful or until @open-timeout seconds "
 "have elapsed. Default 0",
 },
+{
+.name = "multi-conn",
+.type = QEMU_OPT_NUMBER,
+.help = "If > 1 permit up to this number of connections to the "
+"server. The server must also advertise multi-conn "
+"support.  If <= 1, only a single connection is made "
+"to the server even if the server advertises multi-conn. "
+"Default 1",
+},
 { /* end of list */ }
 },
 };
@@ -1858,6 +1869,10 @@ static int nbd_process_options(BlockDriverState *bs, 
QDict *options,
 
 s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
 s->open_timeout = qemu_opt_get_number(opts, "open-timeout", 0);
+s->multi_conn = qemu_opt_get_number(opts, "multi-conn", 1);
+if (s->multi_conn > MAX_MULTI_CONN) {
+s->multi_conn = MAX_MULTI_CONN;
+}
 
 ret = 0;
 
@@ -1912,6 +1927,15 @@ static int nbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 nbd_client_connection_enable_retry(s->conn);
 
+/*
+ * We set s->multi_conn in nbd_process_options above, but now that
+ * we have connected if the server doesn't advertise that it is
+ * safe for multi-conn, force it to 1.
+ */
+if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
+s->multi_conn = 1;
+}
+
 return 0;
 
 fail:
-- 
2.39.2

[PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections

2023-03-09 Thread Richard W.M. Jones

To implement multi-conn, we will put multiple underlying NBD
connections (ie. NBDClientConnection) inside the NBD block device
handle (BDRVNBDState).  This requires first breaking the one-to-one
relationship between NBDClientConnection and BDRVNBDState.

To do this a new structure (NBDConnState) is implemented.
NBDConnState takes all the per-connection state out of the block
driver struct.  BDRVNBDState now contains a conns[] array of pointers
to NBDConnState, one for each underlying multi-conn connection.

After this change there is still a one-to-one relationship because we
only ever use the zeroth slot in the conns[] array.  Thus this does
not implement multi-conn yet.

Signed-off-by: Richard W.M. Jones 
---
 block/coroutines.h |   5 +-
 block/nbd.c| 674 -
 2 files changed, 358 insertions(+), 321 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index dd9f3d449b..14b01d8591 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -62,7 +62,7 @@ int coroutine_fn GRAPH_RDLOCK
 bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
 int coroutine_fn
-nbd_co_do_establish_connection(BlockDriverState *bs, bool blocking,
+nbd_co_do_establish_connection(BlockDriverState *bs, void *cs, bool blocking,
Error **errp);
 
 
@@ -86,6 +86,7 @@ bdrv_common_block_status_above(BlockDriverState *bs,
BlockDriverState **file,
int *depth);
 int co_wrapper_mixed
-nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
+nbd_do_establish_connection(BlockDriverState *bs, void *cs, bool blocking,
+Error **errp);
 
 #endif /* BLOCK_COROUTINES_H */
diff --git a/block/nbd.c b/block/nbd.c
index 5ffae0b798..84e8a1add0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -51,8 +51,8 @@
 #define MAX_NBD_REQUESTS16
 #define MAX_MULTI_CONN  16
 
-#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
-#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
+#define HANDLE_TO_INDEX(cs, handle) ((handle) ^ (uint64_t)(intptr_t)(cs))
+#define INDEX_TO_HANDLE(cs, index)  ((index)  ^ (uint64_t)(intptr_t)(cs))
 
 typedef struct {
 Coroutine *coroutine;
@@ -67,7 +67,10 @@ typedef enum NBDClientState {
 NBD_CLIENT_QUIT
 } NBDClientState;
 
-typedef struct BDRVNBDState {
+/* A single client connection. */
+typedef struct NBDConnState {
+struct BDRVNBDState *s; /* Points to enclosing BDRVNBDState */
+
 QIOChannel *ioc; /* The current I/O channel */
 NBDExportInfo info;
 
@@ -94,7 +97,12 @@ typedef struct BDRVNBDState {
 
 QEMUTimer *open_timer;
 
-BlockDriverState *bs;
+NBDClientConnection *conn;
+} NBDConnState;
+
+typedef struct BDRVNBDState {
+/* The underlying NBD connections */
+NBDConnState *conns[MAX_MULTI_CONN];
 
 /* Connection parameters */
 uint32_t reconnect_delay;
@@ -108,7 +116,7 @@ typedef struct BDRVNBDState {
 char *x_dirty_bitmap;
 bool alloc_depth;
 
-NBDClientConnection *conn;
+BlockDriverState *bs;
 } BDRVNBDState;
 
 static void nbd_yank(void *opaque);
@@ -117,14 +125,16 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-nbd_client_connection_release(s->conn);
-s->conn = NULL;
+nbd_client_connection_release(s->conns[0]->conn);
+s->conns[0]->conn = NULL;
 
 yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
 
 /* Must not leave timers behind that would access freed data */
-assert(!s->reconnect_delay_timer);
-assert(!s->open_timer);
+assert(!s->conns[0]->reconnect_delay_timer);
+assert(!s->conns[0]->open_timer);
+
+g_free(s->conns[0]);
 
 object_unref(OBJECT(s->tlscreds));
 qapi_free_SocketAddress(s->saddr);
@@ -151,139 +161,143 @@ static bool coroutine_fn 
nbd_recv_coroutine_wake_one(NBDClientRequest *req)
 return false;
 }
 
-static void coroutine_fn nbd_recv_coroutines_wake(BDRVNBDState *s)
+static void coroutine_fn nbd_recv_coroutines_wake(NBDConnState *cs)
 {
 int i;
 
-QEMU_LOCK_GUARD(&s->receive_mutex);
+QEMU_LOCK_GUARD(&cs->receive_mutex);
 for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-if (nbd_recv_coroutine_wake_one(&s->requests[i])) {
+if (nbd_recv_coroutine_wake_one(&cs->requests[i])) {
 return;
 }
 }
 }
 
 /* Called with s->requests_lock held.  */
-static void coroutine_fn nbd_channel_error_locked(BDRVNBDState *s, int ret)
+static void coroutine_fn nbd_channel_error_locked(NBDConnState *cs, int ret)
 {
-if (s->state == NBD_CLIENT_CONNECTED) {
-qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+if (cs->state == NBD_CLIENT_CONNECTED) {
+qio_channel_shutdown(cs-

[PATCH nbd 3/4] nbd: Open multiple NBD connections if multi-conn is set

2023-03-09 Thread Richard W.M. Jones

If the user opts in to multi-conn and the server advertises it, open
multiple NBD connections.  They are opened with the same parameters as
the initial connection.  Similarly on the close path the connections
are closed.

However only the zeroth connection is used, so this commit does not
actually enable multi-conn capabilities.

(XXX) The strategy here is very naive.  Firstly if you were going to
open them, you'd probably want to open them in parallel.  Secondly it
would make sense to delay opening until multiple parallel requests are
seen (perhaps above some threshold), so that simple or shortlived NBD
operations do not require multiple connections to be made.

Signed-off-by: Richard W.M. Jones 
---
 block/nbd.c | 128 
 1 file changed, 90 insertions(+), 38 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 84e8a1add0..4c99c3f865 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -124,18 +124,23 @@ static void nbd_yank(void *opaque);
 static void nbd_clear_bdrvstate(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+size_t i;
 
-nbd_client_connection_release(s->conns[0]->conn);
-s->conns[0]->conn = NULL;
+for (i = 0; i < MAX_MULTI_CONN; ++i) {
+if (s->conns[i]) {
+nbd_client_connection_release(s->conns[i]->conn);
+s->conns[i]->conn = NULL;
+
+/* Must not leave timers behind that would access freed data */
+assert(!s->conns[i]->reconnect_delay_timer);
+assert(!s->conns[i]->open_timer);
+
+g_free(s->conns[i]);
+}
+}
 
 yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
 
-/* Must not leave timers behind that would access freed data */
-assert(!s->conns[0]->reconnect_delay_timer);
-assert(!s->conns[0]->open_timer);
-
-g_free(s->conns[0]);
-
 object_unref(OBJECT(s->tlscreds));
 qapi_free_SocketAddress(s->saddr);
 s->saddr = NULL;
@@ -1905,43 +1910,39 @@ static int nbd_process_options(BlockDriverState *bs, 
QDict *options,
 return ret;
 }
 
-static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
-Error **errp)
+static NBDConnState *init_conn_state(BDRVNBDState *s)
 {
+NBDConnState *cs;
+
+cs = g_new0(NBDConnState, 1);
+cs->s = s;
+qemu_mutex_init(&cs->requests_lock);
+qemu_co_queue_init(&cs->free_sema);
+qemu_co_mutex_init(&cs->send_mutex);
+qemu_co_mutex_init(&cs->receive_mutex);
+
+return cs;
+}
+
+static int conn_state_connect(BlockDriverState *bs, NBDConnState *cs,
+  Error **errp)
+{
+BDRVNBDState *s = cs->s;
 int ret;
-BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-s->bs = bs;
-
-s->conns[0] = g_new0(NBDConnState, 1);
-s->conns[0]->s = s;
-qemu_mutex_init(&s->conns[0]->requests_lock);
-qemu_co_queue_init(&s->conns[0]->free_sema);
-qemu_co_mutex_init(&s->conns[0]->send_mutex);
-qemu_co_mutex_init(&s->conns[0]->receive_mutex);
-
-if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
-return -EEXIST;
-}
-
-ret = nbd_process_options(bs, options, errp);
-if (ret < 0) {
-goto fail;
-}
-
-s->conns[0]->conn =
+cs->conn =
 nbd_client_connection_new(s->saddr, true, s->export,
   s->x_dirty_bitmap, s->tlscreds,
   s->tlshostname);
 
 if (s->open_timeout) {
-nbd_client_connection_enable_retry(s->conns[0]->conn);
-open_timer_init(s->conns[0], qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
+nbd_client_connection_enable_retry(cs->conn);
+open_timer_init(cs, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
 s->open_timeout * NANOSECONDS_PER_SECOND);
 }
 
-s->conns[0]->state = NBD_CLIENT_CONNECTING_WAIT;
-ret = nbd_do_establish_connection(bs, s->conns[0], true, errp);
+cs->state = NBD_CLIENT_CONNECTING_WAIT;
+ret = nbd_do_establish_connection(bs, cs, true, errp);
 if (ret < 0) {
 goto fail;
 }
@@ -1951,9 +1952,44 @@ static int nbd_open(BlockDriverState *bs, QDict 
*options, int flags,
  * Delete it, because we do not want it to be around when this node
  * is drained or closed.
  */
-open_timer_del(s->conns[0]);
+open_timer_del(cs);
 
-nbd_client_connection_enable_retry(s->conns[0]->conn);
+nbd_client_connection_enable_retry(cs->conn);
+
+return 0;
+
+fail:
+open_timer_del(cs);
+return ret;
+}
+
+static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
+Error **errp)
+{
+int ret;
+BDRVNBDState *s = (BDRVNBDState *)bs-

[PATCH nbd 4/4] nbd: Enable multi-conn using round-robin

2023-03-09 Thread Richard W.M. Jones

Enable NBD multi-conn by spreading operations across multiple
connections.

(XXX) This uses a naive round-robin approach which could be improved.
For example we could look at how many requests are in flight and
assign operations to the connections with fewest.  Or we could try to
estimate (based on size of requests outstanding) the load on each
connection.  But this implementation doesn't do any of that.

Signed-off-by: Richard W.M. Jones 
---
 block/nbd.c | 67 +++--
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 4c99c3f865..df32ba67ed 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1232,6 +1232,26 @@ static int coroutine_fn nbd_co_request(NBDConnState *cs, 
NBDRequest *request,
 return ret ? ret : request_ret;
 }
 
+/*
+ * If multi-conn, choose a connection for this operation.
+ */
+static NBDConnState *choose_connection(BDRVNBDState *s)
+{
+static size_t next;
+size_t i;
+
+if (s->multi_conn <= 1) {
+return s->conns[0];
+}
+
+/* XXX Stupid simple round robin. */
+i = qatomic_fetch_inc(&next);
+i %= s->multi_conn;
+
+assert(s->conns[i] != NULL);
+return s->conns[i];
+}
+
 static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t 
offset,
  int64_t bytes, QEMUIOVector *qiov,
  BdrvRequestFlags flags)
@@ -1244,7 +1264,7 @@ static int coroutine_fn 
nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
 .from = offset,
 .len = bytes,
 };
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 
 assert(bytes <= NBD_MAX_BUFFER_SIZE);
 
@@ -1301,7 +1321,7 @@ static int coroutine_fn 
nbd_client_co_pwritev(BlockDriverState *bs, int64_t offs
 .from = offset,
 .len = bytes,
 };
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 
 assert(!(cs->info.flags & NBD_FLAG_READ_ONLY));
 if (flags & BDRV_REQ_FUA) {
@@ -1326,7 +1346,7 @@ static int coroutine_fn 
nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_
 .from = offset,
 .len = bytes,  /* .len is uint32_t actually */
 };
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 
 assert(bytes <= UINT32_MAX); /* rely on max_pwrite_zeroes */
 
@@ -1357,7 +1377,13 @@ static int coroutine_fn 
nbd_client_co_flush(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 NBDRequest request = { .type = NBD_CMD_FLUSH };
-NBDConnState * const cs = s->conns[0];
+
+/*
+ * Multi-conn (if used) guarantees that flushing on any connection
+ * flushes caches on all connections, so we can perform this
+ * operation on any.
+ */
+NBDConnState * const cs = choose_connection(s);
 
 if (!(cs->info.flags & NBD_FLAG_SEND_FLUSH)) {
 return 0;
@@ -1378,7 +1404,7 @@ static int coroutine_fn 
nbd_client_co_pdiscard(BlockDriverState *bs, int64_t off
 .from = offset,
 .len = bytes, /* len is uint32_t */
 };
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 
 assert(bytes <= UINT32_MAX); /* rely on max_pdiscard */
 
@@ -1398,7 +1424,7 @@ static int coroutine_fn nbd_client_co_block_status(
 NBDExtent extent = { 0 };
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 Error *local_err = NULL;
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 
 NBDRequest request = {
 .type = NBD_CMD_BLOCK_STATUS,
@@ -2027,7 +2053,7 @@ static int coroutine_fn nbd_co_flush(BlockDriverState *bs)
 static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 uint32_t min = cs->info.min_block;
 uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, cs->info.max_block);
 
@@ -2085,7 +2111,7 @@ static int coroutine_fn nbd_co_truncate(BlockDriverState 
*bs, int64_t offset,
 BdrvRequestFlags flags, Error **errp)
 {
 BDRVNBDState *s = bs->opaque;
-NBDConnState * const cs = s->conns[0];
+NBDConnState * const cs = choose_connection(s);
 
 if (offset != cs->info.size && exact) {
 error_setg(errp, "Cannot resize NBD nodes");
@@ -2168,24 +2194,29 @@ static const char *const nbd_strong_runtime_opts[] = {
 static void nbd_cancel_in_flight(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-NBDConnState * const cs = s->conns[0];
+size_t i;
+NBDConnState *cs;
 
-reconnect_delay_timer_del(cs);
+

[PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

2023-03-09 Thread Richard W.M. Jones

[ Patch series also available here, along with this cover letter and the
  script used to generate test results:
  https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]

This patch series adds multi-conn support to the NBD block driver in
qemu.  It is only meant for discussion and testing because it has a
number of obvious shortcomings (see "XXX" in commit messages and
code).  If we decided this was a good idea, we can work on a better
patch.

The Network Block Driver (NBD) protocol allows servers to advertise
that they are capable of multi-conn.  This means they obey certain
requirements around how data is cached, allowing multiple connections
to be safely opened to the NBD server.  For example, a flush or FUA
operation on one connection is guaranteed to flush the cache on all
connections.

Clients that use multi-conn can achieve better performance.  This
seems to be down to at least two factors:

 - Avoids "head of line blocking" of large requests.

 - With NBD over Unix domain sockets, more cores can be used.

qemu-nbd, nbdkit and libnbd have all supported multi-conn for ages,
but qemu's built in NBD client does not, which is what this patch
fixes.

Below I've produced a few benchmarks.

Note these are mostly concoted to try to test NBD performance and may
not make sense in their own terms (eg. qemu's disk image layer has a
curl client so you wouldn't need to run one separately).  In the real
world we use long pipelines of NBD operations where different tools
are mixed together to achieve efficient downloads, conversions, disk
modifications and sparsification, and they would exercise different
aspects of this.

I've also included nbdcopy as a client for comparison in some tests.

All tests were run 4 times, the first result discarded, and the last 3
averaged.  If any of the last 3 were > 10% away from the average then
the test was stopped.

My summary:

 - It works effectively for qemu client & nbdkit server, especially in
   cases where the server does large, heavyweight requests.  This is
   important for us because virt-v2v uses an nbdkit Python plugin and
   various other heavyweight plugins (eg. plugins that access remote
   servers for each request).

 - It seems to make little or no difference with qemu + qemu-nbd
   server.  I speculate that's because qemu-nbd doesn't support system
   threads, so networking is bottlenecked through a single core.  Even
   though there are coroutines handling different sockets, they must
   all wait in turn to issue send(3) or recv(3) calls on the same
   core.

 - qemu-img unfortunately uses a single thread for all coroutines so
   it suffers from a similar problem to qemu-nbd.  This change would
   be much more effective if we could distribute coroutines across
   threads.

 - For tests which are highly bottlenecked on disk I/O (eg. the large
   local file test and null test) multi-conn doesn't make much
   difference.

 - Multi-conn even with only 2 connections can make up for the
   overhead of range requests, exceeding the performance of wget.

 - In the curlremote test, qemu-nbd is especially slow, for unknown
   reasons.


Integrity test (./multi-conn.pl integrity)
==

nbdkit-sparse-random-plugin
  | ^
  | nbd+unix| nbd+unix
  v |
   qemu-img convert

Reading from and writing the same data back to nbdkit sparse-random
plugin checks that the data written is the same as the data read.
This uses two Unix domain sockets, with or without multi-conn.  This
test is mainly here to check we don't crash or corrupt data with this
patch.

  server  clientmulti-conn
  ---
nbdkitqemu-img  [u/s]   9.07s   
nbdkitqemu-img  1   9.05s   
nbdkitqemu-img  2   9.02s   
nbdkitqemu-img  4   8.98s   

[u/s] = upstream qemu 7.2.0


Curl local server test (./multi-conn.pl curlhttp)
=

Localhost Apache serving a file over http
  |
  | http
  v
nbdkit-curl-plugin   or   qemu-nbd
  |
  | nbd+unix
  v
qemu-img convert   or   nbdcopy

We download an image from a local web server through
nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
The image is copied to /dev/null.

  server  clientmulti-conn
  ---
  qemu-nbd nbdcopy  1   8.88s   
  qemu-nbd nbdcopy  2   8.64s   
  qemu-nbd nbdcopy  4   8.37s   
  qemu-nbdqemu-img  [u/s]   6.47s   
  qemu-nbdqemu-img  1   6.56s   
  qemu-nbdqemu-img  2   6.63s   
  qemu-nbdqemu-img  4   6.50s   
nbdkit nbdcopy  1   12.15s  
nbdkit nbdc

Re: [PATCH] docs: Prefer 'cookie' over 'handle'

2023-03-04 Thread Richard W.M. Jones

On Fri, Mar 03, 2023 at 04:15:03PM -0600, Eric Blake wrote:
> In libnbd, we quickly learned that distinguishing between 'handle'
> (verb for acting on an object) and 'handle' (noun describing which
> object to act on) could get confusing; we solved it by renaming the
> latter to 'cookie'.  Copy that approach into the NBD spec, and make it
> obvious that a cookie is opaque data from the point of view of the
> server.  Makes no difference to implementations (other than older code
> still using 'handle' may be slightly harder to tie back to the spec).
> 
> Suggested-by: Richard W.M. Jones 
> Signed-off-by: Eric Blake 
> ---
>  doc/proto.md | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 3a96703..abb23e8 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -301,11 +301,11 @@ may be handled by the server asynchronously), and 
> structured reply
>  chunks from one request may be interleaved with reply messages from
>  other requests; however, there may be constraints that prevent
>  arbitrary reordering of structured reply chunks within a given reply.
> -Clients SHOULD use a handle that is distinct from all other currently
> -pending transactions, but MAY reuse handles that are no longer in
> -flight; handles need not be consecutive.  In each reply message
> +Clients SHOULD use a cookie that is distinct from all other currently
> +pending transactions, but MAY reuse cookies that are no longer in
> +flight; cookies need not be consecutive.  In each reply message
>  (whether simple or structured), the server MUST use the same value for
> -handle as was sent by the client in the corresponding request.  In
> +cookie as was sent by the client in the corresponding request.  In
>  this way, the client can correlate which request is receiving a
>  response.
> 
> @@ -349,7 +349,7 @@ The request message, sent by the client, looks as follows:
>  C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
>  C: 16 bits, command flags  
>  C: 16 bits, type  
> -C: 64 bits, handle  
> +C: 64 bits, cookie  
>  C: 64 bits, offset (unsigned)  
>  C: 32 bits, length (unsigned)  
>  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> @@ -366,7 +366,7 @@ follows:
>  S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> `NBD_REPLY_MAGIC`)  
>  S: 32 bits, error (MAY be zero)  
> -S: 64 bits, handle  
> +S: 64 bits, cookie  
>  S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
>  *error* is zero)  
> 
> @@ -381,7 +381,7 @@ server must initiate a hard disconnect).  Second, there 
> is no way to
>  efficiently skip over portions of a sparse file that are known to
>  contain all zeroes.  Finally, it is not possible to reliably decode
>  the server traffic without also having context of what pending read
> -requests were sent by the client, to see which *handle* values will
> +requests were sent by the client, to see which *cookie* values will
>  have accompanying payload on success.  Therefore structured replies
>  are also permitted if negotiated.
> 
> @@ -398,7 +398,7 @@ sending errors via a structured reply, as the error can 
> then be
>  accompanied by a string payload to present to a human user.
> 
>  A structured reply MAY occupy multiple structured chunk messages
> -(all with the same value for "handle"), and the
> +(all with the same value for "cookie"), and the
>  `NBD_REPLY_FLAG_DONE` reply flag is used to identify the final
>  chunk.  Unless further documented by individual requests below,
>  the chunks MAY be sent in any order, except that the chunk with
> @@ -418,7 +418,7 @@ A structured reply chunk message looks as follows:
>  S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
>  S: 16 bits, flags  
>  S: 16 bits, type  
> -S: 64 bits, handle  
> +S: 64 bits, cookie  
>  S: 32 bits, length of payload (unsigned)  
>  S: *length* bytes of payload data (if *length* is nonzero)  

Acked-by: Richard W.M. Jones 

I think this change is very sensible.  libnbd has always called this a
cookie, because "handle" is a very generic term often used to refer to
things inside the program.  Leading to this code:

https://gitlab.com/nbdkit/libnbd/-/blob/d169661119f66893e2c3050ce25f76554c519b59/generator/states-issue-command.c#L47

nbdkit uses "handle" but we should change that (whether or not this
goes upstream in fact).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: MBR plus emulated FAT

2023-02-14 Thread Richard W.M. Jones

On Sat, Jan 28, 2023 at 08:48:13PM +0100, Csepp wrote:
> 
> Eric Blake  writes:
> 
> > On Fri, Jan 27, 2023 at 01:00:05AM +0100, Csepp wrote:
> >> Hi!
> >> 
> >> Would it be possible to store the metadata for emulated FAT partitions
> >> backed by host directories?  It would make installing Windows 98 much
> >> more seamless, since it would be able to set the boot flag during
> >> install. I have a 9front install that uses no block devices and gets its
> >> root file system via a simple 9P server, I've found that extremely
> >> useful, since it lets me back up or modify files directly from the host.
> >> 
> >> I don't have that much free time, but if it wouldn't be too difficult to
> >> implement this and someone helped, I could try to do it myself.  But
> >> honestly I would be super thankful if someone else implemented it
> >> instead.
> >
> > I wonder if you can already accomplish this using nbdkit-floppy-plugin
> > (expose a directory as a FAT file system over NBD), then connect qemu
> > as an NBD client, rather than having to implement this all in qemu
> > proper.
> 
> Didn't know abut that program, thanks!  Unfortunately:
> > The plugin does not support writes.
> https://manpages.debian.org/bullseye/nbdkit/nbdkit-floppy-plugin.1.en.html

(Late reply - Eric just told me about this thread)

You can place the COW filter on top to make this plugin writable.
However that doesn't persist the changes back to the disk (VVFAT is
kind of crazy).

$ nbdkit --filter=cow floppy /directory

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: RFC: towards systemd socket activation in q-s-d

2023-01-30 Thread Richard W.M. Jones

On Mon, Jan 30, 2023 at 04:45:08PM +, Daniel P. Berrangé wrote:
> which is LISTEN_FDS=2, LISTEN_FDNAMES=control,vnc

Colon for separating the elements not comma.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: RFC: towards systemd socket activation in q-s-d

2023-01-30 Thread Richard W.M. Jones

On Mon, Jan 30, 2023 at 02:58:01PM +, Daniel P. Berrangé wrote:
> Obviously at startup QEMU can trivially inherit the FDs from whatever
> spawned it. The only task is to identify the FDs that are passed into,
> and systemd defined a mechanism for this using LISTEN_FDNAMES. IOW the
> socket activation can fully replace 'getfd' for purpose of initial
> startup. This will get rid of the annoying difference that SocketAddress
> only allows numeric FDs at startup and named FDs at runtime, by making
> named FDs the consistent standard. We could thus deprecate the use of
> non-named numeric FDs in SocketAddress to improve our sanity.
> 
> The question is how to define semantics for the LISTEN_FDNAMES while
> also still remaining back compat with the existing QEMU utilities
> that allow socket activation. Some kind of naming scheme would need
> to be decided upon, as well as handling the use of activation without
> LISTEN_FDNAMES being set. 

If I understand LISTEN_FDNAMES correctly, it's the names of the
protocols to be used (rather clumsily expressed through IANA
registered names from /etc/services).  It would be valid to use
LISTEN_FDNAMES=http:http for example, for a service that must use HTTP
on both sockets.

In other words it's not just names of file descriptors that you can
make up.

However as there is zero documentation for this environment variable,
who knows what it's really supposed to be ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: RFC: towards systemd socket activation in q-s-d

2023-01-27 Thread Richard W.M. Jones

On Fri, Jan 27, 2023 at 03:26:15PM -0600, Eric Blake wrote:
> In https://bugzilla.redhat.com/show_bug.cgi?id=2055229, the question
> was raised on how to make qemu-storage-daemon sufficiently powerful to
> be a full-blown replacement to qemu-nbd.  One of the features still
> lacking is the ability to do systemd socket activation (qemu-nbd does
> this, qemu-storage-daemon needs a way to do it).
> 
> But that bug further noted that systemd supports LISTEN_FDNAMES to
> supply names to a passed-in fd (right now, qemu-nbd does not use
> names, but merely expects one fd in LISTEN_FDS).  Dan had the idea
> that it would be nice to write a systemd file that passes in a socket
> name for a QMP socket, as in:
> 
>  [Socket]
>  ListenStream=/var/run/myapp/qsd.qmp
>  FileDescriptorName=qmp
>  Service=myapp-qsd.service
> 
> and further notes that QAPI SocketAddressType supports @fd which is a
> name in QMP (a previously-added fd passed through the older 'getfd'
> command, rather than the newer 'add-fd' command), but an integer on
> the command line.  With LISTEN_FDNAMES, we could mix systemd socket
> activation with named fds for any command line usage that already
> supports SocketAddressType (not limited to just q-s-d usage).

I couldn't find a good description of LISTEN_FDNAMES anywhere, and we
don't use it in libnbd / nbdkit / qemu-nbd right now.  So here's my
interpretation of what this environment variable means ...

Currently systemd socket activation in qemu-nbd or nbdkit uses only
LISTEN_PID and LISTEN_FDS, usually with a single fd being passed.
(I'll ignore LISTEN_PID.)  So:

  LISTEN_FDS=1
fd 3 --> NBD socket

This works because there's only one NBD protocol, it doesn't need to
be named.

However if we imagine that a superserver wants to run a webserver, it
might need to pass two sockets carrying distinct protocols (HTTP and
HTTPS).  In this case it would use:

  LISTEN_FDS=2
fd 3 --> HTTP socket
fd 4 --> HTTPS socket
  LISTEN_FDNAMES=http:https

LISTEN_FDNAMES is telling the webserver that the first socket is HTTP
and the second is HTTPS, so it knows which protocol to negotiate on
each socket.

I believe what you're saying above is that you'd like to have the
ability to pass multiple sockets to q-s-d carrying distinct protocols,
with the obvious ones being NBD + QMP (although other combinations
could be possible, eg. NBD + vhost + QMP?):

  LISTEN_FDS=2
fd 3 --> NBD socket
fd 4 --> QMP socket
  LISTEN_FDNAMES=nbd:qmp

If my understanding is correct, then this makes sense.  We might also
modify libnbd to pass LISTEN_FDNAMES=nbd in:

https://gitlab.com/nbdkit/libnbd/-/blob/master/generator/states-connect-socket-activation.c

(Current servers would ignore the new environment variable.)

> I'm at a point where I can take a shot at implementing this, but want
> some feedback on whether it is better to try to shoehorn a generic
> solution into the existing @fd member of the SocketAddressType union,
> or whether it would be better to add yet another union member
> @systemd-fd or some similar name to make it explicit when a command
> line parameter wants to refer to an fd being passed through systemd
> socket activation LISTEN_FDS and friends.

It sounds like the q-s-d command lines will be even more convoluted
and inelegant than before.

That's fine, but please keep qemu-nbd around as a sane alternative.
It might in the end just be a wrapper that translates the command line
to q-s-d.  I don't believe it's ever going to be possible or desirable
to deprecate or remove qemu-nbd.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [RFC v2 02/10] Drop unused static function return values

2022-08-03 Thread Richard W.M. Jones

On Wed, Aug 03, 2022 at 01:25:34PM +0100, Peter Maydell wrote:
> On Wed, 3 Aug 2022 at 12:44, Daniel P. Berrangé  wrote:
> > Inconsistent return value checking is designed-in behaviour for
> > QEMU's current Error handling coding pattern with error_abort/fatal.
> 
> Yes; I habitually mark as false-positive Coverity reports about
> missing error checks where it has not noticed that the error
> handling is done via the errp pointer.

Presumably the advantage of having a qemu-specific static analyser is
it'll be able to ignore certain cases, eg. spotting if error_abort is
a parameter and allowing (requiring even?) the return value to be
ignored.

Coverity allows custom models too, but obviously that's all
proprietary software.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [RFC v2 02/10] Drop unused static function return values

2022-08-03 Thread Richard W.M. Jones

On Wed, Aug 03, 2022 at 12:07:19PM +0100, Alberto Faria wrote:
> On Wed, Aug 3, 2022 at 11:46 AM Dr. David Alan Gilbert
>  wrote:
> >
> > * Alberto Faria (afa...@redhat.com) wrote:
> > > Make non-void static functions whose return values are ignored by
> > > all callers return void instead.
> > >
> > > These functions were found by static-analyzer.py.
> > >
> > > Not all occurrences of this problem were fixed.
> > >
> > > Signed-off-by: Alberto Faria 
> >
> > 
> >
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index e03f698a3c..4698080f96 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -175,7 +175,7 @@ static MigrationIncomingState *current_incoming;
> > >
> > >  static GSList *migration_blockers;
> > >
> > > -static bool migration_object_check(MigrationState *ms, Error **errp);
> > > +static void migration_object_check(MigrationState *ms, Error **errp);
> > >  static int migration_maybe_pause(MigrationState *s,
> > >   int *current_active_state,
> > >   int new_state);
> > > @@ -4485,15 +4485,15 @@ static void migration_instance_init(Object *obj)
> > >   * Return true if check pass, false otherwise. Error will be put
> > >   * inside errp if provided.
> > >   */
> > > -static bool migration_object_check(MigrationState *ms, Error **errp)
> > > +static void migration_object_check(MigrationState *ms, Error **errp)
> > >  {
> >
> > I'm not sure if this is a good change.
> > Where we have a function that returns an error via an Error ** it's
> > normal practice for us to return a bool to say whether it generated an
> > error.
> >
> > Now, in our case we only call it with error_fatal:
> >
> > migration_object_check(current_migration, &error_fatal);
> >
> > so the bool isn't used/checked.
> >
> > So I'm a bit conflicted:
> >
> >   a) Using error_fatal is the easiest way to handle this function
> >   b) Things taking Error ** normally do return a flag value
> >   c) But it's not used in this case.
> >
> > Hmm.
> 
> I guess this generalizes to the bigger question of whether a global
> "return-value-never-used" check makes sense and brings value. Maybe
> there are too many cases where it would be preferable to keep the
> return value for consistency? Maybe they're not that many and could be
> tagged with __attribute__((unused))?
> 
> But in this particular case, perhaps we could drop the Error **errp
> parameter and directly pass &error_fatal to migrate_params_check() and
> migrate_caps_check().

If it helps to think about this, Coverity checks for consistency.
Across the whole code base, is the return value of a function used or
ignored consistently.  You will see Coverity errors like:

  Error: CHECKED_RETURN (CWE-252): [#def37]
  libnbd-1.12.5/fuse/operations.c:180: check_return: Calling "nbd_poll" 
without checking return value (as is done elsewhere 5 out of 6 times).
  libnbd-1.12.5/examples/aio-connect-read.c:96: example_checked: Example 1: 
"nbd_poll(nbd, -1)" has its value checked in "nbd_poll(nbd, -1) == -1".
  libnbd-1.12.5/examples/aio-connect-read.c:128: example_checked: Example 
2: "nbd_poll(nbd, -1)" has its value checked in "nbd_poll(nbd, -1) == -1".
  libnbd-1.12.5/examples/strict-structured-reads.c:246: example_checked: 
Example 3: "nbd_poll(nbd, -1)" has its value checked in "nbd_poll(nbd, -1) == 
-1".
  libnbd-1.12.5/ocaml/nbd-c.c:2599: example_assign: Example 4: Assigning: 
"r" = return value from "nbd_poll(h, timeout)".
  libnbd-1.12.5/ocaml/nbd-c.c:2602: example_checked: Example 4 (cont.): "r" 
has its value checked in "r == -1".
  libnbd-1.12.5/python/methods.c:2806: example_assign: Example 5: 
Assigning: "ret" = return value from "nbd_poll(h, timeout)".
  libnbd-1.12.5/python/methods.c:2808: example_checked: Example 5 (cont.): 
"ret" has its value checked in "ret == -1".
  #  178|   /* Dispatch work while there are commands in flight. */
  #  179|   while (thread->in_flight > 0)
  #  180|->   nbd_poll (h, -1);
  #  181| }
  #  182|

What it's saying is that in this code base, nbd_poll's return value
was checked by the caller 5 out of 6 times, but ignored here.  (This
turned out to be a real bug which we fixed).

It seems like the check implemented in your patch is: If the return
value is used 0 times anywhere in the code base, change the return
value to 'void'.  Coverity would not flag this.

Maybe a consistent use check is better?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [Libguestfs] [PATCH 1/2] spec: Recommend cap on NBD_REPLY_TYPE_BLOCK_STATUS length

2022-04-08 Thread Richard W.M. Jones

On Fri, Apr 08, 2022 at 08:45:56PM +0300, Nir Soffer wrote:
> On Fri, Apr 8, 2022 at 6:47 PM Eric Blake  wrote:
> >
> > On Fri, Apr 08, 2022 at 04:48:59PM +0300, Nir Soffer wrote:
> ...
> > > > BTW attached is an nbdkit plugin that creates an NBD server that
> > > > responds with massive numbers of byte-granularity extents, in case
> > > > anyone wants to test how nbdkit and/or clients respond:
> > > >
> > > > $ chmod +x /var/tmp/lots-of-extents.py
> > > > $ /var/tmp/lots-of-extents.py -f
> > > >
> > > > $ nbdinfo --map nbd://localhost | head
> > > >  0   13  hole,zero
> > > >  1   10  data
> > > >  2   13  hole,zero
> > > >  3   10  data
> > > >  4   13  hole,zero
> > > >  5   10  data
> > > >  6   13  hole,zero
> > > >  7   10  data
> > > >  8   13  hole,zero
> > > >  9   10  data
> > > > $ nbdinfo --map --totals nbd://localhost
> > > > 524288  50.0%   0 data
> > > > 524288  50.0%   3 hole,zero
> > >
> > > This is a malicious server. A good client will drop the connection when
> > > receiving the first 1 byte chunk.
> >
> > Depends on the server.  Most servers don't serve 1-byte extents, and
> > the NBD spec even recommends that extents be at least 512 bytes in
> > size, and requires that extents be a multiple of any minimum block
> > size if one was advertised by the server.
> >
> > But even though most servers don't have 1-byte extents does not mean
> > that the NBD protocol must forbid them.
> 
> Forbidding this simplifies clients without limiting real world use cases.

I'm not even sure this is true.  Clients are quite free to only make
requests on 512 byte block boundaries if they want, and refuse to deal
with servers which offer non-aligned disk sizes or extents.

If clients are doing this and still have problems they ought to be
fixed, although I don't know what those would be.

> What is a reason to allow this?

Because you don't always want to serve things which are exactly disk
images, or which make assumptions related to modern PCs (256 byte
sectors were used into the 90s).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [Libguestfs] [PATCH 1/2] spec: Recommend cap on NBD_REPLY_TYPE_BLOCK_STATUS length

2022-04-08 Thread Richard W.M. Jones

On Fri, Apr 08, 2022 at 04:48:59PM +0300, Nir Soffer wrote:
> This is a malicious server. A good client will drop the connection when
> receiving the first 1 byte chunk.
> 
> The real issue here is not enforcing or suggesting a limit on the number of
> extent the server returns, but enforcing a limit on the minimum size of
> a chunk.
> 
> Since this is the network *block device* protocol it should not allow chunks
> smaller than the device block size, so anything smaller than 512 bytes
> should be invalid response from the server.
> 
> Even the last chunk should not be smaller than 512 bytes. The fact that you
> can serve a file with size that is not aligned to 512 bytes does not mean that
> the export size can be unaligned to the logical block size. There are no real
> block devices that have such alignment so the protocol should not allow this.
> A good server will round the file size down the logical block size to avoid 
> this
> issue.
> 
> How about letting the client set a minimum size of a chunk? This way we
> avoid the issue of limiting the number of chunks. Merging small chunks
> is best done on the server side instead of wasting bandwidth and doing
> this on the client side.

While it's interesting to know if chunks should obey the
(server-specified) minimum block size, I don't think we should force
operations to only work on sector boundaries.  That's a step
backwards.  We've spent a long time and effort making nbdkit & NBD
work well for < 512 byte images, byte granularity tails, and disk
operations.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [Libguestfs] [PATCH 1/2] spec: Recommend cap on NBD_REPLY_TYPE_BLOCK_STATUS length

2022-04-08 Thread Richard W.M. Jones

On Fri, Apr 08, 2022 at 12:52:44PM +0100, Richard W.M. Jones wrote:
> From experimentation I found this really hurts my laptop :-(

... because of a memory leak in the nbdkit Python bindings as it
turned out.  This function:

> def extents(h, count, offset, flags):
> e = [(i, 1, typ(i)) for i in range(offset, offset + count + 1)]
> return e

returns tuples which never got dereferenced in the C part.

Fixed in:

https://gitlab.com/nbdkit/nbdkit/-/commit/4ede3d7b0778cc1fe661307d2289fadfc18a1af2

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [Libguestfs] [PATCH 1/2] spec: Recommend cap on NBD_REPLY_TYPE_BLOCK_STATUS length

2022-04-08 Thread Richard W.M. Jones

On Fri, Apr 08, 2022 at 09:25:01AM +0200, Wouter Verhelst wrote:
> Hi Eric,
> 
> On Thu, Apr 07, 2022 at 04:37:19PM -0500, Eric Blake wrote:
> > The spec was silent on how many extents a server could reply with.
> > However, both qemu and nbdkit (the two server implementations known to
> > have implemented the NBD_CMD_BLOCK_STATUS extension) implement a hard
> > cap, and will truncate the amount of extents in a reply to avoid
> > sending a client a reply larger than the maximum NBD_CMD_READ response
> > they are willing to tolerate:
> > 
> > When qemu first implemented NBD_CMD_BLOCK_STATUS for the
> > base:allocation context (qemu commit e7b1948d51, Mar 2018), it behaved
> > as if NBD_CMD_FLAG_REQ_ONE were always passed by the client, and never
> > responded with more than one extent.  Later, when adding its
> > qemu:dirty-bitmap:XXX context extension (qemu commit 3d068aff16, Jun
> > 2018), it added a cap to 128k extents (1M+4 bytes), and that cap was
> > applied to base:allocation once qemu started sending multiple extents
> > for that context as well (qemu commit fb7afc797e, Jul 2018).  Qemu
> > extents are never smaller than 512 bytes (other than an exception at
> > the end of a file whose size is not aligned to 512), but even so, a
> > request for just under 4G of block status could produce 8M extents,
> > resulting in a reply of 64M if it were not capped smaller.
> > 
> > When nbdkit first implemented NBD_CMD_BLOCK_STATUS (nbdkit 4ca66f70a5,
> > Mar 2019), it did not impose any restriction on the number of extents
> > in the reply chunk.  But because it allows extents as small as one
> > byte, it is easy to write a server that can amplify a client's request
> > of status over 1M of the image into a reply over 8M in size, and it
> > was very easy to demonstrate that a hard cap was needed to avoid
> > crashing clients or otherwise killing the connection (a bad server
> > impacting the client negatively); unique to nbdkit's situation is the
> > fact that because it is designed for plugin server implementations,
> > not capping the length of extent also posed a problem to nbdkit as the
> > server (a client requesting large block status could cause the server
> > to run out of memory depending on the plugin providing the server
> > callbacks).  So nbdkit enforced a bound of 1M extents (8M+4 bytes,
> > nbdkit commit 6e0dc839ea, Jun 2019).
> > 
> > Since the limit chosen by these two implementations is different, and
> > since nbdkit has versions that were not limited, add this as a SHOULD
> > NOT instead of MUST NOT constraint on servers implementing block
> > status.  It does not matter that qemu picked a smaller limit that it
> > truncates to, since we have already documented that the server may
> > truncate for other reasons (such as it being inefficient to collect
> > that many extents in the first place).  But documenting the limit now
> > becomes even more important in the face of a future addition of 64-bit
> > requests, where a client's request is no longer bounded to 4G and
> > could thereby produce even more than 8M extents for the corner case
> > when every 512 bytes is a new extent, if it were not for this
> > recommendation.
> 
> It feels backwards to me to make this a restriction on the server side.
> You're saying there are server implementations that will be inefficient
> if there are more than 2^20 extents, and therefore no server should send
> more than those, even if it can do so efficiently.
> 
> Isn't it up to the server implementation to decide what can be done
> efficiently?
> 
> Perhaps we can make the language about possibly reducing length of
> extens a bit stronger; but I don't think adding explicit limits for a
> server's own protection is necessary.

I agree, but for a different reason.

I think Eric should add language that servers can consider limiting
response sizes in order to prevent possible amplification issues
and/or simply overwhelming the client with work (bad server DoS
attacks against clients are a thing!), but I don't think it's
necessarily a "SHOULD" issue.

BTW attached is an nbdkit plugin that creates an NBD server that
responds with massive numbers of byte-granularity extents, in case
anyone wants to test how nbdkit and/or clients respond:

$ chmod +x /var/tmp/lots-of-extents.py
$ /var/tmp/lots-of-extents.py -f

$ nbdinfo --map nbd://localhost | head
 0   13  hole,zero
 1   10  data
 2   13  hole,zero
 3   10  data
 4   13  hole,zero
 5   10  data
 6   13  hole,zero
 7   10  data
 8   13  hole,zero
 9   10  data
$ nbdinfo --map --totals nbd://localhost 
524288  50.0%   0 data
524288  50.0%   3 hole,zero

>From experimentation I found this really hurts my laptop :-(

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read

Re: [PATCH v3] spec: Clarify BLOCK_STATUS reply details

2022-04-04 Thread Richard W.M. Jones

On Fri, Apr 01, 2022 at 04:08:07PM -0500, Eric Blake wrote:
> Our docs were inconsistent on whether a NBD_REPLY_TYPE_BLOCK_STATUS
> reply chunk can exceed the client's requested length, and silent on
> whether the lengths must be consistent when multiple contexts were
> negotiated.  Clarify this to match existing practice as implemented in
> qemu-nbd.  Clean up some nearby grammatical errors while at it.
> ---
> 
> Another round of rewording attempts, based on feedback from Rich on
> v2.  I went ahead and pushed patch 1 and 2 of the v2 series, as they
> were less controversial.
> 
>  doc/proto.md | 42 --
>  1 file changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 8a817d2..bacccfa 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -882,15 +882,25 @@ The procedure works as follows:
>server supports.
>  - During transmission, a client can then indicate interest in metadata
>for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
> -  where *offset* and *length* indicate the area of interest. The
> -  server MUST then respond with the requested information, for all
> -  contexts which were selected during negotiation. For every metadata
> -  context, the server sends one set of extent chunks, where the sizes
> -  of the extents MUST be less than or equal to the length as specified
> -  in the request. Each extent comes with a *flags* field, the
> -  semantics of which are defined by the metadata context.
> -- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured
> -  reply of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +  where *offset* and *length* indicate the area of interest. On
> +  success, the server MUST respond with one structured reply chunk of
> +  type `NBD_REPLY_TYPE_BLOCK_STATUS` per metadata context selected
> +  during negotiation, where each reply chunk is a list of one or more
> +  consecutive extents for that context.  Each extent comes with a
> +  *flags* field, the semantics of which are defined by the metadata
> +  context.
> +
> +The client's requested *length* is only a hint to the server, so the
> +cumulative extent length contained in a chunk of the server's reply
> +may be shorter or longer the original request.  When more than one
> +metadata context was negotiated, the reply chunks for the different
> +contexts of a single block status request need not have the same
> +number of extents or cumulative extent length.
> +
> +In the request, the client may use the `NBD_CMD_FLAG_REQ_ONE` command
> +flag to further constrain the server's reply so that each chunk
> +contains exactly one extent whose length does not exceed the client's
> +original *length*.
> 
>  A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
>  nonzero number of metadata contexts during negotiation, and used the
> @@ -1778,8 +1788,8 @@ MUST initiate a hard disconnect.
>*length* MUST be 4 + (a positive integer multiple of 8).  This reply
>represents a series of consecutive block descriptors where the sum
>of the length fields within the descriptors is subject to further
> -  constraints documented below. This chunk type MUST appear
> -  exactly once per metadata ID in a structured reply.
> +  constraints documented below.  A successful block status request MUST
> +  have exactly one status chunk per negotiated metadata context ID.
> 
>The payload starts with:
> 
> @@ -1801,15 +1811,19 @@ MUST initiate a hard disconnect.
>*length* of the final extent MAY result in a sum larger than the
>original requested length, if the server has that information anyway
>as a side effect of reporting the status of the requested region.
> +  When multiple metadata contexts are negotiated, the reply chunks for
> +  the different contexts need not have the same number of extents or
> +  cumulative extent length.
> 
>Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
>its request, the server MAY return fewer descriptors in the reply
>than would be required to fully specify the whole range of requested
>information to the client, if looking up the information would be
>too resource-intensive for the server, so long as at least one
> -  extent is returned. Servers should however be aware that most
> -  clients implementations will then simply ask for the next extent
> -  instead.
> +  extent is returned.  Servers should however be aware that most
> +  client implementations will likely follow up with a request for
> +  extent information at the first offset not covered by a
> +  reduced-length reply.
> 
>  All error chunk types have bit 15 set, and begin with the same
>  *

Re: [Libguestfs] [PATCH v2 3/3] spec: Clarify BLOCK_STATUS reply details

2022-03-25 Thread Richard W.M. Jones

On Fri, Mar 25, 2022 at 07:41:02AM -0500, Eric Blake wrote:
> Our docs were inconsistent on whether a NBD_REPLY_TYPE_BLOCK_STATUS
> reply chunk can exceed the client's requested length, and silent on
> whether the lengths must be consistent whem multiple contexts were

"when"

> negotiated.  Clarify this to match existing practice as implemented in
> qemu-nbd.  Clean up some nearby grammatical errors while at it.
> ---
>  doc/proto.md | 38 --
>  1 file changed, 24 insertions(+), 14 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 8a817d2..8276201 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -882,15 +882,22 @@ The procedure works as follows:
>server supports.
>  - During transmission, a client can then indicate interest in metadata
>for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
> -  where *offset* and *length* indicate the area of interest. The
> -  server MUST then respond with the requested information, for all
> -  contexts which were selected during negotiation. For every metadata
> -  context, the server sends one set of extent chunks, where the sizes
> -  of the extents MUST be less than or equal to the length as specified
> -  in the request. Each extent comes with a *flags* field, the
> -  semantics of which are defined by the metadata context.
> -- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured
> -  reply of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +  where *offset* and *length* indicate the area of interest. On
> +  success, the server MUST respond with one structured reply chunk of
> +  type `NBD_REPLY_TYPE_BLOCK_STATUS` per metadata context selected
> +  during negotiation, where each reply chunk is a list of one or more
> +  extents for that context.
> +
> +The client's requested *length* is only a hint to the server, so the
> +cumulative size of the extents in each chunk of the server's reply may
> +be shorter or longer the original request; and when more than one

^ than the original request

> +metadata context was negotiated, the cumulative length per context may

Why is the word "cumulative" here?

> +differ within a single block status request.  Each extent comes with a
> +*flags* field, the semantics of which are defined by the metadata
> +context.  The client may use the `NBD_CMD_FLAG_REQ_ONE` flag to
> +further constrains the server's reply so that each chunk contains
> +exactly one extent whose length does not exceed the client's original
> +*length*.

In these two adjacent sentences you talk about the flags field for
each extent (which is really the type of the extent), and then the
client request flags, and it's kind of confusing.  Maybe make that a
bit clearer, eg. "In the request, the client may set the
NBD_CMD_FLAG_REQ_ONE flag ...".

>  A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
>  nonzero number of metadata contexts during negotiation, and used the
> @@ -1778,8 +1785,8 @@ MUST initiate a hard disconnect.
>*length* MUST be 4 + (a positive integer multiple of 8).  This reply
>represents a series of consecutive block descriptors where the sum
>of the length fields within the descriptors is subject to further
> -  constraints documented below. This chunk type MUST appear
> -  exactly once per metadata ID in a structured reply.
> +  constraints documented below.  A successful block status request MUST
> +  have exactly one status chunk per negotiated metadata context ID.
> 
>The payload starts with:
> 
> @@ -1801,15 +1808,18 @@ MUST initiate a hard disconnect.
>*length* of the final extent MAY result in a sum larger than the
>original requested length, if the server has that information anyway
>as a side effect of reporting the status of the requested region.
> +  When multiple metadata contexts are negotiated, the cumulative
> +  lengths in each chunk reply need not be identical.

Again, I'm unclear what is "cumulative" about the lengths.

>Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
>its request, the server MAY return fewer descriptors in the reply
>than would be required to fully specify the whole range of requested
>information to the client, if looking up the information would be
>too resource-intensive for the server, so long as at least one
> -  extent is returned. Servers should however be aware that most
> -  clients implementations will then simply ask for the next extent
> -  instead.
> +  extent is returned.  Servers should however be aware that most
> +  client implementations will likely follow up with a request for
> +  extent information at the first offset not covered by a
> +  reduced-length reply.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [PATCH v2 2/3] docs: Clarify structured reads vs. block size constraints

2022-03-25 Thread Richard W.M. Jones

On Fri, Mar 25, 2022 at 07:41:01AM -0500, Eric Blake wrote:
> The text for structured reads mentioned a mandatory split of certain
> large reads, without also mentioning that large reads are generally
> not possible when block constraints are in play.
> 
> Signed-off-by: Eric Blake 
> ---
>  doc/proto.md | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index bfebb5a..8a817d2 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -1911,13 +1911,14 @@ The following request types exist:
>  chunks that describe data outside the offset and length of the
>  request, but MAY send the content chunks in any order (the client
>  MUST reassemble content chunks into the correct order), and MAY
> -send additional content chunks even after reporting an error chunk.
> -Note that a request for more than 2^32 - 8 bytes MUST be split
> -into at least two chunks, so as not to overflow the length field
> -of a reply while still allowing space for the offset of each
> -chunk.  When no error is detected, the server MUST send enough
> -data chunks to cover the entire region described by the offset and
> -length of the client's request.
> +send additional content chunks even after reporting an error
> +chunk.  Note that a request for more than 2^32 - 8 bytes (if
> +permitted by block size constraints) MUST be split into at least

The actual change is just the "(if ... constraints)" subclause.
The rest is formatting.

> +two chunks, so as not to overflow the length field of a reply
> +while still allowing space for the offset of each chunk.  When no
> +error is detected, the server MUST send enough data chunks to
> +cover the entire region described by the offset and length of the
> +client's request.
> 
>  To minimize traffic, the server MAY use a content or error chunk
>  as the final chunk by setting the `NBD_REPLY_FLAG_DONE` flag, but

ACK

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [Libguestfs] [PATCH v2 1/3] docs: Clarify NBD_REPLY_TYPE_ERROR lengths

2022-03-25 Thread Richard W.M. Jones

On Fri, Mar 25, 2022 at 07:41:00AM -0500, Eric Blake wrote:
> Add explicit mention that human-readable error message strings must
> comply with the overall NBD string limits.
> 
> Signed-off-by: Eric Blake 
> ---
>  doc/proto.md | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 81ac755..bfebb5a 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -1824,7 +1824,9 @@ remaining structured fields at the end.
>be at least 6.  This chunk represents that an error occurred,
>and the client MAY NOT make any assumptions about partial
>success. This type SHOULD NOT be used more than once in a
> -  structured reply.  Valid as a reply to any request.
> +  structured reply.  Valid as a reply to any request.  Note that
> +  *message length* MUST NOT exceed the 4096 bytes string length
> +  limit.

ACK

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Re: [PATCH v3 3/3] nbd/server: Allow MULTI_CONN for shared writable exports

2022-03-16 Thread Richard W.M. Jones

On Wed, Mar 16, 2022 at 04:15:53PM -0500, Eric Blake wrote:
> On Wed, Mar 16, 2022 at 04:07:21PM -0500, Eric Blake wrote:
> > On Tue, Mar 15, 2022 at 01:14:41PM +, Richard W.M. Jones wrote:
> > > The patches seem OK to me, but I don't really know enough about the
> > > internals of qemu-nbd to give a line-by-line review.  I did however
> > > build and test qemu-nbd with the patches:
> > > 
> > >   $ ./build/qemu-nbd /var/tmp/test.qcow2 
> > >   $ nbdinfo nbd://localhost
> > >   ...
> > >   can_multi_conn: false
> > > 
> > > 
> > >   $ ./build/qemu-nbd -e 2 /var/tmp/test.qcow2 
> > >   $ nbdinfo nbd://localhost
> > >   ...
> > >   can_multi_conn: false
> > > 
> > > ^^^ Is this expected?  It also happens with -e 0.
> > 
> > Yes, because qemu-nbd defaults to read-write connections, but to be
> > conservative, this patch defaults '-m auto' to NOT advertise
> > multi-conn for read-write; you need to be explicit:
> > 
> > > 
> > > 
> > >   $ ./build/qemu-nbd -e 2 -m on /var/tmp/test.qcow2 
> > >   $ nbdinfo nbd://localhost
> > >   ...
> > >   can_multi_conn: true
> > 
> > either with '-m on' as you did here, or with
> > 
> > build/qemu-nbd -r -e 2 /var/tmp/test.qcow2
> > 
> > where the '-m auto' default exposes multi-conn for a readonly client.
> 
> I hit send prematurely - one more thought I wanted to make clear.  For
> _this_ series, the behavior of '-m auto' depends solely on readonly
> vs. read-write; that being the most conservative choice of preserving
> pre-existing qemu-nbd behavior (that is, if you don't use -m, the only
> change in behavior should be the fact that --help output now lists
> -m).
> 
> But in future versions of qemu, we might be able to improve '-m auto'
> to also take into consideration whether the backing image of a
> read-write device is known-cache-consistent (such as a local file
> system), where we can then manage to default to advertising multi-conn
> for those writable images, while still avoiding the advertisement when
> exposing other devices such as network-mounted devices where we are
> less confident on whether there are sufficient cache consistency
> guarantees.  Use of explicit '-m off' or '-m on' gives guaranteed
> results no matter the qemu version, so it is only explicit (or
> implied) '-m auto' where we might get smarter defaults over time.
> 
> Thus testing whether advertising multi-conn makes a difference in
> other tools like nbdcopy thus requires checking if qemu-nbd is new
> enough to support -m, and then being explicit that we are opting in to
> using it.

I see.  The commit message of patch 3 confused me because
superficially it seems to say that multi-conn is safe (and therefore I
assumed safe == enabled) when backed by a local filesystem.  Could the
commit message be improved to list the default for common combinations
of flags?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH 07/27] Replace GCC_FMT_ATTR with G_GNUC_PRINTF

2022-03-16 Thread Richard W.M. Jones

On Wed, Mar 16, 2022 at 01:52:48PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> One less qemu-specific macro. It also helps to make some headers/units
> only depend on glib, and thus moved in standalone projects eventually.
> 
> Signed-off-by: Marc-André Lureau 

I checked the replacements and couldn't spot any differences (I assume
you used a 'perl -pi.bak -e s///' or similar rather than doing it by
hand?).  Also I checked the macro definitions in
include/qemu/compiler.h vs /usr/include/glib-2.0/glib/gmacros.h and
they're pretty much identical.  I even learned about gnu_printf.  So:

Reviewed-by: Richard W.M. Jones 

Shouldn't there be a hunk which removes the definition of GCC_FMT_ATTR
from include/qemu/compiler.h?  Maybe that's in another place in the
patch series.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH v3 3/3] nbd/server: Allow MULTI_CONN for shared writable exports

2022-03-15 Thread Richard W.M. Jones

The patches seem OK to me, but I don't really know enough about the
internals of qemu-nbd to give a line-by-line review.  I did however
build and test qemu-nbd with the patches:

  $ ./build/qemu-nbd /var/tmp/test.qcow2 
  $ nbdinfo nbd://localhost
  ...
can_multi_conn: false


  $ ./build/qemu-nbd -e 2 /var/tmp/test.qcow2 
  $ nbdinfo nbd://localhost
  ...
can_multi_conn: false

^^^ Is this expected?  It also happens with -e 0.


  $ ./build/qemu-nbd -e 2 -m on /var/tmp/test.qcow2 
  $ nbdinfo nbd://localhost
  ...
can_multi_conn: true


  $ ./build/qemu-nbd -e 2 -m off /var/tmp/test.qcow2 
  $ nbdinfo nbd://localhost
  ...
can_multi_conn: false


Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Richard W.M. Jones

On Tue, Feb 15, 2022 at 05:24:14PM -0600, Eric Blake wrote:
> Oh. The QMP command (which is immediately visible through
> nbd-server-add/block-storage-add to qemu and qemu-storage-daemon)
> gains "multi-conn":"on", but you may be right that qemu-nbd would want
> a command line option (either that, or we accellerate our plans that
> qsd should replace qemu-nbd).

I really hope there will always be something called "qemu-nbd"
that acts like qemu-nbd.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones

On Fri, Jan 28, 2022 at 02:19:44PM +0100, Kevin Wolf wrote:
> Am 28.01.2022 um 13:30 hat Hanna Reitz geschrieben:
> > > > I just changed that line of code [2], as shown in [4].  I suppose
> > > > the better thing to do would be to have an option for the NBD server
> > > > to force-change the announced request alignment, because it can
> > > > expect the qemu block layer code to auto-align requests through
> > > > RMW.  Doing it in the client is wrong, because the NBD server might
> > > > want to detect that the client sends unaligned requests and reject
> > > > them (though ours doesn’t, it just traces such events[5] – note that
> > > > it’s explicitly noted there that qemu will auto-align requests).
> > > I know I said I didn't care about performance (in this case), but is
> > > there in fact a penalty to sending unaligned requests to the qcow2
> > > layer?  Or perhaps it cannot compress them?
> > 
> > In qcow2, only the whole cluster can be compressed, so writing compressed
> > data means having to write the whole cluster.  qcow2 could implement the
> > padding by itself, but we decided to just leave the burden of only writing
> > full clusters (with the COMPRESSED write flag) on the callers.
> > 
> > Things like qemu-img convert and blockdev-backup just adhere to that by
> > design; and the compress driver makes sure to set its request alignment
> > accordingly so that requests to it will always be aligned to the cluster
> > size (either by its user, or by the qemu block layer which performs the
> > padding automatically).
> 
> I thought the more limiting factor would be that after auto-aligning the
> first request by padding with zeros, the second request to the same
> cluster would fail because compression doesn't allow using an already
> allocated cluster:
> 
> /* Compression can't overwrite anything. Fail if the cluster was already
>  * allocated. */
> cluster_offset = get_l2_entry(s, l2_slice, l2_index);
> if (cluster_offset & L2E_OFFSET_MASK) {
> qcow2_cache_put(s->l2_table_cache, (void **) &l2_slice);
> return -EIO;
> }
> 
> Did you always just test a single request or why don't you run into
> this?

I didn't test that one specifically and yes it does fail:

$ qemu-img create -f qcow2 output.qcow2 1M
Formatting 'output.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=1048576 lazy_refcounts=off refcount_bits=16
$ qemu-nbd -t --image-opts 
driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=output.qcow2
 &
[1] 2069037

$ nbdsh -u nbd://localhost
nbd> h.set_strict_mode(h.get_strict_mode() & ~nbd.STRICT_ALIGN)
nbd> buf = b'1' * 1024
nbd> h.pwrite(buf, 0)
nbd> h.pwrite(buf, 1024)
Traceback (most recent call last):
  File "/usr/lib64/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
  File "", line 1, in 
  File "/usr/lib64/python3.10/site-packages/nbd.py", line 1631, in pwrite
return libnbdmod.pwrite(self._o, buf, offset, flags)
nbd.Error: nbd_pwrite: write: command failed: Input/output error (EIO)

So what I said in the previous email about about minimum vs preferred
is wrong :-(

What's more interesting is that nbdcopy still appeared to work.
Simulating what that was doing would be something like which
also fails when I do it directly:

nbd> h.pwrite(buf, 0)
nbd> h.zero(1024, 1024)
Traceback (most recent call last):
  File "/usr/lib64/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
  File "", line 1, in 
  File "/usr/lib64/python3.10/site-packages/nbd.py", line 1782, in zero
return libnbdmod.zero(self._o, count, offset, flags)
nbd.Error: nbd_zero: write-zeroes: command failed: Input/output error (EIO)

Anyway back to poking at nbdcopy to make it support block sizes ...

> I guess checking L2E_OFFSET_MASK is strictly speaking wrong because it's
> invalid for compressed clusters (qcow2_get_cluster_type() feels more
> appropriate), but in practice, you will always have non-zero data there,
> so it should error out here.
> 
> Kevin

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones

On Fri, Jan 28, 2022 at 01:30:53PM +, Richard W.M. Jones wrote:
> I feel like this may be a bug in what qemu-nbd advertises.  Currently
> it is:

Ignore this email, see other reply.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones



On Fri, Jan 28, 2022 at 01:30:43PM +0100, Hanna Reitz wrote:
> On 28.01.22 13:18, Richard W.M. Jones wrote:
> >On Fri, Jan 28, 2022 at 12:57:47PM +0100, Hanna Reitz wrote:
> >>On 28.01.22 12:48, Richard W.M. Jones wrote:
> >>>On Fri, Jan 28, 2022 at 12:39:11PM +0100, Hanna Reitz wrote:
> >>>>So I actually don’t know why it works for you.  OTOH, I don’t
> >>>>understand why the block size affects you over NBD, because I would
> >>>>have expected qemu to internally auto-align requests when they are
> >>>>not aligned (in bdrv_co_pwritev_part()).
> >>>I checked it again and my hack definitely fixes nbdcopy.  But maybe
> >>>that's expected if qemu-nbd is auto-aligning requests?  (I'm only
> >>>accessing the block layer through qemu-nbd, not with qemu-io)
> >>It’s not just qemu-io, with your diff[3] I get the same EINVAL over
> >>NBD, too:
> >>
> >>$ ./qemu-img create -f qcow2 test.qcow2 64M
> >>Formatting 'test.qcow2', fmt=qcow2 cluster_size=65536
> >>extended_l2=off compression_type=zlib size=67108864
> >>lazy_refcounts=off refcount_bits=16
> >>
> >>$ ./qemu-nbd --fork --image-opts \
> >>driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=test.qcow2
> >>
> >>$ ./qemu-io -c 'write 0 32k' -f raw nbd://localhost
> >>write failed: Invalid argument
> >Strange - is that error being generated by qemu's nbd client code?
> 
> It’s generated by qcow2, namely the exact place I pointed out (as
> [1]).  I can see that when I put an fprintf there.

I can't reproduce this behaviour (with qemu @ cfe63e46be0a, the head
of git at time of writing).  I wonder if I'm doing something wrong?

  ++ /home/rjones/d/qemu/build/qemu-img create -f qcow2 output.qcow2 64k
  Formatting 'output.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=65536 lazy_refcounts=off refcount_bits=16
  ++ sleep 1
  ++ /home/rjones/d/qemu/build/qemu-nbd -t --image-opts 
driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=output.qcow2
  ++ /home/rjones/d/qemu/build/qemu-io -c 'write 0 32k' -f raw nbd://localhost
  wrote 32768/32768 bytes at offset 0
  32 KiB, 1 ops; 00.02 sec (1.547 MiB/sec and 49.5067 ops/sec)

> >I know I said I didn't care about performance (in this case), but is
> >there in fact a penalty to sending unaligned requests to the qcow2
> >layer?  Or perhaps it cannot compress them?
> 
> In qcow2, only the whole cluster can be compressed, so writing
> compressed data means having to write the whole cluster.  qcow2
> could implement the padding by itself, but we decided to just leave
> the burden of only writing full clusters (with the COMPRESSED write
> flag) on the callers.

I feel like this may be a bug in what qemu-nbd advertises.  Currently
it is:

$ qemu-nbd -t --image-opts 
driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=output.qcow2
 &
[2] 2068900
$ nbdinfo nbd://localhost
protocol: newstyle-fixed without TLS
export="":
export-size: 65536 (64K)
uri: nbd://localhost:10809/
contexts:
base:allocation
is_rotational: false
is_read_only: false
can_cache: true
can_df: true
can_fast_zero: true
can_flush: true
can_fua: true
can_multi_conn: false
can_trim: true
can_zero: true
block_size_minimum: 65536<---
block_size_preferred: 65536
block_size_maximum: 33554432

block_size_preferred is (rightly) set to 64K, as that's what the
compress + qcow2 combination prefers.

But block_size_minimum sounds as if it should be 512 or 1, if qemu-nbd
is able to reassemble smaller than preferred requests, even if they
are suboptimal.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones

On Fri, Jan 28, 2022 at 12:57:47PM +0100, Hanna Reitz wrote:
> On 28.01.22 12:48, Richard W.M. Jones wrote:
> >On Fri, Jan 28, 2022 at 12:39:11PM +0100, Hanna Reitz wrote:
> >>So I actually don’t know why it works for you.  OTOH, I don’t
> >>understand why the block size affects you over NBD, because I would
> >>have expected qemu to internally auto-align requests when they are
> >>not aligned (in bdrv_co_pwritev_part()).
> >I checked it again and my hack definitely fixes nbdcopy.  But maybe
> >that's expected if qemu-nbd is auto-aligning requests?  (I'm only
> >accessing the block layer through qemu-nbd, not with qemu-io)
> 
> It’s not just qemu-io, with your diff[3] I get the same EINVAL over
> NBD, too:
> 
> $ ./qemu-img create -f qcow2 test.qcow2 64M
> Formatting 'test.qcow2', fmt=qcow2 cluster_size=65536
> extended_l2=off compression_type=zlib size=67108864
> lazy_refcounts=off refcount_bits=16
> 
> $ ./qemu-nbd --fork --image-opts \
> driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=test.qcow2
> 
> $ ./qemu-io -c 'write 0 32k' -f raw nbd://localhost
> write failed: Invalid argument

Strange - is that error being generated by qemu's nbd client code?

Here's my test not involving qemu's client code:

$ qemu-nbd --version
qemu-nbd 6.2.0 (qemu-6.2.0-2.fc36)

$ qemu-img create -f qcow2 output.qcow2 1M
Formatting 'output.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off 
compression_type=zlib size=1048576 lazy_refcounts=off refcount_bits=16

$ qemu-nbd --fork --image-opts 
driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=output.qcow2

$ nbdsh -u nbd://localhost
nbd> h.get_strict_mode()
31
nbd> h.set_strict_mode(31 & ~nbd.STRICT_ALIGN)
nbd> h.get_strict_mode()
15
nbd> h.pwrite(b'1'*1024, 0)
nbd> exit

So an unaligned 1K write works (after disabling libnbd's client-side
alignment checks).

> I just changed that line of code [2], as shown in [4].  I suppose
> the better thing to do would be to have an option for the NBD server
> to force-change the announced request alignment, because it can
> expect the qemu block layer code to auto-align requests through
> RMW.  Doing it in the client is wrong, because the NBD server might
> want to detect that the client sends unaligned requests and reject
> them (though ours doesn’t, it just traces such events[5] – note that
> it’s explicitly noted there that qemu will auto-align requests).

I know I said I didn't care about performance (in this case), but is
there in fact a penalty to sending unaligned requests to the qcow2
layer?  Or perhaps it cannot compress them?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones



I hacked nbdcopy to ignore block alignment (the error actually comes
from libnbd refusing to send the unaligned request, not from
qemu-nbd), and indeed qemu-nbd accepts the unaligned request without
complaint.

Eric - maybe having some flag for nbdcopy to ignore unaligned requests
when we know the server doesn't care (qemu-nbd) would work?

Rich.

--- a/copy/nbd-ops.c
+++ b/copy/nbd-ops.c
@@ -59,6 +59,10 @@ open_one_nbd_handle (struct rw_nbd *rwn)
 exit (EXIT_FAILURE);
   }
 
+  uint32_t sm = nbd_get_strict_mode (nbd);
+  sm &= ~LIBNBD_STRICT_ALIGN;
+  nbd_set_strict_mode (nbd, sm);
+
   nbd_set_debug (nbd, verbose);
 
   if (extents && rwn->d == READING &&


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones

On Fri, Jan 28, 2022 at 12:39:11PM +0100, Hanna Reitz wrote:
> So I actually don’t know why it works for you.  OTOH, I don’t
> understand why the block size affects you over NBD, because I would
> have expected qemu to internally auto-align requests when they are
> not aligned (in bdrv_co_pwritev_part()).

I checked it again and my hack definitely fixes nbdcopy.  But maybe
that's expected if qemu-nbd is auto-aligning requests?  (I'm only
accessing the block layer through qemu-nbd, not with qemu-io)

> Like, when I set the NBD
> block driver’s alignment to 512[2], the following still succeeds:

Did you just patch that line in the code or is there a qemu-nbd
option/image-opts to do this?

Rich.

> [1] https://gitlab.com/qemu-project/qemu/-/blob/master/block/qcow2.c#L4662
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/block/nbd.c#L1918

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Block alignment of qcow2 compress driver

2022-01-28 Thread Richard W.M. Jones

The commands below set up a sparse RAM disk, with an allocated block
at offset 32K and another one at offset 1M-32K.  Then it tries to copy
this to a compressed qcow2 file using qemu-nbd + the qemu compress
filter:

  $ qemu-img create -f qcow2 output.qcow2 1M
  $ qemu-nbd -t --image-opts 
driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=output.qcow2
 & sleep 1
  $ nbdkit -U - \
   data '@32768 1*32768 @1015808 1*32768' \
   --run 'nbdcopy $uri nbd://localhost -p'

The nbdcopy command fails when zeroing the first 32K with:

  nbd://localhost: nbd_aio_zero: request is unaligned: Invalid argument

This is a bug in nbdcopy because it ignores the minimum block size
being correctly declared by the compress filter:

  $ nbdinfo nbd://localhost
  protocol: newstyle-fixed without TLS
  export="":
export-size: 1048576 (1M)
uri: nbd://localhost:10809/
contexts:
  ...
block_size_minimum: 65536  <
block_size_preferred: 65536
block_size_maximum: 33554432

The compress filter sets the minimum block size to the the same as the
qcow2 cluster size here:

  
https://gitlab.com/qemu-project/qemu/-/blob/cfe63e46be0a1f8a7fd2fd5547222f8344a43279/block/filter-compress.c#L117

I patched qemu to force this to 4K:

-bs->bl.request_alignment = bdi.cluster_size;
+//bs->bl.request_alignment = bdi.cluster_size;
+bs->bl.request_alignment = 4096;

and the copy above works, and the output file is compressed!

So my question is, does the compress filter in qemu really need to
declare the large minimum block size?  I'm not especially concerned
about efficiency, I'd prefer it just worked, and changing nbdcopy to
understand block sizes is painful.

Is it already adjustable at run time?  (I tried using --image-opts
like compress.request_alignment=4096 but it seems like the filter
doesn't support anything I could think of, and I don't know how to
list the supported options.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH v2 0/6] aio-posix: split poll check from ready handler

2021-12-02 Thread Richard W.M. Jones



Not sure if this is related, but builds are failing with:

FAILED: libblockdev.fa.p/block_export_fuse.c.o 
cc -m64 -mcx16 -Ilibblockdev.fa.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader 
-I/usr/include/fuse3 -I/usr/include/p11-kit-1 -I/usr/include/glib-2.0 
-I/usr/lib64/glib-2.0/include -I/usr/include/sysprof-4 -fdiagnostics-color=auto 
-Wall -Winvalid-pch -std=gnu11 -O2 -g -isystem 
/home/rjones/d/qemu/linux-headers -isystem linux-headers -iquote . -iquote 
/home/rjones/d/qemu -iquote /home/rjones/d/qemu/include -iquote 
/home/rjones/d/qemu/disas/libvixl -iquote /home/rjones/d/qemu/tcg/i386 -pthread 
-DSTAP_SDT_V2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes 
-Wredundant-decls -Wundef -Wwrite-strings -Wmissing-prototypes 
-fno-strict-aliasing -fno-common -fwrapv -Wold-style-declaration 
-Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self 
-Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels 
-Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wno-missing-include-dirs 
-Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -MD -MQ 
libblockdev.fa.p/block_export_fuse.c.o -MF 
libblockdev.fa.p/block_export_fuse.c.o.d -o 
libblockdev.fa.p/block_export_fuse.c.o -c ../block/export/fuse.c
../block/export/fuse.c: In function ‘setup_fuse_export’:
../block/export/fuse.c:226:59: warning: passing argument 7 of 
‘aio_set_fd_handler’ from incompatible pointer type 
[-Wincompatible-pointer-types]
  226 |read_from_fuse_export, NULL, NULL, exp);
  |   ^~~
  |   |
  |   FuseExport *
In file included from ../block/export/fuse.c:22:
/home/rjones/d/qemu/include/block/aio.h:472:36: note: expected ‘void (*)(void 
*)’ but argument is of type ‘FuseExport *’
  472 | IOHandler *io_poll_ready,
  | ~~~^
../block/export/fuse.c:224:5: error: too few arguments to function 
‘aio_set_fd_handler’
  224 | aio_set_fd_handler(exp->common.ctx,
  | ^~
In file included from ../block/export/fuse.c:22:
/home/rjones/d/qemu/include/block/aio.h:466:6: note: declared here
  466 | void aio_set_fd_handler(AioContext *ctx,
  |  ^~
../block/export/fuse.c: In function ‘fuse_export_shutdown’:
../block/export/fuse.c:268:13: error: too few arguments to function 
‘aio_set_fd_handler’
  268 | aio_set_fd_handler(exp->common.ctx,
  | ^~
In file included from ../block/export/fuse.c:22:
/home/rjones/d/qemu/include/block/aio.h:466:6: note: declared here
  466 | void aio_set_fd_handler(AioContext *ctx,
  |  ^~

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH v3] nbd/server: Add --selinux-label option

2021-09-30 Thread Richard W.M. Jones

On Thu, Sep 30, 2021 at 02:00:11PM -0300, Willian Rampazzo wrote:
> On Thu, Sep 30, 2021 at 5:55 AM Vladimir Sementsov-Ogievskiy
>  wrote:
> >
> > 9/30/21 11:47, Richard W.M. Jones wrote:
> > > Under SELinux, Unix domain sockets have two labels.  One is on the
> > > disk and can be set with commands such as chcon(1).  There is a
> > > different label stored in memory (called the process label).  This can
> > > only be set by the process creating the socket.  When using SELinux +
> > > SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
> > > you must set both labels correctly first.
> > >
> > > For qemu-nbd the options to set the second label are awkward.  You can
> > > create the socket in a wrapper program and then exec into qemu-nbd.
> > > Or you could try something with LD_PRELOAD.
> > >
> > > This commit adds the ability to set the label straightforwardly on the
> > > command line, via the new --selinux-label flag.  (The name of the flag
> > > is the same as the equivalent nbdkit option.)
> > >
> > > A worked example showing how to use the new option can be found in
> > > this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > >
> > > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > > Signed-off-by: Richard W.M. Jones 
> > > Reviewed-by: Daniel P. Berrangé 
> > > Signed-off-by: Eric Blake 
> >
> > this should be Reviewed-by?
> 
> Maybe, because of this:
> https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07081.html
> 
> I got confused with this v3.

Yes, I'd somehow lost the original patch and picked it up from Eric's
queue to make v3.

Having said that I'm not sure what the objection above means.  Do you
mean Eric's tag should be Reviewed-by instead of S-o-b?  (And why?)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

[PATCH v3] nbd/server: Add --selinux-label option

2021-09-30 Thread Richard W.M. Jones

Under SELinux, Unix domain sockets have two labels.  One is on the
disk and can be set with commands such as chcon(1).  There is a
different label stored in memory (called the process label).  This can
only be set by the process creating the socket.  When using SELinux +
SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
you must set both labels correctly first.

For qemu-nbd the options to set the second label are awkward.  You can
create the socket in a wrapper program and then exec into qemu-nbd.
Or you could try something with LD_PRELOAD.

This commit adds the ability to set the label straightforwardly on the
command line, via the new --selinux-label flag.  (The name of the flag
is the same as the equivalent nbdkit option.)

A worked example showing how to use the new option can be found in
this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
Signed-off-by: Richard W.M. Jones 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: Eric Blake 
---
 configure |  8 +++-
 meson.build   | 10 -
 meson_options.txt |  3 ++
 qemu-nbd.c| 39 +++
 tests/docker/dockerfiles/centos8.docker   |  1 +
 .../dockerfiles/fedora-i386-cross.docker  |  1 +
 tests/docker/dockerfiles/fedora.docker|  1 +
 tests/docker/dockerfiles/opensuse-leap.docker |  1 +
 tests/docker/dockerfiles/ubuntu1804.docker|  1 +
 tests/docker/dockerfiles/ubuntu2004.docker|  1 +
 10 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 1043ccce4f..b3211a66ee 100755
--- a/configure
+++ b/configure
@@ -445,6 +445,7 @@ fuse="auto"
 fuse_lseek="auto"
 multiprocess="auto"
 slirp_smbd="$default_feature"
+selinux="auto"
 
 malloc_trim="auto"
 gio="$default_feature"
@@ -1576,6 +1577,10 @@ for opt do
   ;;
   --disable-slirp-smbd) slirp_smbd=no
   ;;
+  --enable-selinux) selinux="enabled"
+  ;;
+  --disable-selinux) selinux="disabled"
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1963,6 +1968,7 @@ disabled with --disable-FEATURE, default is enabled if 
available
   multiprocessOut of process device emulation support
   gio libgio support
   slirp-smbd  use smbd (at path --smbd=*) in slirp networking
+  selinux SELinux support in qemu-nbd
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -5207,7 +5213,7 @@ if test "$skip_meson" = no; then
 -Dattr=$attr -Ddefault_devices=$default_devices 
-Dvirglrenderer=$virglrenderer \
 -Ddocs=$docs -Dsphinx_build=$sphinx_build -Dinstall_blobs=$blobs \
 -Dvhost_user_blk_server=$vhost_user_blk_server 
-Dmultiprocess=$multiprocess \
--Dfuse=$fuse -Dfuse_lseek=$fuse_lseek 
-Dguest_agent_msi=$guest_agent_msi -Dbpf=$bpf\
+-Dselinux=$selinux \
 $(if test "$default_feature" = no; then echo 
"-Dauto_features=disabled"; fi) \
-Dtcg_interpreter=$tcg_interpreter \
 $cross_arg \
diff --git a/meson.build b/meson.build
index 15ef4d3c41..0ded2ac5eb 100644
--- a/meson.build
+++ b/meson.build
@@ -1072,6 +1072,11 @@ keyutils = dependency('libkeyutils', required: false,
 
 has_gettid = cc.has_function('gettid')
 
+# libselinux
+selinux = dependency('libselinux',
+ required: get_option('selinux'),
+ method: 'pkg-config', kwargs: static_kwargs)
+
 # Malloc tests
 
 malloc = []
@@ -1300,6 +1305,7 @@ config_host_data.set('CONFIG_FUSE', fuse.found())
 config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
 config_host_data.set('CONFIG_X11', x11.found())
 config_host_data.set('CONFIG_CFI', get_option('cfi'))
+config_host_data.set('CONFIG_SELINUX', selinux.found())
 config_host_data.set('QEMU_VERSION', '"@0@"'.format(meson.project_version()))
 config_host_data.set('QEMU_VERSION_MAJOR', 
meson.project_version().split('.')[0])
 config_host_data.set('QEMU_VERSION_MINOR', 
meson.project_version().split('.')[1])
@@ -2759,7 +2765,8 @@ if have_tools
   qemu_io = executable('qemu-io', files('qemu-io.c'),
  dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
-   dependencies: [blockdev, qemuutil, gnutls], install: true)
+   dependencies: [blockdev, qemuutil, gnutls, selinux],
+   install: true)
 
   subdir('storage-daemon')
   subdir('contrib/rdmacm

Re: [PATCH v2] nbd/server: Add --selinux-label option

2021-09-27 Thread Richard W.M. Jones

On Mon, Sep 27, 2021 at 04:18:34PM -0500, Eric Blake wrote:
> On Fri, Jul 23, 2021 at 11:33:03AM +0100, Richard W.M. Jones wrote:
> > Under SELinux, Unix domain sockets have two labels.  One is on the
> > disk and can be set with commands such as chcon(1).  There is a
> > different label stored in memory (called the process label).  This can
> > only be set by the process creating the socket.  When using SELinux +
> > SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
> > you must set both labels correctly first.
> > 
> > For qemu-nbd the options to set the second label are awkward.  You can
> > create the socket in a wrapper program and then exec into qemu-nbd.
> > Or you could try something with LD_PRELOAD.
> > 
> > This commit adds the ability to set the label straightforwardly on the
> > command line, via the new --selinux-label flag.  (The name of the flag
> > is the same as the equivalent nbdkit option.)
> > 
> > A worked example showing how to use the new option can be found in
> > this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > Signed-off-by: Richard W.M. Jones 
> > ---
> 
> I'm making one tweak to your patch before sending the pull request:
> 
> > +++ b/qemu-nbd.c
> > @@ -64,6 +68,7 @@
> >  #define QEMU_NBD_OPT_FORK  263
> >  #define QEMU_NBD_OPT_TLSAUTHZ  264
> >  #define QEMU_NBD_OPT_PID_FILE  265
> > +#define QEMU_NBD_OPT_SELINUX_LABEL 266
> >  
> >  #define MBR_SIZE 512
> >  
> > @@ -116,6 +121,9 @@ static void usage(const char *name)
> >  "  --forkfork off the server process and exit the 
> > parent\n"
> >  "once the server is running\n"
> >  "  --pid-file=PATH   store the server's process ID in the given 
> > file\n"
> > +#ifdef CONFIG_SELINUX
> > +"  --selinux-label=LABEL set SELinux process label on listening 
> > socket\n"
> > +#endif
> 
> The new option is only conditionally advertised under --help (qemu-nbd
> lacks a stable machine-parseable output, so scraping --help output
> will have to do for now)...
> 
> >  #if HAVE_NBD_DEVICE
> >  "\n"
> >  "Kernel NBD client support:\n"
> > @@ -532,6 +540,8 @@ int main(int argc, char **argv)
> >  { "trace", required_argument, NULL, 'T' },
> >  { "fork", no_argument, NULL, QEMU_NBD_OPT_FORK },
> >  { "pid-file", required_argument, NULL, QEMU_NBD_OPT_PID_FILE },
> > +{ "selinux-label", required_argument, NULL,
> > +  QEMU_NBD_OPT_SELINUX_LABEL },
> 
> ...but is unconditionally supported as a long option even when support
> was not compiled in...
> 
> >  { NULL, 0, NULL, 0 }
> >  };
> >  int ch;
> > @@ -558,6 +568,7 @@ int main(int argc, char **argv)
> >  int old_stderr = -1;
> >  unsigned socket_activation;
> >  const char *pid_file_name = NULL;
> > +const char *selinux_label = NULL;
> >  BlockExportOptions *export_opts;
> >  
> >  #ifdef CONFIG_POSIX
> > @@ -747,6 +758,9 @@ int main(int argc, char **argv)
> >  case QEMU_NBD_OPT_PID_FILE:
> >  pid_file_name = optarg;
> >  break;
> > +case QEMU_NBD_OPT_SELINUX_LABEL:
> > +selinux_label = optarg;
> > +break;
> >  }
> >  }
> >  
> > @@ -938,6 +952,16 @@ int main(int argc, char **argv)
> >  } else {
> >  backlog = MIN(shared, SOMAXCONN);
> >  }
> > +if (sockpath && selinux_label) {
> > +#ifdef CONFIG_SELINUX
> > +if (setsockcreatecon_raw(selinux_label) == -1) {
> > +error_report("Cannot set SELinux socket create context "
> > + "to %s: %s",
> > + selinux_label, strerror(errno));
> > +exit(EXIT_FAILURE);
> > +}
> > +#endif
> 
> ...but here we silently ignore it if support is not compiled in.
> Better is to issue an error message about using an unsupported option,
> so I'll squash this in:
> 
> diff --git i/qemu-nbd.c w/qemu-nbd.c
> index 5dc82c419255..94f8ec07c064 100644
> --- i/qemu-nbd.c
> +++ w/qemu-nbd.c
> @@ -962,6 +962,9 @@ int main(int argc, char **argv)
>   selinux_label, strerror(err

armv7 guest on aarch64 host giving I/O errors

2021-09-15 Thread Richard W.M. Jones

https://bugzilla.redhat.com/show_bug.cgi?id=2004120

qemu-6.1.0-7.fc36.aarch64
Host kernel: 5.14.0-60.fc35.aarch64
Guest kernel: 5.14.1-300.fc35.armv7hl+lpae

Under I/O load, I get intermittent write errors inside the Fedora
armv7 guest running on aarch64 host.

[ 2063.691215] print_req_error: 6 callbacks suppressed
[ 2063.691249] blk_update_request: I/O error, dev vda, sector 8990744 op 
0x1:(WRITE) flags 0x10 phys_seg 60 prio class 0
[ 2063.713844] blk_update_request: I/O error, dev vda, sector 8992744 op 
0x1:(WRITE) flags 0x104000 phys_seg 254 prio class 0
[ 2063.722337] blk_update_request: I/O error, dev vda, sector 8994992 op 
0x1:(WRITE) flags 0x10 phys_seg 2 prio class 0
[ 2063.730617] blk_update_request: I/O error, dev vda, sector 8995008 op 
0x1:(WRITE) flags 0x104000 phys_seg 191 prio class 0
[ 2063.745903] blk_update_request: I/O error, dev vda, sector 8997568 op 
0x1:(WRITE) flags 0x10 phys_seg 65 prio class 0
[ 2063.760612] blk_update_request: I/O error, dev vda, sector 8998552 op 
0x1:(WRITE) flags 0x104000 phys_seg 158 prio class 0
[ 2063.774761] blk_update_request: I/O error, dev vda, sector 9001112 op 
0x1:(WRITE) flags 0x10 phys_seg 98 prio class 0
[ 2063.785913] blk_update_request: I/O error, dev vda, sector 9002712 op 
0x1:(WRITE) flags 0x104000 phys_seg 234 prio class 0
[ 2063.798290] blk_update_request: I/O error, dev vda, sector 9005272 op 
0x1:(WRITE) flags 0x10 phys_seg 120 prio class 0

On the host side, no errors are seen, so it doesn't look like
this is an actual I/O error.

Is there anything I can do to debug this further?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

qemu 6.1 iotests hanging intermittently

2021-09-15 Thread Richard W.M. Jones

[Copying an email previously sent here:
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/U6C6TTITVJVQFXUMUCPYB5PKXT2NMENI/#ZRVNBUMQ26GF3WSGDEVU2YAFIJFJS3QS]

It seems like the qemu test suite now hangs intermittently after
running iotests. Here are some examples:

(on x86-64)
...
TEST iotest-qcow2: 313
TEST iotest-qcow2: nbd-qemu-allocation
TEST iotest-qcow2: qsd-jobs
Not run: 181
Passed all 122 iotests

(on i686)
...
TEST iotest-qcow2: 292
TEST iotest-qcow2: 299
TEST iotest-qcow2: 313
TEST iotest-qcow2: nbd-qemu-allocation
TEST iotest-qcow2: qsd-jobs
Not run: 172 181 186 192
Passed all 119 iotests

The code that invokes the tests is:
https://src.fedoraproject.org/rpms/qemu/blob/rawhide/f/qemu.spec#_1725

I'm not sure exactly why. It seems like the next line of output
should be "popd" which is a trivial bash command and so shouldn't take
too much time.

This seems to have started to happen round about the time qemu 6.1.0
was released.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection

2021-09-15 Thread Richard W.M. Jones

On Wed, Sep 15, 2021 at 10:15:20AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 14.09.2021 19:32, Richard W.M. Jones wrote:
> >On Tue, Sep 14, 2021 at 06:21:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >>14.09.2021 17:52, Richard W.M. Jones wrote:
> >>>On the
> >>>server side when the server receives NBD_CMD_DISC it must complete any
> >>>in-flight requests, but there's no requirement for the server to
> >>>commit anything to disk.  IOW you can still lose data even though you
> >>>took the time to disconnect.
> >>>
> >>>So I don't think there's any reason for libnbd to always gracefully
> >>
> >>Hmm. Me go to NBD spec :)
> >>
> >>I think, there is a reason:
> >>
> >>"The client and the server MUST NOT initiate any form of disconnect other 
> >>than in one of the above circumstances."
> >>
> >>And the only possibility for client to initiate a hard disconnect listed 
> >>above is "if it detects a violation by the other party of a mandatory 
> >>condition within this document".
> >>
> >>So at least, nbdsh violates NBD protocol. May be spec should be updated to 
> >>satisfy your needs.
> >
> >I would say the spec is at best contradictory, but if you read other
> >parts of the spec, then it's pretty clear that we're allowed to drop
> >the connection whenever we like.  This section says as much:
> >
> >https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase
> 
> Hmm, that was exactly the section I read and quoted :)
> 
> >
> >   There are two methods of terminating the transmission phase:
> >   ...
> >   "The client or the server drops the TCP session (in which case it
> >   SHOULD shut down the TLS session first). This is referred to as
> >   'initiating a hard disconnect'."
> 
> Right. Here the method is defined, no word that client can do it at any time.

I don't read this as a definition.

But we should probably clarify the spec to note that dropping the
connection is fine, because it is - no one has made any argument so
far that there's an actual reason to waste time on NBD_CMD_DISC.

Rich.

> Next, in same section:
> 
>Either side MAY initiate a hard disconnect if it detects a violation by 
> the other party of a mandatory condition within this document.
> 
> Next
>The client MAY issue a soft disconnect at any time
> 
> And finally:
> 
>The client and the server MUST NOT initiate any form of disconnect other 
> than in one of the above circumstances.
> 
> 
> Or do you mean some other spec section I missed?
> 
> >
> >Anyway we're dropping the TCP connection because sometimes we are just
> >interrogating an NBD server eg to find out what it supports, and doing
> >a graceful shutdown is a waste of time and internet.
> >
> >>>shut down (especially in this case where there are no in-flight
> >>>requests), and anyway it would break ABI to make that change and slow
> >>>down the client in cases when there's nothing to clean up.
> >>
> >>Which ABI will it break?
> >
> >Our contract with callers using nbd_close(3), if nbd_shutdown(3) is
> >not called beforehand.
> >https://libguestfs.org/nbd_shutdown.3.html
> >https://libguestfs.org/nbd_create.3.html (really nbd_close ...)
> >
> >Rich.
> >
> 
> 
> -- 
> Best regards,
> Vladimir

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection

2021-09-14 Thread Richard W.M. Jones

On Tue, Sep 14, 2021 at 06:21:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 14.09.2021 17:52, Richard W.M. Jones wrote:
> > On the
> >server side when the server receives NBD_CMD_DISC it must complete any
> >in-flight requests, but there's no requirement for the server to
> >commit anything to disk.  IOW you can still lose data even though you
> >took the time to disconnect.
> >
> >So I don't think there's any reason for libnbd to always gracefully
> 
> Hmm. Me go to NBD spec :)
> 
> I think, there is a reason:
> 
> "The client and the server MUST NOT initiate any form of disconnect other 
> than in one of the above circumstances."
> 
> And the only possibility for client to initiate a hard disconnect listed 
> above is "if it detects a violation by the other party of a mandatory 
> condition within this document".
> 
> So at least, nbdsh violates NBD protocol. May be spec should be updated to 
> satisfy your needs.

I would say the spec is at best contradictory, but if you read other
parts of the spec, then it's pretty clear that we're allowed to drop
the connection whenever we like.  This section says as much:

https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase

  There are two methods of terminating the transmission phase:
  ...
  "The client or the server drops the TCP session (in which case it
  SHOULD shut down the TLS session first). This is referred to as
  'initiating a hard disconnect'."

Anyway we're dropping the TCP connection because sometimes we are just
interrogating an NBD server eg to find out what it supports, and doing
a graceful shutdown is a waste of time and internet.

> >shut down (especially in this case where there are no in-flight
> >requests), and anyway it would break ABI to make that change and slow
> >down the client in cases when there's nothing to clean up.
> 
> Which ABI will it break?

Our contract with callers using nbd_close(3), if nbd_shutdown(3) is
not called beforehand.
https://libguestfs.org/nbd_shutdown.3.html
https://libguestfs.org/nbd_create.3.html (really nbd_close ...)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection

2021-09-14 Thread Richard W.M. Jones

On Tue, Sep 14, 2021 at 05:40:59PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 13.09.2021 18:19, Richard W.M. Jones wrote:
> >$ rm -f /tmp/sock /tmp/pid
> >$ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
> >$ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid 
> >/tmp/disk.qcow2 &
> >$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
> >qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write 
> >to socket: Broken pipe
> >$ killall qemu-nbd
> >
> >nbdsh is abruptly dropping the NBD connection here which is a valid
> >way to close the connection.  It seems unnecessary to print an error
> >in this case so this commit suppresses it.
> >
> >Note that if you call the nbdsh h.shutdown() method then the message
> >was not printed:
> >
> >$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 
> >'h.shutdown()'
>
> My personal opinion, is that this warning doesn't hurt in general. I
> think in production tools should gracefully shutdown any connection,
> and abrupt shutdown is a sign of something wrong - i.e., worth
> warning.
>
> Shouldn't nbdsh do graceful shutdown by default?

On the client side the only difference is that nbd_shutdown sends
NBD_CMD_DISC to the server (versus simply closing the socket).  On the
server side when the server receives NBD_CMD_DISC it must complete any
in-flight requests, but there's no requirement for the server to
commit anything to disk.  IOW you can still lose data even though you
took the time to disconnect.

So I don't think there's any reason for libnbd to always gracefully
shut down (especially in this case where there are no in-flight
requests), and anyway it would break ABI to make that change and slow
down the client in cases when there's nothing to clean up.

> >Signed-off-by: Richard W.M. Jones 
> >---
> >  nbd/server.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> >diff --git a/nbd/server.c b/nbd/server.c
> >index 3927f7789d..e0a43802b2 100644
> >--- a/nbd/server.c
> >+++ b/nbd/server.c
> >@@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
> >  ret = nbd_handle_request(client, &request, req->data, &local_err);
> >  }
> >  if (ret < 0) {
> >-error_prepend(&local_err, "Failed to send reply: ");
> >+if (errno != EPIPE) {
>
> Both nbd_handle_request() and nbd_send_generic_reply() declares that
> they return -errno on failure in communication with client. I think,
> you should use ret here: if (ret != -EPIPE). It's safer: who knows,
> does errno really set on all error paths of called functions? If
> not, we may see here errno of some another previous operation.

Should we set errno = 0 earlier in nbd_trip()?  I don't really know
how coroutines in qemu interact with thread-local variables though.

Rich.

> >+error_prepend(&local_err, "Failed to send reply: ");
> >+} else {
> >+error_free(local_err);
> >+local_err = NULL;
> >+}
> >  goto disconnect;
> >  }
> >
> 
> 
> -- 
> Best regards,
> Vladimir

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

[PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection

2021-09-13 Thread Richard W.M. Jones

$ rm -f /tmp/sock /tmp/pid
$ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
$ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid 
/tmp/disk.qcow2 &
$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to 
socket: Broken pipe
$ killall qemu-nbd

nbdsh is abruptly dropping the NBD connection here which is a valid
way to close the connection.  It seems unnecessary to print an error
in this case so this commit suppresses it.

Note that if you call the nbdsh h.shutdown() method then the message
was not printed:

$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'

Signed-off-by: Richard W.M. Jones 
---
 nbd/server.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/nbd/server.c b/nbd/server.c
index 3927f7789d..e0a43802b2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
 ret = nbd_handle_request(client, &request, req->data, &local_err);
 }
 if (ret < 0) {
-error_prepend(&local_err, "Failed to send reply: ");
+if (errno != EPIPE) {
+error_prepend(&local_err, "Failed to send reply: ");
+} else {
+error_free(local_err);
+local_err = NULL;
+}
 goto disconnect;
 }
 
-- 
2.32.0

Re: [PATCH] nbd/server: Suppress Broken pipe errors on abrupt disconnection

2021-09-13 Thread Richard W.M. Jones

On Mon, Sep 13, 2021 at 05:07:12PM +0200, Kevin Wolf wrote:
> > >  if (ret < 0) {
> > > -error_prepend(&local_err, "Failed to send reply: ");
> > > +if (errno != EPIPE) {
> > > +error_prepend(&local_err, "Failed to send reply: ");
> > > +} else {
> > > +local_err = NULL;
> > 
> > This line should be error_free(local_err) to avoid a memleak.
> 
> Actually, you want both error_free(local_err) and local_err = NULL.

Give me a few mins to test an post a new version that at least
fixes this bug ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH] nbd/server: Advertise MULTI_CONN for shared writable exports

2021-08-27 Thread Richard W.M. Jones

On Fri, Aug 27, 2021 at 10:09:16AM -0500, Eric Blake wrote:
> +# Parallel client connections are easier with libnbd than qemu-io:
> +if ! (nbdsh --version) >/dev/null 2>&1; then

I'm curious why the parentheses are needed here?

> +export nbd_unix_socket
> +nbdsh -c '
> +import os
> +sock = os.getenv("nbd_unix_socket")
> +h = []
> +
> +for i in range(3):
> +  h.append(nbd.NBD())
> +  h[i].connect_unix(sock)
> +  assert h[i].can_multi_conn()
> +
> +buf1 = h[0].pread(1024 * 1024, 0)
> +if buf1 != b"\x01" * 1024 * 1024:
> +  print("Unexpected initial read")
> +buf2 = b"\x03" * 1024 * 1024
> +h[1].pwrite(buf2, 0)   #1
> +h[2].flush()
> +buf1 = h[0].pread(1024 * 1024, 0)
> +if buf1 == buf2:
> +  print("Flush appears to be consistent across connections")

The test is ... simple.

Would it be a better test if the statement at line #1 above was
h.aio_pwrite instead, so the operation is only started?  This would
depend on some assumptions inside libnbd: That the h[1] socket is
immediately writable and that libnbd's statement will write before
returning, which are likely to be true, but perhaps you could do
something with parsing libnbd debug statements to check that the state
machine has started the write.

Another idea would be to make the write at line #1 be very large, so
that perhaps the qemu block layer would split it up.  After the flush
you would read the beginning and end bytes of the written part to
approximate a check that the whole write has been flushed so parts of
it are not hanging around in the cache of the first connection.

Anyway the patch as written seems fine to me so:

Reviewed-by: Richard W.M. Jones 

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH v2] nbd/server: Add --selinux-label option

2021-07-23 Thread Richard W.M. Jones

On Fri, Jul 23, 2021 at 06:18:55PM +0200, Kevin Wolf wrote:
> Am 23.07.2021 um 12:33 hat Richard W.M. Jones geschrieben:
> > Under SELinux, Unix domain sockets have two labels.  One is on the
> > disk and can be set with commands such as chcon(1).  There is a
> > different label stored in memory (called the process label).  This can
> > only be set by the process creating the socket.  When using SELinux +
> > SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
> > you must set both labels correctly first.
> > 
> > For qemu-nbd the options to set the second label are awkward.  You can
> > create the socket in a wrapper program and then exec into qemu-nbd.
> > Or you could try something with LD_PRELOAD.
> > 
> > This commit adds the ability to set the label straightforwardly on the
> > command line, via the new --selinux-label flag.  (The name of the flag
> > is the same as the equivalent nbdkit option.)
> > 
> > A worked example showing how to use the new option can be found in
> > this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > Signed-off-by: Richard W.M. Jones 
> 
> I suppose this would also be relevant for the built-in NBD server,
> especially in the context of qemu-storage-daemon?
>
> If so, is this something specific to NBD sockets, or would it actually
> make sense to have it as a generic option in UnixSocketAddress?

For other NBD sockets, most likely.  I'm not sure about Unix sockets
in general (as in: I know they also have the two label thing, but I
don't know if there's a situation where SVirt protects other sockets
apart from NBD sockets).  I'm sure Dan will know ...

By the way although it appears that setsockcreatecon_raw is setting a
global flag, it seems to actually use a thread-local variable, so
implementing this (although still ugly) would not require locks.

https://github.com/SELinuxProject/selinux/blob/32611aea6543e3a8f32635857e37b4332b0b5c99/libselinux/src/procattr.c#L347

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[PATCH v2] nbd/server: Add --selinux-label option

2021-07-23 Thread Richard W.M. Jones

v1 was here:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/threads.html#00713

v2 adds the changes to CI docker files as suggested by
Dan Berrange in his review.

Rich.

1 2 3 4 5 >

1 - 100 of 412 matches

Mail list logo