from:"Roman Kagan"

Re: [PATCH] aio_wait_kick: add missing memory barrier

2022-06-04 Thread Roman Kagan

On Tue, May 24, 2022 at 01:30:54PM -0400, Emanuele Giuseppe Esposito wrote:
> It seems that aio_wait_kick always required a memory barrier
> or atomic operation in the caller, but nobody actually
> took care of doing it.
> 
> Let's put the barrier in the function instead, and pair it
> with another one in AIO_WAIT_WHILE. Read aio_wait_kick()
> comment for further explanation.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/aio-wait.h |  2 ++
>  util/aio-wait.c  | 16 +++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index b39eefb38d..54840f8622 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -81,6 +81,8 @@ extern AioWait global_aio_wait;
>  AioContext *ctx_ = (ctx);  \
>  /* Increment wait_->num_waiters before evaluating cond. */ \
>  qatomic_inc(_->num_waiters);  \
> +/* Paired with smp_mb in aio_wait_kick(). */   \
> +smp_mb();  \

IIRC qatomic_inc() ensures sequential consistency, isn't it enough here?

>  if (ctx_ && in_aio_context_home_thread(ctx_)) {\
>  while ((cond)) {   \
>  aio_poll(ctx_, true);  \

Roman.

Re: [PATCH 00/10] vhost: stick to -errno error return convention

2021-11-29 Thread Roman Kagan

On Sun, Nov 28, 2021 at 04:47:20PM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 11, 2021 at 06:33:44PM +0300, Roman Kagan wrote:
> > Error propagation between the generic vhost code and the specific backends 
> > is
> > not quite consistent: some places follow "return -1 and set errno" 
> > convention,
> > while others assume "return negated errno".  Furthermore, not enough care is
> > taken not to clobber errno.
> > 
> > As a result, on certain code paths the errno resulting from a failure may 
> > get
> > overridden by another function call, and then that zero errno inidicating
> > success is propagated up the stack, leading to failures being lost.  In
> > particular, we've seen errors in the communication with a vhost-user-blk 
> > slave
> > not trigger an immediate connection drop and reconnection, leaving it in a
> > broken state.
> > 
> > Rework error propagation to always return negated errno on errors and
> > correctly pass it up the stack.
> 
> Hi Roman,
> if there are bugfixes here I'll be happy to take them right now.
> The wholesale rework seems inappropriate for 6.2, I'll be
> happy to tag it for after 6.2. Pls ping me aftre release to help
> make sure it's not lost.

All these patches are bugfixes in one way or another.  That said, none
of the problems being addressed are recent regressions.  OTOH the
patches introduce non-zero churn and change behavior on some error
paths, so I'd suggest to postpone the whole series till after 6.2 is
out.

Thanks,
Roman.

Re: [PATCH 01/10] vhost-user-blk: reconnect on any error during realize

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 12:37:59PM +0100, Kevin Wolf wrote:
> Am 12.11.2021 um 08:39 hat Roman Kagan geschrieben:
> > On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> > > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > > > vhost-user-blk realize only attempts to reconnect if the previous
> > > > connection attempt failed on "a problem with the connection and not an
> > > > error related to the content (which would fail again the same way in the
> > > > next attempt)".
> > > > 
> > > > However this distinction is very subtle, and may be inadvertently broken
> > > > if the code changes somewhere deep down the stack and a new error gets
> > > > propagated up to here.
> > > > 
> > > > OTOH now that the number of reconnection attempts is limited it seems
> > > > harmless to try reconnecting on any error.
> > > > 
> > > > So relax the condition of whether to retry connecting to check for any
> > > > error.
> > > > 
> > > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > > > during realize".
> > > > 
> > > > Signed-off-by: Roman Kagan 
> > > 
> > > It results in less than perfect error messages. With a modified export
> > > that just crashes qemu-storage-daemon during get_features, I get:
> > > 
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read 
> > > msg header. Read 0 instead of 12. Original request 1.
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting 
> > > after error: vhost_backend_init failed: Protocol error
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting 
> > > after error: Failed to connect to '/tmp/vsock': Connection refused
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting 
> > > after error: Failed to connect to '/tmp/vsock': Connection refused
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to 
> > > connect to '/tmp/vsock': Connection refused
> > 
> > This patch doesn't change any error messages.  Which ones specifically
> > became less than perfect as a result of this patch?
> 
> But it adds error messages (for each retry), which are different from
> the first error message. As I said this is not the end of the world, but
> maybe a bit more confusing.

Ah, now I see what you mean: it adds reconnection attempts where there
used to be immediate failure return, so now every failed attempt logs
its own message.

> > > I guess this might be tolerable. On the other hand, the patch doesn't
> > > really fix anything either, but just gets rid of possible subtleties.
> > 
> > The remaining patches in the series make other errors beside -EPROTO
> > propagate up to this point, and some (most) of them are retryable.  This
> > was the reason to include this patch at the beginning of the series (I
> > guess I should've mentioned that in the patch log).
> 
> I see. I hadn't looked at the rest of the series yet because I ran out
> of time, but now that I'm skimming them, I see quite a few places that
> use non-EPROTO, but I wonder which of them actually should be
> reconnected. So far all I saw were presumably persistent errors where a
> retry won't help. Can you give me some examples?

E.g. the particular case you mention earlier, -ECONNREFUSED, is not
unlikely to happen due to the vhost-user server restart for maintenance;
in this case retying looks like a reasonable thing to do, doesn't it?

Thanks,
Roman.

Re: [PATCH 04/10] chardev/char-fe: don't allow EAGAIN from blocking read

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 12:24:06PM +0400, Marc-André Lureau wrote:
> Hi
> 
> On Thu, Nov 11, 2021 at 7:44 PM Roman Kagan  wrote:
> 
> > As its name suggests, ChardevClass.chr_sync_read is supposed to do a
> > blocking read.  The only implementation of it, tcp_chr_sync_read, does
> > set the underlying io channel to the blocking mode indeed.
> >
> > Therefore a failure return with EAGAIN is not expected from this call.
> >
> > So do not retry it in qemu_chr_fe_read_all; instead place an assertion
> > that it doesn't fail with EAGAIN.
> >
> 
> The code was introduced in :
> commit 7b0bfdf52d694c9a3a96505aa42ce3f8d63acd35
> Author: Nikolay Nikolaev 
> Date:   Tue May 27 15:03:48 2014 +0300
> 
> Add chardev API qemu_chr_fe_read_all

Right, but at that point chr_sync_read wasn't made to block.  It
happened later in

commit bcdeb9be566ded2eb35233aaccf38742a21e5daa
Author: Marc-André Lureau 
Date:   Thu Jul 6 19:03:53 2017 +0200

chardev: block during sync read

A sync read should block until all requested data is
available (instead of retrying in qemu_chr_fe_read_all). Change the
channel to blocking during sync_read.

> > @@ -68,13 +68,10 @@ int qemu_chr_fe_read_all(CharBackend *be, uint8_t
> > *buf, int len)
> >  }
> >
> >  while (offset < len) {
> > -retry:
> >  res = CHARDEV_GET_CLASS(s)->chr_sync_read(s, buf + offset,
> >len - offset);
> > -if (res == -1 && errno == EAGAIN) {
> > -g_usleep(100);
> > -goto retry;
> > -}
> > +/* ->chr_sync_read should block */
> > +assert(!(res < 0 && errno == EAGAIN));
> >
> >
> While I agree with the rationale to clean this code a bit, I am not so sure
> about replacing it with an assert(). In the past, when we did such things
> we had unexpected regressions :)

Valid point, qemu may be run against some OS where a blocking call may
sporadically return -EAGAIN, and it would be hard to reliably catch this
with testing.

> A slightly better approach perhaps is g_warn_if_fail(), although it's not
> very popular in qemu.

I think the first thing to decide is whether -EAGAIN from a blocking
call isn't broken enough, and justifies (unlimited) retries.  I'm
tempted to just remove any special handling of -EAGAIN and treat it as
any other error, leaving up to the caller to handle (most probably to
fail the call and initiate a recovery, if possible).

Does this make sense?

Thanks,
Roman.

Re: [PATCH 05/10] vhost-backend: avoid overflow on memslots_limit

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 09:56:17AM +, Daniel P. Berrangé wrote:
> On Fri, Nov 12, 2021 at 10:46:46AM +0300, Roman Kagan wrote:
> > On Thu, Nov 11, 2021 at 06:59:43PM +0100, Philippe Mathieu-Daudé wrote:
> > > On 11/11/21 16:33, Roman Kagan wrote:
> > > > Fix the (hypothetical) potential problem when the value parsed out of
> > > > the vhost module parameter in sysfs overflows the return value from
> > > > vhost_kernel_memslots_limit.
> > > > 
> > > > Signed-off-by: Roman Kagan 
> > > > ---
> > > >  hw/virtio/vhost-backend.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > > > index b65f8f7e97..44f7dbb243 100644
> > > > --- a/hw/virtio/vhost-backend.c
> > > > +++ b/hw/virtio/vhost-backend.c
> > > > @@ -58,7 +58,7 @@ static int vhost_kernel_memslots_limit(struct 
> > > > vhost_dev *dev)
> > > >  if 
> > > > (g_file_get_contents("/sys/module/vhost/parameters/max_mem_regions",
> > > >  , NULL, NULL)) {
> > > >  uint64_t val = g_ascii_strtoull(s, NULL, 10);
> > > 
> > > Would using qemu_strtou64() simplify this?
> > 
> > I'm afraid not.  None of the existing strtoXX converting functions has
> > the desired output range (0 < retval < INT_MAX), so the following
> > condition will remain necessary anyway; then it doesn't seem to matter
> > which particular parser is used to extract the value which is in the
> > range, so I left the one that was already there to reduce churn.
> 
> If  qemu_strtou64() can't handle all values in (0 < retval < INT_MAX)
> isn't that a bug in qemu_strtou64 ?

I must have been unclear.  It sure can handle all values in this range;
the point is that the range check after it would still be needed, so
switching from g_ascii_strtoull to qemu_strtoXX saves nothing, therefore
I left it as it was.

Thanks,
Roman.

Re: [PATCH 00/10] vhost: stick to -errno error return convention

2021-11-12 Thread Roman Kagan

On Thu, Nov 11, 2021 at 03:14:56PM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 11, 2021 at 06:33:44PM +0300, Roman Kagan wrote:
> > Error propagation between the generic vhost code and the specific backends 
> > is
> > not quite consistent: some places follow "return -1 and set errno" 
> > convention,
> > while others assume "return negated errno".  Furthermore, not enough care is
> > taken not to clobber errno.
> > 
> > As a result, on certain code paths the errno resulting from a failure may 
> > get
> > overridden by another function call, and then that zero errno inidicating
> > success is propagated up the stack, leading to failures being lost.  In
> > particular, we've seen errors in the communication with a vhost-user-blk 
> > slave
> > not trigger an immediate connection drop and reconnection, leaving it in a
> > broken state.
> > 
> > Rework error propagation to always return negated errno on errors and
> > correctly pass it up the stack.
> 
> Looks like something we want post release. I'll tag it
> but pls ping me after the release to help make sure
> it's not lost.

It doesn't introduce new features so I guess it might qualify for rc0,
but the churn is somewhat too big indeed.

OK I'll reiterate once 6.2 is out; meanwhile if anyone has spare cycles
to review it, it'll be much appreciated.

Thanks,
Roman.

> 
> 
> > Roman Kagan (10):
> >   vhost-user-blk: reconnect on any error during realize
> >   chardev/char-socket: tcp_chr_recv: don't clobber errno
> >   chardev/char-socket: tcp_chr_sync_read: don't clobber errno
> >   chardev/char-fe: don't allow EAGAIN from blocking read
> >   vhost-backend: avoid overflow on memslots_limit
> >   vhost-backend: stick to -errno error return convention
> >   vhost-vdpa: stick to -errno error return convention
> >   vhost-user: stick to -errno error return convention
> >   vhost: stick to -errno error return convention
> >   vhost-user-blk: propagate error return from generic vhost
> > 
> >  chardev/char-fe.c |   7 +-
> >  chardev/char-socket.c |  17 +-
> >  hw/block/vhost-user-blk.c |   4 +-
> >  hw/virtio/vhost-backend.c |   4 +-
> >  hw/virtio/vhost-user.c| 401 +-
> >  hw/virtio/vhost-vdpa.c|  37 ++--
> >  hw/virtio/vhost.c |  98 +-
> >  7 files changed, 307 insertions(+), 261 deletions(-)
> > 
> > -- 
> > 2.33.1
> > 
>

Re: [PATCH 05/10] vhost-backend: avoid overflow on memslots_limit

2021-11-11 Thread Roman Kagan

On Thu, Nov 11, 2021 at 06:59:43PM +0100, Philippe Mathieu-Daudé wrote:
> On 11/11/21 16:33, Roman Kagan wrote:
> > Fix the (hypothetical) potential problem when the value parsed out of
> > the vhost module parameter in sysfs overflows the return value from
> > vhost_kernel_memslots_limit.
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> >  hw/virtio/vhost-backend.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > index b65f8f7e97..44f7dbb243 100644
> > --- a/hw/virtio/vhost-backend.c
> > +++ b/hw/virtio/vhost-backend.c
> > @@ -58,7 +58,7 @@ static int vhost_kernel_memslots_limit(struct vhost_dev 
> > *dev)
> >  if (g_file_get_contents("/sys/module/vhost/parameters/max_mem_regions",
> >  , NULL, NULL)) {
> >  uint64_t val = g_ascii_strtoull(s, NULL, 10);
> 
> Would using qemu_strtou64() simplify this?

I'm afraid not.  None of the existing strtoXX converting functions has
the desired output range (0 < retval < INT_MAX), so the following
condition will remain necessary anyway; then it doesn't seem to matter
which particular parser is used to extract the value which is in the
range, so I left the one that was already there to reduce churn.

> 
> > -if (!((val == G_MAXUINT64 || !val) && errno)) {
> > +if (val < INT_MAX && val > 0) {
> >  g_free(s);
> >  return val;
> >  }

Thanks,
Roman.

Re: [PATCH 01/10] vhost-user-blk: reconnect on any error during realize

2021-11-11 Thread Roman Kagan

On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > vhost-user-blk realize only attempts to reconnect if the previous
> > connection attempt failed on "a problem with the connection and not an
> > error related to the content (which would fail again the same way in the
> > next attempt)".
> > 
> > However this distinction is very subtle, and may be inadvertently broken
> > if the code changes somewhere deep down the stack and a new error gets
> > propagated up to here.
> > 
> > OTOH now that the number of reconnection attempts is limited it seems
> > harmless to try reconnecting on any error.
> > 
> > So relax the condition of whether to retry connecting to check for any
> > error.
> > 
> > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > during realize".
> > 
> > Signed-off-by: Roman Kagan 
> 
> It results in less than perfect error messages. With a modified export
> that just crashes qemu-storage-daemon during get_features, I get:
> 
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg 
> header. Read 0 instead of 12. Original request 1.
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after 
> error: vhost_backend_init failed: Protocol error
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after 
> error: Failed to connect to '/tmp/vsock': Connection refused
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after 
> error: Failed to connect to '/tmp/vsock': Connection refused
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect 
> to '/tmp/vsock': Connection refused

This patch doesn't change any error messages.  Which ones specifically
became less than perfect as a result of this patch?

> I guess this might be tolerable. On the other hand, the patch doesn't
> really fix anything either, but just gets rid of possible subtleties.

The remaining patches in the series make other errors beside -EPROTO
propagate up to this point, and some (most) of them are retryable.  This
was the reason to include this patch at the beginning of the series (I
guess I should've mentioned that in the patch log).

Thanks,
Roman.

[PATCH 09/10] vhost: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

The generic vhost code expects that many of the VhostOps methods in the
respective backends set errno on errors.  However, none of the existing
backends actually bothers to do so.  In a number of those methods errno
from the failed call is clobbered by successful later calls to some
library functions; on a few code paths the generic vhost code then
negates and returns that errno, thus making failures look as successes
to the caller.

As a result, in certain scenarios (e.g. live migration) the device
doesn't notice the first failure and goes on through its state
transitions as if everything is ok, instead of taking recovery actions
(break and reestablish the vhost-user connection, cancel migration, etc)
before it's too late.

To fix this, consolidate on the convention to return negated errno on
failures throughout generic vhost, and use it for error propagation.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost.c | 98 ++-
 1 file changed, 45 insertions(+), 53 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 437347ad01..4f20d4a714 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -33,11 +33,13 @@
 #define _VHOST_DEBUG 1
 
 #ifdef _VHOST_DEBUG
-#define VHOST_OPS_DEBUG(fmt, ...) \
-do { error_report(fmt ": %s (%d)", ## __VA_ARGS__, \
-  strerror(errno), errno); } while (0)
+#define VHOST_OPS_DEBUG(retval, fmt, ...) \
+do { \
+error_report(fmt ": %s (%d)", ## __VA_ARGS__, \
+ strerror(-retval), -retval); \
+} while (0)
 #else
-#define VHOST_OPS_DEBUG(fmt, ...) \
+#define VHOST_OPS_DEBUG(retval, fmt, ...) \
 do { } while (0)
 #endif
 
@@ -297,7 +299,7 @@ static inline void vhost_dev_log_resize(struct vhost_dev 
*dev, uint64_t size)
releasing the current log, to ensure no logging is lost */
 r = dev->vhost_ops->vhost_set_log_base(dev, log_base, log);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_log_base failed");
+VHOST_OPS_DEBUG(r, "vhost_set_log_base failed");
 }
 
 vhost_log_put(dev, true);
@@ -550,7 +552,7 @@ static void vhost_commit(MemoryListener *listener)
 if (!dev->log_enabled) {
 r = dev->vhost_ops->vhost_set_mem_table(dev, dev->mem);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_mem_table failed");
+VHOST_OPS_DEBUG(r, "vhost_set_mem_table failed");
 }
 goto out;
 }
@@ -564,7 +566,7 @@ static void vhost_commit(MemoryListener *listener)
 }
 r = dev->vhost_ops->vhost_set_mem_table(dev, dev->mem);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_mem_table failed");
+VHOST_OPS_DEBUG(r, "vhost_set_mem_table failed");
 }
 /* To log less, can only decrease log size after table update. */
 if (dev->log_size > log_size + VHOST_LOG_BUFFER) {
@@ -803,8 +805,8 @@ static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
 if (dev->vhost_ops->vhost_vq_get_addr) {
 r = dev->vhost_ops->vhost_vq_get_addr(dev, , vq);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_vq_get_addr failed");
-return -errno;
+VHOST_OPS_DEBUG(r, "vhost_vq_get_addr failed");
+return r;
 }
 } else {
 addr.desc_user_addr = (uint64_t)(unsigned long)vq->desc;
@@ -816,10 +818,9 @@ static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
 addr.flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0;
 r = dev->vhost_ops->vhost_set_vring_addr(dev, );
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_vring_addr failed");
-return -errno;
+VHOST_OPS_DEBUG(r, "vhost_set_vring_addr failed");
 }
-return 0;
+return r;
 }
 
 static int vhost_dev_set_features(struct vhost_dev *dev,
@@ -840,19 +841,19 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
 }
 r = dev->vhost_ops->vhost_set_features(dev, features);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_features failed");
+VHOST_OPS_DEBUG(r, "vhost_set_features failed");
 goto out;
 }
 if (dev->vhost_ops->vhost_set_backend_cap) {
 r = dev->vhost_ops->vhost_set_backend_cap(dev);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_backend_cap failed");
+VHOST_OPS_DEBUG(r, "vhost_set_backend_cap failed");
 goto out;
 }
 }
 
 out:
-return r < 0 ? -errno : 0;
+return r;
 }
 
 static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
@@ -999,22 +1000,17 @@ static int 
vhost_virtqueue_set_vring_endian_legacy(struct vhost_dev *dev,
bool is_big_endian,
int vhost_vq

[PATCH 08/10] vhost-user: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

VhostOps methods in user_ops are not very consistent in their error
returns: some return negated errno while others just -1.

Make sure all of them consistently return negated errno.  This also
helps error propagation from the functions being called inside.
Besides, this synchronizes the error return convention with the other
two vhost backends, kernel and vdpa, and will therefore allow for
consistent error propagation in the generic vhost code (in a followup
patch).

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-user.c | 401 +++--
 1 file changed, 223 insertions(+), 178 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index bf6e50223c..662853513e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -280,9 +280,10 @@ static int vhost_user_read_header(struct vhost_dev *dev, 
VhostUserMsg *msg)
 
 r = qemu_chr_fe_read_all(chr, p, size);
 if (r != size) {
+int saved_errno = errno;
 error_report("Failed to read msg header. Read %d instead of %d."
  " Original request %d.", r, size, msg->hdr.request);
-return -1;
+return r < 0 ? -saved_errno : -EIO;
 }
 
 /* validate received flags */
@@ -290,7 +291,7 @@ static int vhost_user_read_header(struct vhost_dev *dev, 
VhostUserMsg *msg)
 error_report("Failed to read msg header."
 " Flags 0x%x instead of 0x%x.", msg->hdr.flags,
 VHOST_USER_REPLY_MASK | VHOST_USER_VERSION);
-return -1;
+return -EPROTO;
 }
 
 return 0;
@@ -314,8 +315,9 @@ static gboolean vhost_user_read_cb(void *do_not_use, 
GIOCondition condition,
 uint8_t *p = (uint8_t *) msg;
 int r, size;
 
-if (vhost_user_read_header(dev, msg) < 0) {
-data->ret = -1;
+r = vhost_user_read_header(dev, msg);
+if (r < 0) {
+data->ret = r;
 goto end;
 }
 
@@ -324,7 +326,7 @@ static gboolean vhost_user_read_cb(void *do_not_use, 
GIOCondition condition,
 error_report("Failed to read msg header."
 " Size %d exceeds the maximum %zu.", msg->hdr.size,
 VHOST_USER_PAYLOAD_SIZE);
-data->ret = -1;
+data->ret = -EPROTO;
 goto end;
 }
 
@@ -333,9 +335,10 @@ static gboolean vhost_user_read_cb(void *do_not_use, 
GIOCondition condition,
 size = msg->hdr.size;
 r = qemu_chr_fe_read_all(chr, p, size);
 if (r != size) {
+int saved_errno = errno;
 error_report("Failed to read msg payload."
  " Read %d instead of %d.", r, msg->hdr.size);
-data->ret = -1;
+data->ret = r < 0 ? -saved_errno : -EIO;
 goto end;
 }
 }
@@ -418,24 +421,26 @@ static int vhost_user_read(struct vhost_dev *dev, 
VhostUserMsg *msg)
 static int process_message_reply(struct vhost_dev *dev,
  const VhostUserMsg *msg)
 {
+int ret;
 VhostUserMsg msg_reply;
 
 if ((msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) == 0) {
 return 0;
 }
 
-if (vhost_user_read(dev, _reply) < 0) {
-return -1;
+ret = vhost_user_read(dev, _reply);
+if (ret < 0) {
+return ret;
 }
 
 if (msg_reply.hdr.request != msg->hdr.request) {
 error_report("Received unexpected msg type. "
  "Expected %d received %d",
  msg->hdr.request, msg_reply.hdr.request);
-return -1;
+return -EPROTO;
 }
 
-return msg_reply.payload.u64 ? -1 : 0;
+return msg_reply.payload.u64 ? -EIO : 0;
 }
 
 static bool vhost_user_one_time_request(VhostUserRequest request)
@@ -472,14 +477,15 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 
 if (qemu_chr_fe_set_msgfds(chr, fds, fd_num) < 0) {
 error_report("Failed to set msg fds.");
-return -1;
+return -EINVAL;
 }
 
 ret = qemu_chr_fe_write_all(chr, (const uint8_t *) msg, size);
 if (ret != size) {
+int saved_errno = errno;
 error_report("Failed to write msg."
  " Wrote %d instead of %d.", ret, size);
-return -1;
+return ret < 0 ? -saved_errno : -EIO;
 }
 
 return 0;
@@ -502,6 +508,7 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, 
uint64_t base,
 size_t fd_num = 0;
 bool shmfd = virtio_has_feature(dev->protocol_features,
 VHOST_USER_PROTOCOL_F_LOG_SHMFD);
+int ret;
 VhostUserMsg msg = {
 .hdr.request = VHOST_USER_SET_LOG_BASE,
 .hdr.flags = VHOST_USER_VERSION,
@@ -514,21 +521,23 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, 
uint64_t base,
 fds[fd_num++] = l

[PATCH 07/10] vhost-vdpa: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

Almost all VhostOps methods in vdpa_ops follow the convention of
returning negated errno on error.

Adjust the few that don't.  To that end, rework vhost_vdpa_add_status to
check if setting of the requested status bits has succeeded and return
the respective error code it hasn't, and propagate the error codes
wherever it's appropriate.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-vdpa.c | 37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 0d8051426c..a3b885902a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -292,18 +292,34 @@ static int vhost_vdpa_call(struct vhost_dev *dev, 
unsigned long int request,
 return ret < 0 ? -errno : ret;
 }
 
-static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
+static int vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
 {
 uint8_t s;
+int ret;
 
 trace_vhost_vdpa_add_status(dev, status);
-if (vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, )) {
-return;
+ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
+if (ret < 0) {
+return ret;
 }
 
 s |= status;
 
-vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, );
+ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, );
+if (ret < 0) {
+return ret;
+}
+
+ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
+if (ret < 0) {
+return ret;
+}
+
+if (!(s & status)) {
+return -EIO;
+}
+
+return 0;
 }
 
 static void vhost_vdpa_get_iova_range(struct vhost_vdpa *v)
@@ -484,7 +500,7 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
 }
 }
 if (mem->padding) {
-return -1;
+return -EINVAL;
 }
 
 return 0;
@@ -501,14 +517,11 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
 
 trace_vhost_vdpa_set_features(dev, features);
 ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, );
-uint8_t status = 0;
 if (ret) {
 return ret;
 }
-vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
-vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
 
-return !(status & VIRTIO_CONFIG_S_FEATURES_OK);
+return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
 }
 
 static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
@@ -650,12 +663,8 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 }
 
 if (started) {
-uint8_t status = 0;
 memory_listener_register(>listener, _space_memory);
-vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
-vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
-
-return !(status & VIRTIO_CONFIG_S_DRIVER_OK);
+return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
 } else {
 vhost_vdpa_reset_device(dev);
 vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
-- 
2.33.1

[PATCH 04/10] chardev/char-fe: don't allow EAGAIN from blocking read

2021-11-11 Thread Roman Kagan

As its name suggests, ChardevClass.chr_sync_read is supposed to do a
blocking read.  The only implementation of it, tcp_chr_sync_read, does
set the underlying io channel to the blocking mode indeed.

Therefore a failure return with EAGAIN is not expected from this call.

So do not retry it in qemu_chr_fe_read_all; instead place an assertion
that it doesn't fail with EAGAIN.

Signed-off-by: Roman Kagan 
---
 chardev/char-fe.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/chardev/char-fe.c b/chardev/char-fe.c
index 7789f7be9c..f94efe928e 100644
--- a/chardev/char-fe.c
+++ b/chardev/char-fe.c
@@ -68,13 +68,10 @@ int qemu_chr_fe_read_all(CharBackend *be, uint8_t *buf, int 
len)
 }
 
 while (offset < len) {
-retry:
 res = CHARDEV_GET_CLASS(s)->chr_sync_read(s, buf + offset,
   len - offset);
-if (res == -1 && errno == EAGAIN) {
-g_usleep(100);
-goto retry;
-}
+/* ->chr_sync_read should block */
+assert(!(res < 0 && errno == EAGAIN));
 
 if (res == 0) {
 break;
-- 
2.33.1

[PATCH 06/10] vhost-backend: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

Almost all VhostOps methods in kernel_ops follow the convention of
returning negated errno on error.

Adjust the only one that doesn't.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-backend.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 44f7dbb243..e409a865ae 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -47,7 +47,7 @@ static int vhost_kernel_cleanup(struct vhost_dev *dev)
 
 assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_KERNEL);
 
-return close(fd);
+return close(fd) < 0 ? -errno : 0;
 }
 
 static int vhost_kernel_memslots_limit(struct vhost_dev *dev)
-- 
2.33.1

[PATCH 10/10] vhost-user-blk: propagate error return from generic vhost

2021-11-11 Thread Roman Kagan

Fix the only callsite that doesn't propagate the error code from the
generic vhost code.

Signed-off-by: Roman Kagan 
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index f9b17f6813..ab11ce8252 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -100,7 +100,7 @@ static int vhost_user_blk_handle_config_change(struct 
vhost_dev *dev)
_err);
 if (ret < 0) {
 error_report_err(local_err);
-return -1;
+return ret;
 }
 
 /* valid for resize only */
-- 
2.33.1

[PATCH 00/10] vhost: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

Error propagation between the generic vhost code and the specific backends is
not quite consistent: some places follow "return -1 and set errno" convention,
while others assume "return negated errno".  Furthermore, not enough care is
taken not to clobber errno.

As a result, on certain code paths the errno resulting from a failure may get
overridden by another function call, and then that zero errno inidicating
success is propagated up the stack, leading to failures being lost.  In
particular, we've seen errors in the communication with a vhost-user-blk slave
not trigger an immediate connection drop and reconnection, leaving it in a
broken state.

Rework error propagation to always return negated errno on errors and
correctly pass it up the stack.

Roman Kagan (10):
  vhost-user-blk: reconnect on any error during realize
  chardev/char-socket: tcp_chr_recv: don't clobber errno
  chardev/char-socket: tcp_chr_sync_read: don't clobber errno
  chardev/char-fe: don't allow EAGAIN from blocking read
  vhost-backend: avoid overflow on memslots_limit
  vhost-backend: stick to -errno error return convention
  vhost-vdpa: stick to -errno error return convention
  vhost-user: stick to -errno error return convention
  vhost: stick to -errno error return convention
  vhost-user-blk: propagate error return from generic vhost

 chardev/char-fe.c |   7 +-
 chardev/char-socket.c |  17 +-
 hw/block/vhost-user-blk.c |   4 +-
 hw/virtio/vhost-backend.c |   4 +-
 hw/virtio/vhost-user.c| 401 +-
 hw/virtio/vhost-vdpa.c|  37 ++--
 hw/virtio/vhost.c |  98 +-
 7 files changed, 307 insertions(+), 261 deletions(-)

-- 
2.33.1

[PATCH 02/10] chardev/char-socket: tcp_chr_recv: don't clobber errno

2021-11-11 Thread Roman Kagan

tcp_chr_recv communicates the specific error condition to the caller via
errno.  However, after setting it, it may call into some system calls or
library functions which can clobber the errno.

Avoid this by moving the errno assignment to the end of the function.

Signed-off-by: Roman Kagan 
---
 chardev/char-socket.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 836cfa0bc2..90054ce58c 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -346,13 +346,6 @@ static ssize_t tcp_chr_recv(Chardev *chr, char *buf, 
size_t len)
  NULL);
 }
 
-if (ret == QIO_CHANNEL_ERR_BLOCK) {
-errno = EAGAIN;
-ret = -1;
-} else if (ret == -1) {
-errno = EIO;
-}
-
 if (msgfds_num) {
 /* close and clean read_msgfds */
 for (i = 0; i < s->read_msgfds_num; i++) {
@@ -381,6 +374,13 @@ static ssize_t tcp_chr_recv(Chardev *chr, char *buf, 
size_t len)
 #endif
 }
 
+if (ret == QIO_CHANNEL_ERR_BLOCK) {
+errno = EAGAIN;
+ret = -1;
+} else if (ret == -1) {
+errno = EIO;
+}
+
 return ret;
 }
 
-- 
2.33.1

[PATCH 03/10] chardev/char-socket: tcp_chr_sync_read: don't clobber errno

2021-11-11 Thread Roman Kagan

After the return from tcp_chr_recv, tcp_chr_sync_read calls into a
function which eventually makes a system call and may clobber errno.

Make a copy of errno right after tcp_chr_recv and restore the errno on
return from tcp_chr_sync_read.

Signed-off-by: Roman Kagan 
---
 chardev/char-socket.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 90054ce58c..cf7f2ba65a 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -581,6 +581,7 @@ static int tcp_chr_sync_read(Chardev *chr, const uint8_t 
*buf, int len)
 {
 SocketChardev *s = SOCKET_CHARDEV(chr);
 int size;
+int saved_errno;
 
 if (s->state != TCP_CHARDEV_STATE_CONNECTED) {
 return 0;
@@ -588,6 +589,7 @@ static int tcp_chr_sync_read(Chardev *chr, const uint8_t 
*buf, int len)
 
 qio_channel_set_blocking(s->ioc, true, NULL);
 size = tcp_chr_recv(chr, (void *) buf, len);
+saved_errno = errno;
 if (s->state != TCP_CHARDEV_STATE_DISCONNECTED) {
 qio_channel_set_blocking(s->ioc, false, NULL);
 }
@@ -596,6 +598,7 @@ static int tcp_chr_sync_read(Chardev *chr, const uint8_t 
*buf, int len)
 tcp_chr_disconnect(chr);
 }
 
+errno = saved_errno;
 return size;
 }
 
-- 
2.33.1

[PATCH 01/10] vhost-user-blk: reconnect on any error during realize

2021-11-11 Thread Roman Kagan

vhost-user-blk realize only attempts to reconnect if the previous
connection attempt failed on "a problem with the connection and not an
error related to the content (which would fail again the same way in the
next attempt)".

However this distinction is very subtle, and may be inadvertently broken
if the code changes somewhere deep down the stack and a new error gets
propagated up to here.

OTOH now that the number of reconnection attempts is limited it seems
harmless to try reconnecting on any error.

So relax the condition of whether to retry connecting to check for any
error.

This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
during realize".

Signed-off-by: Roman Kagan 
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb87e5..f9b17f6813 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -511,7 +511,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 *errp = NULL;
 }
 ret = vhost_user_blk_realize_connect(s, errp);
-} while (ret == -EPROTO && retries--);
+} while (ret < 0 && retries--);
 
 if (ret < 0) {
 goto virtio_err;
-- 
2.33.1

[PATCH 05/10] vhost-backend: avoid overflow on memslots_limit

2021-11-11 Thread Roman Kagan

Fix the (hypothetical) potential problem when the value parsed out of
the vhost module parameter in sysfs overflows the return value from
vhost_kernel_memslots_limit.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-backend.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index b65f8f7e97..44f7dbb243 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -58,7 +58,7 @@ static int vhost_kernel_memslots_limit(struct vhost_dev *dev)
 if (g_file_get_contents("/sys/module/vhost/parameters/max_mem_regions",
 , NULL, NULL)) {
 uint64_t val = g_ascii_strtoull(s, NULL, 10);
-if (!((val == G_MAXUINT64 || !val) && errno)) {
+if (val < INT_MAX && val > 0) {
 g_free(s);
 return val;
 }
-- 
2.33.1

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-04 Thread Roman Kagan

On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > It might be useful for the cases when a slow block layer should be replaced
> > with a more performant one on running VM without stopping, i.e. with very 
> > low
> > downtime comparable with the one on migration.
> > 
> > It's possible to achive that for two reasons:
> > 
> > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
> >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
> >   each other in the values of migration service fields only.
> > 2.The device driver used in the guest is the same: virtio-blk
> > 
> > In the series cross-migration is achieved by adding a new type.
> > The new type uses virtio-blk VMState instead of vhost-user-blk specific
> > VMstate, also it implements migration save/load callbacks to be compatible
> > with migration stream produced by "virtio-blk" device.
> > 
> > Adding the new type instead of modifying the existing one is convenent.
> > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > device from the existing non-compatible one using qemu machinery without any
> > other modifiactions. That gives all the variety of qemu device related
> > constraints out of box.
> 
> Hmm I'm not sure I understand. What is the advantage for the user?
> What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> We could add some hacks to make it compatible for old machine types.

The point is that virtio-blk and vhost-user-blk are not
migration-compatible ATM.  OTOH they are the same device from the guest
POV so there's nothing fundamentally preventing the migration between
the two.  In particular, we see it as a means to switch between the
storage backend transports via live migration without disrupting the
guest.

Migration-wise virtio-blk and vhost-user-blk have in common

- the content of the VMState -- VMSTATE_VIRTIO_DEVICE

The two differ in

- the name and the version of the VMStateDescription

- virtio-blk has an extra migration section (via .save/.load callbacks
  on VirtioDeviceClass) containing requests in flight

It looks like to become migration-compatible with virtio-blk,
vhost-user-blk has to start using VMStateDescription of virtio-blk and
provide compatible .save/.load callbacks.  It isn't entirely obvious how
to make this machine-type-dependent, so we came up with a simpler idea
of defining a new device that shares most of the implementation with the
original vhost-user-blk except for the migration stuff.  We're certainly
open to suggestions on how to reconcile this under a single
vhost-user-blk device, as this would be more user-friendly indeed.

We considered using a class property for this and defining the
respective compat clause, but IIUC the class constructors (where .vmsd
and .save/.load are defined) are not supposed to depend on class
properties.

Thanks,
Roman.

Re: [PATCH v3 06/33] util/async: aio_co_schedule(): support reschedule in same ctx

2021-05-14 Thread Roman Kagan

On Thu, May 13, 2021 at 11:04:37PM +0200, Paolo Bonzini wrote:
> On 12/05/21 09:15, Vladimir Sementsov-Ogievskiy wrote:
> > > > 
> > > 
> > > I don't understand.  Why doesn't aio_co_enter go through the ctx !=
> > > qemu_get_current_aio_context() branch and just do aio_co_schedule?
> > > That was at least the idea behind aio_co_wake and aio_co_enter.
> > > 
> > 
> > Because ctx is exactly qemu_get_current_aio_context(), as we are not in
> > iothread but in nbd connection thread. So,
> > qemu_get_current_aio_context() returns qemu_aio_context.
> 
> So the problem is that threads other than the main thread and
> the I/O thread should not return qemu_aio_context.  The vCPU thread
> may need to return it when running with BQL taken, though.

I'm not sure this is the only case.

AFAICS your patch has basically the same effect as Vladimir's
patch "util/async: aio_co_enter(): do aio_co_schedule in general case"
(https://lore.kernel.org/qemu-devel/20210408140827.332915-4-vsement...@virtuozzo.com/).
That one was found to break e.g. aio=threads cases.  I guessed it
implicitly relied upon aio_co_enter() acquiring the aio_context but we
didn't dig further to pinpoint the exact scenario.

Roman.

> Something like this (untested):
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 5f342267d5..10fcae1515 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -691,10 +691,13 @@ void aio_co_enter(AioContext *ctx, struct Coroutine 
> *co);
>   * Return the AioContext whose event loop runs in the current thread.
>   *
>   * If called from an IOThread this will be the IOThread's AioContext.  If
> - * called from another thread it will be the main loop AioContext.
> + * called from the main thread or with the "big QEMU lock" taken it
> + * will be the main loop AioContext.
>   */
>  AioContext *qemu_get_current_aio_context(void);
> +void qemu_set_current_aio_context(AioContext *ctx);
> +
>  /**
>   * aio_context_setup:
>   * @ctx: the aio context
> diff --git a/iothread.c b/iothread.c
> index 7f086387be..22b967e77c 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -39,11 +39,23 @@ DECLARE_CLASS_CHECKERS(IOThreadClass, IOTHREAD,
>  #define IOTHREAD_POLL_MAX_NS_DEFAULT 0ULL
>  #endif
> -static __thread IOThread *my_iothread;
> +static __thread AioContext *my_aiocontext;
> +
> +void qemu_set_current_aio_context(AioContext *ctx)
> +{
> +assert(!my_aiocontext);
> +my_aiocontext = ctx;
> +}
>  AioContext *qemu_get_current_aio_context(void)
>  {
> -return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
> +if (my_aiocontext) {
> +return my_aiocontext;
> +}
> +if (qemu_mutex_iothread_locked()) {
> +return qemu_get_aio_context();
> +}
> +return NULL;
>  }
>  static void *iothread_run(void *opaque)
> @@ -56,7 +68,7 @@ static void *iothread_run(void *opaque)
>   * in this new thread uses glib.
>   */
>  g_main_context_push_thread_default(iothread->worker_context);
> -my_iothread = iothread;
> +qemu_set_current_aio_context(iothread->ctx);
>  iothread->thread_id = qemu_get_thread_id();
>  qemu_sem_post(>init_done_sem);
> diff --git a/stubs/iothread.c b/stubs/iothread.c
> index 8cc9e28c55..25ff398894 100644
> --- a/stubs/iothread.c
> +++ b/stubs/iothread.c
> @@ -6,3 +6,7 @@ AioContext *qemu_get_current_aio_context(void)
>  {
>  return qemu_get_aio_context();
>  }
> +
> +void qemu_set_current_aio_context(AioContext *ctx)
> +{
> +}
> diff --git a/tests/unit/iothread.c b/tests/unit/iothread.c
> index afde12b4ef..cab38b3da8 100644
> --- a/tests/unit/iothread.c
> +++ b/tests/unit/iothread.c
> @@ -30,13 +30,26 @@ struct IOThread {
>  bool stopping;
>  };
> -static __thread IOThread *my_iothread;
> +static __thread AioContext *my_aiocontext;
> +
> +void qemu_set_current_aio_context(AioContext *ctx)
> +{
> +assert(!my_aiocontext);
> +my_aiocontext = ctx;
> +}
>  AioContext *qemu_get_current_aio_context(void)
>  {
> -return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
> +if (my_aiocontext) {
> +return my_aiocontext;
> +}
> +if (qemu_mutex_iothread_locked()) {
> +return qemu_get_aio_context();
> +}
> +return NULL;
>  }
> +
>  static void iothread_init_gcontext(IOThread *iothread)
>  {
>  GSource *source;
> @@ -54,7 +67,7 @@ static void *iothread_run(void *opaque)
>  rcu_register_thread();
> -my_iothread = iothread;
> +qemu_set_current_aio_context(iothread->ctx);
>  qemu_mutex_lock(>init_done_lock);
>  iothread->ctx = aio_context_new(_abort);
> diff --git a/util/main-loop.c b/util/main-loop.c
> index d9c55df6f5..4ae5b23e99 100644
> --- a/util/main-loop.c
> +++ b/util/main-loop.c
> @@ -170,6 +170,7 @@ int qemu_init_main_loop(Error **errp)
>  if (!qemu_aio_context) {
>  return -EMFILE;
>  }
> +qemu_set_current_aio_context(qemu_aio_context);
>  qemu_notify_bh = qemu_bh_new(notify_event_cb, NULL);
>  gpollfds =

Re: [PATCH v3 21/33] qemu-socket: pass monitor link to socket_get_fd directly

2021-05-12 Thread Roman Kagan

On Mon, Apr 19, 2021 at 10:34:49AM +0100, Daniel P. Berrangé wrote:
> On Fri, Apr 16, 2021 at 11:08:59AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > Detecting monitor by current coroutine works bad when we are not in
> > coroutine context. And that's exactly so in nbd reconnect code, where
> > qio_channel_socket_connect_sync() is called from thread.
> > 
> > Add a possibility to pass monitor by hand, to be used in the following
> > commit.
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > ---
> >  include/io/channel-socket.h| 20 
> >  include/qemu/sockets.h |  2 +-
> >  io/channel-socket.c| 17 +
> >  tests/unit/test-util-sockets.c | 16 
> >  util/qemu-sockets.c| 10 +-
> >  5 files changed, 47 insertions(+), 18 deletions(-)
> > 
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..6d0915420d 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -78,6 +78,23 @@ qio_channel_socket_new_fd(int fd,
> >Error **errp);
> >  
> >  
> > +/**
> > + * qio_channel_socket_connect_sync_mon:
> > + * @ioc: the socket channel object
> > + * @addr: the address to connect to
> > + * @mon: current monitor. If NULL, it will be detected by
> > + *   current coroutine.
> > + * @errp: pointer to a NULL-initialized error object
> > + *
> > + * Attempt to connect to the address @addr. This method
> > + * will run in the foreground so the caller will not regain
> > + * execution control until the connection is established or
> > + * an error occurs.
> > + */
> > +int qio_channel_socket_connect_sync_mon(QIOChannelSocket *ioc,
> > +SocketAddress *addr,
> > +Monitor *mon,
> > +Error **errp);
> 
> I don't really like exposing the concept of the QEMU monitor in
> the IO layer APIs. IMHO these ought to remain completely separate
> subsystems from the API pov,

Agreed. 

> and we ought to fix this problem by
> making monitor_cur() work better in the scenario required.

Would it make sense instead to resolve the fdstr into actual file
descriptor number in the context where monitor_cur() works and makes
sense, prior to passing it to the connection thread?

Roman.

Re: [PATCH v3 19/33] block/nbd: split nbd_handle_updated_info out of nbd_client_handshake()

2021-05-12 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:57AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> To be reused in the following patch.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 99 ++---
>  1 file changed, 57 insertions(+), 42 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v3 18/33] nbd/client-connection: shutdown connection on release

2021-05-11 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:56AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Now, when thread can do negotiation and retry, it may run relatively
> long. We need a mechanism to stop it, when user is not interested in
> result anymore. So, on nbd_client_connection_release() let's shutdown
> the socked, and do not retry connection if thread is detached.

This kinda answers my question to the previous patch about reconnect
cancellation.

> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 36 ++--
>  1 file changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index 002bd91f42..54f73c6c24 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -158,9 +158,13 @@ static void *connect_thread_func(void *opaque)
>  uint64_t timeout = 1;
>  uint64_t max_timeout = 16;
>  
> -while (true) {
> +qemu_mutex_lock(>mutex);
> +while (!conn->detached) {
> +assert(!conn->sioc);
>  conn->sioc = qio_channel_socket_new();
>  
> +qemu_mutex_unlock(>mutex);
> +
>  error_free(conn->err);
>  conn->err = NULL;
>  conn->updated_info = conn->initial_info;
> @@ -171,14 +175,20 @@ static void *connect_thread_func(void *opaque)
>  conn->updated_info.x_dirty_bitmap = NULL;
>  conn->updated_info.name = NULL;
>  
> +qemu_mutex_lock(>mutex);
> +
>  if (ret < 0) {
>  object_unref(OBJECT(conn->sioc));
>  conn->sioc = NULL;
> -if (conn->do_retry) {
> +if (conn->do_retry && !conn->detached) {
> +qemu_mutex_unlock(>mutex);
> +
>  sleep(timeout);
>  if (timeout < max_timeout) {
>  timeout *= 2;
>  }
> +
> +qemu_mutex_lock(>mutex);
>  continue;
>  }
>  }
> @@ -186,15 +196,17 @@ static void *connect_thread_func(void *opaque)
>  break;
>  }
>  
> -WITH_QEMU_LOCK_GUARD(>mutex) {
> -assert(conn->running);
> -conn->running = false;
> -if (conn->wait_co) {
> -aio_co_schedule(NULL, conn->wait_co);
> -conn->wait_co = NULL;
> -}
> -do_free = conn->detached;
> +/* mutex is locked */
> +
> +assert(conn->running);
> +conn->running = false;
> +if (conn->wait_co) {
> +aio_co_schedule(NULL, conn->wait_co);
> +conn->wait_co = NULL;
>  }
> +do_free = conn->detached;
> +
> +qemu_mutex_unlock(>mutex);

This basically reverts some hunks from patch 15 "nbd/client-connection:
use QEMU_LOCK_GUARD".  How about dropping them there (they weren't an
obvious improvement even then).

Roman.

>  
>  if (do_free) {
>  nbd_client_connection_do_free(conn);
> @@ -215,6 +227,10 @@ void nbd_client_connection_release(NBDClientConnection 
> *conn)
>  if (conn->running) {
>  conn->detached = true;
>  }
> +if (conn->sioc) {
> +qio_channel_shutdown(QIO_CHANNEL(conn->sioc),
> + QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +}
>  do_free = !conn->running && !conn->detached;
>  qemu_mutex_unlock(>mutex);
>  
> -- 
> 2.29.2
>

Re: [PATCH v3 17/33] nbd/client-connection: implement connection retry

2021-05-11 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:55AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add an option for thread to retry connection until success. We'll use
> nbd/client-connection both for reconnect and for initial connection in
> nbd_open(), so we need a possibility to use same NBDClientConnection
> instance to connect once in nbd_open() and then use retry semantics for
> reconnect.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/nbd.h |  2 ++
>  nbd/client-connection.c | 55 +
>  2 files changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 5d86e6a393..5bb54d831c 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -409,6 +409,8 @@ const char *nbd_err_lookup(int err);
>  /* nbd/client-connection.c */
>  typedef struct NBDClientConnection NBDClientConnection;
>  
> +void nbd_client_connection_enable_retry(NBDClientConnection *conn);
> +
>  NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> bool do_negotiation,
> const char *export_name,
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index ae4a77f826..002bd91f42 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -36,6 +36,8 @@ struct NBDClientConnection {
>  NBDExportInfo initial_info;
>  bool do_negotiation;
>  
> +bool do_retry;
> +
>  /*
>   * Result of last attempt. Valid in FAIL and SUCCESS states.
>   * If you want to steal error, don't forget to set pointer to NULL.
> @@ -52,6 +54,15 @@ struct NBDClientConnection {
>  Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  };
>  
> +/*
> + * The function isn't protected by any mutex, so call it when thread is not
> + * running.
> + */
> +void nbd_client_connection_enable_retry(NBDClientConnection *conn)
> +{
> +conn->do_retry = true;
> +}
> +
>  NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> bool do_negotiation,
> const char *export_name,
> @@ -144,24 +155,37 @@ static void *connect_thread_func(void *opaque)
>  NBDClientConnection *conn = opaque;
>  bool do_free;
>  int ret;
> +uint64_t timeout = 1;
> +uint64_t max_timeout = 16;
> +
> +while (true) {
> +conn->sioc = qio_channel_socket_new();
> +
> +error_free(conn->err);
> +conn->err = NULL;
> +conn->updated_info = conn->initial_info;
> +
> +ret = nbd_connect(conn->sioc, conn->saddr,
> +  conn->do_negotiation ? >updated_info : NULL,
> +  conn->tlscreds, >ioc, >err);
> +conn->updated_info.x_dirty_bitmap = NULL;
> +conn->updated_info.name = NULL;
> +
> +if (ret < 0) {
> +object_unref(OBJECT(conn->sioc));
> +conn->sioc = NULL;
> +if (conn->do_retry) {
> +sleep(timeout);
> +if (timeout < max_timeout) {
> +timeout *= 2;
> +}
> +continue;
> +}
> +}

How is it supposed to get canceled?

Roman.

> -conn->sioc = qio_channel_socket_new();
> -
> -error_free(conn->err);
> -conn->err = NULL;
> -conn->updated_info = conn->initial_info;
> -
> -ret = nbd_connect(conn->sioc, conn->saddr,
> -  conn->do_negotiation ? >updated_info : NULL,
> -  conn->tlscreds, >ioc, >err);
> -if (ret < 0) {
> -object_unref(OBJECT(conn->sioc));
> -conn->sioc = NULL;
> +break;
>  }
>  
> -conn->updated_info.x_dirty_bitmap = NULL;
> -conn->updated_info.name = NULL;
> -
>  WITH_QEMU_LOCK_GUARD(>mutex) {
>  assert(conn->running);
>  conn->running = false;
> @@ -172,7 +196,6 @@ static void *connect_thread_func(void *opaque)
>  do_free = conn->detached;
>  }
>  
> -
>  if (do_free) {
>  nbd_client_connection_do_free(conn);
>  }
> -- 
> 2.29.2
>

Re: [PATCH v3 16/33] nbd/client-connection: add possibility of negotiation

2021-05-11 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:54AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add arguments and logic to support nbd negotiation in the same thread
> after successful connection.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/nbd.h |   9 +++-
>  block/nbd.c |   4 +-
>  nbd/client-connection.c | 105 ++--
>  3 files changed, 109 insertions(+), 9 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 57381be76f..5d86e6a393 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -409,11 +409,16 @@ const char *nbd_err_lookup(int err);
>  /* nbd/client-connection.c */
>  typedef struct NBDClientConnection NBDClientConnection;
>  
> -NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr);
> +NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> +   bool do_negotiation,
> +   const char *export_name,
> +   const char *x_dirty_bitmap,
> +   QCryptoTLSCreds *tlscreds);
>  void nbd_client_connection_release(NBDClientConnection *conn);
>  
>  QIOChannelSocket *coroutine_fn
> -nbd_co_establish_connection(NBDClientConnection *conn, Error **errp);
> +nbd_co_establish_connection(NBDClientConnection *conn, NBDExportInfo *info,
> +QIOChannel **ioc, Error **errp);
>  
>  void coroutine_fn nbd_co_establish_connection_cancel(NBDClientConnection 
> *conn);
>  
> diff --git a/block/nbd.c b/block/nbd.c
> index 9bd68dcf10..5e63caaf4b 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -361,7 +361,7 @@ static coroutine_fn void 
> nbd_reconnect_attempt(BDRVNBDState *s)
>  s->ioc = NULL;
>  }
>  
> -s->sioc = nbd_co_establish_connection(s->conn, NULL);
> +s->sioc = nbd_co_establish_connection(s->conn, NULL, NULL, NULL);
>  if (!s->sioc) {
>  ret = -ECONNREFUSED;
>  goto out;
> @@ -2033,7 +2033,7 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto fail;
>  }
>  
> -s->conn = nbd_client_connection_new(s->saddr);
> +s->conn = nbd_client_connection_new(s->saddr, false, NULL, NULL, NULL);
>  
>  /*
>   * establish TCP connection, return error if it fails
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index b45a0bd5f6..ae4a77f826 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -30,14 +30,19 @@
>  #include "qapi/clone-visitor.h"
>  
>  struct NBDClientConnection {
> -/* Initialization constants */
> +/* Initialization constants, never change */
>  SocketAddress *saddr; /* address to connect to */
> +QCryptoTLSCreds *tlscreds;
> +NBDExportInfo initial_info;
> +bool do_negotiation;
>  
>  /*
>   * Result of last attempt. Valid in FAIL and SUCCESS states.
>   * If you want to steal error, don't forget to set pointer to NULL.
>   */
> +NBDExportInfo updated_info;
>  QIOChannelSocket *sioc;
> +QIOChannel *ioc;
>  Error *err;
>  
>  QemuMutex mutex;
> @@ -47,12 +52,25 @@ struct NBDClientConnection {
>  Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  };
>  
> -NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr)
> +NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> +   bool do_negotiation,
> +   const char *export_name,
> +   const char *x_dirty_bitmap,
> +   QCryptoTLSCreds *tlscreds)
>  {
>  NBDClientConnection *conn = g_new(NBDClientConnection, 1);
>  
> +object_ref(OBJECT(tlscreds));
>  *conn = (NBDClientConnection) {
>  .saddr = QAPI_CLONE(SocketAddress, saddr),
> +.tlscreds = tlscreds,
> +.do_negotiation = do_negotiation,
> +
> +.initial_info.request_sizes = true,
> +.initial_info.structured_reply = true,
> +.initial_info.base_allocation = true,
> +.initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
> +.initial_info.name = g_strdup(export_name ?: "")
>  };
>  
>  qemu_mutex_init(>mutex);
> @@ -68,9 +86,59 @@ static void 
> nbd_client_connection_do_free(NBDClientConnection *conn)
>  }
>  error_free(conn->err);
>  qapi_free_SocketAddress(conn->saddr);
> +object_unref(OBJECT(conn->tlscreds));
> +g_free(conn->initial_info.x_dirty_bitmap);
> +g_free(conn->initial_info.name);
>  g_free(conn);
>  }
>  
> +/*
> + * Connect to @addr and do NBD negotiation if @info is not null. If @tlscreds
> + * given @outioc is provided. @outioc is provided only on success.  The call 
> may

s/given/are given/
s/provided/returned/g

> + *

Re: [PATCH v3 15/33] nbd/client-connection: use QEMU_LOCK_GUARD

2021-04-28 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:53AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 94 ++---
>  1 file changed, 42 insertions(+), 52 deletions(-)
> 
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index 4e39a5b1af..b45a0bd5f6 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -87,17 +87,16 @@ static void *connect_thread_func(void *opaque)
>  conn->sioc = NULL;
>  }
>  
> -qemu_mutex_lock(>mutex);
> -
> -assert(conn->running);
> -conn->running = false;
> -if (conn->wait_co) {
> -aio_co_schedule(NULL, conn->wait_co);
> -conn->wait_co = NULL;
> +WITH_QEMU_LOCK_GUARD(>mutex) {
> +assert(conn->running);
> +conn->running = false;
> +if (conn->wait_co) {
> +aio_co_schedule(NULL, conn->wait_co);
> +conn->wait_co = NULL;
> +}
>  }
>  do_free = conn->detached;

->detached is now accessed outside the mutex

>  
> -qemu_mutex_unlock(>mutex);
>  
>  if (do_free) {
>  nbd_client_connection_do_free(conn);
> @@ -136,61 +135,54 @@ void nbd_client_connection_release(NBDClientConnection 
> *conn)
>  QIOChannelSocket *coroutine_fn
>  nbd_co_establish_connection(NBDClientConnection *conn, Error **errp)
>  {
> -QIOChannelSocket *sioc = NULL;
>  QemuThread thread;
>  
> -qemu_mutex_lock(>mutex);
> -
> -/*
> - * Don't call nbd_co_establish_connection() in several coroutines in
> - * parallel. Only one call at once is supported.
> - */
> -assert(!conn->wait_co);
> -
> -if (!conn->running) {
> -if (conn->sioc) {
> -/* Previous attempt finally succeeded in background */
> -sioc = g_steal_pointer(>sioc);
> -qemu_mutex_unlock(>mutex);
> -
> -return sioc;
> +WITH_QEMU_LOCK_GUARD(>mutex) {
> +/*
> + * Don't call nbd_co_establish_connection() in several coroutines in
> + * parallel. Only one call at once is supported.
> + */
> +assert(!conn->wait_co);
> +
> +if (!conn->running) {
> +if (conn->sioc) {
> +/* Previous attempt finally succeeded in background */
> +return g_steal_pointer(>sioc);
> +}
> +
> +conn->running = true;
> +error_free(conn->err);
> +conn->err = NULL;
> +qemu_thread_create(, "nbd-connect",
> +   connect_thread_func, conn, 
> QEMU_THREAD_DETACHED);
>  }
>  
> -conn->running = true;
> -error_free(conn->err);
> -conn->err = NULL;
> -qemu_thread_create(, "nbd-connect",
> -   connect_thread_func, conn, QEMU_THREAD_DETACHED);
> +conn->wait_co = qemu_coroutine_self();
>  }
>  
> -conn->wait_co = qemu_coroutine_self();
> -
> -qemu_mutex_unlock(>mutex);
> -
>  /*
>   * We are going to wait for connect-thread finish, but
>   * nbd_co_establish_connection_cancel() can interrupt.
>   */
>  qemu_coroutine_yield();
>  
> -qemu_mutex_lock(>mutex);
> -
> -if (conn->running) {
> -/*
> - * Obviously, drained section wants to start. Report the attempt as
> - * failed. Still connect thread is executing in background, and its
> - * result may be used for next connection attempt.
> - */
> -error_setg(errp, "Connection attempt cancelled by other operation");
> -} else {
> -error_propagate(errp, conn->err);
> -conn->err = NULL;
> -sioc = g_steal_pointer(>sioc);
> +WITH_QEMU_LOCK_GUARD(>mutex) {
> +if (conn->running) {
> +/*
> + * Obviously, drained section wants to start. Report the attempt 
> as
> + * failed. Still connect thread is executing in background, and 
> its
> + * result may be used for next connection attempt.
> + */
> +error_setg(errp, "Connection attempt cancelled by other 
> operation");
> +return NULL;
> +} else {
> +error_propagate(errp, conn->err);
> +conn->err = NULL;
> +return g_steal_pointer(>sioc);
> +}
>  }
>  
> -qemu_mutex_unlock(>mutex);
> -
> -return sioc;
> +abort(); /* unreachable */
>  }
>  
>  /*
> @@ -201,12 +193,10 @@ nbd_co_establish_connection(NBDClientConnection *conn, 
> Error **errp)
>   */
>  void coroutine_fn nbd_co_establish_connection_cancel(NBDClientConnection 
> *conn)
>  {
> -qemu_mutex_lock(>mutex);
> +QEMU_LOCK_GUARD(>mutex);
>  
>  if (conn->wait_co) {
>  aio_co_schedule(NULL, conn->wait_co);
>  conn->wait_co = NULL;
>  }
> -
> -qemu_mutex_unlock(>mutex);
>  }
> -- 
> 2.29.2
>

Re: [PATCH v3 14/33] nbd: move connection code from block/nbd to nbd/client-connection

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:52AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We now have bs-independent connection API, which consists of four
> functions:
> 
>   nbd_client_connection_new()
>   nbd_client_connection_unref()
>   nbd_co_establish_connection()
>   nbd_co_establish_connection_cancel()
> 
> Move them to a separate file together with NBDClientConnection
> structure which becomes private to the new API.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/nbd.h |  11 +++
>  block/nbd.c | 187 ---
>  nbd/client-connection.c | 212 
>  nbd/meson.build |   1 +
>  4 files changed, 224 insertions(+), 187 deletions(-)
>  create mode 100644 nbd/client-connection.c
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 5f34d23bb0..57381be76f 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -406,4 +406,15 @@ const char *nbd_info_lookup(uint16_t info);
>  const char *nbd_cmd_lookup(uint16_t info);
>  const char *nbd_err_lookup(int err);
>  
> +/* nbd/client-connection.c */
> +typedef struct NBDClientConnection NBDClientConnection;
> +
> +NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr);
> +void nbd_client_connection_release(NBDClientConnection *conn);
> +
> +QIOChannelSocket *coroutine_fn
> +nbd_co_establish_connection(NBDClientConnection *conn, Error **errp);
> +
> +void coroutine_fn nbd_co_establish_connection_cancel(NBDClientConnection 
> *conn);
> +
>  #endif
> diff --git a/block/nbd.c b/block/nbd.c
> index 8531d019b2..9bd68dcf10 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -66,24 +66,6 @@ typedef enum NBDClientState {
>  NBD_CLIENT_QUIT
>  } NBDClientState;
>  
> -typedef struct NBDClientConnection {
> -/* Initialization constants */
> -SocketAddress *saddr; /* address to connect to */
> -
> -/*
> - * Result of last attempt. Valid in FAIL and SUCCESS states.
> - * If you want to steal error, don't forget to set pointer to NULL.
> - */
> -QIOChannelSocket *sioc;
> -Error *err;
> -
> -QemuMutex mutex;
> -/* All further fields are protected by mutex */
> -bool running; /* thread is running now */
> -bool detached; /* thread is detached and should cleanup the state */
> -Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
> -} NBDClientConnection;
> -
>  typedef struct BDRVNBDState {
>  QIOChannelSocket *sioc; /* The master data channel */
>  QIOChannel *ioc; /* The current I/O channel which may differ (eg TLS) */
> @@ -118,12 +100,8 @@ typedef struct BDRVNBDState {
>  NBDClientConnection *conn;
>  } BDRVNBDState;
>  
> -static void nbd_client_connection_release(NBDClientConnection *conn);
>  static int nbd_establish_connection(BlockDriverState *bs, SocketAddress 
> *saddr,
>  Error **errp);
> -static coroutine_fn QIOChannelSocket *
> -nbd_co_establish_connection(NBDClientConnection *conn, Error **errp);
> -static void nbd_co_establish_connection_cancel(NBDClientConnection *conn);
>  static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
>  static void nbd_yank(void *opaque);
>  
> @@ -340,171 +318,6 @@ static bool nbd_client_connecting_wait(BDRVNBDState *s)
>  return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
>  }
>  
> -static NBDClientConnection *
> -nbd_client_connection_new(const SocketAddress *saddr)
> -{
> -NBDClientConnection *conn = g_new(NBDClientConnection, 1);
> -
> -*conn = (NBDClientConnection) {
> -.saddr = QAPI_CLONE(SocketAddress, saddr),
> -};
> -
> -qemu_mutex_init(>mutex);
> -
> -return conn;
> -}
> -
> -static void nbd_client_connection_do_free(NBDClientConnection *conn)
> -{
> -if (conn->sioc) {
> -qio_channel_close(QIO_CHANNEL(conn->sioc), NULL);
> -object_unref(OBJECT(conn->sioc));
> -}
> -error_free(conn->err);
> -qapi_free_SocketAddress(conn->saddr);
> -g_free(conn);
> -}
> -
> -static void *connect_thread_func(void *opaque)
> -{
> -NBDClientConnection *conn = opaque;
> -bool do_free;
> -int ret;
> -
> -conn->sioc = qio_channel_socket_new();
> -
> -error_free(conn->err);
> -conn->err = NULL;
> -ret = qio_channel_socket_connect_sync(conn->sioc, conn->saddr, 
> >err);
> -if (ret < 0) {
> -object_unref(OBJECT(conn->sioc));
> -conn->sioc = NULL;
> -}
> -
> -qemu_mutex_lock(>mutex);
> -
> -assert(conn->running);
> -conn->running = false;
> -if (conn->wait_co) {
> -aio_co_schedule(NULL, conn->wait_co);
> -conn->wait_co = NULL;
> -}
> -do_free = conn->detached;
> -
> -qemu_mutex_unlock(>mutex);
> -
> -if (do_free) {
> -nbd_client_connection_do_free(conn);
> -}
> -
> -return NULL;
> -}
> -
> -static void nbd_client_connection_release(NBDClientConnection *conn)
> -{
> -

Re: [PATCH v3 13/33] block/nbd: introduce nbd_client_connection_release()

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:51AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 43 ++-
>  1 file changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 21a4039359..8531d019b2 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -118,7 +118,7 @@ typedef struct BDRVNBDState {
>  NBDClientConnection *conn;
>  } BDRVNBDState;
>  
> -static void nbd_free_connect_thread(NBDClientConnection *conn);
> +static void nbd_client_connection_release(NBDClientConnection *conn);
>  static int nbd_establish_connection(BlockDriverState *bs, SocketAddress 
> *saddr,
>  Error **errp);
>  static coroutine_fn QIOChannelSocket *
> @@ -130,20 +130,9 @@ static void nbd_yank(void *opaque);
>  static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> -NBDClientConnection *conn = s->conn;
> -bool do_free;
> -
> -qemu_mutex_lock(>mutex);
> -if (conn->running) {
> -conn->detached = true;
> -}
> -do_free = !conn->running && !conn->detached;
> -qemu_mutex_unlock(>mutex);
>  
> -/* the runaway thread will clean it up itself */
> -if (do_free) {
> -nbd_free_connect_thread(conn);
> -}
> +nbd_client_connection_release(s->conn);
> +s->conn = NULL;
>  
>  yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
>  
> @@ -365,7 +354,7 @@ nbd_client_connection_new(const SocketAddress *saddr)
>  return conn;
>  }
>  
> -static void nbd_free_connect_thread(NBDClientConnection *conn)
> +static void nbd_client_connection_do_free(NBDClientConnection *conn)
>  {
>  if (conn->sioc) {
>  qio_channel_close(QIO_CHANNEL(conn->sioc), NULL);
> @@ -379,8 +368,8 @@ static void nbd_free_connect_thread(NBDClientConnection 
> *conn)
>  static void *connect_thread_func(void *opaque)
>  {
>  NBDClientConnection *conn = opaque;
> +bool do_free;
>  int ret;
> -bool do_free = false;
>  

This hunk belongs in patch 8.

Roman.

>  conn->sioc = qio_channel_socket_new();
>  
> @@ -405,12 +394,32 @@ static void *connect_thread_func(void *opaque)
>  qemu_mutex_unlock(>mutex);
>  
>  if (do_free) {
> -nbd_free_connect_thread(conn);
> +nbd_client_connection_do_free(conn);
>  }
>  
>  return NULL;
>  }
>  
> +static void nbd_client_connection_release(NBDClientConnection *conn)
> +{
> +bool do_free;
> +
> +if (!conn) {
> +return;
> +}
> +
> +qemu_mutex_lock(>mutex);
> +if (conn->running) {
> +conn->detached = true;
> +}
> +do_free = !conn->running && !conn->detached;
> +qemu_mutex_unlock(>mutex);
> +
> +if (do_free) {
> +nbd_client_connection_do_free(conn);
> +}
> +}
> +
>  /*
>   * Get a new connection in context of @conn:
>   *   if thread is running, wait for completion
> -- 
> 2.29.2
>

Re: [PATCH v3 11/33] block/nbd: rename NBDConnectThread to NBDClientConnection

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:49AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to move connection code to own file and want clear names
> and APIs.
> 
> The structure is shared between user and (possibly) several runs of
> connect-thread. So it's wrong to call it "thread". Let's rename to
> something more generic.
> 
> Appropriately rename connect_thread and thr variables to conn.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 137 ++--
>  1 file changed, 68 insertions(+), 69 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v3 08/33] block/nbd: drop thr->state

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:46AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We don't need all these states. The code refactored to use two boolean
> variables looks simpler.

Indeed.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 125 ++--
>  1 file changed, 34 insertions(+), 91 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index e1f39eda6c..2b26a033a4 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -66,24 +66,6 @@ typedef enum NBDClientState {
>  NBD_CLIENT_QUIT
>  } NBDClientState;
>  
> -typedef enum NBDConnectThreadState {
> -/* No thread, no pending results */
> -CONNECT_THREAD_NONE,
> -
> -/* Thread is running, no results for now */
> -CONNECT_THREAD_RUNNING,
> -
> -/*
> - * Thread is running, but requestor exited. Thread should close
> - * the new socket and free the connect state on exit.
> - */
> -CONNECT_THREAD_RUNNING_DETACHED,
> -
> -/* Thread finished, results are stored in a state */
> -CONNECT_THREAD_FAIL,
> -CONNECT_THREAD_SUCCESS
> -} NBDConnectThreadState;
> -
>  typedef struct NBDConnectThread {
>  /* Initialization constants */
>  SocketAddress *saddr; /* address to connect to */
> @@ -97,7 +79,8 @@ typedef struct NBDConnectThread {
>  
>  QemuMutex mutex;
>  /* All further fields are protected by mutex */
> -NBDConnectThreadState state; /* current state of the thread */
> +bool running; /* thread is running now */
> +bool detached; /* thread is detached and should cleanup the state */
>  Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  } NBDConnectThread;
>  
> @@ -147,17 +130,17 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  NBDConnectThread *thr = s->connect_thread;
> -bool thr_running;
> +bool do_free;
>  
>  qemu_mutex_lock(>mutex);
> -thr_running = thr->state == CONNECT_THREAD_RUNNING;
> -if (thr_running) {
> -thr->state = CONNECT_THREAD_RUNNING_DETACHED;
> +if (thr->running) {
> +thr->detached = true;
>  }
> +do_free = !thr->running && !thr->detached;

This is redundant.  You can unconditionally set ->detached and only
depend on ->running for the rest of this function.

>  qemu_mutex_unlock(>mutex);
>  
>  /* the runaway thread will clean it up itself */
> -if (!thr_running) {
> +if (do_free) {
>  nbd_free_connect_thread(thr);
>  }
>  
> @@ -373,7 +356,6 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  
>  *s->connect_thread = (NBDConnectThread) {
>  .saddr = QAPI_CLONE(SocketAddress, s->saddr),
> -.state = CONNECT_THREAD_NONE,
>  };
>  
>  qemu_mutex_init(>connect_thread->mutex);
> @@ -408,20 +390,13 @@ static void *connect_thread_func(void *opaque)
>  
>  qemu_mutex_lock(>mutex);
>  
> -switch (thr->state) {
> -case CONNECT_THREAD_RUNNING:
> -thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
> -if (thr->wait_co) {
> -aio_co_schedule(NULL, thr->wait_co);
> -thr->wait_co = NULL;
> -}
> -break;
> -case CONNECT_THREAD_RUNNING_DETACHED:
> -do_free = true;
> -break;
> -default:
> -abort();
> +assert(thr->running);
> +thr->running = false;
> +if (thr->wait_co) {
> +aio_co_schedule(NULL, thr->wait_co);
> +thr->wait_co = NULL;
>  }
> +do_free = thr->detached;
>  
>  qemu_mutex_unlock(>mutex);
>  
> @@ -435,36 +410,24 @@ static void *connect_thread_func(void *opaque)
>  static int coroutine_fn
>  nbd_co_establish_connection(BlockDriverState *bs, Error **errp)
>  {
> -int ret;
>  QemuThread thread;
>  BDRVNBDState *s = bs->opaque;
>  NBDConnectThread *thr = s->connect_thread;
>  
> +assert(!s->sioc);
> +
>  qemu_mutex_lock(>mutex);
>  
> -switch (thr->state) {
> -case CONNECT_THREAD_FAIL:
> -case CONNECT_THREAD_NONE:
> +if (!thr->running) {
> +if (thr->sioc) {
> +/* Previous attempt finally succeeded in background */
> +goto out;
> +}
> +thr->running = true;
>  error_free(thr->err);
>  thr->err = NULL;
> -thr->state = CONNECT_THREAD_RUNNING;
>  qemu_thread_create(, "nbd-connect",
> connect_thread_func, thr, QEMU_THREAD_DETACHED);
> -break;
> -case CONNECT_THREAD_SUCCESS:
> -/* Previous attempt finally succeeded in background */
> -thr->state = CONNECT_THREAD_NONE;
> -s->sioc = thr->sioc;
> -thr->sioc = NULL;
> -yank_register_function(BLOCKDEV_YANK_INSTANCE(bs->node_name),
> -   nbd_yank, bs);
> -qemu_mutex_unlock(>mutex);
> -return 0;
> -case CONNECT_THREAD_RUNNING:
> -/* Already running, will

Re: [PATCH v3 07/33] block/nbd: simplify waking of nbd_co_establish_connection()

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:45AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Instead of connect_bh, bh_ctx and wait_connect fields we can live with
> only one link to waiting coroutine, protected by mutex.
> 
> So new logic is:
> 
> nbd_co_establish_connection() sets wait_co under mutex, release the
> mutex and do yield(). Note, that wait_co may be scheduled by thread
> immediately after unlocking the mutex. Still, in main thread (or
> iothread) we'll not reach the code for entering the coroutine until the
> yield() so we are safe.
> 
> Both connect_thread_func() and nbd_co_establish_connection_cancel() do
> the following to handle wait_co:
> 
> Under mutex, if thr->wait_co is not NULL, call aio_co_wake() (which
> never tries to acquire aio context since previous commit, so we are
> safe to do it under thr->mutex) and set thr->wait_co to NULL.
> This way we protect ourselves of scheduling it twice.
> 
> Also this commit make nbd_co_establish_connection() less connected to
> bs (we have generic pointer to the coroutine, not use s->connection_co
> directly). So, we are on the way of splitting connection API out of
> nbd.c (which is overcomplicated now).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 49 +
>  1 file changed, 9 insertions(+), 40 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index d67556c7ee..e1f39eda6c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -87,12 +87,6 @@ typedef enum NBDConnectThreadState {
>  typedef struct NBDConnectThread {
>  /* Initialization constants */
>  SocketAddress *saddr; /* address to connect to */
> -/*
> - * Bottom half to schedule on completion. Scheduled only if bh_ctx is not
> - * NULL
> - */
> -QEMUBHFunc *bh_func;
> -void *bh_opaque;
>  
>  /*
>   * Result of last attempt. Valid in FAIL and SUCCESS states.
> @@ -101,10 +95,10 @@ typedef struct NBDConnectThread {
>  QIOChannelSocket *sioc;
>  Error *err;
>  
> -/* state and bh_ctx are protected by mutex */
>  QemuMutex mutex;
> +/* All further fields are protected by mutex */
>  NBDConnectThreadState state; /* current state of the thread */
> -AioContext *bh_ctx; /* where to schedule bh (NULL means don't schedule) 
> */
> +Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  } NBDConnectThread;
>  
>  typedef struct BDRVNBDState {
> @@ -138,7 +132,6 @@ typedef struct BDRVNBDState {
>  char *x_dirty_bitmap;
>  bool alloc_depth;
>  
> -bool wait_connect;
>  NBDConnectThread *connect_thread;
>  } BDRVNBDState;
>  
> @@ -374,15 +367,6 @@ static bool nbd_client_connecting_wait(BDRVNBDState *s)
>  return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
>  }
>  
> -static void connect_bh(void *opaque)
> -{
> -BDRVNBDState *state = opaque;
> -
> -assert(state->wait_connect);
> -state->wait_connect = false;
> -aio_co_wake(state->connection_co);
> -}
> -
>  static void nbd_init_connect_thread(BDRVNBDState *s)
>  {
>  s->connect_thread = g_new(NBDConnectThread, 1);
> @@ -390,8 +374,6 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  *s->connect_thread = (NBDConnectThread) {
>  .saddr = QAPI_CLONE(SocketAddress, s->saddr),
>  .state = CONNECT_THREAD_NONE,
> -.bh_func = connect_bh,
> -.bh_opaque = s,
>  };
>  
>  qemu_mutex_init(>connect_thread->mutex);
> @@ -429,11 +411,9 @@ static void *connect_thread_func(void *opaque)
>  switch (thr->state) {
>  case CONNECT_THREAD_RUNNING:
>  thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
> -if (thr->bh_ctx) {
> -aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, 
> thr->bh_opaque);
> -
> -/* play safe, don't reuse bh_ctx on further connection attempts 
> */
> -thr->bh_ctx = NULL;
> +if (thr->wait_co) {
> +aio_co_schedule(NULL, thr->wait_co);
> +thr->wait_co = NULL;
>  }
>  break;
>  case CONNECT_THREAD_RUNNING_DETACHED:
> @@ -487,20 +467,14 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
> **errp)
>  abort();
>  }
>  
> -thr->bh_ctx = qemu_get_current_aio_context();
> +thr->wait_co = qemu_coroutine_self();
>  
>  qemu_mutex_unlock(>mutex);
>  
> -
>  /*
>   * We are going to wait for connect-thread finish, but
>   * nbd_client_co_drain_begin() can interrupt.
> - *
> - * Note that wait_connect variable is not visible for connect-thread. It
> - * doesn't need mutex protection, it used only inside home aio context of
> - * bs.
>   */
> -s->wait_connect = true;
>  qemu_coroutine_yield();
>  
>  qemu_mutex_lock(>mutex);
> @@ -555,24 +529,19 @@ static void 
> nbd_co_establish_connection_cancel(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = bs->opaque;
>  NBDConnectThread *thr =

Re: [PATCH v3 06/33] util/async: aio_co_schedule(): support reschedule in same ctx

2021-04-23 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:44AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the following patch we want to call wake coroutine from thread.
> And it doesn't work with aio_co_wake:
> Assume we have no iothreads.
> Assume we have a coroutine A, which waits in the yield point for
> external aio_co_wake(), and no progress can be done until it happen.
> Main thread is in blocking aio_poll() (for example, in blk_read()).
> 
> Now, in a separate thread we do aio_co_wake(). It calls  aio_co_enter(),
> which goes through last "else" branch and do aio_context_acquire(ctx).
> 
> Now we have a deadlock, as aio_poll() will not release the context lock
> until some progress is done, and progress can't be done until
> aio_co_wake() wake the coroutine A. And it can't because it wait for
> aio_context_acquire().
> 
> Still, aio_co_schedule() works well in parallel with blocking
> aio_poll(). So we want use it. Let's add a possibility of rescheduling
> coroutine in same ctx where it was yield'ed.
> 
> Fetch co->ctx in same way as it is done in aio_co_wake().
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/aio.h | 2 +-
>  util/async.c| 8 
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 5f342267d5..744b695525 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -643,7 +643,7 @@ static inline bool aio_node_check(AioContext *ctx, bool 
> is_external)
>  
>  /**
>   * aio_co_schedule:
> - * @ctx: the aio context
> + * @ctx: the aio context, if NULL, the current ctx of @co will be used.
>   * @co: the coroutine
>   *
>   * Start a coroutine on a remote AioContext.
> diff --git a/util/async.c b/util/async.c
> index 674dbefb7c..750be555c6 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -545,6 +545,14 @@ fail:
>  
>  void aio_co_schedule(AioContext *ctx, Coroutine *co)
>  {
> +if (!ctx) {
> +/*
> + * Read coroutine before co->ctx.  Matches smp_wmb in
> + * qemu_coroutine_enter.
> + */
> +smp_read_barrier_depends();
> +ctx = qatomic_read(>ctx);
> +}

I'd rather not extend this interface, but add a new one on top.  And
document how it's different from aio_co_wake().

Roman.

>  trace_aio_co_schedule(ctx, co);
>  const char *scheduled = qatomic_cmpxchg(>scheduled, NULL,
> __func__);
> -- 
> 2.29.2
>

Re: [PATCH v3 04/33] block/nbd: nbd_client_handshake(): fix leak of s->ioc

2021-04-22 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:42AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Roman Kagan

Re: [PATCH v3 02/33] block/nbd: fix how state is cleared on nbd_open() failure paths

2021-04-22 Thread Roman Kagan

On Thu, Apr 22, 2021 at 01:27:22AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 21.04.2021 17:00, Roman Kagan wrote:
> > On Fri, Apr 16, 2021 at 11:08:40AM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > @@ -2305,20 +2301,23 @@ static int nbd_open(BlockDriverState *bs, QDict 
> > > *options, int flags,
> > >   return -EEXIST;
> > >   }
> > > +ret = nbd_process_options(bs, options, errp);
> > > +if (ret < 0) {
> > > +goto fail;
> > > +}
> > > +
> > >   /*
> > >* establish TCP connection, return error if it fails
> > >* TODO: Configurable retry-until-timeout behaviour.
> > >*/
> > >   if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
> > > -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> > > -return -ECONNREFUSED;
> > > +ret = -ECONNREFUSED;
> > > +goto fail;
> > >   }
> > >   ret = nbd_client_handshake(bs, errp);
> > Not that this was introduced by this patch, but once you're at it:
> > AFAICT nbd_client_handshake() calls yank_unregister_instance() on some
> > error path(s); I assume this needs to go too, otherwise it's called
> > twice (and asserts).
> > 
> > Roman.
> > 
> 
> No, nbd_client_handshake() only calls yank_unregister_function(), not 
> instance.

Indeed.  Sorry for confusion.

Reviewed-by: Roman Kagan

Re: [PATCH v3 02/33] block/nbd: fix how state is cleared on nbd_open() failure paths

2021-04-21 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:40AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We have two "return error" paths in nbd_open() after
> nbd_process_options(). Actually we should call nbd_clear_bdrvstate()
> on these paths. Interesting that nbd_process_options() calls
> nbd_clear_bdrvstate() by itself.
> 
> Let's fix leaks and refactor things to be more obvious:
> 
> - intialize yank at top of nbd_open()
> - move yank cleanup to nbd_clear_bdrvstate()
> - refactor nbd_open() so that all failure paths except for
>   yank-register goes through nbd_clear_bdrvstate()
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 36 ++--
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 739ae2941f..a407a3814b 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -152,8 +152,12 @@ static void 
> nbd_co_establish_connection_cancel(BlockDriverState *bs,
>  static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
>  static void nbd_yank(void *opaque);
>  
> -static void nbd_clear_bdrvstate(BDRVNBDState *s)
> +static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
> +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> +
> +yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> +
>  object_unref(OBJECT(s->tlscreds));
>  qapi_free_SocketAddress(s->saddr);
>  s->saddr = NULL;
> @@ -2279,9 +2283,6 @@ static int nbd_process_options(BlockDriverState *bs, 
> QDict *options,
>  ret = 0;
>  
>   error:
> -if (ret < 0) {
> -nbd_clear_bdrvstate(s);
> -}
>  qemu_opts_del(opts);
>  return ret;
>  }
> @@ -2292,11 +2293,6 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  int ret;
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  
> -ret = nbd_process_options(bs, options, errp);
> -if (ret < 0) {
> -return ret;
> -}
> -
>  s->bs = bs;
>  qemu_co_mutex_init(>send_mutex);
>  qemu_co_queue_init(>free_sema);
> @@ -2305,20 +2301,23 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  return -EEXIST;
>  }
>  
> +ret = nbd_process_options(bs, options, errp);
> +if (ret < 0) {
> +goto fail;
> +}
> +
>  /*
>   * establish TCP connection, return error if it fails
>   * TODO: Configurable retry-until-timeout behaviour.
>   */
>  if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -return -ECONNREFUSED;
> +ret = -ECONNREFUSED;
> +goto fail;
>  }
>  
>  ret = nbd_client_handshake(bs, errp);

Not that this was introduced by this patch, but once you're at it:
AFAICT nbd_client_handshake() calls yank_unregister_instance() on some
error path(s); I assume this needs to go too, otherwise it's called
twice (and asserts).

Roman.

>  if (ret < 0) {
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -nbd_clear_bdrvstate(s);
> -return ret;
> +goto fail;
>  }
>  /* successfully connected */
>  s->state = NBD_CLIENT_CONNECTED;
> @@ -2330,6 +2329,10 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
>  
>  return 0;
> +
> +fail:
> +nbd_clear_bdrvstate(bs);
> +return ret;
>  }
>  
>  static int nbd_co_flush(BlockDriverState *bs)
> @@ -2373,11 +2376,8 @@ static void nbd_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  
>  static void nbd_close(BlockDriverState *bs)
>  {
> -BDRVNBDState *s = bs->opaque;
> -
>  nbd_client_close(bs);
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -nbd_clear_bdrvstate(s);
> +nbd_clear_bdrvstate(bs);
>  }
>  
>  /*

Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread

2021-04-12 Thread Roman Kagan

On Tue, Apr 06, 2021 at 06:51:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> If on nbd_close() we detach the thread (in
> nbd_co_establish_connection_cancel() thr->state becomes
> CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use
> s->connect_thread (which is set to NULL), as running thread may free it
> at any time.
> 
> Still nbd_co_establish_connection() does exactly this: it saves
> s->connect_thread to local variable (just for better code style) and
> use it even after yield point, when thread may be already detached.
> 
> Fix that. Also check thr to be non-NULL on
> nbd_co_establish_connection() start for safety.
> 
> After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes
> impossible in the second switch in nbd_co_establish_connection().
> Still, don't add extra abort() just before the release. If it somehow
> possible to reach this "case:" it won't hurt. Anyway, good refactoring
> of all this reconnect mess will come soon.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> Hi all! I faced a crash, just running 277 iotest in a loop. I can't
> reproduce it on master, it reproduces only on my branch with nbd
> reconnect refactorings.
> 
> Still, it seems very possible that it may crash under some conditions.
> So I propose this patch for 6.0. It's written so that it's obvious that
> it will not hurt:
> 
>  pre-patch, on first hunk we'll just crash if thr is NULL,
>  on second hunk it's safe to return -1, and using thr when
>  s->connect_thread is already zeroed is obviously wrong.
> 
>  block/nbd.c | 11 +++
>  1 file changed, 11 insertions(+)

Can we please get it merged in 6.0 as it's a genuine crasher fix?

Reviewed-by: Roman Kagan

Re: [PATCH for-6.0 2/2] block/nbd: ensure ->connection_thread is always valid

2021-04-10 Thread Roman Kagan

On Sat, Apr 10, 2021 at 12:56:34PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 10.04.2021 11:38, Vladimir Sementsov-Ogievskiy wrote:
> > 10.04.2021 11:06, Vladimir Sementsov-Ogievskiy wrote:
> > > 09.04.2021 19:04, Roman Kagan wrote:
> > > > Simplify lifetime management of BDRVNBDState->connection_thread by
> > > > delaying the possible cleanup of it until the BDRVNBDState itself goes
> > > > away.
> > > > 
> > > > This also fixes possible use-after-free in nbd_co_establish_connection
> > > > when it races with nbd_co_establish_connection_cancel.
> > > > 
> > > > Signed-off-by: Roman Kagan
> > > 
> > > Reviewed-by: Vladimir Sementsov-Ogievskiy 
> > > 
> > 
> > Ha stop, it crashes iotest 51, as nbd_clear_bdrvstate is called also from 
> > nbd_process_options.
> > 
> > And this shows that we also do wrong thing when simply return from two ifs 
> > pre-patch (and one after-patch). Yes, after successful nbd_process options 
> > we should call nbd_clear_bdrvstate() on failure path.

The test didn't crash at me somehow but you're right there's bug here.

> And also it shows that patch is more difficult than it seems. I still think 
> that for 6.0 we should take a simple use-after-free-only fix, and don't care 
> about anything else.

I agree it turned out more complex than apporpriate for 6.0.  Let's
proceed with the one you've posted.

Thanks,
Roman.

[PATCH for-6.0 1/2] block/nbd: fix channel object leak

2021-04-09 Thread Roman Kagan

nbd_free_connect_thread leaks the channel object if it hasn't been
stolen.

Unref it and fix the leak.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..d86df3afcb 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -385,6 +385,7 @@ static void nbd_free_connect_thread(NBDConnectThread *thr)
 {
 if (thr->sioc) {
 qio_channel_close(QIO_CHANNEL(thr->sioc), NULL);
+object_unref(OBJECT(thr->sioc));
 }
 error_free(thr->err);
 qapi_free_SocketAddress(thr->saddr);
-- 
2.30.2

[PATCH for-6.0 0/2] block/nbd: assorted bugfixes

2021-04-09 Thread Roman Kagan

A couple of bugfixes to block/nbd that look appropriate for 6.0.

Roman Kagan (2):
  block/nbd: fix channel object leak
  block/nbd: ensure ->connection_thread is always valid

 block/nbd.c | 59 +++--
 1 file changed, 30 insertions(+), 29 deletions(-)

-- 
2.30.2

[PATCH for-6.0 2/2] block/nbd: ensure ->connection_thread is always valid

2021-04-09 Thread Roman Kagan

Simplify lifetime management of BDRVNBDState->connection_thread by
delaying the possible cleanup of it until the BDRVNBDState itself goes
away.

This also fixes possible use-after-free in nbd_co_establish_connection
when it races with nbd_co_establish_connection_cancel.

Signed-off-by: Roman Kagan 
---
This is a more future-proof version of the fix for use-after-free but
less intrusive than Vladimir's series so that it can go in 6.0.

 block/nbd.c | 58 ++---
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index d86df3afcb..6e3879c0c5 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -144,16 +144,31 @@ typedef struct BDRVNBDState {
 NBDConnectThread *connect_thread;
 } BDRVNBDState;
 
+static void nbd_free_connect_thread(NBDConnectThread *thr);
 static int nbd_establish_connection(BlockDriverState *bs, SocketAddress *saddr,
 Error **errp);
 static int nbd_co_establish_connection(BlockDriverState *bs, Error **errp);
-static void nbd_co_establish_connection_cancel(BlockDriverState *bs,
-   bool detach);
+static void nbd_co_establish_connection_cancel(BlockDriverState *bs);
 static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
 static void nbd_yank(void *opaque);
 
 static void nbd_clear_bdrvstate(BDRVNBDState *s)
 {
+NBDConnectThread *thr = s->connect_thread;
+bool thr_running;
+
+qemu_mutex_lock(>mutex);
+thr_running = thr->state == CONNECT_THREAD_RUNNING;
+if (thr_running) {
+thr->state = CONNECT_THREAD_RUNNING_DETACHED;
+}
+qemu_mutex_unlock(>mutex);
+
+/* the runaway thread will clean it up itself */
+if (!thr_running) {
+nbd_free_connect_thread(thr);
+}
+
 object_unref(OBJECT(s->tlscreds));
 qapi_free_SocketAddress(s->saddr);
 s->saddr = NULL;
@@ -293,7 +308,7 @@ static void coroutine_fn 
nbd_client_co_drain_begin(BlockDriverState *bs)
 qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
 }
 
-nbd_co_establish_connection_cancel(bs, false);
+nbd_co_establish_connection_cancel(bs);
 
 reconnect_delay_timer_del(s);
 
@@ -333,7 +348,7 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 if (s->connection_co_sleep_ns_state) {
 qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
 }
-nbd_co_establish_connection_cancel(bs, true);
+nbd_co_establish_connection_cancel(bs);
 }
 if (qemu_in_coroutine()) {
 s->teardown_co = qemu_coroutine_self();
@@ -534,18 +549,12 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
  * nbd_co_establish_connection_cancel
  * Cancel nbd_co_establish_connection asynchronously: it will finish soon, to
  * allow drained section to begin.
- *
- * If detach is true, also cleanup the state (or if thread is running, move it
- * to CONNECT_THREAD_RUNNING_DETACHED state). s->connect_thread becomes NULL if
- * detach is true.
  */
-static void nbd_co_establish_connection_cancel(BlockDriverState *bs,
-   bool detach)
+static void nbd_co_establish_connection_cancel(BlockDriverState *bs)
 {
 BDRVNBDState *s = bs->opaque;
 NBDConnectThread *thr = s->connect_thread;
 bool wake = false;
-bool do_free = false;
 
 qemu_mutex_lock(>mutex);
 
@@ -556,21 +565,10 @@ static void 
nbd_co_establish_connection_cancel(BlockDriverState *bs,
 s->wait_connect = false;
 wake = true;
 }
-if (detach) {
-thr->state = CONNECT_THREAD_RUNNING_DETACHED;
-s->connect_thread = NULL;
-}
-} else if (detach) {
-do_free = true;
 }
 
 qemu_mutex_unlock(>mutex);
 
-if (do_free) {
-nbd_free_connect_thread(thr);
-s->connect_thread = NULL;
-}
-
 if (wake) {
 aio_co_wake(s->connection_co);
 }
@@ -2294,31 +2292,33 @@ static int nbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 return -EEXIST;
 }
 
+nbd_init_connect_thread(s);
+
 /*
  * establish TCP connection, return error if it fails
  * TODO: Configurable retry-until-timeout behaviour.
  */
 if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
-yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
-return -ECONNREFUSED;
+ret = -ECONNREFUSED;
+goto fail;
 }
 
 ret = nbd_client_handshake(bs, errp);
 if (ret < 0) {
-yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
-nbd_clear_bdrvstate(s);
-return ret;
+goto fail;
 }
 /* successfully connected */
 s->state = NBD_CLIENT_CONNECTED;
 
-nbd_init_connect_thread(s);
-
 s->connection_co = qemu_coroutine_create(nbd_connectio

Re: [PATCH v2 03/10] util/async: aio_co_enter(): do aio_co_schedule in general case

2021-04-09 Thread Roman Kagan

On Thu, Apr 08, 2021 at 06:54:30PM +0300, Roman Kagan wrote:
> On Thu, Apr 08, 2021 at 05:08:20PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > With the following patch we want to call aio_co_wake() from thread.
> > And it works bad.
> > Assume we have no iothreads.
> > Assume we have a coroutine A, which waits in the yield point for external
> > aio_co_wake(), and no progress can be done until it happen.
> > Main thread is in blocking aio_poll() (for example, in blk_read()).
> > 
> > Now, in a separate thread we do aio_co_wake(). It calls  aio_co_enter(),
> > which goes through last "else" branch and do aio_context_acquire(ctx).
> > 
> > Now we have a deadlock, as aio_poll() will not release the context lock
> > until some progress is done, and progress can't be done until
> > aio_co_wake() wake the coroutine A. And it can't because it wait for
> > aio_context_acquire().
> > 
> > Still, aio_co_schedule() works well in parallel with blocking
> > aio_poll(). So let's use it in generic case and drop
> > aio_context_acquire/aio_context_release branch from aio_co_enter().
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > ---
> >  util/async.c | 11 ++-
> >  1 file changed, 2 insertions(+), 9 deletions(-)
> > 
> > diff --git a/util/async.c b/util/async.c
> > index 674dbefb7c..f05b883a39 100644
> > --- a/util/async.c
> > +++ b/util/async.c
> > @@ -614,19 +614,12 @@ void aio_co_wake(struct Coroutine *co)
> >  
> >  void aio_co_enter(AioContext *ctx, struct Coroutine *co)
> >  {
> > -if (ctx != qemu_get_current_aio_context()) {
> > -aio_co_schedule(ctx, co);
> > -return;
> > -}
> > -
> > -if (qemu_in_coroutine()) {
> > +if (ctx == qemu_get_current_aio_context() && qemu_in_coroutine()) {
> >  Coroutine *self = qemu_coroutine_self();
> >  assert(self != co);
> >  QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
> >  } else {
> > -aio_context_acquire(ctx);
> > -qemu_aio_coroutine_enter(ctx, co);
> > -aio_context_release(ctx);
> > +aio_co_schedule(ctx, co);
> >  }
> >  }
> 
> I'm fine with the change, but I find the log message to be a bit
> confusing (although correct).  AFAICS the problem is that calling
> aio_co_enter from a thread which has no associated aio_context works
> differently compared to calling it from a proper iothread: if the target
> context was qemu_aio_context, an iothread would just schedule the
> coroutine there, while a "dumb" thread would try lock the context
> potentially resulting in a deadlock.  This patch makes "dumb" threads
> and iothreads behave identically when entering a coroutine on a foreign
> context.
> 
> You may want to rephrase the log message to that end.
> 
> Anyway
> 
> Reviewed-by: Roman Kagan 

I was too quick to reply.  Turns out this patch breaks a lot of stuff;
apparently the original behavior is relied upon somewhere.
In particular, in iotest 008 basic checks with '--aio threads' fail due
to the device being closed before the operation is performed, so the
latter returns with -ENOMEDIUM:

# .../qemu-io --cache writeback --aio threads -f qcow2 \
-c 'aio_read -P 0xa 0 128M' .../scratch/t.qcow2 -T blk_\*
blk_root_attach child 0x560e9fb65a70 blk 0x560e9fb647b0 bs 0x560e9fb5f4b0
blk_root_detach child 0x560e9fb65a70 blk 0x560e9fb647b0 bs 0x560e9fb5f4b0
blk_root_attach child 0x560e9fb65a70 blk 0x560e9fb4c420 bs 0x560e9fb581a0
blk_root_detach child 0x560e9fb65a70 blk 0x560e9fb4c420 bs 0x560e9fb581a0
blk_co_preadv blk 0x560e9fb4c420 bs (nil) offset 0 bytes 134217728 flags 0x0
readv failed: No medium found

Let's drop this and get back to the original scheme with wakeup via BH.
It'll look just as nice, but won't touch the generic infrastructure with
unpredictable consequences.

Thanks,
Roman.

Re: [PATCH v2 00/10] block/nbd: move connection code to separate file

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:17PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> This substitutes "[PATCH 00/14] nbd: move reconnect-thread to separate file"
> Supersedes: <20210407104637.36033-1-vsement...@virtuozzo.com>
> 
> I want to simplify block/nbd.c which is overcomplicated now. First step
> is splitting out what could be split.
> 
> These series creates new file nbd/client-connection.c and part of
> block/nbd.c is refactored and moved.
> 
> v2 is mostly rewritten. I decided move larger part, otherwise it doesn't
> make real sense.
> 
> Note also that v2 is based on master. Patch 01 actually solves same
> problem as
> "[PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread" 
> [*]
> in a smarter way. So, if [*] goes first, this will be rebased to undo
> [*].
> 
> Vladimir Sementsov-Ogievskiy (10):
>   block/nbd: introduce NBDConnectThread reference counter
>   block/nbd: BDRVNBDState: drop unused connect_err and connect_status
>   util/async: aio_co_enter(): do aio_co_schedule in general case
>   block/nbd: simplify waking of nbd_co_establish_connection()
>   block/nbd: drop thr->state
>   block/nbd: bs-independent interface for nbd_co_establish_connection()
>   block/nbd: make nbd_co_establish_connection_cancel() bs-independent
>   block/nbd: rename NBDConnectThread to NBDClientConnection
>   block/nbd: introduce nbd_client_connection_new()
>   nbd: move connection code from block/nbd to nbd/client-connection
> 
>  include/block/nbd.h |  11 ++
>  block/nbd.c | 288 ++--
>  nbd/client-connection.c | 192 +++
>  util/async.c|  11 +-
>  nbd/meson.build |   1 +
>  5 files changed, 218 insertions(+), 285 deletions(-)
>  create mode 100644 nbd/client-connection.c

I think this is a nice cleanup overall, and makes the logic in
block/nbd.c much easier to reason about.

I guess it's 6.1 material though, as it looks somewhat too big for 6.0,
and the only serious bug it actually fixes can be addressed with a
band-aid mentioned above.

The problem I originally came across with, that of the requests being
canceled on drain despite reconnect, still remains, but I think the fix
for it should build up on this series (and thus probably wait till after
6.0).

Thanks,
Roman.

Re: [PATCH v2 10/10] nbd: move connection code from block/nbd to nbd/client-connection

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:27PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We now have bs-independent connection API, which consists of four
> functions:
> 
>   nbd_client_connection_new()
>   nbd_client_connection_unref()
>   nbd_co_establish_connection()
>   nbd_co_establish_connection_cancel()
> 
> Move them to a separate file together with NBDClientConnection
> structure which becomes private to the new API.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> Hmm. I keep only Virtuozzo's copyright in a new file, as actually I've
> developed nbd-reconnection code. Still probably safer to save all
> copyrights. Let me now if you think so and I'll add them.

Not my call.

>  include/block/nbd.h |  11 +++
>  block/nbd.c | 167 --
>  nbd/client-connection.c | 192 
>  nbd/meson.build |   1 +
>  4 files changed, 204 insertions(+), 167 deletions(-)
>  create mode 100644 nbd/client-connection.c

Reviewed-by: Roman Kagan

Re: [PATCH v2 09/10] block/nbd: introduce nbd_client_connection_new()

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:26PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> This is the last step of creating bs-independing nbd connection

s/independing/independent/

> interface. With next commit we can finally move it to separate file.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 08/10] block/nbd: rename NBDConnectThread to NBDClientConnection

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to move connection code to own file and want clear names
> and APIs.
> 
> The structure is shared between user and (possibly) several runs of
> connect-thread. So it's wrong to call it "thread". Let's rename to
> something more generic.
> 
> Appropriately rename connect_thread and thr variables to conn.
> 
> connect_thread_state_unref() function gets new appropriate name too
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 127 ++--
>  1 file changed, 63 insertions(+), 64 deletions(-)

[To other reviewers: in addition to renaming there's one blank line
removed, hence the difference between (+) and (-)]

Reviewed-by: Roman Kagan

Re: [PATCH v2 07/10] block/nbd: make nbd_co_establish_connection_cancel() bs-independent

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> nbd_co_establish_connection_cancel() actually needs only pointer to
> NBDConnectThread. So, make it clean.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 06/10] block/nbd: bs-independent interface for nbd_co_establish_connection()

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:23PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to split connection code to separate file. Now we are
> ready to give nbd_co_establish_connection() clean and bs-independent
> interface.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 49 +++--
>  1 file changed, 31 insertions(+), 18 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 05/10] block/nbd: drop thr->state

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:22PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Actually, the only bit of information we need is "is thread running or
> not". We don't need all these states. So, instead of thr->state add
> boolean variable thr->running and refactor the code.

There's certain redundancy between thr->refcnt and thr->running.  I
wonder if it makes sense to employ this.

Can be done later if deemed useful.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 103 ++--
>  1 file changed, 27 insertions(+), 76 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 04/10] block/nbd: simplify waking of nbd_co_establish_connection()

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Instead of connect_bh, bh_ctx and wait_connect fields we can live with
> only one link to waiting coroutine, protected by mutex.
> 
> So new logic is:
> 
> nbd_co_establish_connection() sets wait_co under mutex, release the
> mutex and do yield(). Note, that wait_co may be scheduled by thread
> immediately after unlocking the mutex. Still, in main thread (or
> iothread) we'll not reach the code for entering the coroutine until the
> yield() so we are safe.
> 
> Both connect_thread_func() and nbd_co_establish_connection_cancel() do
> the following to handle wait_co:
> 
> Under mutex, if thr->wait_co is not NULL, call aio_co_wake() (which
> never tries to acquire aio context since previous commit, so we are
> safe to do it under thr->mutex) and set thr->wait_co to NULL.
> This way we protect ourselves of scheduling it twice.
> 
> Also this commit make nbd_co_establish_connection() less connected to
> bs (we have generic pointer to the coroutine, not use s->connection_co
> directly). So, we are on the way of splitting connection API out of
> nbd.c (which is overcomplicated now).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 49 +--------
>  1 file changed, 9 insertions(+), 40 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 03/10] util/async: aio_co_enter(): do aio_co_schedule in general case

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:20PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the following patch we want to call aio_co_wake() from thread.
> And it works bad.
> Assume we have no iothreads.
> Assume we have a coroutine A, which waits in the yield point for external
> aio_co_wake(), and no progress can be done until it happen.
> Main thread is in blocking aio_poll() (for example, in blk_read()).
> 
> Now, in a separate thread we do aio_co_wake(). It calls  aio_co_enter(),
> which goes through last "else" branch and do aio_context_acquire(ctx).
> 
> Now we have a deadlock, as aio_poll() will not release the context lock
> until some progress is done, and progress can't be done until
> aio_co_wake() wake the coroutine A. And it can't because it wait for
> aio_context_acquire().
> 
> Still, aio_co_schedule() works well in parallel with blocking
> aio_poll(). So let's use it in generic case and drop
> aio_context_acquire/aio_context_release branch from aio_co_enter().
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  util/async.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 674dbefb7c..f05b883a39 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -614,19 +614,12 @@ void aio_co_wake(struct Coroutine *co)
>  
>  void aio_co_enter(AioContext *ctx, struct Coroutine *co)
>  {
> -if (ctx != qemu_get_current_aio_context()) {
> -aio_co_schedule(ctx, co);
> -return;
> -}
> -
> -if (qemu_in_coroutine()) {
> +if (ctx == qemu_get_current_aio_context() && qemu_in_coroutine()) {
>  Coroutine *self = qemu_coroutine_self();
>  assert(self != co);
>  QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
>  } else {
> -aio_context_acquire(ctx);
> -qemu_aio_coroutine_enter(ctx, co);
> -aio_context_release(ctx);
> +aio_co_schedule(ctx, co);
>  }
>  }

I'm fine with the change, but I find the log message to be a bit
confusing (although correct).  AFAICS the problem is that calling
aio_co_enter from a thread which has no associated aio_context works
differently compared to calling it from a proper iothread: if the target
context was qemu_aio_context, an iothread would just schedule the
coroutine there, while a "dumb" thread would try lock the context
potentially resulting in a deadlock.  This patch makes "dumb" threads
and iothreads behave identically when entering a coroutine on a foreign
context.

You may want to rephrase the log message to that end.

Anyway

Reviewed-by: Roman Kagan

Re: [PATCH v2 02/10] block/nbd: BDRVNBDState: drop unused connect_err and connect_status

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:19PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> These fields are write-only. Drop them.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 01/10] block/nbd: introduce NBDConnectThread reference counter

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:18PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The structure is shared between NBD BDS and connection thread. And it
> is possible the connect thread will finish after closing and releasing
> for the bs. To handle this we have a concept of
> CONNECT_THREAD_RUNNING_DETACHED state and when thread is running and
> BDS is going to be closed we don't free the structure, but instead move
> it to CONNECT_THREAD_RUNNING_DETACHED state, so that thread will free
> it.
> 
> Still more native way to solve the problem is using reference counter
> for shared structure. Let's use it. It makes code smaller and more
> readable.
> 
> New approach also makes checks in nbd_co_establish_connection()
> redundant: now we are sure that s->connect_thread is valid during the
> whole life of NBD BDS.
> 
> This also fixes possible use-after-free of s->connect_thread if
> nbd_co_establish_connection_cancel() clears it during
> nbd_co_establish_connection(), and nbd_co_establish_connection() uses
> local copy of s->connect_thread after yield point.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 62 +--------
>  1 file changed, 20 insertions(+), 42 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH 06/14] block/nbd: further segregation of connect-thread

2021-04-08 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:29PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add personal state NBDConnectThread for connect-thread and
> nbd_connect_thread_start() function. Next step would be moving
> connect-thread to a separate file.
> 
> Note that we stop publishing thr->sioc during
> qio_channel_socket_connect_sync(). Which probably means that we can't
> early-cancel qio_channel_socket_connect_sync() call in
> nbd_free_connect_thread(). Still, when thread is detached it doesn't
> make big sense, and we have no guarantee anyway.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 57 -
>  1 file changed, 39 insertions(+), 18 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index e16c6d636a..23eb8adab1 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -85,8 +85,6 @@ typedef enum NBDConnectThreadState {
>  } NBDConnectThreadState;
>  
>  typedef struct NBDConnectCB {
> -/* Initialization constants */
> -SocketAddress *saddr; /* address to connect to */
>  /*
>   * Bottom half to schedule on completion. Scheduled only if bh_ctx is not
>   * NULL
> @@ -103,6 +101,15 @@ typedef struct NBDConnectCB {
>  AioContext *bh_ctx; /* where to schedule bh (NULL means don't schedule) 
> */
>  } NBDConnectCB;
>  
> +typedef void (*NBDConnectThreadCallback)(QIOChannelSocket *sioc, int ret,
> + void *opaque);
> +
> +typedef struct NBDConnectThread {
> +SocketAddress *saddr; /* address to connect to */
> +NBDConnectThreadCallback cb;
> +void *cb_opaque;
> +} NBDConnectThread;
> +
>  typedef struct BDRVNBDState {
>  QIOChannelSocket *sioc; /* The master data channel */
>  QIOChannel *ioc; /* The current I/O channel which may differ (eg TLS) */
> @@ -367,7 +374,6 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  s->connect_thread = g_new(NBDConnectCB, 1);
>  
>  *s->connect_thread = (NBDConnectCB) {
> -.saddr = QAPI_CLONE(SocketAddress, s->saddr),
>  .state = CONNECT_THREAD_NONE,
>  .bh_func = connect_bh,
>  .bh_opaque = s,
> @@ -378,20 +384,18 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  
>  static void nbd_free_connect_thread(NBDConnectCB *thr)
>  {
> -if (thr->sioc) {
> -qio_channel_close(QIO_CHANNEL(thr->sioc), NULL);
> -}

Doesn't this result in an open channel leak?  (The original code here
seems to be missing an unref, too.)

> -qapi_free_SocketAddress(thr->saddr);
>  g_free(thr);
>  }
>  
> -static void connect_thread_cb(int ret, void *opaque)
> +static void connect_thread_cb(QIOChannelSocket *sioc, int ret, void *opaque)
>  {
>  NBDConnectCB *thr = opaque;
>  bool do_free = false;
>  
>  qemu_mutex_lock(>mutex);
>  
> +thr->sioc = sioc;
> +
>  switch (thr->state) {
>  case CONNECT_THREAD_RUNNING:
>  thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
> @@ -418,27 +422,45 @@ static void connect_thread_cb(int ret, void *opaque)
>  
>  static void *connect_thread_func(void *opaque)
>  {
> -NBDConnectCB *thr = opaque;
> +NBDConnectThread *thr = opaque;
>  int ret;
> +QIOChannelSocket *sioc = qio_channel_socket_new();
>  
> -thr->sioc = qio_channel_socket_new();
> -
> -ret = qio_channel_socket_connect_sync(thr->sioc, thr->saddr, NULL);
> +ret = qio_channel_socket_connect_sync(sioc, thr->saddr, NULL);
>  if (ret < 0) {
> -object_unref(OBJECT(thr->sioc));
> -thr->sioc = NULL;
> +object_unref(OBJECT(sioc));
> +sioc = NULL;
>  }
>  
> -connect_thread_cb(ret, thr);
> +thr->cb(sioc, ret, thr->cb_opaque);
> +
> +qapi_free_SocketAddress(thr->saddr);
> +g_free(thr);
>  
>  return NULL;
>  }
>  
> +static void nbd_connect_thread_start(const SocketAddress *saddr,
> + NBDConnectThreadCallback cb,
> + void *cb_opaque)
> +{
> +QemuThread thread;
> +NBDConnectThread *thr = g_new(NBDConnectThread, 1);
> +
> +*thr = (NBDConnectThread) {
> +.saddr = QAPI_CLONE(SocketAddress, saddr),
> +.cb = cb,
> +.cb_opaque = cb_opaque,
> +};
> +
> +qemu_thread_create(, "nbd-connect",
> +   connect_thread_func, thr, QEMU_THREAD_DETACHED);
> +}
> +
>  static int coroutine_fn
>  nbd_co_establish_connection(BlockDriverState *bs)
>  {
>  int ret;
> -QemuThread thread;
>  BDRVNBDState *s = bs->opaque;
>  NBDConnectCB *thr = s->connect_thread;
>  
> @@ -448,8 +470,7 @@ nbd_co_establish_connection(BlockDriverState *bs)
>  case CONNECT_THREAD_FAIL:
>  case CONNECT_THREAD_NONE:
>  thr->state = CONNECT_THREAD_RUNNING;
> -qemu_thread_create(, "nbd-connect",
> -   connect_thread_func, thr, QEMU_THREAD_DETACHED);
> +nbd_connect_thread_start(s->saddr, connect_thread_cb,

Re: [PATCH 03/14] block/nbd: drop unused NBDConnectThread::err field

2021-04-07 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:26PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The field is used only to free it. Let's drop it for now for
> simplicity.

Well, it's *now* (after your patch 2) only used to free it.  This makes
the reconnect process even further concealed from the user: the client
may be struggling to re-establish the connection but you'll never know
when looking at the logs.

As I wrote in my comment to patch 2 I believe these errors need to be
logged.  Whether to do it directly in the reconnection thread or punt it
to the spawner of it is another matter; I'd personally prefer to do
logging in the original bdrv context as it allows to (easier) tag the
log messages with the indication of which connection is suffering.

Thanks,
Roman.

Re: [PATCH 02/14] block/nbd: nbd_co_establish_connection(): drop unused errp

2021-04-07 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to refactor connection logic to make it more
> understandable. Every bit that we can simplify in advance will help.
> Drop errp for now, it's unused anyway. We'll probably reimplement it in
> future.

Although I agree that this passing errors around is a bit of an
overkill, my problem with NBD client is that it's notoriously silent
about problems it expeirences, and those errors never pop up in logs.

Given that these errors are not guest-triggerable, and probably indicate
serious problems at the infrastructure level, instead of endlessly
passing them around (as in the code ATM) or dropping them on the floor
(as you propose in the patch) I'd much rather log them immediately when
encountering.

I have a patch to that end, I'll try to port it on top of your series.

Thanks,
Roman.

Re: [PATCH 01/14] block/nbd: BDRVNBDState: drop unused connect_err

2021-04-07 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The field is actually unused. Let's make things a bit simpler dropping
> it and corresponding logic.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index c26dc5a54f..a47d6cfea3 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -122,7 +122,6 @@ typedef struct BDRVNBDState {
>  int in_flight;
>  NBDClientState state;
>  int connect_status;

This field is write-only too.  Do you have any plans on using it later?
Otherwise you may want to nuke it in this patch too.

Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-04-07 Thread Roman Kagan

On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> > 
> > Roman Kagan (7):
> >block/nbd: avoid touching freed connect_thread
> >block/nbd: use uniformly nbd_client_connecting_wait
> >block/nbd: assert attach/detach runs in the proper context
> >block/nbd: transfer reconnection stuff across aio_context switch
> >block/nbd: better document a case in nbd_co_establish_connection
> >block/nbd: decouple reconnect from drain
> >block/nbd: stop manipulating in_flight counter
> > 
> >   block/nbd.c  | 191 +++
> >   nbd/client.c |   2 -
> >   2 files changed, 86 insertions(+), 107 deletions(-)
> > 
> 
> 
> Hmm. The huge source of problems for this series is weird logic around
> drain and aio context switch in NBD driver.
> 
> Why do we have all these too complicated logic with abuse of in_flight
> counter in NBD? The answer is connection_co. NBD differs from other
> drivers, it has a coroutine independent of request coroutines. And we
> have to move this coroutine carefully to new aio context. We can't
> just enter it from the new context, we want to be sure that
> connection_co is in one of yield points that supports reentering.
> 
> I have an idea of how to avoid this thing: drop connection_co at all.
> 
> 1. nbd negotiation goes to connection thread and becomes independent
> of any aio context.
> 
> 2. waiting for server reply goes to request code. So, instead of
> reading the replay from socket always in connection_co, we read in the
> request coroutine, after sending the request. We'll need a CoMutex for
> it (as only one request coroutine should read from socket), and be
> prepared to coming reply is not for _this_ request (in this case we
> should wake another request and continue read from socket).

The problem with this approach is that it would change the reconnect
behavior.

Currently connection_co purpose is three-fold:

1) receive the header of the server response, identify the request it
   pertains to, and wake the resective request coroutine

2) take on the responsibility to reestablish the connection when it's
   lost

3) monitor the idle connection and initiate the reconnect as soon as the
   connection is lost

Points 1 and 2 can be moved to the request coroutines indeed.  However I
don't see how to do 3 without an extra ever-running coroutine.
Sacrificing it would mean that a connection loss wouldn't be noticed and
the recovery wouldn't be attempted until a request arrived.

This change looks to me like a degradation compared to the current
state.

Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-26 Thread Roman Kagan

On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> > 
> > Roman Kagan (7):
> >block/nbd: avoid touching freed connect_thread
> >block/nbd: use uniformly nbd_client_connecting_wait
> >block/nbd: assert attach/detach runs in the proper context
> >block/nbd: transfer reconnection stuff across aio_context switch
> >block/nbd: better document a case in nbd_co_establish_connection
> >block/nbd: decouple reconnect from drain
> >block/nbd: stop manipulating in_flight counter
> > 
> >   block/nbd.c  | 191 +++
> >   nbd/client.c |   2 -
> >   2 files changed, 86 insertions(+), 107 deletions(-)
> > 
> 
> 
> Hmm. The huge source of problems for this series is weird logic around
> drain and aio context switch in NBD driver.
> 
> Why do we have all these too complicated logic with abuse of in_flight
> counter in NBD? The answer is connection_co. NBD differs from other
> drivers, it has a coroutine independent of request coroutines. And we
> have to move this coroutine carefully to new aio context. We can't
> just enter it from the new context, we want to be sure that
> connection_co is in one of yield points that supports reentering.
> 
> I have an idea of how to avoid this thing: drop connection_co at all.
> 
> 1. nbd negotiation goes to connection thread and becomes independent
> of any aio context.
> 
> 2. waiting for server reply goes to request code. So, instead of
> reading the replay from socket always in connection_co, we read in the
> request coroutine, after sending the request. We'll need a CoMutex for
> it (as only one request coroutine should read from socket), and be
> prepared to coming reply is not for _this_ request (in this case we
> should wake another request and continue read from socket).
> 
> but this may be too much for soft freeze.

This approach does look appealing to me, and I gave it a quick shot but
the amount of changes this involves exceeds the rc tolerance indeed.

> Another idea:
> 
> You want all the requests be completed on drain_begin(), not
> cancelled. Actually, you don't need reconnect runnning during drained
> section for it. It should be enough just wait for all currenct
> requests before disabling the reconnect in drain_begin handler.

So effectively you suggest doing nbd's own drain within
bdrv_co_drain_begin callback.  I'm not totally sure there are no
assumptions this may break, but I'll try to look into this possibility.

Thanks,
Roman.

Re: [PATCH 7/7] block/nbd: stop manipulating in_flight counter

2021-03-26 Thread Roman Kagan

On Tue, Mar 16, 2021 at 09:37:13PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 16.03.2021 19:08, Roman Kagan wrote:
> > On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 15.03.2021 09:06, Roman Kagan wrote:
> > > > As the reconnect logic no longer interferes with drained sections, it
> > > > appears unnecessary to explicitly manipulate the in_flight counter.
> > > > 
> > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > > 
> > > And here you actually allow qemu_aio_coroutine_enter() call in
> > > nbd_client_attach_aio_context_bh() to enter connection_co in any yield
> > > point which is possible during drained section. The analysis should be
> > > done to be sure that all these yield points are safe for reentering by
> > > external qemu_aio_coroutine_enter(). (By external I mean not by the
> > > actual enter() we are waiting for at the yield() point. For example
> > > qemu_channel_yield() supports reentering.. And therefore (as I
> > > understand after fast looking through) nbd_read() should support
> > > reentering too..
> > 
> > I'll do a more thorough analysis of how safe it is.
> > 
> > FWIW this hasn't triggered any test failures yet, but that assert in
> > patch 3 didn't ever go off either so I'm not sure I can trust the tests
> > on this.
> > 
> 
> Hmm. First, we should consider qemu_coroutine_yield() in
> nbd_co_establish_connection().
> 
> Most of nbd_co_establish_connection_cancel() purpose is to avoid
> reentering this yield()..

Unless I'm overlooking something, nbd_co_establish_connection() is fine
with spurious entering at this yield point.  What does look problematic,
though, is your next point:

> And I don't know, how to make it safely reenterable: keep in mind bh
> that may be already scheduled by connect_thread_func(). And if bh is
> already scheduled, can we cancel it? I'm not sure.
> 
> We have qemu_bh_delete(). But is it possible, that BH is near to be
> executed and already cannot be removed by qemu_bh_delete()? I don't
> know.
> 
> And if we can't safely drop the bh at any moment, we should wait in
> nbd_client_detach_aio_context until the scheduled bh enters the
> connection_co.. Or something like this

So I think it's not the reentry at this yield point per se which is
problematic, it's that that bh may have been scheduled before the
aio_context switch so once it runs it would wake up connection_co on the
old aio_context.  I think it may be possible to address by adding a
check into connect_bh().

Thanks,
Roman.

Re: [PATCH 6/7] block/nbd: decouple reconnect from drain

2021-03-26 Thread Roman Kagan

On Tue, Mar 16, 2021 at 09:09:12PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 16.03.2021 19:03, Roman Kagan wrote:
> > On Mon, Mar 15, 2021 at 11:10:14PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 15.03.2021 09:06, Roman Kagan wrote:
> > > > The reconnection logic doesn't need to stop while in a drained section.
> > > > Moreover it has to be active during the drained section, as the requests
> > > > that were caught in-flight with the connection to the server broken can
> > > > only usefully get drained if the connection is restored.  Otherwise such
> > > > requests can only either stall resulting in a deadlock (before
> > > > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > > > machinery (after 8c517de24a).
> > > > 
> > > > Since the pieces of the reconnection logic are now properly migrated
> > > > from one aio_context to another, it appears safe to just stop messing
> > > > with the drained section in the reconnection code.
> > > > 
> > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > > 
> > > I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946
> > > didn't introduce any bugs.
> > 
> > I guess you're right.
> > 
> > In fact I did reproduce the situation when the drain made no progress
> > exactly becase the only in-flight reference was taken by the
> > connection_co, but it may be due to some intermediate stage of the patch
> > development so I need to do a more thorough analysis to tell if it was
> > triggerable with the original code.
> > 
> > > > Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
> > > > reconnect-delay")
> > > 
> > > And here..
> > > 
> > > 1. There is an existing problem (unrelated to nbd) in Qemu that long
> > > io request which we have to wait for at drained_begin may trigger a
> > > dead lock
> > > (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)
> > > 
> > > 2. So, when we have nbd reconnect (and therefore long io requests) we
> > > simply trigger this deadlock.. That's why I decided to cancel the
> > > requests (assuming they will most probably fail anyway).
> > > 
> > > I agree that nbd driver is wrong place for fixing the problem
> > > described in
> > > (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html),
> > > but if you just revert 8c517de24a, you'll see the deadlock again..
> > 
> > I may have misunderstood that thread, but isn't that deadlock exactly
> > due to the requests being unable to ever drain because the
> > reconnection process is suspended while the drain is in progress?
> > 
> 
> 
> Hmm, I didn't thought about it this way.  What you mean is that
> reconnection is cancelled on drain_begin, so drain_begin will never
> finish, because it waits for requests, which will never be
> reconnected. So, you are right it's a deadlock too.

Right.

> But as I remember what is described in 8c517de24a is another deadlock,
> triggered by "just a long request during drain_begin". And it may be
> triggered again, if the we'll try to reconnect for several seconds
> during drained_begin() instead of cancelling requests.

IMO it wasn't a different deadlock, it wass exactly this one: the drain
got stuck the way described above with the qemu global mutex taken, so
the rest of qemu got stuck too.

> Didn't you try the scenario described in 8c517de24a on top of your
> series?

I've tried it with the current master with just 8c517de24a reverted --
the deadlock reproduced in all several attempts.  Then with my series --
the deadlock didn't reproduce in any of my several attempts.  More
precisely, qemu appeared frozen once the guest timed out the requests
and initiated ATA link soft reset, which presumably caused that drain,
but, as soon as the reconnect timer expired (resulting in requests
completed into the guest with errors) or the connection was restored
(resulting in requests completed successfully), it resumed normal
operation.

So yes, I think this series does address that deadlock.

Thanks,
Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan

On Tue, Mar 16, 2021 at 09:41:36AM -0500, Eric Blake wrote:
> On 3/15/21 1:06 AM, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> 
> Soft freeze is today.  I'm leaning towards declaring this series as a
> bug fix (and so give it some more soak time to get right, but still okay
> for -rc1) rather than a feature addition (and therefore would need to be
> in a pull request today).  Speak up now if this characterization is off
> base.

Yes I'd consider it a bug fix, too.  I'll do my best beating it into
shape before -rc2.

Thanks,
Roman.

Re: [PATCH 7/7] block/nbd: stop manipulating in_flight counter

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > As the reconnect logic no longer interferes with drained sections, it
> > appears unnecessary to explicitly manipulate the in_flight counter.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> 
> And here you actually allow qemu_aio_coroutine_enter() call in
> nbd_client_attach_aio_context_bh() to enter connection_co in any yield
> point which is possible during drained section. The analysis should be
> done to be sure that all these yield points are safe for reentering by
> external qemu_aio_coroutine_enter(). (By external I mean not by the
> actual enter() we are waiting for at the yield() point. For example
> qemu_channel_yield() supports reentering.. And therefore (as I
> understand after fast looking through) nbd_read() should support
> reentering too..

I'll do a more thorough analysis of how safe it is.

FWIW this hasn't triggered any test failures yet, but that assert in
patch 3 didn't ever go off either so I'm not sure I can trust the tests
on this.

Thanks,
Roman.

Re: [PATCH 6/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 11:10:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > Since the pieces of the reconnection logic are now properly migrated
> > from one aio_context to another, it appears safe to just stop messing
> > with the drained section in the reconnection code.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> 
> I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946
> didn't introduce any bugs.

I guess you're right.

In fact I did reproduce the situation when the drain made no progress
exactly becase the only in-flight reference was taken by the
connection_co, but it may be due to some intermediate stage of the patch
development so I need to do a more thorough analysis to tell if it was
triggerable with the original code.

> > Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
> > reconnect-delay")
> 
> And here..
> 
> 1. There is an existing problem (unrelated to nbd) in Qemu that long
> io request which we have to wait for at drained_begin may trigger a
> dead lock
> (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)
> 
> 2. So, when we have nbd reconnect (and therefore long io requests) we
> simply trigger this deadlock.. That's why I decided to cancel the
> requests (assuming they will most probably fail anyway).
> 
> I agree that nbd driver is wrong place for fixing the problem
> described in
> (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html),
> but if you just revert 8c517de24a, you'll see the deadlock again..

I may have misunderstood that thread, but isn't that deadlock exactly
due to the requests being unable to ever drain because the
reconnection process is suspended while the drain is in progress?

Thanks,
Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 10:45:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> 
> 
> The actual point is:
> 
> connection_co (together with all functions called from it) has a lot of yield 
> points. And we can't just enter the coroutine in any of the when we want, as 
> it may break some BH which is actually waited for in this yield point..
> 
> Still, we should care only about yield points possible during drained 
> section, so we don't need to care about direct qemu_coroutine_yield() inside 
> nbd_connection_entry().
> 
> Many things changed since 5ad81b4946.. So probably, now all the (possible 
> during drained section) yield points in nbd_connection_entry support 
> reentering. But some analysis of possible yield points should be done.

Thanks for the explanation.  Will do this analysis.

Roman.

Re: [PATCH 1/7] block/nbd: avoid touching freed connect_thread

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 06:40:12PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > When the NBD connection is being torn down, the connection thread gets
> > canceled and "detached", meaning it is about to get freed.
> > 
> > If this happens while the connection coroutine yielded waiting for the
> > connection thread to complete, when it resumes it may access the
> > invalidated connection thread data.
> > 
> > To prevent this, revalidate the ->connect_thread pointer in
> > nbd_co_establish_connection_cancel before using after the the yield.
> > 
> > Signed-off-by: Roman Kagan 
> 
> Seems possible. Do you have a reproducer? Would be great to make an iotest.

It triggered on me in iotest 277, but seems to be timing-dependent.
I'll see if I can do it reliable.

Thanks,
Roman.

Re: [PATCH 3/7] block/nbd: assert attach/detach runs in the proper context

2021-03-15 Thread Roman Kagan

On Mon, Mar 15, 2021 at 07:41:32PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > Document (via a comment and an assert) that
> > nbd_client_detach_aio_context and nbd_client_attach_aio_context_bh run
> > in the desired aio_context
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> >   block/nbd.c | 12 
> >   1 file changed, 12 insertions(+)
> > 
> > diff --git a/block/nbd.c b/block/nbd.c
> > index 1d8edb5b21..658b827d24 100644
> > --- a/block/nbd.c
> > +++ b/block/nbd.c
> > @@ -241,6 +241,12 @@ static void 
> > nbd_client_detach_aio_context(BlockDriverState *bs)
> >   {
> >   BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > +/*
> > + * This runs in the (old, about to be detached) aio context of the @bs 
> > so
> > + * accessing the stuff on @s is concurrency-free.
> > + */
> > +assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
> 
> Hmm. I don't think so. The handler is called from 
> bdrv_set_aio_context_ignore(), which have the assertion 
> g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());. There is 
> also a comment above bdrv_set_aio_context_ignore() "The caller must own the 
> AioContext lock for the old AioContext of bs".
> 
> So, we are not in the home context of bs here. We are in the main aio context 
> and we hold AioContext lock of old bs context.

You're absolutely right.  I'm wondering where I got the idea of this
assertion from...

> 
> > +
> >   /* Timer is deleted in nbd_client_co_drain_begin() */
> >   assert(!s->reconnect_delay_timer);
> >   /*
> > @@ -258,6 +264,12 @@ static void nbd_client_attach_aio_context_bh(void 
> > *opaque)
> >   BlockDriverState *bs = opaque;
> >   BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > +/*
> > + * This runs in the (new, just attached) aio context of the @bs so
> > + * accessing the stuff on @s is concurrency-free.
> > + */
> > +assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
> 
> This is correct just because we are in a BH, scheduled for this
> context (I hope we can't reattach some third context prior to entering
> the BH in the second:). So, I don't think this assertion really adds
> something.

Indeed.

> > +
> >   if (s->connection_co) {
> >   /*
> >* The node is still drained, so we know the coroutine has 
> > yielded in
> > 
> 
> I'm not sure that the asserted fact gives us "concurrency-free". For
> this we also need to ensure that all other things in the driver are
> always called in same aio context.. Still, it's a general assumption
> we have when writing block drivers "everything in one aio context, so
> don't care".. Sometime it leads to bugs, as some things are still
> called even from non-coroutine context. Also, Paolo said
> (https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg03814.html)
> that many iothreads will send requests to bs, and the code in driver
> should be thread safe. I don't have good understanding of all these
> things, and what I have is:
> 
> For now (at least we don't have problems in Rhel based downstream) it
> maybe OK to think that in block-driver everything is protected by
> AioContext lock and all concurrency we have inside block driver is
> switching between coroutines but never real parallelism. But in
> general it's not so, and with multiqueue it's not so.. So, I'd not put
> such a comment :)

So the patch is bogus in every respect; let's just drop it.

I hope it doesn't invalidate completely the rest of the series though.

Meanwhile I certainly need to update my idea of concurrency assumptions
in the block layer.

Thanks,
Roman.

[PATCH 2/7] block/nbd: use uniformly nbd_client_connecting_wait

2021-03-15 Thread Roman Kagan

Use nbd_client_connecting_wait uniformly all over the block/nbd.c.

While at this, drop the redundant check for nbd_client_connecting_wait
in reconnect_delay_timer_init, as all its callsites do this check too.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 447d176b76..1d8edb5b21 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -165,6 +165,18 @@ static void nbd_clear_bdrvstate(BDRVNBDState *s)
 s->x_dirty_bitmap = NULL;
 }
 
+static bool nbd_client_connecting(BDRVNBDState *s)
+{
+NBDClientState state = qatomic_load_acquire(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
+}
+
+static bool nbd_client_connecting_wait(BDRVNBDState *s)
+{
+return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
@@ -205,7 +217,7 @@ static void reconnect_delay_timer_cb(void *opaque)
 {
 BDRVNBDState *s = opaque;
 
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
+if (nbd_client_connecting_wait(s)) {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
 while (qemu_co_enter_next(>free_sema, NULL)) {
 /* Resume all queued requests */
@@ -217,10 +229,6 @@ static void reconnect_delay_timer_cb(void *opaque)
 
 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 {
-if (qatomic_load_acquire(>state) != NBD_CLIENT_CONNECTING_WAIT) {
-return;
-}
-
 assert(!s->reconnect_delay_timer);
 s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
  QEMU_CLOCK_REALTIME,
@@ -346,18 +354,6 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 assert(!s->connection_co);
 }
 
-static bool nbd_client_connecting(BDRVNBDState *s)
-{
-NBDClientState state = qatomic_load_acquire(>state);
-return state == NBD_CLIENT_CONNECTING_WAIT ||
-state == NBD_CLIENT_CONNECTING_NOWAIT;
-}
-
-static bool nbd_client_connecting_wait(BDRVNBDState *s)
-{
-return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
-}
-
 static void connect_bh(void *opaque)
 {
 BDRVNBDState *state = opaque;
@@ -667,7 +663,7 @@ static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState 
*s)
 uint64_t timeout = 1 * NANOSECONDS_PER_SECOND;
 uint64_t max_timeout = 16 * NANOSECONDS_PER_SECOND;
 
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
+if (nbd_client_connecting_wait(s)) {
 reconnect_delay_timer_init(s, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
s->reconnect_delay * 
NANOSECONDS_PER_SECOND);
 }
@@ -2473,7 +2469,7 @@ static void nbd_cancel_in_flight(BlockDriverState *bs)
 
 reconnect_delay_timer_del(s);
 
-if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
+if (nbd_client_connecting_wait(s)) {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
 qemu_co_queue_restart_all(>free_sema);
 }
-- 
2.30.2

[PATCH 3/7] block/nbd: assert attach/detach runs in the proper context

2021-03-15 Thread Roman Kagan

Document (via a comment and an assert) that
nbd_client_detach_aio_context and nbd_client_attach_aio_context_bh run
in the desired aio_context.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 1d8edb5b21..658b827d24 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -241,6 +241,12 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+/*
+ * This runs in the (old, about to be detached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
 /* Timer is deleted in nbd_client_co_drain_begin() */
 assert(!s->reconnect_delay_timer);
 /*
@@ -258,6 +264,12 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+/*
+ * This runs in the (new, just attached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
 if (s->connection_co) {
 /*
  * The node is still drained, so we know the coroutine has yielded in
-- 
2.30.2

[PATCH 6/7] block/nbd: decouple reconnect from drain

2021-03-15 Thread Roman Kagan

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

Since the pieces of the reconnection logic are now properly migrated
from one aio_context to another, it appears safe to just stop messing
with the drained section in the reconnection code.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
reconnect-delay")
Signed-off-by: Roman Kagan 
---
 block/nbd.c | 79 +++--
 1 file changed, 4 insertions(+), 75 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a3eb9b9079..a5a9e4aca5 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -117,8 +117,6 @@ typedef struct BDRVNBDState {
 Coroutine *connection_co;
 Coroutine *teardown_co;
 QemuCoSleepState *connection_co_sleep_ns_state;
-bool drained;
-bool wait_drained_end;
 int in_flight;
 NBDClientState state;
 int connect_status;
@@ -311,12 +309,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 qemu_mutex_unlock(>mutex);
 
 if (s->connection_co) {
-/*
- * The node is still drained, so we know the coroutine has yielded in
- * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
- * it is entered for the first time. Both places are safe for entering
- * the coroutine.
- */
 qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
 }
 bdrv_dec_in_flight(bs);
@@ -344,37 +336,6 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
 aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, bs);
 }
 
-static void coroutine_fn nbd_client_co_drain_begin(BlockDriverState *bs)
-{
-BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-s->drained = true;
-if (s->connection_co_sleep_ns_state) {
-qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
-}
-
-nbd_co_establish_connection_cancel(bs, false);
-
-reconnect_delay_timer_del(s);
-
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
-s->state = NBD_CLIENT_CONNECTING_NOWAIT;
-qemu_co_queue_restart_all(>free_sema);
-}
-}
-
-static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
-{
-BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-s->drained = false;
-if (s->wait_drained_end) {
-s->wait_drained_end = false;
-aio_co_wake(s->connection_co);
-}
-}
-
-
 static void nbd_teardown_connection(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
@@ -686,16 +647,6 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
 
 ret = nbd_client_handshake(s->bs, _err);
 
-if (s->drained) {
-s->wait_drained_end = true;
-while (s->drained) {
-/*
- * We may be entered once from nbd_client_attach_aio_context_bh
- * and then from nbd_client_co_drain_end. So here is a loop.
- */
-qemu_coroutine_yield();
-}
-}
 bdrv_inc_in_flight(s->bs);
 
 out:
@@ -724,26 +675,10 @@ static coroutine_fn void 
nbd_co_reconnect_loop(BDRVNBDState *s)
 nbd_reconnect_attempt(s);
 
 while (nbd_client_connecting(s)) {
-if (s->drained) {
-bdrv_dec_in_flight(s->bs);
-s->wait_drained_end = true;
-while (s->drained) {
-/*
- * We may be entered once from nbd_client_attach_aio_context_bh
- * and then from nbd_client_co_drain_end. So here is a loop.
- */
-qemu_coroutine_yield();
-}
-bdrv_inc_in_flight(s->bs);
-} else {
-qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
-  >connection_co_sleep_ns_state);
-if (s->drained) {
-continue;
-}
-if (timeout < max_timeout) {
-timeout *= 2;
-}
+qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
+  >connection_co_sleep_ns_state);
+if (timeout < max_timeout) {
+timeout *= 2;
 }
 
 nbd_reconnect_attempt(s);
@@ -2548,8 +2483,6 @@ static BlockDriver bdrv_nbd = {
 .bdrv_getlength = nbd_getlength,
 .bdrv_detach_aio_context= nbd_client_detach_aio_context,
 .bdrv_attach_aio_context

[PATCH 7/7] block/nbd: stop manipulating in_flight counter

2021-03-15 Thread Roman Kagan

As the reconnect logic no longer interferes with drained sections, it
appears unnecessary to explicitly manipulate the in_flight counter.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Signed-off-by: Roman Kagan 
---
 block/nbd.c  | 6 --
 nbd/client.c | 2 --
 2 files changed, 8 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a5a9e4aca5..3a22f5d897 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -311,7 +311,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 if (s->connection_co) {
 qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
 }
-bdrv_dec_in_flight(bs);
 }
 
 static void nbd_client_attach_aio_context(BlockDriverState *bs,
@@ -327,7 +326,6 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
 qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
 }
 
-bdrv_inc_in_flight(bs);
 
 /*
  * Need to wait here for the BH to run because the BH must run while the
@@ -643,11 +641,9 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
 goto out;
 }
 
-bdrv_dec_in_flight(s->bs);
 
 ret = nbd_client_handshake(s->bs, _err);
 
-bdrv_inc_in_flight(s->bs);
 
 out:
 s->connect_status = ret;
@@ -759,7 +755,6 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
 
 qemu_co_queue_restart_all(>free_sema);
 nbd_recv_coroutines_wake_all(s);
-bdrv_dec_in_flight(s->bs);
 
 s->connection_co = NULL;
 if (s->ioc) {
@@ -2307,7 +2302,6 @@ static int nbd_open(BlockDriverState *bs, QDict *options, 
int flags,
 nbd_init_connect_thread(s);
 
 s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
-bdrv_inc_in_flight(bs);
 aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
 
 return 0;
diff --git a/nbd/client.c b/nbd/client.c
index 0c2db4bcba..30d5383cb1 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1434,9 +1434,7 @@ nbd_read_eof(BlockDriverState *bs, QIOChannel *ioc, void 
*buffer, size_t size,
 
 len = qio_channel_readv(ioc, , 1, errp);
 if (len == QIO_CHANNEL_ERR_BLOCK) {
-bdrv_dec_in_flight(bs);
 qio_channel_yield(ioc, G_IO_IN);
-bdrv_inc_in_flight(bs);
 continue;
 } else if (len < 0) {
 return -EIO;
-- 
2.30.2

[PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-15 Thread Roman Kagan

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

This series aims to just stop messing with the drained section in the
reconnection code.

While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
connection_co reentrance"); as I've missed the point of that commit I'd
appreciate more scrutiny in this area.

Roman Kagan (7):
  block/nbd: avoid touching freed connect_thread
  block/nbd: use uniformly nbd_client_connecting_wait
  block/nbd: assert attach/detach runs in the proper context
  block/nbd: transfer reconnection stuff across aio_context switch
  block/nbd: better document a case in nbd_co_establish_connection
  block/nbd: decouple reconnect from drain
  block/nbd: stop manipulating in_flight counter

 block/nbd.c  | 191 +++
 nbd/client.c |   2 -
 2 files changed, 86 insertions(+), 107 deletions(-)

-- 
2.30.2

[PATCH 5/7] block/nbd: better document a case in nbd_co_establish_connection

2021-03-15 Thread Roman Kagan

Cosmetic: adjust the comment and the return value in
nbd_co_establish_connection where it's entered while the connection
thread is still running.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a6d713ba58..a3eb9b9079 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -562,11 +562,12 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
 case CONNECT_THREAD_RUNNING:
 case CONNECT_THREAD_RUNNING_DETACHED:
 /*
- * Obviously, drained section wants to start. Report the attempt as
- * failed. Still connect thread is executing in background, and its
+ * Spurious corotine resume before connection attempt has finished,
+ * presumably upon attaching a new aio_context. Report the attempt as
+ * failed.  Still connect thread is executing in background, and its
  * result may be used for next connection attempt.
  */
-ret = -1;
+ret = -ECONNABORTED;
 error_setg(errp, "Connection attempt cancelled by other operation");
 break;
 
-- 
2.30.2

[PATCH 1/7] block/nbd: avoid touching freed connect_thread

2021-03-15 Thread Roman Kagan

When the NBD connection is being torn down, the connection thread gets
canceled and "detached", meaning it is about to get freed.

If this happens while the connection coroutine yielded waiting for the
connection thread to complete, when it resumes it may access the
invalidated connection thread data.

To prevent this, revalidate the ->connect_thread pointer in
nbd_co_establish_connection_cancel before using after the the yield.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..447d176b76 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -486,6 +486,15 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
 s->wait_connect = true;
 qemu_coroutine_yield();
 
+/*
+ * If nbd_co_establish_connection_cancel had a chance to run it may have
+ * invalidated ->connect_thread.
+ */
+thr = s->connect_thread;
+if (!thr) {
+return -ECONNABORTED;
+}
+
 qemu_mutex_lock(>mutex);
 
 switch (thr->state) {
-- 
2.30.2

[PATCH 4/7] block/nbd: transfer reconnection stuff across aio_context switch

2021-03-15 Thread Roman Kagan

Make varios pieces of reconnection logic correctly survive the
transition of the BDRVNBDState from one aio_context to another.  In
particular,

- cancel the reconnect_delay_timer and rearm it in the new context;
- cancel the sleep of the connection_co between reconnect attempt so
  that it continues in the new context;
- prevent the connection thread from delivering its status to the old
  context, and retartget it to the new context on attach.

None of these is needed at the moment because the aio_context switch
happens within a drained section and that effectively terminates the
reconnection logic on entry and starts it over on exit.  However, this
patch paves the way to keeping the reconnection process active across
the drained section (in a followup patch).

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 44 ++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 658b827d24..a6d713ba58 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -126,6 +126,7 @@ typedef struct BDRVNBDState {
 bool wait_in_flight;
 
 QEMUTimer *reconnect_delay_timer;
+int64_t reconnect_expire_time_ns;
 
 NBDClientRequest requests[MAX_NBD_REQUESTS];
 NBDReply reply;
@@ -240,6 +241,7 @@ static void reconnect_delay_timer_init(BDRVNBDState *s, 
uint64_t expire_time_ns)
 static void nbd_client_detach_aio_context(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+NBDConnectThread *thr = s->connect_thread;
 
 /*
  * This runs in the (old, about to be detached) aio context of the @bs so
@@ -247,8 +249,31 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
  */
 assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
 
-/* Timer is deleted in nbd_client_co_drain_begin() */
-assert(!s->reconnect_delay_timer);
+/*
+ * Make sure the connection thread doesn't try to deliver its status to the
+ * old context.
+ */
+qemu_mutex_lock(>mutex);
+thr->bh_ctx = NULL;
+qemu_mutex_unlock(>mutex);
+
+/*
+ * Preserve the expiration time of the reconnect_delay_timer in order to
+ * resume it on the new aio context.
+ */
+s->reconnect_expire_time_ns = s->reconnect_delay_timer ?
+timer_expire_time_ns(s->reconnect_delay_timer) : -1;
+reconnect_delay_timer_del(s);
+
+/*
+ * If the connection coroutine was sleeping between reconnect attempts,
+ * wake it up now and let it continue the process in the new aio context.
+ * This will distort the exponential back-off but that's probably ok.
+ */
+if (s->connection_co_sleep_ns_state) {
+qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
+}
+
 /*
  * If reconnect is in progress we may have no ->ioc.  It will be
  * re-instantiated in the proper aio context once the connection is
@@ -263,6 +288,7 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 {
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+NBDConnectThread *thr = s->connect_thread;
 
 /*
  * This runs in the (new, just attached) aio context of the @bs so
@@ -270,6 +296,20 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
  */
 assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
 
+if (nbd_client_connecting_wait(s) && s->reconnect_expire_time_ns >= 0) {
+reconnect_delay_timer_init(s, s->reconnect_expire_time_ns);
+}
+
+/*
+ * If the connection thread hasn't completed connecting yet, make sure it
+ * can deliver its status in the new context.
+ */
+qemu_mutex_lock(>mutex);
+if (thr->state == CONNECT_THREAD_RUNNING) {
+thr->bh_ctx = qemu_get_current_aio_context();
+}
+qemu_mutex_unlock(>mutex);
+
 if (s->connection_co) {
 /*
  * The node is still drained, so we know the coroutine has yielded in
-- 
2.30.2

Re: [RFC] nbd: decouple reconnect from drain

2021-03-14 Thread Roman Kagan

On Fri, Mar 12, 2021 at 03:35:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 10.03.2021 12:32, Roman Kagan wrote:
> > NBD connect coroutine takes an extra in_flight reference as if it's a
> > request handler.  This prevents drain from completion until the
> > connection coroutine is releases the reference.
> > 
> > When NBD is configured to reconnect, however, this appears to be fatal
> > to the reconnection idea: the drain procedure wants the reconnection to
> > be suspended, but this is only possible if the in-flight requests are
> > canceled.
> 
> As I remember from our conversation, the problem is not that we don't
> reconnect during drained section, but exactly that we cancel requests
> on drained begins starting from 8c517de24a8a1dcbeb.

Well, I'd put it the other way around: the problem is that we don't
reconnect during the drained section, so the requests can't be
successfully completed if the connection breaks: they can either stall
forever (before 8c517de24a8a1dcbeb) or be aborted (after
8c517de24a8a1dcbeb).

> This is not a problem in scenarios when reconnect is rare case and
> failed request is acceptable. But if we have bad connection and
> requests should often wait for reconnect (so, it may be considered as
> a kind of "latency") then really, cancelling the waiting requests on
> any drain() kills the reconnect feature.

In our experience the primary purpose of reconnect is not to withstand
poor network links, but about being able to restart the NBD server
without disrupting the guest operation.  So failing the requests during
the server maintenance window is never acceptable, no matter how rare it
is (and in our practice it isn't).

> > Fix this by making the connection coroutine stop messing with the
> > in-flight counter.  Instead, certain care is taken to properly move the
> > reconnection stuff from one aio_context to another in
> > .bdrv_{attach,detach}_aio_context callbacks.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > Signed-off-by: Roman Kagan 
> > ---
> > This patch passes the regular make check but fails some extra iotests,
> > in particular 277.  It obviously lacks more robust interaction with the
> > connection thread (which in general is fairly complex and hard to reason
> > about), and perhaps has some other drawbacks, so I'll work on this
> > further, but I'd appreciate some feedback on whether the idea is sound.
> > 
> 
> In general I like the idea. The logic around drain in nbd is
> overcomplicated. And I never liked the fact that nbd_read_eof() does
> dec-inc inflight section. Some notes:
> 
> 1. I hope, the patch can be divided into several ones, as there are
> several things done:
> 
> - removing use of in_flight counter introduced by 5ad81b4946
> - do reconnect during drained section
> - stop cancelling requests on .drained_begin

I've split the (somewhat extended) patch into a series of 7, but not
exactly as you suggested.  In particular, I can't just stop aborting the
requests on .drained_begin as this would reintroduce the deadlock, so I
just remove the whole .drained_begin/end in a single patch.

> 2. 5ad81b4946 was needed to make nbd_client_attach_aio_context()
> reenter connection_co only in one (or two) possible places, not on any
> yield.. And I don't see how it is achieved now.. This should be
> described in commit msg..

My problem is that I failed to figure out why it was necessary in the
first place, so I think I don't achieve this with the series.

> 3. About cancelling requests on drained_begin. The behavior was
> introduced by 8c517de24a8a1dcbeb, to fix a deadlock. So, if now the
> deadlock is fixed another way, let's change the logic (don't cancel
> requests) in a separate patch, and note 8c517de24a8a1dcbeb commit and
> the commit that fixes deadlock the other way in the commit message.

As I mentioned earlier I did a patch that removed the root cause; a
separate patch removing just the request cancelation didn't look
justified.  I did mention the commits in the log.

Thanks for the review!  I'm now on to submitting a non-RFC version.
Roman.

[RFC] nbd: decouple reconnect from drain

2021-03-10 Thread Roman Kagan

NBD connect coroutine takes an extra in_flight reference as if it's a
request handler.  This prevents drain from completion until the
connection coroutine is releases the reference.

When NBD is configured to reconnect, however, this appears to be fatal
to the reconnection idea: the drain procedure wants the reconnection to
be suspended, but this is only possible if the in-flight requests are
canceled.

Fix this by making the connection coroutine stop messing with the
in-flight counter.  Instead, certain care is taken to properly move the
reconnection stuff from one aio_context to another in
.bdrv_{attach,detach}_aio_context callbacks.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Signed-off-by: Roman Kagan 
---
This patch passes the regular make check but fails some extra iotests,
in particular 277.  It obviously lacks more robust interaction with the
connection thread (which in general is fairly complex and hard to reason
about), and perhaps has some other drawbacks, so I'll work on this
further, but I'd appreciate some feedback on whether the idea is sound.

 block/nbd.c  | 134 ---
 nbd/client.c |   2 -
 2 files changed, 41 insertions(+), 95 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..5319e543ab 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -117,8 +117,6 @@ typedef struct BDRVNBDState {
 Coroutine *connection_co;
 Coroutine *teardown_co;
 QemuCoSleepState *connection_co_sleep_ns_state;
-bool drained;
-bool wait_drained_end;
 int in_flight;
 NBDClientState state;
 int connect_status;
@@ -126,6 +124,7 @@ typedef struct BDRVNBDState {
 bool wait_in_flight;
 
 QEMUTimer *reconnect_delay_timer;
+int64_t reconnect_expire_time_ns;
 
 NBDClientRequest requests[MAX_NBD_REQUESTS];
 NBDReply reply;
@@ -165,6 +164,18 @@ static void nbd_clear_bdrvstate(BDRVNBDState *s)
 s->x_dirty_bitmap = NULL;
 }
 
+static bool nbd_client_connecting(BDRVNBDState *s)
+{
+NBDClientState state = qatomic_load_acquire(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
+}
+
+static bool nbd_client_connecting_wait(BDRVNBDState *s)
+{
+return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
@@ -217,10 +228,6 @@ static void reconnect_delay_timer_cb(void *opaque)
 
 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 {
-if (qatomic_load_acquire(>state) != NBD_CLIENT_CONNECTING_WAIT) {
-return;
-}
-
 assert(!s->reconnect_delay_timer);
 s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
  QEMU_CLOCK_REALTIME,
@@ -233,8 +240,20 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-/* Timer is deleted in nbd_client_co_drain_begin() */
-assert(!s->reconnect_delay_timer);
+/*
+ * This runs in the (old, about to be detached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
+/*
+ * Preserve the expiration time of the reconnect_delay_timer in order to
+ * resume it on the new aio context.
+ */
+s->reconnect_expire_time_ns = s->reconnect_delay_timer ?
+timer_expire_time_ns(s->reconnect_delay_timer) : -1;
+reconnect_delay_timer_del(s);
+
 /*
  * If reconnect is in progress we may have no ->ioc.  It will be
  * re-instantiated in the proper aio context once the connection is
@@ -250,6 +269,16 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+/*
+ * This runs in the (new, just attached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
+if (nbd_client_connecting_wait(s) && s->reconnect_expire_time_ns >= 0) {
+reconnect_delay_timer_init(s, s->reconnect_expire_time_ns);
+}
+
 if (s->connection_co) {
 /*
  * The node is still drained, so we know the coroutine has yielded in
@@ -259,7 +288,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
  */
 qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
 }
-bdrv_dec_in_flight(bs);
 }
 
 static void nbd_client_attach_aio_context(BlockDriverState *bs,
@@ -275,7 +303,6 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
 qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
 }
 
-bdrv_inc_in_flight(bs);
 
 /*
  * Need t

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

2021-02-03 Thread Roman Kagan

On Wed, Feb 03, 2021 at 05:44:34PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 03.02.2021 17:21, Roman Kagan wrote:
> > On Wed, Feb 03, 2021 at 04:10:41PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 03.02.2021 13:53, Roman Kagan wrote:
> > > > On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy 
> > > > wrote:
> > > > > We pause reconnect process during drained section. So, if we have some
> > > > > requests, waiting for reconnect we should cancel them, otherwise they
> > > > > deadlock the drained section.
> > > > > 
> > > > > How to reproduce:
> > > > > 
> > > > > 1. Create an image:
> > > > >  qemu-img create -f qcow2 xx 100M
> > > > > 
> > > > > 2. Start NBD server:
> > > > >  qemu-nbd xx
> > > > > 
> > > > > 3. Start vm with second nbd disk on node2, like this:
> > > > > 
> > > > > ./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
> > > > >file=/work/images/cent7.qcow2 -drive \
> > > > >
> > > > > driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
> > > > >  \
> > > > >-vnc :0 -m 2G -enable-kvm -vga std
> > > > > 
> > > > > 4. Access the vm through vnc (or some other way?), and check that NBD
> > > > >  drive works:
> > > > > 
> > > > >  dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > > > 
> > > > >  - the command should succeed.
> > > > > 
> > > > > 5. Now, kill the nbd server, and run dd in the guest again:
> > > > > 
> > > > >  dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > > > 
> > > > > Now Qemu is trying to reconnect, and dd-generated requests are waiting
> > > > > for the connection (they will wait up to 60 seconds (see
> > > > > reconnect-delay option above) and than fail). But suddenly, vm may
> > > > > totally hang in the deadlock. You may need to increase reconnect-delay
> > > > > period to catch the dead-lock.
> > > > > 
> > > > > VM doesn't respond because drain dead-lock happens in cpu thread with
> > > > > global mutex taken. That's not good thing by itself and is not fixed
> > > > > by this commit (true way is using iothreads). Still this commit fixes
> > > > > drain dead-lock itself.
> > > > > 
> > > > > Note: probably, we can instead continue to reconnect during drained
> > > > > section. To achieve this, we may move negotiation to the connect 
> > > > > thread
> > > > > to make it independent of bs aio context. But expanding drained 
> > > > > section
> > > > > doesn't seem good anyway. So, let's now fix the bug the simplest way.
> > > > > 
> > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > > > ---
> > > > >block/nbd.c | 5 +
> > > > >1 file changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/block/nbd.c b/block/nbd.c
> > > > > index 9daf003bea..912ea27be7 100644
> > > > > --- a/block/nbd.c
> > > > > +++ b/block/nbd.c
> > > > > @@ -242,6 +242,11 @@ static void coroutine_fn 
> > > > > nbd_client_co_drain_begin(BlockDriverState *bs)
> > > > >}
> > > > >nbd_co_establish_connection_cancel(bs, false);
> > > > > +
> > > > > +if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> > > > > +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> > > > > +qemu_co_queue_restart_all(>free_sema);
> > > > > +}
> > > > >}
> > > > >static void coroutine_fn nbd_client_co_drain_end(BlockDriverState 
> > > > > *bs)
> > > > 
> > > > This basically defeats the whole purpose of reconnect: if the nbd client
> > > > is trying to reconnect, drain effectively cancels that and makes all
> > > > in-flight requests to complete with an error.
> > > > 
> > > > I'm not suggesting to revert this patch (it's now in the tree as commit
> > > > 8c517de24a), because the deadlock is no better, but I'm afraid the only
> > > > real

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

2021-02-03 Thread Roman Kagan

On Wed, Feb 03, 2021 at 04:10:41PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 03.02.2021 13:53, Roman Kagan wrote:
> > On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > We pause reconnect process during drained section. So, if we have some
> > > requests, waiting for reconnect we should cancel them, otherwise they
> > > deadlock the drained section.
> > > 
> > > How to reproduce:
> > > 
> > > 1. Create an image:
> > > qemu-img create -f qcow2 xx 100M
> > > 
> > > 2. Start NBD server:
> > > qemu-nbd xx
> > > 
> > > 3. Start vm with second nbd disk on node2, like this:
> > > 
> > >./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
> > >   file=/work/images/cent7.qcow2 -drive \
> > >   
> > > driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
> > >  \
> > >   -vnc :0 -m 2G -enable-kvm -vga std
> > > 
> > > 4. Access the vm through vnc (or some other way?), and check that NBD
> > > drive works:
> > > 
> > > dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > 
> > > - the command should succeed.
> > > 
> > > 5. Now, kill the nbd server, and run dd in the guest again:
> > > 
> > > dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > 
> > > Now Qemu is trying to reconnect, and dd-generated requests are waiting
> > > for the connection (they will wait up to 60 seconds (see
> > > reconnect-delay option above) and than fail). But suddenly, vm may
> > > totally hang in the deadlock. You may need to increase reconnect-delay
> > > period to catch the dead-lock.
> > > 
> > > VM doesn't respond because drain dead-lock happens in cpu thread with
> > > global mutex taken. That's not good thing by itself and is not fixed
> > > by this commit (true way is using iothreads). Still this commit fixes
> > > drain dead-lock itself.
> > > 
> > > Note: probably, we can instead continue to reconnect during drained
> > > section. To achieve this, we may move negotiation to the connect thread
> > > to make it independent of bs aio context. But expanding drained section
> > > doesn't seem good anyway. So, let's now fix the bug the simplest way.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   block/nbd.c | 5 +
> > >   1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/block/nbd.c b/block/nbd.c
> > > index 9daf003bea..912ea27be7 100644
> > > --- a/block/nbd.c
> > > +++ b/block/nbd.c
> > > @@ -242,6 +242,11 @@ static void coroutine_fn 
> > > nbd_client_co_drain_begin(BlockDriverState *bs)
> > >   }
> > >   nbd_co_establish_connection_cancel(bs, false);
> > > +
> > > +if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> > > +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> > > +qemu_co_queue_restart_all(>free_sema);
> > > +}
> > >   }
> > >   static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
> > 
> > This basically defeats the whole purpose of reconnect: if the nbd client
> > is trying to reconnect, drain effectively cancels that and makes all
> > in-flight requests to complete with an error.
> > 
> > I'm not suggesting to revert this patch (it's now in the tree as commit
> > 8c517de24a), because the deadlock is no better, but I'm afraid the only
> > real fix is to implement reconnect during the drain section.  I'm still
> > trying to get my head around it so no patch yet, but I just wanted to
> > bring this up in case anybody beats me to it.
> > 
> 
> 
> What do you mean by "reconnect during drained section"? Trying to
> establish the connection? Or keeping in-flight requests instead of
> cancelling them? We can't keep in-flight requests during drained
> section, as it's the whole sense of drained-section: no in-flight
> requests. So we'll have to wait for them at drain_begin (waiting up to
> reconnect-delay, which doesn't seem good and triggers dead-lock for
> non-iothread environment) or just cancel them..
> 
> Do you argue that waiting on drained-begin is somehow better than
> cancelling?

Sorry I should have used more accurate wording to be clear.

Yes, my point is that canceling the requests on entry to a drained
section is incorrect.  And no, it doesn't matter if it can be long:
that's the price you pay for doing the drain.  Think of reconnect as a
special case of a slow connection: if an nbd reply from the server is
delayed for whatever reason without dropping the connection, you have to
wait here, too.  (In fact, reconnect is even slightly better here since
it has a configurable finite timeout while the message delay does not
AFAIK.)

Does it make sense?

> Or, the problem is that after drained section (assume it was short) we
> are in NBD_CLIENT_CONNECTING_NOWAIT ?  Fixing it should be simple
> enough, we just need to check the time at drained_end and if the delay
> is not expired enable NBD_CLIENT_CONNECTING_WAIT agaub,,

Hmm, I didn't get thus far but this is probably an issue too.

Thanks,
Roman.

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

2021-02-03 Thread Roman Kagan

On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We pause reconnect process during drained section. So, if we have some
> requests, waiting for reconnect we should cancel them, otherwise they
> deadlock the drained section.
> 
> How to reproduce:
> 
> 1. Create an image:
>qemu-img create -f qcow2 xx 100M
> 
> 2. Start NBD server:
>qemu-nbd xx
> 
> 3. Start vm with second nbd disk on node2, like this:
> 
>   ./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
>  file=/work/images/cent7.qcow2 -drive \
>  
> driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
>  \
>  -vnc :0 -m 2G -enable-kvm -vga std
> 
> 4. Access the vm through vnc (or some other way?), and check that NBD
>drive works:
> 
>dd if=/dev/sdb of=/dev/null bs=1M count=10
> 
>- the command should succeed.
> 
> 5. Now, kill the nbd server, and run dd in the guest again:
> 
>dd if=/dev/sdb of=/dev/null bs=1M count=10
> 
> Now Qemu is trying to reconnect, and dd-generated requests are waiting
> for the connection (they will wait up to 60 seconds (see
> reconnect-delay option above) and than fail). But suddenly, vm may
> totally hang in the deadlock. You may need to increase reconnect-delay
> period to catch the dead-lock.
> 
> VM doesn't respond because drain dead-lock happens in cpu thread with
> global mutex taken. That's not good thing by itself and is not fixed
> by this commit (true way is using iothreads). Still this commit fixes
> drain dead-lock itself.
> 
> Note: probably, we can instead continue to reconnect during drained
> section. To achieve this, we may move negotiation to the connect thread
> to make it independent of bs aio context. But expanding drained section
> doesn't seem good anyway. So, let's now fix the bug the simplest way.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 9daf003bea..912ea27be7 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -242,6 +242,11 @@ static void coroutine_fn 
> nbd_client_co_drain_begin(BlockDriverState *bs)
>  }
>  
>  nbd_co_establish_connection_cancel(bs, false);
> +
> +if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> +qemu_co_queue_restart_all(>free_sema);
> +}
>  }
>  
>  static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)

This basically defeats the whole purpose of reconnect: if the nbd client
is trying to reconnect, drain effectively cancels that and makes all
in-flight requests to complete with an error.

I'm not suggesting to revert this patch (it's now in the tree as commit
8c517de24a), because the deadlock is no better, but I'm afraid the only
real fix is to implement reconnect during the drain section.  I'm still
trying to get my head around it so no patch yet, but I just wanted to
bring this up in case anybody beats me to it.

Thanks,
Roman.

[PATCH v2 2/3] block/nbd: only enter connection coroutine if it's present

2021-01-28 Thread Roman Kagan

When an NBD block driver state is moved from one aio_context to another
(e.g. when doing a drain in a migration thread),
nbd_client_attach_aio_context_bh is executed that enters the connection
coroutine.

However, the assumption that ->connection_co is always present here
appears incorrect: the connection may have encountered an error other
than -EIO in the underlying transport, and thus may have decided to quit
rather than keep trying to reconnect, and therefore it may have
terminated the connection coroutine.  As a result an attempt to reassign
the client in this state (NBD_CLIENT_QUIT) to a different aio_context
leads to a null pointer dereference:

  #0  qio_channel_detach_aio_context (ioc=0x0)
  at /build/qemu-gYtjVn/qemu-5.0.1/io/channel.c:452
  #1  0x562a242824b3 in bdrv_detach_aio_context (bs=0x562a268d6a00)
  at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6151
  #2  bdrv_set_aio_context_ignore (bs=bs@entry=0x562a268d6a00,
  new_context=new_context@entry=0x562a260c9580,
  ignore=ignore@entry=0x7feeadc9b780)
  at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6230
  #3  0x562a24282969 in bdrv_child_try_set_aio_context
  (bs=bs@entry=0x562a268d6a00, ctx=0x562a260c9580,
  ignore_child=, errp=)
  at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6332
  #4  0x562a242bb7db in blk_do_set_aio_context (blk=0x562a2735d0d0,
  new_context=0x562a260c9580,
  update_root_node=update_root_node@entry=true, errp=errp@entry=0x0)
  at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:1989
  #5  0x562a242be0bd in blk_set_aio_context (blk=,
  new_context=, errp=errp@entry=0x0)
  at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:2010
  #6  0x562a23fbd953 in virtio_blk_data_plane_stop (vdev=)
  at /build/qemu-gYtjVn/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
  #7  0x562a241fc7bf in virtio_bus_stop_ioeventfd (bus=0x562a260dbf08)
  at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio-bus.c:245
  #8  0x562a23fefb2e in virtio_vmstate_change (opaque=0x562a260dbf90,
  running=0, state=)
  at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio.c:3220
  #9  0x562a2402ebfd in vm_state_notify (running=running@entry=0,
  state=state@entry=RUN_STATE_FINISH_MIGRATE)
  at /build/qemu-gYtjVn/qemu-5.0.1/softmmu/vl.c:1275
  #10 0x562a23f7bc02 in do_vm_stop (state=RUN_STATE_FINISH_MIGRATE,
  send_stop=)
  at /build/qemu-gYtjVn/qemu-5.0.1/cpus.c:1032
  #11 0x562a24209765 in migration_completion (s=0x562a260e83a0)
  at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:2914
  #12 migration_iteration_run (s=0x562a260e83a0)
  at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3275
  #13 migration_thread (opaque=opaque@entry=0x562a260e83a0)
  at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3439
  #14 0x562a2435ca96 in qemu_thread_start (args=)
  at /build/qemu-gYtjVn/qemu-5.0.1/util/qemu-thread-posix.c:519
  #15 0x7feed31466ba in start_thread (arg=0x7feeadc9c700)
  at pthread_create.c:333
  #16 0x7feed2e7c41d in __GI___sysctl (name=0x0, nlen=608471908,
  oldval=0x562a2452b138, oldlenp=0x0, newval=0x562a2452c5e0
  <__func__.28102>, newlen=0)
  at ../sysdeps/unix/sysv/linux/sysctl.c:30
  #17 0x in ?? ()

Fix it by checking that the connection coroutine is non-null before
trying to enter it.  If it is null, no entering is needed, as the
connection is probably going down anyway.

Signed-off-by: Roman Kagan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index bcd6641e90..b3cbbeb4b0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -250,13 +250,15 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-/*
- * The node is still drained, so we know the coroutine has yielded in
- * nbd_read_eof(), the only place where bs->in_flight can reach 0, or it is
- * entered for the first time. Both places are safe for entering the
- * coroutine.
- */
-qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+if (s->connection_co) {
+/*
+ * The node is still drained, so we know the coroutine has yielded in
+ * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
+ * it is entered for the first time. Both places are safe for entering
+ * the coroutine.
+ */
+qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+}
 bdrv_dec_in_flight(bs);
 }
 
-- 
2.29.2

[PATCH v2 3/3] nbd: make nbd_read* return -EIO on error

2021-01-28 Thread Roman Kagan

NBD reconnect logic considers the error code from the functions that
read NBD messages to tell if reconnect should be attempted or not: it is
attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
state (see nbd_channel_error).  This error code is propagated from the
primitives like nbd_read.

The problem, however, is that nbd_read itself turns every error into -1
rather than -EIO.  As a result, if the NBD server happens to die while
sending the message, the client in QEMU receives less data than it
expects, considers it as a fatal error, and wouldn't attempt
reestablishing the connection.

Fix it by turning every negative return from qio_channel_read_all into
-EIO returned from nbd_read.  Apparently that was the original behavior,
but got broken later.  Also adjust nbd_readXX to follow.

Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
Signed-off-by: Roman Kagan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/nbd.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 4a52a43ef5..5f34d23bb0 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -364,7 +364,7 @@ static inline int nbd_read(QIOChannel *ioc, void *buffer, 
size_t size,
 if (desc) {
 error_prepend(errp, "Failed to read %s: ", desc);
 }
-return -1;
+return ret;
 }
 
 return 0;
@@ -375,8 +375,9 @@ static inline int nbd_read##bits(QIOChannel *ioc,   
\
  uint##bits##_t *val,   \
  const char *desc, Error **errp)\
 {   \
-if (nbd_read(ioc, val, sizeof(*val), desc, errp) < 0) { \
-return -1;  \
+int ret = nbd_read(ioc, val, sizeof(*val), desc, errp); \
+if (ret < 0) {  \
+return ret; \
 }   \
 *val = be##bits##_to_cpu(*val); \
 return 0;   \
-- 
2.29.2

[PATCH v2 0/3] block/nbd: fix crashers in reconnect while migrating

2021-01-28 Thread Roman Kagan

During the final phase of migration the NBD reconnection logic may
encounter situations it doesn't expect during regular operation.

This series addresses some of them that make qemu crash.  They are
reproducible when a vm with a secondary drive attached via nbd with
non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
in the guest on that drive and is migrated (e.g. to a file), while the
nbd server is SIGKILL-ed and restarted every second.

See the individual patches for specific crash conditions and more
detailed explanations.

v1 -> v2:
- fix corrupted backtraces in log messages
- add r-b by Vladimir

Roman Kagan (3):
  block/nbd: only detach existing iochannel from aio_context
  block/nbd: only enter connection coroutine if it's present
  nbd: make nbd_read* return -EIO on error

 include/block/nbd.h |  7 ---
 block/nbd.c | 25 +
 2 files changed, 21 insertions(+), 11 deletions(-)

-- 
2.29.2

[PATCH v2 1/3] block/nbd: only detach existing iochannel from aio_context

2021-01-28 Thread Roman Kagan

When the reconnect in NBD client is in progress, the iochannel used for
NBD connection doesn't exist.  Therefore an attempt to detach it from
the aio_context of the parent BlockDriverState results in a NULL pointer
dereference.

The problem is triggerable, in particular, when an outgoing migration is
about to finish, and stopping the dataplane tries to move the
BlockDriverState from the iothread aio_context to the main loop.  If the
NBD connection is lost before this point, and the NBD client has entered
the reconnect procedure, QEMU crashes:

  #0  qemu_aio_coroutine_enter (ctx=0x5618056c7580, co=0x0)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-coroutine.c:109
  #1  0x5618034b1b68 in nbd_client_attach_aio_context_bh (
  opaque=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block/nbd.c:164
  #2  0x56180353116b in aio_wait_bh (opaque=0x7f60e1e63700)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:55
  #3  0x561803530633 in aio_bh_call (bh=0x7f60d40a7e80)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:136
  #4  aio_bh_poll (ctx=ctx@entry=0x5618056c7580)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:164
  #5  0x561803533e5a in aio_poll (ctx=ctx@entry=0x5618056c7580,
  blocking=blocking@entry=true)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-posix.c:650
  #6  0x56180353128d in aio_wait_bh_oneshot (ctx=0x5618056c7580,
  cb=, opaque=)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:71
  #7  0x56180345c50a in bdrv_attach_aio_context (new_context=0x5618056c7580,
  bs=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6172
  #8  bdrv_set_aio_context_ignore (bs=bs@entry=0x561805ed4c00,
  new_context=new_context@entry=0x5618056c7580,
  ignore=ignore@entry=0x7f60e1e63780)
  at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6237
  #9  0x56180345c969 in bdrv_child_try_set_aio_context (
  bs=bs@entry=0x561805ed4c00, ctx=0x5618056c7580,
  ignore_child=, errp=)
  at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6332
  #10 0x5618034957db in blk_do_set_aio_context (blk=0x56180695b3f0,
  new_context=0x5618056c7580, update_root_node=update_root_node@entry=true,
  errp=errp@entry=0x0)
  at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:1989
  #11 0x5618034980bd in blk_set_aio_context (blk=,
  new_context=, errp=errp@entry=0x0)
  at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:2010
  #12 0x561803197953 in virtio_blk_data_plane_stop (vdev=)
  at /build/qemu-6MF7tq/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
  #13 0x5618033d67bf in virtio_bus_stop_ioeventfd (bus=0x5618056d9f08)
  at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio-bus.c:245
  #14 0x5618031c9b2e in virtio_vmstate_change (opaque=0x5618056d9f90,
  running=0, state=)
  at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio.c:3220
  #15 0x561803208bfd in vm_state_notify (running=running@entry=0,
  state=state@entry=RUN_STATE_FINISH_MIGRATE)
  at /build/qemu-6MF7tq/qemu-5.0.1/softmmu/vl.c:1275
  #16 0x561803155c02 in do_vm_stop (state=RUN_STATE_FINISH_MIGRATE,
  send_stop=) at /build/qemu-6MF7tq/qemu-5.0.1/cpus.c:1032
  #17 0x5618033e3765 in migration_completion (s=0x5618056e6960)
  at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:2914
  #18 migration_iteration_run (s=0x5618056e6960)
  at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3275
  #19 migration_thread (opaque=opaque@entry=0x5618056e6960)
  at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3439
  #20 0x561803536ad6 in qemu_thread_start (args=)
  at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-thread-posix.c:519
  #21 0x7f61085d06ba in start_thread ()
 from /lib/x86_64-linux-gnu/libpthread.so.0
  #22 0x7f610830641d in sysctl () from /lib/x86_64-linux-gnu/libc.so.6
  #23 0x in ?? ()

Fix it by checking that the iochannel is non-null before trying to
detach it from the aio_context.  If it is null, no detaching is needed,
and it will get reattached in the proper aio_context once the connection
is reestablished.

Signed-off-by: Roman Kagan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 42e10c7c93..bcd6641e90 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -235,7 +235,14 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
 
 /* Timer is deleted in nbd_client_co_drain_begin() */
 assert(!s->reconnect_delay_timer);
-qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
+/*
+ * If reconnect is in progress we may have no ->ioc.  It will be
+ * re-instantiated in the proper aio context once the connection is
+ * reestablished.
+ */
+if (s->ioc) {
+qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
+}
 }
 
 static void nbd_client_attach_aio_context_bh(void *opaque)
-- 
2.29.2

Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating

2021-01-28 Thread Roman Kagan

On Fri, Jan 29, 2021 at 08:51:39AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.01.2021 23:14, Roman Kagan wrote:
> > During the final phase of migration the NBD reconnection logic may
> > encounter situations it doesn't expect during regular operation.
> > 
> > This series addresses some of them that make qemu crash.  They are
> > reproducible when a vm with a secondary drive attached via nbd with
> > non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
> > in the guest on that drive and is migrated (e.g. to a file), while the
> > nbd server is SIGKILL-ed and restarted every second.
> > 
> > See the individual patches for specific crash conditions and more
> > detailed explanations.
> > 
> > Roman Kagan (3):
> >block/nbd: only detach existing iochannel from aio_context
> >block/nbd: only enter connection coroutine if it's present
> >nbd: make nbd_read* return -EIO on error
> > 
> >   include/block/nbd.h |  7 ---
> >   block/nbd.c | 25 +
> >   2 files changed, 21 insertions(+), 11 deletions(-)
> > 
> 
> Thanks a lot for fixing!
> 
> Do you have some reproducer scripts? Could you post them or may be add
> an iotest?

I don't have it scripted, just ad hoc command lines.  I'll look into
making up a test.  Can you perhaps suggest what existing test to base
on?

Thanks,
Roman.

Re: [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context

2021-01-28 Thread Roman Kagan

On Fri, Jan 29, 2021 at 08:37:13AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.01.2021 23:14, Roman Kagan wrote:
> > When the reconnect in NBD client is in progress, the iochannel used for
> > NBD connection doesn't exist.  Therefore an attempt to detach it from
> > the aio_context of the parent BlockDriverState results in a NULL pointer
> > dereference.
> > 
> > The problem is triggerable, in particular, when an outgoing migration is
> > about to finish, and stopping the dataplane tries to move the
> > BlockDriverState from the iothread aio_context to the main loop.  If the
> > NBD connection is lost before this point, and the NBD client has entered
> > the reconnect procedure, QEMU crashes:
> > 
> >  at /build/qemu-gYtjVn/qemu-5.0.1/io/channel.c:452
> >  at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6151
> >  new_context=new_context@entry=0x562a260c9580,
> >  ignore=ignore@entry=0x7feeadc9b780)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6230
> >  (bs=bs@entry=0x562a268d6a00, ctx=0x562a260c9580,
> >  ignore_child=, errp=)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6332
> >  new_context=0x562a260c9580,
> >  update_root_node=update_root_node@entry=true, errp=errp@entry=0x0)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:1989
> >  new_context=, errp=errp@entry=0x0)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:2010
> >  out>)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
> >  at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio-bus.c:245
> >  running=0, state=)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio.c:3220
> >  state=state@entry=RUN_STATE_FINISH_MIGRATE)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/softmmu/vl.c:1275
> >  send_stop=)
> >  at /build/qemu-gYtjVn/qemu-5.0.1/cpus.c:1032
> >  at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:2914
> >  at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3275
> >  at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3439
> >  at /build/qemu-gYtjVn/qemu-5.0.1/util/qemu-thread-posix.c:519
> >  at pthread_create.c:333
> >  oldval=0x562a2452b138, oldlenp=0x0, newval=0x562a2452c5e0
> >  <__func__.28102>, newlen=0)
> >  at ../sysdeps/unix/sysv/linux/sysctl.c:30
> 
> function names are somehow lost :(

Oops.  Backtraces in gdb have frame numbers prefixed with a hash which
got interpreted as comments by git commit; I should have offset the
backtrace when pasting.

Will respin with fixed log messages, thanks.

Roman.

[PATCH 1/3] block/nbd: only detach existing iochannel from aio_context

2021-01-28 Thread Roman Kagan

When the reconnect in NBD client is in progress, the iochannel used for
NBD connection doesn't exist.  Therefore an attempt to detach it from
the aio_context of the parent BlockDriverState results in a NULL pointer
dereference.

The problem is triggerable, in particular, when an outgoing migration is
about to finish, and stopping the dataplane tries to move the
BlockDriverState from the iothread aio_context to the main loop.  If the
NBD connection is lost before this point, and the NBD client has entered
the reconnect procedure, QEMU crashes:

at /build/qemu-gYtjVn/qemu-5.0.1/io/channel.c:452
at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6151
new_context=new_context@entry=0x562a260c9580,
ignore=ignore@entry=0x7feeadc9b780)
at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6230
(bs=bs@entry=0x562a268d6a00, ctx=0x562a260c9580,
ignore_child=, errp=)
at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6332
new_context=0x562a260c9580,
update_root_node=update_root_node@entry=true, errp=errp@entry=0x0)
at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:1989
new_context=, errp=errp@entry=0x0)
at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:2010
out>)
at /build/qemu-gYtjVn/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio-bus.c:245
running=0, state=)
at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio.c:3220
state=state@entry=RUN_STATE_FINISH_MIGRATE)
at /build/qemu-gYtjVn/qemu-5.0.1/softmmu/vl.c:1275
send_stop=)
at /build/qemu-gYtjVn/qemu-5.0.1/cpus.c:1032
at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:2914
at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3275
at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3439
at /build/qemu-gYtjVn/qemu-5.0.1/util/qemu-thread-posix.c:519
at pthread_create.c:333
oldval=0x562a2452b138, oldlenp=0x0, newval=0x562a2452c5e0
<__func__.28102>, newlen=0)
at ../sysdeps/unix/sysv/linux/sysctl.c:30

Fix it by checking that the iochannel is non-null before trying to
detach it from the aio_context.  If it is null, no detaching is needed,
and it will get reattached in the proper aio_context once the connection
is reestablished.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 42e10c7c93..bcd6641e90 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -235,7 +235,14 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
 
 /* Timer is deleted in nbd_client_co_drain_begin() */
 assert(!s->reconnect_delay_timer);
-qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
+/*
+ * If reconnect is in progress we may have no ->ioc.  It will be
+ * re-instantiated in the proper aio context once the connection is
+ * reestablished.
+ */
+if (s->ioc) {
+qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
+}
 }
 
 static void nbd_client_attach_aio_context_bh(void *opaque)
-- 
2.29.2

[PATCH 2/3] block/nbd: only enter connection coroutine if it's present

2021-01-28 Thread Roman Kagan

When an NBD block driver state is moved from one aio_context to another
(e.g. when doing a drain in a migration thread),
nbd_client_attach_aio_context_bh is executed that enters the connection
coroutine.

However, the assumption that ->connection_co is always present here
appears incorrect: the connection may have encountered an error other
than -EIO in the underlying transport, and thus may have decided to quit
rather than keep trying to reconnect, and therefore it may have
terminated the connection coroutine.  As a result an attempt to reassign
the client in this state (NBD_CLIENT_QUIT) to a different aio_context
leads to a null pointer dereference:

at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-coroutine.c:109
opaque=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block/nbd.c:164
at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:55
at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:136
at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:164
blocking=blocking@entry=true)
at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-posix.c:650
cb=, opaque=)
at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:71
bs=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6172
new_context=new_context@entry=0x5618056c7580,
ignore=ignore@entry=0x7f60e1e63780)
at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6237
bs=bs@entry=0x561805ed4c00, ctx=0x5618056c7580,
ignore_child=, errp=)
at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6332
new_context=0x5618056c7580, update_root_node=update_root_node@entry=true,
errp=errp@entry=0x0)
at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:1989
new_context=, errp=errp@entry=0x0)
at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:2010
at /build/qemu-6MF7tq/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio-bus.c:245
running=0, state=)
at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio.c:3220
state=state@entry=RUN_STATE_FINISH_MIGRATE)
at /build/qemu-6MF7tq/qemu-5.0.1/softmmu/vl.c:1275
send_stop=) at /build/qemu-6MF7tq/qemu-5.0.1/cpus.c:1032
at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:2914
at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3275
at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3439
at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-thread-posix.c:519
   from /lib/x86_64-linux-gnu/libpthread.so.0

Fix it by checking that the connection coroutine is non-null before
trying to enter it.  If it is null, no entering is needed, as the
connection is probably going down anyway.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index bcd6641e90..b3cbbeb4b0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -250,13 +250,15 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-/*
- * The node is still drained, so we know the coroutine has yielded in
- * nbd_read_eof(), the only place where bs->in_flight can reach 0, or it is
- * entered for the first time. Both places are safe for entering the
- * coroutine.
- */
-qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+if (s->connection_co) {
+/*
+ * The node is still drained, so we know the coroutine has yielded in
+ * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
+ * it is entered for the first time. Both places are safe for entering
+ * the coroutine.
+ */
+qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+}
 bdrv_dec_in_flight(bs);
 }
 
-- 
2.29.2

[PATCH 3/3] nbd: make nbd_read* return -EIO on error

2021-01-28 Thread Roman Kagan

NBD reconnect logic considers the error code from the functions that
read NBD messages to tell if reconnect should be attempted or not: it is
attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
state (see nbd_channel_error).  This error code is propagated from the
primitives like nbd_read.

The problem, however, is that nbd_read itself turns every error into -1
rather than -EIO.  As a result, if the NBD server happens to die while
sending the message, the client in QEMU receives less data than it
expects, considers it as a fatal error, and wouldn't attempt
reestablishing the connection.

Fix it by turning every negative return from qio_channel_read_all into
-EIO returned from nbd_read.  Apparently that was the original behavior,
but got broken later.  Also adjust nbd_readXX to follow.

Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
Signed-off-by: Roman Kagan 
---
 include/block/nbd.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 4a52a43ef5..5f34d23bb0 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -364,7 +364,7 @@ static inline int nbd_read(QIOChannel *ioc, void *buffer, 
size_t size,
 if (desc) {
 error_prepend(errp, "Failed to read %s: ", desc);
 }
-return -1;
+return ret;
 }
 
 return 0;
@@ -375,8 +375,9 @@ static inline int nbd_read##bits(QIOChannel *ioc,   
\
  uint##bits##_t *val,   \
  const char *desc, Error **errp)\
 {   \
-if (nbd_read(ioc, val, sizeof(*val), desc, errp) < 0) { \
-return -1;  \
+int ret = nbd_read(ioc, val, sizeof(*val), desc, errp); \
+if (ret < 0) {  \
+return ret; \
 }   \
 *val = be##bits##_to_cpu(*val); \
 return 0;   \
-- 
2.29.2

[PATCH 0/3] block/nbd: fix crashers in reconnect while migrating

2021-01-28 Thread Roman Kagan

During the final phase of migration the NBD reconnection logic may
encounter situations it doesn't expect during regular operation.

This series addresses some of them that make qemu crash.  They are
reproducible when a vm with a secondary drive attached via nbd with
non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
in the guest on that drive and is migrated (e.g. to a file), while the
nbd server is SIGKILL-ed and restarted every second.

See the individual patches for specific crash conditions and more
detailed explanations.

Roman Kagan (3):
  block/nbd: only detach existing iochannel from aio_context
  block/nbd: only enter connection coroutine if it's present
  nbd: make nbd_read* return -EIO on error

 include/block/nbd.h |  7 ---
 block/nbd.c | 25 +
 2 files changed, 21 insertions(+), 11 deletions(-)

-- 
2.29.2

Re: [PATCH v8 0/8] block: enhance handling of size-related BlockConf properties

2020-06-15 Thread Roman Kagan

On Fri, May 29, 2020 at 01:55:08AM +0300, Roman Kagan wrote:
> BlockConf includes several properties counted in bytes.
> 
> Enhance their handling in some aspects, specifically
> 
> - accept common size suffixes (k, m)
> - perform consistency checks on the values
> - lift the upper limit on physical_block_size and logical_block_size
> 
> Also fix the accessor for opt_io_size in virtio-blk to make it consistent with
> the size of the field.
> 
> History:
> v7 -> v8:
> - replace stringify with %u in error messages [Eric]
> - fix wording in logs [Eric]
> 
> v6 -> v7:
> - avoid overflow in min_io_size check [Eric]
> - try again to perform the art form in patch splitting [Eric]
> 
> v5 -> v6:
> - fix forgotten xen-block and swim
> - add prop_size32 instead of going with 64bit
> 
> v4 -> v5:
> - re-split the patches [Philippe]
> - fix/reword error messages [Philippe, Kevin]
> - do early return on failed consistency check [Philippe]
> - use QEMU_IS_ALIGNED instead of open coding [Philippe]
> - make all BlockConf size props support suffixes
> - expand the log for virtio-blk opt_io_size [Michael]
> 
> v3 -> v4:
> - add patch to fix opt_io_size width in virtio-blk
> - add patch to perform consistency checks [Kevin]
> - check min_io_size against truncation [Kevin]
> 
> v2 -> v3:
> - mention qcow2 cluster size limit in the log and comment [Eric]
> 
> v1 -> v2:
> - cap the property at 2 MiB [Eric]
> - accept size suffixes
> 
> Roman Kagan (8):
>   virtio-blk: store opt_io_size with correct size
>   block: consolidate blocksize properties consistency checks
>   qdev-properties: blocksize: use same limits in code and description
>   qdev-properties: add size32 property type
>   qdev-properties: make blocksize accept size suffixes
>   block: make BlockConf size props 32bit and accept size suffixes
>   qdev-properties: add getter for size32 and blocksize
>   block: lift blocksize property limit to 2 MiB
> 
>  include/hw/block/block.h |  14 +-
>  include/hw/qdev-properties.h |   5 +-
>  hw/block/block.c |  40 ++-
>  hw/block/fdc.c   |   5 +-
>  hw/block/nvme.c  |   5 +-
>  hw/block/swim.c  |   5 +-
>  hw/block/virtio-blk.c|   9 +-
>  hw/block/xen-block.c |   6 +-
>  hw/core/qdev-properties.c|  85 +-
>  hw/ide/qdev.c|   5 +-
>  hw/scsi/scsi-disk.c  |  12 +-
>  hw/usb/dev-storage.c |   5 +-
>  tests/qemu-iotests/172.out   | 532 +--
>  13 files changed, 419 insertions(+), 309 deletions(-)

Ping?

Re: [PATCH v8 2/8] block: consolidate blocksize properties consistency checks

2020-05-29 Thread Roman Kagan

On Fri, May 29, 2020 at 11:53:26AM +0200, Markus Armbruster wrote:
> Roman Kagan  writes:
> 
> > Several block device properties related to blocksize configuration must
> > be in certain relationship WRT each other: physical block must be no
> > smaller than logical block; min_io_size, opt_io_size, and
> > discard_granularity must be a multiple of a logical block.
> >
> > To ensure these requirements are met, add corresponding consistency
> > checks to blkconf_blocksizes, adjusting its signature to communicate
> > possible error to the caller.  Also remove the now redundant consistency
> > checks from the specific devices.
> >
> > Signed-off-by: Roman Kagan 
> > Reviewed-by: Eric Blake 
> > Reviewed-by: Paul Durrant 
> > ---
> >  include/hw/block/block.h   |  2 +-
> >  hw/block/block.c   | 30 +-
> >  hw/block/fdc.c |  5 -
> >  hw/block/nvme.c|  5 -
> >  hw/block/swim.c|  5 -
> >  hw/block/virtio-blk.c  |  7 +--
> >  hw/block/xen-block.c   |  6 +-
> >  hw/ide/qdev.c  |  5 -
> >  hw/scsi/scsi-disk.c| 12 +---
> >  hw/usb/dev-storage.c   |  5 -
> >  tests/qemu-iotests/172.out |  2 +-
> >  11 files changed, 58 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/hw/block/block.h b/include/hw/block/block.h
> > index d7246f3862..784953a237 100644
> > --- a/include/hw/block/block.h
> > +++ b/include/hw/block/block.h
> > @@ -87,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
> > *buf, hwaddr size,
> >  bool blkconf_geometry(BlockConf *conf, int *trans,
> >unsigned cyls_max, unsigned heads_max, unsigned 
> > secs_max,
> >Error **errp);
> > -void blkconf_blocksizes(BlockConf *conf);
> > +bool blkconf_blocksizes(BlockConf *conf, Error **errp);
> >  bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
> > bool resizable, Error **errp);
> >  
> > diff --git a/hw/block/block.c b/hw/block/block.c
> > index bf56c7612b..b22207c921 100644
> > --- a/hw/block/block.c
> > +++ b/hw/block/block.c
> > @@ -61,7 +61,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
> > *buf, hwaddr size,
> >  return true;
> >  }
> >  
> > -void blkconf_blocksizes(BlockConf *conf)
> > +bool blkconf_blocksizes(BlockConf *conf, Error **errp)
> >  {
> >  BlockBackend *blk = conf->blk;
> >  BlockSizes blocksizes;
> > @@ -83,6 +83,34 @@ void blkconf_blocksizes(BlockConf *conf)
> >  conf->logical_block_size = BDRV_SECTOR_SIZE;
> >  }
> >  }
> > +
> > +if (conf->logical_block_size > conf->physical_block_size) {
> > +error_setg(errp,
> > +   "logical_block_size > physical_block_size not 
> > supported");
> > +return false;
> > +}
> 
> Pardon me if this has been answered already for prior revisions: do we
> really support physical block sizes that are not a multiple of the
> logical block size?

Both physical and logical block sizes are required to be powers of two,
so the former is certain to be a multiple of the latter.

> > +
> > +if (!QEMU_IS_ALIGNED(conf->min_io_size, conf->logical_block_size)) {
> > +error_setg(errp,
> > +   "min_io_size must be a multiple of logical_block_size");
> > +return false;
> > +}
> > +
> > +if (!QEMU_IS_ALIGNED(conf->opt_io_size, conf->logical_block_size)) {
> > +error_setg(errp,
> > +   "opt_io_size must be a multiple of logical_block_size");
> > +return false;
> > +}
> > +
> > +if (conf->discard_granularity != -1 &&
> > +!QEMU_IS_ALIGNED(conf->discard_granularity,
> > + conf->logical_block_size)) {
> > +error_setg(errp, "discard_granularity must be "
> > +   "a multiple of logical_block_size");
> > +return false;
> > +}
> > +
> > +return true;
> >  }
> >  
> >  bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
> > diff --git a/hw/block/fdc.c b/hw/block/fdc.c
> > index c5fb9d6ece..8eda572ef4 100644
> > --- a/hw/block/fdc.c
> > +++ b/hw/block/fdc.c
> > @@ -554,7 +554,10 @@ static voi

[PATCH v8 8/8] block: lift blocksize property limit to 2 MiB

2020-05-28 Thread Roman Kagan

Logical and physical block sizes in QEMU are limited to 32 KiB.

This appears unnecessarily tight, and we've seen bigger block sizes
handy at times.

Lift the limitation up to 2 MiB which appears to be good enough for
everybody, and matches the qcow2 cluster size limit.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
---
 hw/core/qdev-properties.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 63d48db70c..ead35d7ffd 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -784,9 +784,12 @@ const PropertyInfo qdev_prop_size32 = {
 /* lower limit is sector size */
 #define MIN_BLOCK_SIZE  512
 #define MIN_BLOCK_SIZE_STR  "512 B"
-/* upper limit is the max power of 2 that fits in uint16_t */
-#define MAX_BLOCK_SIZE  (32 * KiB)
-#define MAX_BLOCK_SIZE_STR  "32 KiB"
+/*
+ * upper limit is arbitrary, 2 MiB looks sufficient for all sensible uses, and
+ * matches qcow2 cluster size limit
+ */
+#define MAX_BLOCK_SIZE  (2 * MiB)
+#define MAX_BLOCK_SIZE_STR  "2 MiB"
 
 static void set_blocksize(Object *obj, Visitor *v, const char *name,
   void *opaque, Error **errp)
-- 
2.26.2

[PATCH v8 6/8] block: make BlockConf size props 32bit and accept size suffixes

2020-05-28 Thread Roman Kagan

Convert all size-related properties in BlockConf to 32bit.  This will
accommodate bigger block sizes (in a followup patch).  This also allows
to make them all accept size suffixes, either via DEFINE_PROP_BLOCKSIZE
or via DEFINE_PROP_SIZE32.

Also, since min_io_size is exposed to the guest by scsi and virtio-blk
devices as an uint16_t in units of logical blocks, introduce an
additional check in blkconf_blocksizes to prevent its silent truncation.

Signed-off-by: Roman Kagan 
---
v7 -> v8:
- replace stringify with %u in the error message [Eric]
- fix wording in the log [Eric]

 include/hw/block/block.h | 12 ++--
 include/hw/qdev-properties.h |  2 +-
 hw/block/block.c | 10 ++
 hw/core/qdev-properties.c|  4 ++--
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index 784953a237..1e8b6253dd 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -18,9 +18,9 @@
 
 typedef struct BlockConf {
 BlockBackend *blk;
-uint16_t physical_block_size;
-uint16_t logical_block_size;
-uint16_t min_io_size;
+uint32_t physical_block_size;
+uint32_t logical_block_size;
+uint32_t min_io_size;
 uint32_t opt_io_size;
 int32_t bootindex;
 uint32_t discard_granularity;
@@ -51,9 +51,9 @@ static inline unsigned int get_physical_block_exp(BlockConf 
*conf)
   _conf.logical_block_size),\
 DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,\
   _conf.physical_block_size),   \
-DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),\
-DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),\
-DEFINE_PROP_UINT32("discard_granularity", _state,   \
+DEFINE_PROP_SIZE32("min_io_size", _state, _conf.min_io_size, 0),\
+DEFINE_PROP_SIZE32("opt_io_size", _state, _conf.opt_io_size, 0),\
+DEFINE_PROP_SIZE32("discard_granularity", _state,   \
_conf.discard_granularity, -1),  \
 DEFINE_PROP_ON_OFF_AUTO("write-cache", _state, _conf.wce,   \
 ON_OFF_AUTO_AUTO),  \
diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index c03eadfad6..5252bb6b1a 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -200,7 +200,7 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
 #define DEFINE_PROP_SIZE32(_n, _s, _f, _d)   \
 DEFINE_PROP_UNSIGNED(_n, _s, _f, _d, qdev_prop_size32, uint32_t)
 #define DEFINE_PROP_BLOCKSIZE(_n, _s, _f) \
-DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint16_t)
+DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint32_t)
 #define DEFINE_PROP_PCI_HOST_DEVADDR(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_pci_host_devaddr, PCIHostDeviceAddress)
 #define DEFINE_PROP_OFF_AUTO_PCIBAR(_n, _s, _f, _d) \
diff --git a/hw/block/block.c b/hw/block/block.c
index b22207c921..1e34573da7 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -96,6 +96,16 @@ bool blkconf_blocksizes(BlockConf *conf, Error **errp)
 return false;
 }
 
+/*
+ * all devices which support min_io_size (scsi and virtio-blk) expose it to
+ * the guest as a uint16_t in units of logical blocks
+ */
+if (conf->min_io_size / conf->logical_block_size > UINT16_MAX) {
+error_setg(errp, "min_io_size must not exceed %u logical blocks",
+   UINT16_MAX);
+return false;
+}
+
 if (!QEMU_IS_ALIGNED(conf->opt_io_size, conf->logical_block_size)) {
 error_setg(errp,
"opt_io_size must be a multiple of logical_block_size");
diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index c9af6a1341..bd4abdc1d1 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -782,7 +782,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-uint16_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
 uint64_t value;
 Error *local_err = NULL;
 
@@ -821,7 +821,7 @@ const PropertyInfo qdev_prop_blocksize = {
 .name  = "size",
 .description = "A power of two between " MIN_BLOCK_SIZE_STR
" and " MAX_BLOCK_SIZE_STR,
-.get   = get_uint16,
+.get   = get_uint32,
 .set   = set_blocksize,
 .set_default_value = set_default_value_uint,
 };
-- 
2.26.2

[PATCH v8 2/8] block: consolidate blocksize properties consistency checks

2020-05-28 Thread Roman Kagan

Several block device properties related to blocksize configuration must
be in certain relationship WRT each other: physical block must be no
smaller than logical block; min_io_size, opt_io_size, and
discard_granularity must be a multiple of a logical block.

To ensure these requirements are met, add corresponding consistency
checks to blkconf_blocksizes, adjusting its signature to communicate
possible error to the caller.  Also remove the now redundant consistency
checks from the specific devices.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
Reviewed-by: Paul Durrant 
---
 include/hw/block/block.h   |  2 +-
 hw/block/block.c   | 30 +-
 hw/block/fdc.c |  5 -
 hw/block/nvme.c|  5 -
 hw/block/swim.c|  5 -
 hw/block/virtio-blk.c  |  7 +--
 hw/block/xen-block.c   |  6 +-
 hw/ide/qdev.c  |  5 -
 hw/scsi/scsi-disk.c| 12 +---
 hw/usb/dev-storage.c   |  5 -
 tests/qemu-iotests/172.out |  2 +-
 11 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index d7246f3862..784953a237 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -87,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
*buf, hwaddr size,
 bool blkconf_geometry(BlockConf *conf, int *trans,
   unsigned cyls_max, unsigned heads_max, unsigned secs_max,
   Error **errp);
-void blkconf_blocksizes(BlockConf *conf);
+bool blkconf_blocksizes(BlockConf *conf, Error **errp);
 bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
bool resizable, Error **errp);
 
diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c7612b..b22207c921 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -61,7 +61,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
*buf, hwaddr size,
 return true;
 }
 
-void blkconf_blocksizes(BlockConf *conf)
+bool blkconf_blocksizes(BlockConf *conf, Error **errp)
 {
 BlockBackend *blk = conf->blk;
 BlockSizes blocksizes;
@@ -83,6 +83,34 @@ void blkconf_blocksizes(BlockConf *conf)
 conf->logical_block_size = BDRV_SECTOR_SIZE;
 }
 }
+
+if (conf->logical_block_size > conf->physical_block_size) {
+error_setg(errp,
+   "logical_block_size > physical_block_size not supported");
+return false;
+}
+
+if (!QEMU_IS_ALIGNED(conf->min_io_size, conf->logical_block_size)) {
+error_setg(errp,
+   "min_io_size must be a multiple of logical_block_size");
+return false;
+}
+
+if (!QEMU_IS_ALIGNED(conf->opt_io_size, conf->logical_block_size)) {
+error_setg(errp,
+   "opt_io_size must be a multiple of logical_block_size");
+return false;
+}
+
+if (conf->discard_granularity != -1 &&
+!QEMU_IS_ALIGNED(conf->discard_granularity,
+ conf->logical_block_size)) {
+error_setg(errp, "discard_granularity must be "
+   "a multiple of logical_block_size");
+return false;
+}
+
+return true;
 }
 
 bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
diff --git a/hw/block/fdc.c b/hw/block/fdc.c
index c5fb9d6ece..8eda572ef4 100644
--- a/hw/block/fdc.c
+++ b/hw/block/fdc.c
@@ -554,7 +554,10 @@ static void floppy_drive_realize(DeviceState *qdev, Error 
**errp)
 read_only = !blk_bs(dev->conf.blk) || blk_is_read_only(dev->conf.blk);
 }
 
-blkconf_blocksizes(>conf);
+if (!blkconf_blocksizes(>conf, errp)) {
+return;
+}
+
 if (dev->conf.logical_block_size != 512 ||
 dev->conf.physical_block_size != 512)
 {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2f3100e56c..672650e162 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1390,7 +1390,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 host_memory_backend_set_mapped(n->pmrdev, true);
 }
 
-blkconf_blocksizes(>conf);
+if (!blkconf_blocksizes(>conf, errp)) {
+return;
+}
+
 if (!blkconf_apply_backend_options(>conf, blk_is_read_only(n->conf.blk),
false, errp)) {
 return;
diff --git a/hw/block/swim.c b/hw/block/swim.c
index 8f124782f4..74f56e8f46 100644
--- a/hw/block/swim.c
+++ b/hw/block/swim.c
@@ -189,7 +189,10 @@ static void swim_drive_realize(DeviceState *qdev, Error 
**errp)
 assert(ret == 0);
 }
 
-blkconf_blocksizes(>conf);
+if (!blkconf_blocksizes(>conf, errp)) {
+return;
+}
+
 if (dev->conf.logical_block_size != 512 ||
 dev->conf.physical_block_size != 512)
 {
diff --git a/hw/block/virtio-blk.c b/h

[PATCH v8 7/8] qdev-properties: add getter for size32 and blocksize

2020-05-28 Thread Roman Kagan

Add getter for size32, and use it for blocksize, too.

In its human-readable branch, it reports approximate size in
human-readable units next to the exact byte value, like the getter for
64bit size does.

Adjust the expected test output accordingly.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
---
 hw/core/qdev-properties.c  |  15 +-
 tests/qemu-iotests/172.out | 530 ++---
 2 files changed, 278 insertions(+), 267 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index bd4abdc1d1..63d48db70c 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -730,6 +730,17 @@ const PropertyInfo qdev_prop_pci_devfn = {
 
 /* --- 32bit unsigned int 'size' type --- */
 
+static void get_size32(Object *obj, Visitor *v, const char *name, void *opaque,
+   Error **errp)
+{
+DeviceState *dev = DEVICE(obj);
+Property *prop = opaque;
+uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint64_t value = *ptr;
+
+visit_type_size(v, name, , errp);
+}
+
 static void set_size32(Object *obj, Visitor *v, const char *name, void *opaque,
Error **errp)
 {
@@ -763,7 +774,7 @@ static void set_size32(Object *obj, Visitor *v, const char 
*name, void *opaque,
 
 const PropertyInfo qdev_prop_size32 = {
 .name  = "size",
-.get = get_uint32,
+.get = get_size32,
 .set = set_size32,
 .set_default_value = set_default_value_uint,
 };
@@ -821,7 +832,7 @@ const PropertyInfo qdev_prop_blocksize = {
 .name  = "size",
 .description = "A power of two between " MIN_BLOCK_SIZE_STR
" and " MAX_BLOCK_SIZE_STR,
-.get   = get_uint32,
+.get   = get_size32,
 .set   = set_blocksize,
 .set_default_value = set_default_value_uint,
 };
diff --git a/tests/qemu-iotests/172.out b/tests/qemu-iotests/172.out
index 59cc70aebb..e782c5957e 100644
--- a/tests/qemu-iotests/172.out
+++ b/tests/qemu-iotests/172.out
@@ -24,11 +24,11 @@ Testing:
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "288"
@@ -54,11 +54,11 @@ Testing: -fda TEST_DIR/t.qcow2
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "144"
@@ -81,22 +81,22 @@ Testing: -fdb TEST_DIR/t.qcow2
   dev: floppy, id ""
 unit = 1 (0x1)
 drive = "floppy1"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "144"
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = &quo

[PATCH v8 5/8] qdev-properties: make blocksize accept size suffixes

2020-05-28 Thread Roman Kagan

It appears convenient to be able to specify physical_block_size and
logical_block_size using common size suffixes.

Teach the blocksize property setter to interpret them.  Also express the
upper and lower limits in the respective units.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
---
 hw/core/qdev-properties.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 40c13f6ebe..c9af6a1341 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -14,6 +14,7 @@
 #include "qapi/visitor.h"
 #include "chardev/char.h"
 #include "qemu/uuid.h"
+#include "qemu/units.h"
 
 void qdev_prop_set_after_realize(DeviceState *dev, const char *name,
   Error **errp)
@@ -771,17 +772,18 @@ const PropertyInfo qdev_prop_size32 = {
 
 /* lower limit is sector size */
 #define MIN_BLOCK_SIZE  512
-#define MIN_BLOCK_SIZE_STR  stringify(MIN_BLOCK_SIZE)
+#define MIN_BLOCK_SIZE_STR  "512 B"
 /* upper limit is the max power of 2 that fits in uint16_t */
-#define MAX_BLOCK_SIZE  32768
-#define MAX_BLOCK_SIZE_STR  stringify(MAX_BLOCK_SIZE)
+#define MAX_BLOCK_SIZE  (32 * KiB)
+#define MAX_BLOCK_SIZE_STR  "32 KiB"
 
 static void set_blocksize(Object *obj, Visitor *v, const char *name,
   void *opaque, Error **errp)
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-uint16_t value, *ptr = qdev_get_prop_ptr(dev, prop);
+uint16_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint64_t value;
 Error *local_err = NULL;
 
 if (dev->realized) {
@@ -789,7 +791,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 return;
 }
 
-visit_type_uint16(v, name, , _err);
+visit_type_size(v, name, , _err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
@@ -797,7 +799,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 /* value of 0 means "unset" */
 if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
 error_setg(errp,
-   "Property %s.%s doesn't take value %" PRIu16
+   "Property %s.%s doesn't take value %" PRIu64
" (minimum: " MIN_BLOCK_SIZE_STR
", maximum: " MAX_BLOCK_SIZE_STR ")",
dev->id ? : "", name, value);
@@ -816,7 +818,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 }
 
 const PropertyInfo qdev_prop_blocksize = {
-.name  = "uint16",
+.name  = "size",
 .description = "A power of two between " MIN_BLOCK_SIZE_STR
" and " MAX_BLOCK_SIZE_STR,
 .get   = get_uint16,
-- 
2.26.2

[PATCH v8 4/8] qdev-properties: add size32 property type

2020-05-28 Thread Roman Kagan

Introduce size32 property type which handles size suffixes (k, m, g)
just like size property, but is uint32_t rather than uint64_t.  It's
going to be useful for properties that are byte sizes but are inherently
32bit, like BlkConf.opt_io_size or .discard_granularity (they are
switched to this new property type in a followup commit).

The getter for size32 is left out for a separate patch as its benefit is
less obvious, and it affects test output; for now the regular uint32
getter is used.

Signed-off-by: Roman Kagan 
---
v7 -> v8:
- replace stringify with %u in the error message [Eric]
- fix wording in the log [Eric]

 include/hw/qdev-properties.h |  3 +++
 hw/core/qdev-properties.c| 40 
 2 files changed, 43 insertions(+)

diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index f161604fb6..c03eadfad6 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -29,6 +29,7 @@ extern const PropertyInfo qdev_prop_drive;
 extern const PropertyInfo qdev_prop_drive_iothread;
 extern const PropertyInfo qdev_prop_netdev;
 extern const PropertyInfo qdev_prop_pci_devfn;
+extern const PropertyInfo qdev_prop_size32;
 extern const PropertyInfo qdev_prop_blocksize;
 extern const PropertyInfo qdev_prop_pci_host_devaddr;
 extern const PropertyInfo qdev_prop_uuid;
@@ -196,6 +197,8 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
 BlockdevOnError)
 #define DEFINE_PROP_BIOS_CHS_TRANS(_n, _s, _f, _d) \
 DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_bios_chs_trans, int)
+#define DEFINE_PROP_SIZE32(_n, _s, _f, _d)   \
+DEFINE_PROP_UNSIGNED(_n, _s, _f, _d, qdev_prop_size32, uint32_t)
 #define DEFINE_PROP_BLOCKSIZE(_n, _s, _f) \
 DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint16_t)
 #define DEFINE_PROP_PCI_HOST_DEVADDR(_n, _s, _f) \
diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 249dc69bd8..40c13f6ebe 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -727,6 +727,46 @@ const PropertyInfo qdev_prop_pci_devfn = {
 .set_default_value = set_default_value_int,
 };
 
+/* --- 32bit unsigned int 'size' type --- */
+
+static void set_size32(Object *obj, Visitor *v, const char *name, void *opaque,
+   Error **errp)
+{
+DeviceState *dev = DEVICE(obj);
+Property *prop = opaque;
+uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint64_t value;
+Error *local_err = NULL;
+
+if (dev->realized) {
+qdev_prop_set_after_realize(dev, name, errp);
+return;
+}
+
+visit_type_size(v, name, , _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+if (value > UINT32_MAX) {
+error_setg(errp,
+   "Property %s.%s doesn't take value %" PRIu64
+   " (maximum: %u)",
+   dev->id ? : "", name, value, UINT32_MAX);
+return;
+}
+
+*ptr = value;
+}
+
+const PropertyInfo qdev_prop_size32 = {
+.name  = "size",
+.get = get_uint32,
+.set = set_size32,
+.set_default_value = set_default_value_uint,
+};
+
 /* --- blocksize --- */
 
 /* lower limit is sector size */
-- 
2.26.2

[PATCH v8 0/8] block: enhance handling of size-related BlockConf properties

2020-05-28 Thread Roman Kagan

BlockConf includes several properties counted in bytes.

Enhance their handling in some aspects, specifically

- accept common size suffixes (k, m)
- perform consistency checks on the values
- lift the upper limit on physical_block_size and logical_block_size

Also fix the accessor for opt_io_size in virtio-blk to make it consistent with
the size of the field.

History:
v7 -> v8:
- replace stringify with %u in error messages [Eric]
- fix wording in logs [Eric]

v6 -> v7:
- avoid overflow in min_io_size check [Eric]
- try again to perform the art form in patch splitting [Eric]

v5 -> v6:
- fix forgotten xen-block and swim
- add prop_size32 instead of going with 64bit

v4 -> v5:
- re-split the patches [Philippe]
- fix/reword error messages [Philippe, Kevin]
- do early return on failed consistency check [Philippe]
- use QEMU_IS_ALIGNED instead of open coding [Philippe]
- make all BlockConf size props support suffixes
- expand the log for virtio-blk opt_io_size [Michael]

v3 -> v4:
- add patch to fix opt_io_size width in virtio-blk
- add patch to perform consistency checks [Kevin]
- check min_io_size against truncation [Kevin]

v2 -> v3:
- mention qcow2 cluster size limit in the log and comment [Eric]

v1 -> v2:
- cap the property at 2 MiB [Eric]
- accept size suffixes

Roman Kagan (8):
  virtio-blk: store opt_io_size with correct size
  block: consolidate blocksize properties consistency checks
  qdev-properties: blocksize: use same limits in code and description
  qdev-properties: add size32 property type
  qdev-properties: make blocksize accept size suffixes
  block: make BlockConf size props 32bit and accept size suffixes
  qdev-properties: add getter for size32 and blocksize
  block: lift blocksize property limit to 2 MiB

 include/hw/block/block.h |  14 +-
 include/hw/qdev-properties.h |   5 +-
 hw/block/block.c |  40 ++-
 hw/block/fdc.c   |   5 +-
 hw/block/nvme.c  |   5 +-
 hw/block/swim.c  |   5 +-
 hw/block/virtio-blk.c|   9 +-
 hw/block/xen-block.c |   6 +-
 hw/core/qdev-properties.c|  85 +-
 hw/ide/qdev.c|   5 +-
 hw/scsi/scsi-disk.c  |  12 +-
 hw/usb/dev-storage.c |   5 +-
 tests/qemu-iotests/172.out   | 532 +--
 13 files changed, 419 insertions(+), 309 deletions(-)

-- 
2.26.2

[PATCH v8 1/8] virtio-blk: store opt_io_size with correct size

2020-05-28 Thread Roman Kagan

The width of opt_io_size in virtio_blk_config is 32bit.  However, it's
written with virtio_stw_p; this may result in value truncation, and on
big-endian systems with legacy virtio in completely bogus readings in
the guest.

Use the appropriate accessor to store it.

Signed-off-by: Roman Kagan 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f5f6fc925e..413083e62f 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -918,7 +918,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, 
uint8_t *config)
 virtio_stw_p(vdev, , conf->cyls);
 virtio_stl_p(vdev, _size, blk_size);
 virtio_stw_p(vdev, _io_size, conf->min_io_size / blk_size);
-virtio_stw_p(vdev, _io_size, conf->opt_io_size / blk_size);
+virtio_stl_p(vdev, _io_size, conf->opt_io_size / blk_size);
 blkcfg.geometry.heads = conf->heads;
 /*
  * We must ensure that the block device capacity is a multiple of
-- 
2.26.2

1 2 >

1 - 100 of 195 matches

Mail list logo