Re: [PATCH v3 4/4] docs: vhost-user: add subsection for non-Linux platforms

2022-03-04 Thread Sergio Lopez
On Fri, Mar 04, 2022 at 05:35:01AM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 03, 2022 at 12:59:11PM +0100, Sergio Lopez wrote:
> > Add a section explaining how vhost-user is supported on platforms
> > other than Linux.
> > 
> > Signed-off-by: Sergio Lopez 
> > ---
> >  docs/interop/vhost-user.rst | 18 ++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > index edc3ad84a3..590a626b92 100644
> > --- a/docs/interop/vhost-user.rst
> > +++ b/docs/interop/vhost-user.rst
> > @@ -38,6 +38,24 @@ conventions `.
> >  *Master* and *slave* can be either a client (i.e. connecting) or
> >  server (listening) in the socket communication.
> >  
> > +Support for platforms other than Linux
> 
> 
> It's not just Linux - any platform without eventfd.
> 
> So I think we should have a section explaining that whereever
> spec says eventfd it can be a pipe if system does not
> support creating eventfd.

I'm confused. This is exactly what this subsection intends to do...

Thanks,
Sergio.

> > +--
> > +
> > +While vhost-user was initially developed targeting Linux, nowadays is
> > +supported on any platform that provides the following features:
> > +
> > +- The ability to share a mapping injected into the guest between
> > +  multiple processes, so both QEMU and the vhost-user daemon servicing
> > +  the device can access simultaneously the memory regions containing
> > +  the virtqueues and the data associated with each request.
> > +
> > +- AF_UNIX sockets with SCM_RIGHTS, so QEMU can communicate with the
> > +  vhost-user daemon and send it file descriptors when needed.
> > +
> > +- Either eventfd or pipe/pipe2. On platforms where eventfd is not
> > +  available, QEMU will automatically fallback to pipe2 or, as a last
> > +  resort, pipe.
> > +
> >  Message Specification
> >  =
> >  
> > -- 
> > 2.35.1
> 


signature.asc
Description: PGP signature


[PATCH v4 4/4] docs: vhost-user: add subsection for non-Linux platforms

2022-03-04 Thread Sergio Lopez
Add a section explaining how vhost-user is supported on platforms
other than Linux.

Signed-off-by: Sergio Lopez 
Reviewed-by: Stefan Hajnoczi 
---
 docs/interop/vhost-user.rst | 20 
 1 file changed, 20 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index edc3ad84a3..4dbc84fd00 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -38,6 +38,26 @@ conventions `.
 *Master* and *slave* can be either a client (i.e. connecting) or
 server (listening) in the socket communication.
 
+Support for platforms other than Linux
+--
+
+While vhost-user was initially developed targeting Linux, nowadays it
+is supported on any platform that provides the following features:
+
+- A way for requesting shared memory represented by a file descriptor
+  so it can be passed over a UNIX domain socket and then mapped by the
+  other process.
+
+- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
+  exchange messages through it, including ancillary data when needed.
+
+- Either eventfd or pipe/pipe2. On platforms where eventfd is not
+  available, QEMU will automatically fall back to pipe2 or, as a last
+  resort, pipe. Each file descriptor will be used for receiving or
+  sending events by reading or writing (respectively) an 8-byte value
+  to the corresponding it. The 8-value itself has no meaning and
+  should not be interpreted.
+
 Message Specification
 =
 
-- 
2.35.1




[PATCH v4 2/4] vhost: use wfd on functions setting vring call fd

2022-03-04 Thread Sergio Lopez
When ioeventfd is emulated using qemu_pipe(), only EventNotifier's wfd
can be used for writing.

Use the recently introduced event_notifier_get_wfd() function to
obtain the fd that our peer must use to signal the vring.

Signed-off-by: Sergio Lopez 
Reviewed-by: Stefan Hajnoczi 
---
 hw/virtio/vhost.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 7b03efccec..b643f42ea4 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1287,7 +1287,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 return r;
 }
 
-file.fd = event_notifier_get_fd(>masked_notifier);
+file.fd = event_notifier_get_wfd(>masked_notifier);
 r = dev->vhost_ops->vhost_set_vring_call(dev, );
 if (r) {
 VHOST_OPS_DEBUG(r, "vhost_set_vring_call failed");
@@ -1542,9 +1542,9 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, 
VirtIODevice *vdev, int n,
 
 if (mask) {
 assert(vdev->use_guest_notifier_mask);
-file.fd = event_notifier_get_fd(>vqs[index].masked_notifier);
+file.fd = event_notifier_get_wfd(>vqs[index].masked_notifier);
 } else {
-file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
+file.fd = event_notifier_get_wfd(virtio_queue_get_guest_notifier(vvq));
 }
 
 file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
-- 
2.35.1




[PATCH v4 3/4] configure, meson: allow enabling vhost-user on all POSIX systems

2022-03-04 Thread Sergio Lopez
With the possibility of using a pipe pair via qemu_pipe() as a
replacement on operating systems that doesn't support eventfd,
vhost-user can also work on all POSIX systems.

This change allows enabling vhost-user on all non-Windows platforms
and makes libvhost_user (which still depends on eventfd) a linux-only
feature.

Signed-off-by: Sergio Lopez 
Reviewed-by: Stefan Hajnoczi 
---
 configure   | 4 ++--
 meson.build | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index c56ed53ee3..daccf4be7c 100755
--- a/configure
+++ b/configure
@@ -1659,8 +1659,8 @@ fi
 # vhost interdependencies and host support
 
 # vhost backends
-if test "$vhost_user" = "yes" && test "$linux" != "yes"; then
-  error_exit "vhost-user is only available on Linux"
+if test "$vhost_user" = "yes" && test "$mingw32" = "yes"; then
+  error_exit "vhost-user is not available on Windows"
 fi
 test "$vhost_vdpa" = "" && vhost_vdpa=$linux
 if test "$vhost_vdpa" = "yes" && test "$linux" != "yes"; then
diff --git a/meson.build b/meson.build
index 8df40bfac4..f2bc439c30 100644
--- a/meson.build
+++ b/meson.build
@@ -2701,7 +2701,7 @@ if have_system or have_user
 endif
 
 vhost_user = not_found
-if 'CONFIG_VHOST_USER' in config_host
+if targetos == 'linux' and 'CONFIG_VHOST_USER' in config_host
   libvhost_user = subproject('libvhost-user')
   vhost_user = libvhost_user.get_variable('vhost_user_dep')
 endif
-- 
2.35.1




[PATCH v4 1/4] event_notifier: add event_notifier_get_wfd()

2022-03-04 Thread Sergio Lopez
event_notifier_get_fd(const EventNotifier *e) always returns
EventNotifier's read file descriptor (rfd). This is not a problem when
the EventNotifier is backed by a an eventfd, as a single file
descriptor is used both for reading and triggering events (rfd ==
wfd).

But, when EventNotifier is backed by a pipe pair, we have two file
descriptors, one that can only be used for reads (rfd), and the other
only for writes (wfd).

There's, at least, one known situation in which we need to obtain wfd
instead of rfd, which is when setting up the file that's going to be
sent to the peer in vhost's SET_VRING_CALL.

Add a new event_notifier_get_wfd(const EventNotifier *e) that can be
used to obtain wfd where needed.

Signed-off-by: Sergio Lopez 
Reviewed-by: Stefan Hajnoczi 
---
 include/qemu/event_notifier.h | 1 +
 util/event_notifier-posix.c   | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/qemu/event_notifier.h b/include/qemu/event_notifier.h
index b79add035d..8a4ff308e1 100644
--- a/include/qemu/event_notifier.h
+++ b/include/qemu/event_notifier.h
@@ -38,6 +38,7 @@ int event_notifier_test_and_clear(EventNotifier *);
 #ifdef CONFIG_POSIX
 void event_notifier_init_fd(EventNotifier *, int fd);
 int event_notifier_get_fd(const EventNotifier *);
+int event_notifier_get_wfd(const EventNotifier *);
 #else
 HANDLE event_notifier_get_handle(EventNotifier *);
 #endif
diff --git a/util/event_notifier-posix.c b/util/event_notifier-posix.c
index 8307013c5d..16294e98d4 100644
--- a/util/event_notifier-posix.c
+++ b/util/event_notifier-posix.c
@@ -99,6 +99,11 @@ int event_notifier_get_fd(const EventNotifier *e)
 return e->rfd;
 }
 
+int event_notifier_get_wfd(const EventNotifier *e)
+{
+return e->wfd;
+}
+
 int event_notifier_set(EventNotifier *e)
 {
 static const uint64_t value = 1;
-- 
2.35.1




[PATCH v4 0/4] Enable vhost-user to be used on BSD systems

2022-03-04 Thread Sergio Lopez
Since QEMU is already able to emulate ioeventfd using pipefd, we're already
pretty close to supporting vhost-user on non-Linux systems.

This two patches bridge the gap by:

1. Adding a new event_notifier_get_wfd() to return wfd on the places where
   the peer is expected to write to the notifier.

2. Modifying the build system to it allows enabling vhost-user on BSD.

v1->v2:
  - Drop: "Allow returning EventNotifier's wfd" (Alex Williamson)
  - Add: "event_notifier: add event_notifier_get_wfd()" (Alex Williamson)
  - Add: "vhost: use wfd on functions setting vring call fd"
  - Rename: "Allow building vhost-user in BSD" to "configure, meson: allow
enabling vhost-user on all POSIX systems"
  - Instead of making possible enabling vhost-user on Linux and BSD systems,
allow enabling it on all non-Windows platforms. (Paolo Bonzini)

v2->v3:
  - Add a section to docs/interop/vhost-user.rst explaining how vhost-user
is supported on non-Linux platforms. (Stefan Hajnoczi)

v3->v4:
  - Some documentation fixes. (Stefan Hajnoczi)
  - Pick up Reviewed-by tags.

Sergio Lopez (4):
  event_notifier: add event_notifier_get_wfd()
  vhost: use wfd on functions setting vring call fd
  configure, meson: allow enabling vhost-user on all POSIX systems
  docs: vhost-user: add subsection for non-Linux platforms

 configure |  4 ++--
 docs/interop/vhost-user.rst   | 20 
 hw/virtio/vhost.c |  6 +++---
 include/qemu/event_notifier.h |  1 +
 meson.build   |  2 +-
 util/event_notifier-posix.c   |  5 +
 6 files changed, 32 insertions(+), 6 deletions(-)

-- 
2.35.1





[PATCH v3 3/4] configure, meson: allow enabling vhost-user on all POSIX systems

2022-03-03 Thread Sergio Lopez
With the possibility of using a pipe pair via qemu_pipe() as a
replacement on operating systems that doesn't support eventfd,
vhost-user can also work on all POSIX systems.

This change allows enabling vhost-user on all non-Windows platforms
and makes libvhost_user (which still depends on eventfd) a linux-only
feature.

Signed-off-by: Sergio Lopez 
---
 configure   | 4 ++--
 meson.build | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index c56ed53ee3..daccf4be7c 100755
--- a/configure
+++ b/configure
@@ -1659,8 +1659,8 @@ fi
 # vhost interdependencies and host support
 
 # vhost backends
-if test "$vhost_user" = "yes" && test "$linux" != "yes"; then
-  error_exit "vhost-user is only available on Linux"
+if test "$vhost_user" = "yes" && test "$mingw32" = "yes"; then
+  error_exit "vhost-user is not available on Windows"
 fi
 test "$vhost_vdpa" = "" && vhost_vdpa=$linux
 if test "$vhost_vdpa" = "yes" && test "$linux" != "yes"; then
diff --git a/meson.build b/meson.build
index 8df40bfac4..f2bc439c30 100644
--- a/meson.build
+++ b/meson.build
@@ -2701,7 +2701,7 @@ if have_system or have_user
 endif
 
 vhost_user = not_found
-if 'CONFIG_VHOST_USER' in config_host
+if targetos == 'linux' and 'CONFIG_VHOST_USER' in config_host
   libvhost_user = subproject('libvhost-user')
   vhost_user = libvhost_user.get_variable('vhost_user_dep')
 endif
-- 
2.35.1




[PATCH v3 4/4] docs: vhost-user: add subsection for non-Linux platforms

2022-03-03 Thread Sergio Lopez
Add a section explaining how vhost-user is supported on platforms
other than Linux.

Signed-off-by: Sergio Lopez 
---
 docs/interop/vhost-user.rst | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index edc3ad84a3..590a626b92 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -38,6 +38,24 @@ conventions `.
 *Master* and *slave* can be either a client (i.e. connecting) or
 server (listening) in the socket communication.
 
+Support for platforms other than Linux
+--
+
+While vhost-user was initially developed targeting Linux, nowadays is
+supported on any platform that provides the following features:
+
+- The ability to share a mapping injected into the guest between
+  multiple processes, so both QEMU and the vhost-user daemon servicing
+  the device can access simultaneously the memory regions containing
+  the virtqueues and the data associated with each request.
+
+- AF_UNIX sockets with SCM_RIGHTS, so QEMU can communicate with the
+  vhost-user daemon and send it file descriptors when needed.
+
+- Either eventfd or pipe/pipe2. On platforms where eventfd is not
+  available, QEMU will automatically fallback to pipe2 or, as a last
+  resort, pipe.
+
 Message Specification
 =
 
-- 
2.35.1




[PATCH v3 2/4] vhost: use wfd on functions setting vring call fd

2022-03-03 Thread Sergio Lopez
When ioeventfd is emulated using qemu_pipe(), only EventNotifier's wfd
can be used for writing.

Use the recently introduced event_notifier_get_wfd() function to
obtain the fd that our peer must use to signal the vring.

Signed-off-by: Sergio Lopez 
---
 hw/virtio/vhost.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 7b03efccec..b643f42ea4 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1287,7 +1287,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 return r;
 }
 
-file.fd = event_notifier_get_fd(>masked_notifier);
+file.fd = event_notifier_get_wfd(>masked_notifier);
 r = dev->vhost_ops->vhost_set_vring_call(dev, );
 if (r) {
 VHOST_OPS_DEBUG(r, "vhost_set_vring_call failed");
@@ -1542,9 +1542,9 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, 
VirtIODevice *vdev, int n,
 
 if (mask) {
 assert(vdev->use_guest_notifier_mask);
-file.fd = event_notifier_get_fd(>vqs[index].masked_notifier);
+file.fd = event_notifier_get_wfd(>vqs[index].masked_notifier);
 } else {
-file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
+file.fd = event_notifier_get_wfd(virtio_queue_get_guest_notifier(vvq));
 }
 
 file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
-- 
2.35.1




[PATCH v3 1/4] event_notifier: add event_notifier_get_wfd()

2022-03-03 Thread Sergio Lopez
event_notifier_get_fd(const EventNotifier *e) always returns
EventNotifier's read file descriptor (rfd). This is not a problem when
the EventNotifier is backed by a an eventfd, as a single file
descriptor is used both for reading and triggering events (rfd ==
wfd).

But, when EventNotifier is backed by a pipe pair, we have two file
descriptors, one that can only be used for reads (rfd), and the other
only for writes (wfd).

There's, at least, one known situation in which we need to obtain wfd
instead of rfd, which is when setting up the file that's going to be
sent to the peer in vhost's SET_VRING_CALL.

Add a new event_notifier_get_wfd(const EventNotifier *e) that can be
used to obtain wfd where needed.

Signed-off-by: Sergio Lopez 
---
 include/qemu/event_notifier.h | 1 +
 util/event_notifier-posix.c   | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/qemu/event_notifier.h b/include/qemu/event_notifier.h
index b79add035d..8a4ff308e1 100644
--- a/include/qemu/event_notifier.h
+++ b/include/qemu/event_notifier.h
@@ -38,6 +38,7 @@ int event_notifier_test_and_clear(EventNotifier *);
 #ifdef CONFIG_POSIX
 void event_notifier_init_fd(EventNotifier *, int fd);
 int event_notifier_get_fd(const EventNotifier *);
+int event_notifier_get_wfd(const EventNotifier *);
 #else
 HANDLE event_notifier_get_handle(EventNotifier *);
 #endif
diff --git a/util/event_notifier-posix.c b/util/event_notifier-posix.c
index 8307013c5d..16294e98d4 100644
--- a/util/event_notifier-posix.c
+++ b/util/event_notifier-posix.c
@@ -99,6 +99,11 @@ int event_notifier_get_fd(const EventNotifier *e)
 return e->rfd;
 }
 
+int event_notifier_get_wfd(const EventNotifier *e)
+{
+return e->wfd;
+}
+
 int event_notifier_set(EventNotifier *e)
 {
 static const uint64_t value = 1;
-- 
2.35.1




[PATCH v3 0/4] Enable vhost-user to be used on BSD systems

2022-03-03 Thread Sergio Lopez
Since QEMU is already able to emulate ioeventfd using pipefd, we're already
pretty close to supporting vhost-user on non-Linux systems.

This two patches bridge the gap by:

1. Adding a new event_notifier_get_wfd() to return wfd on the places where
   the peer is expected to write to the notifier.

2. Modifying the build system to it allows enabling vhost-user on BSD.

v1->v2:
  - Drop: "Allow returning EventNotifier's wfd" (Alex Williamson)
  - Add: "event_notifier: add event_notifier_get_wfd()" (Alex Williamson)
  - Add: "vhost: use wfd on functions setting vring call fd"
  - Rename: "Allow building vhost-user in BSD" to "configure, meson: allow
enabling vhost-user on all POSIX systems"
  - Instead of making possible enabling vhost-user on Linux and BSD systems,
allow enabling it on all non-Windows platforms. (Paolo Bonzini)

v2->v3:
  - Add a section to docs/interop/vhost-user.rst explaining how vhost-user
is supported on non-Linux platforms. (Stefan Hajnoczi)

Sergio Lopez (4):
  event_notifier: add event_notifier_get_wfd()
  vhost: use wfd on functions setting vring call fd
  configure, meson: allow enabling vhost-user on all POSIX systems
  docs: vhost-user: add subsection for non-Linux platforms

 configure |  4 ++--
 docs/interop/vhost-user.rst   | 18 ++
 hw/virtio/vhost.c |  6 +++---
 include/qemu/event_notifier.h |  1 +
 meson.build   |  2 +-
 util/event_notifier-posix.c   |  5 +
 6 files changed, 30 insertions(+), 6 deletions(-)

-- 
2.35.1





[PATCH v2 2/3] vhost: use wfd on functions setting vring call fd

2022-03-02 Thread Sergio Lopez
When ioeventfd is emulated using qemu_pipe(), only EventNotifier's wfd
can be used for writing.

Use the recently introduced event_notifier_get_wfd() function to
obtain the fd that our peer must use to signal the vring.

Signed-off-by: Sergio Lopez 
---
 hw/virtio/vhost.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 7b03efccec..b643f42ea4 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1287,7 +1287,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 return r;
 }
 
-file.fd = event_notifier_get_fd(>masked_notifier);
+file.fd = event_notifier_get_wfd(>masked_notifier);
 r = dev->vhost_ops->vhost_set_vring_call(dev, );
 if (r) {
 VHOST_OPS_DEBUG(r, "vhost_set_vring_call failed");
@@ -1542,9 +1542,9 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, 
VirtIODevice *vdev, int n,
 
 if (mask) {
 assert(vdev->use_guest_notifier_mask);
-file.fd = event_notifier_get_fd(>vqs[index].masked_notifier);
+file.fd = event_notifier_get_wfd(>vqs[index].masked_notifier);
 } else {
-file.fd = event_notifier_get_fd(virtio_queue_get_guest_notifier(vvq));
+file.fd = event_notifier_get_wfd(virtio_queue_get_guest_notifier(vvq));
 }
 
 file.index = hdev->vhost_ops->vhost_get_vq_index(hdev, n);
-- 
2.35.1




[PATCH v2 0/3] Enable vhost-user to be used on BSD systems

2022-03-02 Thread Sergio Lopez
Since QEMU is already able to emulate ioeventfd using pipefd, we're already
pretty close to supporting vhost-user on non-Linux systems.

This two patches bridge the gap by:

1. Adding a new event_notifier_get_wfd() to return wfd on the places where
   the peer is expected to write to the notifier.

2. Modifying the build system to it allows enabling vhost-user on BSD.

v1->v2:
  - Drop: "Allow returning EventNotifier's wfd" (Alex Williamson)
  - Add: "event_notifier: add event_notifier_get_wfd()" (Alex Williamson)
  - Add: "vhost: use wfd on functions setting vring call fd"
  - Rename: "Allow building vhost-user in BSD" to "configure, meson: allow
enabling vhost-user on all POSIX systems"
  - Instead of making possible enabling vhost-user on Linux and BSD systems,
allow enabling it on all non-Windows platforms. (Paolo)

Sergio Lopez (3):
  event_notifier: add event_notifier_get_wfd()
  vhost: use wfd on functions setting vring call fd
  configure, meson: allow enabling vhost-user on all POSIX systems

 configure | 4 ++--
 hw/virtio/vhost.c | 6 +++---
 include/qemu/event_notifier.h | 1 +
 meson.build   | 2 +-
 util/event_notifier-posix.c   | 5 +
 5 files changed, 12 insertions(+), 6 deletions(-)

-- 
2.35.1





[PATCH v2 1/3] event_notifier: add event_notifier_get_wfd()

2022-03-02 Thread Sergio Lopez
event_notifier_get_fd(const EventNotifier *e) always returns
EventNotifier's read file descriptor (rfd). This is not a problem when
the EventNotifier is backed by a an eventfd, as a single file
descriptor is used both for reading and triggering events (rfd ==
wfd).

But, when EventNotifier is backed by a pipe pair, we have two file
descriptors, one that can only be used for reads (rfd), and the other
only for writes (wfd).

There's, at least, one known situation in which we need to obtain wfd
instead of rfd, which is when setting up the file that's going to be
sent to the peer in vhost's SET_VRING_CALL.

Add a new event_notifier_get_wfd(const EventNotifier *e) that can be
used to obtain wfd where needed.

Signed-off-by: Sergio Lopez 
---
 include/qemu/event_notifier.h | 1 +
 util/event_notifier-posix.c   | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/qemu/event_notifier.h b/include/qemu/event_notifier.h
index b79add035d..8a4ff308e1 100644
--- a/include/qemu/event_notifier.h
+++ b/include/qemu/event_notifier.h
@@ -38,6 +38,7 @@ int event_notifier_test_and_clear(EventNotifier *);
 #ifdef CONFIG_POSIX
 void event_notifier_init_fd(EventNotifier *, int fd);
 int event_notifier_get_fd(const EventNotifier *);
+int event_notifier_get_wfd(const EventNotifier *);
 #else
 HANDLE event_notifier_get_handle(EventNotifier *);
 #endif
diff --git a/util/event_notifier-posix.c b/util/event_notifier-posix.c
index 8307013c5d..16294e98d4 100644
--- a/util/event_notifier-posix.c
+++ b/util/event_notifier-posix.c
@@ -99,6 +99,11 @@ int event_notifier_get_fd(const EventNotifier *e)
 return e->rfd;
 }
 
+int event_notifier_get_wfd(const EventNotifier *e)
+{
+return e->wfd;
+}
+
 int event_notifier_set(EventNotifier *e)
 {
 static const uint64_t value = 1;
-- 
2.35.1




[PATCH v2 3/3] configure, meson: allow enabling vhost-user on all POSIX systems

2022-03-02 Thread Sergio Lopez
With the possibility of using a pipe pair via qemu_pipe() as a
replacement on operating systems that doesn't support eventfd,
vhost-user can also work on all POSIX systems.

This change allows enabling vhost-user on all non-Windows platforms
and makes libvhost_user (which still depends on eventfd) a linux-only
feature.

Signed-off-by: Sergio Lopez 
---
 configure   | 4 ++--
 meson.build | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index c56ed53ee3..daccf4be7c 100755
--- a/configure
+++ b/configure
@@ -1659,8 +1659,8 @@ fi
 # vhost interdependencies and host support
 
 # vhost backends
-if test "$vhost_user" = "yes" && test "$linux" != "yes"; then
-  error_exit "vhost-user is only available on Linux"
+if test "$vhost_user" = "yes" && test "$mingw32" = "yes"; then
+  error_exit "vhost-user is not available on Windows"
 fi
 test "$vhost_vdpa" = "" && vhost_vdpa=$linux
 if test "$vhost_vdpa" = "yes" && test "$linux" != "yes"; then
diff --git a/meson.build b/meson.build
index 8df40bfac4..f2bc439c30 100644
--- a/meson.build
+++ b/meson.build
@@ -2701,7 +2701,7 @@ if have_system or have_user
 endif
 
 vhost_user = not_found
-if 'CONFIG_VHOST_USER' in config_host
+if targetos == 'linux' and 'CONFIG_VHOST_USER' in config_host
   libvhost_user = subproject('libvhost-user')
   vhost_user = libvhost_user.get_variable('vhost_user_dep')
 endif
-- 
2.35.1




Re: [PATCH 2/2] Allow building vhost-user in BSD

2022-03-02 Thread Sergio Lopez
On Wed, Mar 02, 2022 at 06:38:07PM +0100, Philippe Mathieu-Daudé wrote:
> On 2/3/22 18:31, Sergio Lopez wrote:
> > On Wed, Mar 02, 2022 at 06:18:59PM +0100, Philippe Mathieu-Daudé wrote:
> > > On 2/3/22 18:10, Paolo Bonzini wrote:
> > > > On 3/2/22 12:36, Sergio Lopez wrote:
> > > > > With the possibility of using pipefd as a replacement on operating
> > > > > systems that doesn't support eventfd, vhost-user can also work on BSD
> > > > > systems.
> > > > > 
> > > > > This change allows enabling vhost-user on BSD platforms too and
> > > > > makes libvhost_user (which still depends on eventfd) a linux-only
> > > > > feature.
> > > > > 
> > > > > Signed-off-by: Sergio Lopez 
> > > > 
> > > > I would just check for !windows.
> > > 
> > > What about Darwin / Haiku / Illumnos?
> > 
> > It should work on every system providing pipe() or pipe2(), so I guess
> > Paolo's right, every platform except Windows. FWIW, I already tested
> > it with Darwin.
> 
> Wow, nice.
> 
> So maybe simply check for pipe/pipe2 rather than !windows?
> 

Is it worth it? In "configure", CONFIG_POSIX is set to "y" if the
target isn't "mingw32", and CONFIG_POSIX brings "util/oslib-posix.c"
into the build, which expects to have either pipe or pipe2.

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH 2/2] Allow building vhost-user in BSD

2022-03-02 Thread Sergio Lopez
On Wed, Mar 02, 2022 at 06:18:59PM +0100, Philippe Mathieu-Daudé wrote:
> On 2/3/22 18:10, Paolo Bonzini wrote:
> > On 3/2/22 12:36, Sergio Lopez wrote:
> > > With the possibility of using pipefd as a replacement on operating
> > > systems that doesn't support eventfd, vhost-user can also work on BSD
> > > systems.
> > > 
> > > This change allows enabling vhost-user on BSD platforms too and
> > > makes libvhost_user (which still depends on eventfd) a linux-only
> > > feature.
> > > 
> > > Signed-off-by: Sergio Lopez 
> > 
> > I would just check for !windows.
> 
> What about Darwin / Haiku / Illumnos?

It should work on every system providing pipe() or pipe2(), so I guess
Paolo's right, every platform except Windows. FWIW, I already tested
it with Darwin.

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH 1/2] Allow returning EventNotifier's wfd

2022-03-02 Thread Sergio Lopez
On Wed, Mar 02, 2022 at 08:12:34AM -0700, Alex Williamson wrote:
> On Wed,  2 Mar 2022 12:36:43 +0100
> Sergio Lopez  wrote:
> 
> > event_notifier_get_fd(const EventNotifier *e) always returns
> > EventNotifier's read file descriptor (rfd). This is not a problem when
> > the EventNotifier is backed by a an eventfd, as a single file
> > descriptor is used both for reading and triggering events (rfd ==
> > wfd).
> > 
> > But, when EventNotifier is backed by a pipefd, we have two file
> > descriptors, one that can only be used for reads (rfd), and the other
> > only for writes (wfd).
> > 
> > There's, at least, one known situation in which we need to obtain wfd
> > instead of rfd, which is when setting up the file that's going to be
> > sent to the peer in vhost's SET_VRING_CALL.
> > 
> > Extend event_notifier_get_fd() to receive an argument which indicates
> > whether the caller wants to obtain rfd (false) or wfd (true).
> 
> There are about 50 places where we add the false arg here and 1 where
> we use true.  Seems it would save a lot of churn to hide this
> internally, event_notifier_get_fd() returns an rfd, a new
> event_notifier_get_wfd() returns the wfd.  Thanks,

I agree. In fact, that's what I implemented in the first place. I
changed to this version in which event_notifier_get_fd() is extended
because it feels more "correct". But yes, the pragmatic option would
be adding a new event_notifier_get_wfd().

I'll wait for more reviews, and unless someone voices against it, I'll
respin the patches with that strategy (I already have it around here).

Thanks,
Sergio.


signature.asc
Description: PGP signature


[PATCH 1/2] Allow returning EventNotifier's wfd

2022-03-02 Thread Sergio Lopez
event_notifier_get_fd(const EventNotifier *e) always returns
EventNotifier's read file descriptor (rfd). This is not a problem when
the EventNotifier is backed by a an eventfd, as a single file
descriptor is used both for reading and triggering events (rfd ==
wfd).

But, when EventNotifier is backed by a pipefd, we have two file
descriptors, one that can only be used for reads (rfd), and the other
only for writes (wfd).

There's, at least, one known situation in which we need to obtain wfd
instead of rfd, which is when setting up the file that's going to be
sent to the peer in vhost's SET_VRING_CALL.

Extend event_notifier_get_fd() to receive an argument which indicates
whether the caller wants to obtain rfd (false) or wfd (true).

Signed-off-by: Sergio Lopez 
---
 accel/kvm/kvm-all.c | 12 +++
 block/linux-aio.c   |  2 +-
 block/nvme.c|  2 +-
 contrib/ivshmem-server/ivshmem-server.c |  5 +--
 hw/hyperv/hyperv.c  |  2 +-
 hw/misc/ivshmem.c   |  2 +-
 hw/remote/iohub.c   | 13 +++
 hw/remote/proxy.c   |  4 +--
 hw/vfio/ccw.c   |  4 +--
 hw/vfio/pci-quirks.c|  6 ++--
 hw/vfio/pci.c   | 48 +
 hw/vfio/platform.c  | 16 -
 hw/virtio/vhost.c   | 10 +++---
 include/qemu/event_notifier.h   |  2 +-
 target/s390x/kvm/kvm.c  |  2 +-
 util/aio-posix.c|  4 +--
 util/event_notifier-posix.c |  5 ++-
 util/vfio-helpers.c |  2 +-
 18 files changed, 75 insertions(+), 66 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 0e66ebb497..c84ee98b17 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1573,7 +1573,7 @@ static void kvm_mem_ioeventfd_add(MemoryListener 
*listener,
   bool match_data, uint64_t data,
   EventNotifier *e)
 {
-int fd = event_notifier_get_fd(e);
+int fd = event_notifier_get_fd(e, false);
 int r;
 
 r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
@@ -1591,7 +1591,7 @@ static void kvm_mem_ioeventfd_del(MemoryListener 
*listener,
   bool match_data, uint64_t data,
   EventNotifier *e)
 {
-int fd = event_notifier_get_fd(e);
+int fd = event_notifier_get_fd(e, false);
 int r;
 
 r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
@@ -1609,7 +1609,7 @@ static void kvm_io_ioeventfd_add(MemoryListener *listener,
  bool match_data, uint64_t data,
  EventNotifier *e)
 {
-int fd = event_notifier_get_fd(e);
+int fd = event_notifier_get_fd(e, false);
 int r;
 
 r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
@@ -1628,7 +1628,7 @@ static void kvm_io_ioeventfd_del(MemoryListener *listener,
  EventNotifier *e)
 
 {
-int fd = event_notifier_get_fd(e);
+int fd = event_notifier_get_fd(e, false);
 int r;
 
 r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
@@ -2045,8 +2045,8 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, 
EventNotifier *event,
 EventNotifier *resample, int virq,
 bool assign)
 {
-int fd = event_notifier_get_fd(event);
-int rfd = resample ? event_notifier_get_fd(resample) : -1;
+int fd = event_notifier_get_fd(event, false);
+int rfd = resample ? event_notifier_get_fd(resample, false) : -1;
 
 struct kvm_irqfd irqfd = {
 .fd = fd,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 4c423fcccf..6068353528 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -390,7 +390,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 __func__, type);
 return -EIO;
 }
-io_set_eventfd(>iocb, event_notifier_get_fd(>e));
+io_set_eventfd(>iocb, event_notifier_get_fd(>e, false));
 
 QSIMPLEQ_INSERT_TAIL(>io_q.pending, laiocb, next);
 s->io_q.in_queue++;
diff --git a/block/nvme.c b/block/nvme.c
index dd20de3865..851c552a4f 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -229,7 +229,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 return NULL;
 }
 trace_nvme_create_queue_pair(idx, q, size, aio_context,
- event_notifier_get_fd(s->irq_notifier));
+ event_notifier_get_fd(s->irq_notifier, 
false));
 bytes = QEMU_ALIGN_UP(s->page_size * NVME_NUM_REQS,
   qemu_real_host_page_size);
 q->prp_list_pages = qemu_tr

[PATCH 0/2] Enable vhost-user to be used on BSD systems

2022-03-02 Thread Sergio Lopez
Since QEMU is already able to emulate ioeventfd using pipefd, we're already
pretty close to supporting vhost-user on non-Linux systems.

This two patches bridge the gap by:

1. Extending event_notifier_get_fd() to be able to return wfd when needed.

2. Modifying the build system to it allows enabling vhost-user on BSD.

Sergio Lopez (2):
  Allow returning EventNotifier's wfd
  Allow building vhost-user in BSD

 accel/kvm/kvm-all.c | 12 +++
 block/linux-aio.c   |  2 +-
 block/nvme.c|  2 +-
 configure   |  5 +--
 contrib/ivshmem-server/ivshmem-server.c |  5 +--
 hw/hyperv/hyperv.c  |  2 +-
 hw/misc/ivshmem.c   |  2 +-
 hw/remote/iohub.c   | 13 +++
 hw/remote/proxy.c   |  4 +--
 hw/vfio/ccw.c   |  4 +--
 hw/vfio/pci-quirks.c|  6 ++--
 hw/vfio/pci.c   | 48 +
 hw/vfio/platform.c  | 16 -
 hw/virtio/vhost.c   | 10 +++---
 include/qemu/event_notifier.h   |  2 +-
 meson.build |  2 +-
 target/s390x/kvm/kvm.c  |  2 +-
 util/aio-posix.c|  4 +--
 util/event_notifier-posix.c |  5 ++-
 util/vfio-helpers.c |  2 +-
 20 files changed, 79 insertions(+), 69 deletions(-)

-- 
2.35.1





[PATCH 2/2] Allow building vhost-user in BSD

2022-03-02 Thread Sergio Lopez
With the possibility of using pipefd as a replacement on operating
systems that doesn't support eventfd, vhost-user can also work on BSD
systems.

This change allows enabling vhost-user on BSD platforms too and
makes libvhost_user (which still depends on eventfd) a linux-only
feature.

Signed-off-by: Sergio Lopez 
---
 configure   | 5 +++--
 meson.build | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index c56ed53ee3..93aa22e345 100755
--- a/configure
+++ b/configure
@@ -1659,8 +1659,9 @@ fi
 # vhost interdependencies and host support
 
 # vhost backends
-if test "$vhost_user" = "yes" && test "$linux" != "yes"; then
-  error_exit "vhost-user is only available on Linux"
+if test "$vhost_user" = "yes" && \
+test "$linux" != "yes" && test "$bsd" != "yes" ; then
+  error_exit "vhost-user is only available on Linux and BSD"
 fi
 test "$vhost_vdpa" = "" && vhost_vdpa=$linux
 if test "$vhost_vdpa" = "yes" && test "$linux" != "yes"; then
diff --git a/meson.build b/meson.build
index 8df40bfac4..f2bc439c30 100644
--- a/meson.build
+++ b/meson.build
@@ -2701,7 +2701,7 @@ if have_system or have_user
 endif
 
 vhost_user = not_found
-if 'CONFIG_VHOST_USER' in config_host
+if targetos == 'linux' and 'CONFIG_VHOST_USER' in config_host
   libvhost_user = subproject('libvhost-user')
   vhost_user = libvhost_user.get_variable('vhost_user_dep')
 endif
-- 
2.35.1




Re: [PATCH v2 2/2] nbd/server: Use drained block ops to quiesce the server

2021-06-02 Thread Sergio Lopez
On Wed, Jun 02, 2021 at 03:06:02PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 02.06.2021 09:05, Sergio Lopez wrote:
> > Before switching between AioContexts we need to make sure that we're
> > fully quiesced ("nb_requests == 0" for every client) when entering the
> > drained section.
> > 
> > To do this, we set "quiescing = true" for every client on
> > ".drained_begin" to prevent new coroutines from being created, and
> > check if "nb_requests == 0" on ".drained_poll". Finally, once we're
> > exiting the drained section, on ".drained_end" we set "quiescing =
> > false" and call "nbd_client_receive_next_request()" to resume the
> > processing of new requests.
> > 
> > With these changes, "blk_aio_attach()" and "blk_aio_detach()" can be
> > reverted to be as simple as they were before f148ae7d36.
> > 
> > RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1960137
> > Suggested-by: Kevin Wolf 
> > Signed-off-by: Sergio Lopez 
> > ---
> >   nbd/server.c | 82 ++--
> >   1 file changed, 61 insertions(+), 21 deletions(-)
> > 
> > diff --git a/nbd/server.c b/nbd/server.c
> > index 86a44a9b41..b60ebc3ab6 100644
> > --- a/nbd/server.c
> > +++ b/nbd/server.c
> > @@ -1513,6 +1513,11 @@ static void nbd_request_put(NBDRequestData *req)
> >   g_free(req);
> >   client->nb_requests--;
> > +
> > +if (client->quiescing && client->nb_requests == 0) {
> > +aio_wait_kick();
> > +}
> > +
> >   nbd_client_receive_next_request(client);
> >   nbd_client_put(client);
> > @@ -1530,49 +1535,68 @@ static void blk_aio_attached(AioContext *ctx, void 
> > *opaque)
> >   QTAILQ_FOREACH(client, >clients, next) {
> >   qio_channel_attach_aio_context(client->ioc, ctx);
> > +assert(client->nb_requests == 0);
> >   assert(client->recv_coroutine == NULL);
> >   assert(client->send_coroutine == NULL);
> > -
> > -if (client->quiescing) {
> > -client->quiescing = false;
> > -nbd_client_receive_next_request(client);
> > -}
> >   }
> >   }
> > -static void nbd_aio_detach_bh(void *opaque)
> > +static void blk_aio_detach(void *opaque)
> >   {
> >   NBDExport *exp = opaque;
> >   NBDClient *client;
> > +trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
> > +
> >   QTAILQ_FOREACH(client, >clients, next) {
> >   qio_channel_detach_aio_context(client->ioc);
> > +}
> > +
> > +exp->common.ctx = NULL;
> > +}
> > +
> > +static void nbd_drained_begin(void *opaque)
> > +{
> > +NBDExport *exp = opaque;
> > +NBDClient *client;
> > +
> > +QTAILQ_FOREACH(client, >clients, next) {
> >   client->quiescing = true;
> > +}
> > +}
> > -if (client->recv_coroutine) {
> > -if (client->read_yielding) {
> > -qemu_aio_coroutine_enter(exp->common.ctx,
> > - client->recv_coroutine);
> > -} else {
> > -AIO_WAIT_WHILE(exp->common.ctx, client->recv_coroutine != 
> > NULL);
> > -}
> > -}
> > +static void nbd_drained_end(void *opaque)
> > +{
> > +NBDExport *exp = opaque;
> > +NBDClient *client;
> > -if (client->send_coroutine) {
> > -AIO_WAIT_WHILE(exp->common.ctx, client->send_coroutine != 
> > NULL);
> > -}
> > +QTAILQ_FOREACH(client, >clients, next) {
> > +client->quiescing = false;
> > +nbd_client_receive_next_request(client);
> >   }
> >   }
> > -static void blk_aio_detach(void *opaque)
> > +static bool nbd_drained_poll(void *opaque)
> >   {
> >   NBDExport *exp = opaque;
> > +NBDClient *client;
> > -trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
> > +QTAILQ_FOREACH(client, >clients, next) {
> > +if (client->nb_requests != 0) {
> > +/*
> > + * If there's a coroutine waiting for a request on 
> > nbd_read_eof()
> > + * enter it here so we don't depend on the client to wake it 
> > up.
> > + */
> > +if (client->recv_coroutine != NULL && 

[PATCH v2 2/2] nbd/server: Use drained block ops to quiesce the server

2021-06-02 Thread Sergio Lopez
Before switching between AioContexts we need to make sure that we're
fully quiesced ("nb_requests == 0" for every client) when entering the
drained section.

To do this, we set "quiescing = true" for every client on
".drained_begin" to prevent new coroutines from being created, and
check if "nb_requests == 0" on ".drained_poll". Finally, once we're
exiting the drained section, on ".drained_end" we set "quiescing =
false" and call "nbd_client_receive_next_request()" to resume the
processing of new requests.

With these changes, "blk_aio_attach()" and "blk_aio_detach()" can be
reverted to be as simple as they were before f148ae7d36.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1960137
Suggested-by: Kevin Wolf 
Signed-off-by: Sergio Lopez 
---
 nbd/server.c | 82 ++--
 1 file changed, 61 insertions(+), 21 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 86a44a9b41..b60ebc3ab6 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1513,6 +1513,11 @@ static void nbd_request_put(NBDRequestData *req)
 g_free(req);
 
 client->nb_requests--;
+
+if (client->quiescing && client->nb_requests == 0) {
+aio_wait_kick();
+}
+
 nbd_client_receive_next_request(client);
 
 nbd_client_put(client);
@@ -1530,49 +1535,68 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 QTAILQ_FOREACH(client, >clients, next) {
 qio_channel_attach_aio_context(client->ioc, ctx);
 
+assert(client->nb_requests == 0);
 assert(client->recv_coroutine == NULL);
 assert(client->send_coroutine == NULL);
-
-if (client->quiescing) {
-client->quiescing = false;
-nbd_client_receive_next_request(client);
-}
 }
 }
 
-static void nbd_aio_detach_bh(void *opaque)
+static void blk_aio_detach(void *opaque)
 {
 NBDExport *exp = opaque;
 NBDClient *client;
 
+trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
+
 QTAILQ_FOREACH(client, >clients, next) {
 qio_channel_detach_aio_context(client->ioc);
+}
+
+exp->common.ctx = NULL;
+}
+
+static void nbd_drained_begin(void *opaque)
+{
+NBDExport *exp = opaque;
+NBDClient *client;
+
+QTAILQ_FOREACH(client, >clients, next) {
 client->quiescing = true;
+}
+}
 
-if (client->recv_coroutine) {
-if (client->read_yielding) {
-qemu_aio_coroutine_enter(exp->common.ctx,
- client->recv_coroutine);
-} else {
-AIO_WAIT_WHILE(exp->common.ctx, client->recv_coroutine != 
NULL);
-}
-}
+static void nbd_drained_end(void *opaque)
+{
+NBDExport *exp = opaque;
+NBDClient *client;
 
-if (client->send_coroutine) {
-AIO_WAIT_WHILE(exp->common.ctx, client->send_coroutine != NULL);
-}
+QTAILQ_FOREACH(client, >clients, next) {
+client->quiescing = false;
+nbd_client_receive_next_request(client);
 }
 }
 
-static void blk_aio_detach(void *opaque)
+static bool nbd_drained_poll(void *opaque)
 {
 NBDExport *exp = opaque;
+NBDClient *client;
 
-trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
+QTAILQ_FOREACH(client, >clients, next) {
+if (client->nb_requests != 0) {
+/*
+ * If there's a coroutine waiting for a request on nbd_read_eof()
+ * enter it here so we don't depend on the client to wake it up.
+ */
+if (client->recv_coroutine != NULL && client->read_yielding) {
+qemu_aio_coroutine_enter(exp->common.ctx,
+ client->recv_coroutine);
+}
 
-aio_wait_bh_oneshot(exp->common.ctx, nbd_aio_detach_bh, exp);
+return true;
+}
+}
 
-exp->common.ctx = NULL;
+return false;
 }
 
 static void nbd_eject_notifier(Notifier *n, void *data)
@@ -1594,6 +1618,12 @@ void nbd_export_set_on_eject_blk(BlockExport *exp, 
BlockBackend *blk)
 blk_add_remove_bs_notifier(blk, _exp->eject_notifier);
 }
 
+static const BlockDevOps nbd_block_ops = {
+.drained_begin = nbd_drained_begin,
+.drained_end = nbd_drained_end,
+.drained_poll = nbd_drained_poll,
+};
+
 static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions 
*exp_args,
  Error **errp)
 {
@@ -1715,8 +1745,17 @@ static int nbd_export_create(BlockExport *blk_exp, 
BlockExportOptions *exp_args,
 
 exp->allocation_depth = arg->allocation_depth;
 
+/*
+ * We need to inhibit request queuing in the block layer to ensure we can
+ * be properly quiesced when entering a drained section, as our coroutines
+  

[PATCH v2 1/2] block-backend: add drained_poll

2021-06-02 Thread Sergio Lopez
Allow block backends to poll their devices/users to check if they have
been quiesced when entering a drained section.

This will be used in the next patch to wait for the NBD server to be
completely quiesced.

Suggested-by: Kevin Wolf 
Reviewed-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Signed-off-by: Sergio Lopez 
---
 block/block-backend.c  | 7 ++-
 include/sysemu/block-backend.h | 4 
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index de5496af66..8fcc2b4b3d 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2393,8 +2393,13 @@ static void blk_root_drained_begin(BdrvChild *child)
 static bool blk_root_drained_poll(BdrvChild *child)
 {
 BlockBackend *blk = child->opaque;
+bool busy = false;
 assert(blk->quiesce_counter);
-return !!blk->in_flight;
+
+if (blk->dev_ops && blk->dev_ops->drained_poll) {
+busy = blk->dev_ops->drained_poll(blk->dev_opaque);
+}
+return busy || !!blk->in_flight;
 }
 
 static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 880e903293..5423e3d9c6 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -66,6 +66,10 @@ typedef struct BlockDevOps {
  * Runs when the backend's last drain request ends.
  */
 void (*drained_end)(void *opaque);
+/*
+ * Is the device still busy?
+ */
+bool (*drained_poll)(void *opaque);
 } BlockDevOps;
 
 /* This struct is embedded in (the private) BlockBackend struct and contains
-- 
2.26.2




[PATCH v2 0/2] nbd/server: Quiesce server on drained section

2021-06-02 Thread Sergio Lopez
Before switching between AioContexts we need to make sure that we're
fully quiesced ("nb_requests == 0" for every client) when entering the
drained section. Otherwise, coroutines may be run in the wrong context
after the switch, leading to a number of critical issues.

To accomplish this, we add ".drained_poll" to BlockDevOps and use it
in the NBD server, along with ".drained_being" and "drained_end", to
coordinate the quiescing of the server while entering a drained
section.

v2:
 - Use a bool for the value returned by .drained_poll [Kevin]
 - Change .drained_poll comment to reflect that the returned boolean
   value will be true if the device is still busy, or false otherwise
 - Drop yield_co_list and use recv_coroutine and read_yielding [Kevin]
 - Return "true" or "false" in nbd_drained_poll [Kevin]
 - Fix grammar in the commit message of patch 2 [Eric]

Sergio Lopez (2):
  block-backend: add drained_poll
  nbd/server: Use drained block ops to quiesce the server

 block/block-backend.c  |  7 ++-
 include/sysemu/block-backend.h |  4 ++
 nbd/server.c   | 82 +-
 3 files changed, 71 insertions(+), 22 deletions(-)

-- 
2.26.2





Re: [PATCH 2/2] nbd/server: Use drained block ops to quiesce the server

2021-06-01 Thread Sergio Lopez
On Tue, Jun 01, 2021 at 04:29:07PM -0500, Eric Blake wrote:
> On Tue, Jun 01, 2021 at 07:57:28AM +0200, Sergio Lopez wrote:
> > Before switching between AioContexts we need to make sure that we're
> > fully quiesced ("nb_requests == 0" for every client) when entering the
> > drained section.
> > 
> > To do this, we set "quiescing = true" for every client on
> > ".drained_begin" to prevent new coroutines to be created, and check if
> 
> s/to be created/from being created/
> 
> > "nb_requests == 0" on ".drained_poll". Finally, once we're exiting the
> > drained section, on ".drained_end" we set "quiescing = false" and
> > call "nbd_client_receive_next_request()" to resume the processing of
> > new requests.
> > 
> > With these changes, "blk_aio_attach()" and "blk_aio_detach()" can be
> > reverted to be as simple as they were before f148ae7d36.
> 
> Is that reversion planned to be patch 3 of your series in v2?

Actually, we need part of the changes introduced in f148ae7d36, so
it's probably simpler to manually revert "blk_aio_attach()" and
"blk_aio_detach()" here than doing an actual reversion and then
reintroducing the changes.

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH 1/2] block-backend: add drained_poll

2021-06-01 Thread Sergio Lopez
On Tue, Jun 01, 2021 at 05:59:10PM +0200, Kevin Wolf wrote:
> Am 01.06.2021 um 07:57 hat Sergio Lopez geschrieben:
> > Allow block backends to poll their devices/users to check if they have
> > been quiesced when entering a drained section.
> > 
> > This will be used in the next patch to wait for the NBD server to be
> > completely quiesced.
> > 
> > Suggested-by: Kevin Wolf 
> > Signed-off-by: Sergio Lopez 
> > ---
> >  block/block-backend.c  | 7 ++-
> >  include/sysemu/block-backend.h | 4 
> >  2 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index de5496af66..163ca05b97 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -2393,8 +2393,13 @@ static void blk_root_drained_begin(BdrvChild *child)
> >  static bool blk_root_drained_poll(BdrvChild *child)
> >  {
> >  BlockBackend *blk = child->opaque;
> > +int ret = 0;
> 
> It's really a bool.

I'll fix this in v2.

Thanks,
Sergio.

> >  assert(blk->quiesce_counter);
> > -return !!blk->in_flight;
> > +
> > +if (blk->dev_ops && blk->dev_ops->drained_poll) {
> > +ret = blk->dev_ops->drained_poll(blk->dev_opaque);
> > +}
> > +return ret || !!blk->in_flight;
> >  }
> 
> Doesn't make a difference for correctness, of course, so whether you
> change it or not:
> 
> Reviewed-by: Kevin Wolf 
> 


signature.asc
Description: PGP signature


Re: [PATCH 2/2] nbd/server: Use drained block ops to quiesce the server

2021-06-01 Thread Sergio Lopez
On Tue, Jun 01, 2021 at 06:08:41PM +0200, Kevin Wolf wrote:
> Am 01.06.2021 um 07:57 hat Sergio Lopez geschrieben:
> > Before switching between AioContexts we need to make sure that we're
> > fully quiesced ("nb_requests == 0" for every client) when entering the
> > drained section.
> > 
> > To do this, we set "quiescing = true" for every client on
> > ".drained_begin" to prevent new coroutines to be created, and check if
> > "nb_requests == 0" on ".drained_poll". Finally, once we're exiting the
> > drained section, on ".drained_end" we set "quiescing = false" and
> > call "nbd_client_receive_next_request()" to resume the processing of
> > new requests.
> > 
> > With these changes, "blk_aio_attach()" and "blk_aio_detach()" can be
> > reverted to be as simple as they were before f148ae7d36.
> > 
> > RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1960137
> > Suggested-by: Kevin Wolf 
> > Signed-off-by: Sergio Lopez 
> > ---
> >  nbd/server.c | 99 +++-
> >  1 file changed, 75 insertions(+), 24 deletions(-)
> > 
> > diff --git a/nbd/server.c b/nbd/server.c
> > index 86a44a9b41..33e55479d7 100644
> > --- a/nbd/server.c
> > +++ b/nbd/server.c
> > @@ -132,7 +132,7 @@ struct NBDClient {
> >  CoMutex send_lock;
> >  Coroutine *send_coroutine;
> >  
> > -bool read_yielding;
> > +GSList *yield_co_list; /* List of coroutines yielding on nbd_read_eof 
> > */
> >  bool quiescing;
> 
> Hm, how do you get more than one coroutine per client yielding in
> nbd_read_eof() at the same time? I thought the model is that you always
> have one coroutine reading the next request (which is
> client->recv_coroutine) and all the others are just processing the
> request they had read earlier. Multiple coroutines reading from the
> same socket would sound like a bad idea.

You're right, there's only a single coroutine yielding on
nbd_read_eof(). I've added the list while at a moment I was trying to
keep track of every coroutine, and I kept it without thinking if it
was really needed.

I'll drop it, entering just client->recv_coroutine is it isn't NULL.

> >  QTAILQ_ENTRY(NBDClient) next;
> > @@ -1367,6 +1367,7 @@ static inline int coroutine_fn
> >  nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
> >  {
> >  bool partial = false;
> > +Coroutine *co;
> >  
> >  assert(size);
> >  while (size > 0) {
> > @@ -1375,9 +1376,12 @@ nbd_read_eof(NBDClient *client, void *buffer, size_t 
> > size, Error **errp)
> >  
> >  len = qio_channel_readv(client->ioc, , 1, errp);
> >  if (len == QIO_CHANNEL_ERR_BLOCK) {
> > -client->read_yielding = true;
> > +co = qemu_coroutine_self();
> > +
> > +client->yield_co_list = g_slist_prepend(client->yield_co_list, 
> > co);
> >  qio_channel_yield(client->ioc, G_IO_IN);
> > -client->read_yielding = false;
> > +client->yield_co_list = g_slist_remove(client->yield_co_list, 
> > co);
> > +
> >  if (client->quiescing) {
> >  return -EAGAIN;
> >  }
> > @@ -1513,6 +1517,11 @@ static void nbd_request_put(NBDRequestData *req)
> >  g_free(req);
> >  
> >  client->nb_requests--;
> > +
> > +if (client->quiescing && client->nb_requests == 0) {
> > +aio_wait_kick();
> > +}
> > +
> >  nbd_client_receive_next_request(client);
> >  
> >  nbd_client_put(client);
> > @@ -1530,49 +1539,75 @@ static void blk_aio_attached(AioContext *ctx, void 
> > *opaque)
> >  QTAILQ_FOREACH(client, >clients, next) {
> >  qio_channel_attach_aio_context(client->ioc, ctx);
> >  
> > +assert(client->nb_requests == 0);
> >  assert(client->recv_coroutine == NULL);
> >  assert(client->send_coroutine == NULL);
> > -
> > -if (client->quiescing) {
> > -client->quiescing = false;
> > -nbd_client_receive_next_request(client);
> > -}
> >  }
> >  }
> >  
> > -static void nbd_aio_detach_bh(void *opaque)
> > +static void blk_aio_detach(void *opaque)
> >  {
> >  NBDExport *exp = opaque;
> >  NBDClient *client;
> >  
> > +trace_nbd_blk_aio_detach

[PATCH 1/2] block-backend: add drained_poll

2021-05-31 Thread Sergio Lopez
Allow block backends to poll their devices/users to check if they have
been quiesced when entering a drained section.

This will be used in the next patch to wait for the NBD server to be
completely quiesced.

Suggested-by: Kevin Wolf 
Signed-off-by: Sergio Lopez 
---
 block/block-backend.c  | 7 ++-
 include/sysemu/block-backend.h | 4 
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index de5496af66..163ca05b97 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2393,8 +2393,13 @@ static void blk_root_drained_begin(BdrvChild *child)
 static bool blk_root_drained_poll(BdrvChild *child)
 {
 BlockBackend *blk = child->opaque;
+int ret = 0;
 assert(blk->quiesce_counter);
-return !!blk->in_flight;
+
+if (blk->dev_ops && blk->dev_ops->drained_poll) {
+ret = blk->dev_ops->drained_poll(blk->dev_opaque);
+}
+return ret || !!blk->in_flight;
 }
 
 static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 880e903293..9992072e18 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -66,6 +66,10 @@ typedef struct BlockDevOps {
  * Runs when the backend's last drain request ends.
  */
 void (*drained_end)(void *opaque);
+/*
+ * Is the device drained?
+ */
+bool (*drained_poll)(void *opaque);
 } BlockDevOps;
 
 /* This struct is embedded in (the private) BlockBackend struct and contains
-- 
2.26.2




[PATCH 0/2] nbd/server: Quiesce server on drained section

2021-05-31 Thread Sergio Lopez
Before switching between AioContexts we need to make sure that we're
fully quiesced ("nb_requests == 0" for every client) when entering the
drained section. Otherwise, coroutines may be run in the wrong context
after the switch, leading to a number of critical issues.

To accomplish this, we add ".drained_poll" to BlockDevOps and use it
in the NBD server, along with ".drained_being" and "drained_end", to
coordinate the quiescing of the server while entering a drained
section.

Sergio Lopez (2):
  block-backend: add drained_poll
  nbd/server: Use drained block ops to quiesce the server

 block/block-backend.c  |  7 ++-
 include/sysemu/block-backend.h |  4 ++
 nbd/server.c   | 99 +-
 3 files changed, 85 insertions(+), 25 deletions(-)

-- 
2.26.2





[PATCH 2/2] nbd/server: Use drained block ops to quiesce the server

2021-05-31 Thread Sergio Lopez
Before switching between AioContexts we need to make sure that we're
fully quiesced ("nb_requests == 0" for every client) when entering the
drained section.

To do this, we set "quiescing = true" for every client on
".drained_begin" to prevent new coroutines to be created, and check if
"nb_requests == 0" on ".drained_poll". Finally, once we're exiting the
drained section, on ".drained_end" we set "quiescing = false" and
call "nbd_client_receive_next_request()" to resume the processing of
new requests.

With these changes, "blk_aio_attach()" and "blk_aio_detach()" can be
reverted to be as simple as they were before f148ae7d36.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1960137
Suggested-by: Kevin Wolf 
Signed-off-by: Sergio Lopez 
---
 nbd/server.c | 99 +++-
 1 file changed, 75 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 86a44a9b41..33e55479d7 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -132,7 +132,7 @@ struct NBDClient {
 CoMutex send_lock;
 Coroutine *send_coroutine;
 
-bool read_yielding;
+GSList *yield_co_list; /* List of coroutines yielding on nbd_read_eof */
 bool quiescing;
 
 QTAILQ_ENTRY(NBDClient) next;
@@ -1367,6 +1367,7 @@ static inline int coroutine_fn
 nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
 {
 bool partial = false;
+Coroutine *co;
 
 assert(size);
 while (size > 0) {
@@ -1375,9 +1376,12 @@ nbd_read_eof(NBDClient *client, void *buffer, size_t 
size, Error **errp)
 
 len = qio_channel_readv(client->ioc, , 1, errp);
 if (len == QIO_CHANNEL_ERR_BLOCK) {
-client->read_yielding = true;
+co = qemu_coroutine_self();
+
+client->yield_co_list = g_slist_prepend(client->yield_co_list, co);
 qio_channel_yield(client->ioc, G_IO_IN);
-client->read_yielding = false;
+client->yield_co_list = g_slist_remove(client->yield_co_list, co);
+
 if (client->quiescing) {
 return -EAGAIN;
 }
@@ -1513,6 +1517,11 @@ static void nbd_request_put(NBDRequestData *req)
 g_free(req);
 
 client->nb_requests--;
+
+if (client->quiescing && client->nb_requests == 0) {
+aio_wait_kick();
+}
+
 nbd_client_receive_next_request(client);
 
 nbd_client_put(client);
@@ -1530,49 +1539,75 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 QTAILQ_FOREACH(client, >clients, next) {
 qio_channel_attach_aio_context(client->ioc, ctx);
 
+assert(client->nb_requests == 0);
 assert(client->recv_coroutine == NULL);
 assert(client->send_coroutine == NULL);
-
-if (client->quiescing) {
-client->quiescing = false;
-nbd_client_receive_next_request(client);
-}
 }
 }
 
-static void nbd_aio_detach_bh(void *opaque)
+static void blk_aio_detach(void *opaque)
 {
 NBDExport *exp = opaque;
 NBDClient *client;
 
+trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
+
 QTAILQ_FOREACH(client, >clients, next) {
 qio_channel_detach_aio_context(client->ioc);
+}
+
+exp->common.ctx = NULL;
+}
+
+static void nbd_drained_begin(void *opaque)
+{
+NBDExport *exp = opaque;
+NBDClient *client;
+
+QTAILQ_FOREACH(client, >clients, next) {
 client->quiescing = true;
+}
+}
 
-if (client->recv_coroutine) {
-if (client->read_yielding) {
-qemu_aio_coroutine_enter(exp->common.ctx,
- client->recv_coroutine);
-} else {
-AIO_WAIT_WHILE(exp->common.ctx, client->recv_coroutine != 
NULL);
-}
-}
+static void nbd_drained_end(void *opaque)
+{
+NBDExport *exp = opaque;
+NBDClient *client;
 
-if (client->send_coroutine) {
-AIO_WAIT_WHILE(exp->common.ctx, client->send_coroutine != NULL);
-}
+QTAILQ_FOREACH(client, >clients, next) {
+client->quiescing = false;
+nbd_client_receive_next_request(client);
 }
 }
 
-static void blk_aio_detach(void *opaque)
+static bool nbd_drained_poll(void *opaque)
 {
 NBDExport *exp = opaque;
+NBDClient *client;
+Coroutine *co;
+GSList *entry;
+GSList *coroutine_list;
 
-trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
+QTAILQ_FOREACH(client, >clients, next) {
+if (client->nb_requests != 0) {
+/*
+ * Enter coroutines waiting for new requests on nbd_read_eof(), so
+ * we don't depend on the client to wake us up.
+ */
+coroutine_list = g_slist_copy(client->

[PATCH v4 2/2] block: move blk_exp_close_all() to qemu_cleanup()

2021-02-01 Thread Sergio Lopez
Move blk_exp_close_all() from bdrv_close() to qemu_cleanup(), before
bdrv_drain_all_begin().

Export drivers may have coroutines yielding at some point in the block
layer, so we need to shut them down before draining the block layer,
as otherwise they may get stuck blk_wait_while_drained().

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900505
Signed-off-by: Sergio Lopez 
---
 block.c  | 1 -
 qemu-nbd.c   | 1 +
 softmmu/runstate.c   | 9 +
 storage-daemon/qemu-storage-daemon.c | 1 +
 4 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 3da99312db..9682c82fa8 100644
--- a/block.c
+++ b/block.c
@@ -4435,7 +4435,6 @@ static void bdrv_close(BlockDriverState *bs)
 void bdrv_close_all(void)
 {
 assert(job_next(NULL) == NULL);
-blk_exp_close_all();
 
 /* Drop references from requests still in flight, such as canceled block
  * jobs whose AIO context has not been polled yet */
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 0d513cb38c..608c63e82a 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -503,6 +503,7 @@ static const char *socket_activation_validate_opts(const 
char *device,
 static void qemu_nbd_shutdown(void)
 {
 job_cancel_sync_all();
+blk_exp_close_all();
 bdrv_close_all();
 }
 
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 6177693a30..ac4b2e2540 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -25,6 +25,7 @@
 #include "qemu/osdep.h"
 #include "audio/audio.h"
 #include "block/block.h"
+#include "block/export.h"
 #include "chardev/char.h"
 #include "crypto/cipher.h"
 #include "crypto/init.h"
@@ -783,6 +784,14 @@ void qemu_cleanup(void)
  */
 migration_shutdown();
 
+/*
+ * Close the exports before draining the block layer. The export
+ * drivers may have coroutines yielding on it, so we need to clean
+ * them up before the drain, as otherwise they may be get stuck in
+ * blk_wait_while_drained().
+ */
+blk_exp_close_all();
+
 /*
  * We must cancel all block jobs while the block layer is drained,
  * or cancelling will be affected by throttling and thus may block
diff --git a/storage-daemon/qemu-storage-daemon.c 
b/storage-daemon/qemu-storage-daemon.c
index e0c87edbdd..d8d172cc60 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -314,6 +314,7 @@ int main(int argc, char *argv[])
 main_loop_wait(false);
 }
 
+blk_exp_close_all();
 bdrv_drain_all_begin();
 bdrv_close_all();
 
-- 
2.26.2




[PATCH v4 1/2] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2021-02-01 Thread Sergio Lopez
Some graphs may contain an indirect reference to the first BDS in the
chain that can be reached while walking it bottom->up from one its
children.

Doubling-processing of a BDS is especially problematic for the
aio_notifiers, as they might attempt to work on both the old and the
new AIO contexts.

To avoid this problem, add every child and parent to the ignore list
before actually processing them.

Suggested-by: Kevin Wolf 
Signed-off-by: Sergio Lopez 
---
 block.c | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index 8b9d457546..3da99312db 100644
--- a/block.c
+++ b/block.c
@@ -6414,7 +6414,10 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
  AioContext *new_context, GSList **ignore)
 {
 AioContext *old_context = bdrv_get_aio_context(bs);
-BdrvChild *child;
+GSList *children_to_process = NULL;
+GSList *parents_to_process = NULL;
+GSList *entry;
+BdrvChild *child, *parent;
 
 g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
@@ -6429,16 +6432,33 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
 continue;
 }
 *ignore = g_slist_prepend(*ignore, child);
-bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
+children_to_process = g_slist_prepend(children_to_process, child);
 }
-QLIST_FOREACH(child, >parents, next_parent) {
-if (g_slist_find(*ignore, child)) {
+
+QLIST_FOREACH(parent, >parents, next_parent) {
+if (g_slist_find(*ignore, parent)) {
 continue;
 }
-assert(child->klass->set_aio_ctx);
-*ignore = g_slist_prepend(*ignore, child);
-child->klass->set_aio_ctx(child, new_context, ignore);
+*ignore = g_slist_prepend(*ignore, parent);
+parents_to_process = g_slist_prepend(parents_to_process, parent);
+}
+
+for (entry = children_to_process;
+ entry != NULL;
+ entry = g_slist_next(entry)) {
+child = entry->data;
+bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
+}
+g_slist_free(children_to_process);
+
+for (entry = parents_to_process;
+ entry != NULL;
+ entry = g_slist_next(entry)) {
+parent = entry->data;
+assert(parent->klass->set_aio_ctx);
+parent->klass->set_aio_ctx(parent, new_context, ignore);
 }
+g_slist_free(parents_to_process);
 
 bdrv_detach_aio_context(bs);
 
-- 
2.26.2




[PATCH v4 0/2] nbd/server: Quiesce coroutines on context switch

2021-02-01 Thread Sergio Lopez
This series allows the NBD server to properly switch between AIO contexts,
having quiesced recv_coroutine and send_coroutine before doing the transition.

We need this because we send back devices running in IO Thread owned contexts
to the main context when stopping the data plane, something that can happen
multiple times during the lifetime of a VM (usually during the boot sequence or
on a reboot), and we drag the NBD server of the correspoing export with it.

While there, fix also a problem caused by a cross-dependency between
closing the export's client connections and draining the block
layer. The visible effect of this problem was QEMU getting hung when
the guest request a power off while there's an active NBD client.

v4:
 - Call to blk_exp_close_all() from qemu-nbd and qemu-storage-daemon
 too. (Kevin Wolf)

v3:
 - Drop already merged "block: Honor blk_set_aio_context() context
 requirements" and "nbd/server: Quiesce coroutines on context switch"
 - Change the strategy for avoiding processing BDS twice to adding
 every child and parent to the ignore list in advance before
 processing them. (Kevin Wolf)
 - Replace "nbd/server: Quiesce coroutines on context switch" with
 "block: move blk_exp_close_all() to qemu_cleanup()"

v2:
 - Replace "virtio-blk: Acquire context while switching them on
 dataplane start" with "block: Honor blk_set_aio_context() context
 requirements" (Kevin Wolf)
 - Add "block: Avoid processing BDS twice in
 bdrv_set_aio_context_ignore()"
 - Add "block: Close block exports in two steps"
 - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
 - Fix double space and typo in comment. (Eric Blake)

Sergio Lopez (2):
  block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  block: move blk_exp_close_all() to qemu_cleanup()

 block.c  | 35 +---
 qemu-nbd.c   |  1 +
 softmmu/runstate.c   |  9 +++
 storage-daemon/qemu-storage-daemon.c |  1 +
 4 files changed, 38 insertions(+), 8 deletions(-)

-- 
2.26.2





Re: [PATCH v3 2/2] block: move blk_exp_close_all() to qemu_cleanup()

2021-02-01 Thread Sergio Lopez
On Mon, Feb 01, 2021 at 01:20:30PM +0100, Kevin Wolf wrote:
> Am 21.01.2021 um 18:07 hat Sergio Lopez geschrieben:
> > Move blk_exp_close_all() from bdrv_close() to qemu_cleanup(), before
> > bdrv_drain_all_begin().
> > 
> > Export drivers may have coroutines yielding at some point in the block
> > layer, so we need to shut them down before draining the block layer,
> > as otherwise they may get stuck blk_wait_while_drained().
> > 
> > RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900505
> > Signed-off-by: Sergio Lopez 
> 
> This patch loses the call in qemu-nbd and qemu-storage-daemon.

You're right, I'll prepare a v4 right away.

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v3 1/2] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2021-02-01 Thread Sergio Lopez
On Mon, Feb 01, 2021 at 01:06:31PM +0100, Kevin Wolf wrote:
> Am 21.01.2021 um 18:06 hat Sergio Lopez geschrieben:
> > Some graphs may contain an indirect reference to the first BDS in the
> > chain that can be reached while walking it bottom->up from one its
> > children.
> > 
> > Doubling-processing of a BDS is especially problematic for the
> > aio_notifiers, as they might attempt to work on both the old and the
> > new AIO contexts.
> > 
> > To avoid this problem, add every child and parent to the ignore list
> > before actually processing them.
> > 
> > Suggested-by: Kevin Wolf 
> > Signed-off-by: Sergio Lopez 
> > ---
> >  block.c | 34 +++---
> >  1 file changed, 27 insertions(+), 7 deletions(-)
> 
> The patch looks correct to me, I'm just wondering about one thing:
> 
> > diff --git a/block.c b/block.c
> > index 8b9d457546..3da99312db 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -6414,7 +6414,10 @@ void bdrv_set_aio_context_ignore(BlockDriverState 
> > *bs,
> >   AioContext *new_context, GSList **ignore)
> >  {
> >  AioContext *old_context = bdrv_get_aio_context(bs);
> > -BdrvChild *child;
> > +GSList *children_to_process = NULL;
> > +GSList *parents_to_process = NULL;
> 
> Why do we need these separate lists? Can't we just iterate over
> bs->parents/children a second time? I don't think the graph can change
> between the first and the second loop (and if it could, the result would
> be broken anyway).

It's not strictly needed, but this makes the code more readable by
making our intentions clearer. To my eyes, at least.

Sergio.


signature.asc
Description: PGP signature


[PATCH v3 2/2] block: move blk_exp_close_all() to qemu_cleanup()

2021-01-21 Thread Sergio Lopez
Move blk_exp_close_all() from bdrv_close() to qemu_cleanup(), before
bdrv_drain_all_begin().

Export drivers may have coroutines yielding at some point in the block
layer, so we need to shut them down before draining the block layer,
as otherwise they may get stuck blk_wait_while_drained().

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900505
Signed-off-by: Sergio Lopez 
---
 block.c| 1 -
 softmmu/runstate.c | 9 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 3da99312db..9682c82fa8 100644
--- a/block.c
+++ b/block.c
@@ -4435,7 +4435,6 @@ static void bdrv_close(BlockDriverState *bs)
 void bdrv_close_all(void)
 {
 assert(job_next(NULL) == NULL);
-blk_exp_close_all();
 
 /* Drop references from requests still in flight, such as canceled block
  * jobs whose AIO context has not been polled yet */
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 6177693a30..ac4b2e2540 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -25,6 +25,7 @@
 #include "qemu/osdep.h"
 #include "audio/audio.h"
 #include "block/block.h"
+#include "block/export.h"
 #include "chardev/char.h"
 #include "crypto/cipher.h"
 #include "crypto/init.h"
@@ -783,6 +784,14 @@ void qemu_cleanup(void)
  */
 migration_shutdown();
 
+/*
+ * Close the exports before draining the block layer. The export
+ * drivers may have coroutines yielding on it, so we need to clean
+ * them up before the drain, as otherwise they may be get stuck in
+ * blk_wait_while_drained().
+ */
+blk_exp_close_all();
+
 /*
  * We must cancel all block jobs while the block layer is drained,
  * or cancelling will be affected by throttling and thus may block
-- 
2.26.2




[PATCH v3 0/2] nbd/server: Quiesce coroutines on context switch

2021-01-21 Thread Sergio Lopez
This series allows the NBD server to properly switch between AIO contexts,
having quiesced recv_coroutine and send_coroutine before doing the transition.

We need this because we send back devices running in IO Thread owned contexts
to the main context when stopping the data plane, something that can happen
multiple times during the lifetime of a VM (usually during the boot sequence or
on a reboot), and we drag the NBD server of the correspoing export with it.

While there, fix also a problem caused by a cross-dependency between
closing the export's client connections and draining the block
layer. The visible effect of this problem was QEMU getting hung when
the guest request a power off while there's an active NBD client.

v3:
 - Drop already merged "block: Honor blk_set_aio_context() context
 requirements" and "nbd/server: Quiesce coroutines on context switch"
 - Change the strategy for avoiding processing BDS twice to adding
 every child and parent to the ignore list in advance before
 processing them. (Kevin Wolf)
 - Replace "nbd/server: Quiesce coroutines on context switch" with
 "block: move blk_exp_close_all() to qemu_cleanup()"

v2:
 - Replace "virtio-blk: Acquire context while switching them on
 dataplane start" with "block: Honor blk_set_aio_context() context
 requirements" (Kevin Wolf)
 - Add "block: Avoid processing BDS twice in
 bdrv_set_aio_context_ignore()"
 - Add "block: Close block exports in two steps"
 - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
 - Fix double space and typo in comment. (Eric Blake)

Sergio Lopez (2):
  block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  block: move blk_exp_close_all() to qemu_cleanup()

 block.c| 35 +++
 softmmu/runstate.c |  9 +
 2 files changed, 36 insertions(+), 8 deletions(-)

-- 
2.26.2





[PATCH v3 1/2] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2021-01-21 Thread Sergio Lopez
Some graphs may contain an indirect reference to the first BDS in the
chain that can be reached while walking it bottom->up from one its
children.

Doubling-processing of a BDS is especially problematic for the
aio_notifiers, as they might attempt to work on both the old and the
new AIO contexts.

To avoid this problem, add every child and parent to the ignore list
before actually processing them.

Suggested-by: Kevin Wolf 
Signed-off-by: Sergio Lopez 
---
 block.c | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index 8b9d457546..3da99312db 100644
--- a/block.c
+++ b/block.c
@@ -6414,7 +6414,10 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
  AioContext *new_context, GSList **ignore)
 {
 AioContext *old_context = bdrv_get_aio_context(bs);
-BdrvChild *child;
+GSList *children_to_process = NULL;
+GSList *parents_to_process = NULL;
+GSList *entry;
+BdrvChild *child, *parent;
 
 g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
@@ -6429,16 +6432,33 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
 continue;
 }
 *ignore = g_slist_prepend(*ignore, child);
-bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
+children_to_process = g_slist_prepend(children_to_process, child);
 }
-QLIST_FOREACH(child, >parents, next_parent) {
-if (g_slist_find(*ignore, child)) {
+
+QLIST_FOREACH(parent, >parents, next_parent) {
+if (g_slist_find(*ignore, parent)) {
 continue;
 }
-assert(child->klass->set_aio_ctx);
-*ignore = g_slist_prepend(*ignore, child);
-child->klass->set_aio_ctx(child, new_context, ignore);
+*ignore = g_slist_prepend(*ignore, parent);
+parents_to_process = g_slist_prepend(parents_to_process, parent);
+}
+
+for (entry = children_to_process;
+ entry != NULL;
+ entry = g_slist_next(entry)) {
+child = entry->data;
+bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
+}
+g_slist_free(children_to_process);
+
+for (entry = parents_to_process;
+ entry != NULL;
+ entry = g_slist_next(entry)) {
+parent = entry->data;
+assert(parent->klass->set_aio_ctx);
+parent->klass->set_aio_ctx(parent, new_context, ignore);
 }
+g_slist_free(parents_to_process);
 
 bdrv_detach_aio_context(bs);
 
-- 
2.26.2




Re: [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch

2021-01-20 Thread Sergio Lopez
On Wed, Jan 20, 2021 at 02:49:14PM -0600, Eric Blake wrote:
> On 12/14/20 11:05 AM, Sergio Lopez wrote:
> > This series allows the NBD server to properly switch between AIO contexts,
> > having quiesced recv_coroutine and send_coroutine before doing the 
> > transition.
> > 
> > We need this because we send back devices running in IO Thread owned 
> > contexts
> > to the main context when stopping the data plane, something that can happen
> > multiple times during the lifetime of a VM (usually during the boot 
> > sequence or
> > on a reboot), and we drag the NBD server of the correspoing export with it.
> > 
> > While there, fix also a problem caused by a cross-dependency between
> > closing the export's client connections and draining the block
> > layer. The visible effect of this problem was QEMU getting hung when
> > the guest request a power off while there's an active NBD client.
> > 
> > v2:
> >  - Replace "virtio-blk: Acquire context while switching them on
> >  dataplane start" with "block: Honor blk_set_aio_context() context
> >  requirements" (Kevin Wolf)
> >  - Add "block: Avoid processing BDS twice in
> >  bdrv_set_aio_context_ignore()"
> >  - Add "block: Close block exports in two steps"
> >  - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
> >  - Fix double space and typo in comment. (Eric Blake)
> 
> ping - where do we stand on this series?  Patches 1 and 3 have positive
> reviews, I'll queue them now; patches 2 and 4 had enough comments that
> I'm guessing I should wait for a v3?

Yes, I have almost ready a v3. Will send it out today. I think it'd be
better to pull all four patches at the same time, as "block: Honor
blk_set_aio_context() context requirements" may cause trouble without
the patch to avoid double processing in
"bdrv_set_aio_context_ignore()".

Thanks,
Sergio.
 
> > 
> > Sergio Lopez (4):
> >   block: Honor blk_set_aio_context() context requirements
> >   block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
> >   nbd/server: Quiesce coroutines on context switch
> >   block: Close block exports in two steps
> > 
> >  block.c |  27 ++-
> >  block/export/export.c   |  10 +--
> >  blockdev-nbd.c  |   2 +-
> >  hw/block/dataplane/virtio-blk.c |   4 ++
> >  hw/block/dataplane/xen-block.c  |   7 +-
> >  hw/scsi/virtio-scsi.c   |   6 +-
> >  include/block/export.h  |   4 +-
> >  nbd/server.c| 120 
> >  qemu-nbd.c  |   2 +-
> >  stubs/blk-exp-close-all.c   |   2 +-
> >  10 files changed, 156 insertions(+), 28 deletions(-)
> > 
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
> 


signature.asc
Description: PGP signature


Re: [PATCH v2 4/4] block: Close block exports in two steps

2020-12-21 Thread Sergio Lopez
On Tue, Dec 15, 2020 at 04:34:05PM +0100, Kevin Wolf wrote:
> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > There's a cross-dependency between closing the block exports and
> > draining the block layer. The latter needs that we close all export's
> > client connections to ensure they won't queue more requests, but the
> > exports may have coroutines yielding in the block layer, which implies
> > they can't be fully closed until we drain it.
> 
> A coroutine that yielded must have some way to be reentered. So I guess
> the quesiton becomes why they aren't reentered until drain. We do
> process events:
> 
> AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
> 
> So in theory, anything that would finalise the block export closing
> should still execute.
> 
> What is the difference that drain makes compared to a simple
> AIO_WAIT_WHILE, so that coroutine are reentered during drain, but not
> during AIO_WAIT_WHILE?
> 
> This is an even more interesting question because the NBD server isn't a
> block node nor a BdrvChildClass implementation, so it shouldn't even
> notice a drain operation.

OK, took a deeper dive into the issue. While shutting down the guest,
some co-routines from the NBD server are stuck here:

nbd_trip
 nbd_handle_request
  nbd_do_cmd_read
   nbd_co_send_sparse_read
blk_pread
 blk_prw
  blk_read_entry
   blk_do_preadv
blk_wait_while_drained
 qemu_co_queue_wait

This happens because bdrv_close_all() is called after
bdrv_drain_all_begin(), so all block backends are quiesced.

An alternative approach to this patch would be moving
blk_exp_close_all() to vl.c:qemu_cleanup, before
bdrv_drain_all_begin().

Do you have a preference for one of these options?

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-17 Thread Sergio Lopez
On Thu, Dec 17, 2020 at 02:06:02PM +0100, Kevin Wolf wrote:
> Am 17.12.2020 um 13:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > 17.12.2020 13:58, Kevin Wolf wrote:
> > > Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
> > > > On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
> > > > > Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> > > > > > On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > > > > > > Anyway, trying to reconstruct the block graph with BdrvChild 
> > > > > > > pointers
> > > > > > > annotated at the edges:
> > > > > > > 
> > > > > > > BlockBackend
> > > > > > >|
> > > > > > >v
> > > > > > >backup-top +
> > > > > > >|   |  |
> > > > > > >|   +---+  |
> > > > > > >|0x5655068b8510 |  | 0x565505e3c450
> > > > > > >|   |  |
> > > > > > >| 0x565505e42090|  |
> > > > > > >v   |  |
> > > > > > >  qcow2 -+  |  |
> > > > > > >||  |  |
> > > > > > >| 0x565505e52060 |  |  | ??? [1]
> > > > > > >||  |  |  |
> > > > > > >v 0x5655066a34d0 |  |  |  | 0x565505fc7aa0
> > > > > > >  file   v  v  v  v
> > > > > > >   qcow2 (backing)
> > > > > > >  |
> > > > > > >  | 0x565505e41d20
> > > > > > >  v
> > > > > > >file
> > > > > > > 
> > > > > > > [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
> > > > > > >  BdrvChild directly owned by the backup job.
> > > > > > > 
> > > > > > > > So it seems this is happening:
> > > > > > > > 
> > > > > > > > backup-top (5e48030) <-| (5)
> > > > > > > > ||  |
> > > > > > > > || (6) > qcow2 (5fbf660)
> > > > > > > > |   ^|
> > > > > > > > |   (3) || (4)
> > > > > > > > |-> (1) qcow2 (5e5d420) -|-> file (6bc0c00)
> > > > > > > > |
> > > > > > > > |-> (2) file (5e52060)
> > > > > > > > 
> > > > > > > > backup-top (5e48030), the BDS that was passed as argument in 
> > > > > > > > the first
> > > > > > > > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 
> > > > > > > > (5fbf660)
> > > > > > > > is processing its parents, and the latter is also re-entered 
> > > > > > > > when the
> > > > > > > > first one starts processing its children again.
> > > > > > > 
> > > > > > > Yes, but look at the BdrvChild pointers, it is through different 
> > > > > > > edges
> > > > > > > that we come back to the same node. No BdrvChild is used twice.
> > > > > > > 
> > > > > > > If backup-top had added all of its children to the ignore list 
> > > > > > > before
> > > > > > > calling into the overlay qcow2, the backing qcow2 wouldn't 
> > > > > > > eventually
> > > > > > > have called back into backup-top.
> > > > > > 
> > > > > > I've tested a patch that first adds every child to the ignore list,
> > > > > > and then processes those that weren't there before, as you suggested
> > > > > > on a previous email. With that, the offending qcow2 is not 
> > > > > > re-entered,
> > > > > > so we avoid the crash, but backup-top is still entered twice:
> > > >

Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-17 Thread Sergio Lopez
On Thu, Dec 17, 2020 at 11:58:30AM +0100, Kevin Wolf wrote:
> Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
> > Do you think it's safe to re-enter backup-top, or should we look for a
> > way to avoid this?
> 
> I think it should be avoided, but I don't understand why putting all
> children of backup-top into the ignore list doesn't already avoid it. If
> backup-top is in the parents list of qcow2, then qcow2 should be in the
> children list of backup-top and therefore the BdrvChild should already
> be in the ignore list.
> 
> The only way I can explain this is that backup-top and qcow2 have
> different ideas about which BdrvChild objects exist that connect them.
> Or that the graph changes between both places, but I don't see how that
> could happen in bdrv_set_aio_context_ignore().

I've been digging around with gdb, and found that, at that point, the
backup-top BDS is actually referenced by two different BdrvChild
objects:

(gdb) p *(BdrvChild *) 0x560c40f7e400
$84 = {bs = 0x560c40c4c030, name = 0x560c41ca4960 "root", klass = 
0x560c3eae7c20 , 
  role = 20, opaque = 0x560c41ca4610, perm = 3, shared_perm = 29, 
has_backup_perm = false, 
  backup_perm = 0, backup_shared_perm = 31, frozen = false, 
parent_quiesce_counter = 2, next = {
le_next = 0x0, le_prev = 0x0}, next_parent = {le_next = 0x0, le_prev = 
0x560c40c44338}}

(gdb) p sibling
$72 = (BdrvChild *) 0x560c40981840
(gdb) p *sibling
$73 = {bs = 0x560c40c4c030, name = 0x560c4161be20 "main node", klass = 
0x560c3eae6a40 , 
  role = 0, opaque = 0x560c4161bc00, perm = 0, shared_perm = 31, 
has_backup_perm = false, 
  backup_perm = 0, backup_shared_perm = 0, frozen = false, 
parent_quiesce_counter = 2, next = {
le_next = 0x0, le_prev = 0x0}, next_parent = {le_next = 0x560c40c442d0, 
le_prev = 0x560c40c501c0}}

When the chain of calls to switch AIO contexts is started, backup-top
is the first one to be processed. blk_do_set_aio_context() instructs
bdrv_child_try_set_aio_context() to add blk->root (0x560c40f7e400) as
the first element in ignore list, but the referenced BDS is still
re-entered through the other BdrvChild (0x560c40981840) by one the
children of the latter.

I can't think of a way of preventing this other than keeping track of
BDS pointers in the ignore list too. Do you think there are any
alternatives?

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-17 Thread Sergio Lopez
On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
> Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> > On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > > Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> > > > On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > > > > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > > > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > > > > While processing the parents of a BDS, one of the parents may 
> > > > > > > > process
> > > > > > > > the child that's doing the tail recursion, which leads to a BDS 
> > > > > > > > being
> > > > > > > > processed twice. This is especially problematic for the 
> > > > > > > > aio_notifiers,
> > > > > > > > as they might attempt to work on both the old and the new AIO
> > > > > > > > contexts.
> > > > > > > > 
> > > > > > > > To avoid this, add the BDS pointer to the ignore list, and 
> > > > > > > > check the
> > > > > > > > child BDS pointer while iterating over the children.
> > > > > > > > 
> > > > > > > > Signed-off-by: Sergio Lopez 
> > > > > > > 
> > > > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > > > > 
> > > > > > I know, it's effective but quite ugly...
> > > > > > 
> > > > > > > What is the specific scenario where you saw this breaking? Did 
> > > > > > > you have
> > > > > > > multiple BdrvChild connections between two nodes so that we would 
> > > > > > > go to
> > > > > > > the parent node through one and then come back to the child node 
> > > > > > > through
> > > > > > > the other?
> > > > > > 
> > > > > > I don't think this is a corner case. If the graph is walked 
> > > > > > top->down,
> > > > > > there's no problem since children are added to the ignore list 
> > > > > > before
> > > > > > getting processed, and siblings don't process each other. But, if 
> > > > > > the
> > > > > > graph is walked bottom->up, a BDS will start processing its parents
> > > > > > without adding itself to the ignore list, so there's nothing
> > > > > > preventing them from processing it again.
> > > > > 
> > > > > I don't understand. child is added to ignore before calling the parent
> > > > > callback on it, so how can we come back through the same BdrvChild?
> > > > > 
> > > > > QLIST_FOREACH(child, >parents, next_parent) {
> > > > > if (g_slist_find(*ignore, child)) {
> > > > > continue;
> > > > > }
> > > > > assert(child->klass->set_aio_ctx);
> > > > > *ignore = g_slist_prepend(*ignore, child);
> > > > > child->klass->set_aio_ctx(child, new_context, ignore);
> > > > > }
> > > > 
> > > > Perhaps I'm missing something, but the way I understand it, that loop
> > > > is adding the BdrvChild pointer of each of its parents, but not the
> > > > BdrvChild pointer of the BDS that was passed as an argument to
> > > > b_s_a_c_i.
> > > 
> > > Generally, the caller has already done that.
> > > 
> > > In the theoretical case that it was the outermost call in the recursion
> > > and it hasn't (I couldn't find any such case), I think we should still
> > > call the callback for the passed BdrvChild like we currently do.
> > > 
> > > > > You didn't dump the BdrvChild here. I think that would add some
> > > > > information on why we re-entered 0x555ee2fbf660. Maybe you can also 
> > > > > add
> > > > > bs->drv->format_name for each node to make the scenario less abstract?
> > > > 
> > > > I've generated another trace with more data:
> > > > 
> > > > bs=0x565505e48030 (backup-top) enter
> > > > bs=0x565505e48030 (backup-top) processing 

Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-16 Thread Sergio Lopez
On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> > On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > > While processing the parents of a BDS, one of the parents may 
> > > > > > process
> > > > > > the child that's doing the tail recursion, which leads to a BDS 
> > > > > > being
> > > > > > processed twice. This is especially problematic for the 
> > > > > > aio_notifiers,
> > > > > > as they might attempt to work on both the old and the new AIO
> > > > > > contexts.
> > > > > > 
> > > > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > > > child BDS pointer while iterating over the children.
> > > > > > 
> > > > > > Signed-off-by: Sergio Lopez 
> > > > > 
> > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > > 
> > > > I know, it's effective but quite ugly...
> > > > 
> > > > > What is the specific scenario where you saw this breaking? Did you 
> > > > > have
> > > > > multiple BdrvChild connections between two nodes so that we would go 
> > > > > to
> > > > > the parent node through one and then come back to the child node 
> > > > > through
> > > > > the other?
> > > > 
> > > > I don't think this is a corner case. If the graph is walked top->down,
> > > > there's no problem since children are added to the ignore list before
> > > > getting processed, and siblings don't process each other. But, if the
> > > > graph is walked bottom->up, a BDS will start processing its parents
> > > > without adding itself to the ignore list, so there's nothing
> > > > preventing them from processing it again.
> > > 
> > > I don't understand. child is added to ignore before calling the parent
> > > callback on it, so how can we come back through the same BdrvChild?
> > > 
> > > QLIST_FOREACH(child, >parents, next_parent) {
> > > if (g_slist_find(*ignore, child)) {
> > > continue;
> > > }
> > > assert(child->klass->set_aio_ctx);
> > > *ignore = g_slist_prepend(*ignore, child);
> > > child->klass->set_aio_ctx(child, new_context, ignore);
> > > }
> > 
> > Perhaps I'm missing something, but the way I understand it, that loop
> > is adding the BdrvChild pointer of each of its parents, but not the
> > BdrvChild pointer of the BDS that was passed as an argument to
> > b_s_a_c_i.
> 
> Generally, the caller has already done that.
> 
> In the theoretical case that it was the outermost call in the recursion
> and it hasn't (I couldn't find any such case), I think we should still
> call the callback for the passed BdrvChild like we currently do.
> 
> > > You didn't dump the BdrvChild here. I think that would add some
> > > information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> > > bs->drv->format_name for each node to make the scenario less abstract?
> > 
> > I've generated another trace with more data:
> > 
> > bs=0x565505e48030 (backup-top) enter
> > bs=0x565505e48030 (backup-top) processing children
> > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 
> > (child->bs=0x565505e5d420)
> > bs=0x565505e5d420 (qcow2) enter
> > bs=0x565505e5d420 (qcow2) processing children
> > bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 
> > (child->bs=0x565505e52060)
> > bs=0x565505e52060 (file) enter
> > bs=0x565505e52060 (file) processing children
> > bs=0x565505e52060 (file) processing parents
> > bs=0x565505e52060 (file) processing itself
> > bs=0x565505e5d420 (qcow2) processing parents
> > bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
> > bs=0x565505fbf660 (qcow2) enter
> > bs=0x565505fbf660 (qcow2) processing children
> > bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 
> > (child->bs=0x565506bc0c00)
> > bs=0x565506bc0c00 (file) enter
> > bs=0x565506bc0c00 (file) processing children
>

Re: [PATCH v2 4/4] block: Close block exports in two steps

2020-12-15 Thread Sergio Lopez
On Tue, Dec 15, 2020 at 04:34:05PM +0100, Kevin Wolf wrote:
> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > There's a cross-dependency between closing the block exports and
> > draining the block layer. The latter needs that we close all export's
> > client connections to ensure they won't queue more requests, but the
> > exports may have coroutines yielding in the block layer, which implies
> > they can't be fully closed until we drain it.
> 
> A coroutine that yielded must have some way to be reentered. So I guess
> the quesiton becomes why they aren't reentered until drain. We do
> process events:
> 
> AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
> 
> So in theory, anything that would finalise the block export closing
> should still execute.
> 
> What is the difference that drain makes compared to a simple
> AIO_WAIT_WHILE, so that coroutine are reentered during drain, but not
> during AIO_WAIT_WHILE?
> 
> This is an even more interesting question because the NBD server isn't a
> block node nor a BdrvChildClass implementation, so it shouldn't even
> notice a drain operation.

I agree in that this deserves a deeper analysis. I'm going to drop
this patch from the series, and will re-analyze the issue later.

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-15 Thread Sergio Lopez
On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > While processing the parents of a BDS, one of the parents may process
> > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > as they might attempt to work on both the old and the new AIO
> > > > contexts.
> > > > 
> > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > child BDS pointer while iterating over the children.
> > > > 
> > > > Signed-off-by: Sergio Lopez 
> > > 
> > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > 
> > I know, it's effective but quite ugly...
> > 
> > > What is the specific scenario where you saw this breaking? Did you have
> > > multiple BdrvChild connections between two nodes so that we would go to
> > > the parent node through one and then come back to the child node through
> > > the other?
> > 
> > I don't think this is a corner case. If the graph is walked top->down,
> > there's no problem since children are added to the ignore list before
> > getting processed, and siblings don't process each other. But, if the
> > graph is walked bottom->up, a BDS will start processing its parents
> > without adding itself to the ignore list, so there's nothing
> > preventing them from processing it again.
> 
> I don't understand. child is added to ignore before calling the parent
> callback on it, so how can we come back through the same BdrvChild?
> 
> QLIST_FOREACH(child, >parents, next_parent) {
> if (g_slist_find(*ignore, child)) {
> continue;
> }
> assert(child->klass->set_aio_ctx);
> *ignore = g_slist_prepend(*ignore, child);
> child->klass->set_aio_ctx(child, new_context, ignore);
> }

Perhaps I'm missing something, but the way I understand it, that loop
is adding the BdrvChild pointer of each of its parents, but not the
BdrvChild pointer of the BDS that was passed as an argument to
b_s_a_c_i.

> You didn't dump the BdrvChild here. I think that would add some
> information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> bs->drv->format_name for each node to make the scenario less abstract?

I've generated another trace with more data:

bs=0x565505e48030 (backup-top) enter
bs=0x565505e48030 (backup-top) processing children
bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 
(child->bs=0x565505e5d420)
bs=0x565505e5d420 (qcow2) enter
bs=0x565505e5d420 (qcow2) processing children
bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 
(child->bs=0x565505e52060)
bs=0x565505e52060 (file) enter
bs=0x565505e52060 (file) processing children
bs=0x565505e52060 (file) processing parents
bs=0x565505e52060 (file) processing itself
bs=0x565505e5d420 (qcow2) processing parents
bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
bs=0x565505fbf660 (qcow2) enter
bs=0x565505fbf660 (qcow2) processing children
bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 
(child->bs=0x565506bc0c00)
bs=0x565506bc0c00 (file) enter
bs=0x565506bc0c00 (file) processing children
bs=0x565506bc0c00 (file) processing parents
bs=0x565506bc0c00 (file) processing itself
bs=0x565505fbf660 (qcow2) processing parents
bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
bs=0x565505e48030 (backup-top) enter
bs=0x565505e48030 (backup-top) processing children
bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 
(child->bs=0x565505fbf660)
bs=0x565505fbf660 (qcow2) enter
bs=0x565505fbf660 (qcow2) processing children
bs=0x565505fbf660 (qcow2) processing parents
bs=0x565505fbf660 (qcow2) processing itself
bs=0x565505e48030 (backup-top) processing parents
bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
bs=0x565505e48030 (backup-top) processing itself
bs=0x565505fbf660 (qcow2) processing itself


So it seems this is happening:

backup-top (5e48030) <-| (5)
   ||  |
   || (6) > qcow2 (5fbf660)
   |   ^|
   |   (3) || (4)
   |-> (1) qcow2 (5e5d420) -|-> file (6bc0c00)
   |
   |-> (2) file (5e52060)

backup-top (5e48030), the BDS that was passed as argument in the first
bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
is processing its parents, and the latter is also re-entered when the
first one starts processing its children again.

Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-15 Thread Sergio Lopez
On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > While processing the parents of a BDS, one of the parents may process
> > the child that's doing the tail recursion, which leads to a BDS being
> > processed twice. This is especially problematic for the aio_notifiers,
> > as they might attempt to work on both the old and the new AIO
> > contexts.
> > 
> > To avoid this, add the BDS pointer to the ignore list, and check the
> > child BDS pointer while iterating over the children.
> > 
> > Signed-off-by: Sergio Lopez 
> 
> Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/

I know, it's effective but quite ugly...

> What is the specific scenario where you saw this breaking? Did you have
> multiple BdrvChild connections between two nodes so that we would go to
> the parent node through one and then come back to the child node through
> the other?

I don't think this is a corner case. If the graph is walked top->down,
there's no problem since children are added to the ignore list before
getting processed, and siblings don't process each other. But, if the
graph is walked bottom->up, a BDS will start processing its parents
without adding itself to the ignore list, so there's nothing
preventing them from processing it again.

I'm pasting here an annotated trace of bdrv_set_aio_context_ignore I
generated while triggering the issue:

<- begin -->
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing itself
bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing parents

 - We enter b_s_a_c_i with BDS 2fbf660 the first time:
 
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children

 - We enter b_s_a_c_i with BDS 3bc0c00, a child of 2fbf660:
 
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 enter
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing children
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing itself

 - We start processing its parents:
 
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents

 - We enter b_s_a_c_i with BDS 2e48030, a parent of 2fbf660:
 
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children

 - We enter b_s_a_c_i with BDS 2fbf660 again, because of parent
   2e48030 didn't found us it in the ignore list:
   
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing itself

 - BDS 2fbf660 will be processed here a second time, triggering the
   issue:
   
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself
<- end -->

I suspect this has been happening for a while, and has only surfaced
now due to the need to run an AIO context BH in an aio_notifier
function that the "nbd/server: Quiesce coroutines on context switch"
patch introduces. There the problem is that the first time the
aio_notifier AIO detach function is called, it works on the old
context (as it should be), and the second one works on the new context
(which is wrong).

> Maybe if what we really need to do is not processing every edge once,
> but processing every node once, the list should be changed to contain
> _only_ BDS objects. But then blk_do_set_aio_context() probably won't
> work any more because it can't have blk->root ignored any more...

I tried that in my first attempt and it broke badly. I didn't take a
deeper look at the causes.

> Anyway, if we end up changing what the list contains, the comment needs
> an update, too. Currently it says:
> 
>  * @ignore will accumulate all visited BdrvChild object. The caller is
>  * responsible for freeing the list afterwards.
> 
> Another option: Split the parents QLIST_FOREACH loop in two. First add
> all parent BdrvChild objects to the ignore list, remember which of them
> were newly added, and only after adding all of them call
> child->klass->set_aio_ctx() for each parent that was previously not on
> the ignore list. This will avoid tha

[PATCH v2 4/4] block: Close block exports in two steps

2020-12-14 Thread Sergio Lopez
There's a cross-dependency between closing the block exports and
draining the block layer. The latter needs that we close all export's
client connections to ensure they won't queue more requests, but the
exports may have coroutines yielding in the block layer, which implies
they can't be fully closed until we drain it.

To break this cross-dependency, this change adds a "bool wait"
argument to blk_exp_close_all() and blk_exp_close_all_type(), so
callers can decide whether they want to wait for the exports to be
fully quiesced, or just return after requesting them to shut down.

Then, in bdrv_close_all we make two calls, one without waiting to
close all client connections, and another after draining the block
layer, this time waiting for the exports to be fully quiesced.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900505
Signed-off-by: Sergio Lopez 
---
 block.c   | 20 +++-
 block/export/export.c | 10 ++
 blockdev-nbd.c|  2 +-
 include/block/export.h|  4 ++--
 qemu-nbd.c|  2 +-
 stubs/blk-exp-close-all.c |  2 +-
 6 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index bc8a66ab6e..41db70ac07 100644
--- a/block.c
+++ b/block.c
@@ -4472,13 +4472,31 @@ static void bdrv_close(BlockDriverState *bs)
 void bdrv_close_all(void)
 {
 assert(job_next(NULL) == NULL);
-blk_exp_close_all();
+
+/*
+ * There's a cross-dependency between closing the block exports and
+ * draining the block layer. The latter needs that we close all export's
+ * client connections to ensure they won't queue more requests, but the
+ * exports may have coroutines yielding in the block layer, which implies
+ * they can't be fully closed until we drain it.
+ *
+ * Make a first call to close all export's client connections, without
+ * waiting for each export to be fully quiesced.
+ */
+blk_exp_close_all(false);
 
 /* Drop references from requests still in flight, such as canceled block
  * jobs whose AIO context has not been polled yet */
 bdrv_drain_all();
 
 blk_remove_all_bs();
+
+/*
+ * Make a second call to shut down the exports, this time waiting for them
+ * to be fully quiesced.
+ */
+blk_exp_close_all(true);
+
 blockdev_close_all_bdrv_states();
 
 assert(QTAILQ_EMPTY(_bdrv_states));
diff --git a/block/export/export.c b/block/export/export.c
index bad6f21b1c..0124ebd9f9 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -280,7 +280,7 @@ static bool blk_exp_has_type(BlockExportType type)
 }
 
 /* type == BLOCK_EXPORT_TYPE__MAX for all types */
-void blk_exp_close_all_type(BlockExportType type)
+void blk_exp_close_all_type(BlockExportType type, bool wait)
 {
 BlockExport *exp, *next;
 
@@ -293,12 +293,14 @@ void blk_exp_close_all_type(BlockExportType type)
 blk_exp_request_shutdown(exp);
 }
 
-AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
+if (wait) {
+AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
+}
 }
 
-void blk_exp_close_all(void)
+void blk_exp_close_all(bool wait)
 {
-blk_exp_close_all_type(BLOCK_EXPORT_TYPE__MAX);
+blk_exp_close_all_type(BLOCK_EXPORT_TYPE__MAX, wait);
 }
 
 void qmp_block_export_add(BlockExportOptions *export, Error **errp)
diff --git a/blockdev-nbd.c b/blockdev-nbd.c
index d8443d235b..d71d4da7c2 100644
--- a/blockdev-nbd.c
+++ b/blockdev-nbd.c
@@ -266,7 +266,7 @@ void qmp_nbd_server_stop(Error **errp)
 return;
 }
 
-blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD);
+blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD, true);
 
 nbd_server_free(nbd_server);
 nbd_server = NULL;
diff --git a/include/block/export.h b/include/block/export.h
index 7feb02e10d..71c25928ce 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -83,7 +83,7 @@ BlockExport *blk_exp_find(const char *id);
 void blk_exp_ref(BlockExport *exp);
 void blk_exp_unref(BlockExport *exp);
 void blk_exp_request_shutdown(BlockExport *exp);
-void blk_exp_close_all(void);
-void blk_exp_close_all_type(BlockExportType type);
+void blk_exp_close_all(bool wait);
+void blk_exp_close_all_type(BlockExportType type, bool wait);
 
 #endif
diff --git a/qemu-nbd.c b/qemu-nbd.c
index a7075c5419..928f4466f6 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -1122,7 +1122,7 @@ int main(int argc, char **argv)
 do {
 main_loop_wait(false);
 if (state == TERMINATE) {
-blk_exp_close_all();
+blk_exp_close_all(true);
 state = TERMINATED;
 }
 } while (state != TERMINATED);
diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
index 1c71316763..ecd0ce611f 100644
--- a/stubs/blk-exp-close-all.c
+++ b/stubs/blk-exp-close-all.c
@@ -2,6 +2,6 @@
 #include "block/export.h"
 
 /* Only used in programs that support block exports (libblockdev.fa) */
-void blk_exp_close_all(void)
+void blk_exp

[PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements

2020-12-14 Thread Sergio Lopez
The documentation for bdrv_set_aio_context_ignore() states this:

 * The caller must own the AioContext lock for the old AioContext of bs, but it
 * must not own the AioContext lock for new_context (unless new_context is the
 * same as the current context of bs).

As blk_set_aio_context() makes use of this function, this rule also
applies to it.

Fix all occurrences where this rule wasn't honored.

Suggested-by: Kevin Wolf 
Signed-off-by: Sergio Lopez 
---
 hw/block/dataplane/virtio-blk.c | 4 
 hw/block/dataplane/xen-block.c  | 7 ++-
 hw/scsi/virtio-scsi.c   | 6 --
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 37499c5564..e9050c8987 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -172,6 +172,7 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 VirtIOBlockDataPlane *s = vblk->dataplane;
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vblk)));
 VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
+AioContext *old_context;
 unsigned i;
 unsigned nvqs = s->conf->num_queues;
 Error *local_err = NULL;
@@ -214,7 +215,10 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 vblk->dataplane_started = true;
 trace_virtio_blk_data_plane_start(s);
 
+old_context = blk_get_aio_context(s->conf->conf.blk);
+aio_context_acquire(old_context);
 r = blk_set_aio_context(s->conf->conf.blk, s->ctx, _err);
+aio_context_release(old_context);
 if (r < 0) {
 error_report_err(local_err);
 goto fail_guest_notifiers;
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index 71c337c7b7..3675f8deaf 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -725,6 +725,7 @@ void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
 {
 ERRP_GUARD();
 XenDevice *xendev = dataplane->xendev;
+AioContext *old_context;
 unsigned int ring_size;
 unsigned int i;
 
@@ -808,10 +809,14 @@ void xen_block_dataplane_start(XenBlockDataPlane 
*dataplane,
 goto stop;
 }
 
-aio_context_acquire(dataplane->ctx);
+old_context = blk_get_aio_context(dataplane->blk);
+aio_context_acquire(old_context);
 /* If other users keep the BlockBackend in the iothread, that's ok */
 blk_set_aio_context(dataplane->blk, dataplane->ctx, NULL);
+aio_context_release(old_context);
+
 /* Only reason for failure is a NULL channel */
+aio_context_acquire(dataplane->ctx);
 xen_device_set_event_channel_context(xendev, dataplane->event_channel,
  dataplane->ctx, _abort);
 aio_context_release(dataplane->ctx);
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 3db9a8aae9..7a347ceac5 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -821,15 +821,17 @@ static void virtio_scsi_hotplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 VirtIODevice *vdev = VIRTIO_DEVICE(hotplug_dev);
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 SCSIDevice *sd = SCSI_DEVICE(dev);
+AioContext *old_context;
 int ret;
 
 if (s->ctx && !s->dataplane_fenced) {
 if (blk_op_is_blocked(sd->conf.blk, BLOCK_OP_TYPE_DATAPLANE, errp)) {
 return;
 }
-virtio_scsi_acquire(s);
+old_context = blk_get_aio_context(sd->conf.blk);
+aio_context_acquire(old_context);
 ret = blk_set_aio_context(sd->conf.blk, s->ctx, errp);
-virtio_scsi_release(s);
+aio_context_release(old_context);
 if (ret < 0) {
 return;
 }
-- 
2.26.2




[PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch

2020-12-14 Thread Sergio Lopez
This series allows the NBD server to properly switch between AIO contexts,
having quiesced recv_coroutine and send_coroutine before doing the transition.

We need this because we send back devices running in IO Thread owned contexts
to the main context when stopping the data plane, something that can happen
multiple times during the lifetime of a VM (usually during the boot sequence or
on a reboot), and we drag the NBD server of the correspoing export with it.

While there, fix also a problem caused by a cross-dependency between
closing the export's client connections and draining the block
layer. The visible effect of this problem was QEMU getting hung when
the guest request a power off while there's an active NBD client.

v2:
 - Replace "virtio-blk: Acquire context while switching them on
 dataplane start" with "block: Honor blk_set_aio_context() context
 requirements" (Kevin Wolf)
 - Add "block: Avoid processing BDS twice in
 bdrv_set_aio_context_ignore()"
 - Add "block: Close block exports in two steps"
 - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
 - Fix double space and typo in comment. (Eric Blake)

Sergio Lopez (4):
  block: Honor blk_set_aio_context() context requirements
  block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  nbd/server: Quiesce coroutines on context switch
  block: Close block exports in two steps

 block.c |  27 ++-
 block/export/export.c   |  10 +--
 blockdev-nbd.c  |   2 +-
 hw/block/dataplane/virtio-blk.c |   4 ++
 hw/block/dataplane/xen-block.c  |   7 +-
 hw/scsi/virtio-scsi.c   |   6 +-
 include/block/export.h  |   4 +-
 nbd/server.c| 120 
 qemu-nbd.c  |   2 +-
 stubs/blk-exp-close-all.c   |   2 +-
 10 files changed, 156 insertions(+), 28 deletions(-)

-- 
2.26.2





[PATCH v2 3/4] nbd/server: Quiesce coroutines on context switch

2020-12-14 Thread Sergio Lopez
When switching between AIO contexts we need to me make sure that both
recv_coroutine and send_coroutine are not scheduled to run. Otherwise,
QEMU may crash while attaching the new context with an error like
this one:

aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'

To achieve this we need a local implementation of
'qio_channel_readv_all_eof' named 'nbd_read_eof' (a trick already done
by 'nbd/client.c') that allows us to interrupt the operation and to
know when recv_coroutine is yielding.

With this in place, we delegate detaching the AIO context to the
owning context with a BH ('nbd_aio_detach_bh') scheduled using
'aio_wait_bh_oneshot'. This BH signals that we need to quiesce the
channel by setting 'client->quiescing' to 'true', and either waits for
the coroutine to finish using AIO_WAIT_WHILE or, if it's yielding in
'nbd_read_eof', actively enters the coroutine to interrupt it.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900326
Signed-off-by: Sergio Lopez 
Reviewed-by: Eric Blake 
---
 nbd/server.c | 120 +--
 1 file changed, 106 insertions(+), 14 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 613ed2634a..7229f487d2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -132,6 +132,9 @@ struct NBDClient {
 CoMutex send_lock;
 Coroutine *send_coroutine;
 
+bool read_yielding;
+bool quiescing;
+
 QTAILQ_ENTRY(NBDClient) next;
 int nb_requests;
 bool closing;
@@ -1352,14 +1355,60 @@ static coroutine_fn int nbd_negotiate(NBDClient 
*client, Error **errp)
 return 0;
 }
 
-static int nbd_receive_request(QIOChannel *ioc, NBDRequest *request,
+/* nbd_read_eof
+ * Tries to read @size bytes from @ioc. This is a local implementation of
+ * qio_channel_readv_all_eof. We have it here because we need it to be
+ * interruptible and to know when the coroutine is yielding.
+ * Returns 1 on success
+ * 0 on eof, when no data was read (errp is not set)
+ * negative errno on failure (errp is set)
+ */
+static inline int coroutine_fn
+nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
+{
+bool partial = false;
+
+assert(size);
+while (size > 0) {
+struct iovec iov = { .iov_base = buffer, .iov_len = size };
+ssize_t len;
+
+len = qio_channel_readv(client->ioc, , 1, errp);
+if (len == QIO_CHANNEL_ERR_BLOCK) {
+client->read_yielding = true;
+qio_channel_yield(client->ioc, G_IO_IN);
+client->read_yielding = false;
+if (client->quiescing) {
+return -EAGAIN;
+}
+continue;
+} else if (len < 0) {
+return -EIO;
+} else if (len == 0) {
+if (partial) {
+error_setg(errp,
+   "Unexpected end-of-file before all bytes were 
read");
+return -EIO;
+} else {
+return 0;
+}
+}
+
+partial = true;
+size -= len;
+buffer = (uint8_t *) buffer + len;
+}
+return 1;
+}
+
+static int nbd_receive_request(NBDClient *client, NBDRequest *request,
Error **errp)
 {
 uint8_t buf[NBD_REQUEST_SIZE];
 uint32_t magic;
 int ret;
 
-ret = nbd_read(ioc, buf, sizeof(buf), "request", errp);
+ret = nbd_read_eof(client, buf, sizeof(buf), errp);
 if (ret < 0) {
 return ret;
 }
@@ -1480,11 +1529,37 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 
 QTAILQ_FOREACH(client, >clients, next) {
 qio_channel_attach_aio_context(client->ioc, ctx);
+
+assert(client->recv_coroutine == NULL);
+assert(client->send_coroutine == NULL);
+
+if (client->quiescing) {
+client->quiescing = false;
+nbd_client_receive_next_request(client);
+}
+}
+}
+
+static void nbd_aio_detach_bh(void *opaque)
+{
+NBDExport *exp = opaque;
+NBDClient *client;
+
+QTAILQ_FOREACH(client, >clients, next) {
+qio_channel_detach_aio_context(client->ioc);
+client->quiescing = true;
+
 if (client->recv_coroutine) {
-aio_co_schedule(ctx, client->recv_coroutine);
+if (client->read_yielding) {
+qemu_aio_coroutine_enter(exp->common.ctx,
+ client->recv_coroutine);
+} else {
+AIO_WAIT_WHILE(exp->common.ctx, client->recv_coroutine != 
NULL);
+}
 }
+
 if (client->send_coroutine) {
-aio_co_schedule(ctx, client->send_coroutine);
+AIO_WAIT_WHILE(exp->common.ctx, client->send_coroutine != NULL);
 }
 }
 }
@@ -1492,13 +1567,10 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 static void blk_aio_detach

[PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()

2020-12-14 Thread Sergio Lopez
While processing the parents of a BDS, one of the parents may process
the child that's doing the tail recursion, which leads to a BDS being
processed twice. This is especially problematic for the aio_notifiers,
as they might attempt to work on both the old and the new AIO
contexts.

To avoid this, add the BDS pointer to the ignore list, and check the
child BDS pointer while iterating over the children.

Signed-off-by: Sergio Lopez 
---
 block.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index f1cedac362..bc8a66ab6e 100644
--- a/block.c
+++ b/block.c
@@ -6465,12 +6465,17 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
 bdrv_drained_begin(bs);
 
 QLIST_FOREACH(child, >children, next) {
-if (g_slist_find(*ignore, child)) {
+if (g_slist_find(*ignore, child) || g_slist_find(*ignore, child->bs)) {
 continue;
 }
 *ignore = g_slist_prepend(*ignore, child);
 bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
 }
+/*
+ * Add a reference to this BS to the ignore list, so its
+ * parents won't attempt to process it again.
+ */
+*ignore = g_slist_prepend(*ignore, bs);
 QLIST_FOREACH(child, >parents, next_parent) {
 if (g_slist_find(*ignore, child)) {
 continue;
-- 
2.26.2




Re: [PATCH 2/2] nbd/server: Quiesce coroutines on context switch

2020-12-09 Thread Sergio Lopez
On Fri, Dec 04, 2020 at 12:39:07PM -0600, Eric Blake wrote:
> On 12/4/20 10:53 AM, Sergio Lopez wrote:
> > When switching between AIO contexts we need to me make sure that both
> > recv_coroutine and send_coroutine are not scheduled to run. Otherwise,
> > QEMU may crash while attaching the new context with an error like
> > this one:
> > 
> > aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'
> > 
> > To achieve this we need a local implementation of
> > 'qio_channel_readv_all_eof' named 'nbd_read_eof' (a trick already done
> > by 'nbd/client.c') that allows us to interrupt the operation and to
> > know when recv_coroutine is yielding.
> > 
> > With this in place, we delegate detaching the AIO context to the
> > owning context with a BH ('nbd_aio_detach_bh') scheduled using
> > 'aio_wait_bh_oneshot'. This BH signals that we need to quiesce the
> > channel by setting 'client->quiescing' to 'true', and either waits for
> > the coroutine to finish using AIO_WAIT_WHILE or, if it's yielding in
> > 'nbd_read_eof', actively enters the coroutine to interrupt it.
> > 
> > RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900326
> > Signed-off-by: Sergio Lopez 
> > ---
> >  nbd/server.c | 120 +--
> >  1 file changed, 106 insertions(+), 14 deletions(-)
> 
> A complex patch, so I'd appreciate a second set of eyes.
> 
> > 
> > diff --git a/nbd/server.c b/nbd/server.c
> > index 613ed2634a..7229f487d2 100644
> > --- a/nbd/server.c
> > +++ b/nbd/server.c
> > @@ -132,6 +132,9 @@ struct NBDClient {
> >  CoMutex send_lock;
> >  Coroutine *send_coroutine;
> >  
> > +bool read_yielding;
> > +bool quiescing;
> 
> Will either of these fields need to be accessed atomically once the
> 'yank' code is added, or are we still safe with direct access because
> coroutines are not multithreaded?

Yes, those are only accessed from coroutines, which will be scheduled
on the same thread.

> > +
> >  QTAILQ_ENTRY(NBDClient) next;
> >  int nb_requests;
> >  bool closing;
> > @@ -1352,14 +1355,60 @@ static coroutine_fn int nbd_negotiate(NBDClient 
> > *client, Error **errp)
> >  return 0;
> >  }
> >  
> > -static int nbd_receive_request(QIOChannel *ioc, NBDRequest *request,
> > +/* nbd_read_eof
> > + * Tries to read @size bytes from @ioc. This is a local implementation of
> > + * qio_channel_readv_all_eof. We have it here because we need it to be
> > + * interruptible and to know when the coroutine is yielding.
> > + * Returns 1 on success
> > + * 0 on eof, when no data was read (errp is not set)
> > + * negative errno on failure (errp is set)
> > + */
> > +static inline int coroutine_fn
> > +nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
> > +{
> > +bool partial = false;
> > +
> > +assert(size);
> > +while (size > 0) {
> > +struct iovec iov = { .iov_base = buffer, .iov_len = size };
> > +ssize_t len;
> > +
> > +len = qio_channel_readv(client->ioc, , 1, errp);
> > +if (len == QIO_CHANNEL_ERR_BLOCK) {
> > +client->read_yielding = true;
> > +qio_channel_yield(client->ioc, G_IO_IN);
> > +client->read_yielding = false;
> 
> nbd/client.c:nbd_read_eof() uses bdrv_dec/inc_in_flight instead of
> read_yielding...
> 
> > +if (client->quiescing) {
> > +return -EAGAIN;
> > +}
> 
> and the quiescing check is new; otherwise, these two functions look
> identical.  Having two static functions with the same name makes gdb a
> bit more annoying (which one of the two did you want your breakpoint
> on?).  Is there any way we could write this code only once in
> nbd/common.c for reuse by both client and server?  But I can live with
> it as written.

I'm not happy with this either, but on the first implementation I've
tried to come up with a unique function for both use cases, and it
looked terrible.

We can easily use a different name, though.

> > @@ -2151,20 +2223,23 @@ static int nbd_co_send_bitmap(NBDClient *client, 
> > uint64_t handle,
> >  
> >  /* nbd_co_receive_request
> >   * Collect a client request. Return 0 if request looks valid, -EIO to drop
> > - * connection right away, and any other negative value to report an error 
> > to
> > - * the client (although the caller may still need to disconnect after 
> > reporting
> > - * t

Re: [PATCH 1/2] virtio-blk: Acquire context while switching them on dataplane start

2020-12-09 Thread Sergio Lopez
On Mon, Dec 07, 2020 at 04:37:53PM +0100, Kevin Wolf wrote:
> Am 04.12.2020 um 17:53 hat Sergio Lopez geschrieben:
> > On dataplane start, acquire the new AIO context before calling
> > 'blk_set_aio_context', releasing it immediately afterwards. This
> > prevents reaching the AIO context attach/detach notifier functions
> > without having acquired it first.
> > 
> > It was also the only place where 'blk_set_aio_context' was called with
> > an unprotected AIO context.
> > 
> > Signed-off-by: Sergio Lopez 
> > ---
> >  hw/block/dataplane/virtio-blk.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/hw/block/dataplane/virtio-blk.c 
> > b/hw/block/dataplane/virtio-blk.c
> > index 37499c5564..034e43cb1f 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -214,7 +214,9 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
> >  vblk->dataplane_started = true;
> >  trace_virtio_blk_data_plane_start(s);
> >  
> > +aio_context_acquire(s->ctx);
> >  r = blk_set_aio_context(s->conf->conf.blk, s->ctx, _err);
> > +aio_context_release(s->ctx);
> 
> bdrv_set_aio_context_ignore() is documented like this:
> 
>  * The caller must own the AioContext lock for the old AioContext of bs, but 
> it
>  * must not own the AioContext lock for new_context (unless new_context is the
>  * same as the current context of bs).

Does that rule apply to blk_set_aio_context too? All use cases I can
find in the code are acquiring the new context:

hw/block/dataplane/xen-block.c:
 719 void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
 720const unsigned int ring_ref[],
 721unsigned int nr_ring_ref,
 722unsigned int event_channel,
 723unsigned int protocol,
 724Error **errp)
 725 {
 ...
 811 aio_context_acquire(dataplane->ctx);
 812 /* If other users keep the BlockBackend in the iothread, that's ok */
 813 blk_set_aio_context(dataplane->blk, dataplane->ctx, NULL);
 814 /* Only reason for failure is a NULL channel */
 815 xen_device_set_event_channel_context(xendev, dataplane->event_channel,
 816  dataplane->ctx, _abort);
 817 aio_context_release(dataplane->ctx);

hw/scsi/virtio-scsi.c:
 818 static void virtio_scsi_hotplug(HotplugHandler *hotplug_dev, DeviceState 
*dev,
 819 Error **errp)
 820 {
 ...
 830 virtio_scsi_acquire(s);
 831 ret = blk_set_aio_context(sd->conf.blk, s->ctx, errp);
 832 virtio_scsi_release(s);

Thanks,
Sergio.


signature.asc
Description: PGP signature


[PATCH 1/2] virtio-blk: Acquire context while switching them on dataplane start

2020-12-04 Thread Sergio Lopez
On dataplane start, acquire the new AIO context before calling
'blk_set_aio_context', releasing it immediately afterwards. This
prevents reaching the AIO context attach/detach notifier functions
without having acquired it first.

It was also the only place where 'blk_set_aio_context' was called with
an unprotected AIO context.

Signed-off-by: Sergio Lopez 
---
 hw/block/dataplane/virtio-blk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 37499c5564..034e43cb1f 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -214,7 +214,9 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 vblk->dataplane_started = true;
 trace_virtio_blk_data_plane_start(s);
 
+aio_context_acquire(s->ctx);
 r = blk_set_aio_context(s->conf->conf.blk, s->ctx, _err);
+aio_context_release(s->ctx);
 if (r < 0) {
 error_report_err(local_err);
 goto fail_guest_notifiers;
-- 
2.26.2




[PATCH 2/2] nbd/server: Quiesce coroutines on context switch

2020-12-04 Thread Sergio Lopez
When switching between AIO contexts we need to me make sure that both
recv_coroutine and send_coroutine are not scheduled to run. Otherwise,
QEMU may crash while attaching the new context with an error like
this one:

aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'

To achieve this we need a local implementation of
'qio_channel_readv_all_eof' named 'nbd_read_eof' (a trick already done
by 'nbd/client.c') that allows us to interrupt the operation and to
know when recv_coroutine is yielding.

With this in place, we delegate detaching the AIO context to the
owning context with a BH ('nbd_aio_detach_bh') scheduled using
'aio_wait_bh_oneshot'. This BH signals that we need to quiesce the
channel by setting 'client->quiescing' to 'true', and either waits for
the coroutine to finish using AIO_WAIT_WHILE or, if it's yielding in
'nbd_read_eof', actively enters the coroutine to interrupt it.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900326
Signed-off-by: Sergio Lopez 
---
 nbd/server.c | 120 +--
 1 file changed, 106 insertions(+), 14 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 613ed2634a..7229f487d2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -132,6 +132,9 @@ struct NBDClient {
 CoMutex send_lock;
 Coroutine *send_coroutine;
 
+bool read_yielding;
+bool quiescing;
+
 QTAILQ_ENTRY(NBDClient) next;
 int nb_requests;
 bool closing;
@@ -1352,14 +1355,60 @@ static coroutine_fn int nbd_negotiate(NBDClient 
*client, Error **errp)
 return 0;
 }
 
-static int nbd_receive_request(QIOChannel *ioc, NBDRequest *request,
+/* nbd_read_eof
+ * Tries to read @size bytes from @ioc. This is a local implementation of
+ * qio_channel_readv_all_eof. We have it here because we need it to be
+ * interruptible and to know when the coroutine is yielding.
+ * Returns 1 on success
+ * 0 on eof, when no data was read (errp is not set)
+ * negative errno on failure (errp is set)
+ */
+static inline int coroutine_fn
+nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
+{
+bool partial = false;
+
+assert(size);
+while (size > 0) {
+struct iovec iov = { .iov_base = buffer, .iov_len = size };
+ssize_t len;
+
+len = qio_channel_readv(client->ioc, , 1, errp);
+if (len == QIO_CHANNEL_ERR_BLOCK) {
+client->read_yielding = true;
+qio_channel_yield(client->ioc, G_IO_IN);
+client->read_yielding = false;
+if (client->quiescing) {
+return -EAGAIN;
+}
+continue;
+} else if (len < 0) {
+return -EIO;
+} else if (len == 0) {
+if (partial) {
+error_setg(errp,
+   "Unexpected end-of-file before all bytes were 
read");
+return -EIO;
+} else {
+return 0;
+}
+}
+
+partial = true;
+size -= len;
+buffer = (uint8_t *) buffer + len;
+}
+return 1;
+}
+
+static int nbd_receive_request(NBDClient *client, NBDRequest *request,
Error **errp)
 {
 uint8_t buf[NBD_REQUEST_SIZE];
 uint32_t magic;
 int ret;
 
-ret = nbd_read(ioc, buf, sizeof(buf), "request", errp);
+ret = nbd_read_eof(client, buf, sizeof(buf), errp);
 if (ret < 0) {
 return ret;
 }
@@ -1480,11 +1529,37 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 
 QTAILQ_FOREACH(client, >clients, next) {
 qio_channel_attach_aio_context(client->ioc, ctx);
+
+assert(client->recv_coroutine == NULL);
+assert(client->send_coroutine == NULL);
+
+if (client->quiescing) {
+client->quiescing = false;
+nbd_client_receive_next_request(client);
+}
+}
+}
+
+static void nbd_aio_detach_bh(void *opaque)
+{
+NBDExport *exp = opaque;
+NBDClient *client;
+
+QTAILQ_FOREACH(client, >clients, next) {
+qio_channel_detach_aio_context(client->ioc);
+client->quiescing = true;
+
 if (client->recv_coroutine) {
-aio_co_schedule(ctx, client->recv_coroutine);
+if (client->read_yielding) {
+qemu_aio_coroutine_enter(exp->common.ctx,
+ client->recv_coroutine);
+} else {
+AIO_WAIT_WHILE(exp->common.ctx, client->recv_coroutine != 
NULL);
+}
 }
+
 if (client->send_coroutine) {
-aio_co_schedule(ctx, client->send_coroutine);
+AIO_WAIT_WHILE(exp->common.ctx, client->send_coroutine != NULL);
 }
 }
 }
@@ -1492,13 +1567,10 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 static void blk_aio_detach(void *opaque)
 {
 

[PATCH 0/2] nbd/server: Quiesce coroutines on context switch

2020-12-04 Thread Sergio Lopez
This series allows the NBD server to properly switch between AIO contexts,
having quiesced recv_coroutine and send_coroutine before doing the transition.

We need this because we send back devices running in IO Thread owned contexts
to the main context when stopping the data plane, something that can happen
multiple times during the lifetime of a VM (usually during the boot sequence or
on a reboot), and we drag the NBD server of the correspoing export with it.

Sergio Lopez (2):
  virtio-blk: Acquire context while switching them on dataplane start
  nbd/server: Quiesce coroutines on context switch

 hw/block/dataplane/virtio-blk.c |   2 +
 nbd/server.c| 120 
 2 files changed, 108 insertions(+), 14 deletions(-)

-- 
2.26.2





Re: Libvirt driver iothread property for virtio-scsi disks

2020-11-04 Thread Sergio Lopez
On Wed, Nov 04, 2020 at 05:48:40PM +0200, Nir Soffer wrote:
> The docs[1] say:
> 
> - The optional iothread attribute assigns the disk to an IOThread as defined 
> by
>   the range for the domain iothreads value. Multiple disks may be assigned to
>   the same IOThread and are numbered from 1 to the domain iothreads value.
>   Available for a disk device target configured to use "virtio" bus and "pci"
>   or "ccw" address types. Since 1.2.8 (QEMU 2.1)
> 
> Does it mean that virtio-scsi disks do not use iothreads?

virtio-scsi disks can use iothreads, but they are configured in the
scsi controller, not in the disk itself. All disks attached to the
same controller will share the same iothread, but you can also attach
multiple controllers.

> I'm experiencing a horrible performance using nested vms (up to 2 levels of
> nesting) when accessing NFS storage running on one of the VMs. The NFS
> server is using scsi disk.
> 
> My theory is:
> - Writing to NFS server is very slow (too much nesting, slow disk)
> - Not using iothreads (because we don't use virtio?)
> - Guest CPU is blocked by slow I/O

I would discard the lack of iothreads as the culprit. They do improve
the performance, but without them the performance should be quite
decent anyway. Probably something else is causing the trouble.

I would do a step by step analysis, testing the NFS performance from
outside the VM first, and then elaborating upwards from that.

Sergio.

> Does this make sense?
> 
> [1] https://libvirt.org/formatdomain.html#hard-drives-floppy-disks-cdroms
> 
> Nir
> 


signature.asc
Description: PGP signature


Re: virtio-scsi and another complex AioContext issue

2020-06-24 Thread Sergio Lopez
On Tue, Jun 23, 2020 at 03:24:54PM +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 22, 2020 at 04:16:04PM +0200, Sergio Lopez wrote:
> > On Fri, Jun 19, 2020 at 02:04:06PM +0100, Stefan Hajnoczi wrote:
> > > On Thu, Jun 11, 2020 at 10:36:22AM +0200, Sergio Lopez wrote:
> > > > Hi,
> > > >
> > > > While debugging BZ#1844343, I managed to reproduce the issue which
> > > > leads to crash with a backtrace like this one:
> > > >
> > > > < snip >
> > > > Thread 2 (Thread 0x7fe208463f00 (LWP 1659571)):
> > > > #0  0x7fe2033b78ed in __lll_lock_wait () at /lib64/libpthread.so.0
> > > > #1  0x7fe2033b0bd4 in pthread_mutex_lock () at 
> > > > /lib64/libpthread.so.0
> > > > #2  0x560caa8f1e6d in qemu_mutex_lock_impl
> > > > (mutex=0x560cacc68a10, file=0x560caaa9797f "util/async.c", 
> > > > line=521) at util/qemu-thread-posix.c:78
> > > > #3  0x560caa82414d in bdrv_set_aio_context_ignore
> > > > (bs=bs@entry=0x560cacc73570, 
> > > > new_context=new_context@entry=0x560cacc5fed0, 
> > > > ignore=ignore@entry=0x7ffe388b1cc0) at block.c:6192
> > > > #4  0x560caa824503 in bdrv_child_try_set_aio_context
> > > > (bs=bs@entry=0x560cacc73570, ctx=0x560cacc5fed0, 
> > > > ignore_child=, errp=)
> > > > at block.c:6272
> > > > #5  0x560caa859e6b in blk_do_set_aio_context
> > > > (blk=0x560cacecf370, new_context=0x560cacc5fed0, 
> > > > update_root_node=update_root_node@entry=true, errp=errp@entry=0x0) at 
> > > > block/block-backend.c:1989
> > > > #6  0x560caa85c501 in blk_set_aio_context
> > > > (blk=, new_context=, 
> > > > errp=errp@entry=0x0) at block/block-backend.c:2010
> > > > #7  0x560caa61db30 in virtio_scsi_hotunplug
> > > > (hotplug_dev=0x560cadaafbd0, dev=0x560cacec1210, 
> > > > errp=0x7ffe388b1d80)
> > > > at 
> > > > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:869
> > > > #8  0x560caa6ccd1e in qdev_unplug (dev=0x560cacec1210, 
> > > > errp=errp@entry=0x7ffe388b1db8)
> > > > at qdev-monitor.c:872
> > > > #9  0x560caa6ccd9e in qmp_device_del (id=, 
> > > > errp=errp@entry=0x7ffe388b1db8)
> > > > at qdev-monitor.c:884
> > > > #10 0x560caa7ec4d3 in qmp_marshal_device_del
> > > > (args=, ret=, errp=0x7ffe388b1e18) at 
> > > > qapi/qapi-commands-qdev.c:99
> > > > #11 0x560caa8a45ec in do_qmp_dispatch
> > > > (errp=0x7ffe388b1e10, allow_oob=, request= > > > out>, cmds=0x560cab1928a0 ) at qapi/qmp-dispatch.c:132
> > > > #12 0x560caa8a45ec in qmp_dispatch
> > > > (cmds=0x560cab1928a0 , request=, 
> > > > allow_oob=)
> > > > at qapi/qmp-dispatch.c:175
> > > > #13 0x560caa7c2521 in monitor_qmp_dispatch (mon=0x560cacca2f00, 
> > > > req=)
> > > > at monitor/qmp.c:145
> > > > #14 0x560caa7c2bba in monitor_qmp_bh_dispatcher (data= > > > out>) at monitor/qmp.c:234
> > > > #15 0x560caa8ec716 in aio_bh_call (bh=0x560cacbd80e0) at 
> > > > util/async.c:117
> > > > #16 0x560caa8ec716 in aio_bh_poll (ctx=ctx@entry=0x560cacbd6da0) at 
> > > > util/async.c:117
> > > > #17 0x560caa8efb04 in aio_dispatch (ctx=0x560cacbd6da0) at 
> > > > util/aio-posix.c:459
> > > > #18 0x560caa8ec5f2 in aio_ctx_dispatch
> > > > (source=, callback=, 
> > > > user_data=) at util/async.c:260
> > > > #19 0x7fe2078d167d in g_main_context_dispatch () at 
> > > > /lib64/libglib-2.0.so.0
> > > > #20 0x560caa8eebb8 in glib_pollfds_poll () at util/main-loop.c:219
> > > > #21 0x560caa8eebb8 in os_host_main_loop_wait (timeout= > > > out>) at util/main-loop.c:242
> > > > #22 0x560caa8eebb8 in main_loop_wait (nonblocking=) 
> > > > at util/main-loop.c:518
> > > > #23 0x560caa6cfe51 in main_loop () at vl.c:1828
> > > > #24 0x560caa57b322 in main (argc=, argv= > > > out>, envp=)
> > > > at vl.c:4504
> > > >
> > > > Thread 1 (Thread 0x7fe1fb059700 (LWP 1659573)):
> > > > #0  0x7fe20301b70f in raise () at /lib64/libc.so.6
> > > > #1  0x7fe203005b25 in abort () at /lib64/libc.so.6
> > > &g

Re: virtio-scsi and another complex AioContext issue

2020-06-22 Thread Sergio Lopez
On Fri, Jun 19, 2020 at 02:04:06PM +0100, Stefan Hajnoczi wrote:
> On Thu, Jun 11, 2020 at 10:36:22AM +0200, Sergio Lopez wrote:
> > Hi,
> >
> > While debugging BZ#1844343, I managed to reproduce the issue which
> > leads to crash with a backtrace like this one:
> >
> > < snip >
> > Thread 2 (Thread 0x7fe208463f00 (LWP 1659571)):
> > #0  0x7fe2033b78ed in __lll_lock_wait () at /lib64/libpthread.so.0
> > #1  0x7fe2033b0bd4 in pthread_mutex_lock () at /lib64/libpthread.so.0
> > #2  0x560caa8f1e6d in qemu_mutex_lock_impl
> > (mutex=0x560cacc68a10, file=0x560caaa9797f "util/async.c", line=521) at 
> > util/qemu-thread-posix.c:78
> > #3  0x560caa82414d in bdrv_set_aio_context_ignore
> > (bs=bs@entry=0x560cacc73570, 
> > new_context=new_context@entry=0x560cacc5fed0, 
> > ignore=ignore@entry=0x7ffe388b1cc0) at block.c:6192
> > #4  0x560caa824503 in bdrv_child_try_set_aio_context
> > (bs=bs@entry=0x560cacc73570, ctx=0x560cacc5fed0, 
> > ignore_child=, errp=)
> > at block.c:6272
> > #5  0x560caa859e6b in blk_do_set_aio_context
> > (blk=0x560cacecf370, new_context=0x560cacc5fed0, 
> > update_root_node=update_root_node@entry=true, errp=errp@entry=0x0) at 
> > block/block-backend.c:1989
> > #6  0x560caa85c501 in blk_set_aio_context
> > (blk=, new_context=, errp=errp@entry=0x0) 
> > at block/block-backend.c:2010
> > #7  0x560caa61db30 in virtio_scsi_hotunplug
> > (hotplug_dev=0x560cadaafbd0, dev=0x560cacec1210, errp=0x7ffe388b1d80)
> > at 
> > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:869
> > #8  0x560caa6ccd1e in qdev_unplug (dev=0x560cacec1210, 
> > errp=errp@entry=0x7ffe388b1db8)
> > at qdev-monitor.c:872
> > #9  0x560caa6ccd9e in qmp_device_del (id=, 
> > errp=errp@entry=0x7ffe388b1db8)
> > at qdev-monitor.c:884
> > #10 0x560caa7ec4d3 in qmp_marshal_device_del
> > (args=, ret=, errp=0x7ffe388b1e18) at 
> > qapi/qapi-commands-qdev.c:99
> > #11 0x560caa8a45ec in do_qmp_dispatch
> > (errp=0x7ffe388b1e10, allow_oob=, request= > out>, cmds=0x560cab1928a0 ) at qapi/qmp-dispatch.c:132
> > #12 0x560caa8a45ec in qmp_dispatch
> > (cmds=0x560cab1928a0 , request=, 
> > allow_oob=)
> > at qapi/qmp-dispatch.c:175
> > #13 0x560caa7c2521 in monitor_qmp_dispatch (mon=0x560cacca2f00, 
> > req=)
> > at monitor/qmp.c:145
> > #14 0x560caa7c2bba in monitor_qmp_bh_dispatcher (data=) 
> > at monitor/qmp.c:234
> > #15 0x560caa8ec716 in aio_bh_call (bh=0x560cacbd80e0) at 
> > util/async.c:117
> > #16 0x560caa8ec716 in aio_bh_poll (ctx=ctx@entry=0x560cacbd6da0) at 
> > util/async.c:117
> > #17 0x560caa8efb04 in aio_dispatch (ctx=0x560cacbd6da0) at 
> > util/aio-posix.c:459
> > #18 0x560caa8ec5f2 in aio_ctx_dispatch
> > (source=, callback=, user_data= > out>) at util/async.c:260
> > #19 0x7fe2078d167d in g_main_context_dispatch () at 
> > /lib64/libglib-2.0.so.0
> > #20 0x560caa8eebb8 in glib_pollfds_poll () at util/main-loop.c:219
> > #21 0x560caa8eebb8 in os_host_main_loop_wait (timeout=) 
> > at util/main-loop.c:242
> > #22 0x560caa8eebb8 in main_loop_wait (nonblocking=) at 
> > util/main-loop.c:518
> > #23 0x560caa6cfe51 in main_loop () at vl.c:1828
> > #24 0x560caa57b322 in main (argc=, argv=, 
> > envp=)
> > at vl.c:4504
> >
> > Thread 1 (Thread 0x7fe1fb059700 (LWP 1659573)):
> > #0  0x7fe20301b70f in raise () at /lib64/libc.so.6
> > #1  0x7fe203005b25 in abort () at /lib64/libc.so.6
> > #2  0x7fe2030059f9 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
> > #3  0x7fe203013cc6 in .annobin_assert.c_end () at /lib64/libc.so.6
> > #4  0x560caa85bfe4 in blk_get_aio_context (blk=0x560cacecf370) at 
> > block/block-backend.c:1968
> > #5  0x560caa85bfe4 in blk_get_aio_context (blk=0x560cacecf370) at 
> > block/block-backend.c:1962
> > #6  0x560caa61d79c in virtio_scsi_ctx_check (s=0x560cadaafbd0, 
> > s=0x560cadaafbd0, d=0x560cacec1210)
> > at 
> > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:250
> > #7  0x560caa61d79c in virtio_scsi_handle_cmd_req_prepare 
> > (req=0x7fe1ec013880, s=0x560cadaafbd0)
> > at 
> > /usr/src/debug/qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64/hw/scsi/virtio-scsi.c:569
> > #8  0x560caa61d79c in virtio_scsi_handle_cmd_vq
> > (s=s@entr

Re: [PATCH v2 1/7] block/nvme: poll queues without q->lock

2020-06-22 Thread Sergio Lopez
On Wed, Jun 17, 2020 at 02:21:55PM +0100, Stefan Hajnoczi wrote:
> A lot of CPU time is spent simply locking/unlocking q->lock during
> polling. Check for completion outside the lock to make q->lock disappear
> from the profile.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index eb2f54dd9d..e4375ec245 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
>  
>  for (i = 0; i < s->nr_queues; i++) {
>  NVMeQueuePair *q = s->queues[i];
> +const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
> +NvmeCqe *cqe = (NvmeCqe *)>cq.queue[cqe_offset];
> +
> +/*
> + * Do an early check for completions. q->lock isn't needed because
> + * nvme_process_completion() only runs in the event loop thread and
> + * cannot race with itself.
> + */
> +if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
> +continue;
> +}
> +
>  qemu_mutex_lock(>lock);
>  while (nvme_process_completion(s, q)) {
>  /* Keep polling */
> -- 
> 2.26.2
>

Thanks for extending the comment.

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


[PATCH 0/2] virtio-blk: Avoid processing requests on the main context on restart

2020-06-03 Thread Sergio Lopez
On restart, we were scheduling a BH to process queued requests, which
would run before starting up the data plane, leading to those requests
being assigned and started on coroutines on the main context.

This could cause requests to be wrongly processed in parallel from
different threads (the main thread and the iothread managing the data
plane), potentially leading to multiple issues.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1812765

See "virtio-blk: Disable request queuing while switching contexts" for
previous discussion:

 - https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg00304.html

Sergio Lopez (2):
  virtio-blk: Refactor the code that processes queued requests
  virtio-blk: On restart, process queued requests in the proper context

 include/hw/virtio/virtio-blk.h  |  1 +
 hw/block/dataplane/virtio-blk.c |  8 
 hw/block/virtio-blk.c   | 30 +-
 3 files changed, 30 insertions(+), 9 deletions(-)

-- 
2.26.2





[PATCH 1/2] virtio-blk: Refactor the code that processes queued requests

2020-06-03 Thread Sergio Lopez
Move the code that processes queued requests from
virtio_blk_dma_restart_bh() to its own, non-static, function. This
will allow us to call it from the virtio_blk_data_plane_start() in a
future patch.

Signed-off-by: Sergio Lopez 
---
 include/hw/virtio/virtio-blk.h |  1 +
 hw/block/virtio-blk.c  | 16 +++-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 1e62f869b2..f584ad9b86 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -86,5 +86,6 @@ typedef struct MultiReqBuffer {
 } MultiReqBuffer;
 
 bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
+void virtio_blk_process_queued_requests(VirtIOBlock *s);
 
 #endif
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f5f6fc925e..978574e4da 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -819,15 +819,11 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
 virtio_blk_handle_output_do(s, vq);
 }
 
-static void virtio_blk_dma_restart_bh(void *opaque)
+void virtio_blk_process_queued_requests(VirtIOBlock *s)
 {
-VirtIOBlock *s = opaque;
 VirtIOBlockReq *req = s->rq;
 MultiReqBuffer mrb = {};
 
-qemu_bh_delete(s->bh);
-s->bh = NULL;
-
 s->rq = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -855,6 +851,16 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
+static void virtio_blk_dma_restart_bh(void *opaque)
+{
+VirtIOBlock *s = opaque;
+
+qemu_bh_delete(s->bh);
+s->bh = NULL;
+
+virtio_blk_process_queued_requests(s);
+}
+
 static void virtio_blk_dma_restart_cb(void *opaque, int running,
   RunState state)
 {
-- 
2.26.2




[PATCH 2/2] virtio-blk: On restart, process queued requests in the proper context

2020-06-03 Thread Sergio Lopez
On restart, we were scheduling a BH to process queued requests, which
would run before starting up the data plane, leading to those requests
being assigned and started on coroutines on the main context.

This could cause requests to be wrongly processed in parallel from
different threads (the main thread and the iothread managing the data
plane), potentially leading to multiple issues.

For example, stopping and resuming a VM multiple times while the guest
is generating I/O on a virtio_blk device can trigger a crash with a
stack tracing looking like this one:

<-->
 Thread 2 (Thread 0x7ff736765700 (LWP 1062503)):
 #0  0x5567a13b99d6 in iov_memset
 (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
bytes=7018105756081554803)
 at util/iov.c:69
 #1  0x5567a13bab73 in qemu_iovec_memset
 (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) 
at util/iov.c:530
 #2  0x5567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) 
at block/linux-aio.c:86
 #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at 
block/linux-aio.c:217
 #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
block/linux-aio.c:323
 #5  0x5567a12f43d9 in qemu_laio_process_completions_and_submit 
(s=0x7ff7182e8420)
 at block/linux-aio.c:236
 #6  0x5567a12f44c2 in qemu_laio_poll_cb (opaque=0x7ff7182e8430) at 
block/linux-aio.c:267
 #7  0x5567a13aed83 in run_poll_handlers_once (ctx=0x5567a2b58c70, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:520
 #8  0x5567a13aee9f in run_poll_handlers (ctx=0x5567a2b58c70, max_ns=16000, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:562
 #9  0x5567a13aefde in try_poll_mode (ctx=0x5567a2b58c70, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:597
 #10 0x5567a13af115 in aio_poll (ctx=0x5567a2b58c70, blocking=true) at 
util/aio-posix.c:639
 #11 0x5567a109acca in iothread_run (opaque=0x5567a2b29760) at iothread.c:75
 #12 0x5567a13b2790 in qemu_thread_start (args=0x5567a2b694c0) at 
util/qemu-thread-posix.c:519
 #13 0x7ff73eedf2de in start_thread () at /lib64/libpthread.so.0
 #14 0x7ff73ec10e83 in clone () at /lib64/libc.so.6

 Thread 1 (Thread 0x7ff743986f00 (LWP 1062500)):
 #0  0x5567a13b99d6 in iov_memset
 (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
bytes=7018105756081554803)
 at util/iov.c:69
 #1  0x5567a13bab73 in qemu_iovec_memset
 (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) 
at util/iov.c:530
 #2  0x5567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) 
at block/linux-aio.c:86
 #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at 
block/linux-aio.c:217
 #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
block/linux-aio.c:323
 #5  0x5567a12f4a2f in laio_do_submit (fd=19, laiocb=0x7ff5f4ff9ae0, 
offset=472363008, type=2)
 at block/linux-aio.c:375
 #6  0x5567a12f4af2 in laio_co_submit
 (bs=0x5567a2b8c460, s=0x7ff7182e8420, fd=19, offset=472363008, 
qiov=0x7ff5f4ff9ca0, type=2)
 at block/linux-aio.c:394
 #7  0x5567a12f1803 in raw_co_prw
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
type=2)
 at block/file-posix.c:1892
 #8  0x5567a12f1941 in raw_co_pwritev
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
flags=0)
 at block/file-posix.c:1925
 #9  0x5567a12fe3e1 in bdrv_driver_pwritev
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
qiov_offset=0, flags=0)
 at block/io.c:1183
 #10 0x5567a1300340 in bdrv_aligned_pwritev
 (child=0x5567a2b5b070, req=0x7ff5f4ff9db0, offset=472363008, bytes=20480, 
align=512, qiov=0x7ff72c0425b8, qiov_offset=0, flags=0) at block/io.c:1980
 #11 0x5567a1300b29 in bdrv_co_pwritev_part
 (child=0x5567a2b5b070, offset=472363008, bytes=20480, qiov=0x7ff72c0425b8, 
qiov_offset=0, flags=0)
 at block/io.c:2137
 #12 0x5567a12baba1 in qcow2_co_pwritev_task
 (bs=0x5567a2b92740, file_cluster_offset=472317952, offset=487305216, 
bytes=20480, qiov=0x7ff72c0425b8, qiov_offset=0, l2meta=0x0) at 
block/qcow2.c:2444
 #13 0x5567a12bacdb in qcow2_co_pwritev_task_entry (task=0x5567a2b48540) at 
block/qcow2.c:2475
 #14 0x5567a13167d8 in aio_task_co (opaque=0x5567a2b48540) at 
block/aio_task.c:45
 #15 0x5567a13cf00c in coroutine_trampoline (i0=738245600, i1=32759) at 
util/coroutine-ucontext.c:115
 #16 0x7ff73eb622e0 in __start_context () at /lib64/libc.so.6
 #17 0x7ff6626f1350 in  ()
 #18 0x in  ()
<-->

This is also known to cause crashes with this message (assertion
failed):

 aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1812765
Signed-off-by: Sergio Lopez 
---
 include/hw/virtio/virtio-blk.h  |  2 +-
 hw/block/dataplane/virtio-blk.c |  8 

Re: [PATCH] virtio-blk: Disable request queuing while switching contexts

2020-06-02 Thread Sergio Lopez
On Tue, Jun 02, 2020 at 03:04:33PM +0200, Kevin Wolf wrote:
> Am 02.06.2020 um 14:18 hat Sergio Lopez geschrieben:
> > On Tue, Jun 02, 2020 at 01:23:14PM +0200, Kevin Wolf wrote:
> > > Am 02.06.2020 um 10:11 hat Sergio Lopez geschrieben:
> > > > Disable request queuing while switching contexts on
> > > > virtio_blk_data_plane_[start|stop](), preventing requests from getting
> > > > queued on the wrong context.
> > > >
> > > > Placing requests on the wrong context may lead to them being wrongly
> > > > accessed in parallel from different threads, potentially leading to
> > > > multiple issues.
> > > >
> > > > For example, stopping and resuming a VM multiple times while the guest
> > > > is generating I/O on a virtio_blk device can trigger a crash with a
> > > > stack tracing looking like this one:
> > > >
> > > > <-->
> > > >  Thread 2 (Thread 0x7ff736765700 (LWP 1062503)):
> > > >  #0  0x5567a13b99d6 in iov_memset
> > > >  (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, 
> > > > fillc=0, bytes=7018105756081554803)
> > > >  at util/iov.c:69
> > > >  #1  0x5567a13bab73 in qemu_iovec_memset
> > > >  (qiov=0x7ff73ec99748, offset=516096, fillc=0, 
> > > > bytes=7018105756081554803) at util/iov.c:530
> > > >  #2  0x5567a12f411c in qemu_laio_process_completion 
> > > > (laiocb=0x7ff6512ee6c0) at block/linux-aio.c:86
> > > >  #3  0x5567a12f42ff in qemu_laio_process_completions 
> > > > (s=0x7ff7182e8420) at block/linux-aio.c:217
> > > >  #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
> > > > block/linux-aio.c:323
> > > >  #5  0x5567a12f43d9 in qemu_laio_process_completions_and_submit 
> > > > (s=0x7ff7182e8420)
> > > >  at block/linux-aio.c:236
> > > >  #6  0x5567a12f44c2 in qemu_laio_poll_cb (opaque=0x7ff7182e8430) at 
> > > > block/linux-aio.c:267
> > > >  #7  0x5567a13aed83 in run_poll_handlers_once (ctx=0x5567a2b58c70, 
> > > > timeout=0x7ff7367645f8)
> > > >  at util/aio-posix.c:520
> > > >  #8  0x5567a13aee9f in run_poll_handlers (ctx=0x5567a2b58c70, 
> > > > max_ns=16000, timeout=0x7ff7367645f8)
> > > >  at util/aio-posix.c:562
> > > >  #9  0x5567a13aefde in try_poll_mode (ctx=0x5567a2b58c70, 
> > > > timeout=0x7ff7367645f8)
> > > >  at util/aio-posix.c:597
> > > >  #10 0x5567a13af115 in aio_poll (ctx=0x5567a2b58c70, blocking=true) 
> > > > at util/aio-posix.c:639
> > > >  #11 0x5567a109acca in iothread_run (opaque=0x5567a2b29760) at 
> > > > iothread.c:75
> > > >  #12 0x5567a13b2790 in qemu_thread_start (args=0x5567a2b694c0) at 
> > > > util/qemu-thread-posix.c:519
> > > >  #13 0x7ff73eedf2de in start_thread () at /lib64/libpthread.so.0
> > > >  #14 0x7ff73ec10e83 in clone () at /lib64/libc.so.6
> > > >
> > > >  Thread 1 (Thread 0x7ff743986f00 (LWP 1062500)):
> > > >  #0  0x5567a13b99d6 in iov_memset
> > > >  (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, 
> > > > fillc=0, bytes=7018105756081554803)
> > > >  at util/iov.c:69
> > > >  #1  0x5567a13bab73 in qemu_iovec_memset
> > > >  (qiov=0x7ff73ec99748, offset=516096, fillc=0, 
> > > > bytes=7018105756081554803) at util/iov.c:530
> > > >  #2  0x5567a12f411c in qemu_laio_process_completion 
> > > > (laiocb=0x7ff6512ee6c0) at block/linux-aio.c:86
> > > >  #3  0x5567a12f42ff in qemu_laio_process_completions 
> > > > (s=0x7ff7182e8420) at block/linux-aio.c:217
> > > >  #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
> > > > block/linux-aio.c:323
> > > >  #5  0x5567a12f4a2f in laio_do_submit (fd=19, 
> > > > laiocb=0x7ff5f4ff9ae0, offset=472363008, type=2)
> > > >  at block/linux-aio.c:375
> > > >  #6  0x5567a12f4af2 in laio_co_submit
> > > >  (bs=0x5567a2b8c460, s=0x7ff7182e8420, fd=19, offset=472363008, 
> > > > qiov=0x7ff5f4ff9ca0, type=2)
> > > >  at block/linux-aio.c:394
> > > >  #7  0x5567a12f1803 in raw_co_prw
> > > >  (bs=0x5567a2b8c460, offset=472363008, bytes=20480, 
> > > > qiov=0x7ff5f4ff9ca0, type=2)
> > > >  at block/file-posix.c:1892
> &g

Re: [PATCH] virtio-blk: Disable request queuing while switching contexts

2020-06-02 Thread Sergio Lopez
On Tue, Jun 02, 2020 at 01:23:14PM +0200, Kevin Wolf wrote:
> Am 02.06.2020 um 10:11 hat Sergio Lopez geschrieben:
> > Disable request queuing while switching contexts on
> > virtio_blk_data_plane_[start|stop](), preventing requests from getting
> > queued on the wrong context.
> >
> > Placing requests on the wrong context may lead to them being wrongly
> > accessed in parallel from different threads, potentially leading to
> > multiple issues.
> >
> > For example, stopping and resuming a VM multiple times while the guest
> > is generating I/O on a virtio_blk device can trigger a crash with a
> > stack tracing looking like this one:
> >
> > <-->
> >  Thread 2 (Thread 0x7ff736765700 (LWP 1062503)):
> >  #0  0x5567a13b99d6 in iov_memset
> >  (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
> > bytes=7018105756081554803)
> >  at util/iov.c:69
> >  #1  0x5567a13bab73 in qemu_iovec_memset
> >  (qiov=0x7ff73ec99748, offset=516096, fillc=0, 
> > bytes=7018105756081554803) at util/iov.c:530
> >  #2  0x5567a12f411c in qemu_laio_process_completion 
> > (laiocb=0x7ff6512ee6c0) at block/linux-aio.c:86
> >  #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) 
> > at block/linux-aio.c:217
> >  #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
> > block/linux-aio.c:323
> >  #5  0x5567a12f43d9 in qemu_laio_process_completions_and_submit 
> > (s=0x7ff7182e8420)
> >  at block/linux-aio.c:236
> >  #6  0x5567a12f44c2 in qemu_laio_poll_cb (opaque=0x7ff7182e8430) at 
> > block/linux-aio.c:267
> >  #7  0x5567a13aed83 in run_poll_handlers_once (ctx=0x5567a2b58c70, 
> > timeout=0x7ff7367645f8)
> >  at util/aio-posix.c:520
> >  #8  0x5567a13aee9f in run_poll_handlers (ctx=0x5567a2b58c70, 
> > max_ns=16000, timeout=0x7ff7367645f8)
> >  at util/aio-posix.c:562
> >  #9  0x5567a13aefde in try_poll_mode (ctx=0x5567a2b58c70, 
> > timeout=0x7ff7367645f8)
> >  at util/aio-posix.c:597
> >  #10 0x5567a13af115 in aio_poll (ctx=0x5567a2b58c70, blocking=true) at 
> > util/aio-posix.c:639
> >  #11 0x5567a109acca in iothread_run (opaque=0x5567a2b29760) at 
> > iothread.c:75
> >  #12 0x5567a13b2790 in qemu_thread_start (args=0x5567a2b694c0) at 
> > util/qemu-thread-posix.c:519
> >  #13 0x7ff73eedf2de in start_thread () at /lib64/libpthread.so.0
> >  #14 0x7ff73ec10e83 in clone () at /lib64/libc.so.6
> >
> >  Thread 1 (Thread 0x7ff743986f00 (LWP 1062500)):
> >  #0  0x5567a13b99d6 in iov_memset
> >  (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
> > bytes=7018105756081554803)
> >  at util/iov.c:69
> >  #1  0x5567a13bab73 in qemu_iovec_memset
> >  (qiov=0x7ff73ec99748, offset=516096, fillc=0, 
> > bytes=7018105756081554803) at util/iov.c:530
> >  #2  0x5567a12f411c in qemu_laio_process_completion 
> > (laiocb=0x7ff6512ee6c0) at block/linux-aio.c:86
> >  #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) 
> > at block/linux-aio.c:217
> >  #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
> > block/linux-aio.c:323
> >  #5  0x5567a12f4a2f in laio_do_submit (fd=19, laiocb=0x7ff5f4ff9ae0, 
> > offset=472363008, type=2)
> >  at block/linux-aio.c:375
> >  #6  0x5567a12f4af2 in laio_co_submit
> >  (bs=0x5567a2b8c460, s=0x7ff7182e8420, fd=19, offset=472363008, 
> > qiov=0x7ff5f4ff9ca0, type=2)
> >  at block/linux-aio.c:394
> >  #7  0x5567a12f1803 in raw_co_prw
> >  (bs=0x5567a2b8c460, offset=472363008, bytes=20480, 
> > qiov=0x7ff5f4ff9ca0, type=2)
> >  at block/file-posix.c:1892
> >  #8  0x5567a12f1941 in raw_co_pwritev
> >  (bs=0x5567a2b8c460, offset=472363008, bytes=20480, 
> > qiov=0x7ff5f4ff9ca0, flags=0)
> >  at block/file-posix.c:1925
> >  #9  0x5567a12fe3e1 in bdrv_driver_pwritev
> >  (bs=0x5567a2b8c460, offset=472363008, bytes=20480, 
> > qiov=0x7ff5f4ff9ca0, qiov_offset=0, flags=0)
> >  at block/io.c:1183
> >  #10 0x5567a1300340 in bdrv_aligned_pwritev
> >  (child=0x5567a2b5b070, req=0x7ff5f4ff9db0, offset=472363008, 
> > bytes=20480, align=512, qiov=0x7ff72c0425b8, qiov_offset=0, flags=0) at 
> > block/io.c:1980
> >  #11 0x5567a1300b29 in bdrv_co_pwritev_part
> >  (child=0x5567a2b5b070, offset=472363008, bytes=20480, 
> > qiov=0x7ff72c0425b8, qiov_offset=0, flags=0)
> >  at block/io.c:2137
> >  #12

[PATCH] virtio-blk: Disable request queuing while switching contexts

2020-06-02 Thread Sergio Lopez
Disable request queuing while switching contexts on
virtio_blk_data_plane_[start|stop](), preventing requests from getting
queued on the wrong context.

Placing requests on the wrong context may lead to them being wrongly
accessed in parallel from different threads, potentially leading to
multiple issues.

For example, stopping and resuming a VM multiple times while the guest
is generating I/O on a virtio_blk device can trigger a crash with a
stack tracing looking like this one:

<-->
 Thread 2 (Thread 0x7ff736765700 (LWP 1062503)):
 #0  0x5567a13b99d6 in iov_memset
 (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
bytes=7018105756081554803)
 at util/iov.c:69
 #1  0x5567a13bab73 in qemu_iovec_memset
 (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) 
at util/iov.c:530
 #2  0x5567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) 
at block/linux-aio.c:86
 #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at 
block/linux-aio.c:217
 #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
block/linux-aio.c:323
 #5  0x5567a12f43d9 in qemu_laio_process_completions_and_submit 
(s=0x7ff7182e8420)
 at block/linux-aio.c:236
 #6  0x5567a12f44c2 in qemu_laio_poll_cb (opaque=0x7ff7182e8430) at 
block/linux-aio.c:267
 #7  0x5567a13aed83 in run_poll_handlers_once (ctx=0x5567a2b58c70, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:520
 #8  0x5567a13aee9f in run_poll_handlers (ctx=0x5567a2b58c70, max_ns=16000, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:562
 #9  0x5567a13aefde in try_poll_mode (ctx=0x5567a2b58c70, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:597
 #10 0x5567a13af115 in aio_poll (ctx=0x5567a2b58c70, blocking=true) at 
util/aio-posix.c:639
 #11 0x5567a109acca in iothread_run (opaque=0x5567a2b29760) at iothread.c:75
 #12 0x5567a13b2790 in qemu_thread_start (args=0x5567a2b694c0) at 
util/qemu-thread-posix.c:519
 #13 0x7ff73eedf2de in start_thread () at /lib64/libpthread.so.0
 #14 0x7ff73ec10e83 in clone () at /lib64/libc.so.6

 Thread 1 (Thread 0x7ff743986f00 (LWP 1062500)):
 #0  0x5567a13b99d6 in iov_memset
 (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
bytes=7018105756081554803)
 at util/iov.c:69
 #1  0x5567a13bab73 in qemu_iovec_memset
 (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) 
at util/iov.c:530
 #2  0x5567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) 
at block/linux-aio.c:86
 #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at 
block/linux-aio.c:217
 #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
block/linux-aio.c:323
 #5  0x5567a12f4a2f in laio_do_submit (fd=19, laiocb=0x7ff5f4ff9ae0, 
offset=472363008, type=2)
 at block/linux-aio.c:375
 #6  0x5567a12f4af2 in laio_co_submit
 (bs=0x5567a2b8c460, s=0x7ff7182e8420, fd=19, offset=472363008, 
qiov=0x7ff5f4ff9ca0, type=2)
 at block/linux-aio.c:394
 #7  0x5567a12f1803 in raw_co_prw
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
type=2)
 at block/file-posix.c:1892
 #8  0x5567a12f1941 in raw_co_pwritev
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
flags=0)
 at block/file-posix.c:1925
 #9  0x5567a12fe3e1 in bdrv_driver_pwritev
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
qiov_offset=0, flags=0)
 at block/io.c:1183
 #10 0x5567a1300340 in bdrv_aligned_pwritev
 (child=0x5567a2b5b070, req=0x7ff5f4ff9db0, offset=472363008, bytes=20480, 
align=512, qiov=0x7ff72c0425b8, qiov_offset=0, flags=0) at block/io.c:1980
 #11 0x5567a1300b29 in bdrv_co_pwritev_part
 (child=0x5567a2b5b070, offset=472363008, bytes=20480, qiov=0x7ff72c0425b8, 
qiov_offset=0, flags=0)
 at block/io.c:2137
 #12 0x5567a12baba1 in qcow2_co_pwritev_task
 (bs=0x5567a2b92740, file_cluster_offset=472317952, offset=487305216, 
bytes=20480, qiov=0x7ff72c0425b8, qiov_offset=0, l2meta=0x0) at 
block/qcow2.c:2444
 #13 0x5567a12bacdb in qcow2_co_pwritev_task_entry (task=0x5567a2b48540) at 
block/qcow2.c:2475
 #14 0x5567a13167d8 in aio_task_co (opaque=0x5567a2b48540) at 
block/aio_task.c:45
 #15 0x5567a13cf00c in coroutine_trampoline (i0=738245600, i1=32759) at 
util/coroutine-ucontext.c:115
 #16 0x7ff73eb622e0 in __start_context () at /lib64/libc.so.6
 #17 0x7ff6626f1350 in  ()
 #18 0x in  ()
<-->

This is also known to cause crashes with this message (assertion
failed):

 aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1812765
Signed-off-by: Sergio Lopez 
---
 hw/block/dataplane/virtio-blk.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio

Re: [PATCH 1/7] block/nvme: poll queues without q->lock

2020-05-29 Thread Sergio Lopez
On Thu, May 28, 2020 at 04:23:50PM +0100, Stefan Hajnoczi wrote:
> On Mon, May 25, 2020 at 10:07:13AM +0200, Sergio Lopez wrote:
> > On Tue, May 19, 2020 at 06:11:32PM +0100, Stefan Hajnoczi wrote:
> > > A lot of CPU time is spent simply locking/unlocking q->lock during
> > > polling. Check for completion outside the lock to make q->lock disappear
> > > from the profile.
> > >
> > > Signed-off-by: Stefan Hajnoczi 
> > > ---
> > >  block/nvme.c | 12 
> > >  1 file changed, 12 insertions(+)
> > >
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index eb2f54dd9d..7eb4512666 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c
> > > @@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
> > >
> > >  for (i = 0; i < s->nr_queues; i++) {
> > >  NVMeQueuePair *q = s->queues[i];
> > > +const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
> > > +NvmeCqe *cqe = (NvmeCqe *)>cq.queue[cqe_offset];
> > > +
> > > +/*
> > > + * q->lock isn't needed for checking completion because
> > > + * nvme_process_completion() only runs in the event loop thread 
> > > and
> > > + * cannot race with itself.
> > > + */
> > > +if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
> > > +continue;
> > > +}
> > > +
> >
> > IIUC, this is introducing an early check of the phase bit to determine
> > if there is something new in the queue.
> >
> > I'm fine with this optimization, but I have the feeling that the
> > comment doesn't properly describe it.
>
> I'm not sure I understand. The comment explains why it's safe not to
> take q->lock. Normally it would be taken. Without the comment readers
> could be confused why we ignore the locking rules here.
>
> As for documenting the cqe->status expression itself, I didn't think of
> explaining it since it's part of the theory of operation of this device.
> Any polling driver will do this, there's nothing QEMU-specific or
> unusual going on here.
>
> Would you like me to expand the comment explaining that NVMe polling
> consists of checking the phase bit of the latest cqe to check for
> readiness?
>
> Or maybe I misunderstood? :)

I was thinking of something like "Do an early check for
completions. We don't need q->lock here because
nvme_process_completion() only runs (...)"

Sergio.


signature.asc
Description: PGP signature


Re: [PATCH 7/7] block/nvme: support nested aio_poll()

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:38PM +0100, Stefan Hajnoczi wrote:
> QEMU block drivers are supposed to support aio_poll() from I/O
> completion callback functions. This means completion processing must be
> re-entrant.
> 
> The standard approach is to schedule a BH during completion processing
> and cancel it at the end of processing. If aio_poll() is invoked by a
> callback function then the BH will run. The BH continues the suspended
> completion processing.
> 
> All of this means that request A's cb() can synchronously wait for
> request B to complete. Previously the nvme block driver would hang
> because it didn't process completions from nested aio_poll().
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c   | 67 --
>  block/trace-events |  2 +-
>  2 files changed, 60 insertions(+), 9 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 6/7] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:37PM +0100, Stefan Hajnoczi wrote:
> Passing around both BDRVNVMeState and NVMeQueuePair is unwiedly. Reduce
> the number of function arguments by keeping the BDRVNVMeState pointer in
> NVMeQueuePair. This will come in handly when a BH is introduced in a
> later patch and only one argument can be passed to it.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 70 
>  1 file changed, 38 insertions(+), 32 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 5/7] block/nvme: clarify that free_req_queue is protected by q->lock

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:36PM +0100, Stefan Hajnoczi wrote:
> Existing users access free_req_queue under q->lock. Document this.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 3/7] block/nvme: don't access CQE after moving cq.head

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:34PM +0100, Stefan Hajnoczi wrote:
> Do not access a CQE after incrementing q->cq.head and releasing q->lock.
> It is unlikely that this causes problems in practice but it's a latent
> bug.
> 
> The reason why it should be safe at the moment is that completion
> processing is not re-entrant and the CQ doorbell isn't written until the
> end of nvme_process_completion().
> 
> Make this change now because QEMU expects completion processing to be
> re-entrant and later patches will do that.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 4/7] block/nvme: switch to a NVMeRequest freelist

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:35PM +0100, Stefan Hajnoczi wrote:
> There are three issues with the current NVMeRequest->busy field:
> 1. The busy field is accidentally accessed outside q->lock when request
>submission fails.
> 2. Waiters on free_req_queue are not woken when a request is returned
>early due to submission failure.
> 2. Finding a free request involves scanning all requests. This makes
>request submission O(n^2).
> 
> Switch to an O(1) freelist that is always accessed under the lock.
> 
> Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and
> NVME_NUM_REQS, the number of usable requests. This makes the code
> simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind
> that one slot is reserved.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 81 ++--
>  1 file changed, 54 insertions(+), 27 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 2/7] block/nvme: drop tautologous assertion

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:33PM +0100, Stefan Hajnoczi wrote:
> nvme_process_completion() explicitly checks cid so the assertion that
> follows is always true:
> 
>   if (cid == 0 || cid > NVME_QUEUE_SIZE) {
>   ...
>   continue;
>   }
>   assert(cid <= NVME_QUEUE_SIZE);
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 1 -
>  1 file changed, 1 deletion(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 1/7] block/nvme: poll queues without q->lock

2020-05-25 Thread Sergio Lopez
On Tue, May 19, 2020 at 06:11:32PM +0100, Stefan Hajnoczi wrote:
> A lot of CPU time is spent simply locking/unlocking q->lock during
> polling. Check for completion outside the lock to make q->lock disappear
> from the profile.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/nvme.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index eb2f54dd9d..7eb4512666 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
>  
>  for (i = 0; i < s->nr_queues; i++) {
>  NVMeQueuePair *q = s->queues[i];
> +const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
> +NvmeCqe *cqe = (NvmeCqe *)>cq.queue[cqe_offset];
> +
> +/*
> + * q->lock isn't needed for checking completion because
> + * nvme_process_completion() only runs in the event loop thread and
> + * cannot race with itself.
> + */
> +if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
> +continue;
> +}
> +

IIUC, this is introducing an early check of the phase bit to determine
if there is something new in the queue.

I'm fine with this optimization, but I have the feeling that the
comment doesn't properly describe it.

Sergio.

>  qemu_mutex_lock(>lock);
>  while (nvme_process_completion(s, q)) {
>  /* Keep polling */
> -- 
> 2.25.3
> 


signature.asc
Description: PGP signature


Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Sergio Lopez
On Thu, Mar 26, 2020 at 08:54:53AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 25.03.2020 18:50, Stefan Reiter wrote:
> > backup_clean is only ever called as a handler via job_exit, which
> 
> Hmm.. I'm afraid it's not quite correct.
> 
> job_clean
> 
>   job_finalize_single
> 
>  job_completed_txn_abort (lock aio context)
> 
>  job_do_finalize
> 
> 
> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio 
> context..
> And on the same time, it directaly calls job_txn_apply(job->txn, 
> job_finalize_single)
> without locking. Is it a bug?

Indeed, looks like a bug to me. In fact, that's the one causing the
issue that Dietmar initially reported.

In think the proper fix is drop the context acquisition/release that
in backup_clean that I added in 0abf2581717a19, as Stefan proposed,
and also acquire the context of "foreign" jobs at job_txn_apply, just
as job_completed_txn_abort does.

Thanks,
Sergio.

> And, even if job_do_finalize called always with locked context, where is 
> guarantee that all
> context of all jobs in txn are locked?
> 
> Still, let's look through its callers.
> 
> job_finalize
> 
>qmp_block_job_finalize (lock aio context)
>qmp_job_finalize (lock aio context)
>test_cancel_concluded (doesn't lock, but it's a test)
> 
>   job_completed_txn_success
> 
>job_completed
> 
> job_exit (lock aio context)
> 
> job_cancel
>   
>  blockdev_mark_auto_del (lock aio context)
> 
>  job_user_cancel
> 
>  qmp_block_job_cancel (locks context)
>  qmp_job_cancel  (locks context)
> 
>  job_cancel_err
> 
>   job_cancel_sync (return job_finish_sync(job, 
> _cancel_err, NULL);, job_finish_sync just calls callback)
> 
>replication_close (it's .bdrv_close.. Hmm, 
> I don't see context locking, where is it ?)
> 
>replication_stop (locks context)
> 
>drive_backup_abort (locks context)
> 
>blockdev_backup_abort (locks context)
> 
>job_cancel_sync_all (locks context)
> 
>cancel_common (locks context)
> 
>  test_* (I don't care)
> 
> > already acquires the job's context. The job's context is guaranteed to
> > be the same as the one used by backup_top via backup_job_create.
> > 
> > Since the previous logic effectively acquired the lock twice, this
> > broke cleanup of backups for disks using IO threads, since the 
> > BDRV_POLL_WHILE
> > in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
> > once, thus deadlocking with the IO thread.
> > 
> > Signed-off-by: Stefan Reiter 
> 
> Just note, that this thing were recently touched by 0abf2581717a19 , so add 
> Sergio (its author) to CC.
> 
> > ---
> > 
> > This is a fix for the issue discussed in this part of the thread:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
> > ...not the original problem (core dump) posted by Dietmar.
> > 
> > I've still seen it occasionally hang during a backup abort. I'm trying to 
> > figure
> > out why that happens, stack trace indicates a similar problem with the main
> > thread hanging at bdrv_do_drained_begin, though I have no clue why as of 
> > yet.
> > 
> >   block/backup.c | 4 
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/block/backup.c b/block/backup.c
> > index 7430ca5883..a7a7dcaf4c 100644
> > --- a/block/backup.c
> > +++ b/block/backup.c
> > @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
> >   static void backup_clean(Job *job)
> >   {
> >   BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
> > -AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
> > -
> > -aio_context_acquire(aio_context);
> >   bdrv_backup_top_drop(s->backup_top);
> > -aio_context_release(aio_context);
> >   }
> >   void backup_do_checkpoint(BlockJob *job, Error **errp)
> > 
> 
> 
> -- 
> Best regards,
> Vladimir
> 


signature.asc
Description: PGP signature


Re: [PATCH 5/5] aio-posix: make AioHandler dispatch O(1) with epoll

2020-02-19 Thread Sergio Lopez
On Fri, Feb 14, 2020 at 05:17:12PM +, Stefan Hajnoczi wrote:
> File descriptor monitoring is O(1) with epoll(7), but
> aio_dispatch_handlers() still scans all AioHandlers instead of
> dispatching just those that are ready.  This makes aio_poll() O(n) with
> respect to the total number of registered handlers.
> 
> Add a local ready_list to aio_poll() so that each nested aio_poll()
> builds a list of handlers ready to be dispatched.  Since file descriptor
> polling is level-triggered, nested aio_poll() calls also see fds that
> were ready in the parent but not yet dispatched.  This guarantees that
> nested aio_poll() invocations will dispatch all fds, even those that
> became ready before the nested invocation.
> 
> Since only handlers ready to be dispatched are placed onto the
> ready_list, the new aio_dispatch_ready_handlers() function provides O(1)
> dispatch.
> 
> Note that AioContext polling is still O(n) and currently cannot be fully
> disabled.  This still needs to be fixed before aio_poll() is fully O(1).
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  util/aio-posix.c | 106 +--
>  1 file changed, 76 insertions(+), 30 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 4/5] aio-posix: make AioHandler deletion O(1)

2020-02-19 Thread Sergio Lopez
On Fri, Feb 14, 2020 at 05:17:11PM +, Stefan Hajnoczi wrote:
> It is not necessary to scan all AioHandlers for deletion.  Keep a list
> of deleted handlers instead of scanning the full list of all handlers.
> 
> The AioHandler->deleted field can be dropped.  Let's check if the
> handler has been inserted into the deleted list instead.  Add a new
> QLIST_IS_INSERTED() API for this check.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  include/block/aio.h  |  6 -
>  include/qemu/queue.h |  3 +++
>  util/aio-posix.c | 53 +---
>  3 files changed, 43 insertions(+), 19 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 3/5] qemu/queue.h: add QLIST_SAFE_REMOVE()

2020-02-19 Thread Sergio Lopez
On Fri, Feb 14, 2020 at 05:17:10PM +, Stefan Hajnoczi wrote:
> QLIST_REMOVE() assumes the element is in a list.  It also leaves the
> element's linked list pointers dangling.
> 
> Introduce a safe version of QLIST_REMOVE() and convert open-coded
> instances of this pattern.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block.c  |  5 +
>  chardev/spice.c  |  4 +---
>  include/qemu/queue.h | 14 ++
>  3 files changed, 16 insertions(+), 7 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 2/5] aio-posix: don't pass ns timeout to epoll_wait()

2020-02-19 Thread Sergio Lopez
On Fri, Feb 14, 2020 at 05:17:09PM +, Stefan Hajnoczi wrote:
> Don't pass the nanosecond timeout into epoll_wait(), which expects
> milliseconds.
> 
> The epoll_wait() timeout value does not matter if qemu_poll_ns()
> determined that the poll fd is ready, but passing a value in the wrong
> units is still ugly.  Pass a 0 timeout to epoll_wait() instead.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  util/aio-posix.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH 1/5] aio-posix: fix use after leaving scope in aio_poll()

2020-02-18 Thread Sergio Lopez
On Fri, Feb 14, 2020 at 05:17:08PM +, Stefan Hajnoczi wrote:
> epoll_handler is a stack variable and must not be accessed after it goes
> out of scope:
> 
>   if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
>   AioHandler epoll_handler;
>   ...
>   add_pollfd(_handler);
>   ret = aio_epoll(ctx, pollfds, npfd, timeout);
>   } ...
> 
>   ...
> 
>   /* if we have any readable fds, dispatch event */
>   if (ret > 0) {
>   for (i = 0; i < npfd; i++) {
>   nodes[i]->pfd.revents = pollfds[i].revents;
>   }
>   }
> 
> nodes[0] is _handler, which has already gone out of scope.
> 
> There is no need to use pollfds[] for epoll.  We don't need an
> AioHandler for the epoll fd.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  util/aio-posix.c | 20 ++++----
>  1 file changed, 8 insertions(+), 12 deletions(-)

Reviewed-by: Sergio Lopez 


signature.asc
Description: PGP signature


Re: [PATCH v2 2/4] virtio-scsi: default num_queues to -smp N

2020-02-03 Thread Sergio Lopez
On Mon, Feb 03, 2020 at 10:57:44AM +, Daniel P. Berrangé wrote:
> On Mon, Feb 03, 2020 at 11:25:29AM +0100, Sergio Lopez wrote:
> > On Thu, Jan 30, 2020 at 10:52:35AM +, Stefan Hajnoczi wrote:
> > > On Thu, Jan 30, 2020 at 01:29:16AM +0100, Paolo Bonzini wrote:
> > > > On 29/01/20 16:44, Stefan Hajnoczi wrote:
> > > > > On Mon, Jan 27, 2020 at 02:10:31PM +0100, Cornelia Huck wrote:
> > > > >> On Fri, 24 Jan 2020 10:01:57 +
> > > > >> Stefan Hajnoczi  wrote:
> > > > >>> @@ -47,10 +48,15 @@ static void 
> > > > >>> vhost_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> > > > >>>  {
> > > > >>>  VHostSCSIPCI *dev = VHOST_SCSI_PCI(vpci_dev);
> > > > >>>  DeviceState *vdev = DEVICE(>vdev);
> > > > >>> -VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev);
> > > > >>> +VirtIOSCSIConf *conf = >vdev.parent_obj.parent_obj.conf;
> > > > >>> +
> > > > >>> +/* 1:1 vq to vcpu mapping is ideal because it avoids IPIs */
> > > > >>> +if (conf->num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> > > > >>> +conf->num_queues = current_machine->smp.cpus;
> > > > >> This now maps the request vqs 1:1 to the vcpus. What about the fixed
> > > > >> vqs? If they don't really matter, amend the comment to explain that?
> > > > > The fixed vqs don't matter.  They are typically not involved in the 
> > > > > data
> > > > > path, only the control path where performance doesn't matter.
> > > > 
> > > > Should we put a limit on the number of vCPUs?  For anything above ~128
> > > > the guest is probably not going to be disk or network bound.
> > > 
> > > Michael Tsirkin pointed out there's a hard limit of VIRTIO_QUEUE_MAX
> > > (1024).  We need to at least stay under that limit.
> > > 
> > > Should the guest have >128 virtqueues?  Each virtqueue requires guest
> > > RAM and 2 host eventfds.  Eventually these resource requirements will
> > > become a scalability problem, but how do we choose a hard limit and what
> > > happens to guest performance above that limit?
> > 
> > From the UX perspective, I think it's safer to use a rather low upper
> > limit for the automatic configuration.
> > 
> > Users of large VMs (>=32 vCPUs) aiming for the optimal performance are
> > already facing the need of manually tuning (or relying on a software
> > to do that for them) other aspects of it, like vNUMA, IOThreads and
> > CPU pinning, so I don't think we should focus on this group.
> 
> Whether they're runing manually, or relying on software to tune for
> them, we (QEMU maintainers) still need to provide credible guidance
> on what todo with tuning for large CPU counts. Without clear info
> from QEMU, it just descends into hearsay and guesswork, both of which
> approaches leave QEMU looking bad.

I agree. Good documentation, ideally with some benchmarks, and safe
defaults sound like a good approach to me.

> So I think we need to, at the very least, make a clear statement here
> about what tuning approach should be applied vCPU count gets high,
> and probably even apply that  as a default out of the box approach.

In general, I would agree, but in this particular case the
optimization has an impact on something outside's QEMU control (host's
resources), so we lack the information needed to make a proper guess.

My main concern here is users upgrading QEMU to hit some kind of crash
or performance issue, without having touched their VM config. And
let's not forget that Stefan said in the cover that this amounts to a
1-4% improvement on 4k operations on an SSD, and I guess that's with
iodepth=1. I suspect with a larger block size and/or higher iodepth
the improvement will be barely noticeable, which means it'll only have
a positive impact on users running DB/OLTP or similar workloads on
dedicated, directly attached, low-latency storage.

But don't get me wrong, this is a *good* optimization. It's just I
think we should play safe here.

Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v2 2/4] virtio-scsi: default num_queues to -smp N

2020-02-03 Thread Sergio Lopez
On Thu, Jan 30, 2020 at 10:52:35AM +, Stefan Hajnoczi wrote:
> On Thu, Jan 30, 2020 at 01:29:16AM +0100, Paolo Bonzini wrote:
> > On 29/01/20 16:44, Stefan Hajnoczi wrote:
> > > On Mon, Jan 27, 2020 at 02:10:31PM +0100, Cornelia Huck wrote:
> > >> On Fri, 24 Jan 2020 10:01:57 +
> > >> Stefan Hajnoczi  wrote:
> > >>> @@ -47,10 +48,15 @@ static void vhost_scsi_pci_realize(VirtIOPCIProxy 
> > >>> *vpci_dev, Error **errp)
> > >>>  {
> > >>>  VHostSCSIPCI *dev = VHOST_SCSI_PCI(vpci_dev);
> > >>>  DeviceState *vdev = DEVICE(>vdev);
> > >>> -VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev);
> > >>> +VirtIOSCSIConf *conf = >vdev.parent_obj.parent_obj.conf;
> > >>> +
> > >>> +/* 1:1 vq to vcpu mapping is ideal because it avoids IPIs */
> > >>> +if (conf->num_queues == VIRTIO_SCSI_AUTO_NUM_QUEUES) {
> > >>> +conf->num_queues = current_machine->smp.cpus;
> > >> This now maps the request vqs 1:1 to the vcpus. What about the fixed
> > >> vqs? If they don't really matter, amend the comment to explain that?
> > > The fixed vqs don't matter.  They are typically not involved in the data
> > > path, only the control path where performance doesn't matter.
> > 
> > Should we put a limit on the number of vCPUs?  For anything above ~128
> > the guest is probably not going to be disk or network bound.
> 
> Michael Tsirkin pointed out there's a hard limit of VIRTIO_QUEUE_MAX
> (1024).  We need to at least stay under that limit.
> 
> Should the guest have >128 virtqueues?  Each virtqueue requires guest
> RAM and 2 host eventfds.  Eventually these resource requirements will
> become a scalability problem, but how do we choose a hard limit and what
> happens to guest performance above that limit?

>From the UX perspective, I think it's safer to use a rather low upper
limit for the automatic configuration.

Users of large VMs (>=32 vCPUs) aiming for the optimal performance are
already facing the need of manually tuning (or relying on a software
to do that for them) other aspects of it, like vNUMA, IOThreads and
CPU pinning, so I don't think we should focus on this group.

On the other hand, the increase in host resource requirements may have
unforeseen in some environments, specially to virtio-blk users with
multiple disks.

All in all, I don't have data that would justify setting the limit to
one value or the other. The only argument I can put on the table is
that, so far, we only had one VQ per device, so perhaps a conservative
value (4? 8?) would make sense from a safety and compatibility point
of view.

Thanks,
Sergio.



signature.asc
Description: PGP signature


[PATCH v6 5/8] block/backup-top: Don't acquire context while dropping top

2020-01-08 Thread Sergio Lopez
All paths that lead to bdrv_backup_top_drop(), except for the call
from backup_clean(), imply that the BDS AioContext has already been
acquired, so doing it there too can potentially lead to QEMU hanging
on AIO_WAIT_WHILE().

An easy way to trigger this situation is by issuing a two actions
transaction, with a proper and a bogus blockdev-backup, so the second
one will trigger a rollback. This will trigger a hang with an stack
trace like this one:

 #0  0x7fb680c75016 in __GI_ppoll (fds=0x55e74580f7c0, nfds=1, 
timeout=,
 timeout@entry=0x0, sigmask=sigmask@entry=0x0) at 
../sysdeps/unix/sysv/linux/ppoll.c:39
 #1  0x55e743386e09 in ppoll (__ss=0x0, __timeout=0x0, __nfds=, __fds=)
 at /usr/include/bits/poll2.h:77
 #2  0x55e743386e09 in qemu_poll_ns
 (fds=, nfds=, timeout=) at 
util/qemu-timer.c:336
 #3  0x55e743388dc4 in aio_poll (ctx=0x55e7458925d0, 
blocking=blocking@entry=true)
 at util/aio-posix.c:669
 #4  0x55e743305dea in bdrv_flush (bs=bs@entry=0x55e74593c0d0) at 
block/io.c:2878
 #5  0x55e7432be58e in bdrv_close (bs=0x55e74593c0d0) at block.c:4017
 #6  0x55e7432be58e in bdrv_delete (bs=) at block.c:4262
 #7  0x55e7432be58e in bdrv_unref (bs=bs@entry=0x55e74593c0d0) at 
block.c:5644
 #8  0x55e743316b9b in bdrv_backup_top_drop (bs=bs@entry=0x55e74593c0d0) at 
block/backup-top.c:273
 #9  0x55e74331461f in backup_job_create
 (job_id=0x0, bs=bs@entry=0x55e7458d5820, 
target=target@entry=0x55e74589f640, speed=0, sync_mode=MIRROR_SYNC_MODE_FULL, 
sync_bitmap=sync_bitmap@entry=0x0, bitmap_mode=BITMAP_SYNC_MODE_ON_SUCCESS, 
compress=false, filter_node_name=0x0, on_source_error=BLOCKDEV_ON_ERROR_REPORT, 
on_target_error=BLOCKDEV_ON_ERROR_REPORT, creation_flags=0, cb=0x0, opaque=0x0, 
txn=0x0, errp=0x7ffddfd1efb0) at block/backup.c:478
 #10 0x55e74315bc52 in do_backup_common
 (backup=backup@entry=0x55e746c066d0, bs=bs@entry=0x55e7458d5820, 
target_bs=target_bs@entry=0x55e74589f640, 
aio_context=aio_context@entry=0x55e7458a91e0, txn=txn@entry=0x0, 
errp=errp@entry=0x7ffddfd1efb0)
 at blockdev.c:3580
 #11 0x55e74315c37c in do_blockdev_backup
 (backup=backup@entry=0x55e746c066d0, txn=0x0, 
errp=errp@entry=0x7ffddfd1efb0)
 at 
/usr/src/debug/qemu-kvm-4.2.0-2.module+el8.2.0+5135+ed3b2489.x86_64/./qapi/qapi-types-block-core.h:1492
 #12 0x55e74315c449 in blockdev_backup_prepare (common=0x55e746a8de90, 
errp=0x7ffddfd1f018)
 at blockdev.c:1885
 #13 0x55e743160152 in qmp_transaction
 (dev_list=, has_props=, 
props=0x55e7467fe2c0, errp=errp@entry=0x7ffddfd1f088) at blockdev.c:2340
 #14 0x55e743287ff5 in qmp_marshal_transaction
 (args=, ret=, errp=0x7ffddfd1f0f8)
 at qapi/qapi-commands-transaction.c:44
 #15 0x55e74333de6c in do_qmp_dispatch
 (errp=0x7ffddfd1f0f0, allow_oob=, request=, 
cmds=0x55e743c28d60 ) at qapi/qmp-dispatch.c:132
 #16 0x55e74333de6c in qmp_dispatch
 (cmds=0x55e743c28d60 , request=, 
allow_oob=)
 at qapi/qmp-dispatch.c:175
 #17 0x55e74325c061 in monitor_qmp_dispatch (mon=0x55e745908030, 
req=)
 at monitor/qmp.c:145
 #18 0x55e74325c6fa in monitor_qmp_bh_dispatcher (data=) at 
monitor/qmp.c:234
 #19 0x55e743385866 in aio_bh_call (bh=0x55e745807ae0) at util/async.c:117
 #20 0x55e743385866 in aio_bh_poll (ctx=ctx@entry=0x55e7458067a0) at 
util/async.c:117
 #21 0x55e743388c54 in aio_dispatch (ctx=0x55e7458067a0) at 
util/aio-posix.c:459
 #22 0x55e743385742 in aio_ctx_dispatch
 (source=, callback=, user_data=) at util/async.c:260
 #23 0x7fb68543e67d in g_main_dispatch (context=0x55e745893a40) at 
gmain.c:3176
 #24 0x7fb68543e67d in g_main_context_dispatch 
(context=context@entry=0x55e745893a40) at gmain.c:3829
 #25 0x55e743387d08 in glib_pollfds_poll () at util/main-loop.c:219
 #26 0x55e743387d08 in os_host_main_loop_wait (timeout=) at 
util/main-loop.c:242
 #27 0x55e743387d08 in main_loop_wait (nonblocking=) at 
util/main-loop.c:518
 #28 0x55e74316a3c1 in main_loop () at vl.c:1828
 #29 0x55e743016a72 in main (argc=, argv=, 
envp=)
 at vl.c:4504

Fix this by not acquiring the AioContext there, and ensuring all paths
leading to it have it already acquired (backup_clean()).

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1782111
Signed-off-by: Sergio Lopez 
---
 block/backup-top.c | 5 -
 block/backup.c | 3 +++
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/backup-top.c b/block/backup-top.c
index 818d3f26b4..b8d863ff08 100644
--- a/block/backup-top.c
+++ b/block/backup-top.c
@@ -255,9 +255,6 @@ append_failed:
 void bdrv_backup_top_drop(BlockDriverState *bs)
 {
 BDRVBackupTopState *s = bs->opaque;
-AioContext *aio_context = bdrv_get_aio_context(bs);
-
-aio_context_acquire(aio_context);
 
 bdrv_drained_begin(bs);
 
@@ -271,6 +268,4 @@ void bdrv_backup_top_drop(BlockDriverState *bs)
 bdrv_drained_end(bs);
 
 bdrv_unref(bs);
-
-aio_context_release(aio_cont

[PATCH v6 8/8] iotests: Test handling of AioContexts with some blockdev actions

2020-01-08 Thread Sergio Lopez
Includes the following tests:

 - Adding a dirty bitmap.
   * RHBZ: 1782175

 - Starting a drive-mirror to an NBD-backed target.
   * RHBZ: 1746217, 1773517

 - Aborting an external snapshot transaction.
   * RHBZ: 1779036

 - Aborting a blockdev backup transaction.
   * RHBZ: 1782111

For each one of them, a VM with a number of disks running in an
IOThread AioContext is used.

Signed-off-by: Sergio Lopez 
---
 tests/qemu-iotests/281 | 247 +
 tests/qemu-iotests/281.out |   5 +
 tests/qemu-iotests/group   |   1 +
 3 files changed, 253 insertions(+)
 create mode 100755 tests/qemu-iotests/281
 create mode 100644 tests/qemu-iotests/281.out

diff --git a/tests/qemu-iotests/281 b/tests/qemu-iotests/281
new file mode 100755
index 00..269d583b2c
--- /dev/null
+++ b/tests/qemu-iotests/281
@@ -0,0 +1,247 @@
+#!/usr/bin/env python
+#
+# Test cases for blockdev + IOThread interactions
+#
+# Copyright (C) 2019 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import qemu_img
+
+image_len = 64 * 1024 * 1024
+
+# Test for RHBZ#1782175
+class TestDirtyBitmapIOThread(iotests.QMPTestCase):
+drive0_img = os.path.join(iotests.test_dir, 'drive0.img')
+images = { 'drive0': drive0_img }
+
+def setUp(self):
+for name in self.images:
+qemu_img('create', '-f', iotests.imgfmt,
+ self.images[name], str(image_len))
+
+self.vm = iotests.VM()
+self.vm.add_object('iothread,id=iothread0')
+
+for name in self.images:
+self.vm.add_blockdev('driver=file,filename=%s,node-name=file_%s'
+ % (self.images[name], name))
+self.vm.add_blockdev('driver=qcow2,file=file_%s,node-name=%s'
+ % (name, name))
+
+self.vm.launch()
+self.vm.qmp('x-blockdev-set-iothread',
+node_name='drive0', iothread='iothread0',
+force=True)
+
+def tearDown(self):
+self.vm.shutdown()
+for name in self.images:
+os.remove(self.images[name])
+
+def test_add_dirty_bitmap(self):
+result = self.vm.qmp(
+'block-dirty-bitmap-add',
+node='drive0',
+name='bitmap1',
+persistent=True,
+)
+
+self.assert_qmp(result, 'return', {})
+
+
+# Test for RHBZ#1746217 & RHBZ#1773517
+class TestNBDMirrorIOThread(iotests.QMPTestCase):
+nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')
+drive0_img = os.path.join(iotests.test_dir, 'drive0.img')
+mirror_img = os.path.join(iotests.test_dir, 'mirror.img')
+images = { 'drive0': drive0_img, 'mirror': mirror_img }
+
+def setUp(self):
+for name in self.images:
+qemu_img('create', '-f', iotests.imgfmt,
+ self.images[name], str(image_len))
+
+self.vm_src = iotests.VM(path_suffix='src')
+self.vm_src.add_object('iothread,id=iothread0')
+self.vm_src.add_blockdev('driver=file,filename=%s,node-name=file0'
+  % (self.drive0_img))
+self.vm_src.add_blockdev('driver=qcow2,file=file0,node-name=drive0')
+self.vm_src.launch()
+self.vm_src.qmp('x-blockdev-set-iothread',
+node_name='drive0', iothread='iothread0',
+force=True)
+
+self.vm_tgt = iotests.VM(path_suffix='tgt')
+self.vm_tgt.add_object('iothread,id=iothread0')
+self.vm_tgt.add_blockdev('driver=file,filename=%s,node-name=file0'
+  % (self.mirror_img))
+self.vm_tgt.add_blockdev('driver=qcow2,file=file0,node-name=drive0')
+self.vm_tgt.launch()
+self.vm_tgt.qmp('x-blockdev-set-iothread',
+node_name='drive0', iothread='iothread0',
+force=True)
+
+def tearDown(self):
+self.vm_src.shutdown()
+self.vm_tgt.shutdown()
+for name in self.images:
+os.remove(self.images[name])
+
+def test_nbd_mirror(self):
+result = self.vm_tgt.qmp(
+'nbd-server-start',
+addr={
+'type': 'unix',
+'data': { 'path': self.nbd_sock }
+}
+)
+self.assert_qmp(result, 'return', {})
+
+

[PATCH v6 7/8] blockdev: Return bs to the proper context on snapshot abort

2020-01-08 Thread Sergio Lopez
external_snapshot_abort() calls to bdrv_set_backing_hd(), which
returns state->old_bs to the main AioContext, as it's intended to be
used then the BDS is going to be released. As that's not the case when
aborting an external snapshot, return it to the AioContext it was
before the call.

This issue can be triggered by issuing a transaction with two actions,
a proper blockdev-snapshot-sync and a bogus one, so the second will
trigger a transaction abort. This results in a crash with an stack
trace like this one:

 #0  0x7fa1048b28df in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x7fa10489ccf5 in __GI_abort () at abort.c:79
 #2  0x7fa10489cbc9 in __assert_fail_base
 (fmt=0x7fa104a03300 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=0x5572240b44d8 "bdrv_get_aio_context(old_bs) == 
bdrv_get_aio_context(new_bs)", file=0x557224014d30 "block.c", line=2240, 
function=) at assert.c:92
 #3  0x7fa1048aae96 in __GI___assert_fail
 (assertion=assertion@entry=0x5572240b44d8 "bdrv_get_aio_context(old_bs) == 
bdrv_get_aio_context(new_bs)", file=file@entry=0x557224014d30 "block.c", 
line=line@entry=2240, function=function@entry=0x5572240b5d60 
<__PRETTY_FUNCTION__.31620> "bdrv_replace_child_noperm") at assert.c:101
 #4  0x557223e631f8 in bdrv_replace_child_noperm (child=0x557225b9c980, 
new_bs=new_bs@entry=0x557225c42e40) at block.c:2240
 #5  0x557223e68be7 in bdrv_replace_node (from=0x557226951a60, 
to=0x557225c42e40, errp=0x5572247d6138 ) at block.c:4196
 #6  0x557223d069c4 in external_snapshot_abort (common=0x557225d7e170) at 
blockdev.c:1731
 #7  0x557223d069c4 in external_snapshot_abort (common=0x557225d7e170) at 
blockdev.c:1717
 #8  0x557223d09013 in qmp_transaction (dev_list=, 
has_props=, props=0x557225cc7d70, 
errp=errp@entry=0x7ffe704c0c98) at blockdev.c:2360
 #9  0x557223e32085 in qmp_marshal_transaction (args=, 
ret=, errp=0x7ffe704c0d08) at qapi/qapi-commands-transaction.c:44
 #10 0x557223ee798c in do_qmp_dispatch (errp=0x7ffe704c0d00, 
allow_oob=, request=, cmds=0x5572247d3cc0 
) at qapi/qmp-dispatch.c:132
 #11 0x557223ee798c in qmp_dispatch (cmds=0x5572247d3cc0 , 
request=, allow_oob=) at qapi/qmp-dispatch.c:175
 #12 0x557223e06141 in monitor_qmp_dispatch (mon=0x557225c69ff0, 
req=) at monitor/qmp.c:120
 #13 0x557223e0678a in monitor_qmp_bh_dispatcher (data=) at 
monitor/qmp.c:209
 #14 0x557223f2f366 in aio_bh_call (bh=0x557225b9dc60) at util/async.c:117
 #15 0x557223f2f366 in aio_bh_poll (ctx=ctx@entry=0x557225b9c840) at 
util/async.c:117
 #16 0x557223f32754 in aio_dispatch (ctx=0x557225b9c840) at 
util/aio-posix.c:459
 #17 0x557223f2f242 in aio_ctx_dispatch (source=, 
callback=, user_data=) at util/async.c:260
 #18 0x7fa10913467d in g_main_dispatch (context=0x557225c28e80) at 
gmain.c:3176
 #19 0x7fa10913467d in g_main_context_dispatch 
(context=context@entry=0x557225c28e80) at gmain.c:3829
 #20 0x557223f31808 in glib_pollfds_poll () at util/main-loop.c:219
 #21 0x557223f31808 in os_host_main_loop_wait (timeout=) at 
util/main-loop.c:242
 #22 0x557223f31808 in main_loop_wait (nonblocking=) at 
util/main-loop.c:518
 #23 0x557223d13201 in main_loop () at vl.c:1828
 #24 0x557223bbfb82 in main (argc=, argv=, 
envp=) at vl.c:4504

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1779036
Signed-off-by: Sergio Lopez 
---
 blockdev.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 292961a88d..cfed87f434 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1731,6 +1731,8 @@ static void external_snapshot_abort(BlkActionState 
*common)
 if (state->new_bs) {
 if (state->overlay_appended) {
 AioContext *aio_context;
+AioContext *tmp_context;
+int ret;
 
 aio_context = bdrv_get_aio_context(state->old_bs);
 aio_context_acquire(aio_context);
@@ -1738,6 +1740,25 @@ static void external_snapshot_abort(BlkActionState 
*common)
 bdrv_ref(state->old_bs);   /* we can't let bdrv_set_backind_hd()
   close state->old_bs; we need it */
 bdrv_set_backing_hd(state->new_bs, NULL, _abort);
+
+/*
+ * The call to bdrv_set_backing_hd() above returns state->old_bs to
+ * the main AioContext. As we're still going to be using it, return
+ * it to the AioContext it was before.
+ */
+tmp_context = bdrv_get_aio_context(state->old_bs);
+if (aio_context != tmp_context) {
+aio_context_release(aio_context);
+aio_context_acquire(tmp_context);
+
+ret = bdrv_try_set_aio_context(state->old_bs,
+   aio_context, NULL);
+assert(ret == 0);
+
+  

[PATCH v6 1/8] blockdev: fix coding style issues in drive_backup_prepare

2020-01-08 Thread Sergio Lopez
Fix a couple of minor coding style issues in drive_backup_prepare.

Signed-off-by: Sergio Lopez 
Reviewed-by: Max Reitz 
Reviewed-by: Kevin Wolf 
---
 blockdev.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 8e029e9c01..553e315972 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3620,7 +3620,7 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 
 if (!backup->has_format) {
 backup->format = backup->mode == NEW_IMAGE_MODE_EXISTING ?
- NULL : (char*) bs->drv->format_name;
+ NULL : (char *) bs->drv->format_name;
 }
 
 /* Early check to avoid creating target */
@@ -3630,8 +3630,10 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 
 flags = bs->open_flags | BDRV_O_RDWR;
 
-/* See if we have a backing HD we can use to create our new image
- * on top of. */
+/*
+ * See if we have a backing HD we can use to create our new image
+ * on top of.
+ */
 if (backup->sync == MIRROR_SYNC_MODE_TOP) {
 source = backing_bs(bs);
 if (!source) {
-- 
2.23.0




[PATCH v6 0/8] blockdev: Fix AioContext handling for various blockdev actions

2020-01-08 Thread Sergio Lopez
This patch series includes fixes for various issues related to
AioContext mismanagement for various blockdev-initiated actions.

It also adds tests for those actions when executed on drives running
on an IOThread AioContext.

---
Changelog:

v6:
 - Rename the patch series.
 - Add "block/backup-top: Don't acquire context while dropping top"
 - Add "blockdev: Acquire AioContext on dirty bitmap functions"
 - Add "blockdev: Return bs to the proper context on snapshot abort"
 - Add "iotests: Test handling of AioContexts with some blockdev
   actions" (thanks Kevin Wolf)
 - Fix context release on error. (thanks Kevin Wolf)

v5:
 - Include fix for iotest 141 in patch 2. (thanks Max Reitz)
 - Also fix iotest 185 and 219 in patch 2. (thanks Max Reitz)
 - Move error block after context acquisition/release, to we can use
   goto to bail out and release resources. (thanks Max Reitz)
 - Properly release old_context in qmp_blockdev_mirror. (thanks Max
   Reitz)

v4:
 - Unify patches 1-4 and 5-7 to avoid producing broken interim
   states. (thanks Max Reitz)
 - Include a fix for iotest 141. (thanks Kevin Wolf)

v3:
 - Rework the whole patch series to fix the issue by consolidating all
   operations in the transaction model. (thanks Kevin Wolf)

v2:
 - Honor bdrv_try_set_aio_context() context acquisition requirements
   (thanks Max Reitz).
 - Release the context at drive_backup_prepare() instead of avoiding
   re-acquiring it at do_drive_baclup(). (thanks Max Reitz)
 - Convert a single patch into a two-patch series.
---

Sergio Lopez (8):
  blockdev: fix coding style issues in drive_backup_prepare
  blockdev: unify qmp_drive_backup and drive-backup transaction paths
  blockdev: unify qmp_blockdev_backup and blockdev-backup transaction
paths
  blockdev: honor bdrv_try_set_aio_context() context requirements
  block/backup-top: Don't acquire context while dropping top
  blockdev: Acquire AioContext on dirty bitmap functions
  blockdev: Return bs to the proper context on snapshot abort
  iotests: Test handling of AioContexts with some blockdev actions

 block/backup-top.c |   5 -
 block/backup.c |   3 +
 blockdev.c | 391 -
 tests/qemu-iotests/141.out |   2 +
 tests/qemu-iotests/185.out |   2 +
 tests/qemu-iotests/219 |   7 +-
 tests/qemu-iotests/219.out |   8 +
 tests/qemu-iotests/281 | 247 +++
 tests/qemu-iotests/281.out |   5 +
 tests/qemu-iotests/group   |   1 +
 10 files changed, 484 insertions(+), 187 deletions(-)
 create mode 100755 tests/qemu-iotests/281
 create mode 100644 tests/qemu-iotests/281.out

-- 
2.23.0




[PATCH v6 6/8] blockdev: Acquire AioContext on dirty bitmap functions

2020-01-08 Thread Sergio Lopez
Dirty map addition and removal functions are not acquiring to BDS
AioContext, while they may call to code that expects it to be
acquired.

This may trigger a crash with a stack trace like this one:

 #0  0x7f0ef146370f in __GI_raise (sig=sig@entry=6)
 at ../sysdeps/unix/sysv/linux/raise.c:50
 #1  0x7f0ef144db25 in __GI_abort () at abort.c:79
 #2  0x565022294dce in error_exit
 (err=, msg=msg@entry=0x56502243a730 <__func__.16350> 
"qemu_mutex_unlock_impl") at util/qemu-thread-posix.c:36
 #3  0x5650222950ba in qemu_mutex_unlock_impl
 (mutex=mutex@entry=0x5650244b0240, file=file@entry=0x565022439adf 
"util/async.c", line=line@entry=526) at util/qemu-thread-posix.c:108
 #4  0x565022290029 in aio_context_release
 (ctx=ctx@entry=0x5650244b01e0) at util/async.c:526
 #5  0x56502221cd08 in bdrv_can_store_new_dirty_bitmap
 (bs=bs@entry=0x5650244dc820, name=name@entry=0x56502481d360 "bitmap1", 
granularity=granularity@entry=65536, errp=errp@entry=0x7fff22831718)
 at block/dirty-bitmap.c:542
 #6  0x56502206ae53 in qmp_block_dirty_bitmap_add
 (errp=0x7fff22831718, disabled=false, has_disabled=, 
persistent=, has_persistent=true, granularity=65536, 
has_granularity=, name=0x56502481d360 "bitmap1", node=) at blockdev.c:2894
 #7  0x56502206ae53 in qmp_block_dirty_bitmap_add
 (node=, name=0x56502481d360 "bitmap1", 
has_granularity=, granularity=, 
has_persistent=true, persistent=, has_disabled=false, 
disabled=false, errp=0x7fff22831718) at blockdev.c:2856
 #8  0x5650221847a3 in qmp_marshal_block_dirty_bitmap_add
 (args=, ret=, errp=0x7fff22831798)
 at qapi/qapi-commands-block-core.c:651
 #9  0x565022247e6c in do_qmp_dispatch
 (errp=0x7fff22831790, allow_oob=, request=, 
cmds=0x565022b32d60 ) at qapi/qmp-dispatch.c:132
 #10 0x565022247e6c in qmp_dispatch
 (cmds=0x565022b32d60 , request=, 
allow_oob=) at qapi/qmp-dispatch.c:175
 #11 0x565022166061 in monitor_qmp_dispatch
 (mon=0x56502450faa0, req=) at monitor/qmp.c:145
 #12 0x5650221666fa in monitor_qmp_bh_dispatcher
 (data=) at monitor/qmp.c:234
 #13 0x56502228f866 in aio_bh_call (bh=0x56502440eae0)
 at util/async.c:117
 #14 0x56502228f866 in aio_bh_poll (ctx=ctx@entry=0x56502440d7a0)
 at util/async.c:117
 #15 0x565022292c54 in aio_dispatch (ctx=0x56502440d7a0)
 at util/aio-posix.c:459
 #16 0x56502228f742 in aio_ctx_dispatch
 (source=, callback=, user_data=) at util/async.c:260
 #17 0x7f0ef5ce667d in g_main_dispatch (context=0x56502449aa40)
 at gmain.c:3176
 #18 0x7f0ef5ce667d in g_main_context_dispatch
 (context=context@entry=0x56502449aa40) at gmain.c:3829
 #19 0x565022291d08 in glib_pollfds_poll () at util/main-loop.c:219
 #20 0x565022291d08 in os_host_main_loop_wait
 (timeout=) at util/main-loop.c:242
 #21 0x565022291d08 in main_loop_wait (nonblocking=)
 at util/main-loop.c:518
 #22 0x5650220743c1 in main_loop () at vl.c:1828
 #23 0x565021f20a72 in main
 (argc=, argv=, envp=)
 at vl.c:4504

Fix this by acquiring the AioContext at qmp_block_dirty_bitmap_add()
and qmp_block_dirty_bitmap_add().

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1782175
Signed-off-by: Sergio Lopez 
---
 blockdev.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 1dacbc20ec..292961a88d 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2984,6 +2984,7 @@ void qmp_block_dirty_bitmap_add(const char *node, const 
char *name,
 {
 BlockDriverState *bs;
 BdrvDirtyBitmap *bitmap;
+AioContext *aio_context;
 
 if (!name || name[0] == '\0') {
 error_setg(errp, "Bitmap name cannot be empty");
@@ -2995,11 +2996,14 @@ void qmp_block_dirty_bitmap_add(const char *node, const 
char *name,
 return;
 }
 
+aio_context = bdrv_get_aio_context(bs);
+aio_context_acquire(aio_context);
+
 if (has_granularity) {
 if (granularity < 512 || !is_power_of_2(granularity)) {
 error_setg(errp, "Granularity must be power of 2 "
  "and at least 512");
-return;
+goto out;
 }
 } else {
 /* Default to cluster size, if available: */
@@ -3017,12 +3021,12 @@ void qmp_block_dirty_bitmap_add(const char *node, const 
char *name,
 if (persistent &&
 !bdrv_can_store_new_dirty_bitmap(bs, name, granularity, errp))
 {
-return;
+goto out;
 }
 
 bitmap = bdrv_create_dirty_bitmap(bs, granularity, name, errp);
 if (bitmap == NULL) {
-return;
+goto out;
 }
 
 if (disabled) {
@@ -3030,6 +3034,9 @@ void qmp_block_dirty_bitmap_add(const char *node, const 
char *name,
 }
 
 bdrv_dirty_bitmap_set_persistence(bitmap, persistent);
+
+out:
+aio_context_release(aio_context);
 }
 
 sta

[PATCH v6 4/8] blockdev: honor bdrv_try_set_aio_context() context requirements

2020-01-08 Thread Sergio Lopez
bdrv_try_set_aio_context() requires that the old context is held, and
the new context is not held. Fix all the occurrences where it's not
done this way.

Suggested-by: Max Reitz 
Signed-off-by: Sergio Lopez 
---
 blockdev.c | 68 +++---
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 152a0f7454..1dacbc20ec 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1535,6 +1535,7 @@ static void external_snapshot_prepare(BlkActionState 
*common,
  DO_UPCAST(ExternalSnapshotState, common, common);
 TransactionAction *action = common->action;
 AioContext *aio_context;
+AioContext *old_context;
 int ret;
 
 /* 'blockdev-snapshot' and 'blockdev-snapshot-sync' have similar
@@ -1675,7 +1676,16 @@ static void external_snapshot_prepare(BlkActionState 
*common,
 goto out;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(state->new_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
 ret = bdrv_try_set_aio_context(state->new_bs, aio_context, errp);
+
+aio_context_release(old_context);
+aio_context_acquire(aio_context);
+
 if (ret < 0) {
 goto out;
 }
@@ -1775,11 +1785,13 @@ static void drive_backup_prepare(BlkActionState 
*common, Error **errp)
 BlockDriverState *target_bs;
 BlockDriverState *source = NULL;
 AioContext *aio_context;
+AioContext *old_context;
 QDict *options;
 Error *local_err = NULL;
 int flags;
 int64_t size;
 bool set_backing_hd = false;
+int ret;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
 backup = common->action->u.drive_backup.data;
@@ -1868,6 +1880,21 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 goto out;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+if (ret < 0) {
+bdrv_unref(target_bs);
+aio_context_release(old_context);
+return;
+}
+
+aio_context_release(old_context);
+aio_context_acquire(aio_context);
+
 if (set_backing_hd) {
 bdrv_set_backing_hd(target_bs, source, _err);
 if (local_err) {
@@ -1947,6 +1974,8 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 BlockDriverState *bs;
 BlockDriverState *target_bs;
 AioContext *aio_context;
+AioContext *old_context;
+int ret;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
 backup = common->action->u.blockdev_backup.data;
@@ -1961,7 +1990,18 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 return;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
 aio_context = bdrv_get_aio_context(bs);
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_acquire(old_context);
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+if (ret < 0) {
+aio_context_release(old_context);
+return;
+}
+
+aio_context_release(old_context);
 aio_context_acquire(aio_context);
 state->bs = bs;
 
@@ -3562,7 +3602,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 BlockJob *job = NULL;
 BdrvDirtyBitmap *bmap = NULL;
 int job_flags = JOB_DEFAULT;
-int ret;
 
 if (!backup->has_speed) {
 backup->speed = 0;
@@ -3586,11 +3625,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 backup->compress = false;
 }
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
-if (ret < 0) {
-return NULL;
-}
-
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
 (backup->sync == MIRROR_SYNC_MODE_INCREMENTAL)) {
 /* done before desugaring 'incremental' to print the right message */
@@ -3825,6 +3859,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
 BlockDriverState *bs;
 BlockDriverState *source, *target_bs;
 AioContext *aio_context;
+AioContext *old_context;
 BlockMirrorBackingMode backing_mode;
 Error *local_err = NULL;
 QDict *options = NULL;
@@ -3937,12 +3972,22 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
(arg->mode == NEW_IMAGE_MODE_EXISTING ||
 !bdrv_has_zero_init(target_bs)));
 
+
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
 ret = bdrv_try_set_aio_context(targ

[PATCH v6 3/8] blockdev: unify qmp_blockdev_backup and blockdev-backup transaction paths

2020-01-08 Thread Sergio Lopez
Issuing a blockdev-backup from qmp_blockdev_backup takes a slightly
different path than when it's issued from a transaction. In the code,
this is manifested as some redundancy between do_blockdev_backup() and
blockdev_backup_prepare().

This change unifies both paths, merging do_blockdev_backup() and
blockdev_backup_prepare(), and changing qmp_blockdev_backup() to
create a transaction instead of calling do_backup_common() direcly.

As a side-effect, now qmp_blockdev_backup() is executed inside a
drained section, as it happens when creating a blockdev-backup
transaction. This change is visible from the user's perspective, as
the job gets paused and immediately resumed before starting the actual
work.

Signed-off-by: Sergio Lopez 
Reviewed-by: Max Reitz 
Reviewed-by: Kevin Wolf 
---
 blockdev.c | 60 --
 1 file changed, 13 insertions(+), 47 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 5e85fc042e..152a0f7454 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1940,16 +1940,13 @@ typedef struct BlockdevBackupState {
 BlockJob *job;
 } BlockdevBackupState;
 
-static BlockJob *do_blockdev_backup(BlockdevBackup *backup, JobTxn *txn,
-Error **errp);
-
 static void blockdev_backup_prepare(BlkActionState *common, Error **errp)
 {
 BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
 BlockdevBackup *backup;
-BlockDriverState *bs, *target;
+BlockDriverState *bs;
+BlockDriverState *target_bs;
 AioContext *aio_context;
-Error *local_err = NULL;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
 backup = common->action->u.blockdev_backup.data;
@@ -1959,8 +1956,8 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 return;
 }
 
-target = bdrv_lookup_bs(backup->target, backup->target, errp);
-if (!target) {
+target_bs = bdrv_lookup_bs(backup->target, backup->target, errp);
+if (!target_bs) {
 return;
 }
 
@@ -1971,13 +1968,10 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 /* Paired with .clean() */
 bdrv_drained_begin(state->bs);
 
-state->job = do_blockdev_backup(backup, common->block_job_txn, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-goto out;
-}
+state->job = do_backup_common(qapi_BlockdevBackup_base(backup),
+  bs, target_bs, aio_context,
+  common->block_job_txn, errp);
 
-out:
 aio_context_release(aio_context);
 }
 
@@ -3695,41 +3689,13 @@ XDbgBlockGraph *qmp_x_debug_query_block_graph(Error 
**errp)
 return bdrv_get_xdbg_block_graph(errp);
 }
 
-BlockJob *do_blockdev_backup(BlockdevBackup *backup, JobTxn *txn,
- Error **errp)
+void qmp_blockdev_backup(BlockdevBackup *backup, Error **errp)
 {
-BlockDriverState *bs;
-BlockDriverState *target_bs;
-AioContext *aio_context;
-BlockJob *job;
-
-bs = bdrv_lookup_bs(backup->device, backup->device, errp);
-if (!bs) {
-return NULL;
-}
-
-target_bs = bdrv_lookup_bs(backup->target, backup->target, errp);
-if (!target_bs) {
-return NULL;
-}
-
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
-job = do_backup_common(qapi_BlockdevBackup_base(backup),
-   bs, target_bs, aio_context, txn, errp);
-
-aio_context_release(aio_context);
-return job;
-}
-
-void qmp_blockdev_backup(BlockdevBackup *arg, Error **errp)
-{
-BlockJob *job;
-job = do_blockdev_backup(arg, NULL, errp);
-if (job) {
-job_start(>job);
-}
+TransactionAction action = {
+.type = TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP,
+.u.blockdev_backup.data = backup,
+};
+blockdev_do_action(, errp);
 }
 
 /* Parameter check and block job starting for drive mirroring.
-- 
2.23.0




[PATCH v6 2/8] blockdev: unify qmp_drive_backup and drive-backup transaction paths

2020-01-08 Thread Sergio Lopez
Issuing a drive-backup from qmp_drive_backup takes a slightly
different path than when it's issued from a transaction. In the code,
this is manifested as some redundancy between do_drive_backup() and
drive_backup_prepare().

This change unifies both paths, merging do_drive_backup() and
drive_backup_prepare(), and changing qmp_drive_backup() to create a
transaction instead of calling do_backup_common() direcly.

As a side-effect, now qmp_drive_backup() is executed inside a drained
section, as it happens when creating a drive-backup transaction. This
change is visible from the user's perspective, as the job gets paused
and immediately resumed before starting the actual work.

Also fix tests 141, 185 and 219 to cope with the extra
JOB_STATUS_CHANGE lines.

Signed-off-by: Sergio Lopez 
Reviewed-by: Kevin Wolf 
---
 blockdev.c | 224 +
 tests/qemu-iotests/141.out |   2 +
 tests/qemu-iotests/185.out |   2 +
 tests/qemu-iotests/219 |   7 +-
 tests/qemu-iotests/219.out |   8 ++
 5 files changed, 117 insertions(+), 126 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 553e315972..5e85fc042e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1761,39 +1761,128 @@ typedef struct DriveBackupState {
 BlockJob *job;
 } DriveBackupState;
 
-static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
-Error **errp);
+static BlockJob *do_backup_common(BackupCommon *backup,
+  BlockDriverState *bs,
+  BlockDriverState *target_bs,
+  AioContext *aio_context,
+  JobTxn *txn, Error **errp);
 
 static void drive_backup_prepare(BlkActionState *common, Error **errp)
 {
 DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
-BlockDriverState *bs;
 DriveBackup *backup;
+BlockDriverState *bs;
+BlockDriverState *target_bs;
+BlockDriverState *source = NULL;
 AioContext *aio_context;
+QDict *options;
 Error *local_err = NULL;
+int flags;
+int64_t size;
+bool set_backing_hd = false;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
 backup = common->action->u.drive_backup.data;
 
+if (!backup->has_mode) {
+backup->mode = NEW_IMAGE_MODE_ABSOLUTE_PATHS;
+}
+
 bs = bdrv_lookup_bs(backup->device, backup->device, errp);
 if (!bs) {
 return;
 }
 
+if (!bs->drv) {
+error_setg(errp, "Device has no medium");
+return;
+}
+
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
 
 /* Paired with .clean() */
 bdrv_drained_begin(bs);
 
-state->bs = bs;
+if (!backup->has_format) {
+backup->format = backup->mode == NEW_IMAGE_MODE_EXISTING ?
+ NULL : (char *) bs->drv->format_name;
+}
+
+/* Early check to avoid creating target */
+if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKUP_SOURCE, errp)) {
+goto out;
+}
+
+flags = bs->open_flags | BDRV_O_RDWR;
+
+/*
+ * See if we have a backing HD we can use to create our new image
+ * on top of.
+ */
+if (backup->sync == MIRROR_SYNC_MODE_TOP) {
+source = backing_bs(bs);
+if (!source) {
+backup->sync = MIRROR_SYNC_MODE_FULL;
+}
+}
+if (backup->sync == MIRROR_SYNC_MODE_NONE) {
+source = bs;
+flags |= BDRV_O_NO_BACKING;
+set_backing_hd = true;
+}
+
+size = bdrv_getlength(bs);
+if (size < 0) {
+error_setg_errno(errp, -size, "bdrv_getlength failed");
+goto out;
+}
+
+if (backup->mode != NEW_IMAGE_MODE_EXISTING) {
+assert(backup->format);
+if (source) {
+bdrv_refresh_filename(source);
+bdrv_img_create(backup->target, backup->format, source->filename,
+source->drv->format_name, NULL,
+size, flags, false, _err);
+} else {
+bdrv_img_create(backup->target, backup->format, NULL, NULL, NULL,
+size, flags, false, _err);
+}
+}
 
-state->job = do_drive_backup(backup, common->block_job_txn, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
 }
 
+options = qdict_new();
+qdict_put_str(options, "discard", "unmap");
+qdict_put_str(options, "detect-zeroes", "unmap");
+if (backup->format) {
+qdict_put_str(options, "driver", backup->format);
+}
+
+target_bs = bdrv_open(backup->target, NULL, options, flags, errp);
+if (!target_bs) {
+goto out;
+}
+
+if (set_backing_hd) {
+bdrv_set_backing_hd(target_bs, 

Re: [PATCH v5 4/4] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-12-20 Thread Sergio Lopez

Sergio Lopez  writes:

> Sergio Lopez  writes:
>
>> Kevin Wolf  writes:
>>
>>> Am 13.12.2019 um 21:59 hat Eric Blake geschrieben:
>>>> On 12/9/19 10:06 AM, Kevin Wolf wrote:
>>>> > Am 28.11.2019 um 11:41 hat Sergio Lopez geschrieben:
>>>> > > bdrv_try_set_aio_context() requires that the old context is held, and
>>>> > > the new context is not held. Fix all the occurrences where it's not
>>>> > > done this way.
>>>> > > 
>>>> > > Suggested-by: Max Reitz 
>>>> > > Signed-off-by: Sergio Lopez 
>>>> > > ---
>>>> 
>>>> > Or in fact, I think you need to hold the AioContext of a bs to
>>>> > bdrv_unref() it, so maybe 'goto out' is right, but you need to unref
>>>> > target_bs while you still hold old_context.
>>>> 
>>>> I suspect https://bugzilla.redhat.com/show_bug.cgi?id=1779036 is also a
>>>> symptom of this.  The v5 patch did not fix this simple test case:
>>>
>>> Speaking of a test case... I think this series should probably add
>>> something to iotests.
>>
>> Okay, I'll try to add one.
>
> So I've been working on this, but it turns out that the issue isn't
> reproducible with 'accel=qtest'. I guess that's because the devices
> using the nodes aren't really active in this situation.
>
> Is it allowed to use 'accel=kvm:tcg' in iotests? I see test 235 does
> that, but I'm not sure if that's just an exception.

Please ignore this, it was another issue.

Thanks,
Sergio.




signature.asc
Description: PGP signature


Re: [PATCH v5 4/4] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-12-19 Thread Sergio Lopez

Sergio Lopez  writes:

> Kevin Wolf  writes:
>
>> Am 13.12.2019 um 21:59 hat Eric Blake geschrieben:
>>> On 12/9/19 10:06 AM, Kevin Wolf wrote:
>>> > Am 28.11.2019 um 11:41 hat Sergio Lopez geschrieben:
>>> > > bdrv_try_set_aio_context() requires that the old context is held, and
>>> > > the new context is not held. Fix all the occurrences where it's not
>>> > > done this way.
>>> > > 
>>> > > Suggested-by: Max Reitz 
>>> > > Signed-off-by: Sergio Lopez 
>>> > > ---
>>> 
>>> > Or in fact, I think you need to hold the AioContext of a bs to
>>> > bdrv_unref() it, so maybe 'goto out' is right, but you need to unref
>>> > target_bs while you still hold old_context.
>>> 
>>> I suspect https://bugzilla.redhat.com/show_bug.cgi?id=1779036 is also a
>>> symptom of this.  The v5 patch did not fix this simple test case:
>>
>> Speaking of a test case... I think this series should probably add
>> something to iotests.
>
> Okay, I'll try to add one.

So I've been working on this, but it turns out that the issue isn't
reproducible with 'accel=qtest'. I guess that's because the devices
using the nodes aren't really active in this situation.

Is it allowed to use 'accel=kvm:tcg' in iotests? I see test 235 does
that, but I'm not sure if that's just an exception.

Thanks,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH v5 4/4] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-12-18 Thread Sergio Lopez

Kevin Wolf  writes:

> Am 13.12.2019 um 21:59 hat Eric Blake geschrieben:
>> On 12/9/19 10:06 AM, Kevin Wolf wrote:
>> > Am 28.11.2019 um 11:41 hat Sergio Lopez geschrieben:
>> > > bdrv_try_set_aio_context() requires that the old context is held, and
>> > > the new context is not held. Fix all the occurrences where it's not
>> > > done this way.
>> > > 
>> > > Suggested-by: Max Reitz 
>> > > Signed-off-by: Sergio Lopez 
>> > > ---
>> 
>> > Or in fact, I think you need to hold the AioContext of a bs to
>> > bdrv_unref() it, so maybe 'goto out' is right, but you need to unref
>> > target_bs while you still hold old_context.
>> 
>> I suspect https://bugzilla.redhat.com/show_bug.cgi?id=1779036 is also a
>> symptom of this.  The v5 patch did not fix this simple test case:
>
> Speaking of a test case... I think this series should probably add
> something to iotests.

Okay, I'll try to add one.

Thanks,
Sergio.



signature.asc
Description: PGP signature


Re: [PATCH v5 4/4] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-12-18 Thread Sergio Lopez

Kevin Wolf  writes:

> Am 28.11.2019 um 11:41 hat Sergio Lopez geschrieben:
>> bdrv_try_set_aio_context() requires that the old context is held, and
>> the new context is not held. Fix all the occurrences where it's not
>> done this way.
>> 
>> Suggested-by: Max Reitz 
>> Signed-off-by: Sergio Lopez 
>> ---
>>  blockdev.c | 72 ++
>>  1 file changed, 62 insertions(+), 10 deletions(-)
>> 
>> diff --git a/blockdev.c b/blockdev.c
>> index 152a0f7454..e33abd7fd2 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -1535,6 +1535,7 @@ static void external_snapshot_prepare(BlkActionState 
>> *common,
>>   DO_UPCAST(ExternalSnapshotState, common, 
>> common);
>>  TransactionAction *action = common->action;
>>  AioContext *aio_context;
>> +AioContext *old_context;
>>  int ret;
>>  
>>  /* 'blockdev-snapshot' and 'blockdev-snapshot-sync' have similar
>> @@ -1675,7 +1676,16 @@ static void external_snapshot_prepare(BlkActionState 
>> *common,
>>  goto out;
>>  }
>>  
>> +/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
>> +old_context = bdrv_get_aio_context(state->new_bs);
>> +aio_context_release(aio_context);
>> +aio_context_acquire(old_context);
>> +
>>  ret = bdrv_try_set_aio_context(state->new_bs, aio_context, errp);
>> +
>> +aio_context_release(old_context);
>> +aio_context_acquire(aio_context);
>> +
>>  if (ret < 0) {
>>  goto out;
>>  }
>> @@ -1775,11 +1785,13 @@ static void drive_backup_prepare(BlkActionState 
>> *common, Error **errp)
>>  BlockDriverState *target_bs;
>>  BlockDriverState *source = NULL;
>>  AioContext *aio_context;
>> +AioContext *old_context;
>>  QDict *options;
>>  Error *local_err = NULL;
>>  int flags;
>>  int64_t size;
>>  bool set_backing_hd = false;
>> +int ret;
>>  
>>  assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
>>  backup = common->action->u.drive_backup.data;
>> @@ -1868,6 +1880,20 @@ static void drive_backup_prepare(BlkActionState 
>> *common, Error **errp)
>>  goto out;
>>  }
>>  
>> +/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
>> +old_context = bdrv_get_aio_context(target_bs);
>> +aio_context_release(aio_context);
>> +aio_context_acquire(old_context);
>> +
>> +ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
>> +aio_context_release(old_context);
>> +aio_context_acquire(aio_context);
>> +
>> +if (ret < 0) {
>> +goto out;
>
> I think this needs to be 'goto unref'.
>
> Or in fact, I think you need to hold the AioContext of a bs to
> bdrv_unref() it, so maybe 'goto out' is right, but you need to unref
> target_bs while you still hold old_context.

Thanks for catching this one. To avoid making the error bailout path
even more complicated, in v6 I'll be moving the check just after
bdrv_try_set_aio_context(), doing the unref, the release of old_context,
and a direct return.

>> +}
>> +
>>  if (set_backing_hd) {
>>  bdrv_set_backing_hd(target_bs, source, _err);
>>  if (local_err) {
>> @@ -1947,6 +1973,8 @@ static void blockdev_backup_prepare(BlkActionState 
>> *common, Error **errp)
>>  BlockDriverState *bs;
>>  BlockDriverState *target_bs;
>>  AioContext *aio_context;
>> +AioContext *old_context;
>> +int ret;
>>  
>>  assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
>>  backup = common->action->u.blockdev_backup.data;
>> @@ -1961,7 +1989,18 @@ static void blockdev_backup_prepare(BlkActionState 
>> *common, Error **errp)
>>  return;
>>  }
>>  
>> +/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
>>  aio_context = bdrv_get_aio_context(bs);
>> +old_context = bdrv_get_aio_context(target_bs);
>> +aio_context_acquire(old_context);
>> +
>> +ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
>> +if (ret < 0) {
>> +aio_context_release(old_context);
>> +return;
>> +}
>> +
>> +aio_context_release(old_context);
>>  aio_context_acquire(aio_context);
>>  state->bs = bs;
>>  

Re: [PATCH v5 4/4] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-12-18 Thread Sergio Lopez

Eric Blake  writes:

> On 12/9/19 10:06 AM, Kevin Wolf wrote:
>> Am 28.11.2019 um 11:41 hat Sergio Lopez geschrieben:
>>> bdrv_try_set_aio_context() requires that the old context is held, and
>>> the new context is not held. Fix all the occurrences where it's not
>>> done this way.
>>>
>>> Suggested-by: Max Reitz 
>>> Signed-off-by: Sergio Lopez 
>>> ---
>
>> Or in fact, I think you need to hold the AioContext of a bs to
>> bdrv_unref() it, so maybe 'goto out' is right, but you need to unref
>> target_bs while you still hold old_context.
>
> I suspect https://bugzilla.redhat.com/show_bug.cgi?id=1779036 is also
> a symptom of this.  The v5 patch did not fix this simple test case:
>
>
> $ qemu-img create -f qcow2 f1 100m
> $ qemu-img create -f qcow2 f2 100m
> $ ./qemu-kvm -nodefaults -nographic -qmp stdio -object iothread,id=io0 \
>  -drive driver=qcow2,id=drive1,file=f1,if=none -device
> virtio-scsi-pci,id=scsi0,iothread=io0 -device
> scsi-hd,id=image1,drive=drive1 \
>  -drive driver=qcow2,id=drive2,file=f2,if=none -device
> virtio-blk-pci,id=image2,drive=drive2,iothread=io0
>
> {'execute':'qmp_capabilities'}
>
> {'execute':'transaction','arguments':{'actions':[
> {'type':'blockdev-snapshot-sync','data':{'device':'drive1',
> 'snapshot-file':'sn1','mode':'absolute-paths','format':'qcow2'}},
> {'type':'blockdev-snapshot-sync','data':{'device':'drive2',
> 'snapshot-file':'/aa/sn1','mode':'absolute-paths','format':'qcow2'}}]}}
>
> which is an aio context bug somewhere on the error path of
> blockdev-snapshot-sync (the first one has to be rolled back because
> the second part of the transaction fails early on a nonexistent
> directory)

This is slightly different. The problem resides in
external_snapshot_abort():

   1717 static void external_snapshot_abort(BlkActionState *common)
   1718 {
   1719 ExternalSnapshotState *state =
   1720  DO_UPCAST(ExternalSnapshotState, common, 
common);
   1721 if (state->new_bs) {
   1722 if (state->overlay_appended) {
   1723 AioContext *aio_context;
   1724 
   1725 aio_context = bdrv_get_aio_context(state->old_bs);
   1726 aio_context_acquire(aio_context);
   1727 
   1728 bdrv_ref(state->old_bs);   /* we can't let 
bdrv_set_backind_hd()
   1729   close state->old_bs; we need 
it */
   1730 bdrv_set_backing_hd(state->new_bs, NULL, _abort);
   1731 bdrv_replace_node(state->new_bs, state->old_bs, 
_abort);
   1732 bdrv_unref(state->old_bs); /* bdrv_replace_node() ref'ed 
old_bs */
   1733 
   1734 aio_context_release(aio_context);
   1735 }
   1736 }
   1737 }

bdrv_set_backing_hd() returns state->old_bs to the main AioContext,
while bdrv_replace_node() expects state->new_bs and state->old_bs to be
using the same AioContext.

I'm thinking sending this as a separate patch:

diff --git a/blockdev.c b/blockdev.c
index e33abd7fd2..6c73ac4e32 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1731,6 +1731,8 @@ static void external_snapshot_abort(BlkActionState 
*common)
 if (state->new_bs) {
 if (state->overlay_appended) {
 AioContext *aio_context;
+AioContext *tmp_context;
+int ret;
 
 aio_context = bdrv_get_aio_context(state->old_bs);
 aio_context_acquire(aio_context);
@@ -1738,6 +1740,25 @@ static void external_snapshot_abort(BlkActionState 
*common)
 bdrv_ref(state->old_bs);   /* we can't let bdrv_set_backind_hd()
   close state->old_bs; we need it */
 bdrv_set_backing_hd(state->new_bs, NULL, _abort);
+
+/*
+ * The call to bdrv_set_backing_hd() above returns state->old_bs to
+ * the main AioContext. As we're still going to be using it, return
+ * it to the AioContext it was before.
+ */
+tmp_context = bdrv_get_aio_context(state->old_bs);
+if (aio_context != tmp_context) {
+aio_context_release(aio_context);
+aio_context_acquire(tmp_context);
+
+ret = bdrv_try_set_aio_context(state->old_bs,
+   aio_context, NULL);
+assert(ret == 0);
+
+aio_context_release(tmp_context);
+aio_context_acquire(aio_context);
+}
+
 bdrv_replace_node(state->new_bs, state->old_bs, _abort);
 bdrv_unref(state->old_bs); /* bdrv_replace_node() ref'ed old_bs */

What do you think?

Sergio.


signature.asc
Description: PGP signature


[PATCH v5 4/4] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-11-28 Thread Sergio Lopez
bdrv_try_set_aio_context() requires that the old context is held, and
the new context is not held. Fix all the occurrences where it's not
done this way.

Suggested-by: Max Reitz 
Signed-off-by: Sergio Lopez 
---
 blockdev.c | 72 ++
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 152a0f7454..e33abd7fd2 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1535,6 +1535,7 @@ static void external_snapshot_prepare(BlkActionState 
*common,
  DO_UPCAST(ExternalSnapshotState, common, common);
 TransactionAction *action = common->action;
 AioContext *aio_context;
+AioContext *old_context;
 int ret;
 
 /* 'blockdev-snapshot' and 'blockdev-snapshot-sync' have similar
@@ -1675,7 +1676,16 @@ static void external_snapshot_prepare(BlkActionState 
*common,
 goto out;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(state->new_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
 ret = bdrv_try_set_aio_context(state->new_bs, aio_context, errp);
+
+aio_context_release(old_context);
+aio_context_acquire(aio_context);
+
 if (ret < 0) {
 goto out;
 }
@@ -1775,11 +1785,13 @@ static void drive_backup_prepare(BlkActionState 
*common, Error **errp)
 BlockDriverState *target_bs;
 BlockDriverState *source = NULL;
 AioContext *aio_context;
+AioContext *old_context;
 QDict *options;
 Error *local_err = NULL;
 int flags;
 int64_t size;
 bool set_backing_hd = false;
+int ret;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
 backup = common->action->u.drive_backup.data;
@@ -1868,6 +1880,20 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 goto out;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+
+aio_context_release(old_context);
+aio_context_acquire(aio_context);
+
+if (ret < 0) {
+goto out;
+}
+
 if (set_backing_hd) {
 bdrv_set_backing_hd(target_bs, source, _err);
 if (local_err) {
@@ -1947,6 +1973,8 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 BlockDriverState *bs;
 BlockDriverState *target_bs;
 AioContext *aio_context;
+AioContext *old_context;
+int ret;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
 backup = common->action->u.blockdev_backup.data;
@@ -1961,7 +1989,18 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 return;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
 aio_context = bdrv_get_aio_context(bs);
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_acquire(old_context);
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+if (ret < 0) {
+aio_context_release(old_context);
+return;
+}
+
+aio_context_release(old_context);
 aio_context_acquire(aio_context);
 state->bs = bs;
 
@@ -3562,7 +3601,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 BlockJob *job = NULL;
 BdrvDirtyBitmap *bmap = NULL;
 int job_flags = JOB_DEFAULT;
-int ret;
 
 if (!backup->has_speed) {
 backup->speed = 0;
@@ -3586,11 +3624,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 backup->compress = false;
 }
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
-if (ret < 0) {
-return NULL;
-}
-
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
 (backup->sync == MIRROR_SYNC_MODE_INCREMENTAL)) {
 /* done before desugaring 'incremental' to print the right message */
@@ -3825,6 +3858,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
 BlockDriverState *bs;
 BlockDriverState *source, *target_bs;
 AioContext *aio_context;
+AioContext *old_context;
 BlockMirrorBackingMode backing_mode;
 Error *local_err = NULL;
 QDict *options = NULL;
@@ -3937,10 +3971,19 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
(arg->mode == NEW_IMAGE_MODE_EXISTING ||
 !bdrv_has_zero_init(target_bs)));
 
+
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
 ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+
+aio_context_release(old_c

[PATCH v5 2/4] blockdev: unify qmp_drive_backup and drive-backup transaction paths

2019-11-28 Thread Sergio Lopez
Issuing a drive-backup from qmp_drive_backup takes a slightly
different path than when it's issued from a transaction. In the code,
this is manifested as some redundancy between do_drive_backup() and
drive_backup_prepare().

This change unifies both paths, merging do_drive_backup() and
drive_backup_prepare(), and changing qmp_drive_backup() to create a
transaction instead of calling do_backup_common() direcly.

As a side-effect, now qmp_drive_backup() is executed inside a drained
section, as it happens when creating a drive-backup transaction. This
change is visible from the user's perspective, as the job gets paused
and immediately resumed before starting the actual work.

Also fix tests 141, 185 and 219 to cope with the extra
JOB_STATUS_CHANGE lines.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 224 +
 tests/qemu-iotests/141.out |   2 +
 tests/qemu-iotests/185.out |   2 +
 tests/qemu-iotests/219 |   7 +-
 tests/qemu-iotests/219.out |   8 ++
 5 files changed, 117 insertions(+), 126 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 553e315972..5e85fc042e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1761,39 +1761,128 @@ typedef struct DriveBackupState {
 BlockJob *job;
 } DriveBackupState;
 
-static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
-Error **errp);
+static BlockJob *do_backup_common(BackupCommon *backup,
+  BlockDriverState *bs,
+  BlockDriverState *target_bs,
+  AioContext *aio_context,
+  JobTxn *txn, Error **errp);
 
 static void drive_backup_prepare(BlkActionState *common, Error **errp)
 {
 DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
-BlockDriverState *bs;
 DriveBackup *backup;
+BlockDriverState *bs;
+BlockDriverState *target_bs;
+BlockDriverState *source = NULL;
 AioContext *aio_context;
+QDict *options;
 Error *local_err = NULL;
+int flags;
+int64_t size;
+bool set_backing_hd = false;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
 backup = common->action->u.drive_backup.data;
 
+if (!backup->has_mode) {
+backup->mode = NEW_IMAGE_MODE_ABSOLUTE_PATHS;
+}
+
 bs = bdrv_lookup_bs(backup->device, backup->device, errp);
 if (!bs) {
 return;
 }
 
+if (!bs->drv) {
+error_setg(errp, "Device has no medium");
+return;
+}
+
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
 
 /* Paired with .clean() */
 bdrv_drained_begin(bs);
 
-state->bs = bs;
+if (!backup->has_format) {
+backup->format = backup->mode == NEW_IMAGE_MODE_EXISTING ?
+ NULL : (char *) bs->drv->format_name;
+}
+
+/* Early check to avoid creating target */
+if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKUP_SOURCE, errp)) {
+goto out;
+}
+
+flags = bs->open_flags | BDRV_O_RDWR;
+
+/*
+ * See if we have a backing HD we can use to create our new image
+ * on top of.
+ */
+if (backup->sync == MIRROR_SYNC_MODE_TOP) {
+source = backing_bs(bs);
+if (!source) {
+backup->sync = MIRROR_SYNC_MODE_FULL;
+}
+}
+if (backup->sync == MIRROR_SYNC_MODE_NONE) {
+source = bs;
+flags |= BDRV_O_NO_BACKING;
+set_backing_hd = true;
+}
+
+size = bdrv_getlength(bs);
+if (size < 0) {
+error_setg_errno(errp, -size, "bdrv_getlength failed");
+goto out;
+}
+
+if (backup->mode != NEW_IMAGE_MODE_EXISTING) {
+assert(backup->format);
+if (source) {
+bdrv_refresh_filename(source);
+bdrv_img_create(backup->target, backup->format, source->filename,
+source->drv->format_name, NULL,
+size, flags, false, _err);
+} else {
+bdrv_img_create(backup->target, backup->format, NULL, NULL, NULL,
+size, flags, false, _err);
+}
+}
 
-state->job = do_drive_backup(backup, common->block_job_txn, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
 }
 
+options = qdict_new();
+qdict_put_str(options, "discard", "unmap");
+qdict_put_str(options, "detect-zeroes", "unmap");
+if (backup->format) {
+qdict_put_str(options, "driver", backup->format);
+}
+
+target_bs = bdrv_open(backup->target, NULL, options, flags, errp);
+if (!target_bs) {
+goto out;
+}
+
+if (set_backing_hd) {
+bdrv_set_backing_hd(target_bs, 

  1   2   >