Re: [PATCH v2 5/5] vhost: add device started check in migration set log

2020-05-11 Thread Li Feng
Hi, Dima.

If vhost_migration_log return < 0, then vhost_log_global_start will
trigger a crash.
Does your patch have process this abort?
If a disconnect happens in the migration stage, the correct operation
is to stop the migration, right?

 841 static void vhost_log_global_start(MemoryListener *listener)
 842 {
 843 int r;
 844
 845 r = vhost_migration_log(listener, true);
 846 if (r < 0) {
 847 abort();
 848 }
 849 }

Thanks,

Feng Li

Jason Wang  于2020年5月12日周二 上午11:33写道:
>
>
> On 2020/5/11 下午5:25, Dima Stepanov wrote:
> > On Mon, May 11, 2020 at 11:15:53AM +0800, Jason Wang wrote:
> >> On 2020/4/30 下午9:36, Dima Stepanov wrote:
> >>> If vhost-user daemon is used as a backend for the vhost device, then we
> >>> should consider a possibility of disconnect at any moment. If such
> >>> disconnect happened in the vhost_migration_log() routine the vhost
> >>> device structure will be clean up.
> >>> At the start of the vhost_migration_log() function there is a check:
> >>>if (!dev->started) {
> >>>dev->log_enabled = enable;
> >>>return 0;
> >>>}
> >>> To be consistent with this check add the same check after calling the
> >>> vhost_dev_set_log() routine. This in general help not to break a
> >>> migration due the assert() message. But it looks like that this code
> >>> should be revised to handle these errors more carefully.
> >>>
> >>> In case of vhost-user device backend the fail paths should consider the
> >>> state of the device. In this case we should skip some function calls
> >>> during rollback on the error paths, so not to get the NULL dereference
> >>> errors.
> >>>
> >>> Signed-off-by: Dima Stepanov 
> >>> ---
> >>>   hw/virtio/vhost.c | 39 +++
> >>>   1 file changed, 35 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>> index 3ee50c4..d5ab96d 100644
> >>> --- a/hw/virtio/vhost.c
> >>> +++ b/hw/virtio/vhost.c
> >>> @@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev 
> >>> *dev,
> >>>   static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
> >>>   {
> >>>   int r, i, idx;
> >>> +
> >>> +if (!dev->started) {
> >>> +/*
> >>> + * If vhost-user daemon is used as a backend for the
> >>> + * device and the connection is broken, then the vhost_dev
> >>> + * structure will be reset all its values to 0.
> >>> + * Add additional check for the device state.
> >>> + */
> >>> +return -1;
> >>> +}
> >>> +
> >>>   r = vhost_dev_set_features(dev, enable_log);
> >>>   if (r < 0) {
> >>>   goto err_features;
> >>> @@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, 
> >>> bool enable_log)
> >>>   }
> >>>   return 0;
> >>>   err_vq:
> >>> -for (; i >= 0; --i) {
> >>> +/*
> >>> + * Disconnect with the vhost-user daemon can lead to the
> >>> + * vhost_dev_cleanup() call which will clean up vhost_dev
> >>> + * structure.
> >>> + */
> >>> +for (; dev->started && (i >= 0); --i) {
> >>>   idx = dev->vhost_ops->vhost_get_vq_index(
> >>
> >> Why need the check of dev->started here, can started be modified outside
> >> mainloop? If yes, I don't get the check of !dev->started in the beginning 
> >> of
> >> this function.
> >>
> > No dev->started can't change outside the mainloop. The main problem is
> > only for the vhost_user_blk daemon. Consider the case when we
> > successfully pass the dev->started check at the beginning of the
> > function, but after it we hit the disconnect on the next call on the
> > second or third iteration:
> >   r = vhost_virtqueue_set_addr(dev, dev->vqs + i, idx, enable_log);
> > The unix socket backend device will call the disconnect routine for this
> > device and reset the structure. So the structure will be reset (and
> > dev->started set to false) inside this set_addr() call.
>
>
> I still don't get here. I think the disconnect can not happen in the
> middle of vhost_dev_set_log() since both of them were running in
> mainloop. And even if it can, we probably need other synchronization
> mechanism other than simple check here.
>
>
> >   So
> > we shouldn't call the clean up calls because this virtqueues were clean
> > up in the disconnect call. But we should protect these calls somehow, so
> > it will not hit SIGSEGV and we will be able to pass migration.
> >
> > Just to summarize it:
> > For the vhost-user-blk devices we ca hit clean up calls twice in case of
> > vhost disconnect:
> > 1. The first time during the disconnect process. The clean up is called
> > inside it.
> > 2. The second time during roll back clean up.
> > So if it is the case we should skip p2.
> >
> >>> dev, dev->vq_index + i);
> >>>   vhost_virtqueue_set_addr(dev, dev->vqs + i, idx,
> >>>dev->log_enabled);
> >>>   }
> >>> -vhost_dev_set_features(dev, 

Re: [PATCH v2 5/5] vhost: add device started check in migration set log

2020-05-11 Thread Jason Wang



On 2020/5/11 下午5:25, Dima Stepanov wrote:

On Mon, May 11, 2020 at 11:15:53AM +0800, Jason Wang wrote:

On 2020/4/30 下午9:36, Dima Stepanov wrote:

If vhost-user daemon is used as a backend for the vhost device, then we
should consider a possibility of disconnect at any moment. If such
disconnect happened in the vhost_migration_log() routine the vhost
device structure will be clean up.
At the start of the vhost_migration_log() function there is a check:
   if (!dev->started) {
   dev->log_enabled = enable;
   return 0;
   }
To be consistent with this check add the same check after calling the
vhost_dev_set_log() routine. This in general help not to break a
migration due the assert() message. But it looks like that this code
should be revised to handle these errors more carefully.

In case of vhost-user device backend the fail paths should consider the
state of the device. In this case we should skip some function calls
during rollback on the error paths, so not to get the NULL dereference
errors.

Signed-off-by: Dima Stepanov 
---
  hw/virtio/vhost.c | 39 +++
  1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 3ee50c4..d5ab96d 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
  static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
  {
  int r, i, idx;
+
+if (!dev->started) {
+/*
+ * If vhost-user daemon is used as a backend for the
+ * device and the connection is broken, then the vhost_dev
+ * structure will be reset all its values to 0.
+ * Add additional check for the device state.
+ */
+return -1;
+}
+
  r = vhost_dev_set_features(dev, enable_log);
  if (r < 0) {
  goto err_features;
@@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, bool 
enable_log)
  }
  return 0;
  err_vq:
-for (; i >= 0; --i) {
+/*
+ * Disconnect with the vhost-user daemon can lead to the
+ * vhost_dev_cleanup() call which will clean up vhost_dev
+ * structure.
+ */
+for (; dev->started && (i >= 0); --i) {
  idx = dev->vhost_ops->vhost_get_vq_index(


Why need the check of dev->started here, can started be modified outside
mainloop? If yes, I don't get the check of !dev->started in the beginning of
this function.


No dev->started can't change outside the mainloop. The main problem is
only for the vhost_user_blk daemon. Consider the case when we
successfully pass the dev->started check at the beginning of the
function, but after it we hit the disconnect on the next call on the
second or third iteration:
  r = vhost_virtqueue_set_addr(dev, dev->vqs + i, idx, enable_log);
The unix socket backend device will call the disconnect routine for this
device and reset the structure. So the structure will be reset (and
dev->started set to false) inside this set_addr() call.



I still don't get here. I think the disconnect can not happen in the 
middle of vhost_dev_set_log() since both of them were running in 
mainloop. And even if it can, we probably need other synchronization 
mechanism other than simple check here.




  So
we shouldn't call the clean up calls because this virtqueues were clean
up in the disconnect call. But we should protect these calls somehow, so
it will not hit SIGSEGV and we will be able to pass migration.

Just to summarize it:
For the vhost-user-blk devices we ca hit clean up calls twice in case of
vhost disconnect:
1. The first time during the disconnect process. The clean up is called
inside it.
2. The second time during roll back clean up.
So if it is the case we should skip p2.


dev, dev->vq_index + i);
  vhost_virtqueue_set_addr(dev, dev->vqs + i, idx,
   dev->log_enabled);
  }
-vhost_dev_set_features(dev, dev->log_enabled);
+if (dev->started) {
+vhost_dev_set_features(dev, dev->log_enabled);
+}
  err_features:
  return r;
  }
@@ -832,7 +850,15 @@ static int vhost_migration_log(MemoryListener *listener, 
int enable)
  } else {
  vhost_dev_log_resize(dev, vhost_get_log_size(dev));
  r = vhost_dev_set_log(dev, true);
-if (r < 0) {
+/*
+ * The dev log resize can fail, because of disconnect
+ * with the vhost-user-blk daemon. Check the device
+ * state before calling the vhost_dev_set_log()
+ * function.
+ * Don't return error if device isn't started to be
+ * consistent with the check above.
+ */
+if (dev->started && r < 0) {
  return r;
  }
  }
@@ -1739,7 +1765,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
  fail_log:
  vhost_log_put(hdev, false);
  fail_vq:
-while (--i >= 0) {
+/*
+ * Disconnect with the vhost-user daemon can lead 

Re: [PATCH v2 4/5] vhost: check vring address before calling unmap

2020-05-11 Thread Jason Wang



On 2020/5/11 下午5:11, Dima Stepanov wrote:

On Mon, May 11, 2020 at 11:05:58AM +0800, Jason Wang wrote:

On 2020/4/30 下午9:36, Dima Stepanov wrote:

Since disconnect can happen at any time during initialization not all
vring buffers (for instance used vring) can be intialized successfully.
If the buffer was not initialized then vhost_memory_unmap call will lead
to SIGSEGV. Add checks for the vring address value before calling unmap.
Also add assert() in the vhost_memory_unmap() routine.

Signed-off-by: Dima Stepanov 
---
  hw/virtio/vhost.c | 27 +--
  1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ddbdc53..3ee50c4 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -314,6 +314,8 @@ static void vhost_memory_unmap(struct vhost_dev *dev, void 
*buffer,
 hwaddr len, int is_write,
 hwaddr access_len)
  {
+assert(buffer);
+
  if (!vhost_dev_has_iommu(dev)) {
  cpu_physical_memory_unmap(buffer, len, is_write, access_len);
  }
@@ -1132,12 +1134,25 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
  vhost_vq_index);
  }
-vhost_memory_unmap(dev, vq->used, virtio_queue_get_used_size(vdev, idx),
-   1, virtio_queue_get_used_size(vdev, idx));
-vhost_memory_unmap(dev, vq->avail, virtio_queue_get_avail_size(vdev, idx),
-   0, virtio_queue_get_avail_size(vdev, idx));
-vhost_memory_unmap(dev, vq->desc, virtio_queue_get_desc_size(vdev, idx),
-   0, virtio_queue_get_desc_size(vdev, idx));
+/*
+ * Since the vhost-user disconnect can happen during initialization
+ * check if vring was initialized, before making unmap.
+ */
+if (vq->used) {
+vhost_memory_unmap(dev, vq->used,
+   virtio_queue_get_used_size(vdev, idx),
+   1, virtio_queue_get_used_size(vdev, idx));
+}
+if (vq->avail) {
+vhost_memory_unmap(dev, vq->avail,
+   virtio_queue_get_avail_size(vdev, idx),
+   0, virtio_queue_get_avail_size(vdev, idx));
+}
+if (vq->desc) {
+vhost_memory_unmap(dev, vq->desc,
+   virtio_queue_get_desc_size(vdev, idx),
+   0, virtio_queue_get_desc_size(vdev, idx));
+}


Any reason not checking hdev->started instead? vhost_dev_start() will set it
to true if virtqueues were correctly mapped.

Thanks

Well i see it a little bit different:
  - vhost_dev_start() sets hdev->started to true before starting
virtqueues
  - vhost_virtqueue_start() maps all the memory
If we hit the vhost disconnect at the start of the
vhost_virtqueue_start(), for instance for this call:
   r = dev->vhost_ops->vhost_set_vring_base(dev, );
Then we will call vhost_user_blk_disconnect:
   vhost_user_blk_disconnect()->
 vhost_user_blk_stop()->
   vhost_dev_stop()->
 vhost_virtqueue_stop()
As a result we will come in this routine with the hdev->started still
set to true, but if used/avail/desc fields still uninitialized and set
to 0.



I may miss something, but consider both vhost_dev_start() and 
vhost_user_blk_disconnect() were serialized in main loop. Can this 
really happen?


Thanks







  }
  static void vhost_eventfd_add(MemoryListener *listener,





Re: [PATCH v3 3/9] block: Make it easier to learn which BDS support bitmaps

2020-05-11 Thread Eric Blake

On 5/11/20 1:16 PM, Eric Blake wrote:

On 5/11/20 4:21 AM, Max Reitz wrote:



+++ b/include/block/block_int.h
@@ -560,6 +560,7 @@ struct BlockDriver {
   uint64_t parent_perm, uint64_t 
parent_shared,

   uint64_t *nperm, uint64_t *nshared);

+    bool (*bdrv_dirty_bitmap_supported)(BlockDriverState *bs);


All BDSs support bitmaps, but only some support persistent dirty
bitmaps, so I think the name should reflect that.


How about .bdrv_dirty_bitmap_supports_persistent?


Bike-shedding myself, it looks like 
.bdrv_supports_persistent_dirty_bitmap is better (if you go by the 
naming convention 'noun-verb-details', it makes more sense that a 'bdrv' 
supports 'persistent dirty bitmaps', than that a 'bdrv dirty bitmap' 
supports 'persistence', particularly when the parameter is a 
BlockDriverState rather than a BdrvDirtyBitmap).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




[PATCH] hw: Use QEMU_IS_ALIGNED() on parallel flash block size

2020-05-11 Thread Philippe Mathieu-Daudé
Use the QEMU_IS_ALIGNED() macro to verify the flash block size
is properly aligned. It is quicker to process when reviewing.

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/arm/sbsa-ref.c   | 2 +-
 hw/arm/virt.c   | 2 +-
 hw/block/pflash_cfi01.c | 2 +-
 hw/block/pflash_cfi02.c | 2 +-
 hw/i386/pc_sysfw.c  | 2 +-
 hw/riscv/virt.c | 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
index 8409ba853d..b379e4a76a 100644
--- a/hw/arm/sbsa-ref.c
+++ b/hw/arm/sbsa-ref.c
@@ -241,7 +241,7 @@ static void sbsa_flash_map1(PFlashCFI01 *flash,
 {
 DeviceState *dev = DEVICE(flash);
 
-assert(size % SBSA_FLASH_SECTOR_SIZE == 0);
+assert(QEMU_IS_ALIGNED(size, SBSA_FLASH_SECTOR_SIZE));
 assert(size / SBSA_FLASH_SECTOR_SIZE <= UINT32_MAX);
 qdev_prop_set_uint32(dev, "num-blocks", size / SBSA_FLASH_SECTOR_SIZE);
 qdev_init_nofail(dev);
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 634db0cfe9..0a99fddb3d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -978,7 +978,7 @@ static void virt_flash_map1(PFlashCFI01 *flash,
 {
 DeviceState *dev = DEVICE(flash);
 
-assert(size % VIRT_FLASH_SECTOR_SIZE == 0);
+assert(QEMU_IS_ALIGNED(size, VIRT_FLASH_SECTOR_SIZE));
 assert(size / VIRT_FLASH_SECTOR_SIZE <= UINT32_MAX);
 qdev_prop_set_uint32(dev, "num-blocks", size / VIRT_FLASH_SECTOR_SIZE);
 qdev_init_nofail(dev);
diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index f586bac269..11922c0f96 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -964,7 +964,7 @@ PFlashCFI01 *pflash_cfi01_register(hwaddr base,
 if (blk) {
 qdev_prop_set_drive(dev, "drive", blk, _abort);
 }
-assert(size % sector_len == 0);
+assert(QEMU_IS_ALIGNED(size, sector_len));
 qdev_prop_set_uint32(dev, "num-blocks", size / sector_len);
 qdev_prop_set_uint64(dev, "sector-length", sector_len);
 qdev_prop_set_uint8(dev, "width", bank_width);
diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index c6b6f2d082..895f7daee3 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -1003,7 +1003,7 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base,
 if (blk) {
 qdev_prop_set_drive(dev, "drive", blk, _abort);
 }
-assert(size % sector_len == 0);
+assert(QEMU_IS_ALIGNED(size, sector_len));
 qdev_prop_set_uint32(dev, "num-blocks", size / sector_len);
 qdev_prop_set_uint32(dev, "sector-length", sector_len);
 qdev_prop_set_uint8(dev, "width", width);
diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c
index f5f3f466b0..fad41f0e73 100644
--- a/hw/i386/pc_sysfw.c
+++ b/hw/i386/pc_sysfw.c
@@ -168,7 +168,7 @@ static void pc_system_flash_map(PCMachineState *pcms,
  blk_name(blk), strerror(-size));
 exit(1);
 }
-if (size == 0 || size % FLASH_SECTOR_SIZE != 0) {
+if (size == 0 || !QEMU_IS_ALIGNED(size, FLASH_SECTOR_SIZE)) {
 error_report("system firmware block device %s has invalid size "
  "%" PRId64,
  blk_name(blk), size);
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index daae3ebdbb..71481d59c2 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -112,7 +112,7 @@ static void virt_flash_map1(PFlashCFI01 *flash,
 {
 DeviceState *dev = DEVICE(flash);
 
-assert(size % VIRT_FLASH_SECTOR_SIZE == 0);
+assert(QEMU_IS_ALIGNED(size, VIRT_FLASH_SECTOR_SIZE));
 assert(size / VIRT_FLASH_SECTOR_SIZE <= UINT32_MAX);
 qdev_prop_set_uint32(dev, "num-blocks", size / VIRT_FLASH_SECTOR_SIZE);
 qdev_init_nofail(dev);
-- 
2.21.3




Re: [PATCH 2/5] io/channel.c,io/channel-socket.c: Add yank feature

2020-05-11 Thread Lukas Straub
On Mon, 11 May 2020 12:51:46 +0100
Daniel P. Berrangé  wrote:

> On Mon, May 11, 2020 at 01:14:41PM +0200, Lukas Straub wrote:
> > Add qio_channel_set_yank function to channel and to channel-socket,
> > which will register a yank function. The yank function calls
> > shutdown() on the socket.
> > 
> > Signed-off-by: Lukas Straub 
> > ---
> >  Makefile.objs   |  1 +
> >  include/io/channel-socket.h |  1 +
> >  include/io/channel.h| 12 
> >  io/channel-socket.c | 29 +
> >  io/channel.c|  9 +
> >  5 files changed, 52 insertions(+)  
> 
> Assuming we want the yank feature (which I'm not entirely convinced
> of), then I don't think any of this addition should exist. The
> QIOChannel class already provides a "qio_channel_shutdown" method
> which can be invoked. The layer above which is using the QIOChannel
> should be calling this existing qio_channel_shutdown method in
> response to any yank request.  The I/O layer shouldn't have any
> direct dependancy on the yank feature.

Having the code here simplifys the code in the other places.

Regards,
Lukas Straub

> 
> Regards,
> Daniel



pgpqVPtmDZNZf.pgp
Description: OpenPGP digital signature


Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Lukas Straub
On Mon, 11 May 2020 13:17:14 +0100
Daniel P. Berrangé  wrote:

> On Mon, May 11, 2020 at 01:07:18PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berra...@redhat.com) wrote:  
> > > On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:  
> > > > Hello Everyone,
> > > > In many cases, if qemu has a network connection (qmp, migration, 
> > > > chardev, etc.)
> > > > to some other server and that server dies or hangs, qemu hangs too.  
> > > 
> > > If qemu as a whole hangs due to a stalled network connection, that is a
> > > bug in QEMU that we should be fixing IMHO. QEMU should be doing 
> > > non-blocking
> > > I/O in general, such that if the network connection or remote server 
> > > stalls,
> > > we simply stop sending I/O - we shouldn't ever hang the QEMU process or 
> > > main
> > > loop.
> > > 
> > > There are places in QEMU code which are not well behaved in this respect,
> > > but many are, and others are getting fixed where found to be important.
> > > 
> > > Arguably any place in QEMU code which can result in a hang of QEMU in the
> > > event of a stalled network should be considered a security flaw, because
> > > the network is untrusted in general.  
> > 
> > That's not really true of the 'management network' - people trust that
> > and I don't see a lot of the qemu code getting fixed safely for all of
> > them.  
> 
> It depends on the user / app / deployment scenario. In OpenStack alot of
> work was done to beef up security between services on the mgmt network,
> with TLS encryption as standard to reduce attack vectors.
> 
> > > > These patches introduce the new 'yank' out-of-band qmp command to 
> > > > recover from
> > > > these kinds of hangs. The different subsystems register callbacks which 
> > > > get
> > > > executed with the yank command. For example the callback can shutdown() 
> > > > a
> > > > socket. This is intended for the colo use-case, but it can be used for 
> > > > other
> > > > things too of course.  
> > > 
> > > IIUC, invoking the "yank" command unconditionally kills every single
> > > network connection in QEMU that has registered with the "yank" subsystem.
> > > IMHO this is way too big of a hammer, even if we accept there are bugs in
> > > QEMU not handling stalled networking well.  
> > 
> > But isn't this hammer conditional - I see that it's a migration
> > capabiltiy for the migration socket, and a flag in nbd - so it only
> > yanks things you've told it to.  
> 
> IIUC, you have to set these flags upfront when you launch QEMU, or
> hotplug the device using the feature. When something gets stuck,
> and you issue the "yank" command, then everything that has the flag
> enabled gets torn down. So in practice it looks like the flag will
> get enabled for everything at QEMU startup, and yanking down tear
> down everything.
> 
> > > eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
> > > connection used for the guest disk, we needlessly break I/O.
> > > 
> > > eg doing this in the chardev backend is not desirable, because the bugs
> > > with hanging QEMU are typically caused by the way the frontend device
> > > uses the chardev blocking I/O calls, instead of non-blocking I/O calls.
> > >   
> > 
> > Having a way to get out of any of these problems from a single point is
> > quite nice.  To be useful in COLO you need to know for sure you can get
> > out of any network screwup.
> > 
> > We already use shutdown(2) in migrate_cancel and migrate-pause for
> > basically the same reason; I don't think we've got anything similar for
> > NBD, and we probably should have (I think I asked for it fairly
> > recently).  
> 
> Yes, the migrate_cancel is an example of a more fine grained way to
> recover. I was thinking that we need an equivalent fine control knob
> for NBD too.

One reason why the yank feature is done this way is that the management 
application may not know in what state qemu is and so it doesn't know what to 
yank. Poking in the dark would work too in my case, but it's not that nice.

Regards,
Lukas Straub

> That way if QEMU does get stuck, you can start by tearing down the
> least distruptive channel. eg try tearing down the migration connection
> first (which shouldn't negatively impact the guest), and only if that
> doesn't work then, move on to tear down the NBD connection (which risks
> data loss)
> 
> Regards,
> Daniel



pgpBc8TJR5Rou.pgp
Description: OpenPGP digital signature


Re: [PATCH v5 14/15] acpi: q35: drop _SB.PCI0.ISA.LPCD opregion.

2020-05-11 Thread Igor Mammedov
On Thu,  7 May 2020 15:16:39 +0200
Gerd Hoffmann  wrote:

> Seems to be unused.
> 
> Signed-off-by: Gerd Hoffmann 

Reviewed-by: Igor Mammedov 

> ---
>  hw/i386/acpi-build.c | 11 ---
>  1 file changed, 11 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index c1e63cce5e8e..1afb47b09ee9 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1417,7 +1417,6 @@ static void build_q35_isa_bridge(Aml *table)
>  {
>  Aml *dev;
>  Aml *scope;
> -Aml *field;
>  
>  scope =  aml_scope("_SB.PCI0");
>  dev = aml_device("ISA");
> @@ -1427,16 +1426,6 @@ static void build_q35_isa_bridge(Aml *table)
>  aml_append(dev, aml_operation_region("PIRQ", AML_PCI_CONFIG,
>   aml_int(0x60), 0x0C));
>  
> -aml_append(dev, aml_operation_region("LPCD", AML_PCI_CONFIG,
> - aml_int(0x80), 0x02));
> -field = aml_field("LPCD", AML_ANY_ACC, AML_NOLOCK, AML_PRESERVE);
> -aml_append(field, aml_named_field("COMA", 3));
> -aml_append(field, aml_reserved_field(1));
> -aml_append(field, aml_named_field("COMB", 3));
> -aml_append(field, aml_reserved_field(1));
> -aml_append(field, aml_named_field("LPTD", 2));
> -aml_append(dev, field);
> -
>  aml_append(scope, dev);
>  aml_append(table, scope);
>  }




Re: [PATCH v5 13/15] acpi: drop build_piix4_pm()

2020-05-11 Thread Igor Mammedov
On Thu,  7 May 2020 15:16:38 +0200
Gerd Hoffmann  wrote:

> The _SB.PCI0.PX13.P13C opregion (holds isa device enable bits)
> is not used any more, remove it from DSDT.
> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  hw/i386/acpi-build.c | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 765409a90eb6..c1e63cce5e8e 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1441,21 +1441,6 @@ static void build_q35_isa_bridge(Aml *table)
>  aml_append(table, scope);
>  }
>  
> -static void build_piix4_pm(Aml *table)
> -{
> -Aml *dev;
> -Aml *scope;
> -
> -scope =  aml_scope("_SB.PCI0");
> -dev = aml_device("PX13");
> -aml_append(dev, aml_name_decl("_ADR", aml_int(0x00010003)));
I agree about removing P13C but I'm not sure if it's safe to remove
whole isa bridge

> -
> -aml_append(dev, aml_operation_region("P13C", AML_PCI_CONFIG,
> - aml_int(0x00), 0xff));
> -aml_append(scope, dev);
> -aml_append(table, scope);
> -}
> -
>  static void build_piix4_isa_bridge(Aml *table)
>  {
>  Aml *dev;
> @@ -1607,7 +1592,6 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
>  aml_append(dsdt, sb_scope);
>  
>  build_hpet_aml(dsdt);
> -build_piix4_pm(dsdt);
>  build_piix4_isa_bridge(dsdt);
>  build_isa_devices_aml(dsdt);
>  build_piix4_pci_hotplug(dsdt);




Re: [PATCH v5 12/15] acpi: drop serial/parallel enable bits from dsdt

2020-05-11 Thread Igor Mammedov
On Thu,  7 May 2020 15:16:37 +0200
Gerd Hoffmann  wrote:

> The _STA methods for COM+LPT used to reference them,
> but that isn't the case any more.
> 
> Signed-off-by: Gerd Hoffmann 

Reviewed-by: Igor Mammedov 


> ---
>  hw/i386/acpi-build.c | 23 ---
>  1 file changed, 23 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 1922868f3401..765409a90eb6 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1437,15 +1437,6 @@ static void build_q35_isa_bridge(Aml *table)
>  aml_append(field, aml_named_field("LPTD", 2));
>  aml_append(dev, field);
>  
> -aml_append(dev, aml_operation_region("LPCE", AML_PCI_CONFIG,
> - aml_int(0x82), 0x02));
> -/* enable bits */
> -field = aml_field("LPCE", AML_ANY_ACC, AML_NOLOCK, AML_PRESERVE);
> -aml_append(field, aml_named_field("CAEN", 1));
> -aml_append(field, aml_named_field("CBEN", 1));
> -aml_append(field, aml_named_field("LPEN", 1));
> -aml_append(dev, field);
> -
>  aml_append(scope, dev);
>  aml_append(table, scope);
>  }
> @@ -1469,7 +1460,6 @@ static void build_piix4_isa_bridge(Aml *table)
>  {
>  Aml *dev;
>  Aml *scope;
> -Aml *field;
>  
>  scope =  aml_scope("_SB.PCI0");
>  dev = aml_device("ISA");
> @@ -1478,19 +1468,6 @@ static void build_piix4_isa_bridge(Aml *table)
>  /* PIIX PCI to ISA irq remapping */
>  aml_append(dev, aml_operation_region("P40C", AML_PCI_CONFIG,
>   aml_int(0x60), 0x04));
> -/* enable bits */
> -field = aml_field("^PX13.P13C", AML_ANY_ACC, AML_NOLOCK, AML_PRESERVE);
> -/* Offset(0x5f),, 7, */
> -aml_append(field, aml_reserved_field(0x2f8));
> -aml_append(field, aml_reserved_field(7));
> -aml_append(field, aml_named_field("LPEN", 1));
> -/* Offset(0x67),, 3, */
> -aml_append(field, aml_reserved_field(0x38));
> -aml_append(field, aml_reserved_field(3));
> -aml_append(field, aml_named_field("CAEN", 1));
> -aml_append(field, aml_reserved_field(3));
> -aml_append(field, aml_named_field("CBEN", 1));
> -aml_append(dev, field);
>  
>  aml_append(scope, dev);
>  aml_append(table, scope);




Re: [PATCH v5 03/15] acpi: rtc: use a single crs range

2020-05-11 Thread Igor Mammedov
On Thu,  7 May 2020 15:16:28 +0200
Gerd Hoffmann  wrote:

> Use a single io range for _CRS instead of two,
> following what real hardware does.
> 
> Signed-off-by: Gerd Hoffmann 

Reviewed-by: Igor Mammedov 

> ---
>  hw/rtc/mc146818rtc.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/rtc/mc146818rtc.c b/hw/rtc/mc146818rtc.c
> index 2104e0aa3b14..ab0cc59973b3 100644
> --- a/hw/rtc/mc146818rtc.c
> +++ b/hw/rtc/mc146818rtc.c
> @@ -1013,12 +1013,14 @@ static void rtc_build_aml(ISADevice *isadev, Aml 
> *scope)
>  Aml *dev;
>  Aml *crs;
>  
> +/*
> + * Reserving 8 io ports here, following what physical hardware
> + * does, even though qemu only responds to the first two ports.
> + */
>  crs = aml_resource_template();
>  aml_append(crs, aml_io(AML_DECODE16, RTC_ISA_BASE, RTC_ISA_BASE,
> -   0x10, 0x02));
> +   0x01, 0x08));
>  aml_append(crs, aml_irq_no_flags(RTC_ISA_IRQ));
> -aml_append(crs, aml_io(AML_DECODE16, RTC_ISA_BASE + 2, RTC_ISA_BASE + 2,
> -   0x02, 0x06));
>  
>  dev = aml_device("RTC");
>  aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0B00")));




[PATCH 1/2] aio-posix: don't duplicate fd handler deletion in fdmon_io_uring_destroy()

2020-05-11 Thread Stefan Hajnoczi
The io_uring file descriptor monitoring implementation has an internal
list of fd handlers that are pending submission to io_uring.
fdmon_io_uring_destroy() deletes all fd handlers on the list.

Don't delete fd handlers directly in fdmon_io_uring_destroy() for two
reasons:
1. This duplicates the aio-posix.c AioHandler deletion code and could
   become outdated if the struct changes.
2. Only handlers with the FDMON_IO_URING_REMOVE flag set are safe to
   remove. If the flag is not set then something still has a pointer to
   the fd handler. Let aio-posix.c and its user worry about that. In
   practice this isn't an issue because fdmon_io_uring_destroy() is only
   called when shutting down so all users have removed their fd
   handlers, but the next patch will need this!

Signed-off-by: Stefan Hajnoczi 
---
 util/aio-posix.c  |  1 +
 util/fdmon-io_uring.c | 13 ++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index c3613d299e..8af334ab19 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -679,6 +679,7 @@ void aio_context_destroy(AioContext *ctx)
 {
 fdmon_io_uring_destroy(ctx);
 fdmon_epoll_disable(ctx);
+aio_free_deleted_handlers(ctx);
 }
 
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index d5a80ed6fb..1d14177df0 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -342,11 +342,18 @@ void fdmon_io_uring_destroy(AioContext *ctx)
 
 io_uring_queue_exit(>fdmon_io_uring);
 
-/* No need to submit these anymore, just free them. */
+/* Move handlers due to be removed onto the deleted list */
 while ((node = QSLIST_FIRST_RCU(>submit_list))) {
+unsigned flags = atomic_fetch_and(>flags,
+~(FDMON_IO_URING_PENDING |
+  FDMON_IO_URING_ADD |
+  FDMON_IO_URING_REMOVE));
+
+if (flags & FDMON_IO_URING_REMOVE) {
+QLIST_INSERT_HEAD_RCU(>deleted_aio_handlers, node, 
node_deleted);
+}
+
 QSLIST_REMOVE_HEAD_RCU(>submit_list, node_submitted);
-QLIST_REMOVE(node, node);
-g_free(node);
 }
 
 ctx->fdmon_ops = _poll_ops;
-- 
2.25.3



[PATCH 2/2] aio-posix: disable fdmon-io_uring when GSource is used

2020-05-11 Thread Stefan Hajnoczi
The glib event loop does not call fdmon_io_uring_wait() so fd handlers
waiting to be submitted build up in the list. There is no benefit is
using io_uring when the glib GSource is being used, so disable it
instead of implementing a more complex fix.

This fixes a memory leak where AioHandlers would build up and increasing
amounts of CPU time were spent iterating them in aio_pending(). The
symptom is that guests become slow when QEMU is built with io_uring
support.

Buglink: https://bugs.launchpad.net/qemu/+bug/1877716
Fixes: 73fd282e7b6dd4e4ea1c3bbb3d302c8db51e4ccf ("aio-posix: add io_uring fd 
monitoring implementation")
Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h |  3 +++
 util/aio-posix.c| 12 
 util/aio-win32.c|  4 
 util/async.c|  1 +
 4 files changed, 20 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 62ed954344..b2f703fa3f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -701,6 +701,9 @@ void aio_context_setup(AioContext *ctx);
  */
 void aio_context_destroy(AioContext *ctx);
 
+/* Used internally, do not call outside AioContext code */
+void aio_context_use_g_source(AioContext *ctx);
+
 /**
  * aio_context_set_poll_params:
  * @ctx: the aio context
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 8af334ab19..1b2a3af65b 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -682,6 +682,18 @@ void aio_context_destroy(AioContext *ctx)
 aio_free_deleted_handlers(ctx);
 }
 
+void aio_context_use_g_source(AioContext *ctx)
+{
+/*
+ * Disable io_uring when the glib main loop is used because it doesn't
+ * support mixed glib/aio_poll() usage. It relies on aio_poll() being
+ * called regularly so that changes to the monitored file descriptors are
+ * submitted, otherwise a list of pending fd handlers builds up.
+ */
+fdmon_io_uring_destroy(ctx);
+aio_free_deleted_handlers(ctx);
+}
+
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
  int64_t grow, int64_t shrink, Error **errp)
 {
diff --git a/util/aio-win32.c b/util/aio-win32.c
index 729d533faf..953c56ab48 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -414,6 +414,10 @@ void aio_context_destroy(AioContext *ctx)
 {
 }
 
+void aio_context_use_g_source(AioContext *ctx)
+{
+}
+
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
  int64_t grow, int64_t shrink, Error **errp)
 {
diff --git a/util/async.c b/util/async.c
index 3165a28f2f..1319eee3bc 100644
--- a/util/async.c
+++ b/util/async.c
@@ -362,6 +362,7 @@ static GSourceFuncs aio_source_funcs = {
 
 GSource *aio_get_g_source(AioContext *ctx)
 {
+aio_context_use_g_source(ctx);
 g_source_ref(>source);
 return >source;
 }
-- 
2.25.3



[PATCH 0/2] aio-posix: fix fdmon-io_uring memory leak

2020-05-11 Thread Stefan Hajnoczi
This bug was introduced in QEMU 5.0 and causes guests to slow down because
AioHandlers are not freed when the fdmon-io_uring file descriptor monitoring
implementation is used by the main loop thread's glib event loop. This issue
does not apply to IOThread usage of fdmon-io_uring.

In practice few distros build with io_uring support enabled at the moment, so
the number of affected users is likely to be small. The fix is still suitable
for a stable release though.

https://bugs.launchpad.net/qemu/+bug/1877716
https://bugs.launchpad.net/qemu/+bug/1873032

Stefan Hajnoczi (2):
  aio-posix: don't duplicate fd handler deletion in
fdmon_io_uring_destroy()
  aio-posix: disable fdmon-io_uring when GSource is used

 include/block/aio.h   |  3 +++
 util/aio-posix.c  | 13 +
 util/aio-win32.c  |  4 
 util/async.c  |  1 +
 util/fdmon-io_uring.c | 13 ++---
 5 files changed, 31 insertions(+), 3 deletions(-)

-- 
2.25.3



Re: [PATCH v3 05/17] block/io: support int64_t bytes in bdrv_co_do_pwrite_zeroes()

2020-05-11 Thread Eric Blake

On 5/11/20 12:17 PM, Alberto Garcia wrote:

On Thu 30 Apr 2020 01:10:21 PM CEST, Vladimir Sementsov-Ogievskiy wrote:

 compute 'int tail' via % 'int alignment' - safe


 tail = (offset + bytes) % alignment;

both are int64_t, no chance of overflow here?


Good question - I know several places check that offset+bytes does not 
overflow, but did not specifically audit if this one does.  Adding an 
assert() in this function may be easier than trying to prove all callers 
pass in safe values.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 7/9] qcow2: Expose bitmaps' size during measure

2020-05-11 Thread Eric Blake

On 5/11/20 6:50 AM, Max Reitz wrote:

On 08.05.20 20:03, Eric Blake wrote:

It's useful to know how much space can be occupied by qcow2 persistent
bitmaps, even though such metadata is unrelated to the guest-visible
data.  Report this value as an additional field, present when
measuring an existing image and the output format supports bitmaps.
Update iotest 178 and 190 to updated output, as well as new coverage
in 190 demonstrating non-zero values made possible with the
recently-added qemu-img bitmap command.

The addition of a new field demonstrates why we should always
zero-initialize qapi C structs; while the qcow2 driver still fully
populates all fields, the raw and crypto drivers had to be tweaked to
avoid uninitialized data.

See also: https://bugzilla.redhat.com/1779904

Reported-by: Nir Soffer 
Signed-off-by: Eric Blake 
---



+#
+# @bitmaps: Additional size required for bitmap metadata in a source image,


s/in/from/?  Otherwise it sounds like this would be about allocation in
the source, which it clear can’t be, but, well.



Yes, 'from' sounds nicer, especially since the size requirements being 
measured depend on the destination's cluster size (which may be 
different from the source's cluster size).



+#   if that bitmap metadata can be copied in addition to guest
+#   contents. (since 5.1)


[...]




+/*
+ * Remove data clusters that are not required.  This overestimates the
   * required size because metadata needed for the fully allocated file is
- * still counted.
+ * still counted.  Show bitmaps only if both source and destination
+ * would support them.
   */
  info->required = info->fully_allocated - virtual_size + required;
+info->has_bitmaps = version >= 3 && in_bs &&
+bdrv_dirty_bitmap_supported(in_bs);


Why is it important whether the source format supports persistent dirty
bitmaps?


If the source format does not support bitmaps, there is nothing to copy 
over.  Reporting '0' would work, but adds verbosity.  It also becomes a 
question as to whether 'qemu-img convert --bitmaps' should silently 
ignore such sources, or loudly error out that the option is unsupported 
because the source lacks bitmaps.  I could lean either way.




I’m asking because I’d like there to be some concise reason when and why
the @bitmaps field appears.  “Whenever the target supports bitmaps” is
more concise than “When both source and target support bitmaps”.  Also,
the latter is not really different from “When any bitmap data can be
copied”, but in the latter case we should not show it when there are no
bitmaps in the source (even though the format supports them).

Or from the other perspective: As a user, I would never be annoyed by
the @bitmaps field being present.  I don’t mind a “0”.
OTOH, what information can it convey to me that it it’s optional and
sometimes not present?


The impact to the iotests .out files is larger if I do not require that 
the source supports bitmaps (more lines of 'bitmaps: 0' added).  I'm 
fine doing that, if we decide we're okay with the simpler definition of 
'"bitmaps" is present if the destination supports them' (rather than 
this version's implementation of '"bitmaps" is present if both the 
source and destination support them').



I can see these cases:

- That the source format doesn’t support bitmaps?  I want to convert it
to something else anyway, so I don’t really care about what the source
format can or can’t do.

- That the destination doesn’t support bitmaps?  Ah, yes, the fact that
the bitmap field is missing might be a warning sign for this.

- That qemu is too old to copy bitmaps?  Same here.


In fact, that argument is a GOOD reason to output 'bitmaps: 0' in as 
many cases as possible, because it then becomes a side-effect witness of 
whether 'qemu-img convert --bitmaps' is even understood.




- That there are no bitmaps in the source?  OK, but then I disregard the
@bitmaps field anyway, present or not.

So from that standpoint, the best use seems to me to take “The @bitmaps
field isn’t present” as kind of a warning that something in the convert
process won’t support copying bitmaps.  If it’s present, all is well.
So basically there’d be an iff relationship between “measure reports
@bitmaps” and “convert --bitmap can work”.


Yes, I can make that tweak for v4.



But the distinction between “the source format doesn’t support bitmaps”
and “the source image doesn’t have bitmaps” doesn’t seem that important
to me to make it visible in the interface.



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 6/9] qemu-img: Add bitmap sub-command

2020-05-11 Thread Eric Blake

On 5/11/20 6:10 AM, Max Reitz wrote:

On 08.05.20 20:03, Eric Blake wrote:

Include actions for --add, --remove, --clear, --enable, --disable, and
--merge (note that --clear is a bit of fluff, because the same can be
accomplished by removing a bitmap and then adding a new one in its
place, but it matches what QMP commands exist).  Listing is omitted,
because it does not require a bitmap name and because it was already
possible with 'qemu-img info'.  A single command line can play one or
more bitmap commands in sequence on the same bitmap name (although all
added bitmaps share the same granularity, and and all merged bitmaps
come from the same source file).  Merge defaults to other bitmaps in
the primary image, but can also be told to merge bitmaps from a
distinct image.


For the record: Yes, my comment was mostly about my confusion around the
{}.  So just replacing them by () would have pacified me.

But this is more fun, of course.




+++ b/docs/tools/qemu-img.rst
@@ -281,6 +281,29 @@ Command description:


[...]


+  Additional options ``-g`` set a non-default *GRANULARITY* for


sets?


Or maybe:

Additional options include ``-g`` which sets a non-default *GRANULARITY* 
for ``--add``, and ``-b`` and ``-F`` which select an alternative source 
file for all *SOURCE* bitmaps used by ``--merge``.


And in writing this, I just realized - even though you _can_ use --add 
more than once in a command line, the command is still limited to 
operating on a single bitmap name, so unless you write contortions like:


qemu-img bitmap --add --remove --add -g 1024 file.qcow2 bitmapname

there will normally be at most one --add operation for a -g to be used 
with (because otherwise the second --add will fail when attempting to 
create an already-existing bitmap name).




With that fixed (or maybe not, you know that better than me):

Reviewed-by: Max Reitz 




--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 3/9] block: Make it easier to learn which BDS support bitmaps

2020-05-11 Thread Eric Blake

On 5/11/20 4:21 AM, Max Reitz wrote:

On 08.05.20 20:03, Eric Blake wrote:

Upcoming patches will enhance bitmap support in qemu-img, but in doing
so, it turns out to be nice to suppress output when bitmaps make no
sense (such as on a qcow2 v2 image).  Add a hook to make this easier
to query.

In the future, when we improve the ability to look up bitmaps through
a filter, we will probably also want to teach the block layer to
automatically let filters pass this request on through.

Signed-off-by: Eric Blake 
---
  block/qcow2.h| 1 +
  include/block/block_int.h| 1 +
  include/block/dirty-bitmap.h | 1 +
  block/dirty-bitmap.c | 9 +
  block/qcow2-bitmap.c | 7 +++
  block/qcow2.c| 1 +
  6 files changed, 20 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index f4de0a27d5c3..fb2b2b5a7b4d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -764,6 +764,7 @@ bool qcow2_co_can_store_new_dirty_bitmap(BlockDriverState 
*bs,
  int qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
  const char *name,
  Error **errp);
+bool qcow2_dirty_bitmap_supported(BlockDriverState *bs);

  ssize_t coroutine_fn
  qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index df6d0273d679..cb1082da4c43 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -560,6 +560,7 @@ struct BlockDriver {
   uint64_t parent_perm, uint64_t parent_shared,
   uint64_t *nperm, uint64_t *nshared);

+bool (*bdrv_dirty_bitmap_supported)(BlockDriverState *bs);


All BDSs support bitmaps, but only some support persistent dirty
bitmaps, so I think the name should reflect that.


How about .bdrv_dirty_bitmap_supports_persistent?



Conceptually, this looks reasonable.  This information might indeed be
nice to have, and I’m not sure whether we should extend any existing
interface to return it.

(The interfaces that come to my mind are:
(1) bdrv_can_store_new_dirty_bitmap() below, which we could make accept
a NULL @name to return basically the same information.  But it’s still a
bit different, because I’d expect that function to return whether any
bitmap can be stored then, not whether the node supports bitmaps at all.
  So e.g. if there are already too many bitmaps, it should return false,
even though the node itself does support bitmaps.


[which reminds me - a while ago, we had patches for qcow2 handling with 
64k bitmaps, or whatever insane number it took to overflow data 
structures, and I don't know if those ever got applied...]




(2) bdrv_get_info()/BlockDriverInfo: This information would fit in very
nicely here, but do we have to put it here just because it does?  I
don’t think so.  This patch adds 20 lines of code, that shows that it’s
very simple to add a dedicated method, and it’s certainly a bit easier
to use than to invoke bdrv_get_info() and throw away all the other
information.  Perhaps this patch only shows that BlockDriverInfo doesn’t
make much sense in the first place, and most of its fields should have
been scalar return values from dedicated functions.)


Indeed, you (re-)discovered some of the very reasons why I chose to make 
a new interface.  I could tweak the commit message to mention 
alternatives, if that would help.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Lukas Straub
On Mon, 11 May 2020 12:49:47 +0100
Daniel P. Berrangé  wrote:

> On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:
> > Hello Everyone,
> > In many cases, if qemu has a network connection (qmp, migration, chardev, 
> > etc.)
> > to some other server and that server dies or hangs, qemu hangs too.  
> 
> If qemu as a whole hangs due to a stalled network connection, that is a
> bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking
> I/O in general, such that if the network connection or remote server stalls,
> we simply stop sending I/O - we shouldn't ever hang the QEMU process or main
> loop.
> 
> There are places in QEMU code which are not well behaved in this respect,
> but many are, and others are getting fixed where found to be important.
> 
> Arguably any place in QEMU code which can result in a hang of QEMU in the
> event of a stalled network should be considered a security flaw, because
> the network is untrusted in general.

The fact that out-of-band qmp commands exist at all shows that we have to make 
tradeoffs of developer time vs. doing things right. Sure, the migration code 
can be rewritten to use non-blocking i/o and finegrained locks. But as a 
hobbyist I don't have time to fix this.

> > These patches introduce the new 'yank' out-of-band qmp command to recover 
> > from
> > these kinds of hangs. The different subsystems register callbacks which get
> > executed with the yank command. For example the callback can shutdown() a
> > socket. This is intended for the colo use-case, but it can be used for other
> > things too of course.  
> 
> IIUC, invoking the "yank" command unconditionally kills every single
> network connection in QEMU that has registered with the "yank" subsystem.
> IMHO this is way too big of a hammer, even if we accept there are bugs in
> QEMU not handling stalled networking well.
> 
> eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
> connection used for the guest disk, we needlessly break I/O.

Yeah, these patches are intended to solve the problems with the colo use-case 
where all external connections (migration, chardevs, nbd) are just for 
replication. In other use-cases you'd enable the yank feature only on the 
non-essential connections.

> eg doing this in the chardev backend is not desirable, because the bugs
> with hanging QEMU are typically caused by the way the frontend device
> uses the chardev blocking I/O calls, instead of non-blocking I/O calls.
> 
> 
> Regards,
> Daniel



pgp96GYh_tGsk.pgp
Description: OpenPGP digital signature


Re: [PATCH 3/5] block/nbd.c: Add yank feature

2020-05-11 Thread Dr. David Alan Gilbert
* Lukas Straub (lukasstra...@web.de) wrote:
> On Mon, 11 May 2020 17:19:09 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Lukas Straub (lukasstra...@web.de) wrote:
> > > Add yank option, pass it to the socket-channel and register a yank
> > > function which sets s->state = NBD_CLIENT_QUIT. This is the same
> > > behaviour as if an error occured.
> > > 
> > > Signed-off-by: Lukas Straub   
> > 
> > > +static void nbd_yank(void *opaque)
> > > +{
> > > +BlockDriverState *bs = opaque;
> > > +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > > +
> > > +atomic_set(>state, NBD_CLIENT_QUIT);  
> > 
> > I think I was expecting a shutdown on the socket here - why doesn't it
> > have one?
> 
> For nbd, we register two yank functions: This one and we enable the yank 
> feature on the qio channel (see function nbd_establish_connection below).

Oh I see; yeh that still surprises me a little; I'd expected one yank
per item.

Dave

> Regards,
> Lukas Straub
> 
> > Dave
> > 
> > > +}
> > > +
> > >  static void nbd_client_close(BlockDriverState *bs)
> > >  {
> > >  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > > @@ -1407,14 +1421,17 @@ static void nbd_client_close(BlockDriverState *bs)
> > >  nbd_teardown_connection(bs);
> > >  }
> > >  
> > > -static QIOChannelSocket *nbd_establish_connection(SocketAddress *saddr,
> > > +static QIOChannelSocket *nbd_establish_connection(BlockDriverState *bs,
> > > +  SocketAddress *saddr,
> > >Error **errp)
> > >  {
> > > +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > >  QIOChannelSocket *sioc;
> > >  Error *local_err = NULL;
> > >  
> > >  sioc = qio_channel_socket_new();
> > >  qio_channel_set_name(QIO_CHANNEL(sioc), "nbd-client");
> > > +qio_channel_set_yank(QIO_CHANNEL(sioc), s->yank);
> > >  
> > >  qio_channel_socket_connect_sync(sioc, saddr, _err);
> > >  if (local_err) {
> > > @@ -1438,7 +1455,7 @@ static int nbd_client_connect(BlockDriverState *bs, 
> > > Error **errp)
> > >   * establish TCP connection, return error if it fails
> > >   * TODO: Configurable retry-until-timeout behaviour.
> > >   */
> > > -QIOChannelSocket *sioc = nbd_establish_connection(s->saddr, errp);
> > > +QIOChannelSocket *sioc = nbd_establish_connection(bs, s->saddr, 
> > > errp);
> > >  
> > >  if (!sioc) {
> > >  return -ECONNREFUSED;
> > > @@ -1829,6 +1846,12 @@ static QemuOptsList nbd_runtime_opts = {
> > >  "future requests before a successful reconnect will "
> > >  "immediately fail. Default 0",
> > >  },
> > > +{
> > > +.name = "yank",
> > > +.type = QEMU_OPT_BOOL,
> > > +.help = "Forcibly close the connection and don't attempt to "
> > > +"reconnect when the 'yank' qmp command is executed.",
> > > +},
> > >  { /* end of list */ }
> > >  },
> > >  };
> > > @@ -1888,6 +1911,8 @@ static int nbd_process_options(BlockDriverState 
> > > *bs, QDict *options,
> > >  
> > >  s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
> > >  
> > > +s->yank = qemu_opt_get_bool(opts, "yank", false);
> > > +
> > >  ret = 0;
> > >  
> > >   error:
> > > @@ -1921,6 +1946,10 @@ static int nbd_open(BlockDriverState *bs, QDict 
> > > *options, int flags,
> > >  /* successfully connected */
> > >  s->state = NBD_CLIENT_CONNECTED;
> > >  
> > > +if (s->yank) {
> > > +yank_register_function(nbd_yank, bs);
> > > +}
> > > +
> > >  s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
> > >  bdrv_inc_in_flight(bs);
> > >  aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
> > > @@ -1972,6 +2001,11 @@ static void nbd_close(BlockDriverState *bs)
> > >  BDRVNBDState *s = bs->opaque;
> > >  
> > >  nbd_client_close(bs);
> > > +
> > > +if (s->yank) {
> > > +yank_unregister_function(nbd_yank, bs);
> > > +}
> > > +
> > >  nbd_clear_bdrvstate(s);
> > >  }
> > >  
> > > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > > index 943df1926a..1c1578160e 100644
> > > --- a/qapi/block-core.json
> > > +++ b/qapi/block-core.json
> > > @@ -3862,6 +3862,8 @@
> > >  #   reconnect. After that time, any delayed requests and 
> > > all
> > >  #   future requests before a successful reconnect will
> > >  #   immediately fail. Default 0 (Since 4.2)
> > > +# @yank: Forcibly close the connection and don't attempt to reconnect 
> > > when
> > > +#the 'yank' qmp command is executed. (Since: 5.1)
> > >  #
> > >  # Since: 2.9
> > >  ##
> > > @@ -3870,7 +3872,8 @@
> > >  '*export': 'str',
> > >  '*tls-creds': 'str',
> > >  '*x-dirty-bitmap': 'str',
> > > -'*reconnect-delay': 'uint32' } }
> > > +

Re: [PATCH v3 05/17] block/io: support int64_t bytes in bdrv_co_do_pwrite_zeroes()

2020-05-11 Thread Alberto Garcia
On Thu 30 Apr 2020 01:10:21 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> compute 'int tail' via % 'int alignment' - safe

tail = (offset + bytes) % alignment;

both are int64_t, no chance of overflow here?

Berto



Re: [PATCH 3/5] block/nbd.c: Add yank feature

2020-05-11 Thread Lukas Straub
On Mon, 11 May 2020 17:19:09 +0100
"Dr. David Alan Gilbert"  wrote:

> * Lukas Straub (lukasstra...@web.de) wrote:
> > Add yank option, pass it to the socket-channel and register a yank
> > function which sets s->state = NBD_CLIENT_QUIT. This is the same
> > behaviour as if an error occured.
> > 
> > Signed-off-by: Lukas Straub   
> 
> > +static void nbd_yank(void *opaque)
> > +{
> > +BlockDriverState *bs = opaque;
> > +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > +
> > +atomic_set(>state, NBD_CLIENT_QUIT);  
> 
> I think I was expecting a shutdown on the socket here - why doesn't it
> have one?

For nbd, we register two yank functions: This one and we enable the yank 
feature on the qio channel (see function nbd_establish_connection below).

Regards,
Lukas Straub

> Dave
> 
> > +}
> > +
> >  static void nbd_client_close(BlockDriverState *bs)
> >  {
> >  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > @@ -1407,14 +1421,17 @@ static void nbd_client_close(BlockDriverState *bs)
> >  nbd_teardown_connection(bs);
> >  }
> >  
> > -static QIOChannelSocket *nbd_establish_connection(SocketAddress *saddr,
> > +static QIOChannelSocket *nbd_establish_connection(BlockDriverState *bs,
> > +  SocketAddress *saddr,
> >Error **errp)
> >  {
> > +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> >  QIOChannelSocket *sioc;
> >  Error *local_err = NULL;
> >  
> >  sioc = qio_channel_socket_new();
> >  qio_channel_set_name(QIO_CHANNEL(sioc), "nbd-client");
> > +qio_channel_set_yank(QIO_CHANNEL(sioc), s->yank);
> >  
> >  qio_channel_socket_connect_sync(sioc, saddr, _err);
> >  if (local_err) {
> > @@ -1438,7 +1455,7 @@ static int nbd_client_connect(BlockDriverState *bs, 
> > Error **errp)
> >   * establish TCP connection, return error if it fails
> >   * TODO: Configurable retry-until-timeout behaviour.
> >   */
> > -QIOChannelSocket *sioc = nbd_establish_connection(s->saddr, errp);
> > +QIOChannelSocket *sioc = nbd_establish_connection(bs, s->saddr, errp);
> >  
> >  if (!sioc) {
> >  return -ECONNREFUSED;
> > @@ -1829,6 +1846,12 @@ static QemuOptsList nbd_runtime_opts = {
> >  "future requests before a successful reconnect will "
> >  "immediately fail. Default 0",
> >  },
> > +{
> > +.name = "yank",
> > +.type = QEMU_OPT_BOOL,
> > +.help = "Forcibly close the connection and don't attempt to "
> > +"reconnect when the 'yank' qmp command is executed.",
> > +},
> >  { /* end of list */ }
> >  },
> >  };
> > @@ -1888,6 +1911,8 @@ static int nbd_process_options(BlockDriverState *bs, 
> > QDict *options,
> >  
> >  s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
> >  
> > +s->yank = qemu_opt_get_bool(opts, "yank", false);
> > +
> >  ret = 0;
> >  
> >   error:
> > @@ -1921,6 +1946,10 @@ static int nbd_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  /* successfully connected */
> >  s->state = NBD_CLIENT_CONNECTED;
> >  
> > +if (s->yank) {
> > +yank_register_function(nbd_yank, bs);
> > +}
> > +
> >  s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
> >  bdrv_inc_in_flight(bs);
> >  aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
> > @@ -1972,6 +2001,11 @@ static void nbd_close(BlockDriverState *bs)
> >  BDRVNBDState *s = bs->opaque;
> >  
> >  nbd_client_close(bs);
> > +
> > +if (s->yank) {
> > +yank_unregister_function(nbd_yank, bs);
> > +}
> > +
> >  nbd_clear_bdrvstate(s);
> >  }
> >  
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 943df1926a..1c1578160e 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -3862,6 +3862,8 @@
> >  #   reconnect. After that time, any delayed requests and 
> > all
> >  #   future requests before a successful reconnect will
> >  #   immediately fail. Default 0 (Since 4.2)
> > +# @yank: Forcibly close the connection and don't attempt to reconnect when
> > +#the 'yank' qmp command is executed. (Since: 5.1)
> >  #
> >  # Since: 2.9
> >  ##
> > @@ -3870,7 +3872,8 @@
> >  '*export': 'str',
> >  '*tls-creds': 'str',
> >  '*x-dirty-bitmap': 'str',
> > -'*reconnect-delay': 'uint32' } }
> > +'*reconnect-delay': 'uint32',
> > +   'yank': 'bool' } }
> >  
> >  ##
> >  # @BlockdevOptionsRaw:
> > -- 
> > 2.20.1
> >   
> 
> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 



pgpgbyJtzVMo1.pgp
Description: OpenPGP digital signature


[RFC] bdrv_flush: only use fast path when in owned AioContext

2020-05-11 Thread Stefan Reiter
Just because we're in a coroutine doesn't imply ownership of the context
of the flushed drive. In such a case use the slow path which explicitly
enters bdrv_flush_co_entry in the correct AioContext.

Signed-off-by: Stefan Reiter 
---

We've experienced some lockups in this codepath when taking snapshots of VMs
with drives that have IO-Threads enabled (we have an async 'savevm'
implementation running from a coroutine).

Currently no reproducer for upstream versions I could find, but in testing this
patch fixes all issues we're seeing and I think the logic checks out.

The fast path pattern is repeated a few times in this file, so if this change
makes sense, it's probably worth evaluating the other occurences as well.

 block/io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index aba67f66b9..ee7310fa13 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2895,8 +2895,9 @@ int bdrv_flush(BlockDriverState *bs)
 .ret = NOT_DONE,
 };
 
-if (qemu_in_coroutine()) {
-/* Fast-path if already in coroutine context */
+if (qemu_in_coroutine() &&
+bdrv_get_aio_context(bs) == qemu_get_current_aio_context()) {
+/* Fast-path if already in coroutine and we own the drive's context */
 bdrv_flush_co_entry(_co);
 } else {
 co = qemu_coroutine_create(bdrv_flush_co_entry, _co);
-- 
2.20.1





[PATCH 0/2] iotests: Run pylint and mypy in a testcase

2020-05-11 Thread Kevin Wolf
Kevin Wolf (2):
  iotests: Fix incomplete type declarations
  iotests: Run pylint and mypy in a testcase

 tests/qemu-iotests/iotests.py |  8 +++
 tests/qemu-iotests/297| 44 +++
 tests/qemu-iotests/297.out|  3 +++
 tests/qemu-iotests/group  |  1 +
 4 files changed, 52 insertions(+), 4 deletions(-)
 create mode 100755 tests/qemu-iotests/297
 create mode 100644 tests/qemu-iotests/297.out

-- 
2.25.3




[PATCH 2/2] iotests: Run pylint and mypy in a testcase

2020-05-11 Thread Kevin Wolf
We made sure that iotests.py passes pylint. It would be a shame if we
allowed new patches in that break this again, so let's just add a
meta-test case that runs pylint on it.

While we don't pass mypy --strict yet, we can already run it with a few
options that would be part of --strict to make sure that we won't
regress on these aspects at least until we can enable the full thing.

Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/297 | 44 ++
 tests/qemu-iotests/297.out |  3 +++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 48 insertions(+)
 create mode 100755 tests/qemu-iotests/297
 create mode 100644 tests/qemu-iotests/297.out

diff --git a/tests/qemu-iotests/297 b/tests/qemu-iotests/297
new file mode 100755
index 00..5c5420712b
--- /dev/null
+++ b/tests/qemu-iotests/297
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+#
+# Copyright (C) 2020 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+# get standard environment
+. ./common.rc
+
+if ! type -p "pylint-3" > /dev/null; then
+_notrun "pylint-3 not found"
+fi
+if ! type -p "mypy" > /dev/null; then
+_notrun "mypy not found"
+fi
+
+pylint-3 --score=n iotests.py
+
+MYPYPATH=../../python/ mypy --warn-unused-configs --disallow-subclassing-any \
+--disallow-any-generics --disallow-incomplete-defs \
+--disallow-untyped-decorators --no-implicit-optional \
+--warn-redundant-casts --warn-unused-ignores \
+--no-implicit-reexport iotests.py
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/297.out b/tests/qemu-iotests/297.out
new file mode 100644
index 00..6acc843649
--- /dev/null
+++ b/tests/qemu-iotests/297.out
@@ -0,0 +1,3 @@
+QA output created by 297
+Success: no issues found in 1 source file
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index fe649c5b73..c514975880 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -299,3 +299,4 @@
 289 rw quick
 290 rw auto quick
 292 rw auto quick
+297 meta
-- 
2.25.3




[PATCH 1/2] iotests: Fix incomplete type declarations

2020-05-11 Thread Kevin Wolf
We need to fix only a few places so that iotests.py can pass
mypy --disallow-incomplete-defs, which seems to be a desirable option to
have enabled in the long run.

Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/iotests.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 6c0e781af7..1d7f6fd7cf 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -1040,7 +1040,7 @@ def _verify_cache_mode(supported_cache_modes: 
Sequence[str] = ()) -> None:
 if supported_cache_modes and (cachemode not in supported_cache_modes):
 notrun('not suitable for this cache mode: %s' % cachemode)
 
-def _verify_aio_mode(supported_aio_modes: Sequence[str] = ()):
+def _verify_aio_mode(supported_aio_modes: Sequence[str] = ()) -> None:
 if supported_aio_modes and (aiomode not in supported_aio_modes):
 notrun('not suitable for this aio mode: %s' % aiomode)
 
@@ -1087,7 +1087,8 @@ def skip_if_unsupported(required_formats=(), 
read_only=False):
 '''Skip Test Decorator
Runs the test if all the required formats are whitelisted'''
 def skip_test_decorator(func):
-def func_wrapper(test_case: QMPTestCase, *args, **kwargs):
+def func_wrapper(test_case: QMPTestCase, *args: List[Any],
+ **kwargs: Dict[str, Any]) -> None:
 if callable(required_formats):
 fmts = required_formats(test_case)
 else:
@@ -1097,9 +1098,8 @@ def skip_if_unsupported(required_formats=(), 
read_only=False):
 if usf_list:
 msg = f'{test_case}: formats {usf_list} are not whitelisted'
 test_case.case_skip(msg)
-return None
 else:
-return func(test_case, *args, **kwargs)
+func(test_case, *args, **kwargs)
 return func_wrapper
 return skip_test_decorator
 
-- 
2.25.3




Re: [PATCH v3 04/17] block/io: use int64_t bytes in driver wrappers

2020-05-11 Thread Alberto Garcia
On Thu 30 Apr 2020 01:10:20 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> We are generally moving to int64_t for both offset and bytes parameters
> on all io paths.
>
> Main motivation is realization of 64-bit write_zeroes operation for
> fast zeroing large disk chunks, up to the whole disk.
>
> We chose signed type, to be consistent with off_t (which is signed) and
> with possibility for signed return type (where negative value means
> error).
>
> So, convert driver wrappers parameters which are already 64bit to
> signed type.
>
> Patch-correctness audit by Eric Blake:
>
>   bdrv_driver_preadv
>
> Remains 64 bits, the question is whether any caller could pass in
> something larger than 63 bits.  Callers are:
>
> bdrv_co_do_copy_on_readv() - passes 'int64_t pnum', but sets pnum
>   in a fragmenting loop, MAX_BOUNCE_BUFFER when copy-on-read is
>   needed, or max_transfer bounded by BDRV_REQUEST_MAX_BYTES
>   otherwise
> bdrv_aligned_preadv() - passes 'unsigned int bytes' further limited
>   by fragmenting loop on max_transfer <= INT_MAX
>
> Input is bounded to < 2G, internal use of 'bytes' is:
>
> drv->bdrv_co_preadv_part(uint64_t) - safe
> qemu_iovec_init_slice(size_t) - potential truncation on 32-bit
>   platform [*]
> drv->bdrv_co_preadv(uint64_t) - safe
> drv->bdrv_aio_preadv(uint64_t) - safe
> drv->bdrv_co_readv(int) after assertion of <2G - safe
>
>   bdrv_driver_pwritev
>
> Remains 64 bits, the question is whether any caller could pass in
> something larger than 63.  Callers are:
>
> bdrv_co_do_copy_on_readv() - passes 'int64_t pnum', but set in a
>   fragmenting loop bounded by MAX_BOUNCE_BUFFER
> bdrv_co_do_pwrite_zeroes() - passes 'int num'
> bdrv_aligned_pwritev() - passes 'unsigned int bytes' further
>   limited by fragmenting loop on max_transfer <= INT_MAX
>
> Input is bounded to < 2G, internal use of 'bytes' is:
>
> drv->bdrv_co_pwritev_part(uint64_t) - safe
> qemu_iovec_init_slice(size_t) - potential truncation on 32-bit
>   platform [*]
> drv->bdrv_co_pwritev(uint64_t) - safe
> drv->bdrv_aio_pwritev(uint64_t) - safe
> drv->bdrv_co_writev(int) after assertion of <2G - safe
>
>   bdrv_driver_pwritev_compressed
>
> bdrv_aligned_pwritev() - passes 'unsigned int bytes'
>
> Input is bounded to < 4G, internal use of 'bytes' is:
>
> drv->bdrv_co_pwritev_compressed_part(uint64_t) - safe
> drv->bdrv_co_pwritev_compressed(uint64_t) - safe
> qemu_iovec_init_slice(size_t) - potential truncation on 32-bit
>   platform [*]
>
> [*]: QEMUIOVector is in-RAM data, so size_t seems a valid type for
> it's operation. To avoid truncations we should require max_transfer to
> be not greater than SIZE_MAX, limiting RAM-occupying operations and
> QEMUIOVector usage. Still, 64bit discard and write_zeroes (which
> doesn't use QEMUIOVector) should work even on 32bit machines, not being
> limited by max_transfer.
>
> For now, we safe anyway, as all input goes through
> bdrv_aligned_pwritev() and bdrv_aligned_preadv(), which are limiting
> max_transfer to INT_MAX.
>
> Series: 64bit-block-status
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Eric Blake 

Reviewed-by: Alberto Garcia 

Berto



Re: [PATCH 3/5] block/nbd.c: Add yank feature

2020-05-11 Thread Dr. David Alan Gilbert
* Lukas Straub (lukasstra...@web.de) wrote:
> Add yank option, pass it to the socket-channel and register a yank
> function which sets s->state = NBD_CLIENT_QUIT. This is the same
> behaviour as if an error occured.
> 
> Signed-off-by: Lukas Straub 

> +static void nbd_yank(void *opaque)
> +{
> +BlockDriverState *bs = opaque;
> +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> +
> +atomic_set(>state, NBD_CLIENT_QUIT);

I think I was expecting a shutdown on the socket here - why doesn't it
have one?

Dave

> +}
> +
>  static void nbd_client_close(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> @@ -1407,14 +1421,17 @@ static void nbd_client_close(BlockDriverState *bs)
>  nbd_teardown_connection(bs);
>  }
>  
> -static QIOChannelSocket *nbd_establish_connection(SocketAddress *saddr,
> +static QIOChannelSocket *nbd_establish_connection(BlockDriverState *bs,
> +  SocketAddress *saddr,
>Error **errp)
>  {
> +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  QIOChannelSocket *sioc;
>  Error *local_err = NULL;
>  
>  sioc = qio_channel_socket_new();
>  qio_channel_set_name(QIO_CHANNEL(sioc), "nbd-client");
> +qio_channel_set_yank(QIO_CHANNEL(sioc), s->yank);
>  
>  qio_channel_socket_connect_sync(sioc, saddr, _err);
>  if (local_err) {
> @@ -1438,7 +1455,7 @@ static int nbd_client_connect(BlockDriverState *bs, 
> Error **errp)
>   * establish TCP connection, return error if it fails
>   * TODO: Configurable retry-until-timeout behaviour.
>   */
> -QIOChannelSocket *sioc = nbd_establish_connection(s->saddr, errp);
> +QIOChannelSocket *sioc = nbd_establish_connection(bs, s->saddr, errp);
>  
>  if (!sioc) {
>  return -ECONNREFUSED;
> @@ -1829,6 +1846,12 @@ static QemuOptsList nbd_runtime_opts = {
>  "future requests before a successful reconnect will "
>  "immediately fail. Default 0",
>  },
> +{
> +.name = "yank",
> +.type = QEMU_OPT_BOOL,
> +.help = "Forcibly close the connection and don't attempt to "
> +"reconnect when the 'yank' qmp command is executed.",
> +},
>  { /* end of list */ }
>  },
>  };
> @@ -1888,6 +1911,8 @@ static int nbd_process_options(BlockDriverState *bs, 
> QDict *options,
>  
>  s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
>  
> +s->yank = qemu_opt_get_bool(opts, "yank", false);
> +
>  ret = 0;
>  
>   error:
> @@ -1921,6 +1946,10 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  /* successfully connected */
>  s->state = NBD_CLIENT_CONNECTED;
>  
> +if (s->yank) {
> +yank_register_function(nbd_yank, bs);
> +}
> +
>  s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
>  bdrv_inc_in_flight(bs);
>  aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
> @@ -1972,6 +2001,11 @@ static void nbd_close(BlockDriverState *bs)
>  BDRVNBDState *s = bs->opaque;
>  
>  nbd_client_close(bs);
> +
> +if (s->yank) {
> +yank_unregister_function(nbd_yank, bs);
> +}
> +
>  nbd_clear_bdrvstate(s);
>  }
>  
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 943df1926a..1c1578160e 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3862,6 +3862,8 @@
>  #   reconnect. After that time, any delayed requests and all
>  #   future requests before a successful reconnect will
>  #   immediately fail. Default 0 (Since 4.2)
> +# @yank: Forcibly close the connection and don't attempt to reconnect when
> +#the 'yank' qmp command is executed. (Since: 5.1)
>  #
>  # Since: 2.9
>  ##
> @@ -3870,7 +3872,8 @@
>  '*export': 'str',
>  '*tls-creds': 'str',
>  '*x-dirty-bitmap': 'str',
> -'*reconnect-delay': 'uint32' } }
> +'*reconnect-delay': 'uint32',
> + 'yank': 'bool' } }
>  
>  ##
>  # @BlockdevOptionsRaw:
> -- 
> 2.20.1
> 


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v3 03/17] block/io: use int64_t bytes parameter in bdrv_check_byte_request()

2020-05-11 Thread Alberto Garcia
On Thu 30 Apr 2020 01:10:19 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> We are generally moving to int64_t for both offset and bytes parameters
> on all io paths.
>
> Main motivation is realization of 64-bit write_zeroes operation for
> fast zeroing large disk chunks, up to the whole disk.
>
> We chose signed type, to be consistent with off_t (which is signed) and
> with possibility for signed return type (where negative value means
> error).
>
> So, convert bdrv_check_byte_request() now.
>
> Patch-correctness audit by Eric Blake:
>
>   This changes an unsigned to signed value on 64-bit machines, and
>   additionally widens the parameter on 32-bit machines.  Existing
>   callers:
>
>   bdrv_co_preadv_part() with 'unsigned int' - no impact
>   bdrv_co_pwritev_part() with 'unsigned int' - no impact
>   bdrv_co_copy_range_internal() with 'uint64_t' -
>   potentially fixes a latent bug on 32-bit machines. Requires a
>   larger audit to see how bdrv_co_copy_range() and friends are
>   used:
>
>   block/block-backend.c:blk_co_copy_range() - passes 'int', thus < 2G
>   block/block-copy.c:block_copy_do_copy() -
>   passes 'int64_t', but only after assert(nbytes < INT_MAX); also
>   it has a BLOCK_COPY_MAX_COPY_RANGE set to 16M that factors into
>   its calculations on how much to do per iteration
>
>   So it looks like at present we are safe.
>
> Series: 64bit-block-status
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Eric Blake 

Reviewed-by: Alberto Garcia 

Berto



Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Mon, May 11, 2020 at 01:07:18PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > > On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:
> > > > Hello Everyone,
> > > > In many cases, if qemu has a network connection (qmp, migration, 
> > > > chardev, etc.)
> > > > to some other server and that server dies or hangs, qemu hangs too.
> > > 
> > > If qemu as a whole hangs due to a stalled network connection, that is a
> > > bug in QEMU that we should be fixing IMHO. QEMU should be doing 
> > > non-blocking
> > > I/O in general, such that if the network connection or remote server 
> > > stalls,
> > > we simply stop sending I/O - we shouldn't ever hang the QEMU process or 
> > > main
> > > loop.
> > > 
> > > There are places in QEMU code which are not well behaved in this respect,
> > > but many are, and others are getting fixed where found to be important.
> > > 
> > > Arguably any place in QEMU code which can result in a hang of QEMU in the
> > > event of a stalled network should be considered a security flaw, because
> > > the network is untrusted in general.
> > 
> > That's not really true of the 'management network' - people trust that
> > and I don't see a lot of the qemu code getting fixed safely for all of
> > them.
> 
> It depends on the user / app / deployment scenario. In OpenStack alot of
> work was done to beef up security between services on the mgmt network,
> with TLS encryption as standard to reduce attack vectors.
> 
> > > > These patches introduce the new 'yank' out-of-band qmp command to 
> > > > recover from
> > > > these kinds of hangs. The different subsystems register callbacks which 
> > > > get
> > > > executed with the yank command. For example the callback can shutdown() 
> > > > a
> > > > socket. This is intended for the colo use-case, but it can be used for 
> > > > other
> > > > things too of course.
> > > 
> > > IIUC, invoking the "yank" command unconditionally kills every single
> > > network connection in QEMU that has registered with the "yank" subsystem.
> > > IMHO this is way too big of a hammer, even if we accept there are bugs in
> > > QEMU not handling stalled networking well.
> > 
> > But isn't this hammer conditional - I see that it's a migration
> > capabiltiy for the migration socket, and a flag in nbd - so it only
> > yanks things you've told it to.
> 
> IIUC, you have to set these flags upfront when you launch QEMU, or
> hotplug the device using the feature. When something gets stuck,
> and you issue the "yank" command, then everything that has the flag
> enabled gets torn down. So in practice it looks like the flag will
> get enabled for everything at QEMU startup, and yanking down tear
> down everything.

For COLO I really expect it for the migration stream, the disk mirroring
stream and probably the network comparison/forwarding streams.

> > > eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
> > > connection used for the guest disk, we needlessly break I/O.
> > > 
> > > eg doing this in the chardev backend is not desirable, because the bugs
> > > with hanging QEMU are typically caused by the way the frontend device
> > > uses the chardev blocking I/O calls, instead of non-blocking I/O calls.
> > > 
> > 
> > Having a way to get out of any of these problems from a single point is
> > quite nice.  To be useful in COLO you need to know for sure you can get
> > out of any network screwup.
> > 
> > We already use shutdown(2) in migrate_cancel and migrate-pause for
> > basically the same reason; I don't think we've got anything similar for
> > NBD, and we probably should have (I think I asked for it fairly
> > recently).
> 
> Yes, the migrate_cancel is an example of a more fine grained way to
> recover. I was thinking that we need an equivalent fine control knob
> for NBD too.

I feel it might be nice not to have to create so many separate knobs.

> That way if QEMU does get stuck, you can start by tearing down the
> least distruptive channel. eg try tearing down the migration connection
> first (which shouldn't negatively impact the guest), and only if that
> doesn't work then, move on to tear down the NBD connection (which risks
> data loss)

I wonder if a different way would be to make all network connections
register with yank, but then make yank take a list of connections to
shutdown(2).

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v2 4/4] iotests: Mirror with different source/target size

2020-05-11 Thread Max Reitz
On 11.05.20 15:58, Kevin Wolf wrote:
> This tests that the mirror job catches situations where the target node
> has a different size than the source node. It must also forbid resize
> operations when the job is already running.
> 
> Signed-off-by: Kevin Wolf 
> Message-Id: <20200507145228.323412-4-kw...@redhat.com>
> Reviewed-by: Eric Blake 
> Signed-off-by: Kevin Wolf 
> ---
>  tests/qemu-iotests/041 | 45 ++
>  tests/qemu-iotests/041.out |  4 ++--
>  2 files changed, 47 insertions(+), 2 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/4] iotests/109: Don't mirror with mismatched size

2020-05-11 Thread Max Reitz
On 11.05.20 17:29, Kevin Wolf wrote:
> Am 11.05.2020 um 17:08 hat Max Reitz geschrieben:
>> On 11.05.20 15:58, Kevin Wolf wrote:
>>> This patch makes the raw image the same size as the file in a different
>>> format that is mirrored as raw to it to avoid errors when mirror starts
>>> to enforce that source and target are the same size.
>>>
>>> We check only that the first 512 bytes are zeroed (instead of 64k)
>>> because some image formats create image files that are smaller than 64k,
>>> so trying to read 64k would result in I/O errors. Apart from this, 512
>>> is more appropriate anyway because the raw format driver protects
>>> specifically the first 512 bytes.
>>>
>>> Signed-off-by: Kevin Wolf 
>>> ---
>>>  tests/qemu-iotests/109   | 10 ++---
>>>  tests/qemu-iotests/109.out   | 74 +---
>>>  tests/qemu-iotests/common.filter |  5 +++
>>>  3 files changed, 41 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
>>> index 5bc2e9b001..3ffeaf3c55 100755
>>> --- a/tests/qemu-iotests/109
>>> +++ b/tests/qemu-iotests/109
>>> @@ -77,14 +77,14 @@ for fmt in qcow qcow2 qed vdi vmdk vpc; do
>>>  echo "=== Writing a $fmt header into raw ==="
>>>  echo
>>>  
>>> -_make_test_img 64M
>>>  TEST_IMG="$TEST_IMG.src" IMGFMT=$fmt _make_test_img 64M
>>> +_make_test_img $(du -b "$TEST_IMG.src" | cut -f1) | 
>>> _filter_img_create_size
>>
>> Why du and not the file length (stat -c '%s')?
> 
> Because the test from which I copied had 'du' and the internet claimed
> that 'stat -c' isn't portable. Now I see that we do use it in other test
> cases, so I guess it would have been fine, too. Is there a good reason
> why 'stat' would be better?

Oh, I didn’t know that du -b reports the file length.  Well, then that
works, too.  (I’ve never seen du used for anything but getting the,
well, disk usage.)

(I figured -b would just report the size in bytes.)

Then:

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] iotests/229: Use blkdebug to inject an error

2020-05-11 Thread Kevin Wolf
Am 11.05.2020 um 17:18 hat Max Reitz geschrieben:
> On 11.05.20 15:58, Kevin Wolf wrote:
> > 229 relies on the mirror running into an I/O error when the target is
> > smaller than the source. After changing mirror to catch this condition
> > while starting the job, this test case won't get a job that is paused
> > for an I/O error any more. Use blkdebug instead to inject an error.
> > 
> > Signed-off-by: Kevin Wolf 
> > Message-Id: <20200507145228.323412-2-kw...@redhat.com>
> > Reviewed-by: Eric Blake 
> > Signed-off-by: Kevin Wolf 
> 
> Did you do this on purpose, or did you send the patch mail this way
> accidentally (with Message-ID tag and double S-o-b)?

It's an artifact on applying my own patch from the list with Message-Id
tags (as we're asked to do), rebasing my development branch (which got
rid of the original version of the commit message) and then noticing in
the last minute that it doesn't pass iotests.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v3 02/17] block: use int64_t as bytes type in tracked requests

2020-05-11 Thread Alberto Garcia
On Thu 30 Apr 2020 01:10:18 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> We are generally moving to int64_t for both offset and bytes parameters
> on all io paths.
>
> Main motivation is realization of 64-bit write_zeroes operation for
> fast zeroing large disk chunks, up to the whole disk.
>
> We chose signed type, to be consistent with off_t (which is signed) and
> with possibility for signed return type (where negative value means
> error).
>
> So, convert tracked requests now.
>
> Series: 64bit-block-status
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Stefan Hajnoczi 
> Reviewed-by: Eric Blake 

Reviewed-by: Alberto Garcia 

Berto



Re: [PATCH v2 3/4] mirror: Make sure that source and target size match

2020-05-11 Thread Max Reitz
On 11.05.20 15:58, Kevin Wolf wrote:
> If the target is shorter than the source, mirror would copy data until
> it reaches the end of the target and then fail with an I/O error when
> trying to write past the end.
> 
> If the target is longer than the source, the mirror job would complete
> successfully, but the target wouldn't actually be an accurate copy of
> the source image (it would contain some additional garbage at the end).
> 
> Fix this by checking that both images have the same size when the job
> starts.
> 
> Signed-off-by: Kevin Wolf 
> Message-Id: <20200507145228.323412-3-kw...@redhat.com>
> Reviewed-by: Eric Blake 
> Signed-off-by: Kevin Wolf 
> ---
>  block/mirror.c | 21 -
>  1 file changed, 12 insertions(+), 9 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/4] iotests/109: Don't mirror with mismatched size

2020-05-11 Thread Kevin Wolf
Am 11.05.2020 um 17:08 hat Max Reitz geschrieben:
> On 11.05.20 15:58, Kevin Wolf wrote:
> > This patch makes the raw image the same size as the file in a different
> > format that is mirrored as raw to it to avoid errors when mirror starts
> > to enforce that source and target are the same size.
> > 
> > We check only that the first 512 bytes are zeroed (instead of 64k)
> > because some image formats create image files that are smaller than 64k,
> > so trying to read 64k would result in I/O errors. Apart from this, 512
> > is more appropriate anyway because the raw format driver protects
> > specifically the first 512 bytes.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  tests/qemu-iotests/109   | 10 ++---
> >  tests/qemu-iotests/109.out   | 74 +---
> >  tests/qemu-iotests/common.filter |  5 +++
> >  3 files changed, 41 insertions(+), 48 deletions(-)
> > 
> > diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
> > index 5bc2e9b001..3ffeaf3c55 100755
> > --- a/tests/qemu-iotests/109
> > +++ b/tests/qemu-iotests/109
> > @@ -77,14 +77,14 @@ for fmt in qcow qcow2 qed vdi vmdk vpc; do
> >  echo "=== Writing a $fmt header into raw ==="
> >  echo
> >  
> > -_make_test_img 64M
> >  TEST_IMG="$TEST_IMG.src" IMGFMT=$fmt _make_test_img 64M
> > +_make_test_img $(du -b "$TEST_IMG.src" | cut -f1) | 
> > _filter_img_create_size
> 
> Why du and not the file length (stat -c '%s')?

Because the test from which I copied had 'du' and the internet claimed
that 'stat -c' isn't portable. Now I see that we do use it in other test
cases, so I guess it would have been fine, too. Is there a good reason
why 'stat' would be better?

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v3 01/17] block/throttle-groups: throttle_group_co_io_limits_intercept(): 64bit bytes

2020-05-11 Thread Alberto Garcia
On Thu 30 Apr 2020 01:10:17 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> The function is called from 64bit io handlers, and bytes is just passed
> to throttle_account() which is 64bit too (unsigned though). So, let's
> convert intermediate argument to 64bit too.
>
> This patch is a first in the 64-bit-blocklayer series, so we are
> generally moving to int64_t for both offset and bytes parameters on all
> io paths. Main motivation is realization of 64-bit write_zeroes
> operation for fast zeroing large disk chunks, up to the whole disk.
>
> We chose signed type, to be consistent with off_t (which is signed) and
> with possibility for signed return type (where negative value means
> error).
>
> Patch-correctness audit by Eric Blake:
>
>   Caller has 32-bit, this patch now causes widening which is safe:
>   block/block-backend.c: blk_do_preadv() passes 'unsigned int'
>   block/block-backend.c: blk_do_pwritev_part() passes 'unsigned int'
>   block/throttle.c: throttle_co_pwrite_zeroes() passes 'int'
>   block/throttle.c: throttle_co_pdiscard() passes 'int'
>
>   Caller has 64-bit, this patch fixes potential bug where pre-patch
>   could narrow, except it's easy enough to trace that callers are still
>   capped at 2G actions:
>   block/throttle.c: throttle_co_preadv() passes 'uint64_t'
>   block/throttle.c: throttle_co_pwritev() passes 'uint64_t'
>
>   Implementation in question: block/throttle-groups.c
>   throttle_group_co_io_limits_intercept() takes 'unsigned int bytes'
>   and uses it: argument to util/throttle.c throttle_account(uint64_t)
>
>   All safe: it patches a latent bug, and does not introduce any 64-bit
>   gotchas once throttle_co_p{read,write}v are relaxed, and assuming
>   throttle_account() is not buggy.
>
> Series: 64bit-block-status
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Eric Blake 

Reviewed-by: Alberto Garcia 

Berto



Re: [PATCH v2 2/4] iotests/229: Use blkdebug to inject an error

2020-05-11 Thread Max Reitz
On 11.05.20 15:58, Kevin Wolf wrote:
> 229 relies on the mirror running into an I/O error when the target is
> smaller than the source. After changing mirror to catch this condition
> while starting the job, this test case won't get a job that is paused
> for an I/O error any more. Use blkdebug instead to inject an error.
> 
> Signed-off-by: Kevin Wolf 
> Message-Id: <20200507145228.323412-2-kw...@redhat.com>
> Reviewed-by: Eric Blake 
> Signed-off-by: Kevin Wolf 

Did you do this on purpose, or did you send the patch mail this way
accidentally (with Message-ID tag and double S-o-b)?

(Because I’ve never seen anyone else do it)

> ---
>  tests/qemu-iotests/229 | 15 +++
>  tests/qemu-iotests/229.out |  6 +++---
>  2 files changed, 14 insertions(+), 7 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/4] iotests/109: Don't mirror with mismatched size

2020-05-11 Thread Max Reitz
On 11.05.20 15:58, Kevin Wolf wrote:
> This patch makes the raw image the same size as the file in a different
> format that is mirrored as raw to it to avoid errors when mirror starts
> to enforce that source and target are the same size.
> 
> We check only that the first 512 bytes are zeroed (instead of 64k)
> because some image formats create image files that are smaller than 64k,
> so trying to read 64k would result in I/O errors. Apart from this, 512
> is more appropriate anyway because the raw format driver protects
> specifically the first 512 bytes.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  tests/qemu-iotests/109   | 10 ++---
>  tests/qemu-iotests/109.out   | 74 +---
>  tests/qemu-iotests/common.filter |  5 +++
>  3 files changed, 41 insertions(+), 48 deletions(-)
> 
> diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
> index 5bc2e9b001..3ffeaf3c55 100755
> --- a/tests/qemu-iotests/109
> +++ b/tests/qemu-iotests/109
> @@ -77,14 +77,14 @@ for fmt in qcow qcow2 qed vdi vmdk vpc; do
>  echo "=== Writing a $fmt header into raw ==="
>  echo
>  
> -_make_test_img 64M
>  TEST_IMG="$TEST_IMG.src" IMGFMT=$fmt _make_test_img 64M
> +_make_test_img $(du -b "$TEST_IMG.src" | cut -f1) | 
> _filter_img_create_size

Why du and not the file length (stat -c '%s')?

Max



signature.asc
Description: OpenPGP digital signature


[PATCH 1/2] When del child, next_child_index need to subtract 1. Otherwise, the index will get bigger and bigger.

2020-05-11 Thread quweijie
From: quweijie 

Signed-off-by: quweijie 
---
 block/quorum.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/quorum.c b/block/quorum.c
index 6d7a56b..a8272fe 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1096,6 +1096,7 @@ static void quorum_del_child(BlockDriverState *bs, 
BdrvChild *child,
 memmove(>children[i], >children[i + 1],
 (s->num_children - i - 1) * sizeof(BdrvChild *));
 s->children = g_renew(BdrvChild *, s->children, --s->num_children);
+s->next_child_index--;
 bdrv_unref_child(bs, child);
 
 bdrv_drained_end(bs);
-- 
1.8.3.1




[PATCH v2 0/4] mirror: Make sure that source and target size match

2020-05-11 Thread Kevin Wolf
Same thing as the recent fix for backup, except that mirror already
forbids resizing during the job. So what remains is checking that the
sizes match at the start of the job.

v2:
- Added patch 1 to fix a test that used different source/target size

Kevin Wolf (4):
  iotests/109: Don't mirror with mismatched size
  iotests/229: Use blkdebug to inject an error
  mirror: Make sure that source and target size match
  iotests: Mirror with different source/target size

 block/mirror.c   | 21 +
 tests/qemu-iotests/041   | 45 +++
 tests/qemu-iotests/041.out   |  4 +-
 tests/qemu-iotests/109   | 10 ++---
 tests/qemu-iotests/109.out   | 74 +---
 tests/qemu-iotests/229   | 15 +--
 tests/qemu-iotests/229.out   |  6 +--
 tests/qemu-iotests/common.filter |  5 +++
 8 files changed, 114 insertions(+), 66 deletions(-)

-- 
2.25.3




[PATCH v2 4/4] iotests: Mirror with different source/target size

2020-05-11 Thread Kevin Wolf
This tests that the mirror job catches situations where the target node
has a different size than the source node. It must also forbid resize
operations when the job is already running.

Signed-off-by: Kevin Wolf 
Message-Id: <20200507145228.323412-4-kw...@redhat.com>
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/041 | 45 ++
 tests/qemu-iotests/041.out |  4 ++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/041 b/tests/qemu-iotests/041
index 1812dd8479..601c756117 100755
--- a/tests/qemu-iotests/041
+++ b/tests/qemu-iotests/041
@@ -240,6 +240,49 @@ class TestSingleBlockdev(TestSingleDrive):
  target=self.qmp_target)
 self.assert_qmp(result, 'error/class', 'GenericError')
 
+def do_test_resize(self, device, node):
+def pre_finalize():
+if device:
+result = self.vm.qmp('block_resize', device=device, size=65536)
+self.assert_qmp(result, 'error/class', 'GenericError')
+
+result = self.vm.qmp('block_resize', node_name=node, size=65536)
+self.assert_qmp(result, 'error/class', 'GenericError')
+
+result = self.vm.qmp(self.qmp_cmd, job_id='job0', device='drive0',
+ sync='full', target=self.qmp_target,
+ auto_finalize=False, auto_dismiss=False)
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.run_job('job0', auto_finalize=False,
+ pre_finalize=pre_finalize)
+self.assertEqual(result, None)
+
+def test_source_resize(self):
+self.do_test_resize('drive0', 'top')
+
+def test_target_resize(self):
+self.do_test_resize(None, self.qmp_target)
+
+def do_test_target_size(self, size):
+result = self.vm.qmp('block_resize', node_name=self.qmp_target,
+ size=size)
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp(self.qmp_cmd, job_id='job0',
+ device='drive0', sync='full', auto_dismiss=False,
+ target=self.qmp_target)
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.run_job('job0')
+self.assertEqual(result, 'Source and target image have different 
sizes')
+
+def test_small_target(self):
+self.do_test_target_size(self.image_len // 2)
+
+def test_large_target(self):
+self.do_test_target_size(self.image_len * 2)
+
 test_large_cluster = None
 test_image_not_found = None
 test_small_buffer2 = None
@@ -251,6 +294,8 @@ class TestSingleDriveZeroLength(TestSingleDrive):
 
 class TestSingleBlockdevZeroLength(TestSingleBlockdev):
 image_len = 0
+test_small_target = None
+test_large_target = None
 
 class TestSingleDriveUnalignedLength(TestSingleDrive):
 image_len = 1025 * 1024
diff --git a/tests/qemu-iotests/041.out b/tests/qemu-iotests/041.out
index 877b76fd31..53abe11d73 100644
--- a/tests/qemu-iotests/041.out
+++ b/tests/qemu-iotests/041.out
@@ -1,5 +1,5 @@
-..
+
 --
-Ran 94 tests
+Ran 104 tests
 
 OK
-- 
2.25.3




[PATCH v2 2/4] iotests/229: Use blkdebug to inject an error

2020-05-11 Thread Kevin Wolf
229 relies on the mirror running into an I/O error when the target is
smaller than the source. After changing mirror to catch this condition
while starting the job, this test case won't get a job that is paused
for an I/O error any more. Use blkdebug instead to inject an error.

Signed-off-by: Kevin Wolf 
Message-Id: <20200507145228.323412-2-kw...@redhat.com>
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/229 | 15 +++
 tests/qemu-iotests/229.out |  6 +++---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/229 b/tests/qemu-iotests/229
index 866168b236..a5d4e5d4f2 100755
--- a/tests/qemu-iotests/229
+++ b/tests/qemu-iotests/229
@@ -33,6 +33,7 @@ _cleanup()
 _cleanup_test_img
 _rm_test_img "$TEST_IMG"
 _rm_test_img "$DEST_IMG"
+rm -f "$TEST_DIR/blkdebug.conf"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
@@ -49,11 +50,10 @@ _supported_os Linux
 
 DEST_IMG="$TEST_DIR/d.$IMGFMT"
 TEST_IMG="$TEST_DIR/b.$IMGFMT"
+BLKDEBUG_CONF="$TEST_DIR/blkdebug.conf"
 
 _make_test_img 2M
-
-# destination for mirror will be too small, causing error
-TEST_IMG=$DEST_IMG _make_test_img 1M
+TEST_IMG=$DEST_IMG _make_test_img 2M
 
 $QEMU_IO -c 'write 0 2M' "$TEST_IMG" | _filter_qemu_io
 
@@ -67,11 +67,18 @@ echo
 echo '=== Starting drive-mirror, causing error & stop  ==='
 echo
 
+cat > "$BLKDEBUG_CONF" <

[PATCH v2 3/4] mirror: Make sure that source and target size match

2020-05-11 Thread Kevin Wolf
If the target is shorter than the source, mirror would copy data until
it reaches the end of the target and then fail with an I/O error when
trying to write past the end.

If the target is longer than the source, the mirror job would complete
successfully, but the target wouldn't actually be an accurate copy of
the source image (it would contain some additional garbage at the end).

Fix this by checking that both images have the same size when the job
starts.

Signed-off-by: Kevin Wolf 
Message-Id: <20200507145228.323412-3-kw...@redhat.com>
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/mirror.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index aca95c9bc9..201ffa26f9 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -872,6 +872,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 BlockDriverState *target_bs = blk_bs(s->target);
 bool need_drain = true;
 int64_t length;
+int64_t target_length;
 BlockDriverInfo bdi;
 char backing_filename[2]; /* we only need 2 characters because we are only
  checking for a NULL string */
@@ -887,24 +888,26 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 goto immediate_exit;
 }
 
+target_length = blk_getlength(s->target);
+if (target_length < 0) {
+ret = target_length;
+goto immediate_exit;
+}
+
 /* Active commit must resize the base image if its size differs from the
  * active layer. */
 if (s->base == blk_bs(s->target)) {
-int64_t base_length;
-
-base_length = blk_getlength(s->target);
-if (base_length < 0) {
-ret = base_length;
-goto immediate_exit;
-}
-
-if (s->bdev_length > base_length) {
+if (s->bdev_length > target_length) {
 ret = blk_truncate(s->target, s->bdev_length, false,
PREALLOC_MODE_OFF, 0, NULL);
 if (ret < 0) {
 goto immediate_exit;
 }
 }
+} else if (s->bdev_length != target_length) {
+error_setg(errp, "Source and target image have different sizes");
+ret = -EINVAL;
+goto immediate_exit;
 }
 
 if (s->bdev_length == 0) {
-- 
2.25.3




[PATCH v2 1/4] iotests/109: Don't mirror with mismatched size

2020-05-11 Thread Kevin Wolf
This patch makes the raw image the same size as the file in a different
format that is mirrored as raw to it to avoid errors when mirror starts
to enforce that source and target are the same size.

We check only that the first 512 bytes are zeroed (instead of 64k)
because some image formats create image files that are smaller than 64k,
so trying to read 64k would result in I/O errors. Apart from this, 512
is more appropriate anyway because the raw format driver protects
specifically the first 512 bytes.

Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/109   | 10 ++---
 tests/qemu-iotests/109.out   | 74 +---
 tests/qemu-iotests/common.filter |  5 +++
 3 files changed, 41 insertions(+), 48 deletions(-)

diff --git a/tests/qemu-iotests/109 b/tests/qemu-iotests/109
index 5bc2e9b001..3ffeaf3c55 100755
--- a/tests/qemu-iotests/109
+++ b/tests/qemu-iotests/109
@@ -77,14 +77,14 @@ for fmt in qcow qcow2 qed vdi vmdk vpc; do
 echo "=== Writing a $fmt header into raw ==="
 echo
 
-_make_test_img 64M
 TEST_IMG="$TEST_IMG.src" IMGFMT=$fmt _make_test_img 64M
+_make_test_img $(du -b "$TEST_IMG.src" | cut -f1) | _filter_img_create_size
 
 # This first test should fail: The image format was probed, we may not
 # write an image header at the start of the image
 run_qemu "$TEST_IMG" "$TEST_IMG.src" "" "BLOCK_JOB_ERROR" |
 _filter_block_job_len
-$QEMU_IO -c 'read -P 0 0 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c 'read -P 0 0 512' "$TEST_IMG" | _filter_qemu_io
 
 
 # When raw was explicitly specified, the same must succeed
@@ -103,12 +103,12 @@ for sample_img in empty.bochs iotest-dirtylog-10G-4M.vhdx 
parallels-v1 \
 
 # Can't use _use_sample_img because that isn't designed to be used multiple
 # times and it overwrites $TEST_IMG (both breaks cleanup)
-_make_test_img 64M
 bzcat "$SAMPLE_IMG_DIR/$sample_img.bz2" > "$TEST_IMG.src"
+_make_test_img $(du -b "$TEST_IMG.src" | cut -f1) | _filter_img_create_size
 
 run_qemu "$TEST_IMG" "$TEST_IMG.src" "" "BLOCK_JOB_ERROR" |
 _filter_block_job_offset | _filter_block_job_len
-$QEMU_IO -c 'read -P 0 0 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c 'read -P 0 0 512' "$TEST_IMG" | _filter_qemu_io
 
 run_qemu "$TEST_IMG" "$TEST_IMG.src" "'format': 'raw'," "BLOCK_JOB_READY"
 $QEMU_IMG compare -f raw -F raw "$TEST_IMG" "$TEST_IMG.src"
@@ -119,8 +119,8 @@ echo "=== Write legitimate MBR into raw ==="
 echo
 
 for sample_img in grub_mbr.raw; do
-_make_test_img 64M
 bzcat "$SAMPLE_IMG_DIR/$sample_img.bz2" > "$TEST_IMG.src"
+_make_test_img $(du -b "$TEST_IMG.src" | cut -f1) | _filter_img_create_size
 
 run_qemu "$TEST_IMG" "$TEST_IMG.src" "" "BLOCK_JOB_READY"
 $QEMU_IMG compare -f raw -F raw "$TEST_IMG" "$TEST_IMG.src"
diff --git a/tests/qemu-iotests/109.out b/tests/qemu-iotests/109.out
index 884f65f18d..ad739df46c 100644
--- a/tests/qemu-iotests/109.out
+++ b/tests/qemu-iotests/109.out
@@ -2,8 +2,8 @@ QA output created by 109
 
 === Writing a qcow header into raw ===
 
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 Formatting 'TEST_DIR/t.raw.src', fmt=IMGFMT size=67108864
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=SIZE
 { 'execute': 'qmp_capabilities' }
 {"return": {}}
 {'execute':'drive-mirror', 'arguments':{ 'device': 'src', 'target': 
'TEST_DIR/t.IMGFMT', 'mode': 'existing', 'sync': 'full'}}
@@ -23,8 +23,8 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' 
and probing guessed
 {"execute":"quit"}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
-read 65536/65536 bytes at offset 0
-64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 { 'execute': 'qmp_capabilities' }
 {"return": {}}
 {'execute':'drive-mirror', 'arguments':{ 'device': 'src', 'target': 
'TEST_DIR/t.IMGFMT', 'format': 'IMGFMT', 'mode': 'existing', 'sync': 'full'}}
@@ -43,13 +43,12 @@ read 65536/65536 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 1024, "offset": 1024, 
"speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
-Warning: Image size mismatch!
 Images are identical.
 
 === Writing a qcow2 header into raw ===
 
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 Formatting 'TEST_DIR/t.raw.src', fmt=IMGFMT size=67108864
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=SIZE
 { 'execute': 'qmp_capabilities' }
 {"return": {}}
 {'execute':'drive-mirror', 

Re: [PATCH v5 00/18] nvme: refactoring and cleanups

2020-05-11 Thread Kevin Wolf
Am 11.05.2020 um 09:09 hat Klaus Jensen geschrieben:
> On May 11 09:00, Philippe Mathieu-Daudé wrote:
> > Hi Klaus,
> > 
> > On 5/11/20 8:25 AM, Klaus Jensen wrote:
> > > On May  5 07:48, Klaus Jensen wrote:
> > > > From: Klaus Jensen 
> > > > 
> > > > Changes since v5
> > > > 
> > > > No functional changes, just updated Reviewed-by tags. Also, I screwed up
> > > > the CC list when sending v4.
> > > > 
> > > > Philippe and Keith, please add a Reviewed-by to
> > > > 
> > > >* "nvme: factor out pmr setup" and
> > > >* "do cmb/pmr init as part of pci init"
> > > > 
> > > > since the first one was added and the second one was changed in v4 when
> > > > rebasing on Kevins block-next tree which had the PMR work that was not
> > > > in master at the time.
> > > > 
> > > > With those in place, it should be ready for Kevin to merge.
> > > > 
> > > Gentle ping on this.
> > > 
> > > Also, please see the two patches in "[PATCH 0/2] hw/block/nvme: fixes
> > > for interrupt behavior". I think they should go in preparation to this
> > > series.
> > 
> > I was going to ping Kevin last week, but then read your comment on pach #7
> > "nvme: add max_ioqpairs device parameter", so I interpreted you would
> > respin.
> > Now it is clearer, applying in the following order you don't need to respin,
> > right?
> > 
> > - [PATCH 0/2] hw/block/nvme: fixes for interrupt behavior"
> > - [PATCH v5 00/18] nvme: refactoring and cleanups

I was waiting for the review Klaus asked for. You had a comment about
renaming patches, but I didn't see any comments about the patches in
question.

> Ugh. "[PATCH v5 00/18] nvme: refactoring and cleanups" doesn't apply
> completely cleanly.
> 
> "[PATCH 0/2] hw/block/nvme: fixes for interrupt behavior" was intented
> to go into master because it fixes a bug, that is why I split them up.
> 
> But looks like it is better to just roll it into this series. I'll
> respin a v6 with the two interrupt fixes.

Ok, I'll wait for that one then. I'm still not sure, though, whether I
should then wait for additional review or just apply the patches.

Kevin




Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Daniel P . Berrangé
On Mon, May 11, 2020 at 01:07:18PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:
> > > Hello Everyone,
> > > In many cases, if qemu has a network connection (qmp, migration, chardev, 
> > > etc.)
> > > to some other server and that server dies or hangs, qemu hangs too.
> > 
> > If qemu as a whole hangs due to a stalled network connection, that is a
> > bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking
> > I/O in general, such that if the network connection or remote server stalls,
> > we simply stop sending I/O - we shouldn't ever hang the QEMU process or main
> > loop.
> > 
> > There are places in QEMU code which are not well behaved in this respect,
> > but many are, and others are getting fixed where found to be important.
> > 
> > Arguably any place in QEMU code which can result in a hang of QEMU in the
> > event of a stalled network should be considered a security flaw, because
> > the network is untrusted in general.
> 
> That's not really true of the 'management network' - people trust that
> and I don't see a lot of the qemu code getting fixed safely for all of
> them.

It depends on the user / app / deployment scenario. In OpenStack alot of
work was done to beef up security between services on the mgmt network,
with TLS encryption as standard to reduce attack vectors.

> > > These patches introduce the new 'yank' out-of-band qmp command to recover 
> > > from
> > > these kinds of hangs. The different subsystems register callbacks which 
> > > get
> > > executed with the yank command. For example the callback can shutdown() a
> > > socket. This is intended for the colo use-case, but it can be used for 
> > > other
> > > things too of course.
> > 
> > IIUC, invoking the "yank" command unconditionally kills every single
> > network connection in QEMU that has registered with the "yank" subsystem.
> > IMHO this is way too big of a hammer, even if we accept there are bugs in
> > QEMU not handling stalled networking well.
> 
> But isn't this hammer conditional - I see that it's a migration
> capabiltiy for the migration socket, and a flag in nbd - so it only
> yanks things you've told it to.

IIUC, you have to set these flags upfront when you launch QEMU, or
hotplug the device using the feature. When something gets stuck,
and you issue the "yank" command, then everything that has the flag
enabled gets torn down. So in practice it looks like the flag will
get enabled for everything at QEMU startup, and yanking down tear
down everything.

> > eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
> > connection used for the guest disk, we needlessly break I/O.
> > 
> > eg doing this in the chardev backend is not desirable, because the bugs
> > with hanging QEMU are typically caused by the way the frontend device
> > uses the chardev blocking I/O calls, instead of non-blocking I/O calls.
> > 
> 
> Having a way to get out of any of these problems from a single point is
> quite nice.  To be useful in COLO you need to know for sure you can get
> out of any network screwup.
> 
> We already use shutdown(2) in migrate_cancel and migrate-pause for
> basically the same reason; I don't think we've got anything similar for
> NBD, and we probably should have (I think I asked for it fairly
> recently).

Yes, the migrate_cancel is an example of a more fine grained way to
recover. I was thinking that we need an equivalent fine control knob
for NBD too.

That way if QEMU does get stuck, you can start by tearing down the
least distruptive channel. eg try tearing down the migration connection
first (which shouldn't negatively impact the guest), and only if that
doesn't work then, move on to tear down the NBD connection (which risks
data loss)

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:
> > Hello Everyone,
> > In many cases, if qemu has a network connection (qmp, migration, chardev, 
> > etc.)
> > to some other server and that server dies or hangs, qemu hangs too.
> 
> If qemu as a whole hangs due to a stalled network connection, that is a
> bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking
> I/O in general, such that if the network connection or remote server stalls,
> we simply stop sending I/O - we shouldn't ever hang the QEMU process or main
> loop.
> 
> There are places in QEMU code which are not well behaved in this respect,
> but many are, and others are getting fixed where found to be important.
> 
> Arguably any place in QEMU code which can result in a hang of QEMU in the
> event of a stalled network should be considered a security flaw, because
> the network is untrusted in general.

That's not really true of the 'management network' - people trust that
and I don't see a lot of the qemu code getting fixed safely for all of
them.

> > These patches introduce the new 'yank' out-of-band qmp command to recover 
> > from
> > these kinds of hangs. The different subsystems register callbacks which get
> > executed with the yank command. For example the callback can shutdown() a
> > socket. This is intended for the colo use-case, but it can be used for other
> > things too of course.
> 
> IIUC, invoking the "yank" command unconditionally kills every single
> network connection in QEMU that has registered with the "yank" subsystem.
> IMHO this is way too big of a hammer, even if we accept there are bugs in
> QEMU not handling stalled networking well.

But isn't this hammer conditional - I see that it's a migration
capabiltiy for the migration socket, and a flag in nbd - so it only
yanks things you've told it to.

> eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
> connection used for the guest disk, we needlessly break I/O.
> 
> eg doing this in the chardev backend is not desirable, because the bugs
> with hanging QEMU are typically caused by the way the frontend device
> uses the chardev blocking I/O calls, instead of non-blocking I/O calls.
> 

Having a way to get out of any of these problems from a single point is
quite nice.  To be useful in COLO you need to know for sure you can get
out of any network screwup.

We already use shutdown(2) in migrate_cancel and migrate-pause for
basically the same reason; I don't think we've got anything similar for
NBD, and we probably should have (I think I asked for it fairly
recently).

Dave



> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v3 9/9] iotests: Add test 291 to for qemu-img bitmap coverage

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> Add a new test covering the 'qemu-img bitmap' subcommand, as well as
> 'qemu-img convert --bitmaps', both added in recent patches.
> 
> Signed-off-by: Eric Blake 
> ---
>  tests/qemu-iotests/291 | 112 +
>  tests/qemu-iotests/291.out |  78 ++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 191 insertions(+)
>  create mode 100755 tests/qemu-iotests/291
>  create mode 100644 tests/qemu-iotests/291.out

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 8/9] qemu-img: Add convert --bitmaps option

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> Make it easier to copy all the persistent bitmaps of (the top layer
> of) a source image along with its guest-visible contents, by adding a
> boolean flag for use with qemu-img convert.  This is basically
> shorthand, as the same effect could be accomplished with a series of
> 'qemu-img bitmap --add' and 'qemu-img bitmap --merge -b source'
> commands, or by QMP commands.
> 
> See also https://bugzilla.redhat.com/show_bug.cgi?id=1779893
> 
> While touching this, clean up a couple coding issues spotted in the
> same function: an extra blank line, and merging back-to-back 'if
> (!skip_create)' blocks.
> 
> Signed-off-by: Eric Blake 
> ---
>  docs/tools/qemu-img.rst |  6 ++-
>  qemu-img.c  | 81 +++--
>  qemu-img-cmds.hx|  4 +-
>  3 files changed, 85 insertions(+), 6 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/5] io/channel.c,io/channel-socket.c: Add yank feature

2020-05-11 Thread Daniel P . Berrangé
On Mon, May 11, 2020 at 01:14:41PM +0200, Lukas Straub wrote:
> Add qio_channel_set_yank function to channel and to channel-socket,
> which will register a yank function. The yank function calls
> shutdown() on the socket.
> 
> Signed-off-by: Lukas Straub 
> ---
>  Makefile.objs   |  1 +
>  include/io/channel-socket.h |  1 +
>  include/io/channel.h| 12 
>  io/channel-socket.c | 29 +
>  io/channel.c|  9 +
>  5 files changed, 52 insertions(+)

Assuming we want the yank feature (which I'm not entirely convinced
of), then I don't think any of this addition should exist. The
QIOChannel class already provides a "qio_channel_shutdown" method
which can be invoked. The layer above which is using the QIOChannel
should be calling this existing qio_channel_shutdown method in
response to any yank request.  The I/O layer shouldn't have any
direct dependancy on the yank feature.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v3 7/9] qcow2: Expose bitmaps' size during measure

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> It's useful to know how much space can be occupied by qcow2 persistent
> bitmaps, even though such metadata is unrelated to the guest-visible
> data.  Report this value as an additional field, present when
> measuring an existing image and the output format supports bitmaps.
> Update iotest 178 and 190 to updated output, as well as new coverage
> in 190 demonstrating non-zero values made possible with the
> recently-added qemu-img bitmap command.
> 
> The addition of a new field demonstrates why we should always
> zero-initialize qapi C structs; while the qcow2 driver still fully
> populates all fields, the raw and crypto drivers had to be tweaked to
> avoid uninitialized data.
> 
> See also: https://bugzilla.redhat.com/1779904
> 
> Reported-by: Nir Soffer 
> Signed-off-by: Eric Blake 
> ---
>  qapi/block-core.json | 15 +++
>  block/crypto.c   |  2 +-
>  block/qcow2.c| 37 ---
>  block/raw-format.c   |  2 +-
>  qemu-img.c   |  3 +++
>  tests/qemu-iotests/178.out.qcow2 | 16 
>  tests/qemu-iotests/190   | 43 ++--
>  tests/qemu-iotests/190.out   | 23 -
>  8 files changed, 128 insertions(+), 13 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 943df1926a91..9a7a388c7ad3 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -633,18 +633,23 @@
>  # efficiently so file size may be smaller than virtual disk size.
>  #
>  # The values are upper bounds that are guaranteed to fit the new image file.
> -# Subsequent modification, such as internal snapshot or bitmap creation, may
> -# require additional space and is not covered here.
> +# Subsequent modification, such as internal snapshot or further bitmap
> +# creation, may require additional space and is not covered here.
>  #
> -# @required: Size required for a new image file, in bytes.
> +# @required: Size required for a new image file, in bytes, when copying just
> +#guest-visible contents.
>  #
>  # @fully-allocated: Image file size, in bytes, once data has been written
> -#   to all sectors.
> +#   to all sectors, when copying just guest-visible contents.
> +#
> +# @bitmaps: Additional size required for bitmap metadata in a source image,

s/in/from/?  Otherwise it sounds like this would be about allocation in
the source, which it clear can’t be, but, well.

> +#   if that bitmap metadata can be copied in addition to guest
> +#   contents. (since 5.1)

[...]

> diff --git a/block/qcow2.c b/block/qcow2.c
> index 838d810ca5ec..f836a6047879 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c

[...]

> @@ -4849,16 +4875,21 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts 
> *opts, BlockDriverState *in_bs,
>  required = virtual_size;
>  }
> 
> -info = g_new(BlockMeasureInfo, 1);
> +info = g_new0(BlockMeasureInfo, 1);
>  info->fully_allocated =
>  qcow2_calc_prealloc_size(virtual_size, cluster_size,
>   ctz32(refcount_bits)) + luks_payload_size;
> 
> -/* Remove data clusters that are not required.  This overestimates the
> +/*
> + * Remove data clusters that are not required.  This overestimates the
>   * required size because metadata needed for the fully allocated file is
> - * still counted.
> + * still counted.  Show bitmaps only if both source and destination
> + * would support them.
>   */
>  info->required = info->fully_allocated - virtual_size + required;
> +info->has_bitmaps = version >= 3 && in_bs &&
> +bdrv_dirty_bitmap_supported(in_bs);

Why is it important whether the source format supports persistent dirty
bitmaps?

I’m asking because I’d like there to be some concise reason when and why
the @bitmaps field appears.  “Whenever the target supports bitmaps” is
more concise than “When both source and target support bitmaps”.  Also,
the latter is not really different from “When any bitmap data can be
copied”, but in the latter case we should not show it when there are no
bitmaps in the source (even though the format supports them).

Or from the other perspective: As a user, I would never be annoyed by
the @bitmaps field being present.  I don’t mind a “0”.
OTOH, what information can it convey to me that it it’s optional and
sometimes not present?
I can see these cases:

- That the source format doesn’t support bitmaps?  I want to convert it
to something else anyway, so I don’t really care about what the source
format can or can’t do.

- That the destination doesn’t support bitmaps?  Ah, yes, the fact that
the bitmap field is missing might be a warning sign for this.

- That qemu is too old to copy bitmaps?  Same here.

- That there are no bitmaps in the source?  OK, but then I disregard the
@bitmaps field anyway, present or 

Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Daniel P . Berrangé
On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:
> Hello Everyone,
> In many cases, if qemu has a network connection (qmp, migration, chardev, 
> etc.)
> to some other server and that server dies or hangs, qemu hangs too.

If qemu as a whole hangs due to a stalled network connection, that is a
bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking
I/O in general, such that if the network connection or remote server stalls,
we simply stop sending I/O - we shouldn't ever hang the QEMU process or main
loop.

There are places in QEMU code which are not well behaved in this respect,
but many are, and others are getting fixed where found to be important.

Arguably any place in QEMU code which can result in a hang of QEMU in the
event of a stalled network should be considered a security flaw, because
the network is untrusted in general.

> These patches introduce the new 'yank' out-of-band qmp command to recover from
> these kinds of hangs. The different subsystems register callbacks which get
> executed with the yank command. For example the callback can shutdown() a
> socket. This is intended for the colo use-case, but it can be used for other
> things too of course.

IIUC, invoking the "yank" command unconditionally kills every single
network connection in QEMU that has registered with the "yank" subsystem.
IMHO this is way too big of a hammer, even if we accept there are bugs in
QEMU not handling stalled networking well.

eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
connection used for the guest disk, we needlessly break I/O.

eg doing this in the chardev backend is not desirable, because the bugs
with hanging QEMU are typically caused by the way the frontend device
uses the chardev blocking I/O calls, instead of non-blocking I/O calls.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




[PATCH 4/5] chardev/char-socket.c: Add yank feature

2020-05-11 Thread Lukas Straub
Add yank option which is passed to the socket-channel.

Signed-off-by: Lukas Straub 
---
 chardev/char-socket.c | 8 
 chardev/char.c| 3 +++
 qapi/char.json| 5 -
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 185fe38dda..e476358941 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -65,6 +65,7 @@ typedef struct {
 int max_size;
 int do_telnetopt;
 int do_nodelay;
+int do_yank;
 int *read_msgfds;
 size_t read_msgfds_num;
 int *write_msgfds;
@@ -877,6 +878,9 @@ static int tcp_chr_new_client(Chardev *chr, 
QIOChannelSocket *sioc)
 if (s->do_nodelay) {
 qio_channel_set_delay(s->ioc, false);
 }
+if (s->do_yank) {
+qio_channel_set_yank(s->ioc, true);
+}
 if (s->listener) {
 qio_net_listener_set_client_func_full(s->listener, NULL, NULL,
   NULL, chr->gcontext);
@@ -1297,6 +1301,7 @@ static void qmp_chardev_open_socket(Chardev *chr,
 SocketChardev *s = SOCKET_CHARDEV(chr);
 ChardevSocket *sock = backend->u.socket.data;
 bool do_nodelay = sock->has_nodelay ? sock->nodelay : false;
+bool do_yank= sock->has_yank? sock->yank: false;
 bool is_listen  = sock->has_server  ? sock->server  : true;
 bool is_telnet  = sock->has_telnet  ? sock->telnet  : false;
 bool is_tn3270  = sock->has_tn3270  ? sock->tn3270  : false;
@@ -1310,6 +1315,7 @@ static void qmp_chardev_open_socket(Chardev *chr,
 s->is_tn3270 = is_tn3270;
 s->is_websock = is_websock;
 s->do_nodelay = do_nodelay;
+s->do_yank = do_yank;
 if (sock->tls_creds) {
 Object *creds;
 creds = object_resolve_path_component(
@@ -1400,6 +1406,8 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
ChardevBackend *backend,
 
 sock->has_nodelay = qemu_opt_get(opts, "delay");
 sock->nodelay = !qemu_opt_get_bool(opts, "delay", true);
+sock->has_yank = qemu_opt_get(opts, "yank");
+sock->yank = qemu_opt_get_bool(opts, "yank", false);
 /*
  * We have different default to QMP for 'server', hence
  * we can't just check for existence of 'server'
diff --git a/chardev/char.c b/chardev/char.c
index e77564060d..04075389bf 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -939,6 +939,9 @@ QemuOptsList qemu_chardev_opts = {
 },{
 .name = "logappend",
 .type = QEMU_OPT_BOOL,
+},{
+.name = "yank",
+.type = QEMU_OPT_BOOL,
 },
 { /* end of list */ }
 },
diff --git a/qapi/char.json b/qapi/char.json
index daceb20f84..f9c04e720c 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -270,6 +270,8 @@
 # then attempt a reconnect after the given number of seconds.
 # Setting this to zero disables this function. (default: 0)
 # (Since: 2.2)
+# @yank: Shutdown the socket when the 'yank' qmp command is executed.
+#(Since: 5.1)
 #
 # Since: 1.4
 ##
@@ -283,7 +285,8 @@
 '*telnet': 'bool',
 '*tn3270': 'bool',
 '*websocket': 'bool',
-'*reconnect': 'int' },
+'*reconnect': 'int',
+'*yank': 'bool' },
   'base': 'ChardevCommon' }
 
 ##
-- 
2.20.1



pgpFCbuPiBUEl.pgp
Description: OpenPGP digital signature


[PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-05-11 Thread Lukas Straub
Hello Everyone,
In many cases, if qemu has a network connection (qmp, migration, chardev, etc.)
to some other server and that server dies or hangs, qemu hangs too.
These patches introduce the new 'yank' out-of-band qmp command to recover from
these kinds of hangs. The different subsystems register callbacks which get
executed with the yank command. For example the callback can shutdown() a
socket. This is intended for the colo use-case, but it can be used for other
things too of course.

Regards,
Lukas Straub

Lukas Straub (5):
  Introduce yank feature
  io/channel.c,io/channel-socket.c: Add yank feature
  block/nbd.c: Add yank feature
  chardev/char-socket.c: Add yank feature
  migration: Add yank feature

 Makefile.objs   |  2 +
 block/nbd.c | 68 -
 chardev/char-socket.c   |  8 
 chardev/char.c  |  3 ++
 include/io/channel-socket.h |  1 +
 include/io/channel.h| 12 ++
 io/channel-socket.c | 29 ++
 io/channel.c|  9 +
 migration/channel.c |  2 +
 migration/migration.c   | 11 ++
 qapi/block-core.json|  5 ++-
 qapi/char.json  |  5 ++-
 qapi/migration.json | 17 +++--
 qapi/misc.json  | 15 
 softmmu/vl.c|  2 +
 yank.c  | 75 +
 yank.h  | 12 ++
 17 files changed, 254 insertions(+), 22 deletions(-)
 create mode 100644 yank.c
 create mode 100644 yank.h

-- 
2.20.1


pgp3hgigll8XF.pgp
Description: OpenPGP digital signature


[PATCH 5/5] migration: Add yank feature

2020-05-11 Thread Lukas Straub
Add yank option which is passed to the socket-channel.

Signed-off-by: Lukas Straub 
---
 migration/channel.c   |  2 ++
 migration/migration.c | 11 +++
 qapi/migration.json   | 17 ++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index 20e4c8e2dc..498af99104 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -35,6 +35,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
 trace_migration_set_incoming_channel(
 ioc, object_get_typename(OBJECT(ioc)));
 
+qio_channel_set_yank(ioc, s->parameters.yank);
 if (s->parameters.tls_creds &&
 *s->parameters.tls_creds &&
 !object_dynamic_cast(OBJECT(ioc),
@@ -67,6 +68,7 @@ void migration_channel_connect(MigrationState *s,
 ioc, object_get_typename(OBJECT(ioc)), hostname, error);
 
 if (!error) {
+qio_channel_set_yank(ioc, s->parameters.yank);
 if (s->parameters.tls_creds &&
 *s->parameters.tls_creds &&
 !object_dynamic_cast(OBJECT(ioc),
diff --git a/migration/migration.c b/migration/migration.c
index 187ac0410c..b6f2f82dfb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -814,6 +814,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->max_postcopy_bandwidth = s->parameters.max_postcopy_bandwidth;
 params->has_max_cpu_throttle = true;
 params->max_cpu_throttle = s->parameters.max_cpu_throttle;
+params->has_yank = true;
+params->yank = s->parameters.yank;
 params->has_announce_initial = true;
 params->announce_initial = s->parameters.announce_initial;
 params->has_announce_max = true;
@@ -1364,6 +1366,9 @@ static void 
migrate_params_test_apply(MigrateSetParameters *params,
 if (params->has_max_cpu_throttle) {
 dest->max_cpu_throttle = params->max_cpu_throttle;
 }
+if (params->has_yank) {
+dest->yank = params->yank;
+}
 if (params->has_announce_initial) {
 dest->announce_initial = params->announce_initial;
 }
@@ -1472,6 +1477,9 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 if (params->has_max_cpu_throttle) {
 s->parameters.max_cpu_throttle = params->max_cpu_throttle;
 }
+if (params->has_yank) {
+s->parameters.yank = params->yank;
+}
 if (params->has_announce_initial) {
 s->parameters.announce_initial = params->announce_initial;
 }
@@ -3623,6 +3631,8 @@ static Property migration_properties[] = {
 DEFINE_PROP_UINT8("max-cpu-throttle", MigrationState,
   parameters.max_cpu_throttle,
   DEFAULT_MIGRATE_MAX_CPU_THROTTLE),
+DEFINE_PROP_BOOL("yank", MigrationState,
+  parameters.yank, false),
 DEFINE_PROP_SIZE("announce-initial", MigrationState,
   parameters.announce_initial,
   DEFAULT_MIGRATE_ANNOUNCE_INITIAL),
@@ -3711,6 +3721,7 @@ static void migration_instance_init(Object *obj)
 params->has_xbzrle_cache_size = true;
 params->has_max_postcopy_bandwidth = true;
 params->has_max_cpu_throttle = true;
+params->has_yank = true;
 params->has_announce_initial = true;
 params->has_announce_max = true;
 params->has_announce_rounds = true;
diff --git a/qapi/migration.json b/qapi/migration.json
index eca2981d0a..ad9e431a8f 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -623,6 +623,9 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @yank: Shutdown the migration socket when the 'yank' qmp command is
+#executed. (Since: 5.1)
+#
 # Since: 2.4
 ##
 { 'enum': 'MigrationParameter',
@@ -636,7 +639,7 @@
'multifd-channels',
'xbzrle-cache-size', 'max-postcopy-bandwidth',
'max-cpu-throttle', 'multifd-compression',
-   'multifd-zlib-level' ,'multifd-zstd-level' ] }
+   'multifd-zlib-level' ,'multifd-zstd-level', 'yank' ] }
 
 ##
 # @MigrateSetParameters:
@@ -747,6 +750,9 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @yank: Shutdown the migration socket when the 'yank' qmp command is
+#executed. (Since: 5.1)
+#
 # Since: 2.4
 ##
 # TODO either fuse back into MigrationParameters, or make
@@ -776,7 +782,8 @@
 '*max-cpu-throttle': 'int',
 '*multifd-compression': 'MultiFDCompression',
 '*multifd-zlib-level': 'int',
-'*multifd-zstd-level': 'int' } }
+'*multifd-zstd-level': 'int',
+'*yank': 'bool'} }
 
 ##
 # @migrate-set-parameters:
@@ -907,6 +914,9 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @yank: Shutdown the migration socket when the 'yank' qmp command is
+#executed. (Since: 5.1)
+#
 # Since: 2.4
 ##
 { 'struct': 'MigrationParameters',
@@ -934,7 +944,8 @@
 '*max-cpu-throttle': 'uint8',
  

[PATCH 2/5] io/channel.c,io/channel-socket.c: Add yank feature

2020-05-11 Thread Lukas Straub
Add qio_channel_set_yank function to channel and to channel-socket,
which will register a yank function. The yank function calls
shutdown() on the socket.

Signed-off-by: Lukas Straub 
---
 Makefile.objs   |  1 +
 include/io/channel-socket.h |  1 +
 include/io/channel.h| 12 
 io/channel-socket.c | 29 +
 io/channel.c|  9 +
 5 files changed, 52 insertions(+)

diff --git a/Makefile.objs b/Makefile.objs
index a7c967633a..889115775c 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -24,6 +24,7 @@ block-obj-m = block/
 crypto-obj-y = crypto/
 
 io-obj-y = io/
+io-obj-y += yank.o
 
 endif # CONFIG_SOFTMMU or CONFIG_TOOLS
 
diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..0fa7a364f3 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -48,6 +48,7 @@ struct QIOChannelSocket {
 socklen_t localAddrLen;
 struct sockaddr_storage remoteAddr;
 socklen_t remoteAddrLen;
+bool yank;
 };
 
 
diff --git a/include/io/channel.h b/include/io/channel.h
index d4557f0930..782b618694 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -132,6 +132,8 @@ struct QIOChannelClass {
 bool enabled);
 void (*io_set_delay)(QIOChannel *ioc,
  bool enabled);
+void (*io_set_yank)(QIOChannel *ioc,
+bool enabled);
 off_t (*io_seek)(QIOChannel *ioc,
  off_t offset,
  int whence,
@@ -550,6 +552,16 @@ int qio_channel_shutdown(QIOChannel *ioc,
 void qio_channel_set_delay(QIOChannel *ioc,
bool enabled);
 
+/**
+ * qio_channel_set_yank:
+ * @ioc: the channel object
+ * @enabled: the new flag state
+ *
+ * Controls wether this channel participates in yanking.
+ */
+void qio_channel_set_yank(QIOChannel *ioc,
+  bool enabled);
+
 /**
  * qio_channel_set_cork:
  * @ioc: the channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index b74f5b92a0..be03946d29 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -26,6 +26,7 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#include "yank.h"
 
 #define SOCKET_MAX_FDS 16
 
@@ -55,6 +56,7 @@ qio_channel_socket_new(void)
 
 sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
 sioc->fd = -1;
+sioc->yank = 0;
 
 ioc = QIO_CHANNEL(sioc);
 qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -395,10 +397,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
 return NULL;
 }
 
+static void qio_channel_socket_yank(void *opaque)
+{
+QIOChannel *ioc = opaque;
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+
+shutdown(sioc->fd, SHUT_RDWR);
+}
+
 static void qio_channel_socket_init(Object *obj)
 {
 QIOChannelSocket *ioc = QIO_CHANNEL_SOCKET(obj);
 ioc->fd = -1;
+ioc->yank = 0;
 }
 
 static void qio_channel_socket_finalize(Object *obj)
@@ -422,6 +433,9 @@ static void qio_channel_socket_finalize(Object *obj)
 closesocket(ioc->fd);
 ioc->fd = -1;
 }
+if (ioc->yank) {
+yank_unregister_function(qio_channel_socket_yank, ioc);
+}
 }
 
 
@@ -686,6 +700,20 @@ qio_channel_socket_set_delay(QIOChannel *ioc,
 , sizeof(v));
 }
 
+static void
+qio_channel_socket_set_yank(QIOChannel *ioc,
+bool enabled)
+{
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+
+if (sioc->yank) {
+yank_unregister_function(qio_channel_socket_yank, ioc);
+}
+sioc->yank = enabled;
+if (sioc->yank) {
+yank_register_function(qio_channel_socket_yank, ioc);
+}
+}
 
 static void
 qio_channel_socket_set_cork(QIOChannel *ioc,
@@ -784,6 +812,7 @@ static void qio_channel_socket_class_init(ObjectClass 
*klass,
 ioc_klass->io_shutdown = qio_channel_socket_shutdown;
 ioc_klass->io_set_cork = qio_channel_socket_set_cork;
 ioc_klass->io_set_delay = qio_channel_socket_set_delay;
+ioc_klass->io_set_yank = qio_channel_socket_set_yank;
 ioc_klass->io_create_watch = qio_channel_socket_create_watch;
 ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
 }
diff --git a/io/channel.c b/io/channel.c
index e4376eb0bc..0c4095e0e0 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -373,6 +373,15 @@ void qio_channel_set_delay(QIOChannel *ioc,
 }
 }
 
+void qio_channel_set_yank(QIOChannel *ioc,
+  bool enabled)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (klass->io_set_yank) {
+klass->io_set_yank(ioc, enabled);
+}
+}
 
 void qio_channel_set_cork(QIOChannel *ioc,
   bool enabled)
-- 
2.20.1



pgpsnu9amdfGI.pgp
Description: OpenPGP digital signature


[PATCH 3/5] block/nbd.c: Add yank feature

2020-05-11 Thread Lukas Straub
Add yank option, pass it to the socket-channel and register a yank
function which sets s->state = NBD_CLIENT_QUIT. This is the same
behaviour as if an error occured.

Signed-off-by: Lukas Straub 
---
 Makefile.objs|  1 +
 block/nbd.c  | 68 +---
 qapi/block-core.json |  5 +++-
 3 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index 889115775c..4b213b3e78 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -18,6 +18,7 @@ block-obj-y += block.o blockjob.o job.o
 block-obj-y += block/ scsi/
 block-obj-y += qemu-io-cmds.o
 block-obj-$(CONFIG_REPLICATION) += replication.o
+block-obj-y += yank.o
 
 block-obj-m = block/
 
diff --git a/block/nbd.c b/block/nbd.c
index 2160859f64..3c0fd3abb8 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -35,6 +35,7 @@
 #include "qemu/option.h"
 #include "qemu/cutils.h"
 #include "qemu/main-loop.h"
+#include "qemu/atomic.h"
 
 #include "qapi/qapi-visit-sockets.h"
 #include "qapi/qmp/qstring.h"
@@ -43,6 +44,8 @@
 #include "block/nbd.h"
 #include "block/block_int.h"
 
+#include "yank.h"
+
 #define EN_OPTSTR ":exportname="
 #define MAX_NBD_REQUESTS16
 
@@ -91,6 +94,7 @@ typedef struct BDRVNBDState {
 QCryptoTLSCreds *tlscreds;
 const char *hostname;
 char *x_dirty_bitmap;
+bool yank;
 } BDRVNBDState;
 
 static int nbd_client_connect(BlockDriverState *bs, Error **errp);
@@ -111,12 +115,12 @@ static void nbd_clear_bdrvstate(BDRVNBDState *s)
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
 NBD_CLIENT_CONNECTING_NOWAIT;
 }
 } else {
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 }
 s->state = NBD_CLIENT_QUIT;
@@ -167,7 +171,7 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
  * s->connection_co is either yielded from nbd_receive_reply or from
  * nbd_co_reconnect_loop()
  */
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
 }
 
@@ -206,7 +210,7 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 /* finish any pending coroutines */
 assert(s->ioc);
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
@@ -230,13 +234,14 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 
 static bool nbd_client_connecting(BDRVNBDState *s)
 {
-return s->state == NBD_CLIENT_CONNECTING_WAIT ||
-s->state == NBD_CLIENT_CONNECTING_NOWAIT;
+NBDClientState state = atomic_read(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
 }
 
 static bool nbd_client_connecting_wait(BDRVNBDState *s)
 {
-return s->state == NBD_CLIENT_CONNECTING_WAIT;
+return atomic_read(>state) == NBD_CLIENT_CONNECTING_WAIT;
 }
 
 static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
@@ -305,7 +310,7 @@ static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState 
*s)
 nbd_reconnect_attempt(s);
 
 while (nbd_client_connecting(s)) {
-if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
+if (atomic_read(>state) == NBD_CLIENT_CONNECTING_WAIT &&
 qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > delay_ns)
 {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
@@ -341,7 +346,7 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
 int ret = 0;
 Error *local_err = NULL;
 
-while (s->state != NBD_CLIENT_QUIT) {
+while (atomic_read(>state) != NBD_CLIENT_QUIT) {
 /*
  * The NBD client can only really be considered idle when it has
  * yielded from qio_channel_readv_all_eof(), waiting for data. This is
@@ -356,7 +361,7 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
 nbd_co_reconnect_loop(s);
 }
 
-if (s->state != NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) != NBD_CLIENT_CONNECTED) {
 continue;
 }
 
@@ -435,7 +440,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 qemu_co_queue_wait(>free_sema, >send_mutex);
 }
 
-if (s->state != NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) != NBD_CLIENT_CONNECTED) {
 rc = -EIO;
 goto err;
 }
@@ -462,7 +467,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 if (qiov) {
 

[PATCH 1/5] Introduce yank feature

2020-05-11 Thread Lukas Straub
The yank feature allows to recover from hanging qemu by "yanking"
at various parts. Other qemu systems can register yank functions
which will be called by the 'yank' out-of-band qmp command.

Signed-off-by: Lukas Straub 
---
 qapi/misc.json | 15 ++
 softmmu/vl.c   |  2 ++
 yank.c | 75 ++
 yank.h | 12 
 4 files changed, 104 insertions(+)
 create mode 100644 yank.c
 create mode 100644 yank.h

diff --git a/qapi/misc.json b/qapi/misc.json
index 99b90ac80b..de1ee494ae 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -1550,3 +1550,18 @@
 ##
 { 'command': 'query-vm-generation-id', 'returns': 'GuidInfo' }
 
+##
+# @yank:
+#
+# Recover from hanging qemu by calling yank functions.
+#
+# Returns: nothing.
+#
+# Example:
+#
+# -> { "execute": "yank" }
+# <- { "return": {} }
+#
+# Since: 5.1
+##
+{ 'command': 'yank', 'allow-oob': true }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 32c0047889..5d99749d29 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -112,6 +112,7 @@
 #include "qapi/qmp/qerror.h"
 #include "sysemu/iothread.h"
 #include "qemu/guest-random.h"
+#include "yank.h"
 
 #define MAX_VIRTIO_CONSOLES 1
 
@@ -2906,6 +2907,7 @@ void qemu_init(int argc, char **argv, char **envp)
 precopy_infrastructure_init();
 postcopy_infrastructure_init();
 monitor_init_globals();
+yank_init();
 
 if (qcrypto_init() < 0) {
 error_reportf_err(err, "cannot initialize crypto: ");
diff --git a/yank.c b/yank.c
new file mode 100644
index 00..cefbfd8ab5
--- /dev/null
+++ b/yank.c
@@ -0,0 +1,75 @@
+/*
+ * QEMU yank feature
+ *
+ * Copyright (c) Lukas Straub 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+#include "yank.h"
+
+struct YankFuncAndParam {
+YankFn *func;
+void *opaque;
+QLIST_ENTRY(YankFuncAndParam) next;
+};
+
+static QemuMutex lock;
+static QLIST_HEAD(qlisthead, YankFuncAndParam) head
+= QLIST_HEAD_INITIALIZER(head);
+
+void yank_register_function(YankFn *func, void *opaque)
+{
+struct YankFuncAndParam *tmp = g_malloc(sizeof(struct YankFuncAndParam));
+tmp->func = func;
+tmp->opaque = opaque;
+
+qemu_mutex_lock();
+QLIST_INSERT_HEAD(, tmp, next);
+qemu_mutex_unlock();
+}
+
+void yank_unregister_function(YankFn *func, void *opaque)
+{
+qemu_mutex_lock();
+
+struct YankFuncAndParam *tmp;
+QLIST_FOREACH(tmp, , next) {
+if (tmp->func == func && tmp->opaque == opaque) {
+QLIST_REMOVE(tmp, next);
+g_free(tmp);
+qemu_mutex_unlock();
+return;
+}
+}
+
+abort();
+}
+
+void yank_call_functions(void)
+{
+qemu_mutex_lock();
+
+struct YankFuncAndParam *tmp;
+QLIST_FOREACH(tmp, , next) {
+tmp->func(tmp->opaque);
+}
+
+qemu_mutex_unlock();
+}
+
+void qmp_yank(Error **errp)
+{
+yank_call_functions();
+}
+
+void yank_init(void)
+{
+qemu_mutex_init();
+QLIST_INIT();
+}
diff --git a/yank.h b/yank.h
new file mode 100644
index 00..7376224219
--- /dev/null
+++ b/yank.h
@@ -0,0 +1,12 @@
+
+#ifndef YANK_H
+#define YANK_H
+
+typedef void (YankFn) (void *opaque);
+
+void yank_register_function(YankFn *func, void *opaque);
+void yank_unregister_function(YankFn *func, void *opaque);
+void yank_call_functions(void);
+void yank_init(void);
+void qmp_yank(Error **errp);
+#endif
-- 
2.20.1



pgpfGVrPSoKBb.pgp
Description: OpenPGP digital signature


Re: [PATCH v3 6/9] qemu-img: Add bitmap sub-command

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> Include actions for --add, --remove, --clear, --enable, --disable, and
> --merge (note that --clear is a bit of fluff, because the same can be
> accomplished by removing a bitmap and then adding a new one in its
> place, but it matches what QMP commands exist).  Listing is omitted,
> because it does not require a bitmap name and because it was already
> possible with 'qemu-img info'.  A single command line can play one or
> more bitmap commands in sequence on the same bitmap name (although all
> added bitmaps share the same granularity, and and all merged bitmaps
> come from the same source file).  Merge defaults to other bitmaps in
> the primary image, but can also be told to merge bitmaps from a
> distinct image.

For the record: Yes, my comment was mostly about my confusion around the
{}.  So just replacing them by () would have pacified me.

But this is more fun, of course.

> While this supports --image-opts for the file being modified, I did
> not think it worth the extra complexity to support that for the source
> file in a cross-file merges.  Likewise, I chose to have --merge only
> take a single source rather than following the QMP support for
> multiple merges in one go (although you can still use more than one
> --merge in the command line); in part because qemu-img is offline and
> therefore atomicity is not an issue.
> 
> Upcoming patches will add iotest coverage of these commands while
> also testing other features.
> 
> Signed-off-by: Eric Blake 
> ---
>  docs/tools/qemu-img.rst |  23 
>  qemu-img.c  | 254 
>  qemu-img-cmds.hx|   7 ++
>  3 files changed, 284 insertions(+)
> 
> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
> index 7d08c48d308f..68393c357386 100644
> --- a/docs/tools/qemu-img.rst
> +++ b/docs/tools/qemu-img.rst
> @@ -281,6 +281,29 @@ Command description:

[...]

> +  Additional options ``-g`` set a non-default *GRANULARITY* for

sets?

With that fixed (or maybe not, you know that better than me):

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 5/9] blockdev: Split off basic bitmap operations for qemu-img

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> Upcoming patches want to add some basic bitmap manipulation abilities
> to qemu-img.  But blockdev.o is too heavyweight to link into qemu-img
> (among other things, it would drag in block jobs and transaction
> support - qemu-img does offline manipulation, where atomicity is less
> important because there are no concurrent modifications to compete
> with), so it's time to split off the bare bones of what we will need
> into a new file block/monitor/bitmap-qmp-cmds.o.
> 
> This is sufficient to expose 6 QMP commands for use by qemu-img (add,
> remove, clear, enable, disable, merge), as well as move the three
> helper functions touched in the previous patch.  Regarding
> MAINTAINERS, the new file is automatically part of block core, but
> also makes sense as related to other dirty bitmap files.
> 
> Signed-off-by: Eric Blake 
> ---
>  Makefile.objs   |   3 +-
>  block/monitor/bitmap-qmp-cmds.c | 323 
>  blockdev.c  | 284 
>  MAINTAINERS |   1 +
>  block/monitor/Makefile.objs |   1 +
>  5 files changed, 326 insertions(+), 286 deletions(-)
>  create mode 100644 block/monitor/bitmap-qmp-cmds.c
> 
> diff --git a/Makefile.objs b/Makefile.objs
> index a7c967633acf..99774cfd2545 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -13,9 +13,8 @@ chardev-obj-y = chardev/
> 
>  authz-obj-y = authz/
> 
> -block-obj-y = nbd/
> +block-obj-y = block/ block/monitor/ nbd/ scsi/

This reads weird because it’s precisely the monitor that we don’t want
in qemu-img.

But I suppose block/monitor is the natural place for block functions
that are to be used by the monitor (and maybe other parties like
qemu-img).  And the monitor itself would never be placed under block/.
So I suppose it does make sense and I have no better suggestion.

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] When del child, next_child_index need to subtract 1. Otherwise, the index will get bigger and bigger.

2020-05-11 Thread Alberto Garcia
On Mon 11 May 2020 12:00:30 PM CEST, quwei...@huayun.com wrote:
> diff --git a/block/quorum.c b/block/quorum.c
> index 6d7a56b..a8272fe 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -1096,6 +1096,7 @@ static void quorum_del_child(BlockDriverState *bs, 
> BdrvChild *child,
>  memmove(>children[i], >children[i + 1],
>  (s->num_children - i - 1) * sizeof(BdrvChild *));
>  s->children = g_renew(BdrvChild *, s->children, --s->num_children);
> +s->next_child_index--;
>  bdrv_unref_child(bs, child);

This is not correct, quorum_del_child() allows you to remove any child
from the Quorum device, so nothing guarantees you that
next_child_index-1 is free.

Berto



Re: [PATCH 0/2] hw/block/nvme: fixes for interrupt behavior

2020-05-11 Thread Klaus Jensen
On May 11 09:09, Philippe Mathieu-Daudé wrote:
> + Michael & Marcel for MSIX,
> and ping to Keith :)
> 

I'll await some feedback here before moving these fixes into my "nvme:
refactoring and cleanups" series. Not sure if this should go directly to
master or block-next.



Re: [PATCH v2 4/4] exec: Rename qemu_ram_writeback() as qemu_ram_msync()

2020-05-11 Thread Stefan Hajnoczi
On Mon, May 11, 2020 at 10:17:19AM +0200, Philippe Mathieu-Daudé wrote:
> Rename qemu_ram_writeback() as qemu_ram_msync() to better
> match what it does.

Based on Juan's comment in the other email thread I think this commit
description could be expanded. Let's explain the rationale for this
change:

  qemu_ram_writeback() and memory_region_do_writeback() have similar names
  but perform different functions. qemu_ram_writeback() is concerned with
  syncing changes to the underlying memory device (NVDIMM or file-backed
  RAM). memory_region_do_writeback() is concerned with dirty memory
  logging.

  Rename qemu_ram_writeback() to prevent confusion with
  memory_region_do_writeback().

> 
> Suggested-by: Stefan Hajnoczi 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/exec/ram_addr.h | 4 ++--
>  exec.c  | 2 +-
>  memory.c| 3 +--
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index a94809f89b..84817e19fa 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -136,12 +136,12 @@ void qemu_ram_free(RAMBlock *block);
>  
>  int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp);
>  
> -void qemu_ram_writeback(RAMBlock *block, ram_addr_t start, ram_addr_t 
> length);
> +void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length);
>  
>  /* Clear whole block of mem */
>  static inline void qemu_ram_block_writeback(RAMBlock *block)
>  {
> -qemu_ram_writeback(block, 0, block->used_length);
> +qemu_ram_msync(block, 0, block->used_length);
>  }
>  
>  #define DIRTY_CLIENTS_ALL ((1 << DIRTY_MEMORY_NUM) - 1)
> diff --git a/exec.c b/exec.c
> index 2874bb5088..f5a8cdf370 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2127,7 +2127,7 @@ int qemu_ram_resize(RAMBlock *block, ram_addr_t 
> newsize, Error **errp)
>   * Otherwise no-op.
>   * @Note: this is supposed to be a synchronous op.
>   */
> -void qemu_ram_writeback(RAMBlock *block, ram_addr_t start, ram_addr_t length)
> +void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length)
>  {
>  /* The requested range should fit in within the block range */
>  g_assert((start + length) <= block->used_length);
> diff --git a/memory.c b/memory.c
> index 73534b26f4..5bfa1429df 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2197,7 +2197,6 @@ void memory_region_ram_resize(MemoryRegion *mr, 
> ram_addr_t newsize, Error **errp
>  qemu_ram_resize(mr->ram_block, newsize, errp);
>  }
>  
> -
>  void memory_region_sync(MemoryRegion *mr, hwaddr addr, hwaddr size)
>  {
>  /*
> @@ -2205,7 +2204,7 @@ void memory_region_sync(MemoryRegion *mr, hwaddr addr, 
> hwaddr size)
>   * different types of memory regions
>   */
>  if (mr->ram_block) {
> -qemu_ram_writeback(mr->ram_block, addr, size);
> +qemu_ram_msync(mr->ram_block, addr, size);
>  }
>  }
>  
> -- 
> 2.21.3
> 


signature.asc
Description: PGP signature


Re: [PATCH v3 4/9] blockdev: Promote several bitmap functions to non-static

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> The next patch will split blockdev.c, which will require accessing
> some previously-static functions from more than one .c file.  But part
> of promoting a function to public is picking a naming scheme that does
> not reek of exposing too many internals (two of the three functions
> were named starting with 'do_').  To make future code motion easier,
> perform the function rename and non-static promotion into its own
> patch.
> 
> Signed-off-by: Eric Blake 
> ---
>  include/block/block_int.h | 12 +++
>  blockdev.c| 45 ---
>  2 files changed, 30 insertions(+), 27 deletions(-)

Reviewed-by: Max Reitz 

> diff --git a/blockdev.c b/blockdev.c
> index b3c840ec0312..fbeb38437869 100644
> --- a/blockdev.c
> +++ b/blockdev.c

[...]

> @@ -2504,9 +2495,10 @@ out:
>  aio_context_release(aio_context);
>  }
> 
> -static BdrvDirtyBitmap *do_block_dirty_bitmap_remove(
> -const char *node, const char *name, bool release,
> -BlockDriverState **bitmap_bs, Error **errp)
> +BdrvDirtyBitmap *block_dirty_bitmap_remove(const char *node, const char 
> *name,
> +   bool release,
> +   BlockDriverState **bitmap_bs,

I’m not entirely sure what this parameter is even used for, but it
obviously doesn’t concern this patch.

Max

> +   Error **errp)
>  {
>  BlockDriverState *bs;
>  BdrvDirtyBitmap *bitmap;



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 5/5] vhost: add device started check in migration set log

2020-05-11 Thread Dima Stepanov
On Sun, May 10, 2020 at 08:03:39PM -0400, Raphael Norwitz wrote:
> On Thu, May 7, 2020 at 11:35 AM Dima Stepanov  wrote:
> >
> > What do you think?
> >
> 
> Apologies - I tripped over the if (dev->started && r < 0) check.
> Never-mind my point with race conditions and failing migrations.
> 
> Rather than modifying vhost_dev_set_log(), it may be clearer to put a
> check after vhost_dev_log_resize()? Something like:
> 
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -829,11 +829,22 @@ static int vhost_migration_log(MemoryListener
> *listener, int enable)
>  vhost_log_put(dev, false);
>  } else {
>  vhost_dev_log_resize(dev, vhost_get_log_size(dev));
> +/*
> + * A device can be stopped because of backend disconnect inside
> + * vhost_dev_log_resize(). In this case we should mark logging
> + * enabled and return without attempting to set the backend
> + * logging state.
> + */
> +if (!dev->started) {
> +goto out_success;
> +}
>  r = vhost_dev_set_log(dev, true);
>  if (r < 0) {
>  return r;
>  }
>  }
> +
> +out_success:
>  dev->log_enabled = enable;
>  return 0;
>  }
This patch will not fix all the issues. Consider the case than you will
hit disconnect inside vhost_dev_set_log. For instance for the 3rd
virtqueue, for the following call:
  vhost_virtqueue_set_addr(...)
Maybe i didn't explain very clearly the problem. The problem i've tried
to fix is only for the vhost-user-blk devices. This issue can be hit
during VHOST_USER commands "handshake". If we hit disconnect on any step
of this "handshake" then we will try to make clean up twice:
1. First during disconnect cleanup (unix socket backend).
2. Second as roll back for initialization.
If this is the case, then we shouldn't call p2, as everything was clean
up on p1. And the complicated thing is that there are several VHOST_USER
commands and we should consider the state after each. And even more,
initialization could fail because of some other reason and we hit
disconnect inside roll back clean up, in this case we should complete
clean up in the disconnect function and stop rolling back.

Hope it helps ).

> 
> This seems harmless enough to me, and I see how it fixes your
> particular crash, but I would still prefer we worked towards a more
> robust solution. In particular I think we could handle this inside
> vhost-user-blk if we let the device state persist between connections
> (i.e. call vhost_dev_cleanup() inside vhost_user_blk_connect() before
> vhost_dev_init() on reconnect). This should also fix some of the
> crashes Li Feng has hit, and probably others which haven’t been
> reported yet. What do you think?
Yes, this looks like a good direction. Because all my patches are only
workarounds and i believe there can be some other issues which haven't
been reported or will be introduced ).
I still think that these patches are good to submit and to think about
more complicated/refactoring solution as the next step.

> 
> If that’s unworkable I guess we will need to add these vhost level
> checks.
At least for now, i don't think its unworkable, i just think that it
will take some time to figure out how to refactor it properly. But the
SIGSEGV issue is real.

> In that case I would still prefer we add a “disconnected” flag
> in struct vhost_dev struct, and make sure it isn’t cleared by
> vhost_dev_cleanup(). That way we don’t conflate stopping a device with
> backend disconnect at the vhost level and potentially regress behavior
> for other device types.
It is also possible, but should be analyzed and properly tested. So as i
said it will take some time to figure out how to refactor it properly.



Re: [PATCH v2 5/5] vhost: add device started check in migration set log

2020-05-11 Thread Dima Stepanov
On Mon, May 11, 2020 at 11:15:53AM +0800, Jason Wang wrote:
> 
> On 2020/4/30 下午9:36, Dima Stepanov wrote:
> >If vhost-user daemon is used as a backend for the vhost device, then we
> >should consider a possibility of disconnect at any moment. If such
> >disconnect happened in the vhost_migration_log() routine the vhost
> >device structure will be clean up.
> >At the start of the vhost_migration_log() function there is a check:
> >   if (!dev->started) {
> >   dev->log_enabled = enable;
> >   return 0;
> >   }
> >To be consistent with this check add the same check after calling the
> >vhost_dev_set_log() routine. This in general help not to break a
> >migration due the assert() message. But it looks like that this code
> >should be revised to handle these errors more carefully.
> >
> >In case of vhost-user device backend the fail paths should consider the
> >state of the device. In this case we should skip some function calls
> >during rollback on the error paths, so not to get the NULL dereference
> >errors.
> >
> >Signed-off-by: Dima Stepanov 
> >---
> >  hw/virtio/vhost.c | 39 +++
> >  1 file changed, 35 insertions(+), 4 deletions(-)
> >
> >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >index 3ee50c4..d5ab96d 100644
> >--- a/hw/virtio/vhost.c
> >+++ b/hw/virtio/vhost.c
> >@@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
> >  static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
> >  {
> >  int r, i, idx;
> >+
> >+if (!dev->started) {
> >+/*
> >+ * If vhost-user daemon is used as a backend for the
> >+ * device and the connection is broken, then the vhost_dev
> >+ * structure will be reset all its values to 0.
> >+ * Add additional check for the device state.
> >+ */
> >+return -1;
> >+}
> >+
> >  r = vhost_dev_set_features(dev, enable_log);
> >  if (r < 0) {
> >  goto err_features;
> >@@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, 
> >bool enable_log)
> >  }
> >  return 0;
> >  err_vq:
> >-for (; i >= 0; --i) {
> >+/*
> >+ * Disconnect with the vhost-user daemon can lead to the
> >+ * vhost_dev_cleanup() call which will clean up vhost_dev
> >+ * structure.
> >+ */
> >+for (; dev->started && (i >= 0); --i) {
> >  idx = dev->vhost_ops->vhost_get_vq_index(
> 
> 
> Why need the check of dev->started here, can started be modified outside
> mainloop? If yes, I don't get the check of !dev->started in the beginning of
> this function.
> 
No dev->started can't change outside the mainloop. The main problem is
only for the vhost_user_blk daemon. Consider the case when we
successfully pass the dev->started check at the beginning of the
function, but after it we hit the disconnect on the next call on the
second or third iteration:
 r = vhost_virtqueue_set_addr(dev, dev->vqs + i, idx, enable_log);
The unix socket backend device will call the disconnect routine for this
device and reset the structure. So the structure will be reset (and
dev->started set to false) inside this set_addr() call. So
we shouldn't call the clean up calls because this virtqueues were clean
up in the disconnect call. But we should protect these calls somehow, so
it will not hit SIGSEGV and we will be able to pass migration.

Just to summarize it:
For the vhost-user-blk devices we ca hit clean up calls twice in case of
vhost disconnect:
1. The first time during the disconnect process. The clean up is called
inside it.
2. The second time during roll back clean up.
So if it is the case we should skip p2.

> 
> >dev, dev->vq_index + i);
> >  vhost_virtqueue_set_addr(dev, dev->vqs + i, idx,
> >   dev->log_enabled);
> >  }
> >-vhost_dev_set_features(dev, dev->log_enabled);
> >+if (dev->started) {
> >+vhost_dev_set_features(dev, dev->log_enabled);
> >+}
> >  err_features:
> >  return r;
> >  }
> >@@ -832,7 +850,15 @@ static int vhost_migration_log(MemoryListener 
> >*listener, int enable)
> >  } else {
> >  vhost_dev_log_resize(dev, vhost_get_log_size(dev));
> >  r = vhost_dev_set_log(dev, true);
> >-if (r < 0) {
> >+/*
> >+ * The dev log resize can fail, because of disconnect
> >+ * with the vhost-user-blk daemon. Check the device
> >+ * state before calling the vhost_dev_set_log()
> >+ * function.
> >+ * Don't return error if device isn't started to be
> >+ * consistent with the check above.
> >+ */
> >+if (dev->started && r < 0) {
> >  return r;
> >  }
> >  }
> >@@ -1739,7 +1765,12 @@ int vhost_dev_start(struct vhost_dev *hdev, 
> >VirtIODevice *vdev)
> >  fail_log:
> >  vhost_log_put(hdev, false);
> >  fail_vq:
> >-while (--i >= 0) {
> >+/*
> >+ * Disconnect with the vhost-user daemon can 

Re: [PATCH v3 3/9] block: Make it easier to learn which BDS support bitmaps

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> Upcoming patches will enhance bitmap support in qemu-img, but in doing
> so, it turns out to be nice to suppress output when bitmaps make no
> sense (such as on a qcow2 v2 image).  Add a hook to make this easier
> to query.
> 
> In the future, when we improve the ability to look up bitmaps through
> a filter, we will probably also want to teach the block layer to
> automatically let filters pass this request on through.
> 
> Signed-off-by: Eric Blake 
> ---
>  block/qcow2.h| 1 +
>  include/block/block_int.h| 1 +
>  include/block/dirty-bitmap.h | 1 +
>  block/dirty-bitmap.c | 9 +
>  block/qcow2-bitmap.c | 7 +++
>  block/qcow2.c| 1 +
>  6 files changed, 20 insertions(+)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index f4de0a27d5c3..fb2b2b5a7b4d 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -764,6 +764,7 @@ bool qcow2_co_can_store_new_dirty_bitmap(BlockDriverState 
> *bs,
>  int qcow2_co_remove_persistent_dirty_bitmap(BlockDriverState *bs,
>  const char *name,
>  Error **errp);
> +bool qcow2_dirty_bitmap_supported(BlockDriverState *bs);
> 
>  ssize_t coroutine_fn
>  qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size,
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index df6d0273d679..cb1082da4c43 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -560,6 +560,7 @@ struct BlockDriver {
>   uint64_t parent_perm, uint64_t parent_shared,
>   uint64_t *nperm, uint64_t *nshared);
> 
> +bool (*bdrv_dirty_bitmap_supported)(BlockDriverState *bs);

All BDSs support bitmaps, but only some support persistent dirty
bitmaps, so I think the name should reflect that.

Conceptually, this looks reasonable.  This information might indeed be
nice to have, and I’m not sure whether we should extend any existing
interface to return it.

(The interfaces that come to my mind are:
(1) bdrv_can_store_new_dirty_bitmap() below, which we could make accept
a NULL @name to return basically the same information.  But it’s still a
bit different, because I’d expect that function to return whether any
bitmap can be stored then, not whether the node supports bitmaps at all.
 So e.g. if there are already too many bitmaps, it should return false,
even though the node itself does support bitmaps.

(2) bdrv_get_info()/BlockDriverInfo: This information would fit in very
nicely here, but do we have to put it here just because it does?  I
don’t think so.  This patch adds 20 lines of code, that shows that it’s
very simple to add a dedicated method, and it’s certainly a bit easier
to use than to invoke bdrv_get_info() and throw away all the other
information.  Perhaps this patch only shows that BlockDriverInfo doesn’t
make much sense in the first place, and most of its fields should have
been scalar return values from dedicated functions.)

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 4/5] vhost: check vring address before calling unmap

2020-05-11 Thread Dima Stepanov
On Mon, May 11, 2020 at 11:05:58AM +0800, Jason Wang wrote:
> 
> On 2020/4/30 下午9:36, Dima Stepanov wrote:
> >Since disconnect can happen at any time during initialization not all
> >vring buffers (for instance used vring) can be intialized successfully.
> >If the buffer was not initialized then vhost_memory_unmap call will lead
> >to SIGSEGV. Add checks for the vring address value before calling unmap.
> >Also add assert() in the vhost_memory_unmap() routine.
> >
> >Signed-off-by: Dima Stepanov 
> >---
> >  hw/virtio/vhost.c | 27 +--
> >  1 file changed, 21 insertions(+), 6 deletions(-)
> >
> >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >index ddbdc53..3ee50c4 100644
> >--- a/hw/virtio/vhost.c
> >+++ b/hw/virtio/vhost.c
> >@@ -314,6 +314,8 @@ static void vhost_memory_unmap(struct vhost_dev *dev, 
> >void *buffer,
> > hwaddr len, int is_write,
> > hwaddr access_len)
> >  {
> >+assert(buffer);
> >+
> >  if (!vhost_dev_has_iommu(dev)) {
> >  cpu_physical_memory_unmap(buffer, len, is_write, access_len);
> >  }
> >@@ -1132,12 +1134,25 @@ static void vhost_virtqueue_stop(struct vhost_dev 
> >*dev,
> >  vhost_vq_index);
> >  }
> >-vhost_memory_unmap(dev, vq->used, virtio_queue_get_used_size(vdev, idx),
> >-   1, virtio_queue_get_used_size(vdev, idx));
> >-vhost_memory_unmap(dev, vq->avail, virtio_queue_get_avail_size(vdev, 
> >idx),
> >-   0, virtio_queue_get_avail_size(vdev, idx));
> >-vhost_memory_unmap(dev, vq->desc, virtio_queue_get_desc_size(vdev, idx),
> >-   0, virtio_queue_get_desc_size(vdev, idx));
> >+/*
> >+ * Since the vhost-user disconnect can happen during initialization
> >+ * check if vring was initialized, before making unmap.
> >+ */
> >+if (vq->used) {
> >+vhost_memory_unmap(dev, vq->used,
> >+   virtio_queue_get_used_size(vdev, idx),
> >+   1, virtio_queue_get_used_size(vdev, idx));
> >+}
> >+if (vq->avail) {
> >+vhost_memory_unmap(dev, vq->avail,
> >+   virtio_queue_get_avail_size(vdev, idx),
> >+   0, virtio_queue_get_avail_size(vdev, idx));
> >+}
> >+if (vq->desc) {
> >+vhost_memory_unmap(dev, vq->desc,
> >+   virtio_queue_get_desc_size(vdev, idx),
> >+   0, virtio_queue_get_desc_size(vdev, idx));
> >+}
> 
> 
> Any reason not checking hdev->started instead? vhost_dev_start() will set it
> to true if virtqueues were correctly mapped.
> 
> Thanks
Well i see it a little bit different:
 - vhost_dev_start() sets hdev->started to true before starting
   virtqueues
 - vhost_virtqueue_start() maps all the memory
If we hit the vhost disconnect at the start of the
vhost_virtqueue_start(), for instance for this call:
  r = dev->vhost_ops->vhost_set_vring_base(dev, );
Then we will call vhost_user_blk_disconnect:
  vhost_user_blk_disconnect()->
vhost_user_blk_stop()->
  vhost_dev_stop()->
vhost_virtqueue_stop()
As a result we will come in this routine with the hdev->started still
set to true, but if used/avail/desc fields still uninitialized and set
to 0.

> 
> 
> >  }
> >  static void vhost_eventfd_add(MemoryListener *listener,
> 



Re: [PATCH v3 2/9] qemu-img: Fix stale comments on doc location

2020-05-11 Thread Max Reitz
On 08.05.20 20:03, Eric Blake wrote:
> Missed in commit e13c59fa.
> 
> Signed-off-by: Eric Blake 
> ---
>  qemu-img.c   | 2 +-
>  qemu-img-cmds.hx | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/5] vhost: introduce wrappers to set guest notifiers for virtio device

2020-05-11 Thread Dima Stepanov
On Mon, May 11, 2020 at 11:03:01AM +0800, Jason Wang wrote:
> 
> On 2020/4/30 下午9:36, Dima Stepanov wrote:
> >Introduce new wrappers to set/reset guest notifiers for the virtio
> >device in the vhost device module:
> >   vhost_dev_assign_guest_notifiers
> > ->set_guest_notifiers(..., ..., true);
> >   vhost_dev_drop_guest_notifiers
> > ->set_guest_notifiers(..., ..., false);
> >This is a preliminary step to refactor code,
> 
> 
> Maybe I miss something, I don't see any add-on patch to modify the new
> wrapper in this series?
Hi, in fact the next 3/5 patch:
  "[PATCH v2 3/5] vhost-user-blk: add mechanism to track the guest
notifiers init state"
is about using these wrappers. But disregard it, i decided to follow
Raphael suggestion. So we will fix the vhost-user-blk case first, so i
will not introduce these wrappers. And the code will be more easier to
read and straightforward.
I will send v3 as soon as we decide what to do with the migration fix
in this patchset.

No other comments mixed in below.

> 
> 
> >  so the set_guest_notifiers
> >methods could be called based on the vhost device state.
> >Update all vhost used devices to use these wrappers instead of direct
> >method call.
> >
> >Signed-off-by: Dima Stepanov 
> >---
> >  backends/cryptodev-vhost.c  | 26 +++---
> >  backends/vhost-user.c   | 16 +---
> >  hw/block/vhost-user-blk.c   | 15 +--
> >  hw/net/vhost_net.c  | 30 +-
> >  hw/scsi/vhost-scsi-common.c | 15 +--
> >  hw/virtio/vhost-user-fs.c   | 17 +++--
> >  hw/virtio/vhost-vsock.c | 18 --
> >  hw/virtio/vhost.c   | 38 ++
> >  hw/virtio/virtio.c  | 13 +
> >  include/hw/virtio/vhost.h   |  4 
> >  include/hw/virtio/virtio.h  |  1 +
> >  11 files changed, 118 insertions(+), 75 deletions(-)
> >
> >diff --git a/backends/cryptodev-vhost.c b/backends/cryptodev-vhost.c
> >index 8337c9a..4522195 100644
> >--- a/backends/cryptodev-vhost.c
> >+++ b/backends/cryptodev-vhost.c
> >@@ -169,16 +169,13 @@ vhost_set_vring_enable(CryptoDevBackendClient *cc,
> >  int cryptodev_vhost_start(VirtIODevice *dev, int total_queues)
> >  {
> >  VirtIOCrypto *vcrypto = VIRTIO_CRYPTO(dev);
> >-BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(dev)));
> >-VirtioBusState *vbus = VIRTIO_BUS(qbus);
> >-VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
> >  int r, e;
> >  int i;
> >  CryptoDevBackend *b = vcrypto->cryptodev;
> >  CryptoDevBackendVhost *vhost_crypto;
> >  CryptoDevBackendClient *cc;
> >-if (!k->set_guest_notifiers) {
> >+if (!virtio_device_guest_notifiers_initialized(dev)) {
> >  error_report("binding does not support guest notifiers");
> >  return -ENOSYS;
> >  }
> >@@ -198,9 +195,13 @@ int cryptodev_vhost_start(VirtIODevice *dev, int 
> >total_queues)
> >  }
> >   }
> >-r = k->set_guest_notifiers(qbus->parent, total_queues, true);
> >+/*
> >+ * Since all the states are handled by one vhost device,
> >+ * use the first one in array.
> >+ */
> >+vhost_crypto = cryptodev_get_vhost(b->conf.peers.ccs[0], b, 0);
> >+r = vhost_dev_assign_guest_notifiers(_crypto->dev, dev, 
> >total_queues);
> >  if (r < 0) {
> >-error_report("error binding guest notifier: %d", -r);
> >  goto err;
> >  }
> >@@ -232,7 +233,8 @@ err_start:
> >  vhost_crypto = cryptodev_get_vhost(cc, b, i);
> >  cryptodev_vhost_stop_one(vhost_crypto, dev);
> >  }
> >-e = k->set_guest_notifiers(qbus->parent, total_queues, false);
> >+vhost_crypto = cryptodev_get_vhost(b->conf.peers.ccs[0], b, 0);
> >+e = vhost_dev_drop_guest_notifiers(_crypto->dev, dev, 
> >total_queues);
> >  if (e < 0) {
> >  error_report("vhost guest notifier cleanup failed: %d", e);
> >  }
> >@@ -242,9 +244,6 @@ err:
> >  void cryptodev_vhost_stop(VirtIODevice *dev, int total_queues)
> >  {
> >-BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(dev)));
> >-VirtioBusState *vbus = VIRTIO_BUS(qbus);
> >-VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
> >  VirtIOCrypto *vcrypto = VIRTIO_CRYPTO(dev);
> >  CryptoDevBackend *b = vcrypto->cryptodev;
> >  CryptoDevBackendVhost *vhost_crypto;
> >@@ -259,7 +258,12 @@ void cryptodev_vhost_stop(VirtIODevice *dev, int 
> >total_queues)
> >  cryptodev_vhost_stop_one(vhost_crypto, dev);
> >  }
> >-r = k->set_guest_notifiers(qbus->parent, total_queues, false);
> >+/*
> >+ * Since all the states are handled by one vhost device,
> >+ * use the first one in array.
> >+ */
> >+vhost_crypto = cryptodev_get_vhost(b->conf.peers.ccs[0], b, 0);
> >+r = vhost_dev_drop_guest_notifiers(_crypto->dev, dev, 
> >total_queues);
> >  if (r < 0) {
> >  error_report("vhost guest notifier cleanup failed: %d", r);
> >  }
> >diff 

[PATCH v2 4/4] exec: Rename qemu_ram_writeback() as qemu_ram_msync()

2020-05-11 Thread Philippe Mathieu-Daudé
Rename qemu_ram_writeback() as qemu_ram_msync() to better
match what it does.

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/exec/ram_addr.h | 4 ++--
 exec.c  | 2 +-
 memory.c| 3 +--
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index a94809f89b..84817e19fa 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -136,12 +136,12 @@ void qemu_ram_free(RAMBlock *block);
 
 int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, Error **errp);
 
-void qemu_ram_writeback(RAMBlock *block, ram_addr_t start, ram_addr_t length);
+void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length);
 
 /* Clear whole block of mem */
 static inline void qemu_ram_block_writeback(RAMBlock *block)
 {
-qemu_ram_writeback(block, 0, block->used_length);
+qemu_ram_msync(block, 0, block->used_length);
 }
 
 #define DIRTY_CLIENTS_ALL ((1 << DIRTY_MEMORY_NUM) - 1)
diff --git a/exec.c b/exec.c
index 2874bb5088..f5a8cdf370 100644
--- a/exec.c
+++ b/exec.c
@@ -2127,7 +2127,7 @@ int qemu_ram_resize(RAMBlock *block, ram_addr_t newsize, 
Error **errp)
  * Otherwise no-op.
  * @Note: this is supposed to be a synchronous op.
  */
-void qemu_ram_writeback(RAMBlock *block, ram_addr_t start, ram_addr_t length)
+void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length)
 {
 /* The requested range should fit in within the block range */
 g_assert((start + length) <= block->used_length);
diff --git a/memory.c b/memory.c
index 73534b26f4..5bfa1429df 100644
--- a/memory.c
+++ b/memory.c
@@ -2197,7 +2197,6 @@ void memory_region_ram_resize(MemoryRegion *mr, 
ram_addr_t newsize, Error **errp
 qemu_ram_resize(mr->ram_block, newsize, errp);
 }
 
-
 void memory_region_sync(MemoryRegion *mr, hwaddr addr, hwaddr size)
 {
 /*
@@ -2205,7 +2204,7 @@ void memory_region_sync(MemoryRegion *mr, hwaddr addr, 
hwaddr size)
  * different types of memory regions
  */
 if (mr->ram_block) {
-qemu_ram_writeback(mr->ram_block, addr, size);
+qemu_ram_msync(mr->ram_block, addr, size);
 }
 }
 
-- 
2.21.3




[PATCH v2 0/4] memory: Add memory_region_sync() & make NVMe emulated device generic

2020-05-11 Thread Philippe Mathieu-Daudé
Remove the pointless dirty_log_mask check before msync'ing,
let the NVMe emulated device be target-agnostic.

Supersedes: <20200508062456.23344-1-phi...@redhat.com>

Philippe Mathieu-Daudé (4):
  memory: Simplify memory_region_do_writeback()
  memory: Rename memory_region_do_writeback() -> memory_region_sync()
  hw/block: Let the NVMe emulated device be target-agnostic
  exec: Rename qemu_ram_writeback() as qemu_ram_msync()

 include/exec/memory.h   | 13 +++--
 include/exec/ram_addr.h |  4 ++--
 exec.c  |  2 +-
 hw/block/nvme.c |  6 ++
 memory.c|  7 +++
 target/arm/helper.c |  2 +-
 hw/block/Makefile.objs  |  2 +-
 7 files changed, 17 insertions(+), 19 deletions(-)

-- 
2.21.3




[PATCH v2 2/4] memory: Rename memory_region_do_writeback() -> memory_region_sync()

2020-05-11 Thread Philippe Mathieu-Daudé
We usually use '_do_' for internal functions. Rename
memory_region_do_writeback() as memory_region_sync()
to better reflect what it does.

Reviewed-by: Richard Henderson 
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/exec/memory.h | 13 +++--
 memory.c  |  2 +-
 target/arm/helper.c   |  2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e000bd2f97..4fc1d85b99 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1474,14 +1474,15 @@ void *memory_region_get_ram_ptr(MemoryRegion *mr);
 void memory_region_ram_resize(MemoryRegion *mr, ram_addr_t newsize,
   Error **errp);
 /**
- * memory_region_do_writeback: Trigger cache writeback or msync for
- * selected address range
+ * memory_region_sync: Synchronize selected address range
  *
- * @mr: the memory region to be updated
- * @addr: the initial address of the range to be written back
- * @size: the size of the range to be written back
+ * It is only meaningful for RAM regions, otherwise it is no-op.
+ *
+ * @mr: the memory region to be synchronized
+ * @addr: the initial address of the range to be sync
+ * @size: the size of the range to be sync
  */
-void memory_region_do_writeback(MemoryRegion *mr, hwaddr addr, hwaddr size);
+void memory_region_sync(MemoryRegion *mr, hwaddr addr, hwaddr size);
 
 /**
  * memory_region_set_log: Turn dirty logging on or off for a region.
diff --git a/memory.c b/memory.c
index fd5c3af535..73534b26f4 100644
--- a/memory.c
+++ b/memory.c
@@ -2198,7 +2198,7 @@ void memory_region_ram_resize(MemoryRegion *mr, 
ram_addr_t newsize, Error **errp
 }
 
 
-void memory_region_do_writeback(MemoryRegion *mr, hwaddr addr, hwaddr size)
+void memory_region_sync(MemoryRegion *mr, hwaddr addr, hwaddr size)
 {
 /*
  * Might be extended case needed to cover
diff --git a/target/arm/helper.c b/target/arm/helper.c
index a94f650795..c2697ed7c0 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -6829,7 +6829,7 @@ static void dccvap_writefn(CPUARMState *env, const 
ARMCPRegInfo *opaque,
 mr = memory_region_from_host(haddr, );
 
 if (mr) {
-memory_region_do_writeback(mr, offset, dline_size);
+memory_region_sync(mr, offset, dline_size);
 }
 }
 }
-- 
2.21.3




[PATCH v2 3/4] hw/block: Let the NVMe emulated device be target-agnostic

2020-05-11 Thread Philippe Mathieu-Daudé
Now than the non-target specific memory_region_sync() function
is available, use it to make this device target-agnostic.

Reviewed-by: Richard Henderson 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/nvme.c| 6 ++
 hw/block/Makefile.objs | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9b453423cf..d9d0649540 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -46,8 +46,7 @@
 #include "qapi/visitor.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/block-backend.h"
-#include "exec/ram_addr.h"
-
+#include "exec/memory.h"
 #include "qemu/log.h"
 #include "qemu/module.h"
 #include "qemu/cutils.h"
@@ -1207,8 +1206,7 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, 
unsigned size)
  */
 if (addr == 0xE08 &&
 (NVME_PMRCAP_PMRWBM(n->bar.pmrcap) & 0x02)) {
-qemu_ram_writeback(n->pmrdev->mr.ram_block,
-   0, n->pmrdev->size);
+memory_region_sync(>pmrdev->mr, 0, n->pmrdev->size);
 }
 memcpy(, ptr + addr, size);
 } else {
diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index 47960b5f0d..8855c22656 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -13,6 +13,6 @@ common-obj-$(CONFIG_SH4) += tc58128.o
 
 obj-$(CONFIG_VIRTIO_BLK) += virtio-blk.o
 obj-$(CONFIG_VHOST_USER_BLK) += vhost-user-blk.o
-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o
 
 obj-y += dataplane/
-- 
2.21.3




[PATCH v2 1/4] memory: Simplify memory_region_do_writeback()

2020-05-11 Thread Philippe Mathieu-Daudé
mr->dirty_log_mask tells if dirty tracking has been enabled,
not if the page is dirty. It would always be true during live
migration and when running on TCG, but otherwise it would
always be false.
As the value of mr->dirty_log_mask does not matter, remove
the check.

Cc: Beata Michalska 
Suggested-by: Paolo Bonzini 
Signed-off-by: Philippe Mathieu-Daudé 
---
 memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/memory.c b/memory.c
index 601b749906..fd5c3af535 100644
--- a/memory.c
+++ b/memory.c
@@ -2204,7 +2204,7 @@ void memory_region_do_writeback(MemoryRegion *mr, hwaddr 
addr, hwaddr size)
  * Might be extended case needed to cover
  * different types of memory regions
  */
-if (mr->ram_block && mr->dirty_log_mask) {
+if (mr->ram_block) {
 qemu_ram_writeback(mr->ram_block, addr, size);
 }
 }
-- 
2.21.3




Re: [PATCH v5 00/18] nvme: refactoring and cleanups

2020-05-11 Thread Klaus Jensen
On May 11 09:00, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 
> On 5/11/20 8:25 AM, Klaus Jensen wrote:
> > On May  5 07:48, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Changes since v5
> > > 
> > > No functional changes, just updated Reviewed-by tags. Also, I screwed up
> > > the CC list when sending v4.
> > > 
> > > Philippe and Keith, please add a Reviewed-by to
> > > 
> > >* "nvme: factor out pmr setup" and
> > >* "do cmb/pmr init as part of pci init"
> > > 
> > > since the first one was added and the second one was changed in v4 when
> > > rebasing on Kevins block-next tree which had the PMR work that was not
> > > in master at the time.
> > > 
> > > With those in place, it should be ready for Kevin to merge.
> > > 
> > Gentle ping on this.
> > 
> > Also, please see the two patches in "[PATCH 0/2] hw/block/nvme: fixes
> > for interrupt behavior". I think they should go in preparation to this
> > series.
> 
> I was going to ping Kevin last week, but then read your comment on pach #7
> "nvme: add max_ioqpairs device parameter", so I interpreted you would
> respin.
> Now it is clearer, applying in the following order you don't need to respin,
> right?
> 
> - [PATCH 0/2] hw/block/nvme: fixes for interrupt behavior"
> - [PATCH v5 00/18] nvme: refactoring and cleanups
> 

Ugh. "[PATCH v5 00/18] nvme: refactoring and cleanups" doesn't apply
completely cleanly.

"[PATCH 0/2] hw/block/nvme: fixes for interrupt behavior" was intented
to go into master because it fixes a bug, that is why I split them up.

But looks like it is better to just roll it into this series. I'll
respin a v6 with the two interrupt fixes.



Re: [PATCH 0/2] hw/block/nvme: fixes for interrupt behavior

2020-05-11 Thread Philippe Mathieu-Daudé

+ Michael & Marcel for MSIX,
and ping to Keith :)

On 5/5/20 10:36 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Klaus Jensen (2):
   hw/block/nvme: fix pin-based interrupt behavior
   hw/block/nvme: allow use of any valid msix vector

  hw/block/nvme.c | 14 +-
  hw/block/nvme.h |  2 +-
  2 files changed, 10 insertions(+), 6 deletions(-)





[PATCH] block/replication.c: Avoid cancelling the job twice

2020-05-11 Thread Lukas Straub
If qemu in colo secondary mode is stopped, it crashes because
s->backup_job is canceled twice: First with job_cancel_sync_all()
in qemu_cleanup() and then in replication_stop().

Fix this by assigning NULL to s->backup_job when the job completes
so replication_stop() and replication_do_checkpoint() won't touch
the job.

Signed-off-by: Lukas Straub 
---
 block/replication.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/replication.c b/block/replication.c
index da013c2041..33f2f62a44 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -398,6 +398,8 @@ static void backup_job_cleanup(BlockDriverState *bs)
 BDRVReplicationState *s = bs->opaque;
 BlockDriverState *top_bs;
 
+s->backup_job = NULL;
+
 top_bs = bdrv_lookup_bs(s->top_id, s->top_id, NULL);
 if (!top_bs) {
 return;
-- 
2.20.1


pgptRbKh2CJoM.pgp
Description: OpenPGP digital signature


Re: [PATCH v5 00/18] nvme: refactoring and cleanups

2020-05-11 Thread Philippe Mathieu-Daudé

Hi Klaus,

On 5/11/20 8:25 AM, Klaus Jensen wrote:

On May  5 07:48, Klaus Jensen wrote:

From: Klaus Jensen 

Changes since v5

No functional changes, just updated Reviewed-by tags. Also, I screwed up
the CC list when sending v4.

Philippe and Keith, please add a Reviewed-by to

   * "nvme: factor out pmr setup" and
   * "do cmb/pmr init as part of pci init"

since the first one was added and the second one was changed in v4 when
rebasing on Kevins block-next tree which had the PMR work that was not
in master at the time.

With those in place, it should be ready for Kevin to merge.

  
Gentle ping on this.


Also, please see the two patches in "[PATCH 0/2] hw/block/nvme: fixes
for interrupt behavior". I think they should go in preparation to this
series.


I was going to ping Kevin last week, but then read your comment on pach 
#7 "nvme: add max_ioqpairs device parameter", so I interpreted you would 
respin.
Now it is clearer, applying in the following order you don't need to 
respin, right?


- [PATCH 0/2] hw/block/nvme: fixes for interrupt behavior"
- [PATCH v5 00/18] nvme: refactoring and cleanups




Re: [PATCH v5 00/18] nvme: refactoring and cleanups

2020-05-11 Thread Klaus Jensen
On May  5 07:48, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Changes since v5
> 
> No functional changes, just updated Reviewed-by tags. Also, I screwed up
> the CC list when sending v4.
> 
> Philippe and Keith, please add a Reviewed-by to
> 
>   * "nvme: factor out pmr setup" and
>   * "do cmb/pmr init as part of pci init"
> 
> since the first one was added and the second one was changed in v4 when
> rebasing on Kevins block-next tree which had the PMR work that was not
> in master at the time.
> 
> With those in place, it should be ready for Kevin to merge.
> 
 
Gentle ping on this.

Also, please see the two patches in "[PATCH 0/2] hw/block/nvme: fixes
for interrupt behavior". I think they should go in preparation to this
series.