Re: [PATCH v6 0/9] Packed virtqueue for virtio

2019-10-24 Thread Jason Wang



On 2019/10/25 上午1:13, Eugenio Pérez wrote:

Hi:

This is an updated version of packed virtqueue support based on Wei and Jason's
V5, mainly solving the clang leak detector error CI gave.

Please review.

Changes from V5:
- Fix qemu's CI asan error.
- Move/copy rcu comments.
- Merge duplicated vdev->broken check between split and packet version.

Eugenio Pérez (3):
   virtio: Free rng and blk virqueues
   virtio: add some rcu comments
   virtio: Move vdev->broken check to dispatch drop_all

Jason Wang (4):
   virtio: basic packed virtqueue support
   virtio: event suppression support for packed ring
   vhost_net: enable packed ring support
   virtio: add property to enable packed virtqueue

Wei Xu (2):
   virtio: basic structure for packed ring
   virtio: device/driverr area size calculation refactor for split ring



Looks good to me.

Just two nits:

I tend to squash patch 8 and patch 9 into the patch that introduces 
those issues and split patch 3 into two parts.


Btw, if you wish you can add your s-o-b to the series.

Do you want to post a new version or I can tweak them by myself?

Thanks




  hw/block/virtio-blk.c   |7 +-
  hw/char/virtio-serial-bus.c |2 +-
  hw/net/vhost_net.c  |2 +
  hw/scsi/virtio-scsi.c   |3 +-
  hw/virtio/virtio-rng.c  |1 +
  hw/virtio/virtio.c  | 1154 ++-
  include/hw/virtio/virtio.h  |   14 +-
  7 files changed, 1045 insertions(+), 138 deletions(-)






Re: [PATCH v2 0/4] apic: Fix migration breakage of >255 vcpus

2019-10-24 Thread Peter Xu
On Thu, Oct 24, 2019 at 01:49:11PM -0400, John Snow wrote:
> 
> 
> On 10/23/19 4:17 AM, Kevin Wolf wrote:
> > The important difference here is legacy IDE (which works) vs. AHCI
> > (which doesn't work). If you add a -device ahci to the -M pc case, it
> > starts failing, too.
> > 
> > Not sure why AHCI fails, but I'll just CC John who is the lucky
> > maintainer of this device. :-)
> 
> Hm... It looks like SeaBIOS is identifying the drive correctly and
> perfectly well, but we're failing at boot_disk(u8 bootdrv, int
> checksig), about here:
> 
> call16_int(0x13, );
> 
> if (br.flags & F_CF) {
> printf("Boot failed: could not read the boot disk\n\n");
> return;
> }
> 
> Looking at AHCI tracing (From the QEMU side), it looks like we set up
> the drive correctly, and then never touch the port ever again -- I don't
> see an attempted read on QEMU's end.
> 
> I'll need to look through SeaBIOS source for hints, I'm not sure right
> yet. If anyone is more familiar with the SeaBIOS boot code, maybe they
> can give a pointer faster than I'll figure it out myself.

Hi, John,

I don't know seabios well, but I did have a pointer in my previous
email on where it faulted.  It seems to me that the issue is that
SeaBIOS may have got incorrect secs/cyls/heads information (and
explicitly setting secs=1,cyls=1,heads=1 on the block device fixes the
issue).

Thanks,

-- 
Peter Xu



Re: qemu crashing when attaching an ISO file to a virtio-scsi CD-ROM device through libvirt

2019-10-24 Thread Fernando Casas Schössow
BTW just to be clear, qemu is crashing in this scenario *only* if 
iothread is enabled for the guest.
Without iothread enabled the operation is completed without any 
problems.

On jue, oct 24, 2019 at 11:07 PM, Fernando Casas Schössow 
 wrote:
> Today I updated to qemu 4.0.1 since this was the latest version 
> available for Alpine and I can confirm that I can repro the issue 
> with this version as well.
> Not sure if relevant but I can also confirm that the problem happens 
> with Windows Server 2012 R2 but also with Linux guests (it doesn't 
> matter if the guest use uefi or bios firmware). I made this tests 
> just to discard things.
> 
> Also as discussed I compiled qemu with debug symbols, repro the 
> problem, collected a core dump and run both through gdb. This is the 
> result:
> 
> (gdb) thread apply all bt
> 
> Thread 42 (LWP 33704):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fee02380b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 41 (LWP 33837):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc1ad5b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 40 (LWP 33719):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fee02266b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 39 (LWP 33696):
> #0 0x7fee04233171 in syscall () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee02be8b64 in ?? ()
> #2 0x0030 in ?? ()
> #3 0x7fee02be2540 in ?? ()
> #4 0x7fee02be2500 in ?? ()
> #5 0x7fee02be2548 in ?? ()
> #6 0x55d7e4987f28 in rcu_gp_event ()
> #7 0x in ?? ()
> 
> Thread 38 (LWP 33839):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc1a83b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 37 (LWP 33841):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc1737b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 36 (LWP 33863):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedb8c83b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 35 (LWP 33842):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc170eb64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 34 (LWP 33862):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedb8cacb64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 33 (LWP 33843):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc16e5b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 32 (LWP 33861):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedb8cd5b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 31 (LWP 33844):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc16bcb64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 30 (LWP 33858):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedb8e83b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 29 (LWP 33845):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc1693b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 28 (LWP 33857):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedb8eacb64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 27 (LWP 33846):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc166ab64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 26 (LWP 33856):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedb8ed5b64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 25 (LWP 33847):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
> #1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
> #2 0x7fedc142ab64 in ?? ()
> #3 0x in ?? ()
> 
> Thread 24 (LWP 33855):
> #0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1

Re: qemu crashing when attaching an ISO file to a virtio-scsi CD-ROM device through libvirt

2019-10-24 Thread Fernando Casas Schössow
Today I updated to qemu 4.0.1 since this was the latest version 
available for Alpine and I can confirm that I can repro the issue with 
this version as well.
Not sure if relevant but I can also confirm that the problem happens 
with Windows Server 2012 R2 but also with Linux guests (it doesn't 
matter if the guest use uefi or bios firmware). I made this tests just 
to discard things.

Also as discussed I compiled qemu with debug symbols, repro the 
problem, collected a core dump and run both through gdb. This is the 
result:

(gdb) thread apply all bt

Thread 42 (LWP 33704):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fee02380b64 in ?? ()
#3 0x in ?? ()

Thread 41 (LWP 33837):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc1ad5b64 in ?? ()
#3 0x in ?? ()

Thread 40 (LWP 33719):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fee02266b64 in ?? ()
#3 0x in ?? ()

Thread 39 (LWP 33696):
#0 0x7fee04233171 in syscall () from /lib/ld-musl-x86_64.so.1
#1 0x7fee02be8b64 in ?? ()
#2 0x0030 in ?? ()
#3 0x7fee02be2540 in ?? ()
#4 0x7fee02be2500 in ?? ()
#5 0x7fee02be2548 in ?? ()
#6 0x55d7e4987f28 in rcu_gp_event ()
#7 0x in ?? ()

Thread 38 (LWP 33839):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc1a83b64 in ?? ()
#3 0x in ?? ()

Thread 37 (LWP 33841):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc1737b64 in ?? ()
#3 0x in ?? ()

Thread 36 (LWP 33863):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8c83b64 in ?? ()
#3 0x in ?? ()

Thread 35 (LWP 33842):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc170eb64 in ?? ()
#3 0x in ?? ()

Thread 34 (LWP 33862):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8cacb64 in ?? ()
#3 0x in ?? ()

Thread 33 (LWP 33843):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc16e5b64 in ?? ()
#3 0x in ?? ()

Thread 32 (LWP 33861):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8cd5b64 in ?? ()
#3 0x in ?? ()

Thread 31 (LWP 33844):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc16bcb64 in ?? ()
#3 0x in ?? ()

Thread 30 (LWP 33858):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8e83b64 in ?? ()
#3 0x in ?? ()

Thread 29 (LWP 33845):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc1693b64 in ?? ()
#3 0x in ?? ()

Thread 28 (LWP 33857):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8eacb64 in ?? ()
#3 0x in ?? ()

Thread 27 (LWP 33846):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc166ab64 in ?? ()
#3 0x in ?? ()

Thread 26 (LWP 33856):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8ed5b64 in ?? ()
#3 0x in ?? ()

Thread 25 (LWP 33847):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedc142ab64 in ?? ()
#3 0x in ?? ()

Thread 24 (LWP 33855):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedb8efeb64 in ?? ()
#3 0x in ?? ()

Thread 23 (LWP 33848):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedbd0feb64 in ?? ()
#3 0x in ?? ()

Thread 22 (LWP 33854):
#0 0x7fee04252080 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x7fee0424f4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x7fedbd031b64 in 

[PATCH v6 2/9] virtio: device/driver area size calculation refactor for split ring

2019-10-24 Thread Eugenio Pérez
From: Wei Xu 

There is slight size difference between split/packed rings.

This is the refactor of split ring as well as a helper to expanding
device and driver area size calculation for packed ring.

Signed-off-by: Wei Xu 
Signed-off-by: Jason Wang 
Reviewed-by: Jens Freimann 
---
 hw/virtio/virtio.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index fdac203cdf..74cc10fad9 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -159,10 +159,8 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 VRingMemoryRegionCaches *old = vq->vring.caches;
 VRingMemoryRegionCaches *new = NULL;
 hwaddr addr, size;
-int event_size;
 int64_t len;
 
-event_size = virtio_vdev_has_feature(vq->vdev, VIRTIO_RING_F_EVENT_IDX) ? 
2 : 0;
 
 addr = vq->vring.desc;
 if (!addr) {
@@ -177,7 +175,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 goto err_desc;
 }
 
-size = virtio_queue_get_used_size(vdev, n) + event_size;
+size = virtio_queue_get_used_size(vdev, n);
 len = address_space_cache_init(>used, vdev->dma_as,
vq->vring.used, size, true);
 if (len < size) {
@@ -185,7 +183,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 goto err_used;
 }
 
-size = virtio_queue_get_avail_size(vdev, n) + event_size;
+size = virtio_queue_get_avail_size(vdev, n);
 len = address_space_cache_init(>avail, vdev->dma_as,
vq->vring.avail, size, false);
 if (len < size) {
@@ -2414,14 +2412,20 @@ hwaddr virtio_queue_get_desc_size(VirtIODevice *vdev, 
int n)
 
 hwaddr virtio_queue_get_avail_size(VirtIODevice *vdev, int n)
 {
+int s;
+
+s = virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 return offsetof(VRingAvail, ring) +
-sizeof(uint16_t) * vdev->vq[n].vring.num;
+sizeof(uint16_t) * vdev->vq[n].vring.num + s;
 }
 
 hwaddr virtio_queue_get_used_size(VirtIODevice *vdev, int n)
 {
+int s;
+
+s = virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 return offsetof(VRingUsed, ring) +
-sizeof(VRingUsedElem) * vdev->vq[n].vring.num;
+sizeof(VRingUsedElem) * vdev->vq[n].vring.num + s;
 }
 
 uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
-- 
2.16.5




[PATCH v6 3/9] virtio: Free rng and blk virqueues

2019-10-24 Thread Eugenio Pérez
The function virtio_del_queue was not called at these devices
unrealize() callbacks.

This was detected due to add an allocated element on the next commit
(used_elems) and running address sanitizer memory leak detector.

Signed-off-by: Eugenio Pérez 
---
 hw/block/virtio-blk.c  | 5 +
 hw/virtio/virtio-rng.c | 1 +
 2 files changed, 6 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index ed2ddebd2b..ba846fe9dc 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1206,9 +1206,14 @@ static void virtio_blk_device_unrealize(DeviceState 
*dev, Error **errp)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
 VirtIOBlock *s = VIRTIO_BLK(dev);
+VirtIOBlkConf *conf = >conf;
+unsigned i;
 
 virtio_blk_data_plane_destroy(s->dataplane);
 s->dataplane = NULL;
+for (i = 0; i < conf->num_queues; i++) {
+virtio_del_queue(vdev, i);
+}
 qemu_del_vm_change_state_handler(s->change);
 blockdev_mark_auto_del(s->blk);
 virtio_cleanup(vdev);
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index e93bed020f..b498a20332 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -238,6 +238,7 @@ static void virtio_rng_device_unrealize(DeviceState *dev, 
Error **errp)
 qemu_del_vm_change_state_handler(vrng->vmstate);
 timer_del(vrng->rate_limit_timer);
 timer_free(vrng->rate_limit_timer);
+virtio_del_queue(vdev, 0);
 virtio_cleanup(vdev);
 }
 
-- 
2.16.5




[PATCH v6 0/9] Packed virtqueue for virtio

2019-10-24 Thread Eugenio Pérez
Hi:

This is an updated version of packed virtqueue support based on Wei and Jason's
V5, mainly solving the clang leak detector error CI gave.

Please review.

Changes from V5:
- Fix qemu's CI asan error.
- Move/copy rcu comments.
- Merge duplicated vdev->broken check between split and packet version.

Eugenio Pérez (3):
  virtio: Free rng and blk virqueues
  virtio: add some rcu comments
  virtio: Move vdev->broken check to dispatch drop_all

Jason Wang (4):
  virtio: basic packed virtqueue support
  virtio: event suppression support for packed ring
  vhost_net: enable packed ring support
  virtio: add property to enable packed virtqueue

Wei Xu (2):
  virtio: basic structure for packed ring
  virtio: device/driverr area size calculation refactor for split ring

 hw/block/virtio-blk.c   |7 +-
 hw/char/virtio-serial-bus.c |2 +-
 hw/net/vhost_net.c  |2 +
 hw/scsi/virtio-scsi.c   |3 +-
 hw/virtio/virtio-rng.c  |1 +
 hw/virtio/virtio.c  | 1154 ++-
 include/hw/virtio/virtio.h  |   14 +-
 7 files changed, 1045 insertions(+), 138 deletions(-)

-- 
2.16.5




[PATCH v6 9/9] virtio: Move vdev->broken check to dispatch drop_all

2019-10-24 Thread Eugenio Pérez
Previous commits did the same with others virtqueue_ functions, but this
check was repeated in the split and the packed version.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/virtio.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 9195b08da8..828c27de1f 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1631,10 +1631,6 @@ static unsigned int virtqueue_packed_drop_all(VirtQueue 
*vq)
 VirtIODevice *vdev = vq->vdev;
 VRingPackedDesc desc;
 
-if (unlikely(vdev->broken)) {
-return 0;
-}
-
 caches = vring_get_region_caches(vq);
 desc_cache = >desc;
 
@@ -1680,10 +1676,6 @@ static unsigned int virtqueue_split_drop_all(VirtQueue 
*vq)
 VirtIODevice *vdev = vq->vdev;
 bool fEventIdx = virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
-if (unlikely(vdev->broken)) {
-return 0;
-}
-
 while (!virtio_queue_empty(vq) && vq->inuse < vq->vring.num) {
 /* works similar to virtqueue_pop but does not map buffers
 * and does not allocate any memory */
@@ -1715,6 +1707,10 @@ unsigned int virtqueue_drop_all(VirtQueue *vq)
 {
 struct VirtIODevice *vdev = vq->vdev;
 
+if (unlikely(vdev->broken)) {
+return 0;
+}
+
 if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
 return virtqueue_packed_drop_all(vq);
 } else {
-- 
2.16.5




[PATCH v6 5/9] virtio: event suppression support for packed ring

2019-10-24 Thread Eugenio Pérez
From: Jason Wang 

This patch implements event suppression through device/driver
area. Please refer virtio specification for more information.

Signed-off-by: Wei Xu 
Signed-off-by: Jason Wang 
---
 hw/virtio/virtio.c | 142 +
 1 file changed, 133 insertions(+), 9 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 6e7a034d2a..3cf12a62c0 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -240,6 +240,44 @@ static void vring_split_desc_read(VirtIODevice *vdev, 
VRingDesc *desc,
 virtio_tswap16s(vdev, >next);
 }
 
+static void vring_packed_event_read(VirtIODevice *vdev,
+MemoryRegionCache *cache,
+VRingPackedDescEvent *e)
+{
+hwaddr off_off = offsetof(VRingPackedDescEvent, off_wrap);
+hwaddr off_flags = offsetof(VRingPackedDescEvent, flags);
+
+address_space_read_cached(cache, off_flags, >flags,
+  sizeof(e->flags));
+/* Make sure flags is seen before off_wrap */
+smp_rmb();
+address_space_read_cached(cache, off_off, >off_wrap,
+  sizeof(e->off_wrap));
+virtio_tswap16s(vdev, >off_wrap);
+virtio_tswap16s(vdev, >flags);
+}
+
+static void vring_packed_off_wrap_write(VirtIODevice *vdev,
+MemoryRegionCache *cache,
+uint16_t off_wrap)
+{
+hwaddr off = offsetof(VRingPackedDescEvent, off_wrap);
+
+virtio_tswap16s(vdev, _wrap);
+address_space_write_cached(cache, off, _wrap, sizeof(off_wrap));
+address_space_cache_invalidate(cache, off, sizeof(off_wrap));
+}
+
+static void vring_packed_flags_write(VirtIODevice *vdev,
+ MemoryRegionCache *cache, uint16_t flags)
+{
+hwaddr off = offsetof(VRingPackedDescEvent, flags);
+
+virtio_tswap16s(vdev, );
+address_space_write_cached(cache, off, , sizeof(flags));
+address_space_cache_invalidate(cache, off, sizeof(flags));
+}
+
 static VRingMemoryRegionCaches *vring_get_region_caches(struct VirtQueue *vq)
 {
 VRingMemoryRegionCaches *caches = atomic_rcu_read(>vring.caches);
@@ -346,14 +384,8 @@ static inline void vring_set_avail_event(VirtQueue *vq, 
uint16_t val)
 address_space_cache_invalidate(>used, pa, sizeof(val));
 }
 
-void virtio_queue_set_notification(VirtQueue *vq, int enable)
+static void virtio_queue_split_set_notification(VirtQueue *vq, int enable)
 {
-vq->notification = enable;
-
-if (!vq->vring.desc) {
-return;
-}
-
 rcu_read_lock();
 if (virtio_vdev_has_feature(vq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
 vring_set_avail_event(vq, vring_avail_idx(vq));
@@ -369,6 +401,51 @@ void virtio_queue_set_notification(VirtQueue *vq, int 
enable)
 rcu_read_unlock();
 }
 
+static void virtio_queue_packed_set_notification(VirtQueue *vq, int enable)
+{
+uint16_t off_wrap;
+VRingPackedDescEvent e;
+VRingMemoryRegionCaches *caches;
+
+rcu_read_lock();
+caches  = vring_get_region_caches(vq);
+vring_packed_event_read(vq->vdev, >used, );
+
+if (!enable) {
+e.flags = VRING_PACKED_EVENT_FLAG_DISABLE;
+} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
+off_wrap = vq->shadow_avail_idx | vq->shadow_avail_wrap_counter << 15;
+vring_packed_off_wrap_write(vq->vdev, >used, off_wrap);
+/* Make sure off_wrap is wrote before flags */
+smp_wmb();
+e.flags = VRING_PACKED_EVENT_FLAG_DESC;
+} else {
+e.flags = VRING_PACKED_EVENT_FLAG_ENABLE;
+}
+
+vring_packed_flags_write(vq->vdev, >used, e.flags);
+if (enable) {
+/* Expose avail event/used flags before caller checks the avail idx. */
+smp_mb();
+}
+rcu_read_unlock();
+}
+
+void virtio_queue_set_notification(VirtQueue *vq, int enable)
+{
+vq->notification = enable;
+
+if (!vq->vring.desc) {
+return;
+}
+
+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
+virtio_queue_packed_set_notification(vq, enable);
+} else {
+virtio_queue_split_set_notification(vq, enable);
+}
+}
+
 int virtio_queue_ready(VirtQueue *vq)
 {
 return vq->vring.avail != 0;
@@ -2290,8 +2367,7 @@ static void virtio_set_isr(VirtIODevice *vdev, int value)
 }
 }
 
-/* Called within rcu_read_lock().  */
-static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
+static bool virtio_split_should_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
 uint16_t old, new;
 bool v;
@@ -2314,6 +2390,54 @@ static bool virtio_should_notify(VirtIODevice *vdev, 
VirtQueue *vq)
 return !v || vring_need_event(vring_get_used_event(vq), new, old);
 }
 
+static bool vring_packed_need_event(VirtQueue *vq, bool wrap,
+uint16_t off_wrap, uint16_t new,
+uint16_t old)
+{
+int off = 

[PATCH v6 6/9] vhost_net: enable packed ring support

2019-10-24 Thread Eugenio Pérez
From: Jason Wang 

Signed-off-by: Jason Wang 
Reviewed-by: Jens Freimann 
---
 hw/net/vhost_net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e975700f95..6b82803fa7 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -49,6 +49,7 @@ static const int kernel_feature_bits[] = {
 VIRTIO_F_VERSION_1,
 VIRTIO_NET_F_MTU,
 VIRTIO_F_IOMMU_PLATFORM,
+VIRTIO_F_RING_PACKED,
 VHOST_INVALID_FEATURE_BIT
 };
 
@@ -74,6 +75,7 @@ static const int user_feature_bits[] = {
 VIRTIO_NET_F_MRG_RXBUF,
 VIRTIO_NET_F_MTU,
 VIRTIO_F_IOMMU_PLATFORM,
+VIRTIO_F_RING_PACKED,
 
 /* This bit implies RARP isn't sent by QEMU out of band */
 VIRTIO_NET_F_GUEST_ANNOUNCE,
-- 
2.16.5




[PATCH v6 7/9] virtio: add property to enable packed virtqueue

2019-10-24 Thread Eugenio Pérez
From: Jason Wang 

Signed-off-by: Jason Wang 
Reviewed-by: Jens Freimann 
---
 include/hw/virtio/virtio.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index d123d5b181..40ddeafadb 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -285,7 +285,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
 DEFINE_PROP_BIT64("any_layout", _state, _field, \
   VIRTIO_F_ANY_LAYOUT, true), \
 DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
-  VIRTIO_F_IOMMU_PLATFORM, false)
+  VIRTIO_F_IOMMU_PLATFORM, false), \
+DEFINE_PROP_BIT64("packed", _state, _field, \
+  VIRTIO_F_RING_PACKED, false)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 bool virtio_queue_enabled(VirtIODevice *vdev, int n);
-- 
2.16.5




[PATCH v6 4/9] virtio: basic packed virtqueue support

2019-10-24 Thread Eugenio Pérez
From: Jason Wang 

This patch implements basic support for the packed virtqueue. Compare
the split virtqueue which has three rings, packed virtqueue only have
one which is supposed to have better cache utilization and more
hardware friendly.

Please refer virtio specification for more information.

Signed-off-by: Wei Xu 
Signed-off-by: Jason Wang 
---
 hw/block/virtio-blk.c   |   2 +-
 hw/char/virtio-serial-bus.c |   2 +-
 hw/scsi/virtio-scsi.c   |   3 +-
 hw/virtio/virtio.c  | 900 
 include/hw/virtio/virtio.h  |  10 +-
 5 files changed, 837 insertions(+), 80 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index ba846fe9dc..7dbdeaaab9 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1052,7 +1052,7 @@ static void virtio_blk_save_device(VirtIODevice *vdev, 
QEMUFile *f)
 qemu_put_be32(f, virtio_get_queue_index(req->vq));
 }
 
-qemu_put_virtqueue_element(f, >elem);
+qemu_put_virtqueue_element(vdev, f, >elem);
 req = req->next;
 }
 qemu_put_sbyte(f, 0);
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 4e0ed829ae..33259042a9 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -708,7 +708,7 @@ static void virtio_serial_save_device(VirtIODevice *vdev, 
QEMUFile *f)
 if (elem_popped) {
 qemu_put_be32s(f, >iov_idx);
 qemu_put_be64s(f, >iov_offset);
-qemu_put_virtqueue_element(f, port->elem);
+qemu_put_virtqueue_element(vdev, f, port->elem);
 }
 }
 }
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index ee52aa7d17..e8b2b64d09 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -190,11 +190,12 @@ static void virtio_scsi_save_request(QEMUFile *f, 
SCSIRequest *sreq)
 {
 VirtIOSCSIReq *req = sreq->hba_private;
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(req->dev);
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
 uint32_t n = virtio_get_queue_index(req->vq) - 2;
 
 assert(n < vs->conf.num_queues);
 qemu_put_be32s(f, );
-qemu_put_virtqueue_element(f, >elem);
+qemu_put_virtqueue_element(vdev, f, >elem);
 }
 
 static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 74cc10fad9..6e7a034d2a 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -96,6 +96,7 @@ typedef struct VRingPackedDescEvent {
 struct VirtQueue
 {
 VRing vring;
+VirtQueueElement *used_elems;
 
 /* Next head to pop */
 uint16_t last_avail_idx;
@@ -160,6 +161,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 VRingMemoryRegionCaches *new = NULL;
 hwaddr addr, size;
 int64_t len;
+bool packed;
 
 
 addr = vq->vring.desc;
@@ -168,8 +170,10 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 }
 new = g_new0(VRingMemoryRegionCaches, 1);
 size = virtio_queue_get_desc_size(vdev, n);
+packed = virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED) ?
+   true : false;
 len = address_space_cache_init(>desc, vdev->dma_as,
-   addr, size, false);
+   addr, size, packed);
 if (len < size) {
 virtio_error(vdev, "Cannot map desc");
 goto err_desc;
@@ -225,8 +229,8 @@ void virtio_queue_update_rings(VirtIODevice *vdev, int n)
 }
 
 /* Called within rcu_read_lock().  */
-static void vring_desc_read(VirtIODevice *vdev, VRingDesc *desc,
-MemoryRegionCache *cache, int i)
+static void vring_split_desc_read(VirtIODevice *vdev, VRingDesc *desc,
+  MemoryRegionCache *cache, int i)
 {
 address_space_read_cached(cache, i * sizeof(VRingDesc),
   desc, sizeof(VRingDesc));
@@ -370,6 +374,95 @@ int virtio_queue_ready(VirtQueue *vq)
 return vq->vring.avail != 0;
 }
 
+static void vring_packed_desc_read_flags(VirtIODevice *vdev,
+ uint16_t *flags,
+ MemoryRegionCache *cache,
+ int i)
+{
+address_space_read_cached(cache,
+  i * sizeof(VRingPackedDesc) +
+  offsetof(VRingPackedDesc, flags),
+  flags, sizeof(*flags));
+virtio_tswap16s(vdev, flags);
+}
+
+static void vring_packed_desc_read(VirtIODevice *vdev,
+   VRingPackedDesc *desc,
+   MemoryRegionCache *cache,
+   int i, bool strict_order)
+{
+hwaddr off = i * sizeof(VRingPackedDesc);
+
+vring_packed_desc_read_flags(vdev, >flags, cache, i);
+
+if (strict_order) {
+/* Make sure flags is read before the rest fields. 

Re: [PATCH v2 0/4] apic: Fix migration breakage of >255 vcpus

2019-10-24 Thread John Snow



On 10/23/19 4:17 AM, Kevin Wolf wrote:
> The important difference here is legacy IDE (which works) vs. AHCI
> (which doesn't work). If you add a -device ahci to the -M pc case, it
> starts failing, too.
> 
> Not sure why AHCI fails, but I'll just CC John who is the lucky
> maintainer of this device. :-)

Hm... It looks like SeaBIOS is identifying the drive correctly and
perfectly well, but we're failing at boot_disk(u8 bootdrv, int
checksig), about here:

call16_int(0x13, );

if (br.flags & F_CF) {
printf("Boot failed: could not read the boot disk\n\n");
return;
}

Looking at AHCI tracing (From the QEMU side), it looks like we set up
the drive correctly, and then never touch the port ever again -- I don't
see an attempted read on QEMU's end.

I'll need to look through SeaBIOS source for hints, I'm not sure right
yet. If anyone is more familiar with the SeaBIOS boot code, maybe they
can give a pointer faster than I'll figure it out myself.

--js




[PATCH v6 1/9] virtio: basic structure for packed ring

2019-10-24 Thread Eugenio Pérez
From: Wei Xu 

Define packed ring structure according to Qemu nomenclature,
field data(wrap counter, etc) are also included.

Signed-off-by: Wei Xu 
Signed-off-by: Jason Wang 
Reviewed-by: Jens Freimann 
---
 hw/virtio/virtio.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 527df03bfd..fdac203cdf 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -43,6 +43,13 @@ typedef struct VRingDesc
 uint16_t next;
 } VRingDesc;
 
+typedef struct VRingPackedDesc {
+uint64_t addr;
+uint32_t len;
+uint16_t id;
+uint16_t flags;
+} VRingPackedDesc;
+
 typedef struct VRingAvail
 {
 uint16_t flags;
@@ -81,17 +88,25 @@ typedef struct VRing
 VRingMemoryRegionCaches *caches;
 } VRing;
 
+typedef struct VRingPackedDescEvent {
+uint16_t off_wrap;
+uint16_t flags;
+} VRingPackedDescEvent ;
+
 struct VirtQueue
 {
 VRing vring;
 
 /* Next head to pop */
 uint16_t last_avail_idx;
+bool last_avail_wrap_counter;
 
 /* Last avail_idx read from VQ. */
 uint16_t shadow_avail_idx;
+bool shadow_avail_wrap_counter;
 
 uint16_t used_idx;
+bool used_wrap_counter;
 
 /* Last used index value we have signalled on */
 uint16_t signalled_used;
-- 
2.16.5




[PATCH v6 8/9] virtio: add some rcu comments

2019-10-24 Thread Eugenio Pérez
Signed-off-by: Eugenio Pérez 
---
 hw/virtio/virtio.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 3cf12a62c0..9195b08da8 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -278,6 +278,7 @@ static void vring_packed_flags_write(VirtIODevice *vdev,
 address_space_cache_invalidate(cache, off, sizeof(flags));
 }
 
+/* Called within rcu_read_lock().  */
 static VRingMemoryRegionCaches *vring_get_region_caches(struct VirtQueue *vq)
 {
 VRingMemoryRegionCaches *caches = atomic_rcu_read(>vring.caches);
@@ -721,7 +722,6 @@ bool virtqueue_rewind(VirtQueue *vq, unsigned int num)
 return true;
 }
 
-/* Called within rcu_read_lock().  */
 static void virtqueue_split_fill(VirtQueue *vq, const VirtQueueElement *elem,
 unsigned int len, unsigned int idx)
 {
@@ -780,6 +780,7 @@ static void virtqueue_packed_fill_desc(VirtQueue *vq,
 vring_packed_desc_write(vq->vdev, , >desc, head, 
strict_order);
 }
 
+/* Called within rcu_read_lock().  */
 void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
 unsigned int len, unsigned int idx)
 {
-- 
2.16.5




Re: [PATCH v2 1/2] coroutine: Add qemu_co_mutex_assert_locked()

2019-10-24 Thread Denis Lunev
On 10/24/19 6:35 PM, Vladimir Sementsov-Ogievskiy wrote:
> 24.10.2019 17:26, Kevin Wolf wrote:
>> Some functions require that the caller holds a certain CoMutex for them
>> to operate correctly. Add a function so that they can assert the lock is
>> really held.
>>
>> Cc: qemu-sta...@nongnu.org
>> Signed-off-by: Kevin Wolf 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Denis V. Lunev 



Re: [PATCH v5 1/4] block: support compressed write at generic layer

2019-10-24 Thread Max Reitz
On 24.10.19 16:07, Andrey Shinkevich wrote:
> 
> 
> On 24/10/2019 16:48, Max Reitz wrote:
>> On 24.10.19 14:56, Andrey Shinkevich wrote:
>>>
>>>
>>> On 24/10/2019 12:34, Max Reitz wrote:
 On 22.10.19 15:53, Andrey Shinkevich wrote:

 [...]

> If the support of COW for compressed writes is found feasible, will it
> make a sense to implement? Then this series will follow.

 Hm, what exactly do you mean by support of COW for compressed writes?

>>>
>>> I spoke in terms of the commit message with the following ID:
>>>
>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>
>>> "qcow2: Fail write_compressed when overwriting data"
>>>
>>> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
>>>
>>> So, I suggest that we implement writing compressed data to the allocated
>>> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the
>>> error. Particularly, when it comes to NBD server connection failure for
>>> writhing a compressed cluster, it may not be rewritten after the
>>> connection is restored.
>>> Are there any issues with that implementation idea?
>>
>> Well, the COW in that commit is meant differently, because it refers to
>> the COW that’s required when writing to a cluster shared by an internal
>> snapshot.
>>
>> OTOH, you could say that all compressed writes to a cluster that is
>> already allocated would need to do COW because we’d always have to fully
>> rewrite that cluster in an RMW cycle.
>>
>> I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
>> existing cluster would solve the problem, though.  You’d generally need
>> to allocate a new cluster; or attempt to reuse the existing space in a
>> compressed cluster.
>>
>> Max
>>
> 
> Yes, new clusters would be allocated for the compressed RMW and the 
> existing ones would be reused if possible. It seams to be ineffective 
> but users are supposed to know what they do.
> So, does it worth to check a feasibility of the implementation?

I don’t know, Vladimir said that use case wouldn’t be needed.

I think if the option was there, people would actually use it.  But I
doubt that anyone misses it sufficiently to warrant the effort.

In addition, there’s still VMDK to consider, too.

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/2] coroutine: Add qemu_co_mutex_assert_locked()

2019-10-24 Thread Vladimir Sementsov-Ogievskiy
24.10.2019 17:26, Kevin Wolf wrote:
> Some functions require that the caller holds a certain CoMutex for them
> to operate correctly. Add a function so that they can assert the lock is
> really held.
> 
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Kevin Wolf 

Reviewed-by: Vladimir Sementsov-Ogievskiy 

> ---
>   include/qemu/coroutine.h | 15 +++
>   1 file changed, 15 insertions(+)
> 
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index 9801e7f5a4..f4843b5f59 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -167,6 +167,21 @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
>*/
>   void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
>   
> +/**
> + * Assert that the current coroutine holds @mutex.
> + */
> +static inline coroutine_fn void qemu_co_mutex_assert_locked(CoMutex *mutex)
> +{
> +/*
> + * mutex->holder doesn't need any synchronisation if the assertion holds
> + * true because the mutex protects it. If it doesn't hold true, we still
> + * don't mind if another thread takes or releases mutex behind our back,
> + * because the condition will be false no matter whether we read NULL or
> + * the pointer for any other coroutine.
> + */
> +assert(atomic_read(>locked) &&
> +   mutex->holder == qemu_coroutine_self());
> +}
>   
>   /**
>* CoQueues are a mechanism to queue coroutines in order to continue 
> executing
> 


-- 
Best regards,
Vladimir



[PATCH v7 4/4] colo: Update Documentation for continuous replication

2019-10-24 Thread Lukas Straub
Document the qemu command-line and qmp commands for continuous replication

Signed-off-by: Lukas Straub 
---
 docs/COLO-FT.txt   | 224 +++--
 docs/block-replication.txt |  28 +++--
 2 files changed, 184 insertions(+), 68 deletions(-)

diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt
index ad24680d13..c8e1740935 100644
--- a/docs/COLO-FT.txt
+++ b/docs/COLO-FT.txt
@@ -145,81 +145,189 @@ The diagram just shows the main qmp command, you can get 
the detail
 in test procedure.
 
 == Test procedure ==
-1. Startup qemu
-Primary:
-# qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -qmp stdio -name primary \
-  -device piix3-usb-uhci -vnc :7 \
-  -device usb-tablet -netdev tap,id=hn0,vhost=off \
-  -device virtio-net-pci,id=net-pci0,netdev=hn0 \
-  -drive 
if=virtio,id=primary-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
- children.0.file.filename=1.raw,\
- children.0.driver=raw -S
-Secondary:
-# qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -qmp stdio -name secondary \
-  -device piix3-usb-uhci -vnc :7 \
-  -device usb-tablet -netdev tap,id=hn0,vhost=off \
-  -device virtio-net-pci,id=net-pci0,netdev=hn0 \
-  -drive 
if=none,id=secondary-disk0,file.filename=1.raw,driver=raw,node-name=node0 \
-  -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\
- file.driver=qcow2,top-id=active-disk0,\
- file.file.filename=/mnt/ramfs/active_disk.img,\
- file.backing.driver=qcow2,\
- file.backing.file.filename=/mnt/ramfs/hidden_disk.img,\
- file.backing.backing=secondary-disk0 \
-  -incoming tcp:0:
-
-2. On Secondary VM's QEMU monitor, issue command
+Note: Here we are running both instances on the same host for testing,
+change the IP Addresses if you want to run it on two hosts. Initally
+127.0.0.1 is the Primary Host and 127.0.0.2 is the Secondary Host.
+
+== Startup qemu ==
+1. Primary:
+Note: Initally, $imagefolder/primary.qcow2 needs to be copied to all hosts.
+You don't need to change any IP's here, because 0.0.0.0 listens on any
+interface. The chardev's with 127.0.0.1 IP's loopback to the local qemu
+instance.
+
+# imagefolder="/mnt/vms/colo-test-primary"
+
+# qemu-system-x86_64 -enable-kvm -cpu qemu64,+kvmclock -m 512 -smp 1 -qmp 
stdio \
+   -device piix3-usb-uhci -device usb-tablet -name primary \
+   -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \
+   -device rtl8139,id=e0,netdev=hn0 \
+   -chardev socket,id=mirror0,host=0.0.0.0,port=9003,server,nowait \
+   -chardev socket,id=compare1,host=0.0.0.0,port=9004,server,wait \
+   -chardev socket,id=compare0,host=127.0.0.1,port=9001,server,nowait \
+   -chardev socket,id=compare0-0,host=127.0.0.1,port=9001 \
+   -chardev socket,id=compare_out,host=127.0.0.1,port=9005,server,nowait \
+   -chardev socket,id=compare_out0,host=127.0.0.1,port=9005 \
+   -object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0 \
+   -object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out \
+   -object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0 \
+   -object iothread,id=iothread1 \
+   -object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,\
+outdev=compare_out0,iothread=iothread1 \
+   -drive 
if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
+children.0.file.filename=$imagefolder/primary.qcow2,children.0.driver=qcow2 -S
+
+2. Secondary:
+Note: Active and hidden images need to be created only once and the
+size should be the same as primary.qcow2. Again, you don't need to change
+any IP's here, except for the $primary_ip variable.
+
+# imagefolder="/mnt/vms/colo-test-secondary"
+# primary_ip=127.0.0.1
+
+# qemu-img create -f qcow2 $imagefolder/secondary-active.qcow2 10G
+
+# qemu-img create -f qcow2 $imagefolder/secondary-hidden.qcow2 10G
+
+# qemu-system-x86_64 -enable-kvm -cpu qemu64,+kvmclock -m 512 -smp 1 -qmp 
stdio \
+   -device piix3-usb-uhci -device usb-tablet -name secondary \
+   -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \
+   -device rtl8139,id=e0,netdev=hn0 \
+   -chardev socket,id=red0,host=$primary_ip,port=9003,reconnect=1 \
+   -chardev socket,id=red1,host=$primary_ip,port=9004,reconnect=1 \
+   -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0 \
+   -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1 \
+   -object filter-rewriter,id=rew0,netdev=hn0,queue=all \
+   -drive 
if=none,id=parent0,file.filename=$imagefolder/primary.qcow2,driver=qcow2 \
+   -drive 
if=none,id=childs0,driver=replication,mode=secondary,file.driver=qcow2,\
+top-id=colo-disk0,file.file.filename=$imagefolder/secondary-active.qcow2,\
+file.backing.driver=qcow2,file.backing.file.filename=$imagefolder/secondary-hidden.qcow2,\
+file.backing.backing=parent0 \
+   -drive 
if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
+children.0=childs0 \
+   -incoming tcp:0.0.0.0:9998
+
+
+3. On Secondary VM's QEMU 

Re: [PATCH v5 1/4] block: support compressed write at generic layer

2019-10-24 Thread Andrey Shinkevich


On 24/10/2019 16:48, Max Reitz wrote:
> On 24.10.19 14:56, Andrey Shinkevich wrote:
>>
>>
>> On 24/10/2019 12:34, Max Reitz wrote:
>>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>>
>>> [...]
>>>
 If the support of COW for compressed writes is found feasible, will it
 make a sense to implement? Then this series will follow.
>>>
>>> Hm, what exactly do you mean by support of COW for compressed writes?
>>>
>>
>> I spoke in terms of the commit message with the following ID:
>>
>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>
>> "qcow2: Fail write_compressed when overwriting data"
>>
>> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
>>
>> So, I suggest that we implement writing compressed data to the allocated
>> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the
>> error. Particularly, when it comes to NBD server connection failure for
>> writhing a compressed cluster, it may not be rewritten after the
>> connection is restored.
>> Are there any issues with that implementation idea?
> 
> Well, the COW in that commit is meant differently, because it refers to
> the COW that’s required when writing to a cluster shared by an internal
> snapshot.
> 
> OTOH, you could say that all compressed writes to a cluster that is
> already allocated would need to do COW because we’d always have to fully
> rewrite that cluster in an RMW cycle.
> 
> I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
> existing cluster would solve the problem, though.  You’d generally need
> to allocate a new cluster; or attempt to reuse the existing space in a
> compressed cluster.
> 
> Max
> 

Yes, new clusters would be allocated for the compressed RMW and the 
existing ones would be reused if possible. It seams to be ineffective 
but users are supposed to know what they do.
So, does it worth to check a feasibility of the implementation?

Andrey



[PATCH v7 2/4] tests/test-replication.c: Add test for for secondary node continuing replication

2019-10-24 Thread Lukas Straub
This simulates the case that happens when we resume COLO after failover.

Signed-off-by: Lukas Straub 
---
 tests/test-replication.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/tests/test-replication.c b/tests/test-replication.c
index f085d1993a..8e18ecd998 100644
--- a/tests/test-replication.c
+++ b/tests/test-replication.c
@@ -489,6 +489,56 @@ static void test_secondary_stop(void)
 teardown_secondary();
 }

+static void test_secondary_continuous_replication(void)
+{
+BlockBackend *top_blk, *local_blk;
+Error *local_err = NULL;
+
+top_blk = start_secondary();
+replication_start_all(REPLICATION_MODE_SECONDARY, _err);
+g_assert(!local_err);
+
+/* write 0x22 to s_local_disk (IMG_SIZE / 2, IMG_SIZE) */
+local_blk = blk_by_name(S_LOCAL_DISK_ID);
+test_blk_write(local_blk, 0x22, IMG_SIZE / 2, IMG_SIZE / 2, false);
+
+/* replication will backup s_local_disk to s_hidden_disk */
+test_blk_read(top_blk, 0x11, IMG_SIZE / 2,
+  IMG_SIZE / 2, 0, IMG_SIZE, false);
+
+/* write 0x33 to s_active_disk (0, IMG_SIZE / 2) */
+test_blk_write(top_blk, 0x33, 0, IMG_SIZE / 2, false);
+
+/* do failover (active commit) */
+replication_stop_all(true, _err);
+g_assert(!local_err);
+
+/* it should ignore all requests from now on */
+
+/* start after failover */
+replication_start_all(REPLICATION_MODE_PRIMARY, _err);
+g_assert(!local_err);
+
+/* checkpoint */
+replication_do_checkpoint_all(_err);
+g_assert(!local_err);
+
+/* stop */
+replication_stop_all(true, _err);
+g_assert(!local_err);
+
+/* read from s_local_disk (0, IMG_SIZE / 2) */
+test_blk_read(top_blk, 0x33, 0, IMG_SIZE / 2,
+  0, IMG_SIZE / 2, false);
+
+
+/* read from s_local_disk (IMG_SIZE / 2, IMG_SIZE) */
+test_blk_read(top_blk, 0x22, IMG_SIZE / 2,
+  IMG_SIZE / 2, 0, IMG_SIZE, false);
+
+teardown_secondary();
+}
+
 static void test_secondary_do_checkpoint(void)
 {
 BlockBackend *top_blk, *local_blk;
@@ -584,6 +634,8 @@ int main(int argc, char **argv)
 g_test_add_func("/replication/secondary/write", test_secondary_write);
 g_test_add_func("/replication/secondary/start", test_secondary_start);
 g_test_add_func("/replication/secondary/stop",  test_secondary_stop);
+g_test_add_func("/replication/secondary/continuous_replication",
+test_secondary_continuous_replication);
 g_test_add_func("/replication/secondary/do_checkpoint",
 test_secondary_do_checkpoint);
 g_test_add_func("/replication/secondary/get_error_all",
--
2.20.1




[PATCH v7 3/4] net/filter.c: Add Options to insert filters anywhere in the filter list

2019-10-24 Thread Lukas Straub
To switch the Secondary to Primary, we need to insert new filters
before the filter-rewriter.

Add the options insert= and position= to be able to insert filters
anywhere in the filter list.

position should be "head" or "tail" to insert at the head or
tail of the filter list or it should be "id=" to specify
the id of another filter.
insert should be either "before" or "behind" to specify where to
insert the new filter relative to the one specified with position.

Signed-off-by: Lukas Straub 
Reviewed-by: Zhang Chen 
---
 include/net/filter.h |  2 +
 net/filter.c | 92 +++-
 qemu-options.hx  | 31 ---
 3 files changed, 119 insertions(+), 6 deletions(-)

diff --git a/include/net/filter.h b/include/net/filter.h
index 49da666ac0..22a723305b 100644
--- a/include/net/filter.h
+++ b/include/net/filter.h
@@ -62,6 +62,8 @@ struct NetFilterState {
 NetClientState *netdev;
 NetFilterDirection direction;
 bool on;
+char *position;
+bool insert_before_flag;
 QTAILQ_ENTRY(NetFilterState) next;
 };

diff --git a/net/filter.c b/net/filter.c
index 28d1930db7..cd2ef9e979 100644
--- a/net/filter.c
+++ b/net/filter.c
@@ -171,11 +171,47 @@ static void netfilter_set_status(Object *obj, const char 
*str, Error **errp)
 }
 }

+static char *netfilter_get_position(Object *obj, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+return g_strdup(nf->position);
+}
+
+static void netfilter_set_position(Object *obj, const char *str, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+nf->position = g_strdup(str);
+}
+
+static char *netfilter_get_insert(Object *obj, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+return nf->insert_before_flag ? g_strdup("before") : g_strdup("behind");
+}
+
+static void netfilter_set_insert(Object *obj, const char *str, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+if (strcmp(str, "before") && strcmp(str, "behind")) {
+error_setg(errp, "Invalid value for netfilter insert, "
+ "should be 'before' or 'behind'");
+return;
+}
+
+nf->insert_before_flag = !strcmp(str, "before");
+}
+
 static void netfilter_init(Object *obj)
 {
 NetFilterState *nf = NETFILTER(obj);

 nf->on = true;
+nf->insert_before_flag = false;
+nf->position = g_strdup("tail");

 object_property_add_str(obj, "netdev",
 netfilter_get_netdev_id, netfilter_set_netdev_id,
@@ -187,11 +223,18 @@ static void netfilter_init(Object *obj)
 object_property_add_str(obj, "status",
 netfilter_get_status, netfilter_set_status,
 NULL);
+object_property_add_str(obj, "position",
+netfilter_get_position, netfilter_set_position,
+NULL);
+object_property_add_str(obj, "insert",
+netfilter_get_insert, netfilter_set_insert,
+NULL);
 }

 static void netfilter_complete(UserCreatable *uc, Error **errp)
 {
 NetFilterState *nf = NETFILTER(uc);
+NetFilterState *position = NULL;
 NetClientState *ncs[MAX_QUEUE_NUM];
 NetFilterClass *nfc = NETFILTER_GET_CLASS(uc);
 int queues;
@@ -219,6 +262,41 @@ static void netfilter_complete(UserCreatable *uc, Error 
**errp)
 return;
 }

+if (strcmp(nf->position, "head") && strcmp(nf->position, "tail")) {
+Object *container;
+Object *obj;
+char *position_id;
+
+if (!g_str_has_prefix(nf->position, "id=")) {
+error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "position",
+   "'head', 'tail' or 'id='");
+return;
+}
+
+/* get the id from the string */
+position_id = g_strndup(nf->position + 3, strlen(nf->position) - 3);
+
+/* Search for the position to insert before/behind */
+container = object_get_objects_root();
+obj = object_resolve_path_component(container, position_id);
+if (!obj) {
+error_setg(errp, "filter '%s' not found", position_id);
+g_free(position_id);
+return;
+}
+
+position = NETFILTER(obj);
+
+if (position->netdev != ncs[0]) {
+error_setg(errp, "filter '%s' belongs to a different netdev",
+position_id);
+g_free(position_id);
+return;
+}
+
+g_free(position_id);
+}
+
 nf->netdev = ncs[0];

 if (nfc->setup) {
@@ -228,7 +306,18 @@ static void netfilter_complete(UserCreatable *uc, Error 
**errp)
 return;
 }
 }
-QTAILQ_INSERT_TAIL(>netdev->filters, nf, next);
+
+if (position) {
+if (nf->insert_before_flag) {
+QTAILQ_INSERT_BEFORE(position, nf, next);
+} else {
+QTAILQ_INSERT_AFTER(>netdev->filters, position, nf, next);
+}

[PATCH v7 1/4] block/replication.c: Ignore requests after failover

2019-10-24 Thread Lukas Straub
After failover the Secondary side of replication shouldn't change state, because
it now functions as our primary disk.

In replication_start, replication_do_checkpoint, replication_stop, ignore
the request if current state is BLOCK_REPLICATION_DONE (sucessful failover) or
BLOCK_REPLICATION_FAILOVER (failover in progres i.e. currently merging active
and hidden images into the base image).

Signed-off-by: Lukas Straub 
Reviewed-by: Zhang Chen 
---
 block/replication.c | 35 ++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 3d4dedddfc..78234d1a04 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -454,6 +454,17 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 aio_context_acquire(aio_context);
 s = bs->opaque;

+if (s->stage == BLOCK_REPLICATION_DONE ||
+s->stage == BLOCK_REPLICATION_FAILOVER) {
+/*
+ * This case happens when a secondary is promoted to primary.
+ * Ignore the request because the secondary side of replication
+ * doesn't have to do anything anymore.
+ */
+aio_context_release(aio_context);
+return;
+}
+
 if (s->stage != BLOCK_REPLICATION_NONE) {
 error_setg(errp, "Block replication is running or done");
 aio_context_release(aio_context);
@@ -577,6 +588,17 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
 aio_context_acquire(aio_context);
 s = bs->opaque;

+if (s->stage == BLOCK_REPLICATION_DONE ||
+s->stage == BLOCK_REPLICATION_FAILOVER) {
+/*
+ * This case happens when a secondary was promoted to primary.
+ * Ignore the request because the secondary side of replication
+ * doesn't have to do anything anymore.
+ */
+aio_context_release(aio_context);
+return;
+}
+
 if (s->mode == REPLICATION_MODE_SECONDARY) {
 secondary_do_checkpoint(s, errp);
 }
@@ -593,7 +615,7 @@ static void replication_get_error(ReplicationState *rs, 
Error **errp)
 aio_context_acquire(aio_context);
 s = bs->opaque;

-if (s->stage != BLOCK_REPLICATION_RUNNING) {
+if (s->stage == BLOCK_REPLICATION_NONE) {
 error_setg(errp, "Block replication is not running");
 aio_context_release(aio_context);
 return;
@@ -635,6 +657,17 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
 aio_context_acquire(aio_context);
 s = bs->opaque;

+if (s->stage == BLOCK_REPLICATION_DONE ||
+s->stage == BLOCK_REPLICATION_FAILOVER) {
+/*
+ * This case happens when a secondary was promoted to primary.
+ * Ignore the request because the secondary side of replication
+ * doesn't have to do anything anymore.
+ */
+aio_context_release(aio_context);
+return;
+}
+
 if (s->stage != BLOCK_REPLICATION_RUNNING) {
 error_setg(errp, "Block replication is not running");
 aio_context_release(aio_context);
--
2.20.1




[PATCH v2 2/2] qcow2: Fix corruption bug in qcow2_detect_metadata_preallocation()

2019-10-24 Thread Kevin Wolf
qcow2_detect_metadata_preallocation() calls qcow2_get_refcount() which
requires s->lock to be taken to protect its accesses to the refcount
table and refcount blocks. However, nothing in this code path actually
took the lock. This could cause the same cache entry to be used by two
requests at the same time, for different tables at different offsets,
resulting in image corruption.

As it would be preferable to base the detection on consistent data (even
though it's just heuristics), let's take the lock not only around the
qcow2_get_refcount() calls, but around the whole function.

This patch takes the lock in qcow2_co_block_status() earlier and asserts
in qcow2_detect_metadata_preallocation() that we hold the lock.

Fixes: 69f47505ee66afaa513305de0c1895a224e52c45
Cc: qemu-sta...@nongnu.org
Reported-by: Michael Weiser 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2-refcount.c | 2 ++
 block/qcow2.c  | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index ef965d7895..0d64bf5a5e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -3455,6 +3455,8 @@ int qcow2_detect_metadata_preallocation(BlockDriverState 
*bs)
 int64_t i, end_cluster, cluster_count = 0, threshold;
 int64_t file_length, real_allocation, real_clusters;
 
+qemu_co_mutex_assert_locked(>lock);
+
 file_length = bdrv_getlength(bs->file->bs);
 if (file_length < 0) {
 return file_length;
diff --git a/block/qcow2.c b/block/qcow2.c
index 8b05933565..0bc69e6996 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1916,6 +1916,8 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 unsigned int bytes;
 int status = 0;
 
+qemu_co_mutex_lock(>lock);
+
 if (!s->metadata_preallocation_checked) {
 ret = qcow2_detect_metadata_preallocation(bs);
 s->metadata_preallocation = (ret == 1);
@@ -1923,7 +1925,6 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 }
 
 bytes = MIN(INT_MAX, count);
-qemu_co_mutex_lock(>lock);
 ret = qcow2_get_cluster_offset(bs, offset, , _offset);
 qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
-- 
2.20.1




Re: [PATCH v5 1/4] block: support compressed write at generic layer

2019-10-24 Thread Max Reitz
On 24.10.19 14:56, Andrey Shinkevich wrote:
> 
> 
> On 24/10/2019 12:34, Max Reitz wrote:
>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>
>> [...]
>>
>>> If the support of COW for compressed writes is found feasible, will it
>>> make a sense to implement? Then this series will follow.
>>
>> Hm, what exactly do you mean by support of COW for compressed writes?
>>
> 
> I spoke in terms of the commit message with the following ID:
> 
> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
> 
> "qcow2: Fail write_compressed when overwriting data"
> 
> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
> 
> So, I suggest that we implement writing compressed data to the allocated 
> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the 
> error. Particularly, when it comes to NBD server connection failure for 
> writhing a compressed cluster, it may not be rewritten after the 
> connection is restored.
> Are there any issues with that implementation idea?

Well, the COW in that commit is meant differently, because it refers to
the COW that’s required when writing to a cluster shared by an internal
snapshot.

OTOH, you could say that all compressed writes to a cluster that is
already allocated would need to do COW because we’d always have to fully
rewrite that cluster in an RMW cycle.

I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
existing cluster would solve the problem, though.  You’d generally need
to allocate a new cluster; or attempt to reuse the existing space in a
compressed cluster.

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v5 1/4] block: support compressed write at generic layer

2019-10-24 Thread Andrey Shinkevich


On 24/10/2019 16:48, Max Reitz wrote:
> On 24.10.19 14:56, Andrey Shinkevich wrote:
>>
>>
>> On 24/10/2019 12:34, Max Reitz wrote:
>>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>>
>>> [...]
>>>
 If the support of COW for compressed writes is found feasible, will it
 make a sense to implement? Then this series will follow.
>>>
>>> Hm, what exactly do you mean by support of COW for compressed writes?
>>>
>>
>> I spoke in terms of the commit message with the following ID:
>>
>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>
>> "qcow2: Fail write_compressed when overwriting data"
>>
>> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
>>
>> So, I suggest that we implement writing compressed data to the allocated
>> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the
>> error. Particularly, when it comes to NBD server connection failure for
>> writhing a compressed cluster, it may not be rewritten after the
>> connection is restored.
>> Are there any issues with that implementation idea?
> 
> Well, the COW in that commit is meant differently, because it refers to
> the COW that’s required when writing to a cluster shared by an internal
> snapshot.
> 
> OTOH, you could say that all compressed writes to a cluster that is
> already allocated would need to do COW because we’d always have to fully
> rewrite that cluster in an RMW cycle.
> 
> I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
> existing cluster would solve the problem, though.  You’d generally need
> to allocate a new cluster; or attempt to reuse the existing space in a
> compressed cluster.
> 
> Max
> 

The idea was to avoid the following error:

$ ./qemu-img create -f qcow2 -o size=10M ./image.qcow2
Formatting './image.qcow2', fmt=qcow2 size=10485760 cluster_size=65536 
lazy_refcounts=off refcount_bits=16
[andrey@dhcp-172-16-25-136 qemu]$ git diff --name-only
nbd/server.c
$ ./qemu-io -c "write -P 0x12 0 64k" ./image.qcow2
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0152 sec (4.101 MiB/sec and 65.6082 ops/sec)
$ ./qemu-io -c "write -c -P 0x12 0 64k" ./image.qcow2
write failed: Input/output error

If it incurs a high cost, we can choose the filter option as discussed 
above.

Andrey



Re: [PATCH 2/3] qcow2: Assert that qcow2_cache_get() callers hold s->lock

2019-10-24 Thread Vladimir Sementsov-Ogievskiy
24.10.2019 13:57, Kevin Wolf wrote:
> Am 24.10.2019 um 12:01 hat Denis Lunev geschrieben:
>> On 10/23/19 6:26 PM, Kevin Wolf wrote:
>>> qcow2_cache_do_get() requires that s->lock is locked because it can
>>> yield between picking a cache entry and actually taking ownership of it
>>> by setting offset and increasing the reference count.
>>>
>>> Add an assertion to make sure the caller really holds the lock. The
>>> function can be called outside of coroutine context, where bdrv_pread
>>> and flushes become synchronous operations. The lock cannot and need not
>>> be taken in this case.
>>>
>>> Signed-off-by: Kevin Wolf 
>>> ---
>>>   block/qcow2-cache.c | 5 +
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
>>> index d29b038a67..75b13dad99 100644
>>> --- a/block/qcow2-cache.c
>>> +++ b/block/qcow2-cache.c
>>> @@ -327,6 +327,9 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
>>> Qcow2Cache *c,
>>>   int min_lru_index = -1;
>>>   
>>>   assert(offset != 0);
>>> +if (qemu_in_coroutine()) {
>>> +qemu_co_mutex_assert_locked(>lock);
>>> +}
>>
>> that is looking not good to me. If this is really requires lock, we should
>> check for the lock always. In the other hand we could face missed
>> lock out of coroutine.
> 
> As the commit message explains, outside of coroutine context, we can't
> yield and bdrv_pread and bdrv_flush become synchronous operations
> instead, so there is nothing else that we need to protect against.
> 

Recently we discussed similar problems about block-dirty-bitmap-* qmp commands,
which wanted to update qcow2 metadata about bitmaps from non-coroutine context.
"qcow2 lock"
<135df452-397a-30bb-7518-2184fa597...@virtuozzo.com>
https://lists.gnu.org/archive/html/qemu-devel/2019-09/msg01419.html

And, as I understand, the correct way is to convert to coroutine and lock mutex
appropriately. Or we want to lock aio context and to be in drained section to
avoid parallel requests accessing critical section. Should we assert such
conditions in case of !qemu_in_coroutine() ?

--
Final commit to fix bug about bitmaps:

commit d2c3080e41fd2c9bc36c996cc9d33804462ba803
Author: Vladimir Sementsov-Ogievskiy 
Date:   Fri Sep 20 11:25:43 2019 +0300

 block/qcow2: proper locking on bitmap add/remove paths




-- 
Best regards,
Vladimir



Re: [RFC PATCH 00/18] Add qemu-storage-daemon

2019-10-24 Thread Vladimir Sementsov-Ogievskiy
Hi!

This reflects our idea of using qemu binary instead of qemu-img for doing
block-layer operations offline.

What is the practical difference between qemu-storage-daemon and starting
qemu binary in stopped state?

17.10.2019 16:01, Kevin Wolf wrote:
> This series adds a new tool 'qemu-storage-daemon', which can be used to
> export and perform operations on block devices. There is some overlap
> between qemu-img/qemu-nbd and the new qemu-storage-daemon, but there are
> a few important differences:
> 
> * The qemu-storage-daemon has QMP support. The command set is obviously
>restricted compared to the system emulator because there is no guest,
>but all of the block operations are present.
> 
>This means that it can access advanced options or operations that the
>qemu-img command line doesn't expose. For example, blockdev-create is
>a lot more powerful than 'qemu-img create', and qemu-storage-daemon
>allows to execute it without starting a guest.

qemu binary can do that too, with -S option..

> 
>Compared to qemu-nbd it means that, for example, block jobs can now be
>executed on the server side, and backing chains shared by multiple VMs
>can be modified this way.
> 
> * The existing tools all have a separately invented one-off syntax for
>the job at hand, which usually comes with restrictions compared to the
>system emulator. qemu-storage-daemon shares the same syntax with the
>system emulator for most options and prefers QAPI based interfaces
>where possible (such as --blockdev), so it should be easy to make use
>of in libvirt.
> 
> * While this series implements only NBD exports, the storage daemon is
>intended to serve multiple protocols and its syntax reflects this. In
>the past, we had proposals to add new one-off tools for exporting over
>new protocols like FUSE or TCMU.
> 
>With a generic storage daemon, additional export methods have a home
>without adding a new tool for each of them.
> 
> I'm posting this as an RFC mainly for two reasons:
> 
> 1. The monitor integration, which could be argued to be a little hackish
> (some generated QAPI source files are built from a separate QAPI
> schema, but the per-module ones are taken from the system emulator)
> and Markus will want to have a closer look there. But from the IRC
> discussions we had, we seem to agree on the general approach here.
> 
> 2. I'm not completely sure if the command line syntax is the final
> version that we want to support long-term. Many options directly use
> QAPI visitors (--blockdev, --export, --nbd-server) and should be
> fine. However, others use QemuOpts (--chardev, --monitor, --object).
> 
> This is the same as in the system emulator, so we wouldn't be adding
> a new problem, but as there was talk about QAPIfying the command
> line, and I wouldn't want later syntax changes or adding lots of
> compatibility code to a new tool, I thought we should probably
> discuss whether QAPIfying from the start would be an option.
> 
> I would like to have something merged for 4.2, but I'm considering
> marking the whole tool as experimental and unstable ABI by requiring a
> command line option like --experimental for the tool to even start if we
> know that we want to change some things later.
> 
> This series is based on:
> * [PATCH] blockdev: Use error_report() in hmp_commit()
> * [PATCH 0/7] qapi: Cleanups and test speedup
> 
> Based-on: <20191015123932.12214-1-kw...@redhat.com>
> Based-on: <20191001191514.11208-1-arm...@redhat.com>
> 
> Kevin Wolf (18):
>qemu-storage-daemon: Add barebone tool
>qemu-storage-daemon: Add --object option
>stubs: Add arch_type
>stubs: Add blk_by_qdev_id()
>qemu-storage-daemon: Add --blockdev option
>qemu-storage-daemon: Add --nbd-server option
>blockdev-nbd: Boxed argument type for nbd-server-add
>qemu-storage-daemon: Add --export option
>qemu-storage-daemon: Add main loop
>qemu-storage-daemon: Add --chardev option
>monitor: Move monitor option parsing to monitor/monitor.c
>stubs: Update monitor stubs for qemu-storage-daemon
>qapi: Create module 'monitor'
>monitor: Create monitor/qmp-cmds-monitor.c
>qapi: Support empty modules
>qapi: Create 'pragma' module
>monitor: Move qmp_query_qmp_schema to qmp-cmds-monitor.c
>qemu-storage-daemon: Add --monitor option
> 
>   qapi/block.json  |  65 -
>   qapi/misc.json   | 212 ---
>   qapi/monitor.json| 218 +++
>   qapi/pragma.json |  24 ++
>   qapi/qapi-schema.json|  26 +-
>   storage-daemon/qapi/qapi-schema.json |  15 ++
>   configure|   2 +-
>   include/block/nbd.h  |   1 +
>   include/monitor/monitor.h|   4 +
>   include/sysemu/arch_init.h   |   2 

Re: [RFC PATCH 01/18] qemu-storage-daemon: Add barebone tool

2019-10-24 Thread Eric Blake

On 10/17/19 8:01 AM, Kevin Wolf wrote:

This adds a new binary qemu-storage-daemon that doesn't yet do more than
some typical initialisation for tools and parsing the basic command
options --version, --help and --trace.

Signed-off-by: Kevin Wolf 
---
  configure |   2 +-
  qemu-storage-daemon.c | 141 ++
  Makefile  |   1 +
  3 files changed, 143 insertions(+), 1 deletion(-)
  create mode 100644 qemu-storage-daemon.c

diff --git a/configure b/configure



+++ b/qemu-storage-daemon.c
@@ -0,0 +1,141 @@
+/*
+ * QEMU storage daemon
+ *
+ * Copyright (c) 2019 Kevin Wolf 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.


Is there an intent to license this binary as BSD (by restricting sources 
that can be linked in), or is it going to end up as GPLv2+ for other 
reasons? If the latter, would it be better to license this file GPLv2+?




+ */
+
+#include "qemu/osdep.h"
+
+#include "block/block.h"
+#include "crypto/init.h"
+
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "qemu-version.h"
+#include "qemu/config-file.h"
+#include "qemu/error-report.h"
+#include "qemu/log.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+
+#include "trace/control.h"
+
+#include 


Shouldn't system headers appear right after qemu/osdep.h?


+
+static void help(void)
+{
+printf(
+"Usage: %s [options]\n"
+"QEMU storage daemon\n"
+"\n"
+"  -h, --help display this help and exit\n"
+"  -T, --trace [[enable=]][,events=][,file=]\n"
+" specify tracing options\n"
+"  -V, --version  output version information and exit\n"
+"\n"
+QEMU_HELP_BOTTOM "\n",
+error_get_progname());
+}
+
+static int process_options(int argc, char *argv[], Error **errp)
+{
+int c;
+char *trace_file = NULL;
+int ret = -EINVAL;
+
+static const struct option long_options[] = {
+{"help", no_argument, 0, 'h'},
+{"version", no_argument, 0, 'V'},
+{"trace", required_argument, NULL, 'T'},


I find it harder to maintain lists of options (which will get longer 
over time) when the order of the struct...



+{0, 0, 0, 0}
+};
+
+while ((c = getopt_long(argc, argv, ":hT:V", long_options, NULL)) != -1) {


...the order of the short options...


+switch (c) {
+case '?':
+error_setg(errp, "Unknown option '%s'", argv[optind - 1]);
+goto out;
+case ':':
+error_setg(errp, "Missing option argument for '%s'",
+   argv[optind - 1]);
+goto out;
+case 'h':
+help();
+exit(EXIT_SUCCESS);
+case 'V':
+printf("qemu-storage-daemon version "
+   QEMU_FULL_VERSION "\n" QEMU_COPYRIGHT "\n");
+exit(EXIT_SUCCESS);
+case 'T':
+g_free(trace_file);
+trace_file = trace_opt_parse(optarg);
+break;


...and the order of the case statements are not identical. 
Case-insensitive alphabetical may be easiest (matching your shortopt 
ordering of ":hT:V").



+}
+}
+if (optind != argc) {
+error_setg(errp, "Unexpected argument: %s", argv[optind]);
+goto out;
+}
+
+if (!trace_init_backends()) {
+error_setg(errp, "Could not initialize trace backends");
+goto out;
+}
+trace_init_file(trace_file);
+qemu_set_log(LOG_TRACE);
+
+ret = 0;
+out:
+g_free(trace_file);


Mark trace_file as g_auto, and you can avoid the out: label altogether.


+return ret;
+}
+
+int main(int argc, char *argv[])
+{
+Error *local_err = NULL;
+int ret;
+
+#ifdef CONFIG_POSIX
+signal(SIGPIPE, SIG_IGN);
+#endif
+
+error_init(argv[0]);
+qemu_init_exec_dir(argv[0]);
+
+module_call_init(MODULE_INIT_QOM);
+module_call_init(MODULE_INIT_TRACE);
+qemu_add_opts(_trace_opts);
+qcrypto_init(_fatal);
+bdrv_init();

Re: [PATCH 2/3] qcow2: Assert that qcow2_cache_get() callers hold s->lock

2019-10-24 Thread Kevin Wolf
Am 24.10.2019 um 15:03 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.10.2019 13:57, Kevin Wolf wrote:
> > Am 24.10.2019 um 12:01 hat Denis Lunev geschrieben:
> >> On 10/23/19 6:26 PM, Kevin Wolf wrote:
> >>> qcow2_cache_do_get() requires that s->lock is locked because it can
> >>> yield between picking a cache entry and actually taking ownership of it
> >>> by setting offset and increasing the reference count.
> >>>
> >>> Add an assertion to make sure the caller really holds the lock. The
> >>> function can be called outside of coroutine context, where bdrv_pread
> >>> and flushes become synchronous operations. The lock cannot and need not
> >>> be taken in this case.
> >>>
> >>> Signed-off-by: Kevin Wolf 
> >>> ---
> >>>   block/qcow2-cache.c | 5 +
> >>>   1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> >>> index d29b038a67..75b13dad99 100644
> >>> --- a/block/qcow2-cache.c
> >>> +++ b/block/qcow2-cache.c
> >>> @@ -327,6 +327,9 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
> >>> Qcow2Cache *c,
> >>>   int min_lru_index = -1;
> >>>   
> >>>   assert(offset != 0);
> >>> +if (qemu_in_coroutine()) {
> >>> +qemu_co_mutex_assert_locked(>lock);
> >>> +}
> >>
> >> that is looking not good to me. If this is really requires lock, we should
> >> check for the lock always. In the other hand we could face missed
> >> lock out of coroutine.
> > 
> > As the commit message explains, outside of coroutine context, we can't
> > yield and bdrv_pread and bdrv_flush become synchronous operations
> > instead, so there is nothing else that we need to protect against.
> > 
> 
> Recently we discussed similar problems about block-dirty-bitmap-* qmp
> commands, which wanted to update qcow2 metadata about bitmaps from
> non-coroutine context.  "qcow2 lock"
> <135df452-397a-30bb-7518-2184fa597...@virtuozzo.com>
> https://lists.gnu.org/archive/html/qemu-devel/2019-09/msg01419.html

Hm, right, I already forgot about the nested event loop again...

> And, as I understand, the correct way is to convert to coroutine and
> lock mutex appropriately. Or we want to lock aio context and to be in
> drained section to avoid parallel requests accessing critical section.
> Should we assert such conditions in case of !qemu_in_coroutine() ?

The AioContext lock must be held anyway, so I don't think this would be
a new requirement. As for draining, I'll have to see.

I'm currently still auditing all the callers of qcow2_cache_do_get().
The synchronous callers I already know are the snapshot functions. I
think these happen to be in a drained section anyway (or should be at
least), so AioContext lock + drain seems like a very reasonable option
for them.

For other synchronous callers, if any, maybe conversion to a coroutine
would make more sense.

Kevin




Re: [PATCH v5 1/4] block: support compressed write at generic layer

2019-10-24 Thread Andrey Shinkevich


On 24/10/2019 12:34, Max Reitz wrote:
> On 22.10.19 15:53, Andrey Shinkevich wrote:
> 
> [...]
> 
>> If the support of COW for compressed writes is found feasible, will it
>> make a sense to implement? Then this series will follow.
> 
> Hm, what exactly do you mean by support of COW for compressed writes?
> 

I spoke in terms of the commit message with the following ID:

b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2

"qcow2: Fail write_compressed when overwriting data"

"...qcow2_write_compressed() doesn't perform COW as it would have to do..."

So, I suggest that we implement writing compressed data to the allocated 
clusters rather than qcow2_alloc_compressed_cluster_offset() returns the 
error. Particularly, when it comes to NBD server connection failure for 
writhing a compressed cluster, it may not be rewritten after the 
connection is restored.
Are there any issues with that implementation idea?

Andrey

>> A comment for the option would warn a user that guest writing will work
>> slower for if compress selected.
> 
> I’m not sure whether that’s necessary.  As a user, I would assume that
> enabling compression will always make something slower.
> 
> Max
> 



Re: [PATCH 3/3] qcow2: Fix corruption bug in qcow2_detect_metadata_preallocation()

2019-10-24 Thread Vladimir Sementsov-Ogievskiy
24.10.2019 14:17, Kevin Wolf wrote:
> Am 24.10.2019 um 12:46 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 23.10.2019 18:26, Kevin Wolf wrote:
>>> qcow2_detect_metadata_preallocation() calls qcow2_get_refcount() which
>>> requires s->lock to be taken to protect its accesses to the refcount
>>> table and refcount blocks. However, nothing in this code path actually
>>> took the lock. This could cause the same cache entry to be used by two
>>> requests at the same time, for different tables at different offsets,
>>> resulting in image corruption.
>>>
>>> As it would be preferable to base the detection on consistent data (even
>>> though it's just heuristics), let's take the lock not only around the
>>> qcow2_get_refcount() calls, but around the whole function.
>>>
>>> This patch takes the lock in qcow2_co_block_status() earlier and asserts
>>> in qcow2_detect_metadata_preallocation() that we hold the lock.
>>>
>>> Fixes: 69f47505ee66afaa513305de0c1895a224e52c45
>>> Cc: qemu-sta...@nongnu.org
>>> Reported-by: Michael Weiser 
>>> Signed-off-by: Kevin Wolf 
>>
>> Oh, I'm very sorry. I was going to backport this patch, and found that it's
>> fixed in our downstream long ago, even before last upstream version patch 
>> sent :(
> 
> Seriously? Is your downstream QEMU so divergent from upstream that you
> wouldn't notice something like this? I think you really have to improve
> something there.

How to improve it? Months and years are passed to bring patches into upstream,
of course we have to live with a bunch (now it's nearer to 300 and this is a lot
better then 500+ in the recent past) of patches above Rhel release.

> 
> I mean, we have a serious data corruptor in the 4.1 release and I wasted
> days to debug this, and you've been sitting on a fix for months without
> telling anyone?

Of course, it was not my goal to hide the fix, I'll do my best to avoid this
in future. Very sorry for your time wasted.

> This isn't a memory leak or something, this kills
> people's images. It's eating their data.

Possibly, I didn't know about data corruption. For some reason I decided that
lock should be taken here, I don't remember. Still I should have
applied it to upstream version too, of course.

> 
> Modifying an image format driver requires utmost care and I think I'll
> try to make sure to allow only few qcow2 changes per release in the
> future. Too many changes were made in too short time, and with too
> little care, and there are even more qcow2 patches pending. Please check
> whether these series actually match your downstream code.

The difference is that I didn't move the lock() but add separate lock()/unlock()
pair around qcow2_detect_metadata_preallocation(). I think there is no serious
difference.

> Anyway, we'll
> tread very carefully now with the pending patches, even if it means
> delaying them for another release or two.

What do you mean? What to do with sent patches during several months?

> Stability is way more
> important for an image format driver than new features and
> optimisations.

Agree. But I think, the main source of stability is testing.

> 
> Do whatever you need to fix your downstream process, but seriously, this
> must not ever happen again.
> 

I don't see how to improve the process to exclude such problems. But I'll
of course will do my best to avoid them.



-- 
Best regards,
Vladimir



Re: [PATCH v3 5/6] iotests: Enable more tests in the 'auto' group to improve test coverage

2019-10-24 Thread Alex Bennée


Thomas Huth  writes:


>
> According to Max, it would be good to have a test for iothreads and
> migration. 127 and 256 seem to be good candidates for iothreads. For
> migration, let's enable 091, 181, and 203 (which also tests iothreads).

> @@ -112,7 +112,7 @@
>  088 rw quick
>  089 rw auto quick
>  090 rw auto quick
> -091 rw migration
> +091 rw auto migration


This is breaking consistently on my ZFS machine:

TESTiotest-qcow2: 091 [fail]
QEMU  -- 
"/home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64"
 -nodefaults -display none -machine accel=qtest
QEMU_IMG  -- 
"/home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/../../qemu-img"
QEMU_IO   -- 
"/home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/../../qemu-io"  --cache 
writeback -f qcow2
QEMU_NBD  -- 
"/home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/../../qemu-nbd"
IMGFMT-- qcow2 (compat=1.1)
IMGPROTO  -- file
PLATFORM  -- Linux/x86_64 hackbox2 4.15.0-66-generic
TEST_DIR  -- /home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/scratch
SOCKET_SCM_HELPER -- 
/home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/socket_scm_helper

--- /home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/091.out  2019-02-19 
12:32:03.231730789 +
+++ /home/alex.bennee/lsrc/qemu.git/tests/qemu-iotests/091.out.bad  
2019-10-24 12:07:00.624967556 +0100
@@ -9,20 +9,4 @@

 === VM 1: Migrate from VM1 to VM2 ===

-vm1: qemu-io disk write complete
-vm1: live migration started
-vm1: live migration completed
-
-=== VM 2: Post-migration, write to disk, verify running ===
-
-vm2: qemu-io disk write complete
-vm2: qemu process running successfully
-vm2: flush io, and quit
-Check image pattern
-read 4194304/4194304 bytes at offset 0
-4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-Running 'qemu-img check -r all $TEST_IMG'
-No errors were found on the image.
-80/16384 = 0.49% allocated, 0.00% fragmented, 0.00% compressed clusters
-Image end offset: 5570560
-*** done
+Timeout waiting for (qemu) on handle 0


--
Alex Bennée



Re: [PATCH 1/3] coroutine: Add qemu_co_mutex_assert_locked()

2019-10-24 Thread Denis Lunev
On 10/24/19 1:54 PM, Kevin Wolf wrote:
> Am 24.10.2019 um 11:59 hat Denis Lunev geschrieben:
>> On 10/23/19 6:26 PM, Kevin Wolf wrote:
>>> Some functions require that the caller holds a certain CoMutex for them
>>> to operate correctly. Add a function so that they can assert the lock is
>>> really held.
>>>
>>> Cc: qemu-sta...@nongnu.org
>>> Signed-off-by: Kevin Wolf 
>>> ---
>>>  include/qemu/coroutine.h | 7 +++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
>>> index 9801e7f5a4..a36bcfe5c8 100644
>>> --- a/include/qemu/coroutine.h
>>> +++ b/include/qemu/coroutine.h
>>> @@ -167,6 +167,13 @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
>>>   */
>>>  void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
>>>  
>>> +/**
>>> + * Assert that the current coroutine holds @mutex.
>>> + */
>>> +static inline coroutine_fn void qemu_co_mutex_assert_locked(CoMutex *mutex)
>>> +{
>>> +assert(mutex->locked && mutex->holder == qemu_coroutine_self());
>>> +}
>>>  
>>>  /**
>>>   * CoQueues are a mechanism to queue coroutines in order to continue 
>>> executing
>> I think that we should use atomic_read(>locked) and require barriers
>> working with holder.
> Hm, this would only be necessary for the case that the assertion doesn't
> hold true. I'll do the atomic_read() because it's easy enough, but I
> don't think we need or want barriers here. If another thread modifies
> this concurrently, the condition is false either way.
>
> Kevin
>
It looks like you are correct. We will see either NULL or something other
if we are in the process of lock taking. Does this worth a comment? :)

Den



Re: [PATCH 2/3] qcow2: Assert that qcow2_cache_get() callers hold s->lock

2019-10-24 Thread Kevin Wolf
Am 24.10.2019 um 13:14 hat Denis Lunev geschrieben:
> On 10/24/19 1:57 PM, Kevin Wolf wrote:
> > Am 24.10.2019 um 12:01 hat Denis Lunev geschrieben:
> >> On 10/23/19 6:26 PM, Kevin Wolf wrote:
> >>> qcow2_cache_do_get() requires that s->lock is locked because it can
> >>> yield between picking a cache entry and actually taking ownership of it
> >>> by setting offset and increasing the reference count.
> >>>
> >>> Add an assertion to make sure the caller really holds the lock. The
> >>> function can be called outside of coroutine context, where bdrv_pread
> >>> and flushes become synchronous operations. The lock cannot and need not
> >>> be taken in this case.
> >>>
> >>> Signed-off-by: Kevin Wolf 
> >>> ---
> >>>  block/qcow2-cache.c | 5 +
> >>>  1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> >>> index d29b038a67..75b13dad99 100644
> >>> --- a/block/qcow2-cache.c
> >>> +++ b/block/qcow2-cache.c
> >>> @@ -327,6 +327,9 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
> >>> Qcow2Cache *c,
> >>>  int min_lru_index = -1;
> >>>  
> >>>  assert(offset != 0);
> >>> +if (qemu_in_coroutine()) {
> >>> +qemu_co_mutex_assert_locked(>lock);
> >>> +}
> >> that is looking not good to me. If this is really requires lock, we should
> >> check for the lock always. In the other hand we could face missed
> >> lock out of coroutine.
> > As the commit message explains, outside of coroutine context, we can't
> > yield and bdrv_pread and bdrv_flush become synchronous operations
> > instead, so there is nothing else that we need to protect against.
> >
> Hmm. It seems I was not careful enough with reading entire message.
> I am fine with this though it looks a bit tricky to me as such things
> can change in the future.

In which way do you think this could change? It's a pretty fundamental
fact about non-coroutine code that it can't yield.

What could change, of course, is that some code switches from being
synchronous to using a coroutine. The assertion would automatically
apply then and catch the bug if adding proper locking is forgotten.

> Anyway, you could consider this as
> 
> Reviewed-by: Denis V. Lunev 

Thanks!

Kevin




Re: [RFC PATCH 00/18] Add qemu-storage-daemon

2019-10-24 Thread Kevin Wolf
Am 17.10.2019 um 15:01 hat Kevin Wolf geschrieben:
> This series adds a new tool 'qemu-storage-daemon', which can be used to
> export and perform operations on block devices. There is some overlap
> between qemu-img/qemu-nbd and the new qemu-storage-daemon, but there are
> a few important differences:
> [...]

Ping?




Re: [PATCH 1/3] coroutine: Add qemu_co_mutex_assert_locked()

2019-10-24 Thread Kevin Wolf
Am 24.10.2019 um 11:59 hat Denis Lunev geschrieben:
> On 10/23/19 6:26 PM, Kevin Wolf wrote:
> > Some functions require that the caller holds a certain CoMutex for them
> > to operate correctly. Add a function so that they can assert the lock is
> > really held.
> >
> > Cc: qemu-sta...@nongnu.org
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/qemu/coroutine.h | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> > index 9801e7f5a4..a36bcfe5c8 100644
> > --- a/include/qemu/coroutine.h
> > +++ b/include/qemu/coroutine.h
> > @@ -167,6 +167,13 @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
> >   */
> >  void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
> >  
> > +/**
> > + * Assert that the current coroutine holds @mutex.
> > + */
> > +static inline coroutine_fn void qemu_co_mutex_assert_locked(CoMutex *mutex)
> > +{
> > +assert(mutex->locked && mutex->holder == qemu_coroutine_self());
> > +}
> >  
> >  /**
> >   * CoQueues are a mechanism to queue coroutines in order to continue 
> > executing
> I think that we should use atomic_read(>locked) and require barriers
> working with holder.

Hm, this would only be necessary for the case that the assertion doesn't
hold true. I'll do the atomic_read() because it's easy enough, but I
don't think we need or want barriers here. If another thread modifies
this concurrently, the condition is false either way.

Kevin




Re: [PATCH 3/3] qcow2: Fix corruption bug in qcow2_detect_metadata_preallocation()

2019-10-24 Thread Vladimir Sementsov-Ogievskiy
23.10.2019 18:26, Kevin Wolf wrote:
> qcow2_detect_metadata_preallocation() calls qcow2_get_refcount() which
> requires s->lock to be taken to protect its accesses to the refcount
> table and refcount blocks. However, nothing in this code path actually
> took the lock. This could cause the same cache entry to be used by two
> requests at the same time, for different tables at different offsets,
> resulting in image corruption.
> 
> As it would be preferable to base the detection on consistent data (even
> though it's just heuristics), let's take the lock not only around the
> qcow2_get_refcount() calls, but around the whole function.
> 
> This patch takes the lock in qcow2_co_block_status() earlier and asserts
> in qcow2_detect_metadata_preallocation() that we hold the lock.
> 
> Fixes: 69f47505ee66afaa513305de0c1895a224e52c45
> Cc: qemu-sta...@nongnu.org
> Reported-by: Michael Weiser 
> Signed-off-by: Kevin Wolf 

Oh, I'm very sorry. I was going to backport this patch, and found that it's
fixed in our downstream long ago, even before last upstream version patch sent 
:(

Reviewed-by: Vladimir Sementsov-Ogievskiy 

> ---
>   block/qcow2-refcount.c | 2 ++
>   block/qcow2.c  | 3 ++-
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index ef965d7895..0d64bf5a5e 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -3455,6 +3455,8 @@ int 
> qcow2_detect_metadata_preallocation(BlockDriverState *bs)
>   int64_t i, end_cluster, cluster_count = 0, threshold;
>   int64_t file_length, real_allocation, real_clusters;
>   
> +qemu_co_mutex_assert_locked(>lock);
> +
>   file_length = bdrv_getlength(bs->file->bs);
>   if (file_length < 0) {
>   return file_length;
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 8b05933565..0bc69e6996 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1916,6 +1916,8 @@ static int coroutine_fn 
> qcow2_co_block_status(BlockDriverState *bs,
>   unsigned int bytes;
>   int status = 0;
>   
> +qemu_co_mutex_lock(>lock);
> +
>   if (!s->metadata_preallocation_checked) {
>   ret = qcow2_detect_metadata_preallocation(bs);
>   s->metadata_preallocation = (ret == 1);
> @@ -1923,7 +1925,6 @@ static int coroutine_fn 
> qcow2_co_block_status(BlockDriverState *bs,
>   }
>   
>   bytes = MIN(INT_MAX, count);
> -qemu_co_mutex_lock(>lock);
>   ret = qcow2_get_cluster_offset(bs, offset, , _offset);
>   qemu_co_mutex_unlock(>lock);
>   if (ret < 0) {
> 


-- 
Best regards,
Vladimir



Re: [PATCH 2/3] qcow2: Assert that qcow2_cache_get() callers hold s->lock

2019-10-24 Thread Denis Lunev
On 10/24/19 1:57 PM, Kevin Wolf wrote:
> Am 24.10.2019 um 12:01 hat Denis Lunev geschrieben:
>> On 10/23/19 6:26 PM, Kevin Wolf wrote:
>>> qcow2_cache_do_get() requires that s->lock is locked because it can
>>> yield between picking a cache entry and actually taking ownership of it
>>> by setting offset and increasing the reference count.
>>>
>>> Add an assertion to make sure the caller really holds the lock. The
>>> function can be called outside of coroutine context, where bdrv_pread
>>> and flushes become synchronous operations. The lock cannot and need not
>>> be taken in this case.
>>>
>>> Signed-off-by: Kevin Wolf 
>>> ---
>>>  block/qcow2-cache.c | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
>>> index d29b038a67..75b13dad99 100644
>>> --- a/block/qcow2-cache.c
>>> +++ b/block/qcow2-cache.c
>>> @@ -327,6 +327,9 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
>>> Qcow2Cache *c,
>>>  int min_lru_index = -1;
>>>  
>>>  assert(offset != 0);
>>> +if (qemu_in_coroutine()) {
>>> +qemu_co_mutex_assert_locked(>lock);
>>> +}
>> that is looking not good to me. If this is really requires lock, we should
>> check for the lock always. In the other hand we could face missed
>> lock out of coroutine.
> As the commit message explains, outside of coroutine context, we can't
> yield and bdrv_pread and bdrv_flush become synchronous operations
> instead, so there is nothing else that we need to protect against.
>
> Kevin
>
Hmm. It seems I was not careful enough with reading entire message.
I am fine with this though it looks a bit tricky to me as such things
can change in the future.

Anyway, you could consider this as

Reviewed-by: Denis V. Lunev 

Den



Re: [PATCH 1/3] coroutine: Add qemu_co_mutex_assert_locked()

2019-10-24 Thread Denis Lunev
On 10/23/19 6:26 PM, Kevin Wolf wrote:
> Some functions require that the caller holds a certain CoMutex for them
> to operate correctly. Add a function so that they can assert the lock is
> really held.
>
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Kevin Wolf 
> ---
>  include/qemu/coroutine.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index 9801e7f5a4..a36bcfe5c8 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -167,6 +167,13 @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex);
>   */
>  void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex);
>  
> +/**
> + * Assert that the current coroutine holds @mutex.
> + */
> +static inline coroutine_fn void qemu_co_mutex_assert_locked(CoMutex *mutex)
> +{
> +assert(mutex->locked && mutex->holder == qemu_coroutine_self());
> +}
>  
>  /**
>   * CoQueues are a mechanism to queue coroutines in order to continue 
> executing
I think that we should use atomic_read(>locked) and require barriers
working with holder.

Den



Re: [PATCH 3/3] qcow2: Fix corruption bug in qcow2_detect_metadata_preallocation()

2019-10-24 Thread Kevin Wolf
Am 24.10.2019 um 12:46 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 23.10.2019 18:26, Kevin Wolf wrote:
> > qcow2_detect_metadata_preallocation() calls qcow2_get_refcount() which
> > requires s->lock to be taken to protect its accesses to the refcount
> > table and refcount blocks. However, nothing in this code path actually
> > took the lock. This could cause the same cache entry to be used by two
> > requests at the same time, for different tables at different offsets,
> > resulting in image corruption.
> > 
> > As it would be preferable to base the detection on consistent data (even
> > though it's just heuristics), let's take the lock not only around the
> > qcow2_get_refcount() calls, but around the whole function.
> > 
> > This patch takes the lock in qcow2_co_block_status() earlier and asserts
> > in qcow2_detect_metadata_preallocation() that we hold the lock.
> > 
> > Fixes: 69f47505ee66afaa513305de0c1895a224e52c45
> > Cc: qemu-sta...@nongnu.org
> > Reported-by: Michael Weiser 
> > Signed-off-by: Kevin Wolf 
> 
> Oh, I'm very sorry. I was going to backport this patch, and found that it's
> fixed in our downstream long ago, even before last upstream version patch 
> sent :(

Seriously? Is your downstream QEMU so divergent from upstream that you
wouldn't notice something like this? I think you really have to improve
something there.

I mean, we have a serious data corruptor in the 4.1 release and I wasted
days to debug this, and you've been sitting on a fix for months without
telling anyone? This isn't a memory leak or something, this kills
people's images. It's eating their data.

Modifying an image format driver requires utmost care and I think I'll
try to make sure to allow only few qcow2 changes per release in the
future. Too many changes were made in too short time, and with too
little care, and there are even more qcow2 patches pending. Please check
whether these series actually match your downstream code. Anyway, we'll
tread very carefully now with the pending patches, even if it means
delaying them for another release or two. Stability is way more
important for an image format driver than new features and
optimisations.

Do whatever you need to fix your downstream process, but seriously, this
must not ever happen again.

Kevin




Re: [PATCH v5 1/4] block: support compressed write at generic layer

2019-10-24 Thread Max Reitz
On 22.10.19 15:53, Andrey Shinkevich wrote:

[...]

> If the support of COW for compressed writes is found feasible, will it 
> make a sense to implement? Then this series will follow.

Hm, what exactly do you mean by support of COW for compressed writes?

> A comment for the option would warn a user that guest writing will work 
> slower for if compress selected.

I’m not sure whether that’s necessary.  As a user, I would assume that
enabling compression will always make something slower.

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/3] qcow2: Assert that qcow2_cache_get() callers hold s->lock

2019-10-24 Thread Kevin Wolf
Am 24.10.2019 um 12:01 hat Denis Lunev geschrieben:
> On 10/23/19 6:26 PM, Kevin Wolf wrote:
> > qcow2_cache_do_get() requires that s->lock is locked because it can
> > yield between picking a cache entry and actually taking ownership of it
> > by setting offset and increasing the reference count.
> >
> > Add an assertion to make sure the caller really holds the lock. The
> > function can be called outside of coroutine context, where bdrv_pread
> > and flushes become synchronous operations. The lock cannot and need not
> > be taken in this case.
> >
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block/qcow2-cache.c | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> > index d29b038a67..75b13dad99 100644
> > --- a/block/qcow2-cache.c
> > +++ b/block/qcow2-cache.c
> > @@ -327,6 +327,9 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
> > Qcow2Cache *c,
> >  int min_lru_index = -1;
> >  
> >  assert(offset != 0);
> > +if (qemu_in_coroutine()) {
> > +qemu_co_mutex_assert_locked(>lock);
> > +}
> 
> that is looking not good to me. If this is really requires lock, we should
> check for the lock always. In the other hand we could face missed
> lock out of coroutine.

As the commit message explains, outside of coroutine context, we can't
yield and bdrv_pread and bdrv_flush become synchronous operations
instead, so there is nothing else that we need to protect against.

Kevin




Re: [PATCH 2/3] qcow2: Assert that qcow2_cache_get() callers hold s->lock

2019-10-24 Thread Denis Lunev
On 10/23/19 6:26 PM, Kevin Wolf wrote:
> qcow2_cache_do_get() requires that s->lock is locked because it can
> yield between picking a cache entry and actually taking ownership of it
> by setting offset and increasing the reference count.
>
> Add an assertion to make sure the caller really holds the lock. The
> function can be called outside of coroutine context, where bdrv_pread
> and flushes become synchronous operations. The lock cannot and need not
> be taken in this case.
>
> Signed-off-by: Kevin Wolf 
> ---
>  block/qcow2-cache.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> index d29b038a67..75b13dad99 100644
> --- a/block/qcow2-cache.c
> +++ b/block/qcow2-cache.c
> @@ -327,6 +327,9 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
> Qcow2Cache *c,
>  int min_lru_index = -1;
>  
>  assert(offset != 0);
> +if (qemu_in_coroutine()) {
> +qemu_co_mutex_assert_locked(>lock);
> +}

that is looking not good to me. If this is really requires lock, we should
check for the lock always. In the other hand we could face missed
lock out of coroutine.

Den

>  
>  trace_qcow2_cache_get(qemu_coroutine_self(), c == s->l2_table_cache,
>offset, read_from_disk);
> @@ -386,6 +389,8 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
> Qcow2Cache *c,
>  }
>  }
>  
> +assert(c->entries[i].ref == 0);
> +assert(c->entries[i].offset == 0);
>  c->entries[i].offset = offset;
>  
>  /* And return the right table */




Re: [PATCH] blockdev: modify blockdev-change-medium to change non-removable device

2019-10-24 Thread Max Reitz
On 23.10.19 16:10, Vladimir Sementsov-Ogievskiy wrote:
> 23.10.2019 16:56, Vladimir Sementsov-Ogievskiy wrote:
>> 22.10.2019 14:05, Max Reitz wrote:
>>> On 21.10.19 08:50, Denis Plotnikov wrote:

 On 18.10.2019 18:02, Max Reitz wrote:
> On 18.10.19 14:09, Denis Plotnikov wrote:
>> The modification is useful to workaround exclusive file access 
>> restrictions,
>> e.g. to implement VM migration with shared disk stored on a storage with
>> the exclusive file opening model: a destination VM is started waiting for
>> incomming migration with a fake image drive, and later, on the last 
>> migration
>> phase, the fake image file is replaced with the real one.
>>
>> Signed-off-by: Denis Plotnikov 
> Isn’t this what we would want to use reopen for?
>
> Max

 Could you please explain what is "use reopen"?
>>>
>>> I was thinking of using (x-)blockdev-reopen to change the file that is
>>> used by the format node (e.g. from a null-co node to a real file); or to
>>> change the filename of the protocol node.
>>>
>>> Kevin has pointed out (on IRC) that this will not allow you to change
>>> the node that is directly attached to the device.  While I don’t know
>>> whether that’s really necessary in this case, if it were indeed
>>> necessary, I’d prefer a method to change a guest device’s @drive option
>>> because that seems more natural to me.
>>>
>>> In contrast, the approach taken in this patch seems not quite right to
>>> me, because it overloads the whole blockdev-change-medium command with a
>>> completely new and different implementation based on whether there’s a
>>> removable medium or not.  If the implementation is so different (and the
>>> interface is, too, because in one path you must give @medium whereas the
>>> other doesn’t evaluate it at all), it should be a new command.
>>>
>>> I don’t know whether we need a new command at all, though.  On the node
>>> level, we have (x-)blockdev-reopen.  So assuming we need something to
>>> change the link between the guest device and the block layer, I wonder
>>> whether there isn’t something similar; specifically, I’d prefer
>>> something to simply change the device’s @drive option.
>>
>> Ok, assume we can set @drive option with help of improved qom-set.
>> But how to create this new blk? blockdev-add don't have 'id' parameter 
>> anymore
>> and don't create blk...
> 
> 
> Hmm, I see, we have command blockdev-insert-medium(id, node-name). Isn't it 
> what
> we want?

blockdev-insert-medium requires that the device be empty, so it would
need to be preceded by a blockdev-remove-medium.  And then it isn’t a
transaction...

Also, it currently is limited (intentionally) to devices with removable
media.

I suppose all of this could be changed, but basically I don’t see why we
wouldn’t use qom-set.  (Well, because it’s ugly to call this new
functionality from set_drive(), but other than that...)

(And I don’t quite know whether there isn’t more to be done for devices
that don’t inherently support removable media than just to swap the node
attached to the BB.  Don’t they also need to be drained or something?)

Max



signature.asc
Description: OpenPGP digital signature


Re: Problems with c8bb23cbdbe3 on ppc64le

2019-10-24 Thread Max Reitz
On 10.10.19 17:17, Max Reitz wrote:
> Hi everyone,
> 
> (CCs just based on tags in the commit in question)
> 
> I have two bug reports which claim problems of qcow2 on XFS on ppc64le
> machines since qemu 4.1.0.  One of those is about bad performance
> (sorry, is isn’t public :-/), the other about data corruption
> (https://bugzilla.redhat.com/show_bug.cgi?id=1751934).
> 
> It looks like in both cases reverting c8bb23cbdbe3 solves the problem
> (which optimized COW of unallocated areas).
> 
> I think I’ve looked at every angle but can‘t find what could be wrong
> with it.  Do any of you have any idea? :-/

It looks to me like an XFS bug.

On XFS, if you do FALLOC_FL_ZERO_RANGE past the EOF and an AIO pwrite
even further after that range, the pwrite will be discarded if the
fallocate settles after the pwrite (and both have been started before
either as finished).  That is, the file length will be increased as if
only the fallocate had been executed, but not the pwrite, so the
pwrite’s data is lost.

(Interestingly, this is pretty similar to the bug I introduced in qemu
in 50ba5b2d994853b38fed10e0841b119da0f8b8e5, where the ftruncate() would
not consider parallel in-flight writes.)

I’ve attached a C program to show the problem.  It creates an empty
file, issues FALLOC_FL_ZERO_RANGE on the first 4 kB in a thread, and an
AIO pwrite in parallel on the second 4 kB.  It then runs hexdump -C on
the file.

On XFS, the hexdump shows only 4 kB of 0s.  On ext4 and btrfs, it shows
4 kB of 0s and 4 kB of 42s.

(You can uncomment the IN_ORDER to execute the fallocate and pwrite
sequentially, then XFS will show the same output.)

(Note that it is possible that pwrite and fallocate are not issued
before the other is finished, or that fallocate settles before pwrite.
In such cases, the file will probably be written correctly.  However, I
see the wrong result pretty much 100 % of the time.  (So on my machine,
pwrite and fallocate pretty much always run in parallel and fallocate
finishes after pwrite.))

Compile the program like so:

$ gcc parallel-falloc-and-pwrite.c -pthread -laio -Wall -Wextra
-pedantic -std=c11

And run it like so:

$ ./a.out tmp-file

Max
#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

// Define this to perform the fallocate and the pwrite sequentially
// instead of in parallel

// #define IN_ORDER


int fd;

void *falloc_thread(void *arg)
{
int ret;

(void)arg;

puts("starting fallocate");

ret = fallocate(fd, FALLOC_FL_ZERO_RANGE, 0, 4096);
assert(ret == 0);

puts("fallocate done");

return NULL;
}

int main(int argc, char *argv[])
{
pthread_t falloc_thr;
int ret;

if (argc != 2) {
fprintf(stderr, "Usage: %s \n", argv[0]);
return 1;
}

fd = open(argv[1], O_CREAT | O_RDWR | O_TRUNC | O_DIRECT, 0666);
assert(fd >= 0);

void *buf = aligned_alloc(4096, 4096);
memset(buf, 42, 4096);

io_context_t ctx = 0;
ret = io_setup(1, );
assert(ret == 0);

ret = pthread_create(_thr, NULL, _thread, NULL);
assert(ret == 0);

#ifdef IN_ORDER
ret = pthread_join(falloc_thr, NULL);
assert(ret == 0);
#endif

struct iocb ior;
io_prep_pwrite(, fd, buf, 4096, 4096);

puts("submitting pwrite");

struct iocb *ios[] = {  };
ret = io_submit(ctx, 1, ios);
assert(ret == 1);

struct io_event evs[1];
ret = io_getevents(ctx, 1, 1, evs, NULL);
assert(ret == 1);

puts("pwrite done");

#ifndef IN_ORDER
ret = pthread_join(falloc_thr, NULL);
assert(ret == 0);
#endif

close(fd);
free(buf);

puts("\nHexdump should show 4k of 0s and 4k of 42s:\n");

execlp("hexdump", "hexdump", "-C", argv[1], NULL);
return 1;
}


signature.asc
Description: OpenPGP digital signature