Re: [PATCH v8 00/34] Add subcluster allocation to qcow2

2020-06-10 Thread no-reply
Patchew URL: https://patchew.org/QEMU/cover.1591801197.git.be...@igalia.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  block/blkdebug.o
  CC  block/blkverify.o
/tmp/qemu-test/src/block/qcow2-cluster.c: In function 'qcow2_get_host_offset':
/tmp/qemu-test/src/block/qcow2-cluster.c:473:19: error: 'expected_type' may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
 } else if (type != expected_type) {
   ^
/tmp/qemu-test/src/block/qcow2-cluster.c:449:25: note: 'expected_type' was 
declared here
 QCow2SubclusterType expected_type, type;
 ^
/tmp/qemu-test/src/block/qcow2-cluster.c:475:19: error: 'check_offset' may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
 } else if (check_offset) {
   ^
/tmp/qemu-test/src/block/qcow2-cluster.c:447:10: note: 'check_offset' was 
declared here
 bool check_offset;
  ^
/tmp/qemu-test/src/block/qcow2-cluster.c:476:29: error: 'expected_offset' may 
be used uninitialized in this function [-Werror=maybe-uninitialized]
 expected_offset += s->cluster_size;
 ^
/tmp/qemu-test/src/block/qcow2-cluster.c:448:14: note: 'expected_offset' was 
declared here
 uint64_t expected_offset;
  ^
cc1: all warnings being treated as errors
make: *** [block/qcow2-cluster.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 665, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=c26843e7a1d24a3c860bd5ab2506a33c', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-h2erlgmp/src/docker-src.2020-06-10-18.50.30.7791:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=c26843e7a1d24a3c860bd5ab2506a33c
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-h2erlgmp/src'
make: *** [docker-run-test-quick@centos7] Error 2

real2m20.876s
user0m8.581s


The full log is available at
http://patchew.org/logs/cover.1591801197.git.be...@igalia.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v8 00/34] Add subcluster allocation to qcow2

2020-06-10 Thread no-reply
Patchew URL: https://patchew.org/QEMU/cover.1591801197.git.be...@igalia.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  block/vhdx.o
  CC  block/vhdx-endian.o
/tmp/qemu-test/src/block/qcow2-cluster.c: In function 'qcow2_get_host_offset':
/tmp/qemu-test/src/block/qcow2-cluster.c:473:19: error: 'expected_type' may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
 } else if (type != expected_type) {
   ^
/tmp/qemu-test/src/block/qcow2-cluster.c:449:25: note: 'expected_type' was 
declared here
 QCow2SubclusterType expected_type, type;
 ^
/tmp/qemu-test/src/block/qcow2-cluster.c:475:19: error: 'check_offset' may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
 } else if (check_offset) {
   ^
/tmp/qemu-test/src/block/qcow2-cluster.c:447:10: note: 'check_offset' was 
declared here
 bool check_offset;
  ^
/tmp/qemu-test/src/block/qcow2-cluster.c:476:29: error: 'expected_offset' may 
be used uninitialized in this function [-Werror=maybe-uninitialized]
 expected_offset += s->cluster_size;
 ^
/tmp/qemu-test/src/block/qcow2-cluster.c:448:14: note: 'expected_offset' was 
declared here
 uint64_t expected_offset;
  ^
cc1: all warnings being treated as errors
make: *** [block/qcow2-cluster.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 665, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=080447e1604744fb934c6e9a0210ed36', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-yrq9p6br/src/docker-src.2020-06-10-17.42.20.24850:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=080447e1604744fb934c6e9a0210ed36
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-yrq9p6br/src'
make: *** [docker-run-test-quick@centos7] Error 2

real2m47.601s
user0m8.384s


The full log is available at
http://patchew.org/logs/cover.1591801197.git.be...@igalia.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v3 0/4] nbd: reduce max_block restrictions

2020-06-10 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/20200610182305.3462-1-vsement...@virtuozzo.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

--- /tmp/qemu-test/src/tests/qemu-iotests/251.out   2020-06-10 
18:56:36.0 +
+++ /tmp/qemu-test/build/tests/qemu-iotests/251.out.bad 2020-06-10 
20:24:40.007412790 +
@@ -18,26 +18,16 @@
 qemu-img: warning: error while reading offset read_fail_offset_8: Input/output 
error
 qemu-img: warning: error while reading offset read_fail_offset_9: Input/output 
error
 
-wrote 512/512 bytes at offset read_fail_offset_0
-512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
---
Not run: 259
Failures: 033 034 154 177 251
Failed 5 of 119 iotests
make: *** [check-tests/check-block.sh] Error 1
make: *** Waiting for unfinished jobs
  TESTcheck-qtest-aarch64: tests/qtest/qos-test
Traceback (most recent call last):
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=faa1edfc69684422b314343cf30174a5', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-9egury9q/src/docker-src.2020-06-10-16.12.07.11584:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=faa1edfc69684422b314343cf30174a5
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-9egury9q/src'
make: *** [docker-run-test-quick@centos7] Error 2

real13m12.298s
user0m8.670s


The full log is available at
http://patchew.org/logs/20200610182305.3462-1-vsement...@virtuozzo.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v8 34/34] iotests: Add tests for qcow2 images with extended L2 entries

2020-06-10 Thread Eric Blake

On 6/10/20 10:03 AM, Alberto Garcia wrote:

Signed-off-by: Alberto Garcia 
---
  tests/qemu-iotests/271 | 801 +
  tests/qemu-iotests/271.out | 676 +++
  tests/qemu-iotests/group   |   1 +
  3 files changed, 1478 insertions(+)
  create mode 100755 tests/qemu-iotests/271
  create mode 100644 tests/qemu-iotests/271.out



Big, but looking rather thorough.

Patch 31 has conflicts on 31, 36, 61, and 291, when compared with my 
pending pull request that improves qcow2.py output:

https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg02527.html

although the resolution is obvious enough: regenerate those .out files. 
With that done, I was able to apply the series and test this.


Tested-by: Eric Blake 
Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v7 0/9] acpi: i386 tweaks

2020-06-10 Thread Michael S. Tsirkin
On Wed, Jun 10, 2020 at 05:53:46PM +0200, Gerd Hoffmann wrote:
> On Wed, Jun 10, 2020 at 10:54:26AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2020 at 01:40:02PM +0200, Igor Mammedov wrote:
> > > On Wed, 10 Jun 2020 11:41:22 +0200
> > > Gerd Hoffmann  wrote:
> > > 
> > > > First batch of microvm patches, some generic acpi stuff.
> > > > Split the acpi-build.c monster, specifically split the
> > > > pc and q35 and pci bits into a separate file which we
> > > > can skip building at some point in the future.
> > > > 
> > > It looks like series is missing patch to whitelist changed ACPI tables in
> > > bios-table-test.
> > 
> > Right. Does it pass make check?
> 
> No, but after 'git cherry-pick 9b20a3365d73dad4ad144eab9c5827dbbb2e9f21' it 
> does.


OK pls post a complete series, ok?

> > > Do we already have test case for microvm in bios-table-test,
> > > if not it's probably time to add it.
> > 
> > Separately :)
> 
> Especially as this series is just preparing cleanups and doesn't
> actually add acpi support to microvm yet.
> 
> But, yes, adding a testcase sounds useful.
> 
> take care,
>   Gerd




Re: [PATCH v8 00/34] Add subcluster allocation to qcow2

2020-06-10 Thread Eric Blake

On 6/10/20 10:02 AM, Alberto Garcia wrote:

Hi,

here's the new version of the patches to add subcluster allocation
support to qcow2.

Please refer to the cover letter of the first version for a full
description of the patches:

https://lists.gnu.org/archive/html/qemu-block/2019-10/msg00983.html

The big change here is that now when an image is preallocated then the
requested clusters are allocated but the L2 bitmap is left untouched.
This makes it possible to preallocate an image that has a backing
file.

If you want to test this series make sure to apply this patch first:

https://lists.gnu.org/archive/html/qemu-block/2020-06/msg00504.html


Let's spell that the way patchew can recognize:
Based-on: <20200610094600.4029-1-be...@igalia.com>



Berto

v8:
- Patch 30: New patch
- Patch 31: Update test expectations after commit cf2d1203dc
- Patch 32: New patch
- Patch 34: New tests, fixes and general refactoring of the code




--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH] qcow2: Fix preallocation on images with unaligned sizes

2020-06-10 Thread Eric Blake

On 6/10/20 4:46 AM, Alberto Garcia wrote:

When resizing an image with qcow2_co_truncate() using the falloc or
full preallocation modes the code assumes that both the old and new
sizes are cluster-aligned.

There are two problems with this:

   1) The calculation of how many clusters are involved does not always
  get the right result.

  Example: creating a 60KB image and resizing it (with
  preallocation=full) to 80KB won't allocate the second cluster.

   2) No copy-on-write is performed, so in the previous example if
  there is a backing file then the first 60KB of the first cluster
  won't be filled with data from the backing file.

This patch fixes both issues.

Signed-off-by: Alberto Garcia 
---
  block/qcow2.c  | 17 ++---
  tests/qemu-iotests/125 | 21 +
  tests/qemu-iotests/125.out |  9 +
  3 files changed, 44 insertions(+), 3 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v8 33/34] qcow2: Assert that expand_zero_clusters_in_l1() does not support subclusters

2020-06-10 Thread Eric Blake

On 6/10/20 10:03 AM, Alberto Garcia wrote:

This function is only used by qcow2_expand_zero_clusters() to
downgrade a qcow2 image to a previous version. It is however not
possible to downgrade an image with extended L2 entries because older
versions of qcow2 do not have this feature.


Well, it _is_ possible, but it would involve rewriting the entire L1/L2 
tables (including all internal snapshots), as well as causing I/O to COW 
every cluster where not all subclusters are allocated; and doing that 
conversion while remaining crash-consistent requires some thought and a 
temporary extra load on disk space (we can't discard the old table until 
the new one is completely written).


It would be more accurate to merely state that we are not prepared to 
implement it at this time.




Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
  block/qcow2-cluster.c  | 8 +++-
  tests/qemu-iotests/061 | 6 ++
  tests/qemu-iotests/061.out | 5 +
  3 files changed, 18 insertions(+), 1 deletion(-)



Whether or not we update the commit message, R-b stands for the code.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v8 32/34] qcow2: Allow preallocation and backing files if extended_l2 is set

2020-06-10 Thread Eric Blake

On 6/10/20 10:03 AM, Alberto Garcia wrote:

Traditional qcow2 images don't allow preallocation if a backing file
is set. This is because once a cluster is allocated there is no way to
tell that its data should be read from the backing file.

Extended L2 entries have individual allocation bits for each
subcluster, and therefore it is perfectly possible to have an
allocated cluster with all its subclusters unallocated.

Signed-off-by: Alberto Garcia 
---
  block/qcow2.c  | 7 ---
  tests/qemu-iotests/206.out | 2 +-
  2 files changed, 5 insertions(+), 4 deletions(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/2] qcow2: improve savevm performance - please ignore

2020-06-10 Thread Denis V. Lunev
On 6/10/20 9:58 PM, Denis V. Lunev wrote:
> This patch does 2 standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>   to QCOW2 image,
> - this buffer is sent to disk asynchronously, allowing several writes to
>   run in parallel.
>
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations with non-cached operations. It should also be noted that all
> operations are performed into unallocated image blocks, which also suffer
> due to partial writes to such new clusters.
>
> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
> original fixed
> cached:  1.79s   1.27s
> non-cached:  3.29s   0.81s
>
> The difference over HDD would be more significant :)
>
> Signed-off-by: Denis V. Lunev 
> CC: Kevin Wolf 
> CC: Max Reitz 
> CC: Vladimir Sementsov-Ogievskiy 
> CC: Denis Plotnikov 
> ---
>  block/qcow2.c | 111 +-
>  block/qcow2.h |   4 ++
>  2 files changed, 113 insertions(+), 2 deletions(-)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 0cd2e6757e..e6232f32e2 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
>  return ret;
>  }
>  
> +
> +typedef struct Qcow2VMStateTask {
> +AioTask task;
> +
> +BlockDriverState *bs;
> +int64_t offset;
> +void *buf;
> +size_t bytes;
> +} Qcow2VMStateTask;
> +
> +typedef struct Qcow2SaveVMState {
> +AioTaskPool *pool;
> +Qcow2VMStateTask *t;
> +} Qcow2SaveVMState;
> +
>  static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
>  {
>  BDRVQcow2State *s = bs->opaque;
> +Qcow2SaveVMState *state = s->savevm_state;
>  int ret;
>  
> +if (state != NULL) {
> +aio_task_pool_start_task(state->pool, >t->task);
> +
> +aio_task_pool_wait_all(state->pool);
> +ret = aio_task_pool_status(state->pool);
> +
> +aio_task_pool_free(state->pool);
> +g_free(state);
> +
> +s->savevm_state = NULL;
> +
> +if (ret < 0) {
> +return ret;
> +}
> +}
> +
>  qemu_co_mutex_lock(>lock);
>  ret = qcow2_write_caches(bs);
>  qemu_co_mutex_unlock(>lock);
> @@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
>  }
>  }
>  
> +
> +static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
> +{
> +int err = 0;
> +Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
> +
> +if (t->bytes != 0) {
> +QEMUIOVector local_qiov;
> +qemu_iovec_init_buf(_qiov, t->buf, t->bytes);
> +err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
> +   _qiov, 0, 0);
> +}
> +
> +qemu_vfree(t->buf);
> +return err;
> +}
> +
> +static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
> +int64_t pos, size_t size)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
> +
> +*t = (Qcow2VMStateTask) {
> +.task.func = qcow2_co_vmstate_task_entry,
> +.buf = qemu_blockalign(bs, size),
> +.offset = qcow2_vm_state_offset(s) + pos,
> +.bs = bs,
> +};
> +
> +return t;
> +}
> +
>  static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>int64_t pos)
>  {
>  BDRVQcow2State *s = bs->opaque;
> +Qcow2SaveVMState *state = s->savevm_state;
> +Qcow2VMStateTask *t;
> +size_t buf_size = MAX(s->cluster_size, 1 * MiB);
> +size_t to_copy;
> +size_t off;
>  
>  BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
> -return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
> - qiov->size, qiov, 0, 0);
> +
> +if (state == NULL) {
> +state = g_new(Qcow2SaveVMState, 1);
> +*state = (Qcow2SaveVMState) {
> +.pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
> +.t = qcow2_vmstate_task_create(bs, pos, buf_size),
> +};
> +
> +s->savevm_state = state;
> +}
> +
> +if (aio_task_pool_status(state->pool) != 0) {
> +return aio_task_pool_status(state->pool);
> +}
> +
> +t = state->t;
> +if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
> +/* Normally this branch is not reachable from migration */
> +return bs->drv->bdrv_co_pwritev_part(bs,
> +qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
> +}
> +
> +off = 0;
> +while (1) {
> +to_copy = MIN(qiov->size - off, buf_size - t->bytes);
> +qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
> + 

Re: [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine - please ignore

2020-06-10 Thread Denis V. Lunev
On 6/10/20 9:58 PM, Denis V. Lunev wrote:
> The patch preserves the constraint that the only waiter is allowed.
>
> The patch renames AioTaskPool->main_co to wake_co and removes
> AioTaskPool->waiting flag. wake_co keeps coroutine, which is
> waiting for wakeup on worker completion. Thus 'waiting' flag
> in this semantics is equivalent to 'wake_co != NULL'.
>
> Signed-off-by: Denis V. Lunev 
> CC: Kevin Wolf 
> CC: Max Reitz 
> CC: Vladimir Sementsov-Ogievskiy 
> CC: Denis Plotnikov 
> ---
>  block/aio_task.c | 17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/block/aio_task.c b/block/aio_task.c
> index 88989fa248..5183b0729d 100644
> --- a/block/aio_task.c
> +++ b/block/aio_task.c
> @@ -27,11 +27,10 @@
>  #include "block/aio_task.h"
>  
>  struct AioTaskPool {
> -Coroutine *main_co;
> +Coroutine *wake_co;
>  int status;
>  int max_busy_tasks;
>  int busy_tasks;
> -bool waiting;
>  };
>  
>  static void coroutine_fn aio_task_co(void *opaque)
> @@ -52,21 +51,21 @@ static void coroutine_fn aio_task_co(void *opaque)
>  
>  g_free(task);
>  
> -if (pool->waiting) {
> -pool->waiting = false;
> -aio_co_wake(pool->main_co);
> +if (pool->wake_co != NULL) {
> +aio_co_wake(pool->wake_co);
> +pool->wake_co = NULL;
>  }
>  }
>  
>  void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
>  {
>  assert(pool->busy_tasks > 0);
> -assert(qemu_coroutine_self() == pool->main_co);
> +assert(pool->wake_co == NULL);
>  
> -pool->waiting = true;
> +pool->wake_co = qemu_coroutine_self();
>  qemu_coroutine_yield();
>  
> -assert(!pool->waiting);
> +assert(pool->wake_co == NULL);
>  assert(pool->busy_tasks < pool->max_busy_tasks);
>  }
>  
> @@ -98,7 +97,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int 
> max_busy_tasks)
>  {
>  AioTaskPool *pool = g_new0(AioTaskPool, 1);
>  
> -pool->main_co = qemu_coroutine_self();
> +pool->wake_co = NULL;
>  pool->max_busy_tasks = max_busy_tasks;
>  
>  return pool;
please ignore



[PATCH 2/2] qcow2: improve savevm performance

2020-06-10 Thread Denis V. Lunev
This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
original fixed
cached:  1.79s   1.27s
non-cached:  3.29s   0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
 block/qcow2.c | 111 +-
 block/qcow2.h |   4 ++
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e6232f32e2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
 return ret;
 }
 
+
+typedef struct Qcow2VMStateTask {
+AioTask task;
+
+BlockDriverState *bs;
+int64_t offset;
+void *buf;
+size_t bytes;
+} Qcow2VMStateTask;
+
+typedef struct Qcow2SaveVMState {
+AioTaskPool *pool;
+Qcow2VMStateTask *t;
+} Qcow2SaveVMState;
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
+Qcow2SaveVMState *state = s->savevm_state;
 int ret;
 
+if (state != NULL) {
+aio_task_pool_start_task(state->pool, >t->task);
+
+aio_task_pool_wait_all(state->pool);
+ret = aio_task_pool_status(state->pool);
+
+aio_task_pool_free(state->pool);
+g_free(state);
+
+s->savevm_state = NULL;
+
+if (ret < 0) {
+return ret;
+}
+}
+
 qemu_co_mutex_lock(>lock);
 ret = qcow2_write_caches(bs);
 qemu_co_mutex_unlock(>lock);
@@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
 }
 }
 
+
+static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
+{
+int err = 0;
+Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
+
+if (t->bytes != 0) {
+QEMUIOVector local_qiov;
+qemu_iovec_init_buf(_qiov, t->buf, t->bytes);
+err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
+   _qiov, 0, 0);
+}
+
+qemu_vfree(t->buf);
+return err;
+}
+
+static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
+int64_t pos, size_t size)
+{
+BDRVQcow2State *s = bs->opaque;
+Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
+
+*t = (Qcow2VMStateTask) {
+.task.func = qcow2_co_vmstate_task_entry,
+.buf = qemu_blockalign(bs, size),
+.offset = qcow2_vm_state_offset(s) + pos,
+.bs = bs,
+};
+
+return t;
+}
+
 static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
   int64_t pos)
 {
 BDRVQcow2State *s = bs->opaque;
+Qcow2SaveVMState *state = s->savevm_state;
+Qcow2VMStateTask *t;
+size_t buf_size = MAX(s->cluster_size, 1 * MiB);
+size_t to_copy;
+size_t off;
 
 BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
-return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
- qiov->size, qiov, 0, 0);
+
+if (state == NULL) {
+state = g_new(Qcow2SaveVMState, 1);
+*state = (Qcow2SaveVMState) {
+.pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
+.t = qcow2_vmstate_task_create(bs, pos, buf_size),
+};
+
+s->savevm_state = state;
+}
+
+if (aio_task_pool_status(state->pool) != 0) {
+return aio_task_pool_status(state->pool);
+}
+
+t = state->t;
+if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
+/* Normally this branch is not reachable from migration */
+return bs->drv->bdrv_co_pwritev_part(bs,
+qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
+}
+
+off = 0;
+while (1) {
+to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+t->bytes += to_copy;
+if (t->bytes < buf_size) {
+return 0;
+}
+
+aio_task_pool_start_task(state->pool, >task);
+
+pos += to_copy;
+off += to_copy;
+state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
+}
+
+return 0;
 }
 
 static int 

[PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine

2020-06-10 Thread Denis V. Lunev
The patch preserves the constraint that the only waiter is allowed.

The patch renames AioTaskPool->main_co to wake_co and removes
AioTaskPool->waiting flag. wake_co keeps coroutine, which is
waiting for wakeup on worker completion. Thus 'waiting' flag
in this semantics is equivalent to 'wake_co != NULL'.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
 block/aio_task.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..5183b0729d 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-Coroutine *main_co;
+Coroutine *wake_co;
 int status;
 int max_busy_tasks;
 int busy_tasks;
-bool waiting;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,21 +51,21 @@ static void coroutine_fn aio_task_co(void *opaque)
 
 g_free(task);
 
-if (pool->waiting) {
-pool->waiting = false;
-aio_co_wake(pool->main_co);
+if (pool->wake_co != NULL) {
+aio_co_wake(pool->wake_co);
+pool->wake_co = NULL;
 }
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
 assert(pool->busy_tasks > 0);
-assert(qemu_coroutine_self() == pool->main_co);
+assert(pool->wake_co == NULL);
 
-pool->waiting = true;
+pool->wake_co = qemu_coroutine_self();
 qemu_coroutine_yield();
 
-assert(!pool->waiting);
+assert(pool->wake_co == NULL);
 assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
@@ -98,7 +97,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int 
max_busy_tasks)
 {
 AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-pool->main_co = qemu_coroutine_self();
+pool->wake_co = NULL;
 pool->max_busy_tasks = max_busy_tasks;
 
 return pool;
-- 
2.17.1




[PATCH v2 0/2] qcow2: seriously improve savevm performance

2020-06-10 Thread Denis V. Lunev
This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
original rfcthis
cached:  1.79s  2.38s   1.27s
non-cached:  3.29s  1.31s   0.81s

Changes from v1:
- patchew warning fixed
- fixed validation that only 1 waiter is allowed in patch 1

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 




[PATCH 2/2] qcow2: improve savevm performance

2020-06-10 Thread Denis V. Lunev
This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
original fixed
cached:  1.79s   1.27s
non-cached:  3.29s   0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
 block/qcow2.c | 111 +-
 block/qcow2.h |   4 ++
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e6232f32e2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
 return ret;
 }
 
+
+typedef struct Qcow2VMStateTask {
+AioTask task;
+
+BlockDriverState *bs;
+int64_t offset;
+void *buf;
+size_t bytes;
+} Qcow2VMStateTask;
+
+typedef struct Qcow2SaveVMState {
+AioTaskPool *pool;
+Qcow2VMStateTask *t;
+} Qcow2SaveVMState;
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
+Qcow2SaveVMState *state = s->savevm_state;
 int ret;
 
+if (state != NULL) {
+aio_task_pool_start_task(state->pool, >t->task);
+
+aio_task_pool_wait_all(state->pool);
+ret = aio_task_pool_status(state->pool);
+
+aio_task_pool_free(state->pool);
+g_free(state);
+
+s->savevm_state = NULL;
+
+if (ret < 0) {
+return ret;
+}
+}
+
 qemu_co_mutex_lock(>lock);
 ret = qcow2_write_caches(bs);
 qemu_co_mutex_unlock(>lock);
@@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
 }
 }
 
+
+static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
+{
+int err = 0;
+Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
+
+if (t->bytes != 0) {
+QEMUIOVector local_qiov;
+qemu_iovec_init_buf(_qiov, t->buf, t->bytes);
+err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
+   _qiov, 0, 0);
+}
+
+qemu_vfree(t->buf);
+return err;
+}
+
+static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
+int64_t pos, size_t size)
+{
+BDRVQcow2State *s = bs->opaque;
+Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
+
+*t = (Qcow2VMStateTask) {
+.task.func = qcow2_co_vmstate_task_entry,
+.buf = qemu_blockalign(bs, size),
+.offset = qcow2_vm_state_offset(s) + pos,
+.bs = bs,
+};
+
+return t;
+}
+
 static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
   int64_t pos)
 {
 BDRVQcow2State *s = bs->opaque;
+Qcow2SaveVMState *state = s->savevm_state;
+Qcow2VMStateTask *t;
+size_t buf_size = MAX(s->cluster_size, 1 * MiB);
+size_t to_copy;
+size_t off;
 
 BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
-return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
- qiov->size, qiov, 0, 0);
+
+if (state == NULL) {
+state = g_new(Qcow2SaveVMState, 1);
+*state = (Qcow2SaveVMState) {
+.pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
+.t = qcow2_vmstate_task_create(bs, pos, buf_size),
+};
+
+s->savevm_state = state;
+}
+
+if (aio_task_pool_status(state->pool) != 0) {
+return aio_task_pool_status(state->pool);
+}
+
+t = state->t;
+if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
+/* Normally this branch is not reachable from migration */
+return bs->drv->bdrv_co_pwritev_part(bs,
+qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
+}
+
+off = 0;
+while (1) {
+to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+t->bytes += to_copy;
+if (t->bytes < buf_size) {
+return 0;
+}
+
+aio_task_pool_start_task(state->pool, >task);
+
+pos += to_copy;
+off += to_copy;
+state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
+}
+
+return 0;
 }
 
 static int 

[PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine

2020-06-10 Thread Denis V. Lunev
The patch preserves the constraint that the only waiter is allowed.

The patch renames AioTaskPool->main_co to wake_co and removes
AioTaskPool->waiting flag. wake_co keeps coroutine, which is
waiting for wakeup on worker completion. Thus 'waiting' flag
in this semantics is equivalent to 'wake_co != NULL'.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
 block/aio_task.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..5183b0729d 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-Coroutine *main_co;
+Coroutine *wake_co;
 int status;
 int max_busy_tasks;
 int busy_tasks;
-bool waiting;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,21 +51,21 @@ static void coroutine_fn aio_task_co(void *opaque)
 
 g_free(task);
 
-if (pool->waiting) {
-pool->waiting = false;
-aio_co_wake(pool->main_co);
+if (pool->wake_co != NULL) {
+aio_co_wake(pool->wake_co);
+pool->wake_co = NULL;
 }
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
 assert(pool->busy_tasks > 0);
-assert(qemu_coroutine_self() == pool->main_co);
+assert(pool->wake_co == NULL);
 
-pool->waiting = true;
+pool->wake_co = qemu_coroutine_self();
 qemu_coroutine_yield();
 
-assert(!pool->waiting);
+assert(pool->wake_co == NULL);
 assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
@@ -98,7 +97,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int 
max_busy_tasks)
 {
 AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-pool->main_co = qemu_coroutine_self();
+pool->wake_co = NULL;
 pool->max_busy_tasks = max_busy_tasks;
 
 return pool;
-- 
2.17.1




Re: [PATCH 0/2] qcow2: seriously improve savevm performance

2020-06-10 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20200610144129.27659-1-...@openvz.org/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  block/gluster.o
  CC  block/ssh.o
  CC  block/dmg-bz2.o
/tmp/qemu-test/src/block/qcow2.c:5139:9: error: variable 'err' is used 
uninitialized whenever 'if' condition is false 
[-Werror,-Wsometimes-uninitialized]
if (t->bytes != 0) {
^
/tmp/qemu-test/src/block/qcow2.c:5147:12: note: uninitialized use occurs here
---
   ^
= 0
1 error generated.
make: *** [/tmp/qemu-test/src/rules.mak:69: block/qcow2.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 665, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=213a8da69081459b91db63888e1cc6a0', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 
'TARGET_LIST=x86_64-softmmu', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 
'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', 
'-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-v54hgiy2/src/docker-src.2020-06-10-14.20.39.19315:/var/tmp/qemu:z,ro',
 'qemu:fedora', '/var/tmp/qemu/run', 'test-debug']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=213a8da69081459b91db63888e1cc6a0
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-v54hgiy2/src'
make: *** [docker-run-test-debug@fedora] Error 2

real4m8.609s
user0m8.917s


The full log is available at
http://patchew.org/logs/20200610144129.27659-1-...@openvz.org/testing.asan/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH 0/2] qcow2: seriously improve savevm performance

2020-06-10 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20200610144129.27659-1-...@openvz.org/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  BUNZIP2 pc-bios/edk2-i386-code.fd.bz2
  BUNZIP2 pc-bios/edk2-arm-vars.fd.bz2
/tmp/qemu-test/src/block/qcow2.c: In function 'qcow2_co_vmstate_task_entry':
/tmp/qemu-test/src/block/qcow2.c:5147:12: error: 'err' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
 return err;
^~~
cc1: all warnings being treated as errors
make: *** [/tmp/qemu-test/src/rules.mak:69: block/qcow2.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 665, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=a0327ae2ef3c4163bdd307b30bc90a7c', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-fbvrtr6u/src/docker-src.2020-06-10-14.22.01.21453:/var/tmp/qemu:z,ro',
 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=a0327ae2ef3c4163bdd307b30bc90a7c
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-fbvrtr6u/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real2m20.791s
user0m8.483s


The full log is available at
http://patchew.org/logs/20200610144129.27659-1-...@openvz.org/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[PATCH v3 4/4] block/io: auto-no-fallback for write-zeroes

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
When BDRV_REQ_NO_FALLBACK is supported, the NBD driver supports a
larger request size.  Add code to try large zero requests with a
NO_FALLBACK request prior to having to split a request into chunks
according to max_pwrite_zeroes.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/io.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block/io.c b/block/io.c
index 3fae97da2d..9a6dabb595 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1778,6 +1778,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
 bs->bl.request_alignment);
 int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFFER);
+bool auto_no_fallback;
 
 assert(alignment % bs->bl.request_alignment == 0);
 
@@ -1785,6 +1786,16 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 return -ENOMEDIUM;
 }
 
+if (!(flags & BDRV_REQ_NO_FALLBACK) &&
+(bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK) &&
+bs->bl.max_pwrite_zeroes && bs->bl.max_pwrite_zeroes < bytes &&
+bs->bl.max_pwrite_zeroes < bs->bl.max_pwrite_zeroes_fast)
+{
+assert(drv->bdrv_co_pwrite_zeroes);
+flags |= BDRV_REQ_NO_FALLBACK;
+auto_no_fallback = true;
+}
+
 if ((flags & ~bs->supported_zero_flags) & BDRV_REQ_NO_FALLBACK) {
 return -ENOTSUP;
 }
@@ -1829,6 +1840,13 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 if (drv->bdrv_co_pwrite_zeroes) {
 ret = drv->bdrv_co_pwrite_zeroes(bs, offset, num,
  flags & bs->supported_zero_flags);
+if (ret == -ENOTSUP && auto_no_fallback) {
+flags &= ~BDRV_REQ_NO_FALLBACK;
+max_write_zeroes =
+QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pwrite_zeroes,
+ INT_MAX), alignment);
+continue;
+}
 if (ret != -ENOTSUP && (flags & BDRV_REQ_FUA) &&
 !(bs->supported_zero_flags & BDRV_REQ_FUA)) {
 need_flush = true;
-- 
2.21.0




[PATCH v3 2/4] block/nbd: define new max_write_zero_fast limit

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
The NBD spec was recently updated to clarify that max_block doesn't
relate to NBD_CMD_WRITE_ZEROES with NBD_CMD_FLAG_FAST_ZERO (which
mirrors Qemu flag BDRV_REQ_NO_FALLBACK).

bs->bl.max_write_zero_fast is zero by default which means using
max_pwrite_zeroes. Update nbd driver to allow larger requests with
BDRV_REQ_NO_FALLBACK.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nbd.c b/block/nbd.c
index 4ac23c8f62..b0584cf68d 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1956,6 +1956,7 @@ static void nbd_refresh_limits(BlockDriverState *bs, 
Error **errp)
 
 bs->bl.request_alignment = min;
 bs->bl.max_pdiscard = QEMU_ALIGN_DOWN(INT_MAX, min);
+bs->bl.max_pwrite_zeroes_fast = bs->bl.max_pdiscard;
 bs->bl.max_pwrite_zeroes = max;
 bs->bl.max_transfer = max;
 
-- 
2.21.0




[PATCH v3 3/4] block/io: refactor bdrv_co_do_pwrite_zeroes head calculation

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
It's wrong to update head using num in this place, as num may be
reduced during the iteration (seems it doesn't, but it's not obvious),
and we'll have wrong head value on next iteration.

Instead update head at iteration end.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 0af62a53fd..3fae97da2d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1813,7 +1813,6 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  * convenience, limit this request to max_transfer even if
  * we don't need to fall back to writes.  */
 num = MIN(MIN(bytes, max_transfer), alignment - head);
-head = (head + num) % alignment;
 assert(num < max_write_zeroes);
 } else if (tail && num > alignment) {
 /* Shorten the request to the last aligned sector.  */
@@ -1872,6 +1871,9 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 
 offset += num;
 bytes -= num;
+if (head) {
+head = offset % alignment;
+}
 }
 
 fail:
-- 
2.21.0




[PATCH v3 1/4] block: add max_pwrite_zeroes_fast to BlockLimits

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
The NBD spec was recently updated to clarify that max_block doesn't
relate to NBD_CMD_WRITE_ZEROES with NBD_CMD_FLAG_FAST_ZERO (which
mirrors Qemu flag BDRV_REQ_NO_FALLBACK). To drop the restriction we
need new max_pwrite_zeroes_fast.

Default value of new max_pwrite_zeroes_fast is zero and it means
use max_pwrite_zeroes. So this commit semantically changes nothing.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h |  8 
 block/io.c| 17 -
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 791de6a59c..277e32fe31 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -626,6 +626,14 @@ typedef struct BlockLimits {
  * pwrite_zeroes_alignment. May be 0 if no inherent 32-bit limit */
 int32_t max_pwrite_zeroes;
 
+/*
+ * Maximum number of bytes that can zeroed at once if flag
+ * BDRV_REQ_NO_FALLBACK specified. Must be multiple of
+ * pwrite_zeroes_alignment.
+ * If 0, max_pwrite_zeroes is used for no-fallback case.
+ */
+int64_t max_pwrite_zeroes_fast;
+
 /* Optimal alignment for write zeroes requests in bytes. A power
  * of 2 is best but not mandatory.  Must be a multiple of
  * bl.request_alignment, and must be less than max_pwrite_zeroes
diff --git a/block/io.c b/block/io.c
index df8f2a98d4..0af62a53fd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1774,12 +1774,13 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 bool need_flush = false;
 int head = 0;
 int tail = 0;
-
-int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
+int max_write_zeroes;
 int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
 bs->bl.request_alignment);
 int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFFER);
 
+assert(alignment % bs->bl.request_alignment == 0);
+
 if (!drv) {
 return -ENOMEDIUM;
 }
@@ -1788,12 +1789,18 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 return -ENOTSUP;
 }
 
-assert(alignment % bs->bl.request_alignment == 0);
-head = offset % alignment;
-tail = (offset + bytes) % alignment;
+if ((flags & BDRV_REQ_NO_FALLBACK) && bs->bl.max_pwrite_zeroes_fast) {
+max_write_zeroes = bs->bl.max_pwrite_zeroes_fast;
+} else {
+max_write_zeroes = bs->bl.max_pwrite_zeroes;
+}
+max_write_zeroes = MIN_NON_ZERO(max_write_zeroes, INT_MAX);
 max_write_zeroes = QEMU_ALIGN_DOWN(max_write_zeroes, alignment);
 assert(max_write_zeroes >= bs->bl.request_alignment);
 
+head = offset % alignment;
+tail = (offset + bytes) % alignment;
+
 while (bytes > 0 && !ret) {
 int num = bytes;
 
-- 
2.21.0




[PATCH v3 0/4] nbd: reduce max_block restrictions

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
Recent changes in NBD protocol allowed to use some commands without
max_block restriction. Let's drop the restrictions.

NBD change is here:
https://github.com/NetworkBlockDevice/nbd/commit/9f30fedb8699f151e7ef4ccc07e624330be3316b#diff-762fb7c670348da69cc9050ef58fe3ae

v3: first two patches from v2 was merged. Let's continue with the rest.

Vladimir Sementsov-Ogievskiy (4):
  block: add max_pwrite_zeroes_fast to BlockLimits
  block/nbd: define new max_write_zero_fast limit
  block/io: refactor bdrv_co_do_pwrite_zeroes head calculation
  block/io: auto-no-fallback for write-zeroes

 include/block/block_int.h |  8 
 block/io.c| 39 +--
 block/nbd.c   |  1 +
 3 files changed, 42 insertions(+), 6 deletions(-)

-- 
2.21.0




Re: [PATCH 0/2] qcow2: seriously improve savevm performance

2020-06-10 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20200610144129.27659-1-...@openvz.org/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  crypto/hash.o
  CC  crypto/hash-nettle.o
/tmp/qemu-test/src/block/qcow2.c: In function 'qcow2_co_vmstate_task_entry':
/tmp/qemu-test/src/block/qcow2.c:5147:5: error: 'err' may be used uninitialized 
in this function [-Werror=maybe-uninitialized]
 return err;
 ^
cc1: all warnings being treated as errors
  CC  crypto/hmac.o
  CC  crypto/hmac-nettle.o
make: *** [block/qcow2.o] Error 1
make: *** Waiting for unfinished jobs
  CC  crypto/desrfb.o
Traceback (most recent call last):
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=3bb6d855342d412ca997d990b1688b3c', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-0hsvevb2/src/docker-src.2020-06-10-14.17.46.12598:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=3bb6d855342d412ca997d990b1688b3c
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-0hsvevb2/src'
make: *** [docker-run-test-quick@centos7] Error 2

real2m9.722s
user0m9.192s


The full log is available at
http://patchew.org/logs/20200610144129.27659-1-...@openvz.org/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[PULL v2 2/3] nbd/server: Avoid long error message assertions CVE-2020-10761

2020-06-10 Thread Eric Blake
Ever since commit 36683283 (v2.8), the server code asserts that error
strings sent to the client are well-formed per the protocol by not
exceeding the maximum string length of 4096.  At the time the server
first started sending error messages, the assertion could not be
triggered, because messages were completely under our control.
However, over the years, we have added latent scenarios where a client
could trigger the server to attempt an error message that would
include the client's information if it passed other checks first:

- requesting NBD_OPT_INFO/GO on an export name that is not present
  (commit 0cfae925 in v2.12 echoes the name)

- requesting NBD_OPT_LIST/SET_META_CONTEXT on an export name that is
  not present (commit e7b1948d in v2.12 echoes the name)

At the time, those were still safe because we flagged names larger
than 256 bytes with a different message; but that changed in commit
93676c88 (v4.2) when we raised the name limit to 4096 to match the NBD
string limit.  (That commit also failed to change the magic number
4096 in nbd_negotiate_send_rep_err to the just-introduced named
constant.)  So with that commit, long client names appended to server
text can now trigger the assertion, and thus be used as a denial of
service attack against a server.  As a mitigating factor, if the
server requires TLS, the client cannot trigger the problematic paths
unless it first supplies TLS credentials, and such trusted clients are
less likely to try to intentionally crash the server.

We may later want to further sanitize the user-supplied strings we
place into our error messages, such as scrubbing out control
characters, but that is less important to the CVE fix, so it can be a
later patch to the new nbd_sanitize_name.

Consideration was given to changing the assertion in
nbd_negotiate_send_rep_verr to instead merely log a server error and
truncate the message, to avoid leaving a latent path that could
trigger a future CVE DoS on any new error message.  However, this
merely complicates the code for something that is already (correctly)
flagging coding errors, and now that we are aware of the long message
pitfall, we are less likely to introduce such errors in the future,
which would make such error handling dead code.

Reported-by: Xueqiang Wei 
CC: qemu-sta...@nongnu.org
Fixes: https://bugzilla.redhat.com/1843684 CVE-2020-10761
Fixes: 93676c88d7
Signed-off-by: Eric Blake 
Message-Id: <20200610163741.3745251-2-ebl...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 nbd/server.c   | 23 ---
 tests/qemu-iotests/143 |  4 
 tests/qemu-iotests/143.out |  2 ++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 02b1ed080145..20754e9ebc3c 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -217,7 +217,7 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,

 msg = g_strdup_vprintf(fmt, va);
 len = strlen(msg);
-assert(len < 4096);
+assert(len < NBD_MAX_STRING_SIZE);
 trace_nbd_negotiate_send_rep_err(msg);
 ret = nbd_negotiate_send_rep_len(client, type, len, errp);
 if (ret < 0) {
@@ -231,6 +231,19 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,
 return 0;
 }

+/*
+ * Return a malloc'd copy of @name suitable for use in an error reply.
+ */
+static char *
+nbd_sanitize_name(const char *name)
+{
+if (strnlen(name, 80) < 80) {
+return g_strdup(name);
+}
+/* XXX Should we also try to sanitize any control characters? */
+return g_strdup_printf("%.80s...", name);
+}
+
 /* Send an error reply.
  * Return -errno on error, 0 on success. */
 static int GCC_FMT_ATTR(4, 5)
@@ -595,9 +608,11 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
Error **errp)

 exp = nbd_export_find(name);
 if (!exp) {
+g_autofree char *sane_name = nbd_sanitize_name(name);
+
 return nbd_negotiate_send_rep_err(client, NBD_REP_ERR_UNKNOWN,
   errp, "export '%s' not present",
-  name);
+  sane_name);
 }

 /* Don't bother sending NBD_INFO_NAME unless client requested it */
@@ -995,8 +1010,10 @@ static int nbd_negotiate_meta_queries(NBDClient *client,

 meta->exp = nbd_export_find(export_name);
 if (meta->exp == NULL) {
+g_autofree char *sane_name = nbd_sanitize_name(export_name);
+
 return nbd_opt_drop(client, NBD_REP_ERR_UNKNOWN, errp,
-"export '%s' not present", export_name);
+"export '%s' not present", sane_name);
 }

 ret = nbd_opt_read(client, _queries, sizeof(nb_queries), errp);
diff --git a/tests/qemu-iotests/143 b/tests/qemu-iotests/143
index f649b3619501..d2349903b1b5 100755
--- a/tests/qemu-iotests/143
+++ b/tests/qemu-iotests/143
@@ -58,6 +58,10 @@ _send_qemu_cmd $QEMU_HANDLE \
 $QEMU_IO_PROG -f raw -c quit \
 

Re: [PATCH v2 1/2] nbd/server: Avoid long error message assertions CVE-2020-10761

2020-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2020 19:37, Eric Blake wrote:

Ever since commit 36683283 (v2.8), the server code asserts that error
strings sent to the client are well-formed per the protocol by not
exceeding the maximum string length of 4096.  At the time the server
first started sending error messages, the assertion could not be
triggered, because messages were completely under our control.
However, over the years, we have added latent scenarios where a client
could trigger the server to attempt an error message that would
include the client's information if it passed other checks first:

- requesting NBD_OPT_INFO/GO on an export name that is not present
   (commit 0cfae925 in v2.12 echoes the name)

- requesting NBD_OPT_LIST/SET_META_CONTEXT on an export name that is
   not present (commit e7b1948d in v2.12 echoes the name)

At the time, those were still safe because we flagged names larger
than 256 bytes with a different message; but that changed in commit
93676c88 (v4.2) when we raised the name limit to 4096 to match the NBD
string limit.  (That commit also failed to change the magic number
4096 in nbd_negotiate_send_rep_err to the just-introduced named
constant.)  So with that commit, long client names appended to server
text can now trigger the assertion, and thus be used as a denial of
service attack against a server.  As a mitigating factor, if the
server requires TLS, the client cannot trigger the problematic paths
unless it first supplies TLS credentials, and such trusted clients are
less likely to try to intentionally crash the server.

We may later want to further sanitize the user-supplied strings we
place into our error messages, such as scrubbing out control
characters, but that is less important to the CVE fix, so it can be a
later patch to the new nbd_sanitize_name.

Consideration was given to changing the assertion in
nbd_negotiate_send_rep_verr to instead merely log a server error and
truncate the message, to avoid leaving a latent path that could
trigger a future CVE DoS on any new error message.  However, this
merely complicates the code for something that is already (correctly)
flagging coding errors, and now that we are aware of the long message
pitfall, we are less likely to introduce such errors in the future,
which would make such error handling dead code.

Reported-by: Xueqiang Wei 
CC: qemu-sta...@nongnu.org
Fixes: https://bugzilla.redhat.com/1843684 CVE-2020-10761
Fixes: 93676c88d7
Signed-off-by: Eric Blake 


Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  nbd/server.c   | 23 ---
  tests/qemu-iotests/143 |  4 
  tests/qemu-iotests/143.out |  2 ++
  3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 02b1ed080145..20754e9ebc3c 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -217,7 +217,7 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,

  msg = g_strdup_vprintf(fmt, va);
  len = strlen(msg);
-assert(len < 4096);
+assert(len < NBD_MAX_STRING_SIZE);
  trace_nbd_negotiate_send_rep_err(msg);
  ret = nbd_negotiate_send_rep_len(client, type, len, errp);
  if (ret < 0) {
@@ -231,6 +231,19 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,
  return 0;
  }

+/*
+ * Return a malloc'd copy of @name suitable for use in an error reply.
+ */
+static char *
+nbd_sanitize_name(const char *name)
+{
+if (strnlen(name, 80) < 80) {
+return g_strdup(name);
+}
+/* XXX Should we also try to sanitize any control characters? */
+return g_strdup_printf("%.80s...", name);
+}
+
  /* Send an error reply.
   * Return -errno on error, 0 on success. */
  static int GCC_FMT_ATTR(4, 5)
@@ -595,9 +608,11 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
Error **errp)

  exp = nbd_export_find(name);
  if (!exp) {
+g_autofree char *sane_name = nbd_sanitize_name(name);


Cool! Somehow I forget about this feature, when writing my answer on v1.


+
  return nbd_negotiate_send_rep_err(client, NBD_REP_ERR_UNKNOWN,
errp, "export '%s' not present",
-  name);
+  sane_name);
  }

  /* Don't bother sending NBD_INFO_NAME unless client requested it */
@@ -995,8 +1010,10 @@ static int nbd_negotiate_meta_queries(NBDClient *client,

  meta->exp = nbd_export_find(export_name);
  if (meta->exp == NULL) {
+g_autofree char *sane_name = nbd_sanitize_name(export_name);
+
  return nbd_opt_drop(client, NBD_REP_ERR_UNKNOWN, errp,
-"export '%s' not present", export_name);
+"export '%s' not present", sane_name);
  }

  ret = nbd_opt_read(client, _queries, sizeof(nb_queries), errp);
diff --git a/tests/qemu-iotests/143 b/tests/qemu-iotests/143
index f649b3619501..b4acc4372542 100755
--- a/tests/qemu-iotests/143
+++ b/tests/qemu-iotests/143

Re: [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine

2020-06-10 Thread Denis V. Lunev
On 6/10/20 6:10 PM, Vladimir Sementsov-Ogievskiy wrote:
> 10.06.2020 17:41, Denis V. Lunev wrote:
>> The patch preserves the constraint that the only waiter is allowed.
>>
>> Signed-off-by: Denis V. Lunev 
>> CC: Kevin Wolf 
>> CC: Max Reitz 
>> CC: Vladimir Sementsov-Ogievskiy 
>> CC: Denis Plotnikov 
>> ---
>>   block/aio_task.c | 8 
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/aio_task.c b/block/aio_task.c
>> index 88989fa248..f338049147 100644
>> --- a/block/aio_task.c
>> +++ b/block/aio_task.c
>> @@ -27,7 +27,7 @@
>>   #include "block/aio_task.h"
>>     struct AioTaskPool {
>> -    Coroutine *main_co;
>> +    Coroutine *wake_co;
>>   int status;
>>   int max_busy_tasks;
>>   int busy_tasks;
>> @@ -54,15 +54,15 @@ static void coroutine_fn aio_task_co(void *opaque)
>>     if (pool->waiting) {
>>   pool->waiting = false;
>> -    aio_co_wake(pool->main_co);
>> +    aio_co_wake(pool->wake_co);
>>   }
>>   }
>>     void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
>>   {
>>   assert(pool->busy_tasks > 0);
>> -    assert(qemu_coroutine_self() == pool->main_co);
>>   +    pool->wake_co = qemu_coroutine_self();
>>   pool->waiting = true;
>>   qemu_coroutine_yield();
>>   @@ -98,7 +98,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int
>> max_busy_tasks)
>>   {
>>   AioTaskPool *pool = g_new0(AioTaskPool, 1);
>>   -    pool->main_co = qemu_coroutine_self();
>> +    pool->wake_co = NULL;
>>   pool->max_busy_tasks = max_busy_tasks;
>>     return pool;
>>
>
> With such approach, if several coroutines will wait simultaneously,
> the only one will be finally woken and other will hang.
>
> I think, we should use CoQueue here: CoQueue instead of wake_co,
> qemu_co_queue_wait in wait_one, and qemu_co_queue_next instead of
> aio_co_wake.
>
>
I will make a check, but for now it would be enough to
add
  assert(!pool->waiting);
at the beginning of aio_task_pool_wait_one

Den



Re: [PATCH v8 30/34] qcow2: Add prealloc field to QCowL2Meta

2020-06-10 Thread Eric Blake

On 6/10/20 10:03 AM, Alberto Garcia wrote:

This field allows us to indicate that the L2 metadata update does not
come from a write request with actual data but from a preallocation
request.

For traditional images this does not make any difference, but for
images with extended L2 entries this means that the clusters are
allocated normally in the L2 table but individual subclusters are
marked as unallocated.

This will allow preallocating images that have a backing file.

There is one special case: when we resize an existing image we can
also request that the new clusters are preallocated. If the image
already had a backing file then we have to hide any possible stale
data and zero out the new clusters (see commit 955c7d6687 for more
details).

In this case the subclusters cannot be left as unallocated so the L2
bitmap must be updated.

Signed-off-by: Alberto Garcia 
---


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 1/6] iotests: 194: wait migration completion on target too

2020-06-10 Thread Alex Bennée


Alex Bennée  writes:

> From: Vladimir Sementsov-Ogievskiy 
>
> It is possible, that shutdown on target occurs earlier than migration
> finish. In this case we crash in bdrv_release_dirty_bitmap_locked()
> on assertion "assert(!bdrv_dirty_bitmap_busy(bitmap));" as we do have
> busy bitmap, as bitmap migration is ongoing.
>
> We'll fix bitmap migration to gracefully cancel on early shutdown soon.
> Now let's fix iotest 194 to wait migration completion before shutdown.
>
> Note that in this test dest_vm.shutdown() is called implicitly, as vms
> used as context-providers, see __exit__() method of QEMUMachine class.
>
> Actually, not waiting migration finish is a wrong thing, but the test
> started to crash after commit ae00aa239847682
> "iotests: 194: test also migration of dirty bitmap", which added dirty
> bitmaps here. So, Fixes: tag won't hurt.
>
> Fixes: ae00aa2398476824f0eca80461da215e7cdc1c3b
> Reported-by: Thomas Huth 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Tested-by: Thomas Huth 
> Reviewed-by: Eric Blake 
> Signed-off-by: Alex Bennée 
> Message-Id: <20200604083341.26978-1-vsement...@virtuozzo.com>

Obviously this patch isn't going in via plugins/next - I had it in my
tree to keep CI green and forgot to take that into account when
generating the series!

-- 
Alex Bennée



[PATCH v2 2/2] block: Call attention to truncation of long NBD exports

2020-06-10 Thread Eric Blake
Commit 93676c88 relaxed our NBD client code to request export names up
to the NBD protocol maximum of 4096 bytes without NUL terminator, even
though the block layer can't store anything longer than 4096 bytes
including NUL terminator for display to the user.  Since this means
there are some export names where we have to truncate things, we can
at least try to make the truncation a bit more obvious for the user.
Note that in spite of the truncated display name, we can still
communicate with an NBD server using such a long export name; this was
deemed nicer than refusing to even connect to such a server (since the
server may not be under our control, and since determining our actual
length limits gets tricky when nbd://host:port/export and
nbd+unix:///export?socket=/path are themselves variable-length
expansions beyond the export name but count towards the block layer
name length).

Reported-by: Xueqiang Wei 
Fixes: https://bugzilla.redhat.com/1843684
Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block.c |  7 +--
 block/nbd.c | 21 +
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index 8416376c9b71..6dbcb7e083ea 100644
--- a/block.c
+++ b/block.c
@@ -6809,8 +6809,11 @@ void bdrv_refresh_filename(BlockDriverState *bs)
 pstrcpy(bs->filename, sizeof(bs->filename), bs->exact_filename);
 } else {
 QString *json = qobject_to_json(QOBJECT(bs->full_open_options));
-snprintf(bs->filename, sizeof(bs->filename), "json:%s",
- qstring_get_str(json));
+if (snprintf(bs->filename, sizeof(bs->filename), "json:%s",
+ qstring_get_str(json)) >= sizeof(bs->filename)) {
+/* Give user a hint if we truncated things. */
+strcpy(bs->filename + sizeof(bs->filename) - 4, "...");
+}
 qobject_unref(json);
 }
 }
diff --git a/block/nbd.c b/block/nbd.c
index 4ac23c8f6299..eed160c5cda1 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1984,6 +1984,7 @@ static void nbd_refresh_filename(BlockDriverState *bs)
 {
 BDRVNBDState *s = bs->opaque;
 const char *host = NULL, *port = NULL, *path = NULL;
+size_t len = 0;

 if (s->saddr->type == SOCKET_ADDRESS_TYPE_INET) {
 const InetSocketAddress *inet = >saddr->u.inet;
@@ -1996,17 +1997,21 @@ static void nbd_refresh_filename(BlockDriverState *bs)
 } /* else can't represent as pseudo-filename */

 if (path && s->export) {
-snprintf(bs->exact_filename, sizeof(bs->exact_filename),
- "nbd+unix:///%s?socket=%s", s->export, path);
+len = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+   "nbd+unix:///%s?socket=%s", s->export, path);
 } else if (path && !s->export) {
-snprintf(bs->exact_filename, sizeof(bs->exact_filename),
- "nbd+unix://?socket=%s", path);
+len = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+   "nbd+unix://?socket=%s", path);
 } else if (host && s->export) {
-snprintf(bs->exact_filename, sizeof(bs->exact_filename),
- "nbd://%s:%s/%s", host, port, s->export);
+len = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+   "nbd://%s:%s/%s", host, port, s->export);
 } else if (host && !s->export) {
-snprintf(bs->exact_filename, sizeof(bs->exact_filename),
- "nbd://%s:%s", host, port);
+len = snprintf(bs->exact_filename, sizeof(bs->exact_filename),
+   "nbd://%s:%s", host, port);
+}
+if (len > sizeof(bs->exact_filename)) {
+/* Name is too long to represent exactly, so leave it empty. */
+bs->exact_filename[0] = '\0';
 }
 }

-- 
2.27.0




[PATCH v2 0/2] Fix NBD CVE-2020-10761

2020-06-10 Thread Eric Blake
In qemu 4.2, I accidentally introduced the ability for an NBD client
obeying the specification to kill qemu as NBD server with an assertion
failure when the client requests an unusually long export name, as a
regression from the intended graceful server error message back to the
client.

In v2:
- use strnlen instead of strlen
- malloc a sane string rather than using a static buffer [Vladimir]
- enhance commit message
- tweak iotest to use different abbreviation than qemu-nbd
- add R-b on patch 2

Once this is reviewed, I'll then spin v2 of my NBD pull request.

Eric Blake (2):
  nbd/server: Avoid long error message assertions CVE-2020-10761
  block: Call attention to truncation of long NBD exports

 block.c|  7 +--
 block/nbd.c| 21 +
 nbd/server.c   | 23 ---
 tests/qemu-iotests/143 |  4 
 tests/qemu-iotests/143.out |  2 ++
 5 files changed, 44 insertions(+), 13 deletions(-)

-- 
2.27.0




[PATCH v2 1/2] nbd/server: Avoid long error message assertions CVE-2020-10761

2020-06-10 Thread Eric Blake
Ever since commit 36683283 (v2.8), the server code asserts that error
strings sent to the client are well-formed per the protocol by not
exceeding the maximum string length of 4096.  At the time the server
first started sending error messages, the assertion could not be
triggered, because messages were completely under our control.
However, over the years, we have added latent scenarios where a client
could trigger the server to attempt an error message that would
include the client's information if it passed other checks first:

- requesting NBD_OPT_INFO/GO on an export name that is not present
  (commit 0cfae925 in v2.12 echoes the name)

- requesting NBD_OPT_LIST/SET_META_CONTEXT on an export name that is
  not present (commit e7b1948d in v2.12 echoes the name)

At the time, those were still safe because we flagged names larger
than 256 bytes with a different message; but that changed in commit
93676c88 (v4.2) when we raised the name limit to 4096 to match the NBD
string limit.  (That commit also failed to change the magic number
4096 in nbd_negotiate_send_rep_err to the just-introduced named
constant.)  So with that commit, long client names appended to server
text can now trigger the assertion, and thus be used as a denial of
service attack against a server.  As a mitigating factor, if the
server requires TLS, the client cannot trigger the problematic paths
unless it first supplies TLS credentials, and such trusted clients are
less likely to try to intentionally crash the server.

We may later want to further sanitize the user-supplied strings we
place into our error messages, such as scrubbing out control
characters, but that is less important to the CVE fix, so it can be a
later patch to the new nbd_sanitize_name.

Consideration was given to changing the assertion in
nbd_negotiate_send_rep_verr to instead merely log a server error and
truncate the message, to avoid leaving a latent path that could
trigger a future CVE DoS on any new error message.  However, this
merely complicates the code for something that is already (correctly)
flagging coding errors, and now that we are aware of the long message
pitfall, we are less likely to introduce such errors in the future,
which would make such error handling dead code.

Reported-by: Xueqiang Wei 
CC: qemu-sta...@nongnu.org
Fixes: https://bugzilla.redhat.com/1843684 CVE-2020-10761
Fixes: 93676c88d7
Signed-off-by: Eric Blake 
---
 nbd/server.c   | 23 ---
 tests/qemu-iotests/143 |  4 
 tests/qemu-iotests/143.out |  2 ++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 02b1ed080145..20754e9ebc3c 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -217,7 +217,7 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,

 msg = g_strdup_vprintf(fmt, va);
 len = strlen(msg);
-assert(len < 4096);
+assert(len < NBD_MAX_STRING_SIZE);
 trace_nbd_negotiate_send_rep_err(msg);
 ret = nbd_negotiate_send_rep_len(client, type, len, errp);
 if (ret < 0) {
@@ -231,6 +231,19 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,
 return 0;
 }

+/*
+ * Return a malloc'd copy of @name suitable for use in an error reply.
+ */
+static char *
+nbd_sanitize_name(const char *name)
+{
+if (strnlen(name, 80) < 80) {
+return g_strdup(name);
+}
+/* XXX Should we also try to sanitize any control characters? */
+return g_strdup_printf("%.80s...", name);
+}
+
 /* Send an error reply.
  * Return -errno on error, 0 on success. */
 static int GCC_FMT_ATTR(4, 5)
@@ -595,9 +608,11 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
Error **errp)

 exp = nbd_export_find(name);
 if (!exp) {
+g_autofree char *sane_name = nbd_sanitize_name(name);
+
 return nbd_negotiate_send_rep_err(client, NBD_REP_ERR_UNKNOWN,
   errp, "export '%s' not present",
-  name);
+  sane_name);
 }

 /* Don't bother sending NBD_INFO_NAME unless client requested it */
@@ -995,8 +1010,10 @@ static int nbd_negotiate_meta_queries(NBDClient *client,

 meta->exp = nbd_export_find(export_name);
 if (meta->exp == NULL) {
+g_autofree char *sane_name = nbd_sanitize_name(export_name);
+
 return nbd_opt_drop(client, NBD_REP_ERR_UNKNOWN, errp,
-"export '%s' not present", export_name);
+"export '%s' not present", sane_name);
 }

 ret = nbd_opt_read(client, _queries, sizeof(nb_queries), errp);
diff --git a/tests/qemu-iotests/143 b/tests/qemu-iotests/143
index f649b3619501..b4acc4372542 100755
--- a/tests/qemu-iotests/143
+++ b/tests/qemu-iotests/143
@@ -58,6 +58,10 @@ _send_qemu_cmd $QEMU_HANDLE \
 $QEMU_IO_PROG -f raw -c quit \
 "nbd+unix:///no_such_export?socket=$SOCK_DIR/nbd" 2>&1 \
 | _filter_qemu_io | _filter_nbd
+# 

Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Kevin Wolf
Am 10.06.2020 um 17:26 hat Sam Eiderman geschrieben:
> Thanks for the clarification Kevin,
> 
> Well first I want to discuss unallocated blocks.
> From my understanding operating systems do not rely on disks to be
> zero initialized, on the contrary, physical disks usually contain
> garbage.
> So an unallocated block should never be treated as zero by any real
> world application.

I think this is a dangerous assumption to make. The guest did have
access to these unallocated blocks before, and they read as zero, so not
writing these to the conversion target does change the virtual disk.
Whether or not this is a harmless change for the guest depends on the
software running in the VM.

> Now assuming that I only care about the allocated content of the
> disks, I would like to save io/time zeroing out unallocated blocks.
> 
> A real world example would be flushing a 500GB vmdk on a real SSD
> disk, if the vmdk contained only 2GB of data, no point in writing
> 498GB of zeroes to that SSD - reducing its lifespan for nothing.

Don't pretty much all SSDs support efficient zeroing/hole punching these
days so that the blocks would actually be deallocated on the disk level?

> Now from what I understand --target-is-zero will give me this behavior
> even though that I really use it as a "--skip-prezeroing-target"
> (sorry for the bad name)
> (This is only true if later *allocated zeroes* are indeed copied correctly)

As you noticed later, it doesn't.

The behaviour you want is more like -B, except that you don't have a
backing file. If you also pass -n, the actual filename you pass isn't
even used, so I guess '-B "" -n' should do the trick?

Kevin




Re: [PATCH 2/2] block: Call attention to truncation of long NBD exports

2020-06-10 Thread Eric Blake

On 6/10/20 4:24 AM, Vladimir Sementsov-Ogievskiy wrote:

08.06.2020 21:26, Eric Blake wrote:

Commit 93676c88 relaxed our NBD client code to request export names up
to the NBD protocol maximum of 4096 bytes without NUL terminator, even
though the block layer can't store anything longer than 4096 bytes
including NUL terminator for display to the user.  Since this means
there are some export names where we have to truncate things, we can
at least try to make the truncation a bit more obvious for the user.
Note that in spite of the truncated display name, we can still
communicate with an NBD server using such a long export name; this was
deemed nicer than refusing to even connect to such a server (since the
server may not be under our control, and since determining our actual
length limits gets tricky when nbd://host:port/export and
nbd+unix:///export?socket=/path are themselves variable-length
expansions beyond the export name but count towards the block layer
name length).

Reported-by: Xueqiang Wei 
Fixes: https://bugzilla.redhat.com/1843684
Signed-off-by: Eric Blake 


Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  block.c |  7 +--
  block/nbd.c | 21 +
  2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index 8416376c9b71..6dbcb7e083ea 100644
--- a/block.c
+++ b/block.c
@@ -6809,8 +6809,11 @@ void bdrv_refresh_filename(BlockDriverState *bs)
  pstrcpy(bs->filename, sizeof(bs->filename), 
bs->exact_filename);

  } else {
  QString *json = 
qobject_to_json(QOBJECT(bs->full_open_options));

-    snprintf(bs->filename, sizeof(bs->filename), "json:%s",
- qstring_get_str(json));
+    if (snprintf(bs->filename, sizeof(bs->filename), "json:%s",
+ qstring_get_str(json)) >= sizeof(bs->filename)) {
+    /* Give user a hint if we truncated things. */
+    strcpy(bs->filename + sizeof(bs->filename) - 4, "...");
+    }


Is  4096 really enough for json in normal cases?


By its very nature, a json string tends be longer than a counterpart URI 
string representing the same information (when such an explicit name 
exists) because of the extra characters burned in adding "key":value 
pairs wrapping the data that was compact in explicit form.  But 4k is 
still quite a lot, and the only cases I've seen where names don't fit in 
JSON form is where the user was explicitly trying to break things with 
corner-case testing, rather than what you get with day-to-day use.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Eric Blake

On 6/10/20 10:57 AM, David Edmondson wrote:

On Wednesday, 2020-06-10 at 10:48:52 -05, Eric Blake wrote:


On 6/10/20 10:42 AM, David Edmondson wrote:

On Wednesday, 2020-06-10 at 18:29:33 +03, Sam Eiderman wrote:


Excuse me,

Vladimir already pointed out in the first comment that it will skip
writing real zeroes later.


Right. That's why you want something like "--no-need-to-zero-initialise"
(the name keeps getting longer!), which would still write zeroes to the
blocks that should contain zeroes, as opposed to writing zeroes to
prepare the device.


Or maybe something like:

qemu-img convert --skip-unallocated


This seems fine.


which says that a pre-zeroing pass may be attempted, but it if fails,


This bit puzzles me. In what circumstances might we attempt but fail?
Does it really mean "if it can be done instantly, it will be done, but
not if it costs something"?


A fast pre-zeroing pass is faster than writing explicit zeroes.  If such 
a fast pass works, then you can avoid further I/O for all subsequent 
zero sections; the unallocated sections will now happen to read as zero, 
but that is not a problem since the content of unallocated portions is 
not guaranteed.


But if pre-zeroing is not fast, then you have to spend the extra I/O to 
explicitly zero the portions that are allocated but read as zero, while 
still skipping the unallocated portions.




I'd be more inclined to go for "unallocated blocks will not be written",
without any attempts to pre-zero.


But that can be slower, when pre-zeroing is fast.  "Unallocated blocks 
need not be written" allows for optimizations, "unallocated blocks must 
not be touched" does not.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread David Edmondson
On Wednesday, 2020-06-10 at 10:48:52 -05, Eric Blake wrote:

> On 6/10/20 10:42 AM, David Edmondson wrote:
>> On Wednesday, 2020-06-10 at 18:29:33 +03, Sam Eiderman wrote:
>> 
>>> Excuse me,
>>>
>>> Vladimir already pointed out in the first comment that it will skip
>>> writing real zeroes later.
>> 
>> Right. That's why you want something like "--no-need-to-zero-initialise"
>> (the name keeps getting longer!), which would still write zeroes to the
>> blocks that should contain zeroes, as opposed to writing zeroes to
>> prepare the device.
>
> Or maybe something like:
>
> qemu-img convert --skip-unallocated

This seems fine.

> which says that a pre-zeroing pass may be attempted, but it if fails, 

This bit puzzles me. In what circumstances might we attempt but fail?
Does it really mean "if it can be done instantly, it will be done, but
not if it costs something"?

I'd be more inclined to go for "unallocated blocks will not be written",
without any attempts to pre-zero.

> only the explicit zeroes need to be written rather than zeroes for all
> unallocated areas in the source (so the resulting image will NOT be an
> identical copy if there were any unallocated areas, but that the user
> is okay with that).

dme.
-- 
Too much information, running through my brain.



[PATCH v2 1/6] iotests: 194: wait migration completion on target too

2020-06-10 Thread Alex Bennée
From: Vladimir Sementsov-Ogievskiy 

It is possible, that shutdown on target occurs earlier than migration
finish. In this case we crash in bdrv_release_dirty_bitmap_locked()
on assertion "assert(!bdrv_dirty_bitmap_busy(bitmap));" as we do have
busy bitmap, as bitmap migration is ongoing.

We'll fix bitmap migration to gracefully cancel on early shutdown soon.
Now let's fix iotest 194 to wait migration completion before shutdown.

Note that in this test dest_vm.shutdown() is called implicitly, as vms
used as context-providers, see __exit__() method of QEMUMachine class.

Actually, not waiting migration finish is a wrong thing, but the test
started to crash after commit ae00aa239847682
"iotests: 194: test also migration of dirty bitmap", which added dirty
bitmaps here. So, Fixes: tag won't hurt.

Fixes: ae00aa2398476824f0eca80461da215e7cdc1c3b
Reported-by: Thomas Huth 
Signed-off-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Thomas Huth 
Reviewed-by: Eric Blake 
Signed-off-by: Alex Bennée 
Message-Id: <20200604083341.26978-1-vsement...@virtuozzo.com>
---
 tests/qemu-iotests/194 | 10 ++
 tests/qemu-iotests/194.out |  5 +
 2 files changed, 15 insertions(+)

diff --git a/tests/qemu-iotests/194 b/tests/qemu-iotests/194
index 3fad7c6c1ab..6dc2bc94d7e 100755
--- a/tests/qemu-iotests/194
+++ b/tests/qemu-iotests/194
@@ -87,4 +87,14 @@ with iotests.FilePath('source.img') as source_img_path, \
 iotests.log(dest_vm.qmp('nbd-server-stop'))
 break
 
+iotests.log('Wait migration completion on target...')
+migr_events = (('MIGRATION', {'data': {'status': 'completed'}}),
+   ('MIGRATION', {'data': {'status': 'failed'}}))
+event = dest_vm.events_wait(migr_events)
+iotests.log(event, filters=[iotests.filter_qmp_event])
+
+iotests.log('Check bitmaps on source:')
 iotests.log(source_vm.qmp('query-block')['return'][0]['dirty-bitmaps'])
+
+iotests.log('Check bitmaps on target:')
+iotests.log(dest_vm.qmp('query-block')['return'][0]['dirty-bitmaps'])
diff --git a/tests/qemu-iotests/194.out b/tests/qemu-iotests/194.out
index dd60dcc14f1..f70cf7610e0 100644
--- a/tests/qemu-iotests/194.out
+++ b/tests/qemu-iotests/194.out
@@ -21,4 +21,9 @@ Gracefully ending the `drive-mirror` job on source...
 {"data": {"device": "mirror-job0", "len": 1073741824, "offset": 1073741824, 
"speed": 0, "type": "mirror"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": 
{"microseconds": "USECS", "seconds": "SECS"}}
 Stopping the NBD server on destination...
 {"return": {}}
+Wait migration completion on target...
+{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": 
{"microseconds": "USECS", "seconds": "SECS"}}
+Check bitmaps on source:
+[{"busy": false, "count": 0, "granularity": 65536, "name": "bitmap0", 
"persistent": false, "recording": true, "status": "active"}]
+Check bitmaps on target:
 [{"busy": false, "count": 0, "granularity": 65536, "name": "bitmap0", 
"persistent": false, "recording": true, "status": "active"}]
-- 
2.20.1




Re: [PATCH v7 0/9] acpi: i386 tweaks

2020-06-10 Thread Gerd Hoffmann
On Wed, Jun 10, 2020 at 10:54:26AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2020 at 01:40:02PM +0200, Igor Mammedov wrote:
> > On Wed, 10 Jun 2020 11:41:22 +0200
> > Gerd Hoffmann  wrote:
> > 
> > > First batch of microvm patches, some generic acpi stuff.
> > > Split the acpi-build.c monster, specifically split the
> > > pc and q35 and pci bits into a separate file which we
> > > can skip building at some point in the future.
> > > 
> > It looks like series is missing patch to whitelist changed ACPI tables in
> > bios-table-test.
> 
> Right. Does it pass make check?

No, but after 'git cherry-pick 9b20a3365d73dad4ad144eab9c5827dbbb2e9f21' it 
does.

> > Do we already have test case for microvm in bios-table-test,
> > if not it's probably time to add it.
> 
> Separately :)

Especially as this series is just preparing cleanups and doesn't
actually add acpi support to microvm yet.

But, yes, adding a testcase sounds useful.

take care,
  Gerd




Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Eric Blake

On 6/10/20 10:42 AM, David Edmondson wrote:

On Wednesday, 2020-06-10 at 18:29:33 +03, Sam Eiderman wrote:


Excuse me,

Vladimir already pointed out in the first comment that it will skip
writing real zeroes later.


Right. That's why you want something like "--no-need-to-zero-initialise"
(the name keeps getting longer!), which would still write zeroes to the
blocks that should contain zeroes, as opposed to writing zeroes to
prepare the device.


Or maybe something like:

qemu-img convert --skip-unallocated

which says that a pre-zeroing pass may be attempted, but it if fails, 
only the explicit zeroes need to be written rather than zeroes for all 
unallocated areas in the source (so the resulting image will NOT be an 
identical copy if there were any unallocated areas, but that the user is 
okay with that).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Sam Eiderman
Ok great, thanks for making it clear.

On Wed, Jun 10, 2020 at 6:42 PM David Edmondson  wrote:
>
> On Wednesday, 2020-06-10 at 18:29:33 +03, Sam Eiderman wrote:
>
> > Excuse me,
> >
> > Vladimir already pointed out in the first comment that it will skip
> > writing real zeroes later.
>
> Right. That's why you want something like "--no-need-to-zero-initialise"
> (the name keeps getting longer!), which would still write zeroes to the
> blocks that should contain zeroes, as opposed to writing zeroes to
> prepare the device.
>
> > Sam
> >
> > On Wed, Jun 10, 2020 at 6:26 PM Sam Eiderman  wrote:
> >>
> >> Thanks for the clarification Kevin,
> >>
> >> Well first I want to discuss unallocated blocks.
> >> From my understanding operating systems do not rely on disks to be
> >> zero initialized, on the contrary, physical disks usually contain
> >> garbage.
> >> So an unallocated block should never be treated as zero by any real
> >> world application.
> >>
> >> Now assuming that I only care about the allocated content of the
> >> disks, I would like to save io/time zeroing out unallocated blocks.
> >>
> >> A real world example would be flushing a 500GB vmdk on a real SSD
> >> disk, if the vmdk contained only 2GB of data, no point in writing
> >> 498GB of zeroes to that SSD - reducing its lifespan for nothing.
> >>
> >> Now from what I understand --target-is-zero will give me this behavior
> >> even though that I really use it as a "--skip-prezeroing-target"
> >> (sorry for the bad name)
> >> (This is only true if later *allocated zeroes* are indeed copied correctly)
> >>
> >> Sam
> >>
> >> On Wed, Jun 10, 2020 at 5:06 PM Kevin Wolf  wrote:
> >> >
> >> > Am 10.06.2020 um 14:19 hat Sam Eiderman geschrieben:
> >> > > Thanks David,
> >> > >
> >> > > Yes, I imaging the following use case:
> >> > >
> >> > > disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its 
> >> > > beginning.
> >> > > /dev/sda is a raw disk containing garbage
> >> > >
> >> > > I invoke:
> >> > > qemu-img convert disk.vmdk -O raw /dev/sda
> >> > >
> >> > > Required output:
> >> > > The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
> >> > > finishes fast.
> >> > >
> >> > > Kevin, from what I understood from you, this is the default behavior.
> >> >
> >> > Sorry, I misunderstood what you want. qemu-img will write zeros to all
> >> > unallocated parts, too. If it didn't do that, the resulting image on
> >> > /dev/sda wouldn't be a copy of disk.vmdk.
> >> >
> >> > As the metadata (which blocks are allocated) cannot be preserved in raw
> >> > images, you wouldn't be able to tell which part of the image contains
> >> > valid data and which part needs to be interpreted as zeros even though
> >> > it contains random garbage.
> >> >
> >> > What is your use case for this result where the actual virtual disk
> >> > content is mixed with garbage?
> >> >
> >> > Kevin
> >> >
>
> dme.
> --
> He caught a fleeting glimpse of a man, moving uphill pursued by a bus.



Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread David Edmondson
On Wednesday, 2020-06-10 at 18:29:33 +03, Sam Eiderman wrote:

> Excuse me,
>
> Vladimir already pointed out in the first comment that it will skip
> writing real zeroes later.

Right. That's why you want something like "--no-need-to-zero-initialise"
(the name keeps getting longer!), which would still write zeroes to the
blocks that should contain zeroes, as opposed to writing zeroes to
prepare the device.

> Sam
>
> On Wed, Jun 10, 2020 at 6:26 PM Sam Eiderman  wrote:
>>
>> Thanks for the clarification Kevin,
>>
>> Well first I want to discuss unallocated blocks.
>> From my understanding operating systems do not rely on disks to be
>> zero initialized, on the contrary, physical disks usually contain
>> garbage.
>> So an unallocated block should never be treated as zero by any real
>> world application.
>>
>> Now assuming that I only care about the allocated content of the
>> disks, I would like to save io/time zeroing out unallocated blocks.
>>
>> A real world example would be flushing a 500GB vmdk on a real SSD
>> disk, if the vmdk contained only 2GB of data, no point in writing
>> 498GB of zeroes to that SSD - reducing its lifespan for nothing.
>>
>> Now from what I understand --target-is-zero will give me this behavior
>> even though that I really use it as a "--skip-prezeroing-target"
>> (sorry for the bad name)
>> (This is only true if later *allocated zeroes* are indeed copied correctly)
>>
>> Sam
>>
>> On Wed, Jun 10, 2020 at 5:06 PM Kevin Wolf  wrote:
>> >
>> > Am 10.06.2020 um 14:19 hat Sam Eiderman geschrieben:
>> > > Thanks David,
>> > >
>> > > Yes, I imaging the following use case:
>> > >
>> > > disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its 
>> > > beginning.
>> > > /dev/sda is a raw disk containing garbage
>> > >
>> > > I invoke:
>> > > qemu-img convert disk.vmdk -O raw /dev/sda
>> > >
>> > > Required output:
>> > > The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
>> > > finishes fast.
>> > >
>> > > Kevin, from what I understood from you, this is the default behavior.
>> >
>> > Sorry, I misunderstood what you want. qemu-img will write zeros to all
>> > unallocated parts, too. If it didn't do that, the resulting image on
>> > /dev/sda wouldn't be a copy of disk.vmdk.
>> >
>> > As the metadata (which blocks are allocated) cannot be preserved in raw
>> > images, you wouldn't be able to tell which part of the image contains
>> > valid data and which part needs to be interpreted as zeros even though
>> > it contains random garbage.
>> >
>> > What is your use case for this result where the actual virtual disk
>> > content is mixed with garbage?
>> >
>> > Kevin
>> >

dme.
-- 
He caught a fleeting glimpse of a man, moving uphill pursued by a bus.



Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Sam Eiderman
Excuse me,

Vladimir already pointed out in the first comment that it will skip
writing real zeroes later.

Sam

On Wed, Jun 10, 2020 at 6:26 PM Sam Eiderman  wrote:
>
> Thanks for the clarification Kevin,
>
> Well first I want to discuss unallocated blocks.
> From my understanding operating systems do not rely on disks to be
> zero initialized, on the contrary, physical disks usually contain
> garbage.
> So an unallocated block should never be treated as zero by any real
> world application.
>
> Now assuming that I only care about the allocated content of the
> disks, I would like to save io/time zeroing out unallocated blocks.
>
> A real world example would be flushing a 500GB vmdk on a real SSD
> disk, if the vmdk contained only 2GB of data, no point in writing
> 498GB of zeroes to that SSD - reducing its lifespan for nothing.
>
> Now from what I understand --target-is-zero will give me this behavior
> even though that I really use it as a "--skip-prezeroing-target"
> (sorry for the bad name)
> (This is only true if later *allocated zeroes* are indeed copied correctly)
>
> Sam
>
> On Wed, Jun 10, 2020 at 5:06 PM Kevin Wolf  wrote:
> >
> > Am 10.06.2020 um 14:19 hat Sam Eiderman geschrieben:
> > > Thanks David,
> > >
> > > Yes, I imaging the following use case:
> > >
> > > disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its 
> > > beginning.
> > > /dev/sda is a raw disk containing garbage
> > >
> > > I invoke:
> > > qemu-img convert disk.vmdk -O raw /dev/sda
> > >
> > > Required output:
> > > The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
> > > finishes fast.
> > >
> > > Kevin, from what I understood from you, this is the default behavior.
> >
> > Sorry, I misunderstood what you want. qemu-img will write zeros to all
> > unallocated parts, too. If it didn't do that, the resulting image on
> > /dev/sda wouldn't be a copy of disk.vmdk.
> >
> > As the metadata (which blocks are allocated) cannot be preserved in raw
> > images, you wouldn't be able to tell which part of the image contains
> > valid data and which part needs to be interpreted as zeros even though
> > it contains random garbage.
> >
> > What is your use case for this result where the actual virtual disk
> > content is mixed with garbage?
> >
> > Kevin
> >



Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Sam Eiderman
Thanks for the clarification Kevin,

Well first I want to discuss unallocated blocks.
>From my understanding operating systems do not rely on disks to be
zero initialized, on the contrary, physical disks usually contain
garbage.
So an unallocated block should never be treated as zero by any real
world application.

Now assuming that I only care about the allocated content of the
disks, I would like to save io/time zeroing out unallocated blocks.

A real world example would be flushing a 500GB vmdk on a real SSD
disk, if the vmdk contained only 2GB of data, no point in writing
498GB of zeroes to that SSD - reducing its lifespan for nothing.

Now from what I understand --target-is-zero will give me this behavior
even though that I really use it as a "--skip-prezeroing-target"
(sorry for the bad name)
(This is only true if later *allocated zeroes* are indeed copied correctly)

Sam

On Wed, Jun 10, 2020 at 5:06 PM Kevin Wolf  wrote:
>
> Am 10.06.2020 um 14:19 hat Sam Eiderman geschrieben:
> > Thanks David,
> >
> > Yes, I imaging the following use case:
> >
> > disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its 
> > beginning.
> > /dev/sda is a raw disk containing garbage
> >
> > I invoke:
> > qemu-img convert disk.vmdk -O raw /dev/sda
> >
> > Required output:
> > The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
> > finishes fast.
> >
> > Kevin, from what I understood from you, this is the default behavior.
>
> Sorry, I misunderstood what you want. qemu-img will write zeros to all
> unallocated parts, too. If it didn't do that, the resulting image on
> /dev/sda wouldn't be a copy of disk.vmdk.
>
> As the metadata (which blocks are allocated) cannot be preserved in raw
> images, you wouldn't be able to tell which part of the image contains
> valid data and which part needs to be interpreted as zeros even though
> it contains random garbage.
>
> What is your use case for this result where the actual virtual disk
> content is mixed with garbage?
>
> Kevin
>



[PATCH v8 21/34] qcow2: Add subcluster support to qcow2_get_host_offset()

2020-06-10 Thread Alberto Garcia
The logic of this function remains pretty much the same, except that
it uses count_contiguous_subclusters(), which combines the logic of
count_contiguous_clusters() / count_contiguous_clusters_unallocated()
and checks individual subclusters.

qcow2_cluster_to_subcluster_type() is not necessary as a separate
function anymore so it's inlined into its caller.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.h |  38 ---
 block/qcow2-cluster.c | 150 ++
 2 files changed, 92 insertions(+), 96 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 5df761edc3..4fad40b96b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -710,29 +710,6 @@ static inline QCow2ClusterType 
qcow2_get_cluster_type(BlockDriverState *bs,
 }
 }
 
-/*
- * For an image without extended L2 entries, return the
- * QCow2SubclusterType equivalent of a given QCow2ClusterType.
- */
-static inline
-QCow2SubclusterType qcow2_cluster_to_subcluster_type(QCow2ClusterType type)
-{
-switch (type) {
-case QCOW2_CLUSTER_COMPRESSED:
-return QCOW2_SUBCLUSTER_COMPRESSED;
-case QCOW2_CLUSTER_ZERO_PLAIN:
-return QCOW2_SUBCLUSTER_ZERO_PLAIN;
-case QCOW2_CLUSTER_ZERO_ALLOC:
-return QCOW2_SUBCLUSTER_ZERO_ALLOC;
-case QCOW2_CLUSTER_NORMAL:
-return QCOW2_SUBCLUSTER_NORMAL;
-case QCOW2_CLUSTER_UNALLOCATED:
-return QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN;
-default:
-g_assert_not_reached();
-}
-}
-
 /*
  * In an image without subsclusters @l2_bitmap is ignored and
  * @sc_index must be 0.
@@ -776,7 +753,20 @@ QCow2SubclusterType 
qcow2_get_subcluster_type(BlockDriverState *bs,
 g_assert_not_reached();
 }
 } else {
-return qcow2_cluster_to_subcluster_type(type);
+switch (type) {
+case QCOW2_CLUSTER_COMPRESSED:
+return QCOW2_SUBCLUSTER_COMPRESSED;
+case QCOW2_CLUSTER_ZERO_PLAIN:
+return QCOW2_SUBCLUSTER_ZERO_PLAIN;
+case QCOW2_CLUSTER_ZERO_ALLOC:
+return QCOW2_SUBCLUSTER_ZERO_ALLOC;
+case QCOW2_CLUSTER_NORMAL:
+return QCOW2_SUBCLUSTER_NORMAL;
+case QCOW2_CLUSTER_UNALLOCATED:
+return QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN;
+default:
+g_assert_not_reached();
+}
 }
 }
 
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 59dd9bda29..2f3bd3a882 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -426,66 +426,66 @@ static int 
qcow2_get_subcluster_range_type(BlockDriverState *bs,
 }
 
 /*
- * Checks how many clusters in a given L2 slice are contiguous in the image
- * file. As soon as one of the flags in the bitmask stop_flags changes compared
- * to the first cluster, the search is stopped and the cluster is not counted
- * as contiguous. (This allows it, for example, to stop at the first compressed
- * cluster which may require a different handling)
+ * Return the number of contiguous subclusters of the exact same type
+ * in a given L2 slice, starting from cluster @l2_index, subcluster
+ * @sc_index. Allocated subclusters are required to be contiguous in
+ * the image file.
+ * At most @nb_clusters are checked (note that this means clusters,
+ * not subclusters).
+ * Compressed clusters are always processed one by one but for the
+ * purpose of this count they are treated as if they were divided into
+ * subclusters of size s->subcluster_size.
+ * On failure return -errno and update @l2_index to point to the
+ * invalid entry.
  */
-static int count_contiguous_clusters(BlockDriverState *bs, int nb_clusters,
-int cluster_size, uint64_t *l2_slice, int l2_index, uint64_t 
stop_flags)
+static int count_contiguous_subclusters(BlockDriverState *bs, int nb_clusters,
+unsigned sc_index, uint64_t *l2_slice,
+unsigned *l2_index)
 {
 BDRVQcow2State *s = bs->opaque;
-int i;
-QCow2ClusterType first_cluster_type;
-uint64_t mask = stop_flags | L2E_OFFSET_MASK | QCOW_OFLAG_COMPRESSED;
-uint64_t first_entry = get_l2_entry(s, l2_slice, l2_index);
-uint64_t offset = first_entry & mask;
+int i, count = 0;
+bool check_offset;
+uint64_t expected_offset;
+QCow2SubclusterType expected_type, type;
 
-first_cluster_type = qcow2_get_cluster_type(bs, first_entry);
-if (first_cluster_type == QCOW2_CLUSTER_UNALLOCATED) {
-return 0;
-}
-
-/* must be allocated */
-assert(first_cluster_type == QCOW2_CLUSTER_NORMAL ||
-   first_cluster_type == QCOW2_CLUSTER_ZERO_ALLOC);
+assert(*l2_index + nb_clusters <= s->l2_size);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t l2_entry = get_l2_entry(s, l2_slice, l2_index + i) & mask;
-if (offset + (uint64_t) i * cluster_size != l2_entry) {
+unsigned first_sc = (i == 0) ? sc_index : 0;
+uint64_t l2_entry = 

Re: [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine

2020-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2020 17:41, Denis V. Lunev wrote:

The patch preserves the constraint that the only waiter is allowed.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
  block/aio_task.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..f338049147 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,7 +27,7 @@
  #include "block/aio_task.h"
  
  struct AioTaskPool {

-Coroutine *main_co;
+Coroutine *wake_co;
  int status;
  int max_busy_tasks;
  int busy_tasks;
@@ -54,15 +54,15 @@ static void coroutine_fn aio_task_co(void *opaque)
  
  if (pool->waiting) {

  pool->waiting = false;
-aio_co_wake(pool->main_co);
+aio_co_wake(pool->wake_co);
  }
  }
  
  void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)

  {
  assert(pool->busy_tasks > 0);
-assert(qemu_coroutine_self() == pool->main_co);
  
+pool->wake_co = qemu_coroutine_self();

  pool->waiting = true;
  qemu_coroutine_yield();
  
@@ -98,7 +98,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)

  {
  AioTaskPool *pool = g_new0(AioTaskPool, 1);
  
-pool->main_co = qemu_coroutine_self();

+pool->wake_co = NULL;
  pool->max_busy_tasks = max_busy_tasks;
  
  return pool;




With such approach, if several coroutines will wait simultaneously, the only 
one will be finally woken and other will hang.

I think, we should use CoQueue here: CoQueue instead of wake_co, 
qemu_co_queue_wait in wait_one, and qemu_co_queue_next instead of aio_co_wake.


--
Best regards,
Vladimir



[PATCH v8 31/34] qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit

2020-06-10 Thread Alberto Garcia
Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json |   7 +++
 block/qcow2.h|   8 ++-
 include/block/block_int.h|   1 +
 block/qcow2.c|  74 --
 tests/qemu-iotests/031.out   |   8 +--
 tests/qemu-iotests/036.out   |   4 +-
 tests/qemu-iotests/049.out   | 102 +++
 tests/qemu-iotests/060.out   |   1 +
 tests/qemu-iotests/061.out   |  20 +++---
 tests/qemu-iotests/065   |  12 ++--
 tests/qemu-iotests/082.out   |  48 ---
 tests/qemu-iotests/085.out   |  38 ++--
 tests/qemu-iotests/144.out   |   4 +-
 tests/qemu-iotests/182.out   |   2 +-
 tests/qemu-iotests/185.out   |   8 +--
 tests/qemu-iotests/198.out   |   2 +
 tests/qemu-iotests/206.out   |   4 ++
 tests/qemu-iotests/242.out   |   5 ++
 tests/qemu-iotests/255.out   |   8 +--
 tests/qemu-iotests/274.out   |  49 ---
 tests/qemu-iotests/280.out   |   2 +-
 tests/qemu-iotests/291.out   |   2 +
 tests/qemu-iotests/common.filter |   1 +
 23 files changed, 270 insertions(+), 140 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0e1c6a59f2..24e002ebae 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -66,6 +66,9 @@
 # standalone (read-only) raw image without looking at qcow2
 # metadata (since: 4.0)
 #
+# @extended-l2: true if the image has extended L2 entries; only valid for
+#   compat >= 1.1 (since 5.1)
+#
 # @lazy-refcounts: on or off; only valid for compat >= 1.1
 #
 # @corrupt: true if the image has been marked corrupt; only valid for
@@ -87,6 +90,7 @@
   'compat': 'str',
   '*data-file': 'str',
   '*data-file-raw': 'bool',
+  '*extended-l2': 'bool',
   '*lazy-refcounts': 'bool',
   '*corrupt': 'bool',
   'refcount-bits': 'int',
@@ -4318,6 +4322,8 @@
 # @data-file-raw: True if the external data file must stay valid as a
 # standalone (read-only) raw image without looking at qcow2
 # metadata (default: false; since: 4.0)
+# @extended-l2  True to make the image have extended L2 entries
+#   (default: false; since 5.1)
 # @size: Size of the virtual disk in bytes
 # @version: Compatibility level (default: v3)
 # @backing-file: File name of the backing file if a backing file
@@ -4338,6 +4344,7 @@
   'data': { 'file': 'BlockdevRef',
 '*data-file':   'BlockdevRef',
 '*data-file-raw':   'bool',
+'*extended-l2': 'bool',
 'size': 'size',
 '*version': 'BlockdevQcow2Version',
 '*backing-file':'str',
diff --git a/block/qcow2.h b/block/qcow2.h
index f3499e53bf..065ec3df0b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -246,15 +246,18 @@ enum {
 QCOW2_INCOMPAT_CORRUPT_BITNR= 1,
 QCOW2_INCOMPAT_DATA_FILE_BITNR  = 2,
 QCOW2_INCOMPAT_COMPRESSION_BITNR = 3,
+QCOW2_INCOMPAT_EXTL2_BITNR  = 4,
 QCOW2_INCOMPAT_DIRTY= 1 << QCOW2_INCOMPAT_DIRTY_BITNR,
 QCOW2_INCOMPAT_CORRUPT  = 1 << QCOW2_INCOMPAT_CORRUPT_BITNR,
 QCOW2_INCOMPAT_DATA_FILE= 1 << QCOW2_INCOMPAT_DATA_FILE_BITNR,
 QCOW2_INCOMPAT_COMPRESSION  = 1 << QCOW2_INCOMPAT_COMPRESSION_BITNR,
+QCOW2_INCOMPAT_EXTL2= 1 << QCOW2_INCOMPAT_EXTL2_BITNR,
 
 QCOW2_INCOMPAT_MASK = QCOW2_INCOMPAT_DIRTY
 | QCOW2_INCOMPAT_CORRUPT
 | QCOW2_INCOMPAT_DATA_FILE
-| QCOW2_INCOMPAT_COMPRESSION,
+| QCOW2_INCOMPAT_COMPRESSION
+| QCOW2_INCOMPAT_EXTL2,
 };
 
 /* Compatible feature bits */
@@ -581,8 +584,7 @@ typedef enum QCow2MetadataOverlap {
 
 static inline bool has_subclusters(BDRVQcow2State *s)
 {
-/* FIXME: Return false until this feature is complete */
-return false;
+return s->incompatible_features & QCOW2_INCOMPAT_EXTL2;
 }
 
 static inline size_t l2_entry_size(BDRVQcow2State *s)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 791de6a59c..36e1993788 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -58,6 +58,7 @@
 #define BLOCK_OPT_DATA_FILE "data_file"
 #define BLOCK_OPT_DATA_FILE_RAW "data_file_raw"
 #define BLOCK_OPT_COMPRESSION_TYPE  "compression_type"
+#define BLOCK_OPT_EXTL2 "extended_l2"
 
 #define BLOCK_PROBE_BUF_SIZE512
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 003f166024..37bfae823c 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1438,6 +1438,12 @@ static 

[PATCH v8 34/34] iotests: Add tests for qcow2 images with extended L2 entries

2020-06-10 Thread Alberto Garcia
Signed-off-by: Alberto Garcia 
---
 tests/qemu-iotests/271 | 801 +
 tests/qemu-iotests/271.out | 676 +++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 1478 insertions(+)
 create mode 100755 tests/qemu-iotests/271
 create mode 100644 tests/qemu-iotests/271.out

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
new file mode 100755
index 00..9c1f50a5b8
--- /dev/null
+++ b/tests/qemu-iotests/271
@@ -0,0 +1,801 @@
+#!/bin/bash
+#
+# Test qcow2 images with extended L2 entries
+#
+# Copyright (C) 2019-2020 Igalia, S.L.
+# Author: Alberto Garcia 
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=be...@igalia.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+here="$PWD"
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+rm -f "$TEST_IMG.raw"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file nfs
+_supported_os Linux
+_unsupported_imgopts extended_l2 compat=0.10 cluster_size data_file
+
+l2_offset=$((0x4))
+
+_verify_img()
+{
+$QEMU_IMG compare "$TEST_IMG" "$TEST_IMG.raw" | grep -v 'Images are 
identical'
+$QEMU_IMG check "$TEST_IMG" | _filter_qemu_img_check | \
+grep -v 'No errors were found on the image'
+}
+
+# Compare the bitmap of an extended L2 entry against an expected value
+_verify_l2_bitmap()
+{
+entry_no="$1"# L2 entry number, starting from 0
+expected_alloc="$alloc"  # Space-separated list of allocated subcluster 
indexes
+expected_zero="$zero"# Space-separated list of zero subcluster indexes
+
+offset=$(($l2_offset + $entry_no * 16))
+entry=$(peek_file_be "$TEST_IMG" $offset 8)
+offset=$(($offset + 8))
+bitmap=$(peek_file_be "$TEST_IMG" $offset 8)
+
+expected_bitmap=0
+for bit in $expected_alloc; do
+expected_bitmap=$(($expected_bitmap | (1 << $bit)))
+done
+for bit in $expected_zero; do
+expected_bitmap=$(($expected_bitmap | (1 << (32 + $bit
+done
+printf -v expected_bitmap "%llu" $expected_bitmap # Convert to unsigned
+
+printf "L2 entry #%d: 0x%016lx %016lx\n" "$entry_no" "$entry" "$bitmap"
+if [ "$bitmap" != "$expected_bitmap" ]; then
+printf "ERROR: expecting bitmap   0x%016lx\n" "$expected_bitmap"
+fi
+}
+
+# This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
+# c:   cluster number (0 if unset)
+# sc:  subcluster number inside cluster @c (0 if unset)
+# off: offset inside subcluster @sc, in kilobytes (0 if unset)
+# len: request length, passed directly to qemu-io (e.g: 256, 4k, 1M, ...)
+# cmd: the command to pass to qemu-io, must be one of
+#  write-> write
+#  zero -> write -z
+#  unmap-> write -z -u
+#  compress -> write -c
+#  discard  -> discard
+_run_test()
+{
+unset c sc off len cmd
+for var in "$@"; do eval "$var"; done
+case "${cmd:-write}" in
+zero)
+cmd="write -q -z";;
+unmap)
+cmd="write -q -z -u";;
+compress)
+pat=$((${pat:-0} + 1))
+cmd="write -q -c -P ${pat}";;
+write)
+pat=$((${pat:-0} + 1))
+cmd="write -q -P ${pat}";;
+discard)
+cmd="discard -q";;
+*)
+echo "Unknown option $cmd"
+exit 1;;
+esac
+c="${c:-0}"
+sc="${sc:-0}"
+off="${off:-0}"
+offset="$(($c * 64 + $sc * 2 + $off))"
+[ "$offset" != 0 ] && offset="${offset}k"
+cmd="$cmd ${offset} ${len}"
+raw_cmd=$(echo $cmd | sed s/-c//) # Raw images don't support -c
+echo $cmd | sed 's/-P [0-9][0-9]\?/-P PATTERN/'
+$QEMU_IO -c "$cmd" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "$raw_cmd" -f raw "$TEST_IMG.raw" | _filter_qemu_io
+_verify_img
+_verify_l2_bitmap "$c"
+}
+
+_reset_img()
+{
+size="$1"
+$QEMU_IMG create -f raw "$TEST_IMG.raw" "$size" | _filter_img_create
+if [ "$use_backing_file" = "yes" ]; then
+$QEMU_IMG create -f raw "$TEST_IMG.base" "$size" | _filter_img_create
+$QEMU_IO -c "write -q -P 0xFF 0 $size" -f raw "$TEST_IMG.base" | 
_filter_qemu_io
+$QEMU_IO -c "write -q -P 0xFF 0 $size" -f raw 

[PATCH v8 28/34] qcow2: Add subcluster support to qcow2_co_pwrite_zeroes()

2020-06-10 Thread Alberto Garcia
This works now at the subcluster level and pwrite_zeroes_alignment is
updated accordingly.

qcow2_cluster_zeroize() is turned into qcow2_subcluster_zeroize() with
the following changes:

   - The request can now be subcluster-aligned.

   - The cluster-aligned body of the request is still zeroized using
 zero_in_l2_slice() as before.

   - The subcluster-aligned head and tail of the request are zeroized
 with the new zero_l2_subclusters() function.

There is just one thing to take into account for a possible future
improvement: compressed clusters cannot be partially zeroized so
zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
This makes the caller repeat the *complete* request and write actual
zeroes to disk. This is sub-optimal because

   1) if the head area was compressed we would still be able to use
  the fast path for the body and possibly the tail.

   2) if the tail area was compressed we are writing zeroes to the
  head and the body areas, which are already zeroized.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.h |  4 +--
 block/qcow2-cluster.c | 80 +++
 block/qcow2.c | 27 ---
 3 files changed, 90 insertions(+), 21 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 4fad40b96b..4ef4ae4ab0 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -898,8 +898,8 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m);
 int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
   uint64_t bytes, enum qcow2_discard_type type,
   bool full_discard);
-int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
-  uint64_t bytes, int flags);
+int qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
+ uint64_t bytes, int flags);
 
 int qcow2_expand_zero_clusters(BlockDriverState *bs,
BlockDriverAmendStatusCB *status_cb,
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index deff838fe8..1641976028 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2015,12 +2015,58 @@ static int zero_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 return nb_clusters;
 }
 
-int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
-  uint64_t bytes, int flags)
+static int zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   unsigned nb_subclusters)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t *l2_slice;
+uint64_t old_l2_bitmap, l2_bitmap;
+int l2_index, ret, sc = offset_to_sc_index(s, offset);
+
+/* For full clusters use zero_in_l2_slice() instead */
+assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
+assert(sc + nb_subclusters <= s->subclusters_per_cluster);
+
+ret = get_cluster_table(bs, offset, _slice, _index);
+if (ret < 0) {
+return ret;
+}
+
+switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) {
+case QCOW2_CLUSTER_COMPRESSED:
+ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */
+goto out;
+case QCOW2_CLUSTER_NORMAL:
+case QCOW2_CLUSTER_UNALLOCATED:
+break;
+default:
+g_assert_not_reached();
+}
+
+old_l2_bitmap = l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+
+l2_bitmap |=  QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+
+if (old_l2_bitmap != l2_bitmap) {
+set_l2_bitmap(s, l2_slice, l2_index, l2_bitmap);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
+}
+
+ret = 0;
+out:
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
+
+return ret;
+}
+
+int qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
+ uint64_t bytes, int flags)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t end_offset = offset + bytes;
 uint64_t nb_clusters;
+unsigned head, tail;
 int64_t cleared;
 int ret;
 
@@ -2035,8 +2081,8 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t 
offset,
 }
 
 /* Caller must pass aligned values, except at image end */
-assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
-assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
+assert(offset_into_subcluster(s, offset) == 0);
+assert(offset_into_subcluster(s, end_offset) == 0 ||
end_offset >= bs->total_sectors << BDRV_SECTOR_BITS);
 
 /* The zero flag is only supported by version 3 and newer */
@@ -2044,11 +2090,26 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, 
uint64_t offset,
 return -ENOTSUP;
 }
 
-/* Each L2 slice is handled by its own loop iteration */
-nb_clusters = size_to_clusters(s, bytes);
+head = MIN(end_offset, 

[PATCH v8 18/34] qcow2: Replace QCOW2_CLUSTER_* with QCOW2_SUBCLUSTER_*

2020-06-10 Thread Alberto Garcia
In order to support extended L2 entries some functions of the qcow2
driver need to start dealing with subclusters instead of clusters.

qcow2_get_host_offset() is modified to return the subcluster type
instead of the cluster type, and all callers are updated to replace
all values of QCow2ClusterType with their QCow2SubclusterType
equivalents.

This patch only changes the data types, there are no semantic changes.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h |  2 +-
 block/qcow2-cluster.c | 10 +++
 block/qcow2.c | 70 ++-
 3 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 74f65793bd..5df761edc3 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -894,7 +894,7 @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t 
sector_num,
 
 int qcow2_get_host_offset(BlockDriverState *bs, uint64_t offset,
   unsigned int *bytes, uint64_t *host_offset,
-  QCow2ClusterType *cluster_type);
+  QCow2SubclusterType *subcluster_type);
 int qcow2_alloc_cluster_offset(BlockDriverState *bs, uint64_t offset,
unsigned int *bytes, uint64_t *host_offset,
QCowL2Meta **m);
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 1e5681b0c6..ed7b92dbb2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -564,15 +564,15 @@ static int coroutine_fn 
do_perform_cow_write(BlockDriverState *bs,
  * offset that we are interested in.
  *
  * On exit, *bytes is the number of bytes starting at offset that have the same
- * cluster type and (if applicable) are stored contiguously in the image file.
- * The cluster type is stored in *cluster_type.
- * Compressed clusters are always returned one by one.
+ * subcluster type and (if applicable) are stored contiguously in the image
+ * file. The subcluster type is stored in *subcluster_type.
+ * Compressed clusters are always processed one by one.
  *
  * Returns 0 on success, -errno in error cases.
  */
 int qcow2_get_host_offset(BlockDriverState *bs, uint64_t offset,
   unsigned int *bytes, uint64_t *host_offset,
-  QCow2ClusterType *cluster_type)
+  QCow2SubclusterType *subcluster_type)
 {
 BDRVQcow2State *s = bs->opaque;
 unsigned int l2_index;
@@ -713,7 +713,7 @@ out:
 assert(bytes_available - offset_in_cluster <= UINT_MAX);
 *bytes = bytes_available - offset_in_cluster;
 
-*cluster_type = type;
+*subcluster_type = qcow2_cluster_to_subcluster_type(type);
 
 return 0;
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 89e17bdaba..2f7a2a7c7a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2033,7 +2033,7 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 BDRVQcow2State *s = bs->opaque;
 uint64_t host_offset;
 unsigned int bytes;
-QCow2ClusterType type;
+QCow2SubclusterType type;
 int ret, status = 0;
 
 qemu_co_mutex_lock(>lock);
@@ -2053,15 +2053,16 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 
 *pnum = bytes;
 
-if ((type == QCOW2_CLUSTER_NORMAL || type == QCOW2_CLUSTER_ZERO_ALLOC) &&
-!s->crypto) {
+if ((type == QCOW2_SUBCLUSTER_NORMAL ||
+ type == QCOW2_SUBCLUSTER_ZERO_ALLOC) && !s->crypto) {
 *map = host_offset;
 *file = s->data_file->bs;
 status |= BDRV_BLOCK_OFFSET_VALID;
 }
-if (type == QCOW2_CLUSTER_ZERO_PLAIN || type == QCOW2_CLUSTER_ZERO_ALLOC) {
+if (type == QCOW2_SUBCLUSTER_ZERO_PLAIN ||
+type == QCOW2_SUBCLUSTER_ZERO_ALLOC) {
 status |= BDRV_BLOCK_ZERO;
-} else if (type != QCOW2_CLUSTER_UNALLOCATED) {
+} else if (type != QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN) {
 status |= BDRV_BLOCK_DATA;
 }
 if (s->metadata_preallocation && (status & BDRV_BLOCK_DATA) &&
@@ -2158,7 +2159,7 @@ typedef struct Qcow2AioTask {
 AioTask task;
 
 BlockDriverState *bs;
-QCow2ClusterType cluster_type; /* only for read */
+QCow2SubclusterType subcluster_type; /* only for read */
 uint64_t host_offset; /* or full descriptor in compressed clusters */
 uint64_t offset;
 uint64_t bytes;
@@ -2171,7 +2172,7 @@ static coroutine_fn int 
qcow2_co_preadv_task_entry(AioTask *task);
 static coroutine_fn int qcow2_add_task(BlockDriverState *bs,
AioTaskPool *pool,
AioTaskFunc func,
-   QCow2ClusterType cluster_type,
+   QCow2SubclusterType subcluster_type,
uint64_t host_offset,
uint64_t offset,
uint64_t bytes,
@@ -2185,7 

[PATCH v8 05/34] qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in handle_copied()

2020-06-10 Thread Alberto Garcia
When writing to a qcow2 file there are two functions that take a
virtual offset and return a host offset, possibly allocating new
clusters if necessary:

   - handle_copied() looks for normal data clusters that are already
 allocated and have a reference count of 1. In those clusters we
 can simply write the data and there is no need to perform any
 copy-on-write.

   - handle_alloc() looks for clusters that do need copy-on-write,
 either because they haven't been allocated yet, because their
 reference count is != 1 or because they are ZERO_ALLOC clusters.

The ZERO_ALLOC case is a bit special because those are clusters that
are already allocated and they could perfectly be dealt with in
handle_copied() (as long as copy-on-write is performed when required).

In fact, there is extra code specifically for them in handle_alloc()
that tries to reuse the existing allocation if possible and frees them
otherwise.

This patch changes the handling of ZERO_ALLOC clusters so the
semantics of these two functions are now like this:

   - handle_copied() looks for clusters that are already allocated and
 which we can overwrite (NORMAL and ZERO_ALLOC clusters with a
 reference count of 1).

   - handle_alloc() looks for clusters for which we need a new
 allocation (all other cases).

One important difference after this change is that clusters found
in handle_copied() may now require copy-on-write, but this will be
necessary anyway once we add support for subclusters.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c | 256 +++---
 1 file changed, 141 insertions(+), 115 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 80f9787461..fce0be7a08 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1039,13 +1039,18 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
 
 /*
  * For a given write request, create a new QCowL2Meta structure, add
- * it to @m and the BDRVQcow2State.cluster_allocs list.
+ * it to @m and the BDRVQcow2State.cluster_allocs list. If the write
+ * request does not need copy-on-write or changes to the L2 metadata
+ * then this function does nothing.
  *
  * @host_cluster_offset points to the beginning of the first cluster.
  *
  * @guest_offset and @bytes indicate the offset and length of the
  * request.
  *
+ * @l2_slice contains the L2 entries of all clusters involved in this
+ * write request.
+ *
  * If @keep_old is true it means that the clusters were already
  * allocated and will be overwritten. If false then the clusters are
  * new and we have to decrease the reference count of the old ones.
@@ -1053,15 +1058,53 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
 static void calculate_l2_meta(BlockDriverState *bs,
   uint64_t host_cluster_offset,
   uint64_t guest_offset, unsigned bytes,
-  QCowL2Meta **m, bool keep_old)
+  uint64_t *l2_slice, QCowL2Meta **m, bool 
keep_old)
 {
 BDRVQcow2State *s = bs->opaque;
-unsigned cow_start_from = 0;
+int l2_index = offset_to_l2_slice_index(s, guest_offset);
+uint64_t l2_entry;
+unsigned cow_start_from, cow_end_to;
 unsigned cow_start_to = offset_into_cluster(s, guest_offset);
 unsigned cow_end_from = cow_start_to + bytes;
-unsigned cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
 unsigned nb_clusters = size_to_clusters(s, cow_end_from);
 QCowL2Meta *old_m = *m;
+QCow2ClusterType type;
+
+assert(nb_clusters <= s->l2_slice_size - l2_index);
+
+/* Return if there's no COW (all clusters are normal and we keep them) */
+if (keep_old) {
+int i;
+for (i = 0; i < nb_clusters; i++) {
+l2_entry = be64_to_cpu(l2_slice[l2_index + i]);
+if (qcow2_get_cluster_type(bs, l2_entry) != QCOW2_CLUSTER_NORMAL) {
+break;
+}
+}
+if (i == nb_clusters) {
+return;
+}
+}
+
+/* Get the L2 entry of the first cluster */
+l2_entry = be64_to_cpu(l2_slice[l2_index]);
+type = qcow2_get_cluster_type(bs, l2_entry);
+
+if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
+cow_start_from = cow_start_to;
+} else {
+cow_start_from = 0;
+}
+
+/* Get the L2 entry of the last cluster */
+l2_entry = be64_to_cpu(l2_slice[l2_index + nb_clusters - 1]);
+type = qcow2_get_cluster_type(bs, l2_entry);
+
+if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
+cow_end_to = cow_end_from;
+} else {
+cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+}
 
 *m = g_malloc0(sizeof(**m));
 **m = (QCowL2Meta) {
@@ -1087,18 +1130,22 @@ static void calculate_l2_meta(BlockDriverState *bs,
 QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
 }
 
-/* Returns true if writing 

[PATCH v8 20/34] qcow2: Add subcluster support to calculate_l2_meta()

2020-06-10 Thread Alberto Garcia
If an image has subclusters then there are more copy-on-write
scenarios that we need to consider. Let's say we have a write request
from the middle of subcluster #3 until the end of the cluster:

1) If we are writing to a newly allocated cluster then we need
   copy-on-write. The previous contents of subclusters #0 to #3 must
   be copied to the new cluster. We can optimize this process by
   skipping all leading unallocated or zero subclusters (the status of
   those skipped subclusters will be reflected in the new L2 bitmap).

2) If we are overwriting an existing cluster:

   2.1) If subcluster #3 is unallocated or has the all-zeroes bit set
then we need copy-on-write (on subcluster #3 only).

   2.2) If subcluster #3 was already allocated then there is no need
for any copy-on-write. However we still need to update the L2
bitmap to reflect possible changes in the allocation status of
subclusters #4 to #31. Because of this, this function checks
if all the overwritten subclusters are already allocated and
in this case it returns without creating a new QCowL2Meta
structure.

After all these changes l2meta_cow_start() and l2meta_cow_end()
are not necessarily cluster-aligned anymore. We need to update the
calculation of old_start and old_end in handle_dependencies() to
guarantee that no two requests try to write on the same cluster.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c | 163 +-
 1 file changed, 131 insertions(+), 32 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ed7b92dbb2..59dd9bda29 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -387,7 +387,6 @@ fail:
  * If the L2 entry is invalid return -errno and set @type to
  * QCOW2_SUBCLUSTER_INVALID.
  */
-G_GNUC_UNUSED
 static int qcow2_get_subcluster_range_type(BlockDriverState *bs,
uint64_t l2_entry,
uint64_t l2_bitmap,
@@ -1110,56 +1109,148 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
  * If @keep_old is true it means that the clusters were already
  * allocated and will be overwritten. If false then the clusters are
  * new and we have to decrease the reference count of the old ones.
+ *
+ * Returns 0 on success, -errno on failure.
  */
-static void calculate_l2_meta(BlockDriverState *bs,
-  uint64_t host_cluster_offset,
-  uint64_t guest_offset, unsigned bytes,
-  uint64_t *l2_slice, QCowL2Meta **m, bool 
keep_old)
+static int calculate_l2_meta(BlockDriverState *bs, uint64_t 
host_cluster_offset,
+ uint64_t guest_offset, unsigned bytes,
+ uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
 {
 BDRVQcow2State *s = bs->opaque;
-int l2_index = offset_to_l2_slice_index(s, guest_offset);
-uint64_t l2_entry;
+int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
+uint64_t l2_entry, l2_bitmap;
 unsigned cow_start_from, cow_end_to;
 unsigned cow_start_to = offset_into_cluster(s, guest_offset);
 unsigned cow_end_from = cow_start_to + bytes;
 unsigned nb_clusters = size_to_clusters(s, cow_end_from);
 QCowL2Meta *old_m = *m;
-QCow2ClusterType type;
+QCow2SubclusterType type;
+int i;
+bool skip_cow = keep_old;
 
 assert(nb_clusters <= s->l2_slice_size - l2_index);
 
-/* Return if there's no COW (all clusters are normal and we keep them) */
-if (keep_old) {
-int i;
-for (i = 0; i < nb_clusters; i++) {
-l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
-if (qcow2_get_cluster_type(bs, l2_entry) != QCOW2_CLUSTER_NORMAL) {
-break;
+/* Check the type of all affected subclusters */
+for (i = 0; i < nb_clusters; i++) {
+l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
+l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+if (skip_cow) {
+unsigned write_from = MAX(cow_start_to, i << s->cluster_bits);
+unsigned write_to = MIN(cow_end_from, (i + 1) << s->cluster_bits);
+int first_sc = offset_to_sc_index(s, write_from);
+int last_sc = offset_to_sc_index(s, write_to - 1);
+int cnt = qcow2_get_subcluster_range_type(bs, l2_entry, l2_bitmap,
+  first_sc, );
+/* Is any of the subclusters of type != QCOW2_SUBCLUSTER_NORMAL ? 
*/
+if (type != QCOW2_SUBCLUSTER_NORMAL || first_sc + cnt <= last_sc) {
+skip_cow = false;
 }
+} else {
+/* If we can't skip the cow we can still look for invalid entries 
*/
+type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, 0);
 }
-if 

[PATCH v8 19/34] qcow2: Handle QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC

2020-06-10 Thread Alberto Garcia
When dealing with subcluster types there is a new value called
QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC that has no equivalent in
QCow2ClusterType.

This patch handles that value in all places where subcluster types
are processed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2f7a2a7c7a..a3481cd85b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2054,7 +2054,8 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 *pnum = bytes;
 
 if ((type == QCOW2_SUBCLUSTER_NORMAL ||
- type == QCOW2_SUBCLUSTER_ZERO_ALLOC) && !s->crypto) {
+ type == QCOW2_SUBCLUSTER_ZERO_ALLOC ||
+ type == QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC) && !s->crypto) {
 *map = host_offset;
 *file = s->data_file->bs;
 status |= BDRV_BLOCK_OFFSET_VALID;
@@ -2062,7 +2063,8 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 if (type == QCOW2_SUBCLUSTER_ZERO_PLAIN ||
 type == QCOW2_SUBCLUSTER_ZERO_ALLOC) {
 status |= BDRV_BLOCK_ZERO;
-} else if (type != QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN) {
+} else if (type != QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN &&
+   type != QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC) {
 status |= BDRV_BLOCK_DATA;
 }
 if (s->metadata_preallocation && (status & BDRV_BLOCK_DATA) &&
@@ -2225,6 +2227,7 @@ static coroutine_fn int 
qcow2_co_preadv_task(BlockDriverState *bs,
 g_assert_not_reached();
 
 case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
+case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC:
 assert(bs->backing); /* otherwise handled in qcow2_co_preadv_part */
 
 BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
@@ -2293,7 +2296,8 @@ static coroutine_fn int 
qcow2_co_preadv_part(BlockDriverState *bs,
 
 if (type == QCOW2_SUBCLUSTER_ZERO_PLAIN ||
 type == QCOW2_SUBCLUSTER_ZERO_ALLOC ||
-(type == QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN && !bs->backing))
+(type == QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN && !bs->backing) ||
+(type == QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC && !bs->backing))
 {
 qemu_iovec_memset(qiov, qiov_offset, 0, cur_bytes);
 } else {
@@ -3865,6 +3869,7 @@ static coroutine_fn int 
qcow2_co_pwrite_zeroes(BlockDriverState *bs,
 ret = qcow2_get_host_offset(bs, offset, , , );
 if (ret < 0 ||
 (type != QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN &&
+ type != QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC &&
  type != QCOW2_SUBCLUSTER_ZERO_PLAIN &&
  type != QCOW2_SUBCLUSTER_ZERO_ALLOC)) {
 qemu_co_mutex_unlock(>lock);
@@ -3943,6 +3948,7 @@ qcow2_co_copy_range_from(BlockDriverState *bs,
 
 switch (type) {
 case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
+case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC:
 if (bs->backing && bs->backing->bs) {
 int64_t backing_length = bdrv_getlength(bs->backing->bs);
 if (src_offset >= backing_length) {
-- 
2.20.1




[PATCH v8 26/34] qcow2: Clear the L2 bitmap when allocating a compressed cluster

2020-06-10 Thread Alberto Garcia
Compressed clusters always have the bitmap part of the extended L2
entry set to 0.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
 block/qcow2-cluster.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 2276cee6d6..deff838fe8 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -861,6 +861,9 @@ int qcow2_alloc_compressed_cluster_offset(BlockDriverState 
*bs,
 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
 set_l2_entry(s, l2_slice, l2_index, cluster_offset);
+if (has_subclusters(s)) {
+set_l2_bitmap(s, l2_slice, l2_index, 0);
+}
 qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 *host_offset = cluster_offset & s->cluster_offset_mask;
-- 
2.20.1




[PATCH v8 24/34] qcow2: Add subcluster support to check_refcounts_l2()

2020-06-10 Thread Alberto Garcia
Setting the QCOW_OFLAG_ZERO bit of the L2 entry is forbidden if an
image has subclusters. Instead, the individual 'all zeroes' bits must
be used.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/qcow2-refcount.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 770c5dbc83..696e4dad07 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1686,8 +1686,13 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 int ign = active ? QCOW2_OL_ACTIVE_L2 :
QCOW2_OL_INACTIVE_L2;
 
-l2_entry = QCOW_OFLAG_ZERO;
-set_l2_entry(s, l2_table, i, l2_entry);
+if (has_subclusters(s)) {
+set_l2_entry(s, l2_table, i, 0);
+set_l2_bitmap(s, l2_table, i,
+  QCOW_L2_BITMAP_ALL_ZEROES);
+} else {
+set_l2_entry(s, l2_table, i, QCOW_OFLAG_ZERO);
+}
 ret = qcow2_pre_write_overlap_check(bs, ign,
 l2e_offset, l2_entry_size(s), false);
 if (ret < 0) {
-- 
2.20.1




[PATCH v8 25/34] qcow2: Update L2 bitmap in qcow2_alloc_cluster_link_l2()

2020-06-10 Thread Alberto Garcia
The L2 bitmap needs to be updated after each write to indicate what
new subclusters are now allocated. This needs to happen even if the
cluster was already allocated and the L2 entry was otherwise valid.

In some cases however a write operation doesn't need change the L2
bitmap (because all affected subclusters were already allocated). This
is detected in calculate_l2_meta(), and qcow2_alloc_cluster_link_l2()
is never called in those cases.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index edfc8ea91c..2276cee6d6 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1061,6 +1061,24 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
 assert((offset & L2E_OFFSET_MASK) == offset);
 
 set_l2_entry(s, l2_slice, l2_index + i, offset | QCOW_OFLAG_COPIED);
+
+/* Update bitmap with the subclusters that were just written */
+if (has_subclusters(s)) {
+uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+unsigned written_from = m->cow_start.offset;
+unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes ?:
+m->nb_clusters << s->cluster_bits;
+int first_sc, last_sc;
+/* Narrow written_from and written_to down to the current cluster 
*/
+written_from = MAX(written_from, i << s->cluster_bits);
+written_to   = MIN(written_to, (i + 1) << s->cluster_bits);
+assert(written_from < written_to);
+first_sc = offset_to_sc_index(s, written_from);
+last_sc  = offset_to_sc_index(s, written_to - 1);
+l2_bitmap |= QCOW_OFLAG_SUB_ALLOC_RANGE(first_sc, last_sc + 1);
+l2_bitmap &= ~QCOW_OFLAG_SUB_ZERO_RANGE(first_sc, last_sc + 1);
+set_l2_bitmap(s, l2_slice, l2_index + i, l2_bitmap);
+}
  }
 
 
-- 
2.20.1




[PATCH v8 32/34] qcow2: Allow preallocation and backing files if extended_l2 is set

2020-06-10 Thread Alberto Garcia
Traditional qcow2 images don't allow preallocation if a backing file
is set. This is because once a cluster is allocated there is no way to
tell that its data should be read from the backing file.

Extended L2 entries have individual allocation bits for each
subcluster, and therefore it is perfectly possible to have an
allocated cluster with all its subclusters unallocated.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.c  | 7 ---
 tests/qemu-iotests/206.out | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 37bfae823c..1ea8d3b87e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3451,10 +3451,11 @@ qcow2_co_create(BlockdevCreateOptions *create_options, 
Error **errp)
 qcow2_opts->preallocation = PREALLOC_MODE_OFF;
 }
 if (qcow2_opts->has_backing_file &&
-qcow2_opts->preallocation != PREALLOC_MODE_OFF)
+qcow2_opts->preallocation != PREALLOC_MODE_OFF &&
+!qcow2_opts->extended_l2)
 {
-error_setg(errp, "Backing file and preallocation cannot be used at "
-   "the same time");
+error_setg(errp, "Backing file and preallocation can only be used at "
+   "the same time if extended_l2 is on");
 ret = -EINVAL;
 goto out;
 }
diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 363c5abe35..a100849fcb 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -203,7 +203,7 @@ Job failed: Different refcount widths than 16 bits require 
compatibility level 1
 === Invalid backing file options ===
 {"execute": "blockdev-create", "arguments": {"job-id": "job0", "options": 
{"backing-file": "/dev/null", "driver": "qcow2", "file": "node0", 
"preallocation": "full", "size": 67108864}}}
 {"return": {}}
-Job failed: Backing file and preallocation cannot be used at the same time
+Job failed: Backing file and preallocation can only be used at the same time 
if extended_l2 is on
 {"execute": "job-dismiss", "arguments": {"id": "job0"}}
 {"return": {}}
 
-- 
2.20.1




[PATCH v8 09/34] qcow2: Add subcluster-related fields to BDRVQcow2State

2020-06-10 Thread Alberto Garcia
This patch adds the following new fields to BDRVQcow2State:

- subclusters_per_cluster: Number of subclusters in a cluster
- subcluster_size: The size of each subcluster, in bytes
- subcluster_bits: No. of bits so 1 << subcluster_bits = subcluster_size

Images without subclusters are treated as if they had exactly one
subcluster per cluster (i.e. subcluster_size = cluster_size).

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h | 5 +
 block/qcow2.c | 5 +
 2 files changed, 10 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 2064dd3d85..eee4c8de9c 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -78,6 +78,8 @@
 /* The cluster reads as all zeros */
 #define QCOW_OFLAG_ZERO (1ULL << 0)
 
+#define QCOW_EXTL2_SUBCLUSTERS_PER_CLUSTER 32
+
 #define MIN_CLUSTER_BITS 9
 #define MAX_CLUSTER_BITS 21
 
@@ -295,6 +297,9 @@ typedef struct BDRVQcow2State {
 int cluster_bits;
 int cluster_size;
 int l2_slice_size;
+int subcluster_bits;
+int subcluster_size;
+int subclusters_per_cluster;
 int l2_bits;
 int l2_size;
 int l1_size;
diff --git a/block/qcow2.c b/block/qcow2.c
index afb00ada42..5c175a314c 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1433,6 +1433,11 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
 }
 }
 
+s->subclusters_per_cluster =
+has_subclusters(s) ? QCOW_EXTL2_SUBCLUSTERS_PER_CLUSTER : 1;
+s->subcluster_size = s->cluster_size / s->subclusters_per_cluster;
+s->subcluster_bits = ctz32(s->subcluster_size);
+
 /* Check support for various header values */
 if (header.refcount_order > 6) {
 error_setg(errp, "Reference count entry width too large; may not "
-- 
2.20.1




[PATCH v8 17/34] qcow2: Add cluster type parameter to qcow2_get_host_offset()

2020-06-10 Thread Alberto Garcia
This function returns an integer that can be either an error code or a
cluster type (a value from the QCow2ClusterType enum).

We are going to start using subcluster types instead of cluster types
in some functions so it's better to use the exact data types instead
of integers for clarity and in order to detect errors more easily.

This patch makes qcow2_get_host_offset() return 0 on success and
puts the returned cluster type in a separate parameter. There are no
semantic changes.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h |  3 ++-
 block/qcow2-cluster.c | 11 +++
 block/qcow2.c | 37 ++---
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index ea647c8bb5..74f65793bd 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -893,7 +893,8 @@ int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t 
sector_num,
   uint8_t *buf, int nb_sectors, bool enc, Error 
**errp);
 
 int qcow2_get_host_offset(BlockDriverState *bs, uint64_t offset,
-  unsigned int *bytes, uint64_t *host_offset);
+  unsigned int *bytes, uint64_t *host_offset,
+  QCow2ClusterType *cluster_type);
 int qcow2_alloc_cluster_offset(BlockDriverState *bs, uint64_t offset,
unsigned int *bytes, uint64_t *host_offset,
QCowL2Meta **m);
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 32dc6e75e3..1e5681b0c6 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -565,13 +565,14 @@ static int coroutine_fn 
do_perform_cow_write(BlockDriverState *bs,
  *
  * On exit, *bytes is the number of bytes starting at offset that have the same
  * cluster type and (if applicable) are stored contiguously in the image file.
+ * The cluster type is stored in *cluster_type.
  * Compressed clusters are always returned one by one.
  *
- * Returns the cluster type (QCOW2_CLUSTER_*) on success, -errno in error
- * cases.
+ * Returns 0 on success, -errno in error cases.
  */
 int qcow2_get_host_offset(BlockDriverState *bs, uint64_t offset,
-  unsigned int *bytes, uint64_t *host_offset)
+  unsigned int *bytes, uint64_t *host_offset,
+  QCow2ClusterType *cluster_type)
 {
 BDRVQcow2State *s = bs->opaque;
 unsigned int l2_index;
@@ -712,7 +713,9 @@ out:
 assert(bytes_available - offset_in_cluster <= UINT_MAX);
 *bytes = bytes_available - offset_in_cluster;
 
-return type;
+*cluster_type = type;
+
+return 0;
 
 fail:
 qcow2_cache_put(s->l2_table_cache, (void **)_slice);
diff --git a/block/qcow2.c b/block/qcow2.c
index 7df55a88a8..89e17bdaba 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2033,6 +2033,7 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 BDRVQcow2State *s = bs->opaque;
 uint64_t host_offset;
 unsigned int bytes;
+QCow2ClusterType type;
 int ret, status = 0;
 
 qemu_co_mutex_lock(>lock);
@@ -2044,7 +2045,7 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 }
 
 bytes = MIN(INT_MAX, count);
-ret = qcow2_get_host_offset(bs, offset, , _offset);
+ret = qcow2_get_host_offset(bs, offset, , _offset, );
 qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
 return ret;
@@ -2052,15 +2053,15 @@ static int coroutine_fn 
qcow2_co_block_status(BlockDriverState *bs,
 
 *pnum = bytes;
 
-if ((ret == QCOW2_CLUSTER_NORMAL || ret == QCOW2_CLUSTER_ZERO_ALLOC) &&
+if ((type == QCOW2_CLUSTER_NORMAL || type == QCOW2_CLUSTER_ZERO_ALLOC) &&
 !s->crypto) {
 *map = host_offset;
 *file = s->data_file->bs;
 status |= BDRV_BLOCK_OFFSET_VALID;
 }
-if (ret == QCOW2_CLUSTER_ZERO_PLAIN || ret == QCOW2_CLUSTER_ZERO_ALLOC) {
+if (type == QCOW2_CLUSTER_ZERO_PLAIN || type == QCOW2_CLUSTER_ZERO_ALLOC) {
 status |= BDRV_BLOCK_ZERO;
-} else if (ret != QCOW2_CLUSTER_UNALLOCATED) {
+} else if (type != QCOW2_CLUSTER_UNALLOCATED) {
 status |= BDRV_BLOCK_DATA;
 }
 if (s->metadata_preallocation && (status & BDRV_BLOCK_DATA) &&
@@ -2269,6 +2270,7 @@ static coroutine_fn int 
qcow2_co_preadv_part(BlockDriverState *bs,
 int ret = 0;
 unsigned int cur_bytes; /* number of bytes in current iteration */
 uint64_t host_offset = 0;
+QCow2ClusterType type;
 AioTaskPool *aio = NULL;
 
 while (bytes != 0 && aio_task_pool_status(aio) == 0) {
@@ -2280,22 +2282,23 @@ static coroutine_fn int 
qcow2_co_preadv_part(BlockDriverState *bs,
 }
 
 qemu_co_mutex_lock(>lock);
-ret = qcow2_get_host_offset(bs, offset, _bytes, _offset);
+ret = qcow2_get_host_offset(bs, offset, _bytes,
+_offset, );
 

[PATCH v8 27/34] qcow2: Add subcluster support to handle_alloc_space()

2020-06-10 Thread Alberto Garcia
The bdrv_co_pwrite_zeroes() call here fills complete clusters with
zeroes, but it can happen that some subclusters are not part of the
write request or the copy-on-write. This patch makes sure that only
the affected subclusters are overwritten.

A potential improvement would be to also fill with zeroes the other
subclusters if we can guarantee that we are not overwriting existing
data. However this would waste more disk space, so we should first
evaluate if it's really worth doing.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index a3481cd85b..86258fbc22 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2411,6 +2411,9 @@ static int handle_alloc_space(BlockDriverState *bs, 
QCowL2Meta *l2meta)
 
 for (m = l2meta; m != NULL; m = m->next) {
 int ret;
+uint64_t start_offset = m->alloc_offset + m->cow_start.offset;
+unsigned nb_bytes = m->cow_end.offset + m->cow_end.nb_bytes -
+m->cow_start.offset;
 
 if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
 continue;
@@ -2425,16 +2428,14 @@ static int handle_alloc_space(BlockDriverState *bs, 
QCowL2Meta *l2meta)
  * efficiently zero out the whole clusters
  */
 
-ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
-m->nb_clusters * s->cluster_size,
+ret = qcow2_pre_write_overlap_check(bs, 0, start_offset, nb_bytes,
 true);
 if (ret < 0) {
 return ret;
 }
 
 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
-ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
-m->nb_clusters * s->cluster_size,
+ret = bdrv_co_pwrite_zeroes(s->data_file, start_offset, nb_bytes,
 BDRV_REQ_NO_FALLBACK);
 if (ret < 0) {
 if (ret != -ENOTSUP && ret != -EAGAIN) {
-- 
2.20.1




[PATCH v8 30/34] qcow2: Add prealloc field to QCowL2Meta

2020-06-10 Thread Alberto Garcia
This field allows us to indicate that the L2 metadata update does not
come from a write request with actual data but from a preallocation
request.

For traditional images this does not make any difference, but for
images with extended L2 entries this means that the clusters are
allocated normally in the L2 table but individual subclusters are
marked as unallocated.

This will allow preallocating images that have a backing file.

There is one special case: when we resize an existing image we can
also request that the new clusters are preallocated. If the image
already had a backing file then we have to hide any possible stale
data and zero out the new clusters (see commit 955c7d6687 for more
details).

In this case the subclusters cannot be left as unallocated so the L2
bitmap must be updated.

Signed-off-by: Alberto Garcia 
---
 block/qcow2.h | 8 
 block/qcow2-cluster.c | 2 +-
 block/qcow2.c | 6 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 4ef4ae4ab0..f3499e53bf 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -463,6 +463,14 @@ typedef struct QCowL2Meta
  */
 bool skip_cow;
 
+/**
+ * Indicates that this is not a normal write request but a preallocation.
+ * If the image has extended L2 entries this means that no new individual
+ * subclusters will be marked as allocated in the L2 bitmap (but any
+ * existing contents of that bitmap will be kept).
+ */
+bool prealloc;
+
 /**
  * The I/O vector with the data from the actual guest write request.
  * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 1641976028..c8217081f2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1066,7 +1066,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
 set_l2_entry(s, l2_slice, l2_index + i, offset | QCOW_OFLAG_COPIED);
 
 /* Update bitmap with the subclusters that were just written */
-if (has_subclusters(s)) {
+if (has_subclusters(s) && !m->prealloc) {
 uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
 unsigned written_from = m->cow_start.offset;
 unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes ?:
diff --git a/block/qcow2.c b/block/qcow2.c
index 72bd25e774..003f166024 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2086,6 +2086,7 @@ static coroutine_fn int 
qcow2_handle_l2meta(BlockDriverState *bs,
 QCowL2Meta *next;
 
 if (link_l2) {
+assert(!l2meta->prealloc);
 ret = qcow2_alloc_cluster_link_l2(bs, l2meta);
 if (ret) {
 goto out;
@@ -3131,6 +3132,7 @@ static int coroutine_fn preallocate_co(BlockDriverState 
*bs, uint64_t offset,
 
 while (meta) {
 QCowL2Meta *next = meta->next;
+meta->prealloc = true;
 
 ret = qcow2_alloc_cluster_link_l2(bs, meta);
 if (ret < 0) {
@@ -4224,6 +4226,7 @@ static int coroutine_fn 
qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
 int64_t clusters_allocated;
 int64_t old_file_size, last_cluster, new_file_size;
 uint64_t nb_new_data_clusters, nb_new_l2_tables;
+bool subclusters_need_allocation = false;
 
 /* With a data file, preallocation means just allocating the metadata
  * and forwarding the truncate request to the data file */
@@ -4305,6 +4308,8 @@ static int coroutine_fn 
qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
BDRV_REQ_ZERO_WRITE, NULL);
 if (ret >= 0) {
 flags &= ~BDRV_REQ_ZERO_WRITE;
+/* Ensure that we read zeroes and not backing file data */
+subclusters_need_allocation = true;
 }
 } else {
 ret = -1;
@@ -4343,6 +4348,7 @@ static int coroutine_fn 
qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
 .offset   = nb_clusters << s->cluster_bits,
 .nb_bytes = 0,
 },
+.prealloc = !subclusters_need_allocation,
 };
 qemu_co_queue_init(_requests);
 
-- 
2.20.1




[PATCH v8 14/34] qcow2: Add QCow2SubclusterType and qcow2_get_subcluster_type()

2020-06-10 Thread Alberto Garcia
This patch adds QCow2SubclusterType, which is the subcluster-level
version of QCow2ClusterType. All QCOW2_SUBCLUSTER_* values have the
the same meaning as their QCOW2_CLUSTER_* equivalents (when they
exist). See below for details and caveats.

In images without extended L2 entries clusters are treated as having
exactly one subcluster so it is possible to replace one data type with
the other while keeping the exact same semantics.

With extended L2 entries there are new possible values, and every
subcluster in the same cluster can obviously have a different
QCow2SubclusterType so functions need to be adapted to work on the
subcluster level.

There are several things that have to be taken into account:

  a) QCOW2_SUBCLUSTER_COMPRESSED means that the whole cluster is
 compressed. We do not support compression at the subcluster
 level.

  b) There are two different values for unallocated subclusters:
 QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN which means that the whole
 cluster is unallocated, and QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC
 which means that the cluster is allocated but the subcluster is
 not. The latter can only happen in images with extended L2
 entries.

  c) QCOW2_SUBCLUSTER_INVALID is used to detect the cases where an L2
 entry has a value that violates the specification. The caller is
 responsible for handling these situations.

 To prevent compatibility problems with images that have invalid
 values but are currently being read by QEMU without causing side
 effects, QCOW2_SUBCLUSTER_INVALID is only returned for images
 with extended L2 entries.

qcow2_cluster_to_subcluster_type() is added as a separate function
from qcow2_get_subcluster_type(), but this is only temporary and both
will be merged in a subsequent patch.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.h | 126 +-
 1 file changed, 125 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 82b86f6cec..3aec6f452a 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -80,6 +80,21 @@
 
 #define QCOW_EXTL2_SUBCLUSTERS_PER_CLUSTER 32
 
+/* The subcluster X [0..31] is allocated */
+#define QCOW_OFLAG_SUB_ALLOC(X)   (1ULL << (X))
+/* The subcluster X [0..31] reads as zeroes */
+#define QCOW_OFLAG_SUB_ZERO(X)(QCOW_OFLAG_SUB_ALLOC(X) << 32)
+/* Subclusters [X, Y) (0 <= X <= Y <= 32) are allocated */
+#define QCOW_OFLAG_SUB_ALLOC_RANGE(X, Y) \
+(QCOW_OFLAG_SUB_ALLOC(Y) - QCOW_OFLAG_SUB_ALLOC(X))
+/* Subclusters [X, Y) (0 <= X <= Y <= 32) read as zeroes */
+#define QCOW_OFLAG_SUB_ZERO_RANGE(X, Y) \
+(QCOW_OFLAG_SUB_ALLOC_RANGE(X, Y) << 32)
+/* L2 entry bitmap with all allocation bits set */
+#define QCOW_L2_BITMAP_ALL_ALLOC  (QCOW_OFLAG_SUB_ALLOC_RANGE(0, 32))
+/* L2 entry bitmap with all "read as zeroes" bits set */
+#define QCOW_L2_BITMAP_ALL_ZEROES (QCOW_OFLAG_SUB_ZERO_RANGE(0, 32))
+
 /* Size of normal and extended L2 entries */
 #define L2E_SIZE_NORMAL   (sizeof(uint64_t))
 #define L2E_SIZE_EXTENDED (sizeof(uint64_t) * 2)
@@ -462,6 +477,33 @@ typedef struct QCowL2Meta
 QLIST_ENTRY(QCowL2Meta) next_in_flight;
 } QCowL2Meta;
 
+/*
+ * In images with standard L2 entries all clusters are treated as if
+ * they had one subcluster so QCow2ClusterType and QCow2SubclusterType
+ * can be mapped to each other and have the exact same meaning
+ * (QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC cannot happen in these images).
+ *
+ * In images with extended L2 entries QCow2ClusterType refers to the
+ * complete cluster and QCow2SubclusterType to each of the individual
+ * subclusters, so there are several possible combinations:
+ *
+ * |--+---|
+ * | Cluster type | Possible subcluster types |
+ * |--+---|
+ * | UNALLOCATED  | UNALLOCATED_PLAIN |
+ * |  |ZERO_PLAIN |
+ * |--+---|
+ * | NORMAL   | UNALLOCATED_ALLOC |
+ * |  |ZERO_ALLOC |
+ * |  |NORMAL |
+ * |--+---|
+ * | COMPRESSED   |COMPRESSED |
+ * |--+---|
+ *
+ * QCOW2_SUBCLUSTER_INVALID means that the L2 entry is incorrect and
+ * the image should be marked corrupt.
+ */
+
 typedef enum QCow2ClusterType {
 QCOW2_CLUSTER_UNALLOCATED,
 QCOW2_CLUSTER_ZERO_PLAIN,
@@ -470,6 +512,16 @@ typedef enum QCow2ClusterType {
 QCOW2_CLUSTER_COMPRESSED,
 } QCow2ClusterType;
 
+typedef enum QCow2SubclusterType {
+QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN,
+QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC,
+QCOW2_SUBCLUSTER_ZERO_PLAIN,
+QCOW2_SUBCLUSTER_ZERO_ALLOC,
+QCOW2_SUBCLUSTER_NORMAL,
+QCOW2_SUBCLUSTER_COMPRESSED,
+QCOW2_SUBCLUSTER_INVALID,
+} QCow2SubclusterType;
+
 typedef enum QCow2MetadataOverlap {
  

[PATCH v8 29/34] qcow2: Add subcluster support to qcow2_measure()

2020-06-10 Thread Alberto Garcia
Extended L2 entries are bigger than normal L2 entries so this has an
impact on the amount of metadata needed for a qcow2 file.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
 block/qcow2.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4edc3c72b9..72bd25e774 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3233,28 +3233,31 @@ int64_t qcow2_refcount_metadata_size(int64_t clusters, 
size_t cluster_size,
  * @total_size: virtual disk size in bytes
  * @cluster_size: cluster size in bytes
  * @refcount_order: refcount bits power-of-2 exponent
+ * @extended_l2: true if the image has extended L2 entries
  *
  * Returns: Total number of bytes required for the fully allocated image
  * (including metadata).
  */
 static int64_t qcow2_calc_prealloc_size(int64_t total_size,
 size_t cluster_size,
-int refcount_order)
+int refcount_order,
+bool extended_l2)
 {
 int64_t meta_size = 0;
 uint64_t nl1e, nl2e;
 int64_t aligned_total_size = ROUND_UP(total_size, cluster_size);
+size_t l2e_size = extended_l2 ? L2E_SIZE_EXTENDED : L2E_SIZE_NORMAL;
 
 /* header: 1 cluster */
 meta_size += cluster_size;
 
 /* total size of L2 tables */
 nl2e = aligned_total_size / cluster_size;
-nl2e = ROUND_UP(nl2e, cluster_size / sizeof(uint64_t));
-meta_size += nl2e * sizeof(uint64_t);
+nl2e = ROUND_UP(nl2e, cluster_size / l2e_size);
+meta_size += nl2e * l2e_size;
 
 /* total size of L1 tables */
-nl1e = nl2e * sizeof(uint64_t) / cluster_size;
+nl1e = nl2e * l2e_size / cluster_size;
 nl1e = ROUND_UP(nl1e, cluster_size / sizeof(uint64_t));
 meta_size += nl1e * sizeof(uint64_t);
 
@@ -4845,6 +4848,8 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
 PreallocMode prealloc;
 bool has_backing_file;
 bool has_luks;
+bool extended_l2 = false; /* Set to false until the option is added */
+size_t l2e_size;
 
 /* Parse image creation options */
 cluster_size = qcow2_opt_get_cluster_size_del(opts, _err);
@@ -4910,8 +4915,9 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
 virtual_size = ROUND_UP(virtual_size, cluster_size);
 
 /* Check that virtual disk size is valid */
+l2e_size = extended_l2 ? L2E_SIZE_EXTENDED : L2E_SIZE_NORMAL;
 l2_tables = DIV_ROUND_UP(virtual_size / cluster_size,
- cluster_size / sizeof(uint64_t));
+ cluster_size / l2e_size);
 if (l2_tables * sizeof(uint64_t) > QCOW_MAX_L1_SIZE) {
 error_setg(_err, "The image size is too large "
"(try using a larger cluster size)");
@@ -4974,9 +4980,9 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
 }
 
 info = g_new0(BlockMeasureInfo, 1);
-info->fully_allocated =
+info->fully_allocated = luks_payload_size +
 qcow2_calc_prealloc_size(virtual_size, cluster_size,
- ctz32(refcount_bits)) + luks_payload_size;
+ ctz32(refcount_bits), extended_l2);
 
 /*
  * Remove data clusters that are not required.  This overestimates the
-- 
2.20.1




[PATCH v8 33/34] qcow2: Assert that expand_zero_clusters_in_l1() does not support subclusters

2020-06-10 Thread Alberto Garcia
This function is only used by qcow2_expand_zero_clusters() to
downgrade a qcow2 image to a previous version. It is however not
possible to downgrade an image with extended L2 entries because older
versions of qcow2 do not have this feature.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c  | 8 +++-
 tests/qemu-iotests/061 | 6 ++
 tests/qemu-iotests/061.out | 5 +
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c8217081f2..e8bb1f32f3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2157,6 +2157,9 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 int ret;
 int i, j;
 
+/* qcow2_downgrade() is not allowed in images with subclusters */
+assert(!has_subclusters(s));
+
 slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
@@ -2225,7 +2228,8 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) {
 if (!bs->backing) {
 /* not backed; therefore we can simply deallocate the
- * cluster */
+ * cluster. No need to call set_l2_bitmap(), this
+ * function doesn't support images with subclusters. */
 set_l2_entry(s, l2_slice, j, 0);
 l2_dirty = true;
 continue;
@@ -2296,6 +2300,8 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 } else {
 set_l2_entry(s, l2_slice, j, offset);
 }
+/* No need to call set_l2_bitmap() after set_l2_entry() because
+ * this function doesn't support images with subclusters. */
 l2_dirty = true;
 }
 
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index 10eb243164..23add2dfe3 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -303,6 +303,12 @@ $QEMU_IMG amend -o "compat=0.10" "$TEST_IMG"
 _img_info --format-specific
 _check_test_img
 
+echo
+echo "=== Testing version downgrade with extended L2 entries ==="
+echo
+_make_test_img -o "compat=1.1,extended_l2=on" 64M
+$QEMU_IMG amend -o "compat=0.10" "$TEST_IMG"
+
 echo
 echo "=== Try changing the external data file ==="
 echo
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 39812d8cf8..c1acdbd751 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -528,6 +528,11 @@ Format specific information:
 extended l2: false
 No errors were found on the image.
 
+=== Testing version downgrade with extended L2 entries ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+qemu-img: Cannot downgrade an image with incompatible features 0x10 set
+
 === Try changing the external data file ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-- 
2.20.1




[PATCH v8 02/34] qcow2: Convert qcow2_get_cluster_offset() into qcow2_get_host_offset()

2020-06-10 Thread Alberto Garcia
qcow2_get_cluster_offset() takes an (unaligned) guest offset and
returns the (aligned) offset of the corresponding cluster in the qcow2
image.

In practice none of the callers need to know where the cluster starts
so this patch makes the function calculate and return the final host
offset directly. The function is also renamed accordingly.

There is a pre-existing exception with compressed clusters: in this
case the function returns the complete cluster descriptor (containing
the offset and size of the compressed data). This does not change with
this patch but it is now documented.

Signed-off-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h |  4 ++--
 block/qcow2-cluster.c | 42 +++---
 block/qcow2.c | 24 +++-
 3 files changed, 32 insertions(+), 38 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 7ce2c23bdb..06475e0849 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -694,8 +694,8 @@ int qcow2_write_l1_entry(BlockDriverState *bs, int 
l1_index);
 int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
   uint8_t *buf, int nb_sectors, bool enc, Error 
**errp);
 
-int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
- unsigned int *bytes, uint64_t *cluster_offset);
+int qcow2_get_host_offset(BlockDriverState *bs, uint64_t offset,
+  unsigned int *bytes, uint64_t *host_offset);
 int qcow2_alloc_cluster_offset(BlockDriverState *bs, uint64_t offset,
unsigned int *bytes, uint64_t *host_offset,
QCowL2Meta **m);
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4b5fc8c4a7..9ab41cb728 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -496,10 +496,15 @@ static int coroutine_fn 
do_perform_cow_write(BlockDriverState *bs,
 
 
 /*
- * get_cluster_offset
+ * get_host_offset
  *
- * For a given offset of the virtual disk, find the cluster type and offset in
- * the qcow2 file. The offset is stored in *cluster_offset.
+ * For a given offset of the virtual disk find the equivalent host
+ * offset in the qcow2 file and store it in *host_offset. Neither
+ * offset needs to be aligned to a cluster boundary.
+ *
+ * If the cluster is unallocated then *host_offset will be 0.
+ * If the cluster is compressed then *host_offset will contain the
+ * complete compressed cluster descriptor.
  *
  * On entry, *bytes is the maximum number of contiguous bytes starting at
  * offset that we are interested in.
@@ -511,12 +516,12 @@ static int coroutine_fn 
do_perform_cow_write(BlockDriverState *bs,
  * Returns the cluster type (QCOW2_CLUSTER_*) on success, -errno in error
  * cases.
  */
-int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset,
- unsigned int *bytes, uint64_t *cluster_offset)
+int qcow2_get_host_offset(BlockDriverState *bs, uint64_t offset,
+  unsigned int *bytes, uint64_t *host_offset)
 {
 BDRVQcow2State *s = bs->opaque;
 unsigned int l2_index;
-uint64_t l1_index, l2_offset, *l2_slice;
+uint64_t l1_index, l2_offset, *l2_slice, l2_entry;
 int c;
 unsigned int offset_in_cluster;
 uint64_t bytes_available, bytes_needed, nb_clusters;
@@ -537,8 +542,6 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 bytes_needed = bytes_available;
 }
 
-*cluster_offset = 0;
-
 /* seek to the l2 offset in the l1 table */
 
 l1_index = offset_to_l1_index(s, offset);
@@ -570,7 +573,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 /* find the cluster offset for the given disk offset */
 
 l2_index = offset_to_l2_slice_index(s, offset);
-*cluster_offset = be64_to_cpu(l2_slice[l2_index]);
+l2_entry = be64_to_cpu(l2_slice[l2_index]);
 
 nb_clusters = size_to_clusters(s, bytes_needed);
 /* bytes_needed <= *bytes + offset_in_cluster, both of which are unsigned
@@ -578,7 +581,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
  * true */
 assert(nb_clusters <= INT_MAX);
 
-type = qcow2_get_cluster_type(bs, *cluster_offset);
+type = qcow2_get_cluster_type(bs, l2_entry);
 if (s->qcow_version < 3 && (type == QCOW2_CLUSTER_ZERO_PLAIN ||
 type == QCOW2_CLUSTER_ZERO_ALLOC)) {
 qcow2_signal_corruption(bs, true, -1, -1, "Zero cluster entry found"
@@ -599,42 +602,43 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, 
uint64_t offset,
 }
 /* Compressed clusters can only be processed one by one */
 c = 1;
-*cluster_offset &= L2E_COMPRESSED_OFFSET_SIZE_MASK;
+*host_offset = l2_entry & L2E_COMPRESSED_OFFSET_SIZE_MASK;
 break;
 case QCOW2_CLUSTER_ZERO_PLAIN:
 case QCOW2_CLUSTER_UNALLOCATED:
 /* how many empty clusters ? */
 

[PATCH v8 07/34] qcow2: Document the Extended L2 Entries feature

2020-06-10 Thread Alberto Garcia
Subcluster allocation in qcow2 is implemented by extending the
existing L2 table entries and adding additional information to
indicate the allocation status of each subcluster.

This patch documents the changes to the qcow2 format and how they
affect the calculation of the L2 cache size.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Eric Blake 
---
 docs/interop/qcow2.txt | 68 --
 docs/qcow2-cache.txt   | 19 +++-
 2 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index cb723463f2..64e9345fb4 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -42,6 +42,9 @@ The first cluster of a qcow2 image contains the file header:
 as the maximum cluster size and won't be able to open 
images
 with larger cluster sizes.
 
+Note: if the image has Extended L2 Entries then 
cluster_bits
+must be at least 14 (i.e. 16384 byte clusters).
+
  24 - 31:   size
 Virtual disk size in bytes.
 
@@ -117,7 +120,12 @@ the next fields through header_length.
 clusters. The compression_type field must be
 present and not zero.
 
-Bits 4-63:  Reserved (set to 0)
+Bit 4:  Extended L2 Entries.  If this bit is set then
+L2 table entries use an extended format that
+allows subcluster-based allocation. See the
+Extended L2 Entries section for more details.
+
+Bits 5-63:  Reserved (set to 0)
 
  80 -  87:  compatible_features
 Bitmask of compatible features. An implementation can
@@ -498,7 +506,7 @@ cannot be relaxed without an incompatible layout change).
 Given an offset into the virtual disk, the offset into the image file can be
 obtained as follows:
 
-l2_entries = (cluster_size / sizeof(uint64_t))
+l2_entries = (cluster_size / sizeof(uint64_t))[*]
 
 l2_index = (offset / cluster_size) % l2_entries
 l1_index = (offset / cluster_size) / l2_entries
@@ -508,6 +516,8 @@ obtained as follows:
 
 return cluster_offset + (offset % cluster_size)
 
+[*] this changes if Extended L2 Entries are enabled, see next section
+
 L1 table entry:
 
 Bit  0 -  8:Reserved (set to 0)
@@ -548,7 +558,8 @@ Standard Cluster Descriptor:
 nor is data read from the backing file if the cluster is
 unallocated.
 
-With version 2, this is always 0.
+With version 2 or with extended L2 entries (see the next
+section), this is always 0.
 
  1 -  8:Reserved (set to 0)
 
@@ -585,6 +596,57 @@ file (except if bit 0 in the Standard Cluster Descriptor 
is set). If there is
 no backing file or the backing file is smaller than the image, they shall read
 zeros for all parts that are not covered by the backing file.
 
+== Extended L2 Entries ==
+
+An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
+field of the header.
+
+In these images standard data clusters are divided into 32 subclusters of the
+same size. They are contiguous and start from the beginning of the cluster.
+Subclusters can be allocated independently and the L2 entry contains 
information
+indicating the status of each one of them. Compressed data clusters don't have
+subclusters so they are treated the same as in images without this feature.
+
+The size of an extended L2 entry is 128 bits so the number of entries per table
+is calculated using this formula:
+
+l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
+
+The first 64 bits have the same format as the standard L2 table entry described
+in the previous section, with the exception of bit 0 of the standard cluster
+descriptor.
+
+The last 64 bits contain a subcluster allocation bitmap with this format:
+
+Subcluster Allocation Bitmap (for standard clusters):
+
+Bit  0 - 31:Allocation status (one bit per subcluster)
+
+1: the subcluster is allocated. In this case the
+   host cluster offset field must contain a valid
+   offset.
+0: the subcluster is not allocated. In this case
+   read requests shall go to the backing file or
+   return zeros if there is no backing file data.
+
+Bits are assigned starting from the least significant
+one (i.e. bit x is used for subcluster x).
+
+32 - 63 Subcluster reads as zeros (one bit per subcluster)
+
+1: the subcluster reads as zeros. In this case the
+   allocation status bit must be unset. The host
+   

[PATCH v8 00/34] Add subcluster allocation to qcow2

2020-06-10 Thread Alberto Garcia
Hi,

here's the new version of the patches to add subcluster allocation
support to qcow2.

Please refer to the cover letter of the first version for a full
description of the patches:

   https://lists.gnu.org/archive/html/qemu-block/2019-10/msg00983.html

The big change here is that now when an image is preallocated then the
requested clusters are allocated but the L2 bitmap is left untouched.
This makes it possible to preallocate an image that has a backing
file.

If you want to test this series make sure to apply this patch first:

   https://lists.gnu.org/archive/html/qemu-block/2020-06/msg00504.html

Berto

v8:
- Patch 30: New patch
- Patch 31: Update test expectations after commit cf2d1203dc
- Patch 32: New patch
- Patch 34: New tests, fixes and general refactoring of the code

v7: https://lists.gnu.org/archive/html/qemu-block/2020-05/msg01683.html
v6: https://lists.gnu.org/archive/html/qemu-block/2020-05/msg01583.html
v5: https://lists.gnu.org/archive/html/qemu-block/2020-05/msg00251.html
v4: https://lists.gnu.org/archive/html/qemu-block/2020-03/msg00966.html
v3: https://lists.gnu.org/archive/html/qemu-block/2019-12/msg00587.html
v2: https://lists.gnu.org/archive/html/qemu-block/2019-10/msg01642.html
v1: https://lists.gnu.org/archive/html/qemu-block/2019-10/msg00983.html

Output of git backport-diff against v7:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/34:[] [--] 'qcow2: Make Qcow2AioTask store the full host offset'
002/34:[] [--] 'qcow2: Convert qcow2_get_cluster_offset() into 
qcow2_get_host_offset()'
003/34:[] [--] 'qcow2: Add calculate_l2_meta()'
004/34:[] [--] 'qcow2: Split cluster_needs_cow() out of 
count_cow_clusters()'
005/34:[] [--] 'qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in 
handle_copied()'
006/34:[] [--] 'qcow2: Add get_l2_entry() and set_l2_entry()'
007/34:[] [--] 'qcow2: Document the Extended L2 Entries feature'
008/34:[] [--] 'qcow2: Add dummy has_subclusters() function'
009/34:[] [--] 'qcow2: Add subcluster-related fields to BDRVQcow2State'
010/34:[] [--] 'qcow2: Add offset_to_sc_index()'
011/34:[] [--] 'qcow2: Add offset_into_subcluster() and 
size_to_subclusters()'
012/34:[] [--] 'qcow2: Add l2_entry_size()'
013/34:[] [--] 'qcow2: Update get/set_l2_entry() and add 
get/set_l2_bitmap()'
014/34:[] [--] 'qcow2: Add QCow2SubclusterType and 
qcow2_get_subcluster_type()'
015/34:[] [--] 'qcow2: Add qcow2_get_subcluster_range_type()'
016/34:[] [--] 'qcow2: Add qcow2_cluster_is_allocated()'
017/34:[] [--] 'qcow2: Add cluster type parameter to 
qcow2_get_host_offset()'
018/34:[] [--] 'qcow2: Replace QCOW2_CLUSTER_* with QCOW2_SUBCLUSTER_*'
019/34:[] [--] 'qcow2: Handle QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC'
020/34:[] [--] 'qcow2: Add subcluster support to calculate_l2_meta()'
021/34:[] [--] 'qcow2: Add subcluster support to qcow2_get_host_offset()'
022/34:[] [--] 'qcow2: Add subcluster support to zero_in_l2_slice()'
023/34:[] [--] 'qcow2: Add subcluster support to discard_in_l2_slice()'
024/34:[] [--] 'qcow2: Add subcluster support to check_refcounts_l2()'
025/34:[] [--] 'qcow2: Update L2 bitmap in qcow2_alloc_cluster_link_l2()'
026/34:[] [--] 'qcow2: Clear the L2 bitmap when allocating a compressed 
cluster'
027/34:[] [--] 'qcow2: Add subcluster support to handle_alloc_space()'
028/34:[] [--] 'qcow2: Add subcluster support to qcow2_co_pwrite_zeroes()'
029/34:[] [-C] 'qcow2: Add subcluster support to qcow2_measure()'
030/34:[down] 'qcow2: Add prealloc field to QCowL2Meta'
031/34:[0002] [FC] 'qcow2: Add the 'extended_l2' option and the 
QCOW2_INCOMPAT_EXTL2 bit'
032/34:[down] 'qcow2: Allow preallocation and backing files if extended_l2 is 
set'
033/34:[] [--] 'qcow2: Assert that expand_zero_clusters_in_l1() does not 
support subclusters'
034/34:[0669] [FC] 'iotests: Add tests for qcow2 images with extended L2 
entries'

Alberto Garcia (34):
  qcow2: Make Qcow2AioTask store the full host offset
  qcow2: Convert qcow2_get_cluster_offset() into qcow2_get_host_offset()
  qcow2: Add calculate_l2_meta()
  qcow2: Split cluster_needs_cow() out of count_cow_clusters()
  qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in handle_copied()
  qcow2: Add get_l2_entry() and set_l2_entry()
  qcow2: Document the Extended L2 Entries feature
  qcow2: Add dummy has_subclusters() function
  qcow2: Add subcluster-related fields to BDRVQcow2State
  qcow2: Add offset_to_sc_index()
  qcow2: Add offset_into_subcluster() and size_to_subclusters()
  qcow2: Add l2_entry_size()
  qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap()
  qcow2: Add QCow2SubclusterType and qcow2_get_subcluster_type()
  qcow2: Add qcow2_get_subcluster_range_type()
  qcow2: Add qcow2_cluster_is_allocated()
  qcow2: Add 

[PATCH v8 10/34] qcow2: Add offset_to_sc_index()

2020-06-10 Thread Alberto Garcia
For a given offset, return the subcluster number within its cluster
(i.e. with 32 subclusters per cluster it returns a number between 0
and 31).

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index eee4c8de9c..2503374677 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -581,6 +581,11 @@ static inline int offset_to_l2_slice_index(BDRVQcow2State 
*s, int64_t offset)
 return (offset >> s->cluster_bits) & (s->l2_slice_size - 1);
 }
 
+static inline int offset_to_sc_index(BDRVQcow2State *s, int64_t offset)
+{
+return (offset >> s->subcluster_bits) & (s->subclusters_per_cluster - 1);
+}
+
 static inline int64_t qcow2_vm_state_offset(BDRVQcow2State *s)
 {
 return (int64_t)s->l1_vm_state_index << (s->cluster_bits + s->l2_bits);
-- 
2.20.1




[PATCH v8 23/34] qcow2: Add subcluster support to discard_in_l2_slice()

2020-06-10 Thread Alberto Garcia
Two things need to be taken into account here:

1) With full_discard == true the L2 entry must be cleared completely.
   This also includes the L2 bitmap if the image has extended L2
   entries.

2) With full_discard == false we have to make the discarded cluster
   read back as zeroes. With normal L2 entries this is done with the
   QCOW_OFLAG_ZERO bit, whereas with extended L2 entries this is done
   with the individual 'all zeroes' bits for each subcluster.

   Note however that QCOW_OFLAG_ZERO is not supported in v2 qcow2
   images so, if there is a backing file, discard cannot guarantee
   that the image will read back as zeroes. If this is important for
   the caller it should forbid it as qcow2_co_pdiscard() does (see
   80f5c01183 for more details).

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c | 52 +++
 1 file changed, 23 insertions(+), 29 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4e59bbd545..edfc8ea91c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1847,11 +1847,17 @@ static int discard_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 assert(nb_clusters <= INT_MAX);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t old_l2_entry;
-
-old_l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
+uint64_t old_l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
+uint64_t old_l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+uint64_t new_l2_entry = old_l2_entry;
+uint64_t new_l2_bitmap = old_l2_bitmap;
+QCow2ClusterType cluster_type =
+qcow2_get_cluster_type(bs, old_l2_entry);
 
 /*
+ * If full_discard is true, the cluster should not read back as zeroes,
+ * but rather fall through to the backing file.
+ *
  * If full_discard is false, make sure that a discarded area reads back
  * as zeroes for v3 images (we cannot do it for v2 without actually
  * writing a zero-filled buffer). We can skip the operation if the
@@ -1860,40 +1866,28 @@ static int discard_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
  *
  * TODO We might want to use bdrv_block_status(bs) here, but we're
  * holding s->lock, so that doesn't work today.
- *
- * If full_discard is true, the sector should not read back as zeroes,
- * but rather fall through to the backing file.
  */
-switch (qcow2_get_cluster_type(bs, old_l2_entry)) {
-case QCOW2_CLUSTER_UNALLOCATED:
-if (full_discard || !bs->backing) {
-continue;
+if (full_discard) {
+new_l2_entry = new_l2_bitmap = 0;
+} else if (bs->backing || qcow2_cluster_is_allocated(cluster_type)) {
+if (has_subclusters(s)) {
+new_l2_entry = 0;
+new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
+} else {
+new_l2_entry = s->qcow_version >= 3 ? QCOW_OFLAG_ZERO : 0;
 }
-break;
+}
 
-case QCOW2_CLUSTER_ZERO_PLAIN:
-if (!full_discard) {
-continue;
-}
-break;
-
-case QCOW2_CLUSTER_ZERO_ALLOC:
-case QCOW2_CLUSTER_NORMAL:
-case QCOW2_CLUSTER_COMPRESSED:
-break;
-
-default:
-abort();
+if (old_l2_entry == new_l2_entry && old_l2_bitmap == new_l2_bitmap) {
+continue;
 }
 
 /* First remove L2 entries */
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
-if (!full_discard && s->qcow_version >= 3) {
-set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
-} else {
-set_l2_entry(s, l2_slice, l2_index + i, 0);
+set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
+if (has_subclusters(s)) {
+set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
 }
-
 /* Then decrease the refcount */
 qcow2_free_any_clusters(bs, old_l2_entry, 1, type);
 }
-- 
2.20.1




[PATCH v8 16/34] qcow2: Add qcow2_cluster_is_allocated()

2020-06-10 Thread Alberto Garcia
This helper function tells us if a cluster is allocated (that is,
there is an associated host offset for it).

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 3aec6f452a..ea647c8bb5 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -780,6 +780,12 @@ QCow2SubclusterType 
qcow2_get_subcluster_type(BlockDriverState *bs,
 }
 }
 
+static inline bool qcow2_cluster_is_allocated(QCow2ClusterType type)
+{
+return (type == QCOW2_CLUSTER_COMPRESSED || type == QCOW2_CLUSTER_NORMAL ||
+type == QCOW2_CLUSTER_ZERO_ALLOC);
+}
+
 /* Check whether refcounts are eager or lazy */
 static inline bool qcow2_need_accurate_refcounts(BDRVQcow2State *s)
 {
-- 
2.20.1




[PATCH v8 12/34] qcow2: Add l2_entry_size()

2020-06-10 Thread Alberto Garcia
qcow2 images with subclusters have 128-bit L2 entries. The first 64
bits contain the same information as traditional images and the last
64 bits form a bitmap with the status of each individual subcluster.

Because of that we cannot assume that L2 entries are sizeof(uint64_t)
anymore. This function returns the proper value for the image.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
 block/qcow2.h  |  9 +
 block/qcow2-cluster.c  | 12 ++--
 block/qcow2-refcount.c | 14 --
 block/qcow2.c  |  8 
 4 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 4fe31adfd3..46b351229a 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -80,6 +80,10 @@
 
 #define QCOW_EXTL2_SUBCLUSTERS_PER_CLUSTER 32
 
+/* Size of normal and extended L2 entries */
+#define L2E_SIZE_NORMAL   (sizeof(uint64_t))
+#define L2E_SIZE_EXTENDED (sizeof(uint64_t) * 2)
+
 #define MIN_CLUSTER_BITS 9
 #define MAX_CLUSTER_BITS 21
 
@@ -521,6 +525,11 @@ static inline bool has_subclusters(BDRVQcow2State *s)
 return false;
 }
 
+static inline size_t l2_entry_size(BDRVQcow2State *s)
+{
+return has_subclusters(s) ? L2E_SIZE_EXTENDED : L2E_SIZE_NORMAL;
+}
+
 static inline uint64_t get_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx)
 {
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 76fd0f3cdb..8b2fc550b7 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -208,7 +208,7 @@ static int l2_load(BlockDriverState *bs, uint64_t offset,
uint64_t l2_offset, uint64_t **l2_slice)
 {
 BDRVQcow2State *s = bs->opaque;
-int start_of_slice = sizeof(uint64_t) *
+int start_of_slice = l2_entry_size(s) *
 (offset_to_l2_index(s, offset) - offset_to_l2_slice_index(s, offset));
 
 return qcow2_cache_get(bs, s->l2_table_cache, l2_offset + start_of_slice,
@@ -281,7 +281,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index)
 
 /* allocate a new l2 entry */
 
-l2_offset = qcow2_alloc_clusters(bs, s->l2_size * sizeof(uint64_t));
+l2_offset = qcow2_alloc_clusters(bs, s->l2_size * l2_entry_size(s));
 if (l2_offset < 0) {
 ret = l2_offset;
 goto fail;
@@ -305,7 +305,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index)
 
 /* allocate a new entry in the l2 cache */
 
-slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
 trace_qcow2_l2_allocate_get_empty(bs, l1_index);
@@ -369,7 +369,7 @@ fail:
 }
 s->l1_table[l1_index] = old_l2_offset;
 if (l2_offset > 0) {
-qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
+qcow2_free_clusters(bs, l2_offset, s->l2_size * l2_entry_size(s),
 QCOW2_DISCARD_ALWAYS);
 }
 return ret;
@@ -716,7 +716,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t 
offset,
 
 /* Then decrease the refcount of the old table */
 if (l2_offset) {
-qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
+qcow2_free_clusters(bs, l2_offset, s->l2_size * l2_entry_size(s),
 QCOW2_DISCARD_OTHER);
 }
 
@@ -1913,7 +1913,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 int ret;
 int i, j;
 
-slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
 if (!is_active_l1) {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 04546838e8..770c5dbc83 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1254,7 +1254,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 l2_slice = NULL;
 l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
-slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * l2_entry_size(s);
 n_slices = s->cluster_size / slice_size2;
 
 s->cache_discards = true;
@@ -1605,7 +1605,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 int i, l2_size, nb_csectors, ret;
 
 /* Read L2 table from disk */
-l2_size = s->l2_size * sizeof(uint64_t);
+l2_size = s->l2_size * l2_entry_size(s);
 l2_table = g_malloc(l2_size);
 
 ret = bdrv_pread(bs->file, l2_offset, l2_table, l2_size);
@@ -1680,15 +1680,16 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR",
 offset);
 if (fix & BDRV_FIX_ERRORS) {
+int idx = i * (l2_entry_size(s) / sizeof(uint64_t));
 uint64_t l2e_offset =
-l2_offset + 

[PATCH v8 08/34] qcow2: Add dummy has_subclusters() function

2020-06-10 Thread Alberto Garcia
This function will be used by the qcow2 code to check if an image has
subclusters or not.

At the moment this simply returns false. Once all patches needed for
subcluster support are ready then QEMU will be able to create and
read images with subclusters and this function will return the actual
value.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index eecbadc4cb..2064dd3d85 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -510,6 +510,12 @@ typedef enum QCow2MetadataOverlap {
 
 #define INV_OFFSET (-1ULL)
 
+static inline bool has_subclusters(BDRVQcow2State *s)
+{
+/* FIXME: Return false until this feature is complete */
+return false;
+}
+
 static inline uint64_t get_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx)
 {
-- 
2.20.1




[PATCH v8 15/34] qcow2: Add qcow2_get_subcluster_range_type()

2020-06-10 Thread Alberto Garcia
There are situations in which we want to know how many contiguous
subclusters of the same type there are in a given cluster. This can be
done by simply iterating over the subclusters and repeatedly calling
qcow2_get_subcluster_type() for each one of them.

However once we determined the type of a subcluster we can check the
rest efficiently by counting the number of adjacent ones (or zeroes)
in the bitmap. This is what this function does.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8b2fc550b7..32dc6e75e3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -375,6 +375,57 @@ fail:
 return ret;
 }
 
+/*
+ * For a given L2 entry, count the number of contiguous subclusters of
+ * the same type starting from @sc_from. Compressed clusters are
+ * treated as if they were divided into subclusters of size
+ * s->subcluster_size.
+ *
+ * Return the number of contiguous subclusters and set @type to the
+ * subcluster type.
+ *
+ * If the L2 entry is invalid return -errno and set @type to
+ * QCOW2_SUBCLUSTER_INVALID.
+ */
+G_GNUC_UNUSED
+static int qcow2_get_subcluster_range_type(BlockDriverState *bs,
+   uint64_t l2_entry,
+   uint64_t l2_bitmap,
+   unsigned sc_from,
+   QCow2SubclusterType *type)
+{
+BDRVQcow2State *s = bs->opaque;
+uint32_t val;
+
+*type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_from);
+
+if (*type == QCOW2_SUBCLUSTER_INVALID) {
+return -EINVAL;
+} else if (!has_subclusters(s) || *type == QCOW2_SUBCLUSTER_COMPRESSED) {
+return s->subclusters_per_cluster - sc_from;
+}
+
+switch (*type) {
+case QCOW2_SUBCLUSTER_NORMAL:
+val = l2_bitmap | QCOW_OFLAG_SUB_ALLOC_RANGE(0, sc_from);
+return cto32(val) - sc_from;
+
+case QCOW2_SUBCLUSTER_ZERO_PLAIN:
+case QCOW2_SUBCLUSTER_ZERO_ALLOC:
+val = (l2_bitmap | QCOW_OFLAG_SUB_ZERO_RANGE(0, sc_from)) >> 32;
+return cto32(val) - sc_from;
+
+case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
+case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC:
+val = ((l2_bitmap >> 32) | l2_bitmap)
+& ~QCOW_OFLAG_SUB_ALLOC_RANGE(0, sc_from);
+return ctz32(val) - sc_from;
+
+default:
+g_assert_not_reached();
+}
+}
+
 /*
  * Checks how many clusters in a given L2 slice are contiguous in the image
  * file. As soon as one of the flags in the bitmask stop_flags changes compared
-- 
2.20.1




[PATCH v8 04/34] qcow2: Split cluster_needs_cow() out of count_cow_clusters()

2020-06-10 Thread Alberto Garcia
We are going to need it in other places.

Signed-off-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 block/qcow2-cluster.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 61ad638bdc..80f9787461 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1087,6 +1087,24 @@ static void calculate_l2_meta(BlockDriverState *bs,
 QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
 }
 
+/* Returns true if writing to a cluster requires COW */
+static bool cluster_needs_cow(BlockDriverState *bs, uint64_t l2_entry)
+{
+switch (qcow2_get_cluster_type(bs, l2_entry)) {
+case QCOW2_CLUSTER_NORMAL:
+if (l2_entry & QCOW_OFLAG_COPIED) {
+return false;
+}
+case QCOW2_CLUSTER_UNALLOCATED:
+case QCOW2_CLUSTER_COMPRESSED:
+case QCOW2_CLUSTER_ZERO_PLAIN:
+case QCOW2_CLUSTER_ZERO_ALLOC:
+return true;
+default:
+abort();
+}
+}
+
 /*
  * Returns the number of contiguous clusters that can be used for an allocating
  * write, but require COW to be performed (this includes yet unallocated space,
@@ -1099,25 +1117,11 @@ static int count_cow_clusters(BlockDriverState *bs, int 
nb_clusters,
 
 for (i = 0; i < nb_clusters; i++) {
 uint64_t l2_entry = be64_to_cpu(l2_slice[l2_index + i]);
-QCow2ClusterType cluster_type = qcow2_get_cluster_type(bs, l2_entry);
-
-switch(cluster_type) {
-case QCOW2_CLUSTER_NORMAL:
-if (l2_entry & QCOW_OFLAG_COPIED) {
-goto out;
-}
+if (!cluster_needs_cow(bs, l2_entry)) {
 break;
-case QCOW2_CLUSTER_UNALLOCATED:
-case QCOW2_CLUSTER_COMPRESSED:
-case QCOW2_CLUSTER_ZERO_PLAIN:
-case QCOW2_CLUSTER_ZERO_ALLOC:
-break;
-default:
-abort();
 }
 }
 
-out:
 assert(i <= nb_clusters);
 return i;
 }
-- 
2.20.1




[PATCH v8 01/34] qcow2: Make Qcow2AioTask store the full host offset

2020-06-10 Thread Alberto Garcia
The file_cluster_offset field of Qcow2AioTask stores a cluster-aligned
host offset. In practice this is not very useful because all users(*)
of this structure need the final host offset into the cluster, which
they calculate using

   host_offset = file_cluster_offset + offset_into_cluster(s, offset)

There is no reason why Qcow2AioTask cannot store host_offset directly
and that is what this patch does.

(*) compressed clusters are the exception: in this case what
file_cluster_offset was storing was the full compressed cluster
descriptor (offset + size). This does not change with this patch
but it is documented now.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.c  | 69 ++
 block/trace-events |  2 +-
 2 files changed, 34 insertions(+), 37 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index e20590c3b7..d792137af6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -74,7 +74,7 @@ typedef struct {
 
 static int coroutine_fn
 qcow2_co_preadv_compressed(BlockDriverState *bs,
-   uint64_t file_cluster_offset,
+   uint64_t cluster_descriptor,
uint64_t offset,
uint64_t bytes,
QEMUIOVector *qiov,
@@ -2103,7 +2103,7 @@ out:
 
 static coroutine_fn int
 qcow2_co_preadv_encrypted(BlockDriverState *bs,
-   uint64_t file_cluster_offset,
+   uint64_t host_offset,
uint64_t offset,
uint64_t bytes,
QEMUIOVector *qiov,
@@ -2130,16 +2130,12 @@ qcow2_co_preadv_encrypted(BlockDriverState *bs,
 }
 
 BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-ret = bdrv_co_pread(s->data_file,
-file_cluster_offset + offset_into_cluster(s, offset),
-bytes, buf, 0);
+ret = bdrv_co_pread(s->data_file, host_offset, bytes, buf, 0);
 if (ret < 0) {
 goto fail;
 }
 
-if (qcow2_co_decrypt(bs,
- file_cluster_offset + offset_into_cluster(s, offset),
- offset, buf, bytes) < 0)
+if (qcow2_co_decrypt(bs, host_offset, offset, buf, bytes) < 0)
 {
 ret = -EIO;
 goto fail;
@@ -2157,7 +2153,7 @@ typedef struct Qcow2AioTask {
 
 BlockDriverState *bs;
 QCow2ClusterType cluster_type; /* only for read */
-uint64_t file_cluster_offset;
+uint64_t host_offset; /* or full descriptor in compressed clusters */
 uint64_t offset;
 uint64_t bytes;
 QEMUIOVector *qiov;
@@ -2170,7 +2166,7 @@ static coroutine_fn int qcow2_add_task(BlockDriverState 
*bs,
AioTaskPool *pool,
AioTaskFunc func,
QCow2ClusterType cluster_type,
-   uint64_t file_cluster_offset,
+   uint64_t host_offset,
uint64_t offset,
uint64_t bytes,
QEMUIOVector *qiov,
@@ -2185,7 +2181,7 @@ static coroutine_fn int qcow2_add_task(BlockDriverState 
*bs,
 .bs = bs,
 .cluster_type = cluster_type,
 .qiov = qiov,
-.file_cluster_offset = file_cluster_offset,
+.host_offset = host_offset,
 .offset = offset,
 .bytes = bytes,
 .qiov_offset = qiov_offset,
@@ -2194,7 +2190,7 @@ static coroutine_fn int qcow2_add_task(BlockDriverState 
*bs,
 
 trace_qcow2_add_task(qemu_coroutine_self(), bs, pool,
  func == qcow2_co_preadv_task_entry ? "read" : "write",
- cluster_type, file_cluster_offset, offset, bytes,
+ cluster_type, host_offset, offset, bytes,
  qiov, qiov_offset);
 
 if (!pool) {
@@ -2208,13 +2204,12 @@ static coroutine_fn int qcow2_add_task(BlockDriverState 
*bs,
 
 static coroutine_fn int qcow2_co_preadv_task(BlockDriverState *bs,
  QCow2ClusterType cluster_type,
- uint64_t file_cluster_offset,
+ uint64_t host_offset,
  uint64_t offset, uint64_t bytes,
  QEMUIOVector *qiov,
  size_t qiov_offset)
 {
 BDRVQcow2State *s = bs->opaque;
-int offset_in_cluster = offset_into_cluster(s, offset);
 
 switch (cluster_type) {
 case QCOW2_CLUSTER_ZERO_PLAIN:
@@ -2230,19 +2225,17 @@ static coroutine_fn int 
qcow2_co_preadv_task(BlockDriverState *bs,
qiov, 

[PATCH v8 03/34] qcow2: Add calculate_l2_meta()

2020-06-10 Thread Alberto Garcia
handle_alloc() creates a QCowL2Meta structure in order to update the
image metadata and perform the necessary copy-on-write operations.

This patch moves that code to a separate function so it can be used
from other places.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
 block/qcow2-cluster.c | 77 +--
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 9ab41cb728..61ad638bdc 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1037,6 +1037,56 @@ void qcow2_alloc_cluster_abort(BlockDriverState *bs, 
QCowL2Meta *m)
 }
 }
 
+/*
+ * For a given write request, create a new QCowL2Meta structure, add
+ * it to @m and the BDRVQcow2State.cluster_allocs list.
+ *
+ * @host_cluster_offset points to the beginning of the first cluster.
+ *
+ * @guest_offset and @bytes indicate the offset and length of the
+ * request.
+ *
+ * If @keep_old is true it means that the clusters were already
+ * allocated and will be overwritten. If false then the clusters are
+ * new and we have to decrease the reference count of the old ones.
+ */
+static void calculate_l2_meta(BlockDriverState *bs,
+  uint64_t host_cluster_offset,
+  uint64_t guest_offset, unsigned bytes,
+  QCowL2Meta **m, bool keep_old)
+{
+BDRVQcow2State *s = bs->opaque;
+unsigned cow_start_from = 0;
+unsigned cow_start_to = offset_into_cluster(s, guest_offset);
+unsigned cow_end_from = cow_start_to + bytes;
+unsigned cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+unsigned nb_clusters = size_to_clusters(s, cow_end_from);
+QCowL2Meta *old_m = *m;
+
+*m = g_malloc0(sizeof(**m));
+**m = (QCowL2Meta) {
+.next   = old_m,
+
+.alloc_offset   = host_cluster_offset,
+.offset = start_of_cluster(s, guest_offset),
+.nb_clusters= nb_clusters,
+
+.keep_old_clusters = keep_old,
+
+.cow_start = {
+.offset = cow_start_from,
+.nb_bytes   = cow_start_to - cow_start_from,
+},
+.cow_end = {
+.offset = cow_end_from,
+.nb_bytes   = cow_end_to - cow_end_from,
+},
+};
+
+qemu_co_queue_init(&(*m)->dependent_requests);
+QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
+}
+
 /*
  * Returns the number of contiguous clusters that can be used for an allocating
  * write, but require COW to be performed (this includes yet unallocated space,
@@ -1435,35 +1485,14 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 uint64_t requested_bytes = *bytes + offset_into_cluster(s, guest_offset);
 int avail_bytes = nb_clusters << s->cluster_bits;
 int nb_bytes = MIN(requested_bytes, avail_bytes);
-QCowL2Meta *old_m = *m;
-
-*m = g_malloc0(sizeof(**m));
-
-**m = (QCowL2Meta) {
-.next   = old_m,
-
-.alloc_offset   = alloc_cluster_offset,
-.offset = start_of_cluster(s, guest_offset),
-.nb_clusters= nb_clusters,
-
-.keep_old_clusters  = keep_old_clusters,
-
-.cow_start = {
-.offset = 0,
-.nb_bytes   = offset_into_cluster(s, guest_offset),
-},
-.cow_end = {
-.offset = nb_bytes,
-.nb_bytes   = avail_bytes - nb_bytes,
-},
-};
-qemu_co_queue_init(&(*m)->dependent_requests);
-QLIST_INSERT_HEAD(>cluster_allocs, *m, next_in_flight);
 
 *host_offset = alloc_cluster_offset + offset_into_cluster(s, guest_offset);
 *bytes = MIN(*bytes, nb_bytes - offset_into_cluster(s, guest_offset));
 assert(*bytes != 0);
 
+calculate_l2_meta(bs, alloc_cluster_offset, guest_offset, *bytes,
+  m, keep_old_clusters);
+
 return 1;
 
 fail:
-- 
2.20.1




[PATCH v8 22/34] qcow2: Add subcluster support to zero_in_l2_slice()

2020-06-10 Thread Alberto Garcia
The QCOW_OFLAG_ZERO bit that indicates that a cluster reads as
zeroes is only used in standard L2 entries. Extended L2 entries use
individual 'all zeroes' bits for each subcluster.

This must be taken into account when updating the L2 entry and also
when deciding that an existing entry does not need to be updated.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2-cluster.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 2f3bd3a882..4e59bbd545 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1956,7 +1956,6 @@ static int zero_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 int l2_index;
 int ret;
 int i;
-bool unmap = !!(flags & BDRV_REQ_MAY_UNMAP);
 
 ret = get_cluster_table(bs, offset, _slice, _index);
 if (ret < 0) {
@@ -1968,28 +1967,31 @@ static int zero_in_l2_slice(BlockDriverState *bs, 
uint64_t offset,
 assert(nb_clusters <= INT_MAX);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t old_offset;
-QCow2ClusterType cluster_type;
+uint64_t old_l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
+uint64_t old_l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+QCow2ClusterType type = qcow2_get_cluster_type(bs, old_l2_entry);
+bool unmap = (type == QCOW2_CLUSTER_COMPRESSED) ||
+((flags & BDRV_REQ_MAY_UNMAP) && qcow2_cluster_is_allocated(type));
+uint64_t new_l2_entry = unmap ? 0 : old_l2_entry;
+uint64_t new_l2_bitmap = old_l2_bitmap;
 
-old_offset = get_l2_entry(s, l2_slice, l2_index + i);
+if (has_subclusters(s)) {
+new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
+} else {
+new_l2_entry |= QCOW_OFLAG_ZERO;
+}
 
-/*
- * Minimize L2 changes if the cluster already reads back as
- * zeroes with correct allocation.
- */
-cluster_type = qcow2_get_cluster_type(bs, old_offset);
-if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN ||
-(cluster_type == QCOW2_CLUSTER_ZERO_ALLOC && !unmap)) {
+if (old_l2_entry == new_l2_entry && old_l2_bitmap == new_l2_bitmap) {
 continue;
 }
 
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
-if (cluster_type == QCOW2_CLUSTER_COMPRESSED || unmap) {
-set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
-qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
-} else {
-uint64_t entry = get_l2_entry(s, l2_slice, l2_index + i);
-set_l2_entry(s, l2_slice, l2_index + i, entry | QCOW_OFLAG_ZERO);
+if (unmap) {
+qcow2_free_any_clusters(bs, old_l2_entry, 1, 
QCOW2_DISCARD_REQUEST);
+}
+set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
+if (has_subclusters(s)) {
+set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
 }
 }
 
-- 
2.20.1




[PATCH v8 06/34] qcow2: Add get_l2_entry() and set_l2_entry()

2020-06-10 Thread Alberto Garcia
The size of an L2 entry is 64 bits, but if we want to have subclusters
we need extended L2 entries. This means that we have to access L2
tables and slices differently depending on whether an image has
extended L2 entries or not.

This patch replaces all l2_slice[] accesses with calls to
get_l2_entry() and set_l2_entry().

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h  | 12 
 block/qcow2-cluster.c  | 63 ++
 block/qcow2-refcount.c | 17 ++--
 3 files changed, 54 insertions(+), 38 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 06475e0849..eecbadc4cb 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -510,6 +510,18 @@ typedef enum QCow2MetadataOverlap {
 
 #define INV_OFFSET (-1ULL)
 
+static inline uint64_t get_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
+int idx)
+{
+return be64_to_cpu(l2_slice[idx]);
+}
+
+static inline void set_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
+int idx, uint64_t entry)
+{
+l2_slice[idx] = cpu_to_be64(entry);
+}
+
 static inline bool has_data_file(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index fce0be7a08..76fd0f3cdb 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -383,12 +383,13 @@ fail:
  * cluster which may require a different handling)
  */
 static int count_contiguous_clusters(BlockDriverState *bs, int nb_clusters,
-int cluster_size, uint64_t *l2_slice, uint64_t stop_flags)
+int cluster_size, uint64_t *l2_slice, int l2_index, uint64_t 
stop_flags)
 {
+BDRVQcow2State *s = bs->opaque;
 int i;
 QCow2ClusterType first_cluster_type;
 uint64_t mask = stop_flags | L2E_OFFSET_MASK | QCOW_OFLAG_COMPRESSED;
-uint64_t first_entry = be64_to_cpu(l2_slice[0]);
+uint64_t first_entry = get_l2_entry(s, l2_slice, l2_index);
 uint64_t offset = first_entry & mask;
 
 first_cluster_type = qcow2_get_cluster_type(bs, first_entry);
@@ -401,7 +402,7 @@ static int count_contiguous_clusters(BlockDriverState *bs, 
int nb_clusters,
first_cluster_type == QCOW2_CLUSTER_ZERO_ALLOC);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t l2_entry = be64_to_cpu(l2_slice[i]) & mask;
+uint64_t l2_entry = get_l2_entry(s, l2_slice, l2_index + i) & mask;
 if (offset + (uint64_t) i * cluster_size != l2_entry) {
 break;
 }
@@ -417,14 +418,16 @@ static int count_contiguous_clusters(BlockDriverState 
*bs, int nb_clusters,
 static int count_contiguous_clusters_unallocated(BlockDriverState *bs,
  int nb_clusters,
  uint64_t *l2_slice,
+ int l2_index,
  QCow2ClusterType wanted_type)
 {
+BDRVQcow2State *s = bs->opaque;
 int i;
 
 assert(wanted_type == QCOW2_CLUSTER_ZERO_PLAIN ||
wanted_type == QCOW2_CLUSTER_UNALLOCATED);
 for (i = 0; i < nb_clusters; i++) {
-uint64_t entry = be64_to_cpu(l2_slice[i]);
+uint64_t entry = get_l2_entry(s, l2_slice, l2_index + i);
 QCow2ClusterType type = qcow2_get_cluster_type(bs, entry);
 
 if (type != wanted_type) {
@@ -573,7 +576,7 @@ int qcow2_get_host_offset(BlockDriverState *bs, uint64_t 
offset,
 /* find the cluster offset for the given disk offset */
 
 l2_index = offset_to_l2_slice_index(s, offset);
-l2_entry = be64_to_cpu(l2_slice[l2_index]);
+l2_entry = get_l2_entry(s, l2_slice, l2_index);
 
 nb_clusters = size_to_clusters(s, bytes_needed);
 /* bytes_needed <= *bytes + offset_in_cluster, both of which are unsigned
@@ -608,7 +611,7 @@ int qcow2_get_host_offset(BlockDriverState *bs, uint64_t 
offset,
 case QCOW2_CLUSTER_UNALLOCATED:
 /* how many empty clusters ? */
 c = count_contiguous_clusters_unallocated(bs, nb_clusters,
-  _slice[l2_index], type);
+  l2_slice, l2_index, type);
 *host_offset = 0;
 break;
 case QCOW2_CLUSTER_ZERO_ALLOC:
@@ -617,7 +620,7 @@ int qcow2_get_host_offset(BlockDriverState *bs, uint64_t 
offset,
 *host_offset = host_cluster_offset + offset_in_cluster;
 /* how many allocated clusters ? */
 c = count_contiguous_clusters(bs, nb_clusters, s->cluster_size,
-  _slice[l2_index], QCOW_OFLAG_ZERO);
+  l2_slice, l2_index, QCOW_OFLAG_ZERO);
 if (offset_into_cluster(s, host_cluster_offset)) {
 qcow2_signal_corruption(bs, true, -1, -1,
 "Cluster allocation offset %#"

[PATCH v8 11/34] qcow2: Add offset_into_subcluster() and size_to_subclusters()

2020-06-10 Thread Alberto Garcia
Like offset_into_cluster() and size_to_clusters(), but for
subclusters.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 2503374677..4fe31adfd3 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -555,11 +555,21 @@ static inline int64_t offset_into_cluster(BDRVQcow2State 
*s, int64_t offset)
 return offset & (s->cluster_size - 1);
 }
 
+static inline int64_t offset_into_subcluster(BDRVQcow2State *s, int64_t offset)
+{
+return offset & (s->subcluster_size - 1);
+}
+
 static inline uint64_t size_to_clusters(BDRVQcow2State *s, uint64_t size)
 {
 return (size + (s->cluster_size - 1)) >> s->cluster_bits;
 }
 
+static inline uint64_t size_to_subclusters(BDRVQcow2State *s, uint64_t size)
+{
+return (size + (s->subcluster_size - 1)) >> s->subcluster_bits;
+}
+
 static inline int64_t size_to_l1(BDRVQcow2State *s, int64_t size)
 {
 int shift = s->cluster_bits + s->l2_bits;
-- 
2.20.1




[PATCH v8 13/34] qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap()

2020-06-10 Thread Alberto Garcia
Extended L2 entries are 128-bit wide: 64 bits for the entry itself and
64 bits for the subcluster allocation bitmap.

In order to support them correctly get/set_l2_entry() need to be
updated so they take the entry width into account in order to
calculate the correct offset.

This patch also adds the get/set_l2_bitmap() functions that are
used to access the bitmaps. For convenience we allow calling
get_l2_bitmap() on images without subclusters. In this case the
returned value is always 0 and has no meaning.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 block/qcow2.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 46b351229a..82b86f6cec 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -533,15 +533,36 @@ static inline size_t l2_entry_size(BDRVQcow2State *s)
 static inline uint64_t get_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx)
 {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
 return be64_to_cpu(l2_slice[idx]);
 }
 
+static inline uint64_t get_l2_bitmap(BDRVQcow2State *s, uint64_t *l2_slice,
+ int idx)
+{
+if (has_subclusters(s)) {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
+return be64_to_cpu(l2_slice[idx + 1]);
+} else {
+return 0; /* For convenience only; this value has no meaning. */
+}
+}
+
 static inline void set_l2_entry(BDRVQcow2State *s, uint64_t *l2_slice,
 int idx, uint64_t entry)
 {
+idx *= l2_entry_size(s) / sizeof(uint64_t);
 l2_slice[idx] = cpu_to_be64(entry);
 }
 
+static inline void set_l2_bitmap(BDRVQcow2State *s, uint64_t *l2_slice,
+ int idx, uint64_t bitmap)
+{
+assert(has_subclusters(s));
+idx *= l2_entry_size(s) / sizeof(uint64_t);
+l2_slice[idx + 1] = cpu_to_be64(bitmap);
+}
+
 static inline bool has_data_file(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
-- 
2.20.1




Re: [PATCH v7 0/9] acpi: i386 tweaks

2020-06-10 Thread Michael S. Tsirkin
On Wed, Jun 10, 2020 at 01:40:02PM +0200, Igor Mammedov wrote:
> On Wed, 10 Jun 2020 11:41:22 +0200
> Gerd Hoffmann  wrote:
> 
> > First batch of microvm patches, some generic acpi stuff.
> > Split the acpi-build.c monster, specifically split the
> > pc and q35 and pci bits into a separate file which we
> > can skip building at some point in the future.
> > 
> It looks like series is missing patch to whitelist changed ACPI tables in
> bios-table-test.

Right. Does it pass make check?

> Do we already have test case for microvm in bios-table-test,
> if not it's probably time to add it.

Separately :)

> > v2 changes: leave acpi-build.c largely as-is, move useful
> > bits to other places to allow them being reused, specifically:
> > 
> >  * move isa device generator functions to individual isa devices.
> >  * move fw_cfg generator function to fw_cfg.c
> > 
> > v3 changes: fix rtc, support multiple lpt devices.
> > 
> > v4 changes:
> >  * drop merged patches.
> >  * split rtc crs change to separata patch.
> >  * added two cleanup patches.
> >  * picked up ack & review tags.
> > 
> > v5 changes:
> >  * add comment for rtc crs update.
> >  * add even more cleanup patches.
> >  * picked up ack & review tags.
> > 
> > v6 changes:
> >  * floppy: move cmos_get_fd_drive_type.
> >  * picked up ack & review tags.
> > 
> > v7 changes:
> >  * rebased to mst/pci branch, resolved stubs conflict.
> >  * dropped patches already queued up in mst/pci.
> >  * added missing sign-off.
> >  * picked up ack & review tags.
> > 
> > take care,
> >   Gerd
> > 
> > Gerd Hoffmann (9):
> >   acpi: move aml builder code for floppy device
> >   floppy: make isa_fdc_get_drive_max_chs static
> >   floppy: move cmos_get_fd_drive_type() from pc
> >   acpi: move aml builder code for i8042 (kbd+mouse) device
> >   acpi: factor out fw_cfg_add_acpi_dsdt()
> >   acpi: simplify build_isa_devices_aml()
> >   acpi: drop serial/parallel enable bits from dsdt
> >   acpi: drop build_piix4_pm()
> >   acpi: q35: drop _SB.PCI0.ISA.LPCD opregion.
> > 
> >  hw/i386/fw_cfg.h   |   1 +
> >  include/hw/block/fdc.h |   3 +-
> >  include/hw/i386/pc.h   |   1 -
> >  hw/block/fdc.c | 111 +-
> >  hw/i386/acpi-build.c   | 211 ++---
> >  hw/i386/fw_cfg.c   |  28 ++
> >  hw/i386/pc.c   |  25 -
> >  hw/input/pckbd.c   |  31 ++
> >  stubs/cmos.c   |   7 ++
> >  stubs/Makefile.objs|   1 +
> >  10 files changed, 184 insertions(+), 235 deletions(-)
> >  create mode 100644 stubs/cmos.c
> > 




[PATCH 0/2] qcow2: seriously improve savevm performance

2020-06-10 Thread Denis V. Lunev
This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
original rfcthis
cached:  1.79s  2.38s   1.27s
non-cached:  3.29s  1.31s   0.81s

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 




[PATCH 2/2] qcow2: improve savevm performance

2020-06-10 Thread Denis V. Lunev
This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
original fixed
cached:  1.79s   1.27s
non-cached:  3.29s   0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
 block/qcow2.c | 111 +-
 block/qcow2.h |   4 ++
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e2ae69422a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
 return ret;
 }
 
+
+typedef struct Qcow2VMStateTask {
+AioTask task;
+
+BlockDriverState *bs;
+int64_t offset;
+void *buf;
+size_t bytes;
+} Qcow2VMStateTask;
+
+typedef struct Qcow2SaveVMState {
+AioTaskPool *pool;
+Qcow2VMStateTask *t;
+} Qcow2SaveVMState;
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
+Qcow2SaveVMState *state = s->savevm_state;
 int ret;
 
+if (state != NULL) {
+aio_task_pool_start_task(state->pool, >t->task);
+
+aio_task_pool_wait_all(state->pool);
+ret = aio_task_pool_status(state->pool);
+
+aio_task_pool_free(state->pool);
+g_free(state);
+
+s->savevm_state = NULL;
+
+if (ret < 0) {
+return ret;
+}
+}
+
 qemu_co_mutex_lock(>lock);
 ret = qcow2_write_caches(bs);
 qemu_co_mutex_unlock(>lock);
@@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
 }
 }
 
+
+static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
+{
+int err;
+Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
+
+if (t->bytes != 0) {
+QEMUIOVector local_qiov;
+qemu_iovec_init_buf(_qiov, t->buf, t->bytes);
+err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
+   _qiov, 0, 0);
+}
+
+qemu_vfree(t->buf);
+return err;
+}
+
+static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
+int64_t pos, size_t size)
+{
+BDRVQcow2State *s = bs->opaque;
+Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
+
+*t = (Qcow2VMStateTask) {
+.task.func = qcow2_co_vmstate_task_entry,
+.buf = qemu_blockalign(bs, size),
+.offset = qcow2_vm_state_offset(s) + pos,
+.bs = bs,
+};
+
+return t;
+}
+
 static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
   int64_t pos)
 {
 BDRVQcow2State *s = bs->opaque;
+Qcow2SaveVMState *state = s->savevm_state;
+Qcow2VMStateTask *t;
+size_t buf_size = MAX(s->cluster_size, 1 * MiB);
+size_t to_copy;
+size_t off;
 
 BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
-return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
- qiov->size, qiov, 0, 0);
+
+if (state == NULL) {
+state = g_new(Qcow2SaveVMState, 1);
+*state = (Qcow2SaveVMState) {
+.pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
+.t = qcow2_vmstate_task_create(bs, pos, buf_size),
+};
+
+s->savevm_state = state;
+}
+
+if (aio_task_pool_status(state->pool) != 0) {
+return aio_task_pool_status(state->pool);
+}
+
+t = state->t;
+if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
+/* Normally this branch is not reachable from migration */
+return bs->drv->bdrv_co_pwritev_part(bs,
+qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
+}
+
+off = 0;
+while (1) {
+to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+t->bytes += to_copy;
+if (t->bytes < buf_size) {
+return 0;
+}
+
+aio_task_pool_start_task(state->pool, >task);
+
+pos += to_copy;
+off += to_copy;
+state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
+}
+
+return 0;
 }
 
 static int 

[PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine

2020-06-10 Thread Denis V. Lunev
The patch preserves the constraint that the only waiter is allowed.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Max Reitz 
CC: Vladimir Sementsov-Ogievskiy 
CC: Denis Plotnikov 
---
 block/aio_task.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..f338049147 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,7 +27,7 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-Coroutine *main_co;
+Coroutine *wake_co;
 int status;
 int max_busy_tasks;
 int busy_tasks;
@@ -54,15 +54,15 @@ static void coroutine_fn aio_task_co(void *opaque)
 
 if (pool->waiting) {
 pool->waiting = false;
-aio_co_wake(pool->main_co);
+aio_co_wake(pool->wake_co);
 }
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
 assert(pool->busy_tasks > 0);
-assert(qemu_coroutine_self() == pool->main_co);
 
+pool->wake_co = qemu_coroutine_self();
 pool->waiting = true;
 qemu_coroutine_yield();
 
@@ -98,7 +98,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int 
max_busy_tasks)
 {
 AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-pool->main_co = qemu_coroutine_self();
+pool->wake_co = NULL;
 pool->max_busy_tasks = max_busy_tasks;
 
 return pool;
-- 
2.17.1




Re: [PATCH v10 1/9] error: auto propagated local_err

2020-06-10 Thread Greg Kurz
On Tue, 17 Mar 2020 18:16:17 +0300
Vladimir Sementsov-Ogievskiy  wrote:

> Introduce a new ERRP_AUTO_PROPAGATE macro, to be used at start of
> functions with an errp OUT parameter.
> 
> It has three goals:
> 
> 1. Fix issue with error_fatal and error_prepend/error_append_hint: user
> can't see this additional information, because exit() happens in
> error_setg earlier than information is added. [Reported by Greg Kurz]
> 

I have more of these coming and I'd really like to use ERRP_AUTO_PROPAGATE.

It seems we have a consensus on the macro itself but this series is gated
by the conversion of the existing code base.

What about merging this patch separately so that people can start using
it at least ?

> 2. Fix issue with error_abort and error_propagate: when we wrap
> error_abort by local_err+error_propagate, the resulting coredump will
> refer to error_propagate and not to the place where error happened.
> (the macro itself doesn't fix the issue, but it allows us to [3.] drop
> the local_err+error_propagate pattern, which will definitely fix the
> issue) [Reported by Kevin Wolf]
> 
> 3. Drop local_err+error_propagate pattern, which is used to workaround
> void functions with errp parameter, when caller wants to know resulting
> status. (Note: actually these functions could be merely updated to
> return int error code).
> 
> To achieve these goals, later patches will add invocations
> of this macro at the start of functions with either use
> error_prepend/error_append_hint (solving 1) or which use
> local_err+error_propagate to check errors, switching those
> functions to use *errp instead (solving 2 and 3).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Paul Durrant 
> Reviewed-by: Greg Kurz 
> Reviewed-by: Eric Blake 
> ---
> 
> Cc: Eric Blake 
> Cc: Kevin Wolf 
> Cc: Max Reitz 
> Cc: Greg Kurz 
> Cc: Christian Schoenebeck 
> Cc: Stefan Hajnoczi 
> Cc: Stefano Stabellini 
> Cc: Anthony Perard 
> Cc: Paul Durrant 
> Cc: "Philippe Mathieu-Daudé" 
> Cc: Laszlo Ersek 
> Cc: Gerd Hoffmann 
> Cc: Stefan Berger 
> Cc: Markus Armbruster 
> Cc: Michael Roth 
> Cc: qemu-de...@nongnu.org
> Cc: qemu-block@nongnu.org
> Cc: xen-de...@lists.xenproject.org
> 
>  include/qapi/error.h | 205 ---
>  1 file changed, 173 insertions(+), 32 deletions(-)
> 
> diff --git a/include/qapi/error.h b/include/qapi/error.h
> index ad5b6e896d..30140d9bfe 100644
> --- a/include/qapi/error.h
> +++ b/include/qapi/error.h
> @@ -15,6 +15,8 @@
>  /*
>   * Error reporting system loosely patterned after Glib's GError.
>   *
> + * = Deal with Error object =
> + *
>   * Create an error:
>   * error_setg(, "situation normal, all fouled up");
>   *
> @@ -47,28 +49,91 @@
>   * reporting it (primarily useful in testsuites):
>   * error_free_or_abort();
>   *
> - * Pass an existing error to the caller:
> - * error_propagate(errp, err);
> - * where Error **errp is a parameter, by convention the last one.
> + * = Deal with Error ** function parameter =
>   *
> - * Pass an existing error to the caller with the message modified:
> - * error_propagate_prepend(errp, err);
> + * A function may use the error system to return errors. In this case, the
> + * function defines an Error **errp parameter, by convention the last one 
> (with
> + * exceptions for functions using ... or va_list).
>   *
> - * Avoid
> - * error_propagate(errp, err);
> - * error_prepend(errp, "Could not frobnicate '%s': ", name);
> - * because this fails to prepend when @errp is _fatal.
> + * The caller may then pass in the following errp values:
>   *
> - * Create a new error and pass it to the caller:
> + * 1. _abort
> + *Any error will result in abort().
> + * 2. _fatal
> + *Any error will result in exit() with a non-zero status.
> + * 3. NULL
> + *No error reporting through errp parameter.
> + * 4. The address of a NULL-initialized Error *err
> + *Any error will populate errp with an error object.
> + *
> + * The following rules then implement the correct semantics desired by the
> + * caller.
> + *
> + * Create a new error to pass to the caller:
>   * error_setg(errp, "situation normal, all fouled up");
>   *
> - * Call a function and receive an error from it:
> + * Calling another errp-based function:
> + * f(..., errp);
> + *
> + * == Checking success of subcall ==
> + *
> + * If a function returns a value indicating an error in addition to setting
> + * errp (which is recommended), then you don't need any additional code, just
> + * do:
> + *
> + * int ret = f(..., errp);
> + * if (ret < 0) {
> + * ... handle error ...
> + * return ret;
> + * }
> + *
> + * If a function returns nothing (not recommended for new code), the only way
> + * to check success is by consulting errp; doing this safely requires the use
> + * of the ERRP_AUTO_PROPAGATE macro, like this:
> + *
> + * int our_func(..., Error **errp) {
> + * ERRP_AUTO_PROPAGATE();
> + 

Re: [PATCH 06/16] fdc: Deprecate configuring floppies with -global isa-fdc

2020-06-10 Thread John Snow



On 6/5/20 10:56 AM, Markus Armbruster wrote:
> Deprecate
> 
> -global isa-fdc.driveA=...
> -global isa-fdc.driveB=...
> 
> in favour of
> 
> -device floppy,unit=0,drive=...
> -device floppy,unit=1,drive=...
> 
> Same for the other floppy controller devices.
> 

If you're not aware of any reason for why we need to keep global, then
neither am I.

> Signed-off-by: Markus Armbruster 

Acked-by: John Snow 

> ---
>  docs/qdev-device-use.txt   | 13 -
>  docs/system/deprecated.rst | 26 ++
>  hw/block/fdc.c | 17 +
>  tests/qemu-iotests/172.out | 30 ++
>  4 files changed, 77 insertions(+), 9 deletions(-)
> 
> diff --git a/docs/qdev-device-use.txt b/docs/qdev-device-use.txt
> index cc53e97dcd..3d781be547 100644
> --- a/docs/qdev-device-use.txt
> +++ b/docs/qdev-device-use.txt
> @@ -104,15 +104,10 @@ The -device argument differs in detail for each type of 
> drive:
>  
>  * if=floppy
>  
> -  -global isa-fdc.driveA=DRIVE-ID
> -  -global isa-fdc.driveB=DRIVE-ID
> +  -device floppy,unit=UNIT,drive=DRIVE-ID
>  
> -  This is -global instead of -device, because the floppy controller is
> -  created automatically, and we want to configure that one, not create
> -  a second one (which isn't possible anyway).
> -
> -  Without any -global isa-fdc,... you get an empty driveA and no
> -  driveB.  You can use -nodefaults to suppress the default driveA, see
> +  Without any -device floppy,... you get an empty unit 0 and no unit
> +  1.  You can use -nodefaults to suppress the default unit 0, see
>"Default Devices".
>  
>  * if=virtio
> @@ -385,7 +380,7 @@ some DEVNAMEs:
>  
>  default device  suppressing DEVNAMEs
>  CD-ROM  ide-cd, ide-drive, ide-hd, scsi-cd, scsi-hd
> -isa-fdc's driveAfloppy, isa-fdc
> +floppy  floppy, isa-fdc
>  parallelisa-parallel
>  serial  isa-serial
>  VGA VGA, cirrus-vga, isa-vga, isa-cirrus-vga,
> diff --git a/docs/system/deprecated.rst b/docs/system/deprecated.rst
> index f0061f94aa..9bd11c1e95 100644
> --- a/docs/system/deprecated.rst
> +++ b/docs/system/deprecated.rst
> @@ -172,6 +172,32 @@ previously available ``-tb-size`` option.
>  Use ``-display sdl,show-cursor=on`` or
>   ``-display gtk,show-cursor=on`` instead.
>  
> +``Configuring floppies with ``-global``
> +'''
> +
> +Use ``-device floppy,...`` instead:
> +::
> +
> +-global isa-fdc.driveA=...
> +-global sysbus-fdc.driveA=...
> +-global SUNW,fdtwo.drive=...
> +
> +become
> +::
> +
> +-device floppy,unit=0,drive=...
> +
> +and
> +::
> +
> +-global isa-fdc.driveB=...
> +-global sysbus-fdc.driveB=...
> +
> +become
> +::
> +
> +-device floppy,unit=1,drive=...
> +
>  QEMU Machine Protocol (QMP) commands
>  
>  
> diff --git a/hw/block/fdc.c b/hw/block/fdc.c
> index 35e734b6fb..4191d5b006 100644
> --- a/hw/block/fdc.c
> +++ b/hw/block/fdc.c
> @@ -2525,6 +2525,7 @@ static void fdctrl_connect_drives(FDCtrl *fdctrl, 
> DeviceState *fdc_dev,
>  DeviceState *dev;
>  BlockBackend *blk;
>  Error *local_err = NULL;
> +const char *fdc_name, *drive_suffix;
>  
>  for (i = 0; i < MAX_FD; i++) {
>  drive = >drives[i];
> @@ -2539,10 +2540,26 @@ static void fdctrl_connect_drives(FDCtrl *fdctrl, 
> DeviceState *fdc_dev,
>  continue;
>  }
>  
> +fdc_name = object_get_typename(OBJECT(fdc_dev));
> +drive_suffix = !strcmp(fdc_name, "SUNW,fdtwo") ? "" : i ? "B" : "A";
> +warn_report("warning: property %s.drive%s is deprecated",
> +fdc_name, drive_suffix);
> +error_printf("Use -device floppy,unit=%d,drive=... instead.\n", i);
> +
>  dev = qdev_new("floppy");
>  qdev_prop_set_uint32(dev, "unit", i);
>  qdev_prop_set_enum(dev, "drive-type", 
> fdctrl->qdev_for_drives[i].type);
>  
> +/*
> + * Hack alert: we move the backend from the floppy controller
> + * device to the floppy device.  We first need to detach the
> + * controller, or else floppy_create()'s qdev_prop_set_drive()
> + * will die when it attaches floppy device.  We also need to
> + * take another reference so that blk_detach_dev() doesn't
> + * free blk while we still need it.
> + *
> + * The hack is probably a bad idea.
> + */
>  blk_ref(blk);
>  blk_detach_dev(blk, fdc_dev);
>  fdctrl->qdev_for_drives[i].blk = NULL;
> diff --git a/tests/qemu-iotests/172.out b/tests/qemu-iotests/172.out
> index ba15a85c88..253f35111d 100644
> --- a/tests/qemu-iotests/172.out
> +++ b/tests/qemu-iotests/172.out
> @@ -383,6 +383,8 @@ sd0: [not inserted]
>  === Using -drive if=none and -global ===
>  
>  Testing: -drive if=none,file=TEST_DIR/t.qcow2 -global isa-fdc.driveA=none0
> +QEMU_PROG: 

Re: [PATCH 00/16] Crazy shit around -global (pardon my french)

2020-06-10 Thread John Snow



On 6/5/20 10:56 AM, Markus Armbruster wrote:
> There are three ways to configure backends:
> 
> * -nic, -serial, -drive, ... (onboard devices)
> 
> * Set the property with -device, or, if you feel masochistic, with
>   -set device (pluggable devices)
> 
> * Set the property with -global (both)
> 
> The trouble is -global is terrible.
> 
> It gets applied in object_new(), which can't fail.  We treat failure
> to apply -global as fatal error, except when hot-plugging, where we
> treat it as warning *boggle*.  I'm not addressing that today.
> 
> Some code falls apart when you use both -global and the other way.
> 
> To make life more interesting, we gave -drive two roles: with
> interface type other than none, it's for configuring onboard devices,
> and with interface type none, it's for defining backends for use with
> -device and such.  Since we neglect to require interface type none for
> the latter, you can use one -drive in both roles.  This confuses the
> code about as much as you, dear reader, probably are by now.
> 
> Because this still isn't interesting enough, there's yet another way
> to configure backends, just for floppies: set the floppy controller's
> property.  Goes back to the time when floppy wasn't a separate device,
> and involves some Bad Magic.  Now -global can interact with itself!
> 
> Digging through all this took me an embarrassing amount of time.
> Hair, too.
> 
> My patches reject some the silliest uses outright, and deprecate some
> not so silly ones that have replacements.
> 
> Apply on top of my "[PATCH v2 00/58] qdev: Rework how we plug into the
> parent bus".
> 
> Enjoy!
> 
> Based-on: <20200529134523.8477-1-arm...@redhat.com>
> 
> Markus Armbruster (16):
>   iotests/172: Include "info block" in test output
>   iotests/172: Cover empty filename and multiple use of drives
>   iotests/172: Cover -global floppy.drive=...
>   fdc: Reject clash between -drive if=floppy and -global isa-fdc
>   fdc: Open-code fdctrl_init_isa()
>   fdc: Deprecate configuring floppies with -global isa-fdc
>   docs/qdev-device-use.txt: Update section "Default Devices"
>   blockdev: Deprecate -drive with bogus interface type
>   qdev: Eliminate get_pointer(), set_pointer()
>   qdev: Improve netdev property override error a bit
>   qdev: Reject drive property override
>   qdev: Reject chardev property override
>   qdev: Make qdev_prop_set_drive() match the other helpers
>   arm/aspeed: Drop aspeed_board_init_flashes() parameter @errp
>   sd/pxa2xx_mmci: Don't crash on pxa2xx_mmci_init() error
>   sd/milkymist-memcard: Fix error API violation
> 
>  docs/qdev-device-use.txt|  17 +-
>  docs/system/deprecated.rst  |  34 ++
>  include/hw/block/fdc.h  |   2 +-
>  include/hw/qdev-properties.h|  18 +-
>  include/sysemu/blockdev.h   |   2 +
>  blockdev.c  |  27 +-
>  hw/arm/aspeed.c |  16 +-
>  hw/arm/cubieboard.c |   2 +-
>  hw/arm/exynos4210.c |   2 +-
>  hw/arm/imx25_pdk.c  |   2 +-
>  hw/arm/mcimx6ul-evk.c   |   2 +-
>  hw/arm/mcimx7d-sabre.c  |   2 +-
>  hw/arm/msf2-som.c   |   4 +-
>  hw/arm/nseries.c|   4 +-
>  hw/arm/orangepi.c   |   2 +-
>  hw/arm/raspi.c  |   2 +-
>  hw/arm/sabrelite.c  |   6 +-
>  hw/arm/vexpress.c   |   3 +-
>  hw/arm/xilinx_zynq.c|   7 +-
>  hw/arm/xlnx-versal-virt.c   |   2 +-
>  hw/arm/xlnx-zcu102.c|  10 +-
>  hw/block/fdc.c  |  82 ++--
>  hw/block/nand.c |   2 +-
>  hw/block/pflash_cfi01.c |   6 +-
>  hw/block/pflash_cfi02.c |   2 +-
>  hw/core/qdev-properties-system.c| 151 ---
>  hw/core/qdev-properties.c   |  17 +
>  hw/i386/pc.c|   8 +-
>  hw/ide/qdev.c   |   4 +-
>  hw/isa/isa-superio.c|  18 +-
>  hw/m68k/q800.c  |   3 +-
>  hw/microblaze/petalogix_ml605_mmu.c |   5 +-
>  hw/ppc/pnv.c|   3 +-
>  hw/ppc/spapr.c  |   4 +-
>  hw/scsi/scsi-bus.c  |   2 +-
>  hw/sd/milkymist-memcard.c   |   2 +-
>  hw/sd/pxa2xx_mmci.c |  15 +-
>  hw/sd/sd.c  |   2 +-
>  hw/sd/ssi-sd.c  |   3 +-
>  hw/sparc64/sun4u.c  |   9 +-
>  hw/xtensa/xtfpga.c  |   3 +-
>  softmmu/vl.c|   8 +
>  tests/qemu-iotests/172  |  27 +-
>  tests/qemu-iotests/172.out  | 656 +---
>  44 files changed, 928 insertions(+), 270 deletions(-)
> 

I'll be honest that I'm a little pre-occupied and possibly unable to
review the fdc related changes in-depth. I generally trust your
judgment, and will try to give it a quick scan.

You may treat any 

Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Kevin Wolf
Am 10.06.2020 um 14:19 hat Sam Eiderman geschrieben:
> Thanks David,
> 
> Yes, I imaging the following use case:
> 
> disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its 
> beginning.
> /dev/sda is a raw disk containing garbage
> 
> I invoke:
> qemu-img convert disk.vmdk -O raw /dev/sda
> 
> Required output:
> The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
> finishes fast.
> 
> Kevin, from what I understood from you, this is the default behavior.

Sorry, I misunderstood what you want. qemu-img will write zeros to all
unallocated parts, too. If it didn't do that, the resulting image on
/dev/sda wouldn't be a copy of disk.vmdk.

As the metadata (which blocks are allocated) cannot be preserved in raw
images, you wouldn't be able to tell which part of the image contains
valid data and which part needs to be interpreted as zeros even though
it contains random garbage.

What is your use case for this result where the actual virtual disk
content is mixed with garbage?

Kevin




Re: [PATCH 1/2] nbd/server: Avoid long error message assertions CVE-2020-10761

2020-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2020 16:39, Eric Blake wrote:

On 6/10/20 3:57 AM, Vladimir Sementsov-Ogievskiy wrote:

08.06.2020 21:26, Eric Blake wrote:

Ever since commit 36683283 (v2.8), the server code asserts that error
strings sent to the client are well-formed per the protocol by not
exceeding the maximum string length of 4096.  At the time the server
first started sending error messages, the assertion could not be
triggered, because messages were completely under our control.
However, over the years, we have added latent scenarios where a client
could trigger the server to attempt an error message that would
include the client's information if it passed other checks first:

- requesting NBD_OPT_INFO/GO on an export name that is not present
   (commit 0cfae925 in v2.12 echoes the name)

- requesting NBD_OPT_LIST/SET_META_CONTEXT on an export name that is
   not present (commit e7b1948d in v2.12 echoes the name)

At the time, those were still safe because we flagged names larger
than 256 bytes with a different message; but that changed in commit
93676c88 (v4.2) when we raised the name limit to 4096 to match the NBD
string limit.  (That commit also failed to change the magic number
4096 in nbd_negotiate_send_rep_err to the just-introduced named
constant.)  So with that commit, long client names appended to server
text can now trigger the assertion, and thus be used as a denial of
service attack against a server.  As a mitigating factor, if the
server requires TLS, the client cannot trigger the problematic paths
unless it first supplies TLS credentials, and such trusted clients are
less likely to try to intentionally crash the server.

Reported-by: Xueqiang Wei 
CC: qemu-sta...@nongnu.org
Fixes: https://bugzilla.redhat.com/1843684 CVE-2020-10761
Fixes: 93676c88d7
Signed-off-by: Eric Blake 
---
  nbd/server.c   | 28 +---
  tests/qemu-iotests/143 |  4 
  tests/qemu-iotests/143.out |  2 ++
  3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 02b1ed080145..ec130303586d 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -217,7 +217,7 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,

  msg = g_strdup_vprintf(fmt, va);
  len = strlen(msg);
-    assert(len < 4096);
+    assert(len < NBD_MAX_STRING_SIZE);
  trace_nbd_negotiate_send_rep_err(msg);
  ret = nbd_negotiate_send_rep_len(client, type, len, errp);
  if (ret < 0) {
@@ -231,6 +231,27 @@ nbd_negotiate_send_rep_verr(NBDClient *client, uint32_t 
type,
  return 0;
  }

+/*
+ * Truncate a potentially-long user-supplied string into something
+ * more suitable for an error reply.
+ */
+static const char *
+nbd_truncate_name(const char *name)
+{
+#define SANE_LENGTH 80
+    static char buf[SANE_LENGTH + 3 + 1]; /* Trailing '...', NUL */


s/NUL/NULL/


NULL is the pointer (typically 4 or 8 bytes); NUL is the character (exactly one 
byte in all multi-byte-encodings like UTF-8, or sizeof(wchar_t) when using wide 
characters).



Hmm. It may break if we use it in parallel in two coroutines or threads.. Not 
sure, is it possible now, neither of course will it be possible in future.


After some testing (including adding some temporary sleep() into the code), it 
looks like 'qemu-nbd -e 2' is currently serialized (we don't start responding 
to a second client until we are done negotiating with the first); on that 
grounds, we are not risking that information leaks from one client to another.  
But you are correct that it is not obvious, and that if we do have a situation 
where two threads can try to allow an NBD connection, then this static buffer 
could leak information from one client to another.  So I'll need to post a v2.



I'd avoid creating functions returning  instead use g_strdup_printf(), like

char *tmp = g_strdup_printf("%.80s...", name);

   ( OR, if you want explicit constant: g_strdup_printf("%.*s...", SANE_LENGTH, 
name) )

... report error ...

g_free(tmp)

Using g_strdup_printf also is safer as we don't need to care about buf size.


malloc'ing the buffer is not too bad; error messages are not the hot path. I'll 
change it along those lines for v2.


@@ -996,7 +1017,8 @@ static int nbd_negotiate_meta_queries(NBDClient *client,
  meta->exp = nbd_export_find(export_name);
  if (meta->exp == NULL) {
  return nbd_opt_drop(client, NBD_REP_ERR_UNKNOWN, errp,
-    "export '%s' not present", export_name);
+    "export '%s' not present",
+    nbd_truncate_name(export_name));
  }



Hmm, maybe instead of assertion, shrink message in 
nbd_negotiate_send_rep_verr() too?
This will save us from forgotten (or future) uses of the function.


Truncating in nbd_negotiate_send_rep_verr is arbitrary; it does not have the context of 
what makes sense to truncate.  With an artificially short length, and a client request 
for "longname_from_the_client", the difference would be 

Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Vladimir Sementsov-Ogievskiy

10.06.2020 15:19, Sam Eiderman wrote:

Thanks David,

Yes, I imaging the following use case:

disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its beginning.
/dev/sda is a raw disk containing garbage

I invoke:
qemu-img convert disk.vmdk -O raw /dev/sda

Required output:
The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
finishes fast.

Kevin, from what I understood from you, this is the default behavior.

So if my VMDK is causing trouble (all virtual size is being written)
this is probably since all the grains in the VMDK are zero allocated
right?

Thanks!


I'm not sure that skipping unallocated clusters in qcow2/vmdk is default. As I 
see,
briefly looking at the code, unallocated clusters are skipped with -B option. 
But
it assuming using some backing file, which is not your case.

Let's check:
]# ./qemu-img create -f raw b 1M
Formatting 'b', fmt=raw size=1048576
]# ./qemu-img create -f qcow2 a 1M
Formatting 'a', fmt=qcow2 size=1048576 cluster_size=65536 lazy_refcounts=off 
refcount_bits=16 compression_type=zlib
]# ./qemu-io -c 'write -P 0xff 0 1M' -f raw b
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.05 sec (21.646 MiB/sec and 21.6457 ops/sec)
]# xxd b | head
:          
0010:          
0020:          
0030:          
0040:          
0050:          
0060:          
0070:          
0080:          
0090:          
]# ./qemu-img convert -f qcow2 -O raw a b
]# xxd b | head
:          
0010:          
0020:          
0030:          
0040:          
0050:          
0060:          
0070:          
0080:          
0090:          
]# ./qemu-io -c 'write -P 0xff 0 1M' -f raw b
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 00.05 sec (20.648 MiB/sec and 20.6478 ops/sec)
]# ./qemu-img create -f qcow2 base 1M
Formatting 'base', fmt=qcow2 size=1048576 cluster_size=65536 lazy_refcounts=off 
refcount_bits=16 compression_type=zlib
]# ./qemu-img convert -f qcow2 -O raw -B base a b
qemu-img: Backing file not supported for file format 'raw'


So you see, in a newly created qcow2 file all cllusters are unallocated. Still 
by default qemu-img convert writes all zeroes. And we can't use -B with raw 
tartget.



On Wed, Jun 10, 2020 at 2:56 PM David Edmondson  wrote:


On Wednesday, 2020-06-10 at 08:28:29 +03, Sam Eiderman wrote:


Hi,

168468fe19c8 ("qemu-img: Add --target-is-zero to convert") has added a
nice functionality for cloud scenarios:

* Create a virtual disk
* Convert a sparse image (qcow2, vmdk) to the virtual disk using
--target-is-zero
* Use the virtual disk

This saves many unnecessary writes - a qcow2 with 1MB of allocated
data but with 100GB virtual size will be converted efficiently.

However, does this pose a problem if the virtual disk is not zero initialized?


As Vladimir indicated, the intent of the flag is supposed to be clear
from the name :-) If your storage doesn't read zeroes absent any earlier
writes, you probably don't want to be using it.


Theoretically - if all unallocated blocks contain garbage - this
shouldn't matter, however what about allocated blocks of zero? Will
convert skip copying allocated zero blocks in the source image to the
target since it assumes that the target is zeroed out first thing?


So something like a "--no-need-to-zero" flag would do what you want,
presuming that it would write known zeroes but no longer clean the
device before use?

dme.
--
You can't hide from the flipside.



--
Best regards,
Vladimir



Re: [PATCH 1/2] nbd/server: Avoid long error message assertions CVE-2020-10761

2020-06-10 Thread Eric Blake

On 6/10/20 3:57 AM, Vladimir Sementsov-Ogievskiy wrote:

08.06.2020 21:26, Eric Blake wrote:

Ever since commit 36683283 (v2.8), the server code asserts that error
strings sent to the client are well-formed per the protocol by not
exceeding the maximum string length of 4096.  At the time the server
first started sending error messages, the assertion could not be
triggered, because messages were completely under our control.
However, over the years, we have added latent scenarios where a client
could trigger the server to attempt an error message that would
include the client's information if it passed other checks first:

- requesting NBD_OPT_INFO/GO on an export name that is not present
   (commit 0cfae925 in v2.12 echoes the name)

- requesting NBD_OPT_LIST/SET_META_CONTEXT on an export name that is
   not present (commit e7b1948d in v2.12 echoes the name)

At the time, those were still safe because we flagged names larger
than 256 bytes with a different message; but that changed in commit
93676c88 (v4.2) when we raised the name limit to 4096 to match the NBD
string limit.  (That commit also failed to change the magic number
4096 in nbd_negotiate_send_rep_err to the just-introduced named
constant.)  So with that commit, long client names appended to server
text can now trigger the assertion, and thus be used as a denial of
service attack against a server.  As a mitigating factor, if the
server requires TLS, the client cannot trigger the problematic paths
unless it first supplies TLS credentials, and such trusted clients are
less likely to try to intentionally crash the server.

Reported-by: Xueqiang Wei 
CC: qemu-sta...@nongnu.org
Fixes: https://bugzilla.redhat.com/1843684 CVE-2020-10761
Fixes: 93676c88d7
Signed-off-by: Eric Blake 
---
  nbd/server.c   | 28 +---
  tests/qemu-iotests/143 |  4 
  tests/qemu-iotests/143.out |  2 ++
  3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 02b1ed080145..ec130303586d 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -217,7 +217,7 @@ nbd_negotiate_send_rep_verr(NBDClient *client, 
uint32_t type,


  msg = g_strdup_vprintf(fmt, va);
  len = strlen(msg);
-    assert(len < 4096);
+    assert(len < NBD_MAX_STRING_SIZE);
  trace_nbd_negotiate_send_rep_err(msg);
  ret = nbd_negotiate_send_rep_len(client, type, len, errp);
  if (ret < 0) {
@@ -231,6 +231,27 @@ nbd_negotiate_send_rep_verr(NBDClient *client, 
uint32_t type,

  return 0;
  }

+/*
+ * Truncate a potentially-long user-supplied string into something
+ * more suitable for an error reply.
+ */
+static const char *
+nbd_truncate_name(const char *name)
+{
+#define SANE_LENGTH 80
+    static char buf[SANE_LENGTH + 3 + 1]; /* Trailing '...', NUL */


s/NUL/NULL/


NULL is the pointer (typically 4 or 8 bytes); NUL is the character 
(exactly one byte in all multi-byte-encodings like UTF-8, or 
sizeof(wchar_t) when using wide characters).




Hmm. It may break if we use it in parallel in two coroutines or 
threads.. Not sure, is it possible now, neither of course will it be 
possible in future.


After some testing (including adding some temporary sleep() into the 
code), it looks like 'qemu-nbd -e 2' is currently serialized (we don't 
start responding to a second client until we are done negotiating with 
the first); on that grounds, we are not risking that information leaks 
from one client to another.  But you are correct that it is not obvious, 
and that if we do have a situation where two threads can try to allow an 
NBD connection, then this static buffer could leak information from one 
client to another.  So I'll need to post a v2.




I'd avoid creating functions returning  instead use g_strdup_printf(), like

char *tmp = g_strdup_printf("%.80s...", name);

   ( OR, if you want explicit constant: g_strdup_printf("%.*s...", 
SANE_LENGTH, name) )


... report error ...

g_free(tmp)

Using g_strdup_printf also is safer as we don't need to care about buf 
size.


malloc'ing the buffer is not too bad; error messages are not the hot 
path. I'll change it along those lines for v2.


@@ -996,7 +1017,8 @@ static int nbd_negotiate_meta_queries(NBDClient 
*client,

  meta->exp = nbd_export_find(export_name);
  if (meta->exp == NULL) {
  return nbd_opt_drop(client, NBD_REP_ERR_UNKNOWN, errp,
-    "export '%s' not present", export_name);
+    "export '%s' not present",
+    nbd_truncate_name(export_name));
  }



Hmm, maybe instead of assertion, shrink message in 
nbd_negotiate_send_rep_verr() too?

This will save us from forgotten (or future) uses of the function.


Truncating in nbd_negotiate_send_rep_verr is arbitrary; it does not have 
the context of what makes sense to truncate.  With an artificially short 
length, and a client request for "longname_from_the_client", the 
difference would be between:


export 

Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Sam Eiderman
Thanks David,

Yes, I imaging the following use case:

disk.vmdk is a 50 GB disk that contains 12 MB binary of zeroes in its beginning.
/dev/sda is a raw disk containing garbage

I invoke:
qemu-img convert disk.vmdk -O raw /dev/sda

Required output:
The first 12 MB of /dev/sda contain zeros, the rest garbage, qemu-img
finishes fast.

Kevin, from what I understood from you, this is the default behavior.

So if my VMDK is causing trouble (all virtual size is being written)
this is probably since all the grains in the VMDK are zero allocated
right?

Thanks!

On Wed, Jun 10, 2020 at 2:56 PM David Edmondson  wrote:
>
> On Wednesday, 2020-06-10 at 08:28:29 +03, Sam Eiderman wrote:
>
> > Hi,
> >
> > 168468fe19c8 ("qemu-img: Add --target-is-zero to convert") has added a
> > nice functionality for cloud scenarios:
> >
> > * Create a virtual disk
> > * Convert a sparse image (qcow2, vmdk) to the virtual disk using
> > --target-is-zero
> > * Use the virtual disk
> >
> > This saves many unnecessary writes - a qcow2 with 1MB of allocated
> > data but with 100GB virtual size will be converted efficiently.
> >
> > However, does this pose a problem if the virtual disk is not zero 
> > initialized?
>
> As Vladimir indicated, the intent of the flag is supposed to be clear
> from the name :-) If your storage doesn't read zeroes absent any earlier
> writes, you probably don't want to be using it.
>
> > Theoretically - if all unallocated blocks contain garbage - this
> > shouldn't matter, however what about allocated blocks of zero? Will
> > convert skip copying allocated zero blocks in the source image to the
> > target since it assumes that the target is zeroed out first thing?
>
> So something like a "--no-need-to-zero" flag would do what you want,
> presuming that it would write known zeroes but no longer clean the
> device before use?
>
> dme.
> --
> You can't hide from the flipside.



[PATCH v5 4/5] block/io: fix bdrv_is_allocated_above

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
bdrv_is_allocated_above wrongly handles short backing files: it reports
after-EOF space as UNALLOCATED which is wrong, as on read the data is
generated on the level of short backing file (if all overlays has
unallocated area at that place).

Reusing bdrv_common_block_status_above fixes the issue and unifies code
path.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/io.c | 43 +--
 1 file changed, 5 insertions(+), 38 deletions(-)

diff --git a/block/io.c b/block/io.c
index 3df3a5b8ee..e80f7ad527 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2471,52 +2471,19 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState 
*bs, int64_t offset,
  * at 'offset + *pnum' may return the same allocation status (in other
  * words, the result is not necessarily the maximum possible range);
  * but 'pnum' will only be 0 when end of file is reached.
- *
  */
 int bdrv_is_allocated_above(BlockDriverState *top,
 BlockDriverState *base,
 bool include_base, int64_t offset,
 int64_t bytes, int64_t *pnum)
 {
-BlockDriverState *intermediate;
-int ret;
-int64_t n = bytes;
-
-assert(base || !include_base);
-
-intermediate = top;
-while (include_base || intermediate != base) {
-int64_t pnum_inter;
-int64_t size_inter;
-
-assert(intermediate);
-ret = bdrv_is_allocated(intermediate, offset, bytes, _inter);
-if (ret < 0) {
-return ret;
-}
-if (ret) {
-*pnum = pnum_inter;
-return 1;
-}
-
-size_inter = bdrv_getlength(intermediate);
-if (size_inter < 0) {
-return size_inter;
-}
-if (n > pnum_inter &&
-(intermediate == top || offset + pnum_inter < size_inter)) {
-n = pnum_inter;
-}
-
-if (intermediate == base) {
-break;
-}
-
-intermediate = backing_bs(intermediate);
+int ret = bdrv_common_block_status_above(top, base, include_base, false,
+ offset, bytes, pnum, NULL, NULL);
+if (ret < 0) {
+return ret;
 }
 
-*pnum = n;
-return 0;
+return !!(ret & BDRV_BLOCK_ALLOCATED);
 }
 
 int coroutine_fn
-- 
2.21.0




[PATCH v5 3/5] block/io: bdrv_common_block_status_above: support bs == base

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
We are going to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
include_base == false and still bs == base (for ex. from img_rebase()).

So, support this corner case.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
Reviewed-by: Eric Blake 
---
 block/io.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index c3ef387f7e..3df3a5b8ee 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2369,7 +2369,11 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 int ret = 0;
 bool first = true;
 
-assert(include_base || bs != base);
+if (!include_base && bs == base) {
+*pnum = bytes;
+return 0;
+}
+
 for (p = bs; include_base || p != base; p = backing_bs(p)) {
 ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
file);
-- 
2.21.0




[PATCH v5 1/5] block/io: fix bdrv_co_block_status_above

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
bdrv_co_block_status_above has several design problems with handling
short backing files:

1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.

2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.

Fix these things, making logic about short backing files clearer.

With fixed bdrv_block_status_above we also have to improve is_zero in
qcow2 code, otherwise iotest 154 will fail, because with this patch we
stop to merge zeros of different types (produced by fully unallocated
in the whole backing chain regions vs produced by short backing files).

Note also, that this patch leaves for another day the general problem
around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
vs go-to-backing.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/io.c| 39 ++-
 block/qcow2.c | 16 ++--
 2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/block/io.c b/block/io.c
index 83ffc7d390..f2a89d9417 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2373,25 +2373,46 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
file);
 if (ret < 0) {
-break;
+return ret;
 }
-if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+if (*pnum == 0) {
+if (first) {
+return ret;
+}
+
 /*
- * Reading beyond the end of the file continues to read
- * zeroes, but we can only widen the result to the
- * unallocated length we learned from an earlier
- * iteration.
+ * The top layer deferred to this layer, and because this layer is
+ * short, any zeroes that we synthesize beyond EOF behave as if 
they
+ * were allocated at this layer
  */
+assert(ret & BDRV_BLOCK_EOF);
 *pnum = bytes;
+if (file) {
+*file = p;
+}
+return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
 }
-if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
-break;
+if (ret & BDRV_BLOCK_ALLOCATED) {
+/* We've found the node and the status, we must return. */
+
+if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+/*
+ * This level is also responsible for reads after EOF inside
+ * the unallocated region in the previous level.
+ */
+*pnum = bytes;
+}
+
+return ret;
 }
+
 /* [offset, pnum] unallocated on this layer, which could be only
  * the first part of [offset, bytes].  */
-bytes = MIN(bytes, *pnum);
+assert(*pnum <= bytes);
+bytes = *pnum;
 first = false;
 }
+
 return ret;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..ce4cf00770 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3827,8 +3827,20 @@ static bool is_zero(BlockDriverState *bs, int64_t 
offset, int64_t bytes)
 if (!bytes) {
 return true;
 }
-res = bdrv_block_status_above(bs, NULL, offset, bytes, , NULL, NULL);
-return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
+
+/*
+ * bdrv_block_status_above doesn't merge different types of zeros, for
+ * example, zeros which come from the region which is unallocated in
+ * the whole backing chain, and zeros which comes because of a short
+ * backing file. So, we need a loop.
+ */
+do {
+res = bdrv_block_status_above(bs, NULL, offset, bytes, , NULL, 
NULL);
+offset += nr;
+bytes -= nr;
+} while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
+
+return res >= 0 && (res & BDRV_BLOCK_ZERO) && bytes == 0;
 }
 
 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
-- 
2.21.0




[PATCH v5 2/5] block/io: bdrv_common_block_status_above: support include_base

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
In order to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above, let's support include_base parameter.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/coroutines.h |  2 ++
 block/io.c | 14 ++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index f69179f5ef..1cb3128b94 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -41,6 +41,7 @@ bdrv_pwritev(BdrvChild *child, int64_t offset, unsigned int 
bytes,
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
   BlockDriverState *base,
+  bool include_base,
   bool want_zero,
   int64_t offset,
   int64_t bytes,
@@ -50,6 +51,7 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 int generated_co_wrapper
 bdrv_common_block_status_above(BlockDriverState *bs,
BlockDriverState *base,
+   bool include_base,
bool want_zero,
int64_t offset,
int64_t bytes,
diff --git a/block/io.c b/block/io.c
index f2a89d9417..c3ef387f7e 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2357,6 +2357,7 @@ early_out:
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
   BlockDriverState *base,
+  bool include_base,
   bool want_zero,
   int64_t offset,
   int64_t bytes,
@@ -2368,8 +2369,8 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 int ret = 0;
 bool first = true;
 
-assert(bs != base);
-for (p = bs; p != base; p = backing_bs(p)) {
+assert(include_base || bs != base);
+for (p = bs; include_base || p != base; p = backing_bs(p)) {
 ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
file);
 if (ret < 0) {
@@ -2408,6 +2409,11 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 
 /* [offset, pnum] unallocated on this layer, which could be only
  * the first part of [offset, bytes].  */
+
+if (p == base) {
+break;
+}
+
 assert(*pnum <= bytes);
 bytes = *pnum;
 first = false;
@@ -2420,7 +2426,7 @@ int bdrv_block_status_above(BlockDriverState *bs, 
BlockDriverState *base,
 int64_t offset, int64_t bytes, int64_t *pnum,
 int64_t *map, BlockDriverState **file)
 {
-return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+return bdrv_common_block_status_above(bs, base, false, true, offset, bytes,
   pnum, map, file);
 }
 
@@ -2437,7 +2443,7 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, 
int64_t offset,
 int ret;
 int64_t dummy;
 
-ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
+ret = bdrv_common_block_status_above(bs, bs, true, false, offset,
  bytes, pnum ? pnum : , NULL,
  NULL);
 if (ret < 0) {
-- 
2.21.0




[PATCH v5 5/5] iotests: add commit top->base cases to 274

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
These cases are fixed by previous patches around block_status and
is_allocated.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 tests/qemu-iotests/274 | 20 
 tests/qemu-iotests/274.out | 65 ++
 2 files changed, 85 insertions(+)

diff --git a/tests/qemu-iotests/274 b/tests/qemu-iotests/274
index 5d1bf34dff..e910455f13 100755
--- a/tests/qemu-iotests/274
+++ b/tests/qemu-iotests/274
@@ -115,6 +115,26 @@ with iotests.FilePath('base') as base, \
 iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, mid)
 iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), mid)
 
+iotests.log('=== Testing qemu-img commit (top -> base) ===')
+
+create_chain()
+iotests.qemu_img_log('commit', '-b', base, top)
+iotests.img_info_log(base)
+iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), 
base)
+
+iotests.log('=== Testing QMP active commit (top -> base) ===')
+
+create_chain()
+with create_vm() as vm:
+vm.launch()
+vm.qmp_log('block-commit', device='top', base_node='base',
+   job_id='job0', auto_dismiss=False)
+vm.run_job('job0', wait=5)
+
+iotests.img_info_log(mid)
+iotests.qemu_io_log('-c', 'read -P 1 0 %d' % size_short, base)
+iotests.qemu_io_log('-c', 'read -P 0 %d %d' % (size_short, size_diff), 
base)
 
 iotests.log('== Resize tests ==')
 
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index d24ff681af..9806dea8b6 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -129,6 +129,71 @@ read 1048576/1048576 bytes at offset 0
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+=== Testing qemu-img commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=2097152 cluster_size=65536 
lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 size=1048576 
backing_file=TEST_DIR/PID-base cluster_size=65536 lazy_refcounts=off 
refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 size=2097152 
backing_file=TEST_DIR/PID-mid cluster_size=65536 lazy_refcounts=off 
refcount_bits=16 compression_type=zlib
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Image committed.
+
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 2 MiB (2097152 bytes)
+cluster_size: 65536
+Format specific information:
+compat: 1.1
+compression type: zlib
+lazy refcounts: false
+refcount bits: 16
+corrupt: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Testing QMP active commit (top -> base) ===
+Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=2097152 cluster_size=65536 
lazy_refcounts=off refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-mid', fmt=qcow2 size=1048576 
backing_file=TEST_DIR/PID-base cluster_size=65536 lazy_refcounts=off 
refcount_bits=16 compression_type=zlib
+
+Formatting 'TEST_DIR/PID-top', fmt=qcow2 size=2097152 
backing_file=TEST_DIR/PID-mid cluster_size=65536 lazy_refcounts=off 
refcount_bits=16 compression_type=zlib
+
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+{"execute": "block-commit", "arguments": {"auto-dismiss": false, "base-node": 
"base", "device": "top", "job-id": "job0"}}
+{"return": {}}
+{"execute": "job-complete", "arguments": {"id": "job0"}}
+{"return": {}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, 
"type": "commit"}, "event": "BLOCK_JOB_READY", "timestamp": {"microseconds": 
"USECS", "seconds": "SECS"}}
+{"data": {"device": "job0", "len": 1048576, "offset": 1048576, "speed": 0, 
"type": "commit"}, "event": "BLOCK_JOB_COMPLETED", "timestamp": 
{"microseconds": "USECS", "seconds": "SECS"}}
+{"execute": "job-dismiss", "arguments": {"id": "job0"}}
+{"return": {}}
+image: TEST_IMG
+file format: IMGFMT
+virtual size: 1 MiB (1048576 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/PID-base
+Format specific information:
+compat: 1.1
+compression type: zlib
+lazy refcounts: false
+refcount bits: 16
+corrupt: false
+
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == Resize tests ==
 === preallocation=off ===
 Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=6442450944 cluster_size=65536 
lazy_refcounts=off refcount_bits=16 compression_type=zlib
-- 
2.21.0




[PATCH v5 0/5] fix & merge block_status_above and is_allocated_above

2020-06-10 Thread Vladimir Sementsov-Ogievskiy
v5: rebase on coroutine-wrappers series, 02 changed correspondingly

Based on series "[PATCH v7 0/7] coroutines: generate wrapper code", or
in other words:
Based-on: <20200610100336.23451-1-vsement...@virtuozzo.com>

Hi all!

These series are here to address the following problem:
block-status-above functions may consider space after EOF of
intermediate backing files as unallocated, which is wrong, as these
backing files are the reason of producing zeroes, we never go further by
backing chain after a short backing file. So, if such short-backing file
is _inside_ requested sub-chain of the backing chain, we should never
report space after its EOF as unallocated.

See patches 01,04,05 for details.

Note, that this series leaves for another day the general problem
around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
vs go-to-backing.
Audit for this problem is done here:
"backing chain & block status & filters"
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg04706.html
And I'm going to prepare series to address this problem.

Also, get_block_status func have same disease, but remains unfixed here:
I want to make separate series for it, as it need some more refactoring,
which should be based on series
"[PATCH v5 0/7] coroutines: generate wrapper code"

Vladimir Sementsov-Ogievskiy (5):
  block/io: fix bdrv_co_block_status_above
  block/io: bdrv_common_block_status_above: support include_base
  block/io: bdrv_common_block_status_above: support bs == base
  block/io: fix bdrv_is_allocated_above
  iotests: add commit top->base cases to 274

 block/coroutines.h |   2 +
 block/io.c | 100 ++---
 block/qcow2.c  |  16 +-
 tests/qemu-iotests/274 |  20 
 tests/qemu-iotests/274.out |  65 
 5 files changed, 150 insertions(+), 53 deletions(-)

-- 
2.21.0




Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread David Edmondson
On Wednesday, 2020-06-10 at 08:28:29 +03, Sam Eiderman wrote:

> Hi,
>
> 168468fe19c8 ("qemu-img: Add --target-is-zero to convert") has added a
> nice functionality for cloud scenarios:
>
> * Create a virtual disk
> * Convert a sparse image (qcow2, vmdk) to the virtual disk using
> --target-is-zero
> * Use the virtual disk
>
> This saves many unnecessary writes - a qcow2 with 1MB of allocated
> data but with 100GB virtual size will be converted efficiently.
>
> However, does this pose a problem if the virtual disk is not zero initialized?

As Vladimir indicated, the intent of the flag is supposed to be clear
from the name :-) If your storage doesn't read zeroes absent any earlier
writes, you probably don't want to be using it.

> Theoretically - if all unallocated blocks contain garbage - this
> shouldn't matter, however what about allocated blocks of zero? Will
> convert skip copying allocated zero blocks in the source image to the
> target since it assumes that the target is zeroed out first thing?

So something like a "--no-need-to-zero" flag would do what you want,
presuming that it would write known zeroes but no longer clean the
device before use?

dme.
-- 
You can't hide from the flipside.



Re: Clarification regarding new qemu-img convert --target-is-zero flag

2020-06-10 Thread Sam Eiderman
I see,

I thought qemu-img (by default) checks the virtual size of the disk
before starting to copy allocated data, zeroes out all of the virtual
size (slowly) and then writes all the allocated data except for
zeroes.

But from what I understand now, qemu-img finds that the target is raw
and can not be efficiently zeroed, so it just writes all the allocated
data, including zeroes, leaving unallocated gaps in the virtual size
unwritten.

I have an image of 800MB VMDK with virtual size of 24GB

So if the following:
qemu-img convert "${IMAGE_PATH}" -p -O raw -S 512b /dev/sdc 2>&1
Takes roughly 3 minutes and 40 seconds (qemu 3.1.0)

And:
qemu-img convert "${IMAGE_PATH}" -n --target-is-zero -p -O raw /dev/sdc 2>&1
Takes roughly 2 seconds (qemu 5.0.0)

This means that probably there are ~23GB of zeroes *allocated* in this VMDK,
I'll check that.

Sam


On Wed, Jun 10, 2020 at 2:37 PM Kevin Wolf  wrote:
>
> Am 10.06.2020 um 08:28 hat Sam Eiderman geschrieben:
> > Hi,
> >
> > My target format is a Persistent Disk on GCP.
> > https://cloud.google.com/persistent-disk
> >
> > And my use case is converting VMDKs to PDs so I'm just using qemu-img
> > for the conversion (not using qemu as a hypervisor).
> >
> > Luckily PDs are zeroed out when allocated but I was asking to
> > understand the restrictions of qemu-img convert.
> >
> > It could be useful for qemu-img convert to not zero out the disk, but
> > do write allocated zeroes, I'm imagining cloud scenarios where instead
> > of virtual disks the customer receives an attached physical SSD device
> > that is not zeroed out beforehand (only encryption key changed, for
> > privacy/security sake) so reads will return garbage.
>
> But that's the default mode? Zeroing out the whole disk upfront is an
> optimisation that we do if efficient zeroing is possible, but if we
> can't, we just write explicit zeros where needed.
>
> --target-is-zero means that you promise that the target is already
> pre-zeroed so qemu-img can further optimise things. If you specify it
> and the target doesn't contain zeros, but random data, you get garbage.
>
> Kevin
>



Re: [PATCH] iotests: Fix 291 across more file systems

2020-06-10 Thread Kevin Wolf
Am 09.06.2020 um 22:46 hat Eric Blake geschrieben:
> On 6/9/20 8:32 AM, Kevin Wolf wrote:
> > Am 08.06.2020 um 21:56 hat Eric Blake geschrieben:
> > > Depending on the granularity of holes and amount of metadata consumed
> > > by a file, the 'disk size:' number of 'qemu-img info' is not reliable.
> > > Adjust our test to use a different set of filters to avoid spurious
> > > failures.
> > > 
> > > Reported-by: Kevin Wolf 
> > > Fixes: cf2d1203dc
> > > Signed-off-by: Eric Blake 
> > 
> > Thanks, applied to the block branch.
> 
> It has a conflict with one of Vladimir's bitmaps patches that I'm about to
> send a pull request for; so I'll resolve the conflict and include it in my
> bitmaps tree instead, and you can drop it from yours.  I'm assuming I can
> add your Acked-by since you were willing to stage it.

Ok, no problem.

Kevin




  1   2   >