Re: [PATCH v3 26/30] qemu-img: Use user_creatable_process_cmdline() for --object

2021-03-12 Thread Paolo Bonzini

On 13/03/21 08:40, Markus Armbruster wrote:

+if (!user_creatable_add_from_str(optarg, _err)) {
+if (local_err) {
+error_report_err(local_err);
+exit(2);
+} else {
+/* Help was printed */
+exit(EXIT_SUCCESS);
+}
+}
+break;
  }
-}   break;
  case OPTION_IMAGE_OPTS:
  image_opts = true;
  break;

Why is this one different?  The others all call
user_creatable_process_cmdline().




It's to exit with status code 2 instead of 1.

Paolo




Re: [PATCH v3 26/30] qemu-img: Use user_creatable_process_cmdline() for --object

2021-03-12 Thread Markus Armbruster
Kevin Wolf  writes:

> This switches qemu-img from a QemuOpts-based parser for --object to
> user_creatable_process_cmdline() which uses a keyval parser and enforces
> the QAPI schema.
>
> Apart from being a cleanup, this makes non-scalar properties accessible.
>
> Signed-off-by: Kevin Wolf 
> Acked-by: Peter Krempa 
> ---
>  qemu-img.c | 251 ++---
>  1 file changed, 45 insertions(+), 206 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index e2952fe955..babb5573ab 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -226,23 +226,6 @@ static void QEMU_NORETURN help(void)
>  exit(EXIT_SUCCESS);
>  }
>  
> -static QemuOptsList qemu_object_opts = {
> -.name = "object",
> -.implied_opt_name = "qom-type",
> -.head = QTAILQ_HEAD_INITIALIZER(qemu_object_opts.head),
> -.desc = {
> -{ }
> -},
> -};
> -
> -static bool qemu_img_object_print_help(const char *type, QemuOpts *opts)
> -{
> -if (user_creatable_print_help(type, opts)) {
> -exit(0);
> -}
> -return true;
> -}
> -
>  /*
>   * Is @optarg safe for accumulate_options()?
>   * It is when multiple of them can be joined together separated by ','.
> @@ -566,14 +549,9 @@ static int img_create(int argc, char **argv)
>  case 'u':
>  flags |= BDRV_O_NO_BACKING;
>  break;
> -case OPTION_OBJECT: {
> -QemuOpts *opts;
> -opts = qemu_opts_parse_noisily(_object_opts,
> -   optarg, true);
> -if (!opts) {
> -goto fail;
> -}
> -}   break;
> +case OPTION_OBJECT:
> +user_creatable_process_cmdline(optarg);
> +break;
>  }
>  }
>  
> @@ -589,12 +567,6 @@ static int img_create(int argc, char **argv)
>  }
>  optind++;
>  
> -if (qemu_opts_foreach(_object_opts,
> -  user_creatable_add_opts_foreach,
> -  qemu_img_object_print_help, _fatal)) {
> -goto fail;
> -}
> -
>  /* Get image size, if specified */
>  if (optind < argc) {
>  int64_t sval;
> @@ -804,14 +776,9 @@ static int img_check(int argc, char **argv)
>  case 'U':
>  force_share = true;
>  break;
> -case OPTION_OBJECT: {
> -QemuOpts *opts;
> -opts = qemu_opts_parse_noisily(_object_opts,
> -   optarg, true);
> -if (!opts) {
> -return 1;
> -}
> -}   break;
> +case OPTION_OBJECT:
> +user_creatable_process_cmdline(optarg);
> +break;
>  case OPTION_IMAGE_OPTS:
>  image_opts = true;
>  break;
> @@ -831,12 +798,6 @@ static int img_check(int argc, char **argv)
>  return 1;
>  }
>  
> -if (qemu_opts_foreach(_object_opts,
> -  user_creatable_add_opts_foreach,
> -  qemu_img_object_print_help, _fatal)) {
> -return 1;
> -}
> -
>  ret = bdrv_parse_cache_mode(cache, , );
>  if (ret < 0) {
>  error_report("Invalid source cache option: %s", cache);
> @@ -1034,14 +995,9 @@ static int img_commit(int argc, char **argv)
>  return 1;
>  }
>  break;
> -case OPTION_OBJECT: {
> -QemuOpts *opts;
> -opts = qemu_opts_parse_noisily(_object_opts,
> -   optarg, true);
> -if (!opts) {
> -return 1;
> -}
> -}   break;
> +case OPTION_OBJECT:
> +user_creatable_process_cmdline(optarg);
> +break;
>  case OPTION_IMAGE_OPTS:
>  image_opts = true;
>  break;
> @@ -1058,12 +1014,6 @@ static int img_commit(int argc, char **argv)
>  }
>  filename = argv[optind++];
>  
> -if (qemu_opts_foreach(_object_opts,
> -  user_creatable_add_opts_foreach,
> -  qemu_img_object_print_help, _fatal)) {
> -return 1;
> -}
> -
>  flags = BDRV_O_RDWR | BDRV_O_UNMAP;
>  ret = bdrv_parse_cache_mode(cache, , );
>  if (ret < 0) {
> @@ -1353,7 +1303,7 @@ static int check_empty_sectors(BlockBackend *blk, 
> int64_t offset,
>  /*
>   * Compares two images. Exit codes:
>   *
> - * 0 - Images are identical
> + * 0 - Images are identical or the requested help was printed
>   * 1 - Images differ
>   * >1 - Error occurred
>   */
> @@ -1423,15 +1373,21 @@ static int img_compare(int argc, char **argv)
>  case 'U':
>  force_share = true;
>  break;
> -case OPTION_OBJECT: {
> -QemuOpts *opts;
> -opts = qemu_opts_parse_noisily(_object_opts,
> -   optarg, true);
> -if (!opts) {
> -ret = 2;
> -  

Re: [PATCH 2/9] block: Replaced qemu_mutex_lock calls with QEMU_LOCK_GUARD

2021-03-12 Thread Mahmoud Mandour
Thank you for the fast review and I'm sorry for the silly and obvious style
errors. Unfortunately I did not notice the section on using the checkpatch
script in the Contributing page on the wiki before committing. But I assure
you that such errors will not occur again.

On Fri, Mar 12, 2021 at 12:23 PM Vladimir Sementsov-Ogievskiy <
vsement...@virtuozzo.com> wrote:

> 11.03.2021 06:15, Mahmoud Mandour wrote:
> > Replaced various qemu_mutex_lock/qemu_mutex_unlock calls with
> > lock guard macros (QEMU_LOCK_GUARD() and WITH_QEMU_LOCK_GUARD).
> > This slightly simplifies the code by eliminating calls to
> > qemu_mutex_unlock and eliminates goto paths.
> >
> > Signed-off-by: Mahmoud Mandour 
> > ---
> >   block/curl.c |  13 ++--
> >   block/nbd.c  | 188 ---
>
> Better would be make two separate patches I think.
>
> >   2 files changed, 95 insertions(+), 106 deletions(-)
> >
> > diff --git a/block/curl.c b/block/curl.c
> > index 4ff895df8f..56a217951a 100644
> > --- a/block/curl.c
> > +++ b/block/curl.c
> > @@ -832,12 +832,12 @@ static void curl_setup_preadv(BlockDriverState
> *bs, CURLAIOCB *acb)
> >   uint64_t start = acb->offset;
> >   uint64_t end;
> >
> > -qemu_mutex_lock(>mutex);
> > +QEMU_LOCK_GUARD(>mutex);
> >
> >   // In case we have the requested data already (e.g. read-ahead),
> >   // we can just call the callback and be done.
> >   if (curl_find_buf(s, start, acb->bytes, acb)) {
> > -goto out;
> > +return;
> >   }
> >
> >   // No cache found, so let's start a new request
> > @@ -852,7 +852,7 @@ static void curl_setup_preadv(BlockDriverState *bs,
> CURLAIOCB *acb)
> >   if (curl_init_state(s, state) < 0) {
> >   curl_clean_state(state);
> >   acb->ret = -EIO;
> > -goto out;
> > +return;
> >   }
> >
> >   acb->start = 0;
> > @@ -867,7 +867,7 @@ static void curl_setup_preadv(BlockDriverState *bs,
> CURLAIOCB *acb)
> >   if (state->buf_len && state->orig_buf == NULL) {
> >   curl_clean_state(state);
> >   acb->ret = -ENOMEM;
> > -goto out;
> > +return;
> >   }
> >   state->acb[0] = acb;
> >
> > @@ -880,14 +880,11 @@ static void curl_setup_preadv(BlockDriverState
> *bs, CURLAIOCB *acb)
> >   acb->ret = -EIO;
> >
> >   curl_clean_state(state);
> > -goto out;
> > +return;
> >   }
> >
> >   /* Tell curl it needs to kick things off */
> >   curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0,
> );
> > -
> > -out:
> > -qemu_mutex_unlock(>mutex);
> >   }
>
> This change is obvious and good.
>
> >
> >   static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> > diff --git a/block/nbd.c b/block/nbd.c
> > index c26dc5a54f..28ba7aad61 100644
> > --- a/block/nbd.c
> > +++ b/block/nbd.c
> > @@ -407,27 +407,25 @@ static void *connect_thread_func(void *opaque)
> >   thr->sioc = NULL;
> >   }
> >
> > -qemu_mutex_lock(>mutex);
> > -
> > -switch (thr->state) {
> > -case CONNECT_THREAD_RUNNING:
> > -thr->state = ret < 0 ? CONNECT_THREAD_FAIL :
> CONNECT_THREAD_SUCCESS;
> > -if (thr->bh_ctx) {
> > -aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func,
> thr->bh_opaque);
> > -
> > -/* play safe, don't reuse bh_ctx on further connection
> attempts */
> > -thr->bh_ctx = NULL;
> > +WITH_QEMU_LOCK_GUARD(>mutex) {
> > +switch (thr->state) {
> > +case CONNECT_THREAD_RUNNING:
> > +thr->state = ret < 0 ? CONNECT_THREAD_FAIL :
> CONNECT_THREAD_SUCCESS;
> > +if (thr->bh_ctx) {
> > +aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func,
> thr->bh_opaque);
>
> over-80 line
>
> > +
> > +/* play safe, don't reuse bh_ctx on further connection
> attempts */
>
> and here
>
> > +thr->bh_ctx = NULL;
> > +}
> > +break;
> > +case CONNECT_THREAD_RUNNING_DETACHED:
> > +do_free = true;
> > +break;
> > +default:
> > +abort();
> >   }
> > -break;
> > -case CONNECT_THREAD_RUNNING_DETACHED:
> > -do_free = true;
> > -break;
> > -default:
> > -abort();
> >   }
> >
> > -qemu_mutex_unlock(>mutex);
> > ->   if (do_free) {
> >   nbd_free_connect_thread(thr);
> >   }
> > @@ -443,36 +441,33 @@ nbd_co_establish_connection(BlockDriverState *bs,
> Error **errp)
> >   BDRVNBDState *s = bs->opaque;
> >   NBDConnectThread *thr = s->connect_thread;
> >
> > -qemu_mutex_lock(>mutex);
> > -
> > -switch (thr->state) {
> > -case CONNECT_THREAD_FAIL:
> > -case CONNECT_THREAD_NONE:
> > -error_free(thr->err);
> > -thr->err = NULL;
> > -thr->state = CONNECT_THREAD_RUNNING;
> > -qemu_thread_create(, "nbd-connect",
> > -   connect_thread_func, thr,

Re: [PULL 00/38] Block layer patches and object-add QAPIfication

2021-03-12 Thread Peter Maydell
On Thu, 11 Mar 2021 at 14:48, Kevin Wolf  wrote:
>
> The following changes since commit d689ecad073e0289afa8ca863e45879d719e5c21:
>
>   Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' 
> into staging (2021-03-10 20:11:33 +)
>
> are available in the Git repository at:
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to 4756658df7d5104b36ee2f40f30f2d0f10225a53:
>
>   qom: Add user_creatable_parse_str() (2021-03-11 13:13:49 +0100)
>
> 
> Block layer patches and object-add QAPIfication
>
> - QAPIfy object-add and --object for tools
> - Add vhost-user-blk-test
> - stream: Fail gracefully if permission is denied
> - storage-daemon: Fix crash on quit when job is still running
> - curl: Fix use after free
> - Fix image creation option defaults that exist in both the format and
>   the protocol layer (e.g. 'cluster_size' in qcow2 and rbd; the qcow2
>   default was incorrectly applied to the rbd layer)
>

This generates new warnings in 'make check':
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
G_TEST_SRCDIR=/home/petmay01/linaro/qemu-for-merges/tests G_TEST
_BUILDDIR=/home/petmay01/linaro/qemu-for-merges/build/all/tests
tests/test-char --tap -k
warning: The alias 'tty' is deprecated, use 'serial' instead
PASS 1 test-char /char/null
PASS 2 test-char /char/invalid
PASS 3 test-char /char/ringbuf
PASS 4 test-char /char/mux
[etc]

thanks
-- PMM



Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes

2021-03-12 Thread Max Reitz

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

Compressed writes are unaligned to 512, which works very slow in
O_DIRECT mode. Let's use the cache.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/coroutines.h |   3 +
  block/qcow2.h  |   4 ++
  block/qcow2-refcount.c |  10 +++
  block/qcow2.c  | 158 ++---
  4 files changed, 164 insertions(+), 11 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 4cfb4946e6..cb8e05830e 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
  int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
  QEMUIOVector *qiov, int64_t pos);
  
+int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs);

+int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs);
+
  #endif /* BLOCK_COROUTINES_INT_H */
diff --git a/block/qcow2.h b/block/qcow2.h
index e18d8dfbae..8b8fed0950 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -28,6 +28,7 @@
  #include "crypto/block.h"
  #include "qemu/coroutine.h"
  #include "qemu/units.h"
+#include "qemu/seqcache.h"
  #include "block/block_int.h"
  
  //#define DEBUG_ALLOC

@@ -422,6 +423,9 @@ typedef struct BDRVQcow2State {
  Qcow2CompressionType compression_type;
  
  GHashTable *inflight_writes_counters;

+
+SeqCache *compressed_cache;
+int compressed_flush_ret;
  } BDRVQcow2State;
  
  typedef struct Qcow2COWRegion {

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..218917308e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
  #include "qemu/bswap.h"
  #include "qemu/cutils.h"
  #include "trace.h"
+#include "block/coroutines.h"
  
  static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,

  uint64_t max);
@@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  qcow2_cache_discard(s->l2_table_cache, table);
  }
  
+if (s->compressed_cache) {

+seqcache_discard_cluster(s->compressed_cache, cluster_offset);
+}
+
  if (s->discard_passthrough[type]) {
  update_refcount_discard(bs, cluster_offset, s->cluster_size);
  }
@@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs)
  BDRVQcow2State *s = bs->opaque;
  int ret;
  
+ret = qcow2_flush_compressed_cache(bs);

+if (ret < 0) {
+return ret;
+}
+


I wonder a bit why this function doesn’t work on a best-effort basis, 
but that isn’t your problem.



  ret = qcow2_cache_write(bs, s->l2_table_cache);
  if (ret < 0) {
  return ret;
diff --git a/block/qcow2.c b/block/qcow2.c
index 6ee94421e0..5f3713cd6f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -42,6 +42,7 @@
  #include "qapi/qapi-visit-block-core.h"
  #include "crypto.h"
  #include "block/aio_task.h"
+#include "block/coroutines.h"
  
  /*

Differences with QCOW:
@@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
  s->inflight_writes_counters =
  g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
  
+if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {


Looks a bit like a layering violation, but I have no better proposal and 
you gave your reasoning, so, OK.


I wonder whether this should be re-checked during reopen, though.


+s->compressed_cache = seqcache_new(s->cluster_size);
+}
+
  return ret;
  
   fail:

@@ -2653,6 +2658,91 @@ fail_nometa:
  return ret;
  }
  
+/*

+ * Flush one cluster of compressed cache if exists. If @unfinished is true and
+ * there is no finished cluster to flush, flush the unfinished one. Otherwise
+ * flush only finished clusters.
+ *
+ * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 
on
+ * error.
+ */
+static int coroutine_fn
+qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished)
+{
+BDRVQcow2State *s = bs->opaque;
+int ret;
+int64_t align = 1;
+int64_t offset, bytes;
+uint8_t *buf;
+
+if (!s->compressed_cache) {
+return 0;
+}
+
+if (!seqcache_get_next_flush(s->compressed_cache, , , ,
+ ))
+{
+return 0;
+}
+
+qcow2_inflight_writes_inc(bs, offset, bytes);
+
+/*
+ * Don't try to align-up unfinished cached cluster, parallel write to it is
+ * possible! For finished host clusters data in the tail of the cluster 
will
+ * be never used. So, take some good alignment for speed.
+ *
+ * Note also, that seqcache guarantees that allocated size of @buf is 
enough
+ * to fill the cluster up to the end, so we are safe to align up with
+ * align 

Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 21:15, Max Reitz wrote:

@@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
  s->inflight_writes_counters =
  g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
+    if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {


Looks a bit like a layering violation, but I have no better proposal and you 
gave your reasoning, so, OK.


Probably better is check request_alignment of bs->file->bs. If it > 1 then 
enable the cache.

--
Best regards,
Vladimir



Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 21:15, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

Compressed writes are unaligned to 512, which works very slow in
O_DIRECT mode. Let's use the cache.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/coroutines.h |   3 +
  block/qcow2.h  |   4 ++
  block/qcow2-refcount.c |  10 +++
  block/qcow2.c  | 158 ++---
  4 files changed, 164 insertions(+), 11 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 4cfb4946e6..cb8e05830e 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
  int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
  QEMUIOVector *qiov, int64_t pos);
+int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs);
+int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs);
+
  #endif /* BLOCK_COROUTINES_INT_H */
diff --git a/block/qcow2.h b/block/qcow2.h
index e18d8dfbae..8b8fed0950 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -28,6 +28,7 @@
  #include "crypto/block.h"
  #include "qemu/coroutine.h"
  #include "qemu/units.h"
+#include "qemu/seqcache.h"
  #include "block/block_int.h"
  //#define DEBUG_ALLOC
@@ -422,6 +423,9 @@ typedef struct BDRVQcow2State {
  Qcow2CompressionType compression_type;
  GHashTable *inflight_writes_counters;
+
+    SeqCache *compressed_cache;
+    int compressed_flush_ret;
  } BDRVQcow2State;
  typedef struct Qcow2COWRegion {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..218917308e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
  #include "qemu/bswap.h"
  #include "qemu/cutils.h"
  #include "trace.h"
+#include "block/coroutines.h"
  static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
  uint64_t max);
@@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  qcow2_cache_discard(s->l2_table_cache, table);
  }
+    if (s->compressed_cache) {
+    seqcache_discard_cluster(s->compressed_cache, cluster_offset);
+    }
+
  if (s->discard_passthrough[type]) {
  update_refcount_discard(bs, cluster_offset, s->cluster_size);
  }
@@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs)
  BDRVQcow2State *s = bs->opaque;
  int ret;
+    ret = qcow2_flush_compressed_cache(bs);
+    if (ret < 0) {
+    return ret;
+    }
+


I wonder a bit why this function doesn’t work on a best-effort basis, but that 
isn’t your problem.


  ret = qcow2_cache_write(bs, s->l2_table_cache);
  if (ret < 0) {
  return ret;
diff --git a/block/qcow2.c b/block/qcow2.c
index 6ee94421e0..5f3713cd6f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -42,6 +42,7 @@
  #include "qapi/qapi-visit-block-core.h"
  #include "crypto.h"
  #include "block/aio_task.h"
+#include "block/coroutines.h"
  /*
    Differences with QCOW:
@@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState 
*bs, QDict *options,
  s->inflight_writes_counters =
  g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
+    if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {


Looks a bit like a layering violation, but I have no better proposal and you 
gave your reasoning, so, OK.

I wonder whether this should be re-checked during reopen, though.


Hmm yes. Somehow I thought qcow2_do_open() will be called again, but it's about 
inactivation/invalidation, not about reopen.




+    s->compressed_cache = seqcache_new(s->cluster_size);
+    }
+
  return ret;
   fail:
@@ -2653,6 +2658,91 @@ fail_nometa:
  return ret;
  }
+/*
+ * Flush one cluster of compressed cache if exists. If @unfinished is true and
+ * there is no finished cluster to flush, flush the unfinished one. Otherwise
+ * flush only finished clusters.
+ *
+ * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 
on
+ * error.
+ */
+static int coroutine_fn
+qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int ret;
+    int64_t align = 1;
+    int64_t offset, bytes;
+    uint8_t *buf;
+
+    if (!s->compressed_cache) {
+    return 0;
+    }
+
+    if (!seqcache_get_next_flush(s->compressed_cache, , , ,
+ ))
+    {
+    return 0;
+    }
+
+    qcow2_inflight_writes_inc(bs, offset, bytes);
+
+    /*
+ * Don't try to align-up unfinished cached cluster, parallel write to it is
+ * possible! For finished host clusters data in the tail of the cluster 
will
+ * be never used. So, take some good alignment for speed.
+ *
+ * Note also, that seqcache guarantees that 

Re: [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix

2021-03-12 Thread Max Reitz

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

We are going to reuse the script to generate a qcow2_ function in
further commit. Prepare the script now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  scripts/block-coroutine-wrapper.py | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH] fdc: fix floppy boot for Red Hat Linux 5.2

2021-03-12 Thread John Snow

On 3/12/21 3:01 AM, Thomas Huth wrote:


That whole table-based approach seems quite unreliable to me - I've seen 
floppy disks with 80, 81, 82 or sometimes even 83 tracks in the past, so 
I think we would do better with a more flexible way of guessing ... but 
for the time being, this is certainly a quick and easy fix that also 
should not have any negative impact, thus:


Reviewed-by: Thomas Huth 


Yes, that's my thought. I can't personally justify tackling this hokey 
system -- I simply don't know enough about the history of the device to 
design a suitable replacement.


So, it's a band-aid, but one that fits with the way the floppy has 
worked for quite a long time, and seems the least likely to bother 
anything further.


--js




Re: [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 18:24, Kevin Wolf wrote:

Am 25.02.2021 um 12:52 hat Vladimir Sementsov-Ogievskiy geschrieben:

Hi all! It occurs that nothing prevents discarding and reallocating host
cluster during data writing. This way data writing will pollute another
newly allocated cluster of data or metadata.

OK, v2 is a try to solve the problem with CoRwlock.. And it is marked
RFC, because of a lot of iotest failures.. Some of problems with v2:

1. It's a more complicated to make a test, as everything is blocking
and I can't just break write and do discard.. I have to implement
aio_discard in qemu-io and rewrite test into several portions of io
commands splitted by "sleep 1".. OK, it's not a big problem, and I've
solved it.


Right, this just demonstrates that it's doing what it's supposed to.


2. iotest 7 fails with several leaked clusters. Seems, that it depend on
the fact that discard may be done in parallel with writes. Iotest 7 does
snapshots, so I think l1 table is updated to the moment when discard is
finally unlocked.. But I didn't dig into it, it's all my assumptions.


This one looks a bit odd, but it may be related to the bug in your code
that you forgot that qcow2_cluster_discard() is not a coroutine_fn.
Later tests fail during the unlock:

qemu-img: ../util/qemu-coroutine-lock.c:358: qemu_co_rwlock_unlock: Assertion 
`qemu_in_coroutine()' failed.

#0  0x7ff33f7d89d5 in raise () from /lib64/libc.so.6
#1  0x7ff33f7c18a4 in abort () from /lib64/libc.so.6
#2  0x7ff33f7c1789 in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x7ff33f7d1026 in __assert_fail () from /lib64/libc.so.6
#4  0x556f4ffd3c94 in qemu_co_rwlock_unlock (lock=0x556f51f63ca0) at 
../util/qemu-coroutine-lock.c:358
#5  0x556f4fef5e09 in qcow2_cluster_discard (bs=0x556f51f56a80, 
offset=37748736, bytes=0, type=QCOW2_DISCARD_NEVER, full_discard=false) at 
../block/qcow2-cluster.c:1970
#6  0x556f4ff46c23 in qcow2_snapshot_create (bs=0x556f51f56a80, 
sn_info=0x7fff89fb9a30) at ../block/qcow2-snapshot.c:736
#7  0x556f4ff0d7b6 in bdrv_snapshot_create (bs=0x556f51f56a80, 
sn_info=0x7fff89fb9a30) at ../block/snapshot.c:227
#8  0x556f4fe85526 in img_snapshot (argc=4, argv=0x7fff89fb9d30) at 
../qemu-img.c:3337
#9  0x556f4fe8a227 in main (argc=4, argv=0x7fff89fb9d30) at 
../qemu-img.c:5375


3. iotest 13 (and I think a lot more iotests) crashes on
assert(!to->locks_held); .. So with this assertion we can't keep rwlock
locked during data writing...

   #3  in __assert_fail () from /lib64/libc.so.6
   #4  in qemu_aio_coroutine_enter (ctx=0x55762120b700, co=0x55762121d700)
   at ../util/qemu-coroutine.c:158
   #5  in aio_co_enter (ctx=0x55762120b700, co=0x55762121d700) at 
../util/async.c:628
   #6  in aio_co_wake (co=0x55762121d700) at ../util/async.c:612
   #7  in thread_pool_co_cb (opaque=0x7f17950daab0, ret=0) at 
../util/thread-pool.c:279
   #8  in thread_pool_completion_bh (opaque=0x5576211e5070) at 
../util/thread-pool.c:188
   #9  in aio_bh_call (bh=0x557621205df0) at ../util/async.c:136
   #10 in aio_bh_poll (ctx=0x55762120b700) at ../util/async.c:164
   #11 in aio_poll (ctx=0x55762120b700, blocking=true) at 
../util/aio-posix.c:659
   #12 in blk_prw (blk=0x557621205790, offset=4303351808,
   buf=0x55762123e000 '\364' , ..., bytes=12288,
   co_entry=0x557620d9dc97 , flags=0) at 
../block/block-backend.c:1335
   #13 in blk_pwrite (blk=0x557621205790, offset=4303351808, buf=0x55762123e000,
   count=12288, flags=0) at ../block/block-backend.c:1501


This is another bug in your code: A taken lock belongs to its coroutine.
You can't lock in one coroutine and unlock in another.

The changes you made to the write code seem unnecessarily complicated
anyway: Why not just qemu_co_rwlock_rdlock() right at the start of
qcow2_co_pwritev_part() and unlock at its end, without taking and
dropping the lock repeatedly? It makes both the locking more obviously
correct and also fixes the bug (013 passes with this change).


So now I think that v1 is simpler.. It's more complicated (but not too
much) in code. But it keeps discards and data writes non-blocking each
other and avoids yields in critical sections.


I think an approach with additional data structures is almost certainly
more complex and harder to maintain (and as the review from Max showed,
also to understand). I wouldn't give up yet on the CoRwlock based
approach, it's almost trivial code in comparison.


Sure



True, making qcow2_cluster_discard() a coroutine_fn requires a
preparational patch that is less trivial, but at least this can be seen
as something that we would want to do sooner or later anyway.



Thanks for help, I'll try your suggestions and resend.

--
Best regards,
Vladimir



Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 18:52, Max Reitz wrote:

On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 18:10, Max Reitz wrote:

On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]


@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems like it wants 
to keep metadata structures in the cache that are still in use (because 
dropping them from the caches is what happens next), but users of metadata 
structures won’t set in-flight counters for those metadata structures, will 
they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count 
of the host cluster becomes 0, including inflight-write-cnt. So, if at this point 
inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I 
can see is code to remove metadata structures from the metadata caches (if the 
discarded cluster was an L2 table or a refblock), and finally the discard on 
the underlying file.  I don’t see how that protocol-level discard has anything 
to do with our problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll have 
data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt



As far as I understand, the freeing happens immediately above the “if (refcount == 
0)” block by s->set_refcount() setting the refcount to 0. (including updating 
s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually prevent 
until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free 
anything.




So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  } else {
  refcount += addend;
  }
-    if (refcount == 0 && cluster_index < s->free_cluster_index) {
-    s->free_cluster_index = cluster_index;
-    }
  s->set_refcount(refcount_block, block_index, refcount);

  if (refcount == 0) {
  void *table;
  Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);

-    if (infl) {
-    infl->refcount_zero = true;
-    infl->type = type;
-    continue;
-    }
-
  table = qcow2_cache_is_table_offset(s->refcount_block_cache,
  offset);
  if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  qcow2_cache_discard(s->l2_table_cache, table);
  }

+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }
+
+    if (cluster_index < s->free_cluster_index) {
+    s->free_cluster_index = cluster_index;
+    }
+
  if (s->discard_passthrough[type]) {
  update_refcount_discard(bs, cluster_offset, s->cluster_size);
  }


I don’t think I like using s->free_cluster_index as a protection against 
allocating something before it.


Hmm, I just propose not to update it, if refcount reached 0 but we still have 
inflight writes.




First, it comes back the problem I just described in my mail from 15:58 GMT+1, 
which is that you’re changing the definition of what a free cluster is.  With 
this proposal, you’re proposing yet a new definition: A free cluster is 
anything with refcount == 0 after free_cluster_index.


I think that free cluster is anything with refcount = 0 and inflight-write-cnt 
= 0.


Then, as I said in my other mail, update_refcount() just cannot free any 
cluster.  So changes to that function can’t be justified by preventing it from 
freeing 

Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Max Reitz

On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 18:10, Max Reitz wrote:

On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write 
may


[..]

@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, 
cluster_index);

+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems 
like it wants to keep metadata structures in the cache that are 
still in use (because dropping them from the caches is what 
happens next), but users of metadata structures won’t set 
in-flight counters for those metadata structures, will they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when 
full reference count of the host cluster becomes 0, including 
inflight-write-cnt. So, if at this point inflight-write-cnt is not 
0, we postpone freeing the host cluster, it will be done later 
from "slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does 
it?  All I can see is code to remove metadata structures from the 
metadata caches (if the discarded cluster was an L2 table or a 
refblock), and finally the discard on the underlying file.  I don’t 
see how that protocol-level discard has anything to do with our 
problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, 
we'll have data instead of a hole. Not a big deal, but seems better 
to postpone discard.


On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt




As far as I understand, the freeing happens immediately above the 
“if (refcount == 0)” block by s->set_refcount() setting the 
refcount to 0. (including updating s->free_cluster_index if the 
refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should 
actually prevent until total reference count becomes zero.


And about s->set_refcount(): it only update a refcount itself, and 
don't free anything.





So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  } else {
  refcount += addend;
  }
-    if (refcount == 0 && cluster_index < s->free_cluster_index) {
-    s->free_cluster_index = cluster_index;
-    }
  s->set_refcount(refcount_block, block_index, refcount);

  if (refcount == 0) {
  void *table;
  Qcow2InFlightRefcount *infl = find_infl_wr(s, 
cluster_index);


-    if (infl) {
-    infl->refcount_zero = true;
-    infl->type = type;
-    continue;
-    }
-
  table = 
qcow2_cache_is_table_offset(s->refcount_block_cache,

  offset);
  if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  qcow2_cache_discard(s->l2_table_cache, table);
  }

+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }
+
+    if (cluster_index < s->free_cluster_index) {
+    s->free_cluster_index = cluster_index;
+    }
+
  if (s->discard_passthrough[type]) {
  update_refcount_discard(bs, cluster_offset, 
s->cluster_size);

  }


I don’t think I like using s->free_cluster_index as a protection 
against allocating something before it.


Hmm, I just propose not to update it, if refcount reached 0 but we still 
have inflight writes.





First, it comes back the problem I just described in my mail from 
15:58 GMT+1, which is that you’re changing the definition of what a 
free cluster is.  With this proposal, you’re proposing yet a new 
definition: A free cluster is anything with refcount == 0 after 
free_cluster_index.


I think that free cluster is anything with refcount = 0 and 
inflight-write-cnt = 0.


Then, as I said in my other mail, update_refcount() just cannot free any 
cluster.  So changes to that function can’t be justified by preventing 
it from freeing clusters.


You 

Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 17:58, Max Reitz wrote:

On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]


@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems like it wants 
to keep metadata structures in the cache that are still in use (because 
dropping them from the caches is what happens next), but users of metadata 
structures won’t set in-flight counters for those metadata structures, will 
they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count 
of the host cluster becomes 0, including inflight-write-cnt. So, if at this point 
inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I 
can see is code to remove metadata structures from the metadata caches (if the 
discarded cluster was an L2 table or a refblock), and finally the discard on 
the underlying file.  I don’t see how that protocol-level discard has anything 
to do with our problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll have 
data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt



As far as I understand, the freeing happens immediately above the “if (refcount == 
0)” block by s->set_refcount() setting the refcount to 0. (including updating 
s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually prevent 
until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free 
anything.


That is what freeing is, though.  I consider something to be free when 
allocation functions will allocate it.  The allocation functions look at the 
refcount, so once a cluster’s refcount is 0, it is free.


And with this patch I try to update allocation function to look also at 
inflight-write-counters. If I missed something its a bug in the patch.



If that isn’t what freeing is, nothing in update_refcount() frees anything 
(when looking at how data clusters are handled).  Passing the discard through 
to the protocol layer isn’t “freeing”, because it’s independent of qcow2.

Now, your patch adds an additional check to the allocation functions (whether 
there are ongoing writes on the cluster), so it’s indeed possible that a 
cluster can have a refcount of 0 but still won’t be used by allocation 
functions.

But that means you’ve just changed the definition of what a free cluster is.  
In fact, that means that nothing in update_refcount() can free a cluster that 
has active writes to it, because now a cluster is only free if there are no 
such writes.  It follows that you needn’t change update_refcount() to prevent 
clusters with such writes from being freed, because with this new definition of 
what a free cluster is, it’s impossible for update_refcount() to free them.


But as I noted somewhere else, update_refcount should not discard the host 
cluster in parallel with inflight write. It's not completely wrong, but it's 
inefficient.



(Yes, you’re right that it would be nice to postpone the protocol-level discard 
still, but not doing so wouldn’t be a catastrophe – which shows that it has 
little to do with actually freeing something, as far as qcow2 is concerned.

If it’s just about postponing the discard, we can do exactly that: Let 
update_refcount() skip discarding for clusters that are still in use, and then 
let update_inflight_write_cnt() only do that discard instead of invoking all of 
qcow2_update_cluster_refcount().)


Agree, yes.



Alternatively, we could also not change the definition of what a free cluster 
is, which means we wouldn’t need to change the allocation functions, but 
instead postpone the refcount update that update_refcount() does.  That would 
mean we’d actually need to really drop the refcount in 
update_inflight_write_cnt() instead of doing a -0.



Hmm, that should work too. Do you think it is better? With first approach 
meaning of zero refcount is changed (it's not a free cluster now, keep in mind 
inflight-write-cnt too). So we 

Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 18:10, Max Reitz wrote:

On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]


@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems like it wants 
to keep metadata structures in the cache that are still in use (because 
dropping them from the caches is what happens next), but users of metadata 
structures won’t set in-flight counters for those metadata structures, will 
they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count 
of the host cluster becomes 0, including inflight-write-cnt. So, if at this point 
inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I 
can see is code to remove metadata structures from the metadata caches (if the 
discarded cluster was an L2 table or a refblock), and finally the discard on 
the underlying file.  I don’t see how that protocol-level discard has anything 
to do with our problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll have 
data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt



As far as I understand, the freeing happens immediately above the “if (refcount == 
0)” block by s->set_refcount() setting the refcount to 0. (including updating 
s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually prevent 
until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free 
anything.




So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  } else {
  refcount += addend;
  }
-    if (refcount == 0 && cluster_index < s->free_cluster_index) {
-    s->free_cluster_index = cluster_index;
-    }
  s->set_refcount(refcount_block, block_index, refcount);

  if (refcount == 0) {
  void *table;
  Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);

-    if (infl) {
-    infl->refcount_zero = true;
-    infl->type = type;
-    continue;
-    }
-
  table = qcow2_cache_is_table_offset(s->refcount_block_cache,
  offset);
  if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  qcow2_cache_discard(s->l2_table_cache, table);
  }

+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }
+
+    if (cluster_index < s->free_cluster_index) {
+    s->free_cluster_index = cluster_index;
+    }
+
  if (s->discard_passthrough[type]) {
  update_refcount_discard(bs, cluster_offset, s->cluster_size);
  }


I don’t think I like using s->free_cluster_index as a protection against 
allocating something before it.


Hmm, I just propose not to update it, if refcount reached 0 but we still have 
inflight writes.




First, it comes back the problem I just described in my mail from 15:58 GMT+1, 
which is that you’re changing the definition of what a free cluster is.  With 
this proposal, you’re proposing yet a new definition: A free cluster is 
anything with refcount == 0 after free_cluster_index.


I think that free cluster is anything with refcount = 0 and inflight-write-cnt 
= 0. And free_cluster_index is a hint where start to search for such cluster.



Now looking only at the allocation functions, it may look like that kind of is 
the definition already.  But I don’t think that was the intention when 
free_cluster_index was 

Re: [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard

2021-03-12 Thread Kevin Wolf
Am 25.02.2021 um 12:52 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all! It occurs that nothing prevents discarding and reallocating host
> cluster during data writing. This way data writing will pollute another
> newly allocated cluster of data or metadata.
> 
> OK, v2 is a try to solve the problem with CoRwlock.. And it is marked
> RFC, because of a lot of iotest failures.. Some of problems with v2:
> 
> 1. It's a more complicated to make a test, as everything is blocking
> and I can't just break write and do discard.. I have to implement
> aio_discard in qemu-io and rewrite test into several portions of io
> commands splitted by "sleep 1".. OK, it's not a big problem, and I've
> solved it.

Right, this just demonstrates that it's doing what it's supposed to.

> 2. iotest 7 fails with several leaked clusters. Seems, that it depend on
> the fact that discard may be done in parallel with writes. Iotest 7 does
> snapshots, so I think l1 table is updated to the moment when discard is
> finally unlocked.. But I didn't dig into it, it's all my assumptions.

This one looks a bit odd, but it may be related to the bug in your code
that you forgot that qcow2_cluster_discard() is not a coroutine_fn.
Later tests fail during the unlock:

qemu-img: ../util/qemu-coroutine-lock.c:358: qemu_co_rwlock_unlock: Assertion 
`qemu_in_coroutine()' failed.

#0  0x7ff33f7d89d5 in raise () from /lib64/libc.so.6
#1  0x7ff33f7c18a4 in abort () from /lib64/libc.so.6
#2  0x7ff33f7c1789 in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x7ff33f7d1026 in __assert_fail () from /lib64/libc.so.6
#4  0x556f4ffd3c94 in qemu_co_rwlock_unlock (lock=0x556f51f63ca0) at 
../util/qemu-coroutine-lock.c:358
#5  0x556f4fef5e09 in qcow2_cluster_discard (bs=0x556f51f56a80, 
offset=37748736, bytes=0, type=QCOW2_DISCARD_NEVER, full_discard=false) at 
../block/qcow2-cluster.c:1970
#6  0x556f4ff46c23 in qcow2_snapshot_create (bs=0x556f51f56a80, 
sn_info=0x7fff89fb9a30) at ../block/qcow2-snapshot.c:736
#7  0x556f4ff0d7b6 in bdrv_snapshot_create (bs=0x556f51f56a80, 
sn_info=0x7fff89fb9a30) at ../block/snapshot.c:227
#8  0x556f4fe85526 in img_snapshot (argc=4, argv=0x7fff89fb9d30) at 
../qemu-img.c:3337
#9  0x556f4fe8a227 in main (argc=4, argv=0x7fff89fb9d30) at 
../qemu-img.c:5375

> 3. iotest 13 (and I think a lot more iotests) crashes on
> assert(!to->locks_held); .. So with this assertion we can't keep rwlock
> locked during data writing...
> 
>   #3  in __assert_fail () from /lib64/libc.so.6
>   #4  in qemu_aio_coroutine_enter (ctx=0x55762120b700, co=0x55762121d700)
>   at ../util/qemu-coroutine.c:158
>   #5  in aio_co_enter (ctx=0x55762120b700, co=0x55762121d700) at 
> ../util/async.c:628
>   #6  in aio_co_wake (co=0x55762121d700) at ../util/async.c:612
>   #7  in thread_pool_co_cb (opaque=0x7f17950daab0, ret=0) at 
> ../util/thread-pool.c:279
>   #8  in thread_pool_completion_bh (opaque=0x5576211e5070) at 
> ../util/thread-pool.c:188
>   #9  in aio_bh_call (bh=0x557621205df0) at ../util/async.c:136
>   #10 in aio_bh_poll (ctx=0x55762120b700) at ../util/async.c:164
>   #11 in aio_poll (ctx=0x55762120b700, blocking=true) at 
> ../util/aio-posix.c:659
>   #12 in blk_prw (blk=0x557621205790, offset=4303351808, 
>   buf=0x55762123e000 '\364' ,  \364>..., bytes=12288, 
>   co_entry=0x557620d9dc97 , flags=0) at 
> ../block/block-backend.c:1335
>   #13 in blk_pwrite (blk=0x557621205790, offset=4303351808, 
> buf=0x55762123e000, 
>   count=12288, flags=0) at ../block/block-backend.c:1501

This is another bug in your code: A taken lock belongs to its coroutine.
You can't lock in one coroutine and unlock in another.

The changes you made to the write code seem unnecessarily complicated
anyway: Why not just qemu_co_rwlock_rdlock() right at the start of
qcow2_co_pwritev_part() and unlock at its end, without taking and
dropping the lock repeatedly? It makes both the locking more obviously
correct and also fixes the bug (013 passes with this change).

> So now I think that v1 is simpler.. It's more complicated (but not too
> much) in code. But it keeps discards and data writes non-blocking each
> other and avoids yields in critical sections.

I think an approach with additional data structures is almost certainly
more complex and harder to maintain (and as the review from Max showed,
also to understand). I wouldn't give up yet on the CoRwlock based
approach, it's almost trivial code in comparison.

True, making qcow2_cluster_discard() a coroutine_fn requires a
preparational patch that is less trivial, but at least this can be seen
as something that we would want to do sooner or later anyway.

Kevin




Re: [PULL v2 19/38] hw/block/nvme: align zoned.zasl with mdts

2021-03-12 Thread Klaus Jensen
On Mar 12 13:07, Peter Maydell wrote:
> On Tue, 9 Mar 2021 at 11:45, Klaus Jensen  wrote:
> >
> > From: Klaus Jensen 
> >
> > ZASL (Zone Append Size Limit) is defined exactly like MDTS (Maximum Data
> > Transfer Size), that is, it is a value in units of the minimum memory
> > page size (CAP.MPSMIN) and is reported as a power of two.
> >
> > The 'mdts' nvme device parameter is specified as in the spec, but the
> > 'zoned.append_size_limit' parameter is specified in bytes. This is
> > suboptimal for a number of reasons:
> >
> >   1. It is just plain confusing wrt. the definition of mdts.
> >   2. There is a lot of complexity involved in validating the value; it
> >  must be a power of two, it should be larger than 4k, if it is zero
> >  we set it internally to mdts, but still report it as zero.
> >   3. While "hw/block/nvme: improve invalid zasl value reporting"
> >  slightly improved the handling of the parameter, the validation is
> >  still wrong; it does not depend on CC.MPS, it depends on
> >  CAP.MPSMIN. And we are not even checking that it is actually less
> >  than or equal to MDTS, which is kinda the *one* condition it must
> >  satisfy.
> >
> > Fix this by defining zasl exactly like mdts and checking the one thing
> > that it must satisfy (that it is less than or equal to mdts). Also,
> > change the default value from 128KiB to 0 (aka, whatever mdts is).
> 
> > @@ -2144,10 +2142,9 @@ static uint16_t nvme_do_write(NvmeCtrl *n, 
> > NvmeRequest *req, bool append,
> >  goto invalid;
> >  }
> >
> > -if (nvme_l2b(ns, nlb) > (n->page_size << n->zasl)) {
> > -trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
> > -status = NVME_INVALID_FIELD;
> > -goto invalid;
> > +if (n->params.zasl && data_size > n->page_size << 
> > n->params.zasl) {
> > +trace_pci_nvme_err_zasl(data_size);
> > +return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >
> >  slba = zone->w_ptr;
> 
> Hi; Coverity points out a possible overflow here (CID 1450756):
> n->page_size is a uint32_t, and n->params.zasl is a uint8_t, so
> the "n->page_size << n->params.zasl" will be done as 32-bit arithmetic;
> but it is then compared against a uint64_t data_size.
> 
> Is this an overflow that can never happen (ie a false positive), or
> should the RHS of the comparison be done as 64-bit arithmetic by
> adding a cast ?
> 

Thanks!

I think a cast is in order. I will get a fix out.


signature.asc
Description: PGP signature


Re: [PATCH v3 4/6] util: implement seqcache

2021-03-12 Thread Max Reitz

On 12.03.21 15:37, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 16:41, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/qemu/seqcache.h |  42 +
  util/seqcache.c | 361 
  MAINTAINERS |   6 +
  util/meson.build    |   1 +
  4 files changed, 410 insertions(+)
  create mode 100644 include/qemu/seqcache.h
  create mode 100644 util/seqcache.c


Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re 
using a list again instead of a hash table.  I suppose we do need the 
list anyway because of the next_flush iterator, so using a hash table 
would only complicate the implementation, but still.


Yes, it seems correct for flush iterator go in same order as writes 
comes, so we need a list. We can add a hash table, it will only help on 
read.. But for compressed cache in qcow2 we try to flush often enough, 
so there should not be many clusters in the cache. So I think addition 
of hash table may be done later if needed.


Sure.  The problem I see is that we’ll probably never reach the point of 
it really being needed. O:)


So I think it’s a question of now or never.

[...]


+ */
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t 
*bytes,

+ uint8_t **buf, bool *unfinished)


Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the 
callers would hate that, though.


Will do. And actually I wrote quite big explanation but missed the fact 
that caller don't get ownership on buf, it should be mentioned.


Great, thanks.


+{
+    Cluster *req = s->next_flush;
+
+    if (s->next_flush) {
+    *unfinished = false;
+    req = s->next_flush;
+    s->next_flush = QSIMPLEQ_NEXT(req, entry);
+    if (s->next_flush == s->cur_write) {
+    s->next_flush = NULL;
+    }
+    } else if (s->cur_write && *unfinished) {
+    req = s->cur_write;


I was wondering whether flushing an unfinished cluster wouldn’t kind 
of finalize it, but I suppose the problem with that would be that you 
can’t add data to a finished cluster, which wouldn’t be that great if 
you’re just flushing the cache without wanting to drop it all.


(The problem I see is that flushing it later will mean all the data 
that already has been written here will have to be rewritten.  Not 
that bad, I suppose.)


Yes that's all correct. Also there is additional strong reason: qcow2 
depends on the fact that clusters become "finished" by defined rules: 
only when it really finished up the the end or when qcow2 starts writing 
another cluster.


For "finished" clusters with unaligned end we can safely align this end 
up to some good alignment writing a bit more data than needed. It's safe 
because tail of the cluster is never used. And we'll perform better with 
aligned write avoiding RMW.


But when flushing "unfinished" cluster, we should write exactly what we 
have in the cache, as there may happen parallel write to the same 
cluster, which will continue the sequential process.


OK, thanks for the explanation.

Max




Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Max Reitz

On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]

@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, 
cluster_index);

+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems 
like it wants to keep metadata structures in the cache that are 
still in use (because dropping them from the caches is what happens 
next), but users of metadata structures won’t set in-flight 
counters for those metadata structures, will they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when 
full reference count of the host cluster becomes 0, including 
inflight-write-cnt. So, if at this point inflight-write-cnt is not 
0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does 
it?  All I can see is code to remove metadata structures from the 
metadata caches (if the discarded cluster was an L2 table or a 
refblock), and finally the discard on the underlying file.  I don’t 
see how that protocol-level discard has anything to do with our 
problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll 
have data instead of a hole. Not a big deal, but seems better to 
postpone discard.


On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt




As far as I understand, the freeing happens immediately above the “if 
(refcount == 0)” block by s->set_refcount() setting the refcount to 
0. (including updating s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually 
prevent until total reference count becomes zero.


And about s->set_refcount(): it only update a refcount itself, and 
don't free anything.





So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  } else {
  refcount += addend;
  }
-    if (refcount == 0 && cluster_index < s->free_cluster_index) {
-    s->free_cluster_index = cluster_index;
-    }
  s->set_refcount(refcount_block, block_index, refcount);

  if (refcount == 0) {
  void *table;
  Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);

-    if (infl) {
-    infl->refcount_zero = true;
-    infl->type = type;
-    continue;
-    }
-
  table = qcow2_cache_is_table_offset(s->refcount_block_cache,
  offset);
  if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  qcow2_cache_discard(s->l2_table_cache, table);
  }

+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }
+
+    if (cluster_index < s->free_cluster_index) {
+    s->free_cluster_index = cluster_index;
+    }
+
  if (s->discard_passthrough[type]) {
  update_refcount_discard(bs, cluster_offset, 
s->cluster_size);

  }


I don’t think I like using s->free_cluster_index as a protection against 
allocating something before it.


First, it comes back the problem I just described in my mail from 15:58 
GMT+1, which is that you’re changing the definition of what a free 
cluster is.  With this proposal, you’re proposing yet a new definition: 
A free cluster is anything with refcount == 0 after free_cluster_index.


Now looking only at the allocation functions, it may look like that kind 
of is the definition already.  But I don’t think that was the intention 
when free_cluster_index was introduced, so we’d have to check every 
place that sets free_cluster_index, to see whether it adheres to this 
definition.


And I think it’s clear that there is a place that won’t adhere to this 
definition, and that is this very place here, in update_refcount().  Say 
free_cluster_index 

Re: [PULL v2 36/38] hw/block/nvme: support namespace attachment command

2021-03-12 Thread Klaus Jensen
On Mar 12 13:12, Peter Maydell wrote:
> On Tue, 9 Mar 2021 at 11:46, Klaus Jensen  wrote:
> >
> > From: Minwoo Im 
> >
> > This patch supports Namespace Attachment command for the pre-defined
> > nvme-ns device nodes.  Of course, attach/detach namespace should only be
> > supported in case 'subsys' is given.  This is because if we detach a
> > namespace from a controller, somebody needs to manage the detached, but
> > allocated namespace in the NVMe subsystem.
> >
> > As command effect for the namespace attachment command is registered,
> > the host will be notified that namespace inventory is changed so that
> > host will rescan the namespace inventory after this command.  For
> > example, kernel driver manages this command effect via passthru IOCTL.
> 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 85a7b5a14f4e..1287bc2cd17a 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -235,6 +235,11 @@ static inline void nvme_ns_attach(NvmeCtrl *n, 
> > NvmeNamespace *ns)
> >  n->namespaces[nvme_nsid(ns) - 1] = ns;
> >  }
> >
> > +static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns)
> > +{
> > +n->namespaces[nvme_nsid(ns) - 1] = NULL;
> > +}
> 
> Hi; Coverity complains about a possible array overflow both here
> in nvme_ns_detach() (CID 1450757) and in nvme_ns_attach() (CID 1450758):
> nvme_nsid() can return -1, but this code does not check for that.
> 
> If these functions both assume that their ns argument is non-NULL,
> then adding an "assert(ns)" would probably placate Coverity and also
> would mean that any bugs elsewhere resulting in accidentally passing
> a NULL pointer would result in a clean assertion failure rather than
> memory corruption. (Or you could directly assert that the array index
> is in-bounds, I guess.)
> 

Hi Peter,

Thanks!

As far as I can tell, we never reach this without nsid being a valid
value or ns being NULL, but as you say, future misuse would be bad. I
will post a fix to make sure such misuse does not happen in the future.


signature.asc
Description: PGP signature


Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Max Reitz

On 12.03.21 13:42, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]

@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, 
cluster_index);

+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems 
like it wants to keep metadata structures in the cache that are 
still in use (because dropping them from the caches is what happens 
next), but users of metadata structures won’t set in-flight 
counters for those metadata structures, will they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when 
full reference count of the host cluster becomes 0, including 
inflight-write-cnt. So, if at this point inflight-write-cnt is not 
0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does 
it?  All I can see is code to remove metadata structures from the 
metadata caches (if the discarded cluster was an L2 table or a 
refblock), and finally the discard on the underlying file.  I don’t 
see how that protocol-level discard has anything to do with our 
problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll 
have data instead of a hole. Not a big deal, but seems better to 
postpone discard.


On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt




As far as I understand, the freeing happens immediately above the “if 
(refcount == 0)” block by s->set_refcount() setting the refcount to 
0. (including updating s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually 
prevent until total reference count becomes zero.


And about s->set_refcount(): it only update a refcount itself, and 
don't free anything.






Looking now at this place:


     if (refcount == 0 && cluster_index < s->free_cluster_index) {
     s->free_cluster_index = cluster_index;
     }
     s->set_refcount(refcount_block, block_index, refcount);
     if (refcount == 0) {
     void *table;
     Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
     if (infl) {
     infl->refcount_zero = true;
     infl->type = type;
     continue;
     }
     table = qcow2_cache_is_table_offset(s->refcount_block_cache,
     offset);
     if (table != NULL) {
     qcow2_cache_put(s->refcount_block_cache, _block);
     old_table_index = -1;
     qcow2_cache_discard(s->refcount_block_cache, table);
     }
     table = qcow2_cache_is_table_offset(s->l2_table_cache, 
offset);

     if (table != NULL) {
     qcow2_cache_discard(s->l2_table_cache, table);
     }
     if (s->compressed_cache) {
     seqcache_discard_cluster(s->compressed_cache, 
cluster_offset);

     }
     if (s->discard_passthrough[type]) {
     update_refcount_discard(bs, cluster_offset, 
s->cluster_size);

     }
     }


Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? 
offset is the start of the whole loop, and is not updated on iterations. 
Isn't it more correct to use cluster_offset here? Or we are sure that 
refcount and l2 metadata is always discarded by exactly one cluster? 
Than we are OK, but still worth an assertion that offset == cluster_offset.


Spontaneously, I think it’s a bug that hasn’t made any problems yet, 
because I suppose L2 tables and refblocks are indeed always discarded 
one by one (i.e., cluster by cluster).  It definitely looks like this 
should be cluster_offset, yes.


Max




Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Max Reitz

On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]

@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,

  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, 
cluster_index);

+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems 
like it wants to keep metadata structures in the cache that are 
still in use (because dropping them from the caches is what happens 
next), but users of metadata structures won’t set in-flight counters 
for those metadata structures, will they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when 
full reference count of the host cluster becomes 0, including 
inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, 
we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does 
it?  All I can see is code to remove metadata structures from the 
metadata caches (if the discarded cluster was an L2 table or a 
refblock), and finally the discard on the underlying file.  I don’t 
see how that protocol-level discard has anything to do with our 
problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll 
have data instead of a hole. Not a big deal, but seems better to 
postpone discard.


On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt




As far as I understand, the freeing happens immediately above the “if 
(refcount == 0)” block by s->set_refcount() setting the refcount to 0. 
(including updating s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually 
prevent until total reference count becomes zero.


And about s->set_refcount(): it only update a refcount itself, and don't 
free anything.


That is what freeing is, though.  I consider something to be free when 
allocation functions will allocate it.  The allocation functions look at 
the refcount, so once a cluster’s refcount is 0, it is free.


If that isn’t what freeing is, nothing in update_refcount() frees 
anything (when looking at how data clusters are handled).  Passing the 
discard through to the protocol layer isn’t “freeing”, because it’s 
independent of qcow2.


Now, your patch adds an additional check to the allocation functions 
(whether there are ongoing writes on the cluster), so it’s indeed 
possible that a cluster can have a refcount of 0 but still won’t be used 
by allocation functions.


But that means you’ve just changed the definition of what a free cluster 
is.  In fact, that means that nothing in update_refcount() can free a 
cluster that has active writes to it, because now a cluster is only free 
if there are no such writes.  It follows that you needn’t change 
update_refcount() to prevent clusters with such writes from being freed, 
because with this new definition of what a free cluster is, it’s 
impossible for update_refcount() to free them.


(Yes, you’re right that it would be nice to postpone the protocol-level 
discard still, but not doing so wouldn’t be a catastrophe – which shows 
that it has little to do with actually freeing something, as far as 
qcow2 is concerned.


If it’s just about postponing the discard, we can do exactly that: Let 
update_refcount() skip discarding for clusters that are still in use, 
and then let update_inflight_write_cnt() only do that discard instead of 
invoking all of qcow2_update_cluster_refcount().)


Alternatively, we could also not change the definition of what a free 
cluster is, which means we wouldn’t need to change the allocation 
functions, but instead postpone the refcount update that 
update_refcount() does.  That would mean we’d actually need to really 
drop the refcount in update_inflight_write_cnt() instead of doing a -0.


Max




Re: [PATCH v3 4/6] util: implement seqcache

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 16:41, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/qemu/seqcache.h |  42 +
  util/seqcache.c | 361 
  MAINTAINERS |   6 +
  util/meson.build    |   1 +
  4 files changed, 410 insertions(+)
  create mode 100644 include/qemu/seqcache.h
  create mode 100644 util/seqcache.c


Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re using a list 
again instead of a hash table.  I suppose we do need the list anyway because of 
the next_flush iterator, so using a hash table would only complicate the 
implementation, but still.


Yes, it seems correct for flush iterator go in same order as writes comes, so 
we need a list. We can add a hash table, it will only help on read.. But for 
compressed cache in qcow2 we try to flush often enough, so there should not be 
many clusters in the cache. So I think addition of hash table may be done later 
if needed.



[...]


diff --git a/util/seqcache.c b/util/seqcache.c
new file mode 100644
index 00..d923560eab
--- /dev/null
+++ b/util/seqcache.c
@@ -0,0 +1,361 @@
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ *
+ * = Description =
+ *
+ * SeqCache is an abbreviation for Sequential Cache.
+ *
+ * The Cache is intended to improve performance of small unaligned sequential
+ * writes. Cache has a cluster_size parameter and the unit of caching is 
aligned
+ * cluster.  Cache has a list of cached clusters, several "finished" ones and 
at
+ * most one "unfinished". "unfinished" cluster is a cluster where last byte of
+ * last write operation is cached and there is a free place after that byte to
+ * the end of cluster. "finished" clusters are just stored in cache to be read
+ * or flushed and don't allow any writes to them.
+ *
+ * If write to the cache intersects cluster bounds, it's splat into several


*split

(Though I like “splat”.  Sounds like a wet blotch.)


+ * requests by cluster bounds. So, consider a write that doesn't intersect
+ * cluster bounds to describe the whole picture:
+ *
+ * There are two cases allowed:
+ *
+ * 1. Sequential write to "unfinished" cluster. Actually it's a write 
sequential
+ *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
+ *    filled up to the cluster bound it becomes "finished".
+ *
+ * 2. Write to new cluster, not existing in the cache. It may be sequential to
+ *    previous write or not. Current "unfinshed" cluster (if exists) becomes


*unfinished


+ *    "finished" and new "unfinished" cluster is started. Note also that offset
+ *    of the write to new cluster is not required to be aligned.
+ *
+ *    Any other write operation (non-sequential write to "unfinished" cluster
+ *    or write to any of "finished" clusters) will crash.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/queue.h"
+#include "qemu/seqcache.h"
+
+/*
+ * Cluster
+ *
+ * Representation of one cached cluster, aligned to SeqCache::cluster_size.
+ * Caches only one subregion of the cluster, started at @offset (may be
+ * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
+ * The whole subregion always lay in one aligned cluster:
+ *
+ *  QEMU_ALIGN_DOWN(offset, cluster_size) ==
+ *  QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
+ *
+ * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
+ * 

Re: [PATCH v3 4/6] util: implement seqcache

2021-03-12 Thread Max Reitz

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/qemu/seqcache.h |  42 +
  util/seqcache.c | 361 
  MAINTAINERS |   6 +
  util/meson.build|   1 +
  4 files changed, 410 insertions(+)
  create mode 100644 include/qemu/seqcache.h
  create mode 100644 util/seqcache.c


Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re using 
a list again instead of a hash table.  I suppose we do need the list 
anyway because of the next_flush iterator, so using a hash table would 
only complicate the implementation, but still.


[...]


diff --git a/util/seqcache.c b/util/seqcache.c
new file mode 100644
index 00..d923560eab
--- /dev/null
+++ b/util/seqcache.c
@@ -0,0 +1,361 @@
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ *
+ * = Description =
+ *
+ * SeqCache is an abbreviation for Sequential Cache.
+ *
+ * The Cache is intended to improve performance of small unaligned sequential
+ * writes. Cache has a cluster_size parameter and the unit of caching is 
aligned
+ * cluster.  Cache has a list of cached clusters, several "finished" ones and 
at
+ * most one "unfinished". "unfinished" cluster is a cluster where last byte of
+ * last write operation is cached and there is a free place after that byte to
+ * the end of cluster. "finished" clusters are just stored in cache to be read
+ * or flushed and don't allow any writes to them.
+ *
+ * If write to the cache intersects cluster bounds, it's splat into several


*split

(Though I like “splat”.  Sounds like a wet blotch.)


+ * requests by cluster bounds. So, consider a write that doesn't intersect
+ * cluster bounds to describe the whole picture:
+ *
+ * There are two cases allowed:
+ *
+ * 1. Sequential write to "unfinished" cluster. Actually it's a write 
sequential
+ *previous write. The data goes to "unfinished" cluster. If "unfinished" is
+ *filled up to the cluster bound it becomes "finished".
+ *
+ * 2. Write to new cluster, not existing in the cache. It may be sequential to
+ *previous write or not. Current "unfinshed" cluster (if exists) becomes


*unfinished


+ *"finished" and new "unfinished" cluster is started. Note also that offset
+ *of the write to new cluster is not required to be aligned.
+ *
+ *Any other write operation (non-sequential write to "unfinished" cluster
+ *or write to any of "finished" clusters) will crash.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/queue.h"
+#include "qemu/seqcache.h"
+
+/*
+ * Cluster
+ *
+ * Representation of one cached cluster, aligned to SeqCache::cluster_size.
+ * Caches only one subregion of the cluster, started at @offset (may be
+ * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
+ * The whole subregion always lay in one aligned cluster:
+ *
+ *  QEMU_ALIGN_DOWN(offset, cluster_size) ==
+ *  QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
+ *
+ * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
+ * allocated @buf length is at least:
+ *
+ *  cluster_size - offset % cluster_size
+ */
+typedef struct Cluster {
+   int64_t offset;
+   int64_t bytes;
+   uint8_t *buf;
+   QSIMPLEQ_ENTRY(Cluster) entry;
+} Cluster;
+
+/*
+ * SeqCache caches small sequential writes into "unfinished" @cur_write 
cluster,
+ * until entire cluster (of @cluster_size bytes) is filled 

Re: [PULL v2 36/38] hw/block/nvme: support namespace attachment command

2021-03-12 Thread Peter Maydell
On Tue, 9 Mar 2021 at 11:46, Klaus Jensen  wrote:
>
> From: Minwoo Im 
>
> This patch supports Namespace Attachment command for the pre-defined
> nvme-ns device nodes.  Of course, attach/detach namespace should only be
> supported in case 'subsys' is given.  This is because if we detach a
> namespace from a controller, somebody needs to manage the detached, but
> allocated namespace in the NVMe subsystem.
>
> As command effect for the namespace attachment command is registered,
> the host will be notified that namespace inventory is changed so that
> host will rescan the namespace inventory after this command.  For
> example, kernel driver manages this command effect via passthru IOCTL.

> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 85a7b5a14f4e..1287bc2cd17a 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -235,6 +235,11 @@ static inline void nvme_ns_attach(NvmeCtrl *n, 
> NvmeNamespace *ns)
>  n->namespaces[nvme_nsid(ns) - 1] = ns;
>  }
>
> +static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +n->namespaces[nvme_nsid(ns) - 1] = NULL;
> +}

Hi; Coverity complains about a possible array overflow both here
in nvme_ns_detach() (CID 1450757) and in nvme_ns_attach() (CID 1450758):
nvme_nsid() can return -1, but this code does not check for that.

If these functions both assume that their ns argument is non-NULL,
then adding an "assert(ns)" would probably placate Coverity and also
would mean that any bugs elsewhere resulting in accidentally passing
a NULL pointer would result in a clean assertion failure rather than
memory corruption. (Or you could directly assert that the array index
is in-bounds, I guess.)

thanks
-- PMM



Re: [PULL v2 19/38] hw/block/nvme: align zoned.zasl with mdts

2021-03-12 Thread Peter Maydell
On Tue, 9 Mar 2021 at 11:45, Klaus Jensen  wrote:
>
> From: Klaus Jensen 
>
> ZASL (Zone Append Size Limit) is defined exactly like MDTS (Maximum Data
> Transfer Size), that is, it is a value in units of the minimum memory
> page size (CAP.MPSMIN) and is reported as a power of two.
>
> The 'mdts' nvme device parameter is specified as in the spec, but the
> 'zoned.append_size_limit' parameter is specified in bytes. This is
> suboptimal for a number of reasons:
>
>   1. It is just plain confusing wrt. the definition of mdts.
>   2. There is a lot of complexity involved in validating the value; it
>  must be a power of two, it should be larger than 4k, if it is zero
>  we set it internally to mdts, but still report it as zero.
>   3. While "hw/block/nvme: improve invalid zasl value reporting"
>  slightly improved the handling of the parameter, the validation is
>  still wrong; it does not depend on CC.MPS, it depends on
>  CAP.MPSMIN. And we are not even checking that it is actually less
>  than or equal to MDTS, which is kinda the *one* condition it must
>  satisfy.
>
> Fix this by defining zasl exactly like mdts and checking the one thing
> that it must satisfy (that it is less than or equal to mdts). Also,
> change the default value from 128KiB to 0 (aka, whatever mdts is).

> @@ -2144,10 +2142,9 @@ static uint16_t nvme_do_write(NvmeCtrl *n, NvmeRequest 
> *req, bool append,
>  goto invalid;
>  }
>
> -if (nvme_l2b(ns, nlb) > (n->page_size << n->zasl)) {
> -trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
> -status = NVME_INVALID_FIELD;
> -goto invalid;
> +if (n->params.zasl && data_size > n->page_size << 
> n->params.zasl) {
> +trace_pci_nvme_err_zasl(data_size);
> +return NVME_INVALID_FIELD | NVME_DNR;
>  }
>
>  slba = zone->w_ptr;

Hi; Coverity points out a possible overflow here (CID 1450756):
n->page_size is a uint32_t, and n->params.zasl is a uint8_t, so
the "n->page_size << n->params.zasl" will be done as 32-bit arithmetic;
but it is then compared against a uint64_t data_size.

Is this an overflow that can never happen (ie a false positive), or
should the RHS of the comparison be done as 64-bit arithmetic by
adding a cast ?

thanks
-- PMM



Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]


@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems like it wants 
to keep metadata structures in the cache that are still in use (because 
dropping them from the caches is what happens next), but users of metadata 
structures won’t set in-flight counters for those metadata structures, will 
they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count 
of the host cluster becomes 0, including inflight-write-cnt. So, if at this point 
inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I 
can see is code to remove metadata structures from the metadata caches (if the 
discarded cluster was an L2 table or a refblock), and finally the discard on 
the underlying file.  I don’t see how that protocol-level discard has anything 
to do with our problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll have 
data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt



As far as I understand, the freeing happens immediately above the “if (refcount == 
0)” block by s->set_refcount() setting the refcount to 0. (including updating 
s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually prevent 
until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free 
anything.




So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
 } else {
 refcount += addend;
 }
-if (refcount == 0 && cluster_index < s->free_cluster_index) {
-s->free_cluster_index = cluster_index;
-}
 s->set_refcount(refcount_block, block_index, refcount);
 
 if (refcount == 0) {

 void *table;
 Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
 
-if (infl) {

-infl->refcount_zero = true;
-infl->type = type;
-continue;
-}
-
 table = qcow2_cache_is_table_offset(s->refcount_block_cache,
 offset);
 if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
 qcow2_cache_discard(s->l2_table_cache, table);
 }
 
+if (infl) {

+infl->refcount_zero = true;
+infl->type = type;
+continue;
+}
+
+if (cluster_index < s->free_cluster_index) {
+s->free_cluster_index = cluster_index;
+}
+
 if (s->discard_passthrough[type]) {
 update_refcount_discard(bs, cluster_offset, s->cluster_size);
 }



--
Best regards,
Vladimir



Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]


@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems like it wants 
to keep metadata structures in the cache that are still in use (because 
dropping them from the caches is what happens next), but users of metadata 
structures won’t set in-flight counters for those metadata structures, will 
they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count 
of the host cluster becomes 0, including inflight-write-cnt. So, if at this point 
inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I 
can see is code to remove metadata structures from the metadata caches (if the 
discarded cluster was an L2 table or a refblock), and finally the discard on 
the underlying file.  I don’t see how that protocol-level discard has anything 
to do with our problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll have 
data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt



As far as I understand, the freeing happens immediately above the “if (refcount == 
0)” block by s->set_refcount() setting the refcount to 0. (including updating 
s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually prevent 
until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free 
anything.





Looking now at this place:


if (refcount == 0 && cluster_index < s->free_cluster_index) {
s->free_cluster_index = cluster_index;
}
s->set_refcount(refcount_block, block_index, refcount);
 
if (refcount == 0) {

void *table;
Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
 
if (infl) {

infl->refcount_zero = true;
infl->type = type;
continue;
}
 
table = qcow2_cache_is_table_offset(s->refcount_block_cache,

offset);
if (table != NULL) {
qcow2_cache_put(s->refcount_block_cache, _block);
old_table_index = -1;
qcow2_cache_discard(s->refcount_block_cache, table);
}
 
table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);

if (table != NULL) {
qcow2_cache_discard(s->l2_table_cache, table);
}
 
if (s->compressed_cache) {

seqcache_discard_cluster(s->compressed_cache, cluster_offset);
}
 
if (s->discard_passthrough[type]) {

update_refcount_discard(bs, cluster_offset, s->cluster_size);
}
}


Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? offset is 
the start of the whole loop, and is not updated on iterations. Isn't it more correct to 
use cluster_offset here? Or we are sure that refcount and l2 metadata is always discarded 
by exactly one cluster? Than we are OK, but still worth an assertion that offset == 
cluster_offset.

--
Best regards,
Vladimir



Re: [RFC] nbd: decouple reconnect from drain

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

10.03.2021 12:32, Roman Kagan wrote:

NBD connect coroutine takes an extra in_flight reference as if it's a
request handler.  This prevents drain from completion until the
connection coroutine is releases the reference.

When NBD is configured to reconnect, however, this appears to be fatal
to the reconnection idea: the drain procedure wants the reconnection to
be suspended, but this is only possible if the in-flight requests are
canceled.


As I remember from our conversation, the problem is not that we don't reconnect 
during drained section, but exactly that we cancel requests on drained begins 
starting from 8c517de24a8a1dcbeb.

This is not a problem in scenarios when reconnect is rare case and failed request is 
acceptable. But if we have bad connection and requests should often wait for reconnect 
(so, it may be considered as a kind of "latency") then really, cancelling the 
waiting requests on any drain() kills the reconnect feature.



Fix this by making the connection coroutine stop messing with the
in-flight counter.  Instead, certain care is taken to properly move the
reconnection stuff from one aio_context to another in
.bdrv_{attach,detach}_aio_context callbacks.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Signed-off-by: Roman Kagan 
---
This patch passes the regular make check but fails some extra iotests,
in particular 277.  It obviously lacks more robust interaction with the
connection thread (which in general is fairly complex and hard to reason
about), and perhaps has some other drawbacks, so I'll work on this
further, but I'd appreciate some feedback on whether the idea is sound.



In general I like the idea. The logic around drain in nbd is overcomplicated. 
And I never liked the fact that nbd_read_eof() does dec-inc inflight section. 
Some notes:

1. I hope, the patch can be divided into several ones, as there are several 
things done:

- removing use of in_flight counter introduced by 5ad81b4946
- do reconnect during drained section
- stop cancelling requests on .drained_begin

2. 5ad81b4946 was needed to make nbd_client_attach_aio_context() reenter 
connection_co only in one (or two) possible places, not on any yield.. And I 
don't see how it is achieved now.. This should be described in commit msg..

3. About cancelling requests on drained_begin. The behavior was introduced by 
8c517de24a8a1dcbeb, to fix a deadlock. So, if now the deadlock is fixed another 
way, let's change the logic (don't cancel requests) in a separate patch, and 
note 8c517de24a8a1dcbeb commit and the commit that fixes deadlock the other way 
in the commit message.


--
Best regards,
Vladimir



Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

12.03.2021 14:17, Max Reitz wrote:

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may


[..]


@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
  if (refcount == 0) {
  void *table;
+    Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+    if (infl) {
+    infl->refcount_zero = true;
+    infl->type = type;
+    continue;
+    }


I don’t understand what this is supposed to do exactly.  It seems like it wants 
to keep metadata structures in the cache that are still in use (because 
dropping them from the caches is what happens next), but users of metadata 
structures won’t set in-flight counters for those metadata structures, will 
they?


Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count 
of the host cluster becomes 0, including inflight-write-cnt. So, if at this point 
inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from 
"slow path" in update_inflight_write_cnt().


But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I 
can see is code to remove metadata structures from the metadata caches (if the 
discarded cluster was an L2 table or a refblock), and finally the discard on 
the underlying file.  I don’t see how that protocol-level discard has anything 
to do with our problem, though.


Hmm. Still, if we do this discard, and then our in-flight write, we'll have 
data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to 
qcow2-refcount, not to inflight-write-cnt



As far as I understand, the freeing happens immediately above the “if (refcount == 
0)” block by s->set_refcount() setting the refcount to 0. (including updating 
s->free_cluster_index if the refcount is 0).


Hmm.. And that (setting s->free_cluster_index) what I should actually prevent 
until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free 
anything.


--
Best regards,
Vladimir



Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Max Reitz

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.


I was about to ask whether we couldn’t somehow let discard requests 
check overlapping I/O with the tracked_request infrastructure from 
block/io.c (or perhaps even just let them be serializing).


But I suppose there may be other reasons for clusters being discarded 
other than explicit discard requests, so we have to have something in 
qcow2...


For example, host cluster which contain exactly one compressed cluster 
may be discarded when corresponding guest cluster is rewritten by 
non-compressed write.


Also, (not very good argument for fixing existing bug but still) 
serializing will not help with compressed-cache, because with compressed 
cache we'll have data writes (cache flushes) not related to current 
in-flight request. And that leads us to something like serializing of 
operations with .file child of qcow2 node. But we can't do it, as 
"discarding" a host cluster is operation with metadata, which may do no 
operation with underlying .file child.


Yes, makes sense.  I think it is a good argument, because the point is 
that qcow2 host cluster discards are not necessarily related to guest 
discards.  The compressed cache is just one new example for that.



To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.


This didn’t really explain anything that would help me understand 
what’s going on in this patch.  The patch itself doesn’t really help 
me with comments either.  It’s possible that I’m just daft, but 
honestly I have a hard time really understanding what’s supposed to be 
going on.


Sorry for this. I believe into your experience of understanding my 
patches, so that shows that something is wrong with the patch itself)




Coming back from jumping all over this patch, I also wonder why this 
isn’t split in multiple patches, where the first introduces the 
infrastructure and explains exactly what’s going on, and the next 
patches make use of it so I could understand what each check etc. is 
supposed to represent and do.


OK




To serve as an example, after reading through the patch, I still have 
no idea how it prevents that discard problem you’re trying to fix.  
Maybe I’m lazy, but I would have liked exactly that to be explained by 
this commit message.  Actually, I don’t even understand the problem 
just from this commit message alone, but I could resort to HEAD^ and 
some thinking.  Not sure if that’s ideal, though.


The problem:

1. Start write to qcow2. Assume guest cluster G and corresponding host 
cluster is H.


2. The write requests come to the point of data writing to .file. The 
write to .file is started and qcow2 mutex is unlocked.


3. At this time refcount of H becomes 0. For example, it may be due to 
discard operation on qcow2 node, or rewriting compressed data by normal 
write, or some operation with snapshots..


4. Next, some operations occurs and leads to allocation of H for some 
other needs: it may be another write-to-qcow2-node operation, or 
allocation of L2 table or some other data or metadata cluster allocation.


5. So, at this point H is used for something other. Assume, L2 table is 
written into H.


6. And now, our write from [2] finishes. And pollutes L2 table in H. 
That's a bug.


Understood.

The problem becomes more possible with compressed write cache, because 
any flush operation may trigger deferred data writes, so problem can be 
reproduced like:


1. Write to guest cluster G2, which triggers flushing of data to host 
cluster H, which is unrelated to G2.


2. Discard cluster G in parallel

[...]

So, problem is simply triggered by parallel write and discard to 
different guest clusters. That's why I should do something with this old 
(and most probably never triggered) problem in qcow2 driver prior to 
implementing compressed cache.


OK!  That makes sense.

Hmm, probably we can avoid all these difficulties by implementing 
compressed cache as a filter driver between qcow2 and its .file nodes..


As I see it, you’ve now opened the can of worms, and I’m not sure we can 
just close it and say we haven’t seen any problem. :)


So, regardless of the compressed cache, I think this should be fixed.

I tried to do it, encountered some problems, but I don't remember what 
exactly. Probably I should return to this try. Some obvious problems:


1. Metadata write will go through cache, and we should handle it somehow 
to not break the sequence of unaligned chunks. Simplest way is to add 
BDRV_REQ_COLORED flag and do compressed data writes to .file with this 
flag. And cache filter would track this flag for sequential 

Re: [PATCH] stream: Don't crash when node permission is denied

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

09.03.2021 20:34, Kevin Wolf wrote:

The image streaming block job restricts shared permissions of the nodes
it accesses. This can obviously fail when other users already got these
permissions. _abort is therefore wrong and can crash. Handle these
errors gracefully and just fail starting the block job.

Reported-by: Nini Gu 
Signed-off-by: Kevin Wolf 
---
Based-on: 20210309121814.31078-1-kw...@redhat.com
('storage-daemon: Call job_cancel_sync_all() on shutdown')

  block/stream.c| 15 +++
  tests/qemu-iotests/tests/qsd-jobs | 20 
  tests/qemu-iotests/tests/qsd-jobs.out | 10 ++
  3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 1fa742b0db..97bee482dc 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -206,7 +206,7 @@ void stream_start(const char *job_id, BlockDriverState *bs,
const char *filter_node_name,
Error **errp)
  {
-StreamBlockJob *s;
+StreamBlockJob *s = NULL;
  BlockDriverState *iter;
  bool bs_read_only;
  int basic_flags = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
@@ -214,6 +214,7 @@ void stream_start(const char *job_id, BlockDriverState *bs,
  BlockDriverState *cor_filter_bs = NULL;
  BlockDriverState *above_base;
  QDict *opts;
+int ret;
  
  assert(!(base && bottom));

  assert(!(backing_file_str && bottom));
@@ -303,7 +304,7 @@ void stream_start(const char *job_id, BlockDriverState *bs,
   * queried only at the job start and then cached.
   */
  if (block_job_add_bdrv(>common, "active node", bs, 0,
-   basic_flags | BLK_PERM_WRITE, _abort)) {
+   basic_flags | BLK_PERM_WRITE, errp)) {


While being here may also do ", errp) < 0) {", for more usual pattern of 
checking error..


  goto fail;
  }
  
@@ -320,8 +321,11 @@ void stream_start(const char *job_id, BlockDriverState *bs,

  for (iter = bdrv_filter_or_cow_bs(bs); iter != base;
   iter = bdrv_filter_or_cow_bs(iter))
  {
-block_job_add_bdrv(>common, "intermediate node", iter, 0,
-   basic_flags, _abort);
+ret = block_job_add_bdrv(>common, "intermediate node", iter, 0,
+ basic_flags, errp);
+if (ret < 0) {
+goto fail;
+}
  }
  
  s->base_overlay = base_overlay;

@@ -337,6 +341,9 @@ void stream_start(const char *job_id, BlockDriverState *bs,
  return;
  
  fail:

+if (s) {
+job_early_fail(>common.job);
+}
  if (cor_filter_bs) {
  bdrv_cor_filter_drop(cor_filter_bs);
  }
diff --git a/tests/qemu-iotests/tests/qsd-jobs 
b/tests/qemu-iotests/tests/qsd-jobs
index 1a1c534fac..972b6b3898 100755
--- a/tests/qemu-iotests/tests/qsd-jobs
+++ b/tests/qemu-iotests/tests/qsd-jobs
@@ -30,6 +30,7 @@ status=1  # failure is the default!
  _cleanup()
  {
  _cleanup_test_img
+rm -f "$SOCK_DIR/nbd.sock"
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15
  
@@ -59,6 +60,25 @@ $QSD --chardev stdio,id=stdio --monitor chardev=stdio \

  {"execute": "quit"}
  EOF
  
+echo

+echo "=== Streaming can't get permission on base node ==="
+echo
+
+# Just make sure that this doesn't crash
+$QSD --chardev stdio,id=stdio --monitor chardev=stdio \
+--blockdev node-name=file_base,driver=file,filename="$TEST_IMG.base" \
+--blockdev node-name=fmt_base,driver=qcow2,file=file_base \
+--blockdev node-name=file_overlay,driver=file,filename="$TEST_IMG" \
+--blockdev 
node-name=fmt_overlay,driver=qcow2,file=file_overlay,backing=fmt_base \
+--nbd-server addr.type=unix,addr.path="$SOCK_DIR/nbd.sock" \
+--export type=nbd,id=export1,node-name=fmt_base,writable=on,name=export1 \


Another option is to use blkdebug with take-child-perms and/or 
unshare-child-perms set. Just a note, nbd is good too.


+<


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 2/9] block: Replaced qemu_mutex_lock calls with QEMU_LOCK_GUARD

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

11.03.2021 06:15, Mahmoud Mandour wrote:

Replaced various qemu_mutex_lock/qemu_mutex_unlock calls with
lock guard macros (QEMU_LOCK_GUARD() and WITH_QEMU_LOCK_GUARD).
This slightly simplifies the code by eliminating calls to
qemu_mutex_unlock and eliminates goto paths.

Signed-off-by: Mahmoud Mandour 
---
  block/curl.c |  13 ++--
  block/nbd.c  | 188 ---


Better would be make two separate patches I think.


  2 files changed, 95 insertions(+), 106 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index 4ff895df8f..56a217951a 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -832,12 +832,12 @@ static void curl_setup_preadv(BlockDriverState *bs, 
CURLAIOCB *acb)
  uint64_t start = acb->offset;
  uint64_t end;
  
-qemu_mutex_lock(>mutex);

+QEMU_LOCK_GUARD(>mutex);
  
  // In case we have the requested data already (e.g. read-ahead),

  // we can just call the callback and be done.
  if (curl_find_buf(s, start, acb->bytes, acb)) {
-goto out;
+return;
  }
  
  // No cache found, so let's start a new request

@@ -852,7 +852,7 @@ static void curl_setup_preadv(BlockDriverState *bs, 
CURLAIOCB *acb)
  if (curl_init_state(s, state) < 0) {
  curl_clean_state(state);
  acb->ret = -EIO;
-goto out;
+return;
  }
  
  acb->start = 0;

@@ -867,7 +867,7 @@ static void curl_setup_preadv(BlockDriverState *bs, 
CURLAIOCB *acb)
  if (state->buf_len && state->orig_buf == NULL) {
  curl_clean_state(state);
  acb->ret = -ENOMEM;
-goto out;
+return;
  }
  state->acb[0] = acb;
  
@@ -880,14 +880,11 @@ static void curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)

  acb->ret = -EIO;
  
  curl_clean_state(state);

-goto out;
+return;
  }
  
  /* Tell curl it needs to kick things off */

  curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, );
-
-out:
-qemu_mutex_unlock(>mutex);
  }


This change is obvious and good.

  
  static int coroutine_fn curl_co_preadv(BlockDriverState *bs,

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..28ba7aad61 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -407,27 +407,25 @@ static void *connect_thread_func(void *opaque)
  thr->sioc = NULL;
  }
  
-qemu_mutex_lock(>mutex);

-
-switch (thr->state) {
-case CONNECT_THREAD_RUNNING:
-thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
-if (thr->bh_ctx) {
-aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, thr->bh_opaque);
-
-/* play safe, don't reuse bh_ctx on further connection attempts */
-thr->bh_ctx = NULL;
+WITH_QEMU_LOCK_GUARD(>mutex) {
+switch (thr->state) {
+case CONNECT_THREAD_RUNNING:
+thr->state = ret < 0 ? CONNECT_THREAD_FAIL : 
CONNECT_THREAD_SUCCESS;
+if (thr->bh_ctx) {
+aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, 
thr->bh_opaque);


over-80 line


+
+/* play safe, don't reuse bh_ctx on further connection 
attempts */


and here


+thr->bh_ctx = NULL;
+}
+break;
+case CONNECT_THREAD_RUNNING_DETACHED:
+do_free = true;
+break;
+default:
+abort();
  }
-break;
-case CONNECT_THREAD_RUNNING_DETACHED:
-do_free = true;
-break;
-default:
-abort();
  }
  
-qemu_mutex_unlock(>mutex);

->   if (do_free) {
  nbd_free_connect_thread(thr);
  }
@@ -443,36 +441,33 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
  BDRVNBDState *s = bs->opaque;
  NBDConnectThread *thr = s->connect_thread;
  
-qemu_mutex_lock(>mutex);

-
-switch (thr->state) {
-case CONNECT_THREAD_FAIL:
-case CONNECT_THREAD_NONE:
-error_free(thr->err);
-thr->err = NULL;
-thr->state = CONNECT_THREAD_RUNNING;
-qemu_thread_create(, "nbd-connect",
-   connect_thread_func, thr, QEMU_THREAD_DETACHED);
-break;
-case CONNECT_THREAD_SUCCESS:
-/* Previous attempt finally succeeded in background */
-thr->state = CONNECT_THREAD_NONE;
-s->sioc = thr->sioc;
-thr->sioc = NULL;
-yank_register_function(BLOCKDEV_YANK_INSTANCE(bs->node_name),
-   nbd_yank, bs);
-qemu_mutex_unlock(>mutex);
-return 0;
-case CONNECT_THREAD_RUNNING:
-/* Already running, will wait */
-break;
-default:
-abort();
-}
-
-thr->bh_ctx = qemu_get_current_aio_context();
+WITH_QEMU_LOCK_GUARD(>mutex) {
+switch (thr->state) {
+case CONNECT_THREAD_FAIL:
+case CONNECT_THREAD_NONE:
+error_free(thr->err);
+thr->err = NULL;
+thr->state = CONNECT_THREAD_RUNNING;
+   

Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

2021-03-12 Thread Vladimir Sementsov-Ogievskiy

11.03.2021 22:58, Max Reitz wrote:

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.


I was about to ask whether we couldn’t somehow let discard requests check 
overlapping I/O with the tracked_request infrastructure from block/io.c (or 
perhaps even just let them be serializing).

But I suppose there may be other reasons for clusters being discarded other 
than explicit discard requests, so we have to have something in qcow2...


For example, host cluster which contain exactly one compressed cluster may be 
discarded when corresponding guest cluster is rewritten by non-compressed write.

Also, (not very good argument for fixing existing bug but still) serializing will not 
help with compressed-cache, because with compressed cache we'll have data writes (cache 
flushes) not related to current in-flight request. And that leads us to something like 
serializing of operations with .file child of qcow2 node. But we can't do it, as 
"discarding" a host cluster is operation with metadata, which may do no 
operation with underlying .file child.




To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.


This didn’t really explain anything that would help me understand what’s going 
on in this patch.  The patch itself doesn’t really help me with comments 
either.  It’s possible that I’m just daft, but honestly I have a hard time 
really understanding what’s supposed to be going on.


Sorry for this. I believe into your experience of understanding my patches, so 
that shows that something is wrong with the patch itself)



Coming back from jumping all over this patch, I also wonder why this isn’t 
split in multiple patches, where the first introduces the infrastructure and 
explains exactly what’s going on, and the next patches make use of it so I 
could understand what each check etc. is supposed to represent and do.


OK




To serve as an example, after reading through the patch, I still have no idea 
how it prevents that discard problem you’re trying to fix.  Maybe I’m lazy, but 
I would have liked exactly that to be explained by this commit message.  
Actually, I don’t even understand the problem just from this commit message 
alone, but I could resort to HEAD^ and some thinking.  Not sure if that’s 
ideal, though.


The problem:

1. Start write to qcow2. Assume guest cluster G and corresponding host cluster 
is H.

2. The write requests come to the point of data writing to .file. The write to 
.file is started and qcow2 mutex is unlocked.

3. At this time refcount of H becomes 0. For example, it may be due to discard 
operation on qcow2 node, or rewriting compressed data by normal write, or some 
operation with snapshots..

4. Next, some operations occurs and leads to allocation of H for some other 
needs: it may be another write-to-qcow2-node operation, or allocation of L2 
table or some other data or metadata cluster allocation.

5. So, at this point H is used for something other. Assume, L2 table is written 
into H.

6. And now, our write from [2] finishes. And pollutes L2 table in H. That's a 
bug.


The problem becomes more possible with compressed write cache, because any 
flush operation may trigger deferred data writes, so problem can be reproduced 
like:

1. Write to guest cluster G2, which triggers flushing of data to host cluster 
H, which is unrelated to G2.

2. Discard cluster G in parallel

[...]

So, problem is simply triggered by parallel write and discard to different 
guest clusters. That's why I should do something with this old (and most 
probably never triggered) problem in qcow2 driver prior to implementing 
compressed cache.


Hmm, probably we can avoid all these difficulties by implementing compressed 
cache as a filter driver between qcow2 and its .file nodes.. I tried to do it, 
encountered some problems, but I don't remember what exactly. Probably I should 
return to this try. Some obvious problems:

1. Metadata write will go through cache, and we should handle it somehow to not 
break the sequence of unaligned chunks. Simplest way is to add BDRV_REQ_COLORED 
flag and do compressed data writes to .file with this flag. And cache filter 
would track this flag for sequential caching.

2. We can't enable caching by default this way, so user will have to explicitly 
add a filter.. And actually, it's a real problem of qcow2 driver that in 
O_DIRECT mode it writes compressed data by small unaligned chunks. So, good to 
fix it inside qcow2..

With filter we avoid the problem of parallel reuse of host cluster, as all 
writes go through the cache and flushes will be serialized. So, the problem in 
qcow2 driver is not enlarged by the cache and we can leave it unfixed..



So the commit message says 

Re: [PATCH v3 00/30] qapi/qom: QAPIfy --object and object-add

2021-03-12 Thread Peter Krempa
On Fri, Mar 12, 2021 at 09:46:54 +0100, Paolo Bonzini wrote:
> On 12/03/21 09:14, Markus Armbruster wrote:
> > Paolo Bonzini  writes:
> > 
> > > On 11/03/21 15:08, Markus Armbruster wrote:
> > > > > I would rather keep the OptsVisitor here.  Do the same check for JSON
> > > > > syntax that you have in qobject_input_visitor_new_str, and whenever
> > > > > you need to walk all -object arguments, use something like this:
> > > > > 
> > > > >   typedef struct ObjectArgument {
> > > > >   const char *id;
> > > > >   QDict *json;/* or NULL for QemuOpts */
> > > > >   QSIMPLEQ_ENTRY(ObjectArgument) next;
> > > > >   }
> > > > > 
> > > > > I already had patches in my queue to store -object in a GSList of
> > > > > dictionaries, changing it to use the above is easy enough.
> > > > 
> > > > I think I'd prefer following -display's precedence.  See my reply to
> > > > Kevin for details.
> > > 
> > > Yeah, I got independently to the same conclusion and posted patches
> > > for that.  I was scared that visit_type_ObjectOptions was too much for
> > > OptsVisitor but it seems to work...
> > 
> > We have reason to be scared.  I'll try to cover this in my review.
> 
> Yes, it's a good reason to possibly even delay those 3 patches to 6.1.

Is there a chance we could get the json syntax for -object for now? I
think it would simplify libvirt's code a bit and sidestep the issue of
converting the already existing parameters from JSON form we have into
the commandline form.




Re: [PATCH v3 00/30] qapi/qom: QAPIfy --object and object-add

2021-03-12 Thread Paolo Bonzini

On 12/03/21 09:14, Markus Armbruster wrote:

Paolo Bonzini  writes:


On 11/03/21 15:08, Markus Armbruster wrote:

I would rather keep the OptsVisitor here.  Do the same check for JSON
syntax that you have in qobject_input_visitor_new_str, and whenever
you need to walk all -object arguments, use something like this:

  typedef struct ObjectArgument {
  const char *id;
  QDict *json;/* or NULL for QemuOpts */
  QSIMPLEQ_ENTRY(ObjectArgument) next;
  }

I already had patches in my queue to store -object in a GSList of
dictionaries, changing it to use the above is easy enough.


I think I'd prefer following -display's precedence.  See my reply to
Kevin for details.


Yeah, I got independently to the same conclusion and posted patches
for that.  I was scared that visit_type_ObjectOptions was too much for
OptsVisitor but it seems to work...


We have reason to be scared.  I'll try to cover this in my review.


Yes, it's a good reason to possibly even delay those 3 patches to 6.1.

Paolo




Re: [PATCH v3 00/30] qapi/qom: QAPIfy --object and object-add

2021-03-12 Thread Markus Armbruster
Paolo Bonzini  writes:

> On 11/03/21 15:08, Markus Armbruster wrote:
>>> I would rather keep the OptsVisitor here.  Do the same check for JSON
>>> syntax that you have in qobject_input_visitor_new_str, and whenever
>>> you need to walk all -object arguments, use something like this:
>>>
>>>  typedef struct ObjectArgument {
>>>  const char *id;
>>>  QDict *json;/* or NULL for QemuOpts */
>>>  QSIMPLEQ_ENTRY(ObjectArgument) next;
>>>  }
>>>
>>> I already had patches in my queue to store -object in a GSList of
>>> dictionaries, changing it to use the above is easy enough.
>> 
>> I think I'd prefer following -display's precedence.  See my reply to
>> Kevin for details.
>
> Yeah, I got independently to the same conclusion and posted patches
> for that.  I was scared that visit_type_ObjectOptions was too much for 
> OptsVisitor but it seems to work...

We have reason to be scared.  I'll try to cover this in my review.




Re: [PATCH] fdc: fix floppy boot for Red Hat Linux 5.2

2021-03-12 Thread Thomas Huth

On 12/03/2021 07.32, John Snow wrote:

The image size indicates it's an 81 track floppy disk image, which we
don't have a listing for in the geometry table. When you force the drive
type to 1.44MB, it guesses the reasonably close 18/80. When the drive
type is allowed to auto-detect or set to 2.88, it guesses a very
incorrect geometry.

auto, 144 and 288 drive types get the right geometry with the new entry
in the table.

Reported-by: Michael Tokarev 
Signed-off-by: John Snow 
---
  hw/block/fdc.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/block/fdc.c b/hw/block/fdc.c
index 198940e737..b2f26ba587 100644
--- a/hw/block/fdc.c
+++ b/hw/block/fdc.c
@@ -122,6 +122,7 @@ static const FDFormat fd_formats[] = {
  /* First entry is default format */
  /* 1.44 MB 3"1/2 floppy disks */
  { FLOPPY_DRIVE_TYPE_144, 18, 80, 1, FDRIVE_RATE_500K, }, /* 3.5" 2880 */
+{ FLOPPY_DRIVE_TYPE_144, 18, 81, 1, FDRIVE_RATE_500K, },
  { FLOPPY_DRIVE_TYPE_144, 20, 80, 1, FDRIVE_RATE_500K, }, /* 3.5" 3200 */
  { FLOPPY_DRIVE_TYPE_144, 21, 80, 1, FDRIVE_RATE_500K, },
  { FLOPPY_DRIVE_TYPE_144, 21, 82, 1, FDRIVE_RATE_500K, },


That whole table-based approach seems quite unreliable to me - I've seen 
floppy disks with 80, 81, 82 or sometimes even 83 tracks in the past, so I 
think we would do better with a more flexible way of guessing ... but for 
the time being, this is certainly a quick and easy fix that also should not 
have any negative impact, thus:


Reviewed-by: Thomas Huth