Re: [PATCH v3 26/30] qemu-img: Use user_creatable_process_cmdline() for --object
On 13/03/21 08:40, Markus Armbruster wrote: +if (!user_creatable_add_from_str(optarg, _err)) { +if (local_err) { +error_report_err(local_err); +exit(2); +} else { +/* Help was printed */ +exit(EXIT_SUCCESS); +} +} +break; } -} break; case OPTION_IMAGE_OPTS: image_opts = true; break; Why is this one different? The others all call user_creatable_process_cmdline(). It's to exit with status code 2 instead of 1. Paolo
Re: [PATCH v3 26/30] qemu-img: Use user_creatable_process_cmdline() for --object
Kevin Wolf writes: > This switches qemu-img from a QemuOpts-based parser for --object to > user_creatable_process_cmdline() which uses a keyval parser and enforces > the QAPI schema. > > Apart from being a cleanup, this makes non-scalar properties accessible. > > Signed-off-by: Kevin Wolf > Acked-by: Peter Krempa > --- > qemu-img.c | 251 ++--- > 1 file changed, 45 insertions(+), 206 deletions(-) > > diff --git a/qemu-img.c b/qemu-img.c > index e2952fe955..babb5573ab 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -226,23 +226,6 @@ static void QEMU_NORETURN help(void) > exit(EXIT_SUCCESS); > } > > -static QemuOptsList qemu_object_opts = { > -.name = "object", > -.implied_opt_name = "qom-type", > -.head = QTAILQ_HEAD_INITIALIZER(qemu_object_opts.head), > -.desc = { > -{ } > -}, > -}; > - > -static bool qemu_img_object_print_help(const char *type, QemuOpts *opts) > -{ > -if (user_creatable_print_help(type, opts)) { > -exit(0); > -} > -return true; > -} > - > /* > * Is @optarg safe for accumulate_options()? > * It is when multiple of them can be joined together separated by ','. > @@ -566,14 +549,9 @@ static int img_create(int argc, char **argv) > case 'u': > flags |= BDRV_O_NO_BACKING; > break; > -case OPTION_OBJECT: { > -QemuOpts *opts; > -opts = qemu_opts_parse_noisily(_object_opts, > - optarg, true); > -if (!opts) { > -goto fail; > -} > -} break; > +case OPTION_OBJECT: > +user_creatable_process_cmdline(optarg); > +break; > } > } > > @@ -589,12 +567,6 @@ static int img_create(int argc, char **argv) > } > optind++; > > -if (qemu_opts_foreach(_object_opts, > - user_creatable_add_opts_foreach, > - qemu_img_object_print_help, _fatal)) { > -goto fail; > -} > - > /* Get image size, if specified */ > if (optind < argc) { > int64_t sval; > @@ -804,14 +776,9 @@ static int img_check(int argc, char **argv) > case 'U': > force_share = true; > break; > -case OPTION_OBJECT: { > -QemuOpts *opts; > -opts = qemu_opts_parse_noisily(_object_opts, > - optarg, true); > -if (!opts) { > -return 1; > -} > -} break; > +case OPTION_OBJECT: > +user_creatable_process_cmdline(optarg); > +break; > case OPTION_IMAGE_OPTS: > image_opts = true; > break; > @@ -831,12 +798,6 @@ static int img_check(int argc, char **argv) > return 1; > } > > -if (qemu_opts_foreach(_object_opts, > - user_creatable_add_opts_foreach, > - qemu_img_object_print_help, _fatal)) { > -return 1; > -} > - > ret = bdrv_parse_cache_mode(cache, , ); > if (ret < 0) { > error_report("Invalid source cache option: %s", cache); > @@ -1034,14 +995,9 @@ static int img_commit(int argc, char **argv) > return 1; > } > break; > -case OPTION_OBJECT: { > -QemuOpts *opts; > -opts = qemu_opts_parse_noisily(_object_opts, > - optarg, true); > -if (!opts) { > -return 1; > -} > -} break; > +case OPTION_OBJECT: > +user_creatable_process_cmdline(optarg); > +break; > case OPTION_IMAGE_OPTS: > image_opts = true; > break; > @@ -1058,12 +1014,6 @@ static int img_commit(int argc, char **argv) > } > filename = argv[optind++]; > > -if (qemu_opts_foreach(_object_opts, > - user_creatable_add_opts_foreach, > - qemu_img_object_print_help, _fatal)) { > -return 1; > -} > - > flags = BDRV_O_RDWR | BDRV_O_UNMAP; > ret = bdrv_parse_cache_mode(cache, , ); > if (ret < 0) { > @@ -1353,7 +1303,7 @@ static int check_empty_sectors(BlockBackend *blk, > int64_t offset, > /* > * Compares two images. Exit codes: > * > - * 0 - Images are identical > + * 0 - Images are identical or the requested help was printed > * 1 - Images differ > * >1 - Error occurred > */ > @@ -1423,15 +1373,21 @@ static int img_compare(int argc, char **argv) > case 'U': > force_share = true; > break; > -case OPTION_OBJECT: { > -QemuOpts *opts; > -opts = qemu_opts_parse_noisily(_object_opts, > - optarg, true); > -if (!opts) { > -ret = 2; > -
Re: [PATCH 2/9] block: Replaced qemu_mutex_lock calls with QEMU_LOCK_GUARD
Thank you for the fast review and I'm sorry for the silly and obvious style errors. Unfortunately I did not notice the section on using the checkpatch script in the Contributing page on the wiki before committing. But I assure you that such errors will not occur again. On Fri, Mar 12, 2021 at 12:23 PM Vladimir Sementsov-Ogievskiy < vsement...@virtuozzo.com> wrote: > 11.03.2021 06:15, Mahmoud Mandour wrote: > > Replaced various qemu_mutex_lock/qemu_mutex_unlock calls with > > lock guard macros (QEMU_LOCK_GUARD() and WITH_QEMU_LOCK_GUARD). > > This slightly simplifies the code by eliminating calls to > > qemu_mutex_unlock and eliminates goto paths. > > > > Signed-off-by: Mahmoud Mandour > > --- > > block/curl.c | 13 ++-- > > block/nbd.c | 188 --- > > Better would be make two separate patches I think. > > > 2 files changed, 95 insertions(+), 106 deletions(-) > > > > diff --git a/block/curl.c b/block/curl.c > > index 4ff895df8f..56a217951a 100644 > > --- a/block/curl.c > > +++ b/block/curl.c > > @@ -832,12 +832,12 @@ static void curl_setup_preadv(BlockDriverState > *bs, CURLAIOCB *acb) > > uint64_t start = acb->offset; > > uint64_t end; > > > > -qemu_mutex_lock(>mutex); > > +QEMU_LOCK_GUARD(>mutex); > > > > // In case we have the requested data already (e.g. read-ahead), > > // we can just call the callback and be done. > > if (curl_find_buf(s, start, acb->bytes, acb)) { > > -goto out; > > +return; > > } > > > > // No cache found, so let's start a new request > > @@ -852,7 +852,7 @@ static void curl_setup_preadv(BlockDriverState *bs, > CURLAIOCB *acb) > > if (curl_init_state(s, state) < 0) { > > curl_clean_state(state); > > acb->ret = -EIO; > > -goto out; > > +return; > > } > > > > acb->start = 0; > > @@ -867,7 +867,7 @@ static void curl_setup_preadv(BlockDriverState *bs, > CURLAIOCB *acb) > > if (state->buf_len && state->orig_buf == NULL) { > > curl_clean_state(state); > > acb->ret = -ENOMEM; > > -goto out; > > +return; > > } > > state->acb[0] = acb; > > > > @@ -880,14 +880,11 @@ static void curl_setup_preadv(BlockDriverState > *bs, CURLAIOCB *acb) > > acb->ret = -EIO; > > > > curl_clean_state(state); > > -goto out; > > +return; > > } > > > > /* Tell curl it needs to kick things off */ > > curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, > ); > > - > > -out: > > -qemu_mutex_unlock(>mutex); > > } > > This change is obvious and good. > > > > > static int coroutine_fn curl_co_preadv(BlockDriverState *bs, > > diff --git a/block/nbd.c b/block/nbd.c > > index c26dc5a54f..28ba7aad61 100644 > > --- a/block/nbd.c > > +++ b/block/nbd.c > > @@ -407,27 +407,25 @@ static void *connect_thread_func(void *opaque) > > thr->sioc = NULL; > > } > > > > -qemu_mutex_lock(>mutex); > > - > > -switch (thr->state) { > > -case CONNECT_THREAD_RUNNING: > > -thr->state = ret < 0 ? CONNECT_THREAD_FAIL : > CONNECT_THREAD_SUCCESS; > > -if (thr->bh_ctx) { > > -aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, > thr->bh_opaque); > > - > > -/* play safe, don't reuse bh_ctx on further connection > attempts */ > > -thr->bh_ctx = NULL; > > +WITH_QEMU_LOCK_GUARD(>mutex) { > > +switch (thr->state) { > > +case CONNECT_THREAD_RUNNING: > > +thr->state = ret < 0 ? CONNECT_THREAD_FAIL : > CONNECT_THREAD_SUCCESS; > > +if (thr->bh_ctx) { > > +aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, > thr->bh_opaque); > > over-80 line > > > + > > +/* play safe, don't reuse bh_ctx on further connection > attempts */ > > and here > > > +thr->bh_ctx = NULL; > > +} > > +break; > > +case CONNECT_THREAD_RUNNING_DETACHED: > > +do_free = true; > > +break; > > +default: > > +abort(); > > } > > -break; > > -case CONNECT_THREAD_RUNNING_DETACHED: > > -do_free = true; > > -break; > > -default: > > -abort(); > > } > > > > -qemu_mutex_unlock(>mutex); > > -> if (do_free) { > > nbd_free_connect_thread(thr); > > } > > @@ -443,36 +441,33 @@ nbd_co_establish_connection(BlockDriverState *bs, > Error **errp) > > BDRVNBDState *s = bs->opaque; > > NBDConnectThread *thr = s->connect_thread; > > > > -qemu_mutex_lock(>mutex); > > - > > -switch (thr->state) { > > -case CONNECT_THREAD_FAIL: > > -case CONNECT_THREAD_NONE: > > -error_free(thr->err); > > -thr->err = NULL; > > -thr->state = CONNECT_THREAD_RUNNING; > > -qemu_thread_create(, "nbd-connect", > > - connect_thread_func, thr,
Re: [PULL 00/38] Block layer patches and object-add QAPIfication
On Thu, 11 Mar 2021 at 14:48, Kevin Wolf wrote: > > The following changes since commit d689ecad073e0289afa8ca863e45879d719e5c21: > > Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' > into staging (2021-03-10 20:11:33 +) > > are available in the Git repository at: > > git://repo.or.cz/qemu/kevin.git tags/for-upstream > > for you to fetch changes up to 4756658df7d5104b36ee2f40f30f2d0f10225a53: > > qom: Add user_creatable_parse_str() (2021-03-11 13:13:49 +0100) > > > Block layer patches and object-add QAPIfication > > - QAPIfy object-add and --object for tools > - Add vhost-user-blk-test > - stream: Fail gracefully if permission is denied > - storage-daemon: Fix crash on quit when job is still running > - curl: Fix use after free > - Fix image creation option defaults that exist in both the format and > the protocol layer (e.g. 'cluster_size' in qcow2 and rbd; the qcow2 > default was incorrectly applied to the rbd layer) > This generates new warnings in 'make check': MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))} G_TEST_SRCDIR=/home/petmay01/linaro/qemu-for-merges/tests G_TEST _BUILDDIR=/home/petmay01/linaro/qemu-for-merges/build/all/tests tests/test-char --tap -k warning: The alias 'tty' is deprecated, use 'serial' instead PASS 1 test-char /char/null PASS 2 test-char /char/invalid PASS 3 test-char /char/ringbuf PASS 4 test-char /char/mux [etc] thanks -- PMM
Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: Compressed writes are unaligned to 512, which works very slow in O_DIRECT mode. Let's use the cache. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/coroutines.h | 3 + block/qcow2.h | 4 ++ block/qcow2-refcount.c | 10 +++ block/qcow2.c | 158 ++--- 4 files changed, 164 insertions(+), 11 deletions(-) diff --git a/block/coroutines.h b/block/coroutines.h index 4cfb4946e6..cb8e05830e 100644 --- a/block/coroutines.h +++ b/block/coroutines.h @@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs, int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); +int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs); +int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs); + #endif /* BLOCK_COROUTINES_INT_H */ diff --git a/block/qcow2.h b/block/qcow2.h index e18d8dfbae..8b8fed0950 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -28,6 +28,7 @@ #include "crypto/block.h" #include "qemu/coroutine.h" #include "qemu/units.h" +#include "qemu/seqcache.h" #include "block/block_int.h" //#define DEBUG_ALLOC @@ -422,6 +423,9 @@ typedef struct BDRVQcow2State { Qcow2CompressionType compression_type; GHashTable *inflight_writes_counters; + +SeqCache *compressed_cache; +int compressed_flush_ret; } BDRVQcow2State; typedef struct Qcow2COWRegion { diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..218917308e 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -29,6 +29,7 @@ #include "qemu/bswap.h" #include "qemu/cutils.h" #include "trace.h" +#include "block/coroutines.h" static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size, uint64_t max); @@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } +if (s->compressed_cache) { +seqcache_discard_cluster(s->compressed_cache, cluster_offset); +} + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } @@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs) BDRVQcow2State *s = bs->opaque; int ret; +ret = qcow2_flush_compressed_cache(bs); +if (ret < 0) { +return ret; +} + I wonder a bit why this function doesn’t work on a best-effort basis, but that isn’t your problem. ret = qcow2_cache_write(bs, s->l2_table_cache); if (ret < 0) { return ret; diff --git a/block/qcow2.c b/block/qcow2.c index 6ee94421e0..5f3713cd6f 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -42,6 +42,7 @@ #include "qapi/qapi-visit-block-core.h" #include "crypto.h" #include "block/aio_task.h" +#include "block/coroutines.h" /* Differences with QCOW: @@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options, s->inflight_writes_counters = g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free); +if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) { Looks a bit like a layering violation, but I have no better proposal and you gave your reasoning, so, OK. I wonder whether this should be re-checked during reopen, though. +s->compressed_cache = seqcache_new(s->cluster_size); +} + return ret; fail: @@ -2653,6 +2658,91 @@ fail_nometa: return ret; } +/* + * Flush one cluster of compressed cache if exists. If @unfinished is true and + * there is no finished cluster to flush, flush the unfinished one. Otherwise + * flush only finished clusters. + * + * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 on + * error. + */ +static int coroutine_fn +qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished) +{ +BDRVQcow2State *s = bs->opaque; +int ret; +int64_t align = 1; +int64_t offset, bytes; +uint8_t *buf; + +if (!s->compressed_cache) { +return 0; +} + +if (!seqcache_get_next_flush(s->compressed_cache, , , , + )) +{ +return 0; +} + +qcow2_inflight_writes_inc(bs, offset, bytes); + +/* + * Don't try to align-up unfinished cached cluster, parallel write to it is + * possible! For finished host clusters data in the tail of the cluster will + * be never used. So, take some good alignment for speed. + * + * Note also, that seqcache guarantees that allocated size of @buf is enough + * to fill the cluster up to the end, so we are safe to align up with + * align
Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
12.03.2021 21:15, Max Reitz wrote: @@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options, s->inflight_writes_counters = g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free); + if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) { Looks a bit like a layering violation, but I have no better proposal and you gave your reasoning, so, OK. Probably better is check request_alignment of bs->file->bs. If it > 1 then enable the cache. -- Best regards, Vladimir
Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
12.03.2021 21:15, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: Compressed writes are unaligned to 512, which works very slow in O_DIRECT mode. Let's use the cache. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/coroutines.h | 3 + block/qcow2.h | 4 ++ block/qcow2-refcount.c | 10 +++ block/qcow2.c | 158 ++--- 4 files changed, 164 insertions(+), 11 deletions(-) diff --git a/block/coroutines.h b/block/coroutines.h index 4cfb4946e6..cb8e05830e 100644 --- a/block/coroutines.h +++ b/block/coroutines.h @@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs, int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); +int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs); +int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs); + #endif /* BLOCK_COROUTINES_INT_H */ diff --git a/block/qcow2.h b/block/qcow2.h index e18d8dfbae..8b8fed0950 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -28,6 +28,7 @@ #include "crypto/block.h" #include "qemu/coroutine.h" #include "qemu/units.h" +#include "qemu/seqcache.h" #include "block/block_int.h" //#define DEBUG_ALLOC @@ -422,6 +423,9 @@ typedef struct BDRVQcow2State { Qcow2CompressionType compression_type; GHashTable *inflight_writes_counters; + + SeqCache *compressed_cache; + int compressed_flush_ret; } BDRVQcow2State; typedef struct Qcow2COWRegion { diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..218917308e 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -29,6 +29,7 @@ #include "qemu/bswap.h" #include "qemu/cutils.h" #include "trace.h" +#include "block/coroutines.h" static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size, uint64_t max); @@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } + if (s->compressed_cache) { + seqcache_discard_cluster(s->compressed_cache, cluster_offset); + } + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } @@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs) BDRVQcow2State *s = bs->opaque; int ret; + ret = qcow2_flush_compressed_cache(bs); + if (ret < 0) { + return ret; + } + I wonder a bit why this function doesn’t work on a best-effort basis, but that isn’t your problem. ret = qcow2_cache_write(bs, s->l2_table_cache); if (ret < 0) { return ret; diff --git a/block/qcow2.c b/block/qcow2.c index 6ee94421e0..5f3713cd6f 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -42,6 +42,7 @@ #include "qapi/qapi-visit-block-core.h" #include "crypto.h" #include "block/aio_task.h" +#include "block/coroutines.h" /* Differences with QCOW: @@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options, s->inflight_writes_counters = g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free); + if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) { Looks a bit like a layering violation, but I have no better proposal and you gave your reasoning, so, OK. I wonder whether this should be re-checked during reopen, though. Hmm yes. Somehow I thought qcow2_do_open() will be called again, but it's about inactivation/invalidation, not about reopen. + s->compressed_cache = seqcache_new(s->cluster_size); + } + return ret; fail: @@ -2653,6 +2658,91 @@ fail_nometa: return ret; } +/* + * Flush one cluster of compressed cache if exists. If @unfinished is true and + * there is no finished cluster to flush, flush the unfinished one. Otherwise + * flush only finished clusters. + * + * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 on + * error. + */ +static int coroutine_fn +qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished) +{ + BDRVQcow2State *s = bs->opaque; + int ret; + int64_t align = 1; + int64_t offset, bytes; + uint8_t *buf; + + if (!s->compressed_cache) { + return 0; + } + + if (!seqcache_get_next_flush(s->compressed_cache, , , , + )) + { + return 0; + } + + qcow2_inflight_writes_inc(bs, offset, bytes); + + /* + * Don't try to align-up unfinished cached cluster, parallel write to it is + * possible! For finished host clusters data in the tail of the cluster will + * be never used. So, take some good alignment for speed. + * + * Note also, that seqcache guarantees that
Re: [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: We are going to reuse the script to generate a qcow2_ function in further commit. Prepare the script now. Signed-off-by: Vladimir Sementsov-Ogievskiy --- scripts/block-coroutine-wrapper.py | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH] fdc: fix floppy boot for Red Hat Linux 5.2
On 3/12/21 3:01 AM, Thomas Huth wrote: That whole table-based approach seems quite unreliable to me - I've seen floppy disks with 80, 81, 82 or sometimes even 83 tracks in the past, so I think we would do better with a more flexible way of guessing ... but for the time being, this is certainly a quick and easy fix that also should not have any negative impact, thus: Reviewed-by: Thomas Huth Yes, that's my thought. I can't personally justify tackling this hokey system -- I simply don't know enough about the history of the device to design a suitable replacement. So, it's a band-aid, but one that fits with the way the floppy has worked for quite a long time, and seems the least likely to bother anything further. --js
Re: [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard
12.03.2021 18:24, Kevin Wolf wrote: Am 25.02.2021 um 12:52 hat Vladimir Sementsov-Ogievskiy geschrieben: Hi all! It occurs that nothing prevents discarding and reallocating host cluster during data writing. This way data writing will pollute another newly allocated cluster of data or metadata. OK, v2 is a try to solve the problem with CoRwlock.. And it is marked RFC, because of a lot of iotest failures.. Some of problems with v2: 1. It's a more complicated to make a test, as everything is blocking and I can't just break write and do discard.. I have to implement aio_discard in qemu-io and rewrite test into several portions of io commands splitted by "sleep 1".. OK, it's not a big problem, and I've solved it. Right, this just demonstrates that it's doing what it's supposed to. 2. iotest 7 fails with several leaked clusters. Seems, that it depend on the fact that discard may be done in parallel with writes. Iotest 7 does snapshots, so I think l1 table is updated to the moment when discard is finally unlocked.. But I didn't dig into it, it's all my assumptions. This one looks a bit odd, but it may be related to the bug in your code that you forgot that qcow2_cluster_discard() is not a coroutine_fn. Later tests fail during the unlock: qemu-img: ../util/qemu-coroutine-lock.c:358: qemu_co_rwlock_unlock: Assertion `qemu_in_coroutine()' failed. #0 0x7ff33f7d89d5 in raise () from /lib64/libc.so.6 #1 0x7ff33f7c18a4 in abort () from /lib64/libc.so.6 #2 0x7ff33f7c1789 in __assert_fail_base.cold () from /lib64/libc.so.6 #3 0x7ff33f7d1026 in __assert_fail () from /lib64/libc.so.6 #4 0x556f4ffd3c94 in qemu_co_rwlock_unlock (lock=0x556f51f63ca0) at ../util/qemu-coroutine-lock.c:358 #5 0x556f4fef5e09 in qcow2_cluster_discard (bs=0x556f51f56a80, offset=37748736, bytes=0, type=QCOW2_DISCARD_NEVER, full_discard=false) at ../block/qcow2-cluster.c:1970 #6 0x556f4ff46c23 in qcow2_snapshot_create (bs=0x556f51f56a80, sn_info=0x7fff89fb9a30) at ../block/qcow2-snapshot.c:736 #7 0x556f4ff0d7b6 in bdrv_snapshot_create (bs=0x556f51f56a80, sn_info=0x7fff89fb9a30) at ../block/snapshot.c:227 #8 0x556f4fe85526 in img_snapshot (argc=4, argv=0x7fff89fb9d30) at ../qemu-img.c:3337 #9 0x556f4fe8a227 in main (argc=4, argv=0x7fff89fb9d30) at ../qemu-img.c:5375 3. iotest 13 (and I think a lot more iotests) crashes on assert(!to->locks_held); .. So with this assertion we can't keep rwlock locked during data writing... #3 in __assert_fail () from /lib64/libc.so.6 #4 in qemu_aio_coroutine_enter (ctx=0x55762120b700, co=0x55762121d700) at ../util/qemu-coroutine.c:158 #5 in aio_co_enter (ctx=0x55762120b700, co=0x55762121d700) at ../util/async.c:628 #6 in aio_co_wake (co=0x55762121d700) at ../util/async.c:612 #7 in thread_pool_co_cb (opaque=0x7f17950daab0, ret=0) at ../util/thread-pool.c:279 #8 in thread_pool_completion_bh (opaque=0x5576211e5070) at ../util/thread-pool.c:188 #9 in aio_bh_call (bh=0x557621205df0) at ../util/async.c:136 #10 in aio_bh_poll (ctx=0x55762120b700) at ../util/async.c:164 #11 in aio_poll (ctx=0x55762120b700, blocking=true) at ../util/aio-posix.c:659 #12 in blk_prw (blk=0x557621205790, offset=4303351808, buf=0x55762123e000 '\364' , ..., bytes=12288, co_entry=0x557620d9dc97 , flags=0) at ../block/block-backend.c:1335 #13 in blk_pwrite (blk=0x557621205790, offset=4303351808, buf=0x55762123e000, count=12288, flags=0) at ../block/block-backend.c:1501 This is another bug in your code: A taken lock belongs to its coroutine. You can't lock in one coroutine and unlock in another. The changes you made to the write code seem unnecessarily complicated anyway: Why not just qemu_co_rwlock_rdlock() right at the start of qcow2_co_pwritev_part() and unlock at its end, without taking and dropping the lock repeatedly? It makes both the locking more obviously correct and also fixes the bug (013 passes with this change). So now I think that v1 is simpler.. It's more complicated (but not too much) in code. But it keeps discards and data writes non-blocking each other and avoids yields in critical sections. I think an approach with additional data structures is almost certainly more complex and harder to maintain (and as the review from Max showed, also to understand). I wouldn't give up yet on the CoRwlock based approach, it's almost trivial code in comparison. Sure True, making qcow2_cluster_discard() a coroutine_fn requires a preparational patch that is less trivial, but at least this can be seen as something that we would want to do sooner or later anyway. Thanks for help, I'll try your suggestions and resend. -- Best regards, Vladimir
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
12.03.2021 18:52, Max Reitz wrote: On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 18:10, Max Reitz wrote: On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. So, it is more correct like this: diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..1da282446d 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, } else { refcount += addend; } - if (refcount == 0 && cluster_index < s->free_cluster_index) { - s->free_cluster_index = cluster_index; - } s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); - if (infl) { - infl->refcount_zero = true; - infl->type = type; - continue; - } - table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } + + if (cluster_index < s->free_cluster_index) { + s->free_cluster_index = cluster_index; + } + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } I don’t think I like using s->free_cluster_index as a protection against allocating something before it. Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes. First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is. With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index. I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0. Then, as I said in my other mail, update_refcount() just cannot free any cluster. So changes to that function can’t be justified by preventing it from freeing
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 18:10, Max Reitz wrote: On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. So, it is more correct like this: diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..1da282446d 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, } else { refcount += addend; } - if (refcount == 0 && cluster_index < s->free_cluster_index) { - s->free_cluster_index = cluster_index; - } s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); - if (infl) { - infl->refcount_zero = true; - infl->type = type; - continue; - } - table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } + + if (cluster_index < s->free_cluster_index) { + s->free_cluster_index = cluster_index; + } + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } I don’t think I like using s->free_cluster_index as a protection against allocating something before it. Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes. First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is. With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index. I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0. Then, as I said in my other mail, update_refcount() just cannot free any cluster. So changes to that function can’t be justified by preventing it from freeing clusters. You
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
12.03.2021 17:58, Max Reitz wrote: On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. That is what freeing is, though. I consider something to be free when allocation functions will allocate it. The allocation functions look at the refcount, so once a cluster’s refcount is 0, it is free. And with this patch I try to update allocation function to look also at inflight-write-counters. If I missed something its a bug in the patch. If that isn’t what freeing is, nothing in update_refcount() frees anything (when looking at how data clusters are handled). Passing the discard through to the protocol layer isn’t “freeing”, because it’s independent of qcow2. Now, your patch adds an additional check to the allocation functions (whether there are ongoing writes on the cluster), so it’s indeed possible that a cluster can have a refcount of 0 but still won’t be used by allocation functions. But that means you’ve just changed the definition of what a free cluster is. In fact, that means that nothing in update_refcount() can free a cluster that has active writes to it, because now a cluster is only free if there are no such writes. It follows that you needn’t change update_refcount() to prevent clusters with such writes from being freed, because with this new definition of what a free cluster is, it’s impossible for update_refcount() to free them. But as I noted somewhere else, update_refcount should not discard the host cluster in parallel with inflight write. It's not completely wrong, but it's inefficient. (Yes, you’re right that it would be nice to postpone the protocol-level discard still, but not doing so wouldn’t be a catastrophe – which shows that it has little to do with actually freeing something, as far as qcow2 is concerned. If it’s just about postponing the discard, we can do exactly that: Let update_refcount() skip discarding for clusters that are still in use, and then let update_inflight_write_cnt() only do that discard instead of invoking all of qcow2_update_cluster_refcount().) Agree, yes. Alternatively, we could also not change the definition of what a free cluster is, which means we wouldn’t need to change the allocation functions, but instead postpone the refcount update that update_refcount() does. That would mean we’d actually need to really drop the refcount in update_inflight_write_cnt() instead of doing a -0. Hmm, that should work too. Do you think it is better? With first approach meaning of zero refcount is changed (it's not a free cluster now, keep in mind inflight-write-cnt too). So we
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
12.03.2021 18:10, Max Reitz wrote: On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. So, it is more correct like this: diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..1da282446d 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, } else { refcount += addend; } - if (refcount == 0 && cluster_index < s->free_cluster_index) { - s->free_cluster_index = cluster_index; - } s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); - if (infl) { - infl->refcount_zero = true; - infl->type = type; - continue; - } - table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } + + if (cluster_index < s->free_cluster_index) { + s->free_cluster_index = cluster_index; + } + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } I don’t think I like using s->free_cluster_index as a protection against allocating something before it. Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes. First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is. With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index. I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0. And free_cluster_index is a hint where start to search for such cluster. Now looking only at the allocation functions, it may look like that kind of is the definition already. But I don’t think that was the intention when free_cluster_index was
Re: [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard
Am 25.02.2021 um 12:52 hat Vladimir Sementsov-Ogievskiy geschrieben: > Hi all! It occurs that nothing prevents discarding and reallocating host > cluster during data writing. This way data writing will pollute another > newly allocated cluster of data or metadata. > > OK, v2 is a try to solve the problem with CoRwlock.. And it is marked > RFC, because of a lot of iotest failures.. Some of problems with v2: > > 1. It's a more complicated to make a test, as everything is blocking > and I can't just break write and do discard.. I have to implement > aio_discard in qemu-io and rewrite test into several portions of io > commands splitted by "sleep 1".. OK, it's not a big problem, and I've > solved it. Right, this just demonstrates that it's doing what it's supposed to. > 2. iotest 7 fails with several leaked clusters. Seems, that it depend on > the fact that discard may be done in parallel with writes. Iotest 7 does > snapshots, so I think l1 table is updated to the moment when discard is > finally unlocked.. But I didn't dig into it, it's all my assumptions. This one looks a bit odd, but it may be related to the bug in your code that you forgot that qcow2_cluster_discard() is not a coroutine_fn. Later tests fail during the unlock: qemu-img: ../util/qemu-coroutine-lock.c:358: qemu_co_rwlock_unlock: Assertion `qemu_in_coroutine()' failed. #0 0x7ff33f7d89d5 in raise () from /lib64/libc.so.6 #1 0x7ff33f7c18a4 in abort () from /lib64/libc.so.6 #2 0x7ff33f7c1789 in __assert_fail_base.cold () from /lib64/libc.so.6 #3 0x7ff33f7d1026 in __assert_fail () from /lib64/libc.so.6 #4 0x556f4ffd3c94 in qemu_co_rwlock_unlock (lock=0x556f51f63ca0) at ../util/qemu-coroutine-lock.c:358 #5 0x556f4fef5e09 in qcow2_cluster_discard (bs=0x556f51f56a80, offset=37748736, bytes=0, type=QCOW2_DISCARD_NEVER, full_discard=false) at ../block/qcow2-cluster.c:1970 #6 0x556f4ff46c23 in qcow2_snapshot_create (bs=0x556f51f56a80, sn_info=0x7fff89fb9a30) at ../block/qcow2-snapshot.c:736 #7 0x556f4ff0d7b6 in bdrv_snapshot_create (bs=0x556f51f56a80, sn_info=0x7fff89fb9a30) at ../block/snapshot.c:227 #8 0x556f4fe85526 in img_snapshot (argc=4, argv=0x7fff89fb9d30) at ../qemu-img.c:3337 #9 0x556f4fe8a227 in main (argc=4, argv=0x7fff89fb9d30) at ../qemu-img.c:5375 > 3. iotest 13 (and I think a lot more iotests) crashes on > assert(!to->locks_held); .. So with this assertion we can't keep rwlock > locked during data writing... > > #3 in __assert_fail () from /lib64/libc.so.6 > #4 in qemu_aio_coroutine_enter (ctx=0x55762120b700, co=0x55762121d700) > at ../util/qemu-coroutine.c:158 > #5 in aio_co_enter (ctx=0x55762120b700, co=0x55762121d700) at > ../util/async.c:628 > #6 in aio_co_wake (co=0x55762121d700) at ../util/async.c:612 > #7 in thread_pool_co_cb (opaque=0x7f17950daab0, ret=0) at > ../util/thread-pool.c:279 > #8 in thread_pool_completion_bh (opaque=0x5576211e5070) at > ../util/thread-pool.c:188 > #9 in aio_bh_call (bh=0x557621205df0) at ../util/async.c:136 > #10 in aio_bh_poll (ctx=0x55762120b700) at ../util/async.c:164 > #11 in aio_poll (ctx=0x55762120b700, blocking=true) at > ../util/aio-posix.c:659 > #12 in blk_prw (blk=0x557621205790, offset=4303351808, > buf=0x55762123e000 '\364' , \364>..., bytes=12288, > co_entry=0x557620d9dc97 , flags=0) at > ../block/block-backend.c:1335 > #13 in blk_pwrite (blk=0x557621205790, offset=4303351808, > buf=0x55762123e000, > count=12288, flags=0) at ../block/block-backend.c:1501 This is another bug in your code: A taken lock belongs to its coroutine. You can't lock in one coroutine and unlock in another. The changes you made to the write code seem unnecessarily complicated anyway: Why not just qemu_co_rwlock_rdlock() right at the start of qcow2_co_pwritev_part() and unlock at its end, without taking and dropping the lock repeatedly? It makes both the locking more obviously correct and also fixes the bug (013 passes with this change). > So now I think that v1 is simpler.. It's more complicated (but not too > much) in code. But it keeps discards and data writes non-blocking each > other and avoids yields in critical sections. I think an approach with additional data structures is almost certainly more complex and harder to maintain (and as the review from Max showed, also to understand). I wouldn't give up yet on the CoRwlock based approach, it's almost trivial code in comparison. True, making qcow2_cluster_discard() a coroutine_fn requires a preparational patch that is less trivial, but at least this can be seen as something that we would want to do sooner or later anyway. Kevin
Re: [PULL v2 19/38] hw/block/nvme: align zoned.zasl with mdts
On Mar 12 13:07, Peter Maydell wrote: > On Tue, 9 Mar 2021 at 11:45, Klaus Jensen wrote: > > > > From: Klaus Jensen > > > > ZASL (Zone Append Size Limit) is defined exactly like MDTS (Maximum Data > > Transfer Size), that is, it is a value in units of the minimum memory > > page size (CAP.MPSMIN) and is reported as a power of two. > > > > The 'mdts' nvme device parameter is specified as in the spec, but the > > 'zoned.append_size_limit' parameter is specified in bytes. This is > > suboptimal for a number of reasons: > > > > 1. It is just plain confusing wrt. the definition of mdts. > > 2. There is a lot of complexity involved in validating the value; it > > must be a power of two, it should be larger than 4k, if it is zero > > we set it internally to mdts, but still report it as zero. > > 3. While "hw/block/nvme: improve invalid zasl value reporting" > > slightly improved the handling of the parameter, the validation is > > still wrong; it does not depend on CC.MPS, it depends on > > CAP.MPSMIN. And we are not even checking that it is actually less > > than or equal to MDTS, which is kinda the *one* condition it must > > satisfy. > > > > Fix this by defining zasl exactly like mdts and checking the one thing > > that it must satisfy (that it is less than or equal to mdts). Also, > > change the default value from 128KiB to 0 (aka, whatever mdts is). > > > @@ -2144,10 +2142,9 @@ static uint16_t nvme_do_write(NvmeCtrl *n, > > NvmeRequest *req, bool append, > > goto invalid; > > } > > > > -if (nvme_l2b(ns, nlb) > (n->page_size << n->zasl)) { > > -trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl); > > -status = NVME_INVALID_FIELD; > > -goto invalid; > > +if (n->params.zasl && data_size > n->page_size << > > n->params.zasl) { > > +trace_pci_nvme_err_zasl(data_size); > > +return NVME_INVALID_FIELD | NVME_DNR; > > } > > > > slba = zone->w_ptr; > > Hi; Coverity points out a possible overflow here (CID 1450756): > n->page_size is a uint32_t, and n->params.zasl is a uint8_t, so > the "n->page_size << n->params.zasl" will be done as 32-bit arithmetic; > but it is then compared against a uint64_t data_size. > > Is this an overflow that can never happen (ie a false positive), or > should the RHS of the comparison be done as 64-bit arithmetic by > adding a cast ? > Thanks! I think a cast is in order. I will get a fix out. signature.asc Description: PGP signature
Re: [PATCH v3 4/6] util: implement seqcache
On 12.03.21 15:37, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 16:41, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: Implement cache for small sequential unaligned writes, so that they may be cached until we get a complete cluster and then write it. The cache is intended to be used for backup to qcow2 compressed target opened in O_DIRECT mode, but can be reused for any similar (even not block-layer related) task. Signed-off-by: Vladimir Sementsov-Ogievskiy --- include/qemu/seqcache.h | 42 + util/seqcache.c | 361 MAINTAINERS | 6 + util/meson.build | 1 + 4 files changed, 410 insertions(+) create mode 100644 include/qemu/seqcache.h create mode 100644 util/seqcache.c Looks quite good to me, thanks. Nice explanations, too. :) The only design question I have is whether there’s a reason you’re using a list again instead of a hash table. I suppose we do need the list anyway because of the next_flush iterator, so using a hash table would only complicate the implementation, but still. Yes, it seems correct for flush iterator go in same order as writes comes, so we need a list. We can add a hash table, it will only help on read.. But for compressed cache in qcow2 we try to flush often enough, so there should not be many clusters in the cache. So I think addition of hash table may be done later if needed. Sure. The problem I see is that we’ll probably never reach the point of it really being needed. O:) So I think it’s a question of now or never. [...] + */ +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes, + uint8_t **buf, bool *unfinished) Could be “uint8_t *const *buf”, I suppose. Don’t know how much the callers would hate that, though. Will do. And actually I wrote quite big explanation but missed the fact that caller don't get ownership on buf, it should be mentioned. Great, thanks. +{ + Cluster *req = s->next_flush; + + if (s->next_flush) { + *unfinished = false; + req = s->next_flush; + s->next_flush = QSIMPLEQ_NEXT(req, entry); + if (s->next_flush == s->cur_write) { + s->next_flush = NULL; + } + } else if (s->cur_write && *unfinished) { + req = s->cur_write; I was wondering whether flushing an unfinished cluster wouldn’t kind of finalize it, but I suppose the problem with that would be that you can’t add data to a finished cluster, which wouldn’t be that great if you’re just flushing the cache without wanting to drop it all. (The problem I see is that flushing it later will mean all the data that already has been written here will have to be rewritten. Not that bad, I suppose.) Yes that's all correct. Also there is additional strong reason: qcow2 depends on the fact that clusters become "finished" by defined rules: only when it really finished up the the end or when qcow2 starts writing another cluster. For "finished" clusters with unaligned end we can safely align this end up to some good alignment writing a bit more data than needed. It's safe because tail of the cluster is never used. And we'll perform better with aligned write avoiding RMW. But when flushing "unfinished" cluster, we should write exactly what we have in the cache, as there may happen parallel write to the same cluster, which will continue the sequential process. OK, thanks for the explanation. Max
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. So, it is more correct like this: diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..1da282446d 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, } else { refcount += addend; } - if (refcount == 0 && cluster_index < s->free_cluster_index) { - s->free_cluster_index = cluster_index; - } s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); - if (infl) { - infl->refcount_zero = true; - infl->type = type; - continue; - } - table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } + + if (cluster_index < s->free_cluster_index) { + s->free_cluster_index = cluster_index; + } + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } I don’t think I like using s->free_cluster_index as a protection against allocating something before it. First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is. With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index. Now looking only at the allocation functions, it may look like that kind of is the definition already. But I don’t think that was the intention when free_cluster_index was introduced, so we’d have to check every place that sets free_cluster_index, to see whether it adheres to this definition. And I think it’s clear that there is a place that won’t adhere to this definition, and that is this very place here, in update_refcount(). Say free_cluster_index
Re: [PULL v2 36/38] hw/block/nvme: support namespace attachment command
On Mar 12 13:12, Peter Maydell wrote: > On Tue, 9 Mar 2021 at 11:46, Klaus Jensen wrote: > > > > From: Minwoo Im > > > > This patch supports Namespace Attachment command for the pre-defined > > nvme-ns device nodes. Of course, attach/detach namespace should only be > > supported in case 'subsys' is given. This is because if we detach a > > namespace from a controller, somebody needs to manage the detached, but > > allocated namespace in the NVMe subsystem. > > > > As command effect for the namespace attachment command is registered, > > the host will be notified that namespace inventory is changed so that > > host will rescan the namespace inventory after this command. For > > example, kernel driver manages this command effect via passthru IOCTL. > > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h > > index 85a7b5a14f4e..1287bc2cd17a 100644 > > --- a/hw/block/nvme.h > > +++ b/hw/block/nvme.h > > @@ -235,6 +235,11 @@ static inline void nvme_ns_attach(NvmeCtrl *n, > > NvmeNamespace *ns) > > n->namespaces[nvme_nsid(ns) - 1] = ns; > > } > > > > +static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns) > > +{ > > +n->namespaces[nvme_nsid(ns) - 1] = NULL; > > +} > > Hi; Coverity complains about a possible array overflow both here > in nvme_ns_detach() (CID 1450757) and in nvme_ns_attach() (CID 1450758): > nvme_nsid() can return -1, but this code does not check for that. > > If these functions both assume that their ns argument is non-NULL, > then adding an "assert(ns)" would probably placate Coverity and also > would mean that any bugs elsewhere resulting in accidentally passing > a NULL pointer would result in a clean assertion failure rather than > memory corruption. (Or you could directly assert that the array index > is in-bounds, I guess.) > Hi Peter, Thanks! As far as I can tell, we never reach this without nsid being a valid value or ns being NULL, but as you say, future misuse would be bad. I will post a fix to make sure such misuse does not happen in the future. signature.asc Description: PGP signature
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
On 12.03.21 13:42, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. Looking now at this place: if (refcount == 0 && cluster_index < s->free_cluster_index) { s->free_cluster_index = cluster_index; } s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); if (infl) { infl->refcount_zero = true; infl->type = type; continue; } table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { qcow2_cache_put(s->refcount_block_cache, _block); old_table_index = -1; qcow2_cache_discard(s->refcount_block_cache, table); } table = qcow2_cache_is_table_offset(s->l2_table_cache, offset); if (table != NULL) { qcow2_cache_discard(s->l2_table_cache, table); } if (s->compressed_cache) { seqcache_discard_cluster(s->compressed_cache, cluster_offset); } if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } } Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? offset is the start of the whole loop, and is not updated on iterations. Isn't it more correct to use cluster_offset here? Or we are sure that refcount and l2 metadata is always discarded by exactly one cluster? Than we are OK, but still worth an assertion that offset == cluster_offset. Spontaneously, I think it’s a bug that hasn’t made any problems yet, because I suppose L2 tables and refblocks are indeed always discarded one by one (i.e., cluster by cluster). It definitely looks like this should be cluster_offset, yes. Max
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. That is what freeing is, though. I consider something to be free when allocation functions will allocate it. The allocation functions look at the refcount, so once a cluster’s refcount is 0, it is free. If that isn’t what freeing is, nothing in update_refcount() frees anything (when looking at how data clusters are handled). Passing the discard through to the protocol layer isn’t “freeing”, because it’s independent of qcow2. Now, your patch adds an additional check to the allocation functions (whether there are ongoing writes on the cluster), so it’s indeed possible that a cluster can have a refcount of 0 but still won’t be used by allocation functions. But that means you’ve just changed the definition of what a free cluster is. In fact, that means that nothing in update_refcount() can free a cluster that has active writes to it, because now a cluster is only free if there are no such writes. It follows that you needn’t change update_refcount() to prevent clusters with such writes from being freed, because with this new definition of what a free cluster is, it’s impossible for update_refcount() to free them. (Yes, you’re right that it would be nice to postpone the protocol-level discard still, but not doing so wouldn’t be a catastrophe – which shows that it has little to do with actually freeing something, as far as qcow2 is concerned. If it’s just about postponing the discard, we can do exactly that: Let update_refcount() skip discarding for clusters that are still in use, and then let update_inflight_write_cnt() only do that discard instead of invoking all of qcow2_update_cluster_refcount().) Alternatively, we could also not change the definition of what a free cluster is, which means we wouldn’t need to change the allocation functions, but instead postpone the refcount update that update_refcount() does. That would mean we’d actually need to really drop the refcount in update_inflight_write_cnt() instead of doing a -0. Max
Re: [PATCH v3 4/6] util: implement seqcache
12.03.2021 16:41, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: Implement cache for small sequential unaligned writes, so that they may be cached until we get a complete cluster and then write it. The cache is intended to be used for backup to qcow2 compressed target opened in O_DIRECT mode, but can be reused for any similar (even not block-layer related) task. Signed-off-by: Vladimir Sementsov-Ogievskiy --- include/qemu/seqcache.h | 42 + util/seqcache.c | 361 MAINTAINERS | 6 + util/meson.build | 1 + 4 files changed, 410 insertions(+) create mode 100644 include/qemu/seqcache.h create mode 100644 util/seqcache.c Looks quite good to me, thanks. Nice explanations, too. :) The only design question I have is whether there’s a reason you’re using a list again instead of a hash table. I suppose we do need the list anyway because of the next_flush iterator, so using a hash table would only complicate the implementation, but still. Yes, it seems correct for flush iterator go in same order as writes comes, so we need a list. We can add a hash table, it will only help on read.. But for compressed cache in qcow2 we try to flush often enough, so there should not be many clusters in the cache. So I think addition of hash table may be done later if needed. [...] diff --git a/util/seqcache.c b/util/seqcache.c new file mode 100644 index 00..d923560eab --- /dev/null +++ b/util/seqcache.c @@ -0,0 +1,361 @@ +/* + * Cache for small sequential write requests. + * + * Copyright (c) 2021 Virtuozzo International GmbH. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + * + * + * = Description = + * + * SeqCache is an abbreviation for Sequential Cache. + * + * The Cache is intended to improve performance of small unaligned sequential + * writes. Cache has a cluster_size parameter and the unit of caching is aligned + * cluster. Cache has a list of cached clusters, several "finished" ones and at + * most one "unfinished". "unfinished" cluster is a cluster where last byte of + * last write operation is cached and there is a free place after that byte to + * the end of cluster. "finished" clusters are just stored in cache to be read + * or flushed and don't allow any writes to them. + * + * If write to the cache intersects cluster bounds, it's splat into several *split (Though I like “splat”. Sounds like a wet blotch.) + * requests by cluster bounds. So, consider a write that doesn't intersect + * cluster bounds to describe the whole picture: + * + * There are two cases allowed: + * + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential + * previous write. The data goes to "unfinished" cluster. If "unfinished" is + * filled up to the cluster bound it becomes "finished". + * + * 2. Write to new cluster, not existing in the cache. It may be sequential to + * previous write or not. Current "unfinshed" cluster (if exists) becomes *unfinished + * "finished" and new "unfinished" cluster is started. Note also that offset + * of the write to new cluster is not required to be aligned. + * + * Any other write operation (non-sequential write to "unfinished" cluster + * or write to any of "finished" clusters) will crash. + */ + +#include "qemu/osdep.h" + +#include "qemu/queue.h" +#include "qemu/seqcache.h" + +/* + * Cluster + * + * Representation of one cached cluster, aligned to SeqCache::cluster_size. + * Caches only one subregion of the cluster, started at @offset (may be + * unaligned to cluster_size) and of @bytes length (may be unaligned as well). + * The whole subregion always lay in one aligned cluster: + * + * QEMU_ALIGN_DOWN(offset, cluster_size) == + * QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size) + * + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e. + *
Re: [PATCH v3 4/6] util: implement seqcache
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: Implement cache for small sequential unaligned writes, so that they may be cached until we get a complete cluster and then write it. The cache is intended to be used for backup to qcow2 compressed target opened in O_DIRECT mode, but can be reused for any similar (even not block-layer related) task. Signed-off-by: Vladimir Sementsov-Ogievskiy --- include/qemu/seqcache.h | 42 + util/seqcache.c | 361 MAINTAINERS | 6 + util/meson.build| 1 + 4 files changed, 410 insertions(+) create mode 100644 include/qemu/seqcache.h create mode 100644 util/seqcache.c Looks quite good to me, thanks. Nice explanations, too. :) The only design question I have is whether there’s a reason you’re using a list again instead of a hash table. I suppose we do need the list anyway because of the next_flush iterator, so using a hash table would only complicate the implementation, but still. [...] diff --git a/util/seqcache.c b/util/seqcache.c new file mode 100644 index 00..d923560eab --- /dev/null +++ b/util/seqcache.c @@ -0,0 +1,361 @@ +/* + * Cache for small sequential write requests. + * + * Copyright (c) 2021 Virtuozzo International GmbH. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + * + * + * = Description = + * + * SeqCache is an abbreviation for Sequential Cache. + * + * The Cache is intended to improve performance of small unaligned sequential + * writes. Cache has a cluster_size parameter and the unit of caching is aligned + * cluster. Cache has a list of cached clusters, several "finished" ones and at + * most one "unfinished". "unfinished" cluster is a cluster where last byte of + * last write operation is cached and there is a free place after that byte to + * the end of cluster. "finished" clusters are just stored in cache to be read + * or flushed and don't allow any writes to them. + * + * If write to the cache intersects cluster bounds, it's splat into several *split (Though I like “splat”. Sounds like a wet blotch.) + * requests by cluster bounds. So, consider a write that doesn't intersect + * cluster bounds to describe the whole picture: + * + * There are two cases allowed: + * + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential + *previous write. The data goes to "unfinished" cluster. If "unfinished" is + *filled up to the cluster bound it becomes "finished". + * + * 2. Write to new cluster, not existing in the cache. It may be sequential to + *previous write or not. Current "unfinshed" cluster (if exists) becomes *unfinished + *"finished" and new "unfinished" cluster is started. Note also that offset + *of the write to new cluster is not required to be aligned. + * + *Any other write operation (non-sequential write to "unfinished" cluster + *or write to any of "finished" clusters) will crash. + */ + +#include "qemu/osdep.h" + +#include "qemu/queue.h" +#include "qemu/seqcache.h" + +/* + * Cluster + * + * Representation of one cached cluster, aligned to SeqCache::cluster_size. + * Caches only one subregion of the cluster, started at @offset (may be + * unaligned to cluster_size) and of @bytes length (may be unaligned as well). + * The whole subregion always lay in one aligned cluster: + * + * QEMU_ALIGN_DOWN(offset, cluster_size) == + * QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size) + * + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e. + * allocated @buf length is at least: + * + * cluster_size - offset % cluster_size + */ +typedef struct Cluster { + int64_t offset; + int64_t bytes; + uint8_t *buf; + QSIMPLEQ_ENTRY(Cluster) entry; +} Cluster; + +/* + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster, + * until entire cluster (of @cluster_size bytes) is filled
Re: [PULL v2 36/38] hw/block/nvme: support namespace attachment command
On Tue, 9 Mar 2021 at 11:46, Klaus Jensen wrote: > > From: Minwoo Im > > This patch supports Namespace Attachment command for the pre-defined > nvme-ns device nodes. Of course, attach/detach namespace should only be > supported in case 'subsys' is given. This is because if we detach a > namespace from a controller, somebody needs to manage the detached, but > allocated namespace in the NVMe subsystem. > > As command effect for the namespace attachment command is registered, > the host will be notified that namespace inventory is changed so that > host will rescan the namespace inventory after this command. For > example, kernel driver manages this command effect via passthru IOCTL. > diff --git a/hw/block/nvme.h b/hw/block/nvme.h > index 85a7b5a14f4e..1287bc2cd17a 100644 > --- a/hw/block/nvme.h > +++ b/hw/block/nvme.h > @@ -235,6 +235,11 @@ static inline void nvme_ns_attach(NvmeCtrl *n, > NvmeNamespace *ns) > n->namespaces[nvme_nsid(ns) - 1] = ns; > } > > +static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns) > +{ > +n->namespaces[nvme_nsid(ns) - 1] = NULL; > +} Hi; Coverity complains about a possible array overflow both here in nvme_ns_detach() (CID 1450757) and in nvme_ns_attach() (CID 1450758): nvme_nsid() can return -1, but this code does not check for that. If these functions both assume that their ns argument is non-NULL, then adding an "assert(ns)" would probably placate Coverity and also would mean that any bugs elsewhere resulting in accidentally passing a NULL pointer would result in a clean assertion failure rather than memory corruption. (Or you could directly assert that the array index is in-bounds, I guess.) thanks -- PMM
Re: [PULL v2 19/38] hw/block/nvme: align zoned.zasl with mdts
On Tue, 9 Mar 2021 at 11:45, Klaus Jensen wrote: > > From: Klaus Jensen > > ZASL (Zone Append Size Limit) is defined exactly like MDTS (Maximum Data > Transfer Size), that is, it is a value in units of the minimum memory > page size (CAP.MPSMIN) and is reported as a power of two. > > The 'mdts' nvme device parameter is specified as in the spec, but the > 'zoned.append_size_limit' parameter is specified in bytes. This is > suboptimal for a number of reasons: > > 1. It is just plain confusing wrt. the definition of mdts. > 2. There is a lot of complexity involved in validating the value; it > must be a power of two, it should be larger than 4k, if it is zero > we set it internally to mdts, but still report it as zero. > 3. While "hw/block/nvme: improve invalid zasl value reporting" > slightly improved the handling of the parameter, the validation is > still wrong; it does not depend on CC.MPS, it depends on > CAP.MPSMIN. And we are not even checking that it is actually less > than or equal to MDTS, which is kinda the *one* condition it must > satisfy. > > Fix this by defining zasl exactly like mdts and checking the one thing > that it must satisfy (that it is less than or equal to mdts). Also, > change the default value from 128KiB to 0 (aka, whatever mdts is). > @@ -2144,10 +2142,9 @@ static uint16_t nvme_do_write(NvmeCtrl *n, NvmeRequest > *req, bool append, > goto invalid; > } > > -if (nvme_l2b(ns, nlb) > (n->page_size << n->zasl)) { > -trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl); > -status = NVME_INVALID_FIELD; > -goto invalid; > +if (n->params.zasl && data_size > n->page_size << > n->params.zasl) { > +trace_pci_nvme_err_zasl(data_size); > +return NVME_INVALID_FIELD | NVME_DNR; > } > > slba = zone->w_ptr; Hi; Coverity points out a possible overflow here (CID 1450756): n->page_size is a uint32_t, and n->params.zasl is a uint8_t, so the "n->page_size << n->params.zasl" will be done as 32-bit arithmetic; but it is then compared against a uint64_t data_size. Is this an overflow that can never happen (ie a false positive), or should the RHS of the comparison be done as 64-bit arithmetic by adding a cast ? thanks -- PMM
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. So, it is more correct like this: diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 464d133368..1da282446d 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, } else { refcount += addend; } -if (refcount == 0 && cluster_index < s->free_cluster_index) { -s->free_cluster_index = cluster_index; -} s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); -if (infl) { -infl->refcount_zero = true; -infl->type = type; -continue; -} - table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } +if (infl) { +infl->refcount_zero = true; +infl->type = type; +continue; +} + +if (cluster_index < s->free_cluster_index) { +s->free_cluster_index = cluster_index; +} + if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } -- Best regards, Vladimir
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote: 12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. Looking now at this place: if (refcount == 0 && cluster_index < s->free_cluster_index) { s->free_cluster_index = cluster_index; } s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { void *table; Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); if (infl) { infl->refcount_zero = true; infl->type = type; continue; } table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { qcow2_cache_put(s->refcount_block_cache, _block); old_table_index = -1; qcow2_cache_discard(s->refcount_block_cache, table); } table = qcow2_cache_is_table_offset(s->l2_table_cache, offset); if (table != NULL) { qcow2_cache_discard(s->l2_table_cache, table); } if (s->compressed_cache) { seqcache_discard_cluster(s->compressed_cache, cluster_offset); } if (s->discard_passthrough[type]) { update_refcount_discard(bs, cluster_offset, s->cluster_size); } } Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? offset is the start of the whole loop, and is not updated on iterations. Isn't it more correct to use cluster_offset here? Or we are sure that refcount and l2 metadata is always discarded by exactly one cluster? Than we are OK, but still worth an assertion that offset == cluster_offset. -- Best regards, Vladimir
Re: [RFC] nbd: decouple reconnect from drain
10.03.2021 12:32, Roman Kagan wrote: NBD connect coroutine takes an extra in_flight reference as if it's a request handler. This prevents drain from completion until the connection coroutine is releases the reference. When NBD is configured to reconnect, however, this appears to be fatal to the reconnection idea: the drain procedure wants the reconnection to be suspended, but this is only possible if the in-flight requests are canceled. As I remember from our conversation, the problem is not that we don't reconnect during drained section, but exactly that we cancel requests on drained begins starting from 8c517de24a8a1dcbeb. This is not a problem in scenarios when reconnect is rare case and failed request is acceptable. But if we have bad connection and requests should often wait for reconnect (so, it may be considered as a kind of "latency") then really, cancelling the waiting requests on any drain() kills the reconnect feature. Fix this by making the connection coroutine stop messing with the in-flight counter. Instead, certain care is taken to properly move the reconnection stuff from one aio_context to another in .bdrv_{attach,detach}_aio_context callbacks. Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance") Signed-off-by: Roman Kagan --- This patch passes the regular make check but fails some extra iotests, in particular 277. It obviously lacks more robust interaction with the connection thread (which in general is fairly complex and hard to reason about), and perhaps has some other drawbacks, so I'll work on this further, but I'd appreciate some feedback on whether the idea is sound. In general I like the idea. The logic around drain in nbd is overcomplicated. And I never liked the fact that nbd_read_eof() does dec-inc inflight section. Some notes: 1. I hope, the patch can be divided into several ones, as there are several things done: - removing use of in_flight counter introduced by 5ad81b4946 - do reconnect during drained section - stop cancelling requests on .drained_begin 2. 5ad81b4946 was needed to make nbd_client_attach_aio_context() reenter connection_co only in one (or two) possible places, not on any yield.. And I don't see how it is achieved now.. This should be described in commit msg.. 3. About cancelling requests on drained_begin. The behavior was introduced by 8c517de24a8a1dcbeb, to fix a deadlock. So, if now the deadlock is fixed another way, let's change the logic (don't cancel requests) in a separate patch, and note 8c517de24a8a1dcbeb commit and the commit that fixes deadlock the other way in the commit message. -- Best regards, Vladimir
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
12.03.2021 14:17, Max Reitz wrote: On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may [..] @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; + Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index); + + if (infl) { + infl->refcount_zero = true; + infl->type = type; + continue; + } I don’t understand what this is supposed to do exactly. It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they? Don't follow. We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt(). But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though. Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard. On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0). Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero. And about s->set_refcount(): it only update a refcount itself, and don't free anything. -- Best regards, Vladimir
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote: 11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may pollute another cluster (recently allocated) or even metadata. I was about to ask whether we couldn’t somehow let discard requests check overlapping I/O with the tracked_request infrastructure from block/io.c (or perhaps even just let them be serializing). But I suppose there may be other reasons for clusters being discarded other than explicit discard requests, so we have to have something in qcow2... For example, host cluster which contain exactly one compressed cluster may be discarded when corresponding guest cluster is rewritten by non-compressed write. Also, (not very good argument for fixing existing bug but still) serializing will not help with compressed-cache, because with compressed cache we'll have data writes (cache flushes) not related to current in-flight request. And that leads us to something like serializing of operations with .file child of qcow2 node. But we can't do it, as "discarding" a host cluster is operation with metadata, which may do no operation with underlying .file child. Yes, makes sense. I think it is a good argument, because the point is that qcow2 host cluster discards are not necessarily related to guest discards. The compressed cache is just one new example for that. To fix the issue let's track inflight writes to host cluster in the hash table and consider new counter when discarding and reusing host clusters. This didn’t really explain anything that would help me understand what’s going on in this patch. The patch itself doesn’t really help me with comments either. It’s possible that I’m just daft, but honestly I have a hard time really understanding what’s supposed to be going on. Sorry for this. I believe into your experience of understanding my patches, so that shows that something is wrong with the patch itself) Coming back from jumping all over this patch, I also wonder why this isn’t split in multiple patches, where the first introduces the infrastructure and explains exactly what’s going on, and the next patches make use of it so I could understand what each check etc. is supposed to represent and do. OK To serve as an example, after reading through the patch, I still have no idea how it prevents that discard problem you’re trying to fix. Maybe I’m lazy, but I would have liked exactly that to be explained by this commit message. Actually, I don’t even understand the problem just from this commit message alone, but I could resort to HEAD^ and some thinking. Not sure if that’s ideal, though. The problem: 1. Start write to qcow2. Assume guest cluster G and corresponding host cluster is H. 2. The write requests come to the point of data writing to .file. The write to .file is started and qcow2 mutex is unlocked. 3. At this time refcount of H becomes 0. For example, it may be due to discard operation on qcow2 node, or rewriting compressed data by normal write, or some operation with snapshots.. 4. Next, some operations occurs and leads to allocation of H for some other needs: it may be another write-to-qcow2-node operation, or allocation of L2 table or some other data or metadata cluster allocation. 5. So, at this point H is used for something other. Assume, L2 table is written into H. 6. And now, our write from [2] finishes. And pollutes L2 table in H. That's a bug. Understood. The problem becomes more possible with compressed write cache, because any flush operation may trigger deferred data writes, so problem can be reproduced like: 1. Write to guest cluster G2, which triggers flushing of data to host cluster H, which is unrelated to G2. 2. Discard cluster G in parallel [...] So, problem is simply triggered by parallel write and discard to different guest clusters. That's why I should do something with this old (and most probably never triggered) problem in qcow2 driver prior to implementing compressed cache. OK! That makes sense. Hmm, probably we can avoid all these difficulties by implementing compressed cache as a filter driver between qcow2 and its .file nodes.. As I see it, you’ve now opened the can of worms, and I’m not sure we can just close it and say we haven’t seen any problem. :) So, regardless of the compressed cache, I think this should be fixed. I tried to do it, encountered some problems, but I don't remember what exactly. Probably I should return to this try. Some obvious problems: 1. Metadata write will go through cache, and we should handle it somehow to not break the sequence of unaligned chunks. Simplest way is to add BDRV_REQ_COLORED flag and do compressed data writes to .file with this flag. And cache filter would track this flag for sequential
Re: [PATCH] stream: Don't crash when node permission is denied
09.03.2021 20:34, Kevin Wolf wrote: The image streaming block job restricts shared permissions of the nodes it accesses. This can obviously fail when other users already got these permissions. _abort is therefore wrong and can crash. Handle these errors gracefully and just fail starting the block job. Reported-by: Nini Gu Signed-off-by: Kevin Wolf --- Based-on: 20210309121814.31078-1-kw...@redhat.com ('storage-daemon: Call job_cancel_sync_all() on shutdown') block/stream.c| 15 +++ tests/qemu-iotests/tests/qsd-jobs | 20 tests/qemu-iotests/tests/qsd-jobs.out | 10 ++ 3 files changed, 41 insertions(+), 4 deletions(-) diff --git a/block/stream.c b/block/stream.c index 1fa742b0db..97bee482dc 100644 --- a/block/stream.c +++ b/block/stream.c @@ -206,7 +206,7 @@ void stream_start(const char *job_id, BlockDriverState *bs, const char *filter_node_name, Error **errp) { -StreamBlockJob *s; +StreamBlockJob *s = NULL; BlockDriverState *iter; bool bs_read_only; int basic_flags = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED; @@ -214,6 +214,7 @@ void stream_start(const char *job_id, BlockDriverState *bs, BlockDriverState *cor_filter_bs = NULL; BlockDriverState *above_base; QDict *opts; +int ret; assert(!(base && bottom)); assert(!(backing_file_str && bottom)); @@ -303,7 +304,7 @@ void stream_start(const char *job_id, BlockDriverState *bs, * queried only at the job start and then cached. */ if (block_job_add_bdrv(>common, "active node", bs, 0, - basic_flags | BLK_PERM_WRITE, _abort)) { + basic_flags | BLK_PERM_WRITE, errp)) { While being here may also do ", errp) < 0) {", for more usual pattern of checking error.. goto fail; } @@ -320,8 +321,11 @@ void stream_start(const char *job_id, BlockDriverState *bs, for (iter = bdrv_filter_or_cow_bs(bs); iter != base; iter = bdrv_filter_or_cow_bs(iter)) { -block_job_add_bdrv(>common, "intermediate node", iter, 0, - basic_flags, _abort); +ret = block_job_add_bdrv(>common, "intermediate node", iter, 0, + basic_flags, errp); +if (ret < 0) { +goto fail; +} } s->base_overlay = base_overlay; @@ -337,6 +341,9 @@ void stream_start(const char *job_id, BlockDriverState *bs, return; fail: +if (s) { +job_early_fail(>common.job); +} if (cor_filter_bs) { bdrv_cor_filter_drop(cor_filter_bs); } diff --git a/tests/qemu-iotests/tests/qsd-jobs b/tests/qemu-iotests/tests/qsd-jobs index 1a1c534fac..972b6b3898 100755 --- a/tests/qemu-iotests/tests/qsd-jobs +++ b/tests/qemu-iotests/tests/qsd-jobs @@ -30,6 +30,7 @@ status=1 # failure is the default! _cleanup() { _cleanup_test_img +rm -f "$SOCK_DIR/nbd.sock" } trap "_cleanup; exit \$status" 0 1 2 3 15 @@ -59,6 +60,25 @@ $QSD --chardev stdio,id=stdio --monitor chardev=stdio \ {"execute": "quit"} EOF +echo +echo "=== Streaming can't get permission on base node ===" +echo + +# Just make sure that this doesn't crash +$QSD --chardev stdio,id=stdio --monitor chardev=stdio \ +--blockdev node-name=file_base,driver=file,filename="$TEST_IMG.base" \ +--blockdev node-name=fmt_base,driver=qcow2,file=file_base \ +--blockdev node-name=file_overlay,driver=file,filename="$TEST_IMG" \ +--blockdev node-name=fmt_overlay,driver=qcow2,file=file_overlay,backing=fmt_base \ +--nbd-server addr.type=unix,addr.path="$SOCK_DIR/nbd.sock" \ +--export type=nbd,id=export1,node-name=fmt_base,writable=on,name=export1 \ Another option is to use blkdebug with take-child-perms and/or unshare-child-perms set. Just a note, nbd is good too. +< Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir
Re: [PATCH 2/9] block: Replaced qemu_mutex_lock calls with QEMU_LOCK_GUARD
11.03.2021 06:15, Mahmoud Mandour wrote: Replaced various qemu_mutex_lock/qemu_mutex_unlock calls with lock guard macros (QEMU_LOCK_GUARD() and WITH_QEMU_LOCK_GUARD). This slightly simplifies the code by eliminating calls to qemu_mutex_unlock and eliminates goto paths. Signed-off-by: Mahmoud Mandour --- block/curl.c | 13 ++-- block/nbd.c | 188 --- Better would be make two separate patches I think. 2 files changed, 95 insertions(+), 106 deletions(-) diff --git a/block/curl.c b/block/curl.c index 4ff895df8f..56a217951a 100644 --- a/block/curl.c +++ b/block/curl.c @@ -832,12 +832,12 @@ static void curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb) uint64_t start = acb->offset; uint64_t end; -qemu_mutex_lock(>mutex); +QEMU_LOCK_GUARD(>mutex); // In case we have the requested data already (e.g. read-ahead), // we can just call the callback and be done. if (curl_find_buf(s, start, acb->bytes, acb)) { -goto out; +return; } // No cache found, so let's start a new request @@ -852,7 +852,7 @@ static void curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb) if (curl_init_state(s, state) < 0) { curl_clean_state(state); acb->ret = -EIO; -goto out; +return; } acb->start = 0; @@ -867,7 +867,7 @@ static void curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb) if (state->buf_len && state->orig_buf == NULL) { curl_clean_state(state); acb->ret = -ENOMEM; -goto out; +return; } state->acb[0] = acb; @@ -880,14 +880,11 @@ static void curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb) acb->ret = -EIO; curl_clean_state(state); -goto out; +return; } /* Tell curl it needs to kick things off */ curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, ); - -out: -qemu_mutex_unlock(>mutex); } This change is obvious and good. static int coroutine_fn curl_co_preadv(BlockDriverState *bs, diff --git a/block/nbd.c b/block/nbd.c index c26dc5a54f..28ba7aad61 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -407,27 +407,25 @@ static void *connect_thread_func(void *opaque) thr->sioc = NULL; } -qemu_mutex_lock(>mutex); - -switch (thr->state) { -case CONNECT_THREAD_RUNNING: -thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS; -if (thr->bh_ctx) { -aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, thr->bh_opaque); - -/* play safe, don't reuse bh_ctx on further connection attempts */ -thr->bh_ctx = NULL; +WITH_QEMU_LOCK_GUARD(>mutex) { +switch (thr->state) { +case CONNECT_THREAD_RUNNING: +thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS; +if (thr->bh_ctx) { +aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, thr->bh_opaque); over-80 line + +/* play safe, don't reuse bh_ctx on further connection attempts */ and here +thr->bh_ctx = NULL; +} +break; +case CONNECT_THREAD_RUNNING_DETACHED: +do_free = true; +break; +default: +abort(); } -break; -case CONNECT_THREAD_RUNNING_DETACHED: -do_free = true; -break; -default: -abort(); } -qemu_mutex_unlock(>mutex); -> if (do_free) { nbd_free_connect_thread(thr); } @@ -443,36 +441,33 @@ nbd_co_establish_connection(BlockDriverState *bs, Error **errp) BDRVNBDState *s = bs->opaque; NBDConnectThread *thr = s->connect_thread; -qemu_mutex_lock(>mutex); - -switch (thr->state) { -case CONNECT_THREAD_FAIL: -case CONNECT_THREAD_NONE: -error_free(thr->err); -thr->err = NULL; -thr->state = CONNECT_THREAD_RUNNING; -qemu_thread_create(, "nbd-connect", - connect_thread_func, thr, QEMU_THREAD_DETACHED); -break; -case CONNECT_THREAD_SUCCESS: -/* Previous attempt finally succeeded in background */ -thr->state = CONNECT_THREAD_NONE; -s->sioc = thr->sioc; -thr->sioc = NULL; -yank_register_function(BLOCKDEV_YANK_INSTANCE(bs->node_name), - nbd_yank, bs); -qemu_mutex_unlock(>mutex); -return 0; -case CONNECT_THREAD_RUNNING: -/* Already running, will wait */ -break; -default: -abort(); -} - -thr->bh_ctx = qemu_get_current_aio_context(); +WITH_QEMU_LOCK_GUARD(>mutex) { +switch (thr->state) { +case CONNECT_THREAD_FAIL: +case CONNECT_THREAD_NONE: +error_free(thr->err); +thr->err = NULL; +thr->state = CONNECT_THREAD_RUNNING; +
Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
11.03.2021 22:58, Max Reitz wrote: On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote: There is a bug in qcow2: host cluster can be discarded (refcount becomes 0) and reused during data write. In this case data write may pollute another cluster (recently allocated) or even metadata. I was about to ask whether we couldn’t somehow let discard requests check overlapping I/O with the tracked_request infrastructure from block/io.c (or perhaps even just let them be serializing). But I suppose there may be other reasons for clusters being discarded other than explicit discard requests, so we have to have something in qcow2... For example, host cluster which contain exactly one compressed cluster may be discarded when corresponding guest cluster is rewritten by non-compressed write. Also, (not very good argument for fixing existing bug but still) serializing will not help with compressed-cache, because with compressed cache we'll have data writes (cache flushes) not related to current in-flight request. And that leads us to something like serializing of operations with .file child of qcow2 node. But we can't do it, as "discarding" a host cluster is operation with metadata, which may do no operation with underlying .file child. To fix the issue let's track inflight writes to host cluster in the hash table and consider new counter when discarding and reusing host clusters. This didn’t really explain anything that would help me understand what’s going on in this patch. The patch itself doesn’t really help me with comments either. It’s possible that I’m just daft, but honestly I have a hard time really understanding what’s supposed to be going on. Sorry for this. I believe into your experience of understanding my patches, so that shows that something is wrong with the patch itself) Coming back from jumping all over this patch, I also wonder why this isn’t split in multiple patches, where the first introduces the infrastructure and explains exactly what’s going on, and the next patches make use of it so I could understand what each check etc. is supposed to represent and do. OK To serve as an example, after reading through the patch, I still have no idea how it prevents that discard problem you’re trying to fix. Maybe I’m lazy, but I would have liked exactly that to be explained by this commit message. Actually, I don’t even understand the problem just from this commit message alone, but I could resort to HEAD^ and some thinking. Not sure if that’s ideal, though. The problem: 1. Start write to qcow2. Assume guest cluster G and corresponding host cluster is H. 2. The write requests come to the point of data writing to .file. The write to .file is started and qcow2 mutex is unlocked. 3. At this time refcount of H becomes 0. For example, it may be due to discard operation on qcow2 node, or rewriting compressed data by normal write, or some operation with snapshots.. 4. Next, some operations occurs and leads to allocation of H for some other needs: it may be another write-to-qcow2-node operation, or allocation of L2 table or some other data or metadata cluster allocation. 5. So, at this point H is used for something other. Assume, L2 table is written into H. 6. And now, our write from [2] finishes. And pollutes L2 table in H. That's a bug. The problem becomes more possible with compressed write cache, because any flush operation may trigger deferred data writes, so problem can be reproduced like: 1. Write to guest cluster G2, which triggers flushing of data to host cluster H, which is unrelated to G2. 2. Discard cluster G in parallel [...] So, problem is simply triggered by parallel write and discard to different guest clusters. That's why I should do something with this old (and most probably never triggered) problem in qcow2 driver prior to implementing compressed cache. Hmm, probably we can avoid all these difficulties by implementing compressed cache as a filter driver between qcow2 and its .file nodes.. I tried to do it, encountered some problems, but I don't remember what exactly. Probably I should return to this try. Some obvious problems: 1. Metadata write will go through cache, and we should handle it somehow to not break the sequence of unaligned chunks. Simplest way is to add BDRV_REQ_COLORED flag and do compressed data writes to .file with this flag. And cache filter would track this flag for sequential caching. 2. We can't enable caching by default this way, so user will have to explicitly add a filter.. And actually, it's a real problem of qcow2 driver that in O_DIRECT mode it writes compressed data by small unaligned chunks. So, good to fix it inside qcow2.. With filter we avoid the problem of parallel reuse of host cluster, as all writes go through the cache and flushes will be serialized. So, the problem in qcow2 driver is not enlarged by the cache and we can leave it unfixed.. So the commit message says
Re: [PATCH v3 00/30] qapi/qom: QAPIfy --object and object-add
On Fri, Mar 12, 2021 at 09:46:54 +0100, Paolo Bonzini wrote: > On 12/03/21 09:14, Markus Armbruster wrote: > > Paolo Bonzini writes: > > > > > On 11/03/21 15:08, Markus Armbruster wrote: > > > > > I would rather keep the OptsVisitor here. Do the same check for JSON > > > > > syntax that you have in qobject_input_visitor_new_str, and whenever > > > > > you need to walk all -object arguments, use something like this: > > > > > > > > > > typedef struct ObjectArgument { > > > > > const char *id; > > > > > QDict *json;/* or NULL for QemuOpts */ > > > > > QSIMPLEQ_ENTRY(ObjectArgument) next; > > > > > } > > > > > > > > > > I already had patches in my queue to store -object in a GSList of > > > > > dictionaries, changing it to use the above is easy enough. > > > > > > > > I think I'd prefer following -display's precedence. See my reply to > > > > Kevin for details. > > > > > > Yeah, I got independently to the same conclusion and posted patches > > > for that. I was scared that visit_type_ObjectOptions was too much for > > > OptsVisitor but it seems to work... > > > > We have reason to be scared. I'll try to cover this in my review. > > Yes, it's a good reason to possibly even delay those 3 patches to 6.1. Is there a chance we could get the json syntax for -object for now? I think it would simplify libvirt's code a bit and sidestep the issue of converting the already existing parameters from JSON form we have into the commandline form.
Re: [PATCH v3 00/30] qapi/qom: QAPIfy --object and object-add
On 12/03/21 09:14, Markus Armbruster wrote: Paolo Bonzini writes: On 11/03/21 15:08, Markus Armbruster wrote: I would rather keep the OptsVisitor here. Do the same check for JSON syntax that you have in qobject_input_visitor_new_str, and whenever you need to walk all -object arguments, use something like this: typedef struct ObjectArgument { const char *id; QDict *json;/* or NULL for QemuOpts */ QSIMPLEQ_ENTRY(ObjectArgument) next; } I already had patches in my queue to store -object in a GSList of dictionaries, changing it to use the above is easy enough. I think I'd prefer following -display's precedence. See my reply to Kevin for details. Yeah, I got independently to the same conclusion and posted patches for that. I was scared that visit_type_ObjectOptions was too much for OptsVisitor but it seems to work... We have reason to be scared. I'll try to cover this in my review. Yes, it's a good reason to possibly even delay those 3 patches to 6.1. Paolo
Re: [PATCH v3 00/30] qapi/qom: QAPIfy --object and object-add
Paolo Bonzini writes: > On 11/03/21 15:08, Markus Armbruster wrote: >>> I would rather keep the OptsVisitor here. Do the same check for JSON >>> syntax that you have in qobject_input_visitor_new_str, and whenever >>> you need to walk all -object arguments, use something like this: >>> >>> typedef struct ObjectArgument { >>> const char *id; >>> QDict *json;/* or NULL for QemuOpts */ >>> QSIMPLEQ_ENTRY(ObjectArgument) next; >>> } >>> >>> I already had patches in my queue to store -object in a GSList of >>> dictionaries, changing it to use the above is easy enough. >> >> I think I'd prefer following -display's precedence. See my reply to >> Kevin for details. > > Yeah, I got independently to the same conclusion and posted patches > for that. I was scared that visit_type_ObjectOptions was too much for > OptsVisitor but it seems to work... We have reason to be scared. I'll try to cover this in my review.
Re: [PATCH] fdc: fix floppy boot for Red Hat Linux 5.2
On 12/03/2021 07.32, John Snow wrote: The image size indicates it's an 81 track floppy disk image, which we don't have a listing for in the geometry table. When you force the drive type to 1.44MB, it guesses the reasonably close 18/80. When the drive type is allowed to auto-detect or set to 2.88, it guesses a very incorrect geometry. auto, 144 and 288 drive types get the right geometry with the new entry in the table. Reported-by: Michael Tokarev Signed-off-by: John Snow --- hw/block/fdc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/fdc.c b/hw/block/fdc.c index 198940e737..b2f26ba587 100644 --- a/hw/block/fdc.c +++ b/hw/block/fdc.c @@ -122,6 +122,7 @@ static const FDFormat fd_formats[] = { /* First entry is default format */ /* 1.44 MB 3"1/2 floppy disks */ { FLOPPY_DRIVE_TYPE_144, 18, 80, 1, FDRIVE_RATE_500K, }, /* 3.5" 2880 */ +{ FLOPPY_DRIVE_TYPE_144, 18, 81, 1, FDRIVE_RATE_500K, }, { FLOPPY_DRIVE_TYPE_144, 20, 80, 1, FDRIVE_RATE_500K, }, /* 3.5" 3200 */ { FLOPPY_DRIVE_TYPE_144, 21, 80, 1, FDRIVE_RATE_500K, }, { FLOPPY_DRIVE_TYPE_144, 21, 82, 1, FDRIVE_RATE_500K, }, That whole table-based approach seems quite unreliable to me - I've seen floppy disks with 80, 81, 82 or sometimes even 83 tracks in the past, so I think we would do better with a more flexible way of guessing ... but for the time being, this is certainly a quick and easy fix that also should not have any negative impact, thus: Reviewed-by: Thomas Huth