from:"Kevin Wolf"

Re: [PATCH v5 4/5] vvfat: Fix reading files with non-continuous clusters

2024-07-19 Thread Kevin Wolf

Am 19.07.2024 um 02:29 hat Amjad Alsharafi geschrieben:
> 
> 
> On Jul 19 2024, at 8:20 am, Amjad Alsharafi  wrote:
> 
> > On Thu, Jul 18, 2024 at 05:20:36PM +0200, Kevin Wolf wrote:
> >> Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> >> > When reading with `read_cluster` we get the `mapping` with
> >> > `find_mapping_for_cluster` and then we call `open_file` for this
> >> > mapping.
> >> > The issue appear when its the same file, but a second cluster that is
> >> > not immediately after it, imagine clusters `500 -> 503`, this will give
> >> > us 2 mappings one has the range `500..501` and another `503..504`, both
> >> > point to the same file, but different offsets.
> >> > 
> >> > When we don't open the file since the path is the same, we won't assign
> >> > `s->current_mapping` and thus accessing way out of bound of the file.
> >> > 
> >> > From our example above, after `open_file` (that didn't open
> >> anything) we
> >> > will get the offset into the file with
> >> > `s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
> >> > give us `0x2000 * (504-500)`, which is out of bound for this
> >> mapping and
> >> > will produce some issues.
> >> > 
> >> > Signed-off-by: Amjad Alsharafi 
> >> > ---
> >> >  block/vvfat.c | 23 ---
> >> >  1 file changed, 16 insertions(+), 7 deletions(-)
> >> > 
> >> > diff --git a/block/vvfat.c b/block/vvfat.c
> >> > index b63ac5d045..fc570d0610 100644
> >> > --- a/block/vvfat.c
> >> > +++ b/block/vvfat.c
> >> > @@ -1360,15 +1360,24 @@ static int open_file(BDRVVVFATState*
> >> s,mapping_t* mapping)
> >> >  {
> >> >  if(!mapping)
> >> >  return -1;
> >> > +int new_path = 1;
> >> >  if(!s->current_mapping ||
> >> > -strcmp(s->current_mapping->path,mapping->path)) {
> >> > -/* open file */
> >> > -int fd = qemu_open_old(mapping->path,
> >> > +s->current_mapping->info.file.offset
> >> > +!= mapping->info.file.offset ||
> >> 
> >> I'm wondering if this couldn't just be s->current_mapping != mapping?
> > 
> > Actually, you are totally right. Not sure what made me go for this.
> > 
> > I tried also to test with only checking if the path changed, but it
> > fails on some tests. So the offset is important.
> > For that reason, checking just the mapping ptr is better since we won't
> > have 2 mappings with same file and offset.
> > 
> > I'll then use this change. Thanks
> 
> Should I send a new patch? since most commits are reviewed now

Yes, please do. I think I reviewed the whole series.

Kevin

> > 
> >> 
> >> > +(new_path = strcmp(s->current_mapping->path,
> >> mapping->path))) {
> >> 
> >> If both the path and the offset change, we still want to set
> >> new_path, I
> >> think. And if we didn't already have a mapping, we also need to open the
> >> file.
> >> 
> >> Actually, setting a variable inside the condition makes it kind of hard
> >> to read, so if s->current_mapping != mapping works, we can do the check
> >> only in the conditon below:
> >> 
> >> > +if (new_path) {
> >> 
> >> if (!s->current_mapping ||
> >> strcmp(s->current_mapping->path, mapping->path))
> >> 
> >> > +/* open file */
> >> > +int fd = qemu_open_old(mapping->path,
> >> > O_RDONLY | O_BINARY | O_LARGEFILE);
> >> > -if(fd<0)
> >> > -return -1;
> >> > -vvfat_close_current_file(s);
> >> > -s->current_fd = fd;
> >> > +if (fd < 0) {
> >> > +return -1;
> >> > +}
> >> > +vvfat_close_current_file(s);
> >> > +
> >> > +s->current_fd = fd;
> >> > +}
> >> > +assert(s->current_fd);
> >> >  s->current_mapping = mapping;
> >> >  }
> >> 
> >> Kevin
> >> 
> > 
>

Re: [PATCH v2] block-backend: per-device throttling of BLOCK_IO_ERROR reports

2024-07-19 Thread Kevin Wolf

Am 19.07.2024 um 06:54 hat Markus Armbruster geschrieben:
> Kevin Wolf  writes:
> 
> > Am 09.01.2024 um 14:13 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> From: Leonid Kaplan 
> >> 
> >> BLOCK_IO_ERROR events comes from guest, so we must throttle them.
> >> We still want per-device throttling, so let's use device id as a key.
> >> 
> >> Signed-off-by: Leonid Kaplan 
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> >> ---
> >> 
> >> v2: add Note: to QAPI doc
> >> 
> >>  monitor/monitor.c| 10 ++
> >>  qapi/block-core.json |  2 ++
> >>  2 files changed, 12 insertions(+)
> >> 
> >> diff --git a/monitor/monitor.c b/monitor/monitor.c
> >> index 01ede1babd..ad0243e9d7 100644
> >> --- a/monitor/monitor.c
> >> +++ b/monitor/monitor.c
> >> @@ -309,6 +309,7 @@ int error_printf_unless_qmp(const char *fmt, ...)
> >>  static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
> >>  /* Limit guest-triggerable events to 1 per second */
> >>  [QAPI_EVENT_RTC_CHANGE]= { 1000 * SCALE_MS },
> >> +[QAPI_EVENT_BLOCK_IO_ERROR]= { 1000 * SCALE_MS },
> >>  [QAPI_EVENT_WATCHDOG]  = { 1000 * SCALE_MS },
> >>  [QAPI_EVENT_BALLOON_CHANGE]= { 1000 * SCALE_MS },
> >>  [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
> >> @@ -498,6 +499,10 @@ static unsigned int qapi_event_throttle_hash(const 
> >> void *key)
> >>  hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
> >>  }
> >>  
> >> +if (evstate->event == QAPI_EVENT_BLOCK_IO_ERROR) {
> >> +hash += g_str_hash(qdict_get_str(evstate->data, "device"));
> >> +}
> >
> > Using "device" only works with -drive, i.e. when the BlockBackend
> > actually has a name. In modern configurations with a -blockdev
> > referenced by -device, the BlockBackend doesn't have a name any more.
> >
> > Maybe we should be using the qdev id (or more generally, QOM path) here,
> > but that's something the event doesn't even contain yet.
> 
> Uh, does the event reliably identify the I/O error's node or not?  If
> not, then that's a serious design defect.
> 
> There's @node-name.  Several commands use "either @device or @node-name"
> to identify a node.  Is that sufficient here?

Possibly. The QAPI event is sent by a device, not by the backend, and
the commit message claims per-device throttling. That's what made me
think that we should base it on the device id.

But it's true that the error does originate in the backend (and it's
unlikely that two devices are attached to the same node anyway), so the
node-name could be good enough if we don't have a BlockBackend name. We
should claim per-block-node rather then per-device throttling then.

Kevin

Re: [PATCH v2 1/3] block/commit: implement final flush

2024-07-18 Thread Kevin Wolf

Am 26.06.2024 um 16:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Actually block job is not completed without the final flush. It's
> rather unexpected to have broken target when job was successfully
> completed long ago and now we fail to flush or process just
> crashed/killed.
> 
> Mirror job already has mirror_flush() for this. So, it's OK.
> 
> Do this for commit job too.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/commit.c | 41 +++--
>  1 file changed, 27 insertions(+), 14 deletions(-)
> 
> diff --git a/block/commit.c b/block/commit.c
> index 7c3fdcb0ca..81971692a2 100644
> --- a/block/commit.c
> +++ b/block/commit.c
> @@ -134,6 +134,7 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
>  int64_t n = 0; /* bytes */
>  QEMU_AUTO_VFREE void *buf = NULL;
>  int64_t len, base_len;
> +bool need_final_flush = true;
>  
>  len = blk_co_getlength(s->top);
>  if (len < 0) {
> @@ -155,8 +156,8 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
>  
>  buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);
>  
> -for (offset = 0; offset < len; offset += n) {
> -bool copy;
> +for (offset = 0; offset < len || need_final_flush; offset += n) {

In general, the control flow would be nicer to read if the final flush
weren't integrated into the loop, but just added after it.

But I assume this is pretty much required for pausing the job during
error handling in the final flush if you don't want to duplicate a lot
of the logic into a second loop?

> +bool copy = false;
>  bool error_in_source = true;
>  
>  /* Note that even when no rate limit is applied we need to yield
> @@ -166,22 +167,34 @@ static int coroutine_fn commit_run(Job *job, Error 
> **errp)
>  if (job_is_cancelled(>common.job)) {
>  break;
>  }
> -/* Copy if allocated above the base */
> -ret = blk_co_is_allocated_above(s->top, s->base_overlay, true,
> -offset, COMMIT_BUFFER_SIZE, );
> -copy = (ret > 0);
> -trace_commit_one_iteration(s, offset, n, ret);
> -if (copy) {
> -assert(n < SIZE_MAX);
>  
> -ret = blk_co_pread(s->top, offset, n, buf, 0);
> -if (ret >= 0) {
> -ret = blk_co_pwrite(s->base, offset, n, buf, 0);
> -if (ret < 0) {
> -error_in_source = false;
> +if (offset < len) {
> +/* Copy if allocated above the base */
> +ret = blk_co_is_allocated_above(s->top, s->base_overlay, true,
> +offset, COMMIT_BUFFER_SIZE, );
> +copy = (ret > 0);
> +trace_commit_one_iteration(s, offset, n, ret);
> +if (copy) {
> +assert(n < SIZE_MAX);
> +
> +ret = blk_co_pread(s->top, offset, n, buf, 0);
> +if (ret >= 0) {
> +ret = blk_co_pwrite(s->base, offset, n, buf, 0);
> +if (ret < 0) {
> +error_in_source = false;
> +}
>  }
>  }
> +} else {
> +assert(need_final_flush);
> +ret = blk_co_flush(s->base);
> +if (ret < 0) {
> +error_in_source = false;
> +} else {
> +need_final_flush = false;
> +}

Should we set n = 0 in this block to avoid counting the last chunk twice
for the progress?

>  }
> +
>  if (ret < 0) {
>  BlockErrorAction action =
>  block_job_error_action(>common, s->on_error,

Kevin

Re: [PATCH v2] block-backend: per-device throttling of BLOCK_IO_ERROR reports

2024-07-18 Thread Kevin Wolf

Am 09.01.2024 um 14:13 hat Vladimir Sementsov-Ogievskiy geschrieben:
> From: Leonid Kaplan 
> 
> BLOCK_IO_ERROR events comes from guest, so we must throttle them.
> We still want per-device throttling, so let's use device id as a key.
> 
> Signed-off-by: Leonid Kaplan 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v2: add Note: to QAPI doc
> 
>  monitor/monitor.c| 10 ++
>  qapi/block-core.json |  2 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/monitor/monitor.c b/monitor/monitor.c
> index 01ede1babd..ad0243e9d7 100644
> --- a/monitor/monitor.c
> +++ b/monitor/monitor.c
> @@ -309,6 +309,7 @@ int error_printf_unless_qmp(const char *fmt, ...)
>  static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
>  /* Limit guest-triggerable events to 1 per second */
>  [QAPI_EVENT_RTC_CHANGE]= { 1000 * SCALE_MS },
> +[QAPI_EVENT_BLOCK_IO_ERROR]= { 1000 * SCALE_MS },
>  [QAPI_EVENT_WATCHDOG]  = { 1000 * SCALE_MS },
>  [QAPI_EVENT_BALLOON_CHANGE]= { 1000 * SCALE_MS },
>  [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
> @@ -498,6 +499,10 @@ static unsigned int qapi_event_throttle_hash(const void 
> *key)
>  hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
>  }
>  
> +if (evstate->event == QAPI_EVENT_BLOCK_IO_ERROR) {
> +hash += g_str_hash(qdict_get_str(evstate->data, "device"));
> +}

Using "device" only works with -drive, i.e. when the BlockBackend
actually has a name. In modern configurations with a -blockdev
referenced by -device, the BlockBackend doesn't have a name any more.

Maybe we should be using the qdev id (or more generally, QOM path) here,
but that's something the event doesn't even contain yet.

> +
>  return hash;
>  }
>  
> @@ -525,6 +530,11 @@ static gboolean qapi_event_throttle_equal(const void *a, 
> const void *b)
> qdict_get_str(evb->data, "qom-path"));
>  }
>  
> +if (eva->event == QAPI_EVENT_BLOCK_IO_ERROR) {
> +return !strcmp(qdict_get_str(eva->data, "device"),
> +   qdict_get_str(evb->data, "device"));
> +}
> +
>  return TRUE;
>  }
>  
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index ca390c5700..32c2c2f030 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -5559,6 +5559,8 @@
>  # Note: If action is "stop", a STOP event will eventually follow the
>  # BLOCK_IO_ERROR event
>  #
> +# Note: This event is rate-limited.
> +#
>  # Since: 0.13
>  #
>  # Example:

Kevin

Re: [PATCH] block/curl: rewrite http header parsing function

2024-07-18 Thread Kevin Wolf

Am 29.06.2024 um 16:25 hat Michael Tokarev geschrieben:
> Existing code was long, unclear and twisty.
> 
> Signed-off-by: Michael Tokarev 
> ---
>  block/curl.c | 44 ++--
>  1 file changed, 18 insertions(+), 26 deletions(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 419f7c89ef..9802d0319d 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -210,37 +210,29 @@ static size_t curl_header_cb(void *ptr, size_t size, 
> size_t nmemb, void *opaque)
>  {
>  BDRVCURLState *s = opaque;
>  size_t realsize = size * nmemb;
> -const char *header = (char *)ptr;
> -const char *end = header + realsize;
> -const char *accept_ranges = "accept-ranges:";
> -const char *bytes = "bytes";
> +const char *p = ptr;
> +const char *end = p + realsize;
> +const char *t = "accept-ranges : bytes "; /* A lowercase template */

I don't think spaces between the field name and the colon are allowed
in the spec (and in the old code), only before and after the value.

> -if (realsize >= strlen(accept_ranges)
> -&& g_ascii_strncasecmp(header, accept_ranges,
> -   strlen(accept_ranges)) == 0) {
> -
> -char *p = strchr(header, ':') + 1;
> -
> -/* Skip whitespace between the header name and value. */
> -while (p < end && *p && g_ascii_isspace(*p)) {
> -p++;
> -}
> -
> -if (end - p >= strlen(bytes)
> -&& strncmp(p, bytes, strlen(bytes)) == 0) {
> -
> -/* Check that there is nothing but whitespace after the value. */
> -p += strlen(bytes);
> -while (p < end && *p && g_ascii_isspace(*p)) {
> -p++;
> -}
> -
> -if (p == end || !*p) {
> -s->accept_range = true;
> +/* check if header matches the "t" template */
> +for (;;) {
> +if (*t == ' ') { /* space in t matches any amount of isspace in p */
> +if (p < end && g_ascii_isspace(*p)) {
> +++p;
> +} else {
> +++t;
>  }
> +} else if (*t && p < end && *t == g_ascii_tolower(*p)) {
> +++p, ++t;
> +} else {
> +break;
>  }
>  }
>  
> +if (!*t && p == end) { /* if we managed to reach ends of both strings */
> +s->accept_range = true;
> +}

Maybe make the generic comparison with a template a separate function
(maybe even in cutils.c?) so that curl_header_cb() essentially only has
something like this any more:

if (!qemu_memcasecmp_space(ptr, end, "accept-ranges: bytes ")) {
s->accept_range = true;
}

(A better name for the function would be preferable, of course. Maybe
also a bool return value, but if it has a name related to memcmp() or
strcmp(), then 0 must mean it matches.)

Then this would really highlight the curl specific logic rather than the
string parser in curl_header_cb().

Kevin

Re: [PATCH] qapi-block-core: Clean up blockdev-snapshot-internal-sync doc

2024-07-18 Thread Kevin Wolf

Am 18.07.2024 um 14:36 hat Markus Armbruster geschrieben:
> BlockdevSnapshotInternal is the arguments type of command
> blockdev-snapshot-internal-sync.  Its doc comment contains this note:
> 
> # .. note:: In a transaction, if @name is empty or any snapshot matching
> #@name exists, the operation will fail.  Only some image formats
> #support it; for example, qcow2, and rbd.
> 
> "In a transaction" is misleading, and "if @name is empty or any
> snapshot matching @name exists, the operation will fail" is redundant
> with the command's Errors documentation.  Drop.
> 
> The remainder is fine.  Move it to the command's doc comment, where it
> is more prominently visible, with a slight rephrasing for clarity.
> 
> Signed-off-by: Markus Armbruster 

Thanks, applied to the block branch.

Kevin

Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter

2024-07-18 Thread Kevin Wolf

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> From: "Denis V. Lunev" 
> 
> We have observed that some clusters in the QCOW2 files are zeroed
> while preallocation filter is used.
> 
> We are able to trace down the following sequence when prealloc-filter
> is used:
> co=0x55e7cbed7680 qcow2_co_pwritev_task()
> co=0x55e7cbed7680 preallocate_co_pwritev_part()
> co=0x55e7cbed7680 handle_write()
> co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> co=0x7f9edb7fe500 do_fallocate()
> 
> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> time to handle next coroutine, which
> co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> co=0x55e7cbee91b0 handle_write()
> co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> co=0x7f9edb7deb00 do_fallocate()
> 
> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> the same area. This means that if (once fallocate is started inside
> 0x7f9edb7deb00) original fallocate could end and the real write will
> be executed. In that case write() request is handled at the same time
> as fallocate().
> 
> The patch moves s->file_lock assignment before fallocate and that is

s/file_lock/file_end/?

> crucial. The idea is that all subsequent requests into the area
> being preallocation will be issued as just writes without fallocate
> to this area and they will not proceed thanks to overlapping
> requests mechanics. If preallocation will fail, we will just switch
> to the normal expand-by-write behavior and that is not a problem
> except performance.
> 
> Signed-off-by: Denis V. Lunev 
> Tested-by: Andrey Drobyshev 
> ---
>  block/preallocate.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/preallocate.c b/block/preallocate.c
> index d215bc5d6d..ecf0aa4baa 100644
> --- a/block/preallocate.c
> +++ b/block/preallocate.c
> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> int64_t bytes,
>  
>  want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>  
> +/*
> + * Assign file_end before making actual preallocation. This will ensure
> + * that next request performed while preallocation is in progress will
> + * be passed without preallocation.
> + */
> +s->file_end = prealloc_end;
> +
>  ret = bdrv_co_pwrite_zeroes(
>  bs->file, prealloc_start, prealloc_end - prealloc_start,
>  BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> int64_t bytes,
>  return false;
>  }
>  
> -s->file_end = prealloc_end;
>  return want_merge_zero;
>  }

Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
unchanged. In other words, if anything calls preallocate_co_getlength()
in the meantime, don't we run into...

ret = bdrv_co_getlength(bs->file->bs);

if (has_prealloc_perms(bs)) {
s->file_end = s->zero_start = s->data_end = ret;
}

...and reset s->file_end back to the old value, re-enabling the bug
you're trying to fix here?

Or do we know that no such code path can be called concurrently for some
reason?

Kevin

Re: [PATCH v5 3/5] vvfat: Fix wrong checks for cluster mappings invariant

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> How this `abort` was intended to check for was:
> - if the `mapping->first_mapping_index` is not the same as
>   `first_mapping_index`, which **should** happen only in one case,
>   when we are handling the first mapping, in that case
>   `mapping->first_mapping_index == -1`, in all other cases, the other
>   mappings after the first should have the condition `true`.
> - From above, we know that this is the first mapping, so if the offset
>   is not `0`, then abort, since this is an invalid state.
> 
> The issue was that `first_mapping_index` is not set if we are
> checking from the middle, the variable `first_mapping_index` is
> only set if we passed through the check `cluster_was_modified` with the
> first mapping, and in the same function call we checked the other
> mappings.
> 
> One approach is to go into the loop even if `cluster_was_modified`
> is not true so that we will be able to set `first_mapping_index` for the
> first mapping, but since `first_mapping_index` is only used here,
> another approach is to just check manually for the
> `mapping->first_mapping_index != -1` since we know that this is the
> value for the only entry where `offset == 0` (i.e. first mapping).
> 
> Signed-off-by: Amjad Alsharafi 

Reviewed-by: Kevin Wolf

Re: [PATCH v5 2/5] vvfat: Fix usage of `info.file.offset`

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> The field is marked as "the offset in the file (in clusters)", but it
> was being used like this
> `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> 
> Signed-off-by: Amjad Alsharafi 

Reviewed-by: Kevin Wolf

Re: [PATCH v5 5/5] iotests: Add `vvfat` tests

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> Added several tests to verify the implementation of the vvfat driver.
> 
> We needed a way to interact with it, so created a basic `fat16.py` driver
> that handled writing correct sectors for us.
> 
> Added `vvfat` to the non-generic formats, as its not a normal image format.
> 
> Signed-off-by: Amjad Alsharafi 

> +def truncate_file(
> +self,
> +entry: FatDirectoryEntry,
> +new_size: int,
> +allocate_non_continuous: bool = False,
> +):
> +"""
> +Truncate the file at the given path to the new size.
> +"""
> +if entry is None:
> +return Exception("entry is None")
> +if entry.attributes & 0x10 != 0:
> +raise Exception(f"{entry.whole_name()} is a directory")
> +
> +def clusters_from_size(size: int):
> +return (
> +size + self.boot_sector.cluster_bytes() - 1
> +) // self.boot_sector.cluster_bytes()
> +
> +# First, allocate new FATs if we need to
> +required_clusters = clusters_from_size(new_size)
> +current_clusters = clusters_from_size(entry.size_bytes)
> +
> +affected_clusters = set()
> +
> +# Keep at least one cluster, easier to manage this way
> +if required_clusters == 0:
> +required_clusters = 1
> +if current_clusters == 0:
> +current_clusters = 1
> +
> +if required_clusters > current_clusters:
> +# Allocate new clusters
> +cluster = entry.cluster
> +to_add = required_clusters
> +for _ in range(current_clusters - 1):
> +to_add -= 1
> +cluster = self.next_cluster(cluster)
> +assert required_clusters > 0, "No new clusters to allocate"
> +assert cluster is not None, "Cluster is None"
> +assert (
> +self.next_cluster(cluster) is None
> +), "Cluster is not the last cluster"
> +
> +# Allocate new clusters
> +for _ in range(to_add - 1):
> +new_cluster = self.next_free_cluster()
> +if allocate_non_continuous:
> +new_cluster = self.next_free_cluster_non_continuous()

The normal self.next_free_cluster() could be in an else branch. No
reason to search for a free cluster when you immediately overwrite it
anyway.

> +self.write_fat_entry(cluster, new_cluster)
> +self.write_fat_entry(new_cluster, 0x)
> +cluster = new_cluster
> +
> +elif required_clusters < current_clusters:
> +# Truncate the file
> +cluster = entry.cluster
> +for _ in range(required_clusters - 1):
> +cluster = self.next_cluster(cluster)
> +assert cluster is not None, "Cluster is None"
> +
> +next_cluster = self.next_cluster(cluster)
> +# mark last as EOF
> +self.write_fat_entry(cluster, 0x)
> +# free the rest
> +while next_cluster is not None:
> +cluster = next_cluster
> +next_cluster = self.next_cluster(next_cluster)
> +self.write_fat_entry(cluster, 0)
> +
> +self.flush_fats()
> +
> +# verify number of clusters
> +cluster = entry.cluster
> +count = 0
> +while cluster is not None:
> +count += 1
> +affected_clusters.add(cluster)
> +cluster = self.next_cluster(cluster)
> +assert (
> +count == required_clusters
> +), f"Expected {required_clusters} clusters, got {count}"
> +
> +# update the size
> +entry.size_bytes = new_size
> +self.update_direntry(entry)
> +
> +# trigger every affected cluster
> +    for cluster in affected_clusters:
> +first_sector = self.boot_sector.first_sector_of_cluster(cluster)
> +first_sector_data = self.read_sectors(first_sector, 1)
> +self.write_sectors(first_sector, first_sector_data)

Other than this, the patch looks good to me and we seem to test all the
cases that are fixed by the previous patches.

Reviewed-by: Kevin Wolf 
Tested-by: Kevin Wolf 

Kevin

Re: [PATCH v5 4/5] vvfat: Fix reading files with non-continuous clusters

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> When reading with `read_cluster` we get the `mapping` with
> `find_mapping_for_cluster` and then we call `open_file` for this
> mapping.
> The issue appear when its the same file, but a second cluster that is
> not immediately after it, imagine clusters `500 -> 503`, this will give
> us 2 mappings one has the range `500..501` and another `503..504`, both
> point to the same file, but different offsets.
> 
> When we don't open the file since the path is the same, we won't assign
> `s->current_mapping` and thus accessing way out of bound of the file.
> 
> From our example above, after `open_file` (that didn't open anything) we
> will get the offset into the file with
> `s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
> give us `0x2000 * (504-500)`, which is out of bound for this mapping and
> will produce some issues.
> 
> Signed-off-by: Amjad Alsharafi 
> ---
>  block/vvfat.c | 23 ---
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/block/vvfat.c b/block/vvfat.c
> index b63ac5d045..fc570d0610 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1360,15 +1360,24 @@ static int open_file(BDRVVVFATState* s,mapping_t* 
> mapping)
>  {
>  if(!mapping)
>  return -1;
> +int new_path = 1;
>  if(!s->current_mapping ||
> -strcmp(s->current_mapping->path,mapping->path)) {
> -/* open file */
> -int fd = qemu_open_old(mapping->path,
> +s->current_mapping->info.file.offset
> +!= mapping->info.file.offset ||

I'm wondering if this couldn't just be s->current_mapping != mapping?

> +(new_path = strcmp(s->current_mapping->path, mapping->path))) {

If both the path and the offset change, we still want to set new_path, I
think. And if we didn't already have a mapping, we also need to open the
file.

Actually, setting a variable inside the condition makes it kind of hard
to read, so if s->current_mapping != mapping works, we can do the check
only in the conditon below:

> +if (new_path) {

if (!s->current_mapping ||
strcmp(s->current_mapping->path, mapping->path))

> +/* open file */
> +int fd = qemu_open_old(mapping->path,
> O_RDONLY | O_BINARY | O_LARGEFILE);
> -if(fd<0)
> -return -1;
> -vvfat_close_current_file(s);
> -s->current_fd = fd;
> +if (fd < 0) {
> +return -1;
> +}
> +vvfat_close_current_file(s);
> +
> +s->current_fd = fd;
> +}
> +assert(s->current_fd);
>  s->current_mapping = mapping;
>  }

Kevin

Re: [PULL 4/4] block: Parse filenames only when explicitly requested

2024-07-04 Thread Kevin Wolf

Am 03.07.2024 um 23:16 hat Michael Tokarev geschrieben:
> 02.07.2024 19:39, Kevin Wolf wrote:
> > When handling image filenames from legacy options such as -drive or from
> > tools, these filenames are parsed for protocol prefixes, including for
> > the json:{} pseudo-protocol.
> > 
> > This behaviour is intended for filenames that come directly from the
> > command line and for backing files, which may come from the image file
> > itself. Higher level management tools generally take care to verify that
> > untrusted images don't contain a bad (or any) backing file reference;
> > 'qemu-img info' is a suitable tool for this.
> > 
> > However, for other files that can be referenced in images, such as
> > qcow2 data files or VMDK extents, the string from the image file is
> > usually not verified by management tools - and 'qemu-img info' wouldn't
> > be suitable because in contrast to backing files, it already opens these
> > other referenced files. So here the string should be interpreted as a
> > literal local filename. More complex configurations need to be specified
> > explicitly on the command line or in QMP.
> > 
> > This patch changes bdrv_open_inherit() so that it only parses filenames
> > if a new parameter parse_filename is true. It is set for the top level
> > in bdrv_open(), for the file child and for the backing file child. All
> > other callers pass false and disable filename parsing this way.
> 
> I'm attaching backports of this change to 8.2 and 7.2 series.
> 
> Please check if the resulting backports are correct, or if they're needed
> in the first place.  Note: 7.2 lacks quite some locking in this area, eg
> v8.0.0-2069-g8394c35ee148 "block: Fix AioContext locking in bdrv_open_child()"
> v8.2.0-rc0-59-g6bc0bcc89f84 "block: Fix deadlocks in bdrv_graph_wrunlock()".

Apart from minor style differences, your 7.2 backport is a perfect match
of the RHEL 9.2 backport which I already reviewed downstream. The 8.2
patch is a bit different from the RHEL 9.4 one because we backported the
final bits of the multiqueue work, but the differences make sense for
upstream QEMU 8.2.

So both patches look good to me.

Kevin

[PULL 3/4] iotests/270: Don't store data-file with json: prefix in image

2024-07-02 Thread Kevin Wolf

We want to disable filename parsing for data files because it's too easy
to abuse in malicious image files. Make the test ready for the change by
passing the data file explicitly in command line options.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/270 | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/270 b/tests/qemu-iotests/270
index 74352342db..c37b674aa2 100755
--- a/tests/qemu-iotests/270
+++ b/tests/qemu-iotests/270
@@ -60,8 +60,16 @@ _make_test_img -o cluster_size=2M,data_file="$TEST_IMG.orig" 
\
 # "write" 2G of data without using any space.
 # (qemu-img create does not like it, though, because null-co does not
 # support image creation.)
-$QEMU_IMG amend -o data_file="json:{'driver':'null-co',,'size':'4294967296'}" \
-"$TEST_IMG"
+test_img_with_null_data="json:{
+'driver': '$IMGFMT',
+'file': {
+'filename': '$TEST_IMG'
+},
+'data-file': {
+'driver': 'null-co',
+'size':'4294967296'
+}
+}"
 
 # This gives us a range of:
 #   2^31 - 512 + 768 - 1 = 2^31 + 255 > 2^31
@@ -74,7 +82,7 @@ $QEMU_IMG amend -o 
data_file="json:{'driver':'null-co',,'size':'4294967296'}" \
 # on L2 boundaries, we need large L2 tables; hence the cluster size of
 # 2 MB.  (Anything from 256 kB should work, though, because then one L2
 # table covers 8 GB.)
-$QEMU_IO -c "write 768 $((2 ** 31 - 512))" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write 768 $((2 ** 31 - 512))" "$test_img_with_null_data" | 
_filter_qemu_io
 
 _check_test_img
 
-- 
2.45.2

[PULL 0/4] Block layer patches (CVE-2024-4467)

2024-07-02 Thread Kevin Wolf

The following changes since commit c80a339587fe4148292c260716482dd2f86d4476:

  Merge tag 'pull-target-arm-20240701' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2024-07-01 
10:41:45 -0700)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 7ead946998610657d38d1a505d5f25300d4ca613:

  block: Parse filenames only when explicitly requested (2024-07-02 18:12:30 
+0200)


Block layer patches (CVE-2024-4467)

- Don't open qcow2 data files in 'qemu-img info'
- Disallow protocol prefixes for qcow2 data files, VMDK extent files and
  other child nodes that are neither 'file' nor 'backing'


Kevin Wolf (4):
  qcow2: Don't open data_file with BDRV_O_NO_IO
  iotests/244: Don't store data-file with protocol in image
  iotests/270: Don't store data-file with json: prefix in image
  block: Parse filenames only when explicitly requested

 block.c| 90 +-
 block/qcow2.c  | 17 -
 tests/qemu-iotests/061 |  6 ++--
 tests/qemu-iotests/061.out |  8 +++--
 tests/qemu-iotests/244 | 19 --
 tests/qemu-iotests/270 | 14 ++--
 6 files changed, 110 insertions(+), 44 deletions(-)

[PULL 4/4] block: Parse filenames only when explicitly requested

2024-07-02 Thread Kevin Wolf

When handling image filenames from legacy options such as -drive or from
tools, these filenames are parsed for protocol prefixes, including for
the json:{} pseudo-protocol.

This behaviour is intended for filenames that come directly from the
command line and for backing files, which may come from the image file
itself. Higher level management tools generally take care to verify that
untrusted images don't contain a bad (or any) backing file reference;
'qemu-img info' is a suitable tool for this.

However, for other files that can be referenced in images, such as
qcow2 data files or VMDK extents, the string from the image file is
usually not verified by management tools - and 'qemu-img info' wouldn't
be suitable because in contrast to backing files, it already opens these
other referenced files. So here the string should be interpreted as a
literal local filename. More complex configurations need to be specified
explicitly on the command line or in QMP.

This patch changes bdrv_open_inherit() so that it only parses filenames
if a new parameter parse_filename is true. It is set for the top level
in bdrv_open(), for the file child and for the backing file child. All
other callers pass false and disable filename parsing this way.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 block.c | 90 -
 1 file changed, 57 insertions(+), 33 deletions(-)

diff --git a/block.c b/block.c
index c1cc313d21..c317de9eaa 100644
--- a/block.c
+++ b/block.c
@@ -86,6 +86,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
BlockDriverState *parent,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
+   bool parse_filename,
Error **errp);
 
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
@@ -2055,7 +2056,8 @@ static void parse_json_protocol(QDict *options, const 
char **pfilename,
  * block driver has been specified explicitly.
  */
 static int bdrv_fill_options(QDict **options, const char *filename,
- int *flags, Error **errp)
+ int *flags, bool allow_parse_filename,
+ Error **errp)
 {
 const char *drvname;
 bool protocol = *flags & BDRV_O_PROTOCOL;
@@ -2097,7 +2099,7 @@ static int bdrv_fill_options(QDict **options, const char 
*filename,
 if (protocol && filename) {
 if (!qdict_haskey(*options, "filename")) {
 qdict_put_str(*options, "filename", filename);
-parse_filename = true;
+parse_filename = allow_parse_filename;
 } else {
 error_setg(errp, "Can't specify 'file' and 'filename' options at "
  "the same time");
@@ -3660,7 +3662,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*parent_options,
 }
 
 backing_hd = bdrv_open_inherit(backing_filename, reference, options, 0, bs,
-   _of_bds, bdrv_backing_role(bs), errp);
+   _of_bds, bdrv_backing_role(bs), true,
+   errp);
 if (!backing_hd) {
 bs->open_flags |= BDRV_O_NO_BACKING;
 error_prepend(errp, "Could not open backing file: ");
@@ -3694,7 +3697,8 @@ free_exit:
 static BlockDriverState *
 bdrv_open_child_bs(const char *filename, QDict *options, const char *bdref_key,
BlockDriverState *parent, const BdrvChildClass *child_class,
-   BdrvChildRole child_role, bool allow_none, Error **errp)
+   BdrvChildRole child_role, bool allow_none,
+   bool parse_filename, Error **errp)
 {
 BlockDriverState *bs = NULL;
 QDict *image_options;
@@ -3725,7 +3729,8 @@ bdrv_open_child_bs(const char *filename, QDict *options, 
const char *bdref_key,
 }
 
 bs = bdrv_open_inherit(filename, reference, image_options, 0,
-   parent, child_class, child_role, errp);
+   parent, child_class, child_role, parse_filename,
+   errp);
 if (!bs) {
 goto done;
 }
@@ -3735,6 +3740,33 @@ done:
 return bs;
 }
 
+static BdrvChild *bdrv_open_child_common(const char *filename,
+ QDict *options, const char *bdref_key,
+ BlockDriverState *parent,
+ const BdrvChildClass *child_class,
+ BdrvChildRole child_role,
+ bool allow_none, bool parse_filename,
+

[PULL 2/4] iotests/244: Don't store data-file with protocol in image

2024-07-02 Thread Kevin Wolf

We want to disable filename parsing for data files because it's too easy
to abuse in malicious image files. Make the test ready for the change by
passing the data file explicitly in command line options.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/244 | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/244 b/tests/qemu-iotests/244
index 3e61fa25bb..bb9cc6512f 100755
--- a/tests/qemu-iotests/244
+++ b/tests/qemu-iotests/244
@@ -215,9 +215,22 @@ $QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n -C 
"$TEST_IMG.src" "$TEST_IMG"
 $QEMU_IMG compare -f $IMGFMT -F $IMGFMT "$TEST_IMG.src" "$TEST_IMG"
 
 # blkdebug doesn't support copy offloading, so this tests the error path
-$QEMU_IMG amend -f $IMGFMT -o "data_file=blkdebug::$TEST_IMG.data" "$TEST_IMG"
-$QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n -C "$TEST_IMG.src" "$TEST_IMG"
-$QEMU_IMG compare -f $IMGFMT -F $IMGFMT "$TEST_IMG.src" "$TEST_IMG"
+test_img_with_blkdebug="json:{
+'driver': 'qcow2',
+'file': {
+'driver': 'file',
+'filename': '$TEST_IMG'
+},
+'data-file': {
+'driver': 'blkdebug',
+'image': {
+'driver': 'file',
+'filename': '$TEST_IMG.data'
+}
+}
+}"
+$QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n -C "$TEST_IMG.src" 
"$test_img_with_blkdebug"
+$QEMU_IMG compare -f $IMGFMT -F $IMGFMT "$TEST_IMG.src" 
"$test_img_with_blkdebug"
 
 echo
 echo "=== Flushing should flush the data file ==="
-- 
2.45.2

[PULL 1/4] qcow2: Don't open data_file with BDRV_O_NO_IO

2024-07-02 Thread Kevin Wolf

One use case for 'qemu-img info' is verifying that untrusted images
don't reference an unwanted external file, be it as a backing file or an
external data file. To make sure that calling 'qemu-img info' can't
already have undesired side effects with a malicious image, just don't
open the data file at all with BDRV_O_NO_IO. If nothing ever tries to do
I/O, we don't need to have it open.

This changes the output of iotests case 061, which used 'qemu-img info'
to show that opening an image with an invalid data file fails. After
this patch, it succeeds. Replace this part of the test with a qemu-io
call, but keep the final 'qemu-img info' to show that the invalid data
file is correctly displayed in the output.

Fixes: CVE-2024-4467
Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 block/qcow2.c  | 17 -
 tests/qemu-iotests/061 |  6 --
 tests/qemu-iotests/061.out |  8 ++--
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 10883a2494..70b19730a3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1636,7 +1636,22 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int 
flags,
 goto fail;
 }
 
-if (open_data_file) {
+if (open_data_file && (flags & BDRV_O_NO_IO)) {
+/*
+ * Don't open the data file for 'qemu-img info' so that it can be used
+ * to verify that an untrusted qcow2 image doesn't refer to external
+ * files.
+ *
+ * Note: This still makes has_data_file() return true.
+ */
+if (s->incompatible_features & QCOW2_INCOMPAT_DATA_FILE) {
+s->data_file = NULL;
+} else {
+s->data_file = bs->file;
+}
+qdict_extract_subqdict(options, NULL, "data-file.");
+qdict_del(options, "data-file");
+} else if (open_data_file) {
 /* Open external data file */
 bdrv_graph_co_rdunlock();
 s->data_file = bdrv_co_open_child(NULL, options, "data-file", bs,
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index 53c7d428e3..b71ac097d1 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -326,12 +326,14 @@ $QEMU_IMG amend -o "data_file=foo" "$TEST_IMG"
 echo
 _make_test_img -o "compat=1.1,data_file=$TEST_IMG.data" 64M
 $QEMU_IMG amend -o "data_file=foo" "$TEST_IMG"
-_img_info --format-specific
+$QEMU_IO -c "read 0 4k" "$TEST_IMG" 2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o 
data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" -c "read 0 4k" | 
_filter_qemu_io
 TEST_IMG="data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" _img_info 
--format-specific --image-opts
 
 echo
 $QEMU_IMG amend -o "data_file=" --image-opts 
"data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG"
-_img_info --format-specific
+$QEMU_IO -c "read 0 4k" "$TEST_IMG" 2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o 
data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" -c "read 0 4k" | 
_filter_qemu_io
 TEST_IMG="data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" _img_info 
--format-specific --image-opts
 
 echo
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 139fc68177..24c33add7c 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -545,7 +545,9 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 qemu-img: data-file can only be set for images that use an external data file
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 
data_file=TEST_DIR/t.IMGFMT.data
-qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Could not open 'foo': No such 
file or directory
+qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open 'foo': No such 
file or directory
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 image: TEST_DIR/t.IMGFMT
 file format: IMGFMT
 virtual size: 64 MiB (67108864 bytes)
@@ -560,7 +562,9 @@ Format specific information:
 corrupt: false
 extended l2: false
 
-qemu-img: Could not open 'TEST_DIR/t.IMGFMT': 'data-file' is required for this 
image
+qemu-io: can't open device TEST_DIR/t.IMGFMT: 'data-file' is required for this 
image
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 image: TEST_DIR/t.IMGFMT
 file format: IMGFMT
 virtual size: 64 MiB (67108864 bytes)
-- 
2.45.2

Re: [PATCH] scsi: Don't ignore most usb-storage properties

2024-07-02 Thread Kevin Wolf

Am 01.07.2024 um 15:08 hat Fiona Ebner geschrieben:
> Hi,
> 
> we got a user report about bootindex for an 'usb-storage' device not
> working anymore [0] and I reproduced it and bisected it to this patch.
> 
> Am 31.01.24 um 14:06 schrieb Kevin Wolf:
> > @@ -399,11 +397,10 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
> > BlockBackend *blk,
> >  object_property_add_child(OBJECT(bus), name, OBJECT(dev));
> >  g_free(name);
> >  
> > +s = SCSI_DEVICE(dev);
> > +s->conf = *conf;
> > +
> >  qdev_prop_set_uint32(dev, "scsi-id", unit);
> > -if (bootindex >= 0) {
> > -object_property_set_int(OBJECT(dev), "bootindex", bootindex,
> > -_abort);
> > -}
> 
> The fact that this is not called anymore means that the 'set' method
> for the property is also not called. Here, that method is
> device_set_bootindex() (as configured by scsi_dev_instance_init() ->
> device_add_bootindex_property()). Therefore, the device is never
> registered via add_boot_device_path() meaning that the bootindex
> property does not have the desired effect anymore.

Hmm, yes, seem I missed this side effect.

Bringing back this one object_property_set_int() call would be the
easiest fix, but I wonder if an explicit add_boot_device_path() call
(and allowing that one to fail gracefully instead of directly calling
exit()) wouldn't be better than re-setting a property to its current
value just for the side effect.

> Is it necessary to keep the object_property_set_{bool,int} and
> qdev_prop_set_enum calls around for these potential side effects? Would
> it even be necessary to introduce new similar calls for the newly
> supported properties? Or is there an easy alternative to
> s->conf = *conf;
> that does trigger the side effects?

I don't think the other properties whose setter we don't call any more
have side effects. They are processed during .realize, which is what I
probably expected for bootindex, too.

And that's really how all properties should be if we ever want to move
forward with the .instance_config approach for creating QOM objects
because then we won't call any setters during object creation any more,
they would only be for runtime changes.

Kevin

[PATCH 2/2] block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

2024-06-27 Thread Kevin Wolf

Upstream clang 18 (and backports to clang 17 in Fedora and RHEL)
implemented support for __attribute__((cleanup())) in its Thread Safety
Analysis, so we can now actually have a proper implementation of
WITH_GRAPH_RDLOCK_GUARD() that understands when we acquire and when we
release the lock.

-Wthread-safety is now only enabled if the compiler is new enough to
understand this pattern. In theory, we could have used some #ifdefs to
keep the existing basic checks on old compilers, but as long as someone
runs a newer compiler (and our CI does), we will catch locking problems,
so it's probably not worth keeping multiple implementations for this.

The implementation can't use g_autoptr any more because the glib macros
define wrapper functions that don't have the right TSA attributes, so
the compiler would complain about them. Just use the cleanup attribute
directly instead.

Signed-off-by: Kevin Wolf 
---
 include/block/graph-lock.h | 21 ++---
 meson.build| 14 +-
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index d7545e82d0..dc8d949184 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -209,31 +209,38 @@ typedef struct GraphLockable { } GraphLockable;
  * unlocked. TSA_ASSERT_SHARED() makes sure that the following calls know that
  * we hold the lock while unlocking is left unchecked.
  */
-static inline GraphLockable * TSA_ASSERT_SHARED(graph_lock) TSA_NO_TSA 
coroutine_fn
+static inline GraphLockable * TSA_ACQUIRE_SHARED(graph_lock) coroutine_fn
 graph_lockable_auto_lock(GraphLockable *x)
 {
 bdrv_graph_co_rdlock();
 return x;
 }
 
-static inline void TSA_NO_TSA coroutine_fn
-graph_lockable_auto_unlock(GraphLockable *x)
+static inline void TSA_RELEASE_SHARED(graph_lock) coroutine_fn
+graph_lockable_auto_unlock(GraphLockable **x)
 {
 bdrv_graph_co_rdunlock();
 }
 
-G_DEFINE_AUTOPTR_CLEANUP_FUNC(GraphLockable, graph_lockable_auto_unlock)
+#define GRAPH_AUTO_UNLOCK __attribute__((cleanup(graph_lockable_auto_unlock)))
 
+/*
+ * @var is only used to break the loop after the first iteration.
+ * @unlock_var can't be unlocked and then set to NULL because TSA wants the 
lock
+ * to be held at the start of every iteration of the loop.
+ */
 #define WITH_GRAPH_RDLOCK_GUARD_(var) \
-for (g_autoptr(GraphLockable) var = graph_lockable_auto_lock(GML_OBJ_()); \
+for (GraphLockable *unlock_var GRAPH_AUTO_UNLOCK =\
+graph_lockable_auto_lock(GML_OBJ_()), \
+*var = unlock_var;\
  var; \
- graph_lockable_auto_unlock(var), var = NULL)
+ var = NULL)
 
 #define WITH_GRAPH_RDLOCK_GUARD() \
 WITH_GRAPH_RDLOCK_GUARD_(glue(graph_lockable_auto, __COUNTER__))
 
 #define GRAPH_RDLOCK_GUARD(x)   \
-g_autoptr(GraphLockable)\
+GraphLockable * GRAPH_AUTO_UNLOCK   \
 glue(graph_lockable_auto, __COUNTER__) G_GNUC_UNUSED =  \
 graph_lockable_auto_lock(GML_OBJ_())
 
diff --git a/meson.build b/meson.build
index 97e00d6f59..b1d5ce5f1d 100644
--- a/meson.build
+++ b/meson.build
@@ -624,7 +624,19 @@ warn_flags = [
 ]
 
 if host_os != 'darwin'
-  warn_flags += ['-Wthread-safety']
+  tsa_has_cleanup = cc.compiles('''
+struct __attribute__((capability("mutex"))) mutex {};
+void lock(struct mutex *m) __attribute__((acquire_capability(m)));
+void unlock(struct mutex *m) __attribute__((release_capability(m)));
+
+void test(void) {
+  struct mutex __attribute__((cleanup(unlock))) m;
+  lock();
+}
+  ''', args: ['-Wthread-safety', '-Werror'])
+  if tsa_has_cleanup
+warn_flags += ['-Wthread-safety']
+  endif
 endif
 
 # Set up C++ compiler flags
-- 
2.45.2

[PATCH 1/2] block-copy: Fix missing graph lock

2024-06-27 Thread Kevin Wolf

The graph lock needs to be held when calling bdrv_co_pdiscard(). Fix
block_copy_task_entry() to take it for the call.

WITH_GRAPH_RDLOCK_GUARD() was implemented in a weak way because of
limitations in clang's Thread Safety Analysis at the time, so that it
only asserts that the lock is held (which allows calling functions that
require the lock), but we never deal with the unlocking (so even after
the scope of the guard, the compiler assumes that the lock is still
held). This is why the compiler didn't catch this locking error.

Signed-off-by: Kevin Wolf 
---
 block/block-copy.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..cc618e4561 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -595,7 +595,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
 if (s->discard_source && ret == 0) {
 int64_t nbytes =
 MIN(t->req.offset + t->req.bytes, s->len) - t->req.offset;
-bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+WITH_GRAPH_RDLOCK_GUARD() {
+bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+}
 }
 
 return ret;
-- 
2.45.2

[PATCH 0/2] block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

2024-06-27 Thread Kevin Wolf

Newer clang versions allow us to check scoped guards more thoroughly.
Surprisingly, we only seem to have missed one instance with the old
incomplete checks.

Kevin Wolf (2):
  block-copy: Fix missing graph lock
  block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

 include/block/graph-lock.h | 21 ++---
 block/block-copy.c |  4 +++-
 meson.build| 14 +-
 3 files changed, 30 insertions(+), 9 deletions(-)

-- 
2.45.2

Re: [PATCH] block/curl: explicitly assert that strchr returns non-NULL value

2024-06-27 Thread Kevin Wolf

Am 27.06.2024 um 17:30 hat Vladimir Sementsov-Ogievskiy geschrieben:
> strchr may return NULL if colon is not found. It seems clearer to
> assert explicitly that we don't expect it here, than dereference 1 in
> the next line.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/curl.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 419f7c89ef..ccfffd6c12 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -219,7 +219,9 @@ static size_t curl_header_cb(void *ptr, size_t size, 
> size_t nmemb, void *opaque)
>  && g_ascii_strncasecmp(header, accept_ranges,
> strlen(accept_ranges)) == 0) {
>  
> -char *p = strchr(header, ':') + 1;
> +char *p = strchr(header, ':');
> +assert(p != NULL);
> +p += 1;

I'm not sure if this is actually much clearer because it doesn't say why
we don't expect NULL here. If you don't look at the context, it almost
looks like an assert() where proper error handling is needed. If you do,
then the original line is clear enough.

My first thought was that maybe what we want is a comment, but we
actually already know where the colon is. So how about this instead:

char *p = header + strlen(accept_ranges);

Kevin

>  /* Skip whitespace between the header name and value. */
>  while (p < end && *p && g_ascii_isspace(*p)) {
> -- 
> 2.34.1
>

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-27 Thread Kevin Wolf

Am 26.06.2024 um 18:27 hat Nir Soffer geschrieben:
> On Wed, Jun 26, 2024 at 12:17 PM Daniel P. Berrangé 
> wrote:
> 
> > On Mon, Jun 24, 2024 at 06:08:26PM +0200, Kevin Wolf wrote:
> > > Am 24.06.2024 um 17:23 hat Stefan Hajnoczi geschrieben:
> > > > On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> > > > > Tested using:
> > > >
> > > > Hi Nir,
> > > > This looks like a good candidate for the qemu-iotests test suite.
> > Adding
> > > > it to the automated tests will protect against future regressions.
> > > >
> > > > Please add the script and the expected output to
> > > > tests/qemu-iotests/test/write-zeroes-unmap and run it using
> > > > `(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.
> > > >
> > > > See the existing test cases in tests/qemu-iotests/ and
> > > > tests/qemu-iotests/tests/ for examples. Some are shell scripts and
> > > > others are Python. I think shell makes sense for this test case. You
> > > > can copy the test framework boilerplate from an existing test case.
> > >
> > > 'du' can't be used like this in qemu-iotests because it makes
> > > assumptions that depend on the filesystem. A test case replicating what
> > > Nir did manually would likely fail on XFS with its preallocation.
> > >
> > > Maybe we could operate on a file exposed by the FUSE export that is
> > > backed by qcow2, and then you can use 'qemu-img map' on that qcow2 image
> > > to verify the allocation status. Somewhat complicated, but I think it
> > > could work.
> >
> > A simpler option would be to use 'du' but with a fuzzy range test,
> > rather than an exact equality test.
> >
> > For the tests which write 1 MB, check the 'du' usage is "at least 1MB",
> > for the tests which expect to unmap blocks, check that the 'du' usage
> > is "less than 256kb". This should be within bounds of xfs speculative
> > allocation.
> 
> This should work, I'll start with this approach.

If we're okay with accepting tests that depend on filesystem behaviour,
then 'qemu-img map -f raw --output=json' should be the less risky
approach than checking 'du'.

Kevin

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-26 Thread Kevin Wolf

Am 24.06.2024 um 23:12 hat Nir Soffer geschrieben:
> On Mon, Jun 24, 2024 at 7:08 PM Kevin Wolf  wrote:
> 
> > Am 24.06.2024 um 17:23 hat Stefan Hajnoczi geschrieben:
> > > On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> > > > Tested using:
> > >
> > > Hi Nir,
> > > This looks like a good candidate for the qemu-iotests test suite. Adding
> > > it to the automated tests will protect against future regressions.
> > >
> > > Please add the script and the expected output to
> > > tests/qemu-iotests/test/write-zeroes-unmap and run it using
> > > `(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.
> > >
> > > See the existing test cases in tests/qemu-iotests/ and
> > > tests/qemu-iotests/tests/ for examples. Some are shell scripts and
> > > others are Python. I think shell makes sense for this test case. You
> > > can copy the test framework boilerplate from an existing test case.
> >
> > 'du' can't be used like this in qemu-iotests because it makes
> > assumptions that depend on the filesystem. A test case replicating what
> > Nir did manually would likely fail on XFS with its preallocation.
> 
> This is why I did not try to add a new qemu-iotest yet.
> 
> > Maybe we could operate on a file exposed by the FUSE export that is
> > backed by qcow2, and then you can use 'qemu-img map' on that qcow2 image
> > to verify the allocation status. Somewhat complicated, but I think it
> > could work.
> 
> Do we have examples of using the FUSE export? It sounds complicated but
> being able to test on any file system is awesome. The complexity can be
> hidden behind simple test helpers.

We seem to have a few tests that use it, and then the fuse protocol
implementation, too. 308 and file-io-error look relevant.

> Another option is to use a specific file system created for the tests,
> for example on a loop device. We used userstorage[1] in ovirt to test
> on specific file systems with known sector size.

Creating loop devices requires root privileges. If I understand
correctly, userstorage solved that by having a setup phase as root
before running the tests as a normal user? We don't really have that in
qemu-iotests.

Some tests require passwordless sudo and are skipped otherwise, but this
means that in practice they are almost always skipped.

> But more important, are you ok with the change?
> 
> I'm not sure about not creating sparse images by default - this is not
> consistent with qemu-img convert and qemu-nbd, which do sparsify by
> default. The old behavior seems better.

Well, your patches make it do what we always claimed it would do, so
that consistency is certainly a good thing. Unmapping on write_zeroes
and ignoring truncate is a weird combination anyway that doesn't really
make any sense to me, so I don't think it's worth preserving. The other
way around could have been more defensible, but that's not how our bug
works.

Now, if ignoring all discard requests is a good default these days is a
separate question and I'm not sure really. Maybe discard=unmap should
be the default (and apply to both discard are write_zeroes, of course).

Kevin

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-24 Thread Kevin Wolf

Am 24.06.2024 um 17:23 hat Stefan Hajnoczi geschrieben:
> On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> > Tested using:
> 
> Hi Nir,
> This looks like a good candidate for the qemu-iotests test suite. Adding
> it to the automated tests will protect against future regressions.
> 
> Please add the script and the expected output to
> tests/qemu-iotests/test/write-zeroes-unmap and run it using
> `(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.
> 
> See the existing test cases in tests/qemu-iotests/ and
> tests/qemu-iotests/tests/ for examples. Some are shell scripts and
> others are Python. I think shell makes sense for this test case. You
> can copy the test framework boilerplate from an existing test case.

'du' can't be used like this in qemu-iotests because it makes
assumptions that depend on the filesystem. A test case replicating what
Nir did manually would likely fail on XFS with its preallocation.

Maybe we could operate on a file exposed by the FUSE export that is
backed by qcow2, and then you can use 'qemu-img map' on that qcow2 image
to verify the allocation status. Somewhat complicated, but I think it
could work.

Kevin

signature.asc
Description: PGP signature

Re: [PATCH] block/file-posix: Consider discard flag when opening

2024-06-19 Thread Kevin Wolf

Am 18.06.2024 um 23:24 hat Nir Soffer geschrieben:
> Set has_discard only when BDRV_O_UNMAP is not set. With this users that
> want to keep their images fully allocated can disable hole punching
> when writing zeros or discarding using:
> 
>-drive file=thick.img,discard=off
> 
> This change is not entirely correct since it changes the default discard
> behavior.  Previously we always allowed punching holes, but now you have
> must use discard=unmap|on to enable it. We probably need to add the
> BDDR_O_UNMAP flag by default.
> 
> make check still works, so maybe we don't have tests for sparsifying
> images, or maybe you need to run special tests that do not run by
> default. We needs tests for keeping images non-sparse.
> 
> Signed-off-by: Nir Soffer 

So first of all, I agree with you that this patch is wrong. ;-)

At first, I failed to understand the problem this is trying to solve. I
put a debug message in handle_aiocb_discard() and tried with which
options it triggers. [1] To me, this looked exactly like it should be.
We only try to discard blocks when discard=unmap is given as an option.

That leaves the case of write_zeroes. And while at the first sight, the
code looked good, we do seem to have a problem there and it tried to
unmap even with discard=off.

>  block/file-posix.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index be25e35ff6..acac2abadc 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -738,11 +738,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> *options,
>  ret = -EINVAL;
>  goto fail;
>  }
>  #endif /* !defined(CONFIG_LINUX_IO_URING) */
>  
> -s->has_discard = true;
> +s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP);
>  s->has_write_zeroes = true;
>  
>  if (fstat(s->fd, ) < 0) {
>  ret = -errno;
>  error_setg_errno(errp, errno, "Could not stat file");

s->has_discard is about what the host supports, not about the semantics
of the QEMU block node. So this doesn't feel right to me.

So for the buggy case, write_zeroes, bdrv_co_pwrite_zeroes() has code
that considers the case and clears the ~BDRV_REQ_MAY_UNMAP flags:

if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
flags &= ~BDRV_REQ_MAY_UNMAP;
}

But it turns out that we don't necessarily even go through this function
for the top node which has discard=off, so it can't take effect:

(gdb) bt
#0  0x74f2f144 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x74ed765e in raise () at /lib64/libc.so.6
#2  0x74ebf902 in abort () at /lib64/libc.so.6
#3  0x5615aff0 in raw_do_pwrite_zeroes (bs=0x57f4bcf0, offset=0, 
bytes=1048576, flags=BDRV_REQ_MAY_UNMAP, blkdev=false) at 
../block/file-posix.c:3643
#4  0x5615557e in raw_co_pwrite_zeroes (bs=0x57f4bcf0, offset=0, 
bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/file-posix.c:3655
#5  0x560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x57f4bcf0, 
offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
#6  0x560c72f9 in bdrv_aligned_pwritev (child=0x57f51460, 
req=0x7fffed5ff800, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, 
flags=6) at ../block/io.c:2100
#7  0x560c6b41 in bdrv_co_do_zero_pwritev (child=0x57f51460, 
offset=0, bytes=1048576, flags=6, req=0x7fffed5ff800) at ../block/io.c:2183
#8  0x560c6647 in bdrv_co_pwritev_part (child=0x57f51460, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
#9  0x560c634f in bdrv_co_pwritev (child=0x57f51460, offset=0, 
bytes=1048576, qiov=0x0, flags=6) at ../block/io.c:2216
#10 0x560c75b5 in bdrv_co_pwrite_zeroes (child=0x57f51460, 
offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/io.c:2322
#11 0x56117d24 in raw_co_pwrite_zeroes (bs=0x57f44980, offset=0, 
bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/raw-format.c:307
#12 0x560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x57f44980, 
offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
#13 0x560c72f9 in bdrv_aligned_pwritev (child=0x57f513f0, 
req=0x7fffed5ffd90, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, 
flags=6) at ../block/io.c:2100
#14 0x560c6b41 in bdrv_co_do_zero_pwritev (child=0x57f513f0, 
offset=0, bytes=1048576, flags=6, req=0x7fffed5ffd90) at ../block/io.c:2183
#15 0x560c6647 in bdrv_co_pwritev_part (child=0x57f513f0, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
#16 0x560ad741 in blk_co_do_pwritev_part (blk=0x57f51660, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at 
../block/block-backend.c:1425
#17 0x560ad5f2 in blk_co_pwritev_part (blk=0x57f51660, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at 
../block/block-backend.c:1440
#18 0x560ad8cf in blk_co_pwritev

Re: [PATCH v5 5/5] iotests: Add `vvfat` tests

2024-06-13 Thread Kevin Wolf

Am 13.06.2024 um 16:07 hat Amjad Alsharafi geschrieben:
> On Wed, Jun 12, 2024 at 08:43:26PM +0800, Amjad Alsharafi wrote:
> > Added several tests to verify the implementation of the vvfat driver.
> > 
> > We needed a way to interact with it, so created a basic `fat16.py` driver
> > that handled writing correct sectors for us.
> > 
> > Added `vvfat` to the non-generic formats, as its not a normal image format.
> > 
> > Signed-off-by: Amjad Alsharafi 
> > ---
> >  tests/qemu-iotests/check   |   2 +-
> >  tests/qemu-iotests/fat16.py| 673 +
> >  tests/qemu-iotests/testenv.py  |   2 +-
> >  tests/qemu-iotests/tests/vvfat | 457 
> >  tests/qemu-iotests/tests/vvfat.out |   5 +
> >  5 files changed, 1137 insertions(+), 2 deletions(-)
> >  create mode 100644 tests/qemu-iotests/fat16.py
> >  create mode 100755 tests/qemu-iotests/tests/vvfat
> >  create mode 100755 tests/qemu-iotests/tests/vvfat.out
> 
> Btw Kevin, I'm not sure if I should add myself to the MAINTAINERS file
> for the new files (fat16.py and vvfat) or not

You can if you really want to, but there is no need.

Basically if you see yourself maintaining these files for a long time
instead of just contributing them now and you want to be contacted
when they are changed in the future, that's when you could consider it.
But the patches will still go through my or Hanna's tree and we already
cover all of qemu-iotests, so we don't have to have a formal
maintainership definition for these files specifically.

Kevin

Re: [PATCH v4 5/5] iotests: add backup-discard-source

2024-06-13 Thread Kevin Wolf

Am 12.06.2024 um 21:21 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11.06.24 20:49, Kevin Wolf wrote:
> > Am 13.03.2024 um 16:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Add test for a new backup option: discard-source.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > Reviewed-by: Fiona Ebner 
> > > Tested-by: Fiona Ebner 
> > 
> > This test fails for me, and it already does so after this commit that
> > introduced it. I haven't checked what get_actual_size(), but I'm running
> > on XFS, so its preallocation could be causing this. We generally avoid
> > checking the number of allocated blocks in image files for this reason.
> > 
> 
> Hmm right, I see that relying on allocated size is bad thing. Better
> is to check block status, to see how many qcow2 clusters are
> allocated.
> 
> Do we have some qmp command to get such information? The simplest way
> I see is to add dirty to temp block-node, and then check its dirty
> count through query-named-block-nodes

Hm, does it have to be QMP in a running QEMU process? I'm not sure what
the best way would be there.

Otherwise, my approach would be 'qemu-img check' or even 'qemu-img map',
depending on what you want to verify with it.

Kevin

Re: [PATCH v4 2/4] vvfat: Fix usage of `info.file.offset`

2024-06-11 Thread Kevin Wolf

Am 11.06.2024 um 18:22 hat Amjad Alsharafi geschrieben:
> On Tue, Jun 11, 2024 at 04:30:53PM +0200, Kevin Wolf wrote:
> > Am 11.06.2024 um 14:31 hat Amjad Alsharafi geschrieben:
> > > On Mon, Jun 10, 2024 at 06:49:43PM +0200, Kevin Wolf wrote:
> > > > Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> > > > > The field is marked as "the offset in the file (in clusters)", but it
> > > > > was being used like this
> > > > > `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> > > > > 
> > > > > Additionally, removed the `abort` when `first_mapping_index` does not
> > > > > match, as this matches the case when adding new clusters for files, 
> > > > > and
> > > > > its inevitable that we reach this condition when doing that if the
> > > > > clusters are not after one another, so there is no reason to `abort`
> > > > > here, execution continues and the new clusters are written to disk
> > > > > correctly.
> > > > > 
> > > > > Signed-off-by: Amjad Alsharafi 
> > > > 
> > > > Can you help me understand how first_mapping_index really works?
> > > > 
> > > > It seems to me that you get a chain of mappings for each file on the FAT
> > > > filesystem, which are just the contiguous areas in it, and
> > > > first_mapping_index refers to the mapping at the start of the file. But
> > > > for much of the time, it actually doesn't seem to be set at all, so you
> > > > have mapping->first_mapping_index == -1. Do you understand the rules
> > > > around when it's set and when it isn't?
> > > 
> > > Yeah. So `first_mapping_index` is the index of the first mapping, each
> > > mapping is a group of clusters that are contiguous in the file.
> > > Its mostly `-1` because the first mapping will have the value set as
> > > `-1` and not its own index, this value will only be set when the file
> > > contain more than one mapping, and this will only happen when you add
> > > clusters to a file that are not contiguous with the existing clusters.
> > 
> > Ah, that makes some sense. Not sure if it's optimal, but it's a rule I
> > can work with. So just to confirm, this is the invariant that we think
> > should always hold true, right?
> > 
> > assert((mapping->mode & MODE_DIRECTORY) ||
> >!mapping->info.file.offset ||
> >mapping->first_mapping_index > 0);
> > 
> 
> Yes.
> 
> We can add this into `get_cluster_count_for_direntry` loop.

Maybe even find_mapping_for_cluster() because we think it should apply
always? It's called by get_cluster_count_for_direntry(), but also by
other functions.

Either way, I think this should be a separate patch.

> I'm thinking of also converting those `abort` into `assert`, since
> the line `copy_it = 1;` was confusing me, since it was after the `abort`.

I agree for the abort() that you removed, but I'm not sure about the
other one. I have a feeling the copy_it = 1 might actually be correct
there (if the copying logic is implemented correctly; I didn't check
that).

> > > And actually, thanks to that I noticed another bug not fixed in PATCH 3, 
> > > We are doing this check 
> > > `s->current_mapping->first_mapping_index != mapping->first_mapping_index`
> > > to know if we should switch to the new mapping or not. 
> > > If we were reading from the first mapping (`first_mapping_index == -1`)
> > > and we jumped to the second mapping (`first_mapping_index == n`), we
> > > will catch this condition and switch to the new mapping.
> > > 
> > > But if the file has more than 2 mappings, and we jumped to the 3rd
> > > mapping, we will not catch this since (`first_mapping_index == n`) for
> > > both of them haha. I think a better check is to check the `mapping`
> > > pointer directly. (I'll add it also in the next series together with a
> > > test for it.)
> > 
> > This comparison is exactly what confused me. I didn't realise that the
> > first mapping in the chain has a different value here, so I thought this
> > must mean that we're looking at a different file now - but of course I
> > couldn't see a reason for that because we're iterating through a single
> > file in this function.
> > 
> > But even now that I know that the condition triggers when switching from
> > the first to the second mapping, it doesn't make sense to me. We don't
> > have to copy things around just because a fil

Re: [PATCH v4 5/5] iotests: add backup-discard-source

2024-06-11 Thread Kevin Wolf

Am 13.03.2024 um 16:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Add test for a new backup option: discard-source.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Fiona Ebner 
> Tested-by: Fiona Ebner 

This test fails for me, and it already does so after this commit that
introduced it. I haven't checked what get_actual_size(), but I'm running
on XFS, so its preallocation could be causing this. We generally avoid
checking the number of allocated blocks in image files for this reason.

Kevin


backup-discard-source   fail   [19:45:49] [19:45:50]   0.8s 
failed, exit status 1
--- /home/kwolf/source/qemu/tests/qemu-iotests/tests/backup-discard-source.out
+++ 
/home/kwolf/source/qemu/build-clang/scratch/qcow2-file-backup-discard-source/backup-discard-source.out.bad
@@ -1,5 +1,14 @@
-..
+F.
+==
+FAIL: test_discard_cbw (__main__.TestBackup.test_discard_cbw)
+1. do backup(discard_source=True), which should inform
+--
+Traceback (most recent call last):
+  File 
"/home/kwolf/source/qemu/tests/qemu-iotests/tests/backup-discard-source", line 
147, in test_discard_cbw
+self.assertLess(get_actual_size(self.vm, 'temp'), 512 * 1024)
+AssertionError: 1249280 not less than 524288
+
 --
 Ran 2 tests

-OK
+FAILED (failures=1)
Failures: backup-discard-source
Failed 1 of 1 iotests

[PULL 3/8] aio: warn about iohandler_ctx special casing

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

The main loop has two AioContexts: qemu_aio_context and iohandler_ctx.
The main loop runs them both, but nested aio_poll() calls on
qemu_aio_context exclude iohandler_ctx.

Which one should qemu_get_current_aio_context() return when called from
the main loop? Document that it's always qemu_aio_context.

This has subtle effects on functions that use
qemu_get_current_aio_context(). For example, aio_co_reschedule_self()
does not work when moving from iohandler_ctx to qemu_aio_context because
qemu_get_current_aio_context() does not differentiate these two
AioContexts.

Document this in order to reduce the chance of future bugs.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240506190622.56095-3-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 include/block/aio.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 8378553eb9..4ee81936ed 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -629,6 +629,9 @@ void aio_co_schedule(AioContext *ctx, Coroutine *co);
  *
  * Move the currently running coroutine to new_ctx. If the coroutine is already
  * running in new_ctx, do nothing.
+ *
+ * Note that this function cannot reschedule from iohandler_ctx to
+ * qemu_aio_context.
  */
 void coroutine_fn aio_co_reschedule_self(AioContext *new_ctx);
 
@@ -661,6 +664,9 @@ void aio_co_enter(AioContext *ctx, Coroutine *co);
  * If called from an IOThread this will be the IOThread's AioContext.  If
  * called from the main thread or with the "big QEMU lock" taken it
  * will be the main loop AioContext.
+ *
+ * Note that the return value is never the main loop's iohandler_ctx and the
+ * return value is the main loop AioContext instead.
  */
 AioContext *qemu_get_current_aio_context(void);
 
-- 
2.45.2

[PULL 6/8] linux-aio: add IO_CMD_FDSYNC command support

2024-06-11 Thread Kevin Wolf

From: Prasad Pandit 

Libaio defines IO_CMD_FDSYNC command to sync all outstanding
asynchronous I/O operations, by flushing out file data to the
disk storage. Enable linux-aio to submit such aio request.

When using aio=native without fdsync() support, QEMU creates
pthreads, and destroying these pthreads results in TLB flushes.
In a real-time guest environment, TLB flushes cause a latency
spike. This patch helps to avoid such spikes.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Prasad Pandit 
Message-ID: <20240425070412.37248-1-ppan...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 include/block/raw-aio.h |  1 +
 block/file-posix.c  |  9 +
 block/linux-aio.c   | 21 -
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 20e000b8ef..626706827f 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -60,6 +60,7 @@ void laio_cleanup(LinuxAioState *s);
 int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 int type, uint64_t dev_max_batch);
 
+bool laio_has_fdsync(int);
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 #endif
diff --git a/block/file-posix.c b/block/file-posix.c
index 5c46938936..be25e35ff6 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -159,6 +159,7 @@ typedef struct BDRVRawState {
 bool has_discard:1;
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
+bool has_laio_fdsync:1;
 bool use_linux_io_uring:1;
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
@@ -718,6 +719,9 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 ret = -EINVAL;
 goto fail;
 }
+if (s->use_linux_aio) {
+s->has_laio_fdsync = laio_has_fdsync(s->fd);
+}
 #else
 if (s->use_linux_aio) {
 error_setg(errp, "aio=native was specified, but is not supported "
@@ -2598,6 +2602,11 @@ static int coroutine_fn 
raw_co_flush_to_disk(BlockDriverState *bs)
 if (raw_check_linux_io_uring(s)) {
 return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
 }
+#endif
+#ifdef CONFIG_LINUX_AIO
+if (s->has_laio_fdsync && raw_check_linux_aio(s)) {
+return laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
+}
 #endif
 return raw_thread_pool_submit(handle_aiocb_flush, );
 }
diff --git a/block/linux-aio.c b/block/linux-aio.c
index ec05d946f3..e3b5ec9aba 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -384,6 +384,9 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 case QEMU_AIO_READ:
 io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
 break;
+case QEMU_AIO_FLUSH:
+io_prep_fdsync(iocbs, fd);
+break;
 /* Currently Linux kernel does not support other operations */
 default:
 fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
@@ -412,7 +415,7 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 AioContext *ctx = qemu_get_current_aio_context();
 struct qemu_laiocb laiocb = {
 .co = qemu_coroutine_self(),
-.nbytes = qiov->size,
+.nbytes = qiov ? qiov->size : 0,
 .ctx= aio_get_linux_aio(ctx),
 .ret= -EINPROGRESS,
 .is_read= (type == QEMU_AIO_READ),
@@ -486,3 +489,19 @@ void laio_cleanup(LinuxAioState *s)
 }
 g_free(s);
 }
+
+bool laio_has_fdsync(int fd)
+{
+struct iocb cb;
+struct iocb *cbs[] = {, NULL};
+
+io_context_t ctx = 0;
+io_setup(1, );
+
+/* check if host kernel supports IO_CMD_FDSYNC */
+io_prep_fdsync(, fd);
+int ret = io_submit(ctx, 1, cbs);
+
+io_destroy(ctx);
+return (ret == -EINVAL) ? false : true;
+}
-- 
2.45.2

[PULL 0/8] Block layer patches

2024-06-11 Thread Kevin Wolf

The following changes since commit 80e8f0602168f451a93e71cbb1d59e93d745e62e:

  Merge tag 'bsd-user-misc-2024q2-pull-request' of gitlab.com:bsdimp/qemu into 
staging (2024-06-09 11:21:55 -0700)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 3ab0f063e58ed9224237d69c4211ca83335164c4:

  crypto/block: drop qcrypto_block_open() n_threads argument (2024-06-10 
11:05:43 +0200)


Block layer patches

- crypto: Fix crash when used with multiqueue devices
- linux-aio: add IO_CMD_FDSYNC command support
- copy-before-write: Avoid integer overflows for timeout > 4s
- Fix crash with QMP block_resize and iothreads
- qemu-io: add cvtnum() error handling for zone commands
- Code cleanup


Denis V. Lunev via (1):
  block: drop force_dup parameter of raw_reconfigure_getfd()

Fiona Ebner (1):
  block/copy-before-write: use uint64_t for timeout in nanoseconds

Prasad J Pandit (1):
  linux-aio: add IO_CMD_FDSYNC command support

Stefan Hajnoczi (5):
  Revert "monitor: use aio_co_reschedule_self()"
  aio: warn about iohandler_ctx special casing
  qemu-io: add cvtnum() error handling for zone commands
  block/crypto: create ciphers on demand
  crypto/block: drop qcrypto_block_open() n_threads argument

 crypto/blockpriv.h |  13 +++--
 include/block/aio.h|   6 +++
 include/block/raw-aio.h|   1 +
 include/crypto/block.h |   2 -
 block/copy-before-write.c  |   2 +-
 block/crypto.c |   1 -
 block/file-posix.c |  17 --
 block/linux-aio.c  |  21 +++-
 block/qcow.c   |   2 +-
 block/qcow2.c  |   5 +-
 crypto/block-luks.c|   4 +-
 crypto/block-qcow.c|   8 ++-
 crypto/block.c | 114 -
 qapi/qmp-dispatch.c|   7 ++-
 qemu-io-cmds.c |  48 -
 tests/unit/test-crypto-block.c |   4 --
 16 files changed, 176 insertions(+), 79 deletions(-)

[PULL 4/8] qemu-io: add cvtnum() error handling for zone commands

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

cvtnum() parses positive int64_t values and returns a negative errno on
failure. Print errors and return early when cvtnum() fails.

While we're at it, also reject nr_zones values greater or equal to 2^32
since they cannot be represented.

Reported-by: Peter Maydell 
Cc: Sam Li 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240507180558.377233-1-stefa...@redhat.com>
Reviewed-by: Sam Li 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 qemu-io-cmds.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f5d7202a13..e2fab57183 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1739,12 +1739,26 @@ static int zone_report_f(BlockBackend *blk, int argc, 
char **argv)
 {
 int ret;
 int64_t offset;
+int64_t val;
 unsigned int nr_zones;
 
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
-nr_zones = cvtnum(argv[optind]);
+val = cvtnum(argv[optind]);
+if (val < 0) {
+print_cvtnum_err(val, argv[optind]);
+return val;
+}
+if (val > UINT_MAX) {
+printf("Number of zones must be less than 2^32\n");
+return -ERANGE;
+}
+nr_zones = val;
 
 g_autofree BlockZoneDescriptor *zones = NULL;
 zones = g_new(BlockZoneDescriptor, nr_zones);
@@ -1780,8 +1794,16 @@ static int zone_open_f(BlockBackend *blk, int argc, char 
**argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
 if (ret < 0) {
 printf("zone open failed: %s\n", strerror(-ret));
@@ -1805,8 +1827,16 @@ static int zone_close_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
 if (ret < 0) {
 printf("zone close failed: %s\n", strerror(-ret));
@@ -1830,8 +1860,16 @@ static int zone_finish_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
 if (ret < 0) {
 printf("zone finish failed: %s\n", strerror(-ret));
@@ -1855,8 +1893,16 @@ static int zone_reset_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
 if (ret < 0) {
 printf("zone reset failed: %s\n", strerror(-ret));
-- 
2.45.2

[PULL 5/8] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-06-11 Thread Kevin Wolf

From: Fiona Ebner 

rather than the uint32_t for which the maximum is slightly more than 4
seconds and larger values would overflow. The QAPI interface allows
specifying the number of seconds, so only values 0 to 4 are safe right
now, other values lead to a much lower timeout than a user expects.

The block_copy() call where this is used already takes a uint64_t for
the timeout, so no change required there.

Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
Reported-by: Friedrich Weber 
Signed-off-by: Fiona Ebner 
Message-ID: <20240429141934.442154-1-f.eb...@proxmox.com>
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/copy-before-write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index cd65524e26..853e01a1eb 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -43,7 +43,7 @@ typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
 OnCbwError on_cbw_error;
-uint32_t cbw_timeout_ns;
+uint64_t cbw_timeout_ns;
 bool discard_source;
 
 /*
-- 
2.45.2

[PULL 8/8] crypto/block: drop qcrypto_block_open() n_threads argument

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

The n_threads argument is no longer used since the previous commit.
Remove it.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240527155851.892885-3-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Acked-by: Daniel P. Berrangé 
Signed-off-by: Kevin Wolf 
---
 crypto/blockpriv.h | 1 -
 include/crypto/block.h | 2 --
 block/crypto.c | 1 -
 block/qcow.c   | 2 +-
 block/qcow2.c  | 5 ++---
 crypto/block-luks.c| 1 -
 crypto/block-qcow.c| 6 ++
 crypto/block.c | 3 +--
 tests/unit/test-crypto-block.c | 4 
 9 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 4bf6043d5d..b8f77cb5eb 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -59,7 +59,6 @@ struct QCryptoBlockDriver {
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 unsigned int flags,
-size_t n_threads,
 Error **errp);
 
 int (*create)(QCryptoBlock *block,
diff --git a/include/crypto/block.h b/include/crypto/block.h
index 92e823c9f2..5b5d039800 100644
--- a/include/crypto/block.h
+++ b/include/crypto/block.h
@@ -76,7 +76,6 @@ typedef enum {
  * @readfunc: callback for reading data from the volume
  * @opaque: data to pass to @readfunc
  * @flags: bitmask of QCryptoBlockOpenFlags values
- * @n_threads: allow concurrent I/O from up to @n_threads threads
  * @errp: pointer to a NULL-initialized error object
  *
  * Create a new block encryption object for an existing
@@ -113,7 +112,6 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
  QCryptoBlockReadFunc readfunc,
  void *opaque,
  unsigned int flags,
- size_t n_threads,
  Error **errp);
 
 typedef enum {
diff --git a/block/crypto.c b/block/crypto.c
index 21eed909c1..4eed3ffa6a 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -363,7 +363,6 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
block_crypto_read_func,
bs,
cflags,
-   1,
errp);
 
 if (!crypto->block) {
diff --git a/block/qcow.c b/block/qcow.c
index ca8e1d5ec8..c2f89db055 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -211,7 +211,7 @@ static int qcow_open(BlockDriverState *bs, QDict *options, 
int flags,
 cflags |= QCRYPTO_BLOCK_OPEN_NO_IO;
 }
 s->crypto = qcrypto_block_open(crypto_opts, "encrypt.",
-   NULL, NULL, cflags, 1, errp);
+   NULL, NULL, cflags, errp);
 if (!s->crypto) {
 ret = -EINVAL;
 goto fail;
diff --git a/block/qcow2.c b/block/qcow2.c
index 956128b409..10883a2494 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -321,7 +321,7 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
start_offset,
 }
 s->crypto = qcrypto_block_open(s->crypto_opts, "encrypt.",
qcow2_crypto_hdr_read_func,
-   bs, cflags, QCOW2_MAX_THREADS, 
errp);
+   bs, cflags, errp);
 if (!s->crypto) {
 return -EINVAL;
 }
@@ -1701,8 +1701,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int 
flags,
 cflags |= QCRYPTO_BLOCK_OPEN_NO_IO;
 }
 s->crypto = qcrypto_block_open(s->crypto_opts, "encrypt.",
-   NULL, NULL, cflags,
-   QCOW2_MAX_THREADS, errp);
+   NULL, NULL, cflags, errp);
 if (!s->crypto) {
 ret = -EINVAL;
 goto fail;
diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 3357852c0a..5b777c15d3 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1189,7 +1189,6 @@ qcrypto_block_luks_open(QCryptoBlock *block,
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 unsigned int flags,
-size_t n_threads,
 Error **errp)
 {
 QCryptoBlockLUKS *luks = NULL;
diff --git a/crypto/block-qcow.c b/crypto/block-qcow.c
index 02305058e3..42e9556e42 100644
--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -44,7 +44,6 @@ qcrypto_block_qcow_has_format(const uint8_t *buf 
G_GNUC_UNUSED,
 static int
 qcrypto_block_qcow_init(QCry

[PULL 2/8] Revert "monitor: use aio_co_reschedule_self()"

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code
cleanup that uses aio_co_reschedule_self() instead of open coding
coroutine rescheduling.

Bug RHEL-34618 was reported and Kevin Wolf  identified
the root cause. I missed that aio_co_reschedule_self() ->
qemu_get_current_aio_context() only knows about
qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It
does not function correctly when going back from the iohandler_ctx to
qemu_aio_context.

Go back to open coding the AioContext transitions to avoid this bug.

This reverts commit 1f25c172f83704e350c0829438d832384084a74d.

Cc: qemu-sta...@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-34618
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240506190622.56095-2-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 qapi/qmp-dispatch.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index f3488afeef..176b549473 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * executing the command handler so that it can make progress if it
  * involves an AIO_WAIT_WHILE().
  */
-aio_co_reschedule_self(qemu_get_aio_context());
+aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
+qemu_coroutine_yield();
 }
 
 monitor_set_cur(qemu_coroutine_self(), cur_mon);
@@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * Move back to iohandler_ctx so that nested event loops for
  * qemu_aio_context don't start new monitor commands.
  */
-aio_co_reschedule_self(iohandler_get_aio_context());
+aio_co_schedule(iohandler_get_aio_context(),
+qemu_coroutine_self());
+qemu_coroutine_yield();
 }
 } else {
/*
-- 
2.45.2

[PULL 1/8] block: drop force_dup parameter of raw_reconfigure_getfd()

2024-06-11 Thread Kevin Wolf

From: "Denis V. Lunev via" 

Since commit 72373e40fbc, this parameter is always passed as 'false'
from the caller.

Signed-off-by: Denis V. Lunev 
CC: Andrey Zhadchenko 
CC: Kevin Wolf 
CC: Hanna Reitz 
Message-ID: <20240430170213.148558-1-...@openvz.org>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/file-posix.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 35684f7e21..5c46938936 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1039,8 +1039,7 @@ static int fcntl_setfl(int fd, int flag)
 }
 
 static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
- int *open_flags, uint64_t perm, bool 
force_dup,
- Error **errp)
+ int *open_flags, uint64_t perm, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 int fd = -1;
@@ -1068,7 +1067,7 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, 
int flags,
 assert((s->open_flags & O_ASYNC) == 0);
 #endif
 
-if (!force_dup && *open_flags == s->open_flags) {
+if (*open_flags == s->open_flags) {
 /* We're lucky, the existing fd is fine */
 return s->fd;
 }
@@ -3748,8 +3747,7 @@ static int raw_check_perm(BlockDriverState *bs, uint64_t 
perm, uint64_t shared,
 int ret;
 
 /* We may need a new fd if auto-read-only switches the mode */
-ret = raw_reconfigure_getfd(bs, input_flags, _flags, perm,
-false, errp);
+ret = raw_reconfigure_getfd(bs, input_flags, _flags, perm, errp);
 if (ret < 0) {
 return ret;
 } else if (ret != s->fd) {
-- 
2.45.2

[PULL 7/8] block/crypto: create ciphers on demand

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

Ciphers are pre-allocated by qcrypto_block_init_cipher() depending on
the given number of threads. The -device
virtio-blk-pci,iothread-vq-mapping= feature allows users to assign
multiple IOThreads to a virtio-blk device, but the association between
the virtio-blk device and the block driver happens after the block
driver is already open.

When the number of threads given to qcrypto_block_init_cipher() is
smaller than the actual number of threads at runtime, the
block->n_free_ciphers > 0 assertion in qcrypto_block_pop_cipher() can
fail.

Get rid of qcrypto_block_init_cipher() n_thread's argument and allocate
ciphers on demand.

Reported-by: Qing Wang 
Buglink: https://issues.redhat.com/browse/RHEL-36159
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240527155851.892885-2-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Acked-by: Daniel P. Berrangé 
Signed-off-by: Kevin Wolf 
---
 crypto/blockpriv.h  |  12 +++--
 crypto/block-luks.c |   3 +-
 crypto/block-qcow.c |   2 +-
 crypto/block.c  | 111 ++--
 4 files changed, 78 insertions(+), 50 deletions(-)

diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 836f3b4726..4bf6043d5d 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -32,8 +32,14 @@ struct QCryptoBlock {
 const QCryptoBlockDriver *driver;
 void *opaque;
 
-QCryptoCipher **ciphers;
-size_t n_ciphers;
+/* Cipher parameters */
+QCryptoCipherAlgorithm alg;
+QCryptoCipherMode mode;
+uint8_t *key;
+size_t nkey;
+
+QCryptoCipher **free_ciphers;
+size_t max_free_ciphers;
 size_t n_free_ciphers;
 QCryptoIVGen *ivgen;
 QemuMutex mutex;
@@ -130,7 +136,7 @@ int qcrypto_block_init_cipher(QCryptoBlock *block,
   QCryptoCipherAlgorithm alg,
   QCryptoCipherMode mode,
   const uint8_t *key, size_t nkey,
-  size_t n_threads, Error **errp);
+  Error **errp);
 
 void qcrypto_block_free_cipher(QCryptoBlock *block);
 
diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 3ee928fb5a..3357852c0a 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1262,7 +1262,6 @@ qcrypto_block_luks_open(QCryptoBlock *block,
   luks->cipher_mode,
   masterkey,
   luks->header.master_key_len,
-  n_threads,
   errp) < 0) {
 goto fail;
 }
@@ -1456,7 +1455,7 @@ qcrypto_block_luks_create(QCryptoBlock *block,
 /* Setup the block device payload encryption objects */
 if (qcrypto_block_init_cipher(block, luks_opts.cipher_alg,
   luks_opts.cipher_mode, masterkey,
-  luks->header.master_key_len, 1, errp) < 0) {
+  luks->header.master_key_len, errp) < 0) {
 goto error;
 }
 
diff --git a/crypto/block-qcow.c b/crypto/block-qcow.c
index 4d7cf36a8f..02305058e3 100644
--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -75,7 +75,7 @@ qcrypto_block_qcow_init(QCryptoBlock *block,
 ret = qcrypto_block_init_cipher(block, QCRYPTO_CIPHER_ALG_AES_128,
 QCRYPTO_CIPHER_MODE_CBC,
 keybuf, G_N_ELEMENTS(keybuf),
-n_threads, errp);
+errp);
 if (ret < 0) {
 ret = -ENOTSUP;
 goto fail;
diff --git a/crypto/block.c b/crypto/block.c
index 506ea1d1a3..ba6d1cebc7 100644
--- a/crypto/block.c
+++ b/crypto/block.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qemu/lockable.h"
 #include "blockpriv.h"
 #include "block-qcow.h"
 #include "block-luks.h"
@@ -57,6 +58,8 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
 {
 QCryptoBlock *block = g_new0(QCryptoBlock, 1);
 
+qemu_mutex_init(>mutex);
+
 block->format = options->format;
 
 if (options->format >= G_N_ELEMENTS(qcrypto_block_drivers) ||
@@ -76,8 +79,6 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
 return NULL;
 }
 
-qemu_mutex_init(>mutex);
-
 return block;
 }
 
@@ -92,6 +93,8 @@ QCryptoBlock *qcrypto_block_create(QCryptoBlockCreateOptions 
*options,
 {
 QCryptoBlock *block = g_new0(QCryptoBlock, 1);
 
+qemu_mutex_init(>mutex);
+
 block->format = options->format;
 
 if (options->format >= G_N_ELEMENTS(qcrypto_block_drivers) ||
@@ -111,8 +114,6 @@ QCryptoBlock 
*qcrypto_block_create(QCryptoBlockCreateOptions *options,
 return NULL;
 }
 
-qemu_mutex_init(>mutex);
-
 return block;

Re: [PATCH v4 2/4] vvfat: Fix usage of `info.file.offset`

2024-06-11 Thread Kevin Wolf

Am 11.06.2024 um 14:31 hat Amjad Alsharafi geschrieben:
> On Mon, Jun 10, 2024 at 06:49:43PM +0200, Kevin Wolf wrote:
> > Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> > > The field is marked as "the offset in the file (in clusters)", but it
> > > was being used like this
> > > `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> > > 
> > > Additionally, removed the `abort` when `first_mapping_index` does not
> > > match, as this matches the case when adding new clusters for files, and
> > > its inevitable that we reach this condition when doing that if the
> > > clusters are not after one another, so there is no reason to `abort`
> > > here, execution continues and the new clusters are written to disk
> > > correctly.
> > > 
> > > Signed-off-by: Amjad Alsharafi 
> > 
> > Can you help me understand how first_mapping_index really works?
> > 
> > It seems to me that you get a chain of mappings for each file on the FAT
> > filesystem, which are just the contiguous areas in it, and
> > first_mapping_index refers to the mapping at the start of the file. But
> > for much of the time, it actually doesn't seem to be set at all, so you
> > have mapping->first_mapping_index == -1. Do you understand the rules
> > around when it's set and when it isn't?
> 
> Yeah. So `first_mapping_index` is the index of the first mapping, each
> mapping is a group of clusters that are contiguous in the file.
> Its mostly `-1` because the first mapping will have the value set as
> `-1` and not its own index, this value will only be set when the file
> contain more than one mapping, and this will only happen when you add
> clusters to a file that are not contiguous with the existing clusters.

Ah, that makes some sense. Not sure if it's optimal, but it's a rule I
can work with. So just to confirm, this is the invariant that we think
should always hold true, right?

assert((mapping->mode & MODE_DIRECTORY) ||
   !mapping->info.file.offset ||
   mapping->first_mapping_index > 0);

> And actually, thanks to that I noticed another bug not fixed in PATCH 3, 
> We are doing this check 
> `s->current_mapping->first_mapping_index != mapping->first_mapping_index`
> to know if we should switch to the new mapping or not. 
> If we were reading from the first mapping (`first_mapping_index == -1`)
> and we jumped to the second mapping (`first_mapping_index == n`), we
> will catch this condition and switch to the new mapping.
> 
> But if the file has more than 2 mappings, and we jumped to the 3rd
> mapping, we will not catch this since (`first_mapping_index == n`) for
> both of them haha. I think a better check is to check the `mapping`
> pointer directly. (I'll add it also in the next series together with a
> test for it.)

This comparison is exactly what confused me. I didn't realise that the
first mapping in the chain has a different value here, so I thought this
must mean that we're looking at a different file now - but of course I
couldn't see a reason for that because we're iterating through a single
file in this function.

But even now that I know that the condition triggers when switching from
the first to the second mapping, it doesn't make sense to me. We don't
have to copy things around just because a file is non-contiguous.

What we want to catch is if the order of mappings has changed compared
to the old state. Do we need a linked list, maybe a prev_mapping_index,
instead of first_mapping_index so that we can compare if it is still the
same as before?

Or actually, I suppose that's the first block with an abort() in the
code, just that it doesn't compare mappings, but their offsets.

> > 
> > >  block/vvfat.c | 12 +++-
> > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/block/vvfat.c b/block/vvfat.c
> > > index 19da009a5b..f0642ac3e4 100644
> > > --- a/block/vvfat.c
> > > +++ b/block/vvfat.c
> > > @@ -1408,7 +1408,9 @@ read_cluster_directory:
> > >  
> > >  assert(s->current_fd);
> > >  
> > > -
> > > offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset;
> > > +offset = s->cluster_size *
> > > +((cluster_num - s->current_mapping->begin)
> > > ++ s->current_mapping->info.file.offset);
> > >  if(lseek(s->current_fd, offset, SEEK_SET)!=offset)
> > >  return -3;
> > >  s->cluster=s->cluster_buffer;
> > > @@ -1929,8 +1931,9 @@ get_clust

Re: [PATCH v4 4/4] iotests: Add `vvfat` tests

2024-06-10 Thread Kevin Wolf

Am 10.06.2024 um 16:11 hat Amjad Alsharafi geschrieben:
> On Mon, Jun 10, 2024 at 02:01:24PM +0200, Kevin Wolf wrote:
> > With the updated test, I can catch the problems that are fixed by
> > patches 1 and 2, but it still doesn't need patch 3 to pass.
> > 
> > Kevin
> > 
> 
> Thanks for reviewing, those are all mistakes, and I fixed them (included
> a small patch to fix these issues at the end...).
> 
> Regarding the failing test, I forgot to also read the files from the fat
> driver, and instead I was just reading from the host filesystem.
> I'm not sure exactly, why reading from the filesystem works, but reading
> from the driver (i.e. guest) gives the weird buggy result. 
> I have updated the test in the patch below to reflect this.
> 
> I would love if you can test the patch below and let me know if the
> issues are fixed, after that I can send the new series.

Yes, that looks good to me and reproduces a failure without patch 3.

Kevin

Re: [PATCH v4 2/4] vvfat: Fix usage of `info.file.offset`

2024-06-10 Thread Kevin Wolf

Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> The field is marked as "the offset in the file (in clusters)", but it
> was being used like this
> `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> 
> Additionally, removed the `abort` when `first_mapping_index` does not
> match, as this matches the case when adding new clusters for files, and
> its inevitable that we reach this condition when doing that if the
> clusters are not after one another, so there is no reason to `abort`
> here, execution continues and the new clusters are written to disk
> correctly.
> 
> Signed-off-by: Amjad Alsharafi 

Can you help me understand how first_mapping_index really works?

It seems to me that you get a chain of mappings for each file on the FAT
filesystem, which are just the contiguous areas in it, and
first_mapping_index refers to the mapping at the start of the file. But
for much of the time, it actually doesn't seem to be set at all, so you
have mapping->first_mapping_index == -1. Do you understand the rules
around when it's set and when it isn't?

>  block/vvfat.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/block/vvfat.c b/block/vvfat.c
> index 19da009a5b..f0642ac3e4 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1408,7 +1408,9 @@ read_cluster_directory:
>  
>  assert(s->current_fd);
>  
> -
> offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset;
> +offset = s->cluster_size *
> +((cluster_num - s->current_mapping->begin)
> ++ s->current_mapping->info.file.offset);
>  if(lseek(s->current_fd, offset, SEEK_SET)!=offset)
>  return -3;
>  s->cluster=s->cluster_buffer;
> @@ -1929,8 +1931,9 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
> direntry_t* direntry, const ch
>  (mapping->mode & MODE_DIRECTORY) == 0) {
>  
>  /* was modified in qcow */
> -if (offset != mapping->info.file.offset + s->cluster_size
> -* (cluster_num - mapping->begin)) {
> +if (offset != s->cluster_size
> +* ((cluster_num - mapping->begin)
> ++ mapping->info.file.offset)) {
>  /* offset of this cluster in file chain has changed 
> */
>  abort();
>  copy_it = 1;
> @@ -1944,7 +1947,6 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
> direntry_t* direntry, const ch
>  
>  if (mapping->first_mapping_index != first_mapping_index
>  && mapping->info.file.offset > 0) {
> -abort();
>  copy_it = 1;
>  }

I'm unsure which case this represents. If first_mapping_index refers to
the mapping of the first cluster in the file, does this mean we got a
mapping for a different file here? Or is the comparison between -1 and a
real value?

In any case it doesn't seem to be the case that the comment at the
declaration of copy_it describes.

>  
> @@ -2404,7 +2406,7 @@ static int commit_mappings(BDRVVVFATState* s,
>  (mapping->end - mapping->begin);
>  } else
>  next_mapping->info.file.offset = mapping->info.file.offset +
> -mapping->end - mapping->begin;
> +(mapping->end - mapping->begin);
>  
>  mapping = next_mapping;
>  }

Kevin

Re: [PATCH v4 4/4] iotests: Add `vvfat` tests

2024-06-10 Thread Kevin Wolf

Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> Added several tests to verify the implementation of the vvfat driver.
> 
> We needed a way to interact with it, so created a basic `fat16.py` driver 
> that handled writing correct sectors for us.
> 
> Added `vvfat` to the non-generic formats, as its not a normal image format.
> 
> Signed-off-by: Amjad Alsharafi 
> ---
>  tests/qemu-iotests/check   |   2 +-
>  tests/qemu-iotests/fat16.py| 635 +
>  tests/qemu-iotests/testenv.py  |   2 +-
>  tests/qemu-iotests/tests/vvfat | 440 
>  tests/qemu-iotests/tests/vvfat.out |   5 +
>  5 files changed, 1082 insertions(+), 2 deletions(-)
>  create mode 100644 tests/qemu-iotests/fat16.py
>  create mode 100755 tests/qemu-iotests/tests/vvfat
>  create mode 100755 tests/qemu-iotests/tests/vvfat.out
> 
> diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
> index 56d88ca423..545f9ec7bd 100755
> --- a/tests/qemu-iotests/check
> +++ b/tests/qemu-iotests/check
> @@ -84,7 +84,7 @@ def make_argparser() -> argparse.ArgumentParser:
>  p.set_defaults(imgfmt='raw', imgproto='file')
>  
>  format_list = ['raw', 'bochs', 'cloop', 'parallels', 'qcow', 'qcow2',
> -   'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg']
> +   'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg', 
> 'vvfat']
>  g_fmt = p.add_argument_group(
>  '  image format options',
>  'The following options set the IMGFMT environment variable. '
> diff --git a/tests/qemu-iotests/fat16.py b/tests/qemu-iotests/fat16.py
> new file mode 100644
> index 00..baf801b4d5
> --- /dev/null
> +++ b/tests/qemu-iotests/fat16.py
> @@ -0,0 +1,635 @@
> +# A simple FAT16 driver that is used to test the `vvfat` driver in QEMU.
> +#
> +# Copyright (C) 2024 Amjad Alsharafi 
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +from typing import List
> +import string
> +
> +SECTOR_SIZE = 512
> +DIRENTRY_SIZE = 32
> +ALLOWED_FILE_CHARS = \
> +set("!#$%&'()-@^_`{}~" + string.digits + string.ascii_uppercase)
> +
> +
> +class MBR:
> +def __init__(self, data: bytes):
> +assert len(data) == 512
> +self.partition_table = []
> +for i in range(4):
> +partition = data[446 + i * 16 : 446 + (i + 1) * 16]
> +self.partition_table.append(
> +{
> +"status": partition[0],
> +"start_head": partition[1],
> +"start_sector": partition[2] & 0x3F,
> +"start_cylinder":
> +((partition[2] & 0xC0) << 2) | partition[3],
> +"type": partition[4],
> +"end_head": partition[5],
> +"end_sector": partition[6] & 0x3F,
> +"end_cylinder":
> +((partition[6] & 0xC0) << 2) | partition[7],
> +"start_lba": int.from_bytes(partition[8:12], "little"),
> +"size": int.from_bytes(partition[12:16], "little"),
> +}
> +)
> +
> +def __str__(self):
> +return "\n".join(
> +[f"{i}: {partition}"
> +for i, partition in enumerate(self.partition_table)]
> +)
> +
> +
> +class FatBootSector:
> +def __init__(self, data: bytes):
> +assert len(data) == 512
> +self.bytes_per_sector = int.from_bytes(data[11:13], "little")
> +self.sectors_per_cluster = data[13]
> +self.reserved_sectors = int.from_bytes(data[14:16], "little")
> +self.fat_count = data[16]
> +self.root_entries = int.from_bytes(data[17:19], "little")
> +self.media_descriptor = data[21]
> +self.fat_size = int.from_bytes(data[22:24], "little")
> +self.sectors_per_fat = int.from_bytes(data[22:24], "little")

Why two different attributes self.fat_size and self.sectors_per_fat that
contain the same value?

> +self.sectors_per_track = int.from_bytes(data[24:26], "little")
> +self.heads = int.from_bytes(data[26:28], "little")
> +self.hidden_sectors = int.from_bytes(data[28:32], "little")
> +self.total_sectors = int.from_bytes(data[32:36], "little")

This value should only be considered if the 16 bit value at byte 19
(which you don't

[PATCH] scsi-disk: Don't silently truncate serial number

2024-06-04 Thread Kevin Wolf

Before this commit, scsi-disk accepts a string of arbitrary length for
its "serial" property. However, the value visible on the guest is
actually truncated to 36 characters. This limitation doesn't come from
the SCSI specification, it is an arbitrary limit that was initially
picked as 20 and later bumped to 36 by commit 48b62063.

Similarly, device_id was introduced as a copy of the serial number,
limited to 20 characters, but commit 48b62063 forgot to actually bump
it.

As long as we silently truncate the given string, extending the limit is
actually not a harmless change, but break the guest ABI. This is the
most important reason why commit 48b62063 was really wrong (and it's
also why we can't change device_id to be in sync with the serial number
again and use 36 characters now, it would be another guest ABI
breakage).

In order to avoid future breakage, don't silently truncate the serial
number string any more, but just error out if it would be truncated.

Buglink: https://issues.redhat.com/browse/RHEL-3542
Suggested-by: Peter Krempa 
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4bd7af9d0c..5f55ae54e4 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -58,6 +58,9 @@
 
 #define TYPE_SCSI_DISK_BASE "scsi-disk-base"
 
+#define MAX_SERIAL_LEN  36
+#define MAX_SERIAL_LEN_FOR_DEVID20
+
 OBJECT_DECLARE_TYPE(SCSIDiskState, SCSIDiskClass, SCSI_DISK_BASE)
 
 struct SCSIDiskClass {
@@ -648,8 +651,8 @@ static int scsi_disk_emulate_vpd_page(SCSIRequest *req, 
uint8_t *outbuf)
 }
 
 l = strlen(s->serial);
-if (l > 36) {
-l = 36;
+if (l > MAX_SERIAL_LEN) {
+l = MAX_SERIAL_LEN;
 }
 
 trace_scsi_disk_emulate_vpd_page_80(req->cmd.xfer);
@@ -2501,9 +2504,20 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 if (!s->vendor) {
 s->vendor = g_strdup("QEMU");
 }
+if (s->serial && strlen(s->serial) > MAX_SERIAL_LEN) {
+error_setg(errp, "The serial number can't be longer than %d 
characters",
+   MAX_SERIAL_LEN);
+return;
+}
 if (!s->device_id) {
 if (s->serial) {
-s->device_id = g_strdup_printf("%.20s", s->serial);
+if (strlen(s->serial) > MAX_SERIAL_LEN_FOR_DEVID) {
+error_setg(errp, "The serial number can't be longer than %d "
+   "characters when it is also used as the default for 
"
+   "device_id", MAX_SERIAL_LEN_FOR_DEVID);
+return;
+}
+s->device_id = g_strdup(s->serial);
 } else {
 const char *str = blk_name(s->qdev.conf.blk);
 if (str && *str) {
-- 
2.45.1

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-04 Thread Kevin Wolf

Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> > Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> >> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> >>>
> >>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> >>> with a coroutine wrapper so that the graph lock is held for the whole
> >>> function. Then calling bdrv_co_flush() while iterating the list is safe
> >>> and doesn't allow concurrent graph modifications.
> >>
> >> The second is that iotest 255 ran into an assertion failure upon QMP 
> >> 'quit':
> >>
> >>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> >>> `!qemu_in_coroutine()' failed.
> >>
> >> Looking at the backtrace:
> >>
> >>> #5  0x762a90cc3eb2 in __GI___assert_fail
> >>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> >>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> >>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> >>> at ./assert/assert.c:101
> >>> #6  0x5afb07585311 in bdrv_graph_wrlock () at 
> >>> ../block/graph-lock.c:113
> >>> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:901
> >>> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:487
> >>> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:547
> >>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> >>> ../block/block-backend.c:618
> >>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> >>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> >>> at block/block-gen.c:1391
> >>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
> >>
> >> So I guess calling bdrv_next() is not safe from a coroutine, because
> >> the function doing the iteration could end up being the last thing to
> >> have a reference for the BB.
> > 
> > Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> > is surprising, because while we hold the graph lock, no reference should
> > be able to go away - you need the writer lock for that and you won't get
> > it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> > reference before the bdrv_next() loop must still have it now. Do you
> > know where it gets dropped?
> > 
> 
> AFAICT, yes, it does hold the graph reader lock. The generated code is:
> 
> > static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> > {
> > BdrvFlushAll *s = opaque;
> > 
> > bdrv_graph_co_rdlock();
> > s->ret = bdrv_co_flush_all();
> > bdrv_graph_co_rdunlock();
> > s->poll_state.in_progress = false;
> > 
> > aio_wait_kick();
> > }
> 
> Apparently when the mirror job is aborted/exits, which can happen during
> the polling for bdrv_co_flush_all_entry(), a reference can go away
> without the write lock (at least my breakpoints didn't trigger) being held:
> 
> > #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> > #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
> > ../block/mirror.c:710
> > #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
> > ../block/mirror.c:823
> > #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> > #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) 
> > at ../job.c:855
> > #5  0x5cdefb223852 in job_completed_txn_abort_locked 
> > (job=0x5cdefeb53000) at ../job.c:958
> > #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
> > ../job.c:1065
> > #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> > #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
> > ../util/async.c:171
> > #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
> > ../util/async.c:218
> > #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at 
> > ../util/aio-posix.c:722
> > #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at 
> > ../block/block-gen.h:43
> > #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> > #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, 
> > send_st

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-03 Thread Kevin Wolf

Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> Hi Kevin,
> 
> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> > Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> >> The old_bs variable in bdrv_next() is currently determined by looking
> >> at the old block backend. However, if the block graph changes before
> >> the next bdrv_next() call, it might be that the associated BDS is not
> >> the same that was referenced previously. In that case, the wrong BDS
> >> is unreferenced, leading to an assertion failure later:
> >>
> >>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
> > 
> > Your change makes sense, but in theory it shouldn't make a difference.
> > The real bug is in the caller, you can't allow graph modifications while
> > iterating the list of nodes. Even if it doesn't crash (like after your
> > patch), you don't have any guarantee that you will have seen every node
> > that exists that the end - and maybe not even that you don't get the
> > same node twice.
> > 
> >> In particular, this can happen in the context of bdrv_flush_all(),
> >> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> >> a graph change (for example with a stream block job [0]).
> > 
> > The whole locking around this case is a bit tricky and would deserve
> > some cleanup.
> > 
> > The basic rule for bdrv_next() callers is that they need to hold the
> > graph reader lock as long as they are iterating the graph, otherwise
> > it's not safe. This implies that the ref/unref pairs in it should never
> > make a difference either - which is important, because at least
> > releasing the last reference is forbidden while holding the graph lock.
> > I intended to remove the ref/unref for bdrv_next(), but I didn't because
> > I realised that the callers need to be audited first that they really
> > obey the rules. You found one that would be problematic.
> > 
> > The thing that bdrv_flush_all() gets wrong is that it promises to follow
> > the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> > something that polls. The compiler can't catch this because bdrv_flush()
> > is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
> > 
> > - If called outside of coroutine context, they are GRAPH_UNLOCKED
> > - If called in coroutine context, they are GRAPH_RDLOCK
> > 
> > We should probably try harder to get rid of the mixed functions, because
> > a synchronous co_wrapper_bdrv_rdlock could actually be marked
> > GRAPH_UNLOCKED in the code and then the compiler could catch this case.
> > 
> > The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> > with a coroutine wrapper so that the graph lock is held for the whole
> > function. Then calling bdrv_co_flush() while iterating the list is safe
> > and doesn't allow concurrent graph modifications.
> 
> I attempted this now, but ran into two issues:
> 
> The first is that I had to add support for a function without arguments
> to the block-coroutine-wrapper.py script. I can send this as an RFC in
> any case if desired.

I think I wouldn't necessarily merge it if we don't have code that makes
use of it, but having it on the list as an RFC can't hurt either way.
Then we can come back to it once we do need it for something.

> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
> 
> > ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> > `!qemu_in_coroutine()' failed.
> 
> Looking at the backtrace:
> 
> > #5  0x762a90cc3eb2 in __GI___assert_fail
> > (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> > "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> > <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> > at ./assert/assert.c:101
> > #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> > #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:901
> > #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:487
> > #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:547
> > #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> > ../block/block-backend.c:618
> > #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> > #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> > at block/block-gen.c:1391
> > #13 0x5afb0773bf41 in coroutine_trampoline (i0=168

Re: [PATCH v3 2/2] iotests: test NBD+TLS+iothread

2024-06-03 Thread Kevin Wolf

Am 03.06.2024 um 13:45 hat Daniel P. Berrangé geschrieben:
> On Fri, May 31, 2024 at 01:04:59PM -0500, Eric Blake wrote:
> > Prevent regressions when using NBD with TLS in the presence of
> > iothreads, adding coverage the fix to qio channels made in the
> > previous patch.
> > 
> > The shell function pick_unused_port() was copied from
> > nbdkit.git/tests/functions.sh.in, where it had all authors from Red
> > Hat, agreeing to the resulting relicensing from 2-clause BSD to GPLv2.
> > 
> > CC: qemu-sta...@nongnu.org
> > CC: "Richard W.M. Jones" 
> > Signed-off-by: Eric Blake 
> > ---
> >  tests/qemu-iotests/tests/nbd-tls-iothread | 168 ++
> >  tests/qemu-iotests/tests/nbd-tls-iothread.out |  54 ++
> >  2 files changed, 222 insertions(+)
> >  create mode 100755 tests/qemu-iotests/tests/nbd-tls-iothread
> >  create mode 100644 tests/qemu-iotests/tests/nbd-tls-iothread.out
> 
> 
> 
> > +# pick_unused_port
> > +#
> > +# Picks and returns an "unused" port, setting the global variable
> > +# $port.
> > +#
> > +# This is inherently racy, but we need it because qemu does not currently
> > +# permit NBD+TLS over a Unix domain socket
> > +pick_unused_port ()
> > +{
> > +if ! (ss --version) >/dev/null 2>&1; then
> > +_notrun "ss utility required, skipped this test"
> > +fi
> > +
> > +# Start at a random port to make it less likely that two parallel
> > +# tests will conflict.
> > +port=$(( 5 + (RANDOM%15000) ))
> > +while ss -ltn | grep -sqE ":$port\b"; do
> > +((port++))
> > +if [ $port -eq 65000 ]; then port=5; fi
> > +done
> > +echo picked unused port
> > +}
> 
> In retrospect I'd probably have suggested putting this into
> common.qemu as its conceptually independant of this test.

If we make it generic, should nbd_server_start_tcp_socket() in
common.nbd use it, too, instead of implementing its own loop trying to
find a free port?

Kevin

Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-06-03 Thread Kevin Wolf

Am 03.06.2024 um 16:45 hat Fiona Ebner geschrieben:
> Am 28.05.24 um 18:06 schrieb Kevin Wolf:
> > Am 29.04.2024 um 16:19 hat Fiona Ebner geschrieben:
> >> rather than the uint32_t for which the maximum is slightly more than 4
> >> seconds and larger values would overflow. The QAPI interface allows
> >> specifying the number of seconds, so only values 0 to 4 are safe right
> >> now, other values lead to a much lower timeout than a user expects.
> >>
> >> The block_copy() call where this is used already takes a uint64_t for
> >> the timeout, so no change required there.
> >>
> >> Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
> >> Reported-by: Friedrich Weber 
> >> Signed-off-by: Fiona Ebner 
> > 
> > Thanks, applied to the block branch.
> > 
> > But I don't think our job is done yet with this. Increasing the limit is
> > good and useful, but even if it's now unlikely to hit with sane values,
> > we should still catch integer overflows in cbw_open() and return an
> > error on too big values instead of silently wrapping around.
> 
> NANOSECONDS_PER_SECOND is 10^9 and the QAPI type for cbw-timeout is
> uint32_t, so even with the maximum allowed value, there is no overflow.
> Should I still add such a check?

You're right, I missed that cbw_timeout is uint32_t. So uint64_t will be
always be enough to hold the result, and the calculation is also done in
64 bits because NANOSECONDS_PER_SECOND is long long. Then we don't need
a check.

Kevin

Re: [PATCH 0/2] block/crypto: do not require number of threads upfront

2024-06-03 Thread Kevin Wolf

Am 27.05.2024 um 17:58 hat Stefan Hajnoczi geschrieben:
> The block layer does not know how many threads will perform I/O. It is 
> possible
> to exceed the number of threads that is given to qcrypto_block_open() and this
> can trigger an assertion failure in qcrypto_block_pop_cipher().
> 
> This patch series removes the n_threads argument and instead handles an
> arbitrary number of threads.

Thanks, applied to the block branch.

Kevin

Re: [PATCH v3 5/6] iotests: Filter out `vvfat` fmt from failing tests

2024-05-31 Thread Kevin Wolf

Am 26.05.2024 um 11:56 hat Amjad Alsharafi geschrieben:
> `vvfat` is a special format and not all tests (even generic) can run
> without crashing.  So, added `unsupported_fmt: vvfat` to all failling
> tests.
> 
> Also added `vvfat` format into `meson.build`, vvfaat tests can be run
> on the `block-thorough` suite.
> 
> Signed-off-by: Amjad Alsharafi 

I think the better approach is just not counting vvfat as generic. It's
technically not even a format, but a protocol, though I think I agree
with adding it as a format anyway because you can't store a normal image
inside of it.

This should do the trick and avoid most of the changes in this patch:

diff --git a/tests/qemu-iotests/testenv.py b/tests/qemu-iotests/testenv.py
index 588f30a4f1..4053d29de4 100644
--- a/tests/qemu-iotests/testenv.py
+++ b/tests/qemu-iotests/testenv.py
@@ -250,7 +250,7 @@ def __init__(self, source_dir: str, build_dir: str,
 self.qemu_img_options = os.getenv('QEMU_IMG_OPTIONS')
 self.qemu_nbd_options = os.getenv('QEMU_NBD_OPTIONS')

-is_generic = self.imgfmt not in ['bochs', 'cloop', 'dmg']
+is_generic = self.imgfmt not in ['bochs', 'cloop', 'dmg', 'vvfat']
 self.imgfmt_generic = 'true' if is_generic else 'false'

 self.qemu_io_options = f'--cache {self.cachemode} --aio {self.aiomode}'

Kevin

Re: [PATCH v3 0/6] vvfat: Fix write bugs for large files and add iotests

2024-05-31 Thread Kevin Wolf

Am 26.05.2024 um 11:56 hat Amjad Alsharafi geschrieben:
> These patches fix some bugs found when modifying files in vvfat.
> First, there was a bug when writing to the cluster 2 or above of a file, it
> will copy the cluster before it instead, so, when writing to cluster=2, the
> content of cluster=1 will be copied into disk instead in its place.
> 
> Another issue was modifying the clusters of a file and adding new
> clusters, this showed 2 issues:
> - If the new cluster is not immediately after the last cluster, it will
> cause issues when reading from this file in the future.
> - Generally, the usage of info.file.offset was incorrect, and the
> system would crash on abort() when the file is modified and a new
> cluster was added.
> 
> Also, added some iotests for vvfat, covering the this fix and also
> general behavior such as reading, writing, and creating files on the 
> filesystem.
> Including tests for reading/writing the first cluster which
> would pass even before this patch.

I was wondering how to reproduce the bugs that patches 2 and 3 fix. So I
tried to run your iotests case, and while it does catch the bug that
patch 1 fixes, it passes even without the other two fixes.

Is this expected? If so, can we add more tests that trigger the problems
the other two patches address?

Kevin

Re: [PATCH v3 1/6] vvfat: Fix bug in writing to middle of file

2024-05-31 Thread Kevin Wolf

Am 26.05.2024 um 11:56 hat Amjad Alsharafi geschrieben:
> Before this commit, the behavior when calling `commit_one_file` for
> example with `offset=0x2000` (second cluster), what will happen is that
> we won't fetch the next cluster from the fat, and instead use the first
> cluster for the read operation.
> 
> This is due to off-by-one error here, where `i=0x2000 !< offset=0x2000`,
> thus not fetching the next cluster.
> 
> Signed-off-by: Amjad Alsharafi 

Reviewed-by: Kevin Wolf 
Tested-by: Kevin Wolf 

> diff --git a/block/vvfat.c b/block/vvfat.c
> index 9d050ba3ae..ab342f0743 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -2525,7 +2525,7 @@ commit_one_file(BDRVVVFATState* s, int dir_index, 
> uint32_t offset)
>  return -1;
>  }
>  
> -for (i = s->cluster_size; i < offset; i += s->cluster_size)
> +for (i = s->cluster_size; i <= offset; i += s->cluster_size)
>  c = modified_fat_get(s, c);

While your change results in the correct behaviour, I think I would
prefer the code to be changed like this so that at the start of each
loop iteration, 'i' always refers to the offset that matches 'c':

for (i = 0; i < offset; i += s->cluster_size) {
c = modified_fat_get(s, c);
}

I'm also adding braces here to make the code conform with the QEMU
coding style. You can use scripts/checkpatch.pl to make sure that all
code you add has the correct style. Much of the vvfat code predates the
coding style, so you'll often have to change style when you touch a
line. (Which is good, because it slowly fixes the existing mess.)

You can keep my Reviewed/Tested-by if you change this.

Kevin

Re: [PATCH v2 2/2] iotests: test NBD+TLS+iothread

2024-05-31 Thread Kevin Wolf

Am 18.05.2024 um 04:50 hat Eric Blake geschrieben:
> Prevent regressions when using NBD with TLS in the presence of
> iothreads, adding coverage the fix to qio channels made in the
> previous patch.
> 
> CC: qemu-sta...@nongnu.org
> Signed-off-by: Eric Blake 

> diff --git a/tests/qemu-iotests/tests/nbd-tls-iothread.out 
> b/tests/qemu-iotests/tests/nbd-tls-iothread.out
> new file mode 100644
> index 000..b5e12dcbe7a
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/nbd-tls-iothread.out
> @@ -0,0 +1,54 @@
> +QA output created by nbd-tls-iothread
> +
> +== preparing TLS creds and spare port ==
> +picked unused port
> +Generating a self signed certificate...
> +Generating a signed certificate...
> +Generating a signed certificate...
> +
> +== preparing image ==
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
> +Formatting 
> '/home/eblake/qemu/build/tests/qemu-iotests/scratch/qcow2-file-nbd-tls-iothread/dst.qcow2',
>  fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib 
> size=1073741824 lazy_refcounts=off refcount_bits=16

/home/eblake is suboptimal in reference output. :-)

Kevin

Re: [PATCH v1 1/2] machine/microvm: support for loading EIF image

2024-05-31 Thread Kevin Wolf

Am 31.05.2024 um 09:54 hat Alexander Graf geschrieben:
> 
> On 22.05.24 19:23, Dorjoy Chowdhury wrote:
> > Hi Daniel,
> > Thanks for reviewing.
> > 
> > On Wed, May 22, 2024 at 9:32 PM Daniel P. Berrangé  
> > wrote:
> > > On Sat, May 18, 2024 at 02:07:52PM +0600, Dorjoy Chowdhury wrote:
> > > > An EIF (Enclave Image Format)[1] image is used to boot an AWS nitro
> > > > enclave[2] virtual machine. The EIF file contains the necessary
> > > > kernel, cmdline, ramdisk(s) sections to boot.
> > > > 
> > > > This commit adds support for loading EIF image using the microvm
> > > > machine code. For microvm to boot from an EIF file, the kernel and
> > > > ramdisk(s) are extracted into a temporary kernel and a temporary
> > > > initrd file which are then hooked into the regular x86 boot mechanism
> > > > along with the extracted cmdline.
> > > I can understand why you did it this way, but I feel its pretty
> > > gross to be loading everything into memory, writing it back to
> > > disk, and then immediately loading it all back into memory.
> > > 
> > > The root problem is the x86_load_linux() method, which directly
> > > accesses the machine properties:
> > > 
> > >  const char *kernel_filename = machine->kernel_filename;
> > >  const char *initrd_filename = machine->initrd_filename;
> > >  const char *dtb_filename = machine->dtb;
> > >  const char *kernel_cmdline = machine->kernel_cmdline;
> > > 
> > > To properly handle this, I'd say we need to introduce an abstraction
> > > for loading the kernel/inittrd/cmdlkine data.
> > > 
> > > ie on the   X86MachineClass object, we should introduce four virtual
> > > methods
> > > 
> > > uint8_t *(*load_kernel)(X86MachineState *machine);
> > > uint8_t *(*load_initrd)(X86MachineState *machine);
> > > uint8_t *(*load_dtb)(X86MachineState *machine);
> > > uint8_t *(*load_cmdline)(X86MachineState *machine);
> > > 
> > > The default impl of these four methods should just following the
> > > existing logic, of reading and returning the data associated with
> > > the kernel_filename, initrd_filename, dtb and kernel_cmdline
> > > properties.
> > > 
> > > The Nitro machine sub-class, however, can provide an alternative
> > > impl of thse virtual methods which returns data directly from
> > > the EIF file instead.
> > > 
> > Great suggestion! I agree the implementation path you suggested would
> > look much nicer as a whole. Now that I looked a bit into the
> > "x86_load_linux" implementation, it looks like pretty much everything
> > is tied to expecting a filename. For example, after reading the header
> > from the kernel_filename x86_load_linux calls into load_multiboot,
> > load_elf (which in turn calls load_elf64, 32) and they all expect a
> > filename. I think I would need to refactor all of them to instead work
> > with (uint8_t *) buffers instead, right? Also in case of
> > initrd_filename the existing code maps the file using
> > g_mapped_file_new to the X86MachineState member initrd_mapped_file. So
> > that will need to be looked into and refactored. Please correct me if
> > I misunderstood something about the way to implement your suggested
> > approach.
> > 
> > If I am understanding this right, this probably requires a lot of work
> > which will also probably not be straightforward to implement or test.
> > Right now, the way the code is, it's easy to see that the existing
> > code paths are still correct as they are not changed and the new
> > nitro-enclave machine code just hooks into them. As the loading to
> > memory, writing to disk and loading back to memory only is in the
> > execution path of the new nitro-enclave machine type, I think the way
> > the patch is right now, is a good first implementation. What do you
> > think?
> 
> I think the "real" fix here is to move all of the crufty loader logic from
> "file access" to "block layer access". Along with that, create a generic
> helper function (like this[1]) that opens all -kernel/-initrd/-dtb files
> through the block layer and calls a board specific callback to perform the
> load.

Not sure if I would call that a "real fix", it's more like a hack.
Kernel images aren't block devices and their size may not even be
aligned to 512, which is the minimum block size the block layer
supports. So there might be some complications around that area.

That said, even if it's an abuse of the block layer, it might be a
useful abuse that saves you writing quite a bit of FILE * based logic
providing similar functionality, so who I am to stop you...

> With that in place, we would have a reentrant code path for the EIF case:
> EIF could plug into the generic x86 loader and when it detects EIF, create
> internal block devices that reference the existing file plus an offset/limit
> setting to ensure it only accesses the correct range in the target file. It
> can then simply reinvoke the x86 loader with the new block device objects.
> 
> With that in place, we could also finally support -kernel

Re: [PATCH 0/2] block/crypto: do not require number of threads upfront

2024-05-29 Thread Kevin Wolf

Am 27.05.2024 um 17:58 hat Stefan Hajnoczi geschrieben:
> The block layer does not know how many threads will perform I/O. It is 
> possible
> to exceed the number of threads that is given to qcrypto_block_open() and this
> can trigger an assertion failure in qcrypto_block_pop_cipher().
> 
> This patch series removes the n_threads argument and instead handles an
> arbitrary number of threads.
> ---
> Is it secure to store the key in QCryptoBlock? In this series I assumed the
> answer is yes since the QCryptoBlock's cipher state is equally sensitive, but
> I'm not familiar with this code or a crypto expert.

I would assume the same, but I'm not merging this yet because I think
you said you'd like to have input from danpb?

Reviewed-by: Kevin Wolf

Re: [PATCH 1/2] Revert "monitor: use aio_co_reschedule_self()"

2024-05-29 Thread Kevin Wolf

Am 29.05.2024 um 12:33 hat Fiona Ebner geschrieben:
> CC-ing stable since 1f25c172f83704e350c0829438d832384084a74d is in 9.0.0

Good point, I'm also updating the commit message in my tree to add
a Cc: line. Thanks for catching this, Fiona!

Kevin

Re: block snapshot issue with RBD

2024-05-29 Thread Kevin Wolf

Am 29.05.2024 um 12:14 hat Fiona Ebner geschrieben:
> I bisected this issue to d3007d348a ("block: Fix crash when loading
> snapshot on inactive node").
> 
> > diff --git a/block/snapshot.c b/block/snapshot.c
> > index ec8cf4810b..c4d40e80dd 100644
> > --- a/block/snapshot.c
> > +++ b/block/snapshot.c
> > @@ -196,8 +196,10 @@ bdrv_snapshot_fallback(BlockDriverState *bs)
> >  int bdrv_can_snapshot(BlockDriverState *bs)
> >  {
> >  BlockDriver *drv = bs->drv;
> > +
> >  GLOBAL_STATE_CODE();
> > -if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
> > +
> > +if (!drv || !bdrv_is_inserted(bs) || !bdrv_is_writable(bs)) {
> >  return 0;
> >  }
> >  
> 
> So I guess the issue is that the blockdev is not writable when
> "postmigrate" state?

That makes sense. The error message really isn't great, but after
migration, the image is assumed to be owned by the destination, so we
can't use it any more. 'cont' basically asserts that the migration
failed and we can get ownership back. I don't think we can do without a
manual command reactivating the image on the source, but we could have
one that does this without resuming the VM.

Kevin

Re: [PATCH v5] linux-aio: add IO_CMD_FDSYNC command support

2024-05-28 Thread Kevin Wolf

Am 25.04.2024 um 09:04 hat Prasad Pandit geschrieben:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage. Enable linux-aio to submit such aio request.
> 
> When using aio=native without fdsync() support, QEMU creates
> pthreads, and destroying these pthreads results in TLB flushes.
> In a real-time guest environment, TLB flushes cause a latency
> spike. This patch helps to avoid such spikes.
> 
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Prasad Pandit 

Thanks, applied to the block branch.

Kevin

Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-05-28 Thread Kevin Wolf

Am 29.04.2024 um 16:19 hat Fiona Ebner geschrieben:
> rather than the uint32_t for which the maximum is slightly more than 4
> seconds and larger values would overflow. The QAPI interface allows
> specifying the number of seconds, so only values 0 to 4 are safe right
> now, other values lead to a much lower timeout than a user expects.
> 
> The block_copy() call where this is used already takes a uint64_t for
> the timeout, so no change required there.
> 
> Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
> Reported-by: Friedrich Weber 
> Signed-off-by: Fiona Ebner 

Thanks, applied to the block branch.

But I don't think our job is done yet with this. Increasing the limit is
good and useful, but even if it's now unlikely to hit with sane values,
we should still catch integer overflows in cbw_open() and return an
error on too big values instead of silently wrapping around.

Kevin

Re: [PATCH] qemu-io: add cvtnum() error handling for zone commands

2024-05-28 Thread Kevin Wolf

Am 07.05.2024 um 20:05 hat Stefan Hajnoczi geschrieben:
> cvtnum() parses positive int64_t values and returns a negative errno on
> failure. Print errors and return early when cvtnum() fails.
> 
> While we're at it, also reject nr_zones values greater or equal to 2^32
> since they cannot be represented.
> 
> Reported-by: Peter Maydell 
> Cc: Sam Li 
> Signed-off-by: Stefan Hajnoczi 

Thanks, applied to the block branch.

Kevin

Re: [PATCH 0/2] Revert "monitor: use aio_co_reschedule_self()"

2024-05-28 Thread Kevin Wolf

Am 06.05.2024 um 21:06 hat Stefan Hajnoczi geschrieben:
> This series fixes RHEL-34618 "qemu crash on Assertion `luringcb->co->ctx ==
> s->aio_context' failed when do block_resize on hotplug disk with 
> aio=io_uring":
> https://issues.redhat.com/browse/RHEL-34618
> 
> Kevin identified commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()")
> as the root cause. There is a subtlety regarding how
> qemu_get_current_aio_context() returns qemu_aio_context even though we may be
> running in iohandler_ctx.
> 
> Revert commit 1f25c172f837, it was just intended as a code cleanup.
> 
> Stefan Hajnoczi (2):
>   Revert "monitor: use aio_co_reschedule_self()"
>   aio: warn about iohandler_ctx special casing

Thanks, applied to the block branch.

Kevin

Re: [PATCH v2 0/3] docs: define policy forbidding use of "AI" / LLM code generators

2024-05-28 Thread Kevin Wolf

Am 16.05.2024 um 18:22 hat Daniel P. Berrangé geschrieben:
> This patch kicks the hornet's nest of AI / LLM code generators.
> 
> With the increasing interest in code generators in recent times,
> it is inevitable that QEMU contributions will include AI generated
> code. Thus far we have remained silent on the matter. Given that
> everyone knows these tools exist, our current position has to be
> considered tacit acceptance of the use of AI generated code in QEMU.
> 
> The question for the project is whether that is a good position for
> QEMU to take or not ?
> 
> IANAL, but I like to think I'm reasonably proficient at understanding
> open source licensing. I am not inherantly against the use of AI tools,
> rather I am anti-risk. I also want to see OSS licenses respected and
> complied with.
> 
> AFAICT at its current state of (im)maturity the question of licensing
> of AI code generator output does not have a broadly accepted / settled
> legal position. This is an inherant bias/self-interest from the vendors
> promoting their usage, who tend to minimize/dismiss the legal questions.
> From my POV, this puts such tools in a position of elevated legal risk.
> 
> Given the fuzziness over the legal position of generated code from
> such tools, I don't consider it credible (today) for a contributor
> to assert compliance with the DCO terms (b) or (c) (which is a stated
> pre-requisite for QEMU accepting patches) when a patch includes (or is
> derived from) AI generated code.
> 
> By implication, I think that QEMU must (for now) explicitly decline
> to (knowingly) accept AI generated code.
> 
> Perhaps a few years down the line the legal uncertainty will have
> reduced and we can re-evaluate this policy.
> 
> Discuss...
> 
> Changes in v2:
> 
>  * Fix a huge number of typos in docs
>  * Clarify that maintainers should still add R-b where relevant, even
>if they are already adding their own S-oB.
>  * Clarify situation when contributor re-starts previously abandoned
>work from another contributor.
>  * Add info about Suggested-by tag
>  * Add new docs section dealing with the broad topic of "generated
>files" (whether code generators or compilers)
>  * Simplify the section related to prohibition of AI generated files
>and give further examples of tools considered covered
>  * Remove repeated references to "LLM" as a specific technology, just
>use the broad "AI" term, except for one use of LLM as an example.
>  * Add note that the policy may evolve if the legal clarity improves
>  * Add note that exceptions can be requested on case-by-case basis
>if contributor thinks they can demonstrate a credible copyright
>and licensing status
> 
> Daniel P. Berrangé (3):
>   docs: introduce dedicated page about code provenance / sign-off
>   docs: define policy limiting the inclusion of generated files
>   docs: define policy forbidding use of AI code generators
> 
>  docs/devel/code-provenance.rst| 315 ++
>  docs/devel/index-process.rst  |   1 +
>  docs/devel/submitting-a-patch.rst |  19 +-
>  3 files changed, 318 insertions(+), 17 deletions(-)
>  create mode 100644 docs/devel/code-provenance.rst

Reviewed-by: Kevin Wolf

Re: [PATCH v2 2/3] docs: define policy limiting the inclusion of generated files

2024-05-28 Thread Kevin Wolf

Am 16.05.2024 um 18:22 hat Daniel P. Berrangé geschrieben:
> Files contributed to QEMU are generally expected to be provided in the
> preferred format for manipulation. IOW, we generally don't expect to
> have generated / compiled code included in the tree, rather, we expect
> to run the code generator / compiler as part of the build process.
> 
> There are some obvious exceptions to this seen in our existing tree, the
> biggest one being the inclusion of many binary firmware ROMs. A more
> niche example is the inclusion of a generated eBPF program. Or the CI
> dockerfiles which are mostly auto-generated. In these cases, however,
> the preferred format source code is still required to be included,
> alongside the generated output.
> 
> Tools which perform user defined algorithmic transformations on code are
> not considered to be "code generators". ie, we permit use of coccinelle,
> spell checkers, and sed/awk/etc to manipulate code. Such use of automated
> manipulation should still be declared in the commit message.
> 
> One off generators which create a boilerplate file which the author then
> fills in, are acceptable if their output has clear copyright and license
> status. This could be where a contributor writes a throwaway python
> script to automate creation of some mundane piece of code for example.
> 
> Signed-off-by: Daniel P. Berrangé 
> ---
>  docs/devel/code-provenance.rst | 55 ++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> index 7c42fae571..eabb3e7c08 100644
> --- a/docs/devel/code-provenance.rst
> +++ b/docs/devel/code-provenance.rst
> @@ -210,3 +210,58 @@ mailing list.
>  It is also recommended to attempt to contact the original author to let them
>  know you are interested in taking over their work, in case they still 
> intended
>  to return to the work, or had any suggestions about the best way to continue.
> +
> +Inclusion of generated files
> +
> +
> +Files in patches contributed to QEMU are generally expected to be provided
> +only in the preferred format for making modifications. The implication of
> +this is that the output of code generators or compilers is usually not
> +appropriate to contribute to QEMU.
> +
> +For reasons of practicality there are some exceptions to this rule, where
> +generated code is permitted, provided it is also accompanied by the
> +corresponding preferred source format. This is done where it is impractical
> +to expect those building QEMU to run the code generation or compilation
> +process. A non-exhustive list of examples is:
> +
> + * Images: where an bitmap image is created from a vector file it is common
> +   to include the rendered bitmaps at desired resolution(s), since subtle
> +   changes in the rasterization process / tools may affect quality. The
> +   original vector file is expected to accompany any generated bitmaps.
> +
> + * Firmware: QEMU includes pre-compiled binary ROMs for a variety of guest
> +   firmwares. When such binary ROMs are contributed, the corresponding source
> +   must also be provided, either directly, or through a git submodule link.
> +
> + * Dockerfiles: the majority of the dockerfiles are automatically generated
> +   from a canonical list of build dependencies maintained in tree, together
> +   with the libvirt-ci git submodule link. The generated dockerfiles are
> +   included in tree because it is desirable to be able to directly build
> +   container images from a clean git checkout.
> +
> + * EBPF: QEMU includes some generated EBPF machine code, since the required
> +   eBPF compilation tools are not broadly available on all targetted OS
> +   distributions. The corresponding eBPF C code for the binary is also
> +   provided. This is a time limited exception until the eBPF toolchain is
> +   sufficiently broadly available in distros.

This paragraph is inconsistent with the spelling of "EBPF"/"eBPF".

Kevin

Re: [PATCH 1/1] block: drop force_dup parameter of raw_reconfigure_getfd()

2024-05-28 Thread Kevin Wolf

Am 30.04.2024 um 19:02 hat Denis V. Lunev via geschrieben:
> This parameter is always passed as 'false' from the caller.
> 
> Signed-off-by: Denis V. Lunev 
> CC: Andrey Zhadchenko 
> CC: Kevin Wolf 
> CC: Hanna Reitz 

Let me add a "Since commit 72373e40fbc" to the commit message.

Thanks, applied to the block branch.

Kevin

Re: [PATCH] iotests/pylintrc: allow up to 10 similar lines

2024-05-28 Thread Kevin Wolf

Am 28.05.2024 um 14:49 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 30.04.24 12:13, Vladimir Sementsov-Ogievskiy wrote:
> > We want to have similar QMP objects in different tests. Reworking these
> > objects to make common parts by calling some helper functions doesn't
> > seem good. It's a lot more comfortable to see the whole QAPI request in
> > one place.
> > 
> > So, let's increase the limit, to unblock further commit
> > "iotests: add backup-discard-source"
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > ---
> > 
> > Hi all! That's a patch to unblock my PR
> > "[PULL 0/6] Block jobs patches for 2024-04-29"
> ><20240429115157.2260885-1-vsement...@yandex-team.ru>
> >
> > https://patchew.org/QEMU/20240429115157.2260885-1-vsement...@yandex-team.ru/
> > 
> > 
> >   tests/qemu-iotests/pylintrc | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tests/qemu-iotests/pylintrc b/tests/qemu-iotests/pylintrc
> > index de2e0c2781..05b75ee59b 100644
> > --- a/tests/qemu-iotests/pylintrc
> > +++ b/tests/qemu-iotests/pylintrc
> > @@ -55,4 +55,4 @@ max-line-length=79
> >   [SIMILARITIES]
> > -min-similarity-lines=6
> > +min-similarity-lines=10
> 
> 
> Hi! I hope it's OK, if I just apply this to my block branch, and
> resend "[PULL 0/6] Block jobs patches for 2024-04-29" which is blocked
> on this problem.

Sorry, I only just returned from PTO and now have to catch up with
things.

But yes, if John doesn't have any objections, it's fine with me, too.

Reviewed-by: Kevin Wolf

Re: [PATCH v2 5/5] monitor: use aio_co_reschedule_self()

2024-05-03 Thread Kevin Wolf

Am 06.02.2024 um 20:06 hat Stefan Hajnoczi geschrieben:
> The aio_co_reschedule_self() API is designed to avoid the race
> condition between scheduling the coroutine in another AioContext and
> yielding.
> 
> The QMP dispatch code uses the open-coded version that appears
> susceptible to the race condition at first glance:
> 
>   aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
>   qemu_coroutine_yield();
> 
> The code is actually safe because the iohandler and qemu_aio_context
> AioContext run under the Big QEMU Lock. Nevertheless, set a good example
> and use aio_co_reschedule_self() so it's obvious that there is no race.
> 
> Suggested-by: Hanna Reitz 
> Reviewed-by: Manos Pitsidianakis 
> Reviewed-by: Hanna Czenczek 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qapi/qmp-dispatch.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index 176b549473..f3488afeef 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -212,8 +212,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> QmpCommandList *cmds, QObject *requ
>   * executing the command handler so that it can make progress if 
> it
>   * involves an AIO_WAIT_WHILE().
>   */
> -aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
> -qemu_coroutine_yield();
> +aio_co_reschedule_self(qemu_get_aio_context());

Turns out that this one actually causes a regression. [1] This code is
ŕun in iohandler_ctx, aio_co_reschedule_self() looks at the new context
and compares it with qemu_get_current_aio_context() - and because both
are qemu_aio_context, it decides that it has nothing to do. So the
command handler coroutine actually still runs in iohandler_ctx now,
which is not what we want.

We could just revert this patch because it was only meant as a cleanup
without a semantic difference.

Or aio_co_reschedule_self() could look at qemu_coroutine_self()->ctx
instead of using qemu_get_current_aio_context(). That would be a little
more indirect, though, and I'm not sure if co->ctx is always up to date.

Any opinions on what is the best way to fix this?

Kevin

[1] https://issues.redhat.com/browse/RHEL-34618

Re: [PULL 6/6] iotests: add backup-discard-source

2024-04-30 Thread Kevin Wolf

Am 29.04.2024 um 20:39 hat Vladimir Sementsov-Ogievskiy geschrieben:
> [Add John]
> 
> On 29.04.24 17:18, Richard Henderson wrote:
> > On 4/29/24 04:51, Vladimir Sementsov-Ogievskiy wrote:
> > > Add test for a new backup option: discard-source.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > Reviewed-by: Fiona Ebner 
> > > Tested-by: Fiona Ebner 
> > > Message-Id: <20240313152822.626493-6-vsement...@yandex-team.ru>
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   .../qemu-iotests/tests/backup-discard-source  | 152 ++
> > >   .../tests/backup-discard-source.out   |   5 +
> > >   2 files changed, 157 insertions(+)
> > >   create mode 100755 tests/qemu-iotests/tests/backup-discard-source
> > >   create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out
> > 
> > This fails check-python-minreqs
> > 
> >    https://gitlab.com/qemu-project/qemu/-/jobs/6739551782
> > 
> > It appears to be a pylint issue.
> > 
> > 
> 
> tests/export-incoming-iothread:1:0: R0801: Similar lines in 2 files
> ==tests.backup-discard-source:[52:61]
> ==tests.copy-before-write:[54:63]
> 'file': {
> 'driver': iotests.imgfmt,
> 'file': {
> 'driver': 'file',
> 'filename': source_img,
> }
> },
> 'target': {
> 'driver': iotests.imgfmt, (duplicate-code)
> 
> 
> 
> Hmm. I don't think, that something should be changed in code.
> splitting out part of this json object to a function? That's a test
> for QMP command, and it's good that we see the command as is in one
> place. I can swap some lines or rename variables to hack the linter,
> but I'd prefer not doing so:)
> 
> 
> For me that looks like a false-positive: yes it's a duplication, but
> it should better be duplication, than complicating raw json objects by
> reusing common parts.
> 
> 
> What to do? As described in 22305c2a081b8b6 "python: Reduce strictness
> of pylint's duplicate-code check", this check could be either be
> disabled completely, or we can increase min-similarity-lines config
> value.
> 
> I'd just disable it completely. Any thoughts?

I think it would make sense to treat tests differently from real
production code. In tests/, some duplication is bound to happen and
trying to unify things across test cases (which would mean moving to
iotests.py) often doesn't make sense. On the other hand, in python/ we
would probably want to unify duplicated code.

Kevin

Re: [PATCH 0/3] Make it possible to compile the x86 binaries without FDC

2024-04-29 Thread Kevin Wolf

[ Cc: qemu-block ]

Am 25.04.2024 um 20:43 hat Thomas Huth geschrieben:
> For downstream versions of QEMU, we'd like to be able to compile QEMU
> without the FDC code included (since it's not required for modern VMs
> anymore and the FDC code has rather a bad reputation, see the VENOM CVE).
> 
> The q35 machine can already be instantiated without FDC, but for being
> able to link a binary without the FDC code, the Kconfig file needs some
> tweaks and there are two spots in the pc code that directly call functions
> from the FDC code - those need to be disabled via #ifdefs.
> 
> The third patch changes the i440fx and isapc machine types so that
> they can work without the FDC device, too, in case it has not been
> compiled into the binary. It's marked as RFC since I assume that the
> FDC was originally a fix compononent of these motherboards, so I'm
> unsure whether we should allow the disablement there. OTOH, it seems
> to work fine, and the FDC is only disabled when it is not available
> in the binary, so I hope this patch is fine, too.
> 
> Thomas Huth (3):
>   hw/i386/pc: Allow to compile without CONFIG_FDC_ISA
>   hw/i386/Kconfig: Allow to compile Q35 without FDC_ISA
>   hw/i386: Add the possibility to use i440fx and isapc without FDC
> 
>  hw/i386/pc.c  | 13 +
>  hw/i386/pc_piix.c |  6 --
>  hw/i386/Kconfig   |  2 +-
>  3 files changed, 14 insertions(+), 7 deletions(-)
> 
> -- 
> 2.44.0
> 
>

Re: [PATCH v4] linux-aio: add IO_CMD_FDSYNC command support

2024-04-24 Thread Kevin Wolf

Am 14.03.2024 um 12:16 hat Prasad Pandit geschrieben:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 

As we discussed in chat, it would be good to be more detailed about the
scenario that we're really concerned about here. The commit message
above sounds like submitting fdsync takes too long, but the real concern
seems to be about the effect that creating and destroying a thread has
on a vcpu by adding a TLB flush. Describing the mechanisms, the sequence
of operations that happen and the problem this causes in more detail
would make the commit message a lot more useful.

>  block/file-posix.c  |  7 +++
>  block/linux-aio.c   | 21 -
>  include/block/raw-aio.h |  1 +
>  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> v4: New boolean field to indicate if aio_fdsync is available or not.
> It is set at file open time and checked before AIO_FLUSH call.
>   - https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg03701.html
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 35684f7e21..78a8cea03b 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -159,6 +159,7 @@ typedef struct BDRVRawState {
>  bool has_discard:1;
>  bool has_write_zeroes:1;
>  bool use_linux_aio:1;
> +bool has_laio_fdsync:1;
>  bool use_linux_io_uring:1;
>  int page_cache_inconsistent; /* errno from fdatasync failure */
>  bool has_fallocate;
> @@ -718,6 +719,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> *options,
>  ret = -EINVAL;
>  goto fail;
>  }
> +s->has_laio_fdsync = laio_has_fdsync(s->fd);

I think this should be conditional on s->use_linux_aio. No point in
probing it if we'll never call it anyway.

Kevin

[PATCH for-9.0?] usb-storage: Fix BlockConf defaults

2024-04-12 Thread Kevin Wolf

Commit 30896374 started to pass the full BlockConf from usb-storage to
scsi-disk, while previously only a few select properties would be
forwarded. This enables the user to set more properties, e.g. the block
size, that are actually taking effect.

However, now the calls to blkconf_apply_backend_options() and
blkconf_blocksizes() in usb_msd_storage_realize() that modify some of
these properties take effect, too, instead of being silently ignored.
This means at least that the block sizes get an unconditional default of
512 bytes before the configuration is passed to scsi-disk.

Before commit 30896374, the property wouldn't be set for scsi-disk and
therefore the device dependent defaults would apply - 512 for scsi-hd,
but 2048 for scsi-cd. The latter default has now become 512, too, which
makes at least Windows 11 installation fail when installing from
usb-storage.

Fix this by simply not calling these functions any more in usb-storage
and passing BlockConf on unmodified (except for the BlockBackend). The
same functions are called by the SCSI code anyway and it sets the right
defaults for the actual media type.

Fixes: 308963746169 ('scsi: Don't ignore most usb-storage properties')
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2260
Reported-by: Jonas Svensson
Signed-off-by: Kevin Wolf 
---
Considering this a candidate for 9.0 given that we're already having an
rc4, it's a regression from 8.2 and breaks installing Windows from USB

 hw/usb/dev-storage-classic.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/hw/usb/dev-storage-classic.c b/hw/usb/dev-storage-classic.c
index 50a3ad6285..6147387dc6 100644
--- a/hw/usb/dev-storage-classic.c
+++ b/hw/usb/dev-storage-classic.c
@@ -38,15 +38,6 @@ static void usb_msd_storage_realize(USBDevice *dev, Error 
**errp)
 return;
 }
 
-if (!blkconf_blocksizes(>conf, errp)) {
-return;
-}
-
-if (!blkconf_apply_backend_options(>conf, !blk_supports_write_perm(blk),
-   true, errp)) {
-return;
-}
-
 /*
  * Hack alert: this pretends to be a block device, but it's really
  * a SCSI bus that can serve only a single device, which it
-- 
2.44.0

Re: [PATCH-for-9.0 v2 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Kevin Wolf

Am 09.04.2024 um 15:59 hat Philippe Mathieu-Daudé geschrieben:
> Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446
> 
> Since v1:
> - Addressed Kevin trivial suggestions (unsigned offset)

You already kept the Reviewed-by tags, but looks good to me.

Kevin

Re: [PATCH-for-9.0? 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Kevin Wolf

Am 08.04.2024 um 10:36 hat Philippe Mathieu-Daudé geschrieben:
> Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446
> 
> Philippe Mathieu-Daudé (3):
>   hw/block/nand: Factor nand_load_iolen() method out
>   hw/block/nand: Have blk_load() return boolean indicating success
>   hw/block/nand: Fix out-of-bound access in NAND block buffer

As we're short on time for 9.0:

Reviewed-by: Kevin Wolf 

But it feels to me like this device could use some more cleanup to make
the code more robust.

Kevin

Re: [PATCH-for-9.0? 1/3] hw/block/nand: Factor nand_load_iolen() method out

2024-04-09 Thread Kevin Wolf

Am 08.04.2024 um 10:36 hat Philippe Mathieu-Daudé geschrieben:
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/block/nand.c | 32 +++-
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/block/nand.c b/hw/block/nand.c
> index d1435f2207..6fa9038bb5 100644
> --- a/hw/block/nand.c
> +++ b/hw/block/nand.c
> @@ -243,9 +243,25 @@ static inline void nand_pushio_byte(NANDFlashState *s, 
> uint8_t value)
>  }
>  }
>  
> +/*
> + * nand_load_block: Load block containing (s->addr + @offset).
> + * Returns length of data available at @offset in this block.
> + */
> +static int nand_load_block(NANDFlashState *s, int offset)
> +{
> +int iolen;
> +
> +s->blk_load(s, s->addr, offset);

Wouldn't it make more sense for @offset to be unsigned, like in
nand_command() before this patch?

I think the values are guaranteed to be small enough to fit in either
signed or unsigned, but we never check for < 0 (maybe that should be
done in your patch to s->blk_load() anyway).

> +iolen = (1 << s->page_shift) - offset;

This is not new, but I'm confused. Can this legitimately be negative?
offset seems to be limited to (1 << s->addr_shift) + s->offset in
practice, but addr_shift > page_shift for NAND_PAGE_SIZE == 2048.

After patch 3, offset is implicitly limited to NAND_PAGE_SIZE + OOB_SIZE
because we return early if s->blk_load() fails. I wonder if it wouldn't
be better to catch this in the callers already and only assert in
s->blk_load().

Anyway, after patch 3 iolen can only temporarily become negative here...

> +if (s->gnd) {
> +iolen += 1 << s->oob_shift;

...before it becomes non-negative again here.

> +}
> +return iolen;
> +}

So none of this makes the code technically incorrect after applying the
full series, but let someone modify it who doesn't understand these
intricacies and I think chances are that it will fall apart.

Kevin

Re: [PATCH v8] arm/kvm: Enable support for KVM_ARM_VCPU_PMU_V3_FILTER

2024-04-02 Thread Kevin Wolf

Am 29.03.2024 um 04:45 hat Shaoqin Huang geschrieben:
> Hi Daniel,
> 
> On 3/25/24 16:55, Daniel P. Berrangé wrote:
> > On Mon, Mar 25, 2024 at 01:35:58PM +0800, Shaoqin Huang wrote:
> > > Hi Daniel,
> > > 
> > > Thanks for your reviewing. I see your comments in the v7.
> > > 
> > > I have some doubts about what you said about the QAPI. Do you want me to
> > > convert the current design into the QAPI parsing like the
> > > IOThreadVirtQueueMapping? And we need to add new json definition in the
> > > qapi/ directory?
> 
> I have defined the QAPI for kvm-pmu-filter like below:
> 
> +##
> +# @FilterAction:
> +#
> +# The Filter Action
> +#
> +# @a: Allow
> +#
> +# @d: Disallow
> +#
> +# Since: 9.0
> +##
> +{ 'enum': 'FilterAction',
> +  'data': [ 'a', 'd' ] }
> +
> +##
> +# @SingleFilter:
> +#
> +# Lazy
> +#
> +# @action: the action
> +#
> +# @start: the start
> +#
> +# @end: the end
> +#
> +# Since: 9.0
> +##
> +
> +{ 'struct': 'SingleFilter',
> + 'data': { 'action': 'FilterAction', 'start': 'int', 'end': 'int' } }
> +
> +##
> +# @KVMPMUFilter:
> +#
> +# Lazy
> +#
> +# @filter: the filter
> +#
> +# Since: 9.0
> +##
> +
> +{ 'struct': 'KVMPMUFilter',
> +  'data': { 'filter': ['SingleFilter'] }}
> 
> And I guess I can use it by adding code like below:
> 
> --- a/hw/core/qdev-properties-system.c
> +++ b/hw/core/qdev-properties-system.c
> @@ -1206,3 +1206,35 @@ const PropertyInfo qdev_prop_iothread_vq_mapping_list
> = {
>  .set = set_iothread_vq_mapping_list,
>  .release = release_iothread_vq_mapping_list,
>  };
> +
> +/* --- kvm-pmu-filter ---*/
> +
> +static void get_kvm_pmu_filter(Object *obj, Visitor *v,
> +const char *name, void *opaque, Error **errp)
> +{
> +KVMPMUFilter **prop_ptr = object_field_prop_ptr(obj, opaque);
> +
> +visit_type_KVMPMUFilter(v, name, prop_ptr, errp);
> +}
> +
> +static void set_kvm_pmu_filter(Object *obj, Visitor *v,
> +const char *name, void *opaque, Error **errp)
> +{
> +KVMPMUFilter **prop_ptr = object_field_prop_ptr(obj, opaque);
> +KVMPMUFilter *list;
> +
> +printf("running the %s\n", __func__);
> +if (!visit_type_KVMPMUFilter(v, name, , errp)) {
> +return;
> +}
> +
> +printf("The name is %s\n", name);
> +*prop_ptr = list;
> +}
> +
> +const PropertyInfo qdev_prop_kvm_pmu_filter = {
> +.name = "KVMPMUFilter",
> +.description = "der der",
> +.get = get_kvm_pmu_filter,
> +.set = set_kvm_pmu_filter,
> +};
> 
> +#define DEFINE_PROP_KVM_PMU_FILTER(_name, _state, _field) \
> +DEFINE_PROP(_name, _state, _field, qdev_prop_kvm_pmu_filter, \
> +KVMPMUFilter *)
> 
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -2439,6 +2441,7 @@ static Property arm_cpu_properties[] = {
>  mp_affinity, ARM64_AFFINITY_INVALID),
>  DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID),
>  DEFINE_PROP_INT32("core-count", ARMCPU, core_count, -1),
> +DEFINE_PROP_KVM_PMU_FILTER("kvm-pmu-filter", ARMCPU, kvm_pmu_filter),
>  DEFINE_PROP_END_OF_LIST()
>  };
> 
> And I guess I can use the new json format input like below:
> 
> qemu-system-aarch64 \
>   -cpu host, '{"filter": [{"action": "a", "start": 0x10, "end": "0x11"}]}'
> 
> But it doesn't work. It seems like because the -cpu option doesn't
> support json format parameter.
> 
> Maybe I'm wrong. So I want to double check with if the -cpu option
> support json format nowadays?

As far as I can see, -cpu doesn't support JSON yet. But even if it did,
your command line would be invalid because the 'host,' part isn't JSON.

> If the -cpu option doesn't support json format, how I can use the QAPI
> for kvm-pmu-filter property?

This would probably mean QAPIfying all CPUs first, which sounds like a
major effort.

Kevin

Re: [PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-04-02 Thread Kevin Wolf

Am 02.04.2024 um 12:53 hat Philippe Mathieu-Daudé geschrieben:
> On 27/3/24 20:27, Kevin Wolf wrote:
> > Coverity complains that the check introduced in commit 3f934817 suggests
> > that qiov could be NULL and we dereference it before reaching the check.
> > In fact, all of the callers pass a non-NULL pointer, so just remove the
> > misleading check.
> > 
> > Resolves: Coverity CID 1542668
> > Signed-off-by: Kevin Wolf 
> > ---
> >   block/io.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Since I'm not seeing other block related patch for 9.0 and
> I'm preparing a pull request, I'm queuing this one.

Thanks, Phil. I didn't send a pull request because I didn't have
anything else and silencing a Coverity false positive didn't seem urgent
for 9.0, but it certainly doesn't hurt either.

Kevin

Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-03-28 Thread Kevin Wolf

Am 27.03.2024 um 23:13 hat Eric Blake geschrieben:
> On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote:
> > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > > may be disordered.
> > 
> > aio_poll() must not be called from coroutine context:
> > 
> >   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
> >^^^
> > 
> > Coroutines are not supposed to block. Instead, they should yield.
> > 
> > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > > some listened events is completed. Therefore, the completion callback
> > > function is dispatched.
> > > 
> > > If this callback function needs to invoke aio_co_enter(), it will only
> > > wake up the coroutine (because we are already in coroutine context),
> > > which may cause that the data on this listening event_fd/socket_fd
> > > is not read/cleared. When the next poll () exits, it will be woken up 
> > > again
> > > and inserted into the wakeup queue again.
> > > 
> > > For example, if TLS is enabled in NBD, the server will call 
> > > g_main_loop_run()
> > > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > > The call stack is as follows:
> > > 
> > > aio_co_enter()
> > > aio_co_wake()
> > > qio_channel_restart_read()
> > > aio_dispatch_handler()
> > > aio_dispatch_handlers()
> > > aio_dispatch()
> > > aio_ctx_dispatch()
> > > g_main_context_dispatch()
> > > g_main_loop_run()
> > > nbd_negotiate_handle_starttls()
> > 
> > This code does not look like it was designed to run in coroutine
> > context. Two options:
> > 
> > 1. Don't run it in coroutine context (e.g. use a BH instead). This
> >avoids blocking the coroutine but calling g_main_loop_run() is still
> >ugly, in my opinion.
> > 
> > 2. Get rid of data.loop and use coroutine APIs instead:
> > 
> >while (!data.complete) {
> >qemu_coroutine_yield();
> >}
> > 
> >and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
> >of g_main_loop_quit(data->loop).
> > 
> >This requires auditing the code to check whether the event loop might
> >invoke something that interferes with
> >nbd_negotiate_handle_starttls(). Typically this means monitor
> >commands or fd activity that could change the state of this
> >connection while it is yielded. This is where the real work is but
> >hopefully it will not be that hard to figure out.
> 
> I agree that 1) is ugly.  So far, I've added some temporary
> assertions, to see when qio_channel_tls_handshake is reached; it looks
> like nbd/server.c is calling it from within coroutine context, but
> nbd/client.c is calling it from the main loop.  The qio code handles
> either, but now I'm stuck in trying to get client.c into having the
> right coroutine context; the TLS handshake is done before the usual
> BlockDriverState *bs object is available, so I'm not even sure what
> aio context, if any, I should be using.  But on my first try, I'm
> hitting:
> 
> qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed.
> 
> so I obviously got something wrong.

Hard to tell without seeing the code, but it looks like you're trying to
wake up the coroutine while you're still executing in the context of the
same coroutine.

It looks like the documentation of qio_channel_tls_handshake() is wrong
and the function can return and call the callback immediately without
dropping out of coroutine context.

A rather heavy-handed, but obvious approach would be moving the
qio_channel_tls_handshake() into a BH, then you know you'll always be
outside of coroutine context in the callback.

But maybe it would be enough to just check if the coroutine isn't
already active:

if (!qemu_coroutine_entered(co)) {
aio_wake_co(co);
}

> It may be possible to use block_gen_c to turn nbd_receive_negotiate
> and nbd_receive_export_list into co_wrapper functions, if I can audit
> that all of their code goes through qio (and is therefore
> coroutine-safe), but that work is still ongoing.

If it's possible, I think that would be nicer in the code and would also
reduce the time the guest is blocked while we're creating a new NBD
connection.

*reads code*

Hm... Actually, one thing I was completely unaware of is that all of
this is running in a separate thread, so maybe the existing synchronous
code already doesn't block the guest. nbd_co_establish_connection()
starts this thread. The thread doesn't have an AioContext, so anything
that involves scheduling something in an AioContext (including BHs and
waking up coroutines) will result in code running in a different thread.

I'm not sure why a thread is used in the first place (does it do
something that coroutines can't do?) and if running parts of the code in
a different thread would be a problem, but we should

[PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-03-27 Thread Kevin Wolf

Coverity complains that the check introduced in commit 3f934817 suggests
that qiov could be NULL and we dereference it before reaching the check.
In fact, all of the callers pass a non-NULL pointer, so just remove the
misleading check.

Resolves: Coverity CID 1542668
Signed-off-by: Kevin Wolf 
---
 block/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 395bea3bac..7217cf811b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1730,7 +1730,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
  * For prefetching in stream_populate(), no qiov is passed along, because
  * only copy-on-read matters.
  */
-if (qiov && *qiov) {
+if (*qiov) {
 sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
   _head, _tail,
   _niov);
-- 
2.44.0

[PULL 3/6] block/io: accept NULL qiov in bdrv_pad_request

2024-03-26 Thread Kevin Wolf

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-2-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/io.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
-
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
+
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.44.0

[PULL 6/6] iotests: add test for stream job with an unaligned prefetch read

2024-03-26 Thread Kevin Wolf

From: Fiona Ebner 

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-5-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.44.0

[PULL 2/6] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-26 Thread Kevin Wolf

VDUSE requires that virtqueues are first enabled before the DRIVER_OK
status flag is set; with the current API of the kernel module, it is
impossible to enable the opposite order in our block export code because
userspace is not notified when a virtqueue is enabled.

This requirement also mathces the normal initialisation order as done by
the generic vhost code in QEMU. However, commit 6c482547 accidentally
changed the order for vdpa-dev and broke access to VDUSE devices with
this.

This changes vdpa-dev to use the normal order again and use the standard
vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
used with vdpa-dev again after this fix.

vhost_net intentionally avoided enabling the vrings for vdpa and does
this manually later while it does enable them for other vhost backends.
Reflect this in the vhost_net code and return early for vdpa, so that
the behaviour doesn't change for this device.

Cc: qemu-sta...@nongnu.org
Fixes: 6c4825476a43 ('vdpa: move vhost_vdpa_set_vring_ready to the caller')
Signed-off-by: Kevin Wolf 
Message-ID: <20240315155949.86066-1-kw...@redhat.com>
Reviewed-by: Eugenio Pérez 
Reviewed-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 hw/net/vhost_net.c | 10 ++
 hw/virtio/vdpa-dev.c   |  5 +
 hw/virtio/vhost-vdpa.c | 29 ++---
 hw/virtio/vhost.c  |  8 +++-
 hw/virtio/trace-events |  2 +-
 5 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..fd1a93701a 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int enable)
 VHostNetState *net = get_vhost_net(nc);
 const VhostOps *vhost_ops = net->dev.vhost_ops;
 
+/*
+ * vhost-vdpa network devices need to enable dataplane virtqueues after
+ * DRIVER_OK, so they can recover device state before starting dataplane.
+ * Because of that, we don't enable virtqueues here and leave it to
+ * net/vhost-vdpa.c.
+ */
+if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
+return 0;
+}
+
 nc->vring_enable = enable;
 
 if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index eb9ecea83b..13e87f06f6 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
Error **errp)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_start(>dev, vdev, false);
+ret = vhost_dev_start(>dev, vdev, true);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error starting vhost");
 goto err_guest_notifiers;
 }
-for (i = 0; i < s->dev.nvqs; ++i) {
-vhost_vdpa_set_vring_ready(>vdpa, i);
-}
 s->started = true;
 
 /*
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3bcd05cc22..e827b9175f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -896,19 +896,41 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, 
int idx)
 return idx;
 }
 
-int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+static int vhost_vdpa_set_vring_enable_one(struct vhost_vdpa *v, unsigned idx,
+   int enable)
 {
 struct vhost_dev *dev = v->dev;
 struct vhost_vring_state state = {
 .index = idx,
-.num = 1,
+.num = enable,
 };
 int r = vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, );
 
-trace_vhost_vdpa_set_vring_ready(dev, idx, r);
+trace_vhost_vdpa_set_vring_enable_one(dev, idx, enable, r);
 return r;
 }
 
+static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+struct vhost_vdpa *v = dev->opaque;
+unsigned int i;
+int ret;
+
+for (i = 0; i < dev->nvqs; ++i) {
+ret = vhost_vdpa_set_vring_enable_one(v, i, enable);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
+}
+
+int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+{
+return vhost_vdpa_set_vring_enable_one(v, idx, 1);
+}
+
 static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
int fd)
 {
@@ -1536,6 +1558,7 @@ const VhostOps vdpa_ops = {
 .vhost_set_features = vhost_vdpa_set_features,
 .vhost_reset_device = vhost_vdpa_reset_device,
 .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
 .vhost_get_config  = vhost_vdpa_get_config,
 .vhost_set_config = vhost_vdpa_set_config,
 .vhost_requires_shm_log = NULL,
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2e4e040db8..f50180e60e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1984,7 +1984,13 @@ static int vhost_dev_

[PULL 4/6] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf

From: Fiona Ebner 

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-3-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.44.0

[PULL 5/6] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf

From: Fiona Ebner 

Same rationale as for commit "block-backend: fix edge case in
bdrv_next() where BDS associated to BB changes". The block graph might
change between the bdrv_next() call and the bdrv_next_cleanup() call,
so it could be that the associated BDS is not the same that was
referenced previously anymore. Instead, rely on bdrv_next() to set
it->bs to the BDS it referenced and unreference that one in any case.

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-4-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 28af1eb17a..db6f9b92a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
-if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
-if (it->blk) {
-bdrv_unref(blk_bs(it->blk));
-blk_unref(it->blk);
-}
-} else {
-bdrv_unref(it->bs);
+bdrv_unref(it->bs);
+
+if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) {
+blk_unref(it->blk);
 }
 
 bdrv_next_reset(it);
-- 
2.44.0

[PULL 1/6] tests/qemu-iotests: Test 157 and 227 require virtio-blk

2024-03-26 Thread Kevin Wolf

From: Thomas Huth 

Tests 157 and 227 use the virtio-blk device, so we have to mark these
tests accordingly to be skipped if this devices is not available (e.g.
when running the tests with qemu-system-avr only).

Signed-off-by: Thomas Huth 
Message-ID: <20240325154737.1305063-1-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/157 | 2 ++
 tests/qemu-iotests/227 | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/157 b/tests/qemu-iotests/157
index 0dc9ba68d2..aa2ebbfb4b 100755
--- a/tests/qemu-iotests/157
+++ b/tests/qemu-iotests/157
@@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt generic
 _supported_proto file
 
+_require_devices virtio-blk
+
 do_run_qemu()
 {
 (
diff --git a/tests/qemu-iotests/227 b/tests/qemu-iotests/227
index 7e45a47ac6..eddaad679e 100755
--- a/tests/qemu-iotests/227
+++ b/tests/qemu-iotests/227
@@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt generic
 _supported_proto file
 
+_require_devices virtio-blk
+
 do_run_qemu()
 {
 echo Testing: "$@"
-- 
2.44.0

[PULL 0/6] Block layer patches

2024-03-26 Thread Kevin Wolf

The following changes since commit 096ae430a7b5a704af4cd94dca7200d6cb069991:

  Merge tag 'pull-qapi-2024-03-26' of https://repo.or.cz/qemu/armbru into 
staging (2024-03-26 09:50:21 +)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 12d7b3bbdcededd3b695501d8d247239d769:

  iotests: add test for stream job with an unaligned prefetch read (2024-03-26 
14:21:26 +0100)


Block layer patches

- Fix crash with unaligned prefetch requests (e.g. in stream jobs)
- vdpa-dev: Fix initialisation order to restore VDUSE compatibility
- iotests fixes


Fiona Ebner (3):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB 
changes
  block-backend: fix edge case in bdrv_next_cleanup() where BDS associated 
to BB changes
  iotests: add test for stream job with an unaligned prefetch read

Kevin Wolf (1):
  vdpa-dev: Fix initialisation order to restore VDUSE compatibility

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

Thomas Huth (1):
  tests/qemu-iotests: Test 157 and 227 require virtio-blk

 block/block-backend.c  | 18 ++---
 block/io.c | 33 +
 hw/net/vhost_net.c | 10 +++
 hw/virtio/vdpa-dev.c   |  5 +-
 hw/virtio/vhost-vdpa.c | 29 +++-
 hw/virtio/vhost.c  |  8 +-
 hw/virtio/trace-events |  2 +-
 tests/qemu-iotests/157 |  2 +
 tests/qemu-iotests/227 |  2 +
 tests/qemu-iotests/tests/stream-unaligned-prefetch | 86 ++
 .../tests/stream-unaligned-prefetch.out|  5 ++
 11 files changed, 167 insertions(+), 33 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs

2024-03-26 Thread Kevin Wolf

Am 25.03.2024 um 21:11 hat Stefan Hajnoczi geschrieben:
> On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote:
> > Changes in v3:
> > * Also deal with edge case in bdrv_next_cleanup(). Haven't run
> >   into an actual issue there, but at least the caller in
> >   migration/block.c uses bdrv_nb_sectors() which, while not a
> >   coroutine wrapper itself (it's written manually), may call
> >   bdrv_refresh_total_sectors(), which is a generated coroutine
> >   wrapper, so AFAIU, the block graph can change during that call.
> >   And even without that, it's just better to be more consistent
> >   with bdrv_next().
> > 
> > Changes in v2:
> > * Ran into another issue while writing the IO test Stefan wanted
> >   to have (good call :)), so include a fix for that and add the
> >   test. I didn't notice during manual testing, because I hadn't
> >   used a scripted QMP 'quit', so there was no race.
> > 
> > Fiona Ebner (3):
> >   block-backend: fix edge case in bdrv_next() where BDS associated to BB
> > changes
> >   block-backend: fix edge case in bdrv_next_cleanup() where BDS
> > associated to BB changes
> >   iotests: add test for stream job with an unaligned prefetch read
> > 
> > Stefan Reiter (1):
> >   block/io: accept NULL qiov in bdrv_pad_request
> > 
> >  block/block-backend.c | 18 ++--
> >  block/io.c| 31 ---
> >  .../tests/stream-unaligned-prefetch   | 86 +++
> >  .../tests/stream-unaligned-prefetch.out   |  5 ++
> >  4 files changed, 117 insertions(+), 23 deletions(-)
> >  create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
> >  create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
> 
> Looks good to me. I will wait until Thursday before merging in case
> Hanna, Vladimir, or Kevin have comments. Thanks!

Let's not delay it to -rc2. If something turns out to be wrong with it,
we can still revert it, but I think getting fixes in earlier is better
during freeze.

Thanks, applied to the block branch.

Kevin


signature.asc
Description: PGP signature

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf

Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
> 
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.

Your change makes sense, but in theory it shouldn't make a difference.
The real bug is in the caller, you can't allow graph modifications while
iterating the list of nodes. Even if it doesn't crash (like after your
patch), you don't have any guarantee that you will have seen every node
that exists that the end - and maybe not even that you don't get the
same node twice.

> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).

The whole locking around this case is a bit tricky and would deserve
some cleanup.

The basic rule for bdrv_next() callers is that they need to hold the
graph reader lock as long as they are iterating the graph, otherwise
it's not safe. This implies that the ref/unref pairs in it should never
make a difference either - which is important, because at least
releasing the last reference is forbidden while holding the graph lock.
I intended to remove the ref/unref for bdrv_next(), but I didn't because
I realised that the callers need to be audited first that they really
obey the rules. You found one that would be problematic.

The thing that bdrv_flush_all() gets wrong is that it promises to follow
the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
something that polls. The compiler can't catch this because bdrv_flush()
is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:

- If called outside of coroutine context, they are GRAPH_UNLOCKED
- If called in coroutine context, they are GRAPH_RDLOCK

We should probably try harder to get rid of the mixed functions, because
a synchronous co_wrapper_bdrv_rdlock could actually be marked
GRAPH_UNLOCKED in the code and then the compiler could catch this case.

The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
with a coroutine wrapper so that the graph lock is held for the whole
function. Then calling bdrv_co_flush() while iterating the list is safe
and doesn't allow concurrent graph modifications.

Kevin

Re: [PATCH] tests/qemu-iotests: Test 157 and 227 require virtio-blk

2024-03-26 Thread Kevin Wolf

Am 25.03.2024 um 16:47 hat Thomas Huth geschrieben:
> Tests 157 and 227 use the virtio-blk device, so we have to mark these
> tests accordingly to be skipped if this devices is not available (e.g.
> when running the tests with qemu-system-avr only).
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to the block branch.

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-21 Thread Kevin Wolf

Am 20.03.2024 um 15:09 hat Daniel P. Berrangé geschrieben:
> On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2024 at 08:10:49PM +, Daniel P. Berrangé wrote:
> > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > > > On Tue, Mar 19, 2024 at 01:43:32PM +, Daniel P. Berrangé wrote:
> > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > > > --- a/util/qemu-coroutine.c
> > > > > > +++ b/util/qemu-coroutine.c
> > > > > 
> > > > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > > > +{
> > > > > > +#ifdef __linux__
> > > > > > +g_autofree char *contents = NULL;
> > > > > > +int max_map_count;
> > > > > > +
> > > > > > +/*
> > > > > > + * Linux processes can have up to max_map_count virtual memory 
> > > > > > areas
> > > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond 
> > > > > > this limit. We
> > > > > > + * must limit the coroutine pool to a safe size to avoid 
> > > > > > running out of
> > > > > > + * VMAs.
> > > > > > + */
> > > > > > +if (g_file_get_contents("/proc/sys/vm/max_map_count", 
> > > > > > , NULL,
> > > > > > +NULL) &&
> > > > > > +qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
> > > > > > +/*
> > > > > > + * This is a conservative upper bound that avoids exceeding
> > > > > > + * max_map_count. Leave half for non-coroutine users like 
> > > > > > library
> > > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 
> > > > > > 2 VMAs so
> > > > > > + * halve the amount again.
> > > 
> > > Leaving half for loaded libraries, etc is quite conservative
> > > if max_map_count is the small-ish 64k default.
> > > 
> > > That reservation could perhaps a fixed number like 5,000 ?
> > 
> > While I don't want QEMU to abort, once this heuristic is in the code it
> > will be scary to make it more optimistic and we may never change it. So
> > now is the best time to try 5,000.
> > 
> > I'll send a follow-up patch that reserves 5,000 mappings. If that turns
> > out to be too optimistic we can increase the reservation.
> 
> BTW, I suggested 5,000, because I looked at a few QEM processes I have
> running on Fedora and saw just under 1,000 lines in /proc/$PID/maps,
> of which only a subset is library mappings. So multiplying that x5 felt
> like a fairly generous overhead for more complex build configurations.

On my system, the boring desktop VM with no special hardware or other
advanced configuration takes ~1500 mappings, most of which are
libraries. I'm not concerned about the library mappings, it's unlikely
that we'll double the number of libraries soon.

But I'm not sure about dynamic mappings outside of coroutines, maybe
when enabling features my simple desktop VM doesn't even use at all. If
we're sure that nothing else uses any number worth mentioning, fine with
me. But I couldn't tell.

Staying the area we know reasonably well, how many libblkio bounce
buffers could be in use at the same time? I think each one is an
individual mmap(), right?

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-19 Thread Kevin Wolf

Am 19.03.2024 um 18:10 hat Daniel P. Berrangé geschrieben:
> On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote:
> > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > The coroutine pool implementation can hit the Linux vm.max_map_count
> > > > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > > > or "failed to set up stack guard page" during coroutine creation.
> > > > 
> > > > This happens because per-thread pools can grow to tens of thousands of
> > > > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > > 
> > > This sounds quite alarming. What usage scenario is justified in
> > > creating so many coroutines?
> > 
> > Basically we try to allow pooling coroutines for as many requests as
> > there can be in flight at the same time. That is, adding a virtio-blk
> > device increases the maximum pool size by num_queues * queue_size. If
> > you have a guest with many CPUs, the default num_queues is relatively
> > large (the bug referenced by Stefan had 64), and queue_size is 256 by
> > default. That's 16k potential requests in flight per disk.
> 
> If we have more than 1 virtio-blk device, does that scale up the max
> coroutines too ?
> 
> eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential
> requests/coroutines ?

Yes. This is the number of request descriptors that fit in the
virtqueues, and if you add another device with additional virtqueues,
then obviously that increases the number of theoretically possible
parallel requests.

The limits of what you can actually achieve in practice might be lower
because I/O might complete faster than the time we need to process all
of the queued requests, depending on how many vcpus are trying to
"compete" with how many iothreads. Of course, the practical limits in
five years might be different from today.

> > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> > > coroutines implies 10's of GB of memory just on stacks alone.
> > 
> > That's only virtual memory, though. Not sure how much of it is actually
> > used in practice.
> 
> True, by default Linux wouldn't care too much about virtual memory,
> Only if 'vm.overcommit_memory' is changed from its default, such
> that Linux applies an overcommit ratio on RAM, then total virtual
> memory would be relevant.

That's a good point and one that I don't have a good answer for, short
of just replacing the whole QEMU block layer with rsd and switching to
stackless coroutines/futures this way.

> > > > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > > 
> > > On my system max_map_count is 1048576, quite alot higher than
> > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> > > ~500 GB of stacks !
> > 
> > Did you change the configuration some time in the past, or is this just
> > a newer default? I get 65530, and that's the same default number I've
> > seen in the bug reports.
> 
> It turns out it is a Fedora change, rather than a kernel change:
> 
>   https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount

Good to know, thanks.

> > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > --- a/util/qemu-coroutine.c
> > > > +++ b/util/qemu-coroutine.c
> > > 
> > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > +{
> > > > +#ifdef __linux__
> > > > +g_autofree char *contents = NULL;
> > > > +int max_map_count;
> > > > +
> > > > +/*
> > > > + * Linux processes can have up to max_map_count virtual memory 
> > > > areas
> > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this 
> > > > limit. We
> > > > + * must limit the coroutine pool to a safe size to avoid running 
> > > > out of
> > > > + * VMAs.
> > > > + */
> > > > +if (g_file_get_contents("/proc/sys/vm/max_map_count", , 
> > > > NULL,
> > > > +NULL) &&
> > > > +qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
> > > > +/*
> > > > + * This is a conservative upper bound that avoids exceeding
> > > > + * max_map_count. Leave half for non-coroutine users like 
> > > > library
>

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-19 Thread Kevin Wolf

Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> 
> This sounds quite alarming. What usage scenario is justified in
> creating so many coroutines?

Basically we try to allow pooling coroutines for as many requests as
there can be in flight at the same time. That is, adding a virtio-blk
device increases the maximum pool size by num_queues * queue_size. If
you have a guest with many CPUs, the default num_queues is relatively
large (the bug referenced by Stefan had 64), and queue_size is 256 by
default. That's 16k potential requests in flight per disk.

Another part of it is just that our calculation didn't make a lot of
sense. Instead of applying this number to the pool size of the iothread
that would actually get the requests, we applied it to _every_ iothread.
This is fixed with this patch, it's a global number applied to a global
pool now.

> IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> coroutines implies 10's of GB of memory just on stacks alone.

That's only virtual memory, though. Not sure how much of it is actually
used in practice.

> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> 
> On my system max_map_count is 1048576, quite alot higher than
> 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> ~500 GB of stacks !

Did you change the configuration some time in the past, or is this just
a newer default? I get 65530, and that's the same default number I've
seen in the bug reports.

> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> 
> > +static unsigned int get_global_pool_hard_max_size(void)
> > +{
> > +#ifdef __linux__
> > +g_autofree char *contents = NULL;
> > +int max_map_count;
> > +
> > +/*
> > + * Linux processes can have up to max_map_count virtual memory areas
> > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this 
> > limit. We
> > + * must limit the coroutine pool to a safe size to avoid running out of
> > + * VMAs.
> > + */
> > +if (g_file_get_contents("/proc/sys/vm/max_map_count", , NULL,
> > +NULL) &&
> > +qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
> > +/*
> > + * This is a conservative upper bound that avoids exceeding
> > + * max_map_count. Leave half for non-coroutine users like library
> > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > + * halve the amount again.
> > + */
> > +return max_map_count / 4;
> 
> That's 256,000 coroutines, which still sounds incredibly large
> to me.

The whole purpose of the limitation is that you won't ever get -ENOMEM
back, which will likely crash your VM. Even if this hard limit is high,
that doesn't mean that it's fully used. Your setting of 1048576 probably
means that you would never have hit the crash anyway.

Even the benchmarks that used to hit the problem don't even get close to
this hard limit any more because the actual number of coroutines stays
much smaller after applying this patch.

> > +}
> > +#endif
> > +
> > +return UINT_MAX;
> 
> Why UINT_MAX as a default ?  If we can't read procfs, we should
> assume some much smaller sane default IMHO, that corresponds to
> what current linux default max_map_count would be.

I don't think we should artificially limit the pool size and with this
potentially limit the performance with it even if the host could do more
if we only allowed it to. If we can't read it from procfs, then it's
your responsibility as a user to make sure that it's large enough for
your VM configuration.

Kevin

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-19 Thread Kevin Wolf

Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben:
> The coroutine pool implementation can hit the Linux vm.max_map_count
> limit, causing QEMU to abort with "failed to allocate memory for stack"
> or "failed to set up stack guard page" during coroutine creation.
> 
> This happens because per-thread pools can grow to tens of thousands of
> coroutines. Each coroutine causes 2 virtual memory areas to be created.
> Eventually vm.max_map_count is reached and memory-related syscalls fail.
> The per-thread pool sizes are non-uniform and depend on past coroutine
> usage in each thread, so it's possible for one thread to have a large
> pool while another thread's pool is empty.
> 
> Switch to a new coroutine pool implementation with a global pool that
> grows to a maximum number of coroutines and per-thread local pools that
> are capped at hardcoded small number of coroutines.
> 
> This approach does not leave large numbers of coroutines pooled in a
> thread that may not use them again. In order to perform well it
> amortizes the cost of global pool accesses by working in batches of
> coroutines instead of individual coroutines.
> 
> The global pool is a list. Threads donate batches of coroutines to when
> they have too many and take batches from when they have too few:
> 
> .---.
> | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> `---'
> 
> Each thread has up to 2 batches of coroutines:
> 
> .---.
> | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> `---'
> 
> The goal of this change is to reduce the excessive number of pooled
> coroutines that cause QEMU to abort when vm.max_map_count is reached
> without losing the performance of an adequately sized coroutine pool.
> 
> Here are virtio-blk disk I/O benchmark results:
> 
>   RW BLKSIZE IODEPTHOLDNEW CHANGE
> randread  4k   1 113725 117451 +3.3%
> randread  4k   8 192968 198510 +2.9%
> randread  4k  16 207138 209429 +1.1%
> randread  4k  32 212399 215145 +1.3%
> randread  4k  64 218319 221277 +1.4%
> randread128k   1  17587  17535 -0.3%
> randread128k   8  17614  17616 +0.0%
> randread128k  16  17608  17609 +0.0%
> randread128k  32  17552  17553 +0.0%
> randread128k  64  17484  17484 +0.0%
> 
> See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> 
> Buglink: https://issues.redhat.com/browse/RHEL-28947
> Reported-by: Sanjay Rao 
> Reported-by: Boaz Ben Shabat 
> Reported-by: Joe Mario 
> Signed-off-by: Stefan Hajnoczi 

Reviewed-by: Kevin Wolf 

Though I do wonder if we can do something about the slight performance
degradation that Sanjay reported. We seem to stay well under the hard
limit, so the reduced global pool size shouldn't be the issue. Maybe
it's the locking?

Either way, even though it could be called a fix, I don't think this is
for 9.0, right?

Kevin

Re: [PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-19 Thread Kevin Wolf

Am 18.03.2024 um 20:27 hat Eugenio Perez Martin geschrieben:
> On Mon, Mar 18, 2024 at 10:02 AM Michael S. Tsirkin  wrote:
> >
> > On Mon, Mar 18, 2024 at 12:31:26PM +0800, Jason Wang wrote:
> > > On Fri, Mar 15, 2024 at 11:59 PM Kevin Wolf  wrote:
> > > >
> > > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > > status flag is set; with the current API of the kernel module, it is
> > > > impossible to enable the opposite order in our block export code because
> > > > userspace is not notified when a virtqueue is enabled.
> > > >
> > > > This requirement also mathces the normal initialisation order as done by
> > > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > > this.
> > > >
> > > > This changes vdpa-dev to use the normal order again and use the standard
> > > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > > used with vdpa-dev again after this fix.
> > > >
> > > > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > > > this manually later while it does enable them for other vhost backends.
> > > > Reflect this in the vhost_net code and return early for vdpa, so that
> > > > the behaviour doesn't change for this device.
> > > >
> > > > Cc: qemu-sta...@nongnu.org
> > > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > > Signed-off-by: Kevin Wolf 
> > > > ---
> > > > v2:
> > > > - Actually make use of the @enable parameter
> > > > - Change vhost_net to preserve the current behaviour
> > > >
> > > > v3:
> > > > - Updated trace point [Stefano]
> > > > - Fixed typo in comment [Stefano]
> > > >
> > > >  hw/net/vhost_net.c | 10 ++
> > > >  hw/virtio/vdpa-dev.c   |  5 +
> > > >  hw/virtio/vhost-vdpa.c | 29 ++---
> > > >  hw/virtio/vhost.c  |  8 +++-
> > > >  hw/virtio/trace-events |  2 +-
> > > >  5 files changed, 45 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > > index e8e1661646..fd1a93701a 100644
> > > > --- a/hw/net/vhost_net.c
> > > > +++ b/hw/net/vhost_net.c
> > > > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int 
> > > > enable)
> > > >  VHostNetState *net = get_vhost_net(nc);
> > > >  const VhostOps *vhost_ops = net->dev.vhost_ops;
> > > >
> > > > +/*
> > > > + * vhost-vdpa network devices need to enable dataplane virtqueues 
> > > > after
> > > > + * DRIVER_OK, so they can recover device state before starting 
> > > > dataplane.
> > > > + * Because of that, we don't enable virtqueues here and leave it to
> > > > + * net/vhost-vdpa.c.
> > > > + */
> > > > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > > > +return 0;
> > > > +}
> > >
> > > I think we need some inputs from Eugenio, this is only needed for
> > > shadow virtqueue during live migration but not other cases.
> > >
> > > Thanks
> >
> >
> > Yes I think we had a backend flag for this, right? Eugenio can you
> > comment please?
> >
> 
> We have the VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend flag,
> right. If the backend does not offer it, it is better to enable all
> the queues here and add a migration blocker in net/vhost-vdpa.c.
> 
> So the check should be:
> nc->info->type == VHOST_VDPA && (backend_features &
> VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK).
> 
> I can manage to add the migration blocker on top of this patch.

Note that my patch preserves the current behaviour for vhost_net. The
callback wasn't implemented for vdpa so far, so we never called anything
even if the flag wasn't set. This patch adds an implementation for the
callback, so we have to skip it here to have everything in vhost_net
work as before - which is what the condition as written does.

If we add a check for the flag now (I don't know if that's correct or
not), that would be a second, unrelated change of behaviour in the same
patch. So if it's necessary, that's a preexisting problem and I'd argue
it doesn't belong in this patch, but should be done separately.

Kevin

[PULL 11/15] tests/qemu-iotests: Restrict test 156 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

The test fails completely when you try to use it with a different
protocol, e.g. with "./check -ssh -qcow2 156".
The test uses some hand-crafted JSON statements which cannot work with other
protocols, thus let's change this test to only support the 'file' protocol.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-7-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/156 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/156 b/tests/qemu-iotests/156
index a9540bd80d..97c2d86ce5 100755
--- a/tests/qemu-iotests/156
+++ b/tests/qemu-iotests/156
@@ -50,7 +50,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.qemu
 
 _supported_fmt qcow2 qed
-_supported_proto generic
+_supported_proto file
 # Copying files around with cp does not work with external data files
 _unsupported_imgopts data_file
 
-- 
2.44.0

[PULL 09/15] tests/qemu-iotests: Restrict test 130 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

Using "-drive ...,backing.file.filename=..." only works with the
file protocol, but not with URIs, so mark this test accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-5-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/130 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/130 b/tests/qemu-iotests/130
index 7257f09677..7af85d20a8 100755
--- a/tests/qemu-iotests/130
+++ b/tests/qemu-iotests/130
@@ -42,7 +42,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.qemu
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 _supported_os Linux
 # We are going to use lazy-refcounts
 _unsupported_imgopts 'compat=0.10'
-- 
2.44.0

[PULL 15/15] iotests: adapt to output change for recently introduced 'detached header' field

2024-03-18 Thread Kevin Wolf

From: Fiona Ebner 

Failure was noticed when running the tests for the qcow2 image format.

Fixes: 0bd779e27e ("crypto: Introduce 'detached-header' field in 
QCryptoBlockInfoLUKS")
Signed-off-by: Fiona Ebner 
Message-ID: <20240216101415.293769-1-f.eb...@proxmox.com>
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/198.out | 2 ++
 tests/qemu-iotests/206.out | 1 +
 2 files changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/198.out b/tests/qemu-iotests/198.out
index 805494916f..62fb73fa3e 100644
--- a/tests/qemu-iotests/198.out
+++ b/tests/qemu-iotests/198.out
@@ -39,6 +39,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
@@ -84,6 +85,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 7e95694777..979f00f9bf 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -114,6 +114,7 @@ Format specific information:
 refcount bits: 16
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha1
 cipher alg: aes-128
 uuid: ----
-- 
2.44.0

[PULL 14/15] tests/qemu-iotests: Restrict tests using "--blockdev file" to the file protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

Tests that use "--blockdev" with the "file" driver cannot work with
other protocols, so we should mark them accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-10-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/tests/qcow2-internal-snapshots | 2 +-
 tests/qemu-iotests/tests/qsd-jobs | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/tests/qcow2-internal-snapshots 
b/tests/qemu-iotests/tests/qcow2-internal-snapshots
index 36523aba06..9f83aa8903 100755
--- a/tests/qemu-iotests/tests/qcow2-internal-snapshots
+++ b/tests/qemu-iotests/tests/qcow2-internal-snapshots
@@ -39,7 +39,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 # Internal snapshots are (currently) impossible with refcount_bits=1,
 # and generally impossible with external data files
 _unsupported_imgopts 'compat=0.10' 'refcount_bits=1[^0-9]' data_file
diff --git a/tests/qemu-iotests/tests/qsd-jobs 
b/tests/qemu-iotests/tests/qsd-jobs
index 510bf0a9dc..9b843af631 100755
--- a/tests/qemu-iotests/tests/qsd-jobs
+++ b/tests/qemu-iotests/tests/qsd-jobs
@@ -40,7 +40,7 @@ cd ..
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 
 size=128M
 
-- 
2.44.0

[PULL 02/15] nbd/server: Fix race in draining the export

2024-03-18 Thread Kevin Wolf

When draining an NBD export, nbd_drained_begin() first sets
client->quiescing so that nbd_client_receive_next_request() won't start
any new request coroutines. Then nbd_drained_poll() tries to makes sure
that we wait for any existing request coroutines by checking that
client->nb_requests has become 0.

However, there is a small window between creating a new request
coroutine and increasing client->nb_requests. If a coroutine is in this
state, it won't be waited for and drain returns too early.

In the context of switching to a different AioContext, this means that
blk_aio_attached() will see client->recv_coroutine != NULL and fail its
assertion.

Fix this by increasing client->nb_requests immediately when starting the
coroutine. Doing this after the checks if we should create a new
coroutine is okay because client->lock is held.

Cc: qemu-sta...@nongnu.org
Fixes: fd6afc501a01 ("nbd/server: Use drained block ops to quiesce the server")
Signed-off-by: Kevin Wolf 
Message-ID: <20240314165825.40261-2-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 nbd/server.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 941832f178..c3484cc1eb 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -3007,8 +3007,8 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 /* Owns a reference to the NBDClient passed as opaque.  */
 static coroutine_fn void nbd_trip(void *opaque)
 {
-NBDClient *client = opaque;
-NBDRequestData *req = NULL;
+NBDRequestData *req = opaque;
+NBDClient *client = req->client;
 NBDRequest request = { 0 };/* GCC thinks it can be used uninitialized 
*/
 int ret;
 Error *local_err = NULL;
@@ -3037,8 +3037,6 @@ static coroutine_fn void nbd_trip(void *opaque)
 goto done;
 }
 
-req = nbd_request_get(client);
-
 /*
  * nbd_co_receive_request() returns -EAGAIN when nbd_drained_begin() has
  * set client->quiescing but by the time we get back nbd_drained_end() may
@@ -3112,9 +3110,7 @@ static coroutine_fn void nbd_trip(void *opaque)
 }
 
 done:
-if (req) {
-nbd_request_put(req);
-}
+nbd_request_put(req);
 
 qemu_mutex_unlock(>lock);
 
@@ -3143,10 +3139,13 @@ disconnect:
  */
 static void nbd_client_receive_next_request(NBDClient *client)
 {
+NBDRequestData *req;
+
 if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS &&
 !client->quiescing) {
 nbd_client_get(client);
-client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
+req = nbd_request_get(client);
+client->recv_coroutine = qemu_coroutine_create(nbd_trip, req);
 aio_co_schedule(client->exp->common.ctx, client->recv_coroutine);
 }
 }
-- 
2.44.0

[PULL 08/15] tests/qemu-iotests: Restrict test 114 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

iotest 114 uses "truncate" and the qcow2.py script on the destination file,
which both cannot deal with URIs. Thus this test needs the "file" protocol,
otherwise it fails with an error message like this:

 truncate: cannot open 
'ssh://127.0.0.1/tmp/qemu-build/tests/qemu-iotests/scratch/qcow2-ssh-114/t.qcow2.orig'
  for writing: No such file or directory

Thus mark this test for "file protocol only" accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-4-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/114 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/114 b/tests/qemu-iotests/114
index de6fd327ee..dccc71008b 100755
--- a/tests/qemu-iotests/114
+++ b/tests/qemu-iotests/114
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 # At least OpenBSD doesn't seem to have truncate
 _supported_os Linux
 # qcow2.py does not work too well with external data files
-- 
2.44.0

[PULL 06/15] tests/qemu-iotests: Fix test 033 for running with non-file protocols

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

When running iotest 033 with the ssh protocol, it fails with:

 033   fail   [14:48:31] [14:48:41]   10.2soutput mismatch
 --- /.../tests/qemu-iotests/033.out
 +++ /.../tests/qemu-iotests/scratch/qcow2-ssh-033/033.out.bad
 @@ -174,6 +174,7 @@
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 512/512 bytes at offset 2097152
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +qemu-io: warning: Failed to truncate the tail of the image: ssh driver does 
not support shrinking files
  read 512/512 bytes at offset 0
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

We already check for the qcow2 format here, so let's simply also
add a check for the protocol here, too, to only test the truncation
with the file protocol.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-2-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/033 | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/033 b/tests/qemu-iotests/033
index da9133c44b..4bc7a071bd 100755
--- a/tests/qemu-iotests/033
+++ b/tests/qemu-iotests/033
@@ -123,9 +123,9 @@ do_test 512 "write -P 1 0 0x200" "$TEST_IMG" | 
_filter_qemu_io
 # next L2 table
 do_test 512 "write -P 1 $L2_COVERAGE 0x200" "$TEST_IMG" | _filter_qemu_io
 
-# only interested in qcow2 here; also other formats might respond with
-#  "not supported" error message
-if [ $IMGFMT = "qcow2" ]; then
+# only interested in qcow2 with file protocol here; also other formats
+# might respond with "not supported" error message
+if [ $IMGFMT = "qcow2" ] && [ $IMGPROTO = "file" ]; then
 do_test 512 "truncate $L2_COVERAGE" "$TEST_IMG" | _filter_qemu_io
 fi
 
-- 
2.44.0

[PULL 01/15] mirror: Don't call job_pause_point() under graph lock

2024-03-18 Thread Kevin Wolf

Calling job_pause_point() while holding the graph reader lock
potentially results in a deadlock: bdrv_graph_wrlock() first drains
everything, including the mirror job, which pauses it. The job is only
unpaused at the end of the drain section, which is when the graph writer
lock has been successfully taken. However, if the job happens to be
paused at a pause point where it still holds the reader lock, the writer
lock can't be taken as long as the job is still paused.

Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly.

Cc: qemu-sta...@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-28125
Fixes: 004915a96a7a ("block: Protect bs->backing with graph_lock")
Signed-off-by: Kevin Wolf 
Message-ID: <20240313153000.33121-1-kw...@redhat.com>
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 include/qemu/job.h |  2 +-
 block/mirror.c | 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 9ea98b5927..2b873f2576 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -483,7 +483,7 @@ void job_enter(Job *job);
  *
  * Called with job_mutex *not* held.
  */
-void coroutine_fn job_pause_point(Job *job);
+void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job);
 
 /**
  * @job: The job that calls the function.
diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..1bdce3b657 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 return bytes_handled;
 }
 
-static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s)
+static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
 {
-BlockDriverState *source = s->mirror_top_bs->backing->bs;
+BlockDriverState *source;
 MirrorOp *pseudo_op;
 int64_t offset;
 /* At least the first dirty chunk is mirrored in one iteration. */
@@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK 
mirror_iteration(MirrorBlockJob *s)
 bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
 int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
 
+bdrv_graph_co_rdlock();
+source = s->mirror_top_bs->backing->bs;
+bdrv_graph_co_rdunlock();
+
 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
 offset = bdrv_dirty_iter_next(s->dbi);
 if (offset < 0) {
@@ -1066,9 +1070,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_wait_for_free_in_flight_slot(s);
 continue;
 } else if (cnt != 0) {
-bdrv_graph_co_rdlock();
 mirror_iteration(s);
-bdrv_graph_co_rdunlock();
 }
 }
 
-- 
2.44.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 23625 matches

Mail list logo