[PATCH v1] block/raw-format: implement .bdrv_get_specific_info handler

2021-07-06 Thread Or Ozeri
When using the raw format, allow exposing specific info by the underlying 
storage.
In particular, this will enable RBD images using the raw format to indicate
a LUKS2 encrypted image in the output of qemu-img info.

Signed-off-by: Or Ozeri 
---
 block/raw-format.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 7717578ed6..f6e70e2356 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -369,6 +369,12 @@ static int raw_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 return bdrv_get_info(bs->file->bs, bdi);
 }
 
+static ImageInfoSpecific *raw_get_specific_info(BlockDriverState *bs,
+Error **errp)
+{
+return bdrv_get_specific_info(bs->file->bs, errp);
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 if (bs->probed) {
@@ -603,6 +609,7 @@ BlockDriver bdrv_raw = {
 .has_variable_length  = true,
 .bdrv_measure = &raw_measure,
 .bdrv_get_info= &raw_get_info,
+.bdrv_get_specific_info = &raw_get_specific_info,
 .bdrv_refresh_limits  = &raw_refresh_limits,
 .bdrv_probe_blocksizes = &raw_probe_blocksizes,
 .bdrv_probe_geometry  = &raw_probe_geometry,
-- 
2.27.0




Re: [RFC PATCH v2 1/7] hw/misc: Add device to help managing aliased memory regions

2021-07-06 Thread Philippe Mathieu-Daudé
On 4/22/21 3:33 AM, Richard Henderson wrote:
> On 4/19/21 2:43 AM, Philippe Mathieu-Daudé wrote:
>> Not really RFC, simply that I'v to add the technical description,
>> but I'd like to know if there could be a possibility to not accept
>> this device (because I missed something) before keeping working on
>> it. So far it is only used in hobbyist boards.
>>
>> Cc: Peter Xu
>> Cc: Paolo Bonzini
>> ---
>>   include/hw/misc/aliased_region.h |  87 
>>   hw/misc/aliased_region.c | 132 +++
>>   MAINTAINERS  |   6 ++
>>   hw/misc/Kconfig  |   3 +
>>   hw/misc/meson.build  |   1 +
>>   5 files changed, 229 insertions(+)
>>   create mode 100644 include/hw/misc/aliased_region.h
>>   create mode 100644 hw/misc/aliased_region.c
> 
> Looks reasonable to me.
> 
> 
>> +    subregion_bits = 64 - clz64(s->span_size - 1);
>> +    s->mem.count = s->region_size >> subregion_bits;
>> +    assert(s->mem.count > 1);
>> +    subregion_size = 1ULL << subregion_bits;
> 
> So... subregion_size = pow2floor(s->span_size)?

No... should be subregion_size = pow2ceil(s->span_size).

> Why use a floor-ish computation here and pow2ceil later in
> aliased_mr_realize?

I missed it :)

> Why use either floor or ceil as opposed to
> asserting that the size is in fact a power of 2?

Unfortunately not all memory regions are power of 2 :(

See for example the armv7m_systick device (size 0xe0).

> Why must the region be
> a power of 2, as opposed to e.g. a multiple of page_size?

I/O regions don't have to be multiple of page_size... See
AVR devices for example:

(qemu) info mtree
address-space: memory
  - (prio 0, i/o): system
-7fff (prio 0, rom): flash
0080-008000ff (prio -1234, i/o): I/O
00800023-00800025 (prio -1000, i/o): atmega-gpio-b
00800026-00800028 (prio -1000, i/o): atmega-gpio-c
00800029-0080002b (prio -1000, i/o): atmega-gpio-d
00800035-00800035 (prio -1000, i/o): avr-timer8-intflag
00800036-00800036 (prio 0, i/o): avr-timer16-intflag
00800037-00800037 (prio -1000, i/o): avr-timer8-intflag
0080003f-00800041 (prio -1000, i/o): avr-eeprom
00800044-00800048 (prio -1000, i/o): avr-timer8
0080004c-0080004e (prio -1000, i/o): avr-spi
00800060-00800060 (prio -1000, i/o): avr-watchdog
00800064-00800064 (prio 0, i/o): avr-power
0080006e-0080006e (prio -1000, i/o): avr-timer8-intmask
0080006f-0080006f (prio 0, i/o): avr-timer16-intmask
00800070-00800070 (prio -1000, i/o): avr-timer8-intmask
00800074-00800075 (prio -1000, i/o): avr-ext-mem-ctrl
00800078-0080007f (prio -1000, i/o): avr-adc
00800080-0080008d (prio 0, i/o): avr-timer16
008000b0-008000b4 (prio -1000, i/o): avr-timer8
008000b8-008000bd (prio -1000, i/o): avr-twi
008000c0-008000c6 (prio 0, i/o): avr-usart
00800100-008008ff (prio 0, ram): sram

> Or just the
> most basic requirement that region_size be a multiple of span_size?

OK.

Thanks for the review! Now I need to fill the documentation part :/

Phil.



Re: [PATCH] hw/sd: sdhci: Enable 64-bit system bus capability in the default SD/MMC host controller

2021-07-06 Thread Joanne Koong
Awesome! Thank you, Philippe and Bin!

On Mon, Jul 5, 2021 at 2:54 AM Philippe Mathieu-Daudé 
wrote:

> On 6/23/21 8:59 PM, Joanne Koong wrote:
> > The default SD/MMC host controller uses SD spec v2.00. 64-bit system bus
> capability
> > was added in v2.
> >
> > In this change, we arrive at 0x157834b4 by computing (0x057834b4 | (1ul
> << 28))
> > where 28 represents the BUS64BIT SDHC_CAPAB field.
> >
> > Signed-off-by: Joanne Koong 
> > ---
> >  hw/sd/sdhci-internal.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
>
> Thanks, series applied to sdmmc-next tree.
>


[PATCH] block/file-posix: Use O_RDWR for locking on macOS

2021-07-06 Thread Akihiko Odaki
qemu_lock_fd_test always returns 0 when fd is not open for writing and
exclusive is true on macOS 11.3.1.

Signed-off-by: Akihiko Odaki 
---
 block/file-posix.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index b3fbb9bd63a..017fbc6b055 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -130,6 +130,14 @@
 #define RAW_LOCK_PERM_BASE 100
 #define RAW_LOCK_SHARED_BASE   200
 
+/*
+ * qemu_lock_fd_test always returns 0 when fd is not open for writing and
+ * exclusive is true on macOS 11.3.1.
+ */
+#ifdef __APPLE__
+#define RAW_LOCK_WRITES
+#endif
+
 typedef struct BDRVRawState {
 int fd;
 bool use_lock;
@@ -638,7 +646,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
false);
 
 s->open_flags = open_flags;
+#ifdef RAW_LOCK_WRITES
+raw_parse_flags(bdrv_flags, &s->open_flags, s->use_lock);
+#else
 raw_parse_flags(bdrv_flags, &s->open_flags, false);
+#endif
 
 s->fd = -1;
 fd = qemu_open(filename, s->open_flags, errp);
@@ -1004,6 +1016,11 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, 
int flags,
 bool has_writers = perm &
 (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED | BLK_PERM_RESIZE);
 int fcntl_flags = O_APPEND | O_NONBLOCK;
+#ifdef RAW_LOCK_WRITES
+if (s->use_lock) {
+has_writers = true;
+}
+#endif
 #ifdef O_NOATIME
 fcntl_flags |= O_NOATIME;
 #endif
-- 
2.30.1 (Apple Git-130)




Re: [PATCH v2 0/6] block: block-status cache for data regions

2021-07-06 Thread Kevin Wolf
Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
> Hi,
> 
> See the cover letter from v1 for the general idea:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html
> 
> 
> The biggest change here in v2 is that instead of having a common CoMutex
> protect the block-status cache, we’re using RCU now.  So to read from
> the cache, or even to invalidate it, no lock is needed, only to update
> it with new data.
> 
> Disclaimer: I have no experience with RCU in practice so far, neither in
> qemu nor anywhere else.  So I hope I’ve used it correctly...

This I hope as well. :-)

With the missing qatomic_rcu_read() added:

Reviewed-by: Kevin Wolf 




Re: [PATCH v4 01/34] modules: add modinfo macros

2021-07-06 Thread Paolo Bonzini

On 24/06/21 22:37, Eduardo Habkost wrote:

On Thu, Jun 24, 2021 at 12:38:03PM +0200, Gerd Hoffmann wrote:

Add macros for module info annotations.

Instead of having that module meta-data stored in lists in util/module.c
place directly in the module source code.


[...]

+/* module implements QOM type  */
+#define module_obj(name) modinfo(obj, name)


Can we make OBJECT_DEFINE_TYPE*() use this macro automatically?



Yeah, that's possible.  I would do it as a separate patch though, 
because Gerd is on vacation and he asked me to include it in a pull 
request before soft freeze.


Thanks,

Paolo




Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-07-06 Thread Kevin Wolf
Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz 

Since you indicated that you'll respin the patch, I'll add my minor
comments:

> @@ -2442,9 +2445,58 @@ static int coroutine_fn 
> bdrv_co_block_status(BlockDriverState *bs,
>  aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>  
>  if (bs->drv->bdrv_co_block_status) {
> -ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -aligned_bytes, pnum, &local_map,
> -&local_file);
> +bool from_cache = false;
> +
> +/*
> + * Use the block-status cache only for protocol nodes: Format
> + * drivers are generally quick to inquire the status, but protocol
> + * drivers often need to get information from outside of qemu, so
> + * we do not have control over the actual implementation.  There
> + * have been cases where inquiring the status took an unreasonably
> + * long time, and we can do nothing in qemu to fix it.
> + * This is especially problematic for images with large data areas,
> + * because finding the few holes in them and giving them special
> + * treatment does not gain much performance.  Therefore, we try to
> + * cache the last-identified data region.
> + *
> + * Second, limiting ourselves to protocol nodes allows us to assume
> + * the block status for data regions to be DATA | OFFSET_VALID, and
> + * that the host offset is the same as the guest offset.
> + *
> + * Note that it is possible that external writers zero parts of
> + * the cached regions without the cache being invalidated, and so
> + * we may report zeroes as data.  This is not catastrophic,
> + * however, because reporting zeroes as data is fine.
> + */
> +if (QLIST_EMPTY(&bs->children)) {
> +if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
> +ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +local_file = bs;
> +local_map = aligned_offset;
> +
> +from_cache = true;
> +}
> +}
> +
> +if (!from_cache) {

Is having a separate variable from_cache really useful? This looks like
it could just be:

if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
// The code above
} else {
// The code below
}

> +ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
> aligned_offset,
> +aligned_bytes, pnum, 
> &local_map,
> +&local_file);
> +
> +/*
> + * Note that checking QLIST_EMPTY(&bs->children) is also done 
> when
> + * t

[PATCH] block/replication.c: Properly attach children

2021-07-06 Thread Lukas Straub
The replication driver needs access to the children block-nodes of
it's child so it can issue bdrv_make_empty to manage the replication.
However, it does this by directly copying the BdrvChilds, which is
wrong.

Fix this by properly attaching the block-nodes with
bdrv_attach_child().

Also, remove a workaround introduced in commit
6ecbc6c52672db5c13805735ca02784879ce8285
"replication: Avoid blk_make_empty() on read-only child".

Signed-off-by: Lukas Straub 
---

-v2: Test for BDRV_CHILD_PRIMARY in replication_child_perm, since
 bs->file might not be set yet. (Vladimir)

 block/replication.c | 94 +
 1 file changed, 61 insertions(+), 33 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 52163f2d1f..fd8cb728a3 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -166,7 +166,12 @@ static void replication_child_perm(BlockDriverState *bs, 
BdrvChild *c,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
 {
-*nperm = BLK_PERM_CONSISTENT_READ;
+if (role & BDRV_CHILD_PRIMARY) {
+*nperm = BLK_PERM_CONSISTENT_READ;
+} else {
+*nperm = 0;
+}
+
 if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == BDRV_O_RDWR) {
 *nperm |= BLK_PERM_WRITE;
 }
@@ -340,17 +345,7 @@ static void secondary_do_checkpoint(BDRVReplicationState 
*s, Error **errp)
 return;
 }
 
-BlockBackend *blk = blk_new(qemu_get_current_aio_context(),
-BLK_PERM_WRITE, BLK_PERM_ALL);
-blk_insert_bs(blk, s->hidden_disk->bs, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
-blk_unref(blk);
-return;
-}
-
-ret = blk_make_empty(blk, errp);
-blk_unref(blk);
+ret = bdrv_make_empty(s->hidden_disk, errp);
 if (ret < 0) {
 return;
 }
@@ -365,27 +360,35 @@ static void reopen_backing_file(BlockDriverState *bs, 
bool writable,
 Error **errp)
 {
 BDRVReplicationState *s = bs->opaque;
+BdrvChild *hidden_disk, *secondary_disk;
 BlockReopenQueue *reopen_queue = NULL;
 
+/*
+ * s->hidden_disk and s->secondary_disk may not be set yet, as they will
+ * only be set after the children are writable.
+ */
+hidden_disk = bs->file->bs->backing;
+secondary_disk = hidden_disk->bs->backing;
+
 if (writable) {
-s->orig_hidden_read_only = bdrv_is_read_only(s->hidden_disk->bs);
-s->orig_secondary_read_only = bdrv_is_read_only(s->secondary_disk->bs);
+s->orig_hidden_read_only = bdrv_is_read_only(hidden_disk->bs);
+s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs);
 }
 
-bdrv_subtree_drained_begin(s->hidden_disk->bs);
-bdrv_subtree_drained_begin(s->secondary_disk->bs);
+bdrv_subtree_drained_begin(hidden_disk->bs);
+bdrv_subtree_drained_begin(secondary_disk->bs);
 
 if (s->orig_hidden_read_only) {
 QDict *opts = qdict_new();
 qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, hidden_disk->bs,
  opts, true);
 }
 
 if (s->orig_secondary_read_only) {
 QDict *opts = qdict_new();
 qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, secondary_disk->bs,
  opts, true);
 }
 
@@ -393,8 +396,8 @@ static void reopen_backing_file(BlockDriverState *bs, bool 
writable,
 bdrv_reopen_multiple(reopen_queue, errp);
 }
 
-bdrv_subtree_drained_end(s->hidden_disk->bs);
-bdrv_subtree_drained_end(s->secondary_disk->bs);
+bdrv_subtree_drained_end(hidden_disk->bs);
+bdrv_subtree_drained_end(secondary_disk->bs);
 }
 
 static void backup_job_cleanup(BlockDriverState *bs)
@@ -451,6 +454,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 BlockDriverState *bs = rs->opaque;
 BDRVReplicationState *s;
 BlockDriverState *top_bs;
+BdrvChild *active_disk, *hidden_disk, *secondary_disk;
 int64_t active_length, hidden_length, disk_length;
 AioContext *aio_context;
 Error *local_err = NULL;
@@ -488,32 +492,32 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 case REPLICATION_MODE_PRIMARY:
 break;
 case REPLICATION_MODE_SECONDARY:
-s->active_disk = bs->file;
-if (!s->active_disk || !s->active_disk->bs ||
-!s->active_disk->bs->backing) {
+active_disk = bs->file;
+if (!active_disk || !active_disk->bs ||
+!active_disk->bs->backing) {

Re: [PATCH v5 6/6] block: Make blockdev-reopen stable API

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

06.07.2021 14:23, Kevin Wolf wrote:

From: Alberto Garcia

This patch drops the 'x-' prefix from x-blockdev-reopen.

Signed-off-by: Alberto Garcia
Signed-off-by: Kevin Wolf



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v5 5/6] iotests: Test reopening multiple devices at the same time

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

06.07.2021 14:23, Kevin Wolf wrote:

From: Alberto Garcia

This test swaps the images used by two active block devices.

This is now possible thanks to the new ability to run
x-blockdev-reopen on multiple devices at the same time.

Signed-off-by: Alberto Garcia
Signed-off-by: Kevin Wolf




Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-06 Thread Peter Lieven



> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
> 
> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
 Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
>>> 
>>> Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
 This series migrates the qemu rbd driver from the old aio emulation
 to native coroutines and adds write zeroes support which is important
 for block operations.
 
 To achieve this we first bump the librbd requirement to the already
 outdated luminous release of ceph to get rid of some wrappers and
 ifdef'ry in the code.
>>> 
>>> Thanks, applied to the block branch.
>>> 
>>> I've only had a very quick look at the patches, but I think there is one
>>> suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
>>> indirection is probably unnecessary now because aio_co_wake() is thread
>>> safe.
>> 
>> But this is new, isn’t it?
> 
> Not sure in what sense you mean. aio_co_wake() has always been thread
> safe, as far as I know.
> 
> Obviously, the old code didn't use aio_co_wake(), but directly called
> some callbacks, so the part that is new is your coroutine conversion
> that enables getting rid of the BH.
> 
>> We also have this indirection in iscsi and nfs drivers I think.
> 
> Indeed, the resulting codes look the same. In iscsi and nfs it doesn't
> come from an incomplete converstion to coroutines, but they both used
> qemu_coroutine_enter() originally, which resumes the coroutine in the
> current thread...

If I remember correctly this would also serialize requests and thus we used 
BHs. libnfs and libiscsi are not thread safe as well and they completely run in 
qemus threads so this wasn’t the original reason.

Thanks for the hints to the relevant commit.

I will send a follow up for rbd/nfs/iscsi in about 2 weeks.

Peter

> 
>> Does it matter that the completion callback is called from an librbd
>> thread? Will the coroutine continue to run in the right thread?
> 
> ...whereas aio_co_wake() resumes the coroutine in the thread where it
> was running before.
> 
> (Before commit 5f50be9b5, this would have been buggy from an librbd
> thread, but now it should work correctly even for threads that are
> neither iothreads nor vcpu threads.)
> 
>> I will have a decent look after my vacation.
> 
> Sounds good, thanks. Enjoy your vacation!
> 
> Kevin
> 





Re: [PATCH v5 4/6] block: Support multiple reopening with x-blockdev-reopen

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

06.07.2021 14:23, Kevin Wolf wrote:

From: Alberto Garcia 

[ kwolf: Fixed AioContext locking ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 



Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  qapi/block-core.json  | 18 +++--
  blockdev.c| 81 ++-
  tests/qemu-iotests/155|  9 ++-
  tests/qemu-iotests/165|  4 +-
  tests/qemu-iotests/245| 27 ---
  tests/qemu-iotests/248|  2 +-
  tests/qemu-iotests/248.out|  2 +-
  tests/qemu-iotests/296|  9 ++-


[..]

  
  ##

  # @blockdev-del:
diff --git a/blockdev.c b/blockdev.c
index f657d090d3..1e8c946828 100644
--- a/blockdev.c
+++ b/blockdev.c


[..]

  
-bs = bdrv_find_node(options->node_name);

-if (!bs) {
-error_setg(errp, "Failed to find node with node-name='%s'",
+bs = bdrv_find_node(options->node_name);
+if (!bs) {
+error_setg(errp, "Failed to find node with node-name='%s'",
 options->node_name);


indentation broken


-goto fail;
-}
+goto fail;
+}
  
-/* Put all options in a QDict and flatten it */

-visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
-visit_complete(v, &obj);
-qdict = qobject_to(QDict, obj);
+/* Put all options in a QDict and flatten it */
+v = qobject_output_visitor_new(&obj);
+visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
+visit_complete(v, &obj);
+visit_free(v);


[..]


+# Run x-blockdev-reopen on a list of block devices
+def reopenMultiple(self, opts, errmsg = None):
+result = self.vm.qmp('x-blockdev-reopen', conv_keys = False, options = 
opts)


I don't really care if this doesn't break iotest 297, but PEP8 doesn't like 
whitespaces around '=' for named arguments..


+if errmsg:
+self.assert_qmp(result, 'error/class', 'GenericError')
+self.assert_qmp(result, 'error/desc', errmsg)
+else:
+self.assert_qmp(result, 'return', {})




--
Best regards,
Vladimir



Re: [ovirt-users] Re: Any way to terminate stuck export task

2021-07-06 Thread Nir Soffer
On Tue, Jul 6, 2021 at 5:55 PM Gianluca Cecchi
 wrote:
>
> On Tue, Jul 6, 2021 at 2:52 PM Nir Soffer  wrote:
>
>>
>>
>> Too bad.
>>
>> You can evaluate how ovirt 4.4. will work with this appliance using
>> this dd command:
>>
>> dd if=/dev/zero bs=8M count=38400 of=/path/to/new/disk
>> oflag=direct conv=fsync
>>
>> We don't use dd for this, but the operation is the same on NFS < 4.2.
>>
>
> I confirm I'm able to saturate the 1Gb/s link. tried creating a 10Gb file on 
> the StoreOnce appliance
>  # time dd if=/dev/zero bs=8M count=1280 
> of=/rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/ansible_ova/test.img
>  oflag=direct conv=fsync
> 1280+0 records in
> 1280+0 records out
> 10737418240 bytes (11 GB) copied, 98.0172 s, 110 MB/s
>
> real 1m38.035s
> user 0m0.003s
> sys 0m2.366s
>
> So are you saying that after upgrading to 4.4.6 (or just released 4.4.7) I 
> should be able to export with this speed?

The preallocation part will run at the same speed, and then
you need to copy the used parts of the disk, time depending
on how much data is used.

>  Or anyway I do need NFS v4.2?

Without NFS 4.2. With NFS 4.2 the entire allocation will take less than
a second without consuming any network bandwidth.

> BTW: is there any capping put in place by oVirt to the export phase (the 
> qemu-img command in practice)? Designed for example not to perturbate the 
> activity of hypervisor?Or do you think that if I have a 10Gb/s network 
> backend and powerful disks on oVirt and powerful NFS server processing power  
> I should have much more speed?

We don't have any capping in place, usually people complain that copying
images is too slow.

In general when copying to file base storage we don't use -W option
(unordered writes) so copy will be slower compared with block based
storage, when qemu-img use 8 concurrent writes. So in a way we always
cap the copies to file based storage. To get maximum throughput you need
to run multiple copies at the same time.

Nir




Re: [PATCH v5 3/6] block: Acquire AioContexts during bdrv_reopen_multiple()

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

06.07.2021 14:23, Kevin Wolf wrote:

As the BlockReopenQueue can contain nodes in multiple AioContexts, only
one of which may be locked when AIO_WAIT_WHILE() can be called, we can't
let the caller lock the right contexts. Instead, individually lock the
AioContext of a single node when iterating the queue.

Reintroduce bdrv_reopen() as a wrapper for reopening a single node that
drains the node and temporarily drops the AioContext lock for
bdrv_reopen_multiple().

Signed-off-by: Kevin Wolf


I don't have the whole picture in mind. But looks clear:

Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-06 Thread Kevin Wolf
Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
> > Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
> > 
> > Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
> >> This series migrates the qemu rbd driver from the old aio emulation
> >> to native coroutines and adds write zeroes support which is important
> >> for block operations.
> >> 
> >> To achieve this we first bump the librbd requirement to the already
> >> outdated luminous release of ceph to get rid of some wrappers and
> >> ifdef'ry in the code.
> > 
> > Thanks, applied to the block branch.
> > 
> > I've only had a very quick look at the patches, but I think there is one
> > suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
> > indirection is probably unnecessary now because aio_co_wake() is thread
> > safe.
> 
> But this is new, isn’t it?

Not sure in what sense you mean. aio_co_wake() has always been thread
safe, as far as I know.

Obviously, the old code didn't use aio_co_wake(), but directly called
some callbacks, so the part that is new is your coroutine conversion
that enables getting rid of the BH.

> We also have this indirection in iscsi and nfs drivers I think.

Indeed, the resulting codes look the same. In iscsi and nfs it doesn't
come from an incomplete converstion to coroutines, but they both used
qemu_coroutine_enter() originally, which resumes the coroutine in the
current thread...

> Does it matter that the completion callback is called from an librbd
> thread? Will the coroutine continue to run in the right thread?

...whereas aio_co_wake() resumes the coroutine in the thread where it
was running before.

(Before commit 5f50be9b5, this would have been buggy from an librbd
thread, but now it should work correctly even for threads that are
neither iothreads nor vcpu threads.)

> I will have a decent look after my vacation.

Sounds good, thanks. Enjoy your vacation!

Kevin




Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-06 Thread Peter Lieven



> Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
> 
> Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
>> This series migrates the qemu rbd driver from the old aio emulation
>> to native coroutines and adds write zeroes support which is important
>> for block operations.
>> 
>> To achieve this we first bump the librbd requirement to the already
>> outdated luminous release of ceph to get rid of some wrappers and
>> ifdef'ry in the code.
> 
> Thanks, applied to the block branch.
> 
> I've only had a very quick look at the patches, but I think there is one
> suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
> indirection is probably unnecessary now because aio_co_wake() is thread
> safe.

But this is new, isn’t it?

We also have this indirection in iscsi and nfs drivers I think.

Does it matter that the completion callback is called from an librbd thread? 
Will the coroutine continue to run in the right thread?

I will have a decent look after my vacation.

Anyway, Thanks for applying,
Peter

> 
> (Also, if I were the responsible maintainer, I would prefer true/false
> rather than 0/1 for bools, but that's minor. :-))
> 
> Kevin
> 





Re: [ovirt-users] Re: Any way to terminate stuck export task

2021-07-06 Thread Gianluca Cecchi
On Tue, Jul 6, 2021 at 2:52 PM Nir Soffer  wrote:


>
> Too bad.
>
> You can evaluate how ovirt 4.4. will work with this appliance using
> this dd command:
>
> dd if=/dev/zero bs=8M count=38400 of=/path/to/new/disk
> oflag=direct conv=fsync
>
> We don't use dd for this, but the operation is the same on NFS < 4.2.
>
>
I confirm I'm able to saturate the 1Gb/s link. tried creating a 10Gb file
on the StoreOnce appliance
 # time dd if=/dev/zero bs=8M count=1280 of=/rhev/data-center/mnt/
172.16.1.137\:_nas_EXPORT-DOMAIN/ansible_ova/test.img oflag=direct
conv=fsync
1280+0 records in
1280+0 records out
10737418240 bytes (11 GB) copied, 98.0172 s, 110 MB/s

real 1m38.035s
user 0m0.003s
sys 0m2.366s

So are you saying that after upgrading to 4.4.6 (or just released 4.4.7) I
should be able to export with this speed? Or anyway I do need NFS v4.2?
BTW: is there any capping put in place by oVirt to the export phase (the
qemu-img command in practice)? Designed for example not to perturbate the
activity of hypervisor?Or do you think that if I have a 10Gb/s network
backend and powerful disks on oVirt and powerful NFS server processing
power  I should have much more speed?


> Based on the 50 MiB/s rate you reported earlier, I guess you have a
> 1Gbit network to
> this appliance, so zeroing can do up to 128 MiB/s, which will take
> about 40 minutes
> for 300G.
>
> Using NFS 4.2, fallocate will complete in less than a second.
>

I can sort of confirm this also for 4.3.10.
I have a test CentOS 7.4 VM configured as NFS server and, if I configure it
as an export domain using the default autonegotiate option, it is
(strangely enough) mounted as NFS v4.1 and the initial fallocate takes some
minutes (55Gb disk).
If I reconfigure it forcing NFS v4.2, it does it and the initial fallocate
is immediate, in the sense that "ls -l" on the export domain becomes quite
immediately the size of the virtual disk.

Thanks,
Gianluca


Re: [PATCH v5 1/6] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-06 Thread Kevin Wolf
Am 06.07.2021 um 15:12 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 06.07.2021 14:23, Kevin Wolf wrote:
> > Without an external data file, s->data_file is a second pointer with the
> > same value as bs->file. When changing bs->file to a different BdrvChild
> > and freeing the old BdrvChild, s->data_file must also be updated,
> > otherwise it points to freed memory and causes crashes.
> > 
> > This problem was caught by iotests case 245.
> > 
> > Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
> > Signed-off-by: Kevin Wolf 
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> Still, some ideas below:
> 
> > ---
> >   block/qcow2.c | 14 ++
> >   1 file changed, 14 insertions(+)
> > 
> > diff --git a/block/qcow2.c b/block/qcow2.c
> > index ee4530cdbd..cb459ef6a6 100644
> > --- a/block/qcow2.c
> > +++ b/block/qcow2.c
> > @@ -962,6 +962,7 @@ static bool read_cache_sizes(BlockDriverState *bs, 
> > QemuOpts *opts,
> >   }
> >   typedef struct Qcow2ReopenState {
> > +bool had_data_file;
> >   Qcow2Cache *l2_table_cache;
> >   Qcow2Cache *refcount_block_cache;
> >   int l2_slice_size; /* Number of entries in a slice of the L2 table */
> > @@ -1932,6 +1933,8 @@ static int qcow2_reopen_prepare(BDRVReopenState 
> > *state,
> >   r = g_new0(Qcow2ReopenState, 1);
> >   state->opaque = r;
> > +r->had_data_file = has_data_file(state->bs);
> > +
> 
> So, during reopen, at some moment s->data_file become invalid. So, we
> shouldn't rely on it..
> 
> Maybe we need
> 
>s->data_file = NULL;
> 
> here..

"need" is a strong word, but I guess we shouldn't access it between
prepare and commit, so I agree setting it to NULL would make bugs in
this area very visible.

In fact, we wouldn't even need r->had_data_file then because commit
could just set s->data_file = state->bs->file if it's NULL.

> >   ret = qcow2_update_options_prepare(state->bs, r, state->options,
> >  state->flags, errp);
> >   if (ret < 0) {
> > @@ -1966,7 +1969,18 @@ fail:
> >   static void qcow2_reopen_commit(BDRVReopenState *state)
> >   {
> > +BDRVQcow2State *s = state->bs->opaque;
> > +Qcow2ReopenState *r = state->opaque;
> > +
> >   qcow2_update_options_commit(state->bs, state->opaque);
> 
> Worth doing
> 
>assert(r->had_data_file == has_data_file(state->bs));
> 
> here, to be double sure?

This would be wrong because at the time that this runs, state->bs->file
is already updated and this is what has_data_file() checks against. So
you can't use has_data_file() any more until it's synced again with the
code below.

In fact, this is why I even added r->had_data_file. At first I directly
used has_data_file(state->bs) here and watched it break.

> > +if (!r->had_data_file && s->data_file != state->bs->file) {
> > +/*
> > + * If s->data_file is just a second pointer to bs->file (which is 
> > the
> > + * case without an external data file), it may need to be updated.
> > + */
> > +s->data_file = state->bs->file;
> > +assert(!has_data_file(state->bs));
> > +}
> >   g_free(state->opaque);
> >   }

Kevin




Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-06 Thread Kevin Wolf
Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
> This series migrates the qemu rbd driver from the old aio emulation
> to native coroutines and adds write zeroes support which is important
> for block operations.
> 
> To achieve this we first bump the librbd requirement to the already
> outdated luminous release of ceph to get rid of some wrappers and
> ifdef'ry in the code.

Thanks, applied to the block branch.

I've only had a very quick look at the patches, but I think there is one
suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
indirection is probably unnecessary now because aio_co_wake() is thread
safe.

(Also, if I were the responsible maintainer, I would prefer true/false
rather than 0/1 for bools, but that's minor. :-))

Kevin




Re: [PATCH v5 1/3] block/file-posix: Optimize for macOS

2021-07-06 Thread Stefan Hajnoczi
On Mon, Jul 05, 2021 at 10:04:56PM +0900, Akihiko Odaki wrote:
> This commit introduces "punch hole" operation and optimizes transfer
> block size for macOS.
> 
> Thanks to Konstantin Nazarov for detailed analysis of a flaw in an
> old version of this change:
> https://gist.github.com/akihikodaki/87df4149e7ca87f18dc56807ec5a1bc5#gistcomment-3654667
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  block/file-posix.c | 27 +--
>  1 file changed, 25 insertions(+), 2 deletions(-)

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v5 2/6] block: Add bdrv_reopen_queue_free()

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

06.07.2021 14:23, Kevin Wolf wrote:

From: Alberto Garcia

Move the code to free a BlockReopenQueue to a separate function.
It will be used in a subsequent patch.

[ kwolf: Also free explicit_options and options, and explicitly
   qobject_ref() the value when it continues to be used. This avoids
   memory leaks as we saw them in two recent patches. ]


Hmm, not clear from the context which two patches you mean



Signed-off-by: Alberto Garcia
Signed-off-by: Kevin Wolf



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v5 1/6] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

06.07.2021 14:23, Kevin Wolf wrote:

Without an external data file, s->data_file is a second pointer with the
same value as bs->file. When changing bs->file to a different BdrvChild
and freeing the old BdrvChild, s->data_file must also be updated,
otherwise it points to freed memory and causes crashes.

This problem was caught by iotests case 245.

Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
Signed-off-by: Kevin Wolf 


Reviewed-by: Vladimir Sementsov-Ogievskiy 

Still, some ideas below:


---
  block/qcow2.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index ee4530cdbd..cb459ef6a6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -962,6 +962,7 @@ static bool read_cache_sizes(BlockDriverState *bs, QemuOpts 
*opts,
  }
  
  typedef struct Qcow2ReopenState {

+bool had_data_file;
  Qcow2Cache *l2_table_cache;
  Qcow2Cache *refcount_block_cache;
  int l2_slice_size; /* Number of entries in a slice of the L2 table */
@@ -1932,6 +1933,8 @@ static int qcow2_reopen_prepare(BDRVReopenState *state,
  r = g_new0(Qcow2ReopenState, 1);
  state->opaque = r;
  
+r->had_data_file = has_data_file(state->bs);

+


So, during reopen, at some moment s->data_file become invalid. So, we shouldn't 
rely on it..

Maybe we need

   s->data_file = NULL;

here..


  ret = qcow2_update_options_prepare(state->bs, r, state->options,
 state->flags, errp);
  if (ret < 0) {
@@ -1966,7 +1969,18 @@ fail:
  
  static void qcow2_reopen_commit(BDRVReopenState *state)

  {
+BDRVQcow2State *s = state->bs->opaque;
+Qcow2ReopenState *r = state->opaque;
+
  qcow2_update_options_commit(state->bs, state->opaque);


Worth doing

   assert(r->had_data_file == has_data_file(state->bs));

here, to be double sure?


+if (!r->had_data_file && s->data_file != state->bs->file) {
+/*
+ * If s->data_file is just a second pointer to bs->file (which is the
+ * case without an external data file), it may need to be updated.
+ */
+s->data_file = state->bs->file;
+assert(!has_data_file(state->bs));
+}
  g_free(state->opaque);
  }
  




--
Best regards,
Vladimir



Re: [ovirt-users] Re: Any way to terminate stuck export task

2021-07-06 Thread Nir Soffer
On Tue, Jul 6, 2021 at 10:21 AM Gianluca Cecchi
 wrote:
>
> On Mon, Jul 5, 2021 at 5:06 PM Nir Soffer  wrote:
>
>>
>>
>> qemu-img is busy in posix_fallocate(), wiring one byte to every 4k block.
>>
>> If you add -tt -T (as I suggested), we can see how much time each write 
>> takes,
>> which may explain why this takes so much time.
>>
>> strace -f -p 14342 --tt -T
>>
>
> It seems I missed part of your suggestion... i didn't get the "-tt -T" (or I 
> didn't see it...)
>
> With it I get this during the export (in networking of host console 4 
> mbit/s):
>
> # strace -f -p 25243 -tt -T
> strace: Process 25243 attached with 2 threads
> [pid 25243] 09:17:32.503907 ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, 
> NULL, NULL, 8 
> [pid 25244] 09:17:32.694207 pwrite64(12, "\0", 1, 3773509631) = 1 <0.59>
> [pid 25244] 09:17:32.694412 pwrite64(12, "\0", 1, 3773513727) = 1 <0.78>
> [pid 25244] 09:17:32.694608 pwrite64(12, "\0", 1, 3773517823) = 1 <0.56>
> [pid 25244] 09:17:32.694729 pwrite64(12, "\0", 1, 3773521919) = 1 <0.24>
> [pid 25244] 09:17:32.694796 pwrite64(12, "\0", 1, 3773526015) = 1 <0.20>
> [pid 25244] 09:17:32.694855 pwrite64(12, "\0", 1, 3773530111) = 1 <0.15>
> [pid 25244] 09:17:32.694908 pwrite64(12, "\0", 1, 3773534207) = 1 <0.14>
> [pid 25244] 09:17:32.694950 pwrite64(12, "\0", 1, 3773538303) = 1 <0.16>
> [pid 25244] 09:17:32.694993 pwrite64(12, "\0", 1, 3773542399) = 1 <0.200032>
> [pid 25244] 09:17:32.895140 pwrite64(12, "\0", 1, 3773546495) = 1 <0.34>
> [pid 25244] 09:17:32.895227 pwrite64(12, "\0", 1, 3773550591) = 1 <0.29>
> [pid 25244] 09:17:32.895296 pwrite64(12, "\0", 1, 3773554687) = 1 <0.24>
> [pid 25244] 09:17:32.895353 pwrite64(12, "\0", 1, 3773558783) = 1 <0.16>
> [pid 25244] 09:17:32.895400 pwrite64(12, "\0", 1, 3773562879) = 1 <0.15>
> [pid 25244] 09:17:32.895443 pwrite64(12, "\0", 1, 3773566975) = 1 <0.15>
> [pid 25244] 09:17:32.895485 pwrite64(12, "\0", 1, 3773571071) = 1 <0.15>
> [pid 25244] 09:17:32.895527 pwrite64(12, "\0", 1, 3773575167) = 1 <0.17>
> [pid 25244] 09:17:32.895570 pwrite64(12, "\0", 1, 3773579263) = 1 <0.199493>
> [pid 25244] 09:17:33.095147 pwrite64(12, "\0", 1, 3773583359) = 1 <0.31>
> [pid 25244] 09:17:33.095262 pwrite64(12, "\0", 1, 3773587455) = 1 <0.61>
> [pid 25244] 09:17:33.095378 pwrite64(12, "\0", 1, 3773591551) = 1 <0.27>
> [pid 25244] 09:17:33.095445 pwrite64(12, "\0", 1, 3773595647) = 1 <0.21>
> [pid 25244] 09:17:33.095498 pwrite64(12, "\0", 1, 3773599743) = 1 <0.16>
> [pid 25244] 09:17:33.095542 pwrite64(12, "\0", 1, 3773603839) = 1 <0.14>

Most writes are pretty fast, but from time to time there is a very slow write.

>From the small sample you posted, we have:

awk '{print $11}' strace.out | sed -e "s///" | awk
'{sum+=$1; if ($1 < 0.1) {fast+=$1; fast_nr++} else {slow+=$1;
slow_nr++}} END{printf "average: %.6f slow: %.6f fast: %.6f\n",
sum/NR, slow/slow_nr, fast/fast_nr}'
average: 0.016673 slow: 0.199763 fast: 0.28

Preallocating a 300 GiB disk will take about 15 days :-)

>>> 300*1024**3 / 4096 * 0.016673 / 3600 / 24
15.176135

If all writes would be fast, it will take less than an hour:

>>> 300*1024**3 / 4096 * 0.28 / 3600
0.61166933

> . . .
>
> BTW: it seems my NAS appliance doesn't support 4.2 version of NFS, because if 
> I force it, I then get an error in mount and in engine.log this error for 
> both nodes as they try to mount:
>
> 2021-07-05 17:01:56,082+02 ERROR 
> [org.ovirt.engine.core.bll.storage.connection.FileStorageHelper] 
> (EE-ManagedThreadFactory-engine-Thread-2554190) [642eb6be] The connection 
> with details '172.16.1.137:/nas/EXPORT-DOMAIN' failed because of error code 
> '477' and error message is: problem while trying to mount target
>
>
> and in vdsm.log:
> MountError: (32, ';mount.nfs: Protocol not supported\n')

Too bad.

You can evaluate how ovirt 4.4. will work with this appliance using
this dd command:

dd if=/dev/zero bs=8M count=38400 of=/path/to/new/disk
oflag=direct conv=fsync

We don't use dd for this, but the operation is the same on NFS < 4.2.

Based on the 50 MiB/s rate you reported earlier, I guess you have a
1Gbit network to
this appliance, so zeroing can do up to 128 MiB/s, which will take
about 40 minutes
for 300G.

Using NFS 4.2, fallocate will complete in less than a second.

Here is example from my test system, creating 90g raw preallocated volume:

2021-07-06 15:46:40,382+0300 INFO  (tasks/1) [storage.Volume] Request
to create RAW volume /rhev/data-center/mnt/storage2:_exp
ort_00/a600ba04-34f9-4793-a5dc-6d4150716d14/images/bcf7c623-8fd8-47b3-aaee-a65c0872536d/82def38d-b41b-4126-826e-0513d669f1b5
with capacity = 96636764160 (fileVolume:493)
...
2021-07-06 15:46:40,447+0300 INFO  (tasks/1) [storage.Volume]
Preallocating volume
/rhev/data-center/mnt/storage2:_export_00/a600ba04-34f9-4793-a5dc-6d4150716d14/images/bcf7c623-8fd8-47b3-aaee-a65c0872536d/82def38d-b41b-4126-826e-0513d66

Re: [PATCH]: /hw/nvme/ctrl error handling if descriptors are greater than 1024

2021-07-06 Thread Klaus Jensen

Hi Padmakar,

Patch looks good, but it got messed up by your MUA. You've previously 
contributed correctly formatted patches, so please revert to that method 
when sending patches in the future ;)


Reviewed-by: Klaus Jensen 

On Jul  6 16:13, Padmakar Kalghatgi wrote:

From: padmakar



if the number of descriptors or pages is more than 1024,

dma writes or reads will result in failure. Hence, we check

if the number of descriptors or pages is more than 1024

in the nvme module and return Internal Device error.



Signed-off-by: Padmakar Kalghatgi

---

hw/nvme/ctrl.c   | 14 ++

hw/nvme/trace-events |  1 +

2 files changed, 15 insertions(+)



diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c

index 40a7efc..082592f 100644

--- a/hw/nvme/ctrl.c

+++ b/hw/nvme/ctrl.c

@@ -602,6 +602,20 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg,
hwaddr addr, size_t len)

return NVME_SUCCESS;

}

+/*

+ *  The QEMU has an inherent issue where in if the no.

+ *  of descriptors is more than 1024, it will result in

+ *  failure during the dma write or reads. Hence, we need

+ *  to return the error.

+ */

+

+if (((sg->flags & NVME_SG_DMA) && ((sg->qsg.nsg + 1) > IOV_MAX)) ||

+((sg->iov.niov + 1) > IOV_MAX)) {

+NVME_GUEST_ERR(pci_nvme_ub_sg_desc_toohigh,

+   "number of descriptors is greater than 1024");

+return NVME_INTERNAL_DEV_ERROR;

+}

+

trace_pci_nvme_map_addr(addr, len);

if (nvme_addr_is_cmb(n, addr)) {

diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events

index ea33d0c..bfe1a3b 100644

--- a/hw/nvme/trace-events

+++ b/hw/nvme/trace-events

@@ -202,3 +202,4 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t
new_head) "completion qu

pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write
for nonexistent queue, sqid=%"PRIu32", ignoring"

pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail)
"submission queue doorbell write value beyond queue size, sqid=%"PRIu32",
new_head=%"PRIu16", ignoring"

pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"

+pci_nvme_ub_sg_desc_toohigh(void) "the number of sg descriptors is too
high"

--

2.7.0.windows.1





signature.asc
Description: PGP signature


[PATCH v5 6/6] block: Make blockdev-reopen stable API

2021-07-06 Thread Kevin Wolf
From: Alberto Garcia 

This patch drops the 'x-' prefix from x-blockdev-reopen.

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
---
 qapi/block-core.json|  6 +++---
 blockdev.c  |  2 +-
 tests/qemu-iotests/155  |  2 +-
 tests/qemu-iotests/165  |  2 +-
 tests/qemu-iotests/245  | 10 +-
 tests/qemu-iotests/248  |  2 +-
 tests/qemu-iotests/248.out  |  2 +-
 tests/qemu-iotests/296  |  2 +-
 tests/qemu-iotests/298  |  2 +-
 tests/qemu-iotests/tests/remove-bitmap-from-backing |  4 ++--
 10 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 052520331e..2eb399f0d4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4219,7 +4219,7 @@
 { 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true }
 
 ##
-# @x-blockdev-reopen:
+# @blockdev-reopen:
 #
 # Reopens one or more block devices using the given set of options.
 # Any option not specified will be reset to its default value regardless
@@ -4257,9 +4257,9 @@
 # image does not have a default backing file name as part of its
 # metadata.
 #
-# Since: 4.0
+# Since: 6.0
 ##
-{ 'command': 'x-blockdev-reopen',
+{ 'command': 'blockdev-reopen',
   'data': { 'options': ['BlockdevOptions'] } }
 
 ##
diff --git a/blockdev.c b/blockdev.c
index 1e8c946828..7639f2108e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3560,7 +3560,7 @@ fail:
 visit_free(v);
 }
 
-void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
+void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
 {
 BlockReopenQueue *queue = NULL;
 GSList *drained = NULL;
diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155
index 3400b0312a..fec43d662d 100755
--- a/tests/qemu-iotests/155
+++ b/tests/qemu-iotests/155
@@ -261,7 +261,7 @@ class TestBlockdevMirrorReopen(MirrorBaseClass):
 result = self.vm.qmp('blockdev-add', node_name="backing",
  driver="null-co")
 self.assert_qmp(result, 'return', {})
-result = self.vm.qmp('x-blockdev-reopen', options = [{
+result = self.vm.qmp('blockdev-reopen', options = [{
  'node-name': "target",
  'driver': iotests.imgfmt,
  'file': "target-file",
diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165
index 57aa88ecae..92a431315b 100755
--- a/tests/qemu-iotests/165
+++ b/tests/qemu-iotests/165
@@ -137,7 +137,7 @@ class TestPersistentDirtyBitmap(iotests.QMPTestCase):
 assert sha256_1 == self.getSha256()
 
 # Reopen to RW
-result = self.vm.qmp('x-blockdev-reopen', options = [{
+result = self.vm.qmp('blockdev-reopen', options = [{
 'node-name': 'node0',
 'driver': iotests.imgfmt,
 'file': {
diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index 6eff352099..28a116a6aa 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 # group: rw
 #
-# Test cases for the QMP 'x-blockdev-reopen' command
+# Test cases for the QMP 'blockdev-reopen' command
 #
 # Copyright (C) 2018-2019 Igalia, S.L.
 # Author: Alberto Garcia 
@@ -85,16 +85,16 @@ class TestBlockdevReopen(iotests.QMPTestCase):
  "Expected output of %d qemu-io commands, found %d" %
  (found, self.total_io_cmds))
 
-# Run x-blockdev-reopen on a list of block devices
+# Run blockdev-reopen on a list of block devices
 def reopenMultiple(self, opts, errmsg = None):
-result = self.vm.qmp('x-blockdev-reopen', conv_keys = False, options = 
opts)
+result = self.vm.qmp('blockdev-reopen', conv_keys = False, options = 
opts)
 if errmsg:
 self.assert_qmp(result, 'error/class', 'GenericError')
 self.assert_qmp(result, 'error/desc', errmsg)
 else:
 self.assert_qmp(result, 'return', {})
 
-# Run x-blockdev-reopen on a single block device (specified by
+# Run blockdev-reopen on a single block device (specified by
 # 'opts') but applying 'newopts' on top of it. The original 'opts'
 # dict is unmodified
 def reopen(self, opts, newopts = {}, errmsg = None):
@@ -161,7 +161,7 @@ class TestBlockdevReopen(iotests.QMPTestCase):
 self.reopen(opts, {'file.locking': 'off'}, "Cannot change the option 
'locking'")
 self.reopen(opts, {'file.filename': None}, "Invalid parameter type for 
'options[0].file.filename', expected: string")
 
-# node-name is optional in BlockdevOptions, but x-blockdev-reopen 
needs it
+# node-name is optional in BlockdevOptions, but blo

[PATCH v5 1/6] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-06 Thread Kevin Wolf
Without an external data file, s->data_file is a second pointer with the
same value as bs->file. When changing bs->file to a different BdrvChild
and freeing the old BdrvChild, s->data_file must also be updated,
otherwise it points to freed memory and causes crashes.

This problem was caught by iotests case 245.

Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index ee4530cdbd..cb459ef6a6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -962,6 +962,7 @@ static bool read_cache_sizes(BlockDriverState *bs, QemuOpts 
*opts,
 }
 
 typedef struct Qcow2ReopenState {
+bool had_data_file;
 Qcow2Cache *l2_table_cache;
 Qcow2Cache *refcount_block_cache;
 int l2_slice_size; /* Number of entries in a slice of the L2 table */
@@ -1932,6 +1933,8 @@ static int qcow2_reopen_prepare(BDRVReopenState *state,
 r = g_new0(Qcow2ReopenState, 1);
 state->opaque = r;
 
+r->had_data_file = has_data_file(state->bs);
+
 ret = qcow2_update_options_prepare(state->bs, r, state->options,
state->flags, errp);
 if (ret < 0) {
@@ -1966,7 +1969,18 @@ fail:
 
 static void qcow2_reopen_commit(BDRVReopenState *state)
 {
+BDRVQcow2State *s = state->bs->opaque;
+Qcow2ReopenState *r = state->opaque;
+
 qcow2_update_options_commit(state->bs, state->opaque);
+if (!r->had_data_file && s->data_file != state->bs->file) {
+/*
+ * If s->data_file is just a second pointer to bs->file (which is the
+ * case without an external data file), it may need to be updated.
+ */
+s->data_file = state->bs->file;
+assert(!has_data_file(state->bs));
+}
 g_free(state->opaque);
 }
 
-- 
2.31.1




[PATCH v5 2/6] block: Add bdrv_reopen_queue_free()

2021-07-06 Thread Kevin Wolf
From: Alberto Garcia 

Move the code to free a BlockReopenQueue to a separate function.
It will be used in a subsequent patch.

[ kwolf: Also free explicit_options and options, and explicitly
  qobject_ref() the value when it continues to be used. This avoids
  memory leaks as we saw them in two recent patches. ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  1 +
 block.c   | 22 --
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 7ec77ecb1a..6d42992985 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -386,6 +386,7 @@ BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, 
const char *node_name,
 BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
 BlockDriverState *bs, QDict *options,
 bool keep_old_opts);
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
 int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
   Error **errp);
diff --git a/block.c b/block.c
index acd35cb0cb..ee9b46e95e 100644
--- a/block.c
+++ b/block.c
@@ -4095,6 +4095,19 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue 
*bs_queue,
NULL, 0, keep_old_opts);
 }
 
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
+{
+if (bs_queue) {
+BlockReopenQueueEntry *bs_entry, *next;
+QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+qobject_unref(bs_entry->state.explicit_options);
+qobject_unref(bs_entry->state.options);
+g_free(bs_entry);
+}
+g_free(bs_queue);
+}
+}
+
 /*
  * Reopen multiple BlockDriverStates atomically & transactionally.
  *
@@ -4197,15 +4210,10 @@ abort:
 if (bs_entry->prepared) {
 bdrv_reopen_abort(&bs_entry->state);
 }
-qobject_unref(bs_entry->state.explicit_options);
-qobject_unref(bs_entry->state.options);
 }
 
 cleanup:
-QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
-g_free(bs_entry);
-}
-g_free(bs_queue);
+bdrv_reopen_queue_free(bs_queue);
 
 return ret;
 }
@@ -4573,6 +4581,8 @@ static void bdrv_reopen_commit(BDRVReopenState 
*reopen_state)
 /* set BDS specific flags now */
 qobject_unref(bs->explicit_options);
 qobject_unref(bs->options);
+qobject_ref(reopen_state->explicit_options);
+qobject_ref(reopen_state->options);
 
 bs->explicit_options   = reopen_state->explicit_options;
 bs->options= reopen_state->options;
-- 
2.31.1




[PATCH v5 0/6] Make blockdev-reopen stable

2021-07-06 Thread Kevin Wolf
This series picks up the remaining patches from Berto's series "[PATCH
v4 0/6] Allow changing bs->file on reopen", which are not merged into
master yet.

Apart from renaming 'x-blockdev-reopen' into 'blockdev-reopen', the
remaining functional change in this series is taking a list of nodes to
reopen as an argument so that multiple changes to the graph can be made
atomically that would be invalid separately (e.g. due to permission
checks on the intermediate state).

It also contains a qcow2 fix for a bug introduced by the part of the
series that was already picked up in Vladimir's "[PATCH v6 0/9] Allow
changing bs->file on reopen".

Alberto Garcia (4):
  block: Add bdrv_reopen_queue_free()
  block: Support multiple reopening with x-blockdev-reopen
  iotests: Test reopening multiple devices at the same time
  block: Make blockdev-reopen stable API

Kevin Wolf (2):
  qcow2: Fix dangling pointer after reopen for 'file'
  block: Acquire AioContexts during bdrv_reopen_multiple()

 qapi/block-core.json  | 24 +++---
 include/block/block.h |  3 +
 block.c   | 71 +
 block/qcow2.c | 14 
 block/replication.c   |  7 ++
 blockdev.c| 76 ++
 qemu-io-cmds.c|  7 +-
 tests/qemu-iotests/155|  9 ++-
 tests/qemu-iotests/165|  4 +-
 tests/qemu-iotests/245| 78 +++
 tests/qemu-iotests/245.out|  4 +-
 tests/qemu-iotests/248|  4 +-
 tests/qemu-iotests/248.out|  2 +-
 tests/qemu-iotests/296| 11 ++-
 tests/qemu-iotests/298|  4 +-
 .../tests/remove-bitmap-from-backing  | 22 +++---
 16 files changed, 240 insertions(+), 100 deletions(-)

-- 
2.31.1




[PATCH v5 4/6] block: Support multiple reopening with x-blockdev-reopen

2021-07-06 Thread Kevin Wolf
From: Alberto Garcia 

[ kwolf: Fixed AioContext locking ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
---
 qapi/block-core.json  | 18 +++--
 blockdev.c| 81 ++-
 tests/qemu-iotests/155|  9 ++-
 tests/qemu-iotests/165|  4 +-
 tests/qemu-iotests/245| 27 ---
 tests/qemu-iotests/248|  2 +-
 tests/qemu-iotests/248.out|  2 +-
 tests/qemu-iotests/296|  9 ++-
 tests/qemu-iotests/298|  4 +-
 .../tests/remove-bitmap-from-backing  | 18 +++--
 10 files changed, 99 insertions(+), 75 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 4a46552816..052520331e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4221,13 +4221,15 @@
 ##
 # @x-blockdev-reopen:
 #
-# Reopens a block device using the given set of options. Any option
-# not specified will be reset to its default value regardless of its
-# previous status. If an option cannot be changed or a particular
+# Reopens one or more block devices using the given set of options.
+# Any option not specified will be reset to its default value regardless
+# of its previous status. If an option cannot be changed or a particular
 # driver does not support reopening then the command will return an
-# error.
+# error. All devices in the list are reopened in one transaction, so
+# if one of them fails then the whole transaction is cancelled.
 #
-# The top-level @node-name option (from BlockdevOptions) must be
+# The command receives a list of block devices to reopen. For each one
+# of them, the top-level @node-name option (from BlockdevOptions) must be
 # specified and is used to select the block device to be reopened.
 # Other @node-name options must be either omitted or set to the
 # current name of the appropriate node. This command won't change any
@@ -4247,8 +4249,8 @@
 #
 #  4) NULL: the current child (if any) is detached.
 #
-# Options (1) and (2) are supported in all cases, but at the moment
-# only @backing allows replacing or detaching an existing child.
+# Options (1) and (2) are supported in all cases. Option (3) is
+# supported for @file and @backing, and option (4) for @backing only.
 #
 # Unlike with blockdev-add, the @backing option must always be present
 # unless the node being reopened does not have a backing file and its
@@ -4258,7 +4260,7 @@
 # Since: 4.0
 ##
 { 'command': 'x-blockdev-reopen',
-  'data': 'BlockdevOptions', 'boxed': true }
+  'data': { 'options': ['BlockdevOptions'] } }
 
 ##
 # @blockdev-del:
diff --git a/blockdev.c b/blockdev.c
index f657d090d3..1e8c946828 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3560,51 +3560,60 @@ fail:
 visit_free(v);
 }
 
-void qmp_x_blockdev_reopen(BlockdevOptions *options, Error **errp)
-{
-BlockDriverState *bs;
-AioContext *ctx;
-QObject *obj;
-Visitor *v = qobject_output_visitor_new(&obj);
-BlockReopenQueue *queue;
-QDict *qdict;
-
-/* Check for the selected node name */
-if (!options->has_node_name) {
-error_setg(errp, "node-name not specified");
-goto fail;
-}
+void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
+{
+BlockReopenQueue *queue = NULL;
+GSList *drained = NULL;
+
+/* Add each one of the BDS that we want to reopen to the queue */
+for (; reopen_list != NULL; reopen_list = reopen_list->next) {
+BlockdevOptions *options = reopen_list->value;
+BlockDriverState *bs;
+AioContext *ctx;
+QObject *obj;
+Visitor *v;
+QDict *qdict;
+
+/* Check for the selected node name */
+if (!options->has_node_name) {
+error_setg(errp, "node-name not specified");
+goto fail;
+}
 
-bs = bdrv_find_node(options->node_name);
-if (!bs) {
-error_setg(errp, "Failed to find node with node-name='%s'",
+bs = bdrv_find_node(options->node_name);
+if (!bs) {
+error_setg(errp, "Failed to find node with node-name='%s'",
options->node_name);
-goto fail;
-}
+goto fail;
+}
 
-/* Put all options in a QDict and flatten it */
-visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
-visit_complete(v, &obj);
-qdict = qobject_to(QDict, obj);
+/* Put all options in a QDict and flatten it */
+v = qobject_output_visitor_new(&obj);
+visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
+visit_complete(v, &obj);
+visit_free(v);
 
-qdict_flatten(qdict);
+qdict = qobject_to(QDict, obj);
 
-/* Perform the reopen operation */
-ctx = bdrv_get_aio_context(bs);
-aio_context_acquire(ctx);
-bdrv_subtree_drained_begin(bs);
-aio_context_release(ctx);
+qdict_flatten(qdict);
 

[PATCH v5 3/6] block: Acquire AioContexts during bdrv_reopen_multiple()

2021-07-06 Thread Kevin Wolf
As the BlockReopenQueue can contain nodes in multiple AioContexts, only
one of which may be locked when AIO_WAIT_WHILE() can be called, we can't
let the caller lock the right contexts. Instead, individually lock the
AioContext of a single node when iterating the queue.

Reintroduce bdrv_reopen() as a wrapper for reopening a single node that
drains the node and temporarily drops the AioContext lock for
bdrv_reopen_multiple().

Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  2 ++
 block.c   | 49 ---
 block/replication.c   |  7 +++
 blockdev.c|  5 +
 qemu-io-cmds.c|  7 +--
 5 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 6d42992985..3477290f9a 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -388,6 +388,8 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue 
*bs_queue,
 bool keep_old_opts);
 void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+Error **errp);
 int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
   Error **errp);
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
diff --git a/block.c b/block.c
index ee9b46e95e..06fb69df5c 100644
--- a/block.c
+++ b/block.c
@@ -4124,19 +4124,26 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
  *
  * All affected nodes must be drained between bdrv_reopen_queue() and
  * bdrv_reopen_multiple().
+ *
+ * To be called from the main thread, with all other AioContexts unlocked.
  */
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
 {
 int ret = -1;
 BlockReopenQueueEntry *bs_entry, *next;
+AioContext *ctx;
 Transaction *tran = tran_new();
 g_autoptr(GHashTable) found = NULL;
 g_autoptr(GSList) refresh_list = NULL;
 
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 assert(bs_queue != NULL);
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 ret = bdrv_flush(bs_entry->state.bs);
+aio_context_release(ctx);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error flushing drive");
 goto abort;
@@ -4145,7 +4152,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
 assert(bs_entry->state.bs->quiesce_counter > 0);
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 ret = bdrv_reopen_prepare(&bs_entry->state, bs_queue, tran, errp);
+aio_context_release(ctx);
 if (ret < 0) {
 goto abort;
 }
@@ -4188,7 +4198,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
  * to first element.
  */
 QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 bdrv_reopen_commit(&bs_entry->state);
+aio_context_release(ctx);
 }
 
 tran_commit(tran);
@@ -4197,7 +4210,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 BlockDriverState *bs = bs_entry->state.bs;
 
 if (bs->drv->bdrv_reopen_commit_post) {
+ctx = bdrv_get_aio_context(bs);
+aio_context_acquire(ctx);
 bs->drv->bdrv_reopen_commit_post(&bs_entry->state);
+aio_context_release(ctx);
 }
 }
 
@@ -4208,7 +4224,10 @@ abort:
 tran_abort(tran);
 QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
 if (bs_entry->prepared) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 bdrv_reopen_abort(&bs_entry->state);
+aio_context_release(ctx);
 }
 }
 
@@ -4218,23 +4237,39 @@ cleanup:
 return ret;
 }
 
-int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
-  Error **errp)
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+Error **errp)
 {
-int ret;
+AioContext *ctx = bdrv_get_aio_context(bs);
 BlockReopenQueue *queue;
-QDict *opts = qdict_new();
-
-qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
+int ret;
 
 bdrv_subtree_drained_begin(bs);
-queue = bdrv_reopen_queue(NULL, bs, opts, true);
+if (ctx != qemu_get_aio_context()) {
+aio_context_release(ctx);
+}
+
+queue = bdrv_reopen_queue(NULL, bs, opts, keep_old_opts);
 ret = bdrv_reopen_multiple(queue, errp);
+
+if (ctx != qemu_get_aio_context()) {
+aio_context_acquire(ctx);
+}
 bdrv_subtree_drained_end(b

[PATCH v5 5/6] iotests: Test reopening multiple devices at the same time

2021-07-06 Thread Kevin Wolf
From: Alberto Garcia 

This test swaps the images used by two active block devices.

This is now possible thanks to the new ability to run
x-blockdev-reopen on multiple devices at the same time.

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/245 | 47 ++
 tests/qemu-iotests/245.out |  4 ++--
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index b81fde8f12..6eff352099 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -649,6 +649,53 @@ class TestBlockdevReopen(iotests.QMPTestCase):
  '-c', 'read -P 0x40 0x40008 1',
  '-c', 'read -P 0x80 0x40010 1', 
hd_path[0])
 
+# Swap the disk images of two active block devices
+def test_swap_files(self):
+# Add hd0 and hd2 (none of them with backing files)
+opts0 = hd_opts(0)
+result = self.vm.qmp('blockdev-add', conv_keys = False, **opts0)
+self.assert_qmp(result, 'return', {})
+
+opts2 = hd_opts(2)
+result = self.vm.qmp('blockdev-add', conv_keys = False, **opts2)
+self.assert_qmp(result, 'return', {})
+
+# Write different data to both block devices
+self.run_qemu_io("hd0", "write -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "write -P 0xa2 0 1k")
+
+# Check that the data reads correctly
+self.run_qemu_io("hd0", "read  -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa2 0 1k")
+
+# It's not possible to make a block device use an image that
+# is already being used by the other device.
+self.reopen(opts0, {'file': 'hd2-file'},
+"Permission conflict on node 'hd2-file': permissions "
+"'write, resize' are both required by node 'hd2' (uses "
+"node 'hd2-file' as 'file' child) and unshared by node "
+"'hd0' (uses node 'hd2-file' as 'file' child).")
+self.reopen(opts2, {'file': 'hd0-file'},
+"Permission conflict on node 'hd0-file': permissions "
+"'write, resize' are both required by node 'hd0' (uses "
+"node 'hd0-file' as 'file' child) and unshared by node "
+"'hd2' (uses node 'hd0-file' as 'file' child).")
+
+# But we can swap the images if we reopen both devices at the
+# same time
+opts0['file'] = 'hd2-file'
+opts2['file'] = 'hd0-file'
+self.reopenMultiple([opts0, opts2])
+self.run_qemu_io("hd0", "read  -P 0xa2 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa0 0 1k")
+
+# And we can of course come back to the original state
+opts0['file'] = 'hd0-file'
+opts2['file'] = 'hd2-file'
+self.reopenMultiple([opts0, opts2])
+self.run_qemu_io("hd0", "read  -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa2 0 1k")
+
 # Misc reopen tests with different block drivers
 @iotests.skip_if_unsupported(['quorum', 'throttle'])
 def test_misc_drivers(self):
diff --git a/tests/qemu-iotests/245.out b/tests/qemu-iotests/245.out
index daf1e51922..4eced19294 100644
--- a/tests/qemu-iotests/245.out
+++ b/tests/qemu-iotests/245.out
@@ -17,8 +17,8 @@ read 1/1 bytes at offset 262152
 read 1/1 bytes at offset 262160
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
-..
+...
 --
-Ran 24 tests
+Ran 25 tests
 
 OK
-- 
2.31.1




[PATCH]: /hw/nvme/ctrl error handling if descriptors are greater than 1024

2021-07-06 Thread Padmakar Kalghatgi
From: padmakar

 

if the number of descriptors or pages is more than 1024,

dma writes or reads will result in failure. Hence, we check

if the number of descriptors or pages is more than 1024

in the nvme module and return Internal Device error.

 

Signed-off-by: Padmakar Kalghatgi

---

hw/nvme/ctrl.c   | 14 ++

hw/nvme/trace-events |  1 +

2 files changed, 15 insertions(+)

 

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c

index 40a7efc..082592f 100644

--- a/hw/nvme/ctrl.c

+++ b/hw/nvme/ctrl.c

@@ -602,6 +602,20 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg,
hwaddr addr, size_t len)

 return NVME_SUCCESS;

 }

+/*

+ *  The QEMU has an inherent issue where in if the no.

+ *  of descriptors is more than 1024, it will result in

+ *  failure during the dma write or reads. Hence, we need

+ *  to return the error.

+ */

+

+if (((sg->flags & NVME_SG_DMA) && ((sg->qsg.nsg + 1) > IOV_MAX)) ||

+((sg->iov.niov + 1) > IOV_MAX)) {

+NVME_GUEST_ERR(pci_nvme_ub_sg_desc_toohigh,

+   "number of descriptors is greater than 1024");

+return NVME_INTERNAL_DEV_ERROR;

+}

+

 trace_pci_nvme_map_addr(addr, len);

 if (nvme_addr_is_cmb(n, addr)) {

diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events

index ea33d0c..bfe1a3b 100644

--- a/hw/nvme/trace-events

+++ b/hw/nvme/trace-events

@@ -202,3 +202,4 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t
new_head) "completion qu

pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write
for nonexistent queue, sqid=%"PRIu32", ignoring"

pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail)
"submission queue doorbell write value beyond queue size, sqid=%"PRIu32",
new_head=%"PRIu16", ignoring"

pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"

+pci_nvme_ub_sg_desc_toohigh(void) "the number of sg descriptors is too
high"

-- 

2.7.0.windows.1

 



Re: [PATCH 07/10] iotests/297: return error code from run_linters()

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

This turns run_linters() into a bit of a hybrid test; returning non-zero
on failed execution while also printing diffable information. This is
done for the benefit of the avocado simple test runner, which will soon
be attempting to execute this test from a different environment.

(Note: universal_newlines is added to the pylint invocation for type
consistency with the mypy run -- it's not strictly necessary, but it
avoids some typing errors caused by our re-use of the 'p' variable.)

Signed-off-by: John Snow 
---
  tests/qemu-iotests/297 | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/297 b/tests/qemu-iotests/297
index 1e8334d1d4..7db1f9ed45 100755
--- a/tests/qemu-iotests/297
+++ b/tests/qemu-iotests/297
@@ -68,19 +68,22 @@ def run_linters(
  files: List[str],
  directory: str = '.',
  env: Optional[Mapping[str, str]] = None,
-) -> None:
+) -> int:
+ret = 0
  
  print('=== pylint ===')

  sys.stdout.flush()
  
  # Todo notes are fine, but fixme's or xxx's should probably just be

  # fixed (in tests, at least)
-subprocess.run(
+p = subprocess.run(
  ('python3', '-m', 'pylint', '--score=n', '--notes=FIXME,XXX', *files),
  cwd=directory,
  env=env,
  check=False,
+universal_newlines=True,
  )
+ret += p.returncode
  
  print('=== mypy ===')

  sys.stdout.flush()
@@ -113,9 +116,12 @@ def run_linters(
  universal_newlines=True
  )
  
+ret += p.returncode

  if p.returncode != 0:
  print(p.stdout)
  
+return ret

+
  
  def main() -> None:

  for linter in ('pylint-3', 'mypy'):



Hmm..

1. Rather unusual for a function in python to return int error-code, more usual 
is raising exceptions..

2. making a sum of return codes looks odd to me

3. Do we really want to run mypy if pylint failed? Maybe better not doing it, 
and just switch s/check=False/check=True/ ? This way:

3.1 the function becomes native wrapper for subprocess.run, and raise same 
exceptions
3.2 we don't waste CI time by running mypy when pylint failed anyway


--
Best regards,
Vladimir



Re: [PATCH 06/10] iotests/297: Add 'directory' argument to run_linters

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

Allow run_linters to work well if it's executed from a different
directory.

Signed-off-by: John Snow



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 05/10] iotests/297: Separate environment setup from test execution

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

Move environment setup into main(), leaving pure test execution behind
in run_linters().

Signed-off-by: John Snow


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



[PATCH 0/4] hw/nvme: fix controller hotplugging

2021-07-06 Thread Klaus Jensen
From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

  [1]: 20210511073511.32511-1-h...@suse.de

Klaus Jensen (4):
  hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
  hw/nvme: mark nvme-subsys non-hotpluggable
  hw/nvme: unregister controller with subsystem at exit
  hw/nvme: fix controller hot unplugging

 hw/nvme/nvme.h   | 21 ---
 hw/nvme/ctrl.c   | 14 +++---
 hw/nvme/ns.c | 67 ++--
 hw/nvme/subsys.c | 10 
 4 files changed, 74 insertions(+), 38 deletions(-)

-- 
2.32.0




[PATCH 2/4] hw/nvme: mark nvme-subsys non-hotpluggable

2021-07-06 Thread Klaus Jensen
From: Klaus Jensen 

We currently lack the infrastructure to handle subsystem hotplugging, so
disable it.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/subsys.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 192223d17ca1..dc7a96862f37 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -61,6 +61,7 @@ static void nvme_subsys_class_init(ObjectClass *oc, void 
*data)
 
 dc->realize = nvme_subsys_realize;
 dc->desc = "Virtual NVMe subsystem";
+dc->hotpluggable = false;
 
 device_class_set_props(dc, nvme_subsystem_props);
 }
-- 
2.32.0




[PATCH 3/4] hw/nvme: unregister controller with subsystem at exit

2021-07-06 Thread Klaus Jensen
From: Klaus Jensen 

Make sure the controller is unregistered from the subsystem when device
is removed.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 1 +
 hw/nvme/ctrl.c   | 4 
 hw/nvme/subsys.c | 5 +
 3 files changed, 10 insertions(+)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 0868359a1e86..c4065467d877 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -50,6 +50,7 @@ typedef struct NvmeSubsystem {
 } NvmeSubsystem;
 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
+void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n);
 
 static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys,
  uint32_t cntlid)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index dd1801510032..90e3ee2b70ee 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6523,6 +6523,10 @@ static void nvme_exit(PCIDevice *pci_dev)
 nvme_ns_cleanup(ns);
 }
 
+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
+}
+
 g_free(n->cq);
 g_free(n->sq);
 g_free(n->aer_reqs);
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index dc7a96862f37..92caa604a280 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -32,6 +32,11 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 return cntlid;
 }
 
+void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
+{
+subsys->ctrls[n->cntlid] = NULL;
+}
+
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
 {
 const char *nqn = subsys->params.nqn ?
-- 
2.32.0




[PATCH 1/4] hw/nvme: remove NvmeCtrl parameter from ns setup/check functions

2021-07-06 Thread Klaus Jensen
From: Klaus Jensen 

The nvme_ns_setup and nvme_ns_check_constraints should not depend on the
controller state. Refactor and remove it.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h |  2 +-
 hw/nvme/ctrl.c |  2 +-
 hw/nvme/ns.c   | 37 ++---
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 56f8eceed2ad..0868359a1e86 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -246,7 +246,7 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns)
 }
 
 void nvme_ns_init_format(NvmeNamespace *ns);
-int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
+int nvme_ns_setup(NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_shutdown(NvmeNamespace *ns);
 void nvme_ns_cleanup(NvmeNamespace *ns);
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 629b0d38c2a2..dd1801510032 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6498,7 +6498,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 ns = &n->namespace;
 ns->params.nsid = 1;
 
-if (nvme_ns_setup(n, ns, errp)) {
+if (nvme_ns_setup(ns, errp)) {
 return;
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 4275c3db6301..3c4f5b8c714a 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -346,8 +346,7 @@ static void nvme_zoned_ns_shutdown(NvmeNamespace *ns)
 assert(ns->nr_open_zones == 0);
 }
 
-static int nvme_ns_check_constraints(NvmeCtrl *n, NvmeNamespace *ns,
- Error **errp)
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
 {
 if (!ns->blkconf.blk) {
 error_setg(errp, "block backend not configured");
@@ -366,20 +365,6 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 return -1;
 }
 
-if (!n->subsys) {
-if (ns->params.detached) {
-error_setg(errp, "detached requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return -1;
-}
-
-if (ns->params.shared) {
-error_setg(errp, "shared requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return -1;
-}
-}
-
 if (ns->params.zoned) {
 if (ns->params.max_active_zones) {
 if (ns->params.max_open_zones > ns->params.max_active_zones) {
@@ -411,9 +396,9 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 return 0;
 }
 
-int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+int nvme_ns_setup(NvmeNamespace *ns, Error **errp)
 {
-if (nvme_ns_check_constraints(n, ns, errp)) {
+if (nvme_ns_check_constraints(ns, errp)) {
 return -1;
 }
 
@@ -465,7 +450,21 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 uint32_t nsid = ns->params.nsid;
 int i;
 
-if (nvme_ns_setup(n, ns, errp)) {
+if (!n->subsys) {
+if (ns->params.detached) {
+error_setg(errp, "detached requires that the nvme device is "
+   "linked to an nvme-subsys device");
+return;
+}
+
+if (ns->params.shared) {
+error_setg(errp, "shared requires that the nvme device is "
+   "linked to an nvme-subsys device");
+return;
+}
+}
+
+if (nvme_ns_setup(ns, errp)) {
 return;
 }
 
-- 
2.32.0




Re: [PATCH 04/10] iotests/297: Create main() function

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

Instead of running "run_linters" directly, create a main() function that
will be responsible for environment setup, leaving run_linters()
responsible only for execution of the linters.

(That environment setup will be moved over in forthcoming commits.)

Signed-off-by: John Snow


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



[PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-06 Thread Klaus Jensen
From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 18 ++
 hw/nvme/ctrl.c   |  8 +---
 hw/nvme/ns.c | 32 +---
 hw/nvme/subsys.c |  4 
 4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
 
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+bool is_subsys;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+NvmeBus bus;
 uint8_t subnqn[256];
 
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
-if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
 return NULL;
 }
 
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
 
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (ns) {
+ns->attached--;
 }
+}
 
-nvme_ns_cleanup(ns);
+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 
 if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
-BusState *s = qdev_get_parent_bus(dev);
-NvmeCtrl *n = NVME(s->parent);
-NvmeSubsystem *subsys = n->subsys;
+BusState *qbus = qdev_get_parent_bus(dev);
+NvmeBus *bus = NVME_BUS(qbus);
+NvmeCtrl *n = NULL;
+NvmeSubsystem *subsys = NULL;
 uint32_t nsid = ns->params.nsid;
 int i;
 
-if (!n->subsys) {
+if (bus->is_subsys) {
+subsys = NVME_SUBSYS(qbus->parent);
+} else {
+n = NVME(qbus->parent);
+subsys = n->subsys;
+}
+
+if (subsys) {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, &subsys->bus.parent_bus, errp)) {
+return;
+}
+} else {
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme device is "
"linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 
 if (!nsid) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
 continue;
 }
 
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 return;
 }
 } else {
-if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
 error_setg(errp, "namespace id '%d' already allocated", nsid);
 return;
 }
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 }
 }
 
-nvme_attach_ns(n, ns);
+if (n) {
+nvme_attach_ns(n, ns);
+}
 }
 
 static Property nvme_ns_props[] = {
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..fb7c3a7c55fc 100644

Re: [PATCH 03/10] iotests/297: Don't rely on distro-specific linter binaries

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

'pylint-3' is another Fedora-ism. Use "python3 -m pylint" or "python3 -m
mypy" to access these scripts instead. This style of invocation will
prefer the "correct" tool when run in a virtual environment.

Note that we still check for "pylint-3" before the test begins -- this
check is now "overly strict", but shouldn't cause anything that was
already running correctly to start failing.

Signed-off-by: John Snow


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 02/10] iotests/297: Add get_files() function

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

Split out file discovery into its own method to begin separating out the
"environment setup" and "test execution" phases.

Signed-off-by: John Snow 


Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  tests/qemu-iotests/297 | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/297 b/tests/qemu-iotests/297
index 493dda17fb..0bc1195805 100755
--- a/tests/qemu-iotests/297
+++ b/tests/qemu-iotests/297
@@ -21,6 +21,7 @@ import re
  import shutil
  import subprocess
  import sys
+from typing import List
  
  import iotests
  
@@ -56,9 +57,15 @@ def is_python_file(filename: str, directory: str = '.') -> bool:

  return False
  
  
+def get_test_files(directory: str = '.') -> List[str]:

+return [
+f for f in (set(os.listdir(directory)) - set(SKIP_FILES))
+if is_python_file(f, directory)
+]
+
+
  def run_linters():
-files = [filename for filename in (set(os.listdir('.')) - set(SKIP_FILES))
- if is_python_file(filename)]
+files = get_test_files()


Hmm. It looks like files in tests/qemu-iotests/tests are ignored now.. That's 
bad

  
  iotests.logger.debug('Files to be checked:')

  iotests.logger.debug(', '.join(sorted(files)))




--
Best regards,
Vladimir



Re: [PATCH 01/10] iotests/297: modify is_python_file to work from any CWD

2021-07-06 Thread Vladimir Sementsov-Ogievskiy

25.06.2021 21:20, John Snow wrote:

Add a directory argument to is_python_file to allow it to work correctly
no matter what CWD we happen to run it from. This is done in
anticipation of running the iotests from another directory (./python/).

Signed-off-by: John Snow


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH] hw/ide: Fix crash when plugging a piix3-ide device into the x-remote machine

2021-07-06 Thread Philippe Mathieu-Daudé
On 7/6/21 10:24 AM, Thomas Huth wrote:
> On 16/04/2021 14.52, Thomas Huth wrote:
>> QEMU currently crashes when the user tries to do something like:
>>
>>   qemu-system-x86_64 -M x-remote -device piix3-ide
> 
> It's now several months later already, and this crash has still not been
> fixed yet. The next softfreeze is around the corner and the
> "device-crash-test" still stumbles accross this problem.
> If the other solutions are not mergable yet (what's the status here?),

See this big thread about ISA vs PCI IDE modelling / design:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg809678.html

TL;DR: short term we are screwed. long term, not worth it.

Stefan, IIRC the multi-process conclusion was we have to reject
PCI devices briding another (non-PCI) bus, such ISA / I2C / USB
/ SD / ... because QEMU register the bus type globally and the
command line machinery resolves it to plug user-creatable devices,
so we can not share such buses. Is that correct?

> could we please include my patch for QEMU v6.1 instead, so that we get
> this silenced in the device-crash-test script?

Yes please.

Regards,

Phil.

>> This happens because the "isabus" variable is not initialized with
>> the x-remote machine yet. Add a proper check for this condition
>> and propagate the error to the caller, so we can fail there gracefully.
>>
>> Signed-off-by: Thomas Huth 
>> ---
>>   hw/ide/ioport.c   | 16 ++--
>>   hw/ide/piix.c | 22 +-
>>   hw/isa/isa-bus.c  | 14 ++
>>   include/hw/ide/internal.h |  2 +-
>>   include/hw/isa/isa.h  | 13 -
>>   5 files changed, 46 insertions(+), 21 deletions(-)




Re: [PATCH] hw/ide: Fix crash when plugging a piix3-ide device into the x-remote machine

2021-07-06 Thread Thomas Huth

On 16/04/2021 14.52, Thomas Huth wrote:

QEMU currently crashes when the user tries to do something like:

  qemu-system-x86_64 -M x-remote -device piix3-ide


It's now several months later already, and this crash has still not been 
fixed yet. The next softfreeze is around the corner and the 
"device-crash-test" still stumbles accross this problem.
If the other solutions are not mergable yet (what's the status here?), could 
we please include my patch for QEMU v6.1 instead, so that we get this 
silenced in the device-crash-test script?


 Thanks,
  Thomas



This happens because the "isabus" variable is not initialized with
the x-remote machine yet. Add a proper check for this condition
and propagate the error to the caller, so we can fail there gracefully.

Signed-off-by: Thomas Huth 
---
  hw/ide/ioport.c   | 16 ++--
  hw/ide/piix.c | 22 +-
  hw/isa/isa-bus.c  | 14 ++
  include/hw/ide/internal.h |  2 +-
  include/hw/isa/isa.h  | 13 -
  5 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/hw/ide/ioport.c b/hw/ide/ioport.c
index b613ff3bba..e6caa537fa 100644
--- a/hw/ide/ioport.c
+++ b/hw/ide/ioport.c
@@ -50,15 +50,19 @@ static const MemoryRegionPortio ide_portio2_list[] = {
  PORTIO_END_OF_LIST(),
  };
  
-void ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)

+int ide_init_ioport(IDEBus *bus, ISADevice *dev, int iobase, int iobase2)
  {
+int ret;
+
  /* ??? Assume only ISA and PCI configurations, and that the PCI-ISA
 bridge has been setup properly to always register with ISA.  */
-isa_register_portio_list(dev, &bus->portio_list,
- iobase, ide_portio_list, bus, "ide");
+ret = isa_register_portio_list(dev, &bus->portio_list,
+   iobase, ide_portio_list, bus, "ide");
  
-if (iobase2) {

-isa_register_portio_list(dev, &bus->portio2_list,
- iobase2, ide_portio2_list, bus, "ide");
+if (ret == 0 && iobase2) {
+ret = isa_register_portio_list(dev, &bus->portio2_list,
+   iobase2, ide_portio2_list, bus, "ide");
  }
+
+return ret;
  }
diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index b9860e35a5..d3e738320b 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -26,6 +26,7 @@
  #include "qemu/osdep.h"
  #include "hw/pci/pci.h"
  #include "migration/vmstate.h"
+#include "qapi/error.h"
  #include "qemu/module.h"
  #include "sysemu/block-backend.h"
  #include "sysemu/blockdev.h"
@@ -123,7 +124,8 @@ static void piix_ide_reset(DeviceState *dev)
  pci_conf[0x20] = 0x01; /* BMIBA: 20-23h */
  }
  
-static void pci_piix_init_ports(PCIIDEState *d) {

+static int pci_piix_init_ports(PCIIDEState *d)
+{
  static const struct {
  int iobase;
  int iobase2;
@@ -132,24 +134,30 @@ static void pci_piix_init_ports(PCIIDEState *d) {
  {0x1f0, 0x3f6, 14},
  {0x170, 0x376, 15},
  };
-int i;
+int i, ret;
  
  for (i = 0; i < 2; i++) {

  ide_bus_new(&d->bus[i], sizeof(d->bus[i]), DEVICE(d), i, 2);
-ide_init_ioport(&d->bus[i], NULL, port_info[i].iobase,
-port_info[i].iobase2);
+ret = ide_init_ioport(&d->bus[i], NULL, port_info[i].iobase,
+  port_info[i].iobase2);
+if (ret) {
+return ret;
+}
  ide_init2(&d->bus[i], isa_get_irq(NULL, port_info[i].isairq));
  
  bmdma_init(&d->bus[i], &d->bmdma[i], d);

  d->bmdma[i].bus = &d->bus[i];
  ide_register_restart_cb(&d->bus[i]);
  }
+
+return 0;
  }
  
  static void pci_piix_ide_realize(PCIDevice *dev, Error **errp)

  {
  PCIIDEState *d = PCI_IDE(dev);
  uint8_t *pci_conf = dev->config;
+int rc;
  
  pci_conf[PCI_CLASS_PROG] = 0x80; // legacy ATA mode
  
@@ -158,7 +166,11 @@ static void pci_piix_ide_realize(PCIDevice *dev, Error **errp)
  
  vmstate_register(VMSTATE_IF(dev), 0, &vmstate_ide_pci, d);
  
-pci_piix_init_ports(d);

+rc = pci_piix_init_ports(d);
+if (rc) {
+error_setg_errno(errp, -rc, "Failed to realize %s",
+ object_get_typename(OBJECT(dev)));
+}
  }
  
  int pci_piix3_xen_ide_unplug(DeviceState *dev, bool aux)

diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
index 7820068e6e..cffaa35e9c 100644
--- a/hw/isa/isa-bus.c
+++ b/hw/isa/isa-bus.c
@@ -131,13 +131,17 @@ void isa_register_ioport(ISADevice *dev, MemoryRegion 
*io, uint16_t start)
  isa_init_ioport(dev, start);
  }
  
-void isa_register_portio_list(ISADevice *dev,

-  PortioList *piolist, uint16_t start,
-  const MemoryRegionPortio *pio_start,
-  void *opaque, const char *name)
+int isa_register_portio_list(ISADevice *dev,
+ PortioList *piolist, uint1

Re: [ovirt-users] Re: Any way to terminate stuck export task

2021-07-06 Thread Gianluca Cecchi
On Mon, Jul 5, 2021 at 5:06 PM Nir Soffer  wrote:


>
> qemu-img is busy in posix_fallocate(), wiring one byte to every 4k block.
>
> If you add -tt -T (as I suggested), we can see how much time each write
> takes,
> which may explain why this takes so much time.
>
> strace -f -p 14342 --tt -T
>
>
It seems I missed part of your suggestion... i didn't get the "-tt -T" (or
I didn't see it...)

With it I get this during the export (in networking of host console 4
mbit/s):

# strace -f -p 25243 -tt -T
strace: Process 25243 attached with 2 threads
[pid 25243] 09:17:32.503907 ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}],
1, NULL, NULL, 8 
[pid 25244] 09:17:32.694207 pwrite64(12, "\0", 1, 3773509631) = 1 <0.59>
[pid 25244] 09:17:32.694412 pwrite64(12, "\0", 1, 3773513727) = 1 <0.78>
[pid 25244] 09:17:32.694608 pwrite64(12, "\0", 1, 3773517823) = 1 <0.56>
[pid 25244] 09:17:32.694729 pwrite64(12, "\0", 1, 3773521919) = 1 <0.24>
[pid 25244] 09:17:32.694796 pwrite64(12, "\0", 1, 3773526015) = 1 <0.20>
[pid 25244] 09:17:32.694855 pwrite64(12, "\0", 1, 3773530111) = 1 <0.15>
[pid 25244] 09:17:32.694908 pwrite64(12, "\0", 1, 3773534207) = 1 <0.14>
[pid 25244] 09:17:32.694950 pwrite64(12, "\0", 1, 3773538303) = 1 <0.16>
[pid 25244] 09:17:32.694993 pwrite64(12, "\0", 1, 3773542399) = 1 <0.200032>
[pid 25244] 09:17:32.895140 pwrite64(12, "\0", 1, 3773546495) = 1 <0.34>
[pid 25244] 09:17:32.895227 pwrite64(12, "\0", 1, 3773550591) = 1 <0.29>
[pid 25244] 09:17:32.895296 pwrite64(12, "\0", 1, 3773554687) = 1 <0.24>
[pid 25244] 09:17:32.895353 pwrite64(12, "\0", 1, 3773558783) = 1 <0.16>
[pid 25244] 09:17:32.895400 pwrite64(12, "\0", 1, 3773562879) = 1 <0.15>
[pid 25244] 09:17:32.895443 pwrite64(12, "\0", 1, 3773566975) = 1 <0.15>
[pid 25244] 09:17:32.895485 pwrite64(12, "\0", 1, 3773571071) = 1 <0.15>
[pid 25244] 09:17:32.895527 pwrite64(12, "\0", 1, 3773575167) = 1 <0.17>
[pid 25244] 09:17:32.895570 pwrite64(12, "\0", 1, 3773579263) = 1 <0.199493>
[pid 25244] 09:17:33.095147 pwrite64(12, "\0", 1, 3773583359) = 1 <0.31>
[pid 25244] 09:17:33.095262 pwrite64(12, "\0", 1, 3773587455) = 1 <0.61>
[pid 25244] 09:17:33.095378 pwrite64(12, "\0", 1, 3773591551) = 1 <0.27>
[pid 25244] 09:17:33.095445 pwrite64(12, "\0", 1, 3773595647) = 1 <0.21>
[pid 25244] 09:17:33.095498 pwrite64(12, "\0", 1, 3773599743) = 1 <0.16>
[pid 25244] 09:17:33.095542 pwrite64(12, "\0", 1, 3773603839) = 1 <0.14>
. . .

BTW: it seems my NAS appliance doesn't support 4.2 version of NFS, because
if I force it, I then get an error in mount and in engine.log this error
for both nodes as they try to mount:

2021-07-05 17:01:56,082+02 ERROR
[org.ovirt.engine.core.bll.storage.connection.FileStorageHelper]
(EE-ManagedThreadFactory-engine-Thread-2554190) [642eb6be] The connection
with details '172.16.1.137:/nas/EXPORT-DOMAIN' failed because of error code
'477' and error message is: problem while trying to mount target


and in vdsm.log:
MountError: (32, ';mount.nfs: Protocol not supported\n')

With NFSv3 I get apparently the same command:

vdsm 19702  3036  7 17:15 ?00:00:02 /usr/bin/qemu-img convert
-p -t none -T none -f raw
/rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137:
_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379

The file size seems bigger but anyway very low throughput as with NFS v4.

Gianluca