Re: [RFC PATCH: v3 1/2] add mi device in qemu

2021-08-19 Thread Padmakar Kalghatgi

On Wed, Aug 18, 2021 at 08:01:03AM +0200, Klaus Jensen wrote:

On Aug  3 12:54, Padmakar Kalghatgi wrote:

From: padmakar 

This patch contains the implementation of certain commands
of nvme-mi specification.The MI commands are useful to
manage/configure/monitor the device.Eventhough the MI commands
can be sent via the inband NVMe-MI send/recieve commands, the idea
here is to emulate the sideband interface for MI.

The changes here includes the interface for i2c/smbus
for nvme-mi protocol. We have used i2c address of 0x15
using which the guest VM can send and recieve the nvme-mi
commands. Since the nvme-mi device uses the I2C_SLAVE as
parent, we have used the send and recieve callbacks by
which the nvme-mi device will get the required notification.
With the callback approach, we have eliminated the use of
threads.

One needs to specify the following command in the launch to
specify the nvme-mi device, link to nvme and assign the i2c address.
<-device nvme-mi,nvme=nvme0,address=0x15>

This module makes use of the NvmeCtrl structure of the nvme module,
to fetch relevant information of the nvme device which are used in
some of the mi commands. Eventhough certain commands might require
modification to the nvme module, currently we have currently refrained
from making changes to the nvme module.

cmd-name  cmd-type
*
read nvme-mi dsnvme-mi
configuration set  nvme-mi
configuration get  nvme-mi
vpd read   nvme-mi
vpd write  nvme-mi
controller Health Staus Poll   nvme-mi
nvme subsystem health status poll  nvme-mi
identify   nvme-admin
get log page   nvme-admin
get features   nvme-admin

Signed-off-by: Padmakar Kalghatgi 
Reviewed-by: Klaus Birkelund Jensen 
Reviewed-by: Jaegyu Choi 



My Reviewed-by here was added by mistake. I've not given it my formal
R-b, but I'll provide a proper review on-list ASAP.

But just glossing over it, I like this approach a lot better than v1
(vsock).


Apologies for the mistake. Looking forward for the feedback on the i2c
implementation.


Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-08-19 Thread Jose R. Ziviani
On Thu, Aug 19, 2021 at 05:14:30PM +0200, Hanna Reitz wrote:
> On 19.08.21 16:31, Jose R. Ziviani wrote:
> > Hello Hanna,
> > 
> > On Thu, Aug 19, 2021 at 12:12:00PM +0200, Hanna Reitz wrote:
> > > We cannot write to images opened with O_DIRECT unless we allow them to
> > > be resized so they are aligned to the sector size: Since 9c60a5d1978,
> > > bdrv_node_refresh_perm() ensures that for nodes whose length is not
> > > aligned to the request alignment and where someone has taken a WRITE
> > > permission, the RESIZE permission is taken, too).
> > > 
> > > Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
> > > blk_new_open() to take the RESIZE permission) when using cache=none for
> > > the target, so that when writing to it, it can be aligned to the target
> > > sector size.
> > > 
> > > Without this patch, an error is returned:
> > > 
> > > $ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
> > > qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
> > > permission without 'resize': Image size is not a multiple of request
> > > alignment
> > > 
> > > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
> > > Signed-off-by: Hanna Reitz 
> > > ---
> > > As I have written in the BZ linked above, I am not sure what behavior we
> > > want.  It can be argued that the current behavior is perfectly OK
> > > because we want the target to have the same size as the source, so if
> > > this cannot be done, we should just print the above error (which I think
> > > explains the problem well enough that users can figure out they need to
> > > resize the source image).
> > > 
> > > OTOH, it is difficult for me to imagine a case where the user would
> > > prefer the above error to just having qemu-img align the target image's
> > > length.
> > This is timely convenient because I'm currently investigating an issue 
> > detected
> > by a libvirt-tck test:
> > 
> > https://gitlab.com/libvirt/libvirt-tck/-/blob/master/scripts/qemu/200-qcow2-single-backing-file.t
> > 
> > It fails with the same message:
> > qemu-system-x86_64: -device 
> > virtio-blk-pci,bus=pci.0,addr=0x3,drive=libvirt-1-format,id=virtio-disk1,write-cache=on:
> >  Cannot get 'write' permission without 'resize': Image size is not a 
> > multiple of request alignment
> > 
> > Libvirt-tck basically wants to make sure that libvirt won't pass a 'backing'
> > argument if we force a QCOW2 image to be interpreted as a RAW image. But, 
> > the
> > actual size of a (not preallocated) QCOW2 is usually unaligned so we ended 
> > up
> > failing at that requirement.
> > 
> > I crafted a reproducer (tck-main is a QCOW2 image):
> >   $ ./qemu-system-x86_64 \
> >-machine pc-i440fx-6.0,accel=kvm -m 1024 -display none -no-user-config 
> > -nodefaults \
> >-kernel vmlinuz -initrd initrd \
> >-blockdev 
> > driver=file,filename=/var/cache/libvirt-tck/storage-fs/tck/tck-main,node-name=a,cache.direct=on
> >  \
> >-blockdev node-name=name,driver=raw,file=a \
> >-device virtio-blk-pci,drive=name
> > 
> > And if I remove 'cache.direct=on' OR if I remove the device virtio-blk-pci I
> > don't hit the failure.
> > 
> > In order to fix the libvirt-tck test case, I think that creating a 
> > preallocated
> > QCOW2 image is the best approach, considering their test case goal. But, if 
> > you
> > don't mind, could you explain why cache.direct=on doesn't set resize 
> > permission
> > with virtio-blk-pci?
> 
> Well, users only take the permissions they need.  Because the device doesn’t
> actively want to resize the disk, it doesn’t take the permission (because
> other simultaneous users might not share that permission, and so taking more
> permissions than you need may cause problems).
> 
> So the decision whether to take the permission or not is a tradeoff.  I
> mean, there’s always a work-around: The error message tells you that the
> image is not aligned to the request alignment, so you can align the image
> size manually.  The question is whether taking the permissions may be
> problematic, and whether the user can be expected to align the image size.
> 
> (And we also need to keep in mind that this case is extremely rare. I don’t
> think that users in practice will ever have images that are not 4k-aligned. 
> What the test is doing is of course something that would never happen in a
> practical set-up, in fact, I believe attaching a qcow2 image as a raw image
> to a VM is something that would usually be considered dangerous from a
> security perspective.[1])
> 
> For qemu-img convert (i.e. for this patch), I decided that it is extremely
> unlikely there are concurrent users for the target, so I can’t imagine
> taking more permissions would cause problems.  The only downside I could see
> is that the target image would be larger than the source image, but I think
> that is still the outcome that users want.
> 
> On the other side, applying the work-around may be problematic for users:
> The source image of qemu-img conv

Re: [PATCH 3/3] qcow2: handle_dependencies(): relax conflict detection

2021-08-19 Thread Eric Blake
On Sat, Jul 24, 2021 at 04:38:46PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> There is no conflict and no dependency if we have parallel writes to
> different subclusters of one cluster when cluster itself is already

when the cluster itself

> allocated. So, relax extra dependency.
> 
> Measure performance:
> First, prepare build/qemu-img-old and build/qemu-img-new images.
> 
> cd scripts/simplebench
> ./img_bench_templater.py
> 
> Paste the following to stdin of running script:
> 
> qemu_img=../../build/qemu-img-{old|new}
> $qemu_img create -f qcow2 -o extended_l2=on /ssd/x.qcow2 1G
> $qemu_img bench -c 10 -d 8 [-s 2K|-s 2K -o 512|-s $((1024*2+512))] \
> -w -t none -n /ssd/x.qcow2
> 
> The result:
> 
> All results are in seconds
> 
> --  -  -
> oldnew
> -s 2K   6.7 ± 15%  6.2 ± 12%
>  -7%
> -s 2K -o 51213 ± 3%11 ± 5%
>  -16%
> -s $((1024*2+512))  9.5 ± 4%   8.4
>  -12%
> --  -  -

Cool improvement.

> 
> So small writes are more independent now and that helps to keep deeper
> io queue which improves performance.
> 
> 271 iotest output becomes racy for three allocation in one cluster.
> Second and third writes may finish in different order. Second and
> third requests don't depend on each other any more. Still they both
> depend on first request anyway. Keep only one for consistent output.

Interesting fallout.  Yes, it looks like the test is still robust
enough without the extra request.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-cluster.c  | 11 +++
>  tests/qemu-iotests/271 |  4 +---
>  tests/qemu-iotests/271.out |  2 --
>  3 files changed, 12 insertions(+), 5 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/3] qcow2: refactor handle_dependencies() loop body

2021-08-19 Thread Eric Blake
On Sat, Jul 24, 2021 at 04:38:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> No logic change, just prepare for the following commit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-cluster.c | 49 ---
>  1 file changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index bd0597842f..967121c7e6 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1400,29 +1400,36 @@ static int handle_dependencies(BlockDriverState *bs, 
> uint64_t guest_offset,
>  
>  if (end <= old_start || start >= old_end) {
>  /* No intersection */
> -} else {
> -if (start < old_start) {
> -/* Stop at the start of a running allocation */
> -bytes = old_start - start;
> -} else {
> -bytes = 0;
> -}
> +continue;
> +}
>  
> -/* Stop if already an l2meta exists. After yielding, it wouldn't

Pre-existing, but...

> - * be valid any more, so we'd have to clean up the old L2Metas
> - * and deal with requests depending on them before starting to
> - * gather new ones. Not worth the trouble. */
> -if (bytes == 0 && *m) {
> -*cur_bytes = 0;
> -return 0;
> -}
> +/* Conflict */
>  
> -if (bytes == 0) {
> -/* Wait for the dependency to complete. We need to recheck
> - * the free/allocated clusters when we continue. */
> -qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
> -return -EAGAIN;
> -}
> +if (start < old_start) {
> +/* Stop at the start of a running allocation */
> +bytes = old_start - start;
> +} else {
> +bytes = 0;
> +}
> +
> +/*
> + * Stop if already an l2meta exists. After yielding, it wouldn't

...might as well fix the grammar:  Stop if an l2meta already exists.

> + * be valid any more, so we'd have to clean up the old L2Metas
> + * and deal with requests depending on them before starting to
> + * gather new ones. Not worth the trouble.
> + */
> +if (bytes == 0 && *m) {
> +*cur_bytes = 0;
> +return 0;
> +}
> +
> +if (bytes == 0) {
> +/*
> + * Wait for the dependency to complete. We need to recheck
> + * the free/allocated clusters when we continue.
> + */
> +qemu_co_queue_wait(&old_alloc->dependent_requests, &s->lock);
> +return -EAGAIN;

So the change adds a short-circuiting 'continue', then reduces the
indentation of the rest of the loop body.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 1/3] simplebench: add img_bench_templater.py

2021-08-19 Thread Hanna Reitz

On 24.07.21 15:38, Vladimir Sementsov-Ogievskiy wrote:

Add simple grammar-parsing template benchmark.


This doesn’t really say much, and FWIW, for like ten minutes I thought 
this would do something completely different than it did (while I was 
trying to parse the help text).


(I thought this was about formatting an existing test’s output, and that 
“template” were kind of the wrong word, but then it turned out it’s 
exactly the right word, only that this is not about using a test’s 
output as a template, but actually using a template of a test (i.e. a 
test template, not a template test) to generate test instances to 
run...  Which of course is much cooler.)


Functionality-wise, as far as I understand (of course I have no 
knowledge of lark), this looks good to me.  And it’s really quite cool.


I just found the documentation confusing, so I have some suggestions for 
it below.



Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  scripts/simplebench/img_bench_templater.py | 85 ++
  scripts/simplebench/table_templater.py | 62 
  2 files changed, 147 insertions(+)
  create mode 100755 scripts/simplebench/img_bench_templater.py
  create mode 100644 scripts/simplebench/table_templater.py

diff --git a/scripts/simplebench/img_bench_templater.py 
b/scripts/simplebench/img_bench_templater.py
new file mode 100755
index 00..d18a243d35
--- /dev/null
+++ b/scripts/simplebench/img_bench_templater.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+#
+# Run img-bench template tests
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+
+import sys
+import subprocess
+import re
+import json
+
+import simplebench
+from results_to_text import results_to_text
+from table_templater import Templater
+
+
+def bench_func(env, case):
+test = templater.gen(env['data'], case['data'])
+
+p = subprocess.run(test, shell=True, stdout=subprocess.PIPE,
+   stderr=subprocess.STDOUT, universal_newlines=True)
+
+if p.returncode == 0:
+try:
+m = re.search(r'Run completed in (\d+.\d+) seconds.', p.stdout)
+return {'seconds': float(m.group(1))}
+except Exception:
+return {'error': f'failed to parse qemu-img output: {p.stdout}'}
+else:
+return {'error': f'qemu-img failed: {p.returncode}: {p.stdout}'}
+
+
+if __name__ == '__main__':
+if len(sys.argv) > 1:
+print("""
+Usage: no arguments. Just pass template test to stdin. Template test is


FWIW, I completely misunderstood this.

At first, this sounded really ambiguous to me; then I thought that 
clearly this must mean that one should pipe the test’s output to this 
script, i.e.


$ path/to/test.sh | scripts/simplebench/img_bench_templater.py

But now after reading more, I finally understand that this is not what 
is meant, but actually literally passing some template of a test script 
to this script, i.e.


$ scripts/simplebench/img_bench_templater.py < path/to/test-template.sh

So, two things; first, I believe it should be a “test template”, not a 
“template test”, because this is about templates for a test, not about a 
test that has something to do with templates.


Second, perhaps we should start with what this does.

Perhaps:

“This script generates performance tests from a test template (example 
below), runs them, and displays the results in a table. The template is 
read from stdin.  It must be written in bash and end with a `qemu-img 
bench` invocation (whose result is parsed to get the test instance’s 
result).”?



+a bash script, last command should be qemu-img bench (it's output is parsed
+to get a result). For templating use the following synax:


“Use the following syntax in the template to create the various 
different test instances:”?



+
+  column templating: {var1|var2|...} - test will use different values in
+  different columns. You may use several {} constructions in the test, in this
+  case product of all choice-sets will be used.
+
+  row templating: [var1|var2|...] - similar thing to define rows (test-cases)
+
+Test tempalate example:


*template


+
+Assume you want to compare two qemu-img binaries, called qemu-img-old and
+qemu-img-new in your build directory in two test-cases with 4K writes and 64K
+writes. Test may look like this:


I’d prefer s/Test/The template/.

Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-08-19 Thread Hanna Reitz

On 19.08.21 16:31, Jose R. Ziviani wrote:

Hello Hanna,

On Thu, Aug 19, 2021 at 12:12:00PM +0200, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.

This is timely convenient because I'm currently investigating an issue detected
by a libvirt-tck test:

https://gitlab.com/libvirt/libvirt-tck/-/blob/master/scripts/qemu/200-qcow2-single-backing-file.t

It fails with the same message:
qemu-system-x86_64: -device 
virtio-blk-pci,bus=pci.0,addr=0x3,drive=libvirt-1-format,id=virtio-disk1,write-cache=on:
 Cannot get 'write' permission without 'resize': Image size is not a multiple 
of request alignment

Libvirt-tck basically wants to make sure that libvirt won't pass a 'backing'
argument if we force a QCOW2 image to be interpreted as a RAW image. But, the
actual size of a (not preallocated) QCOW2 is usually unaligned so we ended up
failing at that requirement.

I crafted a reproducer (tck-main is a QCOW2 image):
  $ ./qemu-system-x86_64 \
   -machine pc-i440fx-6.0,accel=kvm -m 1024 -display none -no-user-config 
-nodefaults \
   -kernel vmlinuz -initrd initrd \
   -blockdev 
driver=file,filename=/var/cache/libvirt-tck/storage-fs/tck/tck-main,node-name=a,cache.direct=on
 \
   -blockdev node-name=name,driver=raw,file=a \
   -device virtio-blk-pci,drive=name

And if I remove 'cache.direct=on' OR if I remove the device virtio-blk-pci I
don't hit the failure.

In order to fix the libvirt-tck test case, I think that creating a preallocated
QCOW2 image is the best approach, considering their test case goal. But, if you
don't mind, could you explain why cache.direct=on doesn't set resize permission
with virtio-blk-pci?


Well, users only take the permissions they need.  Because the device 
doesn’t actively want to resize the disk, it doesn’t take the permission 
(because other simultaneous users might not share that permission, and 
so taking more permissions than you need may cause problems).


So the decision whether to take the permission or not is a tradeoff.  I 
mean, there’s always a work-around: The error message tells you that the 
image is not aligned to the request alignment, so you can align the 
image size manually.  The question is whether taking the permissions may 
be problematic, and whether the user can be expected to align the image 
size.


(And we also need to keep in mind that this case is extremely rare. I 
don’t think that users in practice will ever have images that are not 
4k-aligned.  What the test is doing is of course something that would 
never happen in a practical set-up, in fact, I believe attaching a qcow2 
image as a raw image to a VM is something that would usually be 
considered dangerous from a security perspective.[1])


For qemu-img convert (i.e. for this patch), I decided that it is 
extremely unlikely there are concurrent users for the target, so I can’t 
imagine taking more permissions would cause problems.  The only downside 
I could see is that the target image would be larger than the source 
image, but I think that is still the outcome that users want.


On the other side, applying the work-around may be problematic for 
users: The source image of qemu-img convert shouldn’t have to be 
writable.  So resizing it (just so it can be converted to be stored on a 
medium with 4k sector size) may actually be impossible for the user.


Now, for the virtio-blk case, that is different.  If you’re willing to 
apply the work-around, then you don’t have to do so on an “innocent 
bystander” image.  You have to resize the very image you want to use, 
i.e. one that must be writable anyway.  So resizing the image outside o

Re: [RFC PATCH v2] hw/nvme:Adding Support for namespace management

2021-08-19 Thread Klaus Jensen
On Aug 19 18:39, Naveen Nagar wrote:
> From: Naveen 
> 
> This patch supports namespace management : create and delete operations.
> 
> Since v1:
> - Modified and moved nvme_ns_identify_common in ns.c file 
> - Added check for CSI field in NS management
> - Indentation fix in namespace create
> 
> This patch has been tested with the following command and size of image
> file for unallocated namespaces is taken as 0GB. ns_create will look into
> the list of unallocated namespaces and it will initialize the same and 
> return the nsid of the same. A new mandatory field has been added called
> tnvmcap and we have ensured that the total capacity of namespace created
> does not exceed tnvmcap
> 
> -device nvme-subsys,id=subsys0,tnvmcap=8
> -device nvme,serial=foo,id=nvme0,subsys=subsys0
> -device nvme,serial=bar,id=nvme1,subsys=subsys0
> -drive id=ns1,file=ns1.img,if=none
> -device nvme-ns,drive=ns1,bus=nvme0,nsid=1,zoned=false,shared=true
> -drive id=ns2,file=ns2.img,if=none
> -device nvme-ns,drive=ns2,bus=nvme0,nsid=2,zoned=false,shared=true
> -drive id=ns3,file=ns3.img,if=none
> -device nvme-ns,drive=ns3,bus=nvme0,nsid=3,zoned=false,shared=true
> -drive id=ns4,file=ns4.img,if=none
> -device nvme-ns,drive=ns4,bus=nvme0,nsid=4,zoned=false,shared=true
> 
> Please review and suggest if any changes are required.
> 
> Signed-off-by: Naveen Nagar 
> Reviewed-by: Klaus Jensen 
>   

Woops.

Looks like you sent it to the wrong mailing list - I'd be happy to
comment on this on qemu-{block,devel} instead :)


signature.asc
Description: PGP signature


Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-08-19 Thread Jose R. Ziviani
Hello Hanna,

On Thu, Aug 19, 2021 at 12:12:00PM +0200, Hanna Reitz wrote:
> We cannot write to images opened with O_DIRECT unless we allow them to
> be resized so they are aligned to the sector size: Since 9c60a5d1978,
> bdrv_node_refresh_perm() ensures that for nodes whose length is not
> aligned to the request alignment and where someone has taken a WRITE
> permission, the RESIZE permission is taken, too).
> 
> Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
> blk_new_open() to take the RESIZE permission) when using cache=none for
> the target, so that when writing to it, it can be aligned to the target
> sector size.
> 
> Without this patch, an error is returned:
> 
> $ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
> qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
> permission without 'resize': Image size is not a multiple of request
> alignment
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
> Signed-off-by: Hanna Reitz 
> ---
> As I have written in the BZ linked above, I am not sure what behavior we
> want.  It can be argued that the current behavior is perfectly OK
> because we want the target to have the same size as the source, so if
> this cannot be done, we should just print the above error (which I think
> explains the problem well enough that users can figure out they need to
> resize the source image).
> 
> OTOH, it is difficult for me to imagine a case where the user would
> prefer the above error to just having qemu-img align the target image's
> length.

This is timely convenient because I'm currently investigating an issue detected
by a libvirt-tck test:

https://gitlab.com/libvirt/libvirt-tck/-/blob/master/scripts/qemu/200-qcow2-single-backing-file.t

It fails with the same message:
qemu-system-x86_64: -device 
virtio-blk-pci,bus=pci.0,addr=0x3,drive=libvirt-1-format,id=virtio-disk1,write-cache=on:
 Cannot get 'write' permission without 'resize': Image size is not a multiple 
of request alignment

Libvirt-tck basically wants to make sure that libvirt won't pass a 'backing'
argument if we force a QCOW2 image to be interpreted as a RAW image. But, the
actual size of a (not preallocated) QCOW2 is usually unaligned so we ended up
failing at that requirement.

I crafted a reproducer (tck-main is a QCOW2 image):
 $ ./qemu-system-x86_64 \
  -machine pc-i440fx-6.0,accel=kvm -m 1024 -display none -no-user-config 
-nodefaults \
  -kernel vmlinuz -initrd initrd \
  -blockdev 
driver=file,filename=/var/cache/libvirt-tck/storage-fs/tck/tck-main,node-name=a,cache.direct=on
 \
  -blockdev node-name=name,driver=raw,file=a \
  -device virtio-blk-pci,drive=name

And if I remove 'cache.direct=on' OR if I remove the device virtio-blk-pci I
don't hit the failure.

In order to fix the libvirt-tck test case, I think that creating a preallocated
QCOW2 image is the best approach, considering their test case goal. But, if you
don't mind, could you explain why cache.direct=on doesn't set resize permission
with virtio-blk-pci?

Thank you!

> ---
>  qemu-img.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index 908fd0cce5..d4b29bf73e 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
>  goto out;
>  }
>  
> +if (flags & BDRV_O_NOCACHE) {
> +/*
> + * If we open the target with O_DIRECT, it may be necessary to
> + * extend its size to align to the physical sector size.
> + */
> +flags |= BDRV_O_RESIZE;
> +}
> +
>  if (skip_create) {
>  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
>  flags, writethrough, s.quiet, false);
> -- 
> 2.31.1
> 
> 


signature.asc
Description: Digital signature


[RFC PATCH v2] hw/nvme:Adding Support for namespace management

2021-08-19 Thread Naveen Nagar
From: Naveen 

This patch supports namespace management : create and delete operations.

Since v1:
- Modified and moved nvme_ns_identify_common in ns.c file 
- Added check for CSI field in NS management
- Indentation fix in namespace create

This patch has been tested with the following command and size of image
file for unallocated namespaces is taken as 0GB. ns_create will look into
the list of unallocated namespaces and it will initialize the same and 
return the nsid of the same. A new mandatory field has been added called
tnvmcap and we have ensured that the total capacity of namespace created
does not exceed tnvmcap

-device nvme-subsys,id=subsys0,tnvmcap=8
-device nvme,serial=foo,id=nvme0,subsys=subsys0
-device nvme,serial=bar,id=nvme1,subsys=subsys0
-drive id=ns1,file=ns1.img,if=none
-device nvme-ns,drive=ns1,bus=nvme0,nsid=1,zoned=false,shared=true
-drive id=ns2,file=ns2.img,if=none
-device nvme-ns,drive=ns2,bus=nvme0,nsid=2,zoned=false,shared=true
-drive id=ns3,file=ns3.img,if=none
-device nvme-ns,drive=ns3,bus=nvme0,nsid=3,zoned=false,shared=true
-drive id=ns4,file=ns4.img,if=none
-device nvme-ns,drive=ns4,bus=nvme0,nsid=4,zoned=false,shared=true

Please review and suggest if any changes are required.

Signed-off-by: Naveen Nagar 
Reviewed-by: Klaus Jensen 
  
---
 hw/nvme/ctrl.c   | 237 +--
 hw/nvme/ns.c |  61 ++-
 hw/nvme/nvme.h   |   7 +-
 hw/nvme/subsys.c |   3 +
 include/block/nvme.h |  18 +++-
 5 files changed, 285 insertions(+), 41 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 6baf9e0420..992aaa7d02 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -219,6 +219,7 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_NS_MANAGEMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
@@ -4450,11 +4451,19 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req, bool active)
 NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
 uint32_t nsid = le32_to_cpu(c->nsid);
+NvmeIdNs *id_ns = NULL;
+uint16_t ret;
 
 trace_pci_nvme_identify_ns(nsid);
 
-if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
+if (!nvme_nsid_valid(n, nsid)) {
 return NVME_INVALID_NSID | NVME_DNR;
+} else if (nsid == NVME_NSID_BROADCAST) {
+id_ns = g_new0(NvmeIdNs, 1);
+nvme_ns_identify_common(id_ns);
+ret = nvme_c2h(n, (uint8_t *)id_ns, sizeof(NvmeIdNs), req);
+g_free(id_ns);
+return ret;
 }
 
 ns = nvme_ns(n, nsid);
@@ -5184,6 +5193,200 @@ static void nvme_select_iocs_ns(NvmeCtrl *n, 
NvmeNamespace *ns)
 }
 }
 
+static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
+{
+int ret;
+uint64_t perm, shared_perm;
+
+blk_get_perm(blk, &perm, &shared_perm);
+
+ret = blk_set_perm(blk, perm | BLK_PERM_RESIZE, shared_perm, errp);
+if (ret < 0) {
+return ret;
+}
+
+ret = blk_truncate(blk, len, false, PREALLOC_MODE_OFF, 0, errp);
+if (ret < 0) {
+return ret;
+}
+
+ret = blk_set_perm(blk, perm, shared_perm, errp);
+if (ret < 0) {
+return ret;
+}
+
+return 0;
+}
+
+static uint32_t nvme_allocate_nsid(NvmeCtrl *n)
+{
+uint32_t nsid = 0;
+for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
+continue;
+}
+
+nsid = i;
+return nsid;
+}
+return nsid;
+}
+
+static uint16_t nvme_namespace_create(NvmeCtrl *n, NvmeRequest *req)
+{
+uint32_t ret;
+NvmeIdNs id_ns_host;
+NvmeSubsystem *subsys = n->subsys;
+Error *err = NULL;
+uint8_t flbas_host;
+uint64_t ns_size;
+int lba_index;
+NvmeNamespace *ns;
+NvmeCtrl *ctrl;
+NvmeIdNs *id_ns;
+
+ret = nvme_h2c(n, (uint8_t *)&id_ns_host, sizeof(id_ns_host), req);
+if (ret) {
+return ret;
+}
+
+if (id_ns_host.ncap < id_ns_host.nsze) {
+return NVME_THIN_PROVISION_NO_SUPP | NVME_DNR;
+} else if (id_ns_host.ncap > id_ns_host.nsze) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+if (!id_ns_host.nsze) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+if (QSLIST_EMPTY(&subsys->unallocated_namespaces)) {
+return NVME_NS_ID_UNAVAILABLE;
+}
+
+ns = QSLIST_FIRST(&subsys->unallocated_namespaces);
+id_ns = &ns->id_ns;
+flbas_host = (id_ns_host.flbas) & (0xF);
+
+if (flbas_host > id_ns->nlbaf) {
+return NVME_INVALID_FORMAT | NVME_DNR;
+}
+
+ret = nvme_ns_setup(ns, &err);
+if (ret) {
+return ret;
+}
+
+id_ns

Re: [PATCH] docs: Link to protocol security considerations in uri docs

2021-08-19 Thread Wouter Verhelst
Hi Eric,

On Wed, Aug 18, 2021 at 11:02:48AM -0500, Eric Blake wrote:
> Dan Berrangé and I thought about some more potential future problems:
> right now, even with FORCEDTLS mode (in both client and server), we
> have NO way to validate that the initial NBD_FLAG_[C_] bits advertised
> between client and server were not tampered with by a MitM during the
> plaintext portion of the exchange.  The flags field is 16 bits sent
> from the server, and 16 bits mirrored back by the client, to determine
> which protocol features will be in use the remainder of the
> connection.  Right now, exactly two of those bits are defined:
> 
> NBD_FLAG_FIXED_NEWSTYLE - the spec is clear that NBD_OPT_STARTTLS
> should not be sent unless this bit was negotiated.  Thus, both client
> and server will be sending the bit set, and absence of the bit will be
> evidence of a MitM attempting a downgrade attack, and there's nothing
> further to worry about in the protocol.  Once STARTTLS is executed, we
> already know this capability was available, so we don't need a way to
> re-verify it while encrypted.
> 
> NBD_FLAG_NO_ZEROES - this bit controls how the server will respond to
> NBD_OPT_EXPORT_NAME.  A mismatch between this bit (whether the MitM
> strips or adds the bit) will only be observable if the client uses
> NBD_OPT_EXPORT_NAME, but all clients that use STARTTLS are already
> encouraged to use NBD_OPT_GO.  We may want to tighten the security
> portion of the spec to make this recommendation even stronger (that
> is, any client that wants to ensure there is no MitM downgrade attack
> MUST use NBD_OPT_GO rather than NBD_OPT_EXPORT_NAME; and all servers
> that support TLS MUST support NBD_OPT_GO), but once a client uses the
> modern export selection code, we no longer care about mismatches in
> this bit, and therefore we don't need a way to re-verify it while
> encrypted.
> 
> However, I'm also worried about future extensions.  For example, we
> have been considering the addition of a way for clients to make 64-bit
> requests (basically, allowing NBD_OPT_WRITE_ZEROES to become a
> single-transaction bulk-zero for an export larger than 4G).  If the
> way this extension is haggled between client and server utilizes only
> a new NBD_FLAG_*, then we have introduced a potential security hole,
> because we are back in the situation of having a MitM flip bits prior
> to STARTTLS so that client and server do not agree on which protocol
> changes to use.  We can avoid this by adding a way to validate which
> capability bits are actually common between client and server via a
> new NBD_OPT_ sent after STARTTLS.  But rather than needing a way to
> re-verify which flags are common, it is just as easy to merely declare
> that ALL future protocol extensions will be haggled via NBD_OPT_ in
> the first place (and not by adding new NBD_FLAG_ bits).  That way,
> there will be no further places in the NBD protocol where a MitM
> plaintext injection can interfere with what the client and server use
> to talk to one another post-encryption.
> 
> Is it worth formalizing this decision into the NBD spec, so that we
> remember down the road that adding new NBD_FLAG_ bits is an inherent
> security risk thanks to the presence of STARTTLS?

I see the flags field as a way to negotiate incompatible changes *during
the negotiation phase*. This is true for both current flags:
FIXED_NEWSTYLE is supposed to fix the newstyle negotiation, and
NO_ZEROES changes the format of the reply to the EXPORT_NAME option,
which if used is the final message exchange during the negotiation
phase. Other protocol differences are negotiated with options or with
per-export flags (which are sent in the TLS part of the connection).

The only situation that I can think of in which we would need a new flag
is if for some reason we had to change the message format of the
NBD_OPT_* messages.  I don't see this happening. Even if we did have to
change that format, there are 2^32 option numbers available, which means
they are effectively unlimited; so if for whatever reason we had a
situation where the NBD_OPT_* message format is not flexible enough, we
could use a new option to signal "enable the new message format", which
would be written in the "old" format; if that new option number returns
NBD_REP_ERR_UNSUP, then we know the new message format is not supported
and we fall back to the old message format. That's slightly less
efficient than just setting a flag, but then who cares about a few bytes
in a protocol like NBD, which is expected to transfer large amounts of
data down the line.

I think we can indeed decide that we won't add any global flags anymore.

Thanks,

-- 
 w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}



Re: [PATCH] raw-format: drop WRITE and RESIZE child perms when possible

2021-08-19 Thread Hanna Reitz

On 26.07.21 14:28, Stefan Hajnoczi wrote:

The following command-line fails due to a permissions conflict:

   $ qemu-storage-daemon \
   --blockdev driver=nvme,node-name=nvme0,device=:08:00.0,namespace=1 \
   --blockdev driver=raw,node-name=l1-1,file=nvme0,offset=0,size=1073741824 
\
   --blockdev 
driver=raw,node-name=l1-2,file=nvme0,offset=1073741824,size=1073741824 \
   --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock,max-connections=2 \
   --export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on \
   --export type=nbd,id=nbd-l1-2,node-name=l1-2,name=l1-2,writable=on

   qemu-storage-daemon: --export 
type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on: Permission conflict 
on node 'nvme0': permissions 'resize' are both required by node 'l1-1' (uses 
node 'nvme0' as 'file' child) and unshared by node 'l1-2' (uses node 'nvme0' as 
'file' child).

The problem is that block/raw-format.c relies on bdrv_default_perms() to
set permissions on the nvme node. The default permissions add RESIZE in
anticipation of a format driver like qcow2 that needs to grow the image
file. This fails because RESIZE is unshared, so we cannot get the RESIZE
permission.

Max Reitz pointed out that block/crypto.c already handles this case by
implementing a custom ->bdrv_child_perm() function that adjusts the
result of bdrv_default_perms().

This patch takes the same approach in block/raw-format.c so that RESIZE
is only required if it's actually necessary (e.g. the parent is qcow2).

Cc: Max Reitz 
Cc: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
---
This is not a bug fix, so I didn't mark it for QEMU 6.1. It's new
behavior that hasn't been supported before. I want to split an NVMe
drive using the raw format's offset=/size= feature.
---
  block/raw-format.c | 21 -
  1 file changed, 20 insertions(+), 1 deletion(-)


Thanks, applied to my block-next branch:

https://github.com/XanClic/qemu/commits/block-next

Hanna




Re: [PATCH] block/monitor: Consolidate hmp_handle_error calls to reduce redundant code

2021-08-19 Thread Hanna Reitz

On 02.08.21 08:25, Mao Zhongyi wrote:

Signed-off-by: Mao Zhongyi 
---
  block/monitor/block-hmp-cmds.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)


Thanks, applied to my block-next branch for 6.2:

https://github.com/XanClic/qemu/commits/block-next

Hanna




Re: [PATCH v3] block/file-win32: add reopen handlers

2021-08-19 Thread Hanna Reitz

On 17.08.21 22:21, Viktor Prutyanov wrote:

Make 'qemu-img commit' work on Windows.

Command 'commit' requires reopening backing file in RW mode. So,
add reopen prepare/commit/abort handlers and change dwShareMode
for CreateFile call in order to allow further read/write reopening.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/418
Suggested-by: Hanna Reitz 
Signed-off-by: Viktor Prutyanov 
---
  v2:
 - fix indentation in raw_reopen_prepare
 - free rs if raw_reopen_prepare fails
  v3:
 - restore suggested-by field missed in v2

  block/file-win32.c | 90 +-
  1 file changed, 89 insertions(+), 1 deletion(-)


Overall, looks good to me, thanks!

I just have some questions/suggestions on places where this patch 
deviates from my draft:



diff --git a/block/file-win32.c b/block/file-win32.c
index 2642088bd6..9dcbb2b53b 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -58,6 +58,10 @@ typedef struct BDRVRawState {
  QEMUWin32AIOState *aio;
  } BDRVRawState;
  
+typedef struct BDRVRawReopenState {

+HANDLE hfile;
+} BDRVRawReopenState;
+
  /*
   * Read/writes the data to/from a given linear buffer.
   *
@@ -392,7 +396,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
int flags,
  }
  
  s->hfile = CreateFile(filename, access_flags,

-  FILE_SHARE_READ, NULL,
+  FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
OPEN_EXISTING, overlapped, NULL);
  if (s->hfile == INVALID_HANDLE_VALUE) {
  int err = GetLastError();
@@ -634,6 +638,86 @@ static int coroutine_fn raw_co_create_opts(BlockDriver 
*drv,
  return raw_co_create(&options, errp);
  }
  
+static int raw_reopen_prepare(BDRVReopenState *state,

+  BlockReopenQueue *queue, Error **errp)
+{
+BDRVRawState *s = state->bs->opaque;
+BDRVRawReopenState *rs;
+int access_flags;
+DWORD overlapped;
+int ret = 0;


Comparing with my original draft 
(https://gitlab.com/hreitz/qemu/-/commit/433ee9a6559dad253e7553692f942dc1824132f0), 
I prevented reopening any node that is not of type FTYPE_FILE (i.e. host 
devices), just to be sure (and because I thought we wouldn’t really need 
other cases).


I don’t strongly lean either way, so perhaps we should indeed just allow 
reopening host devices, but if we do, I think the CreateFile() call in 
hdev_open() should be changed to pass FILE_SHARE_READ | 
FILE_SHARED_WRITE, too (instead of just FILE_SHARE_READ).



+
+rs = g_new0(BDRVRawReopenState, 1);
+


I had this comment here:


/*
 * We do not support changing any options (only flags). By leaving
 * all options in state->options, we tell the generic reopen code
 * that we do not support changing any of them, so it will verify
 * that their values did not change.
 */


Is there a reason you chose to not include it?  (I think it’s quite nice 
to have this comment, because otherwise it may not be clear why it’s 
“fine” that we don’t look into state->options at all – it’s fine because 
leaving it untouched will make the generic block code verify that 
nothing was changed.)



+raw_parse_flags(state->flags, s->aio != NULL, &access_flags, &overlapped);
+rs->hfile = CreateFile(state->bs->filename, access_flags,
+   FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
+   OPEN_EXISTING, overlapped, NULL);
+
+if (rs->hfile == INVALID_HANDLE_VALUE) {
+int err = GetLastError();
+
+error_setg_win32(errp, err, "Could not reopen '%s'",
+ state->bs->filename);
+if (err == ERROR_ACCESS_DENIED) {
+ret = -EACCES;
+} else {
+ret = -EINVAL;
+}
+goto fail;
+}
+
+if (s->aio) {
+ret = win32_aio_attach(s->aio, rs->hfile);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Could not enable AIO");
+goto fail;
+}
+}
+
+state->opaque = rs;
+
+return 0;
+
+fail:
+g_free(rs);
+state->opaque = NULL;
+
+return ret;
+}
+
+static void raw_reopen_commit(BDRVReopenState *state)
+{
+BDRVRawState *s = state->bs->opaque;
+BDRVRawReopenState *rs = state->opaque;
+
+if (!rs) {
+return;
+}


I hope this can’t happen (and I believe it’d be a bug if it did), so I’d 
prefer an


assert(rs != NULL);

instead.

Hanna


+
+CloseHandle(s->hfile);
+s->hfile = rs->hfile;
+
+g_free(rs);
+state->opaque = NULL;
+}





[PATCH] qemu-img: Allow target be aligned to sector size

2021-08-19 Thread Hanna Reitz
We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.
---
 qemu-img.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
 goto out;
 }
 
+if (flags & BDRV_O_NOCACHE) {
+/*
+ * If we open the target with O_DIRECT, it may be necessary to
+ * extend its size to align to the physical sector size.
+ */
+flags |= BDRV_O_RESIZE;
+}
+
 if (skip_create) {
 s.target = img_open(tgt_image_opts, out_filename, out_fmt,
 flags, writethrough, s.quiet, false);
-- 
2.31.1