Re: Virtual FAT disk images

2021-09-09 Thread Eric Blake
On Thu, Sep 09, 2021 at 09:32:48AM +0200, Pascal wrote:
> up² :-)
> can someone just ping me to make me sure I'm posting on the qemu list ?

Your question is reaching the list, but this mail ought to be informative:

https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg02463.html

In short, the fact that you are seeing performance penalties with the
vvfat driver is unsurprising, and it may be a sign of bigger bugs
lurking.  The lack of answers is more a sign that no one actively
participating on the list currently uses virtual FAT in any serious
manner, so you are taking on your own risk.

That said, if you'd like to debug it and submit patches, we'd be
grateful!

> 
> Le mar. 31 août 2021 à 09:24, Pascal  a écrit :
> 
> > up :-)
> > nobody uses this feature of qemu?
> >
> > Le ven. 27 août 2021 à 11:11, Pascal  a écrit :
> >
> >> hello everybody,
> >>
> >> virtual FAT disk image - *which is a convenient way to transfer files to
> >> the guest without having to activate its network* - seems to work very
> >> poorly with Windows : do you have the same difficulties?
> >>
> >> context : up to date archlinux, qemu 6.0.0, fresh installed windows 10
> >> 21H1.
> >>
> >> /usr/bin/qemu-system-x86_64 -accel kvm -machine q35 -m 1024 -device
> >> nec-usb-xhci -device usb-tablet -cpu qemu64,kvm=off -parallel null -serial
> >> mon:stdio -hda windows.disk -hdb fat:rw:/tmp/test/

There may be other less-risky ways for easily transferring files
between host and guest; I know in the past I have seen both p9 and MTP
file systems mentioned as approaches, although I do not have
experience with setting those up myself to give you a sample command
line.

> >>
> >> access to the E: drive is extremely slow and the system events report
> >> many storahci (129: reinit \device\RaidPort0 sent) and disk (153: I/O
> >> @0x3f on disk 1 \device\0001f replayed) warnings.
> >>
> >> regards.
> >>
> >

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API

2021-09-09 Thread Eric Blake
On Thu, Sep 09, 2021 at 01:20:17AM +0200, Philippe Mathieu-Daudé wrote:
> Add the BlockDriver::bdrv_taints_security_policy() handler.
> Drivers implementing it might taint the global QEMU security
> policy.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/block/block_int.h | 6 +-
>  block.c   | 6 ++
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index f1a54db0f8c..0ec0a5c06e9 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -169,7 +169,11 @@ struct BlockDriver {
>  int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
>Error **errp);
>  void (*bdrv_close)(BlockDriverState *bs);
> -
> +/*
> + * Return %true if the driver is withing QEMU security policy boundary,

within

> + * %false otherwise. See: 
> https://www.qemu.org/contribute/security-process/
> + */
> +bool (*bdrv_taints_security_policy)(BlockDriverState *bs);
>  
>  int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
> Error **errp);

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [RFC PATCH 02/10] accel: Use qemu_security_policy_taint(), mark KVM and Xen as safe

2021-09-09 Thread Eric Blake
On Thu, Sep 09, 2021 at 01:20:16AM +0200, Philippe Mathieu-Daudé wrote:
> Add the AccelClass::secure_policy_supported field to classify
> safe (within security boundary) vs unsafe accelerators.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/accel.h | 5 +
>  accel/kvm/kvm-all.c  | 1 +
>  accel/xen/xen-all.c  | 1 +
>  softmmu/vl.c | 3 +++
>  4 files changed, 10 insertions(+)
> 
> diff --git a/include/qemu/accel.h b/include/qemu/accel.h
> index 4f4c283f6fc..895e30be0de 100644
> --- a/include/qemu/accel.h
> +++ b/include/qemu/accel.h
> @@ -44,6 +44,11 @@ typedef struct AccelClass {
> hwaddr start_addr, hwaddr size);
>  #endif
>  bool *allowed;
> +/*
> + * Whether the accelerator is withing QEMU security policy boundary.

within

> + * See: https://www.qemu.org/contribute/security-process/
> + */
> +bool secure_policy_supported;
>  /*
>   * Array of global properties that would be applied when specific
>   * accelerator is chosen. It works like MachineClass.compat_props

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [RFC PATCH 01/10] sysemu: Introduce qemu_security_policy_taint() API

2021-09-09 Thread Eric Blake
On Thu, Sep 09, 2021 at 01:20:15AM +0200, Philippe Mathieu-Daudé wrote:
> Introduce qemu_security_policy_taint() which allows unsafe (read
> "not very maintained") code to 'taint' QEMU security policy.
> 
> The "security policy" is the @SecurityPolicy QAPI enum, composed of:
> - "none"   (no policy, current behavior)
> - "warn"   (display a warning when the policy is tainted, keep going)
> - "strict" (once tainted, exit QEMU before starting the VM)
> 
> The qemu_security_policy_is_strict() helper is also provided, which
> will be proved useful once a VM is started (example we do not want

s/be proved/prove/

> to kill a running VM if an unsafe device is hot-added).
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  qapi/run-state.json   | 16 +++
>  include/qemu-common.h | 19 
>  softmmu/vl.c  | 67 +++
>  qemu-options.hx   | 17 +++
>  4 files changed, 119 insertions(+)
> 
> diff --git a/qapi/run-state.json b/qapi/run-state.json
> index 43d66d700fc..b15a107fa01 100644
> --- a/qapi/run-state.json
> +++ b/qapi/run-state.json
> @@ -638,3 +638,19 @@
>  { 'struct': 'MemoryFailureFlags',
>'data': { 'action-required': 'bool',
>  'recursive': 'bool'} }
> +
> +##
> +# @SecurityPolicy:
> +#
> +# An enumeration of the actions taken when the security policy is tainted.
> +#
> +# @none: do nothing.
> +#
> +# @warn: display a warning.
> +#
> +# @strict: prohibit QEMU to start a VM.

s/to start/from starting/

> +#
> +# Since: 6.2
> +##
> +{ 'enum': 'SecurityPolicy',
> +  'data': [ 'none', 'warn', 'strict' ] }

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-09 Thread Peter Xu
On Thu, Sep 09, 2021 at 01:58:39AM -0300, Leonardo Bras Soares Passos wrote:
> FWIW, what I had in mind for a (theoretical) migration setup with
> io_async_writev() + io_async_flush():

One trivial concern is it's not strictly just "async" because "async" can
happen on any nonblocking fd; here it's more "zerocopy" specific.  But as long
as Dan is okay with it I'm fine too to start with "async", as long as we do
proper documentation on the requirement of lifecycle of the buffers.

> - For guest RAM we can decide not to rw_lock memory on zerocopy,
> because there is no need,
> - For headers, we can decide to not use async  (use io_writev() instead),
> - flush() can happen each iteration of migration, or at each N
> seconds, or at the end.

I think we should only need the flush at the end of precopy, and that should
also cover when switching to postcopy.  We could merge this flush() into the
existing per-iteration sync of multi-fd, but it's not strictly necessary, imho.
We can see which is easier. The rest looks good to me.  Thanks,

-- 
Peter Xu




Re: qcow2 perfomance: read-only IO on the guest generates high write IO on the host

2021-09-09 Thread Christopher Pereira


On 24-08-2021 11:37, Kevin Wolf wrote:

[ Cc: qemu-block ]

Am 11.08.2021 um 13:36 hat Christopher Pereira geschrieben:

Hi,

I'm reading a directory with 5.000.000 files (2,4 GB) inside a guest using
"find | grep -c".

On the host I saw high write IO (40 MB/s !) during over 1 hour using
virt-top.

I later repeated the read-only operation inside the guest and no additional
data was written on the host. The operation took only some seconds.

I believe QEMU was creating some kind of cache or metadata map the first
time I accessed the inodes.

No, at least in theory, QEMU shouldn't allocate anything when you're
just reading.

Hmm...interesting.

Are you sure that this isn't activity coming from your guest OS?


Yes. iotop was showing only read IOs on the guest, and on the host iotop 
and virt-top where showing strong write IOs for hours.
Stopping the "find" command on the guest also stopped the write IOs on 
the host.



But I wonder why the cache or metadata map wasn't available the first time
and why QEMU had to recreate it?

The VM has "compressed base <- snap 1" and base was converted without
prealloc.

Is it because we created the base using convert without metadata prealloc
and so the metadata map got lost?

I will do some experiments soon using convert + metadata prealloc and
probably find out myself, but I will happy to read your comments and gain
some additional insights.
If it the problem persists, I would try again without compression.

What were the results of your experiments? Is the behaviour related to
any of these options?


I will do the experiments and report back.

It's also strange that the second time I repeat the "find" command, I 
see no more write IOs and it takes only seconds instead of hours.


I was assuming QEMU was creating some kind of map or cache on the 
snapshot for the content present in the base, but now I got more curious.




Re: [RFC PATCH 00/10] security: Introduce qemu_security_policy_taint() API

2021-09-09 Thread Alexander Bulekov
On 210909 0120, Philippe Mathieu-Daudé wrote:
> Hi,
> 
> This series is experimental! The goal is to better limit the
> boundary of what code is considerated security critical, and
> what is less critical (but still important!).
> 
> This approach was quickly discussed few months ago with Markus
> then Daniel. Instead of classifying the code on a file path
> basis (see [1]), we insert (runtime) hints into the code
> (which survive code movement). Offending unsafe code can
> taint the global security policy. By default this policy
> is 'none': the current behavior. It can be changed on the
> command line to 'warn' to display warnings, and to 'strict'
> to prohibit QEMU running with a tainted policy.
> 
> As examples I started implementing unsafe code taint from
> 3 different pieces of code:
> - accelerators (KVM and Xen in allow-list)
> - block drivers (vvfat and parcial null-co in deny-list)
> - qdev (hobbyist devices regularly hit by fuzzer)

Just looking through the list of hci, storage, network and graphics
devices available on i386 to see which others are potential good
candidates for this tag. Obviously a lot of guesswork here:

USB devices:
name "ich9-usb-ehci1", bus PCI
name "ich9-usb-ehci2", bus PCI
name "ich9-usb-uhci1", bus PCI
name "ich9-usb-uhci2", bus PCI
name "ich9-usb-uhci3", bus PCI
name "ich9-usb-uhci4", bus PCI
name "ich9-usb-uhci5", bus PCI
name "ich9-usb-uhci6", bus PCI
name "nec-usb-xhci", bus PCI
name "pci-ohci", bus PCI, desc "Apple USB Controller"
name "piix3-usb-uhci", bus PCI
name "piix4-usb-uhci", bus PCI
name "qemu-xhci", bus PCI
name "usb-ehci", bus PCI

Not sure about these. Maybe ohci isn't sensitive?

Storage devices:
=== Sensitive ===
name "floppy", bus floppy-bus, desc "virtual floppy drive"
name "ide-cd", bus IDE, desc "virtual IDE CD-ROM"
name "ide-hd", bus IDE, desc "virtual IDE disk"
name "isa-fdc", bus ISA, desc "virtual floppy controller"
name "isa-ide", bus ISA
name "piix3-ide", bus PCI
name "piix3-ide-xen", bus PCI
name "piix4-ide", bus PCI
name "scsi-block", bus SCSI, desc "SCSI block device passthrough"
name "scsi-cd", bus SCSI, desc "virtual SCSI CD-ROM"
name "scsi-generic", bus SCSI, desc "pass through generic scsi device 
(/dev/sg*)"
name "scsi-hd", bus SCSI, desc "virtual SCSI disk"
name "vhost-scsi", bus virtio-bus
name "vhost-scsi-pci", bus PCI
name "vhost-scsi-pci-non-transitional", bus PCI
name "vhost-scsi-pci-transitional", bus PCI
name "vhost-user-blk", bus virtio-bus
name "vhost-user-blk-pci", bus PCI
name "vhost-user-blk-pci-non-transitional", bus PCI
name "vhost-user-blk-pci-transitional", bus PCI
name "vhost-user-fs-device", bus virtio-bus
name "vhost-user-fs-pci", bus PCI
name "vhost-user-scsi", bus virtio-bus
name "vhost-user-scsi-pci", bus PCI
name "vhost-user-scsi-pci-non-transitional", bus PCI
name "vhost-user-scsi-pci-transitional", bus PCI
name "virtio-9p-device", bus virtio-bus
name "virtio-9p-pci", bus PCI, alias "virtio-9p"
name "virtio-9p-pci-non-transitional", bus PCI
name "virtio-9p-pci-transitional", bus PCI
name "virtio-blk-device", bus virtio-bus
name "virtio-blk-pci", bus PCI, alias "virtio-blk"
name "virtio-blk-pci-non-transitional", bus PCI
name "virtio-blk-pci-transitional", bus PCI
name "virtio-pmem", bus virtio-bus
name "virtio-scsi-device", bus virtio-bus
name "virtio-scsi-pci", bus PCI, alias "virtio-scsi"
name "virtio-scsi-pci-non-transitional", bus PCI
name "virtio-scsi-pci-transitional", bus PCI

=== Tainting/Not Sensitive ===
name "am53c974", bus PCI, desc "AMD Am53c974 PCscsi-PCI SCSI adapter"
name "dc390", bus PCI, desc "Tekram DC-390 SCSI adapter"
name "ich9-ahci", bus PCI, alias "ahci"
name "lsi53c810", bus PCI
name "lsi53c895a", bus PCI, alias "lsi"
name "megasas", bus PCI, desc "LSI MegaRAID SAS 1078"
name "megasas-gen2", bus PCI, desc "LSI MegaRAID SAS 2108"
name "mptsas1068", bus PCI, desc "LSI SAS 1068"
name "nvdimm", desc "DIMM memory module"
name "nvme", bus PCI, desc "Non-Volatile Memory Express"
name "nvme-ns", bus nvme-bus, desc "Virtual NVMe namespace"
name "nvme-subsys", desc "Virtual NVMe subsystem"
name "pvscsi", bus PCI
name "sd-card", bus sd-bus
name "sdhci-pci", bus PCI
name "usb-bot", bus usb-bus
name "usb-mtp", bus usb-bus, desc "USB Media Transfer Protocol device"
name "usb-storage", bus usb-bus
name "usb-uas", bus usb-bus

Network devices:
=== Sensitive ===
name "e1000", bus PCI, alias "e1000-82540em", desc "Intel Gigabit Ethernet"
name "e1000e", bus PCI, desc "Intel 82574L GbE Controller"
name "virtio-net-device", bus virtio-bus
name "virtio-net-pci", bus PCI, alias "virtio-net"
name "virtio-net-pci-non-transitional", bus PCI
name "virtio-net-pci-transitional", bus PCI

=== Tainting/Not Sensitive ===
name "e1000-82544gc", bus PCI, desc "Intel Gigabit Ethernet"
name "e1000-82545em", bus PCI, desc "Intel Gigabit Ethernet"
name "i82550", bus PCI, desc "Intel i82550 Ethernet"
name "i82551", bus PCI, desc "Intel i82551 Ethernet"
name "i82557a", bus PCI, desc "Intel i82557A Ethernet"
name "i82557b", 

Re: [PATCH] hw/nvme: select first free NSID for legacy drive configuration

2021-09-09 Thread Hannes Reinecke
On 9/9/21 12:52 PM, Klaus Jensen wrote:
> On Sep  9 11:51, Hannes Reinecke wrote:
>> If a legacy 'drive' argument is passed to the controller we cannot
>> assume that '1' will be a free NSID, as the subsys might already
>> have attached a namespace to this NSID. So select the first free
>> one.
>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  hw/nvme/ctrl.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index 757cdff038..2c69031ca9 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
>> **errp)
>>  
>>  /* setup a namespace if the controller drive property was given */
>>  if (n->namespace.blkconf.blk) {
>> +int i;
>>  ns = >namespace;
>> -ns->params.nsid = 1;
>> +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
>> +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
>> +continue;
>> +}
>> +ns->params.nsid = i;
>> +break;
>> +}
>>  
>>  if (nvme_ns_setup(ns, errp)) {
>>  return;
>> -- 
>> 2.26.2
>>
> 
> Did you actually hit this?
> 
> Because then then property checking logic is bad... The device is not
> supposed to allow nvme,drive= combined with a subsystem property. In
> nvme_check_constraints():
> 
>   if (n->namespace.blkconf.blk && n->subsys) {
> /* error out */
> return;
>   }
> 
> 
Ah. Missed that.
Do ignore my patch then.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



Re: [PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Hannes Reinecke
On 9/9/21 12:47 PM, Klaus Jensen wrote:
> On Sep  9 11:43, Hannes Reinecke wrote:
>> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>> namespaces get moved from the controller to the subsystem if one
>> is specified.
>> That keeps the namespaces alive after a controller hot-unplug, but
>> after a controller hotplug we have to reconnect the namespaces
>> from the subsystem to the controller.
>>
>> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>> Cc: Klaus Jensen 
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  hw/nvme/subsys.c | 8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
>> index 93c35950d6..a9404f2b5e 100644
>> --- a/hw/nvme/subsys.c
>> +++ b/hw/nvme/subsys.c
>> @@ -14,7 +14,7 @@
>>  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>  {
>>  NvmeSubsystem *subsys = n->subsys;
>> -int cntlid;
>> +int cntlid, nsid;
>>  
>>  for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
>>  if (!subsys->ctrls[cntlid]) {
>> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>  
>>  subsys->ctrls[cntlid] = n;
>>  
>> +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
>> +if (subsys->namespaces[nsid]) {
>> +nvme_attach_ns(n, subsys->namespaces[nsid]);
>> +}
> 
> Thanks Hannes! I like it, keeping things simple.
> 
> But we should only attach namespaces that have the shared property or
> have ns->attached == 0. Non-shared namespaces may already be attached to
> another controller in the subsystem.
> 

Well ... I tried to avoid that subject, but as you brought it up:
There is a slightly tricky issue in fabrics, in that the 'controller' is
independent from the 'connection'.
The 'shared' bit in the CMIC setting indicates that the subsystem may
have more than one _controller_. It doesn't talk about how many
_connections_ a controller may support; that then is the realm of
dynamic or static controllers, which we don't talk about :-).
Sufficient to say, linux only implements the dynamic controller model,
so every connection will be to a different controller.
But you have been warned :-)

However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
'shared' bit in the namespace).
Which leads to the interesting question on how exactly should one handle
non-shared namespaces in subsystems for which there are multiple
controllers. Especially as the NSID space is per subsystem, so each
controller will be able to figure out if there are blanked-out namespaces.
So to make this a sane setup I would propose to set the 'shared' option
automatically whenever the controller has the 'subsys' option set.
And actually, I would ditch the 'shared' option completely, and make it
dependent on the 'subsys' setting for the controller.
Much like we treat the 'CMIC' setting nowadays.
That avoids lots of confusions, and also make the implementation _way_
easier.

> However...
> 
> The spec says that "The attach and detach operations are persistent
> across all reset events.". This means that we should track those events
> in the subsystem and only reattach namespaces that were attached at the
> time of the "reset" event. Currently, we don't have anything mapping
> that state. But the device already has to take some liberties with
> regard to stuff that is considered persistent by the spec (SMART log
> etc.) since we do not have any way to store stuff persistently across
> qemu invocations, so I think the above is an acceptable compromise.
> 
Careful. 'attach' and 'detach' are MI (management interface) operations.
If and how many namespaces are visible to any given controllers is
actually independent on that; and, in fact, controllers might not even
implement 'attach' or 'detach'.
But I do agree that we don't handle the 'reset' state properly.

> A potential (as good as it gets) fix would be to keep a map/list of
> "persistently" attached controllers on the namespaces and re-attach
> according to that when we see that controller joining the subsystem
> again. CNTLID would be the obvious choice for the key here, but problem
> is that we cant really use it since we assign it sequentially from the
> subsystem, which now looks like a pretty bad choice. CNTLID should have
> been a required property of the nvme device when subsystems are
> involved. Maybe we can fix up the CNTLID assignment to take the serial
> into account (we know that is defined and *should* be unique) and not
> reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
> controllers, but I think that is fair enough.
> 
> Sigh. Need to think this through.
> 
Well, actually there is an easy way out. I do agree that we need to make
the 'cntlid' a property of the controller. And once it's set we can
track it properly, eg by having per-cntlid nsid lists in the subsystem.
But if it's not set we can claim that we'll be allocating a new
controller 

Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-09 Thread Denis Plotnikov



On 09.09.2021 11:28, Stefano Garzarella wrote:

On Wed, Sep 08, 2021 at 06:20:49PM +0300, Denis Plotnikov wrote:


On 08.09.2021 16:22, Stefano Garzarella wrote:

Message bounced, I use new Denis's email address.

On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote:

Hi Denis,
I just found this discussion since we still have the following line 
in hw/core/machine.c:

   { "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the 
future we can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano


Hi Stefano

I prefer to just remove the line without replacing. This will keep 
things exactly like it is now.


Replacing with "vhost-scsi" will affect seg_max (limit to 126) on 
newly created VMs with machine types using "hw_compat_4_2" and older.


Now, because of the typo (error), their seg-max is queue-size 
dependent. I'm not sure, if replacing the line may cause any 
problems, for example on migration: source (queue-size 256, seg max 
254) -> destination (queue-size 256, seg max 126). But it will 
definitely introduce two different behaviors for VMs with 
"hw_compat_4_2" and older. So, I'd choose the lesser of two evils and 
keep the things like it's now.




Yep, make sense. It was the same concern I had.

Do you want to send a patch to remove it with this explanation?


Yes, sure.

I'll do it today.

Denis



Thanks for the clarification,
Stefano





Re: [RFC PATCH 06/10] qdev: Use qemu_security_policy_taint() API

2021-09-09 Thread Daniel P . Berrangé
On Thu, Sep 09, 2021 at 01:20:20AM +0200, Philippe Mathieu-Daudé wrote:
> Add DeviceClass::taints_security_policy field to allow an
> unsafe device to eventually taint the global security policy
> in DeviceRealize().
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/hw/qdev-core.h |  6 ++
>  hw/core/qdev.c | 11 +++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index bafc311bfa1..ff9ce6671be 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -122,6 +122,12 @@ struct DeviceClass {
>   */
>  bool user_creatable;
>  bool hotpluggable;
> +/*
> + * %false if the device is within the QEMU security policy boundary,
> + * %true if there is no guarantee this device can be used safely.
> + * See: https://www.qemu.org/contribute/security-process/
> + */
> +bool taints_security_policy;
>  
>  /* callbacks */
>  /*

Although your use case is for devices, it probably makes more sense
to push this up into the Object base class.

I think it will need to be a tri-state value too, not a simple
bool. It isn't feasible to mark all devices with this property,
so initially we'll have no information about whether most
devices are secure or insecure. This patch gives the implication
that all devices are secure, except for the few that have been
marked otherwise, which is not a good default IMHO.

We want to be able to make it clear when introspecting, that we
have no information on security available for most devices ie

 - unset  => no information on security (the current default)
 - true => considered secure against malicious guest
 - false => considered insecure against malicious guest

Then we can also extend 'ObjectTypeInfo' to have a

   '*secure': 'bool'

to make 'qom-list-types' be able to introspect this upfront.



> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index cefc5eaa0a9..a5a00f3564c 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -31,6 +31,7 @@
>  #include "qapi/qmp/qerror.h"
>  #include "qapi/visitor.h"
>  #include "qemu/error-report.h"
> +#include "qemu-common.h"
>  #include "qemu/option.h"
>  #include "hw/hotplug.h"
>  #include "hw/irq.h"
> @@ -257,6 +258,13 @@ bool qdev_hotplug_allowed(DeviceState *dev, Error **errp)
>  MachineClass *mc;
>  Object *m_obj = qdev_get_machine();
>  
> +if (qemu_security_policy_is_strict()
> +&& DEVICE_GET_CLASS(dev)->taints_security_policy) {
> +error_setg(errp, "Device '%s' can not be hotplugged when"
> + " 'strict' security policy is in place",
> +   object_get_typename(OBJECT(dev)));

Do you need a 'return' here to stop execution after reportig
the error ?

> +}
> +
>  if (object_dynamic_cast(m_obj, TYPE_MACHINE)) {
>  machine = MACHINE(m_obj);
>  mc = MACHINE_GET_CLASS(machine);
> @@ -385,6 +393,9 @@ bool qdev_realize(DeviceState *dev, BusState *bus, Error 
> **errp)
>  } else {
>  assert(!DEVICE_GET_CLASS(dev)->bus_type);
>  }
> +qemu_security_policy_taint(DEVICE_GET_CLASS(dev)->taints_security_policy,
> +   "device type %s",
> +   object_get_typename(OBJECT(dev)));
>  
>  return object_property_set_bool(OBJECT(dev), "realized", true, errp);
>  }
> -- 
> 2.31.1
> 

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API

2021-09-09 Thread Daniel P . Berrangé
On Thu, Sep 09, 2021 at 11:40:07AM +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 09, 2021 at 01:20:17AM +0200, Philippe Mathieu-Daudé wrote:
> > Add the BlockDriver::bdrv_taints_security_policy() handler.
> > Drivers implementing it might taint the global QEMU security
> > policy.
> > 
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> >  include/block/block_int.h | 6 +-
> >  block.c   | 6 ++
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/block/block_int.h b/include/block/block_int.h
> > index f1a54db0f8c..0ec0a5c06e9 100644
> > --- a/include/block/block_int.h
> > +++ b/include/block/block_int.h
> > @@ -169,7 +169,11 @@ struct BlockDriver {
> >  int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
> >Error **errp);
> >  void (*bdrv_close)(BlockDriverState *bs);
> > -
> > +/*
> > + * Return %true if the driver is withing QEMU security policy boundary,
> > + * %false otherwise. See: 
> > https://www.qemu.org/contribute/security-process/
> > + */
> > +bool (*bdrv_taints_security_policy)(BlockDriverState *bs);

Also as with previous comments, I think we should not refer
to tainting or the security policy here, but instead simply
document whether we consider the bdrv to be secure or not.

Tainting is merely one action that is taken in accordance with
the security policy, as a result of the information presented.

> >  int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
> > Error **errp);
> > diff --git a/block.c b/block.c
> > index b2b66263f9a..696ba486001 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -49,6 +49,7 @@
> >  #include "qemu/timer.h"
> >  #include "qemu/cutils.h"
> >  #include "qemu/id.h"
> > +#include "qemu-common.h"
> >  #include "block/coroutines.h"
> >  
> >  #ifdef CONFIG_BSD
> > @@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, 
> > BlockDriver *drv,
> >  }
> >  }
> >  
> > +if (drv->bdrv_taints_security_policy) {
> > +qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs),
> > +   "Block protocol '%s'", 
> > drv->format_name);
> > +}
> > +
> >  return 0;
> >  open_failed:
> >  bs->drv = NULL;
> 
> Again we need a way to report this via QAPI, but we don't have a natural
> place is hang this off for introspection before starting a guest.
> 
> The best we can do is report the information after a block backend has
> been instantiated. eg  Modify "BlockInfo" struct to gain
> 
>'*secure': 'bool'
> 
> Note I made this an optional field, since unless we mark every single
> block driver impl straight away, we'll need to cope with the absence
> of information.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] hw/nvme: select first free NSID for legacy drive configuration

2021-09-09 Thread Klaus Jensen
On Sep  9 11:51, Hannes Reinecke wrote:
> If a legacy 'drive' argument is passed to the controller we cannot
> assume that '1' will be a free NSID, as the subsys might already
> have attached a namespace to this NSID. So select the first free
> one.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  hw/nvme/ctrl.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 757cdff038..2c69031ca9 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  
>  /* setup a namespace if the controller drive property was given */
>  if (n->namespace.blkconf.blk) {
> +int i;
>  ns = >namespace;
> -ns->params.nsid = 1;
> +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
> +continue;
> +}
> +ns->params.nsid = i;
> +break;
> +}
>  
>  if (nvme_ns_setup(ns, errp)) {
>  return;
> -- 
> 2.26.2
> 

Did you actually hit this?

Because then then property checking logic is bad... The device is not
supposed to allow nvme,drive= combined with a subsystem property. In
nvme_check_constraints():

  if (n->namespace.blkconf.blk && n->subsys) {
/* error out */
return;
  }




signature.asc
Description: PGP signature


Re: [PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Klaus Jensen
On Sep  9 11:43, Hannes Reinecke wrote:
> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> namespaces get moved from the controller to the subsystem if one
> is specified.
> That keeps the namespaces alive after a controller hot-unplug, but
> after a controller hotplug we have to reconnect the namespaces
> from the subsystem to the controller.
> 
> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> Cc: Klaus Jensen 
> Signed-off-by: Hannes Reinecke 
> ---
>  hw/nvme/subsys.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> index 93c35950d6..a9404f2b5e 100644
> --- a/hw/nvme/subsys.c
> +++ b/hw/nvme/subsys.c
> @@ -14,7 +14,7 @@
>  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>  {
>  NvmeSubsystem *subsys = n->subsys;
> -int cntlid;
> +int cntlid, nsid;
>  
>  for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
>  if (!subsys->ctrls[cntlid]) {
> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>  
>  subsys->ctrls[cntlid] = n;
>  
> +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
> +if (subsys->namespaces[nsid]) {
> +nvme_attach_ns(n, subsys->namespaces[nsid]);
> +}

Thanks Hannes! I like it, keeping things simple.

But we should only attach namespaces that have the shared property or
have ns->attached == 0. Non-shared namespaces may already be attached to
another controller in the subsystem.

However...

The spec says that "The attach and detach operations are persistent
across all reset events.". This means that we should track those events
in the subsystem and only reattach namespaces that were attached at the
time of the "reset" event. Currently, we don't have anything mapping
that state. But the device already has to take some liberties with
regard to stuff that is considered persistent by the spec (SMART log
etc.) since we do not have any way to store stuff persistently across
qemu invocations, so I think the above is an acceptable compromise.

A potential (as good as it gets) fix would be to keep a map/list of
"persistently" attached controllers on the namespaces and re-attach
according to that when we see that controller joining the subsystem
again. CNTLID would be the obvious choice for the key here, but problem
is that we cant really use it since we assign it sequentially from the
subsystem, which now looks like a pretty bad choice. CNTLID should have
been a required property of the nvme device when subsystems are
involved. Maybe we can fix up the CNTLID assignment to take the serial
into account (we know that is defined and *should* be unique) and not
reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
controllers, but I think that is fair enough.

Sigh. Need to think this through.

Bottomline I think I'm partial to just accepting your patch (with the
addition of taking the shared property into account) and documenting the
limitation wrt. persistency of attach/detach events. No matter how
spec-compliant we do it on a live system, we still break compliance
across QEMU invocations.


signature.asc
Description: PGP signature


Re: [RFC PATCH 02/10] accel: Use qemu_security_policy_taint(), mark KVM and Xen as safe

2021-09-09 Thread Daniel P . Berrangé
On Thu, Sep 09, 2021 at 01:20:16AM +0200, Philippe Mathieu-Daudé wrote:
> Add the AccelClass::secure_policy_supported field to classify
> safe (within security boundary) vs unsafe accelerators.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/accel.h | 5 +
>  accel/kvm/kvm-all.c  | 1 +
>  accel/xen/xen-all.c  | 1 +
>  softmmu/vl.c | 3 +++
>  4 files changed, 10 insertions(+)
> 
> diff --git a/include/qemu/accel.h b/include/qemu/accel.h
> index 4f4c283f6fc..895e30be0de 100644
> --- a/include/qemu/accel.h
> +++ b/include/qemu/accel.h
> @@ -44,6 +44,11 @@ typedef struct AccelClass {
> hwaddr start_addr, hwaddr size);
>  #endif
>  bool *allowed;
> +/*
> + * Whether the accelerator is withing QEMU security policy boundary.
> + * See: https://www.qemu.org/contribute/security-process/
> + */
> +bool secure_policy_supported;

The security handling policy is a high level concept that is
open to variation over time and also by downstream distro
vendors.

At a code level we should be dealing in a more fundamental
concept. At an accelerator level we should really jsut
declare whether or not the accelerator impl is considered
to be secure against malicious guest code.

eg

/* Whether this accelerator is secure against execution
 * of malciious guest machine code */
bool secure;


>  /*
>   * Array of global properties that would be applied when specific
>   * accelerator is chosen. It works like MachineClass.compat_props
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 0125c17edb8..eb6b9e44df2 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3623,6 +3623,7 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
> *data)
>  ac->init_machine = kvm_init;
>  ac->has_memory = kvm_accel_has_memory;
>  ac->allowed = _allowed;
> +ac->secure_policy_supported = true;
>  
>  object_class_property_add(oc, "kernel-irqchip", "on|off|split",
>  NULL, kvm_set_kernel_irqchip,
> diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
> index 69aa7d018b2..57867af5faf 100644
> --- a/accel/xen/xen-all.c
> +++ b/accel/xen/xen-all.c
> @@ -198,6 +198,7 @@ static void xen_accel_class_init(ObjectClass *oc, void 
> *data)
>  ac->setup_post = xen_setup_post;
>  ac->allowed = _allowed;
>  ac->compat_props = g_ptr_array_new();
> +ac->secure_policy_supported = true;
>  
>  compat_props_add(ac->compat_props, compat, G_N_ELEMENTS(compat));
>  
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 92c05ac97ee..e4f94e159c3 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2388,6 +2388,9 @@ static int do_configure_accelerator(void *opaque, 
> QemuOpts *opts, Error **errp)
>  return 0;
>  }
>  
> +qemu_security_policy_taint(!ac->secure_policy_supported,
> +   "%s accelerator", acc);

We need this information to be introspectable, becuase stuff printed
to stderr is essentially opaque to libvirt and mgmt apps above.

We don't have a convenient "query-accel" command but I think this
could possibly fit into 'query-target'. ie the TargetInfo struct
gain a field:
 

  ##
  # @TargetInfo:
  #
  # Information describing the QEMU target.
  #
  # @arch: the target architecture
  # @secure: Whether the currently active accelerator for this target
  #  is secure against execution of malicous guest code
  #
  # Since: 1.2
  ##
  { 'struct': 'TargetInfo',
'data': { 'arch': 'SysEmuTarget',
  'secure': 'bool'} }

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API

2021-09-09 Thread Daniel P . Berrangé
On Thu, Sep 09, 2021 at 01:20:17AM +0200, Philippe Mathieu-Daudé wrote:
> Add the BlockDriver::bdrv_taints_security_policy() handler.
> Drivers implementing it might taint the global QEMU security
> policy.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/block/block_int.h | 6 +-
>  block.c   | 6 ++
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index f1a54db0f8c..0ec0a5c06e9 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -169,7 +169,11 @@ struct BlockDriver {
>  int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
>Error **errp);
>  void (*bdrv_close)(BlockDriverState *bs);
> -
> +/*
> + * Return %true if the driver is withing QEMU security policy boundary,
> + * %false otherwise. See: 
> https://www.qemu.org/contribute/security-process/
> + */
> +bool (*bdrv_taints_security_policy)(BlockDriverState *bs);
>  
>  int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
> Error **errp);
> diff --git a/block.c b/block.c
> index b2b66263f9a..696ba486001 100644
> --- a/block.c
> +++ b/block.c
> @@ -49,6 +49,7 @@
>  #include "qemu/timer.h"
>  #include "qemu/cutils.h"
>  #include "qemu/id.h"
> +#include "qemu-common.h"
>  #include "block/coroutines.h"
>  
>  #ifdef CONFIG_BSD
> @@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, 
> BlockDriver *drv,
>  }
>  }
>  
> +if (drv->bdrv_taints_security_policy) {
> +qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs),
> +   "Block protocol '%s'", drv->format_name);
> +}
> +
>  return 0;
>  open_failed:
>  bs->drv = NULL;

Again we need a way to report this via QAPI, but we don't have a natural
place is hang this off for introspection before starting a guest.

The best we can do is report the information after a block backend has
been instantiated. eg  Modify "BlockInfo" struct to gain

   '*secure': 'bool'

Note I made this an optional field, since unless we mark every single
block driver impl straight away, we'll need to cope with the absence
of information.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [RFC PATCH 00/10] security: Introduce qemu_security_policy_taint() API

2021-09-09 Thread Daniel P . Berrangé
On Thu, Sep 09, 2021 at 01:20:14AM +0200, Philippe Mathieu-Daudé wrote:
> Hi,
> 
> This series is experimental! The goal is to better limit the
> boundary of what code is considerated security critical, and
> what is less critical (but still important!).
> 
> This approach was quickly discussed few months ago with Markus
> then Daniel. Instead of classifying the code on a file path
> basis (see [1]), we insert (runtime) hints into the code
> (which survive code movement). Offending unsafe code can
> taint the global security policy. By default this policy
> is 'none': the current behavior. It can be changed on the
> command line to 'warn' to display warnings, and to 'strict'
> to prohibit QEMU running with a tainted policy.

Ok, so I infer that you based this idea on the "--compat-policy"
arg used to control behaviour wrt to deprecations.

With the deprecation support, the QAPI introspection data can
report deprecations for machines / CPUs (and in theory devices
later).  Libvirt records this deprecation info and can report
it to the user before even starting a guest, so they can avoid
using a deprecated device in the first place.  We also use this
info to mark a guest as tainted + deprecation at the libvirt
level and let mgmt apps query this status.

The --compat-policy support has been integrated into libvirt
but it is not something we expect real world deployments to
use - rather it is targeted as a testing framework.

Essentially I see the security reporting as needing to operate
in a pretty similar manner.

IOW, the reporting via QAPI introspetion is much more important
for libvirt and mgmt apps, than any custom cli arg / printfs
at the QEMU level.


In terms of QAPI design we currently have

   'deprecated': 'bool'

field against MachineInfo and CpuDefinitionInfo types.

it feels like we need

   'secure': 'bool'

field against the relevant types here too, though I think
maybe we might need to make it an optional field  to let
us distinguish lack of information, since it will take a
long time to annotate all areas. eg

   '*secure': 'bool'

 - not set  => no info available on security characteristics
 - true => device is considered secure wrt malicious guest
 - false => device is not considered secure wrt malicious guest


> As examples I started implementing unsafe code taint from
> 3 different pieces of code:
> - accelerators (KVM and Xen in allow-list)
> - block drivers (vvfat and parcial null-co in deny-list)
> - qdev (hobbyist devices regularly hit by fuzzer)
> 
> I don't want the security researchers to not fuzz QEMU unsafe
> areas, but I'd like to make it clearer what the community
> priority is (currently 47 opened issues on [3]).

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [RFC PATCH 01/10] sysemu: Introduce qemu_security_policy_taint() API

2021-09-09 Thread Paolo Bonzini

On 09/09/21 01:20, Philippe Mathieu-Daudé wrote:

+static QemuOptsList qemu_security_policy_opts = {
+.name = "security-policy",
+.implied_opt_name = "policy",
+.merge_lists = true,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_security_policy_opts.head),
+.desc = {
+{
+.name = "policy",
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};


No new command line options please.  You could rename -compat-policy to 
just -policy, and make this a "security" suboption.


Paolo




[PATCH] hw/nvme: select first free NSID for legacy drive configuration

2021-09-09 Thread Hannes Reinecke
If a legacy 'drive' argument is passed to the controller we cannot
assume that '1' will be a free NSID, as the subsys might already
have attached a namespace to this NSID. So select the first free
one.

Signed-off-by: Hannes Reinecke 
---
 hw/nvme/ctrl.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 757cdff038..2c69031ca9 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 /* setup a namespace if the controller drive property was given */
 if (n->namespace.blkconf.blk) {
+int i;
 ns = >namespace;
-ns->params.nsid = 1;
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
+continue;
+}
+ns->params.nsid = i;
+break;
+}
 
 if (nvme_ns_setup(ns, errp)) {
 return;
-- 
2.26.2




Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API

2021-09-09 Thread Philippe Mathieu-Daudé
On 9/9/21 1:20 AM, Philippe Mathieu-Daudé wrote:
> Add the BlockDriver::bdrv_taints_security_policy() handler.
> Drivers implementing it might taint the global QEMU security
> policy.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/block/block_int.h | 6 +-
>  block.c   | 6 ++
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index f1a54db0f8c..0ec0a5c06e9 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -169,7 +169,11 @@ struct BlockDriver {
>  int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
>Error **errp);
>  void (*bdrv_close)(BlockDriverState *bs);
> -
> +/*
> + * Return %true if the driver is withing QEMU security policy boundary,
> + * %false otherwise. See: 
> https://www.qemu.org/contribute/security-process/
> + */
> +bool (*bdrv_taints_security_policy)(BlockDriverState *bs);
>  
>  int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
> Error **errp);
> diff --git a/block.c b/block.c
> index b2b66263f9a..696ba486001 100644
> --- a/block.c
> +++ b/block.c
> @@ -49,6 +49,7 @@
>  #include "qemu/timer.h"
>  #include "qemu/cutils.h"
>  #include "qemu/id.h"
> +#include "qemu-common.h"
>  #include "block/coroutines.h"
>  
>  #ifdef CONFIG_BSD
> @@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, 
> BlockDriver *drv,
>  }
>  }
>  
> +if (drv->bdrv_taints_security_policy) {
> +qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs),
> +   "Block protocol '%s'", drv->format_name);

Hmm I should check for phase_check(PHASE_MACHINE_READY)
and qemu_security_policy_is_strict() somehow, to refuse
adding unsafe drv at runtime instead of exiting...

> +}
> +
>  return 0;
>  open_failed:
>  bs->drv = NULL;
> 




Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging

2021-09-09 Thread Hannes Reinecke
On 9/9/21 9:59 AM, Klaus Jensen wrote:
> On Sep  9 09:02, Hannes Reinecke wrote:
>> On 7/26/21 9:18 PM, Klaus Jensen wrote:
>>> From: Klaus Jensen 
>>>
>>> Prior to this patch the nvme-ns devices are always children of the
>>> NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
>>> unrealized when the parent device is removed. However, when subsystems
>>> are involved, this is not what we want since the namespaces may be
>>> attached to other controllers as well.
>>>
>>> This patch adds an additional NvmeBus on the subsystem device. When
>>> nvme-ns devices are realized, if the parent controller device is linked
>>> to a subsystem, the parent bus is set to the subsystem one instead. This
>>> makes sure that namespaces are kept alive and not unrealized.
>>>
>>> Reviewed-by: Hannes Reinecke 
>>> Signed-off-by: Klaus Jensen 
>>> ---
>>>   hw/nvme/nvme.h   | 15 ---
>>>   hw/nvme/ctrl.c   | 14 ++
>>>   hw/nvme/ns.c | 18 ++
>>>   hw/nvme/subsys.c |  3 +++
>>>   4 files changed, 35 insertions(+), 15 deletions(-)
>>>
>> Finally got around to test this; sadly, with mixed results.
>> On the good side: controller hotplug works.
>> Within qemu monitor I can do:
>>
>> device_del 
>> device_add 
>>
>> and OS reports:
>> [   56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention button
>> pressed
>> [   56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off due to
>> button press
>> [  104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention button
>> pressed
>> [  104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due to
>> button press
>> [  104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present
>> [  104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up
>> [  104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802
>> [  104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit]
>> [  104.431442] pci :03:00.0: BAR 0: assigned [mem 0xc120-0xc1203fff
>> 64bit]
>> [  104.431580] pcieport :00:09.0: PCI bridge to [bus 03]
>> [  104.431604] pcieport :00:09.0:   bridge window [io  0x7000-0x7fff]
>> [  104.434815] pcieport :00:09.0:   bridge window [mem
>> 0xc120-0xc13f]
>> [  104.436685] pcieport :00:09.0:   bridge window [mem
>> 0x80400-0x805ff 64bit pref]
>> [  104.441896] nvme nvme2: pci function :03:00.0
>> [  104.442151] nvme :03:00.0: enabling device ( -> 0002)
>> [  104.455821] nvme nvme2: 1/0/0 default/read/poll queues
>>
>> So that works.
>> But: the namespace is not reconnected.
>>
>> # nvme list-ns /dev/nvme2
>>
>> draws a blank. So guess some fixup patch is in order...
>>
> 
> Hi Hannes,
> 
> I see. Ater the device_del/device_add dance, the namespace is there, but it is
> not automatically attached.
> 
> # nvme list-ns -a /dev/nvme0
> [   0]:0x2
> 
> # nvme attach-ns /dev/nvme0 -n 2 -c 0
> attach-ns: Success, nsid:2
> 
> # nvme list-ns /dev/nvme0
> [   0]:0x2
> 
> 
> I don't *think* the spec says that namespaces *must* be re-attached
> automatically? But I would have to check... If it does say that, then this is 
> a
> bug of course.
> 
Errm. Yes, the namespaces must be present after a 'reset' (which a
hotunplug/hotplug cycle amounts to here).

As per spec the namespaces are a property of the _subsystem_, not the
controller. And the controller attaches to a subsystem, so it'll see any
namespaces which are present in the subsystem.
(whether it needs to see _all_ namespaces from that subsystem is another
story, but doesn't need to bother us here :-).

Just send a patch for it; is actually quite trivial.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



[PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Hannes Reinecke
With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
namespaces get moved from the controller to the subsystem if one
is specified.
That keeps the namespaces alive after a controller hot-unplug, but
after a controller hotplug we have to reconnect the namespaces
from the subsystem to the controller.

Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
Cc: Klaus Jensen 
Signed-off-by: Hannes Reinecke 
---
 hw/nvme/subsys.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 93c35950d6..a9404f2b5e 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -14,7 +14,7 @@
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 {
 NvmeSubsystem *subsys = n->subsys;
-int cntlid;
+int cntlid, nsid;
 
 for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
 if (!subsys->ctrls[cntlid]) {
@@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
 subsys->ctrls[cntlid] = n;
 
+for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
+if (subsys->namespaces[nsid]) {
+nvme_attach_ns(n, subsys->namespaces[nsid]);
+}
+}
 return cntlid;
 }
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
 subsys->ctrls[n->cntlid] = NULL;
+n->cntlid = -1;
 }
 
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
-- 
2.26.2




[PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Hannes Reinecke
With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
namespaces get moved from the controller to the subsystem if one
is specified.
That keeps the namespaces alive after a controller hot-unplug, but
after a controller hotplug we have to reconnect the namespaces
from the subsystem to the controller.

Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
Cc: Klaus Jensen 
Signed-off-by: Hannes Reinecke 
---
 hw/nvme/subsys.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 93c35950d6..a9404f2b5e 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -14,7 +14,7 @@
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 {
 NvmeSubsystem *subsys = n->subsys;
-int cntlid;
+int cntlid, nsid;
 
 for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
 if (!subsys->ctrls[cntlid]) {
@@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
 subsys->ctrls[cntlid] = n;
 
+for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
+if (subsys->namespaces[nsid]) {
+nvme_attach_ns(n, subsys->namespaces[nsid]);
+}
+}
 return cntlid;
 }
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
 subsys->ctrls[n->cntlid] = NULL;
+n->cntlid = -1;
 }
 
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
-- 
2.26.2




Re: [PATCH v4 1/1] hw/pflash_cfi01: Allow backing devices to be smaller than memory region

2021-09-09 Thread Philippe Mathieu-Daudé
Hi David,

On 8/10/21 3:40 PM, David Edmondson wrote:
> Allow the backing device to be smaller than the extent of the flash
> device by mapping it as a subregion of the flash device region.
> 
> Return zeroes for all reads of the flash device beyond the extent of
> the backing device.
> 
> For writes beyond the extent of the underlying device, fail on
> read-only devices and discard them for writable devices.
> 
> Signed-off-by: David Edmondson 
> ---
>  hw/block/pflash_cfi01.c | 105 
>  hw/block/trace-events   |   3 ++
>  2 files changed, 87 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
> index 81f9f971d8..f3289b6a2f 100644
> --- a/hw/block/pflash_cfi01.c
> +++ b/hw/block/pflash_cfi01.c
> @@ -83,6 +83,8 @@ struct PFlashCFI01 {
>  uint64_t counter;
>  unsigned int writeblock_size;
>  MemoryRegion mem;
> +MemoryRegion mem_outer;
> +char outer_name[64];
>  char *name;
>  void *storage;
>  VMChangeStateEntry *vmstate;
> @@ -434,7 +436,6 @@ static inline void pflash_data_write(PFlashCFI01 *pfl, 
> hwaddr offset,
>  }
>  break;
>  }
> -
>  }
>  
>  static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
> @@ -656,8 +657,44 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset,
>  }
>  
>  
> -static MemTxResult pflash_mem_read_with_attrs(void *opaque, hwaddr addr, 
> uint64_t *value,
> -  unsigned len, MemTxAttrs attrs)
> +static MemTxResult pflash_outer_read_with_attrs(void *opaque, hwaddr addr,
> +uint64_t *value, unsigned 
> len,
> +MemTxAttrs attrs)
> +{
> +PFlashCFI01 *pfl = opaque;
> +
> +trace_pflash_outer_read(pfl->name, addr, len);
> +*value = 0;

This seems to break the "width" and "old-multiple-chip-handling"
parameters ("emulating a number of flash devices wired up in parallel").

Also this breaks booting with SEV enabled on X86... See commits
9617cddb726 ("pc: add parser for OVMF reset block") and b2f73a0784b
("sev/i386: Allow AP booting under SEV-ES").

> +return MEMTX_OK;
> +}
> +
> +static MemTxResult pflash_outer_write_with_attrs(void *opaque, hwaddr addr,
> + uint64_t value, unsigned 
> len,
> + MemTxAttrs attrs)
> +{
> +PFlashCFI01 *pfl = opaque;
> +
> +trace_pflash_outer_write(pfl->name, addr, len);
> +if (pfl->ro) {
> +return MEMTX_ERROR;
> +} else {
> +warn_report_once("%s: "
> + "attempt to write outside of the backing block 
> device "
> + "(offset " TARGET_FMT_plx ") ignored",
> + pfl->name, addr);

This doesn't seem acceptable on mainstream, see:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html

> +return MEMTX_OK;
> +}
> +}
> +
> +static const MemoryRegionOps pflash_cfi01_outer_ops = {
> +.read_with_attrs = pflash_outer_read_with_attrs,
> +.write_with_attrs = pflash_outer_write_with_attrs,
> +.endianness = DEVICE_NATIVE_ENDIAN,
> +};
> +
> +static MemTxResult pflash_mem_read_with_attrs(void *opaque, hwaddr addr,
> +  uint64_t *value, unsigned len,
> +  MemTxAttrs attrs)
>  {
>  PFlashCFI01 *pfl = opaque;
>  bool be = !!(pfl->features & (1 << PFLASH_BE));
> @@ -670,8 +707,9 @@ static MemTxResult pflash_mem_read_with_attrs(void 
> *opaque, hwaddr addr, uint64_
>  return MEMTX_OK;
>  }
>  
> -static MemTxResult pflash_mem_write_with_attrs(void *opaque, hwaddr addr, 
> uint64_t value,
> -   unsigned len, MemTxAttrs 
> attrs)
> +static MemTxResult pflash_mem_write_with_attrs(void *opaque, hwaddr addr,
> +   uint64_t value, unsigned len,
> +   MemTxAttrs attrs)
>  {
>  PFlashCFI01 *pfl = opaque;
>  bool be = !!(pfl->features & (1 << PFLASH_BE));
> @@ -800,7 +838,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
> **errp)
>  {
>  ERRP_GUARD();
>  PFlashCFI01 *pfl = PFLASH_CFI01(dev);
> -uint64_t total_len;
> +uint64_t outer_len, inner_len;
>  int ret;
>  
>  if (pfl->sector_len == 0) {
> @@ -816,35 +854,60 @@ static void pflash_cfi01_realize(DeviceState *dev, 
> Error **errp)
>  return;
>  }
>  
> -total_len = pfl->sector_len * pfl->nb_blocs;
> -
> -memory_region_init_rom_device(
> ->mem, OBJECT(dev),
> -_cfi01_ops,
> -pfl,
> -pfl->name, total_len, errp);
> -if (*errp) {
> -return;
> -}
> -
> -pfl->storage = memory_region_get_ram_ptr(>mem);
> -sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mem);
> +outer_len 

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-09 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (pet...@redhat.com) wrote:
> > > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > What if we do the 'flush()' before we start post-copy, instead of 
> > > > > > after each
> > > > > > iteration? would that be enough?
> > > > > 
> > > > > Possibly, yes. This really need David G's input since he understands
> > > > > the code in way more detail than me.
> > > > 
> > > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > > 
> > > We don't have that yet, do we?
> > 
> > I think multifd does; I think it calls multifd_send_sync_main sometimes,
> > which I *think* corresponds to iterations.
> 
> Oh.. Then we may need to:
> 
>   (1) Make that sync work for zero-copy too; say, semaphores may not be enough
>   with it - we need to make sure all async buffers are consumed by 
> checking
>   the error queue of the sockets,
> 
>   (2) When we make zero-copy work without multi-fd, we may want some similar
>   sync on the main stream too

It might not be worth bothering with (2) - zerocopy fits a lot better
with multifd's data organisation.

Dave

> Thanks,
> 
> > 
> > Dave
> > 
> > > > the case I can think of is if we're doing async sending, we could have
> > > > two versions of the same page in flight (one from each iteration) -
> > > > you'd want those to get there in the right order.
> > > 
> > > From MSG_ZEROCOPY document:
> > > 
> > > For protocols that acknowledge data in-order, like TCP, each
> > > notification can be squashed into the previous one, so that no 
> > > more
> > > than one notification is outstanding at any one point.
> > > 
> > > Ordered delivery is the common case, but not guaranteed. 
> > > Notifications
> > > may arrive out of order on retransmission and socket teardown.
> > > 
> > > So no matter whether we have a flush() per iteration before, it seems we 
> > > may
> > > want one when zero copy enabled?
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> > > 
> > -- 
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > 
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-09 Thread Stefano Garzarella

On Wed, Sep 08, 2021 at 06:20:49PM +0300, Denis Plotnikov wrote:


On 08.09.2021 16:22, Stefano Garzarella wrote:

Message bounced, I use new Denis's email address.

On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote:

Hi Denis,
I just found this discussion since we still have the following 
line in hw/core/machine.c:

   { "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the 
future we can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano


Hi Stefano

I prefer to just remove the line without replacing. This will keep 
things exactly like it is now.


Replacing with "vhost-scsi" will affect seg_max (limit to 126) on 
newly created VMs with machine types using "hw_compat_4_2" and older.


Now, because of the typo (error), their seg-max is queue-size 
dependent. I'm not sure, if replacing the line may cause any problems, 
for example on migration: source (queue-size 256, seg max 254) -> 
destination (queue-size 256, seg max 126). But it will definitely 
introduce two different behaviors for VMs with "hw_compat_4_2" and 
older. So, I'd choose the lesser of two evils and keep the things like 
it's now.




Yep, make sense. It was the same concern I had.

Do you want to send a patch to remove it with this explanation?

Thanks for the clarification,
Stefano




Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging

2021-09-09 Thread Klaus Jensen
On Sep  9 09:02, Hannes Reinecke wrote:
> On 7/26/21 9:18 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Prior to this patch the nvme-ns devices are always children of the
> > NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
> > unrealized when the parent device is removed. However, when subsystems
> > are involved, this is not what we want since the namespaces may be
> > attached to other controllers as well.
> > 
> > This patch adds an additional NvmeBus on the subsystem device. When
> > nvme-ns devices are realized, if the parent controller device is linked
> > to a subsystem, the parent bus is set to the subsystem one instead. This
> > makes sure that namespaces are kept alive and not unrealized.
> > 
> > Reviewed-by: Hannes Reinecke 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/nvme/nvme.h   | 15 ---
> >   hw/nvme/ctrl.c   | 14 ++
> >   hw/nvme/ns.c | 18 ++
> >   hw/nvme/subsys.c |  3 +++
> >   4 files changed, 35 insertions(+), 15 deletions(-)
> > 
> Finally got around to test this; sadly, with mixed results.
> On the good side: controller hotplug works.
> Within qemu monitor I can do:
> 
> device_del 
> device_add 
> 
> and OS reports:
> [   56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention button
> pressed
> [   56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off due to
> button press
> [  104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention button
> pressed
> [  104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due to
> button press
> [  104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present
> [  104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up
> [  104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802
> [  104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit]
> [  104.431442] pci :03:00.0: BAR 0: assigned [mem 0xc120-0xc1203fff
> 64bit]
> [  104.431580] pcieport :00:09.0: PCI bridge to [bus 03]
> [  104.431604] pcieport :00:09.0:   bridge window [io  0x7000-0x7fff]
> [  104.434815] pcieport :00:09.0:   bridge window [mem
> 0xc120-0xc13f]
> [  104.436685] pcieport :00:09.0:   bridge window [mem
> 0x80400-0x805ff 64bit pref]
> [  104.441896] nvme nvme2: pci function :03:00.0
> [  104.442151] nvme :03:00.0: enabling device ( -> 0002)
> [  104.455821] nvme nvme2: 1/0/0 default/read/poll queues
> 
> So that works.
> But: the namespace is not reconnected.
> 
> # nvme list-ns /dev/nvme2
> 
> draws a blank. So guess some fixup patch is in order...
> 

Hi Hannes,

I see. Ater the device_del/device_add dance, the namespace is there, but it is
not automatically attached.

# nvme list-ns -a /dev/nvme0
[   0]:0x2

# nvme attach-ns /dev/nvme0 -n 2 -c 0
attach-ns: Success, nsid:2

# nvme list-ns /dev/nvme0
[   0]:0x2


I don't *think* the spec says that namespaces *must* be re-attached
automatically? But I would have to check... If it does say that, then this is a
bug of course.


signature.asc
Description: PGP signature


Re: Virtual FAT disk images

2021-09-09 Thread Pascal
up² :-)
can someone just ping me to make me sure I'm posting on the qemu list ?

Le mar. 31 août 2021 à 09:24, Pascal  a écrit :

> up :-)
> nobody uses this feature of qemu?
>
> Le ven. 27 août 2021 à 11:11, Pascal  a écrit :
>
>> hello everybody,
>>
>> virtual FAT disk image - *which is a convenient way to transfer files to
>> the guest without having to activate its network* - seems to work very
>> poorly with Windows : do you have the same difficulties?
>>
>> context : up to date archlinux, qemu 6.0.0, fresh installed windows 10
>> 21H1.
>>
>> /usr/bin/qemu-system-x86_64 -accel kvm -machine q35 -m 1024 -device
>> nec-usb-xhci -device usb-tablet -cpu qemu64,kvm=off -parallel null -serial
>> mon:stdio -hda windows.disk -hdb fat:rw:/tmp/test/
>>
>> access to the E: drive is extremely slow and the system events report
>> many storahci (129: reinit \device\RaidPort0 sent) and disk (153: I/O
>> @0x3f on disk 1 \device\0001f replayed) warnings.
>>
>> regards.
>>
>


Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging

2021-09-09 Thread Hannes Reinecke

On 7/26/21 9:18 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 15 ---
  hw/nvme/ctrl.c   | 14 ++
  hw/nvme/ns.c | 18 ++
  hw/nvme/subsys.c |  3 +++
  4 files changed, 35 insertions(+), 15 deletions(-)


Finally got around to test this; sadly, with mixed results.
On the good side: controller hotplug works.
Within qemu monitor I can do:

device_del 
device_add 

and OS reports:
[   56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention 
button pressed
[   56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off 
due to button press
[  104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention 
button pressed
[  104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due 
to button press

[  104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present
[  104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up
[  104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802
[  104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit]
[  104.431442] pci :03:00.0: BAR 0: assigned [mem 
0xc120-0xc1203fff 64bit]

[  104.431580] pcieport :00:09.0: PCI bridge to [bus 03]
[  104.431604] pcieport :00:09.0:   bridge window [io  0x7000-0x7fff]
[  104.434815] pcieport :00:09.0:   bridge window [mem 
0xc120-0xc13f]
[  104.436685] pcieport :00:09.0:   bridge window [mem 
0x80400-0x805ff 64bit pref]

[  104.441896] nvme nvme2: pci function :03:00.0
[  104.442151] nvme :03:00.0: enabling device ( -> 0002)
[  104.455821] nvme nvme2: 1/0/0 default/read/poll queues

So that works.
But: the namespace is not reconnected.

# nvme list-ns /dev/nvme2

draws a blank. So guess some fixup patch is in order...

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer