Re: Virtual FAT disk images
On Thu, Sep 09, 2021 at 09:32:48AM +0200, Pascal wrote: > up² :-) > can someone just ping me to make me sure I'm posting on the qemu list ? Your question is reaching the list, but this mail ought to be informative: https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg02463.html In short, the fact that you are seeing performance penalties with the vvfat driver is unsurprising, and it may be a sign of bigger bugs lurking. The lack of answers is more a sign that no one actively participating on the list currently uses virtual FAT in any serious manner, so you are taking on your own risk. That said, if you'd like to debug it and submit patches, we'd be grateful! > > Le mar. 31 août 2021 à 09:24, Pascal a écrit : > > > up :-) > > nobody uses this feature of qemu? > > > > Le ven. 27 août 2021 à 11:11, Pascal a écrit : > > > >> hello everybody, > >> > >> virtual FAT disk image - *which is a convenient way to transfer files to > >> the guest without having to activate its network* - seems to work very > >> poorly with Windows : do you have the same difficulties? > >> > >> context : up to date archlinux, qemu 6.0.0, fresh installed windows 10 > >> 21H1. > >> > >> /usr/bin/qemu-system-x86_64 -accel kvm -machine q35 -m 1024 -device > >> nec-usb-xhci -device usb-tablet -cpu qemu64,kvm=off -parallel null -serial > >> mon:stdio -hda windows.disk -hdb fat:rw:/tmp/test/ There may be other less-risky ways for easily transferring files between host and guest; I know in the past I have seen both p9 and MTP file systems mentioned as approaches, although I do not have experience with setting those up myself to give you a sample command line. > >> > >> access to the E: drive is extremely slow and the system events report > >> many storahci (129: reinit \device\RaidPort0 sent) and disk (153: I/O > >> @0x3f on disk 1 \device\0001f replayed) warnings. > >> > >> regards. > >> > > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API
On Thu, Sep 09, 2021 at 01:20:17AM +0200, Philippe Mathieu-Daudé wrote: > Add the BlockDriver::bdrv_taints_security_policy() handler. > Drivers implementing it might taint the global QEMU security > policy. > > Signed-off-by: Philippe Mathieu-Daudé > --- > include/block/block_int.h | 6 +- > block.c | 6 ++ > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/include/block/block_int.h b/include/block/block_int.h > index f1a54db0f8c..0ec0a5c06e9 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -169,7 +169,11 @@ struct BlockDriver { > int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags, >Error **errp); > void (*bdrv_close)(BlockDriverState *bs); > - > +/* > + * Return %true if the driver is withing QEMU security policy boundary, within > + * %false otherwise. See: > https://www.qemu.org/contribute/security-process/ > + */ > +bool (*bdrv_taints_security_policy)(BlockDriverState *bs); > > int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts, > Error **errp); -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [RFC PATCH 02/10] accel: Use qemu_security_policy_taint(), mark KVM and Xen as safe
On Thu, Sep 09, 2021 at 01:20:16AM +0200, Philippe Mathieu-Daudé wrote: > Add the AccelClass::secure_policy_supported field to classify > safe (within security boundary) vs unsafe accelerators. > > Signed-off-by: Philippe Mathieu-Daudé > --- > include/qemu/accel.h | 5 + > accel/kvm/kvm-all.c | 1 + > accel/xen/xen-all.c | 1 + > softmmu/vl.c | 3 +++ > 4 files changed, 10 insertions(+) > > diff --git a/include/qemu/accel.h b/include/qemu/accel.h > index 4f4c283f6fc..895e30be0de 100644 > --- a/include/qemu/accel.h > +++ b/include/qemu/accel.h > @@ -44,6 +44,11 @@ typedef struct AccelClass { > hwaddr start_addr, hwaddr size); > #endif > bool *allowed; > +/* > + * Whether the accelerator is withing QEMU security policy boundary. within > + * See: https://www.qemu.org/contribute/security-process/ > + */ > +bool secure_policy_supported; > /* > * Array of global properties that would be applied when specific > * accelerator is chosen. It works like MachineClass.compat_props -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [RFC PATCH 01/10] sysemu: Introduce qemu_security_policy_taint() API
On Thu, Sep 09, 2021 at 01:20:15AM +0200, Philippe Mathieu-Daudé wrote: > Introduce qemu_security_policy_taint() which allows unsafe (read > "not very maintained") code to 'taint' QEMU security policy. > > The "security policy" is the @SecurityPolicy QAPI enum, composed of: > - "none" (no policy, current behavior) > - "warn" (display a warning when the policy is tainted, keep going) > - "strict" (once tainted, exit QEMU before starting the VM) > > The qemu_security_policy_is_strict() helper is also provided, which > will be proved useful once a VM is started (example we do not want s/be proved/prove/ > to kill a running VM if an unsafe device is hot-added). > > Signed-off-by: Philippe Mathieu-Daudé > --- > qapi/run-state.json | 16 +++ > include/qemu-common.h | 19 > softmmu/vl.c | 67 +++ > qemu-options.hx | 17 +++ > 4 files changed, 119 insertions(+) > > diff --git a/qapi/run-state.json b/qapi/run-state.json > index 43d66d700fc..b15a107fa01 100644 > --- a/qapi/run-state.json > +++ b/qapi/run-state.json > @@ -638,3 +638,19 @@ > { 'struct': 'MemoryFailureFlags', >'data': { 'action-required': 'bool', > 'recursive': 'bool'} } > + > +## > +# @SecurityPolicy: > +# > +# An enumeration of the actions taken when the security policy is tainted. > +# > +# @none: do nothing. > +# > +# @warn: display a warning. > +# > +# @strict: prohibit QEMU to start a VM. s/to start/from starting/ > +# > +# Since: 6.2 > +## > +{ 'enum': 'SecurityPolicy', > + 'data': [ 'none', 'warn', 'strict' ] } -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
On Thu, Sep 09, 2021 at 01:58:39AM -0300, Leonardo Bras Soares Passos wrote: > FWIW, what I had in mind for a (theoretical) migration setup with > io_async_writev() + io_async_flush(): One trivial concern is it's not strictly just "async" because "async" can happen on any nonblocking fd; here it's more "zerocopy" specific. But as long as Dan is okay with it I'm fine too to start with "async", as long as we do proper documentation on the requirement of lifecycle of the buffers. > - For guest RAM we can decide not to rw_lock memory on zerocopy, > because there is no need, > - For headers, we can decide to not use async (use io_writev() instead), > - flush() can happen each iteration of migration, or at each N > seconds, or at the end. I think we should only need the flush at the end of precopy, and that should also cover when switching to postcopy. We could merge this flush() into the existing per-iteration sync of multi-fd, but it's not strictly necessary, imho. We can see which is easier. The rest looks good to me. Thanks, -- Peter Xu
Re: qcow2 perfomance: read-only IO on the guest generates high write IO on the host
On 24-08-2021 11:37, Kevin Wolf wrote: [ Cc: qemu-block ] Am 11.08.2021 um 13:36 hat Christopher Pereira geschrieben: Hi, I'm reading a directory with 5.000.000 files (2,4 GB) inside a guest using "find | grep -c". On the host I saw high write IO (40 MB/s !) during over 1 hour using virt-top. I later repeated the read-only operation inside the guest and no additional data was written on the host. The operation took only some seconds. I believe QEMU was creating some kind of cache or metadata map the first time I accessed the inodes. No, at least in theory, QEMU shouldn't allocate anything when you're just reading. Hmm...interesting. Are you sure that this isn't activity coming from your guest OS? Yes. iotop was showing only read IOs on the guest, and on the host iotop and virt-top where showing strong write IOs for hours. Stopping the "find" command on the guest also stopped the write IOs on the host. But I wonder why the cache or metadata map wasn't available the first time and why QEMU had to recreate it? The VM has "compressed base <- snap 1" and base was converted without prealloc. Is it because we created the base using convert without metadata prealloc and so the metadata map got lost? I will do some experiments soon using convert + metadata prealloc and probably find out myself, but I will happy to read your comments and gain some additional insights. If it the problem persists, I would try again without compression. What were the results of your experiments? Is the behaviour related to any of these options? I will do the experiments and report back. It's also strange that the second time I repeat the "find" command, I see no more write IOs and it takes only seconds instead of hours. I was assuming QEMU was creating some kind of map or cache on the snapshot for the content present in the base, but now I got more curious.
Re: [RFC PATCH 00/10] security: Introduce qemu_security_policy_taint() API
On 210909 0120, Philippe Mathieu-Daudé wrote: > Hi, > > This series is experimental! The goal is to better limit the > boundary of what code is considerated security critical, and > what is less critical (but still important!). > > This approach was quickly discussed few months ago with Markus > then Daniel. Instead of classifying the code on a file path > basis (see [1]), we insert (runtime) hints into the code > (which survive code movement). Offending unsafe code can > taint the global security policy. By default this policy > is 'none': the current behavior. It can be changed on the > command line to 'warn' to display warnings, and to 'strict' > to prohibit QEMU running with a tainted policy. > > As examples I started implementing unsafe code taint from > 3 different pieces of code: > - accelerators (KVM and Xen in allow-list) > - block drivers (vvfat and parcial null-co in deny-list) > - qdev (hobbyist devices regularly hit by fuzzer) Just looking through the list of hci, storage, network and graphics devices available on i386 to see which others are potential good candidates for this tag. Obviously a lot of guesswork here: USB devices: name "ich9-usb-ehci1", bus PCI name "ich9-usb-ehci2", bus PCI name "ich9-usb-uhci1", bus PCI name "ich9-usb-uhci2", bus PCI name "ich9-usb-uhci3", bus PCI name "ich9-usb-uhci4", bus PCI name "ich9-usb-uhci5", bus PCI name "ich9-usb-uhci6", bus PCI name "nec-usb-xhci", bus PCI name "pci-ohci", bus PCI, desc "Apple USB Controller" name "piix3-usb-uhci", bus PCI name "piix4-usb-uhci", bus PCI name "qemu-xhci", bus PCI name "usb-ehci", bus PCI Not sure about these. Maybe ohci isn't sensitive? Storage devices: === Sensitive === name "floppy", bus floppy-bus, desc "virtual floppy drive" name "ide-cd", bus IDE, desc "virtual IDE CD-ROM" name "ide-hd", bus IDE, desc "virtual IDE disk" name "isa-fdc", bus ISA, desc "virtual floppy controller" name "isa-ide", bus ISA name "piix3-ide", bus PCI name "piix3-ide-xen", bus PCI name "piix4-ide", bus PCI name "scsi-block", bus SCSI, desc "SCSI block device passthrough" name "scsi-cd", bus SCSI, desc "virtual SCSI CD-ROM" name "scsi-generic", bus SCSI, desc "pass through generic scsi device (/dev/sg*)" name "scsi-hd", bus SCSI, desc "virtual SCSI disk" name "vhost-scsi", bus virtio-bus name "vhost-scsi-pci", bus PCI name "vhost-scsi-pci-non-transitional", bus PCI name "vhost-scsi-pci-transitional", bus PCI name "vhost-user-blk", bus virtio-bus name "vhost-user-blk-pci", bus PCI name "vhost-user-blk-pci-non-transitional", bus PCI name "vhost-user-blk-pci-transitional", bus PCI name "vhost-user-fs-device", bus virtio-bus name "vhost-user-fs-pci", bus PCI name "vhost-user-scsi", bus virtio-bus name "vhost-user-scsi-pci", bus PCI name "vhost-user-scsi-pci-non-transitional", bus PCI name "vhost-user-scsi-pci-transitional", bus PCI name "virtio-9p-device", bus virtio-bus name "virtio-9p-pci", bus PCI, alias "virtio-9p" name "virtio-9p-pci-non-transitional", bus PCI name "virtio-9p-pci-transitional", bus PCI name "virtio-blk-device", bus virtio-bus name "virtio-blk-pci", bus PCI, alias "virtio-blk" name "virtio-blk-pci-non-transitional", bus PCI name "virtio-blk-pci-transitional", bus PCI name "virtio-pmem", bus virtio-bus name "virtio-scsi-device", bus virtio-bus name "virtio-scsi-pci", bus PCI, alias "virtio-scsi" name "virtio-scsi-pci-non-transitional", bus PCI name "virtio-scsi-pci-transitional", bus PCI === Tainting/Not Sensitive === name "am53c974", bus PCI, desc "AMD Am53c974 PCscsi-PCI SCSI adapter" name "dc390", bus PCI, desc "Tekram DC-390 SCSI adapter" name "ich9-ahci", bus PCI, alias "ahci" name "lsi53c810", bus PCI name "lsi53c895a", bus PCI, alias "lsi" name "megasas", bus PCI, desc "LSI MegaRAID SAS 1078" name "megasas-gen2", bus PCI, desc "LSI MegaRAID SAS 2108" name "mptsas1068", bus PCI, desc "LSI SAS 1068" name "nvdimm", desc "DIMM memory module" name "nvme", bus PCI, desc "Non-Volatile Memory Express" name "nvme-ns", bus nvme-bus, desc "Virtual NVMe namespace" name "nvme-subsys", desc "Virtual NVMe subsystem" name "pvscsi", bus PCI name "sd-card", bus sd-bus name "sdhci-pci", bus PCI name "usb-bot", bus usb-bus name "usb-mtp", bus usb-bus, desc "USB Media Transfer Protocol device" name "usb-storage", bus usb-bus name "usb-uas", bus usb-bus Network devices: === Sensitive === name "e1000", bus PCI, alias "e1000-82540em", desc "Intel Gigabit Ethernet" name "e1000e", bus PCI, desc "Intel 82574L GbE Controller" name "virtio-net-device", bus virtio-bus name "virtio-net-pci", bus PCI, alias "virtio-net" name "virtio-net-pci-non-transitional", bus PCI name "virtio-net-pci-transitional", bus PCI === Tainting/Not Sensitive === name "e1000-82544gc", bus PCI, desc "Intel Gigabit Ethernet" name "e1000-82545em", bus PCI, desc "Intel Gigabit Ethernet" name "i82550", bus PCI, desc "Intel i82550 Ethernet" name "i82551", bus PCI, desc "Intel i82551 Ethernet" name "i82557a", bus PCI, desc "Intel i82557A Ethernet" name "i82557b", bu
Re: [PATCH] hw/nvme: select first free NSID for legacy drive configuration
On 9/9/21 12:52 PM, Klaus Jensen wrote: > On Sep 9 11:51, Hannes Reinecke wrote: >> If a legacy 'drive' argument is passed to the controller we cannot >> assume that '1' will be a free NSID, as the subsys might already >> have attached a namespace to this NSID. So select the first free >> one. >> >> Signed-off-by: Hannes Reinecke >> --- >> hw/nvme/ctrl.c | 9 - >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c >> index 757cdff038..2c69031ca9 100644 >> --- a/hw/nvme/ctrl.c >> +++ b/hw/nvme/ctrl.c >> @@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error >> **errp) >> >> /* setup a namespace if the controller drive property was given */ >> if (n->namespace.blkconf.blk) { >> +int i; >> ns = &n->namespace; >> -ns->params.nsid = 1; >> +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { >> +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) { >> +continue; >> +} >> +ns->params.nsid = i; >> +break; >> +} >> >> if (nvme_ns_setup(ns, errp)) { >> return; >> -- >> 2.26.2 >> > > Did you actually hit this? > > Because then then property checking logic is bad... The device is not > supposed to allow nvme,drive= combined with a subsystem property. In > nvme_check_constraints(): > > if (n->namespace.blkconf.blk && n->subsys) { > /* error out */ > return; > } > > Ah. Missed that. Do ignore my patch then. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
Re: [PATCH] hw/nvme: reattach subsystem namespaces on hotplug
On 9/9/21 12:47 PM, Klaus Jensen wrote: > On Sep 9 11:43, Hannes Reinecke wrote: >> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") >> namespaces get moved from the controller to the subsystem if one >> is specified. >> That keeps the namespaces alive after a controller hot-unplug, but >> after a controller hotplug we have to reconnect the namespaces >> from the subsystem to the controller. >> >> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") >> Cc: Klaus Jensen >> Signed-off-by: Hannes Reinecke >> --- >> hw/nvme/subsys.c | 8 +++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c >> index 93c35950d6..a9404f2b5e 100644 >> --- a/hw/nvme/subsys.c >> +++ b/hw/nvme/subsys.c >> @@ -14,7 +14,7 @@ >> int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) >> { >> NvmeSubsystem *subsys = n->subsys; >> -int cntlid; >> +int cntlid, nsid; >> >> for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) { >> if (!subsys->ctrls[cntlid]) { >> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) >> >> subsys->ctrls[cntlid] = n; >> >> +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) { >> +if (subsys->namespaces[nsid]) { >> +nvme_attach_ns(n, subsys->namespaces[nsid]); >> +} > > Thanks Hannes! I like it, keeping things simple. > > But we should only attach namespaces that have the shared property or > have ns->attached == 0. Non-shared namespaces may already be attached to > another controller in the subsystem. > Well ... I tried to avoid that subject, but as you brought it up: There is a slightly tricky issue in fabrics, in that the 'controller' is independent from the 'connection'. The 'shared' bit in the CMIC setting indicates that the subsystem may have more than one _controller_. It doesn't talk about how many _connections_ a controller may support; that then is the realm of dynamic or static controllers, which we don't talk about :-). Sufficient to say, linux only implements the dynamic controller model, so every connection will be to a different controller. But you have been warned :-) However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the 'shared' bit in the namespace). Which leads to the interesting question on how exactly should one handle non-shared namespaces in subsystems for which there are multiple controllers. Especially as the NSID space is per subsystem, so each controller will be able to figure out if there are blanked-out namespaces. So to make this a sane setup I would propose to set the 'shared' option automatically whenever the controller has the 'subsys' option set. And actually, I would ditch the 'shared' option completely, and make it dependent on the 'subsys' setting for the controller. Much like we treat the 'CMIC' setting nowadays. That avoids lots of confusions, and also make the implementation _way_ easier. > However... > > The spec says that "The attach and detach operations are persistent > across all reset events.". This means that we should track those events > in the subsystem and only reattach namespaces that were attached at the > time of the "reset" event. Currently, we don't have anything mapping > that state. But the device already has to take some liberties with > regard to stuff that is considered persistent by the spec (SMART log > etc.) since we do not have any way to store stuff persistently across > qemu invocations, so I think the above is an acceptable compromise. > Careful. 'attach' and 'detach' are MI (management interface) operations. If and how many namespaces are visible to any given controllers is actually independent on that; and, in fact, controllers might not even implement 'attach' or 'detach'. But I do agree that we don't handle the 'reset' state properly. > A potential (as good as it gets) fix would be to keep a map/list of > "persistently" attached controllers on the namespaces and re-attach > according to that when we see that controller joining the subsystem > again. CNTLID would be the obvious choice for the key here, but problem > is that we cant really use it since we assign it sequentially from the > subsystem, which now looks like a pretty bad choice. CNTLID should have > been a required property of the nvme device when subsystems are > involved. Maybe we can fix up the CNTLID assignment to take the serial > into account (we know that is defined and *should* be unique) and not > reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique > controllers, but I think that is fair enough. > > Sigh. Need to think this through. > Well, actually there is an easy way out. I do agree that we need to make the 'cntlid' a property of the controller. And once it's set we can track it properly, eg by having per-cntlid nsid lists in the subsystem. But if it's not set we can claim that we'll be allocating a new controller ac
Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk
On 09.09.2021 11:28, Stefano Garzarella wrote: On Wed, Sep 08, 2021 at 06:20:49PM +0300, Denis Plotnikov wrote: On 08.09.2021 16:22, Stefano Garzarella wrote: Message bounced, I use new Denis's email address. On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote: Hi Denis, I just found this discussion since we still have the following line in hw/core/machine.c: { "vhost-blk-device", "seg_max_adjust", "off"} IIUC it was a typo, and I think we should fix it since in the future we can have `vhost-blk-device`. So, I think we have 2 options: 1. remove that line since for now is useless 2. replace with "vhost-scsi" I'm not sure which is the best, what do you suggest? Thanks, Stefano Hi Stefano I prefer to just remove the line without replacing. This will keep things exactly like it is now. Replacing with "vhost-scsi" will affect seg_max (limit to 126) on newly created VMs with machine types using "hw_compat_4_2" and older. Now, because of the typo (error), their seg-max is queue-size dependent. I'm not sure, if replacing the line may cause any problems, for example on migration: source (queue-size 256, seg max 254) -> destination (queue-size 256, seg max 126). But it will definitely introduce two different behaviors for VMs with "hw_compat_4_2" and older. So, I'd choose the lesser of two evils and keep the things like it's now. Yep, make sense. It was the same concern I had. Do you want to send a patch to remove it with this explanation? Yes, sure. I'll do it today. Denis Thanks for the clarification, Stefano
Re: [RFC PATCH 06/10] qdev: Use qemu_security_policy_taint() API
On Thu, Sep 09, 2021 at 01:20:20AM +0200, Philippe Mathieu-Daudé wrote: > Add DeviceClass::taints_security_policy field to allow an > unsafe device to eventually taint the global security policy > in DeviceRealize(). > > Signed-off-by: Philippe Mathieu-Daudé > --- > include/hw/qdev-core.h | 6 ++ > hw/core/qdev.c | 11 +++ > 2 files changed, 17 insertions(+) > > diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h > index bafc311bfa1..ff9ce6671be 100644 > --- a/include/hw/qdev-core.h > +++ b/include/hw/qdev-core.h > @@ -122,6 +122,12 @@ struct DeviceClass { > */ > bool user_creatable; > bool hotpluggable; > +/* > + * %false if the device is within the QEMU security policy boundary, > + * %true if there is no guarantee this device can be used safely. > + * See: https://www.qemu.org/contribute/security-process/ > + */ > +bool taints_security_policy; > > /* callbacks */ > /* Although your use case is for devices, it probably makes more sense to push this up into the Object base class. I think it will need to be a tri-state value too, not a simple bool. It isn't feasible to mark all devices with this property, so initially we'll have no information about whether most devices are secure or insecure. This patch gives the implication that all devices are secure, except for the few that have been marked otherwise, which is not a good default IMHO. We want to be able to make it clear when introspecting, that we have no information on security available for most devices ie - unset => no information on security (the current default) - true => considered secure against malicious guest - false => considered insecure against malicious guest Then we can also extend 'ObjectTypeInfo' to have a '*secure': 'bool' to make 'qom-list-types' be able to introspect this upfront. > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > index cefc5eaa0a9..a5a00f3564c 100644 > --- a/hw/core/qdev.c > +++ b/hw/core/qdev.c > @@ -31,6 +31,7 @@ > #include "qapi/qmp/qerror.h" > #include "qapi/visitor.h" > #include "qemu/error-report.h" > +#include "qemu-common.h" > #include "qemu/option.h" > #include "hw/hotplug.h" > #include "hw/irq.h" > @@ -257,6 +258,13 @@ bool qdev_hotplug_allowed(DeviceState *dev, Error **errp) > MachineClass *mc; > Object *m_obj = qdev_get_machine(); > > +if (qemu_security_policy_is_strict() > +&& DEVICE_GET_CLASS(dev)->taints_security_policy) { > +error_setg(errp, "Device '%s' can not be hotplugged when" > + " 'strict' security policy is in place", > + object_get_typename(OBJECT(dev))); Do you need a 'return' here to stop execution after reportig the error ? > +} > + > if (object_dynamic_cast(m_obj, TYPE_MACHINE)) { > machine = MACHINE(m_obj); > mc = MACHINE_GET_CLASS(machine); > @@ -385,6 +393,9 @@ bool qdev_realize(DeviceState *dev, BusState *bus, Error > **errp) > } else { > assert(!DEVICE_GET_CLASS(dev)->bus_type); > } > +qemu_security_policy_taint(DEVICE_GET_CLASS(dev)->taints_security_policy, > + "device type %s", > + object_get_typename(OBJECT(dev))); > > return object_property_set_bool(OBJECT(dev), "realized", true, errp); > } > -- > 2.31.1 > Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API
On Thu, Sep 09, 2021 at 11:40:07AM +0100, Daniel P. Berrangé wrote: > On Thu, Sep 09, 2021 at 01:20:17AM +0200, Philippe Mathieu-Daudé wrote: > > Add the BlockDriver::bdrv_taints_security_policy() handler. > > Drivers implementing it might taint the global QEMU security > > policy. > > > > Signed-off-by: Philippe Mathieu-Daudé > > --- > > include/block/block_int.h | 6 +- > > block.c | 6 ++ > > 2 files changed, 11 insertions(+), 1 deletion(-) > > > > diff --git a/include/block/block_int.h b/include/block/block_int.h > > index f1a54db0f8c..0ec0a5c06e9 100644 > > --- a/include/block/block_int.h > > +++ b/include/block/block_int.h > > @@ -169,7 +169,11 @@ struct BlockDriver { > > int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags, > >Error **errp); > > void (*bdrv_close)(BlockDriverState *bs); > > - > > +/* > > + * Return %true if the driver is withing QEMU security policy boundary, > > + * %false otherwise. See: > > https://www.qemu.org/contribute/security-process/ > > + */ > > +bool (*bdrv_taints_security_policy)(BlockDriverState *bs); Also as with previous comments, I think we should not refer to tainting or the security policy here, but instead simply document whether we consider the bdrv to be secure or not. Tainting is merely one action that is taken in accordance with the security policy, as a result of the information presented. > > int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts, > > Error **errp); > > diff --git a/block.c b/block.c > > index b2b66263f9a..696ba486001 100644 > > --- a/block.c > > +++ b/block.c > > @@ -49,6 +49,7 @@ > > #include "qemu/timer.h" > > #include "qemu/cutils.h" > > #include "qemu/id.h" > > +#include "qemu-common.h" > > #include "block/coroutines.h" > > > > #ifdef CONFIG_BSD > > @@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, > > BlockDriver *drv, > > } > > } > > > > +if (drv->bdrv_taints_security_policy) { > > +qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs), > > + "Block protocol '%s'", > > drv->format_name); > > +} > > + > > return 0; > > open_failed: > > bs->drv = NULL; > > Again we need a way to report this via QAPI, but we don't have a natural > place is hang this off for introspection before starting a guest. > > The best we can do is report the information after a block backend has > been instantiated. eg Modify "BlockInfo" struct to gain > >'*secure': 'bool' > > Note I made this an optional field, since unless we mark every single > block driver impl straight away, we'll need to cope with the absence > of information. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH] hw/nvme: select first free NSID for legacy drive configuration
On Sep 9 11:51, Hannes Reinecke wrote: > If a legacy 'drive' argument is passed to the controller we cannot > assume that '1' will be a free NSID, as the subsys might already > have attached a namespace to this NSID. So select the first free > one. > > Signed-off-by: Hannes Reinecke > --- > hw/nvme/ctrl.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 757cdff038..2c69031ca9 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error > **errp) > > /* setup a namespace if the controller drive property was given */ > if (n->namespace.blkconf.blk) { > +int i; > ns = &n->namespace; > -ns->params.nsid = 1; > +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { > +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) { > +continue; > +} > +ns->params.nsid = i; > +break; > +} > > if (nvme_ns_setup(ns, errp)) { > return; > -- > 2.26.2 > Did you actually hit this? Because then then property checking logic is bad... The device is not supposed to allow nvme,drive= combined with a subsystem property. In nvme_check_constraints(): if (n->namespace.blkconf.blk && n->subsys) { /* error out */ return; } signature.asc Description: PGP signature
Re: [PATCH] hw/nvme: reattach subsystem namespaces on hotplug
On Sep 9 11:43, Hannes Reinecke wrote: > With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") > namespaces get moved from the controller to the subsystem if one > is specified. > That keeps the namespaces alive after a controller hot-unplug, but > after a controller hotplug we have to reconnect the namespaces > from the subsystem to the controller. > > Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") > Cc: Klaus Jensen > Signed-off-by: Hannes Reinecke > --- > hw/nvme/subsys.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c > index 93c35950d6..a9404f2b5e 100644 > --- a/hw/nvme/subsys.c > +++ b/hw/nvme/subsys.c > @@ -14,7 +14,7 @@ > int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) > { > NvmeSubsystem *subsys = n->subsys; > -int cntlid; > +int cntlid, nsid; > > for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) { > if (!subsys->ctrls[cntlid]) { > @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) > > subsys->ctrls[cntlid] = n; > > +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) { > +if (subsys->namespaces[nsid]) { > +nvme_attach_ns(n, subsys->namespaces[nsid]); > +} Thanks Hannes! I like it, keeping things simple. But we should only attach namespaces that have the shared property or have ns->attached == 0. Non-shared namespaces may already be attached to another controller in the subsystem. However... The spec says that "The attach and detach operations are persistent across all reset events.". This means that we should track those events in the subsystem and only reattach namespaces that were attached at the time of the "reset" event. Currently, we don't have anything mapping that state. But the device already has to take some liberties with regard to stuff that is considered persistent by the spec (SMART log etc.) since we do not have any way to store stuff persistently across qemu invocations, so I think the above is an acceptable compromise. A potential (as good as it gets) fix would be to keep a map/list of "persistently" attached controllers on the namespaces and re-attach according to that when we see that controller joining the subsystem again. CNTLID would be the obvious choice for the key here, but problem is that we cant really use it since we assign it sequentially from the subsystem, which now looks like a pretty bad choice. CNTLID should have been a required property of the nvme device when subsystems are involved. Maybe we can fix up the CNTLID assignment to take the serial into account (we know that is defined and *should* be unique) and not reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique controllers, but I think that is fair enough. Sigh. Need to think this through. Bottomline I think I'm partial to just accepting your patch (with the addition of taking the shared property into account) and documenting the limitation wrt. persistency of attach/detach events. No matter how spec-compliant we do it on a live system, we still break compliance across QEMU invocations. signature.asc Description: PGP signature
Re: [RFC PATCH 02/10] accel: Use qemu_security_policy_taint(), mark KVM and Xen as safe
On Thu, Sep 09, 2021 at 01:20:16AM +0200, Philippe Mathieu-Daudé wrote: > Add the AccelClass::secure_policy_supported field to classify > safe (within security boundary) vs unsafe accelerators. > > Signed-off-by: Philippe Mathieu-Daudé > --- > include/qemu/accel.h | 5 + > accel/kvm/kvm-all.c | 1 + > accel/xen/xen-all.c | 1 + > softmmu/vl.c | 3 +++ > 4 files changed, 10 insertions(+) > > diff --git a/include/qemu/accel.h b/include/qemu/accel.h > index 4f4c283f6fc..895e30be0de 100644 > --- a/include/qemu/accel.h > +++ b/include/qemu/accel.h > @@ -44,6 +44,11 @@ typedef struct AccelClass { > hwaddr start_addr, hwaddr size); > #endif > bool *allowed; > +/* > + * Whether the accelerator is withing QEMU security policy boundary. > + * See: https://www.qemu.org/contribute/security-process/ > + */ > +bool secure_policy_supported; The security handling policy is a high level concept that is open to variation over time and also by downstream distro vendors. At a code level we should be dealing in a more fundamental concept. At an accelerator level we should really jsut declare whether or not the accelerator impl is considered to be secure against malicious guest code. eg /* Whether this accelerator is secure against execution * of malciious guest machine code */ bool secure; > /* > * Array of global properties that would be applied when specific > * accelerator is chosen. It works like MachineClass.compat_props > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index 0125c17edb8..eb6b9e44df2 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -3623,6 +3623,7 @@ static void kvm_accel_class_init(ObjectClass *oc, void > *data) > ac->init_machine = kvm_init; > ac->has_memory = kvm_accel_has_memory; > ac->allowed = &kvm_allowed; > +ac->secure_policy_supported = true; > > object_class_property_add(oc, "kernel-irqchip", "on|off|split", > NULL, kvm_set_kernel_irqchip, > diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c > index 69aa7d018b2..57867af5faf 100644 > --- a/accel/xen/xen-all.c > +++ b/accel/xen/xen-all.c > @@ -198,6 +198,7 @@ static void xen_accel_class_init(ObjectClass *oc, void > *data) > ac->setup_post = xen_setup_post; > ac->allowed = &xen_allowed; > ac->compat_props = g_ptr_array_new(); > +ac->secure_policy_supported = true; > > compat_props_add(ac->compat_props, compat, G_N_ELEMENTS(compat)); > > diff --git a/softmmu/vl.c b/softmmu/vl.c > index 92c05ac97ee..e4f94e159c3 100644 > --- a/softmmu/vl.c > +++ b/softmmu/vl.c > @@ -2388,6 +2388,9 @@ static int do_configure_accelerator(void *opaque, > QemuOpts *opts, Error **errp) > return 0; > } > > +qemu_security_policy_taint(!ac->secure_policy_supported, > + "%s accelerator", acc); We need this information to be introspectable, becuase stuff printed to stderr is essentially opaque to libvirt and mgmt apps above. We don't have a convenient "query-accel" command but I think this could possibly fit into 'query-target'. ie the TargetInfo struct gain a field: ## # @TargetInfo: # # Information describing the QEMU target. # # @arch: the target architecture # @secure: Whether the currently active accelerator for this target # is secure against execution of malicous guest code # # Since: 1.2 ## { 'struct': 'TargetInfo', 'data': { 'arch': 'SysEmuTarget', 'secure': 'bool'} } Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API
On Thu, Sep 09, 2021 at 01:20:17AM +0200, Philippe Mathieu-Daudé wrote: > Add the BlockDriver::bdrv_taints_security_policy() handler. > Drivers implementing it might taint the global QEMU security > policy. > > Signed-off-by: Philippe Mathieu-Daudé > --- > include/block/block_int.h | 6 +- > block.c | 6 ++ > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/include/block/block_int.h b/include/block/block_int.h > index f1a54db0f8c..0ec0a5c06e9 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -169,7 +169,11 @@ struct BlockDriver { > int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags, >Error **errp); > void (*bdrv_close)(BlockDriverState *bs); > - > +/* > + * Return %true if the driver is withing QEMU security policy boundary, > + * %false otherwise. See: > https://www.qemu.org/contribute/security-process/ > + */ > +bool (*bdrv_taints_security_policy)(BlockDriverState *bs); > > int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts, > Error **errp); > diff --git a/block.c b/block.c > index b2b66263f9a..696ba486001 100644 > --- a/block.c > +++ b/block.c > @@ -49,6 +49,7 @@ > #include "qemu/timer.h" > #include "qemu/cutils.h" > #include "qemu/id.h" > +#include "qemu-common.h" > #include "block/coroutines.h" > > #ifdef CONFIG_BSD > @@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, > BlockDriver *drv, > } > } > > +if (drv->bdrv_taints_security_policy) { > +qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs), > + "Block protocol '%s'", drv->format_name); > +} > + > return 0; > open_failed: > bs->drv = NULL; Again we need a way to report this via QAPI, but we don't have a natural place is hang this off for introspection before starting a guest. The best we can do is report the information after a block backend has been instantiated. eg Modify "BlockInfo" struct to gain '*secure': 'bool' Note I made this an optional field, since unless we mark every single block driver impl straight away, we'll need to cope with the absence of information. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [RFC PATCH 00/10] security: Introduce qemu_security_policy_taint() API
On Thu, Sep 09, 2021 at 01:20:14AM +0200, Philippe Mathieu-Daudé wrote: > Hi, > > This series is experimental! The goal is to better limit the > boundary of what code is considerated security critical, and > what is less critical (but still important!). > > This approach was quickly discussed few months ago with Markus > then Daniel. Instead of classifying the code on a file path > basis (see [1]), we insert (runtime) hints into the code > (which survive code movement). Offending unsafe code can > taint the global security policy. By default this policy > is 'none': the current behavior. It can be changed on the > command line to 'warn' to display warnings, and to 'strict' > to prohibit QEMU running with a tainted policy. Ok, so I infer that you based this idea on the "--compat-policy" arg used to control behaviour wrt to deprecations. With the deprecation support, the QAPI introspection data can report deprecations for machines / CPUs (and in theory devices later). Libvirt records this deprecation info and can report it to the user before even starting a guest, so they can avoid using a deprecated device in the first place. We also use this info to mark a guest as tainted + deprecation at the libvirt level and let mgmt apps query this status. The --compat-policy support has been integrated into libvirt but it is not something we expect real world deployments to use - rather it is targeted as a testing framework. Essentially I see the security reporting as needing to operate in a pretty similar manner. IOW, the reporting via QAPI introspetion is much more important for libvirt and mgmt apps, than any custom cli arg / printfs at the QEMU level. In terms of QAPI design we currently have 'deprecated': 'bool' field against MachineInfo and CpuDefinitionInfo types. it feels like we need 'secure': 'bool' field against the relevant types here too, though I think maybe we might need to make it an optional field to let us distinguish lack of information, since it will take a long time to annotate all areas. eg '*secure': 'bool' - not set => no info available on security characteristics - true => device is considered secure wrt malicious guest - false => device is not considered secure wrt malicious guest > As examples I started implementing unsafe code taint from > 3 different pieces of code: > - accelerators (KVM and Xen in allow-list) > - block drivers (vvfat and parcial null-co in deny-list) > - qdev (hobbyist devices regularly hit by fuzzer) > > I don't want the security researchers to not fuzz QEMU unsafe > areas, but I'd like to make it clearer what the community > priority is (currently 47 opened issues on [3]). Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [RFC PATCH 01/10] sysemu: Introduce qemu_security_policy_taint() API
On 09/09/21 01:20, Philippe Mathieu-Daudé wrote: +static QemuOptsList qemu_security_policy_opts = { +.name = "security-policy", +.implied_opt_name = "policy", +.merge_lists = true, +.head = QTAILQ_HEAD_INITIALIZER(qemu_security_policy_opts.head), +.desc = { +{ +.name = "policy", +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; No new command line options please. You could rename -compat-policy to just -policy, and make this a "security" suboption. Paolo
[PATCH] hw/nvme: select first free NSID for legacy drive configuration
If a legacy 'drive' argument is passed to the controller we cannot assume that '1' will be a free NSID, as the subsys might already have attached a namespace to this NSID. So select the first free one. Signed-off-by: Hannes Reinecke --- hw/nvme/ctrl.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 757cdff038..2c69031ca9 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp) /* setup a namespace if the controller drive property was given */ if (n->namespace.blkconf.blk) { +int i; ns = &n->namespace; -ns->params.nsid = 1; +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) { +continue; +} +ns->params.nsid = i; +break; +} if (nvme_ns_setup(ns, errp)) { return; -- 2.26.2
Re: [RFC PATCH 03/10] block: Use qemu_security_policy_taint() API
On 9/9/21 1:20 AM, Philippe Mathieu-Daudé wrote: > Add the BlockDriver::bdrv_taints_security_policy() handler. > Drivers implementing it might taint the global QEMU security > policy. > > Signed-off-by: Philippe Mathieu-Daudé > --- > include/block/block_int.h | 6 +- > block.c | 6 ++ > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/include/block/block_int.h b/include/block/block_int.h > index f1a54db0f8c..0ec0a5c06e9 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -169,7 +169,11 @@ struct BlockDriver { > int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags, >Error **errp); > void (*bdrv_close)(BlockDriverState *bs); > - > +/* > + * Return %true if the driver is withing QEMU security policy boundary, > + * %false otherwise. See: > https://www.qemu.org/contribute/security-process/ > + */ > +bool (*bdrv_taints_security_policy)(BlockDriverState *bs); > > int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts, > Error **errp); > diff --git a/block.c b/block.c > index b2b66263f9a..696ba486001 100644 > --- a/block.c > +++ b/block.c > @@ -49,6 +49,7 @@ > #include "qemu/timer.h" > #include "qemu/cutils.h" > #include "qemu/id.h" > +#include "qemu-common.h" > #include "block/coroutines.h" > > #ifdef CONFIG_BSD > @@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, > BlockDriver *drv, > } > } > > +if (drv->bdrv_taints_security_policy) { > +qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs), > + "Block protocol '%s'", drv->format_name); Hmm I should check for phase_check(PHASE_MACHINE_READY) and qemu_security_policy_is_strict() somehow, to refuse adding unsafe drv at runtime instead of exiting... > +} > + > return 0; > open_failed: > bs->drv = NULL; >
Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging
On 9/9/21 9:59 AM, Klaus Jensen wrote: > On Sep 9 09:02, Hannes Reinecke wrote: >> On 7/26/21 9:18 PM, Klaus Jensen wrote: >>> From: Klaus Jensen >>> >>> Prior to this patch the nvme-ns devices are always children of the >>> NvmeBus owned by the NvmeCtrl. This causes the namespaces to be >>> unrealized when the parent device is removed. However, when subsystems >>> are involved, this is not what we want since the namespaces may be >>> attached to other controllers as well. >>> >>> This patch adds an additional NvmeBus on the subsystem device. When >>> nvme-ns devices are realized, if the parent controller device is linked >>> to a subsystem, the parent bus is set to the subsystem one instead. This >>> makes sure that namespaces are kept alive and not unrealized. >>> >>> Reviewed-by: Hannes Reinecke >>> Signed-off-by: Klaus Jensen >>> --- >>> hw/nvme/nvme.h | 15 --- >>> hw/nvme/ctrl.c | 14 ++ >>> hw/nvme/ns.c | 18 ++ >>> hw/nvme/subsys.c | 3 +++ >>> 4 files changed, 35 insertions(+), 15 deletions(-) >>> >> Finally got around to test this; sadly, with mixed results. >> On the good side: controller hotplug works. >> Within qemu monitor I can do: >> >> device_del >> device_add >> >> and OS reports: >> [ 56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention button >> pressed >> [ 56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off due to >> button press >> [ 104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention button >> pressed >> [ 104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due to >> button press >> [ 104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present >> [ 104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up >> [ 104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802 >> [ 104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit] >> [ 104.431442] pci :03:00.0: BAR 0: assigned [mem 0xc120-0xc1203fff >> 64bit] >> [ 104.431580] pcieport :00:09.0: PCI bridge to [bus 03] >> [ 104.431604] pcieport :00:09.0: bridge window [io 0x7000-0x7fff] >> [ 104.434815] pcieport :00:09.0: bridge window [mem >> 0xc120-0xc13f] >> [ 104.436685] pcieport :00:09.0: bridge window [mem >> 0x80400-0x805ff 64bit pref] >> [ 104.441896] nvme nvme2: pci function :03:00.0 >> [ 104.442151] nvme :03:00.0: enabling device ( -> 0002) >> [ 104.455821] nvme nvme2: 1/0/0 default/read/poll queues >> >> So that works. >> But: the namespace is not reconnected. >> >> # nvme list-ns /dev/nvme2 >> >> draws a blank. So guess some fixup patch is in order... >> > > Hi Hannes, > > I see. Ater the device_del/device_add dance, the namespace is there, but it is > not automatically attached. > > # nvme list-ns -a /dev/nvme0 > [ 0]:0x2 > > # nvme attach-ns /dev/nvme0 -n 2 -c 0 > attach-ns: Success, nsid:2 > > # nvme list-ns /dev/nvme0 > [ 0]:0x2 > > > I don't *think* the spec says that namespaces *must* be re-attached > automatically? But I would have to check... If it does say that, then this is > a > bug of course. > Errm. Yes, the namespaces must be present after a 'reset' (which a hotunplug/hotplug cycle amounts to here). As per spec the namespaces are a property of the _subsystem_, not the controller. And the controller attaches to a subsystem, so it'll see any namespaces which are present in the subsystem. (whether it needs to see _all_ namespaces from that subsystem is another story, but doesn't need to bother us here :-). Just send a patch for it; is actually quite trivial. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
[PATCH] hw/nvme: reattach subsystem namespaces on hotplug
With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") namespaces get moved from the controller to the subsystem if one is specified. That keeps the namespaces alive after a controller hot-unplug, but after a controller hotplug we have to reconnect the namespaces from the subsystem to the controller. Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") Cc: Klaus Jensen Signed-off-by: Hannes Reinecke --- hw/nvme/subsys.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c index 93c35950d6..a9404f2b5e 100644 --- a/hw/nvme/subsys.c +++ b/hw/nvme/subsys.c @@ -14,7 +14,7 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) { NvmeSubsystem *subsys = n->subsys; -int cntlid; +int cntlid, nsid; for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) { if (!subsys->ctrls[cntlid]) { @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) subsys->ctrls[cntlid] = n; +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) { +if (subsys->namespaces[nsid]) { +nvme_attach_ns(n, subsys->namespaces[nsid]); +} +} return cntlid; } void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n) { subsys->ctrls[n->cntlid] = NULL; +n->cntlid = -1; } static void nvme_subsys_setup(NvmeSubsystem *subsys) -- 2.26.2
[PATCH] hw/nvme: reattach subsystem namespaces on hotplug
With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") namespaces get moved from the controller to the subsystem if one is specified. That keeps the namespaces alive after a controller hot-unplug, but after a controller hotplug we have to reconnect the namespaces from the subsystem to the controller. Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging") Cc: Klaus Jensen Signed-off-by: Hannes Reinecke --- hw/nvme/subsys.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c index 93c35950d6..a9404f2b5e 100644 --- a/hw/nvme/subsys.c +++ b/hw/nvme/subsys.c @@ -14,7 +14,7 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) { NvmeSubsystem *subsys = n->subsys; -int cntlid; +int cntlid, nsid; for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) { if (!subsys->ctrls[cntlid]) { @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) subsys->ctrls[cntlid] = n; +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) { +if (subsys->namespaces[nsid]) { +nvme_attach_ns(n, subsys->namespaces[nsid]); +} +} return cntlid; } void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n) { subsys->ctrls[n->cntlid] = NULL; +n->cntlid = -1; } static void nvme_subsys_setup(NvmeSubsystem *subsys) -- 2.26.2
Re: [PATCH v4 1/1] hw/pflash_cfi01: Allow backing devices to be smaller than memory region
Hi David, On 8/10/21 3:40 PM, David Edmondson wrote: > Allow the backing device to be smaller than the extent of the flash > device by mapping it as a subregion of the flash device region. > > Return zeroes for all reads of the flash device beyond the extent of > the backing device. > > For writes beyond the extent of the underlying device, fail on > read-only devices and discard them for writable devices. > > Signed-off-by: David Edmondson > --- > hw/block/pflash_cfi01.c | 105 > hw/block/trace-events | 3 ++ > 2 files changed, 87 insertions(+), 21 deletions(-) > > diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c > index 81f9f971d8..f3289b6a2f 100644 > --- a/hw/block/pflash_cfi01.c > +++ b/hw/block/pflash_cfi01.c > @@ -83,6 +83,8 @@ struct PFlashCFI01 { > uint64_t counter; > unsigned int writeblock_size; > MemoryRegion mem; > +MemoryRegion mem_outer; > +char outer_name[64]; > char *name; > void *storage; > VMChangeStateEntry *vmstate; > @@ -434,7 +436,6 @@ static inline void pflash_data_write(PFlashCFI01 *pfl, > hwaddr offset, > } > break; > } > - > } > > static void pflash_write(PFlashCFI01 *pfl, hwaddr offset, > @@ -656,8 +657,44 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr offset, > } > > > -static MemTxResult pflash_mem_read_with_attrs(void *opaque, hwaddr addr, > uint64_t *value, > - unsigned len, MemTxAttrs attrs) > +static MemTxResult pflash_outer_read_with_attrs(void *opaque, hwaddr addr, > +uint64_t *value, unsigned > len, > +MemTxAttrs attrs) > +{ > +PFlashCFI01 *pfl = opaque; > + > +trace_pflash_outer_read(pfl->name, addr, len); > +*value = 0; This seems to break the "width" and "old-multiple-chip-handling" parameters ("emulating a number of flash devices wired up in parallel"). Also this breaks booting with SEV enabled on X86... See commits 9617cddb726 ("pc: add parser for OVMF reset block") and b2f73a0784b ("sev/i386: Allow AP booting under SEV-ES"). > +return MEMTX_OK; > +} > + > +static MemTxResult pflash_outer_write_with_attrs(void *opaque, hwaddr addr, > + uint64_t value, unsigned > len, > + MemTxAttrs attrs) > +{ > +PFlashCFI01 *pfl = opaque; > + > +trace_pflash_outer_write(pfl->name, addr, len); > +if (pfl->ro) { > +return MEMTX_ERROR; > +} else { > +warn_report_once("%s: " > + "attempt to write outside of the backing block > device " > + "(offset " TARGET_FMT_plx ") ignored", > + pfl->name, addr); This doesn't seem acceptable on mainstream, see: https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html > +return MEMTX_OK; > +} > +} > + > +static const MemoryRegionOps pflash_cfi01_outer_ops = { > +.read_with_attrs = pflash_outer_read_with_attrs, > +.write_with_attrs = pflash_outer_write_with_attrs, > +.endianness = DEVICE_NATIVE_ENDIAN, > +}; > + > +static MemTxResult pflash_mem_read_with_attrs(void *opaque, hwaddr addr, > + uint64_t *value, unsigned len, > + MemTxAttrs attrs) > { > PFlashCFI01 *pfl = opaque; > bool be = !!(pfl->features & (1 << PFLASH_BE)); > @@ -670,8 +707,9 @@ static MemTxResult pflash_mem_read_with_attrs(void > *opaque, hwaddr addr, uint64_ > return MEMTX_OK; > } > > -static MemTxResult pflash_mem_write_with_attrs(void *opaque, hwaddr addr, > uint64_t value, > - unsigned len, MemTxAttrs > attrs) > +static MemTxResult pflash_mem_write_with_attrs(void *opaque, hwaddr addr, > + uint64_t value, unsigned len, > + MemTxAttrs attrs) > { > PFlashCFI01 *pfl = opaque; > bool be = !!(pfl->features & (1 << PFLASH_BE)); > @@ -800,7 +838,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error > **errp) > { > ERRP_GUARD(); > PFlashCFI01 *pfl = PFLASH_CFI01(dev); > -uint64_t total_len; > +uint64_t outer_len, inner_len; > int ret; > > if (pfl->sector_len == 0) { > @@ -816,35 +854,60 @@ static void pflash_cfi01_realize(DeviceState *dev, > Error **errp) > return; > } > > -total_len = pfl->sector_len * pfl->nb_blocs; > - > -memory_region_init_rom_device( > -&pfl->mem, OBJECT(dev), > -&pflash_cfi01_ops, > -pfl, > -pfl->name, total_len, errp); > -if (*errp) { > -return; > -} > - > -pfl->storage = memory_region_get_ram_ptr(&pfl->mem); > -sysbus_init_mmio(SYS_BUS_DEVICE(dev), &pfl->me
Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
* Peter Xu (pet...@redhat.com) wrote: > On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (pet...@redhat.com) wrote: > > > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote: > > > > > > What if we do the 'flush()' before we start post-copy, instead of > > > > > > after each > > > > > > iteration? would that be enough? > > > > > > > > > > Possibly, yes. This really need David G's input since he understands > > > > > the code in way more detail than me. > > > > > > > > Hmm I'm not entirely sure why we have the sync after each iteration; > > > > > > We don't have that yet, do we? > > > > I think multifd does; I think it calls multifd_send_sync_main sometimes, > > which I *think* corresponds to iterations. > > Oh.. Then we may need to: > > (1) Make that sync work for zero-copy too; say, semaphores may not be enough > with it - we need to make sure all async buffers are consumed by > checking > the error queue of the sockets, > > (2) When we make zero-copy work without multi-fd, we may want some similar > sync on the main stream too It might not be worth bothering with (2) - zerocopy fits a lot better with multifd's data organisation. Dave > Thanks, > > > > > Dave > > > > > > the case I can think of is if we're doing async sending, we could have > > > > two versions of the same page in flight (one from each iteration) - > > > > you'd want those to get there in the right order. > > > > > > From MSG_ZEROCOPY document: > > > > > > For protocols that acknowledge data in-order, like TCP, each > > > notification can be squashed into the previous one, so that no > > > more > > > than one notification is outstanding at any one point. > > > > > > Ordered delivery is the common case, but not guaranteed. > > > Notifications > > > may arrive out of order on retransmission and socket teardown. > > > > > > So no matter whether we have a flush() per iteration before, it seems we > > > may > > > want one when zero copy enabled? > > > > > > Thanks, > > > > > > -- > > > Peter Xu > > > > > -- > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > > > -- > Peter Xu > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk
On Wed, Sep 08, 2021 at 06:20:49PM +0300, Denis Plotnikov wrote: On 08.09.2021 16:22, Stefano Garzarella wrote: Message bounced, I use new Denis's email address. On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote: Hi Denis, I just found this discussion since we still have the following line in hw/core/machine.c: { "vhost-blk-device", "seg_max_adjust", "off"} IIUC it was a typo, and I think we should fix it since in the future we can have `vhost-blk-device`. So, I think we have 2 options: 1. remove that line since for now is useless 2. replace with "vhost-scsi" I'm not sure which is the best, what do you suggest? Thanks, Stefano Hi Stefano I prefer to just remove the line without replacing. This will keep things exactly like it is now. Replacing with "vhost-scsi" will affect seg_max (limit to 126) on newly created VMs with machine types using "hw_compat_4_2" and older. Now, because of the typo (error), their seg-max is queue-size dependent. I'm not sure, if replacing the line may cause any problems, for example on migration: source (queue-size 256, seg max 254) -> destination (queue-size 256, seg max 126). But it will definitely introduce two different behaviors for VMs with "hw_compat_4_2" and older. So, I'd choose the lesser of two evils and keep the things like it's now. Yep, make sense. It was the same concern I had. Do you want to send a patch to remove it with this explanation? Thanks for the clarification, Stefano
Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging
On Sep 9 09:02, Hannes Reinecke wrote: > On 7/26/21 9:18 PM, Klaus Jensen wrote: > > From: Klaus Jensen > > > > Prior to this patch the nvme-ns devices are always children of the > > NvmeBus owned by the NvmeCtrl. This causes the namespaces to be > > unrealized when the parent device is removed. However, when subsystems > > are involved, this is not what we want since the namespaces may be > > attached to other controllers as well. > > > > This patch adds an additional NvmeBus on the subsystem device. When > > nvme-ns devices are realized, if the parent controller device is linked > > to a subsystem, the parent bus is set to the subsystem one instead. This > > makes sure that namespaces are kept alive and not unrealized. > > > > Reviewed-by: Hannes Reinecke > > Signed-off-by: Klaus Jensen > > --- > > hw/nvme/nvme.h | 15 --- > > hw/nvme/ctrl.c | 14 ++ > > hw/nvme/ns.c | 18 ++ > > hw/nvme/subsys.c | 3 +++ > > 4 files changed, 35 insertions(+), 15 deletions(-) > > > Finally got around to test this; sadly, with mixed results. > On the good side: controller hotplug works. > Within qemu monitor I can do: > > device_del > device_add > > and OS reports: > [ 56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention button > pressed > [ 56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off due to > button press > [ 104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention button > pressed > [ 104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due to > button press > [ 104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present > [ 104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up > [ 104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802 > [ 104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit] > [ 104.431442] pci :03:00.0: BAR 0: assigned [mem 0xc120-0xc1203fff > 64bit] > [ 104.431580] pcieport :00:09.0: PCI bridge to [bus 03] > [ 104.431604] pcieport :00:09.0: bridge window [io 0x7000-0x7fff] > [ 104.434815] pcieport :00:09.0: bridge window [mem > 0xc120-0xc13f] > [ 104.436685] pcieport :00:09.0: bridge window [mem > 0x80400-0x805ff 64bit pref] > [ 104.441896] nvme nvme2: pci function :03:00.0 > [ 104.442151] nvme :03:00.0: enabling device ( -> 0002) > [ 104.455821] nvme nvme2: 1/0/0 default/read/poll queues > > So that works. > But: the namespace is not reconnected. > > # nvme list-ns /dev/nvme2 > > draws a blank. So guess some fixup patch is in order... > Hi Hannes, I see. Ater the device_del/device_add dance, the namespace is there, but it is not automatically attached. # nvme list-ns -a /dev/nvme0 [ 0]:0x2 # nvme attach-ns /dev/nvme0 -n 2 -c 0 attach-ns: Success, nsid:2 # nvme list-ns /dev/nvme0 [ 0]:0x2 I don't *think* the spec says that namespaces *must* be re-attached automatically? But I would have to check... If it does say that, then this is a bug of course. signature.asc Description: PGP signature
Re: Virtual FAT disk images
up² :-) can someone just ping me to make me sure I'm posting on the qemu list ? Le mar. 31 août 2021 à 09:24, Pascal a écrit : > up :-) > nobody uses this feature of qemu? > > Le ven. 27 août 2021 à 11:11, Pascal a écrit : > >> hello everybody, >> >> virtual FAT disk image - *which is a convenient way to transfer files to >> the guest without having to activate its network* - seems to work very >> poorly with Windows : do you have the same difficulties? >> >> context : up to date archlinux, qemu 6.0.0, fresh installed windows 10 >> 21H1. >> >> /usr/bin/qemu-system-x86_64 -accel kvm -machine q35 -m 1024 -device >> nec-usb-xhci -device usb-tablet -cpu qemu64,kvm=off -parallel null -serial >> mon:stdio -hda windows.disk -hdb fat:rw:/tmp/test/ >> >> access to the E: drive is extremely slow and the system events report >> many storahci (129: reinit \device\RaidPort0 sent) and disk (153: I/O >> @0x3f on disk 1 \device\0001f replayed) warnings. >> >> regards. >> >
Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging
On 7/26/21 9:18 PM, Klaus Jensen wrote: From: Klaus Jensen Prior to this patch the nvme-ns devices are always children of the NvmeBus owned by the NvmeCtrl. This causes the namespaces to be unrealized when the parent device is removed. However, when subsystems are involved, this is not what we want since the namespaces may be attached to other controllers as well. This patch adds an additional NvmeBus on the subsystem device. When nvme-ns devices are realized, if the parent controller device is linked to a subsystem, the parent bus is set to the subsystem one instead. This makes sure that namespaces are kept alive and not unrealized. Reviewed-by: Hannes Reinecke Signed-off-by: Klaus Jensen --- hw/nvme/nvme.h | 15 --- hw/nvme/ctrl.c | 14 ++ hw/nvme/ns.c | 18 ++ hw/nvme/subsys.c | 3 +++ 4 files changed, 35 insertions(+), 15 deletions(-) Finally got around to test this; sadly, with mixed results. On the good side: controller hotplug works. Within qemu monitor I can do: device_del device_add and OS reports: [ 56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention button pressed [ 56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off due to button press [ 104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention button pressed [ 104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due to button press [ 104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present [ 104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up [ 104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802 [ 104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit] [ 104.431442] pci :03:00.0: BAR 0: assigned [mem 0xc120-0xc1203fff 64bit] [ 104.431580] pcieport :00:09.0: PCI bridge to [bus 03] [ 104.431604] pcieport :00:09.0: bridge window [io 0x7000-0x7fff] [ 104.434815] pcieport :00:09.0: bridge window [mem 0xc120-0xc13f] [ 104.436685] pcieport :00:09.0: bridge window [mem 0x80400-0x805ff 64bit pref] [ 104.441896] nvme nvme2: pci function :03:00.0 [ 104.442151] nvme :03:00.0: enabling device ( -> 0002) [ 104.455821] nvme nvme2: 1/0/0 default/read/poll queues So that works. But: the namespace is not reconnected. # nvme list-ns /dev/nvme2 draws a blank. So guess some fixup patch is in order... Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer