Re: [PATCH v8] drivers/misc: sysgenid: add system generation id driver
Hi Greg, After your previous reply on this thread we started considering to provide this interface and framework/functionality through a userspace service instead of a kernel interface. The latest iteration on this evolving patch-set doesn’t have strong reasons for living in the kernel anymore - the only objectively strong advantage would be easier driving of ecosystem integration; but I am not sure that's a good enough reason to create a new kernel interface. I am now looking into adding this through Systemd. Either as a pluggable service or maybe even a systemd builtin offering. What are your thoughts on it? Thanks, Adrian. On 23/03/2021, 14:57, "Greg KH" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Mon, Mar 08, 2021 at 05:03:58PM +0100, Alexander Graf wrote: > > > On 08.03.21 15:36, Greg KH wrote: > > > > On Mon, Mar 08, 2021 at 04:18:03PM +0200, Adrian Catangiu wrote: > > > +static struct miscdevice sysgenid_misc = { > > > + .minor = MISC_DYNAMIC_MINOR, > > > + .name = "sysgenid", > > > + .fops = , > > > +}; > > > > Much cleaner, but: > > > > > +static int __init sysgenid_init(void) > > > +{ > > > + int ret; > > > + > > > + sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL); > > > + if (!sysgenid_data.map_buf) > > > + return -ENOMEM; > > > + > > > + atomic_set(_data.generation_counter, 0); > > > + atomic_set(_data.outdated_watchers, 0); > > > + init_waitqueue_head(_data.read_waitq); > > > + init_waitqueue_head(_data.outdated_waitq); > > > + spin_lock_init(_data.lock); > > > + > > > + ret = misc_register(_misc); > > > + if (ret < 0) { > > > + pr_err("misc_register() failed for sysgenid\n"); > > > + goto err; > > > + } > > > + > > > + return 0; > > > + > > > +err: > > > + free_pages(sysgenid_data.map_buf, 0); > > > + sysgenid_data.map_buf = 0; > > > + > > > + return ret; > > > +} > > > + > > > +static void __exit sysgenid_exit(void) > > > +{ > > > + misc_deregister(_misc); > > > + free_pages(sysgenid_data.map_buf, 0); > > > + sysgenid_data.map_buf = 0; > > > +} > > > + > > > +module_init(sysgenid_init); > > > +module_exit(sysgenid_exit); > > > > So you do this for any bit of hardware that happens to be out there? > > Will that really work? You do not have any hwid to trigger off of to > > know that this is a valid device you can handle? > > The interface is already useful in a pure container context where the > generation change request is triggered by software. > > And yes, there are hardware triggers, but Michael was quite unhappy about > potential races between VMGenID change and SysGenID change and thus wanted > to ideally separate the interfaces. So we went ahead and isolated the > SysGenID one, as it's already useful as is. > > Hardware drivers to inject change events into SysGenID can then follow > later, for all different hardware platforms. But SysGenID as in this patch > is a completely hardware agnostic concept. Ok, this is going to play havoc with fuzzers and other "automated testers", should be fun to watch! :) Let's queue this up and see what happens... thanks, greg k-h Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v7 0/2] System Generation ID driver and VMGENID backend
Hi Michael, On 24/02/2021, 11:06, "Michael S. Tsirkin" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Wed, Feb 24, 2021 at 10:47:30AM +0200, Adrian Catangiu wrote: > This feature is aimed at virtualized or containerized environments > where VM or container snapshotting duplicates memory state, which is a > challenge for applications that want to generate unique data such as > request IDs, UUIDs, and cryptographic nonces. > > The patch set introduces a mechanism that provides a userspace > interface for applications and libraries to be made aware of uniqueness > breaking events such as VM or container snapshotting, and allow them to > react and adapt to such events. > > Solving the uniqueness problem strongly enough for cryptographic > purposes requires a mechanism which can deterministically reseed > userspace PRNGs with new entropy at restore time. This mechanism must > also support the high-throughput and low-latency use-cases that led > programmers to pick a userspace PRNG in the first place; be usable by > both application code and libraries; allow transparent retrofitting > behind existing popular PRNG interfaces without changing application > code; it must be efficient, especially on snapshot restore; and be > simple enough for wide adoption. > > The first patch in the set implements a device driver which exposes a > the /dev/sysgenid char device to userspace. Its associated filesystem > operations operations can be used to build a system level safe workflow > that guest software can follow to protect itself from negative system > snapshot effects. > > The second patch in the set adds a VmGenId driver which makes use of > the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy > following VM snapshots. > > **Please note**, SysGenID alone does not guarantee complete snapshot > safety to applications using it. A certain workflow needs to be > followed at the system level, in order to make the system > snapshot-resilient. Please see the "Snapshot Safety Prerequisites" > section in the included SysGenID documentation. > > --- > > v6 -> v7: > - remove sysgenid uevent How about we drop mmap too? There's simply no way I can see to make it safe, and no implementation is worse than a racy one imho. Yea there's some decumentation explaining how it is not supposed to be used but it will *seem* to work for people and we will be stuck trying to maintain it. Let's see if userspace using this often enough to make the system call As Colm explained in his reply, the mmap is the only option to consume this within the strict latency constraints of PRNGs and SSL libs, so what if instead, we remove the IRQ race by removing vmgenid as an in-kernel sysgenid backend/driver? We could just drop the vmgenid driver for now and only drive sysgenid from userspace using the fs interface. Doing so will remove the IRQ race which comes from vmgenid backend, and will keep the SysGenID kernel interface safe and consistent, with a race-free mmap(). What do you think? > v5 -> v6: > > - sysgenid: watcher tracking disabled by default > - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each > file descriptor to set whether they should be tracked as watchers > - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE > - rework all documentation to clearly capture all prerequisites for > achieving snapshot safety when using the provided mechanism > - sysgenid documentation: replace individual filesystem operations > examples with a higher level example showcasing system-level > snapshot-safe workflow > > v4 -> v5: > > - sysgenid: generation changes are also exported through uevents > - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl > - document sysgenid ioctl major/minor numbers > > v3 -> v4: > > - split functionality in two separate kernel modules: > 1. drivers/misc/sysgenid.c which provides the generic userspace >interface and mechanisms > 2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds >kernel entropy and acts as a driving backend for the generic >sysgenid > - rename /dev/vmgenid -> /dev/sysgenid > - rename uapi header file vmgenid.h -> sysgenid.h > - rename ioctls VMGENID_* -> SYSGENID_* > - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl > - fix races in documentation examples > > v2 -> v3: > > - separate the core driver logic and interface, from the ACPI device. > The ACPI vmgenid device is now one possible
Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver
On 09/02/2021, 16:53, "Michael S. Tsirkin" wrote: On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote: > - Background and problem > > The System Generation ID feature is required in virtualized or > containerized environments by applications that work with local copies > or caches of world-unique data such as random values, uuids, > monotonically increasing counters, etc. > Such applications can be negatively affected by VM or container > snapshotting when the VM or container is either cloned or returned to > an earlier point in time. > > Furthermore, simply finding out about a system generation change is > only the starting point of a process to renew internal states of > possibly multiple applications across the system. This process could > benefit from a driver that provides an interface through which > orchestration can be easily done. > > - Solution > > The System Generation ID is a simple concept meant to alleviate the > issue by providing a monotonically increasing u32 counter that changes > each time the VM or container is restored from a snapshot. > > The `sysgenid` driver exposes a monotonic incremental System Generation > u32 counter via a char-dev FS interface that provides sync and async > SysGen counter updates notifications. It also provides SysGen counter > retrieval and confirmation mechanisms. > > The counter starts from zero when the driver is initialized and > monotonically increments every time the system generation changes. > > The `sysgenid` driver exports the `void sysgenid_bump_generation()` > symbol which can be used by backend drivers to drive system generation > changes based on hardware events. > System generation changes can also be driven by userspace software > through a dedicated driver ioctl. > > Userspace applications or libraries can then (a)synchronously consume > the system generation counter through the provided FS interface to make > any necessary internal adjustments following a system generation > update. > > Signed-off-by: Adrian Catangiu > --- > Documentation/misc-devices/sysgenid.rst| 236 > Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + > MAINTAINERS| 8 + > drivers/misc/Kconfig | 16 ++ > drivers/misc/Makefile | 1 + > drivers/misc/sysgenid.c| 307 + > include/uapi/linux/sysgenid.h | 17 ++ > 7 files changed, 586 insertions(+) > create mode 100644 Documentation/misc-devices/sysgenid.rst > create mode 100644 drivers/misc/sysgenid.c > create mode 100644 include/uapi/linux/sysgenid.h > > diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst > new file mode 100644 > index 000..4337ca0 > --- /dev/null > +++ b/Documentation/misc-devices/sysgenid.rst > @@ -0,0 +1,236 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > + > +SYSGENID > + > + > +The System Generation ID feature is required in virtualized or > +containerized environments by applications that work with local copies > +or caches of world-unique data such as random values, UUIDs, > +monotonically increasing counters, etc. > +Such applications can be negatively affected by VM or container > +snapshotting when the VM or container is either cloned or returned to > +an earlier point in time. > + > +The System Generation ID is a simple concept meant to alleviate the > +issue by providing a monotonically increasing counter that changes > +each time the VM or container is restored from a snapshot. > +The driver for it lives at ``drivers/misc/sysgenid.c``. > + > +The ``sysgenid`` driver exposes a monotonic incremental System > +Generation u32 counter via a char-dev FS interface accessible through > +``/dev/sysgenid`` that provides sync and async SysGen counter update > +notifications. It also provides SysGen counter retrieval and > +confirmation mechanisms. > + > +The counter starts from zero when the driver is initialized and > +monotonically increments every time the system generation changes. > + > +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()`` > +symbol which can be used by backend drivers to drive system generation > +changes based on hardware events. > +System generation changes can also be driven by userspace software > +through a dedicated driver ioctl. > + > +Userspace applications or libraries can (a)synchronously consume the > +system generation counter through the provided FS interface, to make >
Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver
On 02/02/2021, 14:09, "Greg KH" wrote: On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote: > +static long sysgenid_ioctl(struct file *file, > +unsigned int cmd, unsigned long arg) Very odd indentation style, checkpatch.pl didn't catch this? Checkpatch.pl is happy with this, yes. Will change it to a single tab nonetheless since it does look weird :) Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver
On 02/02/2021, 14:05, "Greg KH" wrote: On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote: > +EXPORT_SYMBOL(sysgenid_bump_generation); EXPORT_SYMBOL_GPL()? I have to ask... Good catch! Will update. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver
On 12/01/2021, 15:09, "Michael S. Tsirkin" wrote: On Tue, Jan 12, 2021 at 02:15:59PM +0200, Adrian Catangiu wrote: > +3) Mapped memory polling simplified example:: > + > + /* > + * app/library function that provides cached secrets > + */ > + char * safe_cached_secret(app_data_t *app) > + { > + char *secret; > + volatile unsigned *const genid_ptr = get_sysgenid_mapping(app); > + again: > + secret = __cached_secret(app); > + > + if (unlikely(*genid_ptr != app->cached_genid)) { > + app->cached_genid = *genid_ptr; > + barrier(); > + > + // rebuild world then confirm the genid update (thru write) > + rebuild_caches(app); > + > + ack_sysgenid_update(app); > + > + goto again; > + } > + > + return secret; Hmm. This seems to rely on the assumption that if you have read the ID and it did not change, then all is well. Problem is, in the interrupt driven implementation this is not a safe assumption to make: ID from hypervisor might have changed but interrupt could be delayed. If we map the hypervisor ID to userspace then we won't have this race ... worth worry about? why not? This is a very good point! Unfortunately, there is no immediate solution here. The current patch-set makes this trade-off in order to gain the genericity of a system-level generation ID which is not limited to VMs usecases, but can also be used with things like CRIU. Directly mapping the vmgenid UUID to userspace was the initial design of this patch-set (see v1), but it was argued against and evolved into current state. I would personally rather enhance the HW support (vmgenid device for example) to also expose a configurable u32 Vm Gen Counter alongside the 128-bit UUID; and add support in SysGenID to offload the counter to HW when applicable. The broader view is we need to strike the right balance between a functional, safe mechanism with today's technology, but also design a future-proof generic mechanism. Fixing the race you mentioned in SysGenID only moves the race a bit further up the stack - you generate the secret race-free but the secret can still be duplicated in the next layer. To make any mechanism completely safe we need to conceptually disconnect ourselves from believing that a restored snapshot is immediately available. There needs to be some entity that moves the restored VM/container/system from a transient state to "available". That entity can be a process inside the VM, but it can also be something outside the VM, in the hypervisor or in the stack surrounding it. That entity is responsible for managing the transition of the VM or container from transient -> available. It is responsible for not allowing the VM/ container to be used or usable until that transition is complete. In the first generations of VM clones and CRIU production deployments, I expect this entity to be a stack-specific component with intimate knowledge of the system components, transient states, lifecycle, etc. Which this patch-set enables. Thanks, Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v4 0/2] System Generation ID driver and VMGENID backend
On 12/01/2021, 14:49, "Michael S. Tsirkin" wrote: On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote: > The first patch in the set implements a device driver which exposes a > read-only device /dev/sysgenid to userspace, which contains a > monotonically increasing u32 generation counter. Libraries and > applications are expected to open() the device, and then call read() > which blocks until the SysGenId changes. Following an update, read() > calls no longer block until the application acknowledges the new > SysGenId by write()ing it back to the device. Non-blocking read() calls > return EAGAIN when there is no new SysGenId available. Alternatively, > libraries can mmap() the device to get a single shared page which > contains the latest SysGenId at offset 0. Looking at some specifications, the gen ID might actually be located at an arbitrary address. How about instead of hard-coding the offset, we expose it e.g. in sysfs? The functionality is split between SysGenID which exposes an internal u32 counter to userspace, and an (optional) VmGenID backend which drives SysGenID generation changes based on hw vmgenid updates. The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or otherwise exposed to userspace. It is only used internally by the vmgenid driver to find out about VM generation changes and drive the more generic SysGenID. The SysGenID u32 monotonic increasing counter is the one that is mmaped to userspace, but it is a software counter. I don't see any value in using a dynamic offset in the mmaped page. Offset 0 is fast and easy and most importantly it is static so no need to dynamically calculate or find it at runtime. Thanks, Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver
On 12/01/2021, 15:07, "Greg KH" wrote: On Tue, Jan 12, 2021 at 02:15:59PM +0200, Adrian Catangiu wrote: > + Partial reads are not allowed - read buffer needs to be at least > + ``sizeof(unsigned)`` in size. "sizeof(unsigned)"? How about being specific and making this a real "X bits big" value please. "unsigned" does not work well across user/kernel boundries. Ok, that's on understatement, the correct thing is "does not work at all". Please be specific in your apis. This is listed elsewhere also. Right, will do! > + - SYSGENID_GET_OUTDATED_WATCHERS: immediately returns the number of > +*outdated* watchers - number of file descriptors that were open > +during a system generation change, and which have not yet confirmed > +the new generation counter. But this number can instantly change after it is read, what good is it? It should never be relied on, so why is this needed at all? What can userspace do with this information? That is true, a userspace process either has to wait for all to adjust to the new generation or not care about other processes. Intermediate probing doesn't bring real value. Will remove. thanks, greg k-h Thanks for the feedback! Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
Sorry Jann for missing your original email. On 27/11/2020 20:22, Jann Horn wrote: > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > > [resend in the hope that amazon will accept my mail this time instead > of replying "550 Too many invalid recipients" again] > > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin >> wrote: >>> This patch is a driver that exposes a monotonic incremental Virtual >>> Machine Generation u32 counter via a char-dev FS interface that >>> provides sync and async VmGen counter updates notifications. It also >>> provides VmGen counter retrieval and confirmation mechanisms. >>> >>> The hw provided UUID is not exposed to userspace, it is internally >>> used by the driver to keep accounting for the exposed VmGen counter. >>> The counter starts from zero when the driver is initialized and >>> monotonically increments every time the hw UUID changes (the VM >>> generation changes). >>> >>> On each hw UUID change, the new hypervisor-provided UUID is also fed >>> to the kernel RNG. >> As for v1: >> >> Is there a reasonable usecase for the "confirmation" mechanism? It >> doesn't seem very useful to me. I think it adds value in complex scenarios with multiple users of the mechanism, potentially at varying layers of the stack, different processes and/or runtime libraries. The driver offers a natural place to handle minimal orchestration support and offer visibility in system-wide status. A high-level service that trusts all system components to properly use the confirmation mechanism can actually block and wait patiently for the system to adjust to the new world. Even if it doesn't trust all components it can still do a best-effort, timeout block. >> >> How do you envision integrating this with libraries that have to work >> in restrictive seccomp sandboxes? If this was in the vDSO, that would >> be much easier. Since this mechanism targets all of userspace stack, the usecase greatly vary. I doubt we can have a single silver bullet interface. For example, the mmap interface targets user space RNGs, where as fast and as race free as possible is key. But there also higher level applications that don't manage their own memory or don't have access to low-level primitives so they can't use the mmap or even vDSO interfaces. That's what the rest of the logic is there for, the read+poll interface and all of the orchestration logic. Like you correctly point out, there are also scenarios like tight seccomp jails where even the FS interfaces is inaccessible. For cases like this and others, I believe we will have to work incrementally to build up the interface diversity to cater to all the user scenarios diversity. Thanks, Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
[PATCH v3] drivers/virt: vmgenid: add vm generation id driver
- Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver that exposes a monotonic incremental Virtual Machine Generation u32 counter via a char-dev FS interface. The FS interface provides sync and async VmGen counter updates notifications. It also provides VmGen counter retrieval and confirmation mechanisms. The generation counter and the interface through which it is exposed are available even when there is no acpi device present. When the device is present, the hw provided UUID is not exposed to userspace, it is internally used by the driver to keep accounting for the exposed VmGen counter. The counter starts from zero when the driver is initialized and monotonically increments every time the hw UUID changes (the VM generation changes). On each hw UUID change, the new hypervisor-provided UUID is also fed to the kernel RNG. If there is no acpi vmgenid device present, the generation changes are not driven by hw vmgenid events but can be driven by software through a dedicated driver ioctl. This patch builds on top of Or Idgar 's proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - but this is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. Signed-off-by: Adrian Catangiu --- v1 -> v2: - expose to userspace a monotonically increasing u32 Vm Gen Counter instead of the hw VmGen UUID - since the hw/hypervisor-provided 128-bit UUID is not public anymore, add it to the kernel RNG as device randomness - insert driver page containing Vm Gen Counter in the user vma in the driver's mmap handler instead of using a fault handler - turn driver into a misc device driver to auto-create /dev/vmgenid - change ioctl arg to avoid leaking kernel structs to userspace - update documentation - various nits - rebase on top of linus latest v2 -> v3: - separate the core driver logic and interface, from the ACPI device. The ACPI vmgenid device is now one possible backend. - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS - add locking to avoid races between fs ops handlers and hw irq driven generation updates - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is outdated or a generation change happens while waiting (thus making current caller outdated), the ioctl returns -EINTR to signal the user to handle event and retry. Fixes blocking on oneself. - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by CAP_CHECKPOINT_RESTORE capability, through which software can force generation bump. --- Documentation/virt/vmgenid.rst | 240 +++ drivers/virt/Kconfig | 17 ++ drivers/virt/Makefile | 1 + drivers/virt/vmgenid.c | 435 + include/uapi/linux/vmgenid.h | 14 ++ 5 files changed, 707 insertions(+) create mode 100644 Documentation/virt/vmgenid.rst create mode 100644 drivers/virt/vmgenid.c create mode 100644 include/uapi/linux/vmgenid.h diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst new file mode 100644 index 000..b6a9f8d --- /dev/null +++ b/Documentation/virt/vmgenid.rst @@ -0,0 +1,240 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +VMGENID + + +The VM Generation ID is a feature defined by Microsoft (paper: +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by +multiple hypervisor
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 18/11/2020 12:30, Alexander Graf wrote: > > > On 16.11.20 16:34, Catangiu, Adrian Costin wrote: >> - Future improvements >> >> Ideally we would want the driver to register itself based on devices' >> _CID and not _HID, but unfortunately I couldn't find a way to do that. >> The problem is that ACPI device matching is done by >> '__acpi_match_device()' which exclusively looks at >> 'acpi_hardware_id *hwid'. >> >> There is a path for platform devices to match on _CID when _HID is >> 'PRP0001' - but this is not the case for the Qemu vmgenid device. >> >> Guidance and help here would be greatly appreciated. > > That one is pretty important IMHO. How about the following (probably > pretty mangled) patch? That seems to work for me. The ACPI change > would obviously need to be its own stand alone change and needs proper > assessment whether it could possibly break any existing systems. > > diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c > index 1682f8b454a2..452443d79d87 100644 > --- a/drivers/acpi/bus.c > +++ b/drivers/acpi/bus.c > @@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device > *device, > /* First, check the ACPI/PNP IDs provided by the caller. */ > if (acpi_ids) { > for (id = acpi_ids; id->id[0] || id->cls; id++) { > - if (id->id[0] && !strcmp((char *)id->id, hwid->id)) > + if (id->id[0] && !strncmp((char *)id->id, hwid->id, > ACPI_ID_LEN - 1)) > goto out_acpi_match; > if (id->cls && __acpi_match_device_cls(id, hwid)) > goto out_acpi_match; > diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c > index 75a787da8aad..0bfa422cf094 100644 > --- a/drivers/virt/vmgenid.c > +++ b/drivers/virt/vmgenid.c > @@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device > *device, u32 event) > } > > static const struct acpi_device_id vmgenid_ids[] = { > - {"QEMUVGID", 0}, > + /* This really is VM_Gen_Counter, but we can only match 8 > characters */ > + {"VM_GEN_C", 0}, > {"", 0}, > }; > Looks legit. I can propose a patch with it, but how do we validate it doesn't break any devices? >> +2) ASYNC simplified example:: >> + >> + void handle_io_on_vmgenfd(int vmgenfd) >> + { >> + unsigned genid; >> + >> + // because of VM generation change, we need to rebuild world >> + reseed_app_env(); >> + >> + // read new gen ID - we need it to confirm we've handled update >> + read(fd, , sizeof(genid)); > > This is racy in case two consecutive snapshots happen. The read needs > to go before the reseed. > Switched them around like you suggest to avoid confusion. But I don't see a problem with this race. The idea here is to trigger reseed_app_env() which doesn't depend on the generation counter value. Whether it gets incremented once or N times is irrelevant, we're just interested that we pause execution and reseed before resuming (in between these, whether N or M generation changes is the same thing). >> +3) Mapped memory polling simplified example:: >> + >> + /* >> + * app/library function that provides cached secrets >> + */ >> + char * safe_cached_secret(app_data_t *app) >> + { >> + char *secret; >> + volatile unsigned *const genid_ptr = get_vmgenid_mapping(app); >> + again: >> + secret = __cached_secret(app); >> + >> + if (unlikely(*genid_ptr != app->cached_genid)) { >> + // rebuild world then confirm the genid update (thru write) >> + rebuild_caches(app); >> + >> + app->cached_genid = *genid_ptr; > > This is racy again. You need to read the genid before rebuild and set > it here. > I don't see the race. Gen counter is read from volatile mapped mem, on detected change we rebuild world, confirm the update back to the driver then restart the loop. Loop will break when no more changes happen. >> + ack_vmgenid_update(app); >> + >> + goto again; >> + } >> + >> + return secret; >> + } >> + >> + >> +static int vmgenid_close(struct inode *inode, struct file *file) >> +{ >> + struct file_data *file_data = file->private_data; >> + struct dev_data *priv = file_data->dev_data; >> + >> + if (file_data->acked_gen_counter != priv->generation_counter) >> + vmgenid_put_outdated_watchers(priv); > > Is this racy?
[PATCH v2] drivers/virt: vmgenid: add vm generation id driver
- Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver that exposes a monotonic incremental Virtual Machine Generation u32 counter via a char-dev FS interface that provides sync and async VmGen counter updates notifications. It also provides VmGen counter retrieval and confirmation mechanisms. The hw provided UUID is not exposed to userspace, it is internally used by the driver to keep accounting for the exposed VmGen counter. The counter starts from zero when the driver is initialized and monotonically increments every time the hw UUID changes (the VM generation changes). On each hw UUID change, the new hypervisor-provided UUID is also fed to the kernel RNG. This patch builds on top of Or Idgar 's proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - but this is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. - v1 -> v2: - expose to userspace a monotonically increasing u32 Vm Gen Counter instead of the hw VmGen UUID - since the hw/hypervisor-provided 128-bit UUID is not public anymore, add it to the kernel RNG as device randomness - insert driver page containing Vm Gen Counter in the user vma in the driver's mmap handler instead of using a fault handler - turn driver into a misc device driver to auto-create /dev/vmgenid - change ioctl arg to avoid leaking kernel structs to userspace - update documentation - various nits (license, unnecessary casting, Kconfig, others) - rebase on top of linus latest Signed-off-by: Adrian Catangiu --- Documentation/virt/vmgenid.rst | 228 drivers/virt/Kconfig | 17 ++ drivers/virt/Makefile | 1 + drivers/virt/vmgenid.c | 390 + include/uapi/linux/vmgenid.h | 13 ++ 5 files changed, 649 insertions(+) create mode 100644 Documentation/virt/vmgenid.rst create mode 100644 drivers/virt/vmgenid.c create mode 100644 include/uapi/linux/vmgenid.h diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst new file mode 100644 index 000..603e8a5 --- /dev/null +++ b/Documentation/virt/vmgenid.rst @@ -0,0 +1,228 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +VMGENID + + +The VM Generation ID is a feature defined by Microsoft (paper: +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by +multiple hypervisor vendors. + +The feature is required in virtualized environments by apps that work +with local copies/caches of world-unique data such as random values, +uuids, monotonically increasing counters, etc. +Such apps can be negatively affected by VM snapshotting when the VM +is either cloned or returned to an earlier point in time. + +The VM Generation ID is a simple concept meant to alleviate the issue +by providing a unique ID that changes each time the VM is restored +from a snapshot. The hw provided UUID value can be used to +differentiate between VMs or different generations of the same VM. + +The VM Generation ID is exposed through an ACPI device by multiple +hypervisor vendors. The driver for it lives at +``drivers/virt/vmgenid.c`` + +The driver exposes a monotonic incremental Virtual Machine Generation +u32 counter via a char-dev FS interface that provides sync and async +VmGen counter updates notifications. It also provides VmGen counter +retrieval and confirmation mechanisms. +
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
Hi all, On 17/10/2020, 21:09, "Graf (AWS), Alexander" wrote: On 17.10.20 15:24, Jason A. Donenfeld wrote: > > After discussing this offline with Jann a bit, I have a few general > comments on the design of this. > > First, the UUID communicated by the hypervisor should be consumed by > the kernel -- added as another input to the rng -- and then userspace We definitely want a kernel internal notifier as well, yes :). What would be a generic event trigger mechanism we could use from a kernel module/driver for triggering rng reseed (possibly adding the uuid to the mix as well)? Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
After discussing this offline with Jann a bit, I have a few general comments on the design of this. First, the UUID communicated by the hypervisor should be consumed by the kernel -- added as another input to the rng -- and then userspace should be notified that it should reseed any userspace RNGs that it may have, without actually communicating that UUID to userspace. IOW, I agree with Jann there. Then, it's the functioning of this notification mechanism to userspace that is interesting to me. Agreed! The UUID/vmgenid is the glue to the hypervisor to be able to find out about VM snapshots/forks. The really interesting (and important) topic here is finding the right notification mechanism to userspace. ...In retrospect, I should have posted this as RFC instead of PATCH. So, anyway, here are a few options with some pros and cons for the kernel notifying userspace that its RNG should reseed. 1. SIGRND - a new signal. Lol. 2. Userspace opens a file descriptor that it can epoll on. Pros are that many notification mechanisms already use this. Cons is that this requires syscall and might be more racy than we want. Another con is that this a new thing for userspace programs to do. 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are that this is extremely fast, and also simple to use and implement. There are enough sequence points in typical crypto programs that checking to see whether this counter has changed before doing whatever operation seems easy enough. Cons are that typically we've been conservative about adding things to the vDSO, and this is also a new thing for userspace programs to do. For each 1, 2, and 3 options, userspace programs _have to do smth new_ anyway, so I wouldn't weigh that as a con. An atomic counter in the vDSO looks like the most bang for the buck to me. I'm really curious to hear more opinions on why we shouldn't do it. 4. We already have a mechanism for this kind of thing, because the same issue comes up when fork()ing. The solution was MADV_WIPEONFORK, where userspace marks a page to be zeroed when forking, for the purposes of the RNG being notified when its world gets split in two. This is basically the same thing as we're discussing here with guest snapshots, except it's on the system level rather than the process level, and a system has many processes. But the problem space is still almost the same, and we could simply reuse that same mechanism. There are a few implementation strategies for that: I don't think we can piggy back on MADV_WIPEONFORK. That madvise flag has a clear contract of only wiping _on fork_. Overloading it with wiping on VM-fork - while process doesn't fork - might break some of its users. 4a. We mess with the PTEs of all processes' pages that are MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. Cons might be that this usually requires semaphores, and we're in irq context, so we'd have to hoist to a workqueue, which means either more wake up latency, or a larger race window. 4b. We just memzero all processes' pages that are MADV_WIPEONFORK, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. 4c. The guest kernel maintains an array of physical addresses that are MADV_WIPEONFORK. The hypervisor knows about this array and its location through whatever protocol, and before resuming a moved/snapshotted/duplicated VM, it takes the responsibility for memzeroing this memory. The huge pro here would be that this eliminates all races, and reduces complexity quite a bit, because the hypervisor can perfectly synchronize its bringup (and SMP bringup) with this, and it can even optimize things like on-disk memory snapshots to simply not write out those pages to disk. I've previously proposed a path similar (in concept at least) with a combination of 4 a,b and c - https://lwn.net/ml/linux-mm/b7793b7a-3660-4769-9b9a-ffcf25072...@amazon.com/ without reusing MADV_WIPEONFORK, but by adding a dedicated MADV_WIPEONSUSPEND. That proposal was clunky however with many people raising concerns around how the interface is too subtle and hard to work with. A vmgenid driver offering a clean FS interface seemed cleaner, although, like some of you observed, it still allows a window of time between actual VM fork and userspace handling of the event. One other direction that I would like to explore and I feel it’s similar to your 4c proposal is to do smth like: "mm: extend memfd with ability to create 'secret' memory" https://patchwork.kernel.org/project/linux-mm/patch/20200130162340.GA14232@rapoport-lnx/ Maybe we can combine ideas from the two patches in smth like: instead of libs using anon
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
Sorry, I forgot to add a few people interested in this and the KVM ML to CC. Added them. On 16/10/2020, 17:33, "Catangiu, Adrian Costin" wrote: - Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver which exposes the Virtual Machine Generation ID via a char-dev FS interface that provides ID update sync and async notification, retrieval and confirmation mechanisms: When the device is 'open()'ed a copy of the current vm UUID is associated with the file handle. 'read()' operations block until the associated UUID is no longer up to date - until HW vm gen id changes - at which point the new UUID is provided/returned. Nonblocking 'read()' uses EWOULDBLOCK to signal that there is no _new_ UUID available. 'poll()' is implemented to allow polling for UUID updates. Such updates result in 'EPOLLIN' events. Subsequent read()s following a UUID update no longer block, but return the updated UUID. The application needs to acknowledge the UUID update by confirming it through a 'write()'. Only on writing back to the driver the right/latest UUID, will the driver mark this "watcher" as up to date and remove EPOLLIN status. 'mmap()' support allows mapping a single read-only shared page which will always contain the latest UUID value at offset 0. The driver also adds support for tracking count of open file handles that haven't acknowledged an UUID update. This is exposed through two IOCTLs: * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of _outdated_ watchers - number of file handles that were open during a VM generation change, and which have not yet confirmed the new Vm-Gen-Id. * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_ watchers, or if a 'timeout' argument is provided, until the timeout expires. This patch builds on top of Or Idgar 's proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - which is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. Signed-off-by: Adrian Catangiu --- Documentation/virt/vmgenid.rst | 211 + drivers/virt/Kconfig | 13 ++ drivers/virt/Makefile | 1 + drivers/virt/vmgenid.c | 419 + include/uapi/linux/vmgenid.h | 22 +++ 5 files changed, 666 insertions(+) create mode 100644 Documentation/virt/vmgenid.rst create mode 100644 drivers/virt/vmgenid.c create mode 100644 include/uapi/linux/vmgenid.h diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst new file mode 100644 index 000..5224415 --- /dev/null +++ b/Documentation/virt/vmgenid.rst @@ -0,0 +1,211 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +VMGENID + + +The VM Generation ID is a feature defined by Microsoft (paper: +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by +multiple hypervisor vendors. + +The feature is required in virtualized environmen
[PATCH] drivers/virt: vmgenid: add vm generation id driver
- Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver which exposes the Virtual Machine Generation ID via a char-dev FS interface that provides ID update sync and async notification, retrieval and confirmation mechanisms: When the device is 'open()'ed a copy of the current vm UUID is associated with the file handle. 'read()' operations block until the associated UUID is no longer up to date - until HW vm gen id changes - at which point the new UUID is provided/returned. Nonblocking 'read()' uses EWOULDBLOCK to signal that there is no _new_ UUID available. 'poll()' is implemented to allow polling for UUID updates. Such updates result in 'EPOLLIN' events. Subsequent read()s following a UUID update no longer block, but return the updated UUID. The application needs to acknowledge the UUID update by confirming it through a 'write()'. Only on writing back to the driver the right/latest UUID, will the driver mark this "watcher" as up to date and remove EPOLLIN status. 'mmap()' support allows mapping a single read-only shared page which will always contain the latest UUID value at offset 0. The driver also adds support for tracking count of open file handles that haven't acknowledged an UUID update. This is exposed through two IOCTLs: * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of _outdated_ watchers - number of file handles that were open during a VM generation change, and which have not yet confirmed the new Vm-Gen-Id. * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_ watchers, or if a 'timeout' argument is provided, until the timeout expires. This patch builds on top of Or Idgar 's proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - which is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. Signed-off-by: Adrian Catangiu --- Documentation/virt/vmgenid.rst | 211 + drivers/virt/Kconfig | 13 ++ drivers/virt/Makefile | 1 + drivers/virt/vmgenid.c | 419 + include/uapi/linux/vmgenid.h | 22 +++ 5 files changed, 666 insertions(+) create mode 100644 Documentation/virt/vmgenid.rst create mode 100644 drivers/virt/vmgenid.c create mode 100644 include/uapi/linux/vmgenid.h diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst new file mode 100644 index 000..5224415 --- /dev/null +++ b/Documentation/virt/vmgenid.rst @@ -0,0 +1,211 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +VMGENID + + +The VM Generation ID is a feature defined by Microsoft (paper: +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by +multiple hypervisor vendors. + +The feature is required in virtualized environments by apps that work +with local copies/caches of world-unique data such as random values, +uuids, monotonically increasing counters, etc. +Such apps can be negatively affected by VM snapshotting when the VM +is either cloned or returned to an earlier point in time. + +The VM Generation ID is a simple concept meant to alleviate the issue +by providing a unique ID that changes each time the VM is restored +from a snapshot. The hw provided UUID value can be used to +differentiate between VMs or different generations of the same VM. + +The VM Generation ID is exposed through an ACPI device by multiple +hypervisor