Hi all, This RFC is a continuation of a longer kernel patch thread https://lkml.org/lkml/2021/3/8/677 where we originally thought such a mechanism belongs. Ultimately, consensus there was that this mechanism would be better suited in userspace, so systemd was an obvious first choice.
Current proposal: * As GitHub Issue here: https://github.com/systemd/systemd/issues/19269 * An example PoC here: https://github.com/acatangiu/sysgenid-dbus * Described in this email as follows: # SysGenID: a system generation id provider ## Background and problem The System Generation ID feature is required in virtualized or containerized environments by applications that work with local copies or caches of world-unique data such as random values, uuids, monotonically increasing counters, cryptographic nonces, etc. Such applications can be negatively affected by VM or container snapshotting when the VM or container is either cloned or returned to an earlier point in time. Solving the uniqueness problem strongly enough for cryptographic purposes requires a mechanism which can deterministically reseed userspace PRNGs with new entropy at restore time. This mechanism must also support the high-throughput and low-latency use-cases that led programmers to pick a userspace PRNG in the first place; be usable by both application code and libraries; allow transparent retrofitting behind existing popular PRNG interfaces without changing application code; it must be efficient, especially on snapshot restore; and be simple enough for wide adoption. ## Solution Introduce a mechanism that standardizes an API for applications and libraries to be made aware of uniqueness breaking events such as VM or container snapshotting, and allow them to react and adapt to such events. The System Generation ID is meant to help in these scenarios by providing a monotonically increasing u32 counter that changes each time the VM or container is restored from a snapshot. The `sysgenid` service exposes a monotonic incremental System Generation u32 counter via the DBus `com.RFC.sysgenid` accessible at `/com/RFC/sysgenid`. It provides asynchronous SysGen counter update notifications, as well as counter retrieval and confirmation mechanisms. The counter starts from zero when the service is started and monotonically increments every time the system generation changes. Userspace applications or libraries can (a)synchronously consume the system generation counter through the provided DBus interface, to make any necessary internal adjustments following a system generation update. The provided DBus interface operations can be used to build a system level safe workflow that guest software can follow to protect itself from negative system snapshot effects. System generation changes are driven by userspace software through a dedicated DBus method. ### Warning SysGenID alone does not guarantee complete snapshot safety to applications using it. A certain workflow needs to be followed at the system level, in order to make the system snapshot-resilient. Please see the "Snapshot Safety Prerequisites" section below. ## SysGenID DBus interface #### Terminology - `watcher` - a client using the SysGenID service _watching_ for system generation changes. - `untracked watcher` - default state for all clients. For a client to be tracked it has to explicitly opt-in by confirming back to the service the correct _system generation counter_. - `tracked watcher` - a client that is tracked by the service. Such a watcher is considered `up-to-date` only after confirming back to the service the correct _system generation counter_. Once tracked, a client is only _untracked_ when closing its connection to the DBus bus. - `outdated watcher` - a _tracked_ client that whose tracking has lived through a system generation change, but has not (yet) confirmed back to the service the correct _system generation counter_. **Methods:** - `GetSysGenCounter` - returns latest system generation counter. - `AckWatcherCounter` - marks the client/watcher to be tracked for ACKs, is also used by the watcher to confirm/ack the correct _sys gen counter_ to the service after every generation change so the service keeps correct track of it as `outdated` or `up-to-date`. Will error if client/watcher confirms/acks the wrong _sys gen counter_. - `CountOutdatedWatchers` - returns the number of current number of _outdated tracked watchers_. A value of `zero` can be interpreted as the system being fully re-adjusted after a generation change. - `TriggerSysGenUpdate` - triggers a generation update (should be a privileged operation). **Signals:** - `NewSystemGeneration` - system generation change notification, also carries new _sys gen counter_. - `SystemReady` - notification sent out when all tracked watchers have _acked_ the new _sys gen counter_. In other words, when all tracked software has adjusted to the new environment. The service can keep track of watchers by DBus connections (`org.freedesktop.DBus.NameOwnerChanged`). **Exported read-only file used for memory mappings:** The service also exports the current _sys gen counter_ through a simple file. The file contains only 4 bytes of data at offset 0, representing the u32 value of the system generation counter. This file is meant to be mapped by other software in the system and be used as a low-latency generation counter probe mechanism in critical sections. This mmap() interface is targeted at libraries or code that needs to check for generation changes in-line, where an event loop is not available or in cases where DBus calls are too expensive. In such cases, logic can be added in-line with the sensitive code to check the counter and trigger on-demand/just-in-time readjustments when changes are detected on the memory mapped file. Users of this interface that plan to lazily adjust most likely don't need to also use the DBus interface, since tracking or waiting on them doesn't make sense. ### Service interface DBus XML specification ```xml <node name="/com/RFC/sysgenid"> <interface name="com.RFC.sysgenid"> <method name="AckWatcherCounter"> <arg name="watcher_counter" type="u" direction="in"/> <arg name="sysgen_counter" type="u" direction="out"/> </method> <method name="CountOutdatedWatchers"> <arg name="outdated_watchers" type="u" direction="out"/> </method> <method name="GetSysGenCounter"> <arg name="sysgen_counter" type="u" direction="out"/> </method> <method name="TriggerSysGenUpdate"> <arg name="min_gen" type="u" direction="in"/> </method> <signal name="NewSystemGeneration"> <arg name="sysgen_counter" type="u"/> </signal> <signal name="SystemReady"> </signal> </interface> <interface name="org.freedesktop.DBus.Introspectable"> <method name="Introspect"> <arg name="xml_data" type="s" direction="out"/> </method> </interface> </node> ``` ## Snapshot Safety Prerequisites and Example If VM, container or other system-level snapshots happen asynchronously, at arbitrary times during an active workload there is no practical way to ensure that in-flight local copies or caches of world-unique data such as random values, secrets, UUIDs, etc are properly scrubbed and regenerated. The challenge stems from the fact that the categorization of data as snapshot-sensitive is only known to the software working with it, and this software has no logical control over the moment in time when an external system snapshot occurs. Let's take an OpenSSL session token for example. Even if the library code is made 100% snapshot-safe, meaning the library guarantees that the session token is unique (any snapshot that happened during the library call did not duplicate or leak the token), the token is still vulnerable to snapshot events while it transits the various layers of the library caller, then the various layers of the OS before leaving the system. To catch a secret while it's in-flight, we'd have to validate system generation at every layer, every step of the way. Even if that would be deemed the right solution, it would be a long road and a whole universe to patch before we get there. Bottom line is we don't have a way to track all of these in-flight secrets and dynamically scrub them from existence with snapshot events happening arbitrarily. ### Simplifying assumption - safety prerequisite **Control the snapshot flow**, disallow snapshots coming at arbitrary moments in the workload lifetime. Use a system-level overseer entity that quiesces the system before snapshot, and post-snapshot-resume oversees that software components have readjusted to new environment, to the new generation. Only after, will the overseer un-quiesce the system and allow active workloads. Software components can choose whether they want to be tracked and waited on by the overseer by using the marking themselves as tracked watchers. The sysgenid service standardizes the API for system software to find out about needing to readjust and at the same time provides a mechanism for the overseer entity to wait for everyone to be done, the system to have readjusted, so it can un-quiesce. ### Example snapshot-safe workflow 1) Before taking a snapshot, quiesce the VM/container/system. Exactly how this is achieved is very workload-specific, but the general description is to get all software to an expected state where their event loops dry up and they are effectively quiesced. 2) Take snapshot. 3) Resume the VM/container/system from said snapshot. 4) Overseer will trigger generation bump using `TriggerSysGenUpdate` method. 5) Software components which have the DBus `NewGeneration` signal in their event loops are notified of the generation change. They do their specific internal adjustments. Some may have chosen to be tracked and waited on by the overseer, others might choose to do their adjustments out of band and not block the overseer. Tracked ones *must* signal when they are done/ready by confirming the new sys gen counter using the `AckWatcherCounter` DBus method. 6) Overseer will block and wait for all tracked watchers by waiting on the `SystemReady` DBus signal. Once all tracked watchers are done in step 5, the signal is sent by `sysgenid` service and overseer will know that the system has readjusted and is ready for active workload. 7) Overseer un-quiesces system. 8) There is a class of software, usually libraries, most notably PRNGs or SSLs, that don't fit the event-loop model and also have strict latency requirements. These can take advantage of the _exported read-only file used for memory mappings_. They can map the file and check sys gen counter value in-line with the critical section and can do so with low latency. When they are called after un-quiesce, they can just-in-time adjust based on the updated mapped value. For a well-designed service stack, these libraries should not be called while system is quiesced. When workload is resumed by the overseer, on the first call into these libs, they will safely JIT readjust. Users of this lazy on-demand readjustment model should not use the DBus interface or at least not enable watcher tracking since doing so would introduce a logical deadlock: lazy adjustments happen only after un-quiesce, but un-quiesce is blocked until all tracked watchers are up-to-date. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
_______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel