On Sat, Jun 13, 2020 at 10:36:07PM +0100, Salil Mehta wrote: > This patch-set introduces the virtual cpu hotplug support for ARMv8 > architecture in QEMU. Idea is to be able to hotplug and hot-unplug the vcpus > while guest VM is running and no reboot is required. This does *not* makes any > assumption of the physical cpu hotplug availability within the host system but > rather tries to solve the problem at virtualizer/QEMU layer and by introducing > cpu hotplug hooks and event handling within the guest kernel. No changes are > required within the host kernel/KVM. > > Motivation: > This allows scaling the guest VM compute capacity on-demand which would be > useful for the following example scenarios, > 1. Vertical Pod Autoscaling[3][4] in the cloud: Part of the orchestration > framework which could adjust resource requests (CPU and Mem requests) for > the containers in a pod, based on usage. > 2. Pay-as-you-grow Business Model: Infrastructure provider could allocate and > restrict the total number of compute resources available to the guest VM > according to the SLA(Service Level Agreement). VM owner could request for > more compute to be hot-plugged for some cost. > > Terminology: > > (*) Present cpus: Total cpus with which guest has/will boot and are available > to guest for use and can be onlined. Qemu parameter(-smp) > (*) Disabled cpus: Possible cpus which will not be available for the guest to > use. These can be hotplugged and made present. These can be > thought of as un-plugged vcpus. These will be included as > part of sizing. > (*) Posssible cpus: Total vcpus which could ever exist in VM. This includes > booted cpus plus any cpus which could be later plugged. > - Qemu parameter(-maxcpus) > - Possible vcpus = Present vcpus (+) Disabled vcpus > > > Limitations of ARMv8 Architecture: > > A. Physical Limitation to CPU Hotplug: > 1. ARMv8 architecture does not support the concept of the physical cpu > hotplug. > The closest thing which is recomended to achieve the cpu hotplug on ARM is > to bring down power state of the cpu using PSCI. > 2. Other ARM components like GIC etc. have not been designed to realize > physical cpu hotplug capability as of now. > > B. Limitations of GIC to Support Virtual CPU Hotplug: > 1. GIC requires various resources(related to GICR/redistributor, GICC/cpu > interface etc) like memory regions to be fixed at the VM init time and > these > could not be changed later on after VM has inited. > 2. Associations between GICC(GIC cpu interface) and vcpu get fixed at the VM > init time and GIC does not allows to change this association once GIC has > initialized. > > C. Known Limitation of the KVM: > 1. As of now KVM allows to create VCPUs but does not allows to delete the > already created vcpus. QEMU already provides an interface to manage created > vcpus at KVM level and then to re-use them. > 2. Inconsistency in interpretation of the MPIDR generated by KVM for vcpus > vis-a-vis SMT/threads. This does not looks to be compliant to the MPIDR > format(SMT is present) as mentioned in the ARMv8 spec. (Please correct my > understanding if I am wrong here?) > > > Workaround to the problems mentioned in Section B & C1: > 1. We pre-size the GIC with possible vcpus at VM init time > 2. Pre-create all possible vcpus at KVM and associate them with GICC > 3. Park the unplugged vcpus (similar to x86) > > > (*) For all of above please refer to Marc's suggestion here[1] > > > Overview of the Approach: > At the time of machvirt_init() we pre-create all of the possible ARMCPU > objects along with the corresponding KVM vcpus at the host. Disabled KVM vcpu > (which are *not* "present" vcpus but are part of "possible" vcpu list) are > parked at per VM list "kvm_parked_vcpus" after their initialization. > > We create the ARMCPU objects(but these are not *realized* in QOM sense) even > for the disabled vcpus to facilitate the GIC initialization (pre-sized with > possible vcpus). After Initialization of the machine is complete we release > the ARMCPU Objects for the disabled vcpus. These ARMCPU object shall be > re-created at the time when vcpu is hot plugged. This new object is then > re-attached with the earlier parked KVM vcpu which also gets unparked. The > ARMCPU object gets now "realized" in QEMU, which means creation of the > corresponding threads, pre_plug/plug phases, and event notification to the > guest using ACPI GED etc. Similarly, hot-unplug leg will lead to the > "unrealization" of the vcpus and will lead to similar ACPI GED events to the > guest for unplug and cleanup and eventually ARMCPU object shall be released > and > KVM vcpus shall be parked again. > > During machine init, ACPI MADT Table is sized with *possible* vcpus GICC > entries. The unplugged/disabled vcpus are presented as MADT GICC DISABLED > entries to the guest. This means the guest will have its resources pre-sized > with possible vcpus(=present+disabled) > > Other approaches to deal with ARMCPU object release(after machine init): > 1. The ARMCPU objects for the disabled vcpus are released in context to the > virt_machine_done() notifier(approach adopted in this patch-set). > 2. Defer the release of current ARMCPU object till the new vcpu object is > hot plugged. > 3. Never release and keep on reusing them and release once at VM exit. This > solves many problems with above 2 approaches but requires change in the way > qdev_device_add() fetches/creates the ARMCPU object for the new vcpus being > hotplugged. For the arm cpu hotplug case we need to figure out way how to > get access to old object and use it to "re-realize" instead of the new > ARMCPU object. > > Concerns/Questions: > 1. In ARM arch a cpu is uniquely represented in hierarchy using various > affinity levels which could represent thread, core, cluster, package. This > is generally represented by a value in MPIDR register as per the format > mentioned in specification. Now, the way MPIDR value is derived for vcpus > is > done using vcpu-index. The concept of thread is not quite as same and > rather > gets lost in the derivation of MPIDR for vcpus. > 2. The topology info used to specify the vcpu while hot-plugging might not > match with the MPIDR value given back by the KVM for the vcpu at the time > of > init. Concept of SMT bit in MPIDR gets lost as per the derivation being > done > in the KVM. Hence, concept of thread-id, core-id, socket-id if used as a > topology info to derive MPIDR value as per ARM specification will not match > with MPIDR actually assigned by the KVM? > Perhaps need to carry forward work of Andrew? please check here[2] > 3. Further if this info is supplied to the guest using PPTT(once introduced in > QEMU) or even derived using MPIDR shall be inconsistent with the host > vcpu. > 4. Any possibilities of interrupts(SGI/PPI/LPI/SPI) always remaining in > *pending* state for the cpus which have been hot-unplugged? IMHO it looks > okay but will need Marc's confirmation on this. > 5. If the ARMCPU object is released after the machine init, UEFI could call > back virt_update_table() to re-build the ACPI tables which might need an > ARMCPU object. Please check the discussion here[5] > > > Commands Used: > > A. Qemu launch commands to init the machine > > $ qemu-system-aarch64 --enable-kvm -machine virt,gic-version=3 \ > -cpu host -smp cpus=4,maxcpus=6 \ > -m 300M \ > -kernel Image \ > -initrd rootfs.cpio.gz \ > -append "console=ttyAMA0 root=/dev/ram rdinit=/init maxcpus=2 acpi=force" \ > -nographic \ > -bios QEMU_EFI.fd \ > > B. Hot-(un)plug related commands > > # Hotplug a host vcpu(accel=kvm) > $ device_add host-arm-cpu,id=core4,core-id=4 > > # Hotplug a vcpu(accel=tcg) > $ device_add cortex-a57-arm-cpu,id=core4,core-id=4 > > # Delete the vcpu > $ device_del core4 > > NOTE: I have not tested the current solution with '-device' interface. The use > is suggested by Igor here[6]. I will test this in coming times but looks > it should work with existing changes. > > > Sample output on guest after boot: > > $ cat /sys/devices/system/cpu/possible > 0-5 > $ cat /sys/devices/system/cpu/present > 0-3 > $ cat /sys/devices/system/cpu/online > 0-1 > $ cat /sys/devices/system/cpu/offline > 2-5 > > > Sample output on guest after hotplug of vcpu=4: > > $ cat /sys/devices/system/cpu/possible > 0-5 > $ cat /sys/devices/system/cpu/present > 0-4 > $ cat /sys/devices/system/cpu/online > 0-1,4 > $ cat /sys/devices/system/cpu/offline > 2-3,5 > > Note: vcpu=4 was explicitly 'onlined' after hot-plug > $ echo 1 > /sys/devices/system/cpu/cpu4/online > > > Repository: > (*) QEMU changes for vcpu hotplug could be cloned from below site, > https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v1 > > (*) Guest Kernel changes required to co-work with the QEMU shall be posted > soon > and repo made available at above site. > > > THINGS TO DO: > (*) Migration support > (*) TCG/Emulation support is not proper right now. Works to a certain extent > but is not complete. especially the unrealize part in which there is a > overflow of tcg contexts. The last is due to the fact tcg maintains a > count on number of context(per thread instance) so as we hotplug the > vcpus > this counter keeps on incrementing. But during hot-unplug the counter is > not decremented. > (*) Support of hotplug with NUMA is not proper > (*) CPU Topology right now is not specified using thread/core/socket but > rather flatly indexed using core-id. This needs consideration[2]. > (*) Do we need PPTT Support for to specify right topology info to guest about > hot-plugged or unplugged vcpus? > (*) Test cases > (*) Docs need to be updated. > >
Hi Salil, I realize this is just a preliminary posting and the approach hasn't been finalized, but maybe in a future posting we can put a lot of this information into a doc patch. I think we'll need good documentation for this feature to ensure we get it right and keep in maintained correctly. Thanks, drew