On 8/9/22 20:06, David Hildenbrand wrote: > On 09.08.22 12:56, Joao Martins wrote: >> On 7/21/22 13:07, David Hildenbrand wrote: >>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by >>> Michal. >>> >>> Setting the CPU affinity of threads from inside QEMU usually isn't >>> easily possible, because we don't want QEMU -- once started and running >>> guest code -- to be able to mess up the system. QEMU disallows relevant >>> syscalls using seccomp, such that any such invocation will fail. >>> >>> Especially for memory preallocation in memory backends, the CPU affinity >>> can significantly increase guest startup time, for example, when running >>> large VMs backed by huge/gigantic pages, because of NUMA effects. For >>> NUMA-aware preallocation, we have to set the CPU affinity, however: >>> >>> (1) Once preallocation threads are created during preallocation, management >>> tools cannot intercept anymore to change the affinity. These threads >>> are created automatically on demand. >>> (2) QEMU cannot easily set the CPU affinity itself. >>> (3) The CPU affinity derived from the NUMA bindings of the memory backend >>> might not necessarily be exactly the CPUs we actually want to use >>> (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs). >>> >>> There is an easy "workaround". If we have a thread with the right CPU >>> affinity, we can simply create new threads on demand via that prepared >>> context. So, all we have to do is setup and create such a context ahead >>> of time, to then configure preallocation to create new threads via that >>> environment. >>> >>> So, let's introduce a user-creatable "thread-context" object that >>> essentially consists of a context thread used to create new threads. >>> QEMU can either try setting the CPU affinity itself ("cpu-affinity", >>> "node-affinity" property), or upper layers can extract the thread id >>> ("thread-id" property) to configure it externally. >>> >>> Make memory-backends consume a thread-context object >>> (via the "prealloc-context" property) and use it when preallocating to >>> create new threads with the desired CPU affinity. Further, to make it >>> easier to use, allow creation of "thread-context" objects, including >>> setting the CPU affinity directly from QEMU, *before* enabling the >>> sandbox option. >>> >>> >>> Quick test on a system with 2 NUMA nodes: >>> >>> Without CPU affinity: >>> time qemu-system-x86_64 \ >>> -object >>> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind >>> \ >>> -nographic -monitor stdio >>> >>> real 0m5.383s >>> real 0m3.499s >>> real 0m5.129s >>> real 0m4.232s >>> real 0m5.220s >>> real 0m4.288s >>> real 0m3.582s >>> real 0m4.305s >>> real 0m5.421s >>> real 0m4.502s >>> >>> -> It heavily depends on the scheduler CPU selection >>> >>> With CPU affinity: >>> time qemu-system-x86_64 \ >>> -object thread-context,id=tc1,node-affinity=0 \ >>> -object >>> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 >>> \ >>> -sandbox enable=on,resourcecontrol=deny \ >>> -nographic -monitor stdio >>> >>> real 0m1.959s >>> real 0m1.942s >>> real 0m1.943s >>> real 0m1.941s >>> real 0m1.948s >>> real 0m1.964s >>> real 0m1.949s >>> real 0m1.948s >>> real 0m1.941s >>> real 0m1.937s >>> >>> On reasonably large VMs, the speedup can be quite significant. >>> >> Really awesome work! > > Thanks! > >> >> I am not sure I picked up this well while reading the series, but it seems >> to me that >> prealloc is still serialized on per memory-backend when solely configured by >> command-line >> right? > > I think it's serialized in any case, even when preallocation is > triggered manually using prealloc=on. I might be wrong, but any kind of > object creation or property changes should be serialized by the BQL. > > In theory, we can "easily" preallocate in our helper -- > qemu_prealloc_mem() -- concurrently when we don't have to bother about > handling SIGBUS -- that is, when the kernel supports > MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll > serialize in there as well. > >> >> Meaning when we start prealloc we wait until the memory-backend >> thread-context action is >> completed (per-memory-backend) even if other to-be-configured >> memory-backends will use a >> thread-context on a separate set of pinned CPUs on another node ... and >> wouldn't in theory >> "need" to wait until the former prealloc finishes? > > Yes. This series only takes care of NUMA-aware preallocation, but > doesn't preallocate multiple memory backends in parallel. > > In theory, it would be quite easy to preallocate concurrently: simply > create the memory backend objects passed on the QEMU cmdline > concurrently from multiple threads. > > In practice, we have to be careful I think with the BQL. But it doesn't > sound horribly complicated to achieve that. We can perform all > synchronized under the BQL and only trigger actual expensive > preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with > released BQL. > >> >> Unless as you alluded in one of the last patches: we can pass these >> thread-contexts with >> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have >> different QMP >> clients set prealloc=on, and thus prealloc would happen concurrently per >> node? > > I think we will serialize in any case when modifying properties. Can you > give it a shot and see if it would work as of now? I doubt it, but I > might be wrong.
Disclaimer, I don't know QEMU internals that much so I might be wrong, but even if libvirt went with -preconfig, wouldn't the monitor be stuck for every 'set prealloc=on' call? I mean, in the end, the same set of functions is called (e.g. touch_all_pages()) as if the configuration was provided via cmd line. So I don't see why there should be any difference between cmd line and -preconfig. <offtopic> In near future, as the number of cmd line arguments that libvirt generates grows, libvirt might need to switch to -preconfig. Or, if it needs to query some values first and generate configuration based on that. But for now, there are no plans. </offtopic> Michal