For virtualized hosts it is quite common to want to confine all host OS processes to a subset of CPUs/RAM nodes, leaving the rest available for exclusive use by QEMU/KVM. Historically people have used the "isolcpus" kernel arg todo this, but last year that had its semantics changed, so that any CPUs listed there also get excluded from load balancing by the schedular making it quite useless in general non-real-time use cases where you still want QEMU threads load-balanced across CPUs.
So the only option is to use the cpuset cgroup controller to confine procosses. AFAIK, systemd does not have an explicit support for the cpuset controller at this time, so I'm trying to work out the "optimal" way to achieve this behind systemd's back while minimising the risk that future systemd releases will break things. As an example I have a host with 3 NUMA nodes, 12 CPUS and want to have all non-QEMU processes running on CPUs 0 & 1, leaving 3-11 available for QEMU machines So far my best solution looks like this: $ cat /etc/systemd/system/cpuset.service [Unit] Description=Restrict CPU placement DefaultDependencies=no Before=sysinit.target slices.target basic.target lvm2-lvmetad.service systemd-journald.service systemd-udevd.service [Service] Type=oneshot KillMode=none RemainAfterExit=yes ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/machine.slice ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus' ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > /sys/fs/cgroup/cpuset/system.slice/cpuset.mems' ExecStartPre=/bin/bash -c '/usr/bin/echo "3-11" > /sys/fs/cgroup/cpuset/machine.slice/cpuset.cpus' ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > /sys/fs/cgroup/cpuset/machine.slice/cpuset.mems' ExecStartPost=/bin/bash -c '/usr/bin/echo 1 > /sys/fs/cgroup/cpuset/system.slice/tasks' ExecStopPost=/usr/bin/rmdir /sys/fs/cgroup/cpuset/system.slice ExecStart=/bin/true [Install] WantedBy=multi-user.target The key factor here is use of "Before" to ensure this gets run immediately after systemd switches root out of the initrd, and before /any/ long lived services are run. This lets us set cpuset placement on systemd (pid 1) itself and have that inherited by everything it spawns. I felt this is better than trying to move processes after they have already started, because it ensures that any memory allocations get taken from the right NUMA node immediately. Empirically this approach seems to work on Fedora 23 (systemd 222) and RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've not anticipated here. Conceptually I'm aiming for "Before=*" to say it should run before everything, but explicitly listing this set of units appears to be best I can do/ Any thoughts / feedback / suggestions welcome on how to improve this. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel