Hi, Sorry for delay in response. Just landed yesterday from LPC.
Others have already commented on the naming, and I would agree that
"paravirt" is really misleading. I cannot say that the previous "cpu-
avoid" one was perfect, but it was much better.
It was my suggestion to switch names. cpu-avoid is definitely a
no-go. Because it doesn't explain anything and only confuses.
I suggested 'paravirt' (notice - only suggested) because the patch
series is mainly discussing paravirtualized VMs. But now I'm not even
sure that the idea of the series is:
1. Applicable only to paravirtualized VMs; and
2. Preemption and rescheduling throttling requires another in-kernel
concept other than nohs, isolcpus, cgroups and similar.
Shrikanth, can you please clarify the scope of the new feature? Would
it be useful for non-paravirtualized VMs, for example? Any other
task-cpu bonding problems?
Current scope of the feature in virtulaized environment where the idea is
to do co-operative folding in each VM based on hint(either HW hint or steal
time).
If you see from macro level, this is framework which allows one to avoid some
vCPUs(In
Guest) to achieve better throughput or latency. So one could come up with more
usecases
even in non-paravirtualized VMs. For example, one crazy idea such as avoid
using SMT siblings
when the system utilization is low to achieve higher ipc(instruction per cycle)
value.
On previous rounds you tried to implement the same with cgroups, as
far as I understood. Can you discuss that? What exactly can't be done
with the existing kernel APIs?
Thanks,
Yury
We discussed this in Sched-MC this year.
https://youtu.be/zf-MBoUIz1Q?t=8581
Currently explored options.
1. CPU Hotplug - slow. Some efforts underway to speed it up.
2. Creating isolated cpusets - Faster. still involves sched domain rebuilds.
The reason why they both won't work is that they break user affinities in the
guest.
i.e guest can do "taskset -c <some_vcpus> <workload>, when the
last vCPU goes offline(guest vCPU hotplug) in that list of vCPUs
the affinity mask is reset and workload can run on online vCPUs and it
doesn't set back to earlier value. That is okay for hotlug or isolated cpusets
since it is driven by user in the guest. So user is aware of it.
Whereas here, the change is driven by the system than user in the guest.
So it cannot break user-space affinities.
So we need a new interface to drive this. I think it is better if it is
non cgroup based framework since cgroup is usually user driven.
(correct me if i am wrong).
PS:
There were some confusion around this affinity breaking. Note it is guest vCPU
being marked and
guest vCPU being hotplugged. Task affinied workload was running in guest. Host
CPUs(pCPU) are not
hotplugged.
---
I had discussion with vincent in hallway, idea is to use the push framework
bits and set the
CPU Capacity=1 (lowest value and consider it as special value) and use a static
key check to do
this stuff only when HW says to do so.
Such as (considering name as paravirt):
static inline bool cpu_paravirt(int cpu)
{
if (static_branch_unlikely(&cpu_paravirt_framework))
return arch_scale_cpu_capacity(cpu) == 1;
return false;
}
Rest of the bits remain same. I found an issue with current series where
setting affinity
is going wrong after cpu is marked paravirt, i will fix it next version. will
do some more
testing and send next version in 2026.
Happy Holidays!