On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
This means that 'A -> idle -> A' should never pass through switch_mm to
begin with.
Please clarify how you think it does.
the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
for a tlb flush.
The intel_idle code does, not the idle code. This is squirreled away in
some driver :/
afaik (but haven't looked in a while) acpi drivers did too
(trust me, that you really want, sequentially IPI's a pile of cores in a deep
sleep
state to just flush a tlb that's empty, the performance of that is horrific)
Hurmph. I'd rather fix that some other way than leave_mm(), this is
piling special on special.
the problem was tricky. but of course if something better is possible lets
figure this out
problem is that an IPI to an idle cpu is both power inefficient and will take
time,
exit of a deep C state can be, say 50 to 100 usec range of time (it varies by
many things, but
for abstractly thinking about the problem one should generally round up to nice
round numbers)
if you have say 64 cores that had the mm at some point, but 63 are in idle, the
64th
really does not want to IPI each of those 63 serially (technically this is does
not need
to be serial but IPI code is tricky, some things end up serializing this a bit)
to get the 100 usec hit 63 times. Actually, even if it's not serialized, even
ONE hit of 100 usec
is unpleasant.
so a CPU that goes idle wants to "unsubscribe" itself from those IPIs as
general objective.
but not getting flush IPIs is only safe if the TLBs in the CPU have nothing
that such IPI would
want to flush, so the TLB needs to be empty of those things.
the only way to do THAT is to switch to an mm that is safe; a leave_mm() does
this, but I'm sure other
options exist.
note: While a CPU that is in a deeper C state will itself flush the TLB, you
don't know if you will actually
enter that deep at the time of making OS decisions (if an interrupt comes in
the cycle before mwait, mwait
becomes a nop for example). In addition, once you wake up, you don't want the
CPU to go start filling
the TLBs with invalid data so you can't really just set a bit and flush after
leaving idle