On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:

This means that 'A -> idle -> A' should never pass through switch_mm to
begin with.

Please clarify how you think it does.


the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
for a tlb flush.

The intel_idle code does, not the idle code. This is squirreled away in
some driver :/

afaik (but haven't looked in a while) acpi drivers did too

(trust me, that you really want, sequentially IPI's a pile of cores in a deep 
sleep
state to just flush a tlb that's empty, the performance of that is horrific)

Hurmph. I'd rather fix that some other way than leave_mm(), this is
piling special on special.

the problem was tricky. but of course if something better is possible lets 
figure this out

problem is that an IPI to an idle cpu is both power inefficient and will take 
time,
exit of a deep C state can be, say 50 to 100 usec range of time (it varies by 
many things, but
for abstractly thinking about the problem one should generally round up to nice 
round numbers)

if you have say 64 cores that had the mm at some point, but 63 are in idle, the 
64th
really does not want to IPI each of those 63 serially (technically this is does 
not need
to be serial but IPI code is tricky, some things end up serializing this a bit)
to get the 100 usec hit 63 times. Actually, even if it's not serialized, even 
ONE hit of 100 usec
is unpleasant.

so a CPU that goes idle wants to "unsubscribe" itself from those IPIs as 
general objective.

but not getting flush IPIs is only safe if the TLBs in the CPU have nothing 
that such IPI would
want to flush, so the TLB needs to be empty of those things.

the only way to do THAT is to switch to an mm that is safe; a leave_mm() does 
this, but I'm sure other
options exist.

note: While a CPU that is in a deeper C state will itself flush the TLB, you 
don't know if you will actually
enter that deep at the time of making OS decisions (if an interrupt comes in 
the cycle before mwait, mwait
becomes a nop for example). In addition, once you wake up, you don't want the 
CPU to go start filling
the TLBs with invalid data so you can't really just set a bit and flush after 
leaving idle

Reply via email to