On Tue, Aug 01, 2017 at 02:14:19PM +0200, Peter Zijlstra wrote: > On Tue, Aug 01, 2017 at 10:02:45PM +1000, Benjamin Herrenschmidt wrote: > > On Tue, 2017-08-01 at 11:31 +0100, Will Deacon wrote: > > > Looks like that's what's currently relied upon: > > > > > > /* Clearing is done after a TLB flush, which also provides a barrier. */ > > > > > > It also provides barrier semantics on arm/arm64. In reality, I suspect > > > all archs have to provide some order between set_pte_at and > > > flush_tlb_range > > > which is sufficient to hold up clearing the flag. :/ > > > > Hrm... not explicitely. > > > > Most archs (powerpc among them) have set_pte_at be just a dumb store, > > so the only barrier it has is the surrounding PTL. > > > > Now flush_tlb_range() I assume has some internal strong barriers but > > none of that is well defined or documented at all, so I suspect all > > bets are off. > > Right.. but seeing how we're in fact relying on things here it might be > time to go figure this out and document bits. > > *sigh*, I suppose its going to be me doing this.. :-)
So on the related question; does on_each_cpu() provide a full smp_mb(), I think we can answer: yes. on_each_cpu() does IPIs to all _other_ CPUs, and those IPIs are using llist_add() which is cmpxchg() which implies smp_mb(). After that it runs the local function. So we can see on_each_cpu() as doing a smp_mb() before running @func. xtensa - it uses on_each_cpu() for TLB invalidates. x86 - we use either on_each_cpu() (flush_tlb_all(), flush_tlb_kernel_range()) or we use flush_tlb_mm_range() which does an atomic_inc_return() at the very start. Not to mention that actually flushing TLBs itself is a barrier. Arguably flush_tlb_mm_range() should first do _others* and then self, because others will use smp_call_function_many() and see above. (TODO look into paravirt) Tile - does mb() in flush_remote() sparc32-smp !? sparc64 -- nope, no-op functions, TLB flushes are contained inside the PTL. sh - yes, per smp_call_function s390 - has atomics when it flushes. ptep_modify_prot_start() can set mm->flush_mm = 1, at which point flush_tlb_range() will actually do something, in that case there will be a smp_mb as per the atomics. Otherwise the TLB invalidate is contained inside the PTL. powerpc - radix - PTESYNC hash - flush inside PTL parisc - has all PTE and TLB operations serialized using a global lock nm10300 - *ugh* but yes, smp_call_function() for remote CPUs mips - smp_call_function for remote CPUs metag - mmio write m32r - doesn't seem to have smp_mb() ia64 - smp_call_function_*() hexagon - HVM trap, no smp_mb() blackfin - nommu arm - dsb ish arm64 - dsb ish arc - no barrier alpha - no barrier Now the architectures that do not have a barrier, like alpha, arc, metag, the PTL spin_unlock has a smp_mb, however I don't think that is enough, because then the flush_tlb_range() might still be pending. That said, these architectures probably don't have transparant huge pages so it doesn't matter. Still this is all rather unsatisfactory. Either we should define flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash) or simply make clear_tlb_flush_pending() an smp_store_release(). I prefer the latter option. Opinions?