On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote: > > > Doing all this job and just give up because we cannot allocate page tables > > > looks very wasteful to me. > > > > > > Have you considered to look how we can hand over from speculative to > > > non-speculative path without starting from scratch (when possible)? > > > > So we _can_ in fact allocate and install page-tables, but we have to be > > very careful about it. The interesting case is where we race with > > free_pgtables() and install a page that was just taken out. > > > > But since we already have the VMA I think we can do something like: > > That makes me extremely nervous... there could be all sort of > assumptions esp. in arch code about the fact that we never populate the > tree without the mm sem.
That _would_ be somewhat dodgy, because that means it needs to rely on taking mmap_sem for _writing_ to undo things and arch/powerpc/ doesn't have many down_write.*mmap_sem: $ git grep "down_write.*mmap_sem" arch/powerpc/ arch/powerpc/kernel/vdso.c: if (down_write_killable(&mm->mmap_sem)) arch/powerpc/kvm/book3s_64_vio.c: down_write(¤t->mm->mmap_sem); arch/powerpc/mm/mmu_context_iommu.c: down_write(&mm->mmap_sem); arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem); arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem); arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem); Then again, I suppose it could be relying on the implicit down_write from things like munmap() and the like.. And things _ought_ to be ordered by the various PTLs (mm->page_table_lock and pmd->lock) which of course doesn't mean something accidentally snuck through. > We'd have to audit archs closely. Things like the page walk cache > flushing on power etc... If you point me where to look, I'll have a poke around. I'm not quite sure what you mean with pagewalk cache flushing. Your hash thing flushes everything inside the PTL IIRC and the radix code appears fairly 'normal'. > I don't mind the "retry" .. .we've brought stuff in the L1 cache > already which I would expect to be the bulk of the overhead, and the > allocation case isn't that common. Do we have numbers to show how > destrimental this is today ? No numbers, afaik. And like I said, I didn't consider this an actual problem when I did these patches. But since Kirill asked ;-)