On Thu, Jun 07, 2012 at 11:00:33PM +0200, Andrea Arcangeli wrote: > In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding > the mmap_sem for reading, cmpxchg8b cannot be used to read pmd > contents under Xen. > > So instead of dealing only with "consistent" pmdvals in > pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually > simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals > where the low 32bit and high 32bit could be inconsistent (to avoid > having to use cmpxchg8b).
<nods> > > The only guarantee we get from pmd_read_atomic is that if the low part > of the pmd was found null, the high part will be null too (so the pmd > will be considered unstable). And if the low part of the pmd is found > "stable" later, then it means the whole pmd was read atomically > (because after a pmd is stable, neither MADV_DONTNEED nor page faults > can alter it anymore, and we read the high part after the low part). > > In the 32bit PAE x86 case, it is enough to read the low part of the > pmdval atomically to declare the pmd as "stable" and that's true for > THP and no THP, furthermore in the THP case we also have a barrier() > that will prevent any inconsistent pmdvals to be cached by a later > re-read of the *pmd. Nice. Andrew, any chane you could test this patch on the affected Xen hypervisors? Was it as easy to reproduce this on a RHEL5 (U1?) hypervisor or is it really only on Linode and Amazon EC2? > > Signed-off-by: Andrea Arcangeli <aarca...@redhat.com> > --- > arch/x86/include/asm/pgtable-3level.h | 30 +++++++++++++++++------------- > include/asm-generic/pgtable.h | 10 ++++++++++ > 2 files changed, 27 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/include/asm/pgtable-3level.h > b/arch/x86/include/asm/pgtable-3level.h > index 43876f1..cb00ccc 100644 > --- a/arch/x86/include/asm/pgtable-3level.h > +++ b/arch/x86/include/asm/pgtable-3level.h > @@ -47,16 +47,26 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) > * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd > * operations. > * > - * Without THP if the mmap_sem is hold for reading, the > - * pmd can only transition from null to not null while pmd_read_atomic runs. > - * So there's no need of literally reading it atomically. > + * Without THP if the mmap_sem is hold for reading, the pmd can only > + * transition from null to not null while pmd_read_atomic runs. So > + * we can always return atomic pmd values with this function. > * > * With THP if the mmap_sem is hold for reading, the pmd can become > - * THP or null or point to a pte (and in turn become "stable") at any > - * time under pmd_read_atomic, so it's mandatory to read it atomically > - * with cmpxchg8b. > + * trans_huge or none or point to a pte (and in turn become "stable") > + * at any time under pmd_read_atomic. We could read it really > + * atomically here with a atomic64_read for the THP enabled case (and > + * it would be a whole lot simpler), but to avoid using cmpxchg8b we > + * only return an atomic pmdval if the low part of the pmdval is later > + * found stable (i.e. pointing to a pte). And we're returning a none > + * pmdval if the low part of the pmd is none. In some cases the high > + * and low part of the pmdval returned may not be consistent if THP is > + * enabled (the low part may point to previously mapped hugepage, > + * while the high part may point to a more recently mapped hugepage), > + * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part > + * of the pmd to be read atomically to decide if the pmd is unstable > + * or not, with the only exception of when the low part of the pmd is > + * zero in which case we return a none pmd. > */ > -#ifndef CONFIG_TRANSPARENT_HUGEPAGE > static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > { > pmdval_t ret; > @@ -74,12 +84,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > > return (pmd_t) { ret }; > } > -#else /* CONFIG_TRANSPARENT_HUGEPAGE */ > -static inline pmd_t pmd_read_atomic(pmd_t *pmdp) > -{ > - return (pmd_t) { atomic64_read((atomic64_t *)pmdp) }; > -} > -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte) > { > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > index ae39c4b..0ff87ec 100644 > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -484,6 +484,16 @@ static inline int > pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) > /* > * The barrier will stabilize the pmdval in a register or on > * the stack so that it will stop changing under the code. > + * > + * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE, > + * pmd_read_atomic is allowed to return a not atomic pmdval > + * (for example pointing to an hugepage that has never been > + * mapped in the pmd). The below checks will only care about > + * the low part of the pmd with 32bit PAE x86 anyway, with the > + * exception of pmd_none(). So the important thing is that if > + * the low part of the pmd is found null, the high part will > + * be also null or the pmd_none() check below would be > + * confused. > */ > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > barrier(); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a> > -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org