Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jerome Glisse
On Wed, Nov 13, 2019 at 02:00:06PM -0800, Dan Williams wrote:
> On Wed, Nov 13, 2019 at 11:23 AM Dan Williams  
> wrote:
> >
> > On Tue, Nov 12, 2019 at 8:27 PM John Hubbard  wrote:
> > >
> > > An upcoming patch changes and complicates the refcounting and
> > > especially the "put page" aspects of it. In order to keep
> > > everything clean, refactor the devmap page release routines:
> > >
> > > * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> > >   and limit the functionality to "read only": return a bool,
> > >   with no side effects.
> > >
> > > * Add a new routine, put_devmap_managed_page(), to handle checking
> > >   what kind of page it is, and what kind of refcount handling it
> > >   requires.
> > >
> > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> > >   and limit the functionality to unconditionally freeing a devmap
> > >   page.
> > >
> > > This is originally based on a separate patch by Ira Weiny, which
> > > applied to an early version of the put_user_page() experiments.
> > > Since then, Jérôme Glisse suggested the refactoring described above.
> > >
> > > Suggested-by: Jérôme Glisse 
> > > Signed-off-by: Ira Weiny 
> > > Signed-off-by: John Hubbard 
> > > ---
> > >  include/linux/mm.h | 27 ---
> > >  mm/memremap.c  | 67 --
> > >  2 files changed, 53 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index a2adf95b3f9c..96228376139c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct 
> > > page *page)
> > >  #endif
> > >
> > >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page);
> > > +void free_devmap_managed_page(struct page *page);
> > >  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > > -static inline bool put_devmap_managed_page(struct page *page)
> > > +
> > > +static inline bool page_is_devmap_managed(struct page *page)
> > >  {
> > > if (!static_branch_unlikely(_managed_key))
> > > return false;
> > > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct 
> > > page *page)
> > > switch (page->pgmap->type) {
> > > case MEMORY_DEVICE_PRIVATE:
> > > case MEMORY_DEVICE_FS_DAX:
> > > -   __put_devmap_managed_page(page);
> > > return true;
> > > default:
> > > break;
> > > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct 
> > > page *page)
> > > return false;
> > >  }
> > >
> > > +static inline bool put_devmap_managed_page(struct page *page)
> > > +{
> > > +   bool is_devmap = page_is_devmap_managed(page);
> > > +
> > > +   if (is_devmap) {
> > > +   int count = page_ref_dec_return(page);
> > > +
> > > +   /*
> > > +* devmap page refcounts are 1-based, rather than 
> > > 0-based: if
> > > +* refcount is 1, then the page is free and the refcount 
> > > is
> > > +* stable because nobody holds a reference on the page.
> > > +*/
> > > +   if (count == 1)
> > > +   free_devmap_managed_page(page);
> > > +   else if (!count)
> > > +   __put_page(page);
> > > +   }
> > > +
> > > +   return is_devmap;
> > > +}
> > > +
> > >  #else /* CONFIG_DEV_PAGEMAP_OPS */
> > >  static inline bool put_devmap_managed_page(struct page *page)
> > >  {
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 03ccbdfeb697..bc7e2a27d025 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long 
> > > pfn,
> > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >
> > >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page)
> > > +void free_devmap_managed_page(struct page *page)
> > >  {
> > > -   int count = page_ref_dec_return(page);
> > > +   /* Clear Active bit in case of parallel mark_page_accessed */
> > > +   __ClearPageActive(page);
> > > +   __ClearPageWaiters(page);
> > > +
> > > +   mem_cgroup_uncharge(page);
> >
> > Ugh, when did all this HMM specific manipulation sneak into the
> > generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> > own put_zone_device_private_page(). For example it's certainly
> > unnecessary and might be broken (would need to check) to call
> > mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> > monolith and the HMM use case leaks pages into code paths that DAX
> > explicitly avoids.
> 
> It's been this way for a while and I did not react previously,
> apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> mem_cgroup_uncharge, belong behind a device-private conditional. The
> history here is:
> 
> Move some, but 

Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jerome Glisse
On Tue, Nov 12, 2019 at 08:26:51PM -0800, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
> 
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 67 --
>  2 files changed, 53 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(_managed_key))
>   return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>  
>   /*
> -  * If refcount is 1 then page is freed and refcount is stable as nobody
> -  * holds a reference on the page.
> +  * When a device_private page is freed, the page->mapping field
> +  * may still contain a (stale) mapping value. For example, the
> +  * lower bits of page->mapping may still identify the page as
> +  * an anonymous page. Ultimately, this entire field is just
> +  * stale and wrong, and it will cause errors if not cleared.
> +  * One example is:
> +  *
> +  *  migrate_vma_pages()
> +  *migrate_vma_insert_page()
> +  *  page_add_new_anon_rmap()
> +  *__page_set_anon_rmap()
> +  *  ...checks page->mapping, via PageAnon(page) call,
> +  *and incorrectly concludes that the page is an
> +  *anonymous page. Therefore, it incorrectly,
> +  *silently fails to set up the new anon rmap.
> +  *
> +  * For other types of ZONE_DEVICE pages, migration is either
> +  * handled differently or not done at all, so there is no need
> +  * to clear page->mapping.
>*/
> - if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> - 

Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 02:49:18PM -0800, John Hubbard wrote:
> On 11/4/19 10:52 AM, Jerome Glisse wrote:
> > On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> >> Add tracking of pages that were pinned via FOLL_PIN.
> >>
> >> As mentioned in the FOLL_PIN documentation, callers who effectively set
> >> FOLL_PIN are required to ultimately free such pages via put_user_page().
> >> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> >> for DIO and/or RDMA use".
> >>
> >> Pages that have been pinned via FOLL_PIN are identifiable via a
> >> new function call:
> >>
> >>bool page_dma_pinned(struct page *page);
> >>
> >> What to do in response to encountering such a page, is left to later
> >> patchsets. There is discussion about this in [1].
> >>
> >> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> >>
> >> This also has a couple of trivial, non-functional change fixes to
> >> try_get_compound_head(). That function got moved to the top of the
> >> file.
> > 
> > Maybe split that as a separate trivial patch.
> 
> 
> Will do.
> 
> 
> > 
> >>
> >> This includes the following fix from Ira Weiny:
> >>
> >> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> >> for GUP pages by introducing put_devmap_managed_user_page() which
> >> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
> > 
> > Please do the put_devmap_managed_page() changes in a separate
> > patch, it would be a lot easier to follow, also on that front
> > see comments below.
> 
> 
> Oh! OK. It makes sense when you say it out loud. :)
> 
> 
> ...
> >> +static inline bool put_devmap_managed_page(struct page *page)
> >> +{
> >> +  bool is_devmap = page_is_devmap_managed(page);
> >> +
> >> +  if (is_devmap) {
> >> +  int count = page_ref_dec_return(page);
> >> +
> >> +  __put_devmap_managed_page(page, count);
> >> +  }
> >> +
> >> +  return is_devmap;
> >> +}
> > 
> > I think the __put_devmap_managed_page() should be rename
> > to free_devmap_managed_page() and that the count != 1
> > case move to this inline function ie:
> > 
> > static inline bool put_devmap_managed_page(struct page *page)
> > {
> > bool is_devmap = page_is_devmap_managed(page);
> > 
> > if (is_devmap) {
> > int count = page_ref_dec_return(page);
> > 
> > /*
> >  * If refcount is 1 then page is freed and refcount is stable 
> > as nobody
> >  * holds a reference on the page.
> >  */
> > if (count == 1)
> > free_devmap_managed_page(page, count);
> > else if (!count)
> > __put_page(page);
> > }
> > 
> > return is_devmap;
> > }
> > 
> 
> Thanks, that does look cleaner and easier to read.
> 
> > 
> >> +
> >>  #else /* CONFIG_DEV_PAGEMAP_OPS */
> >>  static inline bool put_devmap_managed_page(struct page *page)
> >>  {
> >> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct 
> >> page *page)
> >>return true;
> >>  }
> >>  
> >> +__must_check bool user_page_ref_inc(struct page *page);
> >> +
> > 
> > What about having it as an inline here as it is pretty small.
> 
> 
> You mean move it to a static inline function in mm.h? It's worse than it 
> looks, though: *everything* that it calls is also a static function, local
> to gup.c. So I'd have to expose both try_get_compound_head() and
> __update_proc_vmstat(). And that also means calling mod_node_page_state() from
> mm.h, and it goes south right about there. :)

Ok fair enough

> ...  
> >> +/**
> >> + * page_dma_pinned() - report if a page is pinned by a call to 
> >> pin_user_pages*()
> >> + * or pin_longterm_pages*()
> >> + * @page: pointer to page to be queried.
> >> + * @Return:   True, if it is likely that the page has been 
> >> "dma-pinned".
> >> + *False, if the page is definitely not dma-pinned.
> >> + */
> > 
> > Maybe add a small comment about wrap around :)
> 
> 
> I don't *think* the count can wrap around, due to the checks in 
> user_page_ref_inc().
> 
> But it's true that the documentation

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 12:33:09PM -0800, David Rientjes wrote:
> 
> 
> On Sun, 3 Nov 2019, John Hubbard wrote:
> 
> > Introduce pin_user_pages*() variations of get_user_pages*() calls,
> > and also pin_longterm_pages*() variations.
> > 
> > These variants all set FOLL_PIN, which is also introduced, and
> > thoroughly documented.
> > 
> > The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> > to FOLL_PIN:
> > 
> > pin_user_pages()
> > pin_user_pages_remote()
> > pin_user_pages_fast()
> > 
> > pin_longterm_pages()
> > pin_longterm_pages_remote()
> > pin_longterm_pages_fast()
> > 
> > All pages that are pinned via the above calls, must be unpinned via
> > put_user_page().
> > 
> 
> Hi John,
> 
> I'm curious what consideration is given to what pageblock migrate types 
> that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
> longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
> purposes of anti-fragmentation?

We do not control page block, GUP can happens on _any_ page that is
map inside a process (anonymous private vma or regular file back one).

Cheers,
Jérôme



Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> Jason, a question for you at the bottom.
> 
> On 11/4/19 11:52 AM, Jerome Glisse wrote:
> ...
> >> CASE 3: ODP
> >> ---
> >> RDMA hardware with page faulting support. Here, a well-written driver 
> >> doesn't
> > 
> > CASE3: Hardware with page fault support
> > ---
> > 
> > Here, a well-written 
> > 
> 
> Ah, OK. So just drop the first sentence, yes.
> 
> ...
> >>>>>> +   */
> >>>>>> +  gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >>>>>
> >>>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
> >>>>> it can be properly implemented ?
> >>>>>
> >>>>
> >>>> Well, the problem is that I need each call site that requires FOLL_PIN
> >>>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, 
> >>>> because
> >>>> there is a hard, bright rule, which is: if and only if a caller sets
> >>>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> >>>> be called.
> >>>>
> >>>> So this leaves me with only two reasonable choices:
> >>>>
> >>>> a) Convert the call site as above: pin_longterm_pages_remote(), which 
> >>>> sets
> >>>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> >>>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the 
> >>>> call
> >>>> site *might* not need any changes to adopt the working gup.c code.
> >>>>
> >>>> b) Convert the call site to pin_user_pages_remote(), which also sets
> >>>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> >>>> There would also be a comment at the call site, to the effect of, "this
> >>>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> >>>>
> >>>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> >>>> changed to pin_longterm_pages_remote().
> >>>>
> >>>> So you can probably see why I picked (a).
> >>>
> >>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> >>> never have the need for pin_longterm_pages_remote(). My fear is that
> >>> longterm has implication and it would be better to not drop this 
> >>> implication
> >>> by adding a wrapper that does not do what the name says.
> >>>
> >>> So do not introduce pin_longterm_pages_remote() until its first user
> >>> happens. This is option c)
> >>>
> >>
> >> Almost forgot, though: there is already another user: Infiniband:
> >>
> >> drivers/infiniband/core/umem_odp.c:646: npages = 
> >> pin_longterm_pages_remote(owning_process, owning_mm,
> > 
> > odp do not need that, i thought the HMM convertion was already upstream
> > but seems not, in any case odp do not need the longterm case it only
> > so best is to revert that user to gup_fast or something until it get
> > converted to HMM.
> > 
> 
> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646: npages = 
> pin_longterm_pages_remote(owning_process, owning_mm,
> drivers/vfio/vfio_iommu_type1.c:353:ret = 
> pin_longterm_pages_remote(NULL, mm, vaddr, 1,

vfio should be reverted until it can be properly implemented.
The issue is that when you fix the implementation you might
break vfio existing user and thus regress the kernel from user
point of view. So i rather have the change to vfio reverted,
i believe it was not well understood when it got upstream,
between in my 5.4 tree it is still gup_remote not longterm.


> Jerome, Jason: I really don't want to revert the put_page() to 
> put_user_page() 
> conversions that are already throughout the IB driver--pointless churn, right?
> I'd rather either delete them in Jason's tree, or go with what I have here
> while waiting for the deletion.
> 
> Maybe we should just settle on (a) or (b), so that the IB driver ends up with
> the wrapper functions? In fact, if it's getting deleted, then I'd prefer 
> leaving
> it at (a), since that's simple...
> 
> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 
> (ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
> being late).
> 
> thanks,
> 
> John Hubbard
> NVIDIA



Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 11:30:32AM -0800, John Hubbard wrote:
> On 11/4/19 11:18 AM, Jerome Glisse wrote:
> > On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> >> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> >> ...
> >>>
> >>> Few nitpick belows, nonetheless:
> >>>
> >>> Reviewed-by: Jérôme Glisse 
> >>> [...]
> >>>> +
> >>>> +CASE 3: ODP
> >>>> +---
> >>>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >>>> +replayable page faulting). There are GUP references to pages serving as 
> >>>> DMA
> >>>> +buffers. For ODP, MMU notifiers are used to synchronize with 
> >>>> page_mkclean()
> >>>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither 
> >>>> flag
> >>>> +needs to be set.
> >>>
> >>> I would not include ODP or anything like it here, they do not use
> >>> GUP anymore and i believe it is more confusing here. I would how-
> >>> ever include some text in this documentation explaining that hard-
> >>> ware that support page fault is superior as it does not incur any
> >>> of the issues described here.
> >>
> >> OK, agreed, here's a new write up that I'll put in v3:
> >>
> >>
> >> CASE 3: ODP
> >> ---
> > 
> > ODP is RDMA, maybe Hardware with page fault support instead
> > 
> >> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> 
> OK, so:
> 
> "RDMA hardware with page faulting support."
> 
> for the first sentence.

I would drop RDMA completely, RDMA is just one example, they are GPU, FPGA and
others that are in that category. See below

> 
> 
> >> Here, a well-written driver doesn't normally need to pin pages at all. 
> >> However,
> >> if the driver does choose to do so, it can register MMU notifiers for the 
> >> range,
> >> and will be called back upon invalidation. Either way (avoiding page 
> >> pinning, or
> >> using MMU notifiers to unpin upon request), there is proper 
> >> synchronization with 
> >> both filesystem and mm (page_mkclean(), munmap(), etc).
> >>
> >> Therefore, neither flag needs to be set.
> > 
> > In fact GUP should never be use with those.
> 
> 
> Yes. The next paragraph says that, but maybe not strong enough.
> 
> 
> >>
> >> It's worth mentioning here that pinning pages should not be the first 
> >> design
> >> choice. If page fault capable hardware is available, then the software 
> >> should
> >> be written so that it does not pin pages. This allows mm and filesystems to
> >> operate more efficiently and reliably.
> 
> Here's what we have after the above changes:
> 
> CASE 3: ODP
> ---
> RDMA hardware with page faulting support. Here, a well-written driver doesn't

CASE3: Hardware with page fault support
---

Here, a well-written 


> normally need to pin pages at all. However, if the driver does choose to do 
> so,
> it can register MMU notifiers for the range, and will be called back upon
> invalidation. Either way (avoiding page pinning, or using MMU notifiers to 
> unpin
> upon request), there is proper synchronization with both filesystem and mm
> (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.
> 
> In this case, ideally, neither get_user_pages() nor pin_user_pages() should 
> be 
> called. Instead, the software should be written so that it does not pin 
> pages. 
> This allows mm and filesystems to operate more efficiently and reliably.
> 
> >>> [...]
> >>>
> >>>> @@ -1014,7 +1018,16 @@ static __always_inline long 
> >>>> __get_user_pages_locked(struct task_struct *tsk,
> >>>>  BUG_ON(*locked != 1);
> >>>>  }
> >>>>  
> >>>> -if (pages)
> >>>> +/*
> >>>> + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional 
> >>>> behavior
> >>>> + * is to set FOLL_GET if the caller wants pages[] filled in 
> >>>> (but has
> >>>> + * carelessly failed to specify FOLL_GET), so keep doing that, 
> >>>> but only
> >>>> + * for FOLL_GET, not for the newer FOLL_PIN.
> >>>> + *
> >>>> +   

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> ...
> > 
> > Few nitpick belows, nonetheless:
> > 
> > Reviewed-by: Jérôme Glisse 
> > [...]
> >> +
> >> +CASE 3: ODP
> >> +---
> >> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >> +replayable page faulting). There are GUP references to pages serving as 
> >> DMA
> >> +buffers. For ODP, MMU notifiers are used to synchronize with 
> >> page_mkclean()
> >> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >> +needs to be set.
> > 
> > I would not include ODP or anything like it here, they do not use
> > GUP anymore and i believe it is more confusing here. I would how-
> > ever include some text in this documentation explaining that hard-
> > ware that support page fault is superior as it does not incur any
> > of the issues described here.
> 
> OK, agreed, here's a new write up that I'll put in v3:
> 
> 
> CASE 3: ODP
> ---

ODP is RDMA, maybe Hardware with page fault support instead

> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> Here, a well-written driver doesn't normally need to pin pages at all. 
> However,
> if the driver does choose to do so, it can register MMU notifiers for the 
> range,
> and will be called back upon invalidation. Either way (avoiding page pinning, 
> or
> using MMU notifiers to unpin upon request), there is proper synchronization 
> with 
> both filesystem and mm (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.

In fact GUP should never be use with those.

> 
> It's worth mentioning here that pinning pages should not be the first design
> choice. If page fault capable hardware is available, then the software should
> be written so that it does not pin pages. This allows mm and filesystems to
> operate more efficiently and reliably.
> 
> > [...]
> > 
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 199da99e8ffc..1aea48427879 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> > 
> > [...]
> > 
> >> @@ -1014,7 +1018,16 @@ static __always_inline long 
> >> __get_user_pages_locked(struct task_struct *tsk,
> >>BUG_ON(*locked != 1);
> >>}
> >>  
> >> -  if (pages)
> >> +  /*
> >> +   * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >> +   * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >> +   * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >> +   * for FOLL_GET, not for the newer FOLL_PIN.
> >> +   *
> >> +   * FOLL_PIN always expects pages to be non-null, but no need to assert
> >> +   * that here, as any failures will be obvious enough.
> >> +   */
> >> +  if (pages && !(flags & FOLL_PIN))
> >>flags |= FOLL_GET;
> > 
> > Did you look at user that have pages and not FOLL_GET set ?
> > I believe it would be better to first fix them to end up
> > with FOLL_GET set and then error out if pages is != NULL but
> > nor FOLL_GET or FOLL_PIN is set.
> > 
> 
> I was perhaps overly cautious, and didn't go there. However, it's probably
> doable, given that there was already the following in __get_user_pages():
> 
> VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> 
> ...which will have conditioned people and code to set FOLL_GET together with
> pages. So I agree that the time is right.
> 
> In order to make bisecting future failures simpler, I can insert a patch 
> right 
> before this one, that changes the FOLL_GET setting into an assert, like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..be338961e80d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1014,8 +1014,8 @@ static __always_inline long 
> __get_user_pages_locked(struct task_struct *tsk,
> BUG_ON(*locked != 1);
> }
>  
> -   if (pages)
> -   flags |= FOLL_GET;
> +   if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> +   return -EINVAL;
>  
> pages_done = 0;
> lock_dropped = false;
> 
> 
> ...and then add in FOLL_PIN, with this patch.

looks good but double check that it should not happens, i will try
to check on my side too.

> 
> >>  
> >>pages_done = 0;
> > 
> >> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long 
> >> start, int nr_pages,
> 

Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> This also has a couple of trivial, non-functional change fixes to
> try_get_compound_head(). That function got moved to the top of the
> file.

Maybe split that as a separate trivial patch.

> 
> This includes the following fix from Ira Weiny:
> 
> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> for GUP pages by introducing put_devmap_managed_user_page() which
> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

Please do the put_devmap_managed_page() changes in a separate
patch, it would be a lot easier to follow, also on that front
see comments below.

> 
> [1] https://lwn.net/Articles/784574/ "Some slow progress on
> get_user_pages()"
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  include/linux/mm.h   |  80 +++
>  include/linux/mmzone.h   |   2 +
>  include/linux/page_ref.h |  10 ++
>  mm/gup.c | 213 +++
>  mm/huge_memory.c |  32 +-
>  mm/hugetlb.c |  28 -
>  mm/memremap.c|   4 +-
>  mm/vmstat.c  |   2 +
>  8 files changed, 300 insertions(+), 71 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cdfb6fedb271..03b3600843b7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void __put_devmap_managed_page(struct page *page, int count);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(_managed_key))
>   return false;
> @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + __put_devmap_managed_page(page, count);
> + }
> +
> + return is_devmap;
> +}

I think the __put_devmap_managed_page() should be rename
to free_devmap_managed_page() and that the count != 1
case move to this inline function ie:

static inline bool put_devmap_managed_page(struct page *page)
{
bool is_devmap = page_is_devmap_managed(page);

if (is_devmap) {
int count = page_ref_dec_return(page);

/*
 * If refcount is 1 then page is freed and refcount is stable 
as nobody
 * holds a reference on the page.
 */
if (count == 1)
free_devmap_managed_page(page, count);
else if (!count)
__put_page(page);
}

return is_devmap;
}


> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct 
> page *page)
>   return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +

What about having it as an inline here as it is pretty small.


>  static inline void put_page(struct page *page)
>  {
>   page = compound_head(page);
> @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
>   __put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original 
> page
> + * reference count, and also a new count of how many 

Re: [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:04PM -0800, John Hubbard wrote:
> Convert drm/via to use the new pin_user_pages_fast() call, which sets
> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
> tracking of pinned pages, and therefore for any code that calls
> put_user_page().
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Please be more explicit that via_dmablit.c is already using put_user_page()
as i am expecting that any conversion to pin_user_pages*() must be pair with
a put_user_page(). I find above commit message bit unclear from that POV.

Reviewed-by: Jérôme Glisse 


> ---
>  drivers/gpu/drm/via/via_dmablit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/via/via_dmablit.c 
> b/drivers/gpu/drm/via/via_dmablit.c
> index 3db000aacd26..37c5e572993a 100644
> --- a/drivers/gpu/drm/via/via_dmablit.c
> +++ b/drivers/gpu/drm/via/via_dmablit.c
> @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
> drm_via_dmablit_t *xfer)
>   vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
>   if (NULL == vsg->pages)
>   return -ENOMEM;
> - ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
> + ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
>   vsg->num_pages,
>   vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
>   vsg->pages);
> -- 
> 2.23.0
> 



Re: [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:03PM -0800, John Hubbard wrote:
> Convert process_vm_access to use the new pin_user_pages_remote()
> call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
> code that requires tracking of pinned pages.
> 
> Also, release the pages via put_user_page*().
> 
> Also, rename "pages" to "pinned_pages", as this makes for
> easier reading of process_vm_rw_single_vec().
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/process_vm_access.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7bef6c0..fd20ab675b85 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
>   if (copy > len)
>   copy = len;
>  
> - if (vm_write) {
> + if (vm_write)
>   copied = copy_page_from_iter(page, offset, copy, iter);
> - set_page_dirty_lock(page);
> - } else {
> + else
>   copied = copy_page_to_iter(page, offset, copy, iter);
> - }
> +
>   len -= copied;
>   if (copied < copy && iov_iter_count(iter))
>   return -EFAULT;
> @@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>   flags |= FOLL_WRITE;
>  
>   while (!rc && nr_pages && iov_iter_count(iter)) {
> - int pages = min(nr_pages, max_pages_per_loop);
> + int pinned_pages = min(nr_pages, max_pages_per_loop);
>   int locked = 1;
>   size_t bytes;
>  
> @@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
>* current/current->mm
>*/
>   down_read(>mmap_sem);
> - pages = get_user_pages_remote(task, mm, pa, pages, flags,
> -   process_pages, NULL, );
> + pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
> +  flags, process_pages,
> +  NULL, );
>   if (locked)
>   up_read(>mmap_sem);
> - if (pages <= 0)
> + if (pinned_pages <= 0)
>   return -EFAULT;
>  
> - bytes = pages * PAGE_SIZE - start_offset;
> + bytes = pinned_pages * PAGE_SIZE - start_offset;
>   if (bytes > len)
>   bytes = len;
>  
> @@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
>vm_write);
>   len -= bytes;
>   start_offset = 0;
> - nr_pages -= pages;
> - pa += pages * PAGE_SIZE;
> - while (pages)
> - put_page(process_pages[--pages]);
> + nr_pages -= pinned_pages;
> + pa += pinned_pages * PAGE_SIZE;
> +
> + /* If vm_write is set, the pages need to be made dirty: */
> + put_user_pages_dirty_lock(process_pages, pinned_pages,
> +   vm_write);
>   }
>  
>   return rc;
> -- 
> 2.23.0
> 



Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
> * Start with "pin_user_pages" instead of "get_user_pages". That
>   makes it easy to find and audit the call sites.
> * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 

Few nitpick belows, nonetheless:

Reviewed-by: Jérôme Glisse 

> ---
>  Documentation/vm/index.rst  |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++
>  include/linux/mm.h  |  62 ++-
>  mm/gup.c| 265 +---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 

[...]

> diff --git a/Documentation/vm/pin_user_pages.rst 
> b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index ..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst

[...]

> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for 
> describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +---
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. 
> That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +---
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.

I would not include ODP or anything like it here, they do not use
GUP anymore and i believe it is more confusing here. I would how-
ever include some text in this documentation explaining that hard-
ware that support page fault is superior as it does not incur any
of the issues described here.

> +
> +CASE 4: Pinning for struct page manipulation only
> +-
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1014,7 +1018,16 @@ static __always_inline long 
> __get_user_pages_locked(struct task_struct *tsk,
>   BUG_ON(*locked != 1);
>   }
>  
> - if (pages)
> + /*
> +  * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +  * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +  * carelessly failed 

Re: [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:17:58PM -0800, John Hubbard wrote:
> 1. Avoid naming conflicts: rename local static function from
> "pin_user_pages()" to "pin_goldfish_pages()".
> 
> An upcoming patch will introduce a global pin_user_pages()
> function.
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  drivers/platform/goldfish/goldfish_pipe.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
> b/drivers/platform/goldfish/goldfish_pipe.c
> index cef0133aa47a..7ed2a21a0bac 100644
> --- a/drivers/platform/goldfish/goldfish_pipe.c
> +++ b/drivers/platform/goldfish/goldfish_pipe.c
> @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
>   }
>  }
>  
> -static int pin_user_pages(unsigned long first_page,
> -   unsigned long last_page,
> -   unsigned int last_page_size,
> -   int is_write,
> -   struct page *pages[MAX_BUFFERS_PER_COMMAND],
> -   unsigned int *iter_last_page_size)
> +static int pin_goldfish_pages(unsigned long first_page,
> +   unsigned long last_page,
> +   unsigned int last_page_size,
> +   int is_write,
> +   struct page *pages[MAX_BUFFERS_PER_COMMAND],
> +   unsigned int *iter_last_page_size)
>  {
>   int ret;
>   int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
> @@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe 
> *pipe,
>   if (mutex_lock_interruptible(>lock))
>   return -ERESTARTSYS;
>  
> - pages_count = pin_user_pages(first_page, last_page,
> -  last_page_size, is_write,
> -  pipe->pages, _last_page_size);
> + pages_count = pin_goldfish_pages(first_page, last_page,
> +  last_page_size, is_write,
> +  pipe->pages, _last_page_size);
>   if (pages_count < 0) {
>   mutex_unlock(>lock);
>   return pages_count;
> -- 
> 2.23.0
> 



Re: [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:17:57PM -0800, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Cc: Ira Weiny 
> Cc: Christoph Hellwig 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: John Hubbard 

Good cleanup.

Reviewed-by: Jérôme Glisse 

> ---
>  mm/gup.c | 104 ---
>  1 file changed, 45 insertions(+), 59 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..199da99e8ffc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
> *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +  unsigned long end, struct page **pages, int nr)
> +{
> + int nr_recorded_pages = 0;
> +
> + do {
> + pages[nr] = page;
> + nr++;
> + page++;
> + nr_recorded_pages++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return nr_recorded_pages;
> +}
> +
> +static void put_compound_head(struct page *page, int refs)
> +{
> + /* Do a get_page() first, in case refs == page->_refcount */
> + get_page(page);
> + page_ref_sub(page, refs);
> + put_page(page);
> +}
> +
> +static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
> +{
> + *nr += nr_recorded_pages;
> + SetPageReferenced(head);
> +}
> +
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
> unsigned long sz)
> @@ -1998,33 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
> unsigned long addr,
>   /* hugepages are never "special" */
>   VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  
> - refs = 0;
>   head = pte_page(pte);
> -
>   page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> - do {
> - VM_BUG_ON(compound_head(page) != head);
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages, *nr);
>  
>   head = try_get_compound_head(head, refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>   return 0;
> - }
>  
>   if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> - /* Could be optimized better */
> - *nr -= refs;
> - while (refs--)
> - put_page(head);
> + put_compound_head(head, refs);
>   return 0;
>   }
>  
> - SetPageReferenced(head);
> + __huge_pt_done(head, refs, nr);
>   return 1;
>  }
>  
> @@ -2071,29 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>pages, nr);
>   }
>  
> - refs = 0;
>   page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - do {
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages, *nr);
>  
>   head = try_get_compound_head(pmd_page(orig), refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>   return 0;
> - }
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> - *nr -= refs;
> - while (refs--)
> - put_page(head);
> + put_compound_head(head, refs);
>   return 0;
>   }
>  
> - SetPageReferenced(head);
> + __huge_pt_done(head, refs, nr);
>   return 1;
>  }
>  
> @@ -2114,29 +2119,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>pages, nr);
>   }
>  
> - refs = 0;
>   page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - do {
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages, *nr);
>  
>   head = try_get_compound_head(pud_page(orig), refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)

Re: [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:17:56PM -0800, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so
> pass the flags argument through to the __gup_device_*
> functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> Reviewed-by: Ira Weiny 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/gup.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..85caf76b3012 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
>   struct dev_pagemap *pgmap = NULL;
> @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
> *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
> *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
> @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   if (pmd_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, unsigned int flags, struct page **pages, int 
> *nr)
> + unsigned long end, unsigned int flags,
> + struct page **pages, int *nr)
>  {
>   struct page *head, *page;
>   int refs;
> @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>   if (pud_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
> + return __gup_device_huge_pud(orig, pudp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> -- 
> 2.23.0
> 



Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure

2019-04-24 Thread Jerome Glisse
On Wed, Apr 24, 2019 at 04:56:14PM +0200, Laurent Dufour wrote:
> Le 22/04/2019 à 23:26, Jerome Glisse a écrit :
> > On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
> > > From: Peter Zijlstra 
> > > 
> > > Provide infrastructure to do a speculative fault (not holding
> > > mmap_sem).
> > > 
> > > The not holding of mmap_sem means we can race against VMA
> > > change/removal and page-table destruction. We use the SRCU VMA freeing
> > > to keep the VMA around. We use the VMA seqcount to detect change
> > > (including umapping / page-table deletion) and we use gup_fast() style
> > > page-table walking to deal with page-table races.
> > > 
> > > Once we've obtained the page and are ready to update the PTE, we
> > > validate if the state we started the fault with is still valid, if
> > > not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> > > PTE and we're done.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > 
> > > [Manage the newly introduced pte_spinlock() for speculative page
> > >   fault to fail if the VMA is touched in our back]
> > > [Rename vma_is_dead() to vma_has_changed() and declare it here]
> > > [Fetch p4d and pud]
> > > [Set vmd.sequence in __handle_mm_fault()]
> > > [Abort speculative path when handle_userfault() has to be called]
> > > [Add additional VMA's flags checks in handle_speculative_fault()]
> > > [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> > > [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> > > [Remove warning comment about waiting for !seq&1 since we don't want
> > >   to wait]
> > > [Remove warning about no huge page support, mention it explictly]
> > > [Don't call do_fault() in the speculative path as __do_fault() calls
> > >   vma->vm_ops->fault() which may want to release mmap_sem]
> > > [Only vm_fault pointer argument for vma_has_changed()]
> > > [Fix check against huge page, calling pmd_trans_huge()]
> > > [Use READ_ONCE() when reading VMA's fields in the speculative path]
> > > [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
> > >   processing done in vm_normal_page()]
> > > [Check that vma->anon_vma is already set when starting the speculative
> > >   path]
> > > [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
> > >   the processing done in mpol_misplaced()]
> > > [Don't support VMA growing up or down]
> > > [Move check on vm_sequence just before calling handle_pte_fault()]
> > > [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> > > [Add mem cgroup oom check]
> > > [Use READ_ONCE to access p*d entries]
> > > [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> > > [Don't fetch pte again in handle_pte_fault() when running the speculative
> > >   path]
> > > [Check PMD against concurrent collapsing operation]
> > > [Try spin lock the pte during the speculative path to avoid deadlock with
> > >   other CPU's invalidating the TLB and requiring this CPU to catch the
> > >   inter processor's interrupt]
> > > [Move define of FAULT_FLAG_SPECULATIVE here]
> > > [Introduce __handle_speculative_fault() and add a check against
> > >   mm->mm_users in handle_speculative_fault() defined in mm.h]
> > > [Abort if vm_ops->fault is set instead of checking only vm_ops]
> > > [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
> > > Signed-off-by: Laurent Dufour 
> > 
> > 
> > Few comments and questions for this one see below.
> > 
> > 
> > > ---
> > >   include/linux/hugetlb_inline.h |   2 +-
> > >   include/linux/mm.h |  30 +++
> > >   include/linux/pagemap.h|   4 +-
> > >   mm/internal.h  |  15 ++
> > >   mm/memory.c| 344 -
> > >   5 files changed, 389 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/hugetlb_inline.h 
> > > b/include/linux/hugetlb_inline.h
> > > index 0660a03d37d9..9e25283d6fc9 100644
> > > --- a/include/linux/hugetlb_inline.h
> > > +++ b/include/linux/hugetlb_inline.h
> > > @@ -8,7 +8,7 @@
> > >   static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
> > >   {
> > > - return !!(vma->vm_flags & VM_HUGETLB);
> > > + 

Re: [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:14PM +0200, Laurent Dufour wrote:
> Vinayak Menon faced a panic because one thread was page faulting a page in
> swap, while another one was mprotecting a part of the VMA leading to a VMA
> split.
> This raise a panic in swap_vma_readahead() because the VMA's boundaries
> were not more matching the faulting address.
> 
> To avoid this, if the page is not found in the swap, the speculative page
> fault is aborted to retry a regular page fault.
> 
> Reported-by: Vinayak Menon 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

Note that you should also skip non swap entry in do_swap_page() when doing
speculative page fault at very least you need to is_device_private_entry()
case.

But this should either be part of patch 22 or another patch to fix swap
case.

> ---
>  mm/memory.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e6bf61c0e5c..1991da97e2db 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   lru_cache_add_anon(page);
>   swap_readpage(page, true);
>   }
> + } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
> + /*
> +  * Don't try readahead during a speculative page fault
> +  * as the VMA's boundaries may change in our back.
> +  * If the page is not in the swap cache and synchronous
> +  * read is disabled, fall back to the regular page
> +  * fault mechanism.
> +  */
> + delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> + ret = VM_FAULT_RETRY;
> + goto out;
>   } else {
>   page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>   vmf);
> -- 
> 2.21.0
> 


Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra 
> 
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
> 
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
> 
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Manage the newly introduced pte_spinlock() for speculative page
>  fault to fail if the VMA is touched in our back]
> [Rename vma_is_dead() to vma_has_changed() and declare it here]
> [Fetch p4d and pud]
> [Set vmd.sequence in __handle_mm_fault()]
> [Abort speculative path when handle_userfault() has to be called]
> [Add additional VMA's flags checks in handle_speculative_fault()]
> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> [Remove warning comment about waiting for !seq&1 since we don't want
>  to wait]
> [Remove warning about no huge page support, mention it explictly]
> [Don't call do_fault() in the speculative path as __do_fault() calls
>  vma->vm_ops->fault() which may want to release mmap_sem]
> [Only vm_fault pointer argument for vma_has_changed()]
> [Fix check against huge page, calling pmd_trans_huge()]
> [Use READ_ONCE() when reading VMA's fields in the speculative path]
> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>  processing done in vm_normal_page()]
> [Check that vma->anon_vma is already set when starting the speculative
>  path]
> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>  the processing done in mpol_misplaced()]
> [Don't support VMA growing up or down]
> [Move check on vm_sequence just before calling handle_pte_fault()]
> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> [Add mem cgroup oom check]
> [Use READ_ONCE to access p*d entries]
> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> [Don't fetch pte again in handle_pte_fault() when running the speculative
>  path]
> [Check PMD against concurrent collapsing operation]
> [Try spin lock the pte during the speculative path to avoid deadlock with
>  other CPU's invalidating the TLB and requiring this CPU to catch the
>  inter processor's interrupt]
> [Move define of FAULT_FLAG_SPECULATIVE here]
> [Introduce __handle_speculative_fault() and add a check against
>  mm->mm_users in handle_speculative_fault() defined in mm.h]
> [Abort if vm_ops->fault is set instead of checking only vm_ops]
> [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
> Signed-off-by: Laurent Dufour 


Few comments and questions for this one see below.


> ---
>  include/linux/hugetlb_inline.h |   2 +-
>  include/linux/mm.h |  30 +++
>  include/linux/pagemap.h|   4 +-
>  mm/internal.h  |  15 ++
>  mm/memory.c| 344 -
>  5 files changed, 389 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> index 0660a03d37d9..9e25283d6fc9 100644
> --- a/include/linux/hugetlb_inline.h
> +++ b/include/linux/hugetlb_inline.h
> @@ -8,7 +8,7 @@
>  
>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>  {
> - return !!(vma->vm_flags & VM_HUGETLB);
> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>  }
>  
>  #else
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f761a9c65c74..ec609cbad25a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER  0x40/* The fault originated in 
> userspace */
>  #define FAULT_FLAG_REMOTE0x80/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100/* The fault was during an 
> instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE   0x200   /* Speculative fault, not 
> holding mmap_sem */
>  
>  #define FAULT_FLAG_TRACE \
>   { FAULT_FLAG_WRITE, "WRITE" }, \
> @@ -409,6 +410,10 @@ struct vm_fault {
>   gfp_t gfp_mask; /* gfp mask to be used for allocations 
> */
>   pgoff_t pgoff;  /* Logical page offset based on vma */
>   unsigned long address;  /* Faulting virtual address */
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + unsigned int sequence;
> + pmd_t orig_pmd; /* value of PMD at the time of fault */
> +#endif
>   pmd_t *pmd; /* Pointer to pmd entry 

Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
> This allows to search for a VMA structure without holding the mmap_sem.
> 
> The search is repeated while the mm seqlock is changing and until we found
> a valid VMA.
> 
> While under the RCU protection, a reference is taken on the VMA, so the
> caller must call put_vma() once it not more need the VMA structure.
> 
> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
> reference is taken to the VMA by calling get_vma().
> 
> When removing a VMA from the MM RB tree, the VMA is not release immediately
> but at the end of the RCU grace period through vm_rcu_put(). This ensures
> that the VMA remains allocated until the end the RCU grace period.
> 
> Since the vm_file pointer, if valid, is released in put_vma(), there is no
> guarantee that the file pointer will be valid on the returned VMA.
> 
> Signed-off-by: Laurent Dufour 

Minor comments about comment (i love recursion :)) see below.

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm_types.h |  1 +
>  mm/internal.h|  5 ++-
>  mm/mmap.c| 76 ++--
>  3 files changed, 78 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a6159e11a3f..9af6694cb95d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -287,6 +287,7 @@ struct vm_area_struct {
>  
>  #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>   atomic_t vm_ref_count;
> + struct rcu_head vm_rcu;
>  #endif
>   struct rb_node vm_rb;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 302382bed406..1e368e4afe3c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
>   __free_vma(vma);
>  }
>  
> -#else
> +extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
> +unsigned long addr);
> +
> +#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>  
>  static inline void get_vma(struct vm_area_struct *vma)
>  {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c106440dcae7..34bf261dc2c8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct 
> *mm)
>  {
>   write_sequnlock(>mm_seq);
>  }
> +
> +static void __vm_rcu_put(struct rcu_head *head)
> +{
> + struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> +   vm_rcu);
> + put_vma(vma);
> +}
> +static void vm_rcu_put(struct vm_area_struct *vma)
> +{
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(>vm_rb), vma);
> + call_rcu(>vm_rcu, __vm_rcu_put);
> +}
>  #else
>  static inline void mm_write_seqlock(struct mm_struct *mm)
>  {
> @@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct 
> *mm)
>  
>  void __free_vma(struct vm_area_struct *vma)
>  {
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(>vm_rb), vma);
>   mpol_put(vma_policy(vma));
>   vm_area_free(vma);
>  }
> @@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> -static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
>  {
>   struct vm_area_struct *next = vma->vm_next;
>  
>   might_sleep();
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
> + !RB_EMPTY_NODE(>vm_rb)) {
> + /*
> +  * If the VMA is still linked in the RB tree, we must release
> +  * that reference by calling put_vma().
> +  * This should only happen when called from exit_mmap().
> +  * We forcely clear the node to satisfy the chec in
^
Typo: chec -> check

> +  * __free_vma(). This is safe since the RB tree is not walked
> +  * anymore.
> +  */
> + RB_CLEAR_NODE(>vm_rb);
> + put_vma(vma);
> + }
>   if (vma->vm_ops && vma->vm_ops->close)
>   vma->vm_ops->close(vma);
>   if (vma->vm_file)
> @@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>   return next;
>  }
>  
> +static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +{
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(>vm_rb), vma);

Adding a comment here explaining the BUG_ON so people can understand
what is wrong if that happens. For instance:

/*
 * remove_vma() should be call only once a vma have been remove from the rbtree
 * at which point the vma->vm_rb is an empty node. The exception is when vmas
 * are destroy through exit_mmap() in which case we do not bother updating the
 * rbtree (see comment in __remove_vma()).

Re: [PATCH v12 20/31] mm: introduce vma reference counter

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:11PM +0200, Laurent Dufour wrote:
> The final goal is to be able to use a VMA structure without holding the
> mmap_sem and to be sure that the structure will not be freed in our back.
> 
> The lockless use of the VMA will be done through RCU protection and thus a
> dedicated freeing service is required to manage it asynchronously.
> 
> As reported in a 2010's thread [1], this may impact file handling when a
> file is still referenced while the mapping is no more there.  As the final
> goal is to handle anonymous VMA in a speculative way and not file backed
> mapping, we could close and free the file pointer in a synchronous way, as
> soon as we are guaranteed to not use it without holding the mmap_sem. For
> sanity reason, in a minimal effort, the vm_file file pointer is unset once
> the file pointer is put.
> 
> [1] https://lore.kernel.org/linux-mm/20100104182429.833180...@chello.nl/
> 
> Signed-off-by: Laurent Dufour 

Using kref would have been better from my POV even with RCU freeing
but anyway:

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h   |  4 
>  include/linux/mm_types.h |  3 +++
>  mm/internal.h| 27 +++
>  mm/mmap.c| 13 +
>  4 files changed, 43 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f14b2c9ddfd4..f761a9c65c74 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   vma->vm_mm = mm;
>   vma->vm_ops = _vm_ops;
>   INIT_LIST_HEAD(>anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + atomic_set(>vm_ref_count, 1);
> +#endif
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> @@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
>   INIT_LIST_HEAD(>anon_vma_chain);
>  #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>   seqcount_init(>vm_sequence);
> + atomic_set(>vm_ref_count, 1);
>  #endif
>  }
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 24b3f8ce9e42..6a6159e11a3f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -285,6 +285,9 @@ struct vm_area_struct {
>   /* linked list of VM areas per task, sorted by address */
>   struct vm_area_struct *vm_next, *vm_prev;
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + atomic_t vm_ref_count;
> +#endif
>   struct rb_node vm_rb;
>  
>   /*
> diff --git a/mm/internal.h b/mm/internal.h
> index 9eeaf2b95166..302382bed406 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -40,6 +40,33 @@ void page_writeback_init(void);
>  
>  vm_fault_t do_swap_page(struct vm_fault *vmf);
>  
> +
> +extern void __free_vma(struct vm_area_struct *vma);
> +
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void get_vma(struct vm_area_struct *vma)
> +{
> + atomic_inc(>vm_ref_count);
> +}
> +
> +static inline void put_vma(struct vm_area_struct *vma)
> +{
> + if (atomic_dec_and_test(>vm_ref_count))
> + __free_vma(vma);
> +}
> +
> +#else
> +
> +static inline void get_vma(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline void put_vma(struct vm_area_struct *vma)
> +{
> + __free_vma(vma);
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>   unsigned long floor, unsigned long ceiling);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f7f6027a7dff..c106440dcae7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct 
> *mm)
>  }
>  #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>  
> +void __free_vma(struct vm_area_struct *vma)
> +{
> + mpol_put(vma_policy(vma));
> + vm_area_free(vma);
> +}
> +
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> @@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>   vma->vm_ops->close(vma);
>   if (vma->vm_file)
>   fput(vma->vm_file);
> - mpol_put(vma_policy(vma));
> - vm_area_free(vma);
> + vma->vm_file = NULL;
> + put_vma(vma);
>   return next;
>  }
>  
> @@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
> long start,
>   if (next->anon_vma)
>   anon_vma_merge(vma, next);
>   mm->map_count--;
> - mpol_put(vma_policy(next));
> - vm_area_free(next);
> + put_vma(next);
>   /*
>* In mprotect's case 6 (see comments on vma_merge),
>* we must remove another next too. It would clutter
> -- 
> 2.21.0
> 


Re: [PATCH v12 19/31] mm: protect the RB tree with a sequence lock

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:10PM +0200, Laurent Dufour wrote:
> Introducing a per mm_struct seqlock, mm_seq field, to protect the changes
> made in the MM RB tree. This allows to walk the RB tree without grabbing
> the mmap_sem, and on the walk is done to double check that sequence counter
> was stable during the walk.
> 
> The mm seqlock is held while inserting and removing entries into the MM RB
> tree.  Later in this series, it will be check when looking for a VMA
> without holding the mmap_sem.
> 
> This is based on the initial work from Peter Zijlstra:
> https://lore.kernel.org/linux-mm/20100104182813.479668...@chello.nl/
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm_types.h |  3 +++
>  kernel/fork.c|  3 +++
>  mm/init-mm.c |  3 +++
>  mm/mmap.c| 48 +++-
>  4 files changed, 46 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e78f72eb2576..24b3f8ce9e42 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -358,6 +358,9 @@ struct mm_struct {
>   struct {
>   struct vm_area_struct *mmap;/* list of VMAs */
>   struct rb_root mm_rb;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqlock_t mm_seq;
> +#endif
>   u64 vmacache_seqnum;   /* per-thread vmacache */
>  #ifdef CONFIG_MMU
>   unsigned long (*get_unmapped_area) (struct file *filp,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2992d2c95256..3a1739197ebc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1008,6 +1008,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   mm->mmap = NULL;
>   mm->mm_rb = RB_ROOT;
>   mm->vmacache_seqnum = 0;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqlock_init(>mm_seq);
> +#endif
>   atomic_set(>mm_users, 1);
>   atomic_set(>mm_count, 1);
>   init_rwsem(>mmap_sem);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index a787a319211e..69346b883a4e 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -27,6 +27,9 @@
>   */
>  struct mm_struct init_mm = {
>   .mm_rb  = RB_ROOT,
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
> +#endif
>   .pgd= swapper_pg_dir,
>   .mm_users   = ATOMIC_INIT(2),
>   .mm_count   = ATOMIC_INIT(1),
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 13460b38b0fb..f7f6027a7dff 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -170,6 +170,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
>   }
>  }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void mm_write_seqlock(struct mm_struct *mm)
> +{
> + write_seqlock(>mm_seq);
> +}
> +static inline void mm_write_sequnlock(struct mm_struct *mm)
> +{
> + write_sequnlock(>mm_seq);
> +}
> +#else
> +static inline void mm_write_seqlock(struct mm_struct *mm)
> +{
> +}
> +static inline void mm_write_sequnlock(struct mm_struct *mm)
> +{
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> @@ -445,26 +463,32 @@ static void vma_gap_update(struct vm_area_struct *vma)
>  }
>  
>  static inline void vma_rb_insert(struct vm_area_struct *vma,
> -  struct rb_root *root)
> +  struct mm_struct *mm)
>  {
> + struct rb_root *root = >mm_rb;
> +
>   /* All rb_subtree_gap values must be consistent prior to insertion */
>   validate_mm_rb(root, NULL);
>  
>   rb_insert_augmented(>vm_rb, root, _gap_callbacks);
>  }
>  
> -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> +static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
> + struct rb_root *root = >mm_rb;
> +
>   /*
>* Note rb_erase_augmented is a fairly large inline function,
>* so make sure we instantiate it only once with our desired
>* augmented rbtree callbacks.
>*/
> + mm_write_seqlock(mm);
>   rb_erase_augmented(>vm_rb, root, _gap_callbacks);
> + mm_write_sequnlock(mm); /* wmb */
>  }
>  
>  static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> - struct rb_root *root,
> + struct mm_struct *mm,
>   struct vm_area_struct *ignore)
>  {
>   /*
> @@ -472,21 +496,21 @@ static __always_inline void vma_rb_erase_ignore(struct 
> vm_area_struct *vma,
>* with the possible exception of the "next" vma being erased if
>* next->vm_start was reduced.
>*/
> - validate_mm_rb(root, ignore);
> + validate_mm_rb(>mm_rb, ignore);
>  
> - __vma_rb_erase(vma, root);
> + __vma_rb_erase(vma, mm);
>  }
>  
>  

Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
> lead to thread being blocked due to data corruption:
> 
> CPU 1   CPU 2CPU 3
> Process 1,  Process 1,   Process 1,
> Thread AThread B Thread C
> 
> while (1) { while (1) {  while(1) {
> pthread_mutex_lock(l)   pthread_mutex_lock(l)fork
> pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
> }   }
> 
> In the details this happens because :
> 
> CPU 1CPU 2   CPU 3
> fork()
> copy_pte_range()
>   set PTE rdonly
> got to next VMA...
>  .   PTE is seen rdonly  PTE still writable
>  .   thread is writing to page
>  .   -> page fault
>  . copy the page Thread writes to page
>  .  .-> no page fault
>  . update the PTE
>  . flush TLB for that PTE
>flush TLBPTE are now rdonly

Should the fork be on CPU3 to be consistant with the top thing (just to
make it easier to read and go from one to the other as thread can move
from one CPU to another).

> 
> So the write done by the CPU 3 is interfering with the page copy operation
> done by CPU 2, leading to the data corruption.
> 
> To avoid this we mark all the VMA involved in the COW mechanism as changing
> by calling vm_write_begin(). This ensures that the speculative page fault
> handler will not try to handle a fault on these pages.
> The marker is set until the TLB is flushed, ensuring that all the CPUs will
> now see the PTE as not writable.
> Once the TLB is flush, the marker is removed by calling vm_write_end().
> 
> The variable last is used to keep tracked of the latest VMA marked to
> handle the error path where part of the VMA may have been marked.
> 
> Since multiple VMA from the same mm may have the sequence count increased
> during this process, the use of the vm_raw_write_begin/end() is required to
> avoid lockdep false warning messages.
> 
> Reported-by: Ganesh Mahendran 
> Reported-by: Vinayak Menon 
> Signed-off-by: Laurent Dufour 

A minor comment (see below)

Reviewed-by: Jérome Glisse 

> ---
>  kernel/fork.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f8dae021c2e5..2992d2c95256 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   struct mm_struct *oldmm)
>  {
> - struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> + struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>   struct rb_node **rb_link, *rb_parent;
>   int retval;
>   unsigned long charge;
> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct 
> *mm,
>   rb_parent = >vm_rb;
>  
>   mm->map_count++;
> - if (!(tmp->vm_flags & VM_WIPEONFORK))
> + if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> + /*
> +  * Mark this VMA as changing to prevent the
> +  * speculative page fault hanlder to process
> +  * it until the TLB are flushed below.
> +  */
> + last = mpnt;
> + vm_raw_write_begin(mpnt);
> + }
>   retval = copy_page_range(mm, oldmm, mpnt);
> + }
>  
>   if (tmp->vm_ops && tmp->vm_ops->open)
>   tmp->vm_ops->open(tmp);
> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct 
> *mm,
>  out:
>   up_write(>mmap_sem);
>   flush_tlb_mm(oldmm);
> +
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {

You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
will always be NULL if it is not enabled but maybe the compiler will
miss the optimization opportunity if you only have the for() loop
below.

> + /*
> +  * Since the TLB has been flush, we can safely unmark the
> +  * copied VMAs and allows the speculative page fault handler to
> +  * process them again.
> +  * Walk back the VMA list from the last marked VMA.
> +  */
> + for (; last; last = last->vm_prev) {
> + if (last->vm_flags & VM_DONTCOPY)
> + continue;
> + if 

Re: [PATCH v12 17/31] mm: introduce __page_add_new_anon_rmap()

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:08PM +0200, Laurent Dufour wrote:
> When dealing with speculative page fault handler, we may race with VMA
> being split or merged. In this case the vma->vm_start and vm->vm_end
> fields may not match the address the page fault is occurring.
> 
> This can only happens when the VMA is split but in that case, the
> anon_vma pointer of the new VMA will be the same as the original one,
> because in __split_vma the new->anon_vma is set to src->anon_vma when
> *new = *vma.
> 
> So even if the VMA boundaries are not correct, the anon_vma pointer is
> still valid.
> 
> If the VMA has been merged, then the VMA in which it has been merged
> must have the same anon_vma pointer otherwise the merge can't be done.
> 
> So in all the case we know that the anon_vma is valid, since we have
> checked before starting the speculative page fault that the anon_vma
> pointer is valid for this VMA and since there is an anon_vma this
> means that at one time a page has been backed and that before the VMA
> is cleaned, the page table lock would have to be grab to clean the
> PTE, and the anon_vma field is checked once the PTE is locked.
> 
> This patch introduce a new __page_add_new_anon_rmap() service which
> doesn't check for the VMA boundaries, and create a new inline one
> which do the check.
> 
> When called from a page fault handler, if this is not a speculative one,
> there is a guarantee that vm_start and vm_end match the faulting address,
> so this check is useless. In the context of the speculative page fault
> handler, this check may be wrong but anon_vma is still valid as explained
> above.
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/rmap.h | 12 ++--
>  mm/memory.c  |  8 
>  mm/rmap.c|  5 ++---
>  3 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 988d176472df..a5d282573093 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct 
> vm_area_struct *,
>   unsigned long, bool);
>  void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
>  unsigned long, int);
> -void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> - unsigned long, bool);
> +void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> +   unsigned long, bool);
> +static inline void page_add_new_anon_rmap(struct page *page,
> +   struct vm_area_struct *vma,
> +   unsigned long address, bool compound)
> +{
> + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> + __page_add_new_anon_rmap(page, vma, address, compound);
> +}
> +
>  void page_add_file_rmap(struct page *, bool);
>  void page_remove_rmap(struct page *, bool);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index be93f2c8ebe0..46f877b6abea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>* thread doing COW.
>*/
>   ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
> - page_add_new_anon_rmap(new_page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(new_page, memcg, false, false);
>   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
>   /*
> @@ -2897,7 +2897,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  
>   /* ksm created a completely new copy */
>   if (unlikely(page != swapcache && swapcache)) {
> - page_add_new_anon_rmap(page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
>   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>   } else {
> @@ -3049,7 +3049,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault 
> *vmf)
>   }
>  
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
>   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>  setpte:
> @@ -3328,7 +3328,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
> mem_cgroup *memcg,
>   /* copy-on-write page */
>   if (write && !(vmf->vma_flags & VM_SHARED)) {
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, 

Re: [PATCH v12 16/31] mm: introduce __vm_normal_page()

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:07PM +0200, Laurent Dufour wrote:
> When dealing with the speculative fault path we should use the VMA's field
> cached value stored in the vm_fault structure.
> 
> Currently vm_normal_page() is using the pointer to the VMA to fetch the
> vm_flags value. This patch provides a new __vm_normal_page() which is
> receiving the vm_flags flags value as parameter.
> 
> Note: The speculative path is turned on for architecture providing support
> for special PTE flag. So only the first block of vm_normal_page is used
> during the speculative path.
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h | 18 +++---
>  mm/memory.c| 21 -
>  2 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f465bb2b049e..f14b2c9ddfd4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1421,9 +1421,21 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
>  #endif
>  }
>  
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -  pte_t pte, bool with_public_device);
> -#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +   pte_t pte, bool with_public_device,
> +   unsigned long vma_flags);
> +static inline struct page *_vm_normal_page(struct vm_area_struct *vma,
> + unsigned long addr, pte_t pte,
> + bool with_public_device)
> +{
> + return __vm_normal_page(vma, addr, pte, with_public_device,
> + vma->vm_flags);
> +}
> +static inline struct page *vm_normal_page(struct vm_area_struct *vma,
> +   unsigned long addr, pte_t pte)
> +{
> + return _vm_normal_page(vma, addr, pte, false);
> +}
>  
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long 
> addr,
>   pmd_t pmd);
> diff --git a/mm/memory.c b/mm/memory.c
> index 85ec5ce5c0a8..be93f2c8ebe0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -533,7 +533,8 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>  }
>  
>  /*
> - * vm_normal_page -- This function gets the "struct page" associated with a 
> pte.
> + * __vm_normal_page -- This function gets the "struct page" associated with
> + * a pte.
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" 
> (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
> @@ -574,8 +575,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>   * PFNMAP mappings in order to support COWable mappings.
>   *
>   */
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -  pte_t pte, bool with_public_device)
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +   pte_t pte, bool with_public_device,
> +   unsigned long vma_flags)
>  {
>   unsigned long pfn = pte_pfn(pte);
>  
> @@ -584,7 +586,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   goto check_pfn;
>   if (vma->vm_ops && vma->vm_ops->find_special_page)
>   return vma->vm_ops->find_special_page(vma, addr);
> - if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
>   return NULL;
>   if (is_zero_pfn(pfn))
>   return NULL;
> @@ -620,8 +622,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>  
>   /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
>  
> - if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> - if (vma->vm_flags & VM_MIXEDMAP) {
> + if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> + if (vma_flags & VM_MIXEDMAP) {
>   if (!pfn_valid(pfn))
>   return NULL;
>   goto out;
> @@ -630,7 +632,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   off = (addr - vma->vm_start) >> PAGE_SHIFT;
>   if (pfn == vma->vm_pgoff + off)
>   return NULL;
> - if (!is_cow_mapping(vma->vm_flags))
> + if (!is_cow_mapping(vma_flags))
>   return NULL;
>   }
>   }
> @@ -2532,7 +2534,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>  
> - vmf->page = vm_normal_page(vma, 

Re: [PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:06PM +0200, Laurent Dufour wrote:
> The speculative page fault handler which is run without holding the
> mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
> is not guaranteed to remain constant.
> Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
> value parameter instead of the vma pointer.
> 
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/swap.h | 10 --
>  mm/memory.c  |  8 
>  mm/swap.c|  6 +++---
>  3 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4bfb5c4ac108..d33b94eb3c69 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -343,8 +343,14 @@ extern void deactivate_file_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
> -extern void lru_cache_add_active_or_unevictable(struct page *page,
> - struct vm_area_struct *vma);
> +extern void __lru_cache_add_active_or_unevictable(struct page *page,
> + unsigned long vma_flags);
> +
> +static inline void lru_cache_add_active_or_unevictable(struct page *page,
> + struct vm_area_struct *vma)
> +{
> + return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
> +}
>  
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> diff --git a/mm/memory.c b/mm/memory.c
> index 56802850e72c..85ec5ce5c0a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
>   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(new_page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(new_page, vma);
> + __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
>   /*
>* We call the notify macro here because, when using secondary
>* mmu page tables (such as kvm shadow page tables), we want the
> @@ -2896,7 +2896,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   if (unlikely(page != swapcache && swapcache)) {
>   page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>   } else {
>   do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
>   mem_cgroup_commit_charge(page, memcg, true, false);
> @@ -3048,7 +3048,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault 
> *vmf)
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>   page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>  setpte:
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>  
> @@ -3327,7 +3327,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
> mem_cgroup *memcg,
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>   page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>   } else {
>   inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>   page_add_file_rmap(page, false);
> diff --git a/mm/swap.c b/mm/swap.c
> index 3a75722e68a9..a55f0505b563 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -450,12 +450,12 @@ void lru_cache_add(struct page *page)
>   * directly back onto it's zone's unevictable list, it does NOT use a
>   * per cpu pagevec.
>   */
> -void lru_cache_add_active_or_unevictable(struct page *page,
> -  struct vm_area_struct *vma)
> +void __lru_cache_add_active_or_unevictable(struct page *page,
> +unsigned long vma_flags)
>  {
>   VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> + if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
>   SetPageActive(page);
>   else if (!TestSetPageMlocked(page)) {
>   /*
> -- 
> 2.21.0
> 


Re: [PATCH v12 14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:05PM +0200, Laurent Dufour wrote:
> migrate_misplaced_page() is only called during the page fault handling so
> it's better to pass the pointer to the struct vm_fault instead of the vma.
> 
> This way during the speculative page fault path the saved vma->vm_flags
> could be used.
> 
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/migrate.h | 4 ++--
>  mm/memory.c | 2 +-
>  mm/migrate.c| 4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e13d9bf2f9a5..0197e40325f8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -125,14 +125,14 @@ static inline void __ClearPageMovable(struct page *page)
>  #ifdef CONFIG_NUMA_BALANCING
>  extern bool pmd_trans_migrating(pmd_t pmd);
>  extern int migrate_misplaced_page(struct page *page,
> -   struct vm_area_struct *vma, int node);
> +   struct vm_fault *vmf, int node);
>  #else
>  static inline bool pmd_trans_migrating(pmd_t pmd)
>  {
>   return false;
>  }
>  static inline int migrate_misplaced_page(struct page *page,
> -  struct vm_area_struct *vma, int node)
> +  struct vm_fault *vmf, int node)
>  {
>   return -EAGAIN; /* can't migrate now */
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index d0de58464479..56802850e72c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3747,7 +3747,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   }
>  
>   /* Migrate to the requested node */
> - migrated = migrate_misplaced_page(page, vma, target_nid);
> + migrated = migrate_misplaced_page(page, vmf, target_nid);
>   if (migrated) {
>   page_nid = target_nid;
>   flags |= TNF_MIGRATED;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a9138093a8e2..633bd9abac54 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1938,7 +1938,7 @@ bool pmd_trans_migrating(pmd_t pmd)
>   * node. Caller is expected to have an elevated reference count on
>   * the page that will be dropped by this function before returning.
>   */
> -int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> +int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
>  int node)
>  {
>   pg_data_t *pgdat = NODE_DATA(node);
> @@ -1951,7 +1951,7 @@ int migrate_misplaced_page(struct page *page, struct 
> vm_area_struct *vma,
>* with execute permissions as they are probably shared libraries.
>*/
>   if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
> - (vma->vm_flags & VM_EXEC))
> + (vmf->vma_flags & VM_EXEC))
>   goto out;
>  
>   /*
> -- 
> 2.21.0
> 


Re: [PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:04PM +0200, Laurent Dufour wrote:
> When handling speculative page fault, the vma->vm_flags and
> vma->vm_page_prot fields are read once the page table lock is released. So
> there is no more guarantee that these fields would not change in our back.
> They will be saved in the vm_fault structure before the VMA is checked for
> changes.
> 
> In the detail, when we deal with a speculative page fault, the mmap_sem is
> not taken, so parallel VMA's changes can occurred. When a VMA change is
> done which will impact the page fault processing, we assumed that the VMA
> sequence counter will be changed.  In the page fault processing, at the
> time the PTE is locked, we checked the VMA sequence counter to detect
> changes done in our back. If no change is detected we can continue further.
> But this doesn't prevent the VMA to not be changed in our back while the
> PTE is locked. So VMA's fields which are used while the PTE is locked must
> be saved to ensure that we are using *static* values.  This is important
> since the PTE changes will be made on regards to these VMA fields and they
> need to be consistent. This concerns the vma->vm_flags and
> vma->vm_page_prot VMA fields.
> 
> This patch also set the fields in hugetlb_no_page() and
> __collapse_huge_page_swapin even if it is not need for the callee.
> 
> Signed-off-by: Laurent Dufour 

I am unsure about something see below, so you might need to update
that one but it would not change the structure of the patch thus:

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h | 10 +++--
>  mm/huge_memory.c   |  6 +++---
>  mm/hugetlb.c   |  2 ++
>  mm/khugepaged.c|  2 ++
>  mm/memory.c| 53 --
>  mm/migrate.c   |  2 +-
>  6 files changed, 44 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5d45b7d8718d..f465bb2b049e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -439,6 +439,12 @@ struct vm_fault {
>* page table to avoid allocation from
>* atomic context.
>*/
> + /*
> +  * These entries are required when handling speculative page fault.
> +  * This way the page handling is done using consistent field values.
> +  */
> + unsigned long vma_flags;
> + pgprot_t vma_page_prot;
>  };
>  
>  /* page entry size for vm->huge_fault() */
> @@ -781,9 +787,9 @@ void free_compound_page(struct page *page);
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
>   */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
>  {
> - if (likely(vma->vm_flags & VM_WRITE))
> + if (likely(vma_flags & VM_WRITE))
>   pte = pte_mkwrite(pte);
>   return pte;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 823688414d27..865886a689ee 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1244,8 +1244,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct 
> vm_fault *vmf,
>  
>   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
>   pte_t entry;
> - entry = mk_pte(pages[i], vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = mk_pte(pages[i], vmf->vma_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   memcg = (void *)page_private(pages[i]);
>   set_page_private(pages[i], 0);
>   page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
> @@ -2228,7 +2228,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   entry = pte_swp_mksoft_dirty(entry);
>   } else {
>   entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
> - entry = maybe_mkwrite(entry, vma);
> + entry = maybe_mkwrite(entry, vma->vm_flags);
>   if (!write)
>   entry = pte_wrprotect(entry);
>   if (!young)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 109f5de82910..13246da4bc50 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3812,6 +3812,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   .vma = vma,
>   .address = haddr,
>   .flags = flags,
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,

Shouldn't you use READ_ONCE ? I doubt compiler will do something creative
with struct initialization but as you are using WRITE_ONCE to update those
fields 

Re: [PATCH v12 12/31] mm: protect SPF handler against anon_vma changes

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:03PM +0200, Laurent Dufour wrote:
> The speculative page fault handler must be protected against anon_vma
> changes. This is because page_add_new_anon_rmap() is called during the
> speculative path.
> 
> In addition, don't try speculative page fault if the VMA don't have an
> anon_vma structure allocated because its allocation should be
> protected by the mmap_sem.
> 
> In __vma_adjust() when importer->anon_vma is set, there is no need to
> protect against speculative page faults since speculative page fault
> is aborted if the vma->anon_vma is not set.
> 
> When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
> valid since we checked for it when locking the pte and the anon_vma is
> removed once the pte is unlocked. So even if the speculative page
> fault handler is running concurrently with do_unmap(), as the pte is
> locked in unmap_region() - through unmap_vmas() - and the anon_vma
> unlinked later, because we check for the vma sequence counter which is
> updated in unmap_page_range() before locking the pte, and then in
> free_pgtables() so when locking the pte the change will be detected.
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/memory.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 423fa8ea0569..2cf7b6185daa 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -377,7 +377,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>* Hide vma from rmap and truncate_pagecache before freeing
>* pgtables
>*/
> + vm_write_begin(vma);
>   unlink_anon_vmas(vma);
> + vm_write_end(vma);
>   unlink_file_vma(vma);
>  
>   if (is_vm_hugetlb_page(vma)) {
> @@ -391,7 +393,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>  && !is_vm_hugetlb_page(next)) {
>   vma = next;
>   next = vma->vm_next;
> + vm_write_begin(vma);
>   unlink_anon_vmas(vma);
> + vm_write_end(vma);
>   unlink_file_vma(vma);
>   }
>   free_pgd_range(tlb, addr, vma->vm_end,
> -- 
> 2.21.0
> 


Re: [PATCH v12 11/31] mm: protect mremap() against SPF hanlder

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:02PM +0200, Laurent Dufour wrote:
> If a thread is remapping an area while another one is faulting on the
> destination area, the SPF handler may fetch the vma from the RB tree before
> the pte has been moved by the other thread. This means that the moved ptes
> will overwrite those create by the page fault handler leading to page
> leaked.
> 
>   CPU 1   CPU2
>   enter mremap()
>   unmap the dest area
>   copy_vma()  Enter speculative page fault handler
>  >> at this time the dest area is present in the RB tree
>   fetch the vma matching dest area
>   create a pte as the VMA matched
>   Exit the SPF handler
>   
>   move_ptes()
> > it is assumed that the dest area is empty,
> > the move ptes overwrite the page mapped by the CPU2.
> 
> To prevent that, when the VMA matching the dest area is extended or created
> by copy_vma(), it should be marked as non available to the SPF handler.
> The usual way to so is to rely on vm_write_begin()/end().
> This is already in __vma_adjust() called by copy_vma() (through
> vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
> which create a window for another thread.
> This patch adds a new parameter to vma_merge() which is passed down to
> vma_adjust().
> The assumption is that copy_vma() is returning a vma which should be
> released by calling vm_raw_write_end() by the callee once the ptes have
> been moved.
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

Small comment about a comment below but can be fix as a fixup
patch nothing earth shattering.

> ---
>  include/linux/mm.h | 24 -
>  mm/mmap.c  | 53 +++---
>  mm/mremap.c| 13 
>  3 files changed, 73 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 906b9e06f18e..5d45b7d8718d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct 
> anon_vma_chain *node);
>  
>  /* mmap.c */
>  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
> cap_sys_admin);
> +
>  extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> - struct vm_area_struct *expand);
> + struct vm_area_struct *expand, bool keep_locked);
> +
>  static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
>  {
> - return __vma_adjust(vma, start, end, pgoff, insert, NULL);
> + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
>  }
> -extern struct vm_area_struct *vma_merge(struct mm_struct *,
> +
> +extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
> + struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
> + pgoff_t pgoff, struct mempolicy *mpol,
> + struct vm_userfaultfd_ctx uff, bool keep_locked);
> +
> +static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> - unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> - struct mempolicy *, struct vm_userfaultfd_ctx);
> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
> + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
> +{
> + return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
> +pol, uff, false);
> +}
> +
>  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
>  extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
>   unsigned long addr, int new_below);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b77ec0149249..13460b38b0fb 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
>   */
>  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> - struct vm_area_struct *expand)
> + struct vm_area_struct *expand, bool keep_locked)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
> @@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
> long start,
>  
>   importer->anon_vma = exporter->anon_vma;
>   error = anon_vma_clone(importer, exporter);
> - if (error)
> + if (error) {
> + if 

Re: [PATCH v12 10/31] mm: protect VMA modifications using VMA sequence count

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:01PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
> 
> This patch provides protection against the VMA modification done in :
>   - madvise()
>   - mpol_rebind_policy()
>   - vma_replace_policy()
>   - change_prot_numa()
>   - mlock(), munlock()
>   - mprotect()
>   - mmap_region()
>   - collapse_huge_page()
>   - userfaultd registering services
> 
> In addition, VMA fields which will be read during the speculative fault
> path needs to be written using WRITE_ONCE to prevent write to be split
> and intermediate values to be pushed to other CPUs.
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  fs/proc/task_mmu.c |  5 -
>  fs/userfaultfd.c   | 17 
>  mm/khugepaged.c|  3 +++
>  mm/madvise.c   |  6 +-
>  mm/mempolicy.c | 51 ++
>  mm/mlock.c | 13 +++-
>  mm/mmap.c  | 28 -
>  mm/mprotect.c  |  4 +++-
>  mm/swap_state.c| 10 ++---
>  9 files changed, 95 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 01d4eb0e6bd1..0864c050b2de 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1162,8 +1162,11 @@ static ssize_t clear_refs_write(struct file *file, 
> const char __user *buf,
>   goto out_mm;
>   }
>   for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - vma->vm_flags &= ~VM_SOFTDIRTY;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags,
> +  vma->vm_flags & ~VM_SOFTDIRTY);
>   vma_set_page_prot(vma);
> + vm_write_end(vma);
>   }
>   downgrade_write(>mmap_sem);
>   break;
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 3b30301c90ec..2e0f98cadd81 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -667,8 +667,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
> list_head *fcs)
>  
>   octx = vma->vm_userfaultfd_ctx.ctx;
>   if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> + vm_write_begin(vma);
>   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + WRITE_ONCE(vma->vm_flags,
> +vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
> + vm_write_end(vma);
>   return 0;
>   }
>  
> @@ -908,8 +911,10 @@ static int userfaultfd_release(struct inode *inode, 
> struct file *file)
>   vma = prev;
>   else
>   prev = vma;
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
>   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> + vm_write_end(vma);
>   }
>  skip_mm:
>   up_write(>mmap_sem);
> @@ -1474,8 +1479,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
> *ctx,
>* the next vma was merged into the current one and
>* the current one has not been updated yet.
>*/
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
>   vma->vm_userfaultfd_ctx.ctx = ctx;
> + vm_write_end(vma);
>  
>   skip:
>   prev = vma;
> @@ -1636,8 +1643,10 @@ static int userfaultfd_unregister(struct 
> userfaultfd_ctx *ctx,
>* the next vma was merged into the current one and
>* the current one has not been updated yet.
>*/
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
>   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> + vm_write_end(vma);
>  
>   skip:
>   prev = vma;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a335f7c1fac4..6a0cbca3885e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>   if (mm_find_pmd(mm, address) != pmd)
>   goto out;
>  
> + vm_write_begin(vma);
>   anon_vma_lock_write(vma->anon_vma);
>  
>   pte = pte_offset_map(pmd, address);
> @@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>   pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>   

Re: [PATCH v12 09/31] mm: VMA sequence count

2019-04-22 Thread Jerome Glisse
On Fri, Apr 19, 2019 at 05:45:57PM +0200, Laurent Dufour wrote:
> Hi Jerome,
> 
> Thanks a lot for reviewing this series.
> 
> Le 19/04/2019 à 00:48, Jerome Glisse a écrit :
> > On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
> > > From: Peter Zijlstra 
> > > 
> > > Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> > > counts such that we can easily test if a VMA is changed.
> > > 
> > > The calls to vm_write_begin/end() in unmap_page_range() are
> > > used to detect when a VMA is being unmap and thus that new page fault
> > > should not be satisfied for this VMA. If the seqcount hasn't changed when
> > > the page table are locked, this means we are safe to satisfy the page
> > > fault.
> > > 
> > > The flip side is that we cannot distinguish between a vma_adjust() and
> > > the unmap_page_range() -- where with the former we could have
> > > re-checked the vma bounds against the address.
> > > 
> > > The VMA's sequence counter is also used to detect change to various VMA's
> > > fields used during the page fault handling, such as:
> > >   - vm_start, vm_end
> > >   - vm_pgoff
> > >   - vm_flags, vm_page_prot
> > >   - vm_policy
> > 
> > ^ All above are under mmap write lock ?
> 
> Yes, changes are still made under the protection of the mmap_sem.
> 
> > 
> > >   - anon_vma
> > 
> > ^ This is either under mmap write lock or under page table lock
> > 
> > So my question is do we need the complexity of seqcount_t for this ?
> 
> The sequence counter is used to detect write operation done while readers
> (SPF handler) is running.
> 
> The implementation is quite simple (here without the lockdep checks):
> 
> static inline void raw_write_seqcount_begin(seqcount_t *s)
> {
>   s->sequence++;
>   smp_wmb();
> }
> 
> I can't see why this is too complex here, would you elaborate on this ?
> 
> > 
> > It seems that using regular int as counter and also relying on vm_flags
> > when vma is unmap should do the trick.
> 
> vm_flags is not enough I guess an some operation are not impacting the
> vm_flags at all (resizing for instance).
> Am I missing something ?
> 
> > 
> > vma_delete(struct vm_area_struct *vma)
> > {
> >  ...
> >  /*
> >   * Make sure the vma is mark as invalid ie neither read nor write
> >   * so that speculative fault back off. A racing speculative fault
> >   * will either see the flags as 0 or the new seqcount.
> >   */
> >  vma->vm_flags = 0;
> >  smp_wmb();
> >  vma->seqcount++;
> >  ...
> > }
> 
> Well I don't think we can safely clear the vm_flags this way when the VMA is
> unmap, I think it is used later when cleaning is doen.
> 
> Later in this series, the VMA deletion is managed when the VMA is unlinked
> from the RB Tree. That is checked using the vm_rb field's value, and managed
> using RCU.
> 
> > Then:
> > speculative_fault_begin(struct vm_area_struct *vma,
> >  struct spec_vmf *spvmf)
> > {
> >  ...
> >  spvmf->seqcount = vma->seqcount;
> >  smp_rmb();
> >  spvmf->vm_flags = vma->vm_flags;
> >  if (!spvmf->vm_flags) {
> >  // Back off the vma is dying ...
> >  ...
> >  }
> > }
> > 
> > bool speculative_fault_commit(struct vm_area_struct *vma,
> >struct spec_vmf *spvmf)
> > {
> >  ...
> >  seqcount = vma->seqcount;
> >  smp_rmb();
> >  vm_flags = vma->vm_flags;
> > 
> >  if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
> >  // Something did change for the vma
> >  return false;
> >  }
> >  return true;
> > }
> > 
> > This would also avoid the lockdep issue described below. But maybe what
> > i propose is stupid and i will see it after further reviewing thing.
> 
> That's true that the lockdep is quite annoying here. But it is still
> interesting to keep in the loop to avoid 2 subsequent write_seqcount_begin()
> call being made in the same context (which would lead to an even sequence
> counter value while write operation is in progress). So I think this is
> still a good thing to have lockdep available here.

Ok so i had to read everything and i should have read everything before
asking all of the above. It does look good in fact, what worried my in
this patch is all the lockdep avoidance as it is usualy a red flags.

But after thinking long and hard i do not see how to easily solve that
one as unmap_page_range() is in so many different path... So what is done
in this patch is the most sane thing. Sorry for the noise.

So for this patch:

Reviewed-by: Jérôme Glisse 


Re: [PATCH v12 09/31] mm: VMA sequence count

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra 
> 
> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed.
> 
> The calls to vm_write_begin/end() in unmap_page_range() are
> used to detect when a VMA is being unmap and thus that new page fault
> should not be satisfied for this VMA. If the seqcount hasn't changed when
> the page table are locked, this means we are safe to satisfy the page
> fault.
> 
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.
> 
> The VMA's sequence counter is also used to detect change to various VMA's
> fields used during the page fault handling, such as:
>  - vm_start, vm_end
>  - vm_pgoff
>  - vm_flags, vm_page_prot
>  - vm_policy

^ All above are under mmap write lock ?

>  - anon_vma

^ This is either under mmap write lock or under page table lock

So my question is do we need the complexity of seqcount_t for this ?

It seems that using regular int as counter and also relying on vm_flags
when vma is unmap should do the trick.

vma_delete(struct vm_area_struct *vma)
{
...
/*
 * Make sure the vma is mark as invalid ie neither read nor write
 * so that speculative fault back off. A racing speculative fault
 * will either see the flags as 0 or the new seqcount.
 */
vma->vm_flags = 0;
smp_wmb();
vma->seqcount++;
...
}

Then:
speculative_fault_begin(struct vm_area_struct *vma,
struct spec_vmf *spvmf)
{
...
spvmf->seqcount = vma->seqcount;
smp_rmb();
spvmf->vm_flags = vma->vm_flags;
if (!spvmf->vm_flags) {
// Back off the vma is dying ...
...
}
}

bool speculative_fault_commit(struct vm_area_struct *vma,
  struct spec_vmf *spvmf)
{
...
seqcount = vma->seqcount;
smp_rmb();
vm_flags = vma->vm_flags;

if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
// Something did change for the vma
return false;
}
return true;
}

This would also avoid the lockdep issue described below. But maybe what
i propose is stupid and i will see it after further reviewing thing.


Cheers,
Jérôme


> 
> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Port to 4.12 kernel]
> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
> [Introduce vm_write_* inline function depending on
>  CONFIG_SPECULATIVE_PAGE_FAULT]
> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
>  using vm_raw_write* functions]
> [Fix a lock dependency warning in mmap_region() when entering the error
>  path]
> [move sequence initialisation INIT_VMA()]
> [Review the patch description about unmap_page_range()]
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h   | 44 
>  include/linux/mm_types.h |  3 +++
>  mm/memory.c  |  2 ++
>  mm/mmap.c| 30 +++
>  4 files changed, 79 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2ceb1d2869a6..906b9e06f18e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1410,6 +1410,9 @@ struct zap_details {
>  static inline void INIT_VMA(struct vm_area_struct *vma)
>  {
>   INIT_LIST_HEAD(>anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(>vm_sequence);
> +#endif
>  }
>  
>  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct 
> address_space *mapping,
>   unmap_mapping_range(mapping, holebegin, holelen, 0);
>  }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> + write_seqcount_begin(>vm_sequence);
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +  int subclass)
> +{
> + write_seqcount_begin_nested(>vm_sequence, subclass);
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> + write_seqcount_end(>vm_sequence);
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_begin(>vm_sequence);
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_end(>vm_sequence);
> +}
> +#else
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +  int subclass)
> +{
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> +}
> +#endif 

Re: [PATCH v12 08/31] mm: introduce INIT_VMA()

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:59PM +0200, Laurent Dufour wrote:
> Some VMA struct fields need to be initialized once the VMA structure is
> allocated.
> Currently this only concerns anon_vma_chain field but some other will be
> added to support the speculative page fault.
> 
> Instead of spreading the initialization calls all over the code, let's
> introduce a dedicated inline function.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  fs/exec.c  | 1 +
>  include/linux/mm.h | 5 +
>  kernel/fork.c  | 2 +-
>  mm/mmap.c  | 3 +++
>  mm/nommu.c | 1 +
>  5 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2e0033348d8e..9762e060295c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -266,6 +266,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
>   vma->vm_start = vma->vm_end - PAGE_SIZE;
>   vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
> VM_STACK_INCOMPLETE_SETUP;
>   vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> + INIT_VMA(vma);
>  
>   err = insert_vm_struct(mm, vma);
>   if (err)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4ba2f53f9d60..2ceb1d2869a6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1407,6 +1407,11 @@ struct zap_details {
>   pgoff_t last_index; /* Highest page->index to unmap 
> */
>  };
>  
> +static inline void INIT_VMA(struct vm_area_struct *vma)

Can we leave capital names for macro ? Also i prefer vma_init_struct() (the
one thing i like in C++ is namespace and thus i like namespace_action() for
function name).

Also why not doing a coccinelle patch for this:

@@
struct vm_area_struct *vma;
@@
-INIT_LIST_HEAD(>anon_vma_chain);
+vma_init_struct(vma);


Untested ...

> +{
> + INIT_LIST_HEAD(>anon_vma_chain);
> +}
> +
>  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>pte_t pte, bool with_public_device);
>  #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 915be4918a2b..f8dae021c2e5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -341,7 +341,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
> *orig)
>  
>   if (new) {
>   *new = *orig;
> - INIT_LIST_HEAD(>anon_vma_chain);
> + INIT_VMA(new);
>   }
>   return new;
>  }
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bd7b9f293b39..5ad3a3228d76 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1765,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   vma->vm_flags = vm_flags;
>   vma->vm_page_prot = vm_get_page_prot(vm_flags);
>   vma->vm_pgoff = pgoff;
> + INIT_VMA(vma);
>  
>   if (file) {
>   if (vm_flags & VM_DENYWRITE) {
> @@ -3037,6 +3038,7 @@ static int do_brk_flags(unsigned long addr, unsigned 
> long len, unsigned long fla
>   }
>  
>   vma_set_anonymous(vma);
> + INIT_VMA(vma);
>   vma->vm_start = addr;
>   vma->vm_end = addr + len;
>   vma->vm_pgoff = pgoff;
> @@ -3395,6 +3397,7 @@ static struct vm_area_struct *__install_special_mapping(
>   if (unlikely(vma == NULL))
>   return ERR_PTR(-ENOMEM);
>  
> + INIT_VMA(vma);
>   vma->vm_start = addr;
>   vma->vm_end = addr + len;
>  
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 749276beb109..acf7ca72ca90 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1210,6 +1210,7 @@ unsigned long do_mmap(struct file *file,
>   region->vm_flags = vm_flags;
>   region->vm_pgoff = pgoff;
>  
> + INIT_VMA(vma);
>   vma->vm_flags = vm_flags;
>   vma->vm_pgoff = pgoff;
>  
> -- 
> 2.21.0
> 


Re: [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:58PM +0200, Laurent Dufour wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
> 
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
> 
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
> 
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
>   1. pte are the same (0)
>   2. pte are different (VM_FAULT_PTNOTSAME)
>   3. a VMA's changes has been detected (VM_FAULT_RETRY)
> 
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
> 
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 


> ---
>  include/linux/mm_types.h |  6 +-
>  mm/memory.c  | 37 +++--
>  2 files changed, 32 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8ec38b11b361..fd7d38ee2e33 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -652,6 +652,8 @@ typedef __bitwise unsigned int vm_fault_t;
>   * @VM_FAULT_NEEDDSYNC:  ->fault did not modify page tables and 
> needs
>   *   fsync() to complete (for synchronous page faults
>   *   in DAX)
> + * @VM_FAULT_PTNOTSAME   Page table entries have changed during a
> + *   speculative page fault handling.
>   * @VM_FAULT_HINDEX_MASK:mask HINDEX value
>   *
>   */
> @@ -669,6 +671,7 @@ enum vm_fault_reason {
>   VM_FAULT_FALLBACK   = (__force vm_fault_t)0x000800,
>   VM_FAULT_DONE_COW   = (__force vm_fault_t)0x001000,
>   VM_FAULT_NEEDDSYNC  = (__force vm_fault_t)0x002000,
> + VM_FAULT_PTNOTSAME  = (__force vm_fault_t)0x004000,
>   VM_FAULT_HINDEX_MASK= (__force vm_fault_t)0x0f,
>  };
>  
> @@ -693,7 +696,8 @@ enum vm_fault_reason {
>   { VM_FAULT_RETRY,   "RETRY" },  \
>   { VM_FAULT_FALLBACK,"FALLBACK" },   \
>   { VM_FAULT_DONE_COW,"DONE_COW" },   \
> - { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" }
> + { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" },  \
> + { VM_FAULT_PTNOTSAME,   "PTNOTSAME" }
>  
>  struct vm_special_mapping {
>   const char *name;   /* The name, e.g. "[vdso]". */
> diff --git a/mm/memory.c b/mm/memory.c
> index 221ccdf34991..d5bebca47d98 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2094,21 +2094,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>   * parts, do_swap_page must check under lock before unmapping the pte and
>   * proceeding (but do_wp_page is only called after already making such a 
> check;
>   * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + *   0   if the PTE are the same
> + *   VM_FAULT_PTNOTSAME  if the PTE are different
> + *   VM_FAULT_RETRY  if the VMA has changed in our back during
> + *   a speculative page fault handling.
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
>  {
> - int same = 1;
> + int ret = 0;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
>   if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
>   }
>  #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
>  }
>  
>  static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> @@ -2714,8 +2722,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   int exclusive = 0;
>   vm_fault_t ret = 0;
>  
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + ret = pte_unmap_same(vmf);
> + if (ret) {
> + /*
> +   

Re: [PATCH v12 06/31] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:57PM +0200, Laurent Dufour wrote:
> When handling page fault without holding the mmap_sem the fetch of the
> pte lock pointer and the locking will have to be done while ensuring
> that the VMA is not touched in our back.
> 
> So move the fetch and locking operations in a dedicated function.
> 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 


> ---
>  mm/memory.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index fc3698d13cb5..221ccdf34991 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
> long addr,
>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> +static inline bool pte_spinlock(struct vm_fault *vmf)
> +{
> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> + spin_lock(vmf->ptl);
> + return true;
> +}
> +
>  static inline bool pte_map_lock(struct vm_fault *vmf)
>  {
>   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> @@ -3656,8 +3663,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>* validation through pte_unmap_same(). It's of NUMA type but
>* the pfn may be screwed if the read is non atomic.
>*/
> - vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
> - spin_lock(vmf->ptl);
> + if (!pte_spinlock(vmf))
> + return VM_FAULT_RETRY;
>   if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   goto out;
> @@ -3850,8 +3857,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>   return do_numa_page(vmf);
>  
> - vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> - spin_lock(vmf->ptl);
> + if (!pte_spinlock(vmf))
> + return VM_FAULT_RETRY;
>   entry = vmf->orig_pte;
>   if (unlikely(!pte_same(*vmf->pte, entry)))
>   goto unlock;
> -- 
> 2.21.0
> 


Re: [PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:56PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra 
> 
> When speculating faults (without holding mmap_sem) we need to validate
> that the vma against which we loaded pages is still valid when we're
> ready to install the new PTE.
> 
> Therefore, replace the pte_offset_map_lock() calls that (re)take the
> PTL with pte_map_lock() which can fail in case we find the VMA changed
> since we started the fault.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Port to 4.12 kernel]
> [Remove the comment about the fault_env structure which has been
>  implemented as the vm_fault structure in the kernel]
> [move pte_map_lock()'s definition upper in the file]
> [move the define of FAULT_FLAG_SPECULATIVE later in the series]
> [review error path in do_swap_page(), do_anonymous_page() and
>  wp_page_copy()]
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/memory.c | 87 +++--
>  1 file changed, 58 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c6ddadd9d2b7..fc3698d13cb5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
> long addr,
>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> +static inline bool pte_map_lock(struct vm_fault *vmf)

I am not fan of the name maybe pte_offset_map_lock_if_valid() ? But
that just a taste thing. So feel free to ignore this comment.


> +{
> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +vmf->address, >ptl);
> + return true;
> +}
> +
>  /*
>   * handle_pte_fault chooses page fault handler according to an entry which 
> was
>   * read non-atomically.  Before making any commitment, on those architectures


Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:41:56PM +0100, Mark Rutland wrote:
> On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
> > Le 16/04/2019 à 16:27, Mark Rutland a écrit :
> > > On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> > > > From: Mahendran Ganesh 
> > > > 
> > > > Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> > > > enables Speculative Page Fault handler.
> > > > 
> > > > Signed-off-by: Ganesh Mahendran 
> > > 
> > > This is missing your S-o-B.
> > 
> > You're right, I missed that...
> > 
> > > The first patch noted that the ARCH_SUPPORTS_* option was there because
> > > the arch code had to make an explicit call to try to handle the fault
> > > speculatively, but that isn't addeed until patch 30.
> > > 
> > > Why is this separate from that code?
> > 
> > Andrew was recommended this a long time ago for bisection purpose. This
> > allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
> > that trigger the spf handler is added to the per architecture's code.
> 
> Ok. I think it would be worth noting that in the commit message, to
> avoid anyone else asking the same question. :)

Should have read this thread before looking at x86 and ppc :)

In any case the patch is:

Reviewed-by: Jérôme Glisse 


Re: [PATCH v12 03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:54PM +0200, Laurent Dufour wrote:
> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for BOOK3S_64. This enables
> the Speculative Page Fault handler.
> 
> Support is only provide for BOOK3S_64 currently because:
> - require CONFIG_PPC_STD_MMU because checks done in
>   set_access_flags_filter()
> - require BOOK3S because we can't support for book3e_hugetlb_preload()
>   called by update_mmu_cache()
> 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 

Same comment as for x86.

Reviewed-by: Jérôme Glisse 

> ---
>  arch/powerpc/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2d0be82c3061..a29887ea5383 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -238,6 +238,7 @@ config PPC
>   select PCI_SYSCALL  if PCI
>   select RTC_LIB
>   select SPARSE_IRQ
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
>   select SYSCTL_EXCEPTION_TRACE
>   select THREAD_INFO_IN_TASK
>   select VIRT_TO_BUS  if !PPC64
> -- 
> 2.21.0
> 


Re: [PATCH v12 02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:53PM +0200, Laurent Dufour wrote:
> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
> Speculative Page Fault handler when building for 64bit.
> 
> Cc: Thomas Gleixner 
> Signed-off-by: Laurent Dufour 

I think this patch should be move as last patch in the serie so that
the feature is not enabled mid-way without all the pieces ready if
someone bisect. But i have not review everything yet so maybe it is
fine.

Reviewed-by: Jérôme Glisse 

> ---
>  arch/x86/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0f2ab09da060..8bd575184d0b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -30,6 +30,7 @@ config X86_64
>   select SWIOTLB
>   select X86_DEV_DMA_OPS
>   select ARCH_HAS_SYSCALL_WRAPPER
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>  
>  #
>  # Arch settings
> -- 
> 2.21.0
> 


Re: [PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2019-04-18 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:44:52PM +0200, Laurent Dufour wrote:
> This configuration variable will be used to build the code needed to
> handle speculative page fault.
> 
> By default it is turned off, and activated depending on architecture
> support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.
> 
> The architecture support is needed since the speculative page fault handler
> is called from the architecture's page faulting code, and some code has to
> be added there to handle the speculative handler.
> 
> The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
> does processing that is not compatible with the speculative handling in the
> case ARCH_HAS_PTE_SPECIAL is not set.
> 
> Suggested-by: Thomas Gleixner 
> Suggested-by: David Rientjes 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

Small question below

> ---
>  mm/Kconfig | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0eada3f818fa..ff278ac9978a 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -761,4 +761,26 @@ config GUP_BENCHMARK
>  config ARCH_HAS_PTE_SPECIAL
>   bool
>  
> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +   def_bool n
> +
> +config SPECULATIVE_PAGE_FAULT
> + bool "Speculative page faults"
> + default y
> + depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> + depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
> + help
> +   Try to handle user space page faults without holding the mmap_sem.
> +
> +   This should allow better concurrency for massively threaded processes

Is there any case where it does not provide better concurrency ? The
should make me wonder :)

> +   since the page fault handler will not wait for other thread's memory
> +   layout change to be done, assuming that this change is done in
> +   another part of the process's memory space. This type of page fault
> +   is named speculative page fault.
> +
> +   If the speculative page fault fails because a concurrent modification
> +   is detected or because underlying PMD or PTE tables are not yet
> +   allocated, the speculative page fault fails and a classic page fault
> +   is then tried.
> +
>  endmenu
> -- 
> 2.21.0
> 


Re: [PATCH 0/3] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-09 Thread Jerome Glisse
On Mon, Apr 09, 2018 at 04:07:21PM +0200, Michal Hocko wrote:
> On Mon 09-04-18 15:57:06, Laurent Dufour wrote:
> > The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
> > per architecture header files. This doesn't allow to make other
> > configuration dependent on it.
> > 
> > This series is moving the __HAVE_ARCH_PTE_SPECIAL into the Kconfig files,
> > setting it automatically when architectures was already setting it in
> > header file.
> > 
> > There is no functional change introduced by this series.
> 
> I would just fold all three patches into a single one. It is much easier
> to review that those selects are done properly when you can see that the
> define is set for the same architecture.
> 
> In general, I like the patch. It is always quite painful to track per
> arch defines.

You can also add Reviewed-by: Jérôme Glisse  my grep fu
showed no place that was forgotten.

Cheers,
Jérôme


Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-04 Thread Jerome Glisse
On Wed, Apr 04, 2018 at 06:26:44PM +0200, Laurent Dufour wrote:
> 
> 
> On 03/04/2018 21:39, Jerome Glisse wrote:
> > On Tue, Mar 13, 2018 at 06:59:45PM +0100, Laurent Dufour wrote:
> >> When dealing with the speculative fault path we should use the VMA's field
> >> cached value stored in the vm_fault structure.
> >>
> >> Currently vm_normal_page() is using the pointer to the VMA to fetch the
> >> vm_flags value. This patch provides a new __vm_normal_page() which is
> >> receiving the vm_flags flags value as parameter.
> >>
> >> Note: The speculative path is turned on for architecture providing support
> >> for special PTE flag. So only the first block of vm_normal_page is used
> >> during the speculative path.
> > 
> > Might be a good idea to explicitly have SPECULATIVE Kconfig option depends
> > on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function
> > explaining that speculative page fault should never reach that point.
> 
> Unfortunately there is no ARCH_PTE_SPECIAL in the config file, it is defined 
> in
> the per architecture header files.
> So I can't do anything in the Kconfig file

Maybe adding a new Kconfig symbol for ARCH_PTE_SPECIAL very much like
others ARCH_HAS_

> 
> However, I can check that at build time, and doing such a check in
> __vm_normal_page sounds to be a good place, like that:
> 
> @@ -869,6 +870,14 @@ struct page *__vm_normal_page(struct vm_area_struct *vma,
> unsigned long addr,
> 
> /* !HAVE_PTE_SPECIAL case follows: */
> 
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +   /* This part should never get called when the speculative page fault
> +* handler is turned on. This is mainly because we can't rely on
> +* vm_start.
> +*/
> +#error CONFIG_SPECULATIVE_PAGE_FAULT requires HAVE_PTE_SPECIAL
> +#endif
> +
> if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> if (vma_flags & VM_MIXEDMAP) {
> if (!pfn_valid(pfn))
> 

I am not a fan of #if/#else/#endif in code. But that's a taste thing.
I honnestly think that adding a Kconfig for special pte is the cleanest
solution.

Cheers,
Jérôme


Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-04-03 Thread Jerome Glisse
On Tue, Apr 03, 2018 at 01:40:18PM -0700, David Rientjes wrote:
> On Tue, 3 Apr 2018, Jerome Glisse wrote:
> 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 21b1212a0892..4bc7b0bdcb40 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
> > >   * parts, do_swap_page must check under lock before unmapping the pte and
> > >   * proceeding (but do_wp_page is only called after already making such a 
> > > check;
> > >   * and do_anonymous_page can safely check later on).
> > > + *
> > > + * pte_unmap_same() returns:
> > > + *   0   if the PTE are the same
> > > + *   VM_FAULT_PTNOTSAME  if the PTE are different
> > > + *   VM_FAULT_RETRY  if the VMA has changed in our back 
> > > during
> > > + *   a speculative page fault handling.
> > >   */

[...]

> > >  
> > 
> > This change what do_swap_page() returns ie before it was returning 0
> > when locked pte lookup was different from orig_pte. After this patch
> > it returns VM_FAULT_PTNOTSAME but this is a new return value for
> > handle_mm_fault() (the do_swap_page() return value is what ultimately
> > get return by handle_mm_fault())
> > 
> > Do we really want that ? This might confuse some existing user of
> > handle_mm_fault() and i am not sure of the value of that information
> > to caller.
> > 
> > Note i do understand that you want to return retry if anything did
> > change from underneath and thus need to differentiate from when the
> > pte value are not the same.
> > 
> 
> I think VM_FAULT_RETRY should be handled appropriately for any user of 
> handle_mm_fault() already, and would be surprised to learn differently.  
> Khugepaged has the appropriate handling.  I think the concern is whether a 
> user is handling anything other than VM_FAULT_RETRY and VM_FAULT_ERROR 
> (which VM_FAULT_PTNOTSAME is not set in)?  I haven't done a full audit of 
> the users.

I am not worried about VM_FAULT_RETRY and barely have any worry about
VM_FAULT_PTNOTSAME either as they are other comparable new return value
(VM_FAULT_NEEDDSYNC for instance which is quite recent).

I wonder if adding a new value is really needed here. I don't see any
value to it for caller of handle_mm_fault() except for stats.

Note that I am not oppose, but while today we have free bits, maybe
tomorrow we will run out, i am always worried about thing like that :)

Cheers,
Jérôme


Re: [PATCH v9 00/24] Speculative page faults

2018-04-03 Thread Jerome Glisse
On Tue, Mar 13, 2018 at 06:59:30PM +0100, Laurent Dufour wrote:
> This is a port on kernel 4.16 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type page fault is named speculative
> page fault. If the speculative page fault fails because of a concurrency is
> detected or because underlying PMD or PTE tables are not yet allocating, it
> is failing its processing and a classic page fault is then tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, this is done by introducing a rwlock
> which protects the access to the mm_rb tree. Previously this was done using
> SRCU but it was introducing a lot of scheduling to process the VMA's
> freeing
> operation which was hitting the performance by 20% as reported by Kemi Wang
> [2].Using a rwlock to protect access to the mm_rb tree is limiting the
> locking contention to these operations which are expected to be in a O(log
> n)
> order. In addition to ensure that the VMA is not freed in our back a
> reference count is added and 2 services (get_vma() and put_vma()) are
> introduced to handle the reference count. When a VMA is fetch from the RB
> tree using get_vma() is must be later freeed using put_vma(). Furthermore,
> to allow the VMA to be used again by the classic page fault handler a
> service is introduced can_reuse_spf_vma(). This service is expected to be
> called with the mmap_sem hold. It checked that the VMA is still matching
> the specified address and is releasing its reference count as the mmap_sem
> is hold it is ensure that it will not be freed in our back. In general, the
> VMA's reference count could be decremented when holding the mmap_sem but it
> should not be increased as holding the mmap_sem is ensuring that the VMA is
> stable. I can't see anymore the overhead I got while will-it-scale
> benchmark anymore.
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA is found, the speculative page fault handler would check for
> the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.

What would have been nice is some pseudo highlevel code before all the
above detailed description. Something like:
  speculative_fault(addr) {
mm_lock_for_vma_snapshot()
vma_snapshot = snapshot_vma_infos(addr)
mm_unlock_for_vma_snapshot()
...
if (!vma_can_speculatively_fault(vma_snapshot, addr))
return;
...
/* Do fault ie alloc memory, read from file ... */
page = ...;

preempt_disable();
if (vma_snapshot_still_valid(vma_snapshot, addr) &&
vma_pte_map_lock(vma_snapshot, addr)) {
if (pte_same(ptep, orig_pte)) {
/* Setup new pte */
page = NULL;
}
}
preempt_enable();
if (page)
put(page)
  }

I just find pseudo code easier for grasping the highlevel view of the
expected code flow.


> 
> The locking of the PTE is done with interrupts disabled, this allows to
> check for the PMD to ensure that there is not an ongoing collapsing
> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
> waiting for the other CPU to have catch the IPI interrupt, if the pmd is
> valid at the time the PTE is locked, we have the guarantee that the
> collapsing opertion will have to wait on the PTE lock to move foward. This
> allows the SPF handler to map the PTE safely. If the PMD value is different
> than the one recorded at the beginning of the SPF operation, the classic
> page fault handler will be called to handle the operation while holding the
> mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is
> done using spin_trylock() to avoid dead lock when handling a 

Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-03 Thread Jerome Glisse
On Tue, Mar 13, 2018 at 06:59:45PM +0100, Laurent Dufour wrote:
> When dealing with the speculative fault path we should use the VMA's field
> cached value stored in the vm_fault structure.
> 
> Currently vm_normal_page() is using the pointer to the VMA to fetch the
> vm_flags value. This patch provides a new __vm_normal_page() which is
> receiving the vm_flags flags value as parameter.
> 
> Note: The speculative path is turned on for architecture providing support
> for special PTE flag. So only the first block of vm_normal_page is used
> during the speculative path.

Might be a good idea to explicitly have SPECULATIVE Kconfig option depends
on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function
explaining that speculative page fault should never reach that point.

Cheers,
Jérôme


Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-04-03 Thread Jerome Glisse
On Tue, Mar 13, 2018 at 06:59:36PM +0100, Laurent Dufour wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
> 
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
> 
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
> 
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
>   1. pte are the same (0)
>   2. pte are different (VM_FAULT_PTNOTSAME)
>   3. a VMA's changes has been detected (VM_FAULT_RETRY)
> 
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 29 +++--
>  2 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2f3e98edc94a..b6432a261e63 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page 
> *page)
>  #define VM_FAULT_NEEDDSYNC  0x2000   /* ->fault did not modify page tables
>* and needs fsync() to complete (for
>* synchronous page faults in DAX) */
> +#define VM_FAULT_PTNOTSAME 0x4000/* Page table entries have changed */
>  
>  #define VM_FAULT_ERROR   (VM_FAULT_OOM | VM_FAULT_SIGBUS | 
> VM_FAULT_SIGSEGV | \
>VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
> diff --git a/mm/memory.c b/mm/memory.c
> index 21b1212a0892..4bc7b0bdcb40 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
>   * parts, do_swap_page must check under lock before unmapping the pte and
>   * proceeding (but do_wp_page is only called after already making such a 
> check;
>   * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + *   0   if the PTE are the same
> + *   VM_FAULT_PTNOTSAME  if the PTE are different
> + *   VM_FAULT_RETRY  if the VMA has changed in our back during
> + *   a speculative page fault handling.
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
> - int same = 1;
> + int ret = 0;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
>   if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
>   }
>  #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
>  }
>  
>  static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
>   int exclusive = 0;
>   int ret = 0;
>  
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + ret = pte_unmap_same(vmf);
> + if (ret)
>   goto out;
>  

This change what do_swap_page() returns ie before it was returning 0
when locked pte lookup was different from orig_pte. After this patch
it returns VM_FAULT_PTNOTSAME but this is a new return value for
handle_mm_fault() (the do_swap_page() return value is what ultimately
get return by handle_mm_fault())

Do we really want that ? This might confuse some existing user of
handle_mm_fault() and i am not sure of the value of that information
to caller.

Note i do understand that you want to return retry if anything did
change from underneath and thus need to differentiate from when the
pte value are not the same.

Cheers,
Jérôme


Re: revamp vmem_altmap / dev_pagemap handling V3

2018-01-11 Thread Jerome Glisse
On Mon, Jan 08, 2018 at 12:26:46PM +0100, Christoph Hellwig wrote:
> Any chance to get this fully reviewed and picked up before the
> end of the merge window?

Sorry for taking so long to get to that, i looked at all the patches
and did not see anything obviously wrong and i like the cleanup so

Reviewed-by: Jérôme Glisse 


Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2

2017-10-23 Thread Jerome Glisse
On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote:
> On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jgli...@redhat.com> 
> > > > wrote:
> > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > > jgli...@redhat.com wrote:
> > > > > > > From: Jérôme Glisse <jgli...@redhat.com>

[...]

> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1186,8 +1186,15 @@ static int 
> > > > > > > do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > > > goto out_free_pages;
> > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > > 
> > > > > > > +   /*
> > > > > > > +* Leave pmd empty until pte is filled note we must notify 
> > > > > > > here as
> > > > > > > +* concurrent CPU thread might write to new page before the 
> > > > > > > call to
> > > > > > > +* mmu_notifier_invalidate_range_end() happens which can lead 
> > > > > > > to a
> > > > > > > +* device seeing memory write in different order than CPU.
> > > > > > > +*
> > > > > > > +* See Documentation/vm/mmu_notifier.txt
> > > > > > > +*/
> > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > 
> > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > > @@ -2026,8 +2033,15 @@ static void 
> > > > > > > __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > > > pmd_t _pmd;
> > > > > > > int i;
> > > > > > > 
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > > +   /*
> > > > > > > +* Leave pmd empty until pte is filled note that it is fine 
> > > > > > > to delay
> > > > > > > +* notification until mmu_notifier_invalidate_range_end() as 
> > > > > > > we are
> > > > > > > +* replacing a zero pmd write protected page with a zero pte 
> > > > > > > write
> > > > > > > +* protected page.
> > > > > > > +*
> > > > > > > +* See Documentation/vm/mmu_notifier.txt
> > > > > > > +*/
> > > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > > 
> > > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > > 
> > > > > It should not matter, we are talking virtual to physical on behalf
> > > > > of a device against a process address space. So the hardware should
> > > > > not care about the page size.
> > > > > 
> > > > 
> > > > Does that not indicate how much the device can access? Could it try
> > > > to access more than what is mapped?
> > > 
> > > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > > each covering 4K. Both case is read only and both case are pointing
> > > to same data (ie zero).
> > > 
> > > It is fine to delay the TLB invalidate on the device to the call of
> > > mmu_notifier_invalidate_range_end(). The device will keep using the
> > > huge TLB for a little longer but both CPU and device are looking at
> > > same data.
> > > 
> > > Now if there is a racing thread that replace one of the 512 zeor page
> > > after the split but before mmu_notifier_invalidate_range_end() that
> > > code path would call mmu_notifier_invalidate_range() before changing
> > > the pte to point to something else. Which should shoot down the device
> > > TLB (it would be a serious device bug if this did not work).
> > 
> > OK.. This seems reasonable, but I'd really like to see if it can be
> > tested
> 
> Well hard to test, many factors first each device might react differently.
> Device that only store TLB at 4k granularity are fine. Clever device that
> can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
> than their TLB entry ie getting a 4K invalidation would not invalidate a
> 2MB TLB entry in the device. I consider this as buggy. I will go look at
> the PCIE ATS specification one more time and see if there is any wording
> related that. I might bring up a question to the PCIE standard body if not.

So inside PCIE ATS there is the definition of "minimum translation or
invalidate size" which says 4096 bytes. So my understanding is that
hardware must support 4K invalidation in all the case and thus we shoud
be safe from possible hazard above.

But none the less i will repost without the optimization for huge page
to be more concervative as anyway we want to be correct before we care
about last bit of optimization.

Cheers,
Jérôme


Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2

2017-10-21 Thread Jerome Glisse
On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jgli...@redhat.com> wrote:
> > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > jgli...@redhat.com wrote:
> > > > > 
> > > > > > From: Jérôme Glisse <jgli...@redhat.com>
> > > > > > 
> > > > > > +   /*
> > > > > > +* No need to call mmu_notifier_invalidate_range() as 
> > > > > > we are
> > > > > > +* downgrading page table protection not changing it to 
> > > > > > point
> > > > > > +* to a new page.
> > > > > > +*
> > > > > > +* See Documentation/vm/mmu_notifier.txt
> > > > > > +*/
> > > > > > if (pmdp) {
> > > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > > > pmd_t pmd;
> > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct 
> > > > > > address_space *mapping,
> > > > > > pmd = pmd_wrprotect(pmd);
> > > > > > pmd = pmd_mkclean(pmd);
> > > > > > set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > > -   mmu_notifier_invalidate_range(vma->vm_mm, 
> > > > > > start, end);
> > > > > 
> > > > > Could the secondary TLB still see the mapping as dirty and propagate 
> > > > > the dirty bit back?
> > > > 
> > > > I am assuming hardware does sane thing of setting the dirty bit only
> > > > when walking the CPU page table when device does a write fault ie
> > > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > > when walking the page table before returning the lookup result to the
> > > > device and that it won't be set again latter (ie propagated back
> > > > latter).
> > > > 
> > > 
> > > The other possibility is that the hardware things the page is writable
> > > and already
> > > marked dirty. It allows writes and does not set the dirty bit?
> > 
> > I thought about this some more and the patch can not regress anything
> > that is not broken today. So if we assume that device can propagate
> > dirty bit because it can cache the write protection than all current
> > code is broken for two reasons:
> > 
> > First one is current code clear pte entry, build a new pte value with
> > write protection and update pte entry with new pte value. So any PASID/
> > ATS platform that allows device to cache the write bit and set dirty
> > bit anytime after that can race during that window and you would loose
> > the dirty bit of the device. That is not that bad as you are gonna
> > propagate the dirty bit to the struct page.
> 
> But they stay consistent with the notifiers, so from the OS perspective
> it notifies of any PTE changes as they happen. When the ATS platform sees
> invalidation, it invalidates it's PTE's as well.
> 
> I was speaking of the case where the ATS platform could assume it has
> write access and has not seen any invalidation, the OS could return
> back to user space or the caller with write bit clear, but the ATS
> platform could still do a write since it's not seen the invalidation.

I understood what you said and what is above apply. I am removing only
one of the invalidation not both. So with that patch the invalidation
is delayed after the page table lock drop but before dax/page_mkclean
returns. Hence any further activity will be read only on any device too
once we exit those functions.

The only difference is the window during which device can report dirty
pte. Before that patch the 2 "~bogus~" window were small:
  First window between pmd/pte_get_clear_flush and set_pte/pmd
  Second window between set_pte/pmd and mmu_notifier_invalidate_range

The first window stay the same, the second window is bigger, potentialy
lot bigger if thread is prempted before mmu_notifier_invalidate_range_end

But that is fine as in that case the page is reported as dirty and thus
we are not missing anything and the kernel code does not care about
seeing read only pte mark as dirty.

> 
> > 
> > Second one i

Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2

2017-10-19 Thread Jerome Glisse
On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jgli...@redhat.com> wrote:
> > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> >> On Mon, 16 Oct 2017 23:10:02 -0400
> >> jgli...@redhat.com wrote:
> >>
> >> > From: Jérôme Glisse <jgli...@redhat.com>
> >> >
> >> > +   /*
> >> > +* No need to call mmu_notifier_invalidate_range() as we are
> >> > +* downgrading page table protection not changing it to point
> >> > +* to a new page.
> >> > +*
> >> > +* See Documentation/vm/mmu_notifier.txt
> >> > +*/
> >> > if (pmdp) {
> >> >  #ifdef CONFIG_FS_DAX_PMD
> >> > pmd_t pmd;
> >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct 
> >> > address_space *mapping,
> >> > pmd = pmd_wrprotect(pmd);
> >> > pmd = pmd_mkclean(pmd);
> >> > set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> >> > -   mmu_notifier_invalidate_range(vma->vm_mm, start, 
> >> > end);
> >>
> >> Could the secondary TLB still see the mapping as dirty and propagate the 
> >> dirty bit back?
> >
> > I am assuming hardware does sane thing of setting the dirty bit only
> > when walking the CPU page table when device does a write fault ie
> > once the device get a write TLB entry the dirty is set by the IOMMU
> > when walking the page table before returning the lookup result to the
> > device and that it won't be set again latter (ie propagated back
> > latter).
> >
> 
> The other possibility is that the hardware things the page is writable
> and already
> marked dirty. It allows writes and does not set the dirty bit?

I thought about this some more and the patch can not regress anything
that is not broken today. So if we assume that device can propagate
dirty bit because it can cache the write protection than all current
code is broken for two reasons:

First one is current code clear pte entry, build a new pte value with
write protection and update pte entry with new pte value. So any PASID/
ATS platform that allows device to cache the write bit and set dirty
bit anytime after that can race during that window and you would loose
the dirty bit of the device. That is not that bad as you are gonna
propagate the dirty bit to the struct page.

Second one is if the dirty bit is propagated back to the new write
protected pte. Quick look at code it seems that when we zap pte or
or mkclean we don't check that the pte has write permission but only
care about the dirty bit. So it should not have any bad consequence.

After this patch only the second window is bigger and thus more likely
to happen. But nothing sinister should happen from that.


> 
> > I should probably have spell that out and maybe some of the ATS/PASID
> > implementer did not do that.
> >
> >>
> >> >  unlock_pmd:
> >> > spin_unlock(ptl);
> >> >  #endif
> >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct 
> >> > address_space *mapping,
> >> > pte = pte_wrprotect(pte);
> >> > pte = pte_mkclean(pte);
> >> > set_pte_at(vma->vm_mm, address, ptep, pte);
> >> > -   mmu_notifier_invalidate_range(vma->vm_mm, start, 
> >> > end);
> >>
> >> Ditto
> >>
> >> >  unlock_pte:
> >> > pte_unmap_unlock(ptep, ptl);
> >> > }
> >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> > index 6866e8126982..49c925c96b8a 100644
> >> > --- a/include/linux/mmu_notifier.h
> >> > +++ b/include/linux/mmu_notifier.h
> >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >> >  * shared page-tables, it not necessary to implement the
> >> >  * invalidate_range_start()/end() notifiers, as
> >> >  * invalidate_range() alread catches the points in time when an
> >> > -* external TLB range needs to be flushed.
> >> > +* external TLB range needs to be flushed. For more in depth
> >> > +* discussion on this see Documentation/vm/mmu_notifier.txt
> >> >  *
> >> >  * The invalidate_range() function is called under the ptl

Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2

2017-10-18 Thread Jerome Glisse
On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:02 -0400
> jgli...@redhat.com wrote:
> 
> > From: Jérôme Glisse 
> > 
> > +   /*
> > +* No need to call mmu_notifier_invalidate_range() as we are
> > +* downgrading page table protection not changing it to point
> > +* to a new page.
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > if (pmdp) {
> >  #ifdef CONFIG_FS_DAX_PMD
> > pmd_t pmd;
> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct 
> > address_space *mapping,
> > pmd = pmd_wrprotect(pmd);
> > pmd = pmd_mkclean(pmd);
> > set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > -   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Could the secondary TLB still see the mapping as dirty and propagate the 
> dirty bit back?

I am assuming hardware does sane thing of setting the dirty bit only
when walking the CPU page table when device does a write fault ie
once the device get a write TLB entry the dirty is set by the IOMMU
when walking the page table before returning the lookup result to the
device and that it won't be set again latter (ie propagated back
latter).

I should probably have spell that out and maybe some of the ATS/PASID
implementer did not do that.

> 
> >  unlock_pmd:
> > spin_unlock(ptl);
> >  #endif
> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct 
> > address_space *mapping,
> > pte = pte_wrprotect(pte);
> > pte = pte_mkclean(pte);
> > set_pte_at(vma->vm_mm, address, ptep, pte);
> > -   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Ditto
> 
> >  unlock_pte:
> > pte_unmap_unlock(ptep, ptl);
> > }
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 6866e8126982..49c925c96b8a 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >  * shared page-tables, it not necessary to implement the
> >  * invalidate_range_start()/end() notifiers, as
> >  * invalidate_range() alread catches the points in time when an
> > -* external TLB range needs to be flushed.
> > +* external TLB range needs to be flushed. For more in depth
> > +* discussion on this see Documentation/vm/mmu_notifier.txt
> >  *
> >  * The invalidate_range() function is called under the ptl
> >  * spin-lock and not allowed to sleep.
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c037d3d34950..ff5bc647b51d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct 
> > vm_fault *vmf, pmd_t orig_pmd,
> > goto out_free_pages;
> > VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +   /*
> > +* Leave pmd empty until pte is filled note we must notify here as
> > +* concurrent CPU thread might write to new page before the call to
> > +* mmu_notifier_invalidate_range_end() happens which can lead to a
> > +* device seeing memory write in different order than CPU.
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -   /* leave pmd empty until pte is filled */
> >  
> > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct 
> > vm_area_struct *vma,
> > pmd_t _pmd;
> > int i;
> >  
> > -   /* leave pmd empty until pte is filled */
> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > +   /*
> > +* Leave pmd empty until pte is filled note that it is fine to delay
> > +* notification until mmu_notifier_invalidate_range_end() as we are
> > +* replacing a zero pmd write protected page with a zero pte write
> > +* protected page.
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> 
> Shouldn't the secondary TLB know if the page size changed?

It should not matter, we are talking virtual to physical on behalf
of a device against a process address space. So the hardware should
not care about the page size.

Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
4K pages is replace by something new then a device TLB shootdown will
happen before the new page is set.

Only issue i can think of is if the IOMMU TLB (if there is one) or
the device TLB (you do expect that there is one) does not invalidate
TLB entry if the TLB shootdown is smaller than the TLB entry. That
would be idiotic but yes i know hardware bug.


> 
> >  

Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback

2017-10-18 Thread Jerome Glisse
On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:01 -0400
> jgli...@redhat.com wrote:
> 
> > From: Jérôme Glisse 
> > 
> > (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
> >  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
> >  and i fixed typos)
> > 
> > All this only affect user of invalidate_range callback (at this time
> > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> > drivers/iommu/amd_iommu_v2.c|intel-svm.c)
> > 
> > This patchset remove useless double call to mmu_notifier->invalidate_range
> > callback wherever it is safe to do so. The first patch just remove useless
> > call
> 
> As in an extra call? Where does that come from?

Before this patch you had the following pattern:
  mmu_notifier_invalidate_range_start();
  take_page_table_lock()
  ...
  update_page_table()
  mmu_notifier_invalidate_range()
  ...
  drop_page_table_lock()
  mmu_notifier_invalidate_range_end();

It happens that mmu_notifier_invalidate_range_end() also make an
unconditional call to mmu_notifier_invalidate_range() so in the
above scenario you had 2 calls to mmu_notifier_invalidate_range()

Obviously one of the 2 call is useless. In some case you can drop
the first call (under the page table lock) this is what patch 1
does.

In other cases you can drop the second call that happen inside
mmu_notifier_invalidate_range_end() that is what patch 2 does.

Hence why i am referring to useless double call. I have added
more documentation to explain all this in the code and also under
Documentation/vm/mmu_notifier.txt


> 
> > and add documentation explaining why it is safe to do so. The second
> > patch go further by introducing mmu_notifier_invalidate_range_only_end()
> > which skip callback to invalidate_range this can be done when clearing a
> > pte, pmd or pud with notification which call invalidate_range right after
> > clearing under the page table lock.
> >
> 
> Balbir Singh.
> 


Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Jerome Glisse
On Tue, Oct 03, 2017 at 05:43:47PM -0700, Nadav Amit wrote:
> Jerome Glisse <jgli...@redhat.com> wrote:
> 
> > On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> > 
> >> I'd like some more explanation about the inner working of "that new
> >> user" as per comment above.
> >> 
> >> It would be enough to drop mmu_notifier_invalidate_range from above
> >> without adding it to the filebacked case. The above gives higher prio
> >> to the hypothetical and uncertain future case, than to the current
> >> real filebacked case that doesn't need ->invalidate_range inside the
> >> PT lock, or do you see something that might already need such
> >> ->invalidate_range?
> > 
> > No i don't see any new user today that might need such invalidate but
> > i was trying to be extra cautious as i have a tendency to assume that
> > someone might do a patch that use try_to_unmap() without going through
> > all the comments in the function and thus possibly using it in a an
> > unexpected way from mmu_notifier callback point of view. I am fine
> > with putting the burden on new user to get it right and adding an
> > extra warning in the function description to try to warn people in a
> > sensible way.
> 
> I must be missing something. After the PTE is changed, but before the
> secondary TLB notification/invalidation - What prevents another thread from
> changing the mappings (e.g., using munmap/mmap), and setting a new page
> at that PTE?
> 
> Wouldn’t it end with the page being mapped without a secondary TLB flush in
> between?

munmap would call mmu_notifier to invalidate the range too so secondary
TLB would be properly flush before any new pte can be setup in for that
particular virtual address range. Unlike CPU TLB flush, secondary TLB
flush are un-conditional and thus current pte value does not play any
role.

Cheers,
Jérôme


Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Jerome Glisse
On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write 
> > to
> > +a page that might now be use by some completely different task.
> 
> used
> 
> > +is true ven if the thread doing the page table update is preempted right 
> > after
> 
> even
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct 
> > vm_fault *vmf, pmd_t orig_pmd,
> > goto out_free_pages;
> > VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +   /*
> > +* Leave pmd empty until pte is filled note we must notify here as
> > +* concurrent CPU thread might write to new page before the call to
> > +* mmu_notifier_invalidate_range_end() happen which can lead to a
> 
> happens
> 
> > +* device seeing memory write in different order than CPU.
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -   /* leave pmd empty until pte is filled */
> >  
> 
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
> 
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
> 
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
> 
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.

Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.


> 
> > +* No need to notify as we downgrading page
> 
> are
> 
> > +* table protection not changing it to point
> > +* to a new page.
> > +*
> 
> > +* No need to notify as we downgrading page table to read only
> 
> are
> 
> > +* No need to notify as we replacing a read only page with another
> 
> are
> 
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, 
> > struct vm_area_struct *vma,
> > if (pte_soft_dirty(pteval))
> > swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -   } else
> > +   } else {
> > +   /*
> > +* We should not need to notify here as we reach this
> > +* case only from freeze_page() itself only call from
> > +* split_huge_page_to_list() so everything below must
> > +* be true:
> > +*   - page is not anonymous
> > +*   - page is locked
> > +*
> > +* So as it is a shared page and it is locked, it can
> > +* not be remove from the page cache and replace by
> > +* a new page before mmu_notifier_invalidate_range_end
> > +* so no concurrent thread might update its page table
> > +* to point at new page while a device still is using
> > +* this page.
> > +*
> > +* But we can not assume that new user of try_to_unmap
> > +* will have that in mind so just to be safe here call
> > +* mmu_notifier_invalidate_range()
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > dec_mm_counter(mm, mm_counter_file(page));
> > +   mmu_notifier_invalidate_range(mm, address,
> > +   

Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-29 Thread Jerome Glisse
On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse  wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
> 
> Thanks for doing this.
> 
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
> 
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
> 
> And the final end result certainly looks nice:
> 
> >  16 files changed, 74 insertions(+), 214 deletions(-)
> 
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
> 
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
> 
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

> 
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
> 
> People - *especially* the people who saw issues under KVM - can you
> try out Jérôme's patch-series? I aded some people to the cc, the full
> series is on lkml. Jérôme - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
 to push a clean and slim one)

Jérôme


Re: TTM placement caching issue/questions

2014-09-03 Thread Jerome Glisse
On Thu, Sep 04, 2014 at 10:12:27AM +1000, Benjamin Herrenschmidt wrote:
 Hi folks !
 
 I've been tracking down some problems with the recent DRI on powerpc and
 stumbled upon something that doesn't look right, and not necessarily
 only for us.
 
 Now it's possible that I haven't fully understood the code here and I
 also don't know to what extent some of that behaviour is necessary for
 some platforms such as Intel GTT bits.
 
 What I've observed with a simple/dumb (no DMA) driver like AST (but this
 probably happens more generally) is that when evicting a BO from VRAM
 into System memory, the TTM tries to preserve the existing caching
 attributes of the VRAM object.
 
 From what I can tell, we end up with going from VRAM to System memory
 type, and we eventually call ttm_bo_select_caching() to select the
 caching option for the target.
 
 This will, from what I can tell, try to use the same caching mode as the
 original object:
 
   if ((cur_placement  caching) != 0)
   result |= (cur_placement  caching);
 
 And cur_placement comes from bo-mem.placement which as far as I can
 tell is based on the placement array which the drivers set up.
 
 Now they tend to uniformly setup the placement for System memory as
 TTM_PL_MASK_CACHING which enables all caching modes.
 
 So I end up with, for example, my System memory BOs having
 TTM_PL_FLAG_CACHED not set (though they also don't have
 TTM_PL_FLAG_UNCACHED) and TTM_PL_FLAG_WC.
 
 We don't seem to use the man-default_caching (which will have
 TTM_PL_FLAG_CACHED) unless there is no matching bit at all between the
 proposed placement and the existing caching mode.
 
 Now this is a problem for several reason that I can think of:
 
  - On a number of powerpc platforms, such as all our server 64-bit one
 for example, it's actually illegal to map system memory non-cached. The
 system is fully cache coherent for all possible DMA originators (that we
 care about at least) and mapping memory non-cachable while it's mapped
 cachable in the linear mapping can cause nasty cache paradox which, when
 detected by HW, can checkstop the system.
 
  - A similar issue exists, afaik, on ARM = v7, so anything mapped
 non-cachable must be removed from the linear mapping explicitly since
 otherwise it can be speculatively prefetched into the cache.
 
  - I don't know about x86, but even then, it looks quite sub-optimal to
 map the memory backing of the BOs and access it using a WC rather than a
 cachable mapping attribute.
 
 Now, some folks on IRC mentioned that there might be reasons for the
 current behaviour as to not change the caching attributes when going
 in/out of the GTT on Intel, I don't know how that relates and how that
 works, but maybe that should be enforced by having a different placement
 mask specifically on those chipsets.
 
 Dave, should we change the various PCI drivers for generally coherent
 devices such that the System memory type doesn't allow placements
 without CACHED attribute ? Or at least on coherent platforms ? How do
 detect that ? Should we have a TTM helper to establish the default
 memory placement attributes that normal PCI drivers call to set that
 up so we can have all the necessary arch ifdefs in one single place, at
 least for classic PCI/PCIe stuff (AGP might need additional tweaks) ?
 
 Non-PCI and special drivers like Intel can use a different set of
 placement attributes to represent the requirements of those specific
 platforms (mostly thinking of embedded ARM here which under some
 circumstances might actually require non-cached mappings).
 Or am I missing another part of the puzzle ?
 
 As it-is, things are broken for me even for dumb drivers, and I suspect
 to a large extent with radeon and nouveau too, though in some case we
 might get away with it most of the time ... until the machine locks up
 for some unexplainable reason... This might cause problems on existing
 distros such as RHEL7 with our radeon adapters even.
 
 Any suggestion of what's the best approach to fix it ? I'm happy to
 produce the patches but I'm not that familiar with the TTM so I would
 like to make sure I'm the right track first :-)

While i agree about the issue of incoherent double map of same page, i
think we have more issue. For instance lattely AMD have been pushing a
lot of patches to move things to use uncached memory for radeon and as
usual thoses patches comes with no comment to the motivations of those
changes. I am ccing the usual suspect ;)

What i understand is that uncached mapping for some frequently use buffer
give a significant performance boost (i am assuming this has to do with
all the snoop pci transaction overhead).

From my understanding this is something that is allow on other OS and
any driver wishing to compete from performance point of view will want
that.


So i think we need to get a platform flags and or set_pages_array_wc|uc
needs to fail and this would fallback to cached mapping if the fallback
code still works. So 

Re: TTM placement caching issue/questions

2014-09-03 Thread Jerome Glisse
On Wed, Sep 03, 2014 at 09:55:53PM -0400, Jerome Glisse wrote:
 On Thu, Sep 04, 2014 at 10:12:27AM +1000, Benjamin Herrenschmidt wrote:
  Hi folks !
  
  I've been tracking down some problems with the recent DRI on powerpc and
  stumbled upon something that doesn't look right, and not necessarily
  only for us.
  
  Now it's possible that I haven't fully understood the code here and I
  also don't know to what extent some of that behaviour is necessary for
  some platforms such as Intel GTT bits.
  
  What I've observed with a simple/dumb (no DMA) driver like AST (but this
  probably happens more generally) is that when evicting a BO from VRAM
  into System memory, the TTM tries to preserve the existing caching
  attributes of the VRAM object.
  
  From what I can tell, we end up with going from VRAM to System memory
  type, and we eventually call ttm_bo_select_caching() to select the
  caching option for the target.
  
  This will, from what I can tell, try to use the same caching mode as the
  original object:
  
  if ((cur_placement  caching) != 0)
  result |= (cur_placement  caching);
  
  And cur_placement comes from bo-mem.placement which as far as I can
  tell is based on the placement array which the drivers set up.
  
  Now they tend to uniformly setup the placement for System memory as
  TTM_PL_MASK_CACHING which enables all caching modes.
  
  So I end up with, for example, my System memory BOs having
  TTM_PL_FLAG_CACHED not set (though they also don't have
  TTM_PL_FLAG_UNCACHED) and TTM_PL_FLAG_WC.
  
  We don't seem to use the man-default_caching (which will have
  TTM_PL_FLAG_CACHED) unless there is no matching bit at all between the
  proposed placement and the existing caching mode.
  
  Now this is a problem for several reason that I can think of:
  
   - On a number of powerpc platforms, such as all our server 64-bit one
  for example, it's actually illegal to map system memory non-cached. The
  system is fully cache coherent for all possible DMA originators (that we
  care about at least) and mapping memory non-cachable while it's mapped
  cachable in the linear mapping can cause nasty cache paradox which, when
  detected by HW, can checkstop the system.
  
   - A similar issue exists, afaik, on ARM = v7, so anything mapped
  non-cachable must be removed from the linear mapping explicitly since
  otherwise it can be speculatively prefetched into the cache.
  
   - I don't know about x86, but even then, it looks quite sub-optimal to
  map the memory backing of the BOs and access it using a WC rather than a
  cachable mapping attribute.
  
  Now, some folks on IRC mentioned that there might be reasons for the
  current behaviour as to not change the caching attributes when going
  in/out of the GTT on Intel, I don't know how that relates and how that
  works, but maybe that should be enforced by having a different placement
  mask specifically on those chipsets.
  
  Dave, should we change the various PCI drivers for generally coherent
  devices such that the System memory type doesn't allow placements
  without CACHED attribute ? Or at least on coherent platforms ? How do
  detect that ? Should we have a TTM helper to establish the default
  memory placement attributes that normal PCI drivers call to set that
  up so we can have all the necessary arch ifdefs in one single place, at
  least for classic PCI/PCIe stuff (AGP might need additional tweaks) ?
  
  Non-PCI and special drivers like Intel can use a different set of
  placement attributes to represent the requirements of those specific
  platforms (mostly thinking of embedded ARM here which under some
  circumstances might actually require non-cached mappings).
  Or am I missing another part of the puzzle ?
  
  As it-is, things are broken for me even for dumb drivers, and I suspect
  to a large extent with radeon and nouveau too, though in some case we
  might get away with it most of the time ... until the machine locks up
  for some unexplainable reason... This might cause problems on existing
  distros such as RHEL7 with our radeon adapters even.
  
  Any suggestion of what's the best approach to fix it ? I'm happy to
  produce the patches but I'm not that familiar with the TTM so I would
  like to make sure I'm the right track first :-)
 
 While i agree about the issue of incoherent double map of same page, i
 think we have more issue. For instance lattely AMD have been pushing a
 lot of patches to move things to use uncached memory for radeon and as
 usual thoses patches comes with no comment to the motivations of those
 changes. I am ccing the usual suspect ;)
 
 What i understand is that uncached mapping for some frequently use buffer
 give a significant performance boost (i am assuming this has to do with
 all the snoop pci transaction overhead).
 
 From my understanding this is something that is allow on other OS and
 any driver wishing to compete from performance point of view will want
 that.
 
 
 So i

Re: TTM placement caching issue/questions

2014-09-03 Thread Jerome Glisse
On Thu, Sep 04, 2014 at 12:25:23PM +1000, Benjamin Herrenschmidt wrote:
 On Wed, 2014-09-03 at 22:07 -0400, Jerome Glisse wrote:
 
  So in the meantime the attached patch should work, it just silently ignore
  the caching attribute request on non x86 instead of pretending that things
  are setup as expected and then latter the radeon ou nouveau hw unsetting
  the snoop bit.
  
  It's not tested but i think it should work.
 
 I'm still getting placements with !CACHED going from bo_memcpy in
 ttm_io_prot() though ... I'm looking at filtering the placement
 attributes instead.
 
 Ben.

Ok so this one should do the trick.


 
   
   Cheers,
   Jérôme
   

Cheers,
Ben.


___
dri-devel mailing list
dri-de...@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
 
 
From def7a056d042220f91016d0a7c245ba8e96f90ba Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= jgli...@redhat.com
Date: Wed, 3 Sep 2014 22:04:34 -0400
Subject: [PATCH] drm/ttm: force cached mapping on non x86 platform.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

People interested in providing uncached or write combined mapping
on there architecture need to do the ground work inside there arch
specific code to allow to break the linear kernel mapping so that
page mapping attributes can be updated, in the meantime force cached
mapping for non x86 architecture.

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/gpu/drm/radeon/radeon_ttm.c |  2 +-
 drivers/gpu/drm/ttm/ttm_bo.c|  2 +-
 drivers/gpu/drm/ttm/ttm_tt.c| 47 -
 include/drm/ttm/ttm_bo_driver.h |  2 +-
 4 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 72afe82..4dd5060 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -304,7 +304,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object 
*bo,
return r;
}
 
-   r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
+   r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
if (unlikely(r)) {
goto out_cleanup;
}
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 3da89d5..4dc21c3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -305,7 +305,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object 
*bo,
goto out_err;
}
 
-   ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
+   ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
if (ret)
goto out_err;
 
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index bf080ab..7cbdb48 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -89,14 +89,6 @@ static inline int ttm_tt_set_page_caching(struct page *p,
 
return ret;
 }
-#else /* CONFIG_X86 */
-static inline int ttm_tt_set_page_caching(struct page *p,
- enum ttm_caching_state c_old,
- enum ttm_caching_state c_new)
-{
-   return 0;
-}
-#endif /* CONFIG_X86 */
 
 /*
  * Change caching policy for the linear kernel map
@@ -149,19 +141,52 @@ out_err:
return ret;
 }
 
-int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t placement)
+int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
 {
enum ttm_caching_state state;
 
-   if (placement  TTM_PL_FLAG_WC)
+   if (*placement  TTM_PL_FLAG_WC)
state = tt_wc;
-   else if (placement  TTM_PL_FLAG_UNCACHED)
+   else if (*placement  TTM_PL_FLAG_UNCACHED)
state = tt_uncached;
else
state = tt_cached;
 
return ttm_tt_set_caching(ttm, state);
 }
+#else /* CONFIG_X86 */
+int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
+{
+   if (placement  (TTM_PL_TT | TTM_PL_FLAG_SYSTEM)) {
+   ttm-caching_state = tt_cached;
+   } else {
+   if (placement  TTM_PL_FLAG_WC)
+   ttm-caching_state = tt_wc;
+   else if (placement  TTM_PL_FLAG_UNCACHED)
+   ttm-caching_state = tt_uncached;
+   else
+   ttm-caching_state = tt_cached;
+   }
+   /*
+* Some architecture force cached so we need to reflect the
+* new ttm-caching_state into the mem-placement flags.
+*/
+   *placement = ~TTM_PL_MASK_CACHING;
+   switch (bo-ttm-caching_state) {
+   case tt_wc:
+   *placement |= TTM_PL_FLAG_WC;
+   break;
+   case tt_uncached

Re: TTM placement caching issue/questions

2014-09-03 Thread Jerome Glisse
On Wed, Sep 03, 2014 at 10:31:18PM -0400, Jerome Glisse wrote:
 On Thu, Sep 04, 2014 at 12:25:23PM +1000, Benjamin Herrenschmidt wrote:
  On Wed, 2014-09-03 at 22:07 -0400, Jerome Glisse wrote:
  
   So in the meantime the attached patch should work, it just silently ignore
   the caching attribute request on non x86 instead of pretending that things
   are setup as expected and then latter the radeon ou nouveau hw unsetting
   the snoop bit.
   
   It's not tested but i think it should work.
  
  I'm still getting placements with !CACHED going from bo_memcpy in
  ttm_io_prot() though ... I'm looking at filtering the placement
  attributes instead.
  
  Ben.
 
 Ok so this one should do the trick.

And this one should build :)

 
 
  

Cheers,
Jérôme

 
 Cheers,
 Ben.
 
 
 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel
  
  

 From def7a056d042220f91016d0a7c245ba8e96f90ba Mon Sep 17 00:00:00 2001
 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= jgli...@redhat.com
 Date: Wed, 3 Sep 2014 22:04:34 -0400
 Subject: [PATCH] drm/ttm: force cached mapping on non x86 platform.
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 
 People interested in providing uncached or write combined mapping
 on there architecture need to do the ground work inside there arch
 specific code to allow to break the linear kernel mapping so that
 page mapping attributes can be updated, in the meantime force cached
 mapping for non x86 architecture.
 
 Signed-off-by: Jérôme Glisse jgli...@redhat.com
 ---
  drivers/gpu/drm/radeon/radeon_ttm.c |  2 +-
  drivers/gpu/drm/ttm/ttm_bo.c|  2 +-
  drivers/gpu/drm/ttm/ttm_tt.c| 47 
 -
  include/drm/ttm/ttm_bo_driver.h |  2 +-
  4 files changed, 39 insertions(+), 14 deletions(-)
 
 diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
 b/drivers/gpu/drm/radeon/radeon_ttm.c
 index 72afe82..4dd5060 100644
 --- a/drivers/gpu/drm/radeon/radeon_ttm.c
 +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
 @@ -304,7 +304,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object 
 *bo,
   return r;
   }
  
 - r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
 + r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
   if (unlikely(r)) {
   goto out_cleanup;
   }
 diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
 index 3da89d5..4dc21c3 100644
 --- a/drivers/gpu/drm/ttm/ttm_bo.c
 +++ b/drivers/gpu/drm/ttm/ttm_bo.c
 @@ -305,7 +305,7 @@ static int ttm_bo_handle_move_mem(struct 
 ttm_buffer_object *bo,
   goto out_err;
   }
  
 - ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
 + ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
   if (ret)
   goto out_err;
  
 diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
 index bf080ab..7cbdb48 100644
 --- a/drivers/gpu/drm/ttm/ttm_tt.c
 +++ b/drivers/gpu/drm/ttm/ttm_tt.c
 @@ -89,14 +89,6 @@ static inline int ttm_tt_set_page_caching(struct page *p,
  
   return ret;
  }
 -#else /* CONFIG_X86 */
 -static inline int ttm_tt_set_page_caching(struct page *p,
 -   enum ttm_caching_state c_old,
 -   enum ttm_caching_state c_new)
 -{
 - return 0;
 -}
 -#endif /* CONFIG_X86 */
  
  /*
   * Change caching policy for the linear kernel map
 @@ -149,19 +141,52 @@ out_err:
   return ret;
  }
  
 -int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t placement)
 +int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
  {
   enum ttm_caching_state state;
  
 - if (placement  TTM_PL_FLAG_WC)
 + if (*placement  TTM_PL_FLAG_WC)
   state = tt_wc;
 - else if (placement  TTM_PL_FLAG_UNCACHED)
 + else if (*placement  TTM_PL_FLAG_UNCACHED)
   state = tt_uncached;
   else
   state = tt_cached;
  
   return ttm_tt_set_caching(ttm, state);
  }
 +#else /* CONFIG_X86 */
 +int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
 +{
 + if (placement  (TTM_PL_TT | TTM_PL_FLAG_SYSTEM)) {
 + ttm-caching_state = tt_cached;
 + } else {
 + if (placement  TTM_PL_FLAG_WC)
 + ttm-caching_state = tt_wc;
 + else if (placement  TTM_PL_FLAG_UNCACHED)
 + ttm-caching_state = tt_uncached;
 + else
 + ttm-caching_state = tt_cached;
 + }
 + /*
 +  * Some architecture force cached so we need to reflect the
 +  * new ttm-caching_state into the mem-placement flags.
 +  */
 + *placement = ~TTM_PL_MASK_CACHING;
 + switch (bo

Re: TTM placement caching issue/questions

2014-09-03 Thread Jerome Glisse
On Wed, Sep 03, 2014 at 10:31:18PM -0400, Jerome Glisse wrote:
 On Thu, Sep 04, 2014 at 12:25:23PM +1000, Benjamin Herrenschmidt wrote:
  On Wed, 2014-09-03 at 22:07 -0400, Jerome Glisse wrote:
  
   So in the meantime the attached patch should work, it just silently ignore
   the caching attribute request on non x86 instead of pretending that things
   are setup as expected and then latter the radeon ou nouveau hw unsetting
   the snoop bit.
   
   It's not tested but i think it should work.
  
  I'm still getting placements with !CACHED going from bo_memcpy in
  ttm_io_prot() though ... I'm looking at filtering the placement
  attributes instead.
  
  Ben.
 
 Ok so this one should do the trick.

Ok final version ... famous last word.


 
 
  

Cheers,
Jérôme

 
 Cheers,
 Ben.
 
 
 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel
  
  

 From def7a056d042220f91016d0a7c245ba8e96f90ba Mon Sep 17 00:00:00 2001
 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= jgli...@redhat.com
 Date: Wed, 3 Sep 2014 22:04:34 -0400
 Subject: [PATCH] drm/ttm: force cached mapping on non x86 platform.
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 
 People interested in providing uncached or write combined mapping
 on there architecture need to do the ground work inside there arch
 specific code to allow to break the linear kernel mapping so that
 page mapping attributes can be updated, in the meantime force cached
 mapping for non x86 architecture.
 
 Signed-off-by: Jérôme Glisse jgli...@redhat.com
 ---
  drivers/gpu/drm/radeon/radeon_ttm.c |  2 +-
  drivers/gpu/drm/ttm/ttm_bo.c|  2 +-
  drivers/gpu/drm/ttm/ttm_tt.c| 47 
 -
  include/drm/ttm/ttm_bo_driver.h |  2 +-
  4 files changed, 39 insertions(+), 14 deletions(-)
 
 diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
 b/drivers/gpu/drm/radeon/radeon_ttm.c
 index 72afe82..4dd5060 100644
 --- a/drivers/gpu/drm/radeon/radeon_ttm.c
 +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
 @@ -304,7 +304,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object 
 *bo,
   return r;
   }
  
 - r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
 + r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
   if (unlikely(r)) {
   goto out_cleanup;
   }
 diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
 index 3da89d5..4dc21c3 100644
 --- a/drivers/gpu/drm/ttm/ttm_bo.c
 +++ b/drivers/gpu/drm/ttm/ttm_bo.c
 @@ -305,7 +305,7 @@ static int ttm_bo_handle_move_mem(struct 
 ttm_buffer_object *bo,
   goto out_err;
   }
  
 - ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
 + ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
   if (ret)
   goto out_err;
  
 diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
 index bf080ab..7cbdb48 100644
 --- a/drivers/gpu/drm/ttm/ttm_tt.c
 +++ b/drivers/gpu/drm/ttm/ttm_tt.c
 @@ -89,14 +89,6 @@ static inline int ttm_tt_set_page_caching(struct page *p,
  
   return ret;
  }
 -#else /* CONFIG_X86 */
 -static inline int ttm_tt_set_page_caching(struct page *p,
 -   enum ttm_caching_state c_old,
 -   enum ttm_caching_state c_new)
 -{
 - return 0;
 -}
 -#endif /* CONFIG_X86 */
  
  /*
   * Change caching policy for the linear kernel map
 @@ -149,19 +141,52 @@ out_err:
   return ret;
  }
  
 -int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t placement)
 +int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
  {
   enum ttm_caching_state state;
  
 - if (placement  TTM_PL_FLAG_WC)
 + if (*placement  TTM_PL_FLAG_WC)
   state = tt_wc;
 - else if (placement  TTM_PL_FLAG_UNCACHED)
 + else if (*placement  TTM_PL_FLAG_UNCACHED)
   state = tt_uncached;
   else
   state = tt_cached;
  
   return ttm_tt_set_caching(ttm, state);
  }
 +#else /* CONFIG_X86 */
 +int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
 +{
 + if (placement  (TTM_PL_TT | TTM_PL_FLAG_SYSTEM)) {
 + ttm-caching_state = tt_cached;
 + } else {
 + if (placement  TTM_PL_FLAG_WC)
 + ttm-caching_state = tt_wc;
 + else if (placement  TTM_PL_FLAG_UNCACHED)
 + ttm-caching_state = tt_uncached;
 + else
 + ttm-caching_state = tt_cached;
 + }
 + /*
 +  * Some architecture force cached so we need to reflect the
 +  * new ttm-caching_state into the mem-placement flags.
 +  */
 + *placement = ~TTM_PL_MASK_CACHING

Re: [PATCHv5 0/2] Speed Cap fixes for ppc64

2013-05-06 Thread Jerome Glisse
n Mon, May 6, 2013 at 10:32 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 On Fri, May 3, 2013 at 7:01 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
 On Fri, 2013-05-03 at 19:43 -0300, Kleber Sacilotto de Souza wrote:

 This patch series does:
   1. max_bus_speed is used to set the device to gen2 speeds
   2. on power there's no longer a conflict between the pseries call and 
 other
 architectures, because the overwrite is done via a ppc_md hook
   3. radeon is using bus-max_bus_speed instead of 
 drm_pcie_get_speed_cap_mask
 for gen2 capability detection

 The first patch consists of some architecture changes, such as adding a 
 hook on
 powerpc for pci_root_bridge_prepare, so that pseries will initialize it to a
 function, while all other architectures get a NULL pointer. So that whenever
 pci_create_root_bus is called, we'll get max_bus_speed properly setup from
 OpenFirmware.

 The second patch consists of simple radeon changes not to call
 drm_get_pcie_speed_cap_mask anymore. I assume that on x86 machines,
 the max_bus_speed property will be properly set already.

 So I'm ok with the approach now and I might even put the powerpc patch
 in for 3.10 since arguably we are fixing a nasty bug (uninitialized
 max_bus_speed).

 David, what's your feeling about the radeon change ? It would be nice if
 that could go in soon for various distro targets :-) On the other hand
 I'm not going to be pushy if you are not comfortable with it.

 FWIW, the radeon change looks fine to me.

 Alex

As said previously, looks fine to me too.

Cheers,
Jerome
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCHv3 0/2] Speed Cap fixes for ppc64

2013-04-12 Thread Jerome Glisse
On Thu, Apr 11, 2013 at 9:13 AM, Lucas Kannebley Tavares 
luca...@linux.vnet.ibm.com wrote:

 After all the comments in the last patch series, I did a refactoring of
 what I was proposing and came up with this. Basically, now:
   1. max_bus_speed is used to set the device to gen2 speeds
   2. on power there's no longer a conflict between the pseries call and
 other architectures, because the overwrite is done via a ppc_md hook
   3. radeon is using bus-max_bus_speed instead of
 drm_pcie_get_speed_cap_mask for gen2 capability detection

 The first patch consists of some architecture changes, such as adding a
 hook on powerpc for pci_root_bridge_prepare, so that pseries will
 initialize it to a function, while all other architectures get a NULL
 pointer. So that whenever whenever pci_create_root_bus is called, we'll get
 max_bus_speed properly setup from OpenFirmware.

 The second patch consists of simple radeon changes not to call
 drm_get_pcie_speed_cap_mask anymore. I assume that on x86 machines, the
 max_bus_speed property will be properly set already.


The radeon changes are :

Reviewed-by: Jerome Glisse jgli...@redhat.com



 Lucas Kannebley Tavares (2):
   ppc64: perform proper max_bus_speed detection
   radeon: use max_bus_speed to activate gen2 speeds

  arch/powerpc/include/asm/machdep.h |2 +
  arch/powerpc/kernel/pci-common.c   |8 +
  arch/powerpc/platforms/pseries/pci.c   |   51
 
  arch/powerpc/platforms/pseries/setup.c |4 ++
  drivers/gpu/drm/radeon/evergreen.c |9 +
  drivers/gpu/drm/radeon/r600.c  |8 +
  drivers/gpu/drm/radeon/rv770.c |8 +
  7 files changed, 69 insertions(+), 21 deletions(-)

 --
 1.7.4.4

 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCHv2 0/3] PCI Speed Cap Fixes for ppc64

2013-03-20 Thread Jerome Glisse
On Wed, Mar 20, 2013 at 1:24 AM, Lucas Kannebley Tavares
luca...@linux.vnet.ibm.com wrote:
 This patch series first implements a function called pcie_get_speed_cap_mask
 in the PCI subsystem based off from drm_pcie_get_speed_cap_mask in drm. Then
 it removes the latter and fixes all references to it. And ultimately, it
 implements an architecture-specific version of the same function for ppc64.

 The refactor is done because the function that was used by drm is more
 architecture goo than module-specific. Whilst the function also needed a
 platform-specific implementation to get PCIE Gen2 speeds on ppc64.

Looks good to me but we probably want some one from the pci side to take a look

Reviewed-by: Jerome Glisse jgli...@redhat.com


 Lucas Kannebley Tavares (3):
   pci: added pcie_get_speed_cap_mask function
   drm: removed drm_pcie_get_speed_cap_mask function
   ppc64: implemented pcibios_get_speed_cap_mask

  arch/powerpc/platforms/pseries/pci.c |   35 +++
  drivers/gpu/drm/drm_pci.c|   38 -
  drivers/gpu/drm/radeon/evergreen.c   |5 ++-
  drivers/gpu/drm/radeon/r600.c|5 ++-
  drivers/gpu/drm/radeon/rv770.c   |5 ++-
  drivers/pci/pci.c|   44 
 ++
  include/drm/drmP.h   |6 
  include/linux/pci.h  |6 
  8 files changed, 94 insertions(+), 50 deletions(-)

 --
 1.7.4.4

 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev