Re: [PATCH 0/3] eldie generated code for folded p4d/pud

2019-10-10 Thread Kirill A. Shutemov
On Wed, Oct 09, 2019 at 10:26:55PM +, Vineet Gupta wrote:
> Hi,
> 
> This series elides extraneous generate code for folded p4d/pud.
> This came up when trying to remove __ARCH_USE_5LEVEL_HACK from ARC port.
> The code saving are not a while lot, but still worthwhile IMHO.

Agreed.

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v3] ARC: mm: remove __ARCH_USE_5LEVEL_HACK

2019-10-10 Thread Kirill A. Shutemov
On Wed, Oct 09, 2019 at 06:57:31PM +, Vineet Gupta wrote:
> Add the intermediate p4d accessors to make it 5 level compliant.
> 
> This is a non-functional change anyways since ARC has software page walker
> with 2 lookup levels (pgd -> pte)
> 
> There is slight code bloat due to pulling in needless p*d_free_tlb()
> macros which needs to be addressed seperately.
> 
> | bloat-o-meter2 vmlinux-with-5LEVEL_HACK vmlinux-patched
> | add/remove: 0/0 grow/shrink: 2/0 up/down: 128/0 (128)
> | function old new   delta
> | free_pgd_range   546 656+110
> | p4d_clear_bad  2  20 +18
> | Total: Before=4137148, After=4137276, chg 0.00%
> 
> Cc: Kirill A. Shutemov 
> Signed-off-by: Vineet Gupta 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-09 Thread Kirill A. Shutemov
On Tue, Oct 08, 2019 at 11:15:02AM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> The pagewalk code was unconditionally splitting transhuge pmds when a
> pte_entry was present. However ideally we'd want to handle transhuge pmds
> in the pmd_entry function and ptes in pte_entry function. So don't split
> huge pmds when there is a pmd_entry function present, but let the callback
> take care of it if necessary.

Do we have any current user that expect split_huge_pmd() in this scenario.

> 
> In order to make sure a virtual address range is handled by one and only
> one callback, and since pmd entries may be unstable, we introduce a
> pmd_entry return code that tells the walk code to continue processing this
> pmd entry rather than to move on. Since caller-defined positive return
> codes (up to 2) are used by current callers, use a high value that allows a
> large range of positive caller-defined return codes for future users.
> 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/pagewalk.h |  8 
>  mm/pagewalk.c| 28 +---
>  2 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index bddd9759bab9..c4a013eb445d 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -4,6 +4,11 @@
>  
>  #include 
>  
> +/* Highest positive pmd_entry caller-specific return value */
> +#define PAGE_WALK_CALLER_MAX (INT_MAX / 2)
> +/* The handler did not handle the entry. Fall back to the next level */
> +#define PAGE_WALK_FALLBACK   (PAGE_WALK_CALLER_MAX + 1)
> +

That's hacky.

Maybe just use an error code for this? -EAGAIN?

>  struct mm_walk;
>  
>  /**
> @@ -16,6 +21,9 @@ struct mm_walk;
>   *   this handler is required to be able to handle
>   *   pmd_trans_huge() pmds.  They may simply choose to
>   *   split_huge_page() instead of handling it explicitly.
> + *  If the handler did not handle the PMD, or split the
> + *  PMD and wants it handled by the PTE handler, it
> + *  should return PAGE_WALK_FALLBACK.

Indentation is broken. Use tabs.

>   * @pte_entry:   if set, called for each non-empty PTE 
> (4th-level) entry
>   * @pte_hole:if set, called for each hole at all levels
>   * @hugetlb_entry:   if set, called for each hugetlb entry
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 83c0b78363b4..f844c2a2aa60 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -50,10 +50,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
> unsigned long end,
>* This implies that each ->pmd_entry() handler
>* needs to know about pmd_trans_huge() pmds
>*/
> - if (ops->pmd_entry)
> + if (ops->pmd_entry) {
>   err = ops->pmd_entry(pmd, addr, next, walk);
> - if (err)
> - break;
> + if (!err)
> + continue;
> + else if (err <= PAGE_WALK_CALLER_MAX)
> + break;
> + WARN_ON(err != PAGE_WALK_FALLBACK);
> + err = 0;
> + if (pmd_trans_unstable(pmd))
> + goto again;
> + /* Fall through */
> + }
>  
>   /*
>* Check this here so we only break down trans_huge
> @@ -61,8 +69,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
> unsigned long end,
>*/
>   if (!ops->pte_entry)
>   continue;
> -
> - split_huge_pmd(walk->vma, pmd, addr);
> + if (!ops->pmd_entry)
> + split_huge_pmd(walk->vma, pmd, addr);
>   if (pmd_trans_unstable(pmd))
>   goto again;
>   err = walk_pte_range(pmd, addr, next, walk);
> @@ -281,11 +289,17 @@ static int __walk_page_range(unsigned long start, 
> unsigned long end,
>   *
>   *  - 0  : succeeded to handle the current entry, and if you don't reach the
>   * end address yet, continue to walk.
> - *  - >0 : succeeded to handle the current entry, and return to the caller
> - * with caller specific value.
> + *  - >0, 

Re: [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range()

2019-10-09 Thread Kirill A. Shutemov
On Tue, Oct 08, 2019 at 11:15:01AM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> Without the lock, anybody modifying a pte from within this function might
> have it concurrently modified by someone else.
> 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Thomas Hellstrom 
> ---
>  mm/pagewalk.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index d48c2a986ea3..83c0b78363b4 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,8 +10,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
> unsigned long end,
>   pte_t *pte;
>   int err = 0;
>   const struct mm_walk_ops *ops = walk->ops;
> + spinlock_t *ptl;
>  
> - pte = pte_offset_map(pmd, addr);
> + pte = pte_offset_map_lock(walk->mm, pmd, addr, );
>   for (;;) {
>   err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>   if (err)
> @@ -22,7 +23,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
> unsigned long end,
>   pte++;
>   }
>  
> - pte_unmap(pte);
> + pte_unmap_unlock(pte - 1, ptl);

NAK.

If ->pte_entry() fails on the first entry of the page table, pte - 1 will
point out side the page table.

And the '- 1' is totally unnecessary as we break the loop before pte++ on
the last iteration.

-- 
 Kirill A. Shutemov


[RFC, PATCH] mm, thp: Try to bound number of pages on deferred split queue

2019-10-09 Thread Kirill A. Shutemov
THPs on deferred split queue got split by shrinker if memory pressure
comes.

In absence of memory pressure, there is no bound on how long the
deferred split queue can be. In extreme cases, deferred queue can grow
to tens of gigabytes.

It is suboptimal: even without memory pressure we can find better way to
use the memory (page cache for instance).

Make deferred_split_huge_page() to trigger a work that would split
pages, if we have more than NR_PAGES_ON_QUEUE_TO_SPLIT on the queue.

The split can fail (i.e. due to memory pinning by GUP), making the
queue grow despite the effort. Rate-limit the work triggering to at most
every NR_CALLS_TO_SPLIT calls of deferred_split_huge_page().

NR_PAGES_ON_QUEUE_TO_SPLIT and NR_CALLS_TO_SPLIT chosen arbitrarily and
will likely require tweaking.

The patch has risk to introduce performance regressions. For system with
plenty of free memory, triggering the split would cost CPU time (~100ms
per GB of THPs to split).

I have doubts about the approach, so:

Not-Signed-off-by: Kirill A. Shutemov 
---
 include/linux/mmzone.h |   5 ++
 mm/huge_memory.c   | 129 -
 mm/memcontrol.c|   3 +
 mm/page_alloc.c|   2 +
 4 files changed, 100 insertions(+), 39 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bda20282746b..f748542745ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -684,7 +684,12 @@ struct deferred_split {
spinlock_t split_queue_lock;
struct list_head split_queue;
unsigned long split_queue_len;
+   unsigned int deferred_split_calls;
+   struct work_struct deferred_split_work;
 };
+
+void flush_deferred_split_queue(struct work_struct *work);
+void flush_deferred_split_queue_memcg(struct work_struct *work);
 #endif
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c5cb6dcd6c69..bb7bef856e38 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2842,43 +2842,6 @@ void free_transhuge_page(struct page *page)
free_compound_page(page);
 }
 
-void deferred_split_huge_page(struct page *page)
-{
-   struct deferred_split *ds_queue = get_deferred_split_queue(page);
-#ifdef CONFIG_MEMCG
-   struct mem_cgroup *memcg = compound_head(page)->mem_cgroup;
-#endif
-   unsigned long flags;
-
-   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-
-   /*
-* The try_to_unmap() in page reclaim path might reach here too,
-* this may cause a race condition to corrupt deferred split queue.
-* And, if page reclaim is already handling the same page, it is
-* unnecessary to handle it again in shrinker.
-*
-* Check PageSwapCache to determine if the page is being
-* handled by page reclaim since THP swap would add the page into
-* swap cache before calling try_to_unmap().
-*/
-   if (PageSwapCache(page))
-   return;
-
-   spin_lock_irqsave(_queue->split_queue_lock, flags);
-   if (list_empty(page_deferred_list(page))) {
-   count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-   list_add_tail(page_deferred_list(page), _queue->split_queue);
-   ds_queue->split_queue_len++;
-#ifdef CONFIG_MEMCG
-   if (memcg)
-   memcg_set_shrinker_bit(memcg, page_to_nid(page),
-  deferred_split_shrinker.id);
-#endif
-   }
-   spin_unlock_irqrestore(_queue->split_queue_lock, flags);
-}
-
 static unsigned long deferred_split_count(struct shrinker *shrink,
struct shrink_control *sc)
 {
@@ -2895,8 +2858,7 @@ static unsigned long deferred_split_count(struct shrinker 
*shrink,
 static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc)
 {
-   struct pglist_data *pgdata = NODE_DATA(sc->nid);
-   struct deferred_split *ds_queue = >deferred_split_queue;
+   struct deferred_split *ds_queue = NULL;
unsigned long flags;
LIST_HEAD(list), *pos, *next;
struct page *page;
@@ -2906,6 +2868,10 @@ static unsigned long deferred_split_scan(struct shrinker 
*shrink,
if (sc->memcg)
ds_queue = >memcg->deferred_split_queue;
 #endif
+   if (!ds_queue) {
+   struct pglist_data *pgdata = NODE_DATA(sc->nid);
+   ds_queue = >deferred_split_queue;
+   }
 
spin_lock_irqsave(_queue->split_queue_lock, flags);
/* Take pin on all head pages to avoid freeing them under us */
@@ -2957,6 +2923,91 @@ static struct shrinker deferred_split_shrinker = {
 SHRINKER_NONSLAB,
 };
 
+static void __flush_deferred_split_queue(struct pglist_data *pgdata,
+   struct mem_cgroup *memcg)
+{
+   struct shrink_control sc;
+
+   sc.nid = pgdata ? pgdata->node_id : 0;
+   sc.memcg = memcg;
+   sc.nr_to_scan = 0; /* Unlimited */
+
+

Re: [PATCH] ARC: mm: remove __ARCH_USE_5LEVEL_HACK

2019-10-08 Thread Kirill A. Shutemov
On Tue, Oct 08, 2019 at 09:38:36PM +, Vineet Gupta wrote:
> Add the intermediate p4d accessors to make it 5 level compliant.
> 
> Thi sis non-functional change anyways since ARC has software page walker
 ^
 Typo.
> with 2 lookup levels (pgd -> pte)
> 
> Signed-off-by: Vineet Gupta 
> ---
>  arch/arc/include/asm/pgtable.h |  1 -
>  arch/arc/mm/fault.c| 10 --
>  2 files changed, 8 insertions(+), 3 deletions(-)

Have you tested it with CONFIG_HIGHMEM=y? alloc_kmap_pgtable() has to be
converted too.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: thp: move deferred split queue to memcg's nodeinfo

2019-10-08 Thread Kirill A. Shutemov
On Mon, Oct 07, 2019 at 04:30:30PM +0200, Michal Hocko wrote:
> On Mon 07-10-19 16:19:59, Vlastimil Babka wrote:
> > On 10/2/19 10:43 AM, Michal Hocko wrote:
> > > On Wed 02-10-19 06:16:43, Yang Shi wrote:
> > >> The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make
> > >> deferred split shrinker memcg aware") makes deferred split queue per
> > >> memcg to resolve memcg pre-mature OOM problem.  But, all nodes end up
> > >> sharing the same queue instead of one queue per-node before the commit.
> > >> It is not a big deal for memcg limit reclaim, but it may cause global
> > >> kswapd shrink THPs from a different node.
> > >>
> > >> And, 0-day testing reported -19.6% regression of stress-ng's madvise
> > >> test [1].  I didn't see that much regression on my test box (24 threads,
> > >> 48GB memory, 2 nodes), with the same test (stress-ng --timeout 1
> > >> --metrics-brief --sequential 72  --class vm --exclude spawn,exec), I saw
> > >> average -3% (run the same test 10 times then calculate the average since
> > >> the test itself may have most 15% variation according to my test)
> > >> regression sometimes (not every time, sometimes I didn't see regression
> > >> at all).
> > >>
> > >> This might be caused by deferred split queue lock contention.  With some
> > >> configuration (i.e. just one root memcg) the lock contention my be worse
> > >> than before (given 2 nodes, two locks are reduced to one lock).
> > >>
> > >> So, moving deferred split queue to memcg's nodeinfo to make it NUMA
> > >> aware again.
> > >>
> > >> With this change stress-ng's madvise test shows average 4% improvement
> > >> sometimes and I didn't see degradation anymore.
> > > 
> > > My concern about this getting more and more complex
> > > (http://lkml.kernel.org/r/20191002084014.gh15...@dhcp22.suse.cz) holds
> > > here even more. Can we step back and reconsider the whole thing please?
> > 
> > What about freeing immediately after split via workqueue and also have a
> > synchronous version called before going oom? Maybe there would be also
> > other things that would benefit from this scheme instead of traditional
> > reclaim and shrinkers?
> 
> That is exactly what we have discussed some time ago.

Yes, I've posted the patch:

http://lkml.kernel.org/r/20190827125911.boya23eowxhqmopa@box

But I still not sure that the approach is right. I expect it to trigger
performance regressions. For system with pleanty of free memory, we will
just pay split cost for nothing in many cases.

-- 
 Kirill A. Shutemov


Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-08 Thread Kirill A. Shutemov
On Tue, Oct 08, 2019 at 12:58:57PM +, Justin He (Arm Technology China) 
wrote:
> Hi Will
> 
> > -Original Message-
> > From: Will Deacon 
> > Sent: 2019年10月8日 20:40
> > To: Justin He (Arm Technology China) 
> > Cc: Catalin Marinas ; Mark Rutland
> > ; James Morse ; Marc
> > Zyngier ; Matthew Wilcox ; Kirill A.
> > Shutemov ; linux-arm-
> > ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > m...@kvack.org; Punit Agrawal ; Thomas
> > Gleixner ; Andrew Morton  > foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> > ; nd 
> > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> > is cleared
> > 
> > On Tue, Oct 08, 2019 at 02:19:05AM +, Justin He (Arm Technology
> > China) wrote:
> > > > -Original Message-
> > > > From: Will Deacon 
> > > > Sent: 2019年10月1日 20:54
> > > > To: Justin He (Arm Technology China) 
> > > > Cc: Catalin Marinas ; Mark Rutland
> > > > ; James Morse ;
> > Marc
> > > > Zyngier ; Matthew Wilcox ;
> > Kirill A.
> > > > Shutemov ; linux-arm-
> > > > ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > > m...@kvack.org; Punit Agrawal ; Thomas
> > > > Gleixner ; Andrew Morton  > > > foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> > > > 
> > > > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if
> > PTE_AF
> > > > is cleared
> > > >
> > > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index b1ca51a079f2..1f56b0118ef5 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > > >   2;
> > > > >  #endif
> > > > >
> > > > > +#ifndef arch_faults_on_old_pte
> > > > > +static inline bool arch_faults_on_old_pte(void)
> > > > > +{
> > > > > + return false;
> > > > > +}
> > > > > +#endif
> > > >
> > > > Kirill has acked this, so I'm happy to take the patch as-is, however 
> > > > isn't
> > > > it the case that /most/ architectures will want to return true for
> > > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense
> > for
> > > > that to be the default, and have x86 and arm64 provide an override?
> > For
> > > > example, aren't most architectures still going to hit the double fault
> > > > scenario even with your patch applied?
> > >
> > > No, after applying my patch series, only those architectures which don't
> > provide
> > > setting access flag by hardware AND don't implement their
> > arch_faults_on_old_pte
> > > will hit the double page fault.
> > >
> > > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't
> > have the hardware
> > > setting access flag way, it might cause page fault on an old pte"
> > > I don't want to change other architectures' default behavior here. So by
> > default,
> > > arch_faults_on_old_pte() is false.
> > 
> > ...and my complaint is that this is the majority of supported architectures,
> > so you're fixing something for arm64 which also affects arm, powerpc,
> > alpha, mips, riscv, ...
> 
> So, IIUC, you suggested that:
> 1. by default, arch_faults_on_old_pte() return true
> 2. on X86, let arch_faults_on_old_pte() be overrided as returning false
> 3. on arm64, let it be as-is my patch set.
> 4. let other architectures decide the behavior. (But by default, it will set
> pte_young)
> 
> I am ok with that if no objections from others.
> 
> @Kirill A. Shutemov Do you have any comments? Thanks

Sounds sane to me.

-- 
 Kirill A. Shutemov


Re: [PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-07 Thread Kirill A. Shutemov
On Mon, Oct 07, 2019 at 03:51:58PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov  wrote:
> 
> > On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
> > > 
> > > * Anshuman Khandual  wrote:
> > > 
> > > > This adds a test module which will validate architecture page table 
> > > > helpers
> > > > and accessors regarding compliance with generic MM semantics 
> > > > expectations.
> > > > This will help various architectures in validating changes to the 
> > > > existing
> > > > page table helpers or addition of new ones.
> > > > 
> > > > Test page table and memory pages creating it's entries at various level 
> > > > are
> > > > all allocated from system memory with required alignments. If memory 
> > > > pages
> > > > with required size and alignment could not be allocated, then all 
> > > > depending
> > > > individual tests are skipped.
> > > 
> > > > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> > > > b/arch/x86/include/asm/pgtable_64_types.h
> > > > index 52e5f5f2240d..b882792a3999 100644
> > > > --- a/arch/x86/include/asm/pgtable_64_types.h
> > > > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > > > @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
> > > >  #define pgtable_l5_enabled() 0
> > > >  #endif /* CONFIG_X86_5LEVEL */
> > > >  
> > > > +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
> > > > +
> > > >  extern unsigned int pgdir_shift;
> > > >  extern unsigned int ptrs_per_p4d;
> > > 
> > > Any deep reason this has to be a macro instead of proper C?
> > 
> > It's a way to override the generic mm_p4d_folded(). It can be rewritten
> > as inline function + define. Something like:
> > 
> > #define mm_p4d_folded mm_p4d_folded
> > static inline bool mm_p4d_folded(struct mm_struct *mm)
> > {
> > return !pgtable_l5_enabled();
> > }
> > 
> > But I don't see much reason to be more verbose here than needed.
> 
> C type checking? Documentation? Yeah, I know it's just a one-liner, but 
> the principle of the death by a thousand cuts applies here.

Okay, if you think it worth it. Anshuman, could you fix it up for the next
submission?


> BTW., any reason this must be in the low level pgtable_64_types.h type 
> header, instead of one of the API level header files?

I defined it next pgtable_l5_enabled(). What is more appropriate place to
you? pgtable_64.h? Yeah, it makes sense.

-- 
 Kirill A. Shutemov


Re: [PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-07 Thread Kirill A. Shutemov
On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
> 
> * Anshuman Khandual  wrote:
> 
> > This adds a test module which will validate architecture page table helpers
> > and accessors regarding compliance with generic MM semantics expectations.
> > This will help various architectures in validating changes to the existing
> > page table helpers or addition of new ones.
> > 
> > Test page table and memory pages creating it's entries at various level are
> > all allocated from system memory with required alignments. If memory pages
> > with required size and alignment could not be allocated, then all depending
> > individual tests are skipped.
> 
> > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> > b/arch/x86/include/asm/pgtable_64_types.h
> > index 52e5f5f2240d..b882792a3999 100644
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
> >  #define pgtable_l5_enabled() 0
> >  #endif /* CONFIG_X86_5LEVEL */
> >  
> > +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
> > +
> >  extern unsigned int pgdir_shift;
> >  extern unsigned int ptrs_per_p4d;
> 
> Any deep reason this has to be a macro instead of proper C?

It's a way to override the generic mm_p4d_folded(). It can be rewritten
as inline function + define. Something like:

#define mm_p4d_folded mm_p4d_folded
static inline bool mm_p4d_folded(struct mm_struct *mm)
{
return !pgtable_l5_enabled();
}

But I don't see much reason to be more verbose here than needed.

-- 
 Kirill A. Shutemov


Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

2019-10-06 Thread Kirill A. Shutemov
On Fri, Oct 04, 2019 at 06:45:21AM -0700, Daniel Colascione wrote:
> On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov  
> wrote:
> > On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote:
> > > On Wed 02-10-19 19:08:16, Daniel Colascione wrote:
> > > > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai  wrote:
> > > > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione  
> > > > > > wrote:
> > > > > >
> > > > > > Adding the correct linux-mm address.
> > > > > >
> > > > > >
> > > > > >> +config SPLIT_RSS_COUNTING
> > > > > >> +   bool "Per-thread mm counter caching"
> > > > > >> +   depends on MMU
> > > > > >> +   default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> > > > > >> +   help
> > > > > >> + Cache mm counter updates in thread structures and
> > > > > >> + flush them to visible per-process statistics in batches.
> > > > > >> + Say Y here to slightly reduce cache contention in 
> > > > > >> processes
> > > > > >> + with many threads at the expense of decreasing the 
> > > > > >> accuracy
> > > > > >> + of memory statistics in /proc.
> > > > > >> +
> > > > > >> endmenu
> > > > >
> > > > > All those vague words are going to make developers almost
> > > > > impossible to decide the right selection here. It sounds like we
> > > > > should kill SPLIT_RSS_COUNTING at all to simplify the code as
> > > > > the benefit is so small vs the side-effect?
> > > >
> > > > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile
> > > > and a basic desktop, it doesn't make a difference. I figured making it
> > > > a knob would help allay concerns about the performance impact in more
> > > > extreme configurations.
> > >
> > > I do agree with Qian. Either it is really helpful (is it? probably on
> > > the number of cpus) and it should be auto-enabled or it should be
> > > dropped altogether. You cannot really expect people know how to enable
> > > this without a deep understanding of the MM internals. Not to mention
> > > all those users using distro kernels/configs.
> > >
> > > A config option sounds like a bad way forward.
> >
> > And I don't see much point anyway. Reading RSS counters from proc is
> > inherently racy. It can just either way after the read due to process
> > behaviour.
> 
> Split RSS accounting doesn't make reading from mm counters racy. It
> makes these counters *wrong*. We flush task mm counters to the
> mm_struct once every 64 page faults that a task incurs or when that
> task exits. That means that if a thread takes 63 page faults and then
> sleeps for a week, that thread's process's mm counters are wrong by 63
> pages *for a week*. And some processes have a lot of threads,
> compounding the error. Split RSS accounting means that memory usage
> numbers don't add up. I don't think it's unreasonable to want a mode
> where memory counters to agree with other indicators of system
> activity.

It's documented behaviour that is upstream for 9 years. Why is your workload
special?

The documentation suggests to use smaps if you want to have precise data.
Why would it not fly for you?

> Nobody has demonstrated that split RSS accounting actually helps in
> the real world.

The original commit 34e55232e59f ("mm: avoid false sharing of mm_counter")
shows numbers on cache misses. It's not a real world workload, but you
don't have any numbers at all to back your claim.

> But I've described above, concretely, how split RSS
> accounting hurts. I've been trying for over a year to either disable
> split RSS accounting or to let people opt out of it. If you won't
> remove split RSS accounting and you won't let me add a configuration
> knob that lets people opt out of it, what will you accept?

Keeping stats precise is welcome, but often expensive. It might be
negligible for small machine, but becomes a problem on multisocket
machine with dozens or hundreds of cores. We need to keep kernel scalable.

We have other stats that update asynchronously (i.e. /proc/vmstat). Would
you like to convert them too?

-- 
 Kirill A. Shutemov


Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

2019-10-04 Thread Kirill A. Shutemov
On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote:
> On Wed 02-10-19 19:08:16, Daniel Colascione wrote:
> > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai  wrote:
> > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione  wrote:
> > > >
> > > > Adding the correct linux-mm address.
> > > >
> > > >
> > > >> +config SPLIT_RSS_COUNTING
> > > >> +   bool "Per-thread mm counter caching"
> > > >> +   depends on MMU
> > > >> +   default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> > > >> +   help
> > > >> + Cache mm counter updates in thread structures and
> > > >> + flush them to visible per-process statistics in batches.
> > > >> + Say Y here to slightly reduce cache contention in processes
> > > >> + with many threads at the expense of decreasing the accuracy
> > > >> + of memory statistics in /proc.
> > > >> +
> > > >> endmenu
> > >
> > > All those vague words are going to make developers almost impossible to 
> > > decide the right selection here. It sounds like we should kill 
> > > SPLIT_RSS_COUNTING at all to simplify the code as the benefit is so small 
> > > vs the side-effect?
> > 
> > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile
> > and a basic desktop, it doesn't make a difference. I figured making it
> > a knob would help allay concerns about the performance impact in more
> > extreme configurations.
> 
> I do agree with Qian. Either it is really helpful (is it? probably on
> the number of cpus) and it should be auto-enabled or it should be
> dropped altogether. You cannot really expect people know how to enable
> this without a deep understanding of the MM internals. Not to mention
> all those users using distro kernels/configs.
> 
> A config option sounds like a bad way forward.

And I don't see much point anyway. Reading RSS counters from proc is
inherently racy. It can just either way after the read due to process
behaviour.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code

2019-10-04 Thread Kirill A. Shutemov
On Fri, Oct 04, 2019 at 02:58:59PM +0200, Thomas Hellström (VMware) wrote:
> On 10/4/19 2:37 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
> > > > > + *   If @mapping allows faulting of huge pmds and puds, it is 
> > > > > desirable
> > > > > + *   that its huge_fault() handler blocks while this function is 
> > > > > running on
> > > > > + *   @mapping. Otherwise a race may occur where the huge entry is 
> > > > > split when
> > > > > + *   it was intended to be handled in a huge entry callback. This 
> > > > > requires an
> > > > > + *   external lock, for example that @mapping->i_mmap_rwsem is held 
> > > > > in
> > > > > + *   write mode in the huge_fault() handlers.
> > > > Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
> > > > on read) to split PMD entry into PTE table. And it can happen not only
> > > > from fault path.
> > > > 
> > > > If you care about splitting compound page under you, take a pin or lock 
> > > > a
> > > > page. It will block split_huge_page().
> > > > 
> > > > Suggestion to block fault path is not viable (and it will not happen
> > > > magically just because of this comment).
> > > > 
> > > I was specifically thinking of this:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
> > > 
> > > If a huge pud is concurrently faulted in here, it will immediatly get 
> > > split
> > > without getting processed in pud_entry(). An external lock would protect
> > > against that, but that's perhaps a bug in the pagewalk code?  For pmds the
> > > situation is not the same since when pte_entry is used, all pmds will
> > > unconditionally get split.
> > I *think* it should be fixed with something like this (there's no
> > pud_trans_unstable() yet):
> > 
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index d48c2a986ea3..221a3b945f42 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long 
> > addr, unsigned long end,
> > break;
> > continue;
> > }
> > +   } else {
> > +   split_huge_pud(walk->vma, pud, addr);
> > }
> > -   split_huge_pud(walk->vma, pud, addr);
> > -   if (pud_none(*pud))
> > +   if (pud_none(*pud) || pud_trans_unstable(*pud))
> > goto again;
> > if (ops->pmd_entry || ops->pte_entry)
> 
> Yes, this seems better. I was looking at implementing a pud_trans_unstable()
> as a basis of fixing problems like this, but when I looked at
> pmd_trans_unstable I got a bit confused:
> 
> Why are devmap huge pmds considered stable? I mean, couldn't anybody just
> run madvise() to clear those just like transhuge pmds?

Matthew, Dan, could you comment on this?

> > Or better yet converted to what we do on pmd level.
> > 
> > Honestly, all the code around PUD THP missing a lot of ground work.
> > Rushing it upstream for DAX was not a right move.
> > 
> > > There's a similar more scary race in
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
> > > 
> > > It looks like if a concurrent thread faults in a huge pud just after the
> > > test for pud_none in that pmd_alloc, things might go pretty bad.
> > Hm? It will fail the next pmd_none() check under ptl. Do you have a
> > particular racing scenarion?
> > 
> Yes, I misinterpreted the code somewhat, but here's the scenario that looks
> racy:
> 
> Thread 1  Thread 2
> huge_fault(pud)   - Fell back, for 
> example because of write fault on dirty-tracking.
>   huge_fault(pud) - Taken, read fault.
> pmd_alloc() - Will fail pmd_none check 
> and return a pmd_offset()

I see. It also misses pud_tans_unstable() check or its variant.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code

2019-10-04 Thread Kirill A. Shutemov
On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
> > > + *   If @mapping allows faulting of huge pmds and puds, it is desirable
> > > + *   that its huge_fault() handler blocks while this function is running 
> > > on
> > > + *   @mapping. Otherwise a race may occur where the huge entry is split 
> > > when
> > > + *   it was intended to be handled in a huge entry callback. This 
> > > requires an
> > > + *   external lock, for example that @mapping->i_mmap_rwsem is held in
> > > + *   write mode in the huge_fault() handlers.
> > Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
> > on read) to split PMD entry into PTE table. And it can happen not only
> > from fault path.
> > 
> > If you care about splitting compound page under you, take a pin or lock a
> > page. It will block split_huge_page().
> > 
> > Suggestion to block fault path is not viable (and it will not happen
> > magically just because of this comment).
> > 
> I was specifically thinking of this:
> 
> https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
> 
> If a huge pud is concurrently faulted in here, it will immediatly get split
> without getting processed in pud_entry(). An external lock would protect
> against that, but that's perhaps a bug in the pagewalk code?  For pmds the
> situation is not the same since when pte_entry is used, all pmds will
> unconditionally get split.

I *think* it should be fixed with something like this (there's no
pud_trans_unstable() yet):

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..221a3b945f42 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
break;
continue;
}
+   } else {
+   split_huge_pud(walk->vma, pud, addr);
}
 
-   split_huge_pud(walk->vma, pud, addr);
-   if (pud_none(*pud))
+   if (pud_none(*pud) || pud_trans_unstable(*pud))
goto again;
 
if (ops->pmd_entry || ops->pte_entry)

Or better yet converted to what we do on pmd level.

Honestly, all the code around PUD THP missing a lot of ground work.
Rushing it upstream for DAX was not a right move.

> There's a similar more scary race in
> 
> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
> 
> It looks like if a concurrent thread faults in a huge pud just after the
> test for pud_none in that pmd_alloc, things might go pretty bad.

Hm? It will fail the next pmd_none() check under ptl. Do you have a
particular racing scenarion?

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code

2019-10-03 Thread Kirill A. Shutemov
On Wed, Oct 02, 2019 at 03:47:25PM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> For users that want to travers all page table entries pointing into a
> region of a struct address_space mapping, introduce a walk_page_mapping()
> function.
> 
> The walk_page_mapping() function will be initially be used for dirty-
> tracking in virtual graphics drivers.
> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/pagewalk.h |  9 
>  mm/pagewalk.c| 99 +++-
>  2 files changed, 107 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index bddd9759bab9..6ec82e92c87f 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -24,6 +24,9 @@ struct mm_walk;
>   *   "do page table walk over the current vma", returning
>   *   a negative value means "abort current page table walk
>   *   right now" and returning 1 means "skip the current vma"
> + * @pre_vma:if set, called before starting walk on a non-null 
> vma.
> + * @post_vma:   if set, called after a walk on a non-null vma, 
> provided
> + *  that @pre_vma and the vma walk succeeded.
>   */
>  struct mm_walk_ops {
>   int (*pud_entry)(pud_t *pud, unsigned long addr,
> @@ -39,6 +42,9 @@ struct mm_walk_ops {
>struct mm_walk *walk);
>   int (*test_walk)(unsigned long addr, unsigned long next,
>   struct mm_walk *walk);
> + int (*pre_vma)(unsigned long start, unsigned long end,
> +struct mm_walk *walk);
> + void (*post_vma)(struct mm_walk *walk);
>  };
>  
>  /**
> @@ -62,5 +68,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long 
> start,
>   void *private);
>  int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>   void *private);
> +int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> +   pgoff_t nr, const struct mm_walk_ops *ops,
> +   void *private);
>  
>  #endif /* _LINUX_PAGEWALK_H */
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index d48c2a986ea3..658d1e5ec428 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -253,13 +253,23 @@ static int __walk_page_range(unsigned long start, 
> unsigned long end,
>  {
>   int err = 0;
>   struct vm_area_struct *vma = walk->vma;
> + const struct mm_walk_ops *ops = walk->ops;
> +
> + if (vma && ops->pre_vma) {
> + err = ops->pre_vma(start, end, walk);
> + if (err)
> + return err;
> + }
>  
>   if (vma && is_vm_hugetlb_page(vma)) {
> - if (walk->ops->hugetlb_entry)
> + if (ops->hugetlb_entry)
>   err = walk_hugetlb_range(start, end, walk);
>   } else
>   err = walk_pgd_range(start, end, walk);
>  
> + if (vma && ops->post_vma)
> + ops->post_vma(walk);
> +
>   return err;
>  }
>  
> @@ -285,11 +295,17 @@ static int __walk_page_range(unsigned long start, 
> unsigned long end,
>   *  - <0 : failed to handle the current entry, and return to the caller
>   * with error code.
>   *
> + *
>   * Before starting to walk page table, some callers want to check whether
>   * they really want to walk over the current vma, typically by checking
>   * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
>   * purpose.
>   *
> + * If operations need to be staged before and committed after a vma is 
> walked,
> + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> + * since it is intended to handle commit-type operations, can't return any
> + * errors.
> + *
>   * struct mm_walk keeps current values of some common data like vma and pmd,
>   * which are useful for the access from callbacks. If you want to pass some
>   * caller-specific data to callbacks, @private should be helpful.
> @@ -376,3 +392,84 @@ int walk_page_vma(struct vm_area_struct *vma, const 
> struct mm_walk_ops *ops,
>   return err;
>   return __walk_page_range(vma->vm_start, vma->vm_end, );
>  }
> +
> +/**
> + * walk_page_mapping - walk all memory areas mapped into a struct 
> ad

Re: [PATCH v3 1/7] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock()

2019-10-03 Thread Kirill A. Shutemov
On Wed, Oct 02, 2019 at 03:47:24PM +0200, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> The caller needs to make sure that the vma is not torn down during the
> lock operation and can also use the i_mmap_rwsem for file-backed vmas.
> Remove the BUG_ON. We could, as an alternative, add a test that either
> vma->vm_mm->mmap_sem or vma->vm_file->f_mapping->i_mmap_rwsem are held.
> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Signed-off-by: Thomas Hellstrom 

The patch looks good to me:

Acked-by: Kirill A. Shutemov 

But I looked at usage at pagewalk.c and it is inconsitent.  The walker
takes ptl before calling ->pud_entry(), but not for ->pmd_entry().

It should be fixed: do not take the lock before ->pud_entry(). The
callback must take care of it.

Looks like we have single ->pud_entry() implementation the whole kernel.
It should be trivial to fix.

Could you do this?

-- 
 Kirill A. Shutemov


Re: [PATCH 14/15] mm: Align THP mappings for non-DAX

2019-10-01 Thread Kirill A. Shutemov
On Tue, Oct 01, 2019 at 10:08:30AM -0600, William Kucharski wrote:
> 
> 
> > On Oct 1, 2019, at 8:20 AM, Kirill A. Shutemov  wrote:
> > 
> > On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
> >> 
> >> 
> >> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
> >>> On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> >>>> 
> >>>> 
> >>>>> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov  
> >>>>> wrote:
> >>>>> 
> >>>>> On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> >>>>>> 
> >>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>>>> index cbe7d0619439..670a1780bd2f 100644
> >>>>>> --- a/mm/huge_memory.c
> >>>>>> +++ b/mm/huge_memory.c
> >>>>>> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file 
> >>>>>> *filp, unsigned long addr,
> >>>>>> 
> >>>>>>if (addr)
> >>>>>>goto out;
> >>>>>> -  if (!IS_DAX(filp->f_mapping->host) || 
> >>>>>> !IS_ENABLED(CONFIG_FS_DAX_PMD))
> >>>>>> -  goto out;
> >>>>>> 
> >>>>>>addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> >>>>>>if (addr)
> >>>>> 
> >>>>> I think you reducing ASLR without any real indication that THP is 
> >>>>> relevant
> >>>>> for the VMA. We need to know if any huge page allocation will be
> >>>>> *attempted* for the VMA or the file.
> >>>> 
> >>>> Without a properly aligned address the code will never even attempt 
> >>>> allocating
> >>>> a THP.
> >>>> 
> >>>> I don't think rounding an address to one that would be properly aligned 
> >>>> to map
> >>>> to a THP if possible is all that detrimental to ASLR and without the 
> >>>> ability to
> >>>> pick an aligned address it's rather unlikely anyone would ever map 
> >>>> anything to
> >>>> a THP unless they explicitly designate an address with MAP_FIXED.
> >>>> 
> >>>> If you do object to the slight reduction of the ASLR address space, what
> >>>> alternative would you prefer to see?
> >>> 
> >>> We need to know by the time if THP is allowed for this
> >>> file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
> >>> nothing.
> >>> 
> >>> For instance, if THP is disabled globally, there is no reason to align the
> >>> VMA to the THP requirements.
> >> 
> >> I understand, but this code is in thp_get_unmapped_area(), which is only 
> >> called
> >> if THP is configured and the VMA can support it.
> >> 
> >> I don't see it in Matthew's patchset, so I'm not sure if it was 
> >> inadvertently
> >> missed in his merge or if he has other ideas for how it would eventually be
> >> called, but in my last patch revision the code calling it in do_mmap()
> >> looked like this:
> >> 
> >> #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> >>/*
> >> * If THP is enabled, it's a read-only executable that is
> >> * MAP_PRIVATE mapped, the length is larger than a PMD page
> >> * and either it's not a MAP_FIXED mapping or the passed address is
> >> * properly aligned for a PMD page, attempt to get an appropriate
> >> * address at which to map a PMD-sized THP page, otherwise call the
> >> * normal routine.
> >> */
> >>if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
> >>(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
> >>(!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {
> > 
> > len and MAP_FIXED is already handled by thp_get_unmapped_area().
> > 
> > if (prot & (PROT_READ|PROT_WRITE|PROT_READ) == (PROT_READ|PROT_EXEC) &&
> > (flags & MAP_PRIVATE)) {
> 
> It is, but I wanted to avoid even calling it if conditions weren't right.
> 
> Checking twice is non-optimal but I didn't want to alter the existing use of
> the routine for anon THP.

Re: [PATCH 14/15] mm: Align THP mappings for non-DAX

2019-10-01 Thread Kirill A. Shutemov
On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
> 
> 
> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
> > On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> > > 
> > > 
> > > > On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov  
> > > > wrote:
> > > > 
> > > > On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> > > > > 
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index cbe7d0619439..670a1780bd2f 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file 
> > > > > *filp, unsigned long addr,
> > > > > 
> > > > >   if (addr)
> > > > >   goto out;
> > > > > - if (!IS_DAX(filp->f_mapping->host) || 
> > > > > !IS_ENABLED(CONFIG_FS_DAX_PMD))
> > > > > - goto out;
> > > > > 
> > > > >   addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> > > > >   if (addr)
> > > > 
> > > > I think you reducing ASLR without any real indication that THP is 
> > > > relevant
> > > > for the VMA. We need to know if any huge page allocation will be
> > > > *attempted* for the VMA or the file.
> > > 
> > > Without a properly aligned address the code will never even attempt 
> > > allocating
> > > a THP.
> > > 
> > > I don't think rounding an address to one that would be properly aligned 
> > > to map
> > > to a THP if possible is all that detrimental to ASLR and without the 
> > > ability to
> > > pick an aligned address it's rather unlikely anyone would ever map 
> > > anything to
> > > a THP unless they explicitly designate an address with MAP_FIXED.
> > > 
> > > If you do object to the slight reduction of the ASLR address space, what
> > > alternative would you prefer to see?
> > 
> > We need to know by the time if THP is allowed for this
> > file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
> > nothing.
> > 
> > For instance, if THP is disabled globally, there is no reason to align the
> > VMA to the THP requirements.
> 
> I understand, but this code is in thp_get_unmapped_area(), which is only 
> called
> if THP is configured and the VMA can support it.
> 
> I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently
> missed in his merge or if he has other ideas for how it would eventually be
> called, but in my last patch revision the code calling it in do_mmap()
> looked like this:
> 
> #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> /*
>  * If THP is enabled, it's a read-only executable that is
>  * MAP_PRIVATE mapped, the length is larger than a PMD page
>  * and either it's not a MAP_FIXED mapping or the passed address is
>  * properly aligned for a PMD page, attempt to get an appropriate
>  * address at which to map a PMD-sized THP page, otherwise call the
>  * normal routine.
>  */
> if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
> (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
> (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {

len and MAP_FIXED is already handled by thp_get_unmapped_area().

if (prot & (PROT_READ|PROT_WRITE|PROT_READ) == (PROT_READ|PROT_EXEC) &&
(flags & MAP_PRIVATE)) {


> addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
> 
> if (addr && (!(addr & ~HPAGE_PMD_MASK))) {

This check is broken.

For instance, if pgoff is one, (addr & ~HPAGE_PMD_MASK) has to be equal to
PAGE_SIZE to have chance to get a huge page in the mapping.

> /*
>  * If we got a suitable THP mapping address, shut off
>  * VM_MAYWRITE for the region, since it's never what
>  * we would want.
>  */
> vm_maywrite = 0;

Wouldn't it break uprobe, for instance?

> } else
> addr = get_unmapped_area(file, addr, len, pgoff, 
> flags);
> } else {
> #endif
> 
> So I think that meets your expectations regarding ASLR.
> 
>-- Bill

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/3] mm, page_owner: decouple freeing stack trace from debug_pagealloc

2019-10-01 Thread Kirill A. Shutemov
On Tue, Oct 01, 2019 at 08:26:28AM -0400, Qian Cai wrote:
> On Tue, 2019-10-01 at 14:51 +0300, Kirill A. Shutemov wrote:
> > On Tue, Oct 01, 2019 at 10:07:44AM +0200, Vlastimil Babka wrote:
> > > On 10/1/19 1:49 AM, Qian Cai wrote:
> > > > 
> > > > 
> > > > > On Sep 30, 2019, at 5:43 PM, Vlastimil Babka  wrote:
> > > > > 
> > > > > Well, my use case is shipping production kernels with 
> > > > > CONFIG_PAGE_OWNER
> > > > > and CONFIG_DEBUG_PAGEALLOC enabled, and instructing users to boot-time
> > > > > enable only for troubleshooting a crash or memory leak, without a need
> > > > > to install a debug kernel. Things like static keys and page_ext
> > > > > allocations makes this possible without CPU and memory overhead when 
> > > > > not
> > > > > boot-time enabled. I don't know too much about KASAN internals, but I
> > > > > assume it's not possible to use it that way on production kernels yet?
> > > > 
> > > > In that case, why can’t users just simply enable page_owner=on and
> > > > debug_pagealloc=on for troubleshooting? The later makes the kernel
> > > > slower, but I am not sure if it is worth optimization by adding a new
> > > > parameter. There have already been quite a few MM-related kernel
> > > > parameters that could tidy up a bit in the future.
> > > 
> > > They can do that and it was intention, yes. The extra parameter was
> > > requested by Kirill, so I'll defer the answer to him :)
> > 
> > DEBUG_PAGEALLOC is much more intrusive debug option. Not all architectures
> > support it in an efficient way. Some require hibernation.
> > 
> > I don't see a reason to tie these two option together.
> 
> Make sense. How about page_owner=on will have page_owner_free=on by default?
> That way we don't need the extra parameter.

It's 0.1% of system memory. Does it matter for a debug option? I don't know.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm/thp: make set_huge_zero_page() return void

2019-10-01 Thread Kirill A. Shutemov
On Mon, Sep 30, 2019 at 12:55:28PM -0700, Ralph Campbell wrote:
> The return value from set_huge_zero_page() is never checked so simplify
> the code by making it return void.
> 
> Signed-off-by: Ralph Campbell 
> ---
>  mm/huge_memory.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c5cb6dcd6c69..6cf0ee65538d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -686,20 +686,18 @@ static inline gfp_t 
> alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  }
>  
>  /* Caller must hold page table lock. */
> -static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
> +static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
>   struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
>   struct page *zero_page)
>  {
>   pmd_t entry;
> - if (!pmd_none(*pmd))
> - return false;
> +

Wat? So you just bindly overwrite whatever is there?

NAK.

>   entry = mk_pmd(zero_page, vma->vm_page_prot);
>   entry = pmd_mkhuge(entry);
>   if (pgtable)
>   pgtable_trans_huge_deposit(mm, pmd, pgtable);
>   set_pmd_at(mm, haddr, pmd, entry);
>   mm_inc_nr_ptes(mm);
> - return true;
>  }
>  
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> -- 
> 2.20.1
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/3] mm, page_owner: decouple freeing stack trace from debug_pagealloc

2019-10-01 Thread Kirill A. Shutemov
On Tue, Oct 01, 2019 at 10:07:44AM +0200, Vlastimil Babka wrote:
> On 10/1/19 1:49 AM, Qian Cai wrote:
> > 
> > 
> >> On Sep 30, 2019, at 5:43 PM, Vlastimil Babka  wrote:
> >>
> >> Well, my use case is shipping production kernels with CONFIG_PAGE_OWNER
> >> and CONFIG_DEBUG_PAGEALLOC enabled, and instructing users to boot-time
> >> enable only for troubleshooting a crash or memory leak, without a need
> >> to install a debug kernel. Things like static keys and page_ext
> >> allocations makes this possible without CPU and memory overhead when not
> >> boot-time enabled. I don't know too much about KASAN internals, but I
> >> assume it's not possible to use it that way on production kernels yet?
> > 
> > In that case, why can’t users just simply enable page_owner=on and
> > debug_pagealloc=on for troubleshooting? The later makes the kernel
> > slower, but I am not sure if it is worth optimization by adding a new
> > parameter. There have already been quite a few MM-related kernel
> > parameters that could tidy up a bit in the future.
> 
> They can do that and it was intention, yes. The extra parameter was
> requested by Kirill, so I'll defer the answer to him :)

DEBUG_PAGEALLOC is much more intrusive debug option. Not all architectures
support it in an efficient way. Some require hibernation.

I don't see a reason to tie these two option together.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/3] mm, page_owner: decouple freeing stack trace from debug_pagealloc

2019-10-01 Thread Kirill A. Shutemov
On Mon, Sep 30, 2019 at 11:39:34PM +0200, Vlastimil Babka wrote:
> On 9/30/19 2:49 PM, Qian Cai wrote:
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -3237,6 +3237,14 @@
> >>we can turn it on.
> >>on: enable the feature
> >>  
> >> +  page_owner_free=
> >> +  [KNL] When enabled together with page_owner, store also
> >> +  the stack of who frees a page, for error page dump
> >> +  purposes. This is also implicitly enabled by
> >> +  debug_pagealloc=on or KASAN, so only page_owner=on is
> >> +  sufficient in those cases.
> >> +  on: enable the feature
> >> +
> > 
> > If users are willing to set page_owner=on, what prevent them from enabling 
> > KASAN
> > as well? That way, we don't need this additional parameter.
> 
> Well, my use case is shipping production kernels with CONFIG_PAGE_OWNER
> and CONFIG_DEBUG_PAGEALLOC enabled, and instructing users to boot-time
> enable only for troubleshooting a crash or memory leak, without a need
> to install a debug kernel. Things like static keys and page_ext
> allocations makes this possible without CPU and memory overhead when not
> boot-time enabled. I don't know too much about KASAN internals, but I
> assume it's not possible to use it that way on production kernels yet?

I don't know about production, but QEMU (without KVM acceleration) is
painfully slow if KASAN is enabled.

-- 
 Kirill A. Shutemov


Re: [PATCH 14/15] mm: Align THP mappings for non-DAX

2019-10-01 Thread Kirill A. Shutemov
On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> 
> 
> > On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov  wrote:
> > 
> > On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> >> 
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index cbe7d0619439..670a1780bd2f 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, 
> >> unsigned long addr,
> >> 
> >>if (addr)
> >>goto out;
> >> -  if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> >> -  goto out;
> >> 
> >>addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> >>if (addr)
> > 
> > I think you reducing ASLR without any real indication that THP is relevant
> > for the VMA. We need to know if any huge page allocation will be
> > *attempted* for the VMA or the file.
> 
> Without a properly aligned address the code will never even attempt allocating
> a THP.
> 
> I don't think rounding an address to one that would be properly aligned to map
> to a THP if possible is all that detrimental to ASLR and without the ability 
> to
> pick an aligned address it's rather unlikely anyone would ever map anything to
> a THP unless they explicitly designate an address with MAP_FIXED.
> 
> If you do object to the slight reduction of the ASLR address space, what
> alternative would you prefer to see?

We need to know by the time if THP is allowed for this
file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
nothing.

For instance, if THP is disabled globally, there is no reason to align the
VMA to the THP requirements.

-- 
 Kirill A. Shutemov


Re: [PATCH 14/15] mm: Align THP mappings for non-DAX

2019-10-01 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> From: William Kucharski 
> 
> When we have the opportunity to use transparent huge pages to map a
> file, we want to follow the same rules as DAX.
> 
> Signed-off-by: William Kucharski 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cbe7d0619439..670a1780bd2f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, 
> unsigned long addr,
>  
>   if (addr)
>   goto out;
> - if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> - goto out;
>  
>   addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
>   if (addr)

I think you reducing ASLR without any real indication that THP is relevant
for the VMA. We need to know if any huge page allocation will be
*attempted* for the VMA or the file.

-- 
 Kirill A. Shutemov


Re: [PATCH 13/15] mm: Add a huge page fault handler for files

2019-10-01 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:12PM -0700, Matthew Wilcox wrote:
> From: William Kucharski 
> 
> Add filemap_huge_fault() to attempt to satisfy page
> faults on memory-mapped read-only text pages using THP when possible.
> 
> Signed-off-by: William Kucharski 
> [rebased on top of mm prep patches -- Matthew]
> Signed-off-by: Matthew Wilcox (Oracle) 

As we discuss before, I think it's wrong direction to separate ->fault()
from ->huge_fault(). If ->fault() would return a huge page alloc_set_pte()
will do The Right Thing™.

-- 
 Kirill A. Shutemov


Re: [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping

2019-10-01 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:11PM -0700, Matthew Wilcox wrote:
> From: William Kucharski 
> 
> __remove_mapping() assumes that pages can only be either base pages
> or HPAGE_PMD_SIZE.  Ask the page what size it is.

You also fixes the issue CONFIG_READ_ONLY_THP_FOR_FS=y with this patch.
The new feature makes the refcount calculation relevant not only for
PageSwapCache(). It should go to v5.4.

> 
> Signed-off-by: William Kucharski 
> Signed-off-by: Matthew Wilcox (Oracle) 

> ---
>  mm/vmscan.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a7f9f379e523..9f44868e640b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -932,10 +932,7 @@ static int __remove_mapping(struct address_space 
> *mapping, struct page *page,
>* Note that if SetPageDirty is always performed via set_page_dirty,
>* and thus under the i_pages lock, then this ordering is not required.
>*/
> - if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
> - refcount = 1 + HPAGE_PMD_NR;
> - else
> - refcount = 2;
> + refcount = 1 + compound_nr(page);
>   if (!page_ref_freeze(page, refcount))
>   goto cannot_free;
>   /* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH 11/15] mm: Remove hpage_nr_pages

2019-10-01 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:10PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This function assumed that compound pages were necessarily PMD sized.
> While that may be true for some users, it's not going to be true for
> all users forever, so it's better to remove it and avoid the confusion
> by just using compound_nr() or page_size().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 10/15] mm: Allow find_get_page to be used for large pages

2019-10-01 Thread Kirill A. Shutemov
 the page cache slot at @mapping & @offset.
>   *
> - * PCG flags modify how the page is returned.
> + * FGP flags modify how the page is returned.
>   *
>   * @fgp_flags can be:
>   *
> @@ -1636,6 +1700,10 @@ EXPORT_SYMBOL(find_lock_entry);
>   * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to 
> do
>   *   its own locking dance if the page is already in cache, or unlock the 
> page
>   *   before returning if we had to add the page to pagecache.
> + * - FGP_PMD: We're only interested in pages at PMD granularity.  If there
> + *   is no page here (and FGP_CREATE is set), we'll create one large enough.
> + *   If there is a smaller page in the cache that overlaps the PMD page, we
> + *   return %NULL and do not attempt to create a page.

I still think it's suboptimal interface. It's okay to ask for PMD page,
but there's small page already caller should deal with it. Otherwise the
caller will do one additional lookup in xarray for fallback path for no
real reason.

>   *
>   * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
>   * if the GFP flags specified for FGP_CREAT are atomic.
> @@ -1649,10 +1717,11 @@ struct page *pagecache_get_page(struct address_space 
> *mapping, pgoff_t offset,
>  {
>   struct page *page;
>  
> + BUILD_BUG_ON(((63 << FGP_ORDER_SHIFT) >> FGP_ORDER_SHIFT) != 63);
>  repeat:
> - page = find_get_entry(mapping, offset);
> - if (xa_is_value(page))
> - page = NULL;
> + page = __find_get_page(mapping, offset, fgp_order(fgp_flags));
> + if (pagecache_is_conflict(page))
> + return NULL;
>   if (!page)
>   goto no_page;
>  
> @@ -1686,7 +1755,7 @@ struct page *pagecache_get_page(struct address_space 
> *mapping, pgoff_t offset,
>   if (fgp_flags & FGP_NOFS)
>   gfp_mask &= ~__GFP_FS;
>  
> - page = __page_cache_alloc(gfp_mask);
> + page = __page_cache_alloc_order(gfp_mask, fgp_order(fgp_flags));
>   if (!page)
>   return NULL;
>  
> @@ -1704,13 +1773,17 @@ struct page *pagecache_get_page(struct address_space 
> *mapping, pgoff_t offset,
>   if (err == -EEXIST)
>   goto repeat;
>   }
> + if (page) {
> + if (fgp_order(fgp_flags))
> + count_vm_event(THP_FILE_ALLOC);
>  
> -     /*
> -  * add_to_page_cache_lru locks the page, and for mmap we expect
> -  * an unlocked page.
> -  */
> - if (page && (fgp_flags & FGP_FOR_MMAP))
> - unlock_page(page);
> + /*
> +  * add_to_page_cache_lru locks the page, and
> +  * for mmap we expect an unlocked page.
> +  */
> + if (fgp_flags & FGP_FOR_MMAP)
> + unlock_page(page);
> + }
>   }
>  
>   return page;
> -- 
> 2.23.0
> 

-- 
 Kirill A. Shutemov


Re: Ack to merge through DRM? WAS Re: [PATCH v2 1/5] mm: Add write-protect and clean utilities for address space ranges

2019-09-30 Thread Kirill A. Shutemov
On Fri, Sep 27, 2019 at 09:39:48AM -0700, Linus Torvalds wrote:
> On Fri, Sep 27, 2019 at 5:17 AM Kirill A. Shutemov  
> wrote:
> >
> > > Call it "walk_page_mapping()". And talk extensively about how the
> > > locking differs a lot from the usual "walk_page_vma()" things.
> >
> > Walking mappings of a page is what rmap does. This code thas to be
> > integrated there.
> 
> Well, that's very questionable.
> 
> The rmap code mainly does the "page -> virtual" mapping.  One page at a time.
> 
> The page walker code does the "virtual -> pte" mapping. Always a whole
> range at a time.

Have you seen page_vma_mapped_walk()? I made it specifically for rmap code
to cover cases when a THP is mapped with PTEs. To me it's not a big
stretch to make it cover multiple pages too.

> So I think conceptually, mm/memory.c and unmap_mapping_range() is
> closest but I don't think it's practical to share code.
> 
> And between mm/pagewalk.c and mm/rmap.c, I think the page walking has
> way more of actual practical code sharing, and is also conceptually
> closer because most of the code is about walking a range, not looking
> up the mapping of one page.

I guess it's matter of personal preferences, but page table walkers based
on callback always felt wrong to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-09-30 Thread Kirill A. Shutemov
On Fri, Sep 27, 2019 at 08:40:00PM -0300, Leonardo Bras wrote:
> As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
 ^ typo

-- 
 Kirill A. Shutemov


Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare()

2019-09-30 Thread Kirill A. Shutemov
On Fri, Sep 27, 2019 at 03:47:03PM -0400, Qian Cai wrote:
> On architectures like s390, arch_free_page() could mark the page unused
> (set_page_unused()) and any access later would trigger a kernel panic.
> Fix it by moving arch_free_page() after all possible accessing calls.

Looks like Power also uses the hook. Have you check that this patch will
not affect Power?

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault

2019-09-27 Thread Kirill A. Shutemov
On Thu, Sep 26, 2019 at 03:26:24PM -0400, Johannes Weiner wrote:
> I have just one nitpick:

Here's addressed one:

>From 22a9d58a79a3ebb727d9e909c8f833cd0a751c08 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Thu, 26 Sep 2019 16:34:26 +0300
Subject: [PATCH] shmem: Pin the file in shmem_fault() if mmap_sem is dropped

syzbot found the following crash:

 BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530
 include/trace/events/lock.h:13
 Read of size 8 at addr 8880a5cf2c50 by task syz-executor.0/26173

 CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
   print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
   __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
   kasan_report+0x12/0x17 mm/kasan/common.c:618
   __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
   perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
   trace_lock_acquire include/trace/events/lock.h:13 [inline]
   lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411
   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
   _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
   spin_lock include/linux/spinlock.h:338 [inline]
   shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034
   __do_fault+0x111/0x540 mm/memory.c:3083
   do_shared_fault mm/memory.c:3535 [inline]
   do_fault mm/memory.c:3613 [inline]
   handle_pte_fault mm/memory.c:3840 [inline]
   __handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964
   handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001
   do_user_addr_fault arch/x86/mm/fault.c:1441 [inline]
   __do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506
   do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530
   page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202

It happens if the VMA got unmapped under us while we dropped mmap_sem
and inode got freed.

Pinning the file if we drop mmap_sem fixes the issue.

Signed-off-by: Kirill A. Shutemov 
Reported-by: syzbot+03ee87124ee05af99...@syzkaller.appspotmail.com
Acked-by: Johannes Weiner 
Reviewed-by: Matthew Wilcox (Oracle) 
Cc: Hillf Danton 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
---
 mm/shmem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 30ce722c23fa..0601ad615ccb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2022,16 +2022,14 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
shmem_falloc->waitq &&
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
+   struct file *fpin;
wait_queue_head_t *shmem_falloc_waitq;
DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
-   if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
-  !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-   /* It's polite to up mmap_sem if we can */
-   up_read(>vm_mm->mmap_sem);
+   fpin = maybe_unlock_mmap_for_io(vmf, NULL);
+   if (fpin)
ret = VM_FAULT_RETRY;
-   }
 
shmem_falloc_waitq = shmem_falloc->waitq;
prepare_to_wait(shmem_falloc_waitq, _fault_wait,
@@ -2049,6 +2047,9 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
spin_lock(>i_lock);
finish_wait(shmem_falloc_waitq, _fault_wait);
spin_unlock(>i_lock);
+
+   if (fpin)
+   fput(fpin);
        return ret;
}
spin_unlock(>i_lock);
-- 
 Kirill A. Shutemov


Re: [PATCH 09/15] mm: Allow large pages to be added to the page cache

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:08PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> We return -EEXIST if there are any non-shadow entries in the page
> cache in the range covered by the large page.  If there are multiple
> shadow entries in the range, we set *shadowp to one of them (currently
> the one at the highest index).  If that turns out to be the wrong
> answer, we can implement something more complex.  This is mostly
> modelled after the equivalent function in the shmem code.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/filemap.c | 37 ++---
>  1 file changed, 26 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bab97addbb1d..afe8f5d95810 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -855,6 +855,7 @@ static int __add_to_page_cache_locked(struct page *page,
>   int huge = PageHuge(page);
>   struct mem_cgroup *memcg;
>   int error;
> + unsigned int nr = 1;
>   void *old;
>  
>   VM_BUG_ON_PAGE(!PageLocked(page), page);
> @@ -866,31 +867,45 @@ static int __add_to_page_cache_locked(struct page *page,
> gfp_mask, , false);
>   if (error)
>   return error;
> + xas_set_order(, offset, compound_order(page));
> + nr = compound_nr(page);
>   }
>  
> - get_page(page);
> + page_ref_add(page, nr);
>   page->mapping = mapping;
>   page->index = offset;
>  
>   do {
> + unsigned long exceptional = 0;
> + unsigned int i = 0;
> +
>   xas_lock_irq();
> - old = xas_load();
> - if (old && !xa_is_value(old))
> + xas_for_each_conflict(, old) {
> + if (!xa_is_value(old))
> + break;
> + exceptional++;
> + if (shadowp)
> + *shadowp = old;
> + }
> + if (old)
>   xas_set_err(, -EEXIST);

This made me confused.

Do we rely on 'old' to be NULL if the loop has completed without 'break'?
It's not very obvious.

Can we have a comment or call xas_set_err() within the loop next to the
'break'?

> - xas_store(, page);
> + xas_create_range();
>   if (xas_error())
>   goto unlock;
>  
> - if (xa_is_value(old)) {
> - mapping->nrexceptional--;
> - if (shadowp)
> - *shadowp = old;
> +next:
> + xas_store(, page);
> + if (++i < nr) {
> + xas_next();
> + goto next;
>   }
> - mapping->nrpages++;
> + mapping->nrexceptional -= exceptional;
> + mapping->nrpages += nr;
>  
>   /* hugetlb pages do not participate in page cache accounting */
>   if (!huge)
> - __inc_node_page_state(page, NR_FILE_PAGES);
> + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES,
> + nr);

We also need to bump NR_FILE_THPS here.

>  unlock:
>   xas_unlock_irq();
>   } while (xas_nomem(, gfp_mask & GFP_RECLAIM_MASK));
> @@ -907,7 +922,7 @@ static int __add_to_page_cache_locked(struct page *page,
>   /* Leave page->index set: truncation relies upon it */
>   if (!huge)
>   mem_cgroup_cancel_charge(page, memcg, false);
> - put_page(page);
> + page_ref_sub(page, nr);
>   return xas_error();
>  }
>  ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH 08/15] mm: Add __page_cache_alloc_order

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:07PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This new function allows page cache pages to be allocated that are
> larger than an order-0 page.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 07/15] mm: Make prep_transhuge_page tail-callable

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:06PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By permitting NULL or order-0 pages as an argument, and returning the
> argument, callers can write:
> 
>   return prep_transhuge_page(alloc_pages(...));
> 
> instead of assigning the result to a temporary variable and conditionally
> passing that to prep_transhuge_page().

Patch makes sense, "tail-callable" made me think you want it to be able
accept tail pages. Can we re-phrase this?
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 03/15] mm: Add file_offset_of_ helpers

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:02PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> The page_offset function is badly named for people reading the functions
> which call it.  The natural meaning of a function with this name would
> be 'offset within a page', not 'page offset in bytes within a file'.
> Dave Chinner suggests file_offset_of_page() as a replacement function
> name and I'm also adding file_offset_of_next_page() as a helper for the
> large page work.  Also add kernel-doc for these functions so they show
> up in the kernel API book.
> 
> page_offset() is retained as a compatibility define for now.

This should be trivial for coccinelle, right?

> ---
>  drivers/net/ethernet/ibm/ibmveth.c |  2 --
>  include/linux/pagemap.h| 25 ++---
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
> b/drivers/net/ethernet/ibm/ibmveth.c
> index c5be4ebd8437..bf98aeaf9a45 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -978,8 +978,6 @@ static int ibmveth_ioctl(struct net_device *dev, struct 
> ifreq *ifr, int cmd)
>   return -EOPNOTSUPP;
>  }
>  
> -#define page_offset(v) ((unsigned long)(v) & ((1 << 12) - 1))
> -
>  static int ibmveth_send(struct ibmveth_adapter *adapter,
>   union ibmveth_buf_desc *descs, unsigned long mss)
>  {
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 750770a2c685..103205494ea0 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -428,14 +428,33 @@ static inline pgoff_t page_to_pgoff(struct page *page)
>   return page_to_index(page);
>  }
>  
> -/*
> - * Return byte-offset into filesystem object for page.
> +/**
> + * file_offset_of_page - File offset of this page.
> + * @page: Page cache page.
> + *
> + * Context: Any context.
> + * Return: The offset of the first byte of this page.
>   */
> -static inline loff_t page_offset(struct page *page)
> +static inline loff_t file_offset_of_page(struct page *page)
>  {
>   return ((loff_t)page->index) << PAGE_SHIFT;
>  }
>  
> +/* Legacy; please convert callers */
> +#define page_offset(page)file_offset_of_page(page)
> +
> +/**
> + * file_offset_of_next_page - File offset of the next page.
> + * @page: Page cache page.
> + *
> + * Context: Any context.
> + * Return: The offset of the first byte after this page.
> + */
> +static inline loff_t file_offset_of_next_page(struct page *page)
> +{
> + return ((loff_t)page->index + compound_nr(page)) << PAGE_SHIFT;

Wouldn't it be more readable as

    return file_offset_of_page(page) + page_size(page);

?

> +}
> +
>  static inline loff_t page_file_offset(struct page *page)
>  {
>   return ((loff_t)page_index(page)) << PAGE_SHIFT;
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH 01/15] mm: Use vm_fault error code directly

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:00PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:43:37PM -0400, Johannes Weiner wrote:
> On Tue, Sep 24, 2019 at 01:46:08PM -0700, Matthew Wilcox wrote:
> > On Tue, Sep 24, 2019 at 03:42:38PM -0400, Johannes Weiner wrote:
> > > > I'm not a fan of moving file_update_time() to _before_ the
> > > > balance_dirty_pages call.
> > > 
> > > Can you elaborate why? If the filesystem has a page_mkwrite op, it
> > > will have already called file_update_time() before this function is
> > > entered. If anything, this change makes the sequence more consistent.
> > 
> > Oh, that makes sense.  I thought it should be updated after all the data
> > was written, but it probably doesn't make much difference.
> > 
> > > > Also, this is now the third place that needs
> > > > maybe_unlock_mmap_for_io, see
> > > > https://lore.kernel.org/linux-mm/20190917120852.x6x3aypwvh573kfa@box/
> > > 
> > > Good idea, I moved the helper to internal.h and converted to it.
> > > 
> > > I left the shmem site alone, though. It doesn't require the file
> > > pinning, so it shouldn't pointlessly bump the file refcount and
> > > suggest such a dependency - that could cost somebody later quite a bit
> > > of time trying to understand the code.
> > 
> > The problem for shmem is this:
> > 
> > spin_unlock(>i_lock);
> > schedule();
> > 
> > spin_lock(>i_lock);
> > finish_wait(shmem_falloc_waitq, _fault_wait);
> > spin_unlock(>i_lock);
> > 
> > While scheduled, the VMA can go away and the inode be reclaimed, making
> > this a use-after-free.  The initial suggestion was an increment on
> > the inode refcount, but since we already have a pattern which involves
> > pinning the file, I thought that was a better way to go.
> 
> I completely read over the context of that email you linked - that
> there is a bug in the existing code - and looked at it as mere
> refactoring patch. My apologies.
> 
> Switching that shmem site to maybe_unlock_mmap_for_io() to indirectly
> pin the inode (in a separate bug fix patch) indeed makes sense to me.

The patch on top of this one is below. Please post them together if you are
going to resend yours.
> 
> > > Signed-off-by: Johannes Weiner 
> > 
> > Reviewed-by: Matthew Wilcox (Oracle) 

Acked-by: Kirill A. Shutemov 

>From bdf96fe9e3c1a319e9fd131efbe0118ea41a41b1 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Thu, 26 Sep 2019 16:34:26 +0300
Subject: [PATCH] shmem: Pin the file in shmem_fault() if mmap_sem is dropped

syzbot found the following crash:

 BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530
 include/trace/events/lock.h:13
 Read of size 8 at addr 8880a5cf2c50 by task syz-executor.0/26173

 CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
   print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
   __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
   kasan_report+0x12/0x17 mm/kasan/common.c:618
   __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
   perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
   trace_lock_acquire include/trace/events/lock.h:13 [inline]
   lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411
   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
   _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
   spin_lock include/linux/spinlock.h:338 [inline]
   shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034
   __do_fault+0x111/0x540 mm/memory.c:3083
   do_shared_fault mm/memory.c:3535 [inline]
   do_fault mm/memory.c:3613 [inline]
   handle_pte_fault mm/memory.c:3840 [inline]
   __handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964
   handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001
   do_user_addr_fault arch/x86/mm/fault.c:1441 [inline]
   __do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506
   do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530
   page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202

It happens if the VMA got unmapped under us while we dropped mmap_sem
and inode got freed.

Pinning the file if we drop mmap_sem fixes the issue.

Signed-off-by: Kirill A. Shutemov 
Reported-by: syzbot+03ee87124ee05af99...@syzkaller.appspotmail.com
Cc: Hillf Danton 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
---
 mm/shmem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 30ce722c23fa..f672e4145cfd 100644
--- a/mm/shme

Re: [PATCH v9 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-26 Thread Kirill A. Shutemov
On Wed, Sep 25, 2019 at 10:59:22AM +0800, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> pmd=00023d4b3003, pte=36298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
> in case there can be some obscure use-case.(by Kirill)
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Signed-off-by: Jia He 
> Reported-by: Yibo Cai 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/2] x86/boot/64: round memory hole size up to next PMD page.

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 04:04:31PM -0500, Steve Wahl wrote:
> The kernel image map is created using PMD pages, which can include
> some extra space beyond what's actually needed.  Round the size of the
> memory hole we search for up to the next PMD boundary, to be certain
> all of the space to be mapped is usable RAM and includes no reserved
> areas.
> 
> Signed-off-by: Steve Wahl 
> Cc: sta...@vger.kernel.org

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v3 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-26 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 04:03:55PM -0500, Steve Wahl wrote:
> Our hardware (UV aka Superdome Flex) has address ranges marked
> reserved by the BIOS. Access to these ranges is caught as an error,
> causing the BIOS to halt the system.
> 
> Initial page tables mapped a large range of physical addresses that
> were not checked against the list of BIOS reserved addresses, and
> sometimes included reserved addresses in part of the mapped range.
> Including the reserved range in the map allowed processor speculative
> accesses to the reserved range, triggering a BIOS halt.
> 
> Used early in booting, the page table level2_kernel_pgt addresses 1
> GiB divided into 2 MiB pages, and it was set up to linearly map a full
> 1 GiB of physical addresses that included the physical address range
> of the kernel image, as chosen by KASLR.  But this also included a
> large range of unused addresses on either side of the kernel image.
> And unlike the kernel image's physical address range, this extra
> mapped space was not checked against the BIOS tables of usable RAM
> addresses.  So there were times when the addresses chosen by KASLR
> would result in processor accessible mappings of BIOS reserved
> physical addresses.
> 
> The kernel code did not directly access any of this extra mapped
> space, but having it mapped allowed the processor to issue speculative
> accesses into reserved memory, causing system halts.
> 
> This was encountered somewhat rarely on a normal system boot, and much
> more often when starting the crash kernel if "crashkernel=512M,high"
> was specified on the command line (this heavily restricts the physical
> address of the crash kernel, in our case usually within 1 GiB of
> reserved space).
> 
> The solution is to invalidate the pages of this table outside the
> kernel image's space before the page table is activated.  This patch
> has been validated to fix this problem on our hardware.
> 
> Signed-off-by: Steve Wahl 
> Cc: sta...@vger.kernel.org

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-26 Thread Kirill A. Shutemov
On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > > We don't want to expose a non-hugetlb page to the fast gup running
> > > on a remote CPU before all local non-atomic ops on the page flags
> > > are visible first.
> > > 
> > > For an anon page that isn't in swap cache, we need to make sure all
> > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > > the following race:
> > > 
> > >   CPU 1   CPU1
> > >   set_pte_at()get_user_pages_fast()
> > > page_add_new_anon_rmap()gup_pte_range()
> > > __SetPageSwapBacked() SetPageReferenced()
> > > 
> > > This demonstrates a non-fatal scenario. Though haven't been directly
> > > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > > caller and then overwritten by __SetPageSwapBacked().
> > > 
> > > For an anon page that is already in swap cache or a file page, we
> > > don't need smp_wmb() before set_pte_at() because adding to swap or
> > > file cach serves as a valid write barrier. Using non-atomic ops
> > > thereafter is a bug, obviously.
> > > 
> > > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > > call sites, with the only exception being
> > > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > > 
> > 
> > I'm thinking this patch make stuff rather fragile.. Should we instead
> > stick the barrier in set_p*d_at() instead? Or rather, make that store a
> > store-release?
> 
> I prefer it this way too, but I suspected the majority would be
> concerned with the performance implications, especially those
> looping set_pte_at()s in mm/huge_memory.c.

We can rename current set_pte_at() to __set_pte_at() or something and
leave it in places where barrier is not needed. The new set_pte_at()( will
be used in the rest of the places with the barrier inside.

BTW, have you looked at other levels of page table hierarchy. Do we have
the same issue for PMD/PUD/... pages?

-- 
 Kirill A. Shutemov


Re: [PATCH 3/3] mm, page_owner: rename flag indicating that page is allocated

2019-09-26 Thread Kirill A. Shutemov
On Wed, Sep 25, 2019 at 04:30:52PM +0200, Vlastimil Babka wrote:
> Commit 37389167a281 ("mm, page_owner: keep owner info when freeing the page")
> has introduced a flag PAGE_EXT_OWNER_ACTIVE to indicate that page is tracked 
> as
> being allocated.  Kirril suggested naming it PAGE_EXT_OWNER_ALLOCED to make it
^ typo

And PAGE_EXT_OWNER_ALLOCED is my typo. I meant PAGE_EXT_OWNER_ALLOCATED :P

-- 
 Kirill A. Shutemov


Re: [PATCH 2/3] mm, debug, kasan: save and dump freeing stack trace for kasan

2019-09-26 Thread Kirill A. Shutemov
 page_owner->free_handle = handle;
>   }
> @@ -450,14 +453,16 @@ void __dump_page_owner(struct page *page)
>   stack_trace_print(entries, nr_entries, 0);
>   }
>  
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> - handle = READ_ONCE(page_owner->free_handle);
> - if (!handle) {
> - pr_alert("page_owner free stack trace missing\n");
> - } else {
> - nr_entries = stack_depot_fetch(handle, );
> - pr_alert("page last free stack trace:\n");
> - stack_trace_print(entries, nr_entries, 0);
> +#ifdef CONFIG_PAGE_OWNER_FREE_STACK
> + if (static_branch_unlikely(_owner_free_stack)) {
> + handle = READ_ONCE(page_owner->free_handle);
> + if (!handle) {
> + pr_alert("page_owner free stack trace missing\n");
> + } else {
> + nr_entries = stack_depot_fetch(handle, );
> + pr_alert("page last free stack trace:\n");
> + stack_trace_print(entries, nr_entries, 0);
> + }
>   }
>  #endif
>  
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH 1/3] mm, page_owner: fix off-by-one error in __set_page_owner_handle()

2019-09-26 Thread Kirill A. Shutemov
On Wed, Sep 25, 2019 at 04:30:50PM +0200, Vlastimil Babka wrote:
> As noted by Kirill, commit 7e2f2a0cd17c ("mm, page_owner: record page owner 
> for
> each subpage") has introduced an off-by-one error in __set_page_owner_handle()
> when looking up page_ext for subpages. As a result, the head page page_owner
> info is set twice, while for the last tail page, it's not set at all.
> 
> Fix this and also make the code more efficient by advancing the page_ext
> pointer we already have, instead of calling lookup_page_ext() for each 
> subpage.
> Since the full size of struct page_ext is not known at compile time, we can't
> use a simple page_ext++ statement, so introduce a page_ext_next() inline
> function for that.
> 
> Reported-by: Kirill A. Shutemov 
> Fixes: 7e2f2a0cd17c ("mm, page_owner: record page owner for each subpage")
> Signed-off-by: Vlastimil Babka 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm: don't expose page to fast gup prematurely

2019-09-25 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 04:05:50PM -0600, Yu Zhao wrote:
> On Tue, Sep 24, 2019 at 02:23:16PM +0300, Kirill A. Shutemov wrote:
> > On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> > > We don't want to expose page to fast gup running on a remote CPU
> > > before all local non-atomic ops on page flags are visible first.
> > > 
> > > For anon page that isn't in swap cache, we need to make sure all
> > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > page_add_new_anon_rmap(), are order before set_pte_at() to prevent
> > > the following race:
> > > 
> > >   CPU 1   CPU1
> > > set_pte_at()  get_user_pages_fast()
> > > page_add_new_anon_rmap()  gup_pte_range()
> > >   __SetPageSwapBacked()   SetPageReferenced()
> > 
> > Is there a particular codepath that has what you listed for CPU?
> > After quick look, I only saw that we page_add_new_anon_rmap() called
> > before set_pte_at().
> 
> I think so. One in do_swap_page() and another in unuse_pte(). Both
> are on KSM paths. Am I referencing a stale copy of the source?

I *think* it is a bug. Setting a pte before adding the page to rmap may
lead to rmap (like try_to_unmap() or something) to miss the VMA.

Do I miss something?

-- 
 Kirill A. Shutemov


Re: [PATCH v2 4/4] mm, page_owner, debug_pagealloc: save and dump freeing stack trace

2019-09-24 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:15:01PM +0200, Vlastimil Babka wrote:
> On 9/24/19 1:42 PM, Kirill A. Shutemov wrote:
> >> --- a/mm/page_owner.c
> >> +++ b/mm/page_owner.c
> >> @@ -24,6 +24,9 @@ struct page_owner {
> >>short last_migrate_reason;
> >>gfp_t gfp_mask;
> >>depot_stack_handle_t handle;
> >> +#ifdef CONFIG_DEBUG_PAGEALLOC
> >> +  depot_stack_handle_t free_handle;
> >> +#endif
> > 
> > I think it's possible to add space for the second stack handle at runtime:
> > adjust page_owner_ops->size inside the ->need().
> 
> Uh that would complicate the code needlessly? The extra memory overhead
> isn't that much for a scenario that's already a debugging one
> (page_owner), I was more concerned about the cpu overhead, thus the
> static key.

Okay, fair enough.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/4] mm, page_owner: record page owner for each subpage

2019-09-24 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:10:59PM +0200, Vlastimil Babka wrote:
> On 9/24/19 1:31 PM, Kirill A. Shutemov wrote:
> > On Tue, Aug 20, 2019 at 03:18:26PM +0200, Vlastimil Babka wrote:
> >> Currently, page owner info is only recorded for the first page of a 
> >> high-order
> >> allocation, and copied to tail pages in the event of a split page. With the
> >> plan to keep previous owner info after freeing the page, it would be 
> >> benefical
> >> to record page owner for each subpage upon allocation. This increases the
> >> overhead for high orders, but that should be acceptable for a debugging 
> >> option.
> >>
> >> The order stored for each subpage is the order of the whole allocation. 
> >> This
> >> makes it possible to calculate the "head" pfn and to recognize "tail" pages
> >> (quoted because not all high-order allocations are compound pages with true
> >> head and tail pages). When reading the page_owner debugfs file, keep 
> >> skipping
> >> the "tail" pages so that stats gathered by existing scripts don't get 
> >> inflated.
> >>
> >> Signed-off-by: Vlastimil Babka 
> >> ---
> >>  mm/page_owner.c | 40 
> >>  1 file changed, 28 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/mm/page_owner.c b/mm/page_owner.c
> >> index addcbb2ae4e4..813fcb70547b 100644
> >> --- a/mm/page_owner.c
> >> +++ b/mm/page_owner.c
> >> @@ -154,18 +154,23 @@ static noinline depot_stack_handle_t 
> >> save_stack(gfp_t flags)
> >>return handle;
> >>  }
> >>  
> >> -static inline void __set_page_owner_handle(struct page_ext *page_ext,
> >> -  depot_stack_handle_t handle, unsigned int order, gfp_t gfp_mask)
> >> +static inline void __set_page_owner_handle(struct page *page,
> >> +  struct page_ext *page_ext, depot_stack_handle_t handle,
> >> +  unsigned int order, gfp_t gfp_mask)
> >>  {
> >>struct page_owner *page_owner;
> >> +  int i;
> >>  
> >> -  page_owner = get_page_owner(page_ext);
> >> -  page_owner->handle = handle;
> >> -  page_owner->order = order;
> >> -  page_owner->gfp_mask = gfp_mask;
> >> -  page_owner->last_migrate_reason = -1;
> >> +  for (i = 0; i < (1 << order); i++) {
> >> +  page_owner = get_page_owner(page_ext);
> >> +  page_owner->handle = handle;
> >> +  page_owner->order = order;
> >> +  page_owner->gfp_mask = gfp_mask;
> >> +  page_owner->last_migrate_reason = -1;
> >> +  __set_bit(PAGE_EXT_OWNER, _ext->flags);
> >>  
> >> -  __set_bit(PAGE_EXT_OWNER, _ext->flags);
> >> +  page_ext = lookup_page_ext(page + i);
> > 
> > Isn't it off-by-one? You are calculating page_ext for the next page,
> > right?
> 
> You're right, thanks!
> 
> > And cant we just do page_ext++ here instead?
> 
> Unfortunately no, as that implies sizeof(page_ext), which only declares
> unsigned long flags; and the rest is runtime-determined.
> Perhaps I could add a wrapper named e.g. page_ext_next() that would use
> get_entry_size() internally and hide the necessary casts to void * and back?

Yeah, it looks less costly than calling lookup_page_ext() on each
iteration. And looks like it can be inlined if we make 'extra_mem' visible
(under different name) outside page_ext.c.

-- 
 Kirill A. Shutemov


Re: [PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-24 Thread Kirill A. Shutemov
On Tue, Sep 24, 2019 at 11:33:25AM +0100, Catalin Marinas wrote:
> On Tue, Sep 24, 2019 at 06:43:06AM +, Justin He (Arm Technology China) 
> wrote:
> > Catalin Marinas wrote:
> > > On Sat, Sep 21, 2019 at 09:50:54PM +0800, Jia He wrote:
> > > > @@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page 
> > > > *dst, struct page *src, unsigned lo
> > > >  * fails, we just zero-fill it. Live with it.
> > > >  */
> > > > if (unlikely(!src)) {
> > > > -   void *kaddr = kmap_atomic(dst);
> > > > -   void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > > > +   void *kaddr;
> > > > +   pte_t entry;
> > > > +   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
> > > >
> > > > +   /* On architectures with software "accessed" bits, we 
> > > > would
> > > > +* take a double page fault, so mark it accessed here.
> > > > +*/
> [...]
> > > > +   if (arch_faults_on_old_pte() && 
> > > > !pte_young(vmf->orig_pte)) {
> > > > +   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, 
> > > > addr,
> > > > +  >ptl);
> > > > +   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) 
> > > > {
> > > > +   entry = pte_mkyoung(vmf->orig_pte);
> > > > +   if (ptep_set_access_flags(vma, addr,
> > > > + vmf->pte, 
> > > > entry, 0))
> > > > +   update_mmu_cache(vma, addr, 
> > > > vmf->pte);
> > > > +   } else {
> > > > +   /* Other thread has already handled the 
> > > > fault
> > > > +* and we don't need to do anything. If 
> > > > it's
> > > > +* not the case, the fault will be 
> > > > triggered
> > > > +* again on the same address.
> > > > +*/
> > > > +   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > +   return false;
> > > > +   }
> > > > +   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > +   }
> [...]
> > > > +
> > > > +   kaddr = kmap_atomic(dst);
> > > 
> > > Since you moved the kmap_atomic() here, could the above
> > > arch_faults_on_old_pte() run in a preemptible context? I suggested to
> > > add a WARN_ON in patch 2 to be sure.
> > 
> > Should I move kmap_atomic back to the original line? Thus, we can make sure
> > that arch_faults_on_old_pte() is in the context of preempt_disabled?
> > Otherwise, arch_faults_on_old_pte() may cause plenty of warning if I add
> > a WARN_ON in arch_faults_on_old_pte.  I tested it when I enable the 
> > PREEMPT=y
> > on a ThunderX2 qemu guest.
> 
> So we have two options here:
> 
> 1. Change arch_faults_on_old_pte() scope to the whole system rather than
>just the current CPU. You'd have to wire up a new arm64 capability
>for the access flag but this way we don't care whether it's
>preemptible or not.
> 
> 2. Keep the arch_faults_on_old_pte() per-CPU but make sure we are not
>preempted here. The kmap_atomic() move would do but you'd have to
>kunmap_atomic() before the return.
> 
> I think the answer to my question below also has some implication on
> which option to pick:
> 
> > > > /*
> > > >  * This really shouldn't fail, because the page is there
> > > >  * in the page tables. But it might just be unreadable,
> > > >  * in which case we just give up and fill the result 
> > > > with
> > > >  * zeroes.
> > > >  */
> > > > -   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
> > > > +   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) 
> > > > {
> > > > +   /* Give a warn in case there can be some obscure
> > > > +* use-case
> > > > +*/
> > > > +   WARN_ON_ONCE(1);
> > > 
> > > That's more of a question for the mm guys: at this point we do the
> > > copying with the ptl released; is there anything else that could have
> > > made the pte old in the meantime? I think unuse_pte() is only called on
> > > anonymous vmas, so it shouldn't be the case here.
> 
> If we need to hold the ptl here, you could as well have an enclosing
> kmap/kunmap_atomic (option 2) with some goto instead of "return false".

Yeah, look like we need to hold ptl for longer. There is nothing I see
that would prevent clearing young bit under us otherwise.

-- 
 Kirill A. Shutemov


Re: [PATCH V3 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-24 Thread Kirill A. Shutemov
On Fri, Sep 20, 2019 at 12:03:21PM +0530, Anshuman Khandual wrote:
> This series adds a test validation for architecture exported page table
> helpers. Patch in the series adds basic transformation tests at various
> levels of the page table. Before that it exports gigantic page allocation
> function from HugeTLB.
> 
> This test was originally suggested by Catalin during arm64 THP migration
> RFC discussion earlier. Going forward it can include more specific tests
> with respect to various generic MM functions like THP, HugeTLB etc and
> platform specific tests.
> 
> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/
> 
> Testing:
> 
> Successfully build and boot tested on both arm64 and x86 platforms without
> any test failing. Only build tested on some other platforms. Build failed
> on some platforms (known) in pud_clear_tests() as there were no available
> __pgd() definitions.
> 
> - ARM32
> - IA64

Hm. Grep shows __pgd() definitions for both of them. Is it for specific
config?


-- 
 Kirill A. Shutemov


Re: [PATCH v2 4/4] mm, page_owner, debug_pagealloc: save and dump freeing stack trace

2019-09-24 Thread Kirill A. Shutemov
On Tue, Aug 20, 2019 at 03:18:28PM +0200, Vlastimil Babka wrote:
> The debug_pagealloc functionality is useful to catch buggy page allocator 
> users
> that cause e.g. use after free or double free. When page inconsistency is
> detected, debugging is often simpler by knowing the call stack of process that
> last allocated and freed the page. When page_owner is also enabled, we record
> the allocation stack trace, but not freeing.
> 
> This patch therefore adds recording of freeing process stack trace to page
> owner info, if both page_owner and debug_pagealloc are configured and enabled.
> With only page_owner enabled, this info is not useful for the memory leak
> debugging use case. dump_page() is adjusted to print the info. An example
> result of calling __free_pages() twice may look like this (note the page last 
> free
> stack trace):
> 
> BUG: Bad page state in process bash  pfn:13d8f8
> page:c31984f63e00 refcount:-1 mapcount:0 mapping: 
> index:0x0
> flags: 0x1a8()
> raw: 01a8 dead0100 dead0122 
> raw:    
> page dumped because: nonzero _refcount
> page_owner tracks the page as freed
> page last allocated via order 0, migratetype Unmovable, gfp_mask 
> 0xcc0(GFP_KERNEL)
>  prep_new_page+0x143/0x150
>  get_page_from_freelist+0x289/0x380
>  __alloc_pages_nodemask+0x13c/0x2d0
>  khugepaged+0x6e/0xc10
>  kthread+0xf9/0x130
>  ret_from_fork+0x3a/0x50
> page last free stack trace:
>  free_pcp_prepare+0x134/0x1e0
>  free_unref_page+0x18/0x90
>  khugepaged+0x7b/0xc10
>  kthread+0xf9/0x130
>  ret_from_fork+0x3a/0x50
> Modules linked in:
> CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
> Call Trace:
>  dump_stack+0x85/0xc0
>  bad_page.cold+0xba/0xbf
>  rmqueue_pcplist.isra.0+0x6c5/0x6d0
>  rmqueue+0x2d/0x810
>  get_page_from_freelist+0x191/0x380
>  __alloc_pages_nodemask+0x13c/0x2d0
>  __get_free_pages+0xd/0x30
>  __pud_alloc+0x2c/0x110
>  copy_page_range+0x4f9/0x630
>  dup_mmap+0x362/0x480
>  dup_mm+0x68/0x110
>  copy_process+0x19e1/0x1b40
>  _do_fork+0x73/0x310
>  __x64_sys_clone+0x75/0x80
>  do_syscall_64+0x6e/0x1e0
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f10af854a10
> ...
> 
> Signed-off-by: Vlastimil Babka 
> ---
>  .../admin-guide/kernel-parameters.txt |  2 +
>  mm/Kconfig.debug  |  4 +-
>  mm/page_owner.c   | 53 ++-
>  3 files changed, 45 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 47d981a86e2f..e813a17d622e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -809,6 +809,8 @@
>   enables the feature at boot time. By default, it is
>   disabled and the system will work mostly the same as a
>   kernel built without CONFIG_DEBUG_PAGEALLOC.
> + Note: to get most of debug_pagealloc error reports, it's
> + useful to also enable the page_owner functionality.
>   on: enable the feature
>  
>   debugpat[X86] Enable PAT debugging
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 82b6a20898bd..327b3ebf23bf 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -21,7 +21,9 @@ config DEBUG_PAGEALLOC
> Also, the state of page tracking structures is checked more often as
> pages are being allocated and freed, as unexpected state changes
> often happen for same reasons as memory corruption (e.g. double free,
> -   use-after-free).
> +   use-after-free). The error reports for these checks can be augmented
> +   with stack traces of last allocation and freeing of the page, when
> +   PAGE_OWNER is also selected and enabled on boot.
>  
> For architectures which don't enable ARCH_SUPPORTS_DEBUG_PAGEALLOC,
> fill the pages with poison patterns after free_pages() and verify
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 4a48e018dbdf..dee931184788 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -24,6 +24,9 @@ struct page_owner {
>   short last_migrate_reason;
>   gfp_t gfp_mask;
>   depot_stack_handle_t handle;
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> + depot_stack_handle_t free_handle;
> +#endif

I think it's possible to add space for the second stack handle at runtime:
adjust page_owner_ops->size inside the ->need(). The second stack might be
useful beyond CONFIG_DEBUG_PAGEALLOC. We probably should not tie these
features.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 3/4] mm, page_owner: keep owner info when freeing the page

2019-09-24 Thread Kirill A. Shutemov
On Tue, Aug 20, 2019 at 03:18:27PM +0200, Vlastimil Babka wrote:
> For debugging purposes it might be useful to keep the owner info even after
> page has been freed, and include it in e.g. dump_page() when detecting a bad
> page state. For that, change the PAGE_EXT_OWNER flag meaning to "page owner
> info has been set at least once" and add new PAGE_EXT_OWNER_ACTIVE for 
> tracking
> whether page is supposed to be currently tracked allocated or free.

Why not PAGE_EXT_OWNER_ALLOCED if it means "allocated"? Active is somewhat
loaded term for a page.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/4] mm, page_owner: record page owner for each subpage

2019-09-24 Thread Kirill A. Shutemov
On Tue, Aug 20, 2019 at 03:18:26PM +0200, Vlastimil Babka wrote:
> Currently, page owner info is only recorded for the first page of a high-order
> allocation, and copied to tail pages in the event of a split page. With the
> plan to keep previous owner info after freeing the page, it would be benefical
> to record page owner for each subpage upon allocation. This increases the
> overhead for high orders, but that should be acceptable for a debugging 
> option.
> 
> The order stored for each subpage is the order of the whole allocation. This
> makes it possible to calculate the "head" pfn and to recognize "tail" pages
> (quoted because not all high-order allocations are compound pages with true
> head and tail pages). When reading the page_owner debugfs file, keep skipping
> the "tail" pages so that stats gathered by existing scripts don't get 
> inflated.
> 
> Signed-off-by: Vlastimil Babka 
> ---
>  mm/page_owner.c | 40 
>  1 file changed, 28 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index addcbb2ae4e4..813fcb70547b 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -154,18 +154,23 @@ static noinline depot_stack_handle_t save_stack(gfp_t 
> flags)
>   return handle;
>  }
>  
> -static inline void __set_page_owner_handle(struct page_ext *page_ext,
> - depot_stack_handle_t handle, unsigned int order, gfp_t gfp_mask)
> +static inline void __set_page_owner_handle(struct page *page,
> + struct page_ext *page_ext, depot_stack_handle_t handle,
> + unsigned int order, gfp_t gfp_mask)
>  {
>   struct page_owner *page_owner;
> + int i;
>  
> - page_owner = get_page_owner(page_ext);
> - page_owner->handle = handle;
> - page_owner->order = order;
> - page_owner->gfp_mask = gfp_mask;
> - page_owner->last_migrate_reason = -1;
> + for (i = 0; i < (1 << order); i++) {
> + page_owner = get_page_owner(page_ext);
> + page_owner->handle = handle;
> + page_owner->order = order;
> + page_owner->gfp_mask = gfp_mask;
> + page_owner->last_migrate_reason = -1;
> + __set_bit(PAGE_EXT_OWNER, _ext->flags);
>  
> - __set_bit(PAGE_EXT_OWNER, _ext->flags);
> + page_ext = lookup_page_ext(page + i);

Isn't it off-by-one? You are calculating page_ext for the next page,
right? And cant we just do page_ext++ here instead?

> + }
>  }
>  
>  noinline void __set_page_owner(struct page *page, unsigned int order,
> @@ -178,7 +183,7 @@ noinline void __set_page_owner(struct page *page, 
> unsigned int order,
>   return;
>  
>   handle = save_stack(gfp_mask);
> - __set_page_owner_handle(page_ext, handle, order, gfp_mask);
> + __set_page_owner_handle(page, page_ext, handle, order, gfp_mask);
>  }
>  
>  void __set_page_owner_migrate_reason(struct page *page, int reason)
> @@ -204,8 +209,11 @@ void __split_page_owner(struct page *page, unsigned int 
> order)
>  
>   page_owner = get_page_owner(page_ext);
>   page_owner->order = 0;
> - for (i = 1; i < (1 << order); i++)
> - __copy_page_owner(page, page + i);
> + for (i = 1; i < (1 << order); i++) {
> + page_ext = lookup_page_ext(page + i);

Again, page_ext++?

> + page_owner = get_page_owner(page_ext);
> + page_owner->order = 0;
> + }
>  }
>  
>  void __copy_page_owner(struct page *oldpage, struct page *newpage)
> @@ -483,6 +491,13 @@ read_page_owner(struct file *file, char __user *buf, 
> size_t count, loff_t *ppos)
>  
>   page_owner = get_page_owner(page_ext);
>  
> + /*
> +  * Don't print "tail" pages of high-order allocations as that
> +  * would inflate the stats.
> +  */
> + if (!IS_ALIGNED(pfn, 1 << page_owner->order))
> + continue;
> +
>   /*
>* Access to page_ext->handle isn't synchronous so we should
>* be careful to access it.
> @@ -562,7 +577,8 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct 
> zone *zone)
>   continue;
>  
>   /* Found early allocated page */
> - __set_page_owner_handle(page_ext, early_handle, 0, 0);
> + __set_page_owner_handle(page, page_ext, early_handle,
> + 0, 0);
>   count++;
>   }
>   cond_resched();
> -- 
> 2.22.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm: don't expose page to fast gup prematurely

2019-09-24 Thread Kirill A. Shutemov
On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> We don't want to expose page to fast gup running on a remote CPU
> before all local non-atomic ops on page flags are visible first.
> 
> For anon page that isn't in swap cache, we need to make sure all
> prior non-atomic ops, especially __SetPageSwapBacked() in
> page_add_new_anon_rmap(), are order before set_pte_at() to prevent
> the following race:
> 
>   CPU 1   CPU1
> set_pte_at()  get_user_pages_fast()
> page_add_new_anon_rmap()  gup_pte_range()
>   __SetPageSwapBacked()   SetPageReferenced()

Is there a particular codepath that has what you listed for CPU?
After quick look, I only saw that we page_add_new_anon_rmap() called
before set_pte_at().

-- 
 Kirill A. Shutemov


Re: [PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-23 Thread Kirill A. Shutemov
On Sat, Sep 21, 2019 at 09:50:54PM +0800, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> pmd=00023d4b3003, pte=36298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
> in case there can be some obscure use-case.(by Kirill)
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Reported-by: Yibo Cai 
> Signed-off-by: Jia He 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-20 Thread Kirill A. Shutemov
On Fri, Sep 20, 2019 at 08:53:00AM -0700, Matthew Wilcox wrote:
> On Fri, Sep 20, 2019 at 09:54:37PM +0800, Jia He wrote:
> > -static inline void cow_user_page(struct page *dst, struct page *src, 
> > unsigned long va, struct vm_area_struct *vma)
> > +static inline int cow_user_page(struct page *dst, struct page *src,
> > +   struct vm_fault *vmf)
> >  {
> 
> Can we talk about the return type here?
> 
> > +   } else {
> > +   /* Other thread has already handled the fault
> > +* and we don't need to do anything. If it's
> > +* not the case, the fault will be triggered
> > +* again on the same address.
> > +*/
> > +   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +   return -1;
> ...
> > +   return 0;
> >  }
> 
> So -1 for "try again" and 0 for "succeeded".
> 
> > +   if (cow_user_page(new_page, old_page, vmf)) {
> 
> Then we use it like a bool.  But it's kind of backwards from a bool because
> false is success.
> 
> > +   /* COW failed, if the fault was solved by other,
> > +* it's fine. If not, userspace would re-fault on
> > +* the same address and we will handle the fault
> > +* from the second attempt.
> > +*/
> > +   put_page(new_page);
> > +   if (old_page)
> > +   put_page(old_page);
> > +   return 0;
> 
> And we don't use the return value; in fact we invert it.
> 
> Would this make more sense:
> 
> static inline bool cow_user_page(struct page *dst, struct page *src,
>   struct vm_fault *vmf)
> ...
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   return false;
> ...
>   return true;
> ...
>   if (!cow_user_page(new_page, old_page, vmf)) {
> 
> That reads more sensibly for me.

I like this idea too.

-- 
 Kirill A. Shutemov


Re: [PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-20 Thread Kirill A. Shutemov
On Fri, Sep 20, 2019 at 09:54:37PM +0800, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> pmd=00023d4b3003, pte=36298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
> in case there can be some obscure use-case.(by Kirill)
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Reported-by: Yibo Cai 
> Signed-off-by: Jia He 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH] mm, memcg: assign shrinker_map before kvfree

2019-09-20 Thread Kirill A. Shutemov
On Fri, Sep 20, 2019 at 03:29:07PM +0300, Cyrill Gorcunov wrote:
> Currently there is a small gap between fetching pointer, calling
> kvfree and assign its value to nil. In current callgraph it is
> not a problem (since memcg_free_shrinker_maps is running from
> memcg_alloc_shrinker_maps and mem_cgroup_css_free only) still
> this looks suspicious and we can easily eliminate the gap at all.

With this logic it will still look suspicious since you don't wait a grace
period before freeing the map.

-- 
 Kirill A. Shutemov


Re: [PATCH v6 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-20 Thread Kirill A. Shutemov
;* This really shouldn't fail, because the page is there
>* in the page tables. But it might just be unreadable,
>* in which case we just give up and fill the result with
>* zeroes.
>*/
> - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
> + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
> + /* Give a warn in case there can be some obscure
> +  * use-case
> +  */
> + WARN_ON_ONCE(1);
>   clear_page(kaddr);
> + }
>   kunmap_atomic(kaddr);
>   flush_dcache_page(dst);
>   } else
> - copy_user_highpage(dst, src, va, vma);
> + copy_user_highpage(dst, src, addr, vma);
> +
> + return 0;
>  }
>  
>  static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> @@ -2318,7 +2360,16 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   vmf->address);
>   if (!new_page)
>   goto oom;
> - cow_user_page(new_page, old_page, vmf->address, vma);
> +
> + if (cow_user_page(new_page, old_page, vmf)) {
> + /* COW failed, if the fault was solved by other,
> +  * it's fine. If not, userspace would re-fault on
> +  * the same address and we will handle the fault
> +  * from the second attempt.
> +  */
> + put_page(new_page);

I think you also need to give the reference on the old page back:

if (old_page)
    put_page(old_page);

> + goto normal;

I don't see much point in this goto. Just return 0.
> + }
>   }
>  
>   if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
> false))
> @@ -2420,6 +2471,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   }
>   put_page(old_page);
>   }
> +
> +normal:
>   return page_copied ? VM_FAULT_WRITE : 0;
>  oom_free_new:
>   put_page(new_page);
> -- 
> 2.17.1
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Kirill A. Shutemov
On Thu, Sep 19, 2019 at 03:41:43PM +, Catalin Marinas wrote:
> On Thu, Sep 19, 2019 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 18, 2019 at 07:00:30PM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 18, 2019 at 05:00:27PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:
> > > > > @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page 
> > > > > *dst, struct page *src, unsigned lo
> > > > >*/
> > > > >   if (unlikely(!src)) {
> > > > >   void *kaddr = kmap_atomic(dst);
> > > > > - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > > > > + void __user *uaddr = (void __user *)(addr & PAGE_MASK);
> > > > > + pte_t entry;
> > > > >  
> > > > >   /*
> > > > >* This really shouldn't fail, because the page is there
> > > > >* in the page tables. But it might just be unreadable,
> > > > >* in which case we just give up and fill the result 
> > > > > with
> > > > > -  * zeroes.
> > > > > +  * zeroes. On architectures with software "accessed" 
> > > > > bits,
> > > > > +  * we would take a double page fault here, so mark it
> > > > > +  * accessed here.
> > > > >*/
> > > > > + if (arch_faults_on_old_pte() && 
> > > > > !pte_young(vmf->orig_pte)) {
> > > > > + spin_lock(vmf->ptl);
> > > > > + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) 
> > > > > {
> > > > > + entry = pte_mkyoung(vmf->orig_pte);
> > > > > + if (ptep_set_access_flags(vma, addr,
> > > > > +   vmf->pte, 
> > > > > entry, 0))
> > > > > + update_mmu_cache(vma, addr, 
> > > > > vmf->pte);
> > > > > + }
> > > > 
> > > > I don't follow.
> > > > 
> > > > So if pte has changed under you, you don't set the accessed bit, but 
> > > > never
> > > > the less copy from the user.
> > > > 
> > > > What makes you think it will not trigger the same problem?
> > > > 
> > > > I think we need to make cow_user_page() fail in this case and caller --
> > > > wp_page_copy() -- return zero. If the fault was solved by other thread, 
> > > > we
> > > > are fine. If not userspace would re-fault on the same address and we 
> > > > will
> > > > handle the fault from the second attempt.
> > > 
> > > It would be nice to clarify the semantics of this function and do as
> > > you suggest but the current comment is slightly confusing:
> > > 
> > >   /*
> > >* If the source page was a PFN mapping, we don't have
> > >* a "struct page" for it. We do a best-effort copy by
> > >* just copying from the original user address. If that
> > >* fails, we just zero-fill it. Live with it.
> > >*/
> > > 
> > > Would any user-space rely on getting a zero-filled page here instead of
> > > a recursive fault?
> > 
> > I don't see the point in zero-filled page in this case. SIGBUS sounds like
> > more appropriate response, no?
> 
> I think misunderstood your comment. So, if !pte_same(), we should let
> userspace re-fault. This wouldn't be a user ABI change and it is
> bounded, can't end up in an infinite re-fault loop.

Right. !pte_same() can only happen if we raced with somebody else.
I cannot imagine situation when the race will happen more than few times
in a row.

> In case of a __copy_from_user_inatomic() error, SIGBUS would make more
> sense but it changes the current behaviour (zero-filling the page). This
> can be left for a separate patch, doesn't affect the arm64 case here.

I think it's safer to leave it as is. Maybe put WARN_ON_ONCE() or
something. There can be some obscure use-case that I don't see.

-- 
 Kirill A. Shutemov


Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Kirill A. Shutemov
On Wed, Sep 18, 2019 at 07:00:30PM +0100, Catalin Marinas wrote:
> On Wed, Sep 18, 2019 at 05:00:27PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:
> > > @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page 
> > > *dst, struct page *src, unsigned lo
> > >*/
> > >   if (unlikely(!src)) {
> > >   void *kaddr = kmap_atomic(dst);
> > > - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > > + void __user *uaddr = (void __user *)(addr & PAGE_MASK);
> > > + pte_t entry;
> > >  
> > >   /*
> > >* This really shouldn't fail, because the page is there
> > >* in the page tables. But it might just be unreadable,
> > >* in which case we just give up and fill the result with
> > > -  * zeroes.
> > > +  * zeroes. On architectures with software "accessed" bits,
> > > +  * we would take a double page fault here, so mark it
> > > +  * accessed here.
> > >*/
> > > + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
> > > + spin_lock(vmf->ptl);
> > > + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
> > > + entry = pte_mkyoung(vmf->orig_pte);
> > > + if (ptep_set_access_flags(vma, addr,
> > > +   vmf->pte, entry, 0))
> > > + update_mmu_cache(vma, addr, vmf->pte);
> > > + }
> > 
> > I don't follow.
> > 
> > So if pte has changed under you, you don't set the accessed bit, but never
> > the less copy from the user.
> > 
> > What makes you think it will not trigger the same problem?
> > 
> > I think we need to make cow_user_page() fail in this case and caller --
> > wp_page_copy() -- return zero. If the fault was solved by other thread, we
> > are fine. If not userspace would re-fault on the same address and we will
> > handle the fault from the second attempt.
> 
> It would be nice to clarify the semantics of this function and do as
> you suggest but the current comment is slightly confusing:
> 
>   /*
>* If the source page was a PFN mapping, we don't have
>* a "struct page" for it. We do a best-effort copy by
>* just copying from the original user address. If that
>* fails, we just zero-fill it. Live with it.
>*/
> 
> Would any user-space rely on getting a zero-filled page here instead of
> a recursive fault?

I don't see the point in zero-filled page in this case. SIGBUS sounds like
more appropriate response, no?

-- 
 Kirill A. Shutemov


Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Kirill A. Shutemov
On Thu, Sep 19, 2019 at 10:16:34AM +0800, Jia He wrote:
> Hi Kirill
> 
> [On behalf of justin...@arm.com because some mails are filted...]
> 
> On 2019/9/18 22:00, Kirill A. Shutemov wrote:
> > On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:
> > > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, 
> > > there
> > > will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> > > 
> > > Below call trace is from arm64 do_page_fault for debugging purpose
> > > [  110.016195] Call trace:
> > > [  110.016826]  do_page_fault+0x5a4/0x690
> > > [  110.017812]  do_mem_abort+0x50/0xb0
> > > [  110.018726]  el1_da+0x20/0xc4
> > > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > > [  110.020646]  do_wp_page+0xb0/0x860
> > > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > > [  110.022606]  handle_mm_fault+0xe8/0x180
> > > [  110.023584]  do_page_fault+0x240/0x690
> > > [  110.024535]  do_mem_abort+0x50/0xb0
> > > [  110.025423]  el0_da+0x20/0x24
> > > 
> > > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > > [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> > > pmd=00023d4b3003, pte=36298607bd3
> > > 
> > > As told by Catalin: "On arm64 without hardware Access Flag, copying from
> > > user will fail because the pte is old and cannot be marked young. So we
> > > always end up with zeroed page after fork() + CoW for pfn mappings. we
> > > don't always have a hardware-managed access flag on arm64."
> > > 
> > > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > > changed because vmf should be passed to cow_user_page()
> > > 
> > > [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> > > 
> > > Reported-by: Yibo Cai 
> > > Signed-off-by: Jia He 
> > > ---
> > >   mm/memory.c | 35 ++-
> > >   1 file changed, 30 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e2bb51b6242e..d2c130a5883b 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > >   2;
> > >   #endif
> > > +#ifndef arch_faults_on_old_pte
> > > +static inline bool arch_faults_on_old_pte(void)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > >   static int __init disable_randmaps(char *s)
> > >   {
> > >   randomize_va_space = 0;
> > > @@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct 
> > > *mm, pmd_t *pmd,
> > >   return same;
> > >   }
> > > -static inline void cow_user_page(struct page *dst, struct page *src, 
> > > unsigned long va, struct vm_area_struct *vma)
> > > +static inline void cow_user_page(struct page *dst, struct page *src,
> > > +  struct vm_fault *vmf)
> > >   {
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + unsigned long addr = vmf->address;
> > > +
> > >   debug_dma_assert_idle(src);
> > >   /*
> > > @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page 
> > > *dst, struct page *src, unsigned lo
> > >*/
> > >   if (unlikely(!src)) {
> > >   void *kaddr = kmap_atomic(dst);
> > > - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > > + void __user *uaddr = (void __user *)(addr & PAGE_MASK);
> > > + pte_t entry;
> > >   /*
> > >* This really shouldn't fail, because the page is there
> > >* in the page tables. But it might just be unreadable,
> > >* in which case we just give up and fill the result 
> > > with
> > > -  * zeroes.
> > > +  * zeroes. On architectures with software "accessed" bits,
> > > +  * we would take a double page fault here, so mark it
> > > +  * accessed here.
> > >*/
> > > + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
> > > + spin_lock(vmf->ptl);
> > > + if (likely(pte_same(*vmf->pte, 

Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-18 Thread Kirill A. Shutemov
sh_dcache_page(dst);
>   } else
> - copy_user_highpage(dst, src, va, vma);
> + copy_user_highpage(dst, src, addr, vma);
>  }
>  
>  static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> @@ -2318,7 +2343,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   vmf->address);
>   if (!new_page)
>   goto oom;
> - cow_user_page(new_page, old_page, vmf->address, vma);
> + cow_user_page(new_page, old_page, vmf);
>   }
>  
>   if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
> false))
> -- 
> 2.17.1
> 
> 

-- 
 Kirill A. Shutemov


Re: KASAN: use-after-free Read in shmem_fault (2)

2019-09-17 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 06:04:12PM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 09, 2019 at 06:55:21AM -0700, Matthew Wilcox wrote:
> > On Mon, Sep 02, 2019 at 05:20:30PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Sep 02, 2019 at 06:52:54AM -0700, Matthew Wilcox wrote:
> > > > On Sat, Aug 31, 2019 at 12:58:26PM +0800, Hillf Danton wrote:
> > > > > On Fri, 30 Aug 2019 12:40:06 -0700
> > > > > > syzbot found the following crash on:
> > > > > > 
> > > > > > HEAD commit:a55aa89a Linux 5.3-rc6
> > > > > > git tree:   upstream
> > > > > > console output: 
> > > > > > https://syzkaller.appspot.com/x/log.txt?x=12f4beb660
> > > > > > kernel config:  
> > > > > > https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
> > > > > > dashboard link: 
> > > > > > https://syzkaller.appspot.com/bug?extid=03ee87124ee05af991bd
> > > > > > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > > > > > 
> > > > > > ==
> > > > > > BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530  
> > > > > > include/trace/events/lock.h:13
> > > > > > Read of size 8 at addr 8880a5cf2c50 by task syz-executor.0/26173
> > > > > 
> > > > > --- a/mm/shmem.c
> > > > > +++ b/mm/shmem.c
> > > > > @@ -2021,6 +2021,12 @@ static vm_fault_t shmem_fault(struct vm_
> > > > >   shmem_falloc_waitq = shmem_falloc->waitq;
> > > > >   prepare_to_wait(shmem_falloc_waitq, 
> > > > > _fault_wait,
> > > > >   TASK_UNINTERRUPTIBLE);
> > > > > + /*
> > > > > +  * it is not trivial to see what will take 
> > > > > place after
> > > > > +  * releasing i_lock and taking a nap, so hold 
> > > > > inode to
> > > > > +  * be on the safe side.
> > > > 
> > > > I think the comment could be improved.  How about:
> > > > 
> > > >  * The file could be unmapped by another thread 
> > > > after
> > > >  * releasing i_lock, and the inode then freed.  
> > > > Hold
> > > >  * a reference to the inode to prevent this.
> > > 
> > > It only can happen if mmap_sem was released, so it's better to put
> > > __iget() to the branch above next to up_read(). I've got confused at first
> > > how it is possible from ->fault().
> > > 
> > > This way iput() below should only be called for ret == VM_FAULT_RETRY.
> > 
> > Looking at the rather similar construct in filemap.c, should we solve
> > it the same way, where we inc the refcount on the struct file instead
> > of the inode before releasing the mmap_sem?
> 
> Are you talking about maybe_unlock_mmap_for_io()? Yeah, worth moving it to
> mm/internal.h and reuse.
> 
> Care to prepare the patch? :P

Something like this? Untested.

diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..a542f72f57cc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2349,26 +2349,6 @@ EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)
-static struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
-struct file *fpin)
-{
-   int flags = vmf->flags;
-
-   if (fpin)
-   return fpin;
-
-   /*
-* FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or
-* anything, so we only pin the file and drop the mmap_sem if only
-* FAULT_FLAG_ALLOW_RETRY is set.
-*/
-   if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
-   FAULT_FLAG_ALLOW_RETRY) {
-   fpin = get_file(vmf->vma->vm_file);
-   up_read(>vma->vm_mm->mmap_sem);
-   }
-   return fpin;
-}
 
 /*
  * lock_page_maybe_drop_mmap - lock the page, possibly dropping the mmap_sem
diff --git a/mm/internal.h b/mm/internal.h
index e32390802fd3..75ffa646de82 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -362,6 +362,27 @@ vma_address(struct page *page, struct vm_area_struct *vma)
return max(start, vma->vm_start);
 }
 
+static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
+   

Re: [PATCH v3] mm: memory: fix /proc/meminfo reporting for MLOCK_ONFAULT

2019-09-17 Thread Kirill A. Shutemov
On Mon, Sep 16, 2019 at 06:26:19PM +0300, Kirill A. Shutemov wrote:
> > ---
> >  mm/memory.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e0c232fe81d9..55da24f33bc4 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3311,6 +3311,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
> > mem_cgroup *memcg,
> > } else {
> > inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> > page_add_file_rmap(page, false);
> > +   if (vma->vm_flags & VM_LOCKED && !PageTransCompound(page))
> > +   mlock_vma_page(page);
> 
> Why do you only do this for file pages?

Because file pages are locked already, right?

-- 
 Kirill A. Shutemov


Re: [PATCH v3] mm: memory: fix /proc/meminfo reporting for MLOCK_ONFAULT

2019-09-17 Thread Kirill A. Shutemov
On Tue, Sep 17, 2019 at 12:15:19PM +0200, Michal Hocko wrote:
> On Mon 16-09-19 18:26:19, Kirill A. Shutemov wrote:
> > On Fri, Sep 13, 2019 at 02:11:19PM -0700, Lucian Adrian Grijincu wrote:
> > > As pages are faulted in MLOCK_ONFAULT correctly updates
> > > /proc/self/smaps, but doesn't update /proc/meminfo's Mlocked field.
> > 
> > I don't think there's something wrong with this behaviour. It is okay to
> > keep the page an evictable LRU list (and not account it to NR_MLOCKED).
> 
> evictable list is an implementation detail. Having an overview about an

s/evictable/unevictable/

> amount of mlocked pages can be important. Lazy accounting makes this
> more fuzzy and harder for admins to monitor.
> 
> Sure it is not a bug to panic about but it certainly makes life of poor
> admins harder.

Good luck with making mlock accounting exact :P

For start, try to handle sanely trylock_page() failure under ptl while
dealing with FOLL_MLOCK.

> If there is a pathological THP behavior possible then we should look
> into that as well.

There's nothing pathological about THP behaviour. See "MLOCKING
Transparent Huge Pages" section in Documentation/vm/unevictable-lru.rst.

-- 
 Kirill A. Shutemov


Re: [PATCH v3] mm: memory: fix /proc/meminfo reporting for MLOCK_ONFAULT

2019-09-16 Thread Kirill A. Shutemov
ter reading the entire the file =\n");
> meminfo();
> smaps();
> 
>   } else {
> PCHECK(mlock(addr, size));
> printf("= after mlock =\n");
> meminfo();
> smaps();
>   }
> 
>   PCHECK(munmap(addr, size));
>   printf("= after munmap =\n");
>   meminfo();
>   smaps();
> 
>   return ret;
> }
> 
> ---
> 
> Signed-off-by: Lucian Adrian Grijincu 
> Acked-by: Souptick Joarder 
> ---
>  mm/memory.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e0c232fe81d9..55da24f33bc4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
> mem_cgroup *memcg,
>   } else {
>   inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>   page_add_file_rmap(page, false);
> + if (vma->vm_flags & VM_LOCKED && !PageTransCompound(page))
> + mlock_vma_page(page);

Why do you only do this for file pages?

>   }
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>  
> -- 
> 2.17.1
> 
> 

-- 
 Kirill A. Shutemov


[tip: x86/mm] x86/mm: Enable 5-level paging support by default

2019-09-16 Thread tip-bot2 for Kirill A. Shutemov
The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 18ec1eaf58fbf2d9009a752a102a3d8e0d905a0f
Gitweb:
https://git.kernel.org/tip/18ec1eaf58fbf2d9009a752a102a3d8e0d905a0f
Author:Kirill A. Shutemov 
AuthorDate:Fri, 13 Sep 2019 12:54:52 +03:00
Committer: Ingo Molnar 
CommitterDate: Mon, 16 Sep 2019 16:51:20 +02:00

x86/mm: Enable 5-level paging support by default

Support of boot-time switching between 4- and 5-level paging mode is
upstream since 4.17.

We run internal testing with 5-level paging support enabled for a while
and it doesn't not cause any functional or performance regression on
4-level paging hardware.

The only 5-level paging related regressions I saw were in early boot
code that runs independently from CONFIG_X86_5LEVEL.

The next major release of distributions expected to have
CONFIG_X86_5LEVEL=y.

Enable the option by default. It may help to catch obscure bugs early.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: 
https://lkml.kernel.org/r/20190913095452.40592-1-kirill.shute...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 58eae28..d4bbebe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1483,6 +1483,7 @@ config X86_PAE
 
 config X86_5LEVEL
bool "Enable 5-level page tables support"
+   default y
select DYNAMIC_MEMORY_LAYOUT
select SPARSEMEM_VMEMMAP
depends on X86_64


Re: [PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-16 Thread Kirill A. Shutemov
On Mon, Sep 16, 2019 at 09:35:21AM +, Justin He (Arm Technology China) 
wrote:
> 
> Hi Kirill
> > -Original Message-
> > From: Kirill A. Shutemov 
> > Sent: 2019年9月16日 17:16
> > To: Justin He (Arm Technology China) 
> > Cc: Catalin Marinas ; Will Deacon
> > ; Mark Rutland ; James Morse
> > ; Marc Zyngier ; Matthew
> > Wilcox ; Kirill A. Shutemov
> > ; linux-arm-ker...@lists.infradead.org;
> > linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> > ; Anshuman Khandual
> > ; Jun Yao ;
> > Alex Van Brunt ; Robin Murphy
> > ; Thomas Gleixner ;
> > Andrew Morton ; Jérôme Glisse
> > ; Ralph Campbell ;
> > hejia...@gmail.com
> > Subject: Re: [PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF
> > is cleared
> >
> > On Sat, Sep 14, 2019 at 12:32:39AM +0800, Jia He wrote:
> > > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest,
> > there
> > > will be a double page fault in __copy_from_user_inatomic of
> > cow_user_page.
> > >
> > > Below call trace is from arm64 do_page_fault for debugging purpose
> > > [  110.016195] Call trace:
> > > [  110.016826]  do_page_fault+0x5a4/0x690
> > > [  110.017812]  do_mem_abort+0x50/0xb0
> > > [  110.018726]  el1_da+0x20/0xc4
> > > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > > [  110.020646]  do_wp_page+0xb0/0x860
> > > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > > [  110.022606]  handle_mm_fault+0xe8/0x180
> > > [  110.023584]  do_page_fault+0x240/0x690
> > > [  110.024535]  do_mem_abort+0x50/0xb0
> > > [  110.025423]  el0_da+0x20/0x24
> > >
> > > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > > [9b007000] pgd=00023d4f8003, pud=00023da9b003,
> > pmd=00023d4b3003, pte=36298607bd3
> > >
> > > As told by Catalin: "On arm64 without hardware Access Flag, copying
> > from
> > > user will fail because the pte is old and cannot be marked young. So we
> > > always end up with zeroed page after fork() + CoW for pfn mappings. we
> > > don't always have a hardware-managed access flag on arm64."
> > >
> > > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > > changed because vmf should be passed to cow_user_page()
> > >
> > > [1]
> > https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> > >
> > > Reported-by: Yibo Cai 
> > > Signed-off-by: Jia He 
> > > ---
> > >  mm/memory.c | 30 +-
> > >  1 file changed, 25 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e2bb51b6242e..a64af6495f71 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > 2;
> > >  #endif
> > >
> > > +#ifndef arch_faults_on_old_pte
> > > +static inline bool arch_faults_on_old_pte(void)
> > > +{
> > > +   return false;
> > > +}
> > > +#endif
> > > +
> > >  static int __init disable_randmaps(char *s)
> > >  {
> > > randomize_va_space = 0;
> > > @@ -2140,7 +2147,8 @@ static inline int pte_unmap_same(struct
> > mm_struct *mm, pmd_t *pmd,
> > > return same;
> > >  }
> > >
> > > -static inline void cow_user_page(struct page *dst, struct page *src,
> > unsigned long va, struct vm_area_struct *vma)
> > > +static inline void cow_user_page(struct page *dst, struct page *src,
> > > +   struct vm_fault *vmf)
> > >  {
> > > debug_dma_assert_idle(src);
> > >
> > > @@ -2152,20 +2160,32 @@ static inline void cow_user_page(struct page
> > *dst, struct page *src, unsigned lo
> > >  */
> > > if (unlikely(!src)) {
> > > void *kaddr = kmap_atomic(dst);
> > > -   void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > > +   void __user *uaddr = (void __user *)(vmf->address &
> > PAGE_MASK);
> > > +   pte_t entry;
> > >
> > > /*
> > >  * This really shouldn't fail, because the page is there
> > >  * in the page tables. But it might just be unreadable,
> > >  * in which case we just give up and fill the result with
> > > -* zeroes.
> > > +* zeroes. If PTE_AF is cleared on arm64, it might
> > > +* cause double page fault. So makes pte young here
> > >  */
> > > +   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte))
> > {
> > > +   spin_lock(vmf->ptl);
> > > +   entry = pte_mkyoung(vmf->orig_pte);
> >
> > Should't you re-validate that orig_pte after re-taking ptl? It can be
> > stale by now.
> Thanks, do you mean flush_cache_page(vma, vmf->address, 
> pte_pfn(vmf->orig_pte))
> before pte_mkyoung?

No. You need to check pte_same(*vmf->pte, vmf->orig_pte) before modifying
anything and bail out if *vmf->pte has changed under you.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 1/2] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-16 Thread Kirill A. Shutemov
On Sat, Sep 14, 2019 at 12:32:38AM +0800, Jia He wrote:
> On arm64 without hardware Access Flag, copying fromuser will fail because
> the pte is old and cannot be marked young. So we always end up with zeroed
> page after fork() + CoW for pfn mappings. we don't always have a
> hardware-managed access flag on arm64.
> 
> Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
> cause page fault when accessing old pte.
> 
> Signed-off-by: Jia He 
> ---
>  arch/arm64/include/asm/pgtable.h | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index e09760ece844..b41399d758df 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct 
> vm_area_struct *vma,
>  #define phys_to_ttbr(addr)   (addr)
>  #endif
>  
> +/*
> + * On arm64 without hardware Access Flag, copying fromuser will fail because
> + * the pte is old and cannot be marked young. So we always end up with zeroed
> + * page after fork() + CoW for pfn mappings. we don't always have a
> + * hardware-managed access flag on arm64.
> + */
> +static inline bool arch_faults_on_old_pte(void)
> +{
> + return true;

Shouldn't youc check if this particular machine supports hardware access
bit?

> +}
> +#define arch_faults_on_old_pte arch_faults_on_old_pte
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> -- 
> 2.17.1
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-16 Thread Kirill A. Shutemov
On Sat, Sep 14, 2019 at 12:32:39AM +0800, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> pmd=00023d4b3003, pte=36298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Reported-by: Yibo Cai 
> Signed-off-by: Jia He 
> ---
>  mm/memory.c | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e2bb51b6242e..a64af6495f71 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
>   2;
>  #endif
>  
> +#ifndef arch_faults_on_old_pte
> +static inline bool arch_faults_on_old_pte(void)
> +{
> + return false;
> +}
> +#endif
> +
>  static int __init disable_randmaps(char *s)
>  {
>   randomize_va_space = 0;
> @@ -2140,7 +2147,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
> pmd_t *pmd,
>   return same;
>  }
>  
> -static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> +static inline void cow_user_page(struct page *dst, struct page *src,
> + struct vm_fault *vmf)
>  {
>   debug_dma_assert_idle(src);
>  
> @@ -2152,20 +2160,32 @@ static inline void cow_user_page(struct page *dst, 
> struct page *src, unsigned lo
>*/
>   if (unlikely(!src)) {
>   void *kaddr = kmap_atomic(dst);
> - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> + void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK);
> + pte_t entry;
>  
>   /*
>* This really shouldn't fail, because the page is there
>* in the page tables. But it might just be unreadable,
>* in which case we just give up and fill the result with
> -  * zeroes.
> +  * zeroes. If PTE_AF is cleared on arm64, it might
> +  * cause double page fault. So makes pte young here
>*/
> + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
> + spin_lock(vmf->ptl);
> + entry = pte_mkyoung(vmf->orig_pte);

Should't you re-validate that orig_pte after re-taking ptl? It can be
stale by now.

> + if (ptep_set_access_flags(vmf->vma, vmf->address,
> +   vmf->pte, entry, 0))
> + update_mmu_cache(vmf->vma, vmf->address,
> +  vmf->pte);
> + spin_unlock(vmf->ptl);
> + }
> +
>   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
>   clear_page(kaddr);
>   kunmap_atomic(kaddr);
>   flush_dcache_page(dst);
>   } else
> - copy_user_highpage(dst, src, va, vma);
> + copy_user_highpage(dst, src, vmf->address, vmf->vma);
>  }
>  
>  static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> @@ -2318,7 +2338,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   vmf->address);
>   if (!new_page)
>   goto oom;
> - cow_user_page(new_page, old_page, vmf->address, vma);
> + cow_user_page(new_page, old_page, vmf);
>   }
>  
>   if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
> false))
> -- 
> 2.17.1
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-16 Thread Kirill A. Shutemov
On Fri, Sep 13, 2019 at 10:14:15AM -0500, Steve Wahl wrote:
> On Thu, Sep 12, 2019 at 01:19:17PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> > > Thank you for your time looking into this with me!
> > 
> > With all this explanation the original patch looks sane to me.
> > 
> > But I would like to see more information from this thread in the commit
> > message and some comments in the code on why it's crucial not to map more
> > than needed.
> 
> I am working on this.
> 
> > I think we also need to make it clear that this is workaround for a broken
> > hardware: speculative execution must not trigger a halt.
> 
> I think the word broken is a bit loaded here.  According to the UEFI
> spec (version 2.8, page 167), "Regions that are backed by the physical
> hardware, but are not supposed to be accessed by the OS, must be
> returned as EfiReservedMemoryType."  Our interpretation is that
> includes speculative accesses.

+Dave.

I don't think it is. Speculative access is done by hardware, not OS.

BTW, isn't it a BIOS issue?

I believe it should have a way to hide a range of physical address space
from OS or force a caching mode on to exclude it from speculative
execution. Like setup MTRRs or something.

> I'd lean more toward "tightening of adherence to the spec required by
> some particular hardware."  Does that work for you?

Not really, no. I still believe it's issue of the platform, not OS.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 1/2] mm: clean up validate_slab()

2019-09-16 Thread Kirill A. Shutemov
On Fri, Sep 13, 2019 at 06:07:42PM -0600, Yu Zhao wrote:
> The function doesn't need to return any value, and the check can be
> done in one pass.
> 
> There is a behavior change: before the patch, we stop at the first
> invalid free object; after the patch, we stop at the first invalid
> object, free or in use. This shouldn't matter because the original
> behavior isn't intended anyway.
> 
> Signed-off-by: Yu Zhao 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


[PATCH] x86/mm: Enable 5-level paging support by default

2019-09-13 Thread Kirill A. Shutemov
Support of boot-time switching between 4- and 5-level paging mode is
upstream since 4.17.

We run internal testing with 5-level paging support enabled for a while
and it doesn't not cause any functional or performance regression on
4-level paging hardware.

The only 5-level paging related regressions I saw were in early boot
code that runs independently from CONFIG_X86_5LEVEL.

The next major release of distributions expected to have
CONFIG_X86_5LEVEL=y.

Enable the option by default. It may help to catch obscure bugs early.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 222855cc0158..2f7cb91d850e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1483,6 +1483,7 @@ config X86_PAE
 
 config X86_5LEVEL
bool "Enable 5-level page tables support"
+   default y
select DYNAMIC_MEMORY_LAYOUT
select SPARSEMEM_VMEMMAP
depends on X86_64
-- 
2.21.0



[PATCH] mm, thp: Do not queue fully unmapped pages for deferred split

2019-09-13 Thread Kirill A. Shutemov
Adding fully unmapped pages into deferred split queue is not productive:
these pages are about to be freed or they are pinned and cannot be split
anyway.

Signed-off-by: Kirill A. Shutemov 
---
 mm/rmap.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..45388f1bf317 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1271,12 +1271,20 @@ static void page_remove_anon_compound_rmap(struct page 
*page)
if (TestClearPageDoubleMap(page)) {
/*
 * Subpages can be mapped with PTEs too. Check how many of
-* themi are still mapped.
+* them are still mapped.
 */
for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
if (atomic_add_negative(-1, [i]._mapcount))
nr++;
}
+
+   /*
+* Queue the page for deferred split if at least one small
+* page of the compound page is unmapped, but at least one
+* small page is still mapped.
+*/
+   if (nr && nr < HPAGE_PMD_NR)
+   deferred_split_huge_page(page);
} else {
nr = HPAGE_PMD_NR;
}
@@ -1284,10 +1292,8 @@ static void page_remove_anon_compound_rmap(struct page 
*page)
if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
 
-   if (nr) {
+   if (nr)
__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
-   deferred_split_huge_page(page);
-   }
 }
 
 /**
-- 
2.21.0



Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-13 Thread Kirill A. Shutemov
On Fri, Sep 13, 2019 at 02:32:04PM +0530, Anshuman Khandual wrote:
> 
> On 09/12/2019 10:44 PM, Christophe Leroy wrote:
> > 
> > 
> > Le 12/09/2019 à 08:02, Anshuman Khandual a écrit :
> >> This adds a test module which will validate architecture page table helpers
> >> and accessors regarding compliance with generic MM semantics expectations.
> >> This will help various architectures in validating changes to the existing
> >> page table helpers or addition of new ones.
> >>
> >> Test page table and memory pages creating it's entries at various level are
> >> all allocated from system memory with required alignments. If memory pages
> >> with required size and alignment could not be allocated, then all depending
> >> individual tests are skipped.
> >>
> > 
> > [...]
> > 
> >>
> >> Suggested-by: Catalin Marinas 
> >> Signed-off-by: Anshuman Khandual 
> >> ---
> >>   arch/x86/include/asm/pgtable_64_types.h |   2 +
> >>   mm/Kconfig.debug    |  14 +
> >>   mm/Makefile |   1 +
> >>   mm/arch_pgtable_test.c  | 429 
> >>   4 files changed, 446 insertions(+)
> >>   create mode 100644 mm/arch_pgtable_test.c
> >>
> >> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> >> b/arch/x86/include/asm/pgtable_64_types.h
> >> index 52e5f5f2240d..b882792a3999 100644
> >> --- a/arch/x86/include/asm/pgtable_64_types.h
> >> +++ b/arch/x86/include/asm/pgtable_64_types.h
> >> @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
> >>   #define pgtable_l5_enabled() 0
> >>   #endif /* CONFIG_X86_5LEVEL */
> >>   +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
> >> +
> > 
> > This is specific to x86, should go in a separate patch.
> 
> Thought about it but its just a single line. Kirill suggested this in the
> previous version. There is a generic fallback definition but s390 has it's
> own. This change overrides the generic one for x86 probably as a fix or as
> an improvement. Kirill should be able to help classify it in which case it
> can be a separate patch.

I don't think it worth a separate patch.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-13 Thread Kirill A. Shutemov
On Fri, Sep 13, 2019 at 02:12:45PM +0530, Anshuman Khandual wrote:
> 
> 
> On 09/13/2019 12:41 PM, Christophe Leroy wrote:
> > 
> > 
> > Le 13/09/2019 à 09:03, Christophe Leroy a écrit :
> >>
> >>
> >> Le 13/09/2019 à 08:58, Anshuman Khandual a écrit :
> >>> On 09/13/2019 11:53 AM, Christophe Leroy wrote:
> >>>> Fix build failure on powerpc.
> >>>>
> >>>> Fix preemption imbalance.
> >>>>
> >>>> Signed-off-by: Christophe Leroy 
> >>>> ---
> >>>>   mm/arch_pgtable_test.c | 3 +++
> >>>>   1 file changed, 3 insertions(+)
> >>>>
> >>>> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
> >>>> index 8b4a92756ad8..f2b3c9ec35fa 100644
> >>>> --- a/mm/arch_pgtable_test.c
> >>>> +++ b/mm/arch_pgtable_test.c
> >>>> @@ -24,6 +24,7 @@
> >>>>   #include 
> >>>>   #include 
> >>>>   #include 
> >>>> +#include 
> >>>
> >>> This is okay.
> >>>
> >>>>   #include 
> >>>>   #include 
> >>>> @@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)
> >>>>   p4d_clear_tests(p4dp);
> >>>>   pgd_clear_tests(mm, pgdp);
> >>>> +    pte_unmap(ptep);
> >>>> +
> >>>
> >>> Now the preemption imbalance via pte_alloc_map() path i.e
> >>>
> >>> pte_alloc_map() -> pte_offset_map() -> kmap_atomic()
> >>>
> >>> Is not this very much powerpc 32 specific or this will be applicable
> >>> for all platform which uses kmap_XXX() to map high memory ?
> >>>
> >>
> >> See 
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/include/linux/highmem.h#L91
> >>
> >> I think it applies at least to all arches using the generic implementation.
> >>
> >> Applies also to arm:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/mm/highmem.c#L52
> >>
> >> Applies also to mips:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/mips/mm/highmem.c#L47
> >>
> >> Same on sparc:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/sparc/mm/highmem.c#L52
> >>
> >> Same on x86:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/mm/highmem_32.c#L34
> >>
> >> I have not checked others, but I guess it is like that for all.
> >>
> > 
> > 
> > Seems like I answered too quickly. All kmap_atomic() do preempt_disable(), 
> > but not all pte_alloc_map() call kmap_atomic().
> > 
> > However, for instance ARM does:
> > 
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/include/asm/pgtable.h#L200
> > 
> > And X86 as well:
> > 
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/include/asm/pgtable_32.h#L51
> > 
> > Microblaze also:
> > 
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/microblaze/include/asm/pgtable.h#L495
> 
> All the above platforms checks out to be using k[un]map_atomic(). I am 
> wondering whether
> any of the intermediate levels will have similar problems on any these 32 bit 
> platforms
> or any other platforms which might be using generic k[un]map_atomic().

No. Kernel only allocates pte page table from highmem. All other page
tables are always visible in kernel address space.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 1/4] mm: correct mask size for slub page->objects

2019-09-12 Thread Kirill A. Shutemov
On Thu, Sep 12, 2019 at 03:11:14PM -0600, Yu Zhao wrote:
> On Thu, Sep 12, 2019 at 12:40:35PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 11, 2019 at 08:31:08PM -0600, Yu Zhao wrote:
> > > Mask of slub objects per page shouldn't be larger than what
> > > page->objects can hold.
> > > 
> > > It requires more than 2^15 objects to hit the problem, and I don't
> > > think anybody would. It'd be nice to have the mask fixed, but not
> > > really worth cc'ing the stable.
> > > 
> > > Fixes: 50d5c41cd151 ("slub: Do not use frozen page flag but a bit in the 
> > > page counters")
> > > Signed-off-by: Yu Zhao 
> > 
> > I don't think the patch fixes anything.
> 
> Technically it does. It makes no sense for a mask to have more bits
> than the variable that holds the masked value. I had to look up the
> commit history to find out why and go through the code to make sure
> it doesn't actually cause any problem.
> 
> My hope is that nobody else would have to go through the same trouble.

Just put some comments then.

-- 
 Kirill A. Shutemov


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-12 Thread Kirill A. Shutemov
On Thu, Sep 12, 2019 at 11:32:53AM +0530, Anshuman Khandual wrote:
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Anshuman Khandual ");
> +MODULE_DESCRIPTION("Test architecture page table helpers");

It's not module. Why?

BTW, I think we should make all code here __init (or it's variants) so it
can be discarded on boot. It has not use after that.

-- 
 Kirill A. Shutemov


Re: [PATCH v9 0/8] stg mail -e --version=v9 \

2019-09-12 Thread Kirill A. Shutemov
On Thu, Sep 12, 2019 at 11:19:25AM +0200, Michal Hocko wrote:
> On Wed 11-09-19 08:12:03, Alexander Duyck wrote:
> > On Wed, Sep 11, 2019 at 4:36 AM Michal Hocko  wrote:
> > >
> > > On Tue 10-09-19 14:23:40, Alexander Duyck wrote:
> > > [...]
> > > > We don't put any limitations on the allocator other then that it needs 
> > > > to
> > > > clean up the metadata on allocation, and that it cannot allocate a page
> > > > that is in the process of being reported since we pulled it from the
> > > > free_list. If the page is a "Reported" page then it decrements the
> > > > reported_pages count for the free_area and makes sure the page doesn't
> > > > exist in the "Boundary" array pointer value, if it does it moves the
> > > > "Boundary" since it is pulling the page.
> > >
> > > This is still a non-trivial limitation on the page allocation from an
> > > external code IMHO. I cannot give any explicit reason why an ordering on
> > > the free list might matter (well except for page shuffling which uses it
> > > to make physical memory pattern allocation more random) but the
> > > architecture seems hacky and dubious to be honest. It shoulds like the
> > > whole interface has been developed around a very particular and single
> > > purpose optimization.
> > 
> > How is this any different then the code that moves a page that will
> > likely be merged to the tail though?
> 
> I guess you are referring to the page shuffling. If that is the case
> then this is an integral part of the allocator for a reason and it is
> very well obvious in the code including the consequences. I do not
> really like an idea of hiding similar constrains behind a generic
> looking feature which is completely detached from the allocator and so
> any future change of the allocator might subtly break it.

I don't necessary follow why shuffling is more integral to page allocator
than reporting would be. It's next to shutffle.c under mm/ and integrated
in a simillar way.

-- 
 Kirill A. Shutemov


Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-12 Thread Kirill A. Shutemov
On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> Thank you for your time looking into this with me!

With all this explanation the original patch looks sane to me.

But I would like to see more information from this thread in the commit
message and some comments in the code on why it's crucial not to map more
than needed.

I think we also need to make it clear that this is workaround for a broken
hardware: speculative execution must not trigger a halt.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 4/4] mm: lock slub page when listing objects

2019-09-12 Thread Kirill A. Shutemov
On Wed, Sep 11, 2019 at 08:31:11PM -0600, Yu Zhao wrote:
> Though I have no idea what the side effect of such race would be,
> apparently we want to prevent the free list from being changed
> while debugging the objects.

Codewise looks good to me. But commit message can be better.

Can we repharase it to indicate that slab_lock is required to protect
page->objects?

-- 
 Kirill A. Shutemov


Re: [PATCH v2 3/4] mm: avoid slub allocation while holding list_lock

2019-09-12 Thread Kirill A. Shutemov
On Wed, Sep 11, 2019 at 08:31:10PM -0600, Yu Zhao wrote:
> If we are already under list_lock, don't call kmalloc(). Otherwise we
> will run into deadlock because kmalloc() also tries to grab the same
> lock.
> 
> Fixing the problem by using a static bitmap instead.
> 
>   WARNING: possible recursive locking detected
>   
>   mount-encrypted/4921 is trying to acquire lock:
>   (&(>list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
> 
>   but task is already holding lock:
>   (&(>list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>  CPU0
>  
> lock(&(>list_lock)->rlock);
> lock(&(>list_lock)->rlock);
> 
>*** DEADLOCK ***
> 
> Signed-off-by: Yu Zhao 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/4] mm: clean up validate_slab()

2019-09-12 Thread Kirill A. Shutemov
On Wed, Sep 11, 2019 at 08:31:09PM -0600, Yu Zhao wrote:
> The function doesn't need to return any value, and the check can be
> done in one pass.
> 
> There is a behavior change: before the patch, we stop at the first
> invalid free object; after the patch, we stop at the first invalid
> object, free or in use. This shouldn't matter because the original
> behavior isn't intended anyway.
> 
> Signed-off-by: Yu Zhao 
> ---
>  mm/slub.c | 21 -
>  1 file changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 62053ceb4464..7b7e1ee264ef 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4386,31 +4386,26 @@ static int count_total(struct page *page)
>  #endif
>  
>  #ifdef CONFIG_SLUB_DEBUG
> -static int validate_slab(struct kmem_cache *s, struct page *page,
> +static void validate_slab(struct kmem_cache *s, struct page *page,
>   unsigned long *map)
>  {
>   void *p;
>   void *addr = page_address(page);
>  
> - if (!check_slab(s, page) ||
> - !on_freelist(s, page, NULL))
> - return 0;
> + if (!check_slab(s, page) || !on_freelist(s, page, NULL))
> + return;
>  
>   /* Now we know that a valid freelist exists */
>   bitmap_zero(map, page->objects);
>  
>   get_map(s, page, map);
>   for_each_object(p, s, addr, page->objects) {
> - if (test_bit(slab_index(p, s, addr), map))
> - if (!check_object(s, page, p, SLUB_RED_INACTIVE))
> - return 0;
> - }
> + u8 val = test_bit(slab_index(p, s, addr), map) ?
> +  SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;

Proper 'if' would be more readable.

Other than that look fine to me.

>  
> - for_each_object(p, s, addr, page->objects)
> - if (!test_bit(slab_index(p, s, addr), map))
> - if (!check_object(s, page, p, SLUB_RED_ACTIVE))
> - return 0;
> - return 1;
> + if (!check_object(s, page, p, val))
> + break;
> + }
>  }
>  
>  static void validate_slab_slab(struct kmem_cache *s, struct page *page,
> -- 
> 2.23.0.162.g0b9fbb3734-goog
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v2 1/4] mm: correct mask size for slub page->objects

2019-09-12 Thread Kirill A. Shutemov
On Wed, Sep 11, 2019 at 08:31:08PM -0600, Yu Zhao wrote:
> Mask of slub objects per page shouldn't be larger than what
> page->objects can hold.
> 
> It requires more than 2^15 objects to hit the problem, and I don't
> think anybody would. It'd be nice to have the mask fixed, but not
> really worth cc'ing the stable.
> 
> Fixes: 50d5c41cd151 ("slub: Do not use frozen page flag but a bit in the page 
> counters")
> Signed-off-by: Yu Zhao 

I don't think the patch fixes anything.

Yes, we have one spare bit between order and number of object that is not
used and always zero. So what?

I can imagine for some microarchitecures accessing higher 16 bits of int
is cheaper than shifting by 15.

-- 
 Kirill A. Shutemov


Re: [PATCH 2/3] mm: avoid slub allocation while holding list_lock

2019-09-11 Thread Kirill A. Shutemov
On Wed, Sep 11, 2019 at 06:29:28PM -0600, Yu Zhao wrote:
> If we are already under list_lock, don't call kmalloc(). Otherwise we
> will run into deadlock because kmalloc() also tries to grab the same
> lock.
> 
> Instead, statically allocate bitmap in struct kmem_cache_node. Given
> currently page->objects has 15 bits, we bloat the per-node struct by
> 4K. So we waste some memory but only do so when slub debug is on.

Why not have single page total protected by a lock?

Listing object from two pages at the same time doesn't make sense anyway.
Cuncurent validating is not something sane to do.

-- 
 Kirill A. Shutemov


Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-10 Thread Kirill A. Shutemov
On Tue, Sep 10, 2019 at 09:28:10AM -0500, Steve Wahl wrote:
> On Mon, Sep 09, 2019 at 11:14:14AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > > ...
> > > The answer is to invalidate the pages of this table outside the
> > > address range occupied by the kernel before the page table is
> > > activated.  This patch has been validated to fix this problem on our
> > > hardware.
> > 
> > If the goal is to avoid *any* mapping of the reserved region to stop
> > speculation, I don't think this patch will do the job. We still (likely)
> > have the same memory mapped as part of the identity mapping. And it
> > happens at least in two places: here and before on decompression stage.
> 
> I imagine you are likely correct, ideally you would not map any
> reserved pages in these spaces.
> 
> I've been reading the code to try to understand what you say above.
> For identity mappings in the kernel, I see level2_ident_pgt mapping
> the first 1G.

This is for XEN case. Not sure how relevant it is for you.

> And I see early_dyanmic_pgts being set up with an identity mapping of
> the kernel that seems to be pretty well restricted to the range _text
> through _end.

Right, but rounded to 2M around the place kernel was decompressed to.
Some of reserved areas from the listing below are smaller then 2M or not
aligned to 2M.

> Within the decompression code, I see an identity mapping of the first
> 4G set up within the 32 bit code.  I believe we go past that to the
> startup_64 entry point.  (I don't know how common that path is, but I
> don't have a way to test it without figuring out how to force it.)

Kernel can start in 64-bit mode directly and in this case we inherit page
tables from bootloader/BIOS. They trusted to provide identity mapping to
cover at least kernel (plus some more essential stuff), but it's free to
map more.

> From a pragmatic standpoint, the guy who can verify this for me is on
> vacation, but I believe our BIOS won't ever place the halt-causing
> ranges in a space below 4GiB.  Which explains why this patch works for
> our hardware.  (We do have reserved regions below 4G, just not the
> ones that hardware causes a halt for accessing.)
> 
> In case it helps you picture the situation, our hardware takes a small
> portion of RAM from the end of each NUMA node (or it might be pairs or
> quads of NUMA nodes, I'm not entirely clear on this at the moment) for
> its own purposes.  Here's a section of our e820 table:
> 
> [0.00] BIOS-e820: [mem 0x7c00-0x8fff] reserved
> [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
> [0.00] BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
> [0.00] BIOS-e820: [mem 0x0001-0x002f7fff] usable
> [0.00] BIOS-e820: [mem 0x002f8000-0x00303fff] reserved
> [0.00] BIOS-e820: [mem 0x00304000-0x005f7bff] usable
> [0.00] BIOS-e820: [mem 0x005f7c00-0x00603fff] reserved
> [0.00] BIOS-e820: [mem 0x00604000-0x008f7bff] usable
> [0.00] BIOS-e820: [mem 0x008f7c00-0x00903fff] reserved
> [0.00] BIOS-e820: [mem 0x00904000-0x00bf7bff] usable
> [0.00] BIOS-e820: [mem 0x00bf7c00-0x00c03fff] reserved
> [0.00] BIOS-e820: [mem 0x00c04000-0x00ef7bff] usable
> [0.00] BIOS-e820: [mem 0x00ef7c00-0x00f03fff] reserved
> [0.00] BIOS-e820: [mem 0x00f04000-0x011f7bff] usable
> [0.00] BIOS-e820: [mem 0x011f7c00-0x01203fff] reserved
> [0.00] BIOS-e820: [mem 0x01204000-0x014f7bff] usable
> [0.00] BIOS-e820: [mem 0x014f7c00-0x01503fff] reserved
> [0.00] BIOS-e820: [mem 0x01504000-0x017f7bff] usable
> [0.00] BIOS-e820: [mem 0x017f7c00-0x01803fff] reserved

It would be interesting to know which of them are problematic.

> Our problem occurs when KASLR (or kexec) places the kernel close
> enough to the end of one of the usable sections, and the 1G of 1:1
> mapped space includes a portion of the following reserved section, and
> speculation touches the reserved area.

Are you sure that it's speculative access to blame? Speculative access
must not cause change in architectural state.

-- 
 Kirill A. Shutemov


Re: [PATCH 1/2] memory_hotplug: Add a bounds check to check_hotplug_memory_range()

2019-09-10 Thread Kirill A. Shutemov
On Tue, Sep 10, 2019 at 12:52:20PM +1000, Alastair D'Silva wrote:
> From: Alastair D'Silva 
> 
> On PowerPC, the address ranges allocated to OpenCAPI LPC memory
> are allocated from firmware. These address ranges may be higher
> than what older kernels permit, as we increased the maximum
> permissable address in commit 4ffe713b7587
> ("powerpc/mm: Increase the max addressable memory to 2PB"). It is
> possible that the addressable range may change again in the
> future.
> 
> In this scenario, we end up with a bogus section returned from
> __section_nr (see the discussion on the thread "mm: Trigger bug on
> if a section is not found in __section_nr").
> 
> Adding a check here means that we fail early and have an
> opportunity to handle the error gracefully, rather than rumbling
> on and potentially accessing an incorrect section.
> 
> Further discussion is also on the thread ("powerpc: Perform a bounds
> check in arch_add_memory").
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  include/linux/memory_hotplug.h |  1 +
>  mm/memory_hotplug.c| 19 ++-
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f46ea71b4ffd..bc477e98a310 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -110,6 +110,7 @@ extern void __online_page_increment_counters(struct page 
> *page);
>  extern void __online_page_free(struct page *page);
>  
>  extern int try_online_node(int nid);
> +int check_hotplug_memory_addressable(u64 start, u64 size);
>  
>  extern int arch_add_memory(int nid, u64 start, u64 size,
>   struct mhp_restrictions *restrictions);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c73f09913165..3c5428b014f9 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1030,6 +1030,23 @@ int try_online_node(int nid)
>   return ret;
>  }
>  
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#endif
> +#endif
> +
> +int check_hotplug_memory_addressable(u64 start, u64 size)
> +{
> +#ifdef MAX_POSSIBLE_PHYSMEM_BITS

How can it be not defined? You've defined it 6 lines above.

> + if ((start + size - 1) >> MAX_POSSIBLE_PHYSMEM_BITS)
> + return -E2BIG;
> +#endif
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(check_hotplug_memory_addressable);
> +
>  static int check_hotplug_memory_range(u64 start, u64 size)
>  {
>   /* memory range must be block size aligned */
> @@ -1040,7 +1057,7 @@ static int check_hotplug_memory_range(u64 start, u64 
> size)
>   return -EINVAL;
>   }
>  
> - return 0;
> + return check_hotplug_memory_addressable(start, size);
>  }
>  
>  static int online_memory_block(struct memory_block *mem, void *arg)
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v2 1/2] mm/page_ext: support to record the last stack of page

2019-09-10 Thread Kirill A. Shutemov
On Tue, Sep 10, 2019 at 09:07:49AM +0800, Walter Wu wrote:
> On Mon, 2019-09-09 at 12:57 +0200, David Hildenbrand wrote:
> > On 09.09.19 10:53, Walter Wu wrote:
> > > KASAN will record last stack of page in order to help programmer
> > > to see memory corruption caused by page.
> > > 
> > > What is difference between page_owner and our patch?
> > > page_owner records alloc stack of page, but our patch is to record
> > > last stack(it may be alloc or free stack of page).
> > > 
> > > Signed-off-by: Walter Wu 
> > > ---
> > >  mm/page_ext.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > > index 5f5769c7db3b..7ca33dcd9ffa 100644
> > > --- a/mm/page_ext.c
> > > +++ b/mm/page_ext.c
> > > @@ -65,6 +65,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> > >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> > >   _idle_ops,
> > >  #endif
> > > +#ifdef CONFIG_KASAN
> > > + _stack_ops,
> > > +#endif
> > >  };
> > >  
> > >  static unsigned long total_usage;
> > > 
> > 
> > Are you sure this patch compiles?
> > 
> This is patchsets, it need another patch2.
> We have verified it by running KASAN UT on Qemu.

Any patchset must be bisectable: do not break anything in the middle of
patchset.

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-10 Thread Kirill A. Shutemov
On Fri, Sep 06, 2019 at 09:57:47PM +0800, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> pmd=00023d4b3003, pte=36298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Reported-by: Yibo Cai 
> Signed-off-by: Jia He 
> ---
> Changes
> v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin)
> 
>  mm/memory.c | 21 -
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e2bb51b6242e..63d4fd285e8e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2140,7 +2140,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
> pmd_t *pmd,
>   return same;
>  }
>  
> -static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> +static inline void cow_user_page(struct page *dst, struct page *src,
> + struct vm_fault *vmf)
>  {
>   debug_dma_assert_idle(src);
>  
> @@ -2152,20 +2153,30 @@ static inline void cow_user_page(struct page *dst, 
> struct page *src, unsigned lo
>*/
>   if (unlikely(!src)) {
>   void *kaddr = kmap_atomic(dst);
> - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> + void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK);
> + pte_t entry;
>  
>   /*
>* This really shouldn't fail, because the page is there
>* in the page tables. But it might just be unreadable,
>* in which case we just give up and fill the result with
> -  * zeroes.
> +  * zeroes. If PTE_AF is cleared on arm64, it might
> +  * cause double page fault. So makes pte young here
>*/
> + if (!pte_young(vmf->orig_pte)) {
> + entry = pte_mkyoung(vmf->orig_pte);
> + if (ptep_set_access_flags(vmf->vma, vmf->address,
> + vmf->pte, entry, 0))
> + update_mmu_cache(vmf->vma, vmf->address,
> + vmf->pte);
> + }
> +

I don't see where you take ptl.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: avoid slub allocation while holding list_lock

2019-09-10 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 03:39:38PM -0600, Yu Zhao wrote:
> On Tue, Sep 10, 2019 at 05:57:22AM +0900, Tetsuo Handa wrote:
> > On 2019/09/10 1:00, Kirill A. Shutemov wrote:
> > > On Mon, Sep 09, 2019 at 12:10:16AM -0600, Yu Zhao wrote:
> > >> If we are already under list_lock, don't call kmalloc(). Otherwise we
> > >> will run into deadlock because kmalloc() also tries to grab the same
> > >> lock.
> > >>
> > >> Instead, allocate pages directly. Given currently page->objects has
> > >> 15 bits, we only need 1 page. We may waste some memory but we only do
> > >> so when slub debug is on.
> > >>
> > >>   WARNING: possible recursive locking detected
> > >>   
> > >>   mount-encrypted/4921 is trying to acquire lock:
> > >>   (&(>list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
> > >>
> > >>   but task is already holding lock:
> > >>   (&(>list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
> > >>
> > >>   other info that might help us debug this:
> > >>Possible unsafe locking scenario:
> > >>
> > >>  CPU0
> > >>  
> > >> lock(&(>list_lock)->rlock);
> > >> lock(&(>list_lock)->rlock);
> > >>
> > >>*** DEADLOCK ***
> > >>
> > >> Signed-off-by: Yu Zhao 
> > > 
> > > Looks sane to me:
> > > 
> > > Acked-by: Kirill A. Shutemov 
> > > 
> > 
> > Really?
> > 
> > Since page->objects is handled as bitmap, alignment should be BITS_PER_LONG
> > than BITS_PER_BYTE (though in this particular case, get_order() would
> > implicitly align BITS_PER_BYTE * PAGE_SIZE). But get_order(0) is an
> > undefined behavior.
> 
> I think we can safely assume PAGE_SIZE is unsigned long aligned and
> page->objects is non-zero.

I think it's better to handle page->objects == 0 gracefully. It should not
happen, but this code handles situation that should not happen.

-- 
 Kirill A. Shutemov


Re: [PATCH v9 2/8] mm: Adjust shuffle code to allow for future coalescing

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 09:43:00AM -0700, Alexander Duyck wrote:
> I'm not sure I follow what you are saying about the free_area definition.
> It looks like it is a part of the zone structure so I would think it still
> needs to be defined in the header.

Yeah, you are right. I didn't noticed this.

-- 
 Kirill A. Shutemov


Re: [PATCH v9 6/8] mm: Introduce Reported pages

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 09:25:04AM -0700, Alexander Duyck wrote:
> > Proper description for the config option?
> 
> I can add one. However the feature doesn't do anything without a caller
> that makes use of it. I guess it would make sense to enable this for
> something such as an out-of-tree module to later use.

Description under 'help' section will not make the option user selectable
if you leave 'bool' without description.

> > > + mutex_lock(_reporting_mutex);
> > > +
> > > + /* nothing to do if already in use */
> > > + if (rcu_access_pointer(ph_dev_info)) {
> > > + err = -EBUSY;
> > > + goto err_out;
> > > + }
> > 
> > Again, it's from "something went horribly wrong" category.
> > Maybe WARN_ON()?
> 
> That one I am not so sure about. Right now we only have one user for the
> page reporting interface. My concern is if we ever have more than one we
> may experience collisions. The device driver requesting this should
> display an error message if it is not able tor register the interface.

Fair enough.

> > > + boundary = kcalloc(MAX_ORDER - PAGE_REPORTING_MIN_ORDER,
> > > +sizeof(struct list_head *) * MIGRATE_TYPES,
> > > +GFP_KERNEL);
> > 
> > Could you comment here on why this size of array is allocated?
> > The calculation is not obvious to a reader.
> 
> Would something like the following work for you?
> /*
>  * Allocate space to store the boundaries for the zone we are
>  * actively reporting on. We will need to store one boundary
>  * pointer per migratetype, and then we need to have one of these
>  * arrays per order for orders greater than or equal to
>  * PAGE_REPORTING_MIN_ORDER.
>  */

Ack.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: avoid slub allocation while holding list_lock

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 12:10:16AM -0600, Yu Zhao wrote:
> If we are already under list_lock, don't call kmalloc(). Otherwise we
> will run into deadlock because kmalloc() also tries to grab the same
> lock.
> 
> Instead, allocate pages directly. Given currently page->objects has
> 15 bits, we only need 1 page. We may waste some memory but we only do
> so when slub debug is on.
> 
>   WARNING: possible recursive locking detected
>   
>   mount-encrypted/4921 is trying to acquire lock:
>   (&(>list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
> 
>   but task is already holding lock:
>   (&(>list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>  CPU0
>  
> lock(&(>list_lock)->rlock);
> lock(&(>list_lock)->rlock);
> 
>*** DEADLOCK ***
> 
> Signed-off-by: Yu Zhao 

Looks sane to me:

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 0/3] Remove __online_page_set_limits()

2019-09-09 Thread Kirill A. Shutemov
On Sun, Sep 08, 2019 at 03:17:01AM +0530, Souptick Joarder wrote:
> __online_page_set_limits() is a dummy function and an extra call
> to this can be avoided.
> 
> As both of the callers are now removed, __online_page_set_limits()
> can be removed permanently.
> 
> Souptick Joarder (3):
>   hv_ballon: Avoid calling dummy function __online_page_set_limits()
>   xen/ballon: Avoid calling dummy function __online_page_set_limits()
>   mm/memory_hotplug.c: Remove __online_page_set_limits()
> 
>  drivers/hv/hv_balloon.c| 1 -
>  drivers/xen/balloon.c  | 1 -
>  include/linux/memory_hotplug.h | 1 -
>  mm/memory_hotplug.c| 5 -
>  4 files changed, 8 deletions(-)

Do we really need 3 separate patches to remove 8 lines of code?

-- 
 Kirill A. Shutemov


Re: [PATCH v9 2/8] mm: Adjust shuffle code to allow for future coalescing

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 08:22:11AM -0700, Alexander Duyck wrote:
> > > + area = >free_area[order];
> > > + if (is_shuffle_order(order) ? shuffle_pick_tail() :
> > > + buddy_merge_likely(pfn, buddy_pfn, page, order))
> > 
> > Too loaded condition to my taste. Maybe
> > 
> > bool to_tail;
> > ...
> > if (is_shuffle_order(order))
> > to_tail = shuffle_pick_tail();
> > else if (buddy_merge_likely(pfn, buddy_pfn, page, order))
> > to_tail = true;
> > else
> > to_tail = false;
> 
> I can do that, although I would tweak this slightly and do something more
> like:
> if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
> else
> to_tail = buddy+_merge_likely(pfn, buddy_pfn, page, order);

Okay. Looks fine.

> > if (to_tail)
> > add_to_free_area_tail(page, area, migratetype);
> > else
> > add_to_free_area(page, area, migratetype);
> > 
> > > + add_to_free_area_tail(page, area, migratetype);
> > >   else
> > > - add_to_free_area(page, >free_area[order], migratetype);
> > > -
> > > + add_to_free_area(page, area, migratetype);
> > >  }
> > >  
> > >  /*
> > > diff --git a/mm/shuffle.c b/mm/shuffle.c
> > > index 9ba542ecf335..345cb4347455 100644
> > > --- a/mm/shuffle.c
> > > +++ b/mm/shuffle.c
> > > @@ -4,7 +4,6 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > >  #include 
> > >  #include "internal.h"
> > >  #include "shuffle.h"
> > 
> > Why do you move #include  from .c to .h?
> > It's not obvious to me.
> 
> Because I had originally put the shuffle logic in an inline function. I
> can undo that now as I when back to doing the randomness in the .c
> sometime v5 I believe.

Yes, please. It's needless change now.

> 
> > > @@ -190,8 +189,7 @@ struct batched_bit_entropy {
> > >  
> > >  static DEFINE_PER_CPU(struct batched_bit_entropy, batched_entropy_bool);
> > >  
> > > -void add_to_free_area_random(struct page *page, struct free_area *area,
> > > - int migratetype)
> > > +bool __shuffle_pick_tail(void)
> > >  {
> > >   struct batched_bit_entropy *batch;
> > >   unsigned long entropy;
> > > @@ -213,8 +211,5 @@ void add_to_free_area_random(struct page *page, 
> > > struct free_area *area,
> > >   batch->position = position;
> > >   entropy = batch->entropy_bool;
> > >  
> > > - if (1ul & (entropy >> position))
> > > - add_to_free_area(page, area, migratetype);
> > > - else
> > > - add_to_free_area_tail(page, area, migratetype);
> > > + return 1ul & (entropy >> position);
> > >  }
> > > diff --git a/mm/shuffle.h b/mm/shuffle.h
> > > index 777a257a0d2f..0723eb97f22f 100644
> > > --- a/mm/shuffle.h
> > > +++ b/mm/shuffle.h
> > > @@ -3,6 +3,7 @@
> > >  #ifndef _MM_SHUFFLE_H
> > >  #define _MM_SHUFFLE_H
> > >  #include 
> > > +#include 
> > >  
> > >  /*
> > >   * SHUFFLE_ENABLE is called from the command line enabling path, or by
> > > @@ -22,6 +23,7 @@ enum mm_shuffle_ctl {
> > >  DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> > >  extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
> > >  extern void __shuffle_free_memory(pg_data_t *pgdat);
> > > +extern bool __shuffle_pick_tail(void);
> > >  static inline void shuffle_free_memory(pg_data_t *pgdat)
> > >  {
> > >   if (!static_branch_unlikely(_alloc_shuffle_key))
> > > @@ -43,6 +45,11 @@ static inline bool is_shuffle_order(int order)
> > >   return false;
> > >   return order >= SHUFFLE_ORDER;
> > >  }
> > > +
> > > +static inline bool shuffle_pick_tail(void)
> > > +{
> > > + return __shuffle_pick_tail();
> > > +}
> > 
> > I don't see a reason in __shuffle_pick_tail() existing if you call it
> > unconditionally.
> 
> That is for compilation purposes. The function is not used in the
> shuffle_pick_tail below that always returns false.

Wouldn't it be the same if you rename __shuffle_pick_tail() to
shuffle_pick_tail() and put its declaration under the same #ifdef?

-- 
 Kirill A. Shutemov


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 11:56:50AM +0530, Anshuman Khandual wrote:
> 
> 
> On 09/07/2019 12:33 AM, Gerald Schaefer wrote:
> > On Fri, 6 Sep 2019 11:58:59 +0530
> > Anshuman Khandual  wrote:
> > 
> >> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
> >>> On Thu, 5 Sep 2019 14:48:14 +0530
> >>> Anshuman Khandual  wrote:
> >>>   
> >>>>> [...]
> >>>>>> +
> >>>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && 
> >>>>>> !defined(__ARCH_HAS_4LEVEL_HACK)
> >>>>>> +static void pud_clear_tests(pud_t *pudp)
> >>>>>> +{
> >>>>>> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >>>>>> +  pud_clear(pudp);
> >>>>>> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >>>>>> +}
> >>>>>
> >>>>> For pgd/p4d/pud_clear(), we only clear if the page table level is 
> >>>>> present
> >>>>> and not folded. The memset() here overwrites the table type bits, so
> >>>>> pud_clear() will not clear anything on s390 and the pud_none() check 
> >>>>> will
> >>>>> fail.
> >>>>> Would it be possible to OR a (larger) random value into the table, so 
> >>>>> that
> >>>>> the lower 12 bits would be preserved?
> >>>>
> >>>> So the suggestion is instead of doing memset() on entry with 
> >>>> RANDOM_NZVALUE,
> >>>> it should OR a large random value preserving lower 12 bits. Hmm, this 
> >>>> should
> >>>> still do the trick for other platforms, they just need non zero value. 
> >>>> So on
> >>>> s390, the lower 12 bits on the page table entry already has valid value 
> >>>> while
> >>>> entering this function which would make sure that pud_clear() really does
> >>>> clear the entry ?  
> >>>
> >>> Yes, in theory the table entry on s390 would have the type set in the last
> >>> 4 bits, so preserving those would be enough. If it does not conflict with
> >>> others, I would still suggest preserving all 12 bits since those would 
> >>> contain
> >>> arch-specific flags in general, just to be sure. For s390, the pte/pmd 
> >>> tests
> >>> would also work with the memset, but for consistency I think the same 
> >>> logic
> >>> should be used in all pxd_clear_tests.  
> >>
> >> Makes sense but..
> >>
> >> There is a small challenge with this. Modifying individual bits on a given
> >> page table entry from generic code like this test case is bit tricky. That
> >> is because there are not enough helpers to create entries with an absolute
> >> value. This would have been easier if all the platforms provided functions
> >> like __pxx() which is not the case now. Otherwise something like this 
> >> should
> >> have worked.
> >>
> >>
> >> pud_t pud = READ_ONCE(*pudp);
> >> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> >> WRITE_ONCE(*pudp, pud);
> >>
> >> But __pud() will fail to build in many platforms.
> > 
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> > 
> > pud_val(*pudp) |= RANDOM_NZVALUE;
> 
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there.

Use instead

*pudp = __pud(pud_val(*pudp) | RANDOM_NZVALUE);

It *should* be more portable.

-- 
 Kirill A. Shutemov


Re: KASAN: use-after-free Read in shmem_fault (2)

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 06:55:21AM -0700, Matthew Wilcox wrote:
> On Mon, Sep 02, 2019 at 05:20:30PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Sep 02, 2019 at 06:52:54AM -0700, Matthew Wilcox wrote:
> > > On Sat, Aug 31, 2019 at 12:58:26PM +0800, Hillf Danton wrote:
> > > > On Fri, 30 Aug 2019 12:40:06 -0700
> > > > > syzbot found the following crash on:
> > > > > 
> > > > > HEAD commit:a55aa89a Linux 5.3-rc6
> > > > > git tree:   upstream
> > > > > console output: 
> > > > > https://syzkaller.appspot.com/x/log.txt?x=12f4beb660
> > > > > kernel config:  
> > > > > https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
> > > > > dashboard link: 
> > > > > https://syzkaller.appspot.com/bug?extid=03ee87124ee05af991bd
> > > > > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > > > > 
> > > > > ==
> > > > > BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530  
> > > > > include/trace/events/lock.h:13
> > > > > Read of size 8 at addr 8880a5cf2c50 by task syz-executor.0/26173
> > > > 
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -2021,6 +2021,12 @@ static vm_fault_t shmem_fault(struct vm_
> > > > shmem_falloc_waitq = shmem_falloc->waitq;
> > > > prepare_to_wait(shmem_falloc_waitq, 
> > > > _fault_wait,
> > > > TASK_UNINTERRUPTIBLE);
> > > > +   /*
> > > > +* it is not trivial to see what will take 
> > > > place after
> > > > +* releasing i_lock and taking a nap, so hold 
> > > > inode to
> > > > +* be on the safe side.
> > > 
> > > I think the comment could be improved.  How about:
> > > 
> > >* The file could be unmapped by another thread after
> > >* releasing i_lock, and the inode then freed.  Hold
> > >* a reference to the inode to prevent this.
> > 
> > It only can happen if mmap_sem was released, so it's better to put
> > __iget() to the branch above next to up_read(). I've got confused at first
> > how it is possible from ->fault().
> > 
> > This way iput() below should only be called for ret == VM_FAULT_RETRY.
> 
> Looking at the rather similar construct in filemap.c, should we solve
> it the same way, where we inc the refcount on the struct file instead
> of the inode before releasing the mmap_sem?

Are you talking about maybe_unlock_mmap_for_io()? Yeah, worth moving it to
mm/internal.h and reuse.

Care to prepare the patch? :P

-- 
 Kirill A. Shutemov


Re: [PATCH v9 6/8] mm: Introduce Reported pages

2019-09-09 Thread Kirill A. Shutemov
an set or clear this bit while we are
> +  * holding the zone lock. The advantage to doing it this way is
> +  * that we don't have to dirty the cacheline unless we are
> +  * changing the value.
> +  */
> + __set_bit(ZONE_PAGE_REPORTING_REQUESTED, >flags);
> +
> + /*
> +  * Delay the start of work to allow a sizable queue to
> +  * build. For now we are limiting this to running no more
> +  * than 10 times per second.
> +  */
> + if (!atomic_fetch_inc(>refcnt))
> + schedule_delayed_work(>work, HZ / 10);
> +out:
> + rcu_read_unlock();
> +}
> +
> +void __page_reporting_free_stats(struct zone *zone)
> +{
> + /* free reported_page statisitics */
> + kfree(zone->reported_pages);
> + zone->reported_pages = NULL;
> +}
> +
> +static DEFINE_MUTEX(page_reporting_mutex);
> +DEFINE_STATIC_KEY_FALSE(page_reporting_notify_enabled);
> +
> +void page_reporting_shutdown(struct page_reporting_dev_info *phdev)
> +{
> + mutex_lock(_reporting_mutex);
> +
> + if (rcu_access_pointer(ph_dev_info) == phdev) {
> + /* Disable page reporting notification */
> + static_branch_disable(_reporting_notify_enabled);
> + RCU_INIT_POINTER(ph_dev_info, NULL);
> + synchronize_rcu();
> +
> + /* Flush any existing work, and lock it out */
> + cancel_delayed_work_sync(>work);
> +
> + /* Free scatterlist */
> + kfree(phdev->sg);
> + phdev->sg = NULL;
> +
> + /* Free boundaries */
> + kfree(boundary);
> + boundary = NULL;
> + }
> +
> + mutex_unlock(_reporting_mutex);
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_shutdown);
> +
> +int page_reporting_startup(struct page_reporting_dev_info *phdev)
> +{
> + struct zone *zone;
> + int err = 0;
> +
> + /* No point in enabling this if it cannot handle any pages */
> + if (!phdev->capacity)
> + return -EINVAL;

Looks like a usage error. Maybe WARN_ON()?

> +
> + mutex_lock(_reporting_mutex);
> +
> + /* nothing to do if already in use */
> + if (rcu_access_pointer(ph_dev_info)) {
> + err = -EBUSY;
> + goto err_out;
> + }

Again, it's from "something went horribly wrong" category.
Maybe WARN_ON()?

> +
> + boundary = kcalloc(MAX_ORDER - PAGE_REPORTING_MIN_ORDER,
> +sizeof(struct list_head *) * MIGRATE_TYPES,
> +GFP_KERNEL);

Could you comment here on why this size of array is allocated?
The calculation is not obvious to a reader.

> + if (!boundary) {
> + err = -ENOMEM;
> + goto err_out;
> + }
> +
> + /* allocate scatterlist to store pages being reported on */
> + phdev->sg = kcalloc(phdev->capacity, sizeof(*phdev->sg), GFP_KERNEL);
> + if (!phdev->sg) {
> + err = -ENOMEM;
> +
> + kfree(boundary);
> + boundary = NULL;
> +
> + goto err_out;
> + }
> +
> +
> + /* initialize refcnt and work structures */
> + atomic_set(>refcnt, 0);
> + INIT_DELAYED_WORK(>work, _reporting_process);
> +
> + /* assign device, and begin initial flush of populated zones */
> + rcu_assign_pointer(ph_dev_info, phdev);
> + for_each_populated_zone(zone) {
> + spin_lock_irq(>lock);
> + __page_reporting_request(zone);
> + spin_unlock_irq(>lock);
> + }
> +
> + /* enable page reporting notification */
> + static_branch_enable(_reporting_notify_enabled);
> +err_out:
> + mutex_unlock(_reporting_mutex);
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_startup);
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v9 3/8] mm: Move set/get_pcppage_migratetype to mmzone.h

2019-09-09 Thread Kirill A. Shutemov
On Sat, Sep 07, 2019 at 10:25:28AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> In order to support page reporting it will be necessary to store and
> retrieve the migratetype of a page. To enable that I am moving the set and
> get operations for pcppage_migratetype into the mm/internal.h header so
> that they can be used outside of the page_alloc.c file.
> 
> Reviewed-by: Dan Williams 
> Signed-off-by: Alexander Duyck 

I'm not sure that it's great idea to export this functionality beyond
mm/page_alloc.c without any additional safeguards. How would we avoid to
messing with ->index when the page is not in the right state of its
life-cycle. Can we add some VM_BUG_ON()s here?

-- 
 Kirill A. Shutemov


<    1   2   3   4   5   6   7   8   9   10   >