Agreed.
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
d new delta
> | free_pgd_range 546 656+110
> | p4d_clear_bad 2 20 +18
> | Total: Before=4137148, After=4137276, chg 0.00%
>
> Cc: Kirill A. Shutemov
> Signed-off-by: Vineet Gupta
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
use a high value that allows a
> large range of positive caller-defined return codes for future users.
>
> Cc: Matthew Wilcox
> Cc: Will Deacon
> Cc: Peter Zijlstra
> Cc: Rik van Riel
> Cc: Minchan Kim
> Cc: Michal Hocko
> Cc: Huang Ying
> Cc: Jérôme Glisse
&
> Cc: Peter Zijlstra
> Cc: Rik van Riel
> Cc: Minchan Kim
> Cc: Michal Hocko
> Cc: Huang Ying
> Cc: Jérôme Glisse
> Cc: Kirill A. Shutemov
> Suggested-by: Linus Torvalds
> Signed-off-by: Thomas Hellstrom
> ---
> mm/pagewalk.c | 5 +++--
> 1 file chan
, triggering the split would cost CPU time (~100ms
per GB of THPs to split).
I have doubts about the approach, so:
Not-Signed-off-by: Kirill A. Shutemov
---
include/linux/mmzone.h | 5 ++
mm/huge_memory.c | 129 -
mm/memcontrol.c| 3 +
mm
-off-by: Vineet Gupta
> ---
> arch/arc/include/asm/pgtable.h | 1 -
> arch/arc/mm/fault.c| 10 --
> 2 files changed, 8 insertions(+), 3 deletions(-)
Have you tested it with CONFIG_HIGHMEM=y? alloc_kmap_pgtable() has to be
converted too.
--
Kirill A. Shutemov
before going oom? Maybe there would be also
> > other things that would benefit from this scheme instead of traditional
> > reclaim and shrinkers?
>
> That is exactly what we have discussed some time ago.
Yes, I've posted the patch:
http://lkml.kernel.org/r/20190827125911.boya23eowxhqmopa@box
But I still not sure that the approach is right. I expect it to trigger
performance regressions. For system with pleanty of free memory, we will
just pay split cost for nothing in many cases.
--
Kirill A. Shutemov
> > ; James Morse ; Marc
> > Zyngier ; Matthew Wilcox ; Kirill A.
> > Shutemov ; linux-arm-
> > ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > m...@kvack.org; Punit Agrawal ; Thomas
> > Gleixner ; Andrew Morton > foundation.org>; hejia..
On Mon, Oct 07, 2019 at 03:51:58PM +0200, Ingo Molnar wrote:
>
> * Kirill A. Shutemov wrote:
>
> > On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
> > >
> > > * Anshuman Khandual wrote:
> > >
> > > > This adds a t
inline function + define. Something like:
#define mm_p4d_folded mm_p4d_folded
static inline bool mm_p4d_folded(struct mm_struct *mm)
{
return !pgtable_l5_enabled();
}
But I don't see much reason to be more verbose here than needed.
--
Kirill A. Shutemov
On Fri, Oct 04, 2019 at 06:45:21AM -0700, Daniel Colascione wrote:
> On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov
> wrote:
> > On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote:
> > > On Wed 02-10-19 19:08:16, Daniel Colascione wrote:
> > > > On
s it? probably on
> the number of cpus) and it should be auto-enabled or it should be
> dropped altogether. You cannot really expect people know how to enable
> this without a deep understanding of the MM internals. Not to mention
> all those users using distro kernels/configs.
>
> A config option sounds like a bad way forward.
And I don't see much point anyway. Reading RSS counters from proc is
inherently racy. It can just either way after the read due to process
behaviour.
--
Kirill A. Shutemov
On Fri, Oct 04, 2019 at 02:58:59PM +0200, Thomas Hellström (VMware) wrote:
> On 10/4/19 2:37 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
> > > > > + * If @mapping allows faulting of huge pmds and puds, it
> It looks like if a concurrent thread faults in a huge pud just after the
> test for pud_none in that pmd_alloc, things might go pretty bad.
Hm? It will fail the next pmd_none() check under ptl. Do you have a
particular racing scenarion?
--
Kirill A. Shutemov
The walk_page_mapping() function will be initially be used for dirty-
> tracking in virtual graphics drivers.
>
> Cc: Andrew Morton
> Cc: Matthew Wilcox
> Cc: Will Deacon
> Cc: Peter Zijlstra
> Cc: Rik van Riel
> Cc: Minchan Kim
> Cc: Michal Hocko
> Cc: Huang Ying
>
ocko
> Cc: Huang Ying
> Cc: Jérôme Glisse
> Cc: Kirill A. Shutemov
> Signed-off-by: Thomas Hellstrom
The patch looks good to me:
Acked-by: Kirill A. Shutemov
But I looked at usage at pagewalk.c and it is inconsitent. The walker
takes ptl before calling ->pud_entry(), but
On Tue, Oct 01, 2019 at 10:08:30AM -0600, William Kucharski wrote:
>
>
> > On Oct 1, 2019, at 8:20 AM, Kirill A. Shutemov wrote:
> >
> > On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
> >>
> >>
> >> On 10/1/19 5:32 AM, K
On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
>
>
> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
> > On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> > >
> > >
> > > > On Oct 1, 2019, a
On Tue, Oct 01, 2019 at 08:26:28AM -0400, Qian Cai wrote:
> On Tue, 2019-10-01 at 14:51 +0300, Kirill A. Shutemov wrote:
> > On Tue, Oct 01, 2019 at 10:07:44AM +0200, Vlastimil Babka wrote:
> > > On 10/1/19 1:49 AM, Qian Cai wrote:
> > > >
> > > >
> &
ntry);
> if (pgtable)
> pgtable_trans_huge_deposit(mm, pmd, pgtable);
> set_pmd_at(mm, haddr, pmd, entry);
> mm_inc_nr_ptes(mm);
> - return true;
> }
>
> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> --
> 2.20.1
>
>
--
Kirill A. Shutemov
can do that and it was intention, yes. The extra parameter was
> requested by Kirill, so I'll defer the answer to him :)
DEBUG_PAGEALLOC is much more intrusive debug option. Not all architectures
support it in an efficient way. Some require hibernation.
I don't see a reason to tie these two option together.
--
Kirill A. Shutemov
I don't know too much about KASAN internals, but I
> assume it's not possible to use it that way on production kernels yet?
I don't know about production, but QEMU (without KVM acceleration) is
painfully slow if KASAN is enabled.
--
Kirill A. Shutemov
On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
>
>
> > On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov wrote:
> >
> > On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> >>
> >> diff --git a/mm/huge_memory.c
lags, PMD_SIZE);
> if (addr)
I think you reducing ASLR without any real indication that THP is relevant
for the VMA. We need to know if any huge page allocation will be
*attempted* for the VMA or the file.
--
Kirill A. Shutemov
op of mm prep patches -- Matthew]
> Signed-off-by: Matthew Wilcox (Oracle)
As we discuss before, I think it's wrong direction to separate ->fault()
from ->huge_fault(). If ->fault() would return a huge page alloc_set_pte()
will do The Right Thing™.
--
Kirill A. Shutemov
= 2;
> + refcount = 1 + compound_nr(page);
> if (!page_ref_freeze(page, refcount))
> goto cannot_free;
> /* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */
> --
> 2.23.0
>
>
--
Kirill A. Shutemov
so it's better to remove it and avoid the confusion
> by just using compound_nr() or page_size().
>
> Signed-off-by: Matthew Wilcox (Oracle)
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
ge) {
> + if (fgp_order(fgp_flags))
> + count_vm_event(THP_FILE_ALLOC);
>
> - /*
> - * add_to_page_cache_lru locks the page, and for mmap we expect
> - * an unlocked page.
> - */
> - if (page && (fgp_flags & FGP_FOR_MMAP))
> - unlock_page(page);
> + /*
> + * add_to_page_cache_lru locks the page, and
> + * for mmap we expect an unlocked page.
> + */
> + if (fgp_flags & FGP_FOR_MMAP)
> + unlock_page(page);
> + }
> }
>
> return page;
> --
> 2.23.0
>
--
Kirill A. Shutemov
On Fri, Sep 27, 2019 at 09:39:48AM -0700, Linus Torvalds wrote:
> On Fri, Sep 27, 2019 at 5:17 AM Kirill A. Shutemov
> wrote:
> >
> > > Call it "walk_page_mapping()". And talk extensively about how the
> > > locking differs a lot from the usual "walk
On Fri, Sep 27, 2019 at 08:40:00PM -0300, Leonardo Bras wrote:
> As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
^ typo
--
Kirill A. Shutemov
ike Power also uses the hook. Have you check that this patch will
not affect Power?
--
Kirill A. Shutemov
On Thu, Sep 26, 2019 at 03:26:24PM -0400, Johannes Weiner wrote:
> I have just one nitpick:
Here's addressed one:
>From 22a9d58a79a3ebb727d9e909c8f833cd0a751c08 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov"
Date: Thu, 26 Sep 2019 16:34:26 +0300
Subject: [PATCH] shm
;
> } while (xas_nomem(, gfp_mask & GFP_RECLAIM_MASK));
> @@ -907,7 +922,7 @@ static int __add_to_page_cache_locked(struct page *page,
> /* Leave page->index set: truncation relies upon it */
> if (!huge)
> mem_cgroup_cancel_charge(page, memcg, false);
> - put_page(page);
> + page_ref_sub(page, nr);
> return xas_error();
> }
> ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
> --
> 2.23.0
>
>
--
Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:07PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)"
>
> This new function allows page cache pages to be allocated that are
> larger than an order-0 page.
>
> Signed-off-by: Matthew Wilcox (Oracle)
Acked-by: Kiri
t; instead of assigning the result to a temporary variable and conditionally
> passing that to prep_transhuge_page().
Patch makes sense, "tail-callable" made me think you want it to be able
accept tail pages. Can we re-phrase this?
>
> Signed-off-by: Matthew Wilcox (Oracle)
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
line loff_t file_offset_of_next_page(struct page *page)
> +{
> + return ((loff_t)page->index + compound_nr(page)) << PAGE_SHIFT;
Wouldn't it be more readable as
return file_offset_of_page(page) + page_size(page);
?
> +}
> +
> static inline loff_t page_file_offset(struct page *page)
> {
> return ((loff_t)page_index(page)) << PAGE_SHIFT;
> --
> 2.23.0
>
>
--
Kirill A. Shutemov
On Tue, Sep 24, 2019 at 05:52:00PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)"
>
> Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).
>
> Signed-off-by: Matthew Wilcox (Oracle)
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
ompletely read over the context of that email you linked - that
> there is a bug in the existing code - and looked at it as mere
> refactoring patch. My apologies.
>
> Switching that shmem site to maybe_unlock_mmap_for_io() to indirectly
> pin the inode (in a separate bug fix patch)
s error
> in case there can be some obscure use-case.(by Kirill)
>
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
>
> Signed-off-by: Jia He
> Reported-by: Yibo Cai
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
of the space to be mapped is usable RAM and includes no reserved
> areas.
>
> Signed-off-by: Steve Wahl
> Cc: sta...@vger.kernel.org
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
in 1 GiB of
> reserved space).
>
> The solution is to invalidate the pages of this table outside the
> kernel image's space before the page table is activated. This patch
> has been validated to fix this problem on our hardware.
>
> Signed-off-by: Steve Wahl
> Cc: sta...@vger.kernel.org
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
ecially those
> looping set_pte_at()s in mm/huge_memory.c.
We can rename current set_pte_at() to __set_pte_at() or something and
leave it in places where barrier is not needed. The new set_pte_at()( will
be used in the rest of the places with the barrier inside.
BTW, have you looked at other levels of page table hierarchy. Do we have
the same issue for PMD/PUD/... pages?
--
Kirill A. Shutemov
irril suggested naming it PAGE_EXT_OWNER_ALLOCED to make it
^ typo
And PAGE_EXT_OWNER_ALLOCED is my typo. I meant PAGE_EXT_OWNER_ALLOCATED :P
--
Kirill A. Shutemov
f (static_branch_unlikely(_owner_free_stack)) {
> + handle = READ_ONCE(page_owner->free_handle);
> + if (!handle) {
> + pr_alert("page_owner free stack trace missing\n");
> + } else {
> + nr_entries = stack_depot_fetch(handle, );
> + pr_alert("page last free stack trace:\n");
> + stack_trace_print(entries, nr_entries, 0);
> + }
> }
> #endif
>
> --
> 2.23.0
>
>
--
Kirill A. Shutemov
ull size of struct page_ext is not known at compile time, we can't
> use a simple page_ext++ statement, so introduce a page_ext_next() inline
> function for that.
>
> Reported-by: Kirill A. Shutemov
> Fixes: 7e2f2a0cd17c ("mm, page_owner: record page owner for each s
On Tue, Sep 24, 2019 at 04:05:50PM -0600, Yu Zhao wrote:
> On Tue, Sep 24, 2019 at 02:23:16PM +0300, Kirill A. Shutemov wrote:
> > On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> > > We don't want to expose page to fast gup running on a remote CPU
> > > befo
On Tue, Sep 24, 2019 at 05:15:01PM +0200, Vlastimil Babka wrote:
> On 9/24/19 1:42 PM, Kirill A. Shutemov wrote:
> >> --- a/mm/page_owner.c
> >> +++ b/mm/page_owner.c
> >> @@ -24,6 +24,9 @@ struct page_owner {
> >>short last_migrate_reason;
> >>
On Tue, Sep 24, 2019 at 05:10:59PM +0200, Vlastimil Babka wrote:
> On 9/24/19 1:31 PM, Kirill A. Shutemov wrote:
> > On Tue, Aug 20, 2019 at 03:18:26PM +0200, Vlastimil Babka wrote:
> >> Currently, page owner info is only recorded for the first page of a
> >>
addr, PAGE_SIZE))
> > > > + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
> > > > {
> > > > + /* Give a warn in case there can be some obscure
> > > > +* use-case
> > > > +*/
> > > > + WARN_ON_ONCE(1);
> > >
> > > That's more of a question for the mm guys: at this point we do the
> > > copying with the ptl released; is there anything else that could have
> > > made the pte old in the meantime? I think unuse_pte() is only called on
> > > anonymous vmas, so it shouldn't be the case here.
>
> If we need to hold the ptl here, you could as well have an enclosing
> kmap/kunmap_atomic (option 2) with some goto instead of "return false".
Yeah, look like we need to hold ptl for longer. There is nothing I see
that would prevent clearing young bit under us otherwise.
--
Kirill A. Shutemov
wn) in pud_clear_tests() as there were no available
> __pgd() definitions.
>
> - ARM32
> - IA64
Hm. Grep shows __pgd() definitions for both of them. Is it for specific
config?
--
Kirill A. Shutemov
last_migrate_reason;
> gfp_t gfp_mask;
> depot_stack_handle_t handle;
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> + depot_stack_handle_t free_handle;
> +#endif
I think it's possible to add space for the second stack handle at runtime:
adjust page_owner_ops->size inside the ->need(). The second stack might be
useful beyond CONFIG_DEBUG_PAGEALLOC. We probably should not tie these
features.
--
Kirill A. Shutemov
to "page owner
> info has been set at least once" and add new PAGE_EXT_OWNER_ACTIVE for
> tracking
> whether page is supposed to be currently tracked allocated or free.
Why not PAGE_EXT_OWNER_ALLOCED if it means "allocated"? Active is somewhat
loaded term for a page.
--
Kirill A. Shutemov
t; @@ -562,7 +577,8 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct
> zone *zone)
> continue;
>
> /* Found early allocated page */
> - __set_page_owner_handle(page_ext, early_handle, 0, 0);
> + __set_page_owner_handle(page, page_ext, early_handle,
> + 0, 0);
> count++;
> }
> cond_resched();
> --
> 2.22.0
>
>
--
Kirill A. Shutemov
nge()
> __SetPageSwapBacked() SetPageReferenced()
Is there a particular codepath that has what you listed for CPU?
After quick look, I only saw that we page_add_new_anon_rmap() called
before set_pte_at().
--
Kirill A. Shutemov
s error
> in case there can be some obscure use-case.(by Kirill)
>
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
>
> Reported-by: Yibo Cai
> Signed-off-by: Jia He
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
...
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return false;
> ...
> return true;
> ...
> if (!cow_user_page(new_page, old_page, vmf)) {
>
> That reads more sensibly for me.
I like this idea too.
--
Kirill A. Shutemov
s error
> in case there can be some obscure use-case.(by Kirill)
>
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
>
> Reported-by: Yibo Cai
> Signed-off-by: Jia He
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
hrinker_maps and mem_cgroup_css_free only) still
> this looks suspicious and we can easily eliminate the gap at all.
With this logic it will still look suspicious since you don't wait a grace
period before freeing the map.
--
Kirill A. Shutemov
* from the second attempt.
> + */
> + put_page(new_page);
I think you also need to give the reference on the old page back:
if (old_page)
put_page(old_page);
> + goto normal;
I don't see much point in this goto. Just return 0.
> + }
> }
>
> if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, ,
> false))
> @@ -2420,6 +2471,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> }
> put_page(old_page);
> }
> +
> +normal:
> return page_copied ? VM_FAULT_WRITE : 0;
> oom_free_new:
> put_page(new_page);
> --
> 2.17.1
>
>
--
Kirill A. Shutemov
On Thu, Sep 19, 2019 at 03:41:43PM +, Catalin Marinas wrote:
> On Thu, Sep 19, 2019 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 18, 2019 at 07:00:30PM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 18, 2019 at 05:00:27PM +0300, Kirill A. Shutemov wrote:
&
On Wed, Sep 18, 2019 at 07:00:30PM +0100, Catalin Marinas wrote:
> On Wed, Sep 18, 2019 at 05:00:27PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:
> > > @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page
>
On Thu, Sep 19, 2019 at 10:16:34AM +0800, Jia He wrote:
> Hi Kirill
>
> [On behalf of justin...@arm.com because some mails are filted...]
>
> On 2019/9/18 22:00, Kirill A. Shutemov wrote:
> > On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:
> > > Wh
copy_user_highpage(dst, src, addr, vma);
> }
>
> static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> @@ -2318,7 +2343,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> vmf->address);
> if (!new_page)
> goto oom;
> - cow_user_page(new_page, old_page, vmf->address, vma);
> + cow_user_page(new_page, old_page, vmf);
> }
>
> if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, ,
> false))
> --
> 2.17.1
>
>
--
Kirill A. Shutemov
On Mon, Sep 09, 2019 at 06:04:12PM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 09, 2019 at 06:55:21AM -0700, Matthew Wilcox wrote:
> > On Mon, Sep 02, 2019 at 05:20:30PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Sep 02, 2019 at 06:52:54AM -0700, Matthew Wilcox wrote:
>
On Mon, Sep 16, 2019 at 06:26:19PM +0300, Kirill A. Shutemov wrote:
> > ---
> > mm/memory.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e0c232fe81d9..55da24f33bc4 100644
> > --- a/mm/memory.c
&g
On Tue, Sep 17, 2019 at 12:15:19PM +0200, Michal Hocko wrote:
> On Mon 16-09-19 18:26:19, Kirill A. Shutemov wrote:
> > On Fri, Sep 13, 2019 at 02:11:19PM -0700, Lucian Adrian Grijincu wrote:
> > > As pages are faulted in MLOCK_ONFAULT correctly updates
> > > /proc/sel
---
>
> Signed-off-by: Lucian Adrian Grijincu
> Acked-by: Souptick Joarder
> ---
> mm/memory.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e0c232fe81d9..55da24f33bc4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct
> mem_cgroup *memcg,
> } else {
> inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> page_add_file_rmap(page, false);
> + if (vma->vm_flags & VM_LOCKED && !PageTransCompound(page))
> + mlock_vma_page(page);
Why do you only do this for file pages?
> }
> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>
> --
> 2.17.1
>
>
--
Kirill A. Shutemov
The following commit has been merged into the x86/mm branch of tip:
Commit-ID: 18ec1eaf58fbf2d9009a752a102a3d8e0d905a0f
Gitweb:
https://git.kernel.org/tip/18ec1eaf58fbf2d9009a752a102a3d8e0d905a0f
Author:Kirill A. Shutemov
AuthorDate:Fri, 13 Sep 2019 12:54:52 +03:00
On Mon, Sep 16, 2019 at 09:35:21AM +, Justin He (Arm Technology China)
wrote:
>
> Hi Kirill
> > -Original Message-
> > From: Kirill A. Shutemov
> > Sent: 2019年9月16日 17:16
> > To: Justin He (Arm Technology China)
> > Cc: Catalin Marinas ; Will Dea
s_on_old_pte arch_faults_on_old_pte
> +
> #endif /* !__ASSEMBLY__ */
>
> #endif /* __ASM_PGTABLE_H */
> --
> 2.17.1
>
>
--
Kirill A. Shutemov
copy_user_highpage(dst, src, vmf->address, vmf->vma);
> }
>
> static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> @@ -2318,7 +2338,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> vmf->address);
> if (!new_page)
> goto oom;
> - cow_user_page(new_page, old_page, vmf->address, vma);
> + cow_user_page(new_page, old_page, vmf);
> }
>
> if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, ,
> false))
> --
> 2.17.1
>
>
--
Kirill A. Shutemov
On Fri, Sep 13, 2019 at 10:14:15AM -0500, Steve Wahl wrote:
> On Thu, Sep 12, 2019 at 01:19:17PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> > > Thank you for your time looking into this with me!
> >
> > With all
nvalid
> object, free or in use. This shouldn't matter because the original
> behavior isn't intended anyway.
>
> Signed-off-by: Yu Zhao
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
regressions I saw were in early boot
code that runs independently from CONFIG_X86_5LEVEL.
The next major release of distributions expected to have
CONFIG_X86_5LEVEL=y.
Enable the option by default. It may help to catch obscure bugs early.
Signed-off-by: Kirill A. Shutemov
---
arch/x86/Kconfig | 1 +
1
Adding fully unmapped pages into deferred split queue is not productive:
these pages are about to be freed or they are pinned and cannot be split
anyway.
Signed-off-by: Kirill A. Shutemov
---
mm/rmap.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/rmap.c
but its just a single line. Kirill suggested this in the
> previous version. There is a generic fallback definition but s390 has it's
> own. This change overrides the generic one for x86 probably as a fix or as
> an improvement. Kirill should be able to help classify it in which case it
> can be a separate patch.
I don't think it worth a separate patch.
--
Kirill A. Shutemov
/x86/mm/highmem_32.c#L34
> >>
> >> I have not checked others, but I guess it is like that for all.
> >>
> >
> >
> > Seems like I answered too quickly. All kmap_atomic() do preempt_disable(),
> > but not all pte_alloc_map() call kmap_atomic().
> >
> > However, for instance ARM does:
> >
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/include/asm/pgtable.h#L200
> >
> > And X86 as well:
> >
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/include/asm/pgtable_32.h#L51
> >
> > Microblaze also:
> >
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/microblaze/include/asm/pgtable.h#L495
>
> All the above platforms checks out to be using k[un]map_atomic(). I am
> wondering whether
> any of the intermediate levels will have similar problems on any these 32 bit
> platforms
> or any other platforms which might be using generic k[un]map_atomic().
No. Kernel only allocates pte page table from highmem. All other page
tables are always visible in kernel address space.
--
Kirill A. Shutemov
On Thu, Sep 12, 2019 at 03:11:14PM -0600, Yu Zhao wrote:
> On Thu, Sep 12, 2019 at 12:40:35PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 11, 2019 at 08:31:08PM -0600, Yu Zhao wrote:
> > > Mask of slub objects per page shouldn't be larger than what
> >
code here __init (or it's variants) so it
can be discarded on boot. It has not use after that.
--
Kirill A. Shutemov
very well obvious in the code including the consequences. I do not
> really like an idea of hiding similar constrains behind a generic
> looking feature which is completely detached from the allocator and so
> any future change of the allocator might subtly break it.
I don't necessary follow why shuffling is more integral to page allocator
than reporting would be. It's next to shutffle.c under mm/ and integrated
in a simillar way.
--
Kirill A. Shutemov
why it's crucial not to map more
than needed.
I think we also need to make it clear that this is workaround for a broken
hardware: speculative execution must not trigger a halt.
--
Kirill A. Shutemov
.
Can we repharase it to indicate that slab_lock is required to protect
page->objects?
--
Kirill A. Shutemov
ock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
>
> other info that might help us debug this:
>Possible unsafe locking scenario:
>
> CPU0
>
> lock(&(>list_lock)->rlock);
> lock(&(>list_lock)->rlock);
>
>*** DEADLOCK ***
>
> Signed-off-by: Yu Zhao
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
f (!check_object(s, page, p, SLUB_RED_ACTIVE))
> - return 0;
> - return 1;
> + if (!check_object(s, page, p, val))
> + break;
> + }
> }
>
> static void validate_slab_slab(struct kmem_cache *s, struct page *page,
> --
> 2.23.0.162.g0b9fbb3734-goog
>
--
Kirill A. Shutemov
s not
used and always zero. So what?
I can imagine for some microarchitecures accessing higher 16 bits of int
is cheaper than shifting by 15.
--
Kirill A. Shutemov
not something sane to do.
--
Kirill A. Shutemov
On Tue, Sep 10, 2019 at 09:28:10AM -0500, Steve Wahl wrote:
> On Mon, Sep 09, 2019 at 11:14:14AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > > ...
> > > The answer is to invalidate the pages of this table outs
static int check_hotplug_memory_range(u64 start, u64 size)
> {
> /* memory range must be block size aligned */
> @@ -1040,7 +1057,7 @@ static int check_hotplug_memory_range(u64 start, u64
> size)
> return -EINVAL;
> }
>
> - return 0;
> + return check_hotplug_memory_addressable(start, size);
> }
>
> static int online_memory_block(struct memory_block *mem, void *arg)
> --
> 2.21.0
>
--
Kirill A. Shutemov
t; > +#endif
> > > };
> > >
> > > static unsigned long total_usage;
> > >
> >
> > Are you sure this patch compiles?
> >
> This is patchsets, it need another patch2.
> We have verified it by running KASAN UT on Qemu.
Any patchset must be bisectable: do not break anything in the middle of
patchset.
--
Kirill A. Shutemov
entry = pte_mkyoung(vmf->orig_pte);
> + if (ptep_set_access_flags(vmf->vma, vmf->address,
> + vmf->pte, entry, 0))
> + update_mmu_cache(vmf->vma, vmf->address,
> + vmf->pte);
> + }
> +
I don't see where you take ptl.
--
Kirill A. Shutemov
On Mon, Sep 09, 2019 at 03:39:38PM -0600, Yu Zhao wrote:
> On Tue, Sep 10, 2019 at 05:57:22AM +0900, Tetsuo Handa wrote:
> > On 2019/09/10 1:00, Kirill A. Shutemov wrote:
> > > On Mon, Sep 09, 2019 at 12:10:16AM -0600, Yu Zhao wrote:
> > >> If we are already under
t noticed this.
--
Kirill A. Shutemov
reader.
>
> Would something like the following work for you?
> /*
> * Allocate space to store the boundaries for the zone we are
> * actively reporting on. We will need to store one boundary
> * pointer per migratetype, and then we need to have one of these
> * arrays per order for orders greater than or equal to
> * PAGE_REPORTING_MIN_ORDER.
> */
Ack.
--
Kirill A. Shutemov
0
>
> lock(&(>list_lock)->rlock);
> lock(&(>list_lock)->rlock);
>
>*** DEADLOCK ***
>
> Signed-off-by: Yu Zhao
Looks sane to me:
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
lloon.c | 1 -
> include/linux/memory_hotplug.h | 1 -
> mm/memory_hotplug.c| 5 -
> 4 files changed, 8 deletions(-)
Do we really need 3 separate patches to remove 8 lines of code?
--
Kirill A. Shutemov
> > > extern void __shuffle_free_memory(pg_data_t *pgdat);
> > > +extern bool __shuffle_pick_tail(void);
> > > static inline void shuffle_free_memory(pg_data_t *pgdat)
> > > {
> > > if (!static_branch_unlikely(_alloc_shuffle_key))
> > > @@ -43,6 +45,11 @@ static inline bool is_shuffle_order(int order)
> > > return false;
> > > return order >= SHUFFLE_ORDER;
> > > }
> > > +
> > > +static inline bool shuffle_pick_tail(void)
> > > +{
> > > + return __shuffle_pick_tail();
> > > +}
> >
> > I don't see a reason in __shuffle_pick_tail() existing if you call it
> > unconditionally.
>
> That is for compilation purposes. The function is not used in the
> shuffle_pick_tail below that always returns false.
Wouldn't it be the same if you rename __shuffle_pick_tail() to
shuffle_pick_tail() and put its declaration under the same #ifdef?
--
Kirill A. Shutemov
le entry from generic code like this test case is bit tricky. That
> >> is because there are not enough helpers to create entries with an absolute
> >> value. This would have been easier if all the platforms provided functions
> >> like __pxx() which is not the case now. Otherwise something like this
> >> should
> >> have worked.
> >>
> >>
> >> pud_t pud = READ_ONCE(*pudp);
> >> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> >> WRITE_ONCE(*pudp, pud);
> >>
> >> But __pud() will fail to build in many platforms.
> >
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> >
> > pud_val(*pudp) |= RANDOM_NZVALUE;
>
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there.
Use instead
*pudp = __pud(pud_val(*pudp) | RANDOM_NZVALUE);
It *should* be more portable.
--
Kirill A. Shutemov
On Mon, Sep 09, 2019 at 06:55:21AM -0700, Matthew Wilcox wrote:
> On Mon, Sep 02, 2019 at 05:20:30PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Sep 02, 2019 at 06:52:54AM -0700, Matthew Wilcox wrote:
> > > On Sat, Aug 31, 2019 at 12:58:26PM +0800, Hillf Danton wrote:
> >
eporting_free_stats(struct zone *zone)
> +{
> + /* free reported_page statisitics */
> + kfree(zone->reported_pages);
> + zone->reported_pages = NULL;
> +}
> +
> +static DEFINE_MUTEX(page_reporting_mutex);
> +DEFINE_STATIC_KEY_FALSE(page_reporting_notify_enabled);
> +
> +void page_reporting_shutdown(struct page_reporting_dev_info *phdev)
> +{
> + mutex_lock(_reporting_mutex);
> +
> + if (rcu_access_pointer(ph_dev_info) == phdev) {
> + /* Disable page reporting notification */
> + static_branch_disable(_reporting_notify_enabled);
> + RCU_INIT_POINTER(ph_dev_info, NULL);
> + synchronize_rcu();
> +
> + /* Flush any existing work, and lock it out */
> + cancel_delayed_work_sync(>work);
> +
> + /* Free scatterlist */
> + kfree(phdev->sg);
> + phdev->sg = NULL;
> +
> + /* Free boundaries */
> + kfree(boundary);
> + boundary = NULL;
> + }
> +
> + mutex_unlock(_reporting_mutex);
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_shutdown);
> +
> +int page_reporting_startup(struct page_reporting_dev_info *phdev)
> +{
> + struct zone *zone;
> + int err = 0;
> +
> + /* No point in enabling this if it cannot handle any pages */
> + if (!phdev->capacity)
> + return -EINVAL;
Looks like a usage error. Maybe WARN_ON()?
> +
> + mutex_lock(_reporting_mutex);
> +
> + /* nothing to do if already in use */
> + if (rcu_access_pointer(ph_dev_info)) {
> + err = -EBUSY;
> + goto err_out;
> + }
Again, it's from "something went horribly wrong" category.
Maybe WARN_ON()?
> +
> + boundary = kcalloc(MAX_ORDER - PAGE_REPORTING_MIN_ORDER,
> +sizeof(struct list_head *) * MIGRATE_TYPES,
> +GFP_KERNEL);
Could you comment here on why this size of array is allocated?
The calculation is not obvious to a reader.
> + if (!boundary) {
> + err = -ENOMEM;
> + goto err_out;
> + }
> +
> + /* allocate scatterlist to store pages being reported on */
> + phdev->sg = kcalloc(phdev->capacity, sizeof(*phdev->sg), GFP_KERNEL);
> + if (!phdev->sg) {
> + err = -ENOMEM;
> +
> + kfree(boundary);
> + boundary = NULL;
> +
> + goto err_out;
> + }
> +
> +
> + /* initialize refcnt and work structures */
> + atomic_set(>refcnt, 0);
> + INIT_DELAYED_WORK(>work, _reporting_process);
> +
> + /* assign device, and begin initial flush of populated zones */
> + rcu_assign_pointer(ph_dev_info, phdev);
> + for_each_populated_zone(zone) {
> + spin_lock_irq(>lock);
> + __page_reporting_request(zone);
> + spin_unlock_irq(>lock);
> + }
> +
> + /* enable page reporting notification */
> + static_branch_enable(_reporting_notify_enabled);
> +err_out:
> + mutex_unlock(_reporting_mutex);
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_startup);
>
>
--
Kirill A. Shutemov
w would we avoid to
messing with ->index when the page is not in the right state of its
life-cycle. Can we add some VM_BUG_ON()s here?
--
Kirill A. Shutemov
301 - 400 of 10468 matches
Mail list logo