Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h

2023-06-15 Thread Matthew Wilcox
On Thu, Jun 15, 2023 at 03:33:12PM -0400, Peter Xu wrote:
> My question is whether page_zonenum() is ready for taking all kinds of tail
> pages?
> 
> Zone device tail pages all look fine, per memmap_init_zone_device().  The
> question was other kinds of usual compound pages, like either thp or
> hugetlb.  IIUC page->flags can be uninitialized for those tail pages.

I don't think that's true.  It's my understanding that page->flags is
initialised for all pages in memmap at boot / hotplug / delayed-init
time.  So you can check things like zone, node, etc on literally any
page.  Contrariwise, those flags are not available in tail pages for
use by the entity that has allocated a compound page / large folio.

Also, I don't believe zone device pages support compound allocation.
I think they're always allocated as order-0.



Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-26 Thread Matthew Wilcox
On Thu, Jan 26, 2023 at 04:50:59PM +0200, Mike Rapoport wrote:
> On Thu, Jan 26, 2023 at 11:17:09AM +0200, Mike Rapoport wrote:
> > On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> > > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > > +static inline void init_vm_flags(struct vm_area_struct *vma,
> > > +  unsigned long flags)
> > 
> > I'd suggest to make it vm_flags_init() etc.
> 
> Thinking more about it, it will be even clearer to name these vma_flags_xyz()

Perhaps vma_VERB_flags()?

vma_init_flags()
vma_reset_flags()
vma_set_flags()
vma_clear_flags()
vma_mod_flags()




Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Matthew Wilcox
On Wed, Jan 25, 2023 at 08:49:50AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 25, 2023 at 1:10 AM Peter Zijlstra  wrote:
> > > + /*
> > > +  * Flags, see mm.h.
> > > +  * WARNING! Do not modify directly.
> > > +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> > > +  */
> > > + unsigned long vm_flags;
> >
> > We have __private and ACCESS_PRIVATE() to help with enforcing this.
> 
> Thanks for pointing this out, Peter! I guess for that I'll need to
> convert all read accesses and provide get_vm_flags() too? That will
> cause some additional churt (a quick search shows 801 hits over 248
> files) but maybe it's worth it? I think Michal suggested that too in
> another patch. Should I do that while we are at it?

Here's a trick I saw somewhere in the VFS:

union {
const vm_flags_t vm_flags;
vm_flags_t __private __vm_flags;
};

Now it can be read by anybody but written only by those using
ACCESS_PRIVATE.



Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Matthew Wilcox
On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  unsigned long flags)
> +{
> + vma->vm_flags = flags;

vm_flags are supposed to have type vm_flags_t.  That's not been
fully realised yet, but perhaps we could avoid making it worse?

>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;

Including changing this line to vm_flags_t



Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-11 Thread Matthew Wilcox
On Mon, Jul 11, 2022 at 03:35:42PM +0200, David Hildenbrand wrote:
> > +   /*
> > +* Device coherent pages are managed by a driver and should not
> > +* be pinned indefinitely as it prevents the driver moving the
> > +* page. So when trying to pin with FOLL_LONGTERM instead try
> > +* to migrate the page out of device memory.
> > +*/
> > +   if (folio_is_device_coherent(folio)) {
> > +   WARN_ON_ONCE(PageCompound(&folio->page));
> 
> Maybe that belongs into migrate_device_page()?

... and should be simply WARN_ON_ONCE(folio_test_large(folio)).
No need to go converting it back into a page in order to test things
that can't possibly be true.


Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d

2022-05-26 Thread Matthew Wilcox
On Thu, May 26, 2022 at 11:48:32AM +0300, Dan Carpenter wrote:
> On Thu, May 26, 2022 at 02:16:34AM +0100, Matthew Wilcox wrote:
> > Bizarre this started showing up now.  The recent patch was:
> > 
> > -   info->alloced += compound_nr(page);
> > -   inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
> > +   info->alloced += folio_nr_pages(folio);
> > +   inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);
> > 
> > so it could tell that compound_order() was small, but folio_order()
> > might be large?
> 
> The old code also generates a warning on my test system.  Smatch thinks
> both compound_order() and folio_order() are 0-255.  I guess because of
> the "unsigned char compound_order;" in the struct page.

It'd be nice if we could annotate that as "contains a value between
1 and BITS_PER_LONG - PAGE_SHIFT".  Then be able to optionally enable
a checker that ensures that's true on loads/stores.  Maybe we need a
language that isn't C :-P  Ada can do this ... I don't think Rust can.


Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d

2022-05-25 Thread Matthew Wilcox
On Wed, May 25, 2022 at 03:20:06PM -0700, Andrew Morton wrote:
> On Wed, 25 May 2022 23:07:35 +0100 Jessica Clarke  wrote:
> 
> > This is i386, so an unsigned long is 32-bit, but i_blocks is a blkcnt_t
> > i.e. a u64, which makes the shift without a cast of the LHS fishy.
> 
> Ah, of course, thanks.  I remember 32 bits ;)
> 
> --- a/mm/shmem.c~mm-shmemc-suppress-shift-warning
> +++ a/mm/shmem.c
> @@ -1945,7 +1945,7 @@ alloc_nohuge:
>  
>   spin_lock_irq(&info->lock);
>   info->alloced += folio_nr_pages(folio);
> - inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);
> + inode->i_blocks += (blkcnt_t)BLOCKS_PER_PAGE << folio_order(folio);

Bizarre this started showing up now.  The recent patch was:

-   info->alloced += compound_nr(page);
-   inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
+   info->alloced += folio_nr_pages(folio);
+   inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);

so it could tell that compound_order() was small, but folio_order()
might be large?

Silencing the warning is a good thing, but folio_order() can (at the
moment) be at most 9 on i386, so it isn't actually going to be
larger than 4096.


Re: [PATCH] drm/amdkfd: Add SVM API support capability bits

2022-03-30 Thread Matthew Wilcox
On Wed, Mar 30, 2022 at 04:24:20PM -0500, Alex Sierra wrote:
> From: Philip Yang 
> 
> SVMAPISupported property added to HSA_CAPABILITY, the value match
> HSA_CAPABILITY defined in Thunk spec:
> 
> SVMAPISupported: it will not be supported on older kernels that don't
> have HMM or on systems with GFXv8 or older GPUs without support for
> 48-bit virtual addresses.
> 
> CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match
> HSA_MEMORYPROPERTY defined in Thunk spec:
> 
> CoherentHostAccess: whether or not device memory can be coherently
> accessed by the host CPU.

Could you translate this commit message into English?  Reviewing
Documentation/process/5.Posting.rst might be helpful.


Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-11 Thread Matthew Wilcox
On Thu, Mar 10, 2022 at 11:26:31AM -0600, Alex Sierra wrote:
> @@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>   * PFNMAP mappings in order to support COWable mappings.
>   *
>   */
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long 
> addr,
>   pte_t pte)
>  {
>   unsigned long pfn = pte_pfn(pte);
> @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   return NULL;
>   if (is_zero_pfn(pfn))
>   return NULL;
> - if (pte_devmap(pte))
> - return NULL;
>  
>   print_bad_pte(vma, addr, pte, NULL);
>   return NULL;

... what?

Haven't you just made it so that a devmap page always prints a bad PTE
message, and then returns NULL anyway?

Surely this should be:

if (pte_devmap(pte))
-   return NULL;
+   return pfn_to_page(pfn);

or maybe

+   goto check_pfn;

But I don't know about that highest_memmap_pfn check.

> @@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   return pfn_to_page(pfn);
>  }
>  
> +/*
> + * vm_normal_lru_page -- This function gets the "struct page" associated
> + * with a pte only for page cache and anon page. These pages are LRU handled.
> + */
> +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long 
> addr,
> + pte_t pte)

It seems a shame to add a new function without proper kernel-doc.



Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Matthew Wilcox
On Tue, Mar 01, 2022 at 10:14:07AM -0800, Kees Cook wrote:
> On Mon, Feb 28, 2022 at 04:45:11PM -0800, Linus Torvalds wrote:
> > Really. The "-Wshadow doesn't work on the kernel" is not some new
> > issue, because you have to do completely insane things to the source
> > code to enable it.
> 
> The first big glitch with -Wshadow was with shadowed global variables.
> GCC 4.8 fixed that, but it still yells about shadowed functions. What
> _almost_ works is -Wshadow=local. At first glace, all the warnings
> look solvable, but then one will eventually discover __wait_event()
> and associated macros that mix when and how deeply it intentionally
> shadows variables. :)

Well, that's just disgusting.  Macros fundamentally shouldn't be
referring to things that aren't in their arguments.  The first step to
cleaning this up is ...

I'll take a look at the rest of cleaning this up soon.

>From 28ffe35d56223d4242b915832299e5acc926737e Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" 
Date: Tue, 1 Mar 2022 13:47:07 -0500
Subject: [PATCH] wait: Parameterize the return variable to ___wait_event()

Macros should not refer to variables which aren't in their arguments.
Pass the name from its callers.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/swait.h| 12 ++--
 include/linux/wait.h | 32 
 include/linux/wait_bit.h |  4 ++--
 3 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/swait.h b/include/linux/swait.h
index 6a8c22b8c2a5..5e8e9b13be2d 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -191,14 +191,14 @@ do {  
\
 } while (0)
 
 #define __swait_event_timeout(wq, condition, timeout)  \
-   ___swait_event(wq, ___wait_cond_timeout(condition), \
+   ___swait_event(wq, ___wait_cond_timeout(condition, __ret),  \
  TASK_UNINTERRUPTIBLE, timeout,\
  __ret = schedule_timeout(__ret))
 
 #define swait_event_timeout_exclusive(wq, condition, timeout)  \
 ({ \
long __ret = timeout;   \
-   if (!___wait_cond_timeout(condition))   \
+   if (!___wait_cond_timeout(condition, __ret))\
__ret = __swait_event_timeout(wq, condition, timeout);  \
__ret;  \
 })
@@ -216,14 +216,14 @@ do {  
\
 })
 
 #define __swait_event_interruptible_timeout(wq, condition, timeout)\
-   ___swait_event(wq, ___wait_cond_timeout(condition), \
+   ___swait_event(wq, ___wait_cond_timeout(condition, __ret),  \
  TASK_INTERRUPTIBLE, timeout,  \
  __ret = schedule_timeout(__ret))
 
 #define swait_event_interruptible_timeout_exclusive(wq, condition, timeout)\
 ({ \
long __ret = timeout;   \
-   if (!___wait_cond_timeout(condition))   \
+   if (!___wait_cond_timeout(condition, __ret))\
__ret = __swait_event_interruptible_timeout(wq, \
condition, timeout);\
__ret;  \
@@ -252,7 +252,7 @@ do {
\
 } while (0)
 
 #define __swait_event_idle_timeout(wq, condition, timeout) \
-   ___swait_event(wq, ___wait_cond_timeout(condition), \
+   ___swait_event(wq, ___wait_cond_timeout(condition, __ret),  \
   TASK_IDLE, timeout,  \
   __ret = schedule_timeout(__ret))
 
@@ -278,7 +278,7 @@ do {
\
 #define swait_event_idle_timeout_exclusive(wq, condition, timeout) \
 ({ \
long __ret = timeout;   \
-   if (!___wait_cond_timeout(condition))   \
+   if (!___wait_cond_timeout(condition, __ret))\
__ret = __swait_event_idle_timeout(wq,  \
   condition, timeout); \
__ret;  \
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 851e07da2583..890cce3c0f2e 1006

Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Matthew Wilcox
On Mon, Feb 28, 2022 at 12:37:15PM -0800, Linus Torvalds wrote:
> On Mon, Feb 28, 2022 at 12:16 PM Matthew Wilcox  wrote:
> >
> > Then we can never use -Wshadow ;-(  I'd love to be able to turn it on;
> > it catches real bugs.
> 
> Oh, we already can never use -Wshadow regardless of things like this.
> That bridge hasn't just been burned, it never existed in the first
> place.
> 
> The whole '-Wshadow' thing simply cannot work with local variables in
> macros - something that we've used since day 1.
> 
> Try this (as a "p.c" file):
> 
> #define min(a,b) ({ \
> typeof(a) __a = (a);\
> typeof(b) __b = (b);\
> __a < __b ? __a : __b; })
> 
> int min3(int a, int b, int c)
> {
> return min(a,min(b,c));
> }
> 
> and now do "gcc -O2 -S t.c".
> 
> Then try it with -Wshadow.

#define ___PASTE(a, b)  a##b
#define __PASTE(a, b) ___PASTE(a, b)
#define _min(a, b, u) ({ \
typeof(a) __PASTE(__a,u) = (a);\
typeof(b) __PASTE(__b,u) = (b);\
__PASTE(__a,u) < __PASTE(__b,u) ? __PASTE(__a,u) : __PASTE(__b,u); })

#define min(a, b) _min(a, b, __COUNTER__)

int min3(int a, int b, int c)
{
return min(a,min(b,c));
}

(probably there's a more elegant way to do this)


Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Matthew Wilcox
On Mon, Feb 28, 2022 at 12:10:24PM -0800, Linus Torvalds wrote:
> We can do
> 
> typeof(pos) pos
> 
> in the 'for ()' loop, and never use __iter at all.
> 
> That means that inside the for-loop, we use a _different_ 'pos' than outside.

Then we can never use -Wshadow ;-(  I'd love to be able to turn it on;
it catches real bugs.

> +#define list_for_each_entry(pos, head, member)   
> \
> + for (typeof(pos) pos = list_first_entry(head, typeof(*pos), member);
> \
> +  !list_entry_is_head(pos, head, member);\
>pos = list_next_entry(pos, member))


Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-16 Thread Matthew Wilcox
On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote:
> Assuming changing FSDAX is hard.. How would DAX people feel about just
> deleting the PUD/PMD support until it can be done with compound pages?

I think there are customers who would find that an unacceptable answer :-)



Re: [PATCH v1 1/2] ext4/xfs: add page refcount helper

2021-10-15 Thread Matthew Wilcox
On Thu, Oct 14, 2021 at 10:39:27AM -0500, Alex Sierra wrote:
> From: Ralph Campbell 
> 
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> Reviewed-by: Christoph Hellwig 
> Acked-by: Theodore Ts'o 
> Acked-by: Darrick J. Wong 

Reviewed-by: Matthew Wilcox (Oracle) 


Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-15 Thread Matthew Wilcox
On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> From: Ralph Campbell 
> 
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> Reviewed-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 


Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-14 Thread Matthew Wilcox


It would probably help if you cc'd Dan on this.
As far as I know he's the only person left who cares about GUP on DAX.

On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> > From: Ralph Campbell 
> > 
> > ZONE_DEVICE struct pages have an extra reference count that complicates the
> > code for put_page() and several places in the kernel that need to check the
> > reference count to see that a page is not being used (gup, compaction,
> > migration, etc.). Clean up the code so the reference count doesn't need to
> > be treated specially for ZONE_DEVICE.
> > 
> > Signed-off-by: Ralph Campbell 
> > Signed-off-by: Alex Sierra 
> > Reviewed-by: Christoph Hellwig 
> > ---
> > v2:
> > AS: merged this patch in linux 5.11 version
> > 
> > v5:
> > AS: add condition at try_grab_page to check for the zone device type, while
> > page ref counter is checked less/equal to zero. In case of device zone, 
> > pages
> > ref counter are initialized to zero.
> > 
> > v7:
> > AS: fix condition at try_grab_page added at v5, is invalid. It supposed
> > to fix xfstests/generic/413 test, however, there's a known issue on
> > this test where DAX mapped area DIO to non-DAX expect to fail.
> > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com
> > This condition was removed after rebase over patch series
> > https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
> >  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> >  fs/dax.c   |  4 +-
> >  include/linux/dax.h|  2 +-
> >  include/linux/memremap.h   |  7 +--
> >  include/linux/mm.h | 11 
> >  lib/test_hmm.c |  2 +-
> >  mm/internal.h  |  8 +++
> >  mm/memcontrol.c|  6 +--
> >  mm/memremap.c  | 69 +++---
> >  mm/migrate.c   |  5 --
> >  mm/page_alloc.c|  3 ++
> >  mm/swap.c  | 45 ++---
> >  13 files changed, 46 insertions(+), 120 deletions(-)
> 
> Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
> backed memory still work?
> 
> What refcount value does the struct pages have when they are installed
> in the PTEs? Remember a 0 refcount will make all the get_user_pages()
> fail.
> 
> I'm looking at the call path starting in ext4_punch_hole() and I would
> expect to see something manipulating the page ref count before
> the ext4_break_layouts() call path gets to the dax_page_unused() test.
> 
> All I see is we go into unmap_mapping_pages() - that would normally
> put back the page references held by PTEs but insert_pfn() has this:
> 
>   if (pfn_t_devmap(pfn))
>   entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> 
> And:
> 
> static inline pte_t pte_mkdevmap(pte_t pte)
> {
>   return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> }
> 
> Which interacts with vm_normal_page():
> 
>   if (pte_devmap(pte))
>   return NULL;
> 
> To disable that refcounting?
> 
> So... I have a feeling this will have PTEs pointing to 0 refcount
> pages? Unless FSDAX is !pte_devmap which is not the case, right?
> 
> This seems further confirmed by this comment:
> 
>   /*
>* If we race get_user_pages_fast() here either we'll see the
>* elevated page count in the iteration and wait, or
>* get_user_pages_fast() will see that the page it took a reference
>* against is no longer mapped in the page tables and bail to the
>* get_user_pages() slow path.  The slow path is protected by
>* pte_lock() and pmd_lock(). New references are not taken without
>* holding those locks, and unmap_mapping_pages() will not zero the
>* pte or pmd without holding the respective lock, so we are
>* guaranteed to either see new references or prevent new
>* references from being established.
>*/
> 
> Which seems to explain this scheme relies on unmap_mapping_pages() to
> fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
> stop.
> 
> This seems like it would be properly fixed by using normal page
> refcounting for PTEs - ie stop using special for these pages?
> 
> Does anyone know why devmap is pte_special anyhow?
> 
> > +void free_zone_device_page(struct page *page)
> > +{
> > +   switch (page->pgmap->type) {
> > +   case MEMORY_DEVICE_PRIVATE:
> > +   free_device_page(page);
> > +   return;
> > +   case MEMORY_DEVICE_FS_DAX:
> > +   /* notify page idle */
> > +   wake_up_var(&page->_refcount);
> > +   return;
> 
> It is not for this series, but I wonder if we should just always call
> ops->page_free and have free_device_page() logic in that cal

Re: [PATCH v1 00/12] MEMORY_DEVICE_COHERENT for CPU-accessible coherent device memory

2021-10-12 Thread Matthew Wilcox
On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
> Because I must ask: if this feature is for one single computer which
> presumably has a custom kernel, why add it to mainline Linux?

I think in particular patch 2 deserves to be merged because it removes
a ton of cruft from every call to put_page() (at least if you're using
a distro config).  It makes me nervous, but I think it's the right
thing to do.  It may well need more fixups after it has been merged,
but that's life.


Re: [PATCH v5 3/9] dyndbg: add DEFINE_DYNAMIC_DEBUG_CATEGORIES and callbacks

2021-08-13 Thread Matthew Wilcox
On Fri, Aug 13, 2021 at 06:51:05PM +0300, Andy Shevchenko wrote:
> On Fri, Aug 13, 2021 at 09:17:11AM -0600, Jim Cromie wrote:
> > +int param_set_dyndbg(const char *instr, const struct kernel_param *kp)
> > +{
> > +   unsigned long inbits;
> > +   int rc, i, chgct = 0, totct = 0;
> > +   char query[OUR_QUERY_SIZE];
> > +   struct dyndbg_bitdesc *bitmap = (struct dyndbg_bitdesc *) kp->data;
> 
> So you need space after ')' ?

More importantly, if ->data is of type 'void *', it is bad style to
cast the pointer at all.  I can't tell what type 'data' has; if it
is added to kernel_param as part of this series, I wasn't cc'd on
the patch that did that.



Re: [RFC PATCH v2 1/8] ext4/xfs: add page refcount helper

2021-06-09 Thread Matthew Wilcox
On Mon, Jun 07, 2021 at 03:42:19PM -0500, Alex Sierra wrote:
> +++ b/include/linux/dax.h
> @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
> *mapping)
>   return mapping->host && IS_DAX(mapping->host);
>  }
>  
> +static inline bool dax_layout_is_idle_page(struct page *page)
> +{
> + return page_ref_count(page) == 1;
> +}

We already have something called an idle page, and that's quite a
different thing from this.  How about dax_page_unused() (it's a use
count, so once it's got down to its minimum value, it's unused)?

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC PATCH v2 1/8] ext4/xfs: add page refcount helper

2021-06-08 Thread Matthew Wilcox
On Tue, Jun 08, 2021 at 12:29:04AM +, Liam Howlett wrote:
> * Alex Sierra  [210607 16:43]:
> > From: Ralph Campbell 
> > 
> > There are several places where ZONE_DEVICE struct pages assume a reference
> > count == 1 means the page is idle and free. Instead of open coding this,
> > add a helper function to hide this detail.
> > 
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b52f084aa643..8909a91cd381 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
> > *mapping)
> > return mapping->host && IS_DAX(mapping->host);
> >  }
> >  
> > +static inline bool dax_layout_is_idle_page(struct page *page)
> > +{
> > +   return page_ref_count(page) == 1;
> > +}
> 
> If this races with page_ref_count(page) == 0, then it will return false
> that a page is idle when the page is being freed.  I don't know the code
> well enough to say if this is an issue or not so please let me know.
> 
> For example:
> !dax_layout_is_idle_page() will return true in dax_busy_page() above
> when the count is 0 and return the page.
> 
> Maybe you are sure to have at least one reference when calling this?  It
> might be worth adding a comment.

You're getting confused by the problem that the next patch fixes, which
is that devmap pages were stupidly given an elevated refcount.  devmap
pages are considered "free" when their refcount is 1.  See
put_page(), put_devmap_managed_page() and so on.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC PATCH v2 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

2021-06-08 Thread Matthew Wilcox
On Mon, Jun 07, 2021 at 03:42:18PM -0500, Alex Sierra wrote:
> v1:
> https://lore.kernel.org/linux-mm/20210529064022.gb15...@lst.de/T/

Please copy and paste the rationale into followup patch series instead
of sending a link:

AMD is building a system architecture for the Frontier supercomputer with
a coherent interconnect between CPUs and GPUs. This hardware architecture
allows the CPUs to coherently access GPU device memory. We have hardware
in our labs and we are working with our partner HPE on the BIOS, firmware
and software for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver
looks it up with lookup_resource and registers it with devmap as
MEMORY_DEVICE_GENERIC using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the
migrate_vma_* helpers so we can support page-based migration in our
unified memory allocations, while also supporting CPU access to those
pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages
behave correctly in the migrate_vma_* helpers. We are looking for feedback
about this approach. If we're close, what's needed to make our patches
acceptable upstream? If we're not close, any suggestions how else to
achieve what we are trying to do (i.e. page migration and coherent CPU
access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently
upstreamed to Dave Airlie's drm-next branch
[https://cgit.freedesktop.org/drm/drm/log/?h=drm-next]. On top of that we
did some rework of our VRAM management for migrations to remove some
incorrect assumptions, allow partially successful migrations and GPU
memory mappings that mix pages in VRAM and system memory.
[https://patchwork.kernel.org/project/dri-devel/list/?series=489811]

> v2:
> This patch series version has merged "[RFC PATCH v3 0/2]
> mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
> Ralph Campbell. It also applies at the top of these series, our changes
> to support device generic type in migration_vma helpers.
> This has been tested in systems with device memory that has coherent
> access by CPU.
> 
> Also addresses the following feedback made in v1:
> - Isolate in one patch kernel/resource.c modification, based
> on Christoph's feedback.
> - Add helpers check for generic and private type to avoid
> duplicated long lines.
> 
> I like to provide an overview of what each of the patches does in a series:
> 
> Patches 1-2: Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches
> Patch 3: Export lookup_resource
> Patches 4-5: AMDGPU driver changes to register and use DEVICE_GENERIC memory
> Patches 6-8: Handle DEVICE_GENERIC memory in migration helpers
> 
> Alex Sierra (6):
>   kernel: resource: lookup_resource as exported symbol
>   drm/amdkfd: add SPM support for SVM
>   drm/amdkfd: generic type as sys mem on migration to ram
>   include/linux/mm.h: helpers to check zone device generic type
>   mm: add generic type support to migrate_vma helpers
>   mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
> 
> Ralph Campbell (2):
>   ext4/xfs: add page refcount helper
>   mm: remove extra ZONE_DEVICE struct page refcount
> 
>  arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 --
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>  fs/dax.c |  8 +--
>  fs/ext4/inode.c  |  5 +-
>  fs/xfs/xfs_file.c|  4 +-
>  include/linux/dax.h  | 10 
>  include/linux/memremap.h |  7 +--
>  include/linux/mm.h   | 52 +++---
>  kernel/resource.c|  2 +-
>  lib/test_hmm.c   |  2 +-
>  mm/internal.h|  8 +++
>  mm/memremap.c| 69 +++-
>  mm/migrate.c | 13 ++---
>  mm/page_alloc.c  |  3 ++
>  mm/swap.c| 45 ++--
>  16 files changed, 83 insertions(+), 164 deletions(-)
> 
> -- 
> 2.17.1
> 
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-22 Thread Matthew Wilcox
On Mon, Mar 22, 2021 at 02:49:27PM +0100, Daniel Vetter wrote:
> On Sun, Mar 21, 2021 at 03:18:28PM +0100, Christian König wrote:
> > Am 20.03.21 um 14:17 schrieb Daniel Vetter:
> > > On Sat, Mar 20, 2021 at 10:04 AM Christian König
> > >  wrote:
> > > > Am 19.03.21 um 20:06 schrieb Daniel Vetter:
> > > > > On Fri, Mar 19, 2021 at 07:53:48PM +0100, Christian König wrote:
> > > > > > Am 19.03.21 um 18:52 schrieb Daniel Vetter:
> > > > > > > On Fri, Mar 19, 2021 at 03:08:57PM +0100, Christian König wrote:
> > > > > > > > Don't print a warning when we fail to allocate a page for 
> > > > > > > > swapping things out.
> > > > > > > > 
> > > > > > > > Also rely on memalloc_nofs_save/memalloc_nofs_restore instead 
> > > > > > > > of GFP_NOFS.
> > > > > > > Uh this part doesn't make sense. Especially since you only do it 
> > > > > > > for the
> > > > > > > debugfs file, not in general. Which means you've just completely 
> > > > > > > broken
> > > > > > > the shrinker.
> > > > > > Are you sure? My impression is that GFP_NOFS should now work much 
> > > > > > more out
> > > > > > of the box with the memalloc_nofs_save()/memalloc_nofs_restore().
> > > > > Yeah, if you'd put it in the right place :-)
> > > > > 
> > > > > But also -mm folks are very clear that memalloc_no*() family is for 
> > > > > dire
> > > > > situation where there's really no other way out. For anything where 
> > > > > you
> > > > > know what you're doing, you really should use explicit gfp flags.
> > > > My impression is just the other way around. You should try to avoid the
> > > > NOFS/NOIO flags and use the memalloc_no* approach instead.
> > > Where did you get that idea?
> > 
> > Well from the kernel comment on GFP_NOFS:
> > 
> >  * %GFP_NOFS will use direct reclaim but will not use any filesystem
> > interfaces.
> >  * Please try to avoid using this flag directly and instead use
> >  * memalloc_nofs_{save,restore} to mark the whole scope which
> > cannot/shouldn't
> >  * recurse into the FS layer with a short explanation why. All allocation
> >  * requests will inherit GFP_NOFS implicitly.
> 
> Huh that's interesting, since iirc Willy or Dave told me the opposite, and
> the memalloc_no* stuff is for e.g. nfs calling into network layer (needs
> GFP_NOFS) or swap on top of a filesystems (even needs GFP_NOIO I think).
> 
> Adding them, maybe I got confused.

My impression is that the scoped API is preferred these days.

https://www.kernel.org/doc/html/latest/core-api/gfp_mask-from-fs-io.html

I'd probably need to spend a few months learning the DRM subsystem to
have a more detailed opinion on whether passing GFP flags around explicitly
or using the scope API is the better approach for your situation.

I usually defer to Michal on these kinds of questions.

> > > The kernel is full of explicit gfp_t flag
> > > passing to make this as explicit as possible. The memalloc_no* stuff
> > > is just for when you go through entire subsystems and really can't
> > > wire it through. I can't find the discussion anymore, but that was the
> > > advice I got from mm/fs people.
> > > 
> > > One reason is that generally a small GFP_KERNEL allocation never
> > > fails. But it absolutely can fail if it's in a memalloc_no* section,
> > > and these kind of non-obvious non-local effects are a real pain in
> > > testing and review. Hence explicit gfp_flag passing as much as
> > > possible.

I agree with this; it's definitely a problem with the scope API.  I wanted
to extend it to include GFP_NOWAIT, but if you do that, your chances of
memory allocation failure go way up, so you really want to set __GFP_NOWARN
too, but now you need to audit all the places that you're calling to be
sure they really handle errors correctly.

So I think I'm giving up on that patch set.

> > > > > > > If this is just to paper over the seq_printf doing the wrong 
> > > > > > > allocations,
> > > > > > > then just move that out from under the fs_reclaim_acquire/release 
> > > > > > > part.
> > > > > > No, that wasn't the problem.
> > > > > > 
> > > > > > We have just seen to many failures to allocate pages for swapout 
> > > > > > and I think
> > > > > > that would improve this because in a lot of cases we can then 
> > > > > > immediately
> > > > > > swap things out instead of having to rely on upper layers.
> > > > > Yeah, you broke it. Now the real shrinker is running with GFP_KERNEL,
> > > > > because your memalloc_no is only around the debugfs function. And ofc 
> > > > > it's
> > > > > much easier to allocate with GFP_KERNEL, right until you deadlock :-)
> > > > The problem here is that for example kswapd calls the shrinker without
> > > > holding a FS lock as far as I can see.
> > > > 
> > > > And it is rather sad that we can't optimize this case directly.
> > > I'm still not clear what you want to optimize? You can check for "is
> > > this kswapd" in pf flags, but that sounds very hairy and fragile.
> > 
> > Well we only need the NOFS flag when the shrinker callback really comes f

Re: [RFC] MAINTAINERS tag for cleanup robot

2020-11-23 Thread Matthew Wilcox
On Sat, Nov 21, 2020 at 08:50:58AM -0800, t...@redhat.com wrote:
> The fixer review is
> https://reviews.llvm.org/D91789
> 
> A run over allyesconfig for x86_64 finds 62 issues, 5 are false positives.
> The false positives are caused by macros passed to other macros and by
> some macro expansions that did not have an extra semicolon.
> 
> This cleans up about 1,000 of the current 10,000 -Wextra-semi-stmt
> warnings in linux-next.

Are any of them not false-positives?  It's all very well to enable
stricter warnings, but if they don't fix any bugs, they're just churn.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC] MAINTAINERS tag for cleanup robot

2020-11-23 Thread Matthew Wilcox
On Sun, Nov 22, 2020 at 06:46:46AM -0800, Tom Rix wrote:
> 
> On 11/21/20 7:23 PM, Matthew Wilcox wrote:
> > On Sat, Nov 21, 2020 at 08:50:58AM -0800, t...@redhat.com wrote:
> >> The fixer review is
> >> https://reviews.llvm.org/D91789
> >>
> >> A run over allyesconfig for x86_64 finds 62 issues, 5 are false positives.
> >> The false positives are caused by macros passed to other macros and by
> >> some macro expansions that did not have an extra semicolon.
> >>
> >> This cleans up about 1,000 of the current 10,000 -Wextra-semi-stmt
> >> warnings in linux-next.
> > Are any of them not false-positives?  It's all very well to enable
> > stricter warnings, but if they don't fix any bugs, they're just churn.
> >
> While enabling additional warnings may be a side effect of this effort
> 
> the primary goal is to set up a cleaning robot. After that a refactoring 
> robot.

Why do we need such a thing?  Again, it sounds like more churn.
It's really annoying when I'm working on something important that gets
derailed by pointless churn.  Churn also makes it harder to backport
patches to earlier kernels.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Ocfs2-devel] [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Matthew Wilcox
On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> clang has a number of useful, new warnings see
> https://urldefense.com/v3/__https://clang.llvm.org/docs/DiagnosticsReference.html__;!!GqivPVa7Brio!Krxz78O3RKcB9JBMVo_F98FupVhj_jxX60ddN6tKGEbv_cnooXc1nnBmchm-e_O9ieGnyQ$
>  

Please get your IT department to remove that stupidity.  If you can't,
please send email from a non-Red Hat email address.

I don't understand why this is a useful warning to fix.  What actual
problem is caused by the code below?

> return and break
> 
>   switch (c->x86_vendor) {
>   case X86_VENDOR_INTEL:
>   intel_p5_mcheck_init(c);
>   return 1;
> - break;

Sure, it's unnecessary, but it's not masking a bug.  It's not unclear.
Why do we want to enable this warning?

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Ocfs2-devel] [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Matthew Wilcox
On Sun, Oct 18, 2020 at 12:13:35PM -0700, James Bottomley wrote:
> On Sun, 2020-10-18 at 19:59 +0100, Matthew Wilcox wrote:
> > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > clang has a number of useful, new warnings see
> > > https://urldefense.com/v3/__https://clang.llvm.org/docs/DiagnosticsReference.html__;!!GqivPVa7Brio!Krxz78O3RKcB9JBMVo_F98FupVhj_jxX60ddN6tKGEbv_cnooXc1nnBmchm-e_O9ieGnyQ$
> > >  
> > 
> > Please get your IT department to remove that stupidity.  If you
> > can't, please send email from a non-Red Hat email address.
> 
> Actually, the problem is at Oracle's end somewhere in the ocfs2 list
> ... if you could fix it, that would be great.  The usual real mailing
> lists didn't get this transformation
> 
> https://lore.kernel.org/bpf/20201017160928.12698-1-t...@redhat.com/
> 
> but the ocfs2 list archive did:
> 
> https://oss.oracle.com/pipermail/ocfs2-devel/2020-October/015330.html
> 
> I bet Oracle IT has put some spam filter on the list that mangles URLs
> this way.

*sigh*.  I'm sure there's a way.  I've raised it with someone who should
be able to fix it.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Matthew Wilcox
On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> >
> > From: Ira Weiny 
> >
> > The kmap() calls in this FS are localized to a single thread.  To avoid
> > the over head of global PKRS updates use the new kmap_thread() call.
> >
> > Cc: Nicolas Pitre 
> > Signed-off-by: Ira Weiny 
> > ---
> >  fs/cramfs/inode.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > index 912308600d39..003c014a42ed 100644
> > --- a/fs/cramfs/inode.c
> > +++ b/fs/cramfs/inode.c
> > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block *sb, 
> > unsigned int offset,
> > struct page *page = pages[i];
> >
> > if (page) {
> > -   memcpy(data, kmap(page), PAGE_SIZE);
> > -   kunmap(page);
> > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > +   kunmap_thread(page);
> 
> Why does this need a sleepable kmap? This looks like a textbook
> kmap_atomic() use case.

There's a lot of code of this form.  Could we perhaps have:

static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned int 
size)
{
char *vto = kmap_atomic(to);

memcpy(vto, vfrom, size);
kunmap_atomic(vto);
}

in linux/highmem.h ?
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-13 Thread Matthew Wilcox
On Mon, Oct 12, 2020 at 12:53:54PM -0700, Ira Weiny wrote:
> On Mon, Oct 12, 2020 at 05:44:38PM +0100, Matthew Wilcox wrote:
> > On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> > > kmap_atomic() is always preferred over kmap()/kmap_thread().
> > > kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> > > always CPU-local and never broadcast.
> > > 
> > > So, basically, unless you *must* sleep while the mapping is in place,
> > > kmap_atomic() is preferred.
> > 
> > But kmap_atomic() disables preemption, so the _ideal_ interface would map
> > it only locally, then on preemption make it global.  I don't even know
> > if that _can_ be done.  But this email makes it seem like kmap_atomic()
> > has no downsides.
> 
> And that is IIUC what Thomas was trying to solve.
> 
> Also, Linus brought up that kmap_atomic() has quirks in nesting.[1]
> 
> >From what I can see all of these discussions support the need to have 
> >something
> between kmap() and kmap_atomic().
> 
> However, the reason behind converting call sites to kmap_thread() are 
> different
> between Thomas' patch set and mine.  Both require more kmap granularity.
> However, they do so with different reasons and underlying implementations but
> with the _same_ resulting semantics; a thread local mapping which is
> preemptable.[2]  Therefore they each focus on changing different call sites.
> 
> While this patch set is huge I think it serves a valuable purpose to identify 
> a
> large number of call sites which are candidates for this new semantic.

Yes, I agree.  My problem with this patch-set is that it ties it to
some Intel feature that almost nobody cares about.  Maybe we should
care about it, but you didn't try very hard to make anyone care about
it in the cover letter.

For a future patch-set, I'd like to see you just introduce the new
API.  Then you can optimise the Intel implementation of it afterwards.
Those patch-sets have entirely different reviewers.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Matthew Wilcox
On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> kmap_atomic() is always preferred over kmap()/kmap_thread().
> kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> always CPU-local and never broadcast.
> 
> So, basically, unless you *must* sleep while the mapping is in place,
> kmap_atomic() is preferred.

But kmap_atomic() disables preemption, so the _ideal_ interface would map
it only locally, then on preemption make it global.  I don't even know
if that _can_ be done.  But this email makes it seem like kmap_atomic()
has no downsides.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-10 Thread Matthew Wilcox
On Fri, Oct 09, 2020 at 02:34:34PM -0700, Eric Biggers wrote:
> On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> > The kmap() calls in this FS are localized to a single thread.  To avoid
> > the over head of global PKRS updates use the new kmap_thread() call.
> >
> > @@ -2410,12 +2410,12 @@ static inline struct page *f2fs_pagecache_get_page(
> >  
> >  static inline void f2fs_copy_page(struct page *src, struct page *dst)
> >  {
> > -   char *src_kaddr = kmap(src);
> > -   char *dst_kaddr = kmap(dst);
> > +   char *src_kaddr = kmap_thread(src);
> > +   char *dst_kaddr = kmap_thread(dst);
> >  
> > memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > -   kunmap(dst);
> > -   kunmap(src);
> > +   kunmap_thread(dst);
> > +   kunmap_thread(src);
> >  }
> 
> Wouldn't it make more sense to switch cases like this to kmap_atomic()?
> The pages are only mapped to do a memcpy(), then they're immediately unmapped.

Maybe you missed the earlier thread from Thomas trying to do something
similar for rather different reasons ...

https://lore.kernel.org/lkml/20200919091751.06...@linutronix.de/
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] mm/hmm: Simplify hmm_vma_walk_pud slightly

2020-03-13 Thread Matthew Wilcox
On Fri, Mar 13, 2020 at 04:55:50PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 12, 2020 at 05:02:18PM +, Steven Price wrote:
> > On 12/03/2020 16:37, Jason Gunthorpe wrote:
> > > On Thu, Mar 12, 2020 at 04:16:33PM +, Steven Price wrote:
> > > > > Actually, while you are looking at this, do you think we should be
> > > > > adding at least READ_ONCE in the pagewalk.c walk_* functions? The
> > > > > multiple references of pmd, pud, etc without locking seems sketchy to
> > > > > me.
> > > > 
> > > > I agree it seems worrying. I'm not entirely sure whether the holding of
> > > > mmap_sem is sufficient,
> > > 
> > > I looked at this question, and at least for PMD, mmap_sem is not
> > > sufficient. I didn't easilly figure it out for the other ones
> > > 
> > > I'm guessing if PMD is not safe then none of them are.
> > > 
> > > > this isn't something that I changed so I've just
> > > > been hoping that it's sufficient since it seems to have been working
> > > > (whether that's by chance because the compiler didn't generate multiple
> > > > reads I've no idea). For walking the kernel's page tables the lack of
> > > > READ_ONCE is also not great, but at least for PTDUMP we don't care too 
> > > > much
> > > > about accuracy and it should be crash proof because there's no RCU grace
> > > > period. And again the code I was replacing didn't have any special
> > > > protection.
> > > > 
> > > > I can't see any harm in updating the code to include READ_ONCE and I'm 
> > > > happy
> > > > to review a patch.
> > > 
> > > The reason I ask is because hmm's walkers often have this pattern
> > > where they get the pointer and then de-ref it (again) then
> > > immediately have to recheck the 'again' conditions of the walker
> > > itself because the re-read may have given a different value.
> > > 
> > > Having the walker deref the pointer and pass the value it into the ops
> > > for use rather than repeatedly de-refing an unlocked value seems like
> > > a much safer design to me.
> > 
> > Yeah that sounds like a good idea.
> 
> I'm looking at this now.. The PUD is also changing under the read
> mmap_sem - and I was able to think up some race conditiony bugs
> related to this. Have some patches now..
> 
> However, I haven't been able to understand why walk_page_range()
> doesn't check pud_present() or pmd_present() before calling
> pmd_offset_map() or pte_offset_map().
> 
> As far as I can see a non-present entry has a swap entry encoded in
> it, and thus it seems like it is a bad idea to pass a non-present
> entry to the two map functions. I think those should only be called
> when the entry points to the next level in the page table  (so there
> is something to map?)
> 
> I see you added !present tests for the !vma case, but why only there?
> 
> Is this a bug? Do you know how it works?
> 
> Is it something that was missed when people added non-present PUD and
> PMD's?

... I'm sorry, I did what now?  As far as I can tell, you're talking
about mm/pagewalk.c, and the only commit I have in that file is
a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517 ("mm, x86: add support for
PUD-sized transparent hugepages", which I think I was pretty clear
from the commit message is basically copy-and-paste from the PMD code.
I have no clue why most of the decisions in the MM were made.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-02 Thread Matthew Wilcox
On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
> > [...]
> > > 2) Convert all of the call sites for get_user_pages*(), to
> > > invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > call sites, and will take some time.
> > 
> > How do we make sure this is the case and it will remain the case in the
> > future? There must be some automagic to enforce/check that. It is simply
> > not manageable to do it every now and then because then 3) will simply
> > be never safe.
> > 
> > Have you considered coccinele or some other scripted way to do the
> > transition? I have no idea how to deal with future changes that would
> > break the balance though.
> 
> Yeah, that's why I've been suggesting at LSF/MM that we may need to create
> a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> references got converted by using this wrapper instead of gup. The
> counterpart would then be more logically named as unpin_page() or whatever
> instead of put_user_page().  Sure this is not completely foolproof (you can
> create new callsite using vaddr_pin_pages() and then just drop refs using
> put_page()) but I suppose it would be a high enough barrier for missed
> conversions... Thoughts?

I think the API we really need is get_user_bvec() / put_user_bvec(),
and I know Christoph has been putting some work into that.  That avoids
doing refcount operations on hundreds of pages if the page in question is
a huge page.  Once people are switched over to that, they won't be tempted
to manually call put_page() on the individual constituent pages of a bvec.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[RFC] Rewrite page fault interception in TTM drivers

2018-04-16 Thread Matthew Wilcox

Three^W Two of the TTM drivers intercept the pagefault handler in a rather
un-Linuxy and racy way.  If they really must intercept the fault call
(and it's not clear to me that they need to), I'd rather see them do it
like this.

The QXL driver seems least likely to need it; as the virtio driver has
exactly the same code in it, only commented out.  The radeon driver takes
its own lock ... maybe there's a way to avoid that since no other driver
needs to take its own lock at this point?

diff --git a/drivers/gpu/drm/qxl/qxl_ttm.c b/drivers/gpu/drm/qxl/qxl_ttm.c
index ee2340e31f06..d2c76a3d6816 100644
--- a/drivers/gpu/drm/qxl/qxl_ttm.c
+++ b/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -102,21 +102,23 @@ static void qxl_ttm_global_fini(struct qxl_device *qdev)
}
 }
 
-static struct vm_operations_struct qxl_ttm_vm_ops;
-static const struct vm_operations_struct *ttm_vm_ops;
-
 static int qxl_ttm_fault(struct vm_fault *vmf)
 {
struct ttm_buffer_object *bo;
-   int r;
 
bo = (struct ttm_buffer_object *)vmf->vma->vm_private_data;
if (bo == NULL)
return VM_FAULT_NOPAGE;
-   r = ttm_vm_ops->fault(vmf);
-   return r;
+   return ttm_bo_vm_fault(vmf);
 }
 
+static const struct vm_operations_struct qxl_ttm_vm_ops = {
+   .fault = qxl_ttm_fault,
+   .open = ttm_bo_vm_open,
+   .close = ttm_bo_vm_close,
+   .access = ttm_bo_vm_access,
+};
+
 int qxl_mmap(struct file *filp, struct vm_area_struct *vma)
 {
struct drm_file *file_priv;
@@ -139,11 +141,6 @@ int qxl_mmap(struct file *filp, struct vm_area_struct *vma)
r = ttm_bo_mmap(filp, vma, &qdev->mman.bdev);
if (unlikely(r != 0))
return r;
-   if (unlikely(ttm_vm_ops == NULL)) {
-   ttm_vm_ops = vma->vm_ops;
-   qxl_ttm_vm_ops = *ttm_vm_ops;
-   qxl_ttm_vm_ops.fault = &qxl_ttm_fault;
-   }
vma->vm_ops = &qxl_ttm_vm_ops;
return 0;
 }
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 8689fcca051c..08cc0c5b9e94 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -944,9 +944,6 @@ void radeon_ttm_set_active_vram_size(struct radeon_device 
*rdev, u64 size)
man->size = size >> PAGE_SHIFT;
 }
 
-static struct vm_operations_struct radeon_ttm_vm_ops;
-static const struct vm_operations_struct *ttm_vm_ops = NULL;
-
 static int radeon_ttm_fault(struct vm_fault *vmf)
 {
struct ttm_buffer_object *bo;
@@ -959,11 +956,18 @@ static int radeon_ttm_fault(struct vm_fault *vmf)
}
rdev = radeon_get_rdev(bo->bdev);
down_read(&rdev->pm.mclk_lock);
-   r = ttm_vm_ops->fault(vmf);
+   r = ttm_bo_vm_fault(vmf);
up_read(&rdev->pm.mclk_lock);
return r;
 }
 
+static const struct vm_operations_struct radeon_ttm_vm_ops = {
+   .fault = radeon_ttm_fault,
+   .open = ttm_bo_vm_open,
+   .close = ttm_bo_vm_close,
+   .access = ttm_bo_vm_access,
+};
+
 int radeon_mmap(struct file *filp, struct vm_area_struct *vma)
 {
struct drm_file *file_priv;
@@ -983,11 +987,6 @@ int radeon_mmap(struct file *filp, struct vm_area_struct 
*vma)
if (unlikely(r != 0)) {
return r;
}
-   if (unlikely(ttm_vm_ops == NULL)) {
-   ttm_vm_ops = vma->vm_ops;
-   radeon_ttm_vm_ops = *ttm_vm_ops;
-   radeon_ttm_vm_ops.fault = &radeon_ttm_fault;
-   }
vma->vm_ops = &radeon_ttm_vm_ops;
return 0;
 }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 8eba95b3c737..f59a8f41aae0 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -104,7 +104,7 @@ static unsigned long ttm_bo_io_mem_pfn(struct 
ttm_buffer_object *bo,
+ page_offset;
 }
 
-static int ttm_bo_vm_fault(struct vm_fault *vmf)
+int ttm_bo_vm_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct ttm_buffer_object *bo = (struct ttm_buffer_object *)
@@ -294,8 +294,9 @@ static int ttm_bo_vm_fault(struct vm_fault *vmf)
ttm_bo_unreserve(bo);
return ret;
 }
+EXPORT_SYMBOL_GPL(ttm_bo_vm_fault);
 
-static void ttm_bo_vm_open(struct vm_area_struct *vma)
+void ttm_bo_vm_open(struct vm_area_struct *vma)
 {
struct ttm_buffer_object *bo =
(struct ttm_buffer_object *)vma->vm_private_data;
@@ -304,14 +305,16 @@ static void ttm_bo_vm_open(struct vm_area_struct *vma)
 
(void)ttm_bo_reference(bo);
 }
+EXPORT_SYMBOL_GPL(ttm_bo_vm_open);
 
-static void ttm_bo_vm_close(struct vm_area_struct *vma)
+void ttm_bo_vm_close(struct vm_area_struct *vma)
 {
struct ttm_buffer_object *bo = (struct ttm_buffer_object 
*)vma->vm_private_data;
 
ttm_bo_unref(&bo);
vma->vm_private_data = NULL;
 }
+EXPORT_SYMBOL_GPL(ttm_bo_vm_close);
 
 static int ttm_bo_vm_access_kmap(struct ttm_buffer_object *bo,