Re: [PATCH] KVM: SVM: Mark SEV launch secret pages as dirty.

2020-08-07 Thread David Rientjes
On Thu, 6 Aug 2020, Cfir Cohen wrote:

> The LAUNCH_SECRET command performs encryption of the
> launch secret memory contents. Mark pinned pages as
> dirty, before unpinning them.
> This matches the logic in sev_launch_update().
> 
> Signed-off-by: Cfir Cohen 

Acked-by: David Rientjes 


Re: [PATCH v3] mm/slab.c: add node spinlock protect in __cache_free_alien

2020-07-30 Thread David Rientjes
On Thu, 30 Jul 2020, qiang.zh...@windriver.com wrote:

> From: Zhang Qiang 
> 
> for example:
>   node0
>   cpu0cpu1
> slab_dead_cpu
>>mutex_lock(_mutex)
>  >cpuup_canceledslab_dead_cpu
>>mask = cpumask_of_node(node)   >mutex_lock(_mutex)
>>n = get_node(cachep0, node0)
>>spin_lock_irq(n&->list_lock)
>>if (!cpumask_empty(mask)) == true
>   >spin_unlock_irq(>list_lock)
>   >goto free_slab
>
>>mutex_unlock(_mutex)
> 
>  >cpuup_canceled
>>mask = 
> cpumask_of_node(node)
> kmem_cache_free(cachep0 )  >n = get_node(cachep0, 
> node0)
>  >__cache_free_alien(cachep0 ) 
> >spin_lock_irq(n&->list_lock)
>>n = get_node(cachep0, node0)   >if (!cpumask_empty(mask)) 
> == false
>>if (n->alien && n->alien[page_node])   >alien = n->alien
>  >alien = n->alien[page_node]  >n->alien = NULL
>  > 
> >spin_unlock_irq(>list_lock)
>>
> 

As mentioned in the review of v1 of this patch, we likely want to do a fix 
for cpuup_canceled() instead.


Re: [PATCH] mm: slab: Avoid the use of one-element array and use struct_size() helper

2020-07-29 Thread David Rientjes
On Wed, 29 Jul 2020, Qianli Zhao wrote:

> From: Qianli Zhao 
> 
> There is a regular need in the kernel to provide a way to declare having a
> dynamically sized set of trailing elements in a structure. Kernel code should
> always use “flexible array members”[1] for these cases. The older style of
> one-element or zero-length arrays should no longer be used[2].
> 
> Make use of the struct_size() helper instead of an open-coded version
> in order to avoid any potential type mistakes.
> 
> [1] https://en.wikipedia.org/wiki/Flexible_array_member
> [2] https://github.com/KSPP/linux/issues/21
> 
> Signed-off-by: Qianli Zhao 
> ---
>  mm/slab.h| 2 +-
>  mm/slab_common.c | 7 ++-
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 74f7e09..c12fb65 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -34,7 +34,7 @@ struct kmem_cache {
>  
>  struct memcg_cache_array {
>   struct rcu_head rcu;
> - struct kmem_cache *entries[0];
> + struct kmem_cache *entries[];
>  };
>  
>  /*

This is removed in the -mm tree, see 
https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations.patch

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index fe8b684..56f4818 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -166,9 +166,7 @@ static int init_memcg_params(struct kmem_cache *s,
>   if (!memcg_nr_cache_ids)
>   return 0;
>  
> - arr = kvzalloc(sizeof(struct memcg_cache_array) +
> -memcg_nr_cache_ids * sizeof(void *),
> -GFP_KERNEL);
> + arr = kvzalloc(struct_size(arr, entries, memcg_nr_cache_ids), 
> GFP_KERNEL);
>   if (!arr)
>   return -ENOMEM;
>  
> @@ -199,8 +197,7 @@ static int update_memcg_params(struct kmem_cache *s, int 
> new_array_size)
>  {
>   struct memcg_cache_array *old, *new;
>  
> - new = kvzalloc(sizeof(struct memcg_cache_array) +
> -new_array_size * sizeof(void *), GFP_KERNEL);
> + new = kvzalloc(struct_size(new, entries, new_array_size), GFP_KERNEL);
>   if (!new)
>   return -ENOMEM;
>  
> -- 
> 2.7.4
> 
> 

Re: 回复: [PATCH] mm/slab.c: add node spinlock protect in __cache_free_alien

2020-07-29 Thread David Rientjes
On Wed, 29 Jul 2020, Zhang, Qiang wrote:

> > From: Zhang Qiang 
> >
> > We should add node spinlock protect "n->alien" which may be
> > assigned to NULL in cpuup_canceled func. cause address access
> > exception.
> >
> 
> >Hi, do you have an example NULL pointer dereference where you have hit
> >this?
> 

If you have a NULL pointer dereference or a GPF that occurred because of 
this, it would be helpful to provide as rationale.

> >This rather looks like something to fix up in cpuup_canceled() since it's
> >currently manipulating the alien cache for the canceled cpu's node.
> 
> yes , it is fix up in cpuup_canceled  it's
> currently manipulating the alien cache for the canceled cpu's node which  may 
> be the same as the node being operated on in the __cache_free_alien func.
> 
> void cpuup_canceled
> {
> n = get_node(cachep, node);
> spin_lock_irq(>list_lock);
> ...
> n->alien = NULL;
> spin_unlock_irq(>list_lock);
>  
> }
> 

Right, so the idea is that this should be fixed in cpuup_canceled() 
instead -- why would we invaliate the entire node's alien cache because a 
single cpu failed to come online?


Re: [PATCH] mm/slab.c: add node spinlock protect in __cache_free_alien

2020-07-28 Thread David Rientjes
On Tue, 28 Jul 2020, qiang.zh...@windriver.com wrote:

> From: Zhang Qiang 
> 
> We should add node spinlock protect "n->alien" which may be
> assigned to NULL in cpuup_canceled func. cause address access
> exception.
> 

Hi, do you have an example NULL pointer dereference where you have hit 
this?

This rather looks like something to fix up in cpuup_canceled() since it's 
currently manipulating the alien cache for the canceled cpu's node.

> Fixes: 18bf854117c6 ("slab: use get_node() and kmem_cache_node() functions")
> Signed-off-by: Zhang Qiang 
> ---
>  mm/slab.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index a89633603b2d..290523c90b4e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -759,8 +759,10 @@ static int __cache_free_alien(struct kmem_cache *cachep, 
> void *objp,
>  
>   n = get_node(cachep, node);
>   STATS_INC_NODEFREES(cachep);
> + spin_lock(>list_lock);
>   if (n->alien && n->alien[page_node]) {
>   alien = n->alien[page_node];
> + spin_unlock(>list_lock);
>   ac = >ac;
>   spin_lock(>lock);
>   if (unlikely(ac->avail == ac->limit)) {
> @@ -769,14 +771,15 @@ static int __cache_free_alien(struct kmem_cache 
> *cachep, void *objp,
>   }
>   ac->entry[ac->avail++] = objp;
>   spin_unlock(>lock);
> - slabs_destroy(cachep, );
>   } else {
> + spin_unlock(>list_lock);
>   n = get_node(cachep, page_node);
>   spin_lock(>list_lock);
>   free_block(cachep, , 1, page_node, );
>   spin_unlock(>list_lock);
> - slabs_destroy(cachep, );
>   }
> +
> + slabs_destroy(cachep, );
>   return 1;
>  }
>  
> -- 
> 2.26.2
> 
> 


Re: [PATCH 2/4] dma-pool: Get rid of dma_in_atomic_pool()

2020-07-09 Thread David Rientjes
On Thu, 9 Jul 2020, Nicolas Saenz Julienne wrote:

> The function is only used once and can be simplified to a one-liner.
> 
> Signed-off-by: Nicolas Saenz Julienne 

I'll leave this one to Christoph to decide on.  One thing I really liked 
about hacking around in kernel/dma is the coding style, it really follows 
"one function does one thing and does it well" even if there is only one 
caller.  dma_in_atomic_pool() was an attempt to follow in those footsteps.


Re: [PATCH] dma-pool: use single atomic pool for both DMA zones

2020-07-09 Thread David Rientjes
On Wed, 8 Jul 2020, Christoph Hellwig wrote:

> On Wed, Jul 08, 2020 at 06:00:35PM +0200, Nicolas Saenz Julienne wrote:
> > On Wed, 2020-07-08 at 17:35 +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 07, 2020 at 02:28:04PM +0200, Nicolas Saenz Julienne wrote:
> > > > When allocating atomic DMA memory for a device, the dma-pool core
> > > > queries __dma_direct_optimal_gfp_mask() to check which atomic pool to
> > > > use. It turns out the GFP flag returned is only an optimistic guess.
> > > > The pool selected might sometimes live in a zone higher than the
> > > > device's view of memory.
> > > > 
> > > > As there isn't a way to grantee a mapping between a device's DMA
> > > > constraints and correct GFP flags this unifies both DMA atomic pools.
> > > > The resulting pool is allocated in the lower DMA zone available, if any,
> > > > so as for devices to always get accessible memory while having the
> > > > flexibility of using dma_pool_kernel for the non constrained ones.
> > > > 
> > > > Fixes: c84dc6e68a1d ("dma-pool: add additional coherent pools to map to 
> > > > gfp
> > > > mask")
> > > > Reported-by: Jeremy Linton 
> > > > Suggested-by: Robin Murphy 
> > > > Signed-off-by: Nicolas Saenz Julienne 
> > > 
> > > Hmm, this is not what I expected from the previous thread.  I thought
> > > we'd just use one dma pool based on runtime available of the zones..
> > 
> > I may be misunderstanding you, but isn't that going back to how things used 
> > to
> > be before pulling in David Rientjes' work? The benefit of having a 
> > GFP_KERNEL
> > pool is that non-address-constrained devices can get their atomic memory 
> > there,
> > instead of consuming somewhat scarcer low memory.
> 
> Yes, I think we are misunderstanding each other.  I don't want to remove
> any pool, just make better runtime decisions when to use them.
> 

Just to be extra explicit for the record and for my own understanding: 
Nicolas, your patch series "dma-pool: Fix atomic pool selection" obsoletes 
this patch, right? :)


Re: [PATCH] dma-pool: Do not allocate pool memory from CMA

2020-07-09 Thread David Rientjes
On Wed, 8 Jul 2020, Nicolas Saenz Julienne wrote:

> There is no guarantee to CMA's placement, so allocating a zone specific
> atomic pool from CMA might return memory from a completely different
> memory zone. So stop using it.
> 
> Fixes: c84dc6e68a1d ("dma-pool: add additional coherent pools to map to gfp 
> mask")
> Reported-by: Jeremy Linton 
> Signed-off-by: Nicolas Saenz Julienne 

Acked-by: David Rientjes 

Thanks Nicolas!


Re: [BUG] XHCI getting ZONE_DMA32 memory > than its bus_dma_limit

2020-07-05 Thread David Rientjes
On Fri, 3 Jul 2020, Robin Murphy wrote:

> > Just for the record the offending commit is: c84dc6e68a1d2 ("dma-pool: add
> > additional coherent pools to map to gfp mask").
> > 
> > On Thu, 2020-07-02 at 12:49 -0500, Jeremy Linton wrote:
> > > Hi,
> > > 
> > > Using 5.8rc3:
> > > 
> > > The rpi4 has a 3G dev->bus_dma_limit on its XHCI controller. With a usb3
> > > hub, plus a few devices plugged in, randomly devices will fail
> > > operations. This appears to because xhci_alloc_container_ctx() is
> > > getting buffers > 3G via dma_pool_zalloc().
> > > 
> > > Tracking that down, it seems to be caused by dma_alloc_from_pool() using
> > > dev_to_pool()->dma_direct_optimal_gfp_mask() to "optimistically" select
> > > the atomic_pool_dma32 but then failing to verify that the allocations in
> > > the pool are less than the dev bus_dma_limit.
> > 
> > I can reproduce this too.
> > 
> > The way I see it, dev_to_pool() wants a strict dma_direct_optimal_gfp_mask()
> > that is never wrong, since it's going to stick to that pool for the device's
> > lifetime. I've been looking at how to implement it, and it's not so trivial
> > as
> > I can't see a failproof way to make a distinction between who needs DMA32
> > and
> > who is OK with plain KERNEL memory.
> > 
> > Otherwise, as Jeremy points out, the patch needs to implement allocations
> > with
> > an algorithm similar to __dma_direct_alloc_pages()'s, which TBH I don't know
> > if
> > it's a little overkill for the atomic context.
> > 
> > Short of finding a fix in the coming rc's, I suggest we revert this.
> 
> Or perhaps just get rid of atomic_pool_dma32 (and allocate atomic_pool_dma
> from ZONE_DMA32 if !ZONE_DMA). That should make it fall pretty much back in
> line while still preserving the potential benefit of the kernel pool for
> non-address-constrained devices.
> 

I assume it depends on how often we have devices where 
__dma_direct_alloc_pages() behavior is required, i.e. what requires the 
dma_coherent_ok() checks and altering of the gfp flags to get memory that 
works.

Is the idea that getting rid of atomic_pool_dma32 would use GFP_KERNEL 
(and atomic_pool_kernel) as the default policy here?  That doesn't do any 
dma_coherent_ok() checks so dma_direct_alloc_pages would return from 
ZONE_NORMAL without a < 3G check?

It *seems* like we want to check if dma_coherent_ok() succeeds for ret in 
dma_direct_alloc_pages() when allocating from the atomic pool and, based 
on criteria that allows fallback, just fall into 
__dma_direct_alloc_pages()?


Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Dave Hansen wrote:

> On 7/1/20 1:04 PM, Ben Widawsky wrote:
> >> +static inline bool node_reclaim_enabled(void)
> >> +{
> >> +  /* Is any node_reclaim_mode bit set? */
> >> +  return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> >> +}
> >> +
> >>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> >>  
> >>  extern int kswapd_run(int nid);
> > If a user writes a bit that isn't a RECLAIM_* bit to vm.zone_reclaim_mode 
> > today, it acts as though RECLAIM_ZONE is enabled: we try to reclaim in 
> > zonelist order before falling back to the next zone in the page allocator.  
> > The sysctl doesn't enforce any max value :/  I dont know if there is any 
> > such user, but this would break them if there is.
> > 
> > Should this simply be return !!node_reclaim_mode?
> 
> You're right that there _could_ be a user-visible behavior change here.
>  But, if there were a change it would be for a bit which wasn't even
> mentioned in the documentation.  Somebody would have had to look at the
> doc mentioning 1,2,4 and written an 8.  If they did that, they're asking
> for trouble because we could have defined the '8' bit to do nasty things
> like auto-demote all your memory. :)
> 
> I'll mention it in the changelog, but I still think we should check the
> actual, known bits rather than check for 0.
> 
> BTW, in the hardware, they almost invariably make unused bits "reserved"
> and do mean things like #GP if someone tries to set them.  This is a
> case where the kernel probably should have done the same.  It would have
> saved us the trouble of asking these questions now.  Maybe we should
> even do that going forward.
> 

Maybe enforce it in a sysctl handler so the user catches any errors, which 
would be better than silently accepting some policy that doesn't exist?

RECLAIM_UNMAP and/or RECLAIM_WRITE should likely get -EINVAL if attempted 
to be set without RECLAIM_ZONE as well: they are no-ops without 
RECLAIM_ZONE.  This would likely have caught something wrong with commit 
648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE") if it 
would have already been in place.

I don't feel strongly about this, so feel free to ignore.


Re: [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Dave Hansen wrote:

> 
> From: Dave Hansen 
> 
> It is currently not obvious that the RECLAIM_* bits are part of the
> uapi since they are defined in vmscan.c.  Move them to a uapi header
> to make it obvious.
> 
> This should have no functional impact.
> 
> Signed-off-by: Dave Hansen 
> Cc: Ben Widawsky 
> Cc: Alex Shi 
> Cc: Daniel Wagner 
> Cc: "Tobin C. Harding" 
> Cc: Christoph Lameter 
> Cc: Andrew Morton 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: Qian Cai 
> Cc: Daniel Wagner 

Acked-by: David Rientjes 


Re: [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Dave Hansen wrote:

> 
> From: Dave Hansen 
> 
> I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
> sysctl.  Like a good kernel developer, I also went to go update the
> documentation.  I noticed that the bits in the documentation didn't
> match the bits in the #defines.
> 
> The VM never explicitly checks the RECLAIM_ZONE bit.  The bit is,
> however implicitly checked when checking 'node_reclaim_mode==0'.
> The RECLAIM_ZONE #define was removed in a cleanup.  That, by itself
> is fine.
> 
> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed.  That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel.  The end result is that
> if someone had a script that did:
> 
>   sysctl vm.zone_reclaim_mode=1
> 
> That script went from doing nothing to writing out pages during
> node reclaim after the commit in question.  That's not great.
> 
> Put the bits back the way they were and add a comment so something
> like this is a bit harder to do again.  Update the documentation to
> make it clear that the first bit is ignored.
> 
> Signed-off-by: Dave Hansen 
> Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
> Cc: Ben Widawsky 
> Cc: Alex Shi 
> Cc: Daniel Wagner 
> Cc: "Tobin C. Harding" 
> Cc: Christoph Lameter 
> Cc: Andrew Morton 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: Qian Cai 
> Cc: Daniel Wagner 
> Cc: sta...@vger.kernel.org

Acked-by: David Rientjes 


Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Dave Hansen wrote:

> diff -puN include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper 
> include/linux/swap.h
> --- a/include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper 2020-07-01 
> 08:22:13.650955330 -0700
> +++ b/include/linux/swap.h2020-07-01 08:22:13.659955330 -0700
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  struct notifier_block;
> @@ -374,6 +375,12 @@ extern int sysctl_min_slab_ratio;
>  #define node_reclaim_mode 0
>  #endif
>  
> +static inline bool node_reclaim_enabled(void)
> +{
> + /* Is any node_reclaim_mode bit set? */
> + return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +}
> +
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
>  
>  extern int kswapd_run(int nid);

If a user writes a bit that isn't a RECLAIM_* bit to vm.zone_reclaim_mode 
today, it acts as though RECLAIM_ZONE is enabled: we try to reclaim in 
zonelist order before falling back to the next zone in the page allocator.  
The sysctl doesn't enforce any max value :/  I dont know if there is any 
such user, but this would break them if there is.

Should this simply be return !!node_reclaim_mode?


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Dave Hansen wrote:

> Even if they don't allocate directly from PMEM, is it OK for such an app
> to get its cold data migrated to PMEM?  That's a much more subtle
> question and I suspect the kernel isn't going to have a single answer
> for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
> on or off.
> 

I think the answer is whether the app's cold data can be reclaimed, 
otherwise migration to PMEM is likely better in terms of performance.  So 
any such app today should just be mlocking its cold data if it can't 
handle overhead from reclaim?


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Yang Shi wrote:

> > We can do this if we consider pmem not to be a separate memory tier from
> > the system perspective, however, but rather the socket perspective.  In
> > other words, a node can only demote to a series of exclusive pmem ranges
> > and promote to the same series of ranges in reverse order.  So DRAM node 0
> > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > one DRAM node.
> > 
> > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > just to be slower volatile memory and we don't need to deal with the
> > latency concerns of cross socket migration.  A user page will never be
> > demoted to a pmem range across the socket and will never be promoted to a
> > different DRAM node that it doesn't have access to.
> 
> But I don't see too much benefit to limit the migration target to the
> so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> different socket) pmem node since even the cross socket access should be much
> faster then refault or swap from disk.
> 

Hi Yang,

Right, but any eventual promotion path would allow this to subvert the 
user mempolicy or cpuset.mems if the demoted memory is eventually promoted 
to a DRAM node on its socket.  We've discussed not having the ability to 
map from the demoted page to either of these contexts and it becomes more 
difficult for shared memory.  We have page_to_nid() and page_zone() so we 
can always find the appropriate demotion or promotion node for a given 
page if there is a 1:1 relationship.

Do we lose anything with the strict 1:1 relationship between DRAM and PMEM 
nodes?  It seems much simpler in terms of implementation and is more 
intuitive.

> I think using pmem as a node is more natural than zone and less intrusive
> since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> think the implementation may be more intrusive and complicated (i.e. need a
> new gfp flag) and user can't control the memory placement.
> 

This is an important decision to make, I'm not sure that we actually 
*want* all of these NUMA APIs :)  If my memory is demoted, I can simply do 
migrate_pages() back to DRAM and cause other memory to be demoted in its 
place.  Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.  
Kswapd for a DRAM node putting pressure on a PMEM node for demotion that 
then puts the kswapd for the PMEM node under pressure to reclaim it serves 
*only* to spend unnecessary cpu cycles.

Users could control the memory placement through a new mempolicy flag, 
which I think are needed anyway for explicit allocation policies for PMEM 
nodes.  Consider if PMEM is a zone so that it has the natural 1:1 
relationship with DRAM, now your system only has nodes {0,1} as today, no 
new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that 
specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I 
can then mlock() if I want to disable demotion on memory pressure).


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Dave Hansen wrote:

> > Could this cause us to break a user's mbind() or allow a user to 
> > circumvent their cpuset.mems?
> 
> In its current form, yes.
> 
> My current rationale for this is that while it's not as deferential as
> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>  The auto-migration only kicks in when the data is about to go away.  So
> while the user's data might be slower than they like, it is *WAY* faster
> than they deserve because it should be off on the disk.
> 

It's outside the scope of this patchset, but eventually there will be a 
promotion path that I think requires a strict 1:1 relationship between 
DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
cpuset.mems become ineffective for nodes facing memory pressure.

For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
N1.  On promotion, I think we need to rely on something stronger than 
autonuma to decide which DRAM node to promote to: specifically any user 
policy put into effect (memory tiering or autonuma shouldn't be allowed to 
subvert these user policies).

As others have mentioned, we lose the allocation or process context at the 
time of demotion or promotion and any workaround for that requires some 
hacks, such as mapping the page to cpuset (what is the right solution for 
shared pages?) or adding NUMA locality handling to memcg.

I think a 1:1 relationship between DRAM and PMEM nodes is required if we 
consider the eventual promotion of this memory so that user memory can't 
eventually reappear on a DRAM node that is not allowed by mbind(), 
set_mempolicy(), or cpuset.mems.  I think it also makes this patchset much 
simpler.

> > Because we don't have a mapping of the page back to its allocation 
> > context (or the process context in which it was allocated), it seems like 
> > both are possible.
> > 
> > So let's assume that migration nodes cannot be other DRAM nodes.  
> > Otherwise, memory pressure could be intentionally or unintentionally 
> > induced to migrate these pages to another node.  Do we have such a 
> > restriction on migration nodes?
> 
> There's nothing explicit.  On a normal, balanced system where there's a
> 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
> implicit since the migration path is one deep and goes from DRAM->PMEM.
> 
> If there were some oddball system where there was a memory only DRAM
> node, it might very well end up being a migration target.
> 

Shouldn't DRAM->DRAM demotion be banned?  It's all DRAM and within the 
control of mempolicies and cpusets today, so I had assumed this is outside 
the scope of memory tiering support.  I had assumed that memory tiering 
support was all about separate tiers :)

> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned 
> >> long node)
> >> +{
> >> +  /*
> >> +   * 'mask' targets allocation only to the desired node in the
> >> +   * migration path, and fails fast if the allocation can not be
> >> +   * immediately satisfied.  Reclaim is already active and heroic
> >> +   * allocation efforts are unwanted.
> >> +   */
> >> +  gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> >> +  __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> >> +  __GFP_MOVABLE;
> > 
> > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> > actually want to kick kswapd on the pmem node?
> 
> In my mental model, cold data flows from:
> 
>   DRAM -> PMEM -> swap
> 
> Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
> for kinda cold data, kswapd can be working on doing the PMEM->swap part
> on really cold data.
> 

Makes sense.


Re: [PATCH v3] mm, slab: Check GFP_SLAB_BUG_MASK before alloc_pages in kmalloc_order

2020-07-01 Thread David Rientjes
On Wed, 1 Jul 2020, Long Li wrote:

> diff --git a/mm/slab.c b/mm/slab.c
> index ac7a223d9ac3..2850fe3c5fb8 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2573,13 +2573,9 @@ static struct page *cache_grow_begin(struct kmem_cache 
> *cachep,
>* Be lazy and only check for valid flags here,  keeping it out of the
>* critical path in kmem_cache_alloc().
>*/
> - if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
> - gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK;
> - flags &= ~GFP_SLAB_BUG_MASK;
> - pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x 
> (%pGg). Fix your code!\n",
> - invalid_mask, _mask, flags, );
> - dump_stack();
> - }
> + if (unlikely(flags & GFP_SLAB_BUG_MASK))
> + flags = kmalloc_invalid_flags(flags);
> +
>   WARN_ON_ONCE(cachep->ctor && (flags & __GFP_ZERO));
>   local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>  

Is this name correct?  kmalloc_invalid_flags() masks off the invalid flags 
and returns the valid flags :)  Maybe kmalloc_check_flags()?


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-06-30 Thread David Rientjes
On Tue, 30 Jun 2020, Yang Shi wrote:

> > > From: Dave Hansen 
> > > 
> > > If a memory node has a preferred migration path to demote cold pages,
> > > attempt to move those inactive pages to that migration node before
> > > reclaiming. This will better utilize available memory, provide a faster
> > > tier than swapping or discarding, and allow such pages to be reused
> > > immediately without IO to retrieve the data.
> > > 
> > > When handling anonymous pages, this will be considered before swap if
> > > enabled. Should the demotion fail for any reason, the page reclaim
> > > will proceed as if the demotion feature was not enabled.
> > > 
> > Thanks for sharing these patches and kick-starting the conversation, Dave.
> > 
> > Could this cause us to break a user's mbind() or allow a user to
> > circumvent their cpuset.mems?
> > 
> > Because we don't have a mapping of the page back to its allocation
> > context (or the process context in which it was allocated), it seems like
> > both are possible.
> 
> Yes, this could break the memory placement policy enforced by mbind and
> cpuset. I discussed this with Michal on mailing list and tried to find a way
> to solve it, but unfortunately it seems not easy as what you mentioned above.
> The memory policy and cpuset is stored in task_struct rather than mm_struct.
> It is not easy to trace back to task_struct from page (owner field of
> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
> preferred way).
> 

Yeah, and Ying made a similar response to this message.

We can do this if we consider pmem not to be a separate memory tier from 
the system perspective, however, but rather the socket perspective.  In 
other words, a node can only demote to a series of exclusive pmem ranges 
and promote to the same series of ranges in reverse order.  So DRAM node 0 
can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
one DRAM node.

This naturally takes care of mbind() and cpuset.mems if we consider pmem 
just to be slower volatile memory and we don't need to deal with the 
latency concerns of cross socket migration.  A user page will never be 
demoted to a pmem range across the socket and will never be promoted to a 
different DRAM node that it doesn't have access to.

That can work with the NUMA abstraction for pmem, but it could also 
theoretically be a new memory zone instead.  If all memory living on pmem 
is migratable (the natural way that memory hotplug is done, so we can 
offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
would determine whether we can allocate directly from this memory based on 
system config or a new gfp flag that could be set for users of a mempolicy 
that allows allocations directly from pmem.  If abstracted as a NUMA node 
instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
make much sense.

Kswapd would need to be enlightened for proper pgdat and pmem balancing 
but in theory it should be simpler because it only has its own node to 
manage.  Existing per-zone watermarks might be easy to use to fine tune 
the policy from userspace: the scale factor determines how much memory we 
try to keep free on DRAM for migration from pmem, for example.  We also 
wouldn't have to deal with node hotplug or updating of demotion/promotion 
node chains.

Maybe the strongest advantage of the node abstraction is the ability to 
use autonuma and migrate_pages()/move_pages() API for moving pages 
explicitly?  Mempolicies could be used for migration to "top-tier" memory, 
i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-06-30 Thread David Rientjes
On Mon, 29 Jun 2020, Dave Hansen wrote:

> From: Dave Hansen 
> 
> If a memory node has a preferred migration path to demote cold pages,
> attempt to move those inactive pages to that migration node before
> reclaiming. This will better utilize available memory, provide a faster
> tier than swapping or discarding, and allow such pages to be reused
> immediately without IO to retrieve the data.
> 
> When handling anonymous pages, this will be considered before swap if
> enabled. Should the demotion fail for any reason, the page reclaim
> will proceed as if the demotion feature was not enabled.
> 

Thanks for sharing these patches and kick-starting the conversation, Dave.

Could this cause us to break a user's mbind() or allow a user to 
circumvent their cpuset.mems?

Because we don't have a mapping of the page back to its allocation 
context (or the process context in which it was allocated), it seems like 
both are possible.

So let's assume that migration nodes cannot be other DRAM nodes.  
Otherwise, memory pressure could be intentionally or unintentionally 
induced to migrate these pages to another node.  Do we have such a 
restriction on migration nodes?

> Some places we would like to see this used:
> 
>   1. Persistent memory being as a slower, cheaper DRAM replacement
>   2. Remote memory-only "expansion" NUMA nodes
>   3. Resolving memory imbalances where one NUMA node is seeing more
>  allocation activity than another.  This helps keep more recent
>  allocations closer to the CPUs on the node doing the allocating.
> 

(3) is the concerning one given the above if we are to use 
migrate_demote_mapping() for DRAM node balancing.

> Yang Shi's patches used an alternative approach where to-be-discarded
> pages were collected on a separate discard list and then discarded
> as a batch with migrate_pages().  This results in simpler code and
> has all the performance advantages of batching, but has the
> disadvantage that pages which fail to migrate never get swapped.
> 
> #Signed-off-by: Keith Busch 
> Signed-off-by: Dave Hansen 
> Cc: Keith Busch 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> ---
> 
>  b/include/linux/migrate.h|6 
>  b/include/trace/events/migrate.h |3 +-
>  b/mm/debug.c |1 
>  b/mm/migrate.c   |   52 
> +++
>  b/mm/vmscan.c|   25 ++
>  5 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff -puN 
> include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>  include/linux/migrate.h
> --- 
> a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>2020-06-29 16:34:38.950312604 -0700
> +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ enum migrate_reason {
>   MR_MEMPOLICY_MBIND,
>   MR_NUMA_MISPLACED,
>   MR_CONTIG_RANGE,
> + MR_DEMOTION,
>   MR_TYPES
>  };
>  
> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
> struct page *newpage, struct page *page);
>  extern int migrate_page_move_mapping(struct address_space *mapping,
>   struct page *newpage, struct page *page, int extra_count);
> +extern int migrate_demote_mapping(struct page *page);
>  #else
>  
>  static inline void putback_movable_pages(struct list_head *l) {}
> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>   return -ENOSYS;
>  }
>  
> +static inline int migrate_demote_mapping(struct page *page)
> +{
> + return -ENOSYS;
> +}
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_COMPACTION
> diff -puN 
> include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>  include/trace/events/migrate.h
> --- 
> a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
> 2020-06-29 16:34:38.952312604 -0700
> +++ b/include/trace/events/migrate.h  2020-06-29 16:34:38.963312604 -0700
> @@ -20,7 +20,8 @@
>   EM( MR_SYSCALL, "syscall_or_cpuset")\
>   EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
>   EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
> - EMe(MR_CONTIG_RANGE,"contig_range")
> + EM( MR_CONTIG_RANGE,"contig_range") \
> + EMe(MR_DEMOTION,"demotion")
>  
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff -puN 
> mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 
> mm/debug.c
> --- a

Re: [PATCH v4 01/26] mm: Do page fault accounting in handle_mm_fault

2020-06-30 Thread David Rientjes
On Tue, 30 Jun 2020, Peter Xu wrote:

> @@ -4408,6 +4440,34 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   mem_cgroup_oom_synchronize(false);
>   }
>  
> + if (ret & (VM_FAULT_RETRY | VM_FAULT_ERROR))
> + return ret;
> +
> + /*
> +  * Do accounting in the common code, to avoid unnecessary
> +  * architecture differences or duplicated code.
> +  *
> +  * We arbitrarily make the rules be:
> +  *
> +  *  - Unsuccessful faults do not count (e.g. when the address wasn't
> +  *valid). That includes arch_vma_access_permitted() failing above.
> +  *
> +  *So this is expressly not a "this many hardware page faults"
> +  *counter. Use the hw profiling for that.
> +  *
> +  *  - Incomplete faults do not count (e.g. RETRY).  They will only
> +  *count once completed.
> +  *
> +  *  - The fault counts as a "major" fault when the final successful
> +  *fault is VM_FAULT_MAJOR, or if it was a retry (which implies that
> +  *we couldn't handle it immediately previously).
> +  *
> +  *  - If the fault is done for GUP, regs will be NULL and no accounting
> +  *will be done.
> +  */
> + mm_account_fault(regs, address, (ret & VM_FAULT_MAJOR) ||
> +  (flags & FAULT_FLAG_TRIED));
> +
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);

Just a nit, likely not important: I wonder if it would be cleaner to pass 
the vm_fault_t into mm_account_fault() and then do the VM_FAULT_RETRY and
VM_FAULT_ERROR checks there as well as putting the comment about how 
accounting is handled in that function.  Your comment is great.


Re: ERROR: "min_low_pfn" undefined!

2020-06-29 Thread David Rientjes
On Tue, 30 Jun 2020, kernel test robot wrote:

> Hi Alexander,
> 
> FYI, the error/warning still remains.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   7c30b859a947535f2213277e827d7ac7dcff9c84
> commit: f220df66f67684246ae1bf4a4e479efc7c2f325a intel_th: msu-sink: An 
> example msu buffer "sink"
> date:   11 months ago
> config: microblaze-randconfig-c023-20200629 (attached as .config)
> compiler: microblaze-linux-gcc (GCC) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> All errors (new ones prefixed by >>):
> 
>ERROR: "min_low_pfn" [drivers/rpmsg/virtio_rpmsg_bus.ko] undefined!
> >> ERROR: "min_low_pfn" [drivers/hwtracing/intel_th/intel_th_msu_sink.ko] 
> >> undefined!
>ERROR: "min_low_pfn" [drivers/hwtracing/intel_th/intel_th_msu.ko] 
> undefined!
>ERROR: "min_low_pfn" [drivers/mmc/core/mmc_core.ko] undefined!
>ERROR: "min_low_pfn" [drivers/md/dm-crypt.ko] undefined!
>ERROR: "min_low_pfn" [drivers/net/wireless/ath/ath6kl/ath6kl_sdio.ko] 
> undefined!
>ERROR: "min_low_pfn" [crypto/tcrypt.ko] undefined!
>ERROR: "min_low_pfn" [crypto/asymmetric_keys/asym_tpm.ko] undefined!
> 

Looks like we have precedence for min_low_pfn failures and we can't blame 
this poor commit :)

I wonder why we don't simply do EXPORT_SYMBOL() in mm/memblock.c, 
non-microblaze architectures do this themselves because it doesn't get 
generically exported:

arch/ia64/kernel/ia64_ksyms.c:EXPORT_SYMBOL(min_low_pfn);   /* defined by 
bootmem.c, but not exported by generic code */
arch/sh/kernel/sh_ksyms_32.c:EXPORT_SYMBOL(min_low_pfn);

Adding Mike Rapoport.


Re:Re: [PATCH] mm: remove the redundancy code

2020-06-29 Thread David Rientjes
On Tue, 30 Jun 2020, 苏辉 wrote:

> I am sorry that i did not consider the memory hotplug case,
> and i think we should add a new param to distinguish two different cases.
> 

No need, we can simply continue setting zone->zone_start_pfn unless there 
is a bug to be fixed (and, if so, please send a bug report).  Removing it 
would not improve anything and would only increase risk.

Re: [PATCH] mm: remove the redundancy code

2020-06-29 Thread David Rientjes
On Tue, 30 Jun 2020, Su Hui wrote:

> remove the redundancy code, the zone_start_pfn
> is assigned from zone->zone_start_pfn
> Signed-off-by: Su Hui 

I don't think this is redundant, it's used by memory hotplug when onlining 
new memory.

> ---
>  mm/page_alloc.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..3372a8c9fbc4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6215,8 +6215,6 @@ void __meminit init_currently_empty_zone(struct zone 
> *zone,
>   if (zone_idx > pgdat->nr_zones)
>   pgdat->nr_zones = zone_idx;
>  
> - zone->zone_start_pfn = zone_start_pfn;
> -
>   mminit_dprintk(MMINIT_TRACE, "memmap_init",
>   "Initialising map node %d zone %lu pfns %lu -> %lu\n",
>   pgdat->node_id,


Re: [patch] dma-pool: warn when coherent pool is depleted

2020-06-27 Thread David Rientjes
On Sun, 21 Jun 2020, Guenter Roeck wrote:

> > When a DMA coherent pool is depleted, allocation failures may or may not
> > get reported in the kernel log depending on the allocator.
> > 
> > The admin does have a workaround, however, by using coherent_pool= on the
> > kernel command line.
> > 
> > Provide some guidance on the failure and a recommended minimum size for
> > the pools (double the size).
> > 
> > Signed-off-by: David Rientjes 
> 
> Tested-by: Guenter Roeck 
> 
> Also confirmed that coherent_pool=256k works around the crash
> I had observed.
> 

Thanks Guenter.  Christoph, does it make sense to apply this patch since 
there may not be an artifact left behind in the kernel log on allocation 
failure by the caller?

> Guenter
> 
> > ---
> >  kernel/dma/pool.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> > --- a/kernel/dma/pool.c
> > +++ b/kernel/dma/pool.c
> > @@ -239,12 +239,16 @@ void *dma_alloc_from_pool(struct device *dev, size_t 
> > size,
> > }
> >  
> > val = gen_pool_alloc(pool, size);
> > -   if (val) {
> > +   if (likely(val)) {
> > phys_addr_t phys = gen_pool_virt_to_phys(pool, val);
> >  
> > *ret_page = pfn_to_page(__phys_to_pfn(phys));
> > ptr = (void *)val;
> > memset(ptr, 0, size);
> > +   } else {
> > +   WARN_ONCE(1, "DMA coherent pool depleted, increase size "
> > +"(recommended min coherent_pool=%zuK)\n",
> > + gen_pool_size(pool) >> 9);
> > }
> > if (gen_pool_avail(pool) < atomic_pool_size)
> > schedule_work(_pool_work);
> 


Re: [mm, slub] c91e241f56: WARNING:at_mm/slub.c:#kmem_cache_open

2020-06-27 Thread David Rientjes
On Sat, 27 Jun 2020, kernel test robot wrote:

> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: c91e241f569e7f9b0e2946841ef884b22a09f624 ("mm, slub: introduce 
> kmem_cache_debug_flags()-fix")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> in testcase: kernel-selftests
> with following parameters:
> 
>   group: kselftests-x86
> 
> test-description: The kernel contains a set of "self tests" under the 
> tools/testing/selftests/ directory. These are intended to be small unit tests 
> to exercise individual code paths in the kernel.
> test-url: https://www.kernel.org/doc/Documentation/kselftest.txt
> 
> 
> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +---+++
> |   | d5cee50015 | c91e241f56 |
> +---+++
> | boot_successes| 13 | 0  |
> | boot_failures | 0  | 17 |
> | WARNING:at_mm/slub.c:#kmem_cache_open | 0  | 17 |
> | EIP:kmem_cache_open   | 0  | 17 |
> | WARNING:at_mm/slub.c:#new_slab| 0  | 17 |
> | EIP:new_slab  | 0  | 17 |
> | WARNING:at_mm/slub.c:#___slab_alloc   | 0  | 17 |
> | EIP:___slab_alloc | 0  | 17 |
> | WARNING:at_mm/slub.c:#__slab_free | 0  | 17 |
> | EIP:__slab_free   | 0  | 17 |
> | WARNING:at_mm/slub.c:#deactivate_slab | 0  | 4  |
> | EIP:deactivate_slab   | 0  | 4  |
> +---+++
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> [1.934611] WARNING: CPU: 0 PID: 0 at mm/slub.c:132 
> kmem_cache_open+0x1cc/0x346
> [1.934612] Modules linked in:
> [1.934614] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 5.8.0-rc1-00274-gc91e241f569e7 #1
> [1.934615] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.12.0-1 04/01/2014
> [1.934615] EIP: kmem_cache_open+0x1cc/0x346
> [1.934617] Code: 83 ca ff b9 02 00 00 00 0f bd c7 0f 44 c2 99 f7 f9 ba 0a 
> 00 00 00 83 f8 0a 0f 47 c2 ba 05 00 00 00 83 f8 05 0f 42 c2 89 43 08 <0f> 0b 
> 81 ff ff 0f 00 00 76 09 c7 43 18 02 00 00 00 eb 29 81 ff ff
> [1.934618] EAX: 0005 EBX: 8d26c820 ECX: 0002 EDX: 0005
> [1.934619] ESI: 0040 EDI: 0040 EBP: 8d181eec ESP: 8d181ed0
> [1.934620] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210093
> [1.934621] CR0: 80050033 CR2: ffbff000 CR3: 0d2bc000 CR4: 000406b0
> [1.934622] DR0:  DR1:  DR2:  DR3: 
> [1.934622] DR6: fffe0ff0 DR7: 0400
> [1.934623] Call Trace:
> [1.934624]  __kmem_cache_create+0x1f/0x100
> [1.934624]  ? vprintk_emit+0x1c4/0x1ce
> [1.934625]  ? vprintk_default+0x12/0x14
> [1.934626]  create_boot_cache+0x59/0x79
> [1.934626]  kmem_cache_init+0x48/0x11d
> [1.934627]  start_kernel+0x217/0x424
> [1.934628]  ? early_idt_handler_common+0x44/0x44
> [1.934628]  i386_start_kernel+0x43/0x45
> [1.934629]  startup_32_smp+0x164/0x168
> [1.934630] ---[ end trace 19a0735aef7d3dec ]---
> 

I think this is fixed by 
mm-slab-slub-improve-error-reporting-and-overhead-of-cache_from_obj-fix.patch 
in -mm, the config here does not have CONFIG_SLUB_DEBUG enabled so the 
VM_WARN_ON_ONCE() would always be triggered without this fix.


Re: [PATCH v8 1/4] mm/madvise: pass task and mm to do_madvise

2020-06-24 Thread David Rientjes
On Mon, 22 Jun 2020, Minchan Kim wrote:

> Patch series "introduce memory hinting API for external process", v8.
> 
> Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
> that, application could give hints to kernel what memory range are
> preferred to be reclaimed.  However, in some platform(e.g., Android), the
> information required to make the hinting decision is not known to the app.
> Instead, it is known to a centralized userspace daemon(e.g.,
> ActivityManagerService), and that daemon must be able to initiate reclaim
> on its own without any app involvement.
> 
> To solve the concern, this patch introduces new syscall -
> process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
> has some differences.
> 
> 1. It needs pidfd of target process to provide the hint
> 
> 2.  It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
>moment.  Other hints in madvise will be opened when there are explicit
>requests from community to prevent unexpected bugs we couldn't support.
> 
> 3.  Only privileged processes can do something for other process's
>address space.
> 
> For more detail of the new API, please see "mm: introduce external memory
> hinting API" description in this patchset.
> 
> This patch (of 4):
> 
> In upcoming patches, do_madvise will be called from external process
> context so we shouldn't asssume "current" is always hinted process's
> task_struct.
> 
> Furthermore, we must not access mm_struct via task->mm, but obtain it
> via access_mm() once (in the following patch) and only use that pointer
> [1], so pass it to do_madvise() as well.  Note the vma->vm_mm pointers
> are safe, so we can use them further down the call stack.
> 
> And let's pass *current* and current->mm as arguments of do_madvise so
> it shouldn't change existing behavior but prepare next patch to make
> review easy.
> 
> Note: io_madvise passes NULL as target_task argument of do_madvise because
> it couldn't know who is target.
> 
> [1] 
> http://lore.kernel.org/r/CAG48ez27=pwm5m_n_988xt1huo7g7h6artql44zev6td-h-...@mail.gmail.com
> 
> [vba...@suse.cz: changelog tweak]
> [minc...@kernel.org: use current->mm for io_uring]
>   Link: http://lkml.kernel.org/r/20200423145215.72666-1-minc...@kernel.org
> [a...@linux-foundation.org: fix it for upstream changes]
> [a...@linux-foundation.org: whoops]
> [rdun...@infradead.org: add missing includes]
> Link: http://lkml.kernel.org/r/20200302193630.68771-2-minc...@kernel.org
> Signed-off-by: Minchan Kim 
> Reviewed-by: Suren Baghdasaryan 
> Reviewed-by: Vlastimil Babka 
> Cc: Jens Axboe 
> Cc: Jann Horn 
> Cc: Tim Murray 
> Cc: Daniel Colascione 
> Cc: Sandeep Patil 
> Cc: Sonny Rao 
> Cc: Brian Geffon 
> Cc: Michal Hocko 
> Cc: Johannes Weiner 
> Cc: Shakeel Butt 
> Cc: John Dias 
> Cc: Joel Fernandes 
> Cc: Alexander Duyck 
> Cc: SeongJae Park 
> Cc: Christian Brauner 
> Cc: Kirill Tkhai 
> Cc: Oleksandr Natalenko 
> Cc: SeongJae Park 
> Cc: Christian Brauner 
> Cc: 

Acked-by: David Rientjes 


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-06-24 Thread David Rientjes
On Mon, 22 Jun 2020, Minchan Kim wrote:

> diff --git a/mm/madvise.c b/mm/madvise.c
> index 551ed816eefe..23abca3f93fa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior)
>   }
>  }
>  
> +static bool
> +process_madvise_behavior_valid(int behavior)
> +{
> + switch (behavior) {
> + case MADV_COLD:
> + case MADV_PAGEOUT:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> @@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior)
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *   from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *  MADV_COLD - the application is not expected to use this memory soon,
> + *   deactivate pages in this range so that they can be reclaimed
> + *   easily if memory pressure hanppens.
> + *  MADV_PAGEOUT - the application is not expected to use this memory soon,
> + *   page out the pages in this range immediately.
>   *
>   * return values:
>   *  zero- success
> @@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, 
> size_t, len_in, int, behavior)
>  {
>   return do_madvise(current, current->mm, start, len_in, behavior);
>  }
> +
> +static int process_madvise_vec(struct task_struct *target_task,
> + struct mm_struct *mm, struct iov_iter *iter, int behavior)
> +{
> + struct iovec iovec;
> + int ret = 0;
> +
> + while (iov_iter_count(iter)) {
> + iovec = iov_iter_iovec(iter);
> + ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
> + iovec.iov_len, behavior);
> + if (ret < 0)
> + break;
> + iov_iter_advance(iter, iovec.iov_len);
> + }
> +
> + return ret;
> +}
> +
> +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
> + int behavior, unsigned int flags)
> +{
> + ssize_t ret;
> + struct pid *pid;
> + struct task_struct *task;
> + struct mm_struct *mm;
> + size_t total_len = iov_iter_count(iter);
> +
> + if (flags != 0)
> + return -EINVAL;
> +
> + pid = pidfd_get_pid(pidfd);
> + if (IS_ERR(pid))
> + return PTR_ERR(pid);
> +
> + task = get_pid_task(pid, PIDTYPE_PID);
> + if (!task) {
> + ret = -ESRCH;
> + goto put_pid;
> + }
> +
> + if (task->mm != current->mm &&
> + !process_madvise_behavior_valid(behavior)) {
> + ret = -EINVAL;
> + goto release_task;
> + }
> +
> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> + if (IS_ERR_OR_NULL(mm)) {
> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> + goto release_task;
> + }
> 

mm is always task->mm right?  I'm wondering if it would be better to find 
the mm directly in process_madvise_vec() rather than passing it into the 
function.  I'm not sure why we'd pass both task and mm here.

+
> + ret = process_madvise_vec(task, mm, iter, behavior);
> + if (ret >= 0)
> + ret = total_len - iov_iter_count(iter);
> +
> + mmput(mm);
> +release_task:
> + put_task_struct(task);
> +put_pid:
> + put_pid(pid);
> + return ret;
> +}
> +
> +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, 
> vec,
> + unsigned long, vlen, int, behavior, unsigned int, flags)

I love the idea of adding the flags parameter here and I can think of an 
immediate use case for MADV_HUGEPAGE, which is overloaded.

Today, MADV_HUGEPAGE controls enablement depending on system config and 
controls defrag behavior based on system config.  It also cannot be opted 
out of without setting MADV_NOHUGEPAGE :)

I was thinking of a flag that users could use to trigger an immediate 
collapse in process context regardless of the system config.

So I'm a big advocate of this flags parameter and consider it an absolute 
must for the API.

Acked-by: David Rientjes 


Re: [PATCH v8 2/4] pid: move pidfd_get_pid() to pid.c

2020-06-24 Thread David Rientjes
On Mon, 22 Jun 2020, Minchan Kim wrote:

> process_madvise syscall needs pidfd_get_pid function to translate pidfd to
> pid so this patch move the function to kernel/pid.c.
> 
> Link: http://lkml.kernel.org/r/20200302193630.68771-5-minc...@kernel.org
> Signed-off-by: Minchan Kim 
> Reviewed-by: Suren Baghdasaryan 
> Suggested-by: Alexander Duyck 
> Reviewed-by: Alexander Duyck 
> Acked-by: Christian Brauner 
> Reviewed-by: Vlastimil Babka 
> Cc: Jens Axboe 
> Cc: Jann Horn 
> Cc: Brian Geffon 
> Cc: Daniel Colascione 
> Cc: Joel Fernandes 
> Cc: Johannes Weiner 
> Cc: John Dias 
> Cc: Kirill Tkhai 
> Cc: Michal Hocko 
> Cc: Oleksandr Natalenko 
> Cc: Sandeep Patil 
> Cc: SeongJae Park 
> Cc: SeongJae Park 
> Cc: Shakeel Butt 
> Cc: Sonny Rao 
> Cc: Tim Murray 
> Cc: Christian Brauner 
> Cc: 

Acked-by: David Rientjes 


Re: [PATCH v8 4/4] mm/madvise: check fatal signal pending of target process

2020-06-24 Thread David Rientjes
On Mon, 22 Jun 2020, Minchan Kim wrote:

> Bail out to prevent unnecessary CPU overhead if target process has pending
> fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
> 
> Link: http://lkml.kernel.org/r/20200302193630.68771-4-minc...@kernel.org
> Signed-off-by: Minchan Kim 
> Reviewed-by: Suren Baghdasaryan 
> Reviewed-by: Vlastimil Babka 
> Cc: Alexander Duyck 
> Cc: Brian Geffon 
> Cc: Christian Brauner 
> Cc: Daniel Colascione 
> Cc: Jann Horn 
> Cc: Jens Axboe 
> Cc: Joel Fernandes 
> Cc: Johannes Weiner 
> Cc: John Dias 
> Cc: Kirill Tkhai 
> Cc: Michal Hocko 
> Cc: Oleksandr Natalenko 
> Cc: Sandeep Patil 
> Cc: SeongJae Park 
> Cc: SeongJae Park 
> Cc: Shakeel Butt 
> Cc: Sonny Rao 
> Cc: Tim Murray 
> Cc: Christian Brauner 
> Cc: 

Acked-by: David Rientjes 


[patch] dma-pool: warn when coherent pool is depleted

2020-06-21 Thread David Rientjes
When a DMA coherent pool is depleted, allocation failures may or may not
get reported in the kernel log depending on the allocator.

The admin does have a workaround, however, by using coherent_pool= on the
kernel command line.

Provide some guidance on the failure and a recommended minimum size for
the pools (double the size).

Signed-off-by: David Rientjes 
---
 kernel/dma/pool.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -239,12 +239,16 @@ void *dma_alloc_from_pool(struct device *dev, size_t size,
}
 
val = gen_pool_alloc(pool, size);
-   if (val) {
+   if (likely(val)) {
phys_addr_t phys = gen_pool_virt_to_phys(pool, val);
 
*ret_page = pfn_to_page(__phys_to_pfn(phys));
ptr = (void *)val;
memset(ptr, 0, size);
+   } else {
+   WARN_ONCE(1, "DMA coherent pool depleted, increase size "
+"(recommended min coherent_pool=%zuK)\n",
+ gen_pool_size(pool) >> 9);
}
if (gen_pool_avail(pool) < atomic_pool_size)
schedule_work(_pool_work);


Re: [PATCH v2] dma-pool: Fix too large DMA pools on medium systems

2020-06-21 Thread David Rientjes
On Sun, 21 Jun 2020, Guenter Roeck wrote:

> >> This patch results in a boot failure in some of my powerpc boot tests,
> >> specifically those testing boots from mptsas1068 devices. Error message:
> >>
> >> mptsas :00:02.0: enabling device ( -> 0002)
> >> mptbase: ioc0: Initiating bringup
> >> ioc0: LSISAS1068 A0: Capabilities={Initiator}
> >> mptbase: ioc0: ERROR - Unable to allocate Reply, Request, Chain Buffers!
> >> mptbase: ioc0: ERROR - didn't initialize properly! (-3)
> >> mptsas: probe of :00:02.0 failed with error -3
> >>
> >> Configuration is bamboo:44x/bamboo_defconfig plus various added drivers.
> >> Qemu command line is
> >>
> >> qemu-system-ppc -kernel vmlinux -M bamboo \
> >>  -m 256 -no-reboot -snapshot -device mptsas1068,id=scsi \
> >>  -device scsi-hd,bus=scsi.0,drive=d0,wwn=0x5000c50015ea71ac -drive \
> >>  file=rootfs.ext2,format=raw,if=none,id=d0 \
> >>  --append "panic=-1 slub_debug=FZPUA root=/dev/sda  mem=256M 
> >> console=ttyS0" \
> >>  -monitor none -nographic
> >>
> >> canyonlands_defconfig with sam460ex machine and otherwise similar command 
> >> line
> >> fails as well.
> >>
> >> Reverting this patch fixes the problem.
> > 
> > This looks like the minimum value of 128 KiB is not sufficient, and the
> > bug is in the intention of 1d659236fb43c4d2 ("dma-pool: scale the
> > default DMA coherent pool size with memory capacity")?
> > Before, there was a single pool of (fixed) 256 KiB size, now there are
> > up to three coherent pools (DMA, DMA32, and kernel), albeit of smaller
> > size (128 KiB each).
> > 
> > Can you print the requested size in drivers/message/fusion/mptbase.c:
> > PrimeIocFifos()?
> 
> 172928 bytes
> 
> > Does replacing all SZ_128K by SZ_256K in my patch help?
> 
> Yes, it does.
> 

The new coherent pools should auto expand when they are close to being 
depleted but there's no guarantee that it can be done fast enough.  
Switching the min size to be the previous min size (256KB) seems like the 
best option and it matches what 
Documentation/admin-guide/kernel-parameters.txt still stays.

I'll also send a patch to point in the right direction when this happens.


Re: kernel BUG at mm/huge_memory.c:2613!

2020-06-21 Thread David Rientjes
On Fri, 19 Jun 2020, Roman Gushchin wrote:

> > > [   40.287524] BUG: unable to handle page fault for address: 
> > > a77b833df000
> > > [   40.287529] #PF: supervisor write access in kernel mode
> > > [   40.287531] #PF: error_code(0x000b) - reserved bit violation
> > > [   40.287532] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3ec54d067
> > > PTE 80001688033d9163
> > > [   40.287538] Oops: 000b [#2] SMP NOPTI
> > > [   40.287542] CPU: 9 PID: 1986 Comm: pulseaudio Tainted: G  D
> > >   5.8.0-rc1+ #697
> > > [   40.287544] Hardware name: Gigabyte Technology Co., Ltd.
> > > AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> > > [   40.287550] RIP: 0010:__memset+0x24/0x30
> > > [   40.287553] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
> > > d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
> > > 0f af c6  48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
> > > d1 f3
> > > [   40.287556] RSP: 0018:a77b827a7e08 EFLAGS: 00010216
> > > [   40.287558] RAX:  RBX: 90f77dced800 RCX: 
> > > 08a0
> > > [   40.287560] RDX:  RSI:  RDI: 
> > > a77b833df000
> > > [   40.287561] RBP: 90f7898c7000 R08: 90f78c507768 R09: 
> > > a77b833df000
> > > [   40.287563] R10: a77b833df000 R11: 90f7839f2d40 R12: 
> > > 
> > > [   40.287564] R13: 90f76a802e00 R14: c0cb8880 R15: 
> > > 90f770f4e700
> > > [   40.287567] FS:  7f3d8e8df880() GS:90f78ee4()
> > > knlGS:
> > > [   40.287569] CS:  0010 DS:  ES:  CR0: 80050033
> > > [   40.287570] CR2: a77b833df000 CR3: 0003fa556000 CR4: 
> > > 003406e0
> > > [   40.287572] Call Trace:
> > > [   40.287584]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
> > > [   40.287593]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> > > [   40.287601]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> > > [   40.287608]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> > > [   40.287613]  ksys_ioctl+0x82/0xc0
> > > [   40.287617]  __x64_sys_ioctl+0x16/0x20
> > > [   40.287622]  do_syscall_64+0x4d/0x90
> > > [   40.287627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Hi Roman,
> > 
> > If you have CONFIG_AMD_MEM_ENCRYPT set, this should be resolved by
> > 
> > commit dbed452a078d56bc7f1abecc3edd6a75e8e4484e
> > Author: David Rientjes 
> > Date:   Thu Jun 11 00:25:57 2020 -0700
> > 
> > dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
> > 
> > Or you might want to wait for 5.8-rc2 instead which includes this fix.
> > 
> 
> Hello, David!
> 
> Thank you for pointing at it! Unfortunately, there must be something wrong
> with drivers, your patch didn't help much. I still see the same stacktrace.
> 

Wow, I'm very surprised.  Do you have CONFIG_AMD_MEM_ENCRYPT enabled?

Adding Takashi.



Re: kernel BUG at mm/huge_memory.c:2613!

2020-06-19 Thread David Rientjes
On Fri, 19 Jun 2020, Roman Gushchin wrote:

> [   40.287524] BUG: unable to handle page fault for address: a77b833df000
> [   40.287529] #PF: supervisor write access in kernel mode
> [   40.287531] #PF: error_code(0x000b) - reserved bit violation
> [   40.287532] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3ec54d067
> PTE 80001688033d9163
> [   40.287538] Oops: 000b [#2] SMP NOPTI
> [   40.287542] CPU: 9 PID: 1986 Comm: pulseaudio Tainted: G  D
>   5.8.0-rc1+ #697
> [   40.287544] Hardware name: Gigabyte Technology Co., Ltd.
> AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> [   40.287550] RIP: 0010:__memset+0x24/0x30
> [   40.287553] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
> d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
> 0f af c6  48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
> d1 f3
> [   40.287556] RSP: 0018:a77b827a7e08 EFLAGS: 00010216
> [   40.287558] RAX:  RBX: 90f77dced800 RCX: 
> 08a0
> [   40.287560] RDX:  RSI:  RDI: 
> a77b833df000
> [   40.287561] RBP: 90f7898c7000 R08: 90f78c507768 R09: 
> a77b833df000
> [   40.287563] R10: a77b833df000 R11: 90f7839f2d40 R12: 
> 
> [   40.287564] R13: 90f76a802e00 R14: c0cb8880 R15: 
> 90f770f4e700
> [   40.287567] FS:  7f3d8e8df880() GS:90f78ee4()
> knlGS:
> [   40.287569] CS:  0010 DS:  ES:  CR0: 80050033
> [   40.287570] CR2: a77b833df000 CR3: 0003fa556000 CR4: 
> 003406e0
> [   40.287572] Call Trace:
> [   40.287584]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
> [   40.287593]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> [   40.287601]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> [   40.287608]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> [   40.287613]  ksys_ioctl+0x82/0xc0
> [   40.287617]  __x64_sys_ioctl+0x16/0x20
> [   40.287622]  do_syscall_64+0x4d/0x90
> [   40.287627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hi Roman,

If you have CONFIG_AMD_MEM_ENCRYPT set, this should be resolved by

commit dbed452a078d56bc7f1abecc3edd6a75e8e4484e
Author: David Rientjes 
Date:   Thu Jun 11 00:25:57 2020 -0700

dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL

Or you might want to wait for 5.8-rc2 instead which includes this fix.


Re: [PATCH] dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR

2020-06-18 Thread David Rientjes
On Thu, 18 Jun 2020, Christoph Hellwig wrote:

> The dma coherent pool code needs genalloc.  Move the select over
> from DMA_REMAP, which doesn't actually need it.
> 
> Fixes: dbed452a078d ("dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL")
> Reported-by: kernel test robot 

Acked-by: David Rientjes 

Thanks Christoph.  In the initial bug report from Alex Xu, his .config set 
CONFIG_GENERIC_ALLOCATOR already so I think there is very little risk with 
this patch.


Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

2020-06-15 Thread David Rientjes
On Mon, 15 Jun 2020, Shuah Khan wrote:

> I am seeing the following problem on my system. I haven't started debug
> yet. Is this a known issue?
> 
> [9.791309] BUG: unable to handle page fault for address: b1e78165d000
> [9.791328] #PF: supervisor write access in kernel mode
> [9.791330] #PF: error_code(0x000b) - reserved bit violation
> [9.791332] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 22ba8e067 PTE
> 80001a3681509163
> [9.791337] Oops: 000b [#1] SMP NOPTI
> [9.791340] CPU: 7 PID: 866 Comm: pulseaudio Not tainted 5.8.0-rc1 #1
> [9.791341] Hardware name: LENOVO 10VGCTO1WW/3130, BIOS M1XKT45A 08/21/2019
> [9.791346] RIP: 0010:__memset+0x24/0x30
> [9.791348] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2
> 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6  48
> ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> [9.791350] RSP: 0018:b1e7817a7dd0 EFLAGS: 00010216
> [9.791352] RAX:  RBX: 97b32dfd7400 RCX:
> 08a0
> [9.791354] RDX:  RSI:  RDI:
> b1e78165d000
> [9.791356] RBP: b1e7817a7e00 R08: b1e78000 R09:
> b1e78165d000
> [9.791358] R10:  R11: b1e78165d000 R12:
> 
> [9.791359] R13: 97b32dfd3000 R14: c0b48880 R15:
> 97b33aa42600
> [9.791361] FS:  7fa11cb34ec0() GS:97b33edc()
> knlGS:
> [9.791363] CS:  0010 DS:  ES:  CR0: 80050033
> [9.791365] CR2: b1e78165d000 CR3: 000210db6000 CR4:
> 003406e0
> [9.791367] Call Trace:
> [9.791377]  ? snd_pcm_hw_params+0x3ca/0x440 [snd_pcm]
> [9.791383]  snd_pcm_common_ioctl+0x173/0xf20 [snd_pcm]
> [9.791389]  ? snd_ctl_ioctl+0x1c5/0x710 [snd]
> [9.791394]  snd_pcm_ioctl+0x27/0x40 [snd_pcm]
> [9.791398]  ksys_ioctl+0x9d/0xd0
> [9.791400]  __x64_sys_ioctl+0x1a/0x20
> [9.791404]  do_syscall_64+0x49/0xc0
> [9.791406]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [9.791408] RIP: 0033:0x7fa11d4c137b
> [9.791410] Code: Bad RIP value.

Hi Shuah, do you have CONFIG_AMD_MEM_ENCRYPT enabled?

If so, could you try 
http://git.infradead.org/users/hch/dma-mapping.git/commitdiff/dbed452a078d56bc7f1abecc3edd6a75e8e4484e


[patch for-5.8 3/4] dma-direct: check return value when encrypting or decrypting memory

2020-06-11 Thread David Rientjes
__change_page_attr() can fail which will cause set_memory_encrypted() and
set_memory_decrypted() to return non-zero.

If the device requires unencrypted DMA memory and decryption fails, simply
free the memory and fail.

If attempting to re-encrypt in the failure path and that encryption fails,
there is no alternative other than to leak the memory.

Fixes: c10f07aa27da ("dma/direct: Handle force decryption for DMA coherent
buffers in common code")
Cc: sta...@vger.kernel.org # 4.17+
Signed-off-by: David Rientjes 
---
 kernel/dma/direct.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -158,6 +158,7 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
 {
struct page *page;
void *ret;
+   int err;
 
size = PAGE_ALIGN(size);
 
@@ -210,8 +211,12 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
}
 
ret = page_address(page);
-   if (force_dma_unencrypted(dev))
-   set_memory_decrypted((unsigned long)ret, 1 << get_order(size));
+   if (force_dma_unencrypted(dev)) {
+   err = set_memory_decrypted((unsigned long)ret,
+  1 << get_order(size));
+   if (err)
+   goto out_free_pages;
+   }
 
memset(ret, 0, size);
 
@@ -229,9 +234,13 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
*dma_handle = phys_to_dma(dev, page_to_phys(page));
return ret;
 out_encrypt_pages:
-   if (force_dma_unencrypted(dev))
-   set_memory_encrypted((unsigned long)page_address(page),
-1 << get_order(size));
+   if (force_dma_unencrypted(dev)) {
+   err = set_memory_encrypted((unsigned long)page_address(page),
+  1 << get_order(size));
+   /* If memory cannot be re-encrypted, it must be leaked */
+   if (err)
+   return NULL;
+   }
 out_free_pages:
dma_free_contiguous(dev, page, size);
return NULL;


[patch for-5.8 2/4] dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails

2020-06-11 Thread David Rientjes
If arch_dma_set_uncached() fails after memory has been decrypted, it needs
to be re-encrypted before freeing.

Fixes: fa7e2247c572 ("dma-direct: make uncached_kernel_address more
general")
Cc: sta...@vger.kernel.org # 5.7
Signed-off-by: David Rientjes 
---
 kernel/dma/direct.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -220,7 +220,7 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
arch_dma_prep_coherent(page, size);
ret = arch_dma_set_uncached(ret, size);
if (IS_ERR(ret))
-   goto out_free_pages;
+   goto out_encrypt_pages;
}
 done:
if (force_dma_unencrypted(dev))
@@ -228,6 +228,10 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
else
*dma_handle = phys_to_dma(dev, page_to_phys(page));
return ret;
+out_encrypt_pages:
+   if (force_dma_unencrypted(dev))
+   set_memory_encrypted((unsigned long)page_address(page),
+1 << get_order(size));
 out_free_pages:
dma_free_contiguous(dev, page, size);
return NULL;


[patch for-5.8 1/4] dma-direct: always align allocation size in dma_direct_alloc_pages()

2020-06-11 Thread David Rientjes
dma_alloc_contiguous() does size >> PAGE_SHIFT and set_memory_decrypted()
works at page granularity.  It's necessary to page align the allocation
size in dma_direct_alloc_pages() for consistent behavior.

This also fixes an issue when arch_dma_prep_coherent() is called on an
unaligned allocation size for dma_alloc_need_uncached() when
CONFIG_DMA_DIRECT_REMAP is disabled but CONFIG_ARCH_HAS_DMA_SET_UNCACHED
is enabled.

Cc: sta...@vger.kernel.org
Signed-off-by: David Rientjes 
---
 kernel/dma/direct.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -112,11 +112,12 @@ static inline bool dma_should_free_from_pool(struct 
device *dev,
 struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
gfp_t gfp, unsigned long attrs)
 {
-   size_t alloc_size = PAGE_ALIGN(size);
int node = dev_to_node(dev);
struct page *page = NULL;
u64 phys_limit;
 
+   VM_BUG_ON(!PAGE_ALIGNED(size));
+
if (attrs & DMA_ATTR_NO_WARN)
gfp |= __GFP_NOWARN;
 
@@ -124,14 +125,14 @@ struct page *__dma_direct_alloc_pages(struct device *dev, 
size_t size,
gfp &= ~__GFP_ZERO;
gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
   _limit);
-   page = dma_alloc_contiguous(dev, alloc_size, gfp);
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
-   dma_free_contiguous(dev, page, alloc_size);
+   dma_free_contiguous(dev, page, size);
page = NULL;
}
 again:
if (!page)
-   page = alloc_pages_node(node, gfp, get_order(alloc_size));
+   page = alloc_pages_node(node, gfp, get_order(size));
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -158,8 +159,10 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
struct page *page;
void *ret;
 
+   size = PAGE_ALIGN(size);
+
if (dma_should_alloc_from_pool(dev, gfp, attrs)) {
-   ret = dma_alloc_from_pool(dev, PAGE_ALIGN(size), , gfp);
+   ret = dma_alloc_from_pool(dev, size, , gfp);
if (!ret)
return NULL;
goto done;
@@ -183,10 +186,10 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
 dma_alloc_need_uncached(dev, attrs)) ||
(IS_ENABLED(CONFIG_DMA_REMAP) && PageHighMem(page))) {
/* remove any dirty cache lines on the kernel alias */
-   arch_dma_prep_coherent(page, PAGE_ALIGN(size));
+   arch_dma_prep_coherent(page, size);
 
/* create a coherent mapping */
-   ret = dma_common_contiguous_remap(page, PAGE_ALIGN(size),
+   ret = dma_common_contiguous_remap(page, size,
dma_pgprot(dev, PAGE_KERNEL, attrs),
__builtin_return_address(0));
if (!ret)


[patch for-5.8 4/4] dma-direct: add missing set_memory_decrypted() for coherent mapping

2020-06-11 Thread David Rientjes
When a coherent mapping is created in dma_direct_alloc_pages(), it needs
to be decrypted if the device requires unencrypted DMA before returning.

Fixes: 3acac065508f ("dma-mapping: merge the generic remapping helpers
into dma-direct")
Cc: sta...@vger.kernel.org # 5.5+
Signed-off-by: David Rientjes 
---
 kernel/dma/direct.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -195,6 +195,12 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
__builtin_return_address(0));
if (!ret)
goto out_free_pages;
+   if (force_dma_unencrypted(dev)) {
+   err = set_memory_decrypted((unsigned long)ret,
+  1 << get_order(size));
+   if (err)
+   goto out_free_pages;
+   }
memset(ret, 0, size);
goto done;
}


[patch for-5.8 0/4] dma-direct: dma_direct_alloc_pages() fixes for AMD SEV

2020-06-11 Thread David Rientjes
While debugging recently reported issues concerning DMA allocation
practices when CONFIG_AMD_MEM_ENCRYPT is enabled, some curiosities arose
when looking at dma_direct_alloc_pages() behavior.

Fix these up.  These are likely all stable material, so proposing for 5.8.
---
 kernel/dma/direct.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)


[patch for-5.8] dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL

2020-06-11 Thread David Rientjes
DMA_REMAP is an unnecessary requirement for AMD SEV, which requires 
DMA_COHERENT_POOL, so avoid selecting it when it is otherwise unnecessary.  

The only other requirement for DMA coherent pools is DMA_DIRECT_REMAP, so 
ensure that properly selects the config option when needed.

Fixes: 82fef0ad811f ("x86/mm: unencrypted non-blocking DMA allocations use 
coherent pools")
Suggested-by: Christoph Hellwig 
Signed-off-by: David Rientjes 
---
 kernel/dma/Kconfig | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -73,18 +73,18 @@ config SWIOTLB
 config DMA_NONCOHERENT_MMAP
bool
 
+config DMA_COHERENT_POOL
+   bool
+
 config DMA_REMAP
+   bool
depends on MMU
select GENERIC_ALLOCATOR
select DMA_NONCOHERENT_MMAP
-   bool
-
-config DMA_COHERENT_POOL
-   bool
-   select DMA_REMAP
 
 config DMA_DIRECT_REMAP
bool
+   select DMA_REMAP
select DMA_COHERENT_POOL
 
 config DMA_CMA


Re: next-0519 on thinkpad x60: sound related? window manager crash

2020-06-09 Thread David Rientjes
On Tue, 9 Jun 2020, Christoph Hellwig wrote:

> > Working theory is that CONFIG_DMA_NONCOHERENT_MMAP getting set is causing 
> > the error_code in the page fault path.  Debugging with Alex off-thread we 
> > found that dma_{alloc,free}_from_pool() are not getting called from the 
> > new code in dma_direct_{alloc,free}_pages() and he has not enabled 
> > mem_encrypt.
> 
> While DMA_COHERENT_POOL absolutely should not select DMA_NONCOHERENT_MMAP
> (and you should send your patch either way), I don't think it is going
> to make a difference here, as DMA_NONCOHERENT_MMAP just means we
> allows mmaps even for non-coherent devices, and we do not support
> non-coherent devices on x86.
> 

We haven't heard yet whether the disabling of DMA_NONCOHERENT_MMAP fixes 
Aaron's BUG(), and the patch included some other debugging hints that will 
be printed out in case it didn't, but I'll share what we figured out:

In 5.7, his config didn't have DMA_DIRECT_REMAP or DMA_REMAP (it did have 
GENERIC_ALLOCATOR already).  AMD_MEM_ENCRYPT is set.  

In Linus HEAD, AMD_MEM_ENCRYPT now selects DMA_COHERENT_POOL so it sets 
the two aforementioned options.

We also figured out that dma_should_alloc_from_pool() is always false up 
until the BUG().  So what else changed?  Only the selection of DMA_REMAP 
and DMA_NONCOHERENT_MMAP.

The comment in the Kconfig about setting "an uncached bit in the 
pagetables" led me to believe it may be related to the splat he's seeing 
(reserved bit violation).  So I suggested dropping DMA_NONCOHERENT_MMAP 
from his Kconfig for testing purposes.



If this option should not implicitly be set for DMA_COHERENT_POOL, then I 
assume we need yet another Kconfig option since DMA_REMAP selected it 
before and DMA_COHERENT_POOL selects DMA_REMAP :)

So do we want a DMA_REMAP_BUT_NO_DMA_NONCOHERENT_MMAP?  Decouple DMA_REMAP 
from DMA_NONCOHERENT_MMAP and select the latter wherever the former was 
set (but not DMA_COHERENT_POOL)?  Something else?


Re: next-0519 on thinkpad x60: sound related? window manager crash

2020-06-08 Thread David Rientjes
On Mon, 8 Jun 2020, Alex Xu (Hello71) wrote:

> Excerpts from Christoph Hellwig's message of June 8, 2020 2:19 am:
> > Can you do a listing using gdb where this happens?
> > 
> > gdb vmlinux
> > 
> > l *(snd_pcm_hw_params+0x3f3)
> > 
> > ?
> > 
> 
> (gdb) l *(snd_pcm_hw_params+0x3f3)
> 0x817efc85 is in snd_pcm_hw_params 
> (.../linux/sound/core/pcm_native.c:749).
> 744 while (runtime->boundary * 2 <= LONG_MAX - 
> runtime->buffer_size)
> 745 runtime->boundary *= 2;
> 746
> 747 /* clear the buffer for avoiding possible kernel info leaks */
> 748 if (runtime->dma_area && !substream->ops->copy_user)
> 749 memset(runtime->dma_area, 0, runtime->dma_bytes);
> 750
> 751 snd_pcm_timer_resolution_change(substream);
> 752 snd_pcm_set_state(substream, SNDRV_PCM_STATE_SETUP);
> 753
> 

Working theory is that CONFIG_DMA_NONCOHERENT_MMAP getting set is causing 
the error_code in the page fault path.  Debugging with Alex off-thread we 
found that dma_{alloc,free}_from_pool() are not getting called from the 
new code in dma_direct_{alloc,free}_pages() and he has not enabled 
mem_encrypt.

So the issue is related to setting CONFIG_DMA_COHERENT_POOL, and not 
anything else related to AMD SME.  He has a patch to try out, but I wanted 
to update the thread in case there are other ideas to try other than 
selecting CONFIG_DMA_NONCOHERENT_MMAP only when CONFIG_DMA_REMAP is set 
(and not CONFIG_DMA_COHERENT_POOL).


Re: [PATCH v2] dma-pool: Fix too large DMA pools on medium systems

2020-06-08 Thread David Rientjes
On Mon, 8 Jun 2020, Geert Uytterhoeven wrote:

> On systems with at least 32 MiB, but less than 32 GiB of RAM, the DMA
> memory pools are much larger than intended (e.g. 2 MiB instead of 128
> KiB on a 256 MiB system).
> 
> Fix this by correcting the calculation of the number of GiBs of RAM in
> the system.  Invert the order of the min/max operations, to keep on
> calculating in pages until the last step, which aids readability.
> 
> Fixes: 1d659236fb43c4d2 ("dma-pool: scale the default DMA coherent pool size 
> with memory capacity")
> Signed-off-by: Geert Uytterhoeven 

This works as well and is much more readable.  Thanks Geert!

Acked-by: David Rientjes 


Re: 82fef0ad811f "x86/mm: unencrypted non-blocking DMA allocations use coherent pools" was Re: next-0519 on thinkpad x60: sound related? window manager crash

2020-06-07 Thread David Rientjes
On Sun, 7 Jun 2020, Alex Xu (Hello71) wrote:

> > On Sun, 7 Jun 2020, Pavel Machek wrote:
> > 
> >> > I have a similar issue, caused between aaa2faab4ed8 and b170290c2836.
> >> > 
> >> > [   20.263098] BUG: unable to handle page fault for address: 
> >> > b2b582cc2000
> >> > [   20.263104] #PF: supervisor write access in kernel mode
> >> > [   20.263105] #PF: error_code(0x000b) - reserved bit violation
> >> > [   20.263107] PGD 3fd03b067 P4D 3fd03b067 PUD 3fd03c067 PMD 3f8822067 
> >> > PTE 8000273942ab2163
> >> > [   20.263113] Oops: 000b [#1] PREEMPT SMP
> >> > [   20.263117] CPU: 3 PID: 691 Comm: mpv Not tainted 
> >> > 5.7.0-11262-gb170290c2836 #1
> >> > [   20.263119] Hardware name: To Be Filled By O.E.M. To Be Filled By 
> >> > O.E.M./B450 Pro4, BIOS P4.10 03/05/2020
> >> > [   20.263125] RIP: 0010:__memset+0x24/0x30
> >> > [   20.263128] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 
> >> > 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af 
> >> > c6  48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> >> > [   20.263131] RSP: 0018:b2b583d07e10 EFLAGS: 00010216
> >> > [   20.263133] RAX:  RBX: 8b8000102c00 RCX: 
> >> > 4000
> >> > [   20.263134] RDX:  RSI:  RDI: 
> >> > b2b582cc2000
> >> > [   20.263136] RBP: 8b8000101000 R08:  R09: 
> >> > b2b582cc2000
> >> > [   20.263137] R10: 5356 R11: 8b8000102c18 R12: 
> >> > 
> >> > [   20.263139] R13:  R14: 8b8039944200 R15: 
> >> > 9794daa0
> >> > [   20.263141] FS:  7f41aa4b4200() GS:8b803ecc() 
> >> > knlGS:
> >> > [   20.263143] CS:  0010 DS:  ES:  CR0: 80050033
> >> > [   20.263144] CR2: b2b582cc2000 CR3: 0003b6731000 CR4: 
> >> > 003406e0
> >> > [   20.263146] Call Trace:
> >> > [   20.263151]  ? snd_pcm_hw_params+0x3f3/0x47a
> >> > [   20.263154]  ? snd_pcm_common_ioctl+0xf2/0xf73
> >> > [   20.263158]  ? snd_pcm_ioctl+0x1e/0x29
> >> > [   20.263161]  ? ksys_ioctl+0x77/0x91
> >> > [   20.263163]  ? __x64_sys_ioctl+0x11/0x14
> >> > [   20.263166]  ? do_syscall_64+0x3d/0xf5
> >> > [   20.263170]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> > [   20.263173] Modules linked in: uvcvideo videobuf2_vmalloc 
> >> > videobuf2_memops videobuf2_v4l2 videodev snd_usb_audio videobuf2_common 
> >> > snd_hwdep snd_usbmidi_lib input_leds snd_rawmidi led_class
> >> > [   20.263182] CR2: b2b582cc2000
> >> > [   20.263184] ---[ end trace c6b47a774b91f0a0 ]---
> >> > [   20.263187] RIP: 0010:__memset+0x24/0x30
> >> > [   20.263190] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 
> >> > 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af 
> >> > c6  48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> >> > [   20.263192] RSP: 0018:b2b583d07e10 EFLAGS: 00010216
> >> > [   20.263193] RAX:  RBX: 8b8000102c00 RCX: 
> >> > 4000
> >> > [   20.263195] RDX:  RSI:  RDI: 
> >> > b2b582cc2000
> >> > [   20.263196] RBP: 8b8000101000 R08:  R09: 
> >> > b2b582cc2000
> >> > [   20.263197] R10: 5356 R11: 8b8000102c18 R12: 
> >> > 
> >> > [   20.263199] R13:  R14: 8b8039944200 R15: 
> >> > 9794daa0
> >> > [   20.263201] FS:  7f41aa4b4200() GS:8b803ecc() 
> >> > knlGS:
> >> > [   20.263202] CS:  0010 DS:  ES:  CR0: 80050033
> >> > [   20.263204] CR2: b2b582cc2000 CR3: 0003b6731000 CR4: 
> >> > 003406e0
> >> > 
> >> > I bisected this to 82fef0ad811f "x86/mm: unencrypted non-blocking DMA 
> >> > allocations use coherent pools". Reverting 1ee18de92927 resolves the 
> >> > issue.
> >> > 
> >> > Looks like Thinkpad X60 doesn't have VT-d, but could still be DMA 
> >> > related.
> >> 
> >> Note that newer -next releases seem to behave okay for me. The commit
> >> pointed out by siection is really simple:
> >> 
> >> AFAIK you could verify it is responsible by turning off
> >> CONFIG_AMD_MEM_ENCRYPT on latest kernel...
> >> 
> >> Best regards,
> >>Pavel
> >> 
> >> index 1d6104ea8af0..2bf819d3 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -1520,6 +1520,7 @@ config X86_CPA_STATISTICS
> >>  config AMD_MEM_ENCRYPT
> >> bool "AMD Secure Memory Encryption (SME) support"
> >> depends on X86_64 && CPU_SUP_AMD
> >> +   select DMA_COHERENT_POOL
> >> select DYNAMIC_PHYSICAL_MASK
> >> select ARCH_USE_MEMREMAP_PROT
> >> select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> > 
> > Thanks for the report!
> > 
> > Besides CONFIG_AMD_MEM_ENCRYPT, do you have CONFIG_DMA_DIRECT_REMAP 
> > enabled?  If so, it may be caused by the virtual address passed to the 
> 

Re: 82fef0ad811f "x86/mm: unencrypted non-blocking DMA allocations use coherent pools" was Re: next-0519 on thinkpad x60: sound related? window manager crash

2020-06-07 Thread David Rientjes
On Sun, 7 Jun 2020, Pavel Machek wrote:

> > I have a similar issue, caused between aaa2faab4ed8 and b170290c2836.
> > 
> > [   20.263098] BUG: unable to handle page fault for address: 
> > b2b582cc2000
> > [   20.263104] #PF: supervisor write access in kernel mode
> > [   20.263105] #PF: error_code(0x000b) - reserved bit violation
> > [   20.263107] PGD 3fd03b067 P4D 3fd03b067 PUD 3fd03c067 PMD 3f8822067 PTE 
> > 8000273942ab2163
> > [   20.263113] Oops: 000b [#1] PREEMPT SMP
> > [   20.263117] CPU: 3 PID: 691 Comm: mpv Not tainted 
> > 5.7.0-11262-gb170290c2836 #1
> > [   20.263119] Hardware name: To Be Filled By O.E.M. To Be Filled By 
> > O.E.M./B450 Pro4, BIOS P4.10 03/05/2020
> > [   20.263125] RIP: 0010:__memset+0x24/0x30
> > [   20.263128] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 
> > e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 
> >  48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> > [   20.263131] RSP: 0018:b2b583d07e10 EFLAGS: 00010216
> > [   20.263133] RAX:  RBX: 8b8000102c00 RCX: 
> > 4000
> > [   20.263134] RDX:  RSI:  RDI: 
> > b2b582cc2000
> > [   20.263136] RBP: 8b8000101000 R08:  R09: 
> > b2b582cc2000
> > [   20.263137] R10: 5356 R11: 8b8000102c18 R12: 
> > 
> > [   20.263139] R13:  R14: 8b8039944200 R15: 
> > 9794daa0
> > [   20.263141] FS:  7f41aa4b4200() GS:8b803ecc() 
> > knlGS:
> > [   20.263143] CS:  0010 DS:  ES:  CR0: 80050033
> > [   20.263144] CR2: b2b582cc2000 CR3: 0003b6731000 CR4: 
> > 003406e0
> > [   20.263146] Call Trace:
> > [   20.263151]  ? snd_pcm_hw_params+0x3f3/0x47a
> > [   20.263154]  ? snd_pcm_common_ioctl+0xf2/0xf73
> > [   20.263158]  ? snd_pcm_ioctl+0x1e/0x29
> > [   20.263161]  ? ksys_ioctl+0x77/0x91
> > [   20.263163]  ? __x64_sys_ioctl+0x11/0x14
> > [   20.263166]  ? do_syscall_64+0x3d/0xf5
> > [   20.263170]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [   20.263173] Modules linked in: uvcvideo videobuf2_vmalloc 
> > videobuf2_memops videobuf2_v4l2 videodev snd_usb_audio videobuf2_common 
> > snd_hwdep snd_usbmidi_lib input_leds snd_rawmidi led_class
> > [   20.263182] CR2: b2b582cc2000
> > [   20.263184] ---[ end trace c6b47a774b91f0a0 ]---
> > [   20.263187] RIP: 0010:__memset+0x24/0x30
> > [   20.263190] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 
> > e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 
> >  48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> > [   20.263192] RSP: 0018:b2b583d07e10 EFLAGS: 00010216
> > [   20.263193] RAX:  RBX: 8b8000102c00 RCX: 
> > 4000
> > [   20.263195] RDX:  RSI:  RDI: 
> > b2b582cc2000
> > [   20.263196] RBP: 8b8000101000 R08:  R09: 
> > b2b582cc2000
> > [   20.263197] R10: 5356 R11: 8b8000102c18 R12: 
> > 
> > [   20.263199] R13:  R14: 8b8039944200 R15: 
> > 9794daa0
> > [   20.263201] FS:  7f41aa4b4200() GS:8b803ecc() 
> > knlGS:
> > [   20.263202] CS:  0010 DS:  ES:  CR0: 80050033
> > [   20.263204] CR2: b2b582cc2000 CR3: 0003b6731000 CR4: 
> > 003406e0
> > 
> > I bisected this to 82fef0ad811f "x86/mm: unencrypted non-blocking DMA 
> > allocations use coherent pools". Reverting 1ee18de92927 resolves the 
> > issue.
> > 
> > Looks like Thinkpad X60 doesn't have VT-d, but could still be DMA 
> > related.
> 
> Note that newer -next releases seem to behave okay for me. The commit
> pointed out by siection is really simple:
> 
> AFAIK you could verify it is responsible by turning off
> CONFIG_AMD_MEM_ENCRYPT on latest kernel...
> 
> Best regards,
>   Pavel
> 
> index 1d6104ea8af0..2bf819d3 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1520,6 +1520,7 @@ config X86_CPA_STATISTICS
>  config AMD_MEM_ENCRYPT
> bool "AMD Secure Memory Encryption (SME) support"
> depends on X86_64 && CPU_SUP_AMD
> +   select DMA_COHERENT_POOL
> select DYNAMIC_PHYSICAL_MASK
> select ARCH_USE_MEMREMAP_PROT
> select ARCH_HAS_FORCE_DMA_UNENCRYPTED

Thanks for the report!

Besides CONFIG_AMD_MEM_ENCRYPT, do you have CONFIG_DMA_DIRECT_REMAP 
enabled?  If so, it may be caused by the virtual address passed to the 
set_memory_{decrypted,encrypted}() functions.

And I assume you are enabling SME by using mem_encrypt=on on the kernel 
command line or CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is enabled.

We likely need an atomic pool for devices that support DMA to addresses in 
sme_me_mask as well.  I can test this tomorrow, but wanted to get it out 
early to see if 

Re: [PATCH] mm: thp: Add new kernel parameters transparent_hugepage_defrag/khugepaged_defrag

2020-06-03 Thread David Rientjes
On Wed, 3 Jun 2020, Vlastimil Babka wrote:

> > There is no way to set up the defrag options in boot time. And it's
> > useful to set it up by default instead of making it work by a
> > systemd/upstart service or put the command to set up defrag inside
> > /etc/rc.local.
> > 
> > Signed-off-by: Gavin Guo 
> 
> Well, maybe isntead of adding these handlers, we could extend the new boot
> parameter sysctl support (handling procfs /proc/sys/) to sysfs (/sys) as well,
> as Eric already suggested? [1]
> 
> [1] https://lore.kernel.org/linux-api/87bloj2skm@x220.int.ebiederm.org/
> 

Fully agreed, I think the solution needs to be more generic since thp 
defrag isn't special here.  With the generic support to tune sysctls and 
sysfs tunables from the command line it seems like this patch would be 
redundant.


Re: [PATCH] mm, compaction: Indicate when compaction is manually triggered by sysctl

2020-05-10 Thread David Rientjes
On Fri, 8 May 2020, Guilherme Piccoli wrote:

> On Fri, May 8, 2020 at 3:31 PM David Rientjes  wrote:
> > It doesn't make sense because it's only being done here for the entire
> > system, there are also per-node sysfs triggers so you could do something
> > like iterate over the nodemask of all nodes with memory and trigger
> > compaction manually and then nothing is emitted to the kernel log.
> >
> > There is new statsfs support that Red Hat is proposing that can be used
> > for things like this.  It currently only supports KVM statistics but
> > adding MM statistics is something that would be a natural extension and
> > avoids polluting both the kernel log and /proc/vmstat.
> 
> Thanks for the review. Is this what you're talking about [0] ? Very 
> interesting!
> 

Exactly.

> Also, I agree about the per-node compaction, it's a good point. But at
> the same time, having the information on the number of manual
> compaction triggered is interesting, at least for some users. What if
> we add that as a per-node stat in zoneinfo?
> 

The kernel log is not preferred for this (or drop_caches, really) because 
the amount of info can causing important information to be lost.  We don't 
really gain anything by printing that someone manually triggered 
compaction; they could just write to the kernel log themselves if they 
really wanted to.  The reverse is not true: we can't suppress your kernel 
message with this patch.

Instead, a statsfs-like approach could be used to indicate when this has 
happened and there is no chance of losing events because it got scrolled 
off the kernel log.  It has the added benefit of not requiring the entire 
log to be parsed for such events.


Re: [PATCH] mm, compaction: Indicate when compaction is manually triggered by sysctl

2020-05-08 Thread David Rientjes
On Thu, 7 May 2020, Guilherme G. Piccoli wrote:

> Well...you can think that the problem we are trying to solve was more
> like...admin forgot if they triggered or not the compaction hehe
> So, counting on the user to keep track of it is what I'd like to
> avoid. And thinking about drop_caches (that have an indication when
> triggered) makes me think this indeed does make sense for compaction
> too.
> Cheers,
> 

It doesn't make sense because it's only being done here for the entire 
system, there are also per-node sysfs triggers so you could do something 
like iterate over the nodemask of all nodes with memory and trigger 
compaction manually and then nothing is emitted to the kernel log.

There is new statsfs support that Red Hat is proposing that can be used 
for things like this.  It currently only supports KVM statistics but 
adding MM statistics is something that would be a natural extension and 
avoids polluting both the kernel log and /proc/vmstat.


Re: [PATCH] slab: Replace zero-length array with flexible-array

2020-05-08 Thread David Rientjes
On Thu, 7 May 2020, Gustavo A. R. Silva wrote:

> The current codebase makes use of the zero-length array language
> extension to the C90 standard, but the preferred mechanism to declare
> variable-length types such as these ones is a flexible array member[1][2],
> introduced in C99:
> 
> struct foo {
> int stuff;
> struct boo array[];
> };
> 
> By making use of the mechanism above, we will get a compiler warning
> in case the flexible array does not occur last in the structure, which
> will help us prevent some kind of undefined behavior bugs from being
> inadvertently introduced[3] to the codebase from now on.
> 
> Also, notice that, dynamic memory allocations won't be affected by
> this change:
> 
> "Flexible array members have incomplete type, and so the sizeof operator
> may not be applied. As a quirk of the original implementation of
> zero-length arrays, sizeof evaluates to zero."[1]
> 
> sizeof(flexible-array-member) triggers a warning because flexible array
> members have incomplete type[1]. There are some instances of code in
> which the sizeof operator is being incorrectly/erroneously applied to
> zero-length arrays and the result is zero. Such instances may be hiding
> some bugs. So, this work (flexible-array member conversions) will also
> help to get completely rid of those sorts of issues.
> 
> This issue was found with the help of Coccinelle.
> 
> [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> [2] https://github.com/KSPP/linux/issues/21
> [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> 
> Signed-off-by: Gustavo A. R. Silva 

Acked-by: David Rientjes 


Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics

2020-05-07 Thread David Rientjes
On Thu, 7 May 2020, Konstantin Khlebnikov wrote:

> > > > To get exact count of free and used objects slub have to scan list of
> > > > partial slabs. This may take at long time. Scanning holds spinlock and
> > > > blocks allocations which move partial slabs to per-cpu lists and back.
> > > > 
> > > > Example found in the wild:
> > > > 
> > > > # cat /sys/kernel/slab/dentry/partial
> > > > 14478538 N0=7329569 N1=7148969
> > > > # time cat /sys/kernel/slab/dentry/objects
> > > > 286225471 N0=136967768 N1=149257703
> > > > 
> > > > real0m1.722s
> > > > user0m0.001s
> > > > sys 0m1.721s
> > > > 
> > > > The same problem in slab was addressed in commit f728b0a5d72a ("mm,
> > > > slab:
> > > > faster active and free stats") by adding more kmem cache statistics.
> > > > For slub same approach requires atomic op on fast path when object
> > > > frees.
> > > > 
> > > > Let's simply limit count of scanned slabs and print warning.
> > > > Limit set in /sys/module/slub/parameters/max_partial_to_count.
> > > > Default is 1 which should be enough for most sane cases.
> > > > 
> > > > Return linear approximation if list of partials is longer than limit.
> > > > Nobody should notice difference.
> > > > 
> > > > Signed-off-by: Konstantin Khlebnikov 
> > > 
> > > This patch will trigger the warning under memory pressure, and then makes
> > > lockdep unhappy. Also,  it is almost impossible tell how many
> > > max_partial_to_count is sufficient from user perspective.
> 
> Oops, my bad. Printing under this lock indeed a bad idea.
> 
> Probably it's better to simply remove this message.
> I cannot imagine situation when precise count of object matters at such scale.
> 

If the printk is removed, then probably better to remove the 
max_partial_to_count param as well?  I doubt it would ever be used since 
nothing points to it other than the kernel code now.  If somebody 
complains about the approximation, we can (a) convince them the 
approximation is better than precise calculation to prevent irqs from 
being disabled for several seconds and (b) add it later if absolutely 
necessary.  I notice the absence of other module_param()'s in mm/slub.c, 
so likely better to avoid adding special tunables like this unless 
absolutely necessary.


Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread David Rientjes
On Tue, 5 May 2020, Paolo Bonzini wrote:

> >>> Since this is becoming a generic API (good!!), maybe we can discuss
> >>> possible ways to optimize gathering of stats in mass?
> >> Sure, the idea of a binary format was considered from the beginning in
> >> [1], and it can be done either together with the current filesystem, or
> >> as a replacement via different mount options.
> > 
> > ASCII stats are not scalable. A binary format is definitely the way to go.
> 
> I am totally in favor of having a binary format, but it should be
> introduced as a separate series on top of this one---and preferably by
> someone who has already put some thought into the problem (which
> Emanuele and I have not, beyond ensuring that the statsfs concept and
> API is flexible enough).
> 

The concern is that once this series is merged then /sys/kernel/stats 
could be considered an ABI and there would be a reasonable expectation 
that it will remain stable, in so far as the stats that userspace is 
interested in are stable and not obsoleted.

So is this a suggestion that the binary format becomes complementary to 
statsfs and provide a means for getting all stats from a single subsystem, 
or that this series gets converted to such a format before it is merged?

> ASCII stats are necessary for quick userspace consumption and for
> backwards compatibility with KVM debugfs (which is not an ABI, but it's
> damn useful and should not be dropped without providing something as
> handy), so this is what this series starts from.
> 


Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-04 Thread David Rientjes
On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote:

> There is currently no common way for Linux kernel subsystems to expose
> statistics to userspace shared throughout the Linux kernel; subsystems
> have to take care of gathering and displaying statistics by themselves,
> for example in the form of files in debugfs. For example KVM has its own
> code section that takes care of this in virt/kvm/kvm_main.c, where it sets
> up debugfs handlers for displaying values and aggregating them from
> various subfolders to obtain information about the system state (i.e.
> displaying the total number of exits, calculated by summing all exits of
> all cpus of all running virtual machines).
> 
> Allowing each section of the kernel to do so has two disadvantages. First,
> it will introduce redundant code. Second, debugfs is anyway not the right
> place for statistics (for example it is affected by lockdown)
> 
> In this patch series I introduce statsfs, a synthetic ram-based virtual
> filesystem that takes care of gathering and displaying statistics for the
> Linux kernel subsystems.
> 

This is exciting, we have been looking in the same area recently.  Adding 
Jonathan Adams .

In your diffstat, one thing I notice that is omitted: an update to 
Documentation/* :)  Any chance of getting some proposed Documentation/ 
updates with structure of the fs, the per subsystem breakdown, and best 
practices for managing the stats from the kernel level?

> The file system is mounted on /sys/kernel/stats and would be already used
> by kvm. Statsfs was initially introduced by Paolo Bonzini [1].
> 
> Statsfs offers a generic and stable API, allowing any kind of
> directory/file organization and supporting multiple kind of aggregations
> (not only sum, but also average, max, min and count_zero) and data types
> (all unsigned and signed types plus boolean). The implementation, which is
> a generalization of KVM’s debugfs statistics code, takes care of gathering
> and displaying information at run time; users only need to specify the
> values to be included in each source.
> 
> Statsfs would also be a different mountpoint from debugfs, and would not
> suffer from limited access due to the security lock down patches. Its main
> function is to display each statistics as a file in the desired folder
> hierarchy defined through the API. Statsfs files can be read, and possibly
> cleared if their file mode allows it.
> 
> Statsfs has two main components: the public API defined by
> include/linux/statsfs.h, and the virtual file system which should end up
> in /sys/kernel/stats.
> 
> The API has two main elements, values and sources. Kernel subsystems like
> KVM can use the API to create a source, add child
> sources/values/aggregates and register it to the root source (that on the
> virtual fs would be /sys/kernel/statsfs).
> 
> Sources are created via statsfs_source_create(), and each source becomes a
> directory in the file system. Sources form a parent-child relationship;
> root sources are added to the file system via statsfs_source_register().
> Every other source is added to or removed from a parent through the
> statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs.
> Once a source is created and added to the tree (via add_subordinate), it
> will be used to compute aggregate values in the parent source.
> 
> Values represent quantites that are gathered by the statsfs user. Examples
> of values include the number of vm exits of a given kind, the amount of
> memory used by some data structure, the length of the longest hash table
> chain, or anything like that. Values are defined with the
> statsfs_source_add_values function. Each value is defined by a struct
> statsfs_value; the same statsfs_value can be added to many different
> sources. A value can be considered "simple" if it fetches data from a
> user-provided location, or "aggregate" if it groups all values in the
> subordinates sources that include the same statsfs_value.
> 

This seems like it could have a lot of overhead if we wanted to 
periodically track the totality of subsystem stats as a form of telemetry 
gathering from userspace.  To collect telemetry for 1,000 different stats, 
do we need to issue lseek()+read() syscalls for each of them individually 
(or, worse, open()+read()+close())?

Any thoughts on how that can be optimized?  A couple of ideas:

 - an interface that allows gathering of all stats for a particular
   interface through a single file that would likely be encoded in binary
   and the responsibility of userspace to disseminate, or

 - an interface that extends beyond this proposal and allows the reader to
   specify which stats they are interested in collecting and then the
   kernel will only provide these stats in a well formed structure and 
   also be binary encoded.

We've found that the one-file-per-stat method is pretty much a show 
stopper from the performance view and we always must execute at least two 
syscalls to obtain a 

Re: [PATCH v5.6-rt] mm: slub: Always flush the delayed empty slubs in flush_all()

2020-05-04 Thread David Rientjes
On Mon, 4 May 2020, Kevin Hao wrote:

> After commit f0b231101c94 ("mm/SLUB: delay giving back empty slubs to
> IRQ enabled regions"), when the free_slab() is invoked with the IRQ
> disabled, the empty slubs are moved to a per-CPU list and will be
> freed after IRQ enabled later. But in the current codes, there is
> a check to see if there really has the cpu slub on a specific cpu
> before flushing the delayed empty slubs, this may cause a reference
> of already released kmem_cache in a scenario like below:
>   cpu 0   cpu 1
>   kmem_cache_destroy()
> flush_all()
>  --->IPI   flush_cpu_slab()
>  flush_slab()
>deactivate_slab()
>  discard_slab()
>free_slab()
>  c->page = NULL;
>   for_each_online_cpu(cpu)
> if (!has_cpu_slab(1, s))
>   continue
> this skip to flush the delayed
> empty slub released by cpu1
> kmem_cache_free(kmem_cache, s)
> 
>kmalloc()
>  __slab_alloc()
> free_delayed()
> __free_slab()
> reference to released kmem_cache
> 
> Fixes: f0b231101c94 ("mm/SLUB: delay giving back empty slubs to IRQ enabled 
> regions")
> Signed-off-by: Kevin Hao 

Acked-by: David Rientjes 


Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics

2020-05-04 Thread David Rientjes
On Mon, 4 May 2020, Konstantin Khlebnikov wrote:

> To get exact count of free and used objects slub have to scan list of
> partial slabs. This may take at long time. Scanning holds spinlock and
> blocks allocations which move partial slabs to per-cpu lists and back.
> 
> Example found in the wild:
> 
> # cat /sys/kernel/slab/dentry/partial
> 14478538 N0=7329569 N1=7148969
> # time cat /sys/kernel/slab/dentry/objects
> 286225471 N0=136967768 N1=149257703
> 
> real  0m1.722s
> user  0m0.001s
> sys   0m1.721s
> 
> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
> faster active and free stats") by adding more kmem cache statistics.
> For slub same approach requires atomic op on fast path when object frees.
> 
> Let's simply limit count of scanned slabs and print warning.
> Limit set in /sys/module/slub/parameters/max_partial_to_count.
> Default is 1 which should be enough for most sane cases.
> 
> Return linear approximation if list of partials is longer than limit.
> Nobody should notice difference.
> 

Hi Konstantin,

Do you only exhibit this on slub for SO_ALL|SO_OBJECTS?  I notice the 
timing in the changelog is only looking at "objects" and not "partial".

If so, it seems this is also a problem for get_slabinfo() since it also 
uses the count_free() callback for count_partial().

Concern would be that the kernel has now drastically changed a statistic 
that it exports to userspace.  There was some discussion about this back 
in 2016[*] and one idea was that slabinfo would truncate its scanning and 
append a '+' to the end of the value to indicate it exceeds the max, i.e. 
1+.  I think that '+' actually caused the problem itself for userspace 
processes.

I think the patch is too far reaching, however, since it impacts all 
count_partial() counting and not only for the case cited in the changelog.  
Are there examples for things other than the count_free() callback?

 [*] https://lore.kernel.org/patchwork/patch/708427/

> Signed-off-by: Konstantin Khlebnikov 
> ---
>  mm/slub.c |   15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bf44955c4f1..86a366f7acb6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct 
> kmem_cache_node *n)
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> +
> +static unsigned long max_partial_to_count __read_mostly = 1;
> +module_param(max_partial_to_count, ulong, 0644);
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>   int (*get_count)(struct page *))
>  {
> + unsigned long counted = 0;
>   unsigned long flags;
>   unsigned long x = 0;
>   struct page *page;
>  
>   spin_lock_irqsave(>list_lock, flags);
> - list_for_each_entry(page, >partial, slab_list)
> + list_for_each_entry(page, >partial, slab_list) {
>   x += get_count(page);
> +
> + if (++counted > max_partial_to_count) {
> + pr_warn_once("SLUB: too much partial slabs to count all 
> objects, increase max_partial_to_count.\n");
> + /* Approximate total count of objects */
> + x = mult_frac(x, n->nr_partial, counted);
> + break;
> + }
> + }
>   spin_unlock_irqrestore(>list_lock, flags);
>   return x;
>  }
> 
> 


Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

2020-04-28 Thread David Rientjes
On Tue, 28 Apr 2020, Vlastimil Babka wrote:

> > I took a look at doing a quick-fix for the
> > direct-reclaimers-get-their-stuff-stolen issue about a million years
> > ago.  I don't recall where it ended up.  It's pretty trivial for the
> > direct reclaimer to free pages into current->reclaimed_pages and to
> > take a look in there on the allocation path, etc.  But it's only
> > practical for order-0 pages.
> 
> FWIW there's already such approach added to compaction by Mel some time ago,
> so order>0 allocations are covered to some extent. But in this case I imagine
> that compaction won't even start because order-0 watermarks are too low.
> 
> The order-0 reclaim capture might work though - as a result the GFP_ATOMIC
> allocations would more likely fail and defer to their fallback context.
> 

Yes, order-0 reclaim capture is interesting since the issue being reported 
here is userspace going out to lunch because it loops for an unbounded 
amount of time trying to get above a watermark where it's allowed to 
allocate and other consumers are depleting that resource.

We actually prefer to oom kill earlier rather than being put in a 
perpetual state of aggressive reclaim that affects all allocators and the 
unbounded nature of those allocations leads to very poor results for 
everybody.

I'm happy to scope this solely to an order-0 reclaim capture.  I'm not 
sure if I'm clear on whether this has been worked on before and patches 
existed in the past?

Somewhat related to what I described in the changelog: we lost the "page 
allocation stalls" artifacts in the kernel log for 4.15.  The commit 
description references an asynchronous mechanism for getting this 
information; I don't know where this mechanism currently lives.


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread David Rientjes
On Tue, 22 Oct 2019, Waiman Long wrote:

> >>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> >>> is usually the largest one on large memory systems, this is the one
> >>> to be skipped. Since the printing order is migration-type => order, we
> >>> will have to store the counts in an internal 2D array before printing
> >>> them out.
> >>>
> >>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> >>> zone lock for too long blocking out other zone lock waiters from being
> >>> run. This can be problematic for systems with large amount of memory.
> >>> So a check is added to temporarily release the lock and reschedule if
> >>> more than 64k of list entries have been iterated for each order. With
> >>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> >>> entries before releasing the lock.
> >> But you are still iterating through the whole free_list at once so if it
> >> gets really large then this is still possible. I think it would be
> >> preferable to use per migratetype nr_free if it doesn't cause any
> >> regressions.
> >>
> > Yes, it is still theoretically possible. I will take a further look at
> > having per-migrate type nr_free. BTW, there is one more place where the
> > free lists are being iterated with zone lock held - mark_free_pages().
> 
> Looking deeper into the code, the exact migration type is not stored in
> the page itself. An initial movable page can be stolen to be put into
> another migration type. So in a delete or move from free_area, we don't
> know exactly what migration type the page is coming from. IOW, it is
> hard to get accurate counts of the number of entries in each lists.
> 

I think the suggestion is to maintain a nr_free count of the free_list for 
each order for each migratetype so anytime a page is added or deleted from 
the list, the nr_free is adjusted.  Then the free_area's nr_free becomes 
the sum of its migratetype's nr_free at that order.  That's possible to do 
if you track the migratetype per page, as you said, or like pcp pages 
track it as part of page->index.  It's a trade-off on whether you want to 
impact the performance of maintaining these new nr_frees anytime you 
manipulate the freelists.

I think Vlastimil and I discussed per order per migratetype nr_frees in 
the past and it could be a worthwhile improvement for other reasons, 
specifically it leads to heuristics that can be used to determine how 
fragmentated a certain migratetype is for a zone, i.e. a very quick way to 
determine what ratio of pages over all MIGRATE_UNMOVABLE pageblocks are 
free.

Or maybe there are other reasons why these nr_frees can't be maintained 
anymore?  (I had a patch to do it on 4.3.)

You may also find systems where MIGRATE_MOVABLE is not actually the 
longest free_list compared to other migratetypes on a severely fragmented 
system, so special casing MIGRATE_MOVABLE might not be the best way 
forward.


Re: [PATCH] mm: update comments in slub.c

2019-10-13 Thread David Rientjes
On Mon, 7 Oct 2019, Yu Zhao wrote:

> Slub doesn't use PG_active and PG_error anymore.
> 
> Signed-off-by: Yu Zhao 

Acked-by: David Rientjes 


Re: [rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim

2019-10-04 Thread David Rientjes
On Fri, 4 Oct 2019, Michal Hocko wrote:

> Requesting the userspace to drop _all_ page cache in order allocate a
> number of hugetlb pages or any other affected __GFP_RETRY_MAYFAIL
> requests is simply not reasonable IMHO.

It can be used as a fallback when writing to nr_hugepages and the amount 
allocated did not match expectation.  Again, I'll defer all of this to 
Mike when he returns: he expressed his preference, I suggested an 
alternative to consider, and he can make the decision to ack or nack this 
patch because he has a better understanding of that expectation from users 
who use hugetlb pages.


Re: [PATCH] mm/slub: fix a deadlock in show_slab_objects()

2019-10-03 Thread David Rientjes
On Thu, 3 Oct 2019, Qian Cai wrote:

> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 42c1b3af3c98..922cdcf5758a 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -4838,7 +4838,15 @@ static ssize_t show_slab_objects(struct kmem_cache 
> > > *s,
> > >   }
> > >   }
> > >  
> > > - get_online_mems();
> > > +/*
> > > + * It is not possible to take "mem_hotplug_lock" here, as it has already 
> > > held
> > > + * "kernfs_mutex" which could race with the lock order:
> > > + *
> > > + * mem_hotplug_lock->slab_mutex->kernfs_mutex
> > > + *
> > > + * In the worest case, it might be mis-calculated while doing NUMA node
> > > + * hotplug, but it shall be corrected by later reads of the same files.
> > > + */
> > >  #ifdef CONFIG_SLUB_DEBUG
> > >   if (flags & SO_ALL) {
> > >   struct kmem_cache_node *n;
> > 
> > No objection to removing the {get,put}_online_mems() but the comment 
> > doesn't match the kernel style.  I actually don't think we need the 
> > comment at all, actually.
> 
> I am a bit worry about later someone comes to add the lock back as he/she
> figures out that it could get more accurate statistics that way, but I agree 
> it
> is probably an overkill.
> 

Maybe just a small comment that follows the kernel coding style?


Re: [PATCH] mm/slub: fix a deadlock in show_slab_objects()

2019-10-03 Thread David Rientjes
On Thu, 3 Oct 2019, Qian Cai wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 42c1b3af3c98..922cdcf5758a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4838,7 +4838,15 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
>   }
>   }
>  
> - get_online_mems();
> +/*
> + * It is not possible to take "mem_hotplug_lock" here, as it has already held
> + * "kernfs_mutex" which could race with the lock order:
> + *
> + * mem_hotplug_lock->slab_mutex->kernfs_mutex
> + *
> + * In the worest case, it might be mis-calculated while doing NUMA node
> + * hotplug, but it shall be corrected by later reads of the same files.
> + */
>  #ifdef CONFIG_SLUB_DEBUG
>   if (flags & SO_ALL) {
>   struct kmem_cache_node *n;

No objection to removing the {get,put}_online_mems() but the comment 
doesn't match the kernel style.  I actually don't think we need the 
comment at all, actually.

> @@ -4879,7 +4887,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
>   x += sprintf(buf + x, " N%d=%lu",
>   node, nodes[node]);
>  #endif
> - put_online_mems();
>   kfree(nodes);
>   return x + sprintf(buf + x, "\n");
>  }


Re: [rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim

2019-10-03 Thread David Rientjes
On Thu, 3 Oct 2019, Vlastimil Babka wrote:

> I think the key differences between Mike's tests and Michal's is this part
> from Mike's mail linked above:
> 
> "I 'tested' by simply creating some background activity and then seeing
> how many hugetlb pages could be allocated. Of course, many tries over
> time in a loop."
> 
> - "some background activity" might be different than Michal's pre-filling
>   of the memory with (clean) page cache
> - "many tries over time in a loop" could mean that kswapd has time to 
>   reclaim and eventually the new condition for pageblock order will pass
>   every few retries, because there's enough memory for compaction and it
>   won't return COMPACT_SKIPPED
> 

I'll rely on Mike, the hugetlb maintainer, to assess the trade-off between 
the potential for encountering very expensive reclaim as Andrea did and 
the possibility of being able to allocate additional hugetlb pages at 
runtime if we did that expensive reclaim.

For parity with previous kernels it seems reasonable to ask that this 
remains unchanged since allocating large amounts of hugetlb pages has 
different latency expectations than during page fault.  This patch is 
available if he'd prefer to go that route.

On the other hand, userspace could achieve similar results if it were to 
use vm.drop_caches and explicitly triggered compaction through either 
procfs or sysfs before writing to vm.nr_hugepages, and that would be much 
faster because it would be done in one go.  Users who allocate through the 
kernel command line would obviously be unaffected.

Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when 
compaction may not succeed") was written with the latter in mind.  Mike 
subsequently requested that hugetlb not be impacted at least provisionally 
until it could be further assessed.

I'd suggest that latter: let the user initiate expensive reclaim and/or 
compaction when tuning vm.nr_hugepages and leave no surprises for users 
using hugetlb overcommit, but I wouldn't argue against either approach, he 
knows the users and expectations of hugetlb far better than I do.


Re: [PATCH] mm: vmscan: remove unused scan_control parameter from pageout()

2019-10-03 Thread David Rientjes
On Fri, 4 Oct 2019, Yang Shi wrote:

> Since lumpy reclaim was removed in v3.5 scan_control is not used by
> may_write_to_{queue|inode} and pageout() anymore, remove the unused
> parameter.
> 
> Cc: Mel Gorman 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Signed-off-by: Yang Shi 

Acked-by: David Rientjes 


[rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim

2019-10-02 Thread David Rientjes
Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get 
hugepages that the user needs.  Commit b39d0ee2632d ("mm, page_alloc: 
avoid expensive reclaim when compaction may not succeed") intends to 
improve allocator behind for thp allocations to prevent excessive amounts 
of reclaim especially when constrained to a single node.

Since hugetlb allocations have explicitly preferred to loop and do reclaim 
and compaction, exempt them from this new behavior at least for the time 
being.  It is not shown that hugetlb allocation success rate has been 
impacted by commit b39d0ee2632d but hugetlb allocations are admittedly 
beyond the scope of what the patch is intended to address (thp 
allocations).

Cc: Mike Kravetz 
Signed-off-by: David Rientjes 
---
 Mike, you eluded that you may want to opt hugetlbfs out of this for the
 time being in https://marc.info/?l=linux-kernel=156771690024533 --
 not sure if you want to allow this excessive amount of reclaim for 
 hugetlb allocations or not given the swap storms Andrea has shown is
 possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not
 part of the problem we are trying to address here so no objection to
 opting it out.  

 You might want to consider how expensive hugetlb allocations can become
 and disruptive to the system if it does not yield additional hugepages,
 but that can be done at any time later as a general improvement rather
 than part of a series aimed at thp.

 mm/page_alloc.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4467,12 +4467,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (page)
goto got_pg;
 
-if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
+!(gfp_mask & __GFP_RETRY_MAYFAIL)) {
/*
 * If allocating entire pageblock(s) and compaction
 * failed because all zones are below low watermarks
 * or is prohibited because it recently failed at this
-* order, fail immediately.
+* order, fail immediately unless the allocator has
+* requested compaction and reclaim retry.
 *
 * Reclaim is
 *  - potentially very expensive because zones are far


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-10-02 Thread David Rientjes
On Wed, 2 Oct 2019, Michal Hocko wrote:

> > > If 
> > > hugetlb wants to stress this to the fullest extent possible, it already 
> > > appropriately uses __GFP_RETRY_MAYFAIL.
> > 
> > Which doesn't work anymore right now, and should again after this patch.
> 
> I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using
> __GFP_NORETRY is quite subtle but it is already in place with some
> explanation and a reference to THPs. So while I am not really happy it
> is at least something you can reason about.
> 

It's a no-op:

/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;

/*
 * Do not retry costly high order allocations unless they are
 * __GFP_RETRY_MAYFAIL
 */
if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
goto nopage;

So I'm not sure we should spend too much time discussing a hunk of a patch 
that doesn't do anything.

> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") on the other hand has added a much more wider change
> which has clearly broken hugetlb and any __GFP_RETRY_MAYFAIL user of
> pageblock_order sized allocations. And that is much worse and something
> I was pointing at during the review and those concerns were never really
> addressed before merging.
> 
> In any case this is something to be fixed ASAP. Do you have any better
> proposa? I do not assume you would be proposing yet another revert.

I thought Mike Kravetz said[*] that hugetlb was not negatively affected by 
this?  We could certainly disregard this logic for __GFP_RETRY_MAYFAIL if 
anybody is relying on excessive reclaim ("swap storms") that does not 
allow compaction to make forward progress for some reason.

 [*] https://marc.info/?l=linux-kernel=156771690024533

If not, for the purposes of this conversation we can disregard 
__GFP_NORETRY per the above because thp does not use __GFP_RETRY_MAYFAIL.


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-10-01 Thread David Rientjes
On Tue, 1 Oct 2019, Vlastimil Babka wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..2c48146f3ee2 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>   nmask = policy_nodemask(gfp, pol);
>   if (!nmask || node_isset(hpage_node, *nmask)) {
>   mpol_cond_put(pol);
> + /*
> +  * First, try to allocate THP only on local node, but
> +  * don't reclaim unnecessarily, just compact.
> +  */
>   page = __alloc_pages_node(hpage_node,
> - gfp | __GFP_THISNODE, order);
> + gfp | __GFP_THISNODE | __GFP_NORETRY, order);

The page allocator has heuristics to determine when compaction should be 
retried, reclaim should be retried, and the allocation itself should retry 
for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER).  
PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior 
when reclaim itself is unlikely -- or disruptive enough -- in making that 
amount of contiguous memory available.

Rather than papering over the poor feedback loop between compaction and 
reclaim that exists in the page allocator, it's probably better to improve 
that and determine when an allocation should fail or it's worthwhile to 
retry.  That's a separate topic from NUMA locality of thp.

In other words, we should likely address how compaction and reclaim is 
done for all high order-allocations in the page allocator itself rather 
than only here for hugepage allocations and relying on specific gfp bits 
to be set.  Ask: if the allocation here should not retry regardless of why 
compaction failed, why should any high-order allocation (user or kernel) 
retry if compaction failed and at what order we should just fail?  If 
hugetlb wants to stress this to the fullest extent possible, it already 
appropriately uses __GFP_RETRY_MAYFAIL.

The code here is saying we care more about NUMA locality than hugepages 
simply because that's where the access latency is better and is specific 
to hugepages; allocation behavior for high-order pages needs to live in 
the page allocator.

>  
>   /*
> -  * If hugepage allocations are configured to always
> -  * synchronous compact or the vma has been madvised
> -  * to prefer hugepage backing, retry allowing remote
> -  * memory as well.
> +  * If that fails, allow both compaction and reclaim,
> +  * but on all nodes.
>*/
> - if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> + if (!page)
>   page = __alloc_pages_node(hpage_node,
> - gfp | __GFP_NORETRY, order);
> + gfp, order);
>  
>   goto out;
>   }

We certainly don't want this for non-MADV_HUGEPAGE regions when thp 
enabled bit is not set to "always".  We'd much rather fallback to local 
native pages because of its improved access latency.  This is allowing all 
hugepage allocations to be remote even without MADV_HUGEPAGE which is not 
even what Andrea needs: qemu uses MADV_HUGEPAGE.


Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-26 Thread David Rientjes
On Tue, 24 Sep 2019, Mina Almasry wrote:

> > I personally prefer the one counter approach only for the reason that it
> > exposes less information about hugetlb reservations.  I was not around
> > for the introduction of hugetlb reservations, but I have fixed several
> > issues having to do with reservations.  IMO, reservations should be hidden
> > from users as much as possible.  Others may disagree.
> >
> > I really hope that Aneesh will comment.  He added the existing hugetlb
> > cgroup code.  I was not involved in that effort, but it looks like there
> > might have been some thought given to reservations in early versions of
> > that code.  It would be interesting to get his perspective.
> >
> > Changes included in patch 4 (disable region_add file_region coalescing)
> > would be needed in a one counter approach as well, so I do plan to
> > review those changes.
> 
> OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
> not really opposed. David, maybe chime in if you see a problem here?
> From the perspective of hiding reservations from the user as much as
> possible, it is an improvement.
> 
> I'm only wary about changing the behavior of the current and having
> that regress applications. I'm hoping you and Aneesh can shed light on
> this.
> 

I think neither Aneesh nor myself are going to be able to provide a 
complete answer on the use of hugetlb cgroup today, anybody could be using 
it without our knowledge and that opens up the possibility that combining 
the limits would adversely affect a real system configuration.

If that is a possibility, I think we need to do some due diligence and try 
to deprecate allocation limits if possible.  One of the benefits to 
separate limits is that we can make reasonable steps to deprecating the 
actual allocation limits, if possible: we could add warnings about the 
deprecation of allocation limits and see if anybody complains.

That could take the form of two separate limits or a tunable in the root 
hugetlb cgroup that defines whether the limits are for allocation or 
reservation.

Combining them in the first pass seems to be very risky and could cause 
pain for users that will not detect this during an rc cycle and will 
report the issue only when their distro gets it.  Then we are left with no 
alternative other than stable backports and the separation of the limits 
anyway.


Re: [PATCH] mm, vmpressure: Fix a signedness bug in vmpressure_register_event()

2019-09-26 Thread David Rientjes
On Wed, 25 Sep 2019, Dan Carpenter wrote:

> The "mode" and "level" variables are enums and in this context GCC will
> treat them as unsigned ints so the error handling is never triggered.
> 
> I also removed the bogus initializer because it isn't required any more
> and it's sort of confusing.
> 
> Fixes: 3cadfa2b9497 ("mm/vmpressure.c: convert to use match_string() helper")
> Signed-off-by: Dan Carpenter 

Acked-by: David Rientjes 


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-26 Thread David Rientjes
On Wed, 25 Sep 2019, Michal Hocko wrote:

> I am especially interested about this part. The more I think about this
> the more I am convinced that the underlying problem really is in the pre
> mature fallback in the fast path.

I appreciate you taking the time to continue to look at this but I'm 
confused about the underlying problem you're referring to: we had no 
underlying problem until 5.3 was released so we need to carry patches that 
will revert this behavior (we simply can't tolerate double digit memory 
access latency regressions lol).

If you're referring to post-5.3 behavior, this appears to override 
alloc_hugepage_direct_gfpmask() but directly in the page allocator.

static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, 
unsigned long addr)
{
...
/*
 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
 * specified, to express a general desire to stay on the current

Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this 
allocation will fail in the fastpath for both my case (fragmented local 
node) and Andrea's case (out of memory local node).  The first 
get_page_from_freelist() will then succeed in the slowpath for both cases; 
compaction is not tried for either.

In my case, that results in a perpetual remote access latency that we 
can't tolerate.  If Andrea's remote nodes are fragmented or low on memory, 
his case encounters swap storms over both the local node and remote nodes.

So I'm not really sure what is solved by your patch?

We're not on 5.3, but I can try it and collect data on exactly how poorly 
it performs on fragmented *hosts* (not single nodes, but really the whole 
system) because swap storms were never fixed here, it was only papered 
over.

> Does the almost-patch below helps your
> workload? It effectively reduces the fast path for higher order
> allocations to the local/requested node. The justification is that
> watermark check might be too strict for those requests as it is primary
> order-0 oriented. Low watermark target simply has no meaning for the
> higher order requests AFAIU. The min-low gap is giving kswapd a chance
> to balance and be more local node friendly while we do not have anything
> like that in compaction.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ff5484fdbdf9..09036cf55fca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order, int preferred_nid,
>  {
>   struct page *page;
>   unsigned int alloc_flags = ALLOC_WMARK_LOW;
> - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> + gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used 
> for allocation */
>   struct alloc_context ac = { };
>  
>   /*
> @@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order, int preferred_nid,
>   }
>  
>   gfp_mask &= gfp_allowed_mask;
> - alloc_mask = gfp_mask;
> + fastpath_mask = alloc_mask = gfp_mask;
>   if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, , 
> _mask, _flags))
>   return NULL;
>  
> @@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order, int preferred_nid,
>*/
>   alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, 
> gfp_mask);
>  
> - /* First allocation attempt */
> - page = get_page_from_freelist(alloc_mask, order, alloc_flags, );
> + /*
> +  * First allocation attempt. If we have a high order allocation then do 
> not fall
> +  * back to a remote node just based on the watermark check on the 
> requested node
> +  * because compaction might easily free up a requested order and then 
> it would be
> +  * better to simply go to the slow path.
> +  * TODO: kcompactd should help here but nobody has woken it up unless 
> we hit the
> +  * slow path so we might need some tuning there as well.
> +  */
> + if (order && (gfp_mask & __GFP_DIRECT_RECLAIM))
> + fastpath_mask |= __GFP_THISNODE;
> + page = get_page_from_freelist(fastpath_mask, order, alloc_flags, );
>   if (likely(page))
>   goto out;


Re: [PATCH] mm: slub: print_hex_dump() with DUMP_PREFIX_OFFSET

2019-09-21 Thread David Rientjes
On Fri, 20 Sep 2019, Miles Chen wrote:

> Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"),
> The use DUMP_PREFIX_OFFSET instead of DUMP_PREFIX_ADDRESS with
> print_hex_dump() can generate more useful messages.
> 
> In the following example, it's easier get the offset of incorrect poison
> value with DUMP_PREFIX_OFFSET.
> 
> Before:
> Object e817b73b: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 816f4601: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 169d2bb8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object f4c38716: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 917e3491: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object c0e33738: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 1755ddd1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 
> After:
> Object : 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 0010: 63 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 0020: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Object 0030: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5
> 
> I think it might be worth to convert all DUMP_PREFIX_ADDRESS to
> DUMP_PREFIX_OFFSET for the whole Linux kernel.
> 

I agree it looks nicer for poisoning, I'm not sure that every caller of 
print_section() is the same, however.  For example trace() seems better 
off as DUMP_PREFIX_ADDRESS since it already specifies the address of the 
object being allocated or freed and offset here wouldn't really be useful, 
no?


Re: [PATCH] mm/slub: fix -Wunused-function compiler warnings

2019-09-17 Thread David Rientjes
On Tue, 17 Sep 2019, Qian Cai wrote:

> tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure()
> when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default,
> Clang will complain that those unused functions.
> 
> Signed-off-by: Qian Cai 

Acked-by: David Rientjes 


Re: [RFC] mm: Proactive compaction

2019-09-17 Thread David Rientjes
On Tue, 17 Sep 2019, John Hubbard wrote:

> > We've had good success with periodically compacting memory on a regular 
> > cadence on systems with hugepages enabled.  The cadence itself is defined 
> > by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> > compaction in an attempt to keep zones as defragmented as possible 
> 
> That's an important data point, thanks for reporting it. 
> 
> And given that we have at least one data point validating it, I think we
> should feel fairly comfortable with this approach. Because the sys admin 
> probably knows  when are the best times to steal cpu cycles and recover 
> some huge pages. Unlike the kernel, the sys admin can actually see the 
> future sometimes, because he/she may know what is going to be run.
> 
> It's still sounding like we can expect excellent results from simply 
> defragmenting from user space, via a chron job and/or before running
> important tests, rather than trying to have the kernel guess whether 
> it's a performance win to defragment at some particular time.
> 
> Are you using existing interfaces, or did you need to add something? How
> exactly are you triggering compaction?
> 

It's possible to do this through a cron job but there are a fre reasons 
that we preferred to do it through khugepaged:

 - we use a lighter variation of compaction, MIGRATE_SYNC_LIGHT, than what 
   the per-node trigger provides since compact_node() forces MIGRATE_SYNC
   and can stall for minutes and become disruptive under some
   circumstances,

 - we do not ignore the pageblock skip hint which compact_node() hardcodes 
   to ignore, and 

 - we didn't want to do this in process context so that the cpu time is
   not taxed to any user cgroup since it's on behalf of the system as a
   whole.

It seems much better to do this on a per-node basis rather than through 
the sysctl to do it for the whole system to partition the work.  Extending 
the per-node interface to do MIGRATE_SYNC_LIGHT and not ignore pageblock 
skip is possible but the work done would still be done in process context 
so if done from userspace this would need to be attached to a cgroup that 
does not tax that cgroup for usage done on behalf of the entire system.

Again, we're using khugepaged and allowing the period to be defined 
through /sys/kernel/mm/transparent_hugepage/khugepaged but that is because 
we only want to do this on systems where we want to dynamically allocate 
hugepages on a regular basis.


Re: [RFC PATCH] mm/slub: remove left-over debugging code

2019-09-17 Thread David Rientjes
On Tue, 17 Sep 2019, Qian Cai wrote:

> > The cmpxchg failures could likely be more generalized beyond SLUB since 
> > there will be other dependencies in the kernel than just this allocator.
> 
> OK, SLUB_RESILIENCY_TEST is fine to keep around and maybe be turned into a
> Kconfig option to make it more visible.
> 
> Is it fine to remove SLUB_DEBUG_CMPXCHG? If somebody later want to generalize 
> it
> beyond SLUB, he/she can always find the old code somewhere anyway.
> 

Beyond the fact that your patch doesn't compile, slub is the most notable 
(only?) user of double cmpxchg in the kernel so generalizing it would only 
serve to add more indirection at the moment.  If/when it becomes more 
widely used, we can have a discussion about generalizing it so that we can 
detect failures even when SLUB is not used.

Note that the primary purpose of the option is to diagnose issues when the 
CMPXCHG_DOUBLE_FAIL is observed.  If we encounter that, we wouldn't have 
any diagnostic tools to look deeper without adding this code back.  So I 
don't think anything around cmpxchg failure notifications needs to be 
changed right now.


Re: [RFC] mm: Proactive compaction

2019-09-16 Thread David Rientjes
On Fri, 16 Aug 2019, Nitin Gupta wrote:

> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does
> on-demand compaction as we request more hugepages but this style of
> compaction incurs very high latency. Experiments with one-time full
> memory compaction (followed by hugepage allocations) shows that kernel
> is able to restore a highly fragmented memory state to a fairly
> compacted memory state within <1 sec for a 32G system. Such data
> suggests that a more proactive compaction can help us allocate a large
> fraction of memory as hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> per page-order external fragmentation thresholds and let kcompactd
> threads act on these thresholds.
> 
> The low and high thresholds are defined per page-order and exposed
> through sysfs:
> 
>   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> 
> Per-node kcompactd thread is woken up every few seconds to check if
> any zone on its node has extfrag above the extfrag_high threshold for
> any order, in which case the thread starts compaction in the backgrond
> till all zones are below extfrag_low level for all orders. By default
> both these thresolds are set to 100 for all orders which essentially
> disables kcompactd.
> 
> To avoid wasting CPU cycles when compaction cannot help, such as when
> memory is full, we check both, extfrag > extfrag_high and
> compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> even if extfrag thresholds are not met.
> 
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> 
> Testing done (on x86):
>  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
>  respectively.
>  - Use a test program to fragment memory: the program allocates all memory
>  and then for each 2M aligned section, frees 3/4 of base pages using
>  munmap.
>  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
>  compaction till extfrag < extfrag_low for order-9.
> 
> The patch has plenty of rough edges but posting it early to see if I'm
> going in the right direction and to get some early feedback.
> 

Is there an update to this proposal or non-RFC patch that has been posted 
for proactive compaction?

We've had good success with periodically compacting memory on a regular 
cadence on systems with hugepages enabled.  The cadence itself is defined 
by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
compaction in an attempt to keep zones as defragmented as possible 
(perhaps more "proactive" than what is proposed here in an attempt to keep 
all memory as unfragmented as possible regardless of extfrag thresholds).  
It also avoids corner-cases where kcompactd could become more expensive 
than what is anticipated because it is unsuccessful at compacting memory 
yet the extfrag threshold is still exceeded.

 [*] Khugepaged instead of kcompactd only because this is only enabled
 for systems where transparent hugepages are enabled, probably better
 off in kcompactd to avoid duplicating work between two kthreads if
 there is already a need for background compaction.


Re: WARNING in implement

2019-09-16 Thread David Rientjes
On Mon, 16 Sep 2019, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:f0df5c1b usb-fuzzer: main usb gadget fuzzer driver
> git tree:   https://github.com/google/kasan.git usb-fuzzer
> console output: https://syzkaller.appspot.com/x/log.txt?x=170b213e60
> kernel config:  https://syzkaller.appspot.com/x/.config?x=5c6633fa4ed00be5
> dashboard link: https://syzkaller.appspot.com/bug?extid=38e7237add3712479d65
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16830dc160
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11f2d3fa60
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+38e7237add3712479...@syzkaller.appspotmail.com
> 
> keytouch 0003:0926:.0001: implement() called with too large value 32769
> (n: 1)! (kworker/0:1)
> [ cut here ]
> WARNING: CPU: 0 PID: 12 at drivers/hid/hid-core.c:1370
> implement.cold+0x40/0x81 drivers/hid/hid-core.c:1370
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.3.0-rc7+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> Workqueue: events hidinput_led_worker
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0xca/0x13e lib/dump_stack.c:113
> panic+0x2a3/0x6da kernel/panic.c:219
> __warn.cold+0x20/0x4a kernel/panic.c:576
> report_bug+0x262/0x2a0 lib/bug.c:186
> fixup_bug arch/x86/kernel/traps.c:179 [inline]
> fixup_bug arch/x86/kernel/traps.c:174 [inline]
> do_error_trap+0x12b/0x1e0 arch/x86/kernel/traps.c:272
> do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:291
> invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
> RIP: 0010:implement.cold+0x40/0x81 drivers/hid/hid-core.c:1370
> Code: 00 ef 01 00 48 c7 c2 a0 ac 52 86 48 c7 c6 c0 8a 52 86 4c 8d 88 70 06 00
> 00 e8 3f 3b 1a fe 48 c7 c7 20 8b 52 86 e8 fc 4e d7 fc <0f> 0b 44 21 e5 e9 06
> 3a ff ff e8 64 ad ec fc 49 8d bd 28 19 00 00
> RSP: 0018:8881da20fb88 EFLAGS: 00010082
> RAX: 0024 RBX:  RCX: 
> RDX:  RSI: 81288ddd RDI: ed103b441f63
> RBP: 8001 R08: 0024 R09: ed103b643ee7
> R10: ed103b643ee6 R11: 8881db21f737 R12: 0001
> R13: 8881d28d8000 R14: 0001 R15: 0001
> hid_output_field drivers/hid/hid-core.c:1543 [inline]
> hid_output_report+0x2dc/0x4c0 drivers/hid/hid-core.c:1562
> __usbhid_submit_report drivers/hid/usbhid/hid-core.c:593 [inline]
> usbhid_submit_report+0x65c/0xde0 drivers/hid/usbhid/hid-core.c:638
> usbhid_request+0x3c/0x70 drivers/hid/usbhid/hid-core.c:1252
> hidinput_led_worker+0xbd/0x360 drivers/hid/hid-input.c:1495
> process_one_work+0x92b/0x1530 kernel/workqueue.c:2269
> worker_thread+0x96/0xe20 kernel/workqueue.c:2415
> kthread+0x318/0x420 kernel/kthread.c:255
> ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

I assume this is the same issue as 
https://marc.info/?l=linux-kernel=156865976021549 and any fix for that 
issue would apply to this one as well.  Looks like syzbot found Benjamin 
and Jiri already.


Re: WARNING in __alloc_pages_nodemask

2019-09-16 Thread David Rientjes
On Mon, 16 Sep 2019, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:f0df5c1b usb-fuzzer: main usb gadget fuzzer driver
> git tree:   https://github.com/google/kasan.git usb-fuzzer
> console output: https://syzkaller.appspot.com/x/log.txt?x=14b1537160
> kernel config:  https://syzkaller.appspot.com/x/.config?x=5c6633fa4ed00be5
> dashboard link: https://syzkaller.appspot.com/bug?extid=e38fe539fedfc127987e
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1093bed160
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1603cfc660
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+e38fe539fedfc1279...@syzkaller.appspotmail.com
> 
> WARNING: CPU: 0 PID: 1720 at mm/page_alloc.c:4696
> __alloc_pages_nodemask+0x36f/0x780 mm/page_alloc.c:4696
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 0 PID: 1720 Comm: syz-executor388 Not tainted 5.3.0-rc7+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0xca/0x13e lib/dump_stack.c:113
> panic+0x2a3/0x6da kernel/panic.c:219
> __warn.cold+0x20/0x4a kernel/panic.c:576
> report_bug+0x262/0x2a0 lib/bug.c:186
> fixup_bug arch/x86/kernel/traps.c:179 [inline]
> fixup_bug arch/x86/kernel/traps.c:174 [inline]
> do_error_trap+0x12b/0x1e0 arch/x86/kernel/traps.c:272
> do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:291
> invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
> RIP: 0010:__alloc_pages_nodemask+0x36f/0x780 mm/page_alloc.c:4696
> Code: fe ff ff 65 48 8b 04 25 00 ef 01 00 48 05 60 10 00 00 41 be 01 00 00 00
> 48 89 44 24 58 e9 ee fd ff ff 81 e5 00 20 00 00 75 02 <0f> 0b 45 31 f6 e9 6b
> ff ff ff 8b 44 24 68 89 04 24 65 8b 2d e9 7e
> RSP: 0018:8881d320f9d8 EFLAGS: 00010046
> RAX:  RBX: 11103a641f3f RCX: 
> RDX:  RSI: dc00 RDI: 00040a20
> RBP:  R08: 8881d3bcc800 R09: ed103a541d19
> R10: ed103a541d18 R11: 8881d2a0e8c7 R12: 0012
> R13: 0012 R14:  R15: 8881d2a0e8c0
> alloc_pages_current+0xff/0x200 mm/mempolicy.c:2153
> alloc_pages include/linux/gfp.h:509 [inline]
> kmalloc_order+0x1a/0x60 mm/slab_common.c:1257
> kmalloc_order_trace+0x18/0x110 mm/slab_common.c:1269
> __usbhid_submit_report drivers/hid/usbhid/hid-core.c:588 [inline]
> usbhid_submit_report+0x5b5/0xde0 drivers/hid/usbhid/hid-core.c:638
> usbhid_request+0x3c/0x70 drivers/hid/usbhid/hid-core.c:1252
> hid_hw_request include/linux/hid.h:1053 [inline]
> hiddev_ioctl+0x526/0x1550 drivers/hid/usbhid/hiddev.c:735
> vfs_ioctl fs/ioctl.c:46 [inline]
> file_ioctl fs/ioctl.c:509 [inline]
> do_vfs_ioctl+0xd2d/0x1330 fs/ioctl.c:696
> ksys_ioctl+0x9b/0xc0 fs/ioctl.c:713
> __do_sys_ioctl fs/ioctl.c:720 [inline]
> __se_sys_ioctl fs/ioctl.c:718 [inline]
> __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718
> do_syscall_64+0xb7/0x580 arch/x86/entry/common.c:296
> entry_SYSCALL_64_after_hwframe+0x49/0xbe

Adding Jiri and Benjamin.  The hid report length is simply too large for 
the page allocator to allocate: this is triggering because the resulting 
allocation order is > MAX_ORDER-1.  Any way to make this allocate less 
physically contiguous memory?


Re: [RFC PATCH] mm/slub: remove left-over debugging code

2019-09-16 Thread David Rientjes
On Mon, 16 Sep 2019, Qian Cai wrote:

> SLUB_RESILIENCY_TEST and SLUB_DEBUG_CMPXCHG look like some left-over
> debugging code during the internal development that probably nobody uses
> it anymore. Remove them to make the world greener.

Adding Pengfei Li who has been working on a patchset for modified handling 
of kmalloc cache initialization and touches the resiliency test.

I still find the resiliency test to be helpful/instructional for handling 
unexpected conditions in these caches, so I'd suggest against removing it: 
the only downside is that it's additional source code.  But it's helpful 
source code for reference.

The cmpxchg failures could likely be more generalized beyond SLUB since 
there will be other dependencies in the kernel than just this allocator.

(I assume you didn't send a Signed-off-by line because this is an RFC.)

> ---
>  mm/slub.c | 110 
> --
>  1 file changed, 110 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 8834563cdb4b..f97155ba097d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -150,12 +150,6 @@ static inline bool kmem_cache_has_cpu_partial(struct 
> kmem_cache *s)
>   * - Variable sizing of the per node arrays
>   */
>  
> -/* Enable to test recovery from slab corruption on boot */
> -#undef SLUB_RESILIENCY_TEST
> -
> -/* Enable to log cmpxchg failures */
> -#undef SLUB_DEBUG_CMPXCHG
> -
>  /*
>   * Mininum number of partial slabs. These will be left on the partial
>   * lists even if they are empty. kmem_cache_shrink may reclaim them.
> @@ -392,10 +386,6 @@ static inline bool __cmpxchg_double_slab(struct 
> kmem_cache *s, struct page *page
>   cpu_relax();
>   stat(s, CMPXCHG_DOUBLE_FAIL);
>  
> -#ifdef SLUB_DEBUG_CMPXCHG
> - pr_info("%s %s: cmpxchg double redo ", n, s->name);
> -#endif
> -
>   return false;
>  }
>  
> @@ -433,10 +423,6 @@ static inline bool cmpxchg_double_slab(struct kmem_cache 
> *s, struct page *page,
>   cpu_relax();
>   stat(s, CMPXCHG_DOUBLE_FAIL);
>  
> -#ifdef SLUB_DEBUG_CMPXCHG
> - pr_info("%s %s: cmpxchg double redo ", n, s->name);
> -#endif
> -
>   return false;
>  }
>  
> @@ -2004,45 +1990,11 @@ static inline unsigned long next_tid(unsigned long 
> tid)
>   return tid + TID_STEP;
>  }
>  
> -static inline unsigned int tid_to_cpu(unsigned long tid)
> -{
> - return tid % TID_STEP;
> -}
> -
> -static inline unsigned long tid_to_event(unsigned long tid)
> -{
> - return tid / TID_STEP;
> -}
> -
>  static inline unsigned int init_tid(int cpu)
>  {
>   return cpu;
>  }
>  
> -static inline void note_cmpxchg_failure(const char *n,
> - const struct kmem_cache *s, unsigned long tid)
> -{
> -#ifdef SLUB_DEBUG_CMPXCHG
> - unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
> -
> - pr_info("%s %s: cmpxchg redo ", n, s->name);
> -
> -#ifdef CONFIG_PREEMPT
> - if (tid_to_cpu(tid) != tid_to_cpu(actual_tid))
> - pr_warn("due to cpu change %d -> %d\n",
> - tid_to_cpu(tid), tid_to_cpu(actual_tid));
> - else
> -#endif
> - if (tid_to_event(tid) != tid_to_event(actual_tid))
> - pr_warn("due to cpu running other code. Event %ld->%ld\n",
> - tid_to_event(tid), tid_to_event(actual_tid));
> - else
> - pr_warn("for unknown reason: actual=%lx was=%lx target=%lx\n",
> - actual_tid, tid, next_tid(tid));
> -#endif
> - stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
> -}
> -
>  static void init_kmem_cache_cpus(struct kmem_cache *s)
>  {
>   int cpu;
> @@ -2751,7 +2703,6 @@ static __always_inline void *slab_alloc_node(struct 
> kmem_cache *s,
>   object, tid,
>   next_object, next_tid(tid {
>  
> - note_cmpxchg_failure("slab_alloc", s, tid);
>   goto redo;
>   }
>   prefetch_freepointer(s, next_object);
> @@ -4694,66 +4645,6 @@ static int list_locations(struct kmem_cache *s, char 
> *buf,
>  }
>  #endif   /* CONFIG_SLUB_DEBUG */
>  
> -#ifdef SLUB_RESILIENCY_TEST
> -static void __init resiliency_test(void)
> -{
> - u8 *p;
> - int type = KMALLOC_NORMAL;
> -
> - BUILD_BUG_ON(KMALLOC_MIN_SIZE > 16 || KMALLOC_SHIFT_HIGH < 10);
> -
> - pr_err("SLUB resiliency testing\n");
> - pr_err("---\n");
> - pr_err("A. Corruption after allocation\n");
> -
> - p = kzalloc(16, GFP_KERNEL);
> - p[16] = 0x12;
> - pr_err("\n1. kmalloc-16: Clobber Redzone/next pointer 0x12->0x%p\n\n",
> -p + 16);
> -
> - validate_slab_cache(kmalloc_caches[type][4]);
> -
> - /* Hmmm... The next two are dangerous */
> - p = kzalloc(32, GFP_KERNEL);
> - p[32 + sizeof(void *)] = 0x34;
> - pr_err("\n2. kmalloc-32: Clobber next pointer/next slab 0x34 -> 
> -0x%p\n",
> -p);
> - pr_err("If allocated object is overwritten then not 

Re: [RESEND v4 7/7] mm, slab_common: Modify kmalloc_caches[type][idx] to kmalloc_caches[idx][type]

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> KMALLOC_NORMAL is the most frequently accessed, and kmalloc_caches[]
> is initialized by different types of the same size.
> 
> So modifying kmalloc_caches[type][idx] to kmalloc_caches[idx][type]
> will benefit performance.
> 
> $ ./scripts/bloat-o-meter vmlinux.patch_1-6 vmlinux.patch_1-7
> add/remove: 0/0 grow/shrink: 2/57 up/down: 8/-457 (-449)
> Function old new   delta
> tg3_self_test   42554259  +4
> nf_queue 666 670  +4
> kmalloc_slab  97  93  -4
> i915_sw_fence_await_dma_fence441 437  -4
> __igmp_group_dropped 619 615  -4
> gss_import_sec_context   176 170  -6
> xhci_alloc_command   212 205  -7
> create_kmalloc_caches155 148  -7
> xprt_switch_alloc136 128  -8
> xhci_segment_alloc   297 289  -8
> xhci_ring_alloc  369 361  -8
> xhci_mem_init   36643656  -8
> xhci_alloc_virt_device   496 488  -8
> xhci_alloc_tt_info   346 338  -8
> xhci_alloc_stream_info   718 710  -8
> xhci_alloc_container_ctx 215 207  -8
> xfrm_policy_alloc271 263  -8
> tcp_sendmsg_locked  31203112  -8
> tcp_md5_do_add   774 766  -8
> tcp_fastopen_defer_connect   270 262  -8
> sr_read_tochdr.isra  251 243  -8
> sr_read_tocentry.isra328 320  -8
> sr_is_xa 376 368  -8
> sr_get_mcn   260 252  -8
> selinux_sk_alloc_security113 105  -8
> sdev_evt_send_simple 118 110  -8
> sdev_evt_alloc79  71  -8
> scsi_probe_and_add_lun  29382930  -8
> sbitmap_queue_init_node  418 410  -8
> ring_buffer_read_prepare  94  86  -8
> request_firmware_nowait  396 388  -8
> regulatory_hint_found_beacon 394 386  -8
> ohci_urb_enqueue31763168  -8
> nla_strdup   142 134  -8
> nfs_alloc_seqid   87  79  -8
> nfs4_get_state_owner10401032  -8
> nfs4_do_close578 570  -8
> nf_ct_tmpl_alloc  85  77  -8
> mempool_create_node  164 156  -8
> ip_setup_cork362 354  -8
> ip6_setup_cork  10211013  -8
> gss_create_cred  140 132  -8
> drm_flip_work_allocate_task   70  62  -8
> dma_pool_alloc   410 402  -8
> devres_open_group214 206  -8
> cfg80211_stop_iface  260 252  -8
> cfg80211_sinfo_alloc_tid_stats77  69  -8
> cfg80211_port_authorized 212 204  -8
> cfg80211_parse_mbssid_data  23972389  -8
> cfg80211_ibss_joined 335 327  -8
> call_usermodehelper_setup149 141  -8
> bpf_prog_alloc_no_stats  182 174  -8
> blk_alloc_flush_queue191 183  -8
> bdi_alloc_node   195 187  -8
> audit_log_d_path 196 188  -8
> _netlbl_catmap_getnode   247 239  -8
> ip_mc_inc_group  475 467  -8
> __i915_sw_fence_await_sw_fence   417 405 -12
> ida_alloc_range  955 934 -21
> Total: Before=14874316, After=14873867, chg -0.00%
> 
> Signed-off-by: Pengfei Li 

This also seems more intuitive.

Acked-by: David Rientjes 


Re: [RESEND v4 4/7] mm, slab: Return ZERO_SIZE_ALLOC for zero sized kmalloc requests

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> This is a preparation patch, just replace 0 with ZERO_SIZE_ALLOC
> as the return value of zero sized requests.
> 
> Signed-off-by: Pengfei Li 

Acked-by: David Rientjes 


Re: [RESEND v4 5/7] mm, slab_common: Make kmalloc_caches[] start at size KMALLOC_MIN_SIZE

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> Currently, kmalloc_cache[] is not sorted by size, kmalloc_cache[0]
> is kmalloc-96, kmalloc_cache[1] is kmalloc-192 (when ARCH_DMA_MINALIGN
> is not defined).
> 
> As suggested by Vlastimil Babka,
> 
> "Since you're doing these cleanups, have you considered reordering
> kmalloc_info, size_index, kmalloc_index() etc so that sizes 96 and 192
> are ordered naturally between 64, 128 and 256? That should remove
> various special casing such as in create_kmalloc_caches(). I can't
> guarantee it will be possible without breaking e.g. constant folding
> optimizations etc., but seems to me it should be feasible. (There are
> definitely more places to change than those I listed.)"
> 
> So this patch reordered kmalloc_info[], kmalloc_caches[], and modified
> kmalloc_index() and kmalloc_slab() accordingly.
> 
> As a result, there is no subtle judgment about size in
> create_kmalloc_caches(). And initialize kmalloc_cache[] from 0 instead
> of KMALLOC_SHIFT_LOW.
> 
> I used ./scripts/bloat-o-meter to measure the impact of this patch on
> performance. The results show that it brings some benefits.
> 
> Considering the size change of kmalloc_info[], the size of the code is
> actually about 641 bytes less.
> 

bloat-o-meter is reporting a net benefit of -241 bytes for this, so not 
sure about relevancy of the difference for only kmalloc_info.

This, to me, looks like increased complexity for the statically allocated 
arrays vs the runtime complexity when initializing the caches themselves.  
Not sure that this is an improvement given that you still need to do 
things like

+#if KMALLOC_SIZE_96_EXIST == 1
+   if (size > 64 && size <= 96) return (7 - KMALLOC_IDX_ADJ_0);
+#endif
+
+#if KMALLOC_SIZE_192_EXIST == 1
+   if (size > 128 && size <= 192) return (8 - KMALLOC_IDX_ADJ_1);
+#endif


Re: [RESEND v4 3/7] mm, slab_common: Use enum kmalloc_cache_type to iterate over kmalloc caches

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> The type of local variable *type* of new_kmalloc_cache() should
> be enum kmalloc_cache_type instead of int, so correct it.
> 
> Signed-off-by: Pengfei Li 
> Acked-by: Vlastimil Babka 
> Acked-by: Roman Gushchin 

Acked-by: David Rientjes 


Re: [RESEND v4 2/7] mm, slab: Remove unused kmalloc_size()

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> The size of kmalloc can be obtained from kmalloc_info[],
> so remove kmalloc_size() that will not be used anymore.
> 
> Signed-off-by: Pengfei Li 
> Acked-by: Vlastimil Babka 
> Acked-by: Roman Gushchin 

Acked-by: David Rientjes 


Re: [RESEND v4 1/7] mm, slab: Make kmalloc_info[] contain all types of names

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
> and KMALLOC_DMA.
> 
> The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
> but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
> generated by kmalloc_cache_name().
> 
> This patch predefines the names of all types of kmalloc to save
> the time spent dynamically generating names.
> 
> Besides, remove the kmalloc_cache_name() that is no longer used.
> 
> Signed-off-by: Pengfei Li 
> Acked-by: Vlastimil Babka 
> Acked-by: Roman Gushchin 

Acked-by: David Rientjes 

It's unfortunate the existing names are kmalloc-, dma-kmalloc-, and 
kmalloc-rcl- since they aren't following any standard naming convention.

Also not sure I understand the SET_KMALLOC_SIZE naming since this isn't 
just setting a size.  Maybe better off as INIT_KMALLOC_INFO?

Nothing major though, so:

Acked-by: David Rientjes 


Re: [RESEND v4 6/7] mm, slab_common: Initialize the same size of kmalloc_caches[]

2019-09-15 Thread David Rientjes
On Mon, 16 Sep 2019, Pengfei Li wrote:

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 2aed30deb071..e7903bd28b1f 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1165,12 +1165,9 @@ void __init setup_kmalloc_cache_index_table(void)
>   size_index[size_index_elem(i)] = 0;
>  }
>  
> -static void __init
> +static __always_inline void __init
>  new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
>  {
> - if (type == KMALLOC_RECLAIM)
> - flags |= SLAB_RECLAIM_ACCOUNT;
> -
>   kmalloc_caches[type][idx] = create_kmalloc_cache(
>   kmalloc_info[idx].name[type],
>   kmalloc_info[idx].size, flags, 0,
> @@ -1185,30 +1182,22 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type 
> type, slab_flags_t flags)
>  void __init create_kmalloc_caches(slab_flags_t flags)
>  {
>   int i;
> - enum kmalloc_cache_type type;
>  
> - for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) {
> - for (i = 0; i < KMALLOC_CACHE_NUM; i++) {
> - if (!kmalloc_caches[type][i])
> - new_kmalloc_cache(i, type, flags);
> - }
> - }
> + for (i = 0; i < KMALLOC_CACHE_NUM; i++) {
> + if (!kmalloc_caches[KMALLOC_NORMAL][i])
> + new_kmalloc_cache(i, KMALLOC_NORMAL, flags);
>  
> - /* Kmalloc array is now usable */
> - slab_state = UP;
> + new_kmalloc_cache(i, KMALLOC_RECLAIM,
> + flags | SLAB_RECLAIM_ACCOUNT);

This seems less robust, no?  Previously we verified that the cache doesn't 
exist before creating a new cache over top of it (for NORMAL and RECLAIM).  
Now we presume that the RECLAIM cache never exists.

Can we just move a check to new_kmalloc_cache() to see if 
kmalloc_caches[type][idx] already exists and, if so, just return?  This 
should be more robust and simplify create_kmalloc_caches() slightly more.


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-08 Thread David Rientjes
On Sun, 8 Sep 2019, Vlastimil Babka wrote:

> > On Sat, 7 Sep 2019, Linus Torvalds wrote:
> > 
> >>> Andrea acknowledges the swap storm that he reported would be fixed with
> >>> the last two patches in this series
> >>
> >> The problem is that even you aren't arguing that those patches should
> >> go into 5.3.
> >>
> > 
> > For three reasons: (a) we lack a test result from Andrea,
> 
> That's argument against the rfc patches 3+4s, no? But not for including
> the reverts of reverts of reverts (patches 1+2).
> 

Yes, thanks: I would strongly prefer not to propose rfc patches 3-4 
without a testing result from Andrea and collaboration to fix the 
underlying issue.  My suggestion to Linus is to merge patches 1-2 so we 
don't have additional semantics for MADV_HUGEPAGE or thp enabled=always 
configs based on kernel version, especially since they are already 
conflated.

> > (b) there's 
> > on-going discussion, particularly based on Vlastimil's feedback, and 
> 
> I doubt this will be finished and tested with reasonable confidence even
> for the 5.4 merge window.
> 

Depends, but I probably suspect the same.  If the reverts to 5.3 are not 
applied, then I'm not at all confident that forward progress on this issue 
will be made: my suggestion about changes to the page allocator when the 
patches were initially proposed went unresponded to, as did the ping on 
those suggestions, and now we have a simplistic "this will fix the swap 
storms" but no active involvement from Andrea to improve this; he likely 
is quite content on lumping NUMA policy onto an already overloaded madvise 
mode.

 [ NOTE! The rest of this email and my responses are about how to address
   the default page allocation behavior which we can continue to discuss
   but I'd prefer it separated from the discussion of reverts for 5.3
   which needs to be done to not conflate madvise modes with mempolicies
   for a subset of kernel versions. ]

> > It indicates that progress has been made to address the actual bug without 
> > introducing long-lived access latency regressions for others, particularly 
> > those who use MADV_HUGEPAGE.  In the worst case, some systems running 
> > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
> > on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a
> 
> It's been said before, but such sensitive code generally relies on
> mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation
> details. Or if you know there's enough free memory and just needs to be
> compacted, you could do it once via sysfs before starting up your workload.
> 

This entire discussion is based on the long standing and default behavior 
of page allocation for transparent hugepages.  Your suggestions are not 
possible for two reasons: (1) I cannot enforce a mempolicy of MPOL_BIND 
because this doesn't allow fallback at all and would oom kill if the local 
node is oom, and (2) node reclaim mode is a system-wide setting so all 
workloads are affected for every page allocation, not only users of 
MADV_HUGEPAGE who specifically opt-in to expensive allocation.

We could make the argument that Andrea's qemu usecase could simply use 
MPOL_PREFERRED for memory that should be faulted remotely which would 
provide more control and would work for all versions of Linux regardless 
of MADV_HUGEPAGE or not; that's a much more simple workaround than 
conflating MADV_HUGEPAGE for NUMA locality, asking users who are adversely 
affected by 5.3 to create new mempolicies to work around something that 
has always worked fine, or asking users to tune page allocator policies 
with sysctls.

> > I'm arguing to revert 5.3 back to the behavior that we have had for years 
> > and actually fix the bug that everybody else seems to be ignoring and then 
> > *backport* those fixes to 5.3 stable and every other stable tree that can 
> > use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that
> 
> I think it's rather removing the problematic implicit mempolicy of
> __GFP_THISNODE.
> 

I'm referring to a solution that is backwards compatible for existing 
users which 5.3 is certainly not.

> I might have missed something, but you were asked for a reproducer of
> your use case so others can develop patches with it in mind? Mel did
> provide a simple example that shows the swap storms very easily.
> 

Are you asking for a synthetic kernel module that you can inject to induce 
fragmentation on a local node where memory compaction would be possible 
and then a userspace program that uses MADV_HUGEPAGE and fits within that 
node?  The regression I'm reporting is for workloads that fit within a 
socket, it requires local fragmentation to show a regression.

For the qemu case, it's quite easy to fill a local node and require 
additional hugepage allocations with MADV_HUGEPAGE in a test case, but for
without synthetically inducing fragmentation I cannot provide a testcase 
that will show performance regression 

Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-07 Thread David Rientjes
On Sat, 7 Sep 2019, Linus Torvalds wrote:

> > Andrea acknowledges the swap storm that he reported would be fixed with
> > the last two patches in this series
> 
> The problem is that even you aren't arguing that those patches should
> go into 5.3.
> 

For three reasons: (a) we lack a test result from Andrea, (b) there's 
on-going discussion, particularly based on Vlastimil's feedback, and 
(c) the patches will be refreshed incorporating that feedback as well as 
Mike's suggestion to exempt __GFP_RETRY_MAYFAIL for hugetlb.

> So those fixes aren't going in, so "the swap storms would be fixed"
> argument isn't actually an argument at all as far as 5.3 is concerned.
> 

It indicates that progress has been made to address the actual bug without 
introducing long-lived access latency regressions for others, particularly 
those who use MADV_HUGEPAGE.  In the worst case, some systems running 
5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a 
signficant performance regression regardless of platform; the only thing 
needed to induce this is a fragmented local node that would otherwise be 
compacted in 5.3-rc4 rather than quickly allocate remote on 5.3-rc5.

> End result: we'd have the qemu-kvm instance performance problem in 5.3
> that apparently causes distros to apply those patches that you want to
> revert anyway.
> 
> So reverting would just make distros not use 5.3 in that form.
> 

I'm arguing to revert 5.3 back to the behavior that we have had for years 
and actually fix the bug that everybody else seems to be ignoring and then 
*backport* those fixes to 5.3 stable and every other stable tree that can 
use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that 
will subsequently changed in future 5.3 stable kernels and differs from 
all kernels from the past few years is not in anybody's best interest if 
the actual problem can be fixed.  It requires more feedback than a 
one-line "the swap storms would be fixed with this."  That collaboration 
takes time and isn't something that should be rushed into 5.3-rc5.

Yes, we can fix NUMA locality of hugepages when a workload like qemu is 
larger than a single socket; the vast majority of workloads in the 
datacenter are small than a socket and *cannot* incur the performance 
penalty if local memory is fragmented that 5.3-rc5 introduces.

In other words, 5.3-rc5 is only fixing a highly specialized usecase where 
remote allocation is acceptable because the workload is larger than a 
socket *and* remote memory is not low on memory or fragmented.  If you 
consider the opposite of that, workloads smaller than a socket or local 
compaction actually works, this has introduced a measurable regression for 
everybody else.

I'm not sure why we are ignoring a painfully obvious bug in the page 
allocator because of a poor feedback loop between itself and memory 
compaction and rather papering over it by falling back to remote memory 
when NUMA actually does matter.  If you release 5.3 without the first two 
patches in this series, I wouldn't expect any additional feedback or test 
results to fix this bug considering all we have gotten so far is "this 
would fix this swap storms" and not collaborating to fix the issue for 
everybody rather than only caring about their own workloads.  At least my 
patches acknowledge and try to fix the issue the other is encountering.


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-07 Thread David Rientjes
Is there any objection from anybody to applying the first two patches, the 
reverts of the reverts that went into 5.3-rc5, for 5.3 and pursuing 
discussion and development using the last two patches in this series as a 
starting point for a sane allocation policy that just works by default for 
everybody?

Andrea acknowledges the swap storm that he reported would be fixed with 
the last two patches in this series and there has been discussion on how 
they can be extended at the same time that they do not impact allocations 
outside the scope of the discussion here (hugetlb).


On Thu, 5 Sep 2019, David Rientjes wrote:

> On Wed, 4 Sep 2019, Linus Torvalds wrote:
> 
> > > This series reverts those reverts and attempts to propose a more sane
> > > default allocation strategy specifically for hugepages.  Andrea
> > > acknowledges this is likely to fix the swap storms that he originally
> > > reported that resulted in the patches that removed __GFP_THISNODE from
> > > hugepage allocations.
> > 
> > There's no way we can try this for 5.3 even if looks ok. This is
> > "let's try this during the 5.4 merge window" material, and see how it
> > works.
> > 
> > But I'd love affected people to test this all on their loads and post
> > numbers, so that we have actual numbers for this series when we do try
> > to merge it.
> > 
> 
> I'm certainly not proposing the last two patches in the series marked as 
> RFC to be merged.  I'm proposing the first two patches in the series, 
> reverts of the reverts that went into 5.3-rc5, are merged for 5.3 so that 
> we return to the same behavior that we have had for years and semantics 
> that MADV_HUGEPAGE has provided that entire libraries and userspaces have 
> been based on.
> 
> It is very clear that there is a path forward here to address the *bug* 
> that Andrea is reporting: it has become conflated with NUMA allocation 
> policies which is not at all the issue.  Note that if 5.3 is released with 
> these patches that it requires a very specialized usecase to benefit from: 
> workloads that are larger than one socket and *requires* remote memory not 
> being low on memory or fragmented.  If remote memory is as low on memory 
> or fragmented as local memory (like in a datacenter), the reverts that 
> went into 5.3 will double the impact of the very bug being reported 
> because now it's causing swap storms for remote memory as well.  I don't 
> anticipate we'll get numbers for that since it's not a configuration they 
> run in.
> 
> The bug here is reclaim in the page allocator that does not benefit memory 
> compaction because we are failing per-zone watermarks already.  The last 
> two patches in these series avoid that, which is a sane default page 
> allocation policy, and the allow fallback to remote memory only when we 
> can't easily allocate locally.
> 
> We *need* the ability to allocate hugepages locally if compaction can 
> work, anything else kills performance.  5.3-rc7 won't try that, it will 
> simply fallback to remote memory.  We need to try compaction but we do not 
> want to reclaim if failing watermark checks.
> 
> I hope that I'm not being unrealistically optimistic that we can make 
> progress on providing a sane default allocation policy using those last 
> two patches as a starter for 5.4, but I'm strongly suggesting that you 
> take the first two patches to return us to the policy that has existed for 
> years and not allow MADV_HUGEPAGE to be used for immediate remote 
> allocation when local is possible.
> 


Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed

2019-09-06 Thread David Rientjes
On Thu, 5 Sep 2019, Vlastimil Babka wrote:

> >>  - failing order-0 watermark checks in memory compaction does not account
> >>for how far below the watermarks the zone actually is: to enable
> >>migration, there must be *some* free memory available.  Per the above,
> >>watermarks are not always suffficient if isolate_freepages() cannot
> >>find the free memory but it could require hundreds of MBs of reclaim to
> >>even reach this threshold (read: potentially very expensive reclaim with
> >>no indication compaction can be successful), and
> 
> I doubt it's hundreds of MBs for a 2MB hugepage.
> 

I'm not sure how you presume to know, we certainly have incidents where 
compaction is skipped because free pages are are 100MB+ under low 
watermarks.

> >> For hugepage allocations, these are quite substantial drawbacks because
> >> these are very high order allocations (order-9 on x86) and falling back to
> >> doing reclaim can potentially be *very* expensive without any indication
> >> that compaction would even be successful.
> 
> You seem to lump together hugetlbfs and THP here, by saying "hugepage",
> but these are very different things - hugetlbfs reservations are
> expected to be potentially expensive.
> 

Mike Kravetz followed up and I can make a simple change to this fix to 
only run the new logic if the allocation is not using __GFP_RETRY_MAYFAIL 
which would exclude hugetlb allocations and include transparent hugepage 
allocations.

> >> Reclaim itself is unlikely to free entire pageblocks and certainly no
> >> reliance should be put on it to do so in isolation (recall lumpy reclaim).
> >> This means we should avoid reclaim and simply fail hugepage allocation if
> >> compaction is deferred.
> 
> It is however possible that reclaim frees enough to make even a
> previously deferred compaction succeed.
> 

This is another way that the return value that we get from memory 
compaction can be improved since right now we only check 
compaction_deferred() at the priorities we care about.  This discussion 
has revealed several areas where we can get more reliable and actionable 
return values from memory compaction to implement a sane default policy in 
the page allocator that works for everybody.

> >> It is also not helpful to thrash a zone by doing excessive reclaim if
> >> compaction may not be able to access that memory.  If order-0 watermarks
> >> fail and the allocation order is sufficiently large, it is likely better
> >> to fail the allocation rather than thrashing the zone.
> >>
> >> Signed-off-by: David Rientjes 
> >> ---
> >>  mm/page_alloc.c | 22 ++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> >> order,
> >>if (page)
> >>goto got_pg;
> >>  
> >> +   if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
> >> +  /*
> >> +   * If allocating entire pageblock(s) and compaction
> >> +   * failed because all zones are below low watermarks
> >> +   * or is prohibited because it recently failed at this
> >> +   * order, fail immediately.
> >> +   *
> >> +   * Reclaim is
> >> +   *  - potentially very expensive because zones are far
> >> +   *below their low watermarks or this is part of very
> >> +   *bursty high order allocations,
> >> +   *  - not guaranteed to help because isolate_freepages()
> >> +   *may not iterate over freed pages as part of its
> >> +   *linear scan, and
> >> +   *  - unlikely to make entire pageblocks free on its
> >> +   *own.
> >> +   */
> >> +  if (compact_result == COMPACT_SKIPPED ||
> >> +  compact_result == COMPACT_DEFERRED)
> >> +  goto nopage;
> 
> As I said, I expect this will make hugetlbfs reservations fail
> prematurely - Mike can probably confirm or disprove that.
> I think it also addresses consequences, not the primary problem, IMHO.
> I believe the primary problem is that we reclaim something even if
> there's enough

Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed

2019-09-06 Thread David Rientjes
On Thu, 5 Sep 2019, Mike Kravetz wrote:

> I don't have a specific test for this.  It is somewhat common for people
> to want to allocate "as many hugetlb pages as possible".  Therefore, they
> will try to allocate more pages than reasonable for their environment and
> take what they can get.  I 'tested' by simply creating some background
> activity and then seeing how many hugetlb pages could be allocated.  Of
> course, many tries over time in a loop.
> 
> This patch did not cause premature allocation failures in my limited testing.
> The number of pages which could be allocated with and without patch were
> pretty much the same.
> 
> Do note that I tested on top of Andrew's tree which contains this series:
> http://lkml.kernel.org/r/20190806014744.15446-1-mike.krav...@oracle.com
> Patch 3 in that series causes allocations to fail sooner in the case of
> COMPACT_DEFERRED:
> http://lkml.kernel.org/r/20190806014744.15446-4-mike.krav...@oracle.com
> 
> hugetlb allocations have the __GFP_RETRY_MAYFAIL flag set.  They are willing
> to retry and wait and callers are aware of this.  Even though my limited
> testing did not show regressions caused by this patch, I would prefer if the
> quick exit did not apply to __GFP_RETRY_MAYFAIL requests.

Good!  I think that is the ideal way of handling it: we can specify the 
preference to actually loop and retry (but still eventually fail) for 
hugetlb allocations specifically for this patch by testing for 
__GFP_RETRY_MAYFAIL.

I can add that to the formal proposal of patches 3 and 4 in this series 
assuming we get 5.3 settled by applying the reverts in patches 1 and 2 so 
that we don't cause various versions of Linux to have different default 
and madvise allocation policies wrt NUMA.


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-05 Thread David Rientjes
On Wed, 4 Sep 2019, Andrea Arcangeli wrote:

> > This is an admittedly hacky solution that shouldn't cause anybody to 
> > regress based on NUMA and the semantics of MADV_HUGEPAGE for the past 
> > 4 1/2 years for users whose workload does fit within a socket.
> 
> How can you live with the below if you can't live with 5.3-rc6? Here
> you allocate remote THP if the local THP allocation fails.
> 
> > page = __alloc_pages_node(hpage_node,
> > gfp | __GFP_THISNODE, order);
> > +
> > +   /*
> > +* If hugepage allocations are configured to always
> > +* synchronous compact or the vma has been madvised
> > +* to prefer hugepage backing, retry allowing remote
> > +* memory as well.
> > +*/
> > +   if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> > +   page = __alloc_pages_node(hpage_node,
> > +   gfp | __GFP_NORETRY, order);
> > +
> 
> You're still going to get THP allocate remote _before_ you have a
> chance to allocate 4k local this way. __GFP_NORETRY won't make any
> difference when there's THP immediately available in the remote nodes.
> 

This is incorrect: the fallback allocation here is only if the initial 
allocation with __GFP_THISNODE fails.  In that case, we were able to 
compact memory to make a local hugepage available without incurring 
excessive swap based on the RFC patch that appears as patch 3 in this 
series.  I very much believe your usecase would benefit from this as well 
(or at least not cause others to regress).  We *want* remote thp if they 
are immediately available but only after we have tried to allocate locally 
from the initial allocation and allowed memory compaction fail first.

Likely there can be discussion around the fourth patch of this series to 
get exactly the right policy.  We can construct it as necessary for 
hugetlbfs to not have any change in behavior, that's simple.  We could 
also check per-zone watermarks in mm/huge_memory.c to determine if local 
memory is low-on-memory and, if so, allow remote allocation.  In that case 
it's certainly better to allocate remotely when we'd be reclaiming locally 
even for fallback native pages.

> I said one good thing about this patch series, that it fixes the swap
> storms. But upstream 5.3 fixes the swap storms too and what you sent
> is not nearly equivalent to the mempolicy that Michal was willing
> to provide you and that we thought you needed to get bigger guarantees
> of getting only local 2m or local 4k pages.
> 

I haven't seen such a patch series, is there a link?


Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-05 Thread David Rientjes
On Wed, 4 Sep 2019, Linus Torvalds wrote:

> > This series reverts those reverts and attempts to propose a more sane
> > default allocation strategy specifically for hugepages.  Andrea
> > acknowledges this is likely to fix the swap storms that he originally
> > reported that resulted in the patches that removed __GFP_THISNODE from
> > hugepage allocations.
> 
> There's no way we can try this for 5.3 even if looks ok. This is
> "let's try this during the 5.4 merge window" material, and see how it
> works.
> 
> But I'd love affected people to test this all on their loads and post
> numbers, so that we have actual numbers for this series when we do try
> to merge it.
> 

I'm certainly not proposing the last two patches in the series marked as 
RFC to be merged.  I'm proposing the first two patches in the series, 
reverts of the reverts that went into 5.3-rc5, are merged for 5.3 so that 
we return to the same behavior that we have had for years and semantics 
that MADV_HUGEPAGE has provided that entire libraries and userspaces have 
been based on.

It is very clear that there is a path forward here to address the *bug* 
that Andrea is reporting: it has become conflated with NUMA allocation 
policies which is not at all the issue.  Note that if 5.3 is released with 
these patches that it requires a very specialized usecase to benefit from: 
workloads that are larger than one socket and *requires* remote memory not 
being low on memory or fragmented.  If remote memory is as low on memory 
or fragmented as local memory (like in a datacenter), the reverts that 
went into 5.3 will double the impact of the very bug being reported 
because now it's causing swap storms for remote memory as well.  I don't 
anticipate we'll get numbers for that since it's not a configuration they 
run in.

The bug here is reclaim in the page allocator that does not benefit memory 
compaction because we are failing per-zone watermarks already.  The last 
two patches in these series avoid that, which is a sane default page 
allocation policy, and the allow fallback to remote memory only when we 
can't easily allocate locally.

We *need* the ability to allocate hugepages locally if compaction can 
work, anything else kills performance.  5.3-rc7 won't try that, it will 
simply fallback to remote memory.  We need to try compaction but we do not 
want to reclaim if failing watermark checks.

I hope that I'm not being unrealistically optimistic that we can make 
progress on providing a sane default allocation policy using those last 
two patches as a starter for 5.4, but I'm strongly suggesting that you 
take the first two patches to return us to the policy that has existed for 
years and not allow MADV_HUGEPAGE to be used for immediate remote 
allocation when local is possible.


[bug] __blk_mq_run_hw_queue suspicious rcu usage

2019-09-04 Thread David Rientjes
Hi Christoph, Jens, and Ming,

While booting a 5.2 SEV-enabled guest we have encountered the following 
WARNING that is followed up by a BUG because we are in atomic context 
while trying to call set_memory_decrypted:

 WARNING: suspicious RCU usage
 5.2.0 #1 Not tainted
 -
 include/linux/rcupdate.h:266 Illegal context switch in RCU read-side critical 
section!
 
 other info that might help us debug this:
 
 
 rcu_scheduler_active = 2, debug_locks = 1
 3 locks held by kworker/0:1H/97:
  #0: 16e1b654 ((wq_completion)kblockd){+.+.}, at: 
process_one_work+0x1b5/0x5e0
  #1: 002674ff ((work_completion)(&(>run_work)->work)){+.+.}, at: 
process_one_work+0x1b5/0x5e0
  #2: addb6aba (rcu_read_lock){}, at: hctx_lock+0x17/0xe0
 
 stack backtrace:
 CPU: 0 PID: 97 Comm: kworker/0:1H Not tainted 5.2.0 #1
 Workqueue: kblockd blk_mq_run_work_fn
 Call Trace:
  dump_stack+0x67/0x90
  ___might_sleep+0xfb/0x180
  _vm_unmap_aliases+0x3e/0x1f0
  __set_memory_enc_dec+0x7b/0x120
  dma_direct_alloc_pages+0xcc/0x100
  dma_pool_alloc+0xcf/0x1e0
  nvme_queue_rq+0x5fb/0x9f0
  blk_mq_dispatch_rq_list+0x350/0x5a0
  blk_mq_do_dispatch_sched+0x76/0x110
  blk_mq_sched_dispatch_requests+0x119/0x170
  __blk_mq_run_hw_queue+0x6c/0xf0
  process_one_work+0x23b/0x5e0
  worker_thread+0x3d/0x390
  kthread+0x121/0x140
  ret_from_fork+0x27/0x50

hctx_lock() in __blk_mq_run_hw_queue() takes rcu_read_lock or 
srcu_read_lock depending on BLK_MQ_F_BLOCKING.

dma_direct_alloc_pages() can then call set_memory_decrypted() which must 
be allowed to block.

Any ideas on how to address this?


Re: [RFC PATCH] mm, oom: disable dump_tasks by default

2019-09-04 Thread David Rientjes
On Wed, 4 Sep 2019, Michal Hocko wrote:

> > > It's primary purpose is
> > > to help analyse oom victim selection decision.
> > 
> > I disagree, for I use the process list for understanding what / how many
> > processes are consuming what kind of memory (without crashing the system)
> > for anomaly detection purpose. Although we can't dump memory consumed by
> > e.g. file descriptors, disabling dump_tasks() loose that clue, and is
> > problematic for me.
> 
> Does anything really prevent you from enabling this by sysctl though? Or
> do you claim that this is a general usage pattern and therefore the
> default change is not acceptable or do you want a changelog to be
> updated?
> 

I think the motivation is that users don't want to need to reproduce an 
oom kill to figure out why: they want to be able to figure out which 
process had higher than normal memory usage.  If oom is the normal case 
then they ceratinly have the ability to disable it by disabling the 
sysctl, but that seems like something better to opt-out of rather than 
need to opt-in to and reproduce.


[rfc 4/4] mm, page_alloc: allow hugepage fallback to remote nodes when madvised

2019-09-04 Thread David Rientjes
For systems configured to always try hard to allocate transparent
hugepages (thp defrag setting of "always") or for memory that has been
explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
remote memory to allocate the hugepage if the local allocation fails
first.

The point is to allow the initial call to __alloc_pages_node() to attempt
to defragment local memory to make a hugepage available, if possible,
rather than immediately fallback to remote memory.  Local hugepages will
always have a better access latency than remote (huge)pages, so an attempt
to make a hugepage available locally is always preferred.

If memory compaction cannot be successful locally, however, it is likely
better to fallback to remote memory.  This could take on two forms: either
allow immediate fallback to remote memory or do per-zone watermark checks.
It would be possible to fallback only when per-zone watermarks fail for
order-0 memory, since that would require local reclaim for all subsequent
faults so remote huge allocation is likely better than thrashing the local
zone for large workloads.

In this case, it is assumed that because the system is configured to try
hard to allocate hugepages or the vma is advised to explicitly want to try
hard for hugepages that remote allocation is better when local allocation
and memory compaction have both failed.

Signed-off-by: David Rientjes 
---
 mm/mempolicy.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2133,6 +2133,17 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
vm_area_struct *vma,
mpol_cond_put(pol);
page = __alloc_pages_node(hpage_node,
gfp | __GFP_THISNODE, order);
+
+   /*
+* If hugepage allocations are configured to always
+* synchronous compact or the vma has been madvised
+* to prefer hugepage backing, retry allowing remote
+* memory as well.
+*/
+   if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+   page = __alloc_pages_node(hpage_node,
+   gfp | __GFP_NORETRY, order);
+
goto out;
}
}


[rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed

2019-09-04 Thread David Rientjes
Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:

 - isolate_freepages() is responsible for finding free pages to use as
   migration targets and is implemented as a linear scan of memory
   starting at the end of a zone,

 - failing order-0 watermark checks in memory compaction does not account
   for how far below the watermarks the zone actually is: to enable
   migration, there must be *some* free memory available.  Per the above,
   watermarks are not always suffficient if isolate_freepages() cannot
   find the free memory but it could require hundreds of MBs of reclaim to
   even reach this threshold (read: potentially very expensive reclaim with
   no indication compaction can be successful), and

 - if compaction at this order has failed recently so that it does not even
   run as a result of deferred compaction, looping through reclaim can often
   be pointless.

For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.

Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.

It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory.  If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.

Signed-off-by: David Rientjes 
---
 mm/page_alloc.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (page)
goto got_pg;
 
+if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+   /*
+* If allocating entire pageblock(s) and compaction
+* failed because all zones are below low watermarks
+* or is prohibited because it recently failed at this
+* order, fail immediately.
+*
+* Reclaim is
+*  - potentially very expensive because zones are far
+*below their low watermarks or this is part of very
+*bursty high order allocations,
+*  - not guaranteed to help because isolate_freepages()
+*may not iterate over freed pages as part of its
+*linear scan, and
+*  - unlikely to make entire pageblocks free on its
+*own.
+*/
+   if (compact_result == COMPACT_SKIPPED ||
+   compact_result == COMPACT_DEFERRED)
+   goto nopage;
+   }
+
/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations


[patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-09-04 Thread David Rientjes
Two commits:

commit a8282608c88e08b1782141026eab61204c1e533f
Author: Andrea Arcangeli 
Date:   Tue Aug 13 15:37:53 2019 -0700

Revert "mm, thp: restore node-local hugepage allocations"

commit 92717d429b38e4f9f934eed7e605cc42858f1839
Author: Andrea Arcangeli 
Date:   Tue Aug 13 15:37:50 2019 -0700

Revert "Revert "mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask""

made their way into 5.3-rc5

We (mostly Linus, Andrea, and myself) have been discussing offlist how to
implement a sane default allocation strategy for hugepages on NUMA
platforms.

With these reverts in place, the page allocator will happily allocate a
remote hugepage immediately rather than try to make a local hugepage
available.  This incurs a substantial performance degradation when
memory compaction would have otherwise made a local hugepage available.

This series reverts those reverts and attempts to propose a more sane
default allocation strategy specifically for hugepages.  Andrea
acknowledges this is likely to fix the swap storms that he originally
reported that resulted in the patches that removed __GFP_THISNODE from
hugepage allocations.

The immediate goal is to return 5.3 to the behavior the kernel has
implemented over the past several years so that remote hugepages are
not immediately allocated when local hugepages could have been made
available because the increased access latency is untenable.

The next goal is to introduce a sane default allocation strategy for
hugepages allocations in general regardless of the configuration of the
system so that we prevent thrashing of local memory when compaction is
unlikely to succeed and can prefer remote hugepages over remote native
pages when the local node is low on memory.

Merging these reverts late in the rc cycle to change behavior that has
existed for years and is known (and acknowledged) to create performance
degradations when local hugepages could have been made available serves
no purpose other than to make the development of a sane default policy
much more difficult under time pressure and to accelerate decisions that
will affect how userspace is written (and how it has regressed) that
otherwise require carefully crafted and detailed implementations.

Thus, this patch series returns 5.3 to the long-standing allocation
strategy that Linux has had for years and proposes to follow-up changes
that can be discussed that Andrea acknowledges will avoid the swap storms
that initially triggered this discussion in the first place.


[patch for-5.3 1/4] Revert "Revert "mm, thp: restore node-local hugepage allocations""

2019-09-04 Thread David Rientjes
This reverts commit a8282608c88e08b1782141026eab61204c1e533f.

The commit references the original intended semantic for MADV_HUGEPAGE
which has subsequently taken on three unique purposes:

 - enables or disables thp for a range of memory depending on the system's
   config (is thp "enabled" set to "always" or "madvise"),

 - determines the synchronous compaction behavior for thp allocations at
   fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
   and

 - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
   clear previous hugepage advice).

These are the three purposes that currently exist in 5.2 and over the past
several years that userspace has been written around.  Adding a NUMA
locality preference adds a fourth dimension to an already conflated advice
mode.

Based on the semantic that MADV_HUGEPAGE has provided over the past
several years, there exist workloads that use the tunable based on these
principles: specifically that the allocation should attempt to defragment
a local node before falling back.  It is agreed that remote hugepages
typically (but not always) have a better access latency than remote native
pages, although on Naples this is at parity for intersocket.

The revert commit that this patch reverts allows hugepage allocation to
immediately allocate remotely when local memory is fragmented.  This is
contrary to the semantic of MADV_HUGEPAGE over the past several years:
that is, memory compaction should be attempted locally before falling
back.

The performance degradation of remote hugepages over local hugepages on
Rome, for example, is 53.5% increased access latency.  For this reason,
the goal is to revert back to the 5.2 and previous behavior that would
attempt local defragmentation before falling back.  With the patch that
is reverted by this patch, we see performance degradations at the tail
because the allocator happily allocates the remote hugepage rather than
even attempting to make a local hugepage available.

zone_reclaim_mode is not a solution to this problem since it does not
only impact hugepage allocations but rather changes the memory allocation
strategy for *all* page allocations.

Signed-off-by: David Rientjes 
---
 include/linux/mempolicy.h |  2 --
 mm/huge_memory.c  | 42 +++
 mm/mempolicy.c|  2 +-
 3 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct 
shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr);
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
-   unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -648,37 +648,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, 
unsigned long addr)
 {
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-   gfp_t this_node = 0;
-
-#ifdef CONFIG_NUMA
-   struct mempolicy *pol;
-   /*
-* __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
-* specified, to express a general desire to stay on the current
-* node for optimistic allocation attempts. If the defrag mode
-* and/or madvise hint requires the direct reclaim then we prefer
-* to fallback to other node rather than node reclaim because that
-* can lead to excessive reclaim even though there is free memory
-* on other nodes. We expect that NUMA preferences are specified
-* by memory policies.
-*/
-   pol = get_vma_policy(vma, addr);
-   if (pol->mode != MPOL_BIND)
-   this_node = __GFP_THISNODE;
-   mpol_cond_put(pol);
-#endif
+   const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
+   /* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
_hugepage_flags))
-   return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+   return GFP_TRANSHUGE | __GFP_THISNODE |
+  (vma_madvised ? 0 : __GFP_NORETRY);
+
+   /* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
_hugepage_flags))
-   return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
+   return gfp_mask | __GFP_KSWAPD_RECLAIM;
+
+   /* Synchronous compaction if madvised, otherwise kick k

[patch for-5.3 2/4] Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""

2019-09-04 Thread David Rientjes
This reverts commit 92717d429b38e4f9f934eed7e605cc42858f1839.

Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage
allocations"") is reverted in this series, it is better to restore the
previous 5.2 behavior between the thp allocation and the page allocator
rather than to attempt any consolidation or cleanup for a policy that is
now reverted.  It's less risky during an rc cycle and subsequent patches
in this series further modify the same policy that the pre-5.3 behavior
implements.

Consolidation and cleanup can be done subsequent to a sane default page
allocation strategy, so this patch reverts a cleanup done on a strategy
that is now reverted and thus is the least risky option for 5.3.

Signed-off-by: David Rientjes 
---
 include/linux/gfp.h | 12 
 mm/huge_memory.c| 27 +--
 mm/mempolicy.c  | 32 +---
 mm/shmem.c  |  2 +-
 4 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
-   int node);
+   int node, bool hugepage);
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+   alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
 #else
 #define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
+   alloc_pages(gfp_mask, order)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)\
-   alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+   alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
-   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
+   alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -645,30 +645,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
  * available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, 
unsigned long addr)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-   const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
/* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
_hugepage_flags))
-   return GFP_TRANSHUGE | __GFP_THISNODE |
-  (vma_madvised ? 0 : __GFP_NORETRY);
+   return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 
/* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
_hugepage_flags))
-   return gfp_mask | __GFP_KSWAPD_RECLAIM;
+   return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
/* Synchronous compaction if madvised, otherwise kick kcompactd */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, 
_hugepage_flags))
-   return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
- __GFP_KSWAPD_RECLAIM);
+   return GFP_TRANSHUGE_LIGHT |
+   (vma_madvised ? __GFP_DIRECT_RECLAIM :
+   __GFP_KSWAPD_RECLAIM);
 
/* Only do synchronous compaction if madvised */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, 
_hugepage_flags))
-   return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+   return GFP_TRANSHUGE_LIGHT |
+  (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
 
-   return gfp_mask;
+   return GFP_TRANSHUGE_LIGHT;
 }
 
 /* Caller must hold page table lock. */
@@ -740,8 +740,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
pte_free(vma->vm_mm, pgtable);
return ret;
}
-   gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
-   page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, 
numa_node_id());
+   gfp = alloc_hugepage_direct_gfpmask(vma);
+   page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PM

Re: [PATCH] mm, oom: consider present pages for the node size

2019-08-29 Thread David Rientjes
On Thu, 29 Aug 2019, Michal Hocko wrote:

> From: Michal Hocko 
> 
> constrained_alloc calculates the size of the oom domain by using
> node_spanned_pages which is incorrect because this is the full range of
> the physical memory range that the numa node occupies rather than the
> memory that backs that range which is represented by node_present_pages.
> 
> Sparsely populated nodes (e.g. after memory hot remove or simply sparse
> due to memory layout) can have really a large difference between the
> two. This shouldn't really cause any real user observable problems
> because the oom calculates a ratio against totalpages and used memory
> cannot exceed present pages but it is confusing and wrong from code
> point of view.
> 
> Noticed-by: David Hildenbrand 
> Signed-off-by: Michal Hocko 

Acked-by: David Rientjes 


Re: [PATCH] mm/oom: Add oom_score_adj value to oom Killed process message

2019-08-21 Thread David Rientjes
On Wed, 21 Aug 2019, Michal Hocko wrote:

> > vm.oom_dump_tasks is pretty useful, however, so it's curious why you 
> > haven't left it enabled :/
> 
> Because it generates a lot of output potentially. Think of a workload
> with too many tasks which is not uncommon.

Probably better to always print all the info for the victim so we don't 
need to duplicate everything between dump_tasks() and dump_oom_summary().

Edward, how about this?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
  * State information includes task's pid, uid, tgid, vm size, rss,
  * pgtables_bytes, swapents, oom_score_adj value, and name.
  */
-static void dump_tasks(struct oom_control *oc)
+static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
 {
pr_info("Tasks state (memory values in pages):\n");
pr_info("[  pid  ]   uid  tgid total_vm  rss pgtables_bytes 
swapents oom_score_adj name\n");
 
+   /* If vm.oom_dump_tasks is disabled, only show the victim */
+   if (!sysctl_oom_dump_tasks) {
+   dump_task(victim, oc);
+   return;
+   }
+
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
else {
@@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct 
task_struct *p)
if (is_dump_unreclaim_slabs())
dump_unreclaimable_slab();
}
-   if (sysctl_oom_dump_tasks)
-   dump_tasks(oc);
+   if (p || sysctl_oom_dump_tasks)
+   dump_tasks(oc, p);
if (p)
dump_oom_summary(oc, p);
 }


<    1   2   3   4   5   6   7   8   9   10   >