date:20190519

fix filler_t callback type mismatches v2

2019-05-19 Thread Christoph Hellwig

Casting mapping->a_ops->readpage to filler_t causes an indirect call
type mismatch with Control-Flow Integrity checking. This change fixes
the mismatch in read_cache_page_gfp and read_mapping_page by adding
using a NULL filler argument as an indication to call ->readpage
directly, and by passing the right parameter callbacks in nfs and jffs2.

Changes since v1:
 - add the 9p patch to the series
 - drop the nfs patch that has been merged

Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-19 Thread Bharata B Rao

On Mon, May 20, 2019 at 02:48:35PM +1000, Nicholas Piggin wrote:
> >> > git bisect points to
> >> >
> >> > commit 4231aba000f5a4583dd9f67057aadb68c3eca99d
> >> > Author: Nicholas Piggin 
> >> > Date:   Fri Jul 27 21:48:17 2018 +1000
> >> >
> >> > powerpc/64s: Fix page table fragment refcount race vs speculative 
> >> > references
> >> >
> >> > The page table fragment allocator uses the main page refcount racily
> >> > with respect to speculative references. A customer observed a BUG due
> >> > to page table page refcount underflow in the fragment allocator. This
> >> > can be caused by the fragment allocator set_page_count stomping on a
> >> > speculative reference, and then the speculative failure handler
> >> > decrements the new reference, and the underflow eventually pops when
> >> > the page tables are freed.
> >> >
> >> > Fix this by using a dedicated field in the struct page for the page
> >> > table fragment allocator.
> >> >
> >> > Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
> >> > Cc: sta...@vger.kernel.org # v3.10+
> >> 
> >> That's the commit that added the BUG_ON(), so prior to that you won't
> >> see the crash.
> > 
> > Right, but the commit says it fixes page table page refcount underflow by
> > introducing a new field >pt_frag_refcount. Now we are hitting the 
> > underflow
> > for this pt_frag_refcount.
> 
> The fixed underflow is caused by a bug (race on page count) that got 
> fixed by that patch. You are hitting a different underflow here. It's
> not certain my patch caused it, I'm just trying to reproduce now.

Ok.

> 
> > 
> > BTW, if I go below this commit, I don't hit the pagecount
> > 
> > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > 
> > which is in pte_fragment_free() path.
> 
> Do you have CONFIG_DEBUG_VM=y?

Yes.

Regards,
Bharata.

[PATCH 3/4] jffs2: pass the correct prototype to read_cache_page

2019-05-19 Thread Christoph Hellwig

Fix the callback jffs2 passes to read_cache_page to actually have the
proper type expected.  Casting around function pointers can easily
hide typing bugs, and defeats control flow protection.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 fs/jffs2/file.c | 4 ++--
 fs/jffs2/fs.c   | 2 +-
 fs/jffs2/os-linux.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index 7d8654a1472e..f8fb89b10227 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -109,9 +109,9 @@ static int jffs2_do_readpage_nolock (struct inode *inode, 
struct page *pg)
return ret;
 }
 
-int jffs2_do_readpage_unlock(struct inode *inode, struct page *pg)
+int jffs2_do_readpage_unlock(void *data, struct page *pg)
 {
-   int ret = jffs2_do_readpage_nolock(inode, pg);
+   int ret = jffs2_do_readpage_nolock(data, pg);
unlock_page(pg);
return ret;
 }
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 112d85849db1..8a20ddd25f2d 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -687,7 +687,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info *c,
struct page *pg;
 
pg = read_cache_page(inode->i_mapping, offset >> PAGE_SHIFT,
-(void *)jffs2_do_readpage_unlock, inode);
+jffs2_do_readpage_unlock, inode);
if (IS_ERR(pg))
return (void *)pg;
 
diff --git a/fs/jffs2/os-linux.h b/fs/jffs2/os-linux.h
index a2dbbb3f4c74..bd3d5f0ddc34 100644
--- a/fs/jffs2/os-linux.h
+++ b/fs/jffs2/os-linux.h
@@ -155,7 +155,7 @@ extern const struct file_operations jffs2_file_operations;
 extern const struct inode_operations jffs2_file_inode_operations;
 extern const struct address_space_operations jffs2_file_address_operations;
 int jffs2_fsync(struct file *, loff_t, loff_t, int);
-int jffs2_do_readpage_unlock (struct inode *inode, struct page *pg);
+int jffs2_do_readpage_unlock(void *data, struct page *pg);
 
 /* ioctl.c */
 long jffs2_ioctl(struct file *, unsigned int, unsigned long);
-- 
2.20.1

[PATCH 4/4] 9p: pass the correct prototype to read_cache_page

2019-05-19 Thread Christoph Hellwig

Fix the callback 9p passes to read_cache_page to actually have the
proper type expected.  Casting around function pointers can easily
hide typing bugs, and defeats control flow protection.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 fs/9p/vfs_addr.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 0bcbcc20f769..02e0fc51401e 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -50,8 +50,9 @@
  * @page: structure to page
  *
  */
-static int v9fs_fid_readpage(struct p9_fid *fid, struct page *page)
+static int v9fs_fid_readpage(void *data, struct page *page)
 {
+   struct p9_fid *fid = data;
struct inode *inode = page->mapping->host;
struct bio_vec bvec = {.bv_page = page, .bv_len = PAGE_SIZE};
struct iov_iter to;
@@ -122,7 +123,8 @@ static int v9fs_vfs_readpages(struct file *filp, struct 
address_space *mapping,
if (ret == 0)
return ret;
 
-   ret = read_cache_pages(mapping, pages, (void *)v9fs_vfs_readpage, filp);
+   ret = read_cache_pages(mapping, pages, v9fs_fid_readpage,
+   filp->private_data);
p9_debug(P9_DEBUG_VFS, "  = %d\n", ret);
return ret;
 }
-- 
2.20.1

[PATCH 2/4] mm: don't cast ->readpage to filler_t for do_read_cache_page

2019-05-19 Thread Christoph Hellwig

We can just pass a NULL filler and do the right thing inside of
do_read_cache_page based on the NULL parameter.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 include/linux/pagemap.h |  3 +--
 mm/filemap.c| 10 ++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9ec3544baee2..6dd7ec95c778 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -396,8 +396,7 @@ extern int read_cache_pages(struct address_space *mapping,
 static inline struct page *read_mapping_page(struct address_space *mapping,
pgoff_t index, void *data)
 {
-   filler_t *filler = (filler_t *)mapping->a_ops->readpage;
-   return read_cache_page(mapping, index, filler, data);
+   return read_cache_page(mapping, index, NULL, data);
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 6a8048477bc6..3bec6e18b763 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2772,7 +2772,11 @@ static struct page *do_read_cache_page(struct 
address_space *mapping,
}
 
 filler:
-   err = filler(data, page);
+   if (filler)
+   err = filler(data, page);
+   else
+   err = mapping->a_ops->readpage(data, page);
+
if (err < 0) {
put_page(page);
return ERR_PTR(err);
@@ -2884,9 +2888,7 @@ struct page *read_cache_page_gfp(struct address_space 
*mapping,
pgoff_t index,
gfp_t gfp)
 {
-   filler_t *filler = (filler_t *)mapping->a_ops->readpage;
-
-   return do_read_cache_page(mapping, index, filler, NULL, gfp);
+   return do_read_cache_page(mapping, index, NULL, NULL, gfp);
 }
 EXPORT_SYMBOL(read_cache_page_gfp);
 
-- 
2.20.1

[PATCH 1/4] mm: fix an overly long line in read_cache_page

2019-05-19 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Kees Cook 
---
 mm/filemap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c5af80c43d36..6a8048477bc6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2862,7 +2862,8 @@ struct page *read_cache_page(struct address_space 
*mapping,
int (*filler)(void *, struct page *),
void *data)
 {
-   return do_read_cache_page(mapping, index, filler, data, 
mapping_gfp_mask(mapping));
+   return do_read_cache_page(mapping, index, filler, data,
+   mapping_gfp_mask(mapping));
 }
 EXPORT_SYMBOL(read_cache_page);
 
-- 
2.20.1

Re: [RFC PATCH] usb: host: xhci: allow __GFP_FS in dma allocation

2019-05-19 Thread Christoph Hellwig

Folks, you can't just pass arbitary GFP_ flags to dma allocation
routines, beause very often they are not just wrappers around
the page allocator.

So no, you can't just fine grained control the flags, but the
existing code is just as buggy.

Please switch to use memalloc_noio_save() instead.

Re: arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of function 'uaccess_kernel'; did you mean 'getname_kernel'?

2019-05-19 Thread Max Filippov

Hello,

On Sun, May 19, 2019 at 10:39 PM kbuild test robot  wrote:
> Hi Matt,
>
> FYI, the error/warning still remains.
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   a188339ca5a396acc588e5851ed7e19f66b0ebd9
> commit: 7df95299b94a63ec67a6389fc02dc25019a80ee8 staging: kpc2000: Add DMA 
> driver
> date:   4 weeks ago
> config: xtensa-allmodconfig (attached as .config)
> compiler: xtensa-linux-gcc (GCC) 8.1.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> git checkout 7df95299b94a63ec67a6389fc02dc25019a80ee8
> # save the attached .config to linux build tree
> GCC_VERSION=8.1.0 make.cross ARCH=xtensa
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
>
> All errors (new ones prefixed by >>):
>
>In file included from drivers/staging/kpc2000/kpc_dma/fileops.c:11:
>arch/xtensa/include/asm/uaccess.h: In function 'clear_user':
> >> arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of 
> >> function 'uaccess_kernel'; did you mean 'getname_kernel'? 
> >> [-Werror=implicit-function-declaration]
> #define __kernel_ok (uaccess_kernel())
>  ^~

I've posted a fix for this issue here:
  https://lkml.org/lkml/2019/5/8/956

If there are post merge window pull requests planned for the
staging tree please consider including this fix. Alternatively
if I could get an ack for this fix I'd submit it in a pull request
from the xtensa tree.

--
Thanks.
-- Max

Re: [PATCH] intel_th: msu: Fix unused variable warning on arm64 platform

2019-05-19 Thread Zhangshaokun

please ignore this patch, I will update it and resend it.

On 2019/5/20 11:23, Shaokun Zhang wrote:
> Fix this compiler warning on arm64 platform.
> 
> Cc: Alexander Shishkin 
> Cc: Greg Kroah-Hartman 
> Signed-off-by: Shaokun Zhang 
> ---
>  drivers/hwtracing/intel_th/msu.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/hwtracing/intel_th/msu.c 
> b/drivers/hwtracing/intel_th/msu.c
> index 81bb54fa3ce8..e15ed5c308e1 100644
> --- a/drivers/hwtracing/intel_th/msu.c
> +++ b/drivers/hwtracing/intel_th/msu.c
> @@ -780,7 +780,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win,
>  static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks)
>  {
>   struct msc_window *win;
> - int ret = -ENOMEM, i;
> + int ret = -ENOMEM;
>  
>   if (!nr_blocks)
>   return 0;
> @@ -812,7 +812,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned 
> int nr_blocks)
>   goto err_nomem;
>  
>  #ifdef CONFIG_X86
> - for (i = 0; i < ret; i++)
> + for (int i = 0; i < ret; i++)
>   /* Set the page as uncached */
>   set_memory_uc((unsigned long)msc_win_block(win, i), 1);
>  #endif
> @@ -860,8 +860,6 @@ static void __msc_buffer_win_free(struct msc *msc, struct 
> msc_window *win)
>   */
>  static void msc_buffer_win_free(struct msc *msc, struct msc_window *win)
>  {
> - int i;
> -
>   msc->nr_pages -= win->nr_blocks;
>  
>   list_del(>entry);
> @@ -871,7 +869,7 @@ static void msc_buffer_win_free(struct msc *msc, struct 
> msc_window *win)
>   }
>  
>  #ifdef CONFIG_X86
> - for (i = 0; i < win->nr_blocks; i++)
> + for (int i = 0; i < win->nr_blocks; i++)
>   /* Reset the page to write-back */
>   set_memory_wb((unsigned long)msc_win_block(win, i), 1);
>  #endif
>

[RFC PATCH v5 12/16] slub: Enable moving objects to/from specific nodes

2019-05-19 Thread Tobin C. Harding

We have just implemented Slab Movable Objects (object migration).
Currently object migration is used to defrag a cache.  On NUMA systems
it would be nice to be able to control the source and destination nodes
when moving objects.

Add CONFIG_SMO_NODE to guard this feature.  CONFIG_SMO_NODE depends on
CONFIG_SLUB_DEBUG because we use the full list.

Implement moving all objects (including those in full slabs) to a
specific node.  Expose this functionality to userspace via a sysfs entry.

Add sysfs entry:

   /sysfs/kernel/slab//move

With this users get access to the following functionality:

 - Move all objects to specified node.

echo "N1" > move

 - Move all objects from specified node to other specified
   node (from N1 -> to N2):

echo "N1 N2" > move

This also enables shrinking slabs on a specific node:

echo "N1 N1" > move

Signed-off-by: Tobin C. Harding 
---
 mm/Kconfig |   7 ++
 mm/slub.c  | 249 +
 2 files changed, 256 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ee8d1f311858..aa8d60e69a01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,13 @@ config ARCH_ENABLE_THP_MIGRATION
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
+config SMO_NODE
+   bool "Enable per node control of Slab Movable Objects"
+   depends on SLUB && SYSFS
+   select SLUB_DEBUG
+   help
+ On NUMA systems enable moving objects to and from a specified node.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
diff --git a/mm/slub.c b/mm/slub.c
index 2157205df7ba..9582f2fc97d2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4336,6 +4336,106 @@ static void move_slab_page(struct page *page, void 
*scratch, int node)
s->migrate(s, vector, count, node, private);
 }
 
+#ifdef CONFIG_SMO_NODE
+/*
+ * kmem_cache_move() - Attempt to move all slab objects.
+ * @s: The cache we are working on.
+ * @node: The node to move objects away from.
+ * @target_node: The node to move objects on to.
+ *
+ * Attempts to move all objects (partial slabs and full slabs) to target
+ * node.
+ *
+ * Context: Takes the list_lock.
+ * Return: The number of slabs remaining on node.
+ */
+static unsigned long kmem_cache_move(struct kmem_cache *s,
+int node, int target_node)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   goto out;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >partial, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   if (page->inuse) {
+   list_move(>lru, _list);
+   /* Stop page being considered for allocations */
+   n->nr_partial--;
+   page->frozen = 1;
+
+   slab_unlock(page);
+   } else {/* Empty slab page */
+   list_del(>lru);
+   n->nr_partial--;
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+   }
+
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   if (page->inuse == page->objects) {
+   list_add(>lru, >full);
+   slab_unlock(page);
+   } else {
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   }
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+

[RFC PATCH v5 15/16] dcache: Implement partial shrink via Slab Movable Objects

2019-05-19 Thread Tobin C. Harding

The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

Implement isolate and 'migrate' functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 76 +
 1 file changed, 76 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index b7318615979d..0dfe580c2d42 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "mount.h"
 
@@ -3071,6 +3072,79 @@ void d_tmpfile(struct dentry *dentry, struct inode 
*inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+   struct list_head *dispose;
+   struct dentry *dentry;
+   int i;
+
+   dispose = kmalloc(sizeof(*dispose), GFP_KERNEL);
+   if (!dispose)
+   return NULL;
+
+   INIT_LIST_HEAD(dispose);
+
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   spin_lock(>d_lock);
+
+   if (dentry->d_lockref.count > 0 ||
+   dentry->d_flags & DCACHE_SHRINK_LIST) {
+   spin_unlock(>d_lock);
+   continue;
+   }
+
+   if (dentry->d_flags & DCACHE_LRU_LIST)
+   d_lru_del(dentry);
+
+   d_shrink_add(dentry, dispose);
+   spin_unlock(>d_lock);
+   }
+
+   return dispose;
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @_unused: We do not access the vector.
+ * @__unused: No need for length of vector.
+ * @___unused: We do not do any allocation.
+ * @private: list_head pointer representing the shrink list.
+ *
+ * Dispose of the shrink list created during isolation function.
+ *
+ * Dentry objects can _not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **_unused, int 
__unused,
+int ___unused, void *private)
+{
+   struct list_head *dispose = private;
+
+   if (!private)   /* kmalloc error during isolate. */
+   return;
+
+   if (!list_empty(dispose))
+   shrink_dentry_list(dispose);
+
+   kfree(private);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3116,6 +3190,8 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+   kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
return;
-- 
2.21.0

[RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes

2019-05-19 Thread Tobin C. Harding

We have just implemented Slab Movable Objects (SMO).  On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs.  Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

 1. Move all objects to node 0 (this has the effect of defragmenting the
cache).

 2. Calculate the desired number of slabs for each node (this is done
using the approximation nr_slabs / nr_nodes).

 3. Loop over the nodes moving the desired number of slabs from node 0
to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this).  Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry.  Add sysfs
entry:

   /sysfs/kernel/slab//balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 120 ++
 1 file changed, 120 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 9582f2fc97d2..25b6d1e408e3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4574,6 +4574,109 @@ static unsigned long kmem_cache_move_to_node(struct 
kmem_cache *s, int node)
 
return left;
 }
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node.  This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+ int node, int target_node, long num)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+   long done = 0;
+
+   if (node == target_node)
+   return -EINVAL;
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   return -ENOMEM;
+
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+
+   if (++done >= num)
+   break;
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   /*
+* This is best effort only, if slab still has
+* objects just put it back on the partial list.
+*/
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+   struct kmem_cache_node *n = get_node(s, 0);
+   unsigned long desired_nr_slabs_per_node;
+   unsigned long nr_slabs;
+   int nr_nodes = 0;
+   int nid;
+
+   (void)kmem_cache_move_to_node(s, 0);
+
+   for_each_node_state(nid, N_NORMAL_MEMORY)
+   nr_nodes++;
+
+   nr_slabs = atomic_long_read(>nr_slabs);
+   desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+   for_each_node_state(nid, N_NORMAL_MEMORY) {
+   if (nid == 0)
+   continue;
+
+   kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node);
+   }
+}
 #endif
 
 /**
@@ -5838,6 +5941,22 @@ static ssize_t move_store(struct kmem_cache *s, const 
char *buf, size_t length)
return length;
 }
 SLAB_ATTR(move);
+
+static ssize_t balance_show(struct kmem_cache *s, char *buf)
+{
+   return 0;
+}
+
+static

[RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-05-19 Thread Tobin C. Harding

In an attempt to make the SMO patchset as non-invasive as possible add a
config option CONFIG_DCACHE_SMO (under "Memory Management options") for
enabling SMO for the DCACHE.  Whithout this option dcache constructor is
used but no other code is built in, with this option enabled slab
mobility is enabled and the isolate/migrate functions are built in.

Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
Slab Movable Objects infrastructure.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 4 
 mm/Kconfig  | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0dfe580c2d42..96063e872366 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3072,6 +3072,7 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+#ifdef CONFIG_DCACHE_SMO
 /*
  * d_isolate() - Dentry isolation callback function.
  * @s: The dentry cache.
@@ -3144,6 +3145,7 @@ static void d_partial_shrink(struct kmem_cache *s, void 
**_unused, int __unused,
 
kfree(private);
 }
+#endif /* CONFIG_DCACHE_SMO */
 
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
@@ -3190,7 +3192,9 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+#ifdef CONFIG_DCACHE_SMO
kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+#endif
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
diff --git a/mm/Kconfig b/mm/Kconfig
index aa8d60e69a01..7dcea76e5ecc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -265,6 +265,13 @@ config SMO_NODE
help
  On NUMA systems enable moving objects to and from a specified node.
 
+config DCACHE_SMO
+   bool "Enable Slab Movable Objects for the dcache"
+   depends on SLUB
+   help
+ Under memory pressure we can try to free dentry slab cache objects 
from
+ the partial slab list if this is enabled.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
-- 
2.21.0

[RFC PATCH v5 11/16] tools/testing/slab: Add XArray movable objects tests

2019-05-19 Thread Tobin C. Harding

We just implemented movable objects for the XArray.  Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs.  For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session


Relevant /proc/slabinfo column headers:

  name   

Prior to testing slabinfo report for radix_tree_node:

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8352
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 497   Sanity Checks : On   Total: 8142848
  SlabObj: 912  Full   : 473   Redzoning : On   Used : 4810752
  SlabSiz:   16384  Partial:  24   Poisoning : On   Loss : 3332096
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2806272
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8442
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 499   Sanity Checks : On   Total: 8175616
  SlabObj: 912  Full   : 484   Redzoning : On   Used : 4862592
  SlabSiz:   16384  Partial:  15   Poisoning : On   Loss : 3313024
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2836512
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  439120

Now we can shrink the radix_tree_node cache:

  # slabinfo radix_tree_node --shrink
  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8515
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 501   Sanity Checks : On   Total: 8208384
  SlabObj: 912  Full   : 500   Redzoning : On   Used : 4904640
  SlabSiz:   16384  Partial:   1   Poisoning : On   Loss : 3303744
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2861040
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile |   2 +-
 tools/testing/slab/slub_defrag_xarray.c | 211 
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o
 
 KTREE=../../..
 
diff --git a/tools/testing/slab/slub_defrag_xarray.c 
b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index ..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+   long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+   struct smox_thing *thing;
+
+   thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+   if (!thing)
+   return ERR_PTR(-ENOMEM);
+
+   thing->id = id;
+   return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void

[RFC PATCH v5 14/16] dcache: Provide a dentry constructor

2019-05-19 Thread Tobin C. Harding

In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..b7318615979d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1602,6 +1602,16 @@ void d_invalidate(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
+static void dcache_ctor(void *p)
+{
+   struct dentry *dentry = p;
+
+   /* Mimic lockref_mark_dead() */
+   dentry->d_lockref.count = -128;
+
+   spin_lock_init(>d_lock);
+}
+
 /**
  * __d_alloc   -   allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1657,7 +1667,6 @@ struct dentry *__d_alloc(struct super_block *sb, const 
struct qstr *name)
 
dentry->d_lockref.count = 1;
dentry->d_flags = 0;
-   spin_lock_init(>d_lock);
seqcount_init(>d_seq);
dentry->d_inode = NULL;
dentry->d_parent = dentry;
@@ -3095,14 +3104,17 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
-   /*
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache.
-*/
-   dentry_cache = KMEM_CACHE_USERCOPY(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
-   d_iname);
+   slab_flags_t flags =
+   SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | 
SLAB_ACCOUNT;
+
+   dentry_cache =
+   kmem_cache_create_usercopy("dentry",
+  sizeof(struct dentry),
+  __alignof__(struct dentry),
+  flags,
+  offsetof(struct dentry, d_iname),
+  sizeof_field(struct dentry, d_iname),
+  dcache_ctor);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
-- 
2.21.0

[RFC PATCH v5 05/16] tools/vm/slabinfo: Add remote node defrag ratio output

2019-05-19 Thread Tobin C. Harding

Add output line for NUMA remote node defrag ratio.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index cbfc56c44c2f..d2c22f9ee2d8 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode)
if (skip_zero && !s->slabs)
return;
 
+   if (mode) {
+   printf("\nNUMA remote node defrag ratio: %3d\n",
+  s->remote_node_defrag_ratio);
+   }
if (!line) {
printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
@@ -1272,6 +1277,8 @@ static void read_slab_dir(void)
slab->cpu_partial_free = get_obj("cpu_partial_free");
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
+   slab->remote_node_defrag_ratio =
+   get_obj("remote_node_defrag_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0

[RFC PATCH v5 08/16] tools/testing/slab: Add object migration test suite

2019-05-19 Thread Tobin C. Harding

We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects.  Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

  $ cd path/to/linux/tools/testing/slab
  $ /slub_defrag.py
  Please run script as root

  $ sudo ./slub_defrag.py
  

  $ sudo ./slub_defrag.py --debug
  Loading module ...
  Slab cache smo_test created
  Objects per slab: 20
  Running sanity checks ...

  Running module stress test (see dmesg for additional test output) ...
  Removing module slub_defrag ...
  Loading module ...
  Slab cache smo_test created

  Running test non-movable ...
  testing slab 'smo_test' prior to enabling movable objects ...
  verified non-movable slabs are NOT shrinkable

  Running test movable ...
  testing slab 'smo_test' after enabling movable objects ...
  verified movable slabs are shrinkable

  Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/slub_defrag.c  |   1 +
 tools/testing/slab/slub_defrag.py | 451 ++
 2 files changed, 452 insertions(+)
 create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)
 
 /*
  * struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
  */
 struct functions {
char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py 
b/tools/testing/slab/slub_defrag.py
new file mode 100755
index ..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+#  - CONFIG_SLUB=y
+#  - CONFIG_SLUB_DEBUG=y
+#  - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+#  - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+#  - Writes to 'callfn' are parsed as a command string and the function
+#associated with command is called.
+#  - Defines 4 commands (all commands operate on smo_test cache):
+# - 'test': Runs module stress tests.
+# - 'alloc N': Allocates N slub objects
+# - 'free N POS': Frees N objects starting at POS (see below)
+# - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects.  Allocation adds
+# objects to the tail of the list.  Free'ing frees from the head of the
+# list.  This has the effect of creating free slots in the slab.  For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab//shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+if debug:
+print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+user = run_shell_get_stdout('whoami')
+if user != b'root\n':
+eprint("Please run script as root")
+sys.exit(1)
+
+
+def mount_debugfs():
+mounted = False
+
+# Check if debugfs is mounted at a known

[RFC PATCH v5 07/16] tools/testing/slab: Add object migration test module

2019-05-19 Thread Tobin C. Harding

We just implemented slab movable objects for the SLUB allocator.  We
should test that code.  In order to do so we need to be able to do a
number of things

 - Create a cache
 - Enable Slab Movable Objects for the cache
 - Allocate objects to the cache
 - Free objects from within specific slabs of the cache

We can do all this via a loadable module.

Add a module that defines functions that can be triggered from userspace
via a debugfs entry. From the source:

  /*
   * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
   *
   * This module is used for testing the SLUB allocator.  Enables
   * userspace to run kernel functions via a debugfs file.
   *
   *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
   *
   * String written to `callfn` is parsed by the module and associated
   * function is called.  See fn_tab for mapping of strings to functions.
   */

References to allocated objects are kept by the module in a linked list
so that userspace can control which object to free.

We introduce the following four functions via the function table

  "enable": Enables object migration for the test cache.
  "alloc X": Allocates X objects
  "free X [Y]": Frees X objects starting at list position Y (default Y==0)
  "test": Runs [stress] tests from within the module (see below).

   {"enable", smo_enable_cache_mobility},
   {"alloc", smo_alloc_objects},
   {"free", smo_free_object},
   {"test", smo_run_module_tests},

Freeing from the start of the list creates a hole in the slab being
freed from (i.e. creates a partial slab).  The results of running these
commands can be see using `slabinfo` (available in tools/vm/):

make -o slabinfo tools/vm/slabinfo.c

Stress tests can be run from within the module.  These tests are
internal to the module because we verify that object references are
still good after object migration.  These are called 'stress' tests
because it is intended that they create/free a lot of objects.
Userspace can control the number of objects to create, default is 1000.

Example test session


Relevant /proc/slabinfo column headers:

  name   

  # mount -t debugfs none /sys/kernel/debug/
  $ cd path/to/linux/tools/testing/slab; make
  ...

  # insmod slub_defrag.ko
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test   0  0392   202

>From this we can see that the module created cache 'smo_test' with 20
objects per slab and 2 pages per slab (and cache is currently empty).

We can play with the slab allocator manually:

  # insmod slub_defrag.ko
  # echo 'alloc 21' > callfn
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test  21 40392   202

We see here that 21 active objects have been allocated creating 2
slabs (40 total objects).

  # slabinfo smo_test --report

  Slabcache: smo_test Aliases:  0 Order :  1 Objects: 21

  Sizes (bytes) Slabs  DebugMemory
  
  Object :  56  Total  :   2   Sanity Checks : On   Total:   16384
  SlabObj: 392  Full   :   1   Redzoning : On   Used :1176
  SlabSiz:8192  Partial:   1   Poisoning : On   Loss :   15208
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:7056
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 704

Now free an object from the first slot of the first slab

  # echo 'free 1' > callfn
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test  20 40392   202

  # slabinfo smo_test --report

  Slabcache: smo_test Aliases:  0 Order :  1 Objects: 20

  Sizes (bytes) Slabs  DebugMemory
  
  Object :  56  Total  :   2   Sanity Checks : On   Total:   16384
  SlabObj: 392  Full   :   0   Redzoning : On   Used :1120
  SlabSiz:8192  Partial:   2   Poisoning : On   Loss :   15264
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:6720
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 704

Calling shrink now on the cache does nothing because object migration is
not enabled (output omitted).  If we enable object migration then shrink
the cache we expect the object from the second slab to me moved to the
first slot in the first slab and the second slab to be removed from the
partial list.

  # echo 'enable' > callfn
  # slabinfo smo_test --shrink
  # slabinfo smo_test --report

  Slabcache: smo_test Aliases:  0 Order :  1 Objects: 20
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object :  56  Total  :   1   Sanity Checks : On   Total:8192
  SlabObj: 392  Full   :   1

[RFC PATCH v5 06/16] tools/vm/slabinfo: Add defrag_used_ratio output

2019-05-19 Thread Tobin C. Harding

Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int defrag_used_ratio;
int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->movable)
+   printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
slab->deactivate_bypass = get_obj("deactivate_bypass");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
+   slab->defrag_used_ratio = get_obj("defrag_used_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0

[RFC PATCH v5 09/16] lib: Separate radix_tree_node and xa_node slab cache

2019-05-19 Thread Tobin C. Harding

Earlier, Slab Movable Objects (SMO) was implemented.  The XArray is now
able to take advantage of SMO in order to make xarray nodes
movable (when using the SLUB allocator).

Currently the radix tree uses the same slab cache as the XArray.  Only
XArray nodes are movable _not_ radix tree nodes.  We can give the radix
tree its own slab cache to overcome this.

In preparation for implementing XArray object migration (xa_node
objects) via Slab Movable Objects add a slab cache solely for XArray
nodes and make the XArray use this slab cache instead of the
radix_tree_node slab cache.

Cc: Matthew Wilcox 
Signed-off-by: Tobin C. Harding 
---
 include/linux/xarray.h |  3 +++
 init/main.c|  2 ++
 lib/radix-tree.c   |  2 +-
 lib/xarray.c   | 48 ++
 4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..773f91f8e1db 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -42,6 +42,9 @@
 
 #define BITS_PER_XA_VALUE  (BITS_PER_LONG - 1)
 
+/* Called from init/main.c */
+void xarray_slabcache_init(void);
+
 /**
  * xa_mk_value() - Create an XArray entry from an integer.
  * @v: Value to store in XArray.
diff --git a/init/main.c b/init/main.c
index 5a2c69b4d7b3..e89915ffbe26 100644
--- a/init/main.c
+++ b/init/main.c
@@ -106,6 +106,7 @@ static int kernel_init(void *);
 
 extern void init_IRQ(void);
 extern void radix_tree_init(void);
+extern void xarray_slabcache_init(void);
 
 /*
  * Debug helper: via this flag we know that we are in 'early bootup code'
@@ -621,6 +622,7 @@ asmlinkage __visible void __init start_kernel(void)
 "Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
+   xarray_slabcache_init();
 
/*
 * Set up housekeeping before setting up workqueues to allow the unbound
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 14d51548bea6..edbfb530ba73 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -44,7 +44,7 @@
 /*
  * Radix tree node cache.
  */
-struct kmem_cache *radix_tree_node_cachep;
+static struct kmem_cache *radix_tree_node_cachep;
 
 /*
  * The radix tree is variable-height, so an insert operation not only has
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..a528a5277c9d 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -27,6 +27,8 @@
  * @entry refers to something stored in a slot in the xarray
  */
 
+static struct kmem_cache *xa_node_cachep;
+
 static inline unsigned int xa_lock_type(const struct xarray *xa)
 {
return (__force unsigned int)xa->xa_flags & 3;
@@ -244,9 +246,21 @@ void *xas_load(struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_load);
 
-/* Move the radix tree node cache here */
-extern struct kmem_cache *radix_tree_node_cachep;
-extern void radix_tree_node_rcu_free(struct rcu_head *head);
+void xa_node_rcu_free(struct rcu_head *head)
+{
+   struct xa_node *node = container_of(head, struct xa_node, rcu_head);
+
+   /*
+* Must only free zeroed nodes into the slab.  We can be left with
+* non-NULL entries by radix_tree_free_nodes, so clear the entries
+* and tags here.
+*/
+   memset(node->slots, 0, sizeof(node->slots));
+   memset(node->tags, 0, sizeof(node->tags));
+   INIT_LIST_HEAD(>private_list);
+
+   kmem_cache_free(xa_node_cachep, node);
+}
 
 #define XA_RCU_FREE((struct xarray *)1)
 
@@ -254,7 +268,7 @@ static void xa_node_free(struct xa_node *node)
 {
XA_NODE_BUG_ON(node, !list_empty(>private_list));
node->array = XA_RCU_FREE;
-   call_rcu(>rcu_head, radix_tree_node_rcu_free);
+   call_rcu(>rcu_head, xa_node_rcu_free);
 }
 
 /*
@@ -270,7 +284,7 @@ static void xas_destroy(struct xa_state *xas)
if (!node)
return;
XA_NODE_BUG_ON(node, !list_empty(>private_list));
-   kmem_cache_free(radix_tree_node_cachep, node);
+   kmem_cache_free(xa_node_cachep, node);
xas->xa_alloc = NULL;
 }
 
@@ -298,7 +312,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
xas_destroy(xas);
return false;
}
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
if (!xas->xa_alloc)
return false;
XA_NODE_BUG_ON(xas->xa_alloc, 
!list_empty(>xa_alloc->private_list));
@@ -327,10 +341,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
}
if (gfpflags_allow_blocking(gfp)) {
xas_unlock_type(xas, lock_type);
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
xas_lock_type(xas, lock_type);
} else {
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+

[RFC PATCH v5 10/16] xarray: Implement migration function for xa_node objects

2019-05-19 Thread Tobin C. Harding

Recently Slab Movable Objects (SMO) was implemented for the SLUB
allocator.  The XArray can take advantage of this and make the xa_node
slab cache objects movable.

Implement functions to migrate objects and activate SMO when we
initialise the XArray slab cache.

This is based on initial code by Matthew Wilcox and was modified to work
with slab object migration.

Cc: Matthew Wilcox 
Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 lib/xarray.c | 61 
 1 file changed, 61 insertions(+)

diff --git a/lib/xarray.c b/lib/xarray.c
index a528a5277c9d..c6b077f59e88 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1993,12 +1993,73 @@ static void xa_node_ctor(void *arg)
INIT_LIST_HEAD(>private_list);
 }
 
+static void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+   struct xarray *xa = READ_ONCE(node->array);
+   void __rcu **slot;
+   struct xa_node *new_node;
+   int i;
+
+   /* Freed or not yet in tree then skip */
+   if (!xa || xa == XA_RCU_FREE)
+   return;
+
+   new_node = kmem_cache_alloc_node(xa_node_cachep, GFP_KERNEL, numa_node);
+   if (!new_node) {
+   pr_err("%s: slab cache allocation failed\n", __func__);
+   return;
+   }
+
+   xa_lock_irq(xa);
+
+   /* Check again. */
+   if (xa != node->array) {
+   node = new_node;
+   goto unlock;
+   }
+
+   memcpy(new_node, node, sizeof(struct xa_node));
+
+   if (list_empty(>private_list))
+   INIT_LIST_HEAD(_node->private_list);
+   else
+   list_replace(>private_list, _node->private_list);
+
+   for (i = 0; i < XA_CHUNK_SIZE; i++) {
+   void *x = xa_entry_locked(xa, new_node, i);
+
+   if (xa_is_node(x))
+   rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+   }
+   if (!new_node->parent)
+   slot = >xa_head;
+   else
+   slot = _parent_locked(xa, new_node)->slots[new_node->offset];
+   rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+   xa_unlock_irq(xa);
+   xa_node_free(node);
+   rcu_barrier();
+}
+
+static void xa_migrate(struct kmem_cache *s, void **objects, int nr,
+  int node, void *_unused)
+{
+   int i;
+
+   for (i = 0; i < nr; i++)
+   xa_object_migrate(objects[i], node);
+}
+
+
 void __init xarray_slabcache_init(void)
 {
xa_node_cachep = kmem_cache_create("xarray_node",
   sizeof(struct xa_node), 0,
   SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
   xa_node_ctor);
+   kmem_cache_setup_mobility(xa_node_cachep, NULL, xa_migrate);
 }
 
 #ifdef XA_DEBUG
-- 
2.21.0

[RFC PATCH v5 03/16] slub: Sort slab cache list

2019-05-19 Thread Tobin C. Harding

It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 mm/slab_common.c | 2 +-
 mm/slub.c| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
goto out_free_cache;
 
s->refcount = 1;
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
memcg_link_cache(s);
 out:
if (err)
diff --git a/mm/slub.c b/mm/slub.c
index 1c380a2bc78a..66d474397c0f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4333,6 +4333,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
return;
}
 
+   mutex_lock(_mutex);
+
s->isolate = isolate;
s->migrate = migrate;
 
@@ -4341,6 +4343,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 * to disable fast cmpxchg based processing.
 */
s->flags &= ~__CMPXCHG_DOUBLE;
+
+   list_move(>list, _caches);  /* Move to top */
+
+   mutex_unlock(_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
-- 
2.21.0

[RFC PATCH v5 04/16] slub: Slab defrag core

2019-05-19 Thread Tobin C. Harding

Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

Implement Slab Movable Objects in order to defragment slab caches.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

slabinfo  -s

3. Conditionally via the use of kmem_cache_defrag()

- Use Slab Movable Objects when shrinking cache.

Currently when the kernel calls kmem_cache_shrink() we curate the
partial slabs list.  If object migration is not enabled for the cache we
still do this, if however, SMO is enabled we attempt to move objects in
partially full slabs in order to defragment the cache.  Shrink attempts
to move all objects in order to reduce the cache to a single partial
slab for each node.

- Add conditional per node defrag via new function:

kmem_defrag_slabs(int node).

kmem_defrag_slabs() attempts to defragment all slab caches for
node. Defragmentation is done conditionally dependent on MAX_PARTIAL
_and_ defrag_used_ratio.

   Caches are only considered for defragmentation if the number of
   partial slabs exceeds MAX_PARTIAL (per node).

   Also, defragmentation only occurs if the usage ratio of the slab is
   lower than the configured percentage (sysfs field added in this
   patch).  Fragmentation ratios are measured by calculating the
   percentage of objects in use compared to the total number of objects
   that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_defrag_slabs() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

echo 0 > /sys/kernel/slab//defrag_used_ratio

- Add a defrag ratio sysfs field and set it to 30% by default. A limit
of 30% specifies that more than 3 out of 10 available slots for objects
need to be in use otherwise slab defragmentation will be attempted on
the remaining objects.

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

void kmem_cache_setup_mobility(struct kmem_cache *,
   kmem_cache_isolate_func,
   kmem_cache_migrate_func);

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 include/linux/slab.h|   1 +
 include/linux/slub_def.h|   7 +
 mm/slub.c   | 385 
 4 files changed, 334 insertions(+), 73 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab 
b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..c6f129af035a 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,20 @@ Description:
list.  It can be written to clear the current count.
Available when CONFIG_SLUB_STATS is enabled.
 
+What:  /sys/kernel/slab/cache/defrag_used_ratio
+Date:  May 2019
+KernelVersion: 5.2
+Contact:   Christoph Lameter 
+   Pekka Enberg ,
+Description:
+   The defrag_used_ratio file allows the control of how aggressive
+   slab fragmentation reduction works at reclaiming objects from
+   sparsely populated slabs. This is a percentage. If a slab has
+   less than this percentage of objects allocated then reclaim will
+   attempt to reclaim objects so that the whole slab page can be
+   freed. 0% specifies no reclaim attempt (defrag disabled), 100%
+   specifies attempt to reclaim all pages.  The default is 30%.
+
 What:  /sys/kernel/slab/cache/deactivate_to_tail
 Date:  February 2008
 KernelVersion: 2.6.25
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 886fc130334d..4bf381b34829 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char 
*name,
void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_defrag_slabs(int node);
 
 void memcg_create_kmem_cache(struct

[RFC PATCH v5 02/16] tools/vm/slabinfo: Add support for -C and -M options

2019-05-19 Thread Tobin C. Harding

-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int movable, ctor;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-   printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-   "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+   printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 
2017 Jump Trading LLC.\n\n"
+  "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+
"-a|--aliases   Show aliases\n"
"-A|--activity  Most active slabs first\n"
"-B|--Bytes Show size in bytes\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-D|--display-activeSwitch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
"-L|--Loss  Sort by loss\n"
+   "-M|--movable   Show caches that support movable 
objects\n"
"-n|--numa  Show NUMA information\n"
"-N|--lines=K   Show the first K slabs\n"
"-o|--ops   Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;
 
+   if (show_ctor && !s->ctor)
+   return;
+
+   if (show_movable && !s->movable)
+   return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->ctor)
+   *p++ = 'C';
+   if (s->movable)
+   *p++ = 'M';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
-   s->slabs ? (s->partial * 100) / s->slabs : 100,
+   s->slabs ? (s->partial * 100) /
+   (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
chdir("..");
+   if (read_slab_obj(slab, "ops")) {
+   if (strstr(buffer, "ctor :"))
+   slab->ctor = 1;
+   if (strstr(buffer, "migrate :"))
+   slab->movable = 1;
+   }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1332,6 +1356,8 @@ static void xtotals(void)
 }
 
 struct option opts[] = {
+   { "ctor", no_argument, NULL, 'C' },
+   { "movable", no_argument, NULL, 'M' },
{ "aliases", no_argument, NULL, 'a' },
{ "activity", no_argument, NULL, 'A' },
{ "debug", optional_argument, NULL, 'd' },
@@ -1367,7 +1393,7 @@ int main(int argc, char *argv[])
 
page_size = getpagesize();
 
-

[RFC PATCH v5 00/16] Slab Movable Objects (SMO)

2019-05-19 Thread Tobin C. Harding

Hi,

Another iteration of the SMO patch set, updates to this version are
restricted to the XArray patches (#9 and #10 and tested with module
implemented in #11).

Applies on top of Linus' tree (tag: v5.2-rc1).

This is a patch set implementing movable objects within the SLUB
allocator.  This is work based on Christopher Lameter's patch set:

 https://lore.kernel.org/patchwork/project/lkml/list/?series=377335

The original code logic is from that set and implemented by Christopher.
Clean up, refactoring, documentation, and additional features by myself.
Responsibility for any bugs remaining falls solely with myself.

I am intending on sending a non-RFC version soon after this one (if
XArray stuff is ok).  If anyone has any objects with SMO in general
please yell at me now.

Changes to this version:

Patch XArray to use a separate slab cache.  Currently the radix tree and
XArray use the same slab cache.  Radix tree nodes can not be moved but
XArray nodes can.

Matthew,

Does this fit in ok with your plans for the XArray and radix tree?  I
don't really like the function names used here or the init function name
(xarray_slabcache_init()).  If there is a better way to do this please
mercilessly correct me :)


Thanks for looking at this,
Tobin.


Tobin C. Harding (16):
  slub: Add isolate() and migrate() methods
  tools/vm/slabinfo: Add support for -C and -M options
  slub: Sort slab cache list
  slub: Slab defrag core
  tools/vm/slabinfo: Add remote node defrag ratio output
  tools/vm/slabinfo: Add defrag_used_ratio output
  tools/testing/slab: Add object migration test module
  tools/testing/slab: Add object migration test suite
  lib: Separate radix_tree_node and xa_node slab cache
  xarray: Implement migration function for xa_node objects
  tools/testing/slab: Add XArray movable objects tests
  slub: Enable moving objects to/from specific nodes
  slub: Enable balancing slabs across nodes
  dcache: Provide a dentry constructor
  dcache: Implement partial shrink via Slab Movable Objects
  dcache: Add CONFIG_DCACHE_SMO

 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 fs/dcache.c | 110 ++-
 include/linux/slab.h|  71 ++
 include/linux/slub_def.h|  10 +
 include/linux/xarray.h  |   3 +
 init/main.c |   2 +
 lib/radix-tree.c|   2 +-
 lib/xarray.c| 109 ++-
 mm/Kconfig  |  14 +
 mm/slab_common.c|   2 +-
 mm/slub.c   | 819 ++--
 tools/testing/slab/Makefile |  10 +
 tools/testing/slab/slub_defrag.c| 567 ++
 tools/testing/slab/slub_defrag.py   | 451 +++
 tools/testing/slab/slub_defrag_xarray.c | 211 +
 tools/vm/slabinfo.c |  51 +-
 16 files changed, 2343 insertions(+), 103 deletions(-)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c
 create mode 100755 tools/testing/slab/slub_defrag.py
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

-- 
2.21.0

[RFC PATCH v5 01/16] slub: Add isolate() and migrate() methods

2019-05-19 Thread Tobin C. Harding

Add the two methods needed for moving objects and enable the display of
the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab
cache.

Add empty functions for SLAB/SLOB. The API is generic so it could be
theoretically implemented for these allocators as well.

Change sysfs 'ctor' field to be 'ops' to contain all the callback
operations defined for a slab cache.  Display the existing 'ctor'
callback in the ops fields contents along with 'isolate' and 'migrate'
callbacks.

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 include/linux/slab.h | 70 
 include/linux/slub_def.h |  3 ++
 mm/slub.c| 59 +
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..886fc130334d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct 
kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
+/*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable
+ * mobile objects and targeted reclaim in slab caches.
+ */
+
+/**
+ * typedef kmem_cache_isolate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to isolate.
+ * @nr: Number of objects in @ptr array.
+ *
+ * The purpose of kmem_cache_isolate_func() is to pin each object so that
+ * they cannot be freed until kmem_cache_migrate_func() has processed
+ * them. This may be accomplished by increasing the refcount or setting
+ * a flag.
+ *
+ * The object pointer array passed is also passed to
+ * kmem_cache_migrate_func().  The function may remove objects from the
+ * array by setting pointers to %NULL. This is useful if we can
+ * determine that an object is being freed because
+ * kmem_cache_isolate_func() was called when the subsystem was calling
+ * kmem_cache_free().  In that case it is not necessary to increase the
+ * refcount or specially mark the object because the release of the slab
+ * lock will lead to the immediate freeing of the object.
+ *
+ * Context: Called with locks held so that the slab objects cannot be
+ *  freed.  We are in an atomic context and no slab operations
+ *  may be performed.
+ * Return: A pointer that is passed to the migrate function. If any
+ * objects cannot be touched at this point then the pointer may
+ * indicate a failure and then the migration function can simply
+ * remove the references that were already obtained. The private
+ * data could be used to track the objects that were already pinned.
+ */
+typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int 
nr);
+
+/**
+ * typedef kmem_cache_migrate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to migrate.
+ * @nr: Number of objects in @ptr array.
+ * @node: The NUMA node where the object should be allocated.
+ * @private: The pointer returned by kmem_cache_isolate_func().
+ *
+ * This function is responsible for migrating objects.  Typically, for
+ * each object in the input array you will want to allocate an new
+ * object, copy the original object, update any pointers, and free the
+ * old object.
+ *
+ * After this function returns all pointers to the old object should now
+ * point to the new object.
+ *
+ * Context: Called with no locks held and interrupts enabled.  Sleeping
+ *  is possible.  Any operation may be performed.
+ */
+typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr,
+int nr, int node, void *private);
+
+/*
+ * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func,
+  kmem_cache_migrate_func);
+#else
+static inline void
+kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func 
isolate,
+ kmem_cache_migrate_func migrate) {}
+#endif
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..2879a2f5f8eb 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
gfp_t allocflags;   /* gfp flags to use on each alloc */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *);
+   kmem_cache_isolate_func *isolate;
+

[PATCH v1 5/9] perf diff: Use hists to manage basic blocks

2019-05-19 Thread Jin Yao

The function hist__account_cycles() can account cycles per basic
block. The basic block information are saved in a per-symbol
cycles_hist structure.

This patch processes each symbol, get basic blocks from cycles_hist
and add the basic block entry to a hists. Using a hists is because
we need to compare, sort and print the basic blocks.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-diff.c | 145 ++
 1 file changed, 145 insertions(+)

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index e067ac6..09551fe 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -20,6 +20,7 @@
 #include "util/data.h"
 #include "util/config.h"
 #include "util/time-utils.h"
+#include "util/annotate.h"
 
 #include 
 #include 
@@ -30,6 +31,9 @@ struct block_hists {
struct histssym_hists;
struct perf_hpp_listsym_list;
struct perf_hpp_fmt sym_fmt;
+   struct histshists;
+   struct perf_hpp_listlist;
+   struct perf_hpp_fmt fmt;
 };
 
 struct perf_diff {
@@ -73,6 +77,8 @@ struct data__file {
struct block_hists   block_hists;
 };
 
+static struct addr_location dummy_al;
+
 static struct data__file *data__files;
 static int data__files_cnt;
 
@@ -378,6 +384,7 @@ static int diff__process_sample_event(struct perf_tool 
*tool,
pr_warning("problem counting symbol for basic block\n");
goto out_put;
}
+   hist__account_cycles(sample->branch_stack, , sample, false);
} else {
if (!hists__add_entry(hists, , NULL, NULL, NULL,
  sample, true)) {
@@ -938,13 +945,149 @@ static int block_sym_hists_init(struct block_hists 
*hists)
return 0;
 }
 
+static int64_t block_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
+struct hist_entry *left, struct hist_entry *right)
+{
+   struct block_info *bi_l = left->block_info;
+   struct block_info *bi_r = right->block_info;
+   int cmp;
+
+   if (!bi_l->sym || !bi_r->sym) {
+   if (!bi_l->sym && !bi_r->sym)
+   return 0;
+   else if (!bi_l->sym)
+   return -1;
+   else
+   return 1;
+   }
+
+   if (bi_l->sym == bi_r->sym) {
+   if (bi_l->start == bi_r->start) {
+   if (bi_l->end == bi_r->end)
+   return 0;
+   else
+   return (int64_t)(bi_r->end - bi_l->end);
+   } else
+   return (int64_t)(bi_r->start - bi_l->start);
+   } else {
+   cmp = strcmp(bi_l->sym->name, bi_r->sym->name);
+   return cmp;
+   }
+
+   if (bi_l->sym->start != bi_r->sym->start)
+   return (int64_t)(bi_r->sym->start - bi_l->sym->start);
+
+   return (int64_t)(bi_r->sym->end - bi_l->sym->end);
+}
+
+static int block_hists_init(struct block_hists *hists)
+{
+   struct perf_hpp_fmt *fmt;
+
+   __hists__init(>hists, >list);
+   perf_hpp_list__init(>list);
+   fmt = >fmt;
+   INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>sort_list);
+
+   fmt->cmp = block_cmp;
+
+   perf_hpp_list__register_sort_field(>list, fmt);
+   return 0;
+}
+
+static void *block_he_zalloc(size_t size)
+{
+   return zalloc(size + sizeof(struct hist_entry));
+}
+
+static void block_he_free(void *he)
+{
+   struct block_info *bi = ((struct hist_entry *)he)->block_info;
+
+   block_info__put(bi);
+   free(he);
+}
+
+struct hist_entry_ops block_he_ops = {
+   .new= block_he_zalloc,
+   .free   = block_he_free,
+};
+
+static void init_block_info(struct block_info *bi, struct symbol *sym,
+   struct cyc_hist *ch, int offset)
+{
+   bi->sym = sym;
+   bi->start = ch->start;
+   bi->end = offset;
+   bi->cycles = ch->cycles;
+   bi->cycles_aggr = ch->cycles_aggr;
+   bi->num = ch->num;
+   bi->num_aggr = ch->num_aggr;
+}
+
+static int process_block_per_sym(struct data__file *d)
+{
+   struct hists *hists = >block_hists.sym_hists;
+   struct rb_root_cached *root = hists->entries_in;
+   struct rb_node *next = rb_first_cached(root);
+   struct annotation *notes;
+   struct cyc_hist *ch;
+   unsigned int i;
+   struct block_info *bi;
+   struct hist_entry *he, *he_block;
+
+   while (next != NULL) {
+   he = rb_entry(next, struct hist_entry, rb_node_in);
+   next = rb_next(>rb_node_in);
+
+   if (!he->ms.map || !he->ms.sym)
+   continue;
+
+   notes = symbol__annotation(he->ms.sym);
+   if (!notes || !notes->src || !notes->src->cycles_hist)
+   continue;
+
+   ch = notes->src->cycles_hist;
+

[PATCH v1 4/9] perf diff: Get a list of symbols(functions)

2019-05-19 Thread Jin Yao

We already have a function hist__account_cycles() which can be used
to account cycles per basic block in symbol/function. But we also
need to know what the symbols are, since we need to get basic blocks
of all symbols(functions) before diff.

This patch records the sorted symbols in sym_hists, which will be
used in next patch for accounting cycles per basic block per
function.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-diff.c | 81 ++-
 1 file changed, 73 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index 6023f8c..e067ac6 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -26,6 +26,12 @@
 #include 
 #include 
 
+struct block_hists {
+   struct histssym_hists;
+   struct perf_hpp_listsym_list;
+   struct perf_hpp_fmt sym_fmt;
+};
+
 struct perf_diff {
struct perf_tool tool;
const char  *time_str;
@@ -33,6 +39,8 @@ struct perf_diff {
int  range_size;
int  range_num;
bool has_br_stack;
+   bool basic_block;
+   struct block_hists  *block_hists;
 };
 
 /* Diff command specific HPP columns. */
@@ -62,6 +70,7 @@ struct data__file {
int  idx;
struct hists*hists;
struct diff_hpp_fmt  fmt[PERF_HPP_DIFF__MAX_INDEX];
+   struct block_hists   block_hists;
 };
 
 static struct data__file *data__files;
@@ -363,9 +372,18 @@ static int diff__process_sample_event(struct perf_tool 
*tool,
goto out_put;
}
 
-   if (!hists__add_entry(hists, , NULL, NULL, NULL, sample, true)) {
-   pr_warning("problem incrementing symbol period, skipping 
event\n");
-   goto out_put;
+   if (pdiff->has_br_stack && pdiff->basic_block) {
+   if (!hists__add_entry(>block_hists->sym_hists, ,
+   NULL, NULL, NULL, sample, false)) {
+   pr_warning("problem counting symbol for basic block\n");
+   goto out_put;
+   }
+   } else {
+   if (!hists__add_entry(hists, , NULL, NULL, NULL,
+ sample, true)) {
+   pr_warning("problem incrementing symbol period, 
skipping event\n");
+   goto out_put;
+   }
}
 
/*
@@ -899,6 +917,37 @@ static int check_file_brstack(void)
return 0;
 }
 
+static int64_t symbol_se_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
+struct hist_entry *a, struct hist_entry *b)
+{
+   return sort_sym.se_cmp(a, b);
+}
+
+static int block_sym_hists_init(struct block_hists *hists)
+{
+   struct perf_hpp_fmt *fmt;
+
+   __hists__init(>sym_hists, >sym_list);
+   perf_hpp_list__init(>sym_list);
+   fmt = >sym_fmt;
+   INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>sort_list);
+
+   fmt->cmp = symbol_se_cmp;
+   perf_hpp_list__register_sort_field(>sym_list, fmt);
+   return 0;
+}
+
+static void basic_block_process(void)
+{
+   struct data__file *d;
+   int i;
+
+   data__for_each_file(i, d) {
+   hists__delete_entries(>block_hists.sym_hists);
+   }
+}
+
 static int __cmd_diff(void)
 {
struct data__file *d;
@@ -937,6 +986,16 @@ static int __cmd_diff(void)
goto out_delete;
}
 
+   if (pdiff.has_br_stack && pdiff.basic_block) {
+   ret = block_sym_hists_init(>block_hists);
+   if (ret) {
+   pr_err("Failed to initialize basic block 
hists\n");
+   goto out_delete;
+   }
+
+   pdiff.block_hists = >block_hists;
+   }
+
ret = perf_session__process_events(d->session);
if (ret) {
pr_err("Failed to process %s\n", d->data.path);
@@ -949,7 +1008,10 @@ static int __cmd_diff(void)
zfree(_range);
}
 
-   data_process();
+   if (pdiff.has_br_stack && pdiff.basic_block)
+   basic_block_process();
+   else
+   data_process();
 
  out_delete:
data__for_each_file(i, d) {
@@ -1019,6 +1081,8 @@ static const struct option options[] = {
   "only consider symbols in these pids"),
OPT_STRING(0, "tid", _conf.tid_list_str, "tid[,tid...]",
   "only consider symbols in these tids"),
+   OPT_BOOLEAN(0, "basic-block", _block,
+   "display the differential program basic block"),
OPT_END()
 };
 
@@ -1517,10 +1581,11 @@ int cmd_diff(int argc, const char **argv)
if (data_init(argc, argv)

[PATCH v1 0/9] perf diff: diff cycles at basic block level

2019-05-19 Thread Jin Yao

In some cases small changes in hot loops can show big differences.
But it's difficult to identify these differences.

perf diff currently can only diff symbols (functions). We can also expand
it to diff cycles of individual programs blocks as reported by timed LBR.
This would allow to identify changes in specific code accurately.

With this patch set, for example,

perf record -b ./div
perf record -b ./div
perf diff --basic-block

 # Cycles diff  Basic block (start:end)
 # ...  ...
 #
  -20   native_write_msr (7fff9a069900:7fff9a06990b)
   -3   __indirect_thunk_start (7fff9ac02ca0:7fff9ac02ca0)
1   __indirect_thunk_start (7fff9ac02cac:7fff9ac02cb0)
0   rand@plt (490:490)
0   rand (3af60:3af64)
0   rand (3af69:3af6d)
0   main (4e8:4ea)
0   main (4ef:500)
0   main (4ef:535)
0   compute_flag (640:644)
0   compute_flag (649:659)
0   __random_r (3ac40:3ac76)
0   __random_r (3ac40:3ac88)
0   __random_r (3ac90:3ac9c)
0   __random (3aac0:3aad2)
0   __random (3aae0:3aae7)
0   __random (3ab03:3ab0f)
0   __random (3ab14:3ab1b)
0   __random (3ab28:3ab2e)
0   __random (3ab4a:3ab53)

The "basic-block" is a new perf-diff option, which enables the displaying
of cycles difference of same program basic block amongst two or more
perf.data. The program basic block is the code block between two branches
in a function.

Jin Yao (9):
  perf util: Create block_info structure
  perf util: Add block_info in hist_entry
  perf diff: Check if all data files with branch stacks
  perf diff: Get a list of symbols(functions)
  perf diff: Use hists to manage basic blocks
  perf diff: Link same basic blocks among different data files
  perf diff: Compute cycles diff of basic blocks
  perf diff: Print the basic block cycles diff
  perf diff: Documentation --basic-block option

 tools/perf/Documentation/perf-diff.txt |   5 +
 tools/perf/builtin-diff.c  | 521 -
 tools/perf/util/hist.c |  24 +-
 tools/perf/util/hist.h |   6 +
 tools/perf/util/sort.h |   4 +
 tools/perf/util/symbol.c   |  22 ++
 tools/perf/util/symbol.h   |  23 ++
 7 files changed, 594 insertions(+), 11 deletions(-)

-- 
2.7.4

[PATCH v1 3/9] perf diff: Check if all data files with branch stacks

2019-05-19 Thread Jin Yao

We will expand perf diff to support diff cycles of individual programs
blocks, so it requires all data files having branch stacks.

This patch checks HEADER_BRANCH_STACK in header, and only set the flag
has_br_stack when HEADER_BRANCH_STACK are set in all data files.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-diff.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index 6e79207..6023f8c 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -32,6 +32,7 @@ struct perf_diff {
struct perf_time_interval   *ptime_range;
int  range_size;
int  range_num;
+   bool has_br_stack;
 };
 
 /* Diff command specific HPP columns. */
@@ -873,12 +874,41 @@ static int parse_time_str(struct data__file *d, char 
*abstime_ostr,
return ret;
 }
 
+static int check_file_brstack(void)
+{
+   struct data__file *d;
+   bool has_br_stack;
+   int i;
+
+   data__for_each_file(i, d) {
+   d->session = perf_session__new(>data, false, );
+   if (!d->session) {
+   pr_err("Failed to open %s\n", d->data.path);
+   return -1;
+   }
+
+   has_br_stack = perf_header__has_feat(>session->header,
+HEADER_BRANCH_STACK);
+   perf_session__delete(d->session);
+   if (!has_br_stack)
+   return 0;
+   }
+
+   /* Set only all files having branch stacks */
+   pdiff.has_br_stack = true;
+   return 0;
+}
+
 static int __cmd_diff(void)
 {
struct data__file *d;
int ret, i;
char *abstime_ostr, *abstime_tmp;
 
+   ret = check_file_brstack();
+   if (ret)
+   return ret;
+
ret = abstime_str_dup(_ostr);
if (ret)
return ret;
-- 
2.7.4

[PATCH v1 2/9] perf util: Add block_info in hist_entry

2019-05-19 Thread Jin Yao

The block_info contains the program basic block information, i.e,
contains the start address and the end address of this basic block and
how much cycles it takes. We need to compare, sort and even print out
the basic block by some orders, i.e. sort by cycles.

For this purpose, we add block_info field to hist_entry. In order not to
impact current interface, we creates a new function hists__add_entry_block.

Signed-off-by: Jin Yao 
---
 tools/perf/util/hist.c | 22 --
 tools/perf/util/hist.h |  6 ++
 tools/perf/util/sort.h |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 7ace7a10..3810460 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -574,6 +574,8 @@ static struct hist_entry *hists__findnew_entry(struct hists 
*hists,
 */
mem_info__zput(entry->mem_info);
 
+   block_info__zput(entry->block_info);
+
/* If the map of an existing hist_entry has
 * become out-of-date due to an exec() or
 * similar, update it.  Otherwise we will
@@ -645,6 +647,7 @@ __hists__add_entry(struct hists *hists,
   struct symbol *sym_parent,
   struct branch_info *bi,
   struct mem_info *mi,
+  struct block_info *block_info,
   struct perf_sample *sample,
   bool sample_self,
   struct hist_entry_ops *ops)
@@ -677,6 +680,7 @@ __hists__add_entry(struct hists *hists,
.hists  = hists,
.branch_info = bi,
.mem_info = mi,
+   .block_info = block_info,
.transaction = sample->transaction,
.raw_data = sample->raw_data,
.raw_size = sample->raw_size,
@@ -699,7 +703,7 @@ struct hist_entry *hists__add_entry(struct hists *hists,
struct perf_sample *sample,
bool sample_self)
 {
-   return __hists__add_entry(hists, al, sym_parent, bi, mi,
+   return __hists__add_entry(hists, al, sym_parent, bi, mi, NULL,
  sample, sample_self, NULL);
 }
 
@@ -712,10 +716,24 @@ struct hist_entry *hists__add_entry_ops(struct hists 
*hists,
struct perf_sample *sample,
bool sample_self)
 {
-   return __hists__add_entry(hists, al, sym_parent, bi, mi,
+   return __hists__add_entry(hists, al, sym_parent, bi, mi, NULL,
  sample, sample_self, ops);
 }
 
+struct hist_entry *hists__add_entry_block(struct hists *hists,
+ struct hist_entry_ops *ops,
+ struct addr_location *al,
+ struct block_info *block_info)
+{
+   struct hist_entry entry = {
+   .ops = ops,
+   .block_info = block_info,
+   .hists = hists,
+   }, *he = hists__findnew_entry(hists, , al, false);
+
+   return he;
+}
+
 static int
 iter_next_nop_entry(struct hist_entry_iter *iter __maybe_unused,
struct addr_location *al __maybe_unused)
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 76ff6c6..c8f7d66 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -16,6 +16,7 @@ struct addr_location;
 struct map_symbol;
 struct mem_info;
 struct branch_info;
+struct block_info;
 struct symbol;
 
 enum hist_filter {
@@ -149,6 +150,11 @@ struct hist_entry *hists__add_entry_ops(struct hists 
*hists,
struct perf_sample *sample,
bool sample_self);
 
+struct hist_entry *hists__add_entry_block(struct hists *hists,
+ struct hist_entry_ops *ops,
+ struct addr_location *al,
+ struct block_info *bi);
+
 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location 
*al,
 int max_stack_depth, void *arg);
 
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index ce376a7..43623fa 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -144,6 +144,7 @@ struct hist_entry {
longtime;
struct hists*hists;
struct mem_info *mem_info;
+   struct block_info   *block_info;
void*raw_data;
u32 raw_size;
int num_res;
-- 
2.7.4

[PATCH v1 1/9] perf util: Create block_info structure

2019-05-19 Thread Jin Yao

perf diff currently can only diff symbols(functions). We should expand it
to diff cycles of individual programs blocks as reported by timed LBR.
This would allow to identify changes in specific code accurately.

We need a new structure to maintain the basic block information, such as,
symbol(function), start/end addrress of this block, cycles. This patch
creates this structure and with some ops.

Signed-off-by: Jin Yao 
---
 tools/perf/util/symbol.c | 22 ++
 tools/perf/util/symbol.h | 23 +++
 2 files changed, 45 insertions(+)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 5cbad55..3a90e72 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -2262,3 +2262,25 @@ struct mem_info *mem_info__new(void)
refcount_set(>refcnt, 1);
return mi;
 }
+
+struct block_info *block_info__get(struct block_info *bi)
+{
+   if (bi)
+   refcount_inc(>refcnt);
+   return bi;
+}
+
+void block_info__put(struct block_info *bi)
+{
+   if (bi && refcount_dec_and_test(>refcnt))
+   free(bi);
+}
+
+struct block_info *block_info__new(void)
+{
+   struct block_info *bi = zalloc(sizeof(*bi));
+
+   if (bi)
+   refcount_set(>refcnt, 1);
+   return bi;
+}
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 9a8fe01..12755b4 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -131,6 +131,17 @@ struct mem_info {
refcount_t  refcnt;
 };
 
+struct block_info {
+   struct symbol   *sym;
+   u64 start;
+   u64 end;
+   u64 cycles;
+   u64 cycles_aggr;
+   int num;
+   int num_aggr;
+   refcount_t  refcnt;
+};
+
 struct addr_location {
struct machine *machine;
struct thread *thread;
@@ -332,4 +343,16 @@ static inline void __mem_info__zput(struct mem_info **mi)
 
 #define mem_info__zput(mi) __mem_info__zput()
 
+struct block_info *block_info__new(void);
+struct block_info *block_info__get(struct block_info *bi);
+void   block_info__put(struct block_info *bi);
+
+static inline void __block_info__zput(struct block_info **bi)
+{
+   block_info__put(*bi);
+   *bi = NULL;
+}
+
+#define block_info__zput(bi) __block_info__zput()
+
 #endif /* __PERF_SYMBOL */
-- 
2.7.4

[PATCH v1 9/9] perf diff: Documentation --basic-block option

2019-05-19 Thread Jin Yao

Documentation the new option '--basic-block'.

Signed-off-by: Jin Yao 
---
 tools/perf/Documentation/perf-diff.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/perf/Documentation/perf-diff.txt 
b/tools/perf/Documentation/perf-diff.txt
index da7809b..b242af8 100644
--- a/tools/perf/Documentation/perf-diff.txt
+++ b/tools/perf/Documentation/perf-diff.txt
@@ -174,6 +174,11 @@ OPTIONS
 --tid=::
Only diff samples for given thread ID (comma separated list).
 
+--basic-block::
+   Display the cycles difference of same program basic block amongst
+   two or more perf.data. The program basic block is the code block
+   between two branches in a function.
+
 COMPARISON
 --
 The comparison is governed by the baseline file. The baseline perf.data
-- 
2.7.4

[PATCH v1 7/9] perf diff: Compute cycles diff of basic blocks

2019-05-19 Thread Jin Yao

In previous patch, we have already linked up the same basic blocks.
Now we compute the cycles diff value of basic blocks, in order to
sort by diff cycles later.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-diff.c | 31 +++
 tools/perf/util/sort.h|  2 ++
 2 files changed, 33 insertions(+)

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index 72c33ab..47e34a3 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -1132,6 +1132,31 @@ static void block_hists_match(struct hists *hists_base,
}
 }
 
+static void compute_block_hists_diff(struct block_hists *block_hists,
+struct data__file *d)
+{
+   struct hists *hists = _hists->hists;
+   struct rb_root_cached *root = hists->entries_in;
+   struct rb_node *next = rb_first_cached(root);
+
+   while (next != NULL) {
+   struct hist_entry *he = rb_entry(next, struct hist_entry,
+rb_node_in);
+   struct hist_entry *pair = get_pair_data(he, d);
+
+   next = rb_next(>rb_node_in);
+
+   if (pair) {
+   pair->diff.computed = true;
+   if (pair->block_info->num && he->block_info->num) {
+   pair->diff.cycles_diff =
+   pair->block_info->cycles_aggr / 
pair->block_info->num_aggr -
+   he->block_info->cycles_aggr / 
he->block_info->num_aggr;
+   }
+   }
+   }
+}
+
 static void basic_block_process(void)
 {
struct hists *hists_base = __files[0].block_hists.hists;
@@ -1151,6 +1176,12 @@ static void basic_block_process(void)
block_hists_match(hists_base, hists);
}
 
+   data__for_each_file_new(i, d) {
+   hists = >block_hists.hists;
+   d->hists = hists;
+   compute_block_hists_diff(__files[0].block_hists, d);
+   }
+
data__for_each_file(i, d) {
hists__delete_entries(>block_hists.sym_hists);
hists__delete_entries(>block_hists.hists);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 43623fa..de9e61a 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -79,6 +79,8 @@ struct hist_entry_diff {
 
/* HISTC_WEIGHTED_DIFF */
s64 wdiff;
+
+   s64 cycles_diff;
};
 };
 
-- 
2.7.4

[PATCH v1 8/9] perf diff: Print the basic block cycles diff

2019-05-19 Thread Jin Yao

Currently we only support sorting by diff cycles.

For example,

perf record -b ./div
perf record -b ./div
perf diff --basic-block

 # Cycles diff  Basic block (start:end)
 # ...  ...
 #
  -20   native_write_msr (7fff9a069900:7fff9a06990b)
   -3   __indirect_thunk_start (7fff9ac02ca0:7fff9ac02ca0)
1   __indirect_thunk_start (7fff9ac02cac:7fff9ac02cb0)
0   rand@plt (490:490)
0   rand (3af60:3af64)
0   rand (3af69:3af6d)
0   main (4e8:4ea)
0   main (4ef:500)
0   main (4ef:535)
0   compute_flag (640:644)
0   compute_flag (649:659)
0   __random_r (3ac40:3ac76)
0   __random_r (3ac40:3ac88)
0   __random_r (3ac90:3ac9c)
0   __random (3aac0:3aad2)
0   __random (3aae0:3aae7)
0   __random (3ab03:3ab0f)
0   __random (3ab14:3ab1b)
0   __random (3ab28:3ab2e)
0   __random (3ab4a:3ab53)

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-diff.c | 168 ++
 tools/perf/util/hist.c|   2 +-
 tools/perf/util/sort.h|   1 +
 3 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index 47e34a3..dbf242d 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -27,6 +27,12 @@
 #include 
 #include 
 
+struct block_hpp_fmt {
+   struct perf_hpp_fmt fmt;
+   struct data__file   *file;
+   int width;
+};
+
 struct block_hists {
struct histssym_hists;
struct perf_hpp_listsym_list;
@@ -34,6 +40,8 @@ struct block_hists {
struct histshists;
struct perf_hpp_listlist;
struct perf_hpp_fmt fmt;
+   struct block_hpp_fmtblock_fmt;
+   struct perf_hpp_fmt desc_fmt;
 };
 
 struct perf_diff {
@@ -1157,6 +1165,162 @@ static void compute_block_hists_diff(struct block_hists 
*block_hists,
}
 }
 
+static int64_t block_cycles_diff_cmp(struct perf_hpp_fmt *fmt,
+struct hist_entry *left,
+struct hist_entry *right)
+{
+   struct block_hpp_fmt *block_fmt = container_of(fmt,
+  struct block_hpp_fmt,
+  fmt);
+   struct data__file *d = block_fmt->file;
+   bool pairs_left  = hist_entry__has_pairs(left);
+   bool pairs_right = hist_entry__has_pairs(right);
+   struct hist_entry *p_right, *p_left;
+   s64 l, r;
+
+   if (!pairs_left && !pairs_right)
+   return 0;
+
+   if (!pairs_left || !pairs_right)
+   return pairs_left ? -1 : 1;
+
+   p_left  = get_pair_data(left, d);
+   p_right  = get_pair_data(right, d);
+
+   if (!p_left && !p_right)
+   return 0;
+
+   if (!p_left || !p_right)
+   return p_left ?  -1 : 1;
+
+   l = abs(p_left->diff.cycles_diff);
+   r = abs(p_right->diff.cycles_diff);
+
+   return r - l;
+}
+
+static int64_t block_diff_sort(struct perf_hpp_fmt *fmt,
+  struct hist_entry *left, struct hist_entry 
*right)
+{
+   return block_cycles_diff_cmp(fmt, right, left);
+}
+
+static int block_diff_header(struct perf_hpp_fmt *fmt __maybe_unused,
+struct perf_hpp *hpp,
+struct hists *hists __maybe_unused,
+int line __maybe_unused,
+int *span __maybe_unused)
+{
+   return scnprintf(hpp->buf, hpp->size, "Cycles diff");
+}
+
+static int block_diff_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
+   struct hist_entry *he)
+{
+   struct block_hpp_fmt *block_fmt = container_of(fmt,
+  struct block_hpp_fmt,
+  fmt);
+   struct data__file *d = block_fmt->file;
+   struct hist_entry *pair = get_pair_data(he, d);
+
+   if (pair && pair->diff.computed) {
+   return scnprintf(hpp->buf, hpp->size, "%*ld", block_fmt->width,
+pair->diff.cycles_diff);
+   }
+
+   return scnprintf(hpp->buf, hpp->size, "%*s", block_fmt->width, " ");
+}
+
+static int block_diff_width(struct perf_hpp_fmt *fmt,
+   struct perf_hpp *hpp __maybe_unused,
+   struct hists *hists __maybe_unused)
+{
+   struct block_hpp_fmt *block_fmt =
+   container_of(fmt, struct block_hpp_fmt, fmt);
+
+   return block_fmt->width;
+}
+
+static int block_sym_width(struct perf_hpp_fmt *fmt __maybe_unused,
+   struct perf_hpp *hpp __maybe_unused,
+   struct

[PATCH v1 6/9] perf diff: Link same basic blocks among different data files

2019-05-19 Thread Jin Yao

The target is to compare the performance difference (cycles
diff) for the same basic blocks in different data files.

The same basic block means same function, same start address
and same end address. This patch finds the same basic blocks
from different data files and link them together.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-diff.c | 66 +++
 1 file changed, 66 insertions(+)

diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index 09551fe..72c33ab 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -1073,8 +1073,69 @@ static int process_block_per_sym(struct data__file *d)
return 0;
 }
 
+static int block_pair_cmp(struct hist_entry *a, struct hist_entry *b)
+{
+   struct block_info *bi_a = a->block_info;
+   struct block_info *bi_b = b->block_info;
+   int cmp;
+
+   if (!bi_a->sym || !bi_b->sym)
+   return -1;
+
+   if (bi_a->sym->name && bi_b->sym->name) {
+   cmp = strcmp(bi_a->sym->name, bi_b->sym->name);
+   if ((!cmp) && (bi_a->start == bi_b->start) &&
+   (bi_a->end == bi_b->end)) {
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+static struct hist_entry *get_block_pair(struct hist_entry *he,
+struct hists *hists_pair)
+{
+   struct rb_root_cached *root = hists_pair->entries_in;
+   struct rb_node *next = rb_first_cached(root);
+   int cmp;
+
+   while (next != NULL) {
+   struct hist_entry *he_pair = rb_entry(next, struct hist_entry,
+ rb_node_in);
+
+   next = rb_next(_pair->rb_node_in);
+
+   cmp = block_pair_cmp(he_pair, he);
+   if (!cmp)
+   return he_pair;
+   }
+
+   return NULL;
+}
+
+static void block_hists_match(struct hists *hists_base,
+ struct hists *hists_pair)
+{
+   struct rb_root_cached *root = hists_base->entries_in;
+   struct rb_node *next = rb_first_cached(root);
+
+   while (next != NULL) {
+   struct hist_entry *he = rb_entry(next, struct hist_entry,
+rb_node_in);
+   struct hist_entry *pair = get_block_pair(he, hists_pair);
+
+   next = rb_next(>rb_node_in);
+
+   if (pair)
+   hist_entry__add_pair(pair, he);
+   }
+}
+
 static void basic_block_process(void)
 {
+   struct hists *hists_base = __files[0].block_hists.hists;
+   struct hists *hists;
struct data__file *d;
int i;
 
@@ -1085,6 +1146,11 @@ static void basic_block_process(void)
process_block_per_sym(d);
}
 
+   data__for_each_file_new(i, d) {
+   hists = >block_hists.hists;
+   block_hists_match(hists_base, hists);
+   }
+
data__for_each_file(i, d) {
hists__delete_entries(>block_hists.sym_hists);
hists__delete_entries(>block_hists.hists);
-- 
2.7.4

Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address

2019-05-19 Thread Anshuman Khandual

On 05/18/2019 03:20 AM, Andrew Morton wrote:
> On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual 
>  wrote:
> 
>> The presence of struct page does not guarantee linear mapping for the pfn
>> physical range. Device private memory which is non-coherent is excluded
>> from linear mapping during devm_memremap_pages() though they will still
>> have struct page coverage. Just check for device private memory before
>> giving out virtual address for a given pfn.
> 
> I was going to give my standard "what are the user-visible runtime
> effects of this change?", but...
> 
>> All these helper functions are all pfn_t related but could not figure out
>> another way of determining a private pfn without looking into it's struct
>> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is
>> it used by out of tree drivers ? Should we then drop it completely ?
> 
> Yeah, let's kill it.
> 
> But first, let's fix it so that if someone brings it back, they bring
> back a non-buggy version.

Makes sense.

> 
> So...  what (would be) the user-visible runtime effects of this change?

I am not very well aware about the user interaction with the drivers which
hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to
comment on it's user visible runtime impact. I just figured this out from
code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes
the function bit more expensive as it now involve some additional memory
references.

[PATCH V4 3/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

2019-05-19 Thread Anshuman Khandual

The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/ptdump_debugfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..80171d1 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
struct ptdump_info *info = m->private;
+
+   get_online_mems();
ptdump_walk_pgd(m, info);
+   put_online_mems();
return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4

[PATCH V4 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()

2019-05-19 Thread Anshuman Khandual

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x68000, assuming node 0
[   62.052517] [ cut here ]
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 
5.1.0-rc2-4-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 4045 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : 168b3ce0
[   62.060477] x29: 168b3ce0 x28: 8005db546c00
[   62.061501] x27:  x26: 
[   62.062509] x25: 111ef000 x24: 111ef5d0
[   62.063520] x23:  x22: 0006bfff
[   62.064540] x21: ffef x20: 006c
[   62.065558] x19: 0068 x18: 0024
[   62.066566] x17:  x16: 
[   62.067579] x15:  x14: 8005e412e890
[   62.068588] x13: 8005d6b105d8 x12: 
[   62.069610] x11: 8005d6b10490 x10: 0040
[   62.070615] x9 : 8005e412e898 x8 : 8005e412e890
[   62.071631] x7 : 8005d6b105d8 x6 : 8005db546c00
[   62.072640] x5 : 0001 x4 : 0002
[   62.073654] x3 : 8005d7049480 x2 : 0002
[   62.074666] x1 : 0003 x0 : ffef
[   62.075685] Process bash (pid: 3275, stack limit = 0xd754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Signed-off-by: Anshuman Khandual 
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 328878b..1dbda48 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1850,10 +1850,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
+   arch_remove_memory(nid, start, size, NULL);
memblock_free(start, size);
memblock_remove(start, size);
 
-   arch_remove_memory(nid, start, size, NULL);
__release_memory_resource(start, size);
 
try_offline_node(nid);
-- 
2.7.4

[PATCH V4 2/4] arm64/mm: Inhibit huge-vmap with ptdump

2019-05-19 Thread Anshuman Khandual

From: Mark Rutland 

The arm64 ptdump code can race with concurrent modification of the
kernel page tables. At the time this was added, this was sound as:

* Modifications to leaf entries could result in stale information being
  logged, but would not result in a functional problem.

* Boot time modifications to non-leaf entries (e.g. freeing of initmem)
  were performed when the ptdump code cannot be invoked.

* At runtime, modifications to non-leaf entries only occurred in the
  vmalloc region, and these were strictly additive, as intermediate
  entries were never freed.

However, since commit:

  commit 324420bf91f6 ("arm64: add support for ioremap() block mappings")

... it has been possible to create huge mappings in the vmalloc area at
runtime, and as part of this existing intermediate levels of table my be
removed and freed.

It's possible for the ptdump code to race with this, and continue to
walk tables which have been freed (and potentially poisoned or
reallocated). As a result of this, the ptdump code may dereference bogus
addresses, which could be fatal.

Since huge-vmap is a TLB and memory optimization, we can disable it when
the runtime ptdump code is in use to avoid this problem.

Fixes: 324420bf91f60582 ("arm64: add support for ioremap() block mappings")
Acked-by: Ard Biesheuvel 
Signed-off-by: Mark Rutland 
Signed-off-by: Anshuman Khandual 
Cc: Ard Biesheuvel 
Cc: Catalin Marinas 
Cc: Will Deacon 
---
 arch/arm64/mm/mmu.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a170c63..a1bfc44 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -955,13 +955,18 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys)
 
 int __init arch_ioremap_pud_supported(void)
 {
-   /* only 4k granule supports level 1 block mappings */
-   return IS_ENABLED(CONFIG_ARM64_4K_PAGES);
+   /*
+* Only 4k granule supports level 1 block mappings.
+* SW table walks can't handle removal of intermediate entries.
+*/
+   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+  !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
 }
 
 int __init arch_ioremap_pmd_supported(void)
 {
-   return 1;
+   /* See arch_ioremap_pud_supported() */
+   return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
 }
 
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
-- 
2.7.4

[PATCH V4 4/4] arm64/mm: Enable memory hot remove

2019-05-19 Thread Anshuman Khandual

The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

While freeing intermediate level page table pages bail out if any of it's
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region. Take
the kernel ptl so that we can safely free potentially-shared tables.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/mm/mmu.c | 212 +++-
 2 files changed, 213 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 4780eb7..ce24427 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -267,6 +267,9 @@ config HAVE_GENERIC_GUP
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+   def_bool y
+
 config SMP
def_bool y
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a1bfc44..0cf0d41 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -733,6 +733,187 @@ int kern_addr_valid(unsigned long addr)
 
return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, ssize_t size)
+{
+   WARN_ON(PageReserved(page));
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+   free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr)
+{
+   struct page *page;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   if (!pte_none(ptep[i]))
+   return;
+   }
+
+   page = pmd_page(READ_ONCE(*pmdp));
+   pmd_clear(pmdp);
+   __flush_tlb_kernel_pgtable(addr);
+   free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
+{
+   struct page *page;
+   int i;
+
+   if (CONFIG_PGTABLE_LEVELS <= 2)
+   return;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   if (!pmd_none(pmdp[i]))
+   return;
+   }
+
+   page = pud_page(READ_ONCE(*pudp));
+   pud_clear(pudp);
+   __flush_tlb_kernel_pgtable(addr);
+   free_hotplug_pgtable_page(page);
+}
+
+static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr)
+{
+   struct page *page;
+   int i;
+
+   if (CONFIG_PGTABLE_LEVELS <= 3)
+   return;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   if (!pud_none(pudp[i]))
+   return;
+   }
+
+   page = pgd_page(READ_ONCE(*pgdp));
+   pgd_clear(pgdp);
+   __flush_tlb_kernel_pgtable(addr);
+   free_hotplug_pgtable_page(page);
+}
+
+static void
+remove_pte_table(pmd_t *pmdp, unsigned long addr,
+   unsigned long end, bool sparse_vmap)
+{
+   struct page *page;
+   pte_t *ptep, pte;
+   unsigned long start = addr;
+
+   for (; addr < end; addr += PAGE_SIZE) {
+   ptep = pte_offset_kernel(pmdp, addr);
+   pte = READ_ONCE(*ptep);
+
+   if (pte_none(pte))
+   continue;
+
+   WARN_ON(!pte_present(pte));
+   if (sparse_vmap) {
+   page = pte_page(pte);
+   free_hotplug_page_range(page, PAGE_SIZE);
+   }
+   pte_clear(_mm, addr, ptep);
+   }
+   flush_tlb_kernel_range(start, end);
+}
+
+static void
+remove_pmd_table(pud_t *pudp, unsigned long addr,
+   unsigned long end, bool sparse_vmap)
+{
+   unsigned long next;
+   struct page *page;
+   pte_t *ptep_base;
+   pmd_t *pmdp, pmd;
+
+   for (; addr < end; addr = next) {
+   next = pmd_addr_end(addr, end);
+   pmdp = pmd_offset(pudp, addr);
+   pmd = READ_ONCE(*pmdp);
+
+   if (pmd_none(pmd))
+   continue;
+
+

[PATCH V4 0/4] arm64/mm: Enable memory hot remove

2019-05-19 Thread Anshuman Khandual

This series enables memory hot remove on arm64 after fixing a memblock
removal ordering problem in generic __remove_memory() and two possible
arm64 platform specific kernel page table race conditions. This series
is based on latest v5.2-rc1 tag.

Testing:

Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
combinations. Its only build tested on non-arm64 platforms.

Changes in V4:

- Implemented most of the suggestions from Mark Rutland
- Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message
- Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table()
- Used READ_ONCE() in missing instances while accessing page table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted 
polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore

- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

Anshuman Khandual (3):
  mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

Mark Rutland (1):
  arm64/mm: Inhibit huge-vmap with ptdump

 arch/arm64/Kconfig |   3 +
 arch/arm64/mm/mmu.c| 223 -
 arch/arm64/mm/ptdump_debugfs.c |   3 +
 mm/memory_hotplug.c|   2 +-
 4 files changed, 225 insertions(+), 6 deletions(-)

-- 
2.7.4

[PATCH] intel-svm: fix typos in code comments

2019-05-19 Thread Weitao Hou

fix acccess to access

Signed-off-by: Weitao Hou 
---
 include/linux/intel-svm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index e3f76315ca4d..8dfead70699c 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -57,7 +57,7 @@ struct svm_dev_ops {
 
 /**
  * intel_svm_bind_mm() - Bind the current process to a PASID
- * @dev:   Device to be granted acccess
+ * @dev:   Device to be granted access
  * @pasid: Address for allocated PASID
  * @flags: Flags. Later for requesting supervisor mode, etc.
  * @ops:   Callbacks to device driver
-- 
2.18.0

Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-19 Thread Nicholas Piggin

Bharata B Rao's on May 20, 2019 2:25 pm:
> On Mon, May 20, 2019 at 12:02:23PM +1000, Michael Ellerman wrote:
>> Bharata B Rao  writes:
>> > On Thu, May 16, 2019 at 07:44:20PM +0530, srikanth wrote:
>> >> Hello,
>> >> 
>> >> On power9 host, performing memory hotunplug from ppc64le guest results in
>> >> kernel oops.
>> >> 
>> >> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using
>> >> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest.
>> >> 
>> >> Recreation steps:
>> >> 
>> >> 1. Boot a guest with below mem configuration:
>> >>   33554432
>> >>   8388608
>> >>   4194304
>> >>   
>> >>     
>> >>   
>> >>     
>> >>   
>> >> 
>> >> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> 
>> >> now
>> >> reboot guest -> once guest comes back try to unplug 8G memory
>> >> 
>> >> mem.xml used:
>> >> 
>> >> 
>> >> 8
>> >> 0
>> >> 
>> >> 
>> >> 
>> >> Memory attach and detach commands used:
>> >>     virsh attach-device vm1 ./mem.xml --live
>> >>     virsh detach-device vm1 ./mem.xml --live
>> >> 
>> >> Trace seen inside guest after unplug, guest just hangs there forever:
>> >> 
>> >> [   21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
>> >> [   21.963064] Oops: Exception in kernel mode, sig: 5 [#1]
>> >> [   21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA
>> >> pSeries
>> >> [   21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse
>> >> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi 
>> >> scsi_transport_iscsi
>> >> ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress 
>> >> lzo_compress
>> >> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
>> >> xor raid6_pq multipath crc32c_vpmsum
>> >> [   21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not
>> >> tainted 5.1.0-dirty #2
>> >> [   21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn
>> >> [   21.963355] NIP:  c0079e18 LR: c0c79308 CTR:
>> >> 8000
>> >> [   21.963392] REGS: c003f88034f0 TRAP: 0700   Not tainted 
>> >> (5.1.0-dirty)
>> >> [   21.963422] MSR:  8282b033   
>> >> CR:
>> >> 28002884  XER: 2004
>> >> [   21.963470] CFAR: c0c79304 IRQMASK: 0
>> >> [   21.963470] GPR00: c0c79308 c003f8803780 c1521000
>> >> 00fff8c0
>> >> [   21.963470] GPR04: 0001 ffe30005 0005
>> >> 0020
>> >> [   21.963470] GPR08:  0001 c00a00fff8e0
>> >> c16d21a0
>> >> [   21.963470] GPR12: c16e7b90 c7ff2700 c00a00a0
>> >> c003ffe30100
>> >> [   21.963470] GPR16: c003ffe3 c14aa4de c00a009f
>> >> c16d21b0
>> >> [   21.963470] GPR20: c14de588 0001 c16d21b8
>> >> c00a00a0
>> >> [   21.963470] GPR24:   c00a00a0
>> >> c003ffe96000
>> >> [   21.963470] GPR28: c00a00a0 c00a00a0 c003fffec000
>> >> c00a00fff8c0
>> >> [   21.963802] NIP [c0079e18] pte_fragment_free+0x48/0xd0
>> >> [   21.963838] LR [c0c79308] remove_pagetable+0x49c/0x5b4
>> >> [   21.963873] Call Trace:
>> >> [   21.963890] [c003f8803780] [c003ffe997f0] 0xc003ffe997f0
>> >> (unreliable)
>> >> [   21.963933] [c003f88037b0] [] (null)
>> >> [   21.963969] [c003f88038c0] [c006f038]
>> >> vmemmap_free+0x218/0x2e0
>> >> [   21.964006] [c003f8803940] [c036f100]
>> >> sparse_remove_one_section+0xd0/0x138
>> >> [   21.964050] [c003f8803980] [c0383a50]
>> >> __remove_pages+0x410/0x560
>> >> [   21.964093] [c003f8803a90] [c0c784d8]
>> >> arch_remove_memory+0x68/0xdc
>> >> [   21.964136] [c003f8803ad0] [c0385d74]
>> >> __remove_memory+0xc4/0x110
>> >> [   21.964180] [c003f8803b10] [c00d44e4]
>> >> dlpar_remove_lmb+0x94/0x140
>> >> [   21.964223] [c003f8803b50] [c00d52b4]
>> >> dlpar_memory+0x464/0xd00
>> >> [   21.964259] [c003f8803be0] [c00cd5c0]
>> >> handle_dlpar_errorlog+0xc0/0x190
>> >> [   21.964303] [c003f8803c50] [c00cd6bc]
>> >> pseries_hp_work_fn+0x2c/0x60
>> >> [   21.964346] [c003f8803c80] [c013a4a0]
>> >> process_one_work+0x2b0/0x5a0
>> >> [   21.964388] [c003f8803d10] [c013a818]
>> >> worker_thread+0x88/0x610
>> >> [   21.964434] [c003f8803db0] [c0143884] kthread+0x1a4/0x1b0
>> >> [   21.964468] [c003f8803e20] [c000bdc4]
>> >> ret_from_kernel_thread+0x5c/0x78
>> >> [   21.964506] Instruction dump:
>> >> [   21.964527] fbe1fff8 f821ffd1 78638502 78633664 ebe9 7fff1a14
>> >> 395f0020 813f0020
>> >> [   21.964569] 7d2907b4 7d2900d0 79290fe0 69290001 <0b09> 7c0004ac
>> >> 7d205028 3129
>> >> [   21.964613] ---[ end trace aaa571aa1636fee6 ]---
>> >> [   21.966349]
>> >> [   21.966383] Sending IPI to other CPUs
>> >> [

[PATCH] fddi: fix typos in code comments

2019-05-19 Thread Weitao Hou

fix abord to abort

Signed-off-by: Weitao Hou 
---
 drivers/net/fddi/skfp/hwmtm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/fddi/skfp/hwmtm.c b/drivers/net/fddi/skfp/hwmtm.c
index abbe309051d9..3d0f417e8586 100644
--- a/drivers/net/fddi/skfp/hwmtm.c
+++ b/drivers/net/fddi/skfp/hwmtm.c
@@ -1206,7 +1206,7 @@ void process_receive(struct s_smc *smc)
}
/*
 * SUPERNET 3 Bug: FORMAC delivers status words
-* of aborded frames to the BMU
+* of aborted frames to the BMU
 */
if (len <= 4) {
DB_RX(2, "Frame length = 0");
@@ -1343,7 +1343,7 @@ void process_receive(struct s_smc *smc)
break ;
default :
/*
-* unknown FC abord the frame
+* unknown FC abort the frame
 */
DB_RX(2, "unknown FC error");
smt_free_mbuf(smc,mb) ;
-- 
2.18.0

[PATCH] mm/failslab: By default, do not fail allocations with direct reclaim only

2019-05-19 Thread Nicolas Boichat

When failslab was originally written, the intention of the
"ignore-gfp-wait" flag default value ("N") was to fail
GFP_ATOMIC allocations. Those were defined as (__GFP_HIGH),
and the code would test for __GFP_WAIT (0x10u).

However, since then, __GFP_WAIT was replaced by __GFP_RECLAIM
(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM), and GFP_ATOMIC is
now defined as (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM).

This means that when the flag is false, almost no allocation
ever fails (as even GFP_ATOMIC allocations contain
__GFP_KSWAPD_RECLAIM).

Restore the original intent of the code, by ignoring calls
that directly reclaim only (___GFP_DIRECT_RECLAIM), and thus,
failing GFP_ATOMIC calls again by default.

Fixes: 71baba4b92dc1fa1 ("mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM")
Signed-off-by: Nicolas Boichat 
---
 mm/failslab.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/failslab.c b/mm/failslab.c
index ec5aad211c5be97..33efcb60e633c0a 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -23,7 +23,8 @@ bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags)
if (gfpflags & __GFP_NOFAIL)
return false;
 
-   if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
+   if (failslab.ignore_gfp_reclaim &&
+   (gfpflags & ___GFP_DIRECT_RECLAIM))
return false;
 
if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB))
-- 
2.21.0.1020.gf2820cf01a-goog

Re: [PATCH] vfio: vfio_pci_nvlink2: use a vma helper function

2019-05-19 Thread Alexey Kardashevskiy




On 17/05/2019 00:48, Alex Williamson wrote:
> [Cc Alexey + kvm]
> 
> On Thu, 16 May 2019 20:38:26 +0800
> "richard.p...@oppo.com"  wrote:
> 
>> Use a vma helper function to simply code.
>>
>> Signed-off-by: Peng Hao 
>> ---
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
>> b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> index 32f695ffe128..dc42aa0e47f6 100644
>> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
>> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> @@ -161,8 +161,7 @@ static int vfio_pci_nvgpu_mmap(struct vfio_pci_device 
>> *vdev,
>>  
>>  atomic_inc(>mm->mm_count);
>>  ret = (int) mm_iommu_newdev(data->mm, data->useraddr,
>> -(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
>> -data->gpu_hpa, >mem);
>> +vma_pages(vma), data->gpu_hpa, >mem);


I did not realize we have been having this mighty helper since 2005 :)

Reviewed-by: Alexey Kardashevskiy 



>>  
>>  trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr,
>>  vma->vm_end - vma->vm_start, ret);
>> -- 
>> 2.20.1

-- 
Alexey

Re: [PATCH 6/8] PM / OPP: Support adjusting OPP voltages at runtime

2019-05-19 Thread Viresh Kumar

On 16-05-19, 17:08, Andrew-sh.Cheng wrote:
> From: Stephen Boyd 
> 
> On some SoCs the Adaptive Voltage Scaling (AVS) technique is
> employed to optimize the operating voltage of a device. At a
> given frequency, the hardware monitors dynamic factors and either
> makes a suggestion for how much to adjust a voltage for the
> current frequency, or it automatically adjusts the voltage
> without software intervention. Add an API to the OPP library for
> the former case, so that AVS type devices can update the voltages
> for an OPP when the hardware determines the voltage should
> change. The assumption is that drivers like CPUfreq or devfreq
> will register for the OPP notifiers and adjust the voltage
> according to suggestions that AVS makes.
> 
> This patch is devired from [1] submitted by Stephen.
> [1] https://lore.kernel.org/patchwork/patch/599279/
> 
> Signed-off-by: Stephen Boyd 
> Signed-off-by: Roger Lu 
> ---
>  drivers/opp/core.c | 78 
> ++
>  include/linux/pm_opp.h | 11 +++
>  2 files changed, 89 insertions(+)

This is an rcu implementation which got removed long back from OPP core. Please
align this with the latest changes.

-- 
viresh

Re: [PATCH 4/8] dt-bindings: devfreq: add compatible for mt8183 cci devfreq

2019-05-19 Thread Viresh Kumar

On 16-05-19, 17:08, Andrew-sh.Cheng wrote:
> From: "Andrew-sh.Cheng" 
> 
> This adds dt-binding documentation of cci devfreq
> for Mediatek MT8183 SoC platform.
> 
> Signed-off-by: Andrew-sh.Cheng 
> ---
>  .../bindings/devfreq/mt8183-cci-devfreq.txt  | 20 
> 
>  1 file changed, 20 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt
> 
> diff --git a/Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt 
> b/Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt
> new file mode 100644
> index ..3189902902e0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt
> @@ -0,0 +1,20 @@
> +* Mediatek Cache Coherent Interconnect(CCI) frequency device
> +
> +Required properties:
> +- compatible: should contain "mediatek,mt8183-cci" for frequency scaling of 
> CCI

Example doesn't have this compatible .

> +- clocks: for frequency scaling of CCI
> +- clock-names: for frequency scaling of CCI driver to reference
> +- regulator: for voltage scaling of CCI
> +- operating-points-v2: for frequency scaling of CCI opp table
> +
> +Example:
> + cci: cci {
> + compatible = "mediatek,cci";
> + clocks = < CLK_APMIXED_CCIPLL>;
> + clock-names = "cci_clock";
> + operating-points-v2 = <_opp>;
> + };
> +
> +  {
> + proc-supply = <_vproc12_reg>;
> + };
> \ No newline at end of file
> -- 
> 2.12.5

-- 
viresh

Re: [EXT] Re: [PATCH 1/3] enetc: add hardware timestamping support

2019-05-19 Thread Richard Cochran

On Mon, May 20, 2019 at 03:20:23AM +, Y.b. Lu wrote:
> > > +config FSL_ENETC_HW_TIMESTAMPING
> > > + bool "ENETC hardware timestamping support"
> > > + depends on FSL_ENETC || FSL_ENETC_VF
> > > + help
> > > +   Enable hardware timestamping support on the Ethernet packets
> > > +   using the SO_TIMESTAMPING API. Because the RX BD ring dynamic
> > > +   allocation hasn't been supported and it's too expensive to use
> > 
> > s/it's/it is/
> 
> [Y.b. Lu] Will modify it. BTW, may I know what's the purpose of dropping 
> single quote character? For searching, script checking, or something else?

Simply because "it's" is informal speech, but the Kconfig help is
formal technical documentation.  (Or at least it should be!)

Thanks,
Richard

Re: [RFC 5/5] cpufreq: add driver for Raspbery Pi

2019-05-19 Thread Viresh Kumar

On 17-05-19, 17:35, Nicolas Saenz Julienne wrote:
> Raspberry Pi's firmware offers and interface though which update it's
> performance requirements. It allows us to request for specific runtime
> frequencies, which the firmware might or might not respect, depending on
> the firmware configuration and thermals.
> 
> As the maximum and minimum frequencies are configurable in the firmware
> there is no way to know in advance their values. So the Raspberry Pi
> cpufreq driver queries them, builds an opp frequency table to then
> launch cpufreq-dt.
> 
> Signed-off-by: Nicolas Saenz Julienne 
> ---
>  drivers/cpufreq/Kconfig.arm   |  8 +++
>  drivers/cpufreq/Makefile  |  1 +
>  drivers/cpufreq/raspberrypi-cpufreq.c | 79 +++
>  3 files changed, 88 insertions(+)
>  create mode 100644 drivers/cpufreq/raspberrypi-cpufreq.c
> 
> diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> index 179a1d302f48..70e5f13f7632 100644
> --- a/drivers/cpufreq/Kconfig.arm
> +++ b/drivers/cpufreq/Kconfig.arm
> @@ -308,3 +308,11 @@ config ARM_PXA2xx_CPUFREQ
> This add the CPUFreq driver support for Intel PXA2xx SOCs.
>  
> If in doubt, say N.
> +
> +config ARM_RASPBERRYPI_CPUFREQ
> + tristate "Raspberry Pi cpufreq support"
> + depends on RASPBERRYPI_FIRMWARE || (RASPBERRYPI_FIRMWARE=n && 
> COMPILE_TEST)

What about:
depends on RASPBERRYPI_FIRMWARE || COMPILE_TEST

> + help
> +   This adds the CPUFreq driver for Raspberry Pi
> +
> +   If in doubt, say N.
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index 689b26c6f949..02678e9b2ff5 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -84,6 +84,7 @@ obj-$(CONFIG_ARM_TEGRA124_CPUFREQ)  += tegra124-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA186_CPUFREQ)   += tegra186-cpufreq.o
>  obj-$(CONFIG_ARM_TI_CPUFREQ) += ti-cpufreq.o
>  obj-$(CONFIG_ARM_VEXPRESS_SPC_CPUFREQ)   += vexpress-spc-cpufreq.o
> +obj-$(CONFIG_ARM_RASPBERRYPI_CPUFREQ)+= raspberrypi-cpufreq.o
>  
>  
>  
> ##
> diff --git a/drivers/cpufreq/raspberrypi-cpufreq.c 
> b/drivers/cpufreq/raspberrypi-cpufreq.c
> new file mode 100644
> index ..53cb3e5a8457
> --- /dev/null
> +++ b/drivers/cpufreq/raspberrypi-cpufreq.c
> @@ -0,0 +1,79 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Raspberry Pi cpufreq driver
> + *
> + * Copyright (C) 2019, Nicolas Saenz Julienne 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static const struct of_device_id machines[] __initconst = {
> + { .compatible = "raspberrypi,3-model-b-plus" },
> + { .compatible = "raspberrypi,3-model-b" },
> + { /* sentinel */ }
> +};
> +
> +static int __init raspberrypi_cpufreq_driver_init(void)
> +{
> + struct platform_device *pdev;
> + struct cpumask shared_cpus;
> + struct device *cpu_dev;
> + struct clk *clk;
> + long min, max;
> + long rate;
> + int ret;
> +
> + if (!of_match_node(machines, of_root))
> + return -ENODEV;
> +
> + cpu_dev = get_cpu_device(0);
> + if (!cpu_dev) {
> + pr_err("Cannot get CPU for cpufreq driver\n");
> + return -ENODEV;
> + }
> +
> + clk = clk_get(cpu_dev, 0);
> + if (IS_ERR(clk)) {
> + dev_err(cpu_dev, "Cannot get clock for CPU0\n");
> + return PTR_ERR(clk);
> + }

You want to do a clk_put() somewhere ?

> +
> + /*
> +  * The max and min frequencies are configurable in the Raspberry Pi
> +  * firmware, so we query them at runtime
> +  */
> + min = clk_round_rate(clk, 0);
> + max = clk_round_rate(clk, ULONG_MAX);
> +
> + for (rate = min; rate < max; rate += 1) {
> + ret = dev_pm_opp_add(cpu_dev, rate, 0);
> + if (ret)
> + return ret;
> + }
> +
> + ret = dev_pm_opp_add(cpu_dev, max, 0);
> + if (ret)
> + return ret;

On errors, you should remove all previously added OPPs.

> +
> + cpumask_setall(_cpus);
> + dev_pm_opp_set_sharing_cpus(cpu_dev, _cpus);

Why are these required? This must be done by the cpufreq-dt driver anyway ?

> +
> + pdev = platform_device_register_data(NULL, "cpufreq-dt", -1, NULL, 0);
> + ret = PTR_ERR_OR_ZERO(pdev);
> + if (ret)
> + dev_err(cpu_dev, "Failed to create platform device, %d\n", ret);
> +
> + return ret;
> +}
> +
> +late_initcall(raspberrypi_cpufreq_driver_init);
> +
> +MODULE_AUTHOR("Nicolas Saenz Julienne  +MODULE_DESCRIPTION("Raspberry Pi cpufreq driver");
> +MODULE_LICENSE("GPL v2");
> -- 
> 2.21.0

-- 
viresh

[PATCH] kvm: vmx: Fix -Wmissing-prototypes warnings

2019-05-19 Thread Yi Wang

We get a warning when build kernel W=1:
arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for 
‘vmx_update_host_rsp’ [-Wmissing-prototypes]
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)

Add the missing declaration to fix this.

Signed-off-by: Yi Wang 
---
 arch/x86/kvm/vmx/vmx.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f879529..9cd72de 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -314,6 +314,7 @@ struct kvm_vmx {
 void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
 struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
 void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
+void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
 
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1
-- 
1.8.3.1

Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-19 Thread Bharata B Rao

On Mon, May 20, 2019 at 12:02:23PM +1000, Michael Ellerman wrote:
> Bharata B Rao  writes:
> > On Thu, May 16, 2019 at 07:44:20PM +0530, srikanth wrote:
> >> Hello,
> >> 
> >> On power9 host, performing memory hotunplug from ppc64le guest results in
> >> kernel oops.
> >> 
> >> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using
> >> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest.
> >> 
> >> Recreation steps:
> >> 
> >> 1. Boot a guest with below mem configuration:
> >>   33554432
> >>   8388608
> >>   4194304
> >>   
> >>     
> >>   
> >>     
> >>   
> >> 
> >> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> now
> >> reboot guest -> once guest comes back try to unplug 8G memory
> >> 
> >> mem.xml used:
> >> 
> >> 
> >> 8
> >> 0
> >> 
> >> 
> >> 
> >> Memory attach and detach commands used:
> >>     virsh attach-device vm1 ./mem.xml --live
> >>     virsh detach-device vm1 ./mem.xml --live
> >> 
> >> Trace seen inside guest after unplug, guest just hangs there forever:
> >> 
> >> [   21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> >> [   21.963064] Oops: Exception in kernel mode, sig: 5 [#1]
> >> [   21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA
> >> pSeries
> >> [   21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse
> >> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi 
> >> scsi_transport_iscsi
> >> ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress
> >> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
> >> xor raid6_pq multipath crc32c_vpmsum
> >> [   21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not
> >> tainted 5.1.0-dirty #2
> >> [   21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn
> >> [   21.963355] NIP:  c0079e18 LR: c0c79308 CTR:
> >> 8000
> >> [   21.963392] REGS: c003f88034f0 TRAP: 0700   Not tainted 
> >> (5.1.0-dirty)
> >> [   21.963422] MSR:  8282b033   
> >> CR:
> >> 28002884  XER: 2004
> >> [   21.963470] CFAR: c0c79304 IRQMASK: 0
> >> [   21.963470] GPR00: c0c79308 c003f8803780 c1521000
> >> 00fff8c0
> >> [   21.963470] GPR04: 0001 ffe30005 0005
> >> 0020
> >> [   21.963470] GPR08:  0001 c00a00fff8e0
> >> c16d21a0
> >> [   21.963470] GPR12: c16e7b90 c7ff2700 c00a00a0
> >> c003ffe30100
> >> [   21.963470] GPR16: c003ffe3 c14aa4de c00a009f
> >> c16d21b0
> >> [   21.963470] GPR20: c14de588 0001 c16d21b8
> >> c00a00a0
> >> [   21.963470] GPR24:   c00a00a0
> >> c003ffe96000
> >> [   21.963470] GPR28: c00a00a0 c00a00a0 c003fffec000
> >> c00a00fff8c0
> >> [   21.963802] NIP [c0079e18] pte_fragment_free+0x48/0xd0
> >> [   21.963838] LR [c0c79308] remove_pagetable+0x49c/0x5b4
> >> [   21.963873] Call Trace:
> >> [   21.963890] [c003f8803780] [c003ffe997f0] 0xc003ffe997f0
> >> (unreliable)
> >> [   21.963933] [c003f88037b0] [] (null)
> >> [   21.963969] [c003f88038c0] [c006f038]
> >> vmemmap_free+0x218/0x2e0
> >> [   21.964006] [c003f8803940] [c036f100]
> >> sparse_remove_one_section+0xd0/0x138
> >> [   21.964050] [c003f8803980] [c0383a50]
> >> __remove_pages+0x410/0x560
> >> [   21.964093] [c003f8803a90] [c0c784d8]
> >> arch_remove_memory+0x68/0xdc
> >> [   21.964136] [c003f8803ad0] [c0385d74]
> >> __remove_memory+0xc4/0x110
> >> [   21.964180] [c003f8803b10] [c00d44e4]
> >> dlpar_remove_lmb+0x94/0x140
> >> [   21.964223] [c003f8803b50] [c00d52b4]
> >> dlpar_memory+0x464/0xd00
> >> [   21.964259] [c003f8803be0] [c00cd5c0]
> >> handle_dlpar_errorlog+0xc0/0x190
> >> [   21.964303] [c003f8803c50] [c00cd6bc]
> >> pseries_hp_work_fn+0x2c/0x60
> >> [   21.964346] [c003f8803c80] [c013a4a0]
> >> process_one_work+0x2b0/0x5a0
> >> [   21.964388] [c003f8803d10] [c013a818]
> >> worker_thread+0x88/0x610
> >> [   21.964434] [c003f8803db0] [c0143884] kthread+0x1a4/0x1b0
> >> [   21.964468] [c003f8803e20] [c000bdc4]
> >> ret_from_kernel_thread+0x5c/0x78
> >> [   21.964506] Instruction dump:
> >> [   21.964527] fbe1fff8 f821ffd1 78638502 78633664 ebe9 7fff1a14
> >> 395f0020 813f0020
> >> [   21.964569] 7d2907b4 7d2900d0 79290fe0 69290001 <0b09> 7c0004ac
> >> 7d205028 3129
> >> [   21.964613] ---[ end trace aaa571aa1636fee6 ]---
> >> [   21.966349]
> >> [   21.966383] Sending IPI to other CPUs
> >> [   21.978335] IPI complete
> >> [   21.981354] kexec: Starting switchover sequence.
> >> I'm in purgatory
> >
> > git bisect points to
> >
> > commit

Re: [PATCH v3 3/3] i2c-ocores: sifive: add polling mode workaround for FU540-C000 SoC

2019-05-19 Thread Sagar Kadam

Hi Andrew,

On Thu, May 16, 2019 at 6:37 PM Andrew Lunn  wrote:
>
> On Thu, May 16, 2019 at 10:38:40AM +0530, Sagar Shrikant Kadam wrote:
> > The i2c-ocore driver already has a polling mode interface.But it needs
> > a workaround for FU540 Chipset on HiFive unleashed board (RevA00).
> > There is an erratum in FU540 chip that prevents interrupt driven i2c
> > transfers from working, and also the I2C controller's interrupt bit
> > cannot be cleared if set, due to this the existing i2c polling mode
> > interface added in mainline earlier doesn't work, and CPU stall's
> > infinitely, when-ever i2c transfer is initiated.
> >
> > Ref:previous polling mode support in mainline
> >
> >   commit 69c8c0c0efa8 ("i2c: ocores: add polling interface")
> >
> > The workaround / fix under OCORES_FLAG_BROKEN_IRQ is particularly for
> > FU540-COOO SoC.
> >
> > Signed-off-by: Sagar Shrikant Kadam 
> > ---
> >  drivers/i2c/busses/i2c-ocores.c | 34 --
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/i2c/busses/i2c-ocores.c 
> > b/drivers/i2c/busses/i2c-ocores.c
> > index aee1d86..00ee45c 100644
> > --- a/drivers/i2c/busses/i2c-ocores.c
> > +++ b/drivers/i2c/busses/i2c-ocores.c
> > @@ -27,6 +27,7 @@
> >  #include 
> >
> >  #define OCORES_FLAG_POLL BIT(0)
> > +#define OCORES_FLAG_BROKEN_IRQ BIT(2) /* Broken IRQ in HiFive Unleashed */
>
> Hi Sigar
>
> BIT(1). Don't leave a gap.

I will remove the gap and update this in V4.

Thanks,
Sagar
>
> Andrew

arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of function 'uaccess_kernel'; did you mean 'getname_kernel'?

2019-05-19 Thread kbuild test robot

Hi Matt,

FYI, the error/warning still remains.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   a188339ca5a396acc588e5851ed7e19f66b0ebd9
commit: 7df95299b94a63ec67a6389fc02dc25019a80ee8 staging: kpc2000: Add DMA 
driver
date:   4 weeks ago
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 8.1.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 7df95299b94a63ec67a6389fc02dc25019a80ee8
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=xtensa 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   In file included from drivers/staging/kpc2000/kpc_dma/fileops.c:11:
   arch/xtensa/include/asm/uaccess.h: In function 'clear_user':
>> arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of 
>> function 'uaccess_kernel'; did you mean 'getname_kernel'? 
>> [-Werror=implicit-function-declaration]
#define __kernel_ok (uaccess_kernel())
 ^~
   arch/xtensa/include/asm/uaccess.h:43:34: note: in expansion of macro 
'__kernel_ok'
#define __access_ok(addr, size) (__kernel_ok || __user_ok((addr), (size)))
 ^~~
   arch/xtensa/include/asm/uaccess.h:44:31: note: in expansion of macro 
'__access_ok'
#define access_ok(addr, size) __access_ok((unsigned long)(addr), (size))
  ^~~
   arch/xtensa/include/asm/uaccess.h:271:6: note: in expansion of macro 
'access_ok'
 if (access_ok(addr, size))
 ^
   In file included from include/linux/printk.h:330,
from include/linux/kernel.h:15,
from include/linux/list.h:9,
from include/linux/module.h:9,
from drivers/staging/kpc2000/kpc_dma/fileops.c:2:
   drivers/staging/kpc2000/kpc_dma/fileops.c: In function 'kpc_dma_transfer':
   drivers/staging/kpc2000/kpc_dma/fileops.c:58:35: warning: format '%ld' 
expects argument of type 'long int', but argument 7 has type 'size_t' {aka 
'unsigned int'} [-Wformat=]
 dev_dbg(>ldev->pldev->dev, "kpc_dma_transfer(priv = [%p], kcb = 
[%p], iov_base = [%p], iov_len = %ld) ldev = [%p]\n", priv, kcb, 
(void*)iov_base, iov_len, ldev);
  
^
   include/linux/dynamic_debug.h:118:15: note: in definition of macro 
'__dynamic_func_call'
  func(, ##__VA_ARGS__);  \
  ^~~
   include/linux/dynamic_debug.h:150:2: note: in expansion of macro 
'_dynamic_func_call'
 _dynamic_func_call(fmt,__dynamic_dev_dbg,   \
 ^~
   include/linux/device.h:1493:2: note: in expansion of macro 'dynamic_dev_dbg'
 dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
 ^~~
   include/linux/device.h:1493:23: note: in expansion of macro 'dev_fmt'
 dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
  ^~~
   drivers/staging/kpc2000/kpc_dma/fileops.c:58:2: note: in expansion of macro 
'dev_dbg'
 dev_dbg(>ldev->pldev->dev, "kpc_dma_transfer(priv = [%p], kcb = 
[%p], iov_base = [%p], iov_len = %ld) ldev = [%p]\n", priv, kcb, 
(void*)iov_base, iov_len, ldev);
 ^~~
   cc1: some warnings being treated as errors
--
   In file included from drivers/staging//kpc2000/kpc_dma/fileops.c:11:
   arch/xtensa/include/asm/uaccess.h: In function 'clear_user':
>> arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of 
>> function 'uaccess_kernel'; did you mean 'getname_kernel'? 
>> [-Werror=implicit-function-declaration]
#define __kernel_ok (uaccess_kernel())
 ^~
   arch/xtensa/include/asm/uaccess.h:43:34: note: in expansion of macro 
'__kernel_ok'
#define __access_ok(addr, size) (__kernel_ok || __user_ok((addr), (size)))
 ^~~
   arch/xtensa/include/asm/uaccess.h:44:31: note: in expansion of macro 
'__access_ok'
#define access_ok(addr, size) __access_ok((unsigned long)(addr), (size))
  ^~~
   arch/xtensa/include/asm/uaccess.h:271:6: note: in expansion of macro 
'access_ok'
 if (access_ok(addr, size))
 ^
   In file included from include/linux/printk.h:330,
from include/linux/kernel.h:15,
from include/linux/list.h:9,
from include/linux/module.h:9,
from drivers/staging//kpc2000/kpc_dma/fileops.c:2:
   drivers/staging//kpc2000/kpc_dma/fileops.c: In function 'kpc_dma_transfer':
   drivers/staging//kpc2000/kpc_dma/fileops.c:58:35: warning: format '%ld' 
expects argument of type 'long int', but

drivers//pinctrl/pinctrl-stmfx.c:652:17: error: 'struct gpio_chip' has no member named 'of_node'

2019-05-19 Thread kbuild test robot

Hi Amelie,

FYI, the error/warning still remains.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   a188339ca5a396acc588e5851ed7e19f66b0ebd9
commit: 1490d9f841b186664f9d3ca213dcfa4464a60680 pinctrl: Add STMFX GPIO 
expander Pinctrl/GPIO driver
date:   10 days ago
config: um-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
git checkout 1490d9f841b186664f9d3ca213dcfa4464a60680
# save the attached .config to linux build tree
make ARCH=um 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   drivers//pinctrl/pinctrl-stmfx.c: In function 'stmfx_pinctrl_probe':
>> drivers//pinctrl/pinctrl-stmfx.c:652:17: error: 'struct gpio_chip' has no 
>> member named 'of_node'
 pctl->gpio_chip.of_node = np;
^

vim +652 drivers//pinctrl/pinctrl-stmfx.c

   588  
   589  static int stmfx_pinctrl_probe(struct platform_device *pdev)
   590  {
   591  struct stmfx *stmfx = dev_get_drvdata(pdev->dev.parent);
   592  struct device_node *np = pdev->dev.of_node;
   593  struct stmfx_pinctrl *pctl;
   594  u32 n;
   595  int irq, ret;
   596  
   597  pctl = devm_kzalloc(stmfx->dev, sizeof(*pctl), GFP_KERNEL);
   598  if (!pctl)
   599  return -ENOMEM;
   600  
   601  platform_set_drvdata(pdev, pctl);
   602  
   603  pctl->dev = >dev;
   604  pctl->stmfx = stmfx;
   605  
   606  if (!of_find_property(np, "gpio-ranges", NULL)) {
   607  dev_err(pctl->dev, "missing required gpio-ranges 
property\n");
   608  return -EINVAL;
   609  }
   610  
   611  irq = platform_get_irq(pdev, 0);
   612  if (irq <= 0) {
   613  dev_err(pctl->dev, "failed to get irq\n");
   614  return -ENXIO;
   615  }
   616  
   617  mutex_init(>lock);
   618  
   619  /* Register pin controller */
   620  pctl->pctl_desc.name = "stmfx-pinctrl";
   621  pctl->pctl_desc.pctlops = _pinctrl_ops;
   622  pctl->pctl_desc.confops = _pinconf_ops;
   623  pctl->pctl_desc.pins = stmfx_pins;
   624  pctl->pctl_desc.npins = ARRAY_SIZE(stmfx_pins);
   625  pctl->pctl_desc.owner = THIS_MODULE;
   626  
   627  ret = devm_pinctrl_register_and_init(pctl->dev, 
>pctl_desc,
   628   pctl, >pctl_dev);
   629  if (ret) {
   630  dev_err(pctl->dev, "pinctrl registration failed\n");
   631  return ret;
   632  }
   633  
   634  ret = pinctrl_enable(pctl->pctl_dev);
   635  if (ret) {
   636  dev_err(pctl->dev, "pinctrl enable failed\n");
   637  return ret;
   638  }
   639  
   640  /* Register gpio controller */
   641  pctl->gpio_chip.label = "stmfx-gpio";
   642  pctl->gpio_chip.parent = pctl->dev;
   643  pctl->gpio_chip.get_direction = stmfx_gpio_get_direction;
   644  pctl->gpio_chip.direction_input = stmfx_gpio_direction_input;
   645  pctl->gpio_chip.direction_output = stmfx_gpio_direction_output;
   646  pctl->gpio_chip.get = stmfx_gpio_get;
   647  pctl->gpio_chip.set = stmfx_gpio_set;
   648  pctl->gpio_chip.set_config = gpiochip_generic_config;
   649  pctl->gpio_chip.base = -1;
   650  pctl->gpio_chip.ngpio = pctl->pctl_desc.npins;
   651  pctl->gpio_chip.can_sleep = true;
 > 652  pctl->gpio_chip.of_node = np;
   653  pctl->gpio_chip.need_valid_mask = true;
   654  
   655  ret = devm_gpiochip_add_data(pctl->dev, >gpio_chip, pctl);
   656  if (ret) {
   657  dev_err(pctl->dev, "gpio_chip registration failed\n");
   658  return ret;
   659  }
   660  
   661  ret = stmfx_pinctrl_gpio_function_enable(pctl);
   662  if (ret)
   663  return ret;
   664  
   665  pctl->irq_chip.name = dev_name(pctl->dev);
   666  pctl->irq_chip.irq_mask = stmfx_pinctrl_irq_mask;
   667  pctl->irq_chip.irq_unmask = stmfx_pinctrl_irq_unmask;
   668  pctl->irq_chip.irq_set_type = stmfx_pinctrl_irq_set_type;
   669  pctl->irq_chip.irq_bus_lock = stmfx_pinctrl_irq_bus_lock;
   670  pctl->irq_chip.irq_bus_sync_unlock = 
stmfx_pinctrl_irq_bus_sync_unlock;
   671  for_each_clear_bit(n, >gpio_valid_mask, 
pctl->gpio_chip.ngpio)
   672  clear_bit(n, pctl->gpio_chip.valid_mask);
   673  
   674  ret = gpiochip_irqchip_add_nested(>gpio_chip, 
>irq_chip,
   6750, handle_bad_irq, 
IRQ_TYPE_NONE);
   676  if (ret) {
   677  dev_err(pctl->dev, "cannot add

ERROR: "uio_unregister_device" [drivers/staging/kpc2000/kpc2000/kpc2000.ko] undefined!

2019-05-19 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   a188339ca5a396acc588e5851ed7e19f66b0ebd9
commit: 7dc7967fc39af81191558f63eeaf3d2b83899b1c staging: kpc2000: add initial 
set of Daktronics drivers
date:   4 weeks ago
config: x86_64-randconfig-m3-05200349 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
git checkout 7dc7967fc39af81191558f63eeaf3d2b83899b1c
# save the attached .config to linux build tree
make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

>> ERROR: "uio_unregister_device" [drivers/staging/kpc2000/kpc2000/kpc2000.ko] 
>> undefined!
>> ERROR: "__uio_register_device" [drivers/staging/kpc2000/kpc2000/kpc2000.ko] 
>> undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH v2] pstore: Add boot loader log messages support

2019-05-19 Thread Yue Hu

On Wed, 15 May 2019 15:04:09 -0700
Kees Cook  wrote:

> Hi!
> 
> Thanks for the reminder to review this code. :) Sorry for the delay!
> 
> On Thu, Feb 14, 2019 at 11:49 PM Yue Hu  wrote:
> >
> > From: Yue Hu 
> >
> > Sometimes we hope to check boot loader log messages (e.g. Android
> > Verified Boot status) when kernel is coming up. Generally it does
> > depend on serial device, but it will be removed for the hardware
> > shipping to market by most of manufacturers. In that case better
> > solder and proper serial cable for different interface (e.g. Type-C
> > or microUSB) are needed. That is inconvenient and even wasting much
> > time on it.  
> 
> Can you give some examples of how this would be used on a real device?
> More notes below...
> 
> >
> > Therefore, let's add a logging support: PSTORE_TYPE_XBL.
> >
> > Signed-off-by: Yue Hu 
> > ---
> > v2: mention info of interacting with boot loader
> >
> >  fs/pstore/Kconfig  | 10 +++
> >  fs/pstore/platform.c   | 16 ++
> >  fs/pstore/ram.c| 81 
> > --
> >  include/linux/pstore.h | 21 +
> >  4 files changed, 121 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
> > index 0d19d19..ef4a2dc 100644
> > --- a/fs/pstore/Kconfig
> > +++ b/fs/pstore/Kconfig
> > @@ -137,6 +137,16 @@ config PSTORE_FTRACE
> >
> >   If unsure, say N.
> >
> > +config PSTORE_XBL
> > +   bool "Log bootloader messages"
> > +   depends on PSTORE
> > +   help
> > + When the option is enabled, pstore will log boot loader
> > + messages to /sys/fs/pstore/xbl-ramoops-[ID] after reboot.
> > + Boot loader needs to support log buffer reserved.
> > +
> > + If unsure, say N.
> > +
> >  config PSTORE_RAM
> > tristate "Log panic/oops to a RAM buffer"
> > depends on PSTORE
> > diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
> > index 2d1066e..2e6c3f8f 100644
> > --- a/fs/pstore/platform.c
> > +++ b/fs/pstore/platform.c
> > @@ -65,6 +65,7 @@
> > "mce",
> > "console",
> > "ftrace",
> > +   "xbl",
> > "rtas",
> > "powerpc-ofw",
> > "powerpc-common",
> > @@ -530,6 +531,19 @@ static void pstore_register_console(void) {}
> >  static void pstore_unregister_console(void) {}
> >  #endif
> >
> > +#ifdef CONFIG_PSTORE_XBL
> > +static void pstore_register_xbl(void)
> > +{
> > +   struct pstore_record record;
> > +
> > +   pstore_record_init(, psinfo);
> > +   record.type = PSTORE_TYPE_XBL;
> > +   psinfo->write();
> > +}  
> 
> This seems like a very strange way to get the record: this is an
> "empty" write that has a side-effect of reading the XBL region and
> copying it into the prz area. I would expect this to all happen in
> ramoops_pstore_read() instead.
> 
> > +#else
> > +static void pstore_register_xbl(void) {}
> > +#endif
> > +
> >  static int pstore_write_user_compat(struct pstore_record *record,
> > const char __user *buf)
> >  {
> > @@ -616,6 +630,8 @@ int pstore_register(struct pstore_info *psi)
> > pstore_register_ftrace();
> > if (psi->flags & PSTORE_FLAGS_PMSG)
> > pstore_register_pmsg();
> > +   if (psi->flags & PSTORE_FLAGS_XBL)
> > +   pstore_register_xbl();
> >
> > /* Start watching for new records, if desired. */
> > if (pstore_update_ms >= 0) {
> > diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> > index 1adb5e3..b114b1d 100644
> > --- a/fs/pstore/ram.c
> > +++ b/fs/pstore/ram.c
> > @@ -56,6 +56,27 @@
> >  module_param_named(pmsg_size, ramoops_pmsg_size, ulong, 0400);
> >  MODULE_PARM_DESC(pmsg_size, "size of user space message log");  
> 
> How is the base address of the XBL area specified? It looks currently
> like it's up to the end user to do all the math correctly to line it
> up with where it's expected?
> 
> >
> > +/*
> > + * interact with boot loader
> > + * =
> > + *
> > + * xbl memory layout:
> > + * ++
> > + * |dst |
> > + * || ++
> > + * |src > |   header   |
> > + * ++ |log messages|
> > + *++
> > + *
> > + * As above, src memory is used to store header and log messages generated
> > + * by boot loader, pstore will copy the log messages to dst memory which
> > + * has same size as src. The header in src is to record log messages size
> > + * written and make xbl cookie.  
> 
> Why is such a copy needed? The log is already present in memory; why
> can't pstore just use what's already there?
> 
> > + */
> > +static ulong ramoops_xbl_size = MIN_MEM_SIZE;
> > +module_param_named(xbl_size, ramoops_xbl_size, ulong, 0400);
> > +MODULE_PARM_DESC(xbl_size, "size of boot loader log");
> > +
> >  static unsigned long long mem_address;
> >  module_param_hw(mem_address, ullong, other, 0400);
> >  MODULE_PARM_DESC(mem_address,
> > @@ -88,6

Re: [PATCH v3] kbuild: check uniqueness of module names

2019-05-19 Thread Masahiro Yamada

On Mon, May 20, 2019 at 8:52 AM Stephen Rothwell  wrote:
>
> Hi Masahiro,
>
> On Sat, 18 May 2019 01:07:15 +0900 Masahiro Yamada 
>  wrote:
> >
> > It checks not only real modules, but also built-in modules (i.e.
> > controlled by tristate CONFIG option, but currently compiled with =y).
> > Non-unique names for built-in modules also cause problems because
> > /sys/modules/ would fall over.
> >
> > I tested allmodconfig on the latest kernel, and it detected the
> > following:
>
> A powerpc ppc64_defconfig produces:
>
> warning: same basename if the following are built as modules:
>   arch/powerpc/platforms/powermac/nvram.ko
>   drivers/char/nvram.ko
>
> Which is a false positive since
> arch/powerpc/platforms/powermac/Makefile has
>
> # CONFIG_NVRAM is an arch. independent tristate symbol, for pmac32 we really
> # need this to be a bool.  Cheat here and pretend CONFIG_NVRAM=m is really
> # CONFIG_NVRAM=y
> obj-$(CONFIG_NVRAM:m=y) += nvram.o
>
> Which means that this nvram.o will never be built as a module.
> --
> Cheers,
> Stephen Rothwell


BTW, arm64 defconfig also produces a false positive:

warning: same basename if the following are built as modules:
  arch/arm64/lib/crc32.ko
  lib/crc32.ko

CONFIG_CRC32 is a tristate option, but ARM64 selects CRC32.
So, CRC32 is always =y.


We must stop checking modules.builtin soon.
Sorry about noises.




-- 
Best Regards
Masahiro Yamada

[RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-19 Thread Minchan Kim

System could have much faster swap device like zRAM. In that case, swapping
is extremely cheaper than file-IO on the low-end storage.
In this configuration, userspace could handle different strategy for each
kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
while it keeps file-backed pages in inactive LRU by MADV_COOL because
file IO is more expensive in this case so want to keep them in memory
until memory pressure happens.

To support such strategy easier, this patch introduces
MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
that /proc//clear_refs already has supported same filters.
They are filters could be Ored with other existing hints using top two bits
of (int behavior).

Once either of them is set, the hint could affect only the interested vma
either anonymous or file-backed.

With that, user could call a process_madvise syscall simply with a entire
range(0x0 - 0x) but either of MADV_ANONYMOUS_FILTER and
MADV_FILE_FILTER so there is no need to call the syscall range by range.

* from v1r2
  * use consistent check with clear_refs to identify anon/file vma - surenb

* from v1r1
  * use naming "filter" for new madvise option - dancol

Signed-off-by: Minchan Kim 
---
 include/uapi/asm-generic/mman-common.h |  5 +
 mm/madvise.c   | 14 ++
 2 files changed, 19 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index b8e230de84a6..be59a1b90284 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -66,6 +66,11 @@
 #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
 
+#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER))
+
+#define MADV_ANONYMOUS_FILTER  (1<<31) /* works for only anonymous vma */
+#define MADV_FILE_FILTER   (1<<30) /* works for only file-backed vma */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index f4f569dac2bd..116131243540 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, 
unsigned long start,
int write;
size_t len;
struct blk_plug plug;
+   bool anon_only, file_only;
 
+   anon_only = behavior & MADV_ANONYMOUS_FILTER;
+   file_only = behavior & MADV_FILE_FILTER;
+
+   if (anon_only && file_only)
+   return error;
+
+   behavior = behavior & MADV_BEHAVIOR_MASK;
if (!madvise_behavior_valid(behavior))
return error;
 
@@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, 
unsigned long start,
if (end < tmp)
tmp = end;
 
+   if (anon_only && vma->vm_file)
+   goto next;
+   if (file_only && !vma->vm_file)
+   goto next;
+
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
error = madvise_vma(tsk, vma, , start, tmp,
behavior, );
if (error)
goto out;
*nr_pages += pages;
+next:
start = tmp;
if (prev && start < prev->vm_end)
start = prev->vm_end;
-- 
2.21.0.1020.gf2820cf01a-goog

[RFC 4/7] mm: factor out madvise's core functionality

2019-05-19 Thread Minchan Kim

This patch factor out madvise's core functionality so that upcoming
patch can reuse it without duplication.

It shouldn't change any behavior.

Signed-off-by: Minchan Kim 
---
 mm/madvise.c | 168 +++
 1 file changed, 89 insertions(+), 79 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9a6698b56845..119e82e1f065 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct 
vm_area_struct *vma,
return 0;
 }
 
-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct task_struct *tsk,
+ struct vm_area_struct *vma,
  struct vm_area_struct **prev,
  unsigned long start, unsigned long end,
  int behavior)
@@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct 
*vma,
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_sem has been dropped, prev is stale */
 
-   down_read(>mm->mmap_sem);
-   vma = find_vma(current->mm, start);
+   down_read(>mm->mmap_sem);
+   vma = find_vma(tsk->mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct 
*vma,
  * Application wants to free up the pages and associated backing store.
  * This is effectively punching a hole into the middle of a file.
  */
-static long madvise_remove(struct vm_area_struct *vma,
+static long madvise_remove(struct task_struct *tsk,
+   struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
 {
@@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_sem was not released by userfaultfd_remove() */
-   up_read(>mm->mmap_sem);
+   up_read(>mm->mmap_sem);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
-   down_read(>mm->mmap_sem);
+   down_read(>mm->mmap_sem);
return error;
 }
 
@@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
 #endif
 
 static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-   unsigned long start, unsigned long end, int behavior)
+madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
+   struct vm_area_struct **prev, unsigned long start,
+   unsigned long end, int behavior)
 {
switch (behavior) {
case MADV_REMOVE:
-   return madvise_remove(vma, prev, start, end);
+   return madvise_remove(tsk, vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COOL:
@@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct 
vm_area_struct **prev,
return madvise_cold(vma, start, end);
case MADV_FREE:
case MADV_DONTNEED:
-   return madvise_dontneed_free(vma, prev, start, end, behavior);
+   return madvise_dontneed_free(tsk, vma, prev, start,
+   end, behavior);
default:
return madvise_behavior(vma, prev, start, end, behavior);
}
@@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior)
}
 }
 
-/*
- * The madvise(2) system call.
- *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area.  The idea is to help the kernel
- * use appropriate read-ahead and caching techniques.  The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
- *
- * behavior values:
- *  MADV_NORMAL - the default behavior is to read clusters.  This
- * results in some read-ahead and read-behind.
- *  MADV_RANDOM - the system should read the minimum amount of data
- * on any access, since it is unlikely that the appli-
- * cation will need more than what it asks for.
- *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
- * once, so they can be aggressively read ahead, and
- * can be freed soon after they are accessed.
- *  MADV_WILLNEED - the application is notifying the system to read
- * some pages ahead.
- *  MADV_DONTNEED - the application is finished with the given range,
- * so the kernel can free resources

[RFC 3/7] mm: introduce MADV_COLD

2019-05-19 Thread Minchan Kim

When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.

This patch introduces the new MADV_COLD hint to madvise(2)
syscall. MADV_COLD can be used by a process to mark a memory range
as not expected to be used for a long time. The hint can help
kernel in deciding which pages to evict proactively.

Internally, it works via reclaiming memory in process context
the syscall is called. If the page is dirty but backing storage
is not synchronous device, the written page will be rotate back
into LRU's tail once the write is done so they will reclaim easily
when memory pressure happens. If backing storage is
synchrnous device(e.g., zram), hte page will be reclaimed instantly.

Signed-off-by: Minchan Kim 
---
 include/linux/swap.h   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c   | 123 +
 mm/vmscan.c|  74 +++
 4 files changed, 199 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 64795abea003..7f32a948fc6a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned long reclaim_pages(struct list_head *page_list);
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index f7a4a5d4b642..b9b51eeb8e1a 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
 #define MADV_WILLNEED  3   /* will need these pages */
 #define MADV_DONTNEED  4   /* don't need these pages */
 #define MADV_COOL  5   /* deactivatie these pages */
+#define MADV_COLD  6   /* reclaim these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE  8   /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index c05817fb570d..9a6698b56845 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_COOL:
+   case MADV_COLD:
case MADV_FREE:
return 0;
default:
@@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma,
return 0;
 }
 
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+   unsigned long end, struct mm_walk *walk)
+{
+   pte_t *orig_pte, *pte, ptent;
+   spinlock_t *ptl;
+   LIST_HEAD(page_list);
+   struct page *page;
+   int isolated = 0;
+   struct vm_area_struct *vma = walk->vma;
+   unsigned long next;
+
+   next = pmd_addr_end(addr, end);
+   if (pmd_trans_huge(*pmd)) {
+   spinlock_t *ptl;
+
+   ptl = pmd_trans_huge_lock(pmd, vma);
+   if (!ptl)
+   return 0;
+
+   if (is_huge_zero_pmd(*pmd))
+   goto huge_unlock;
+
+   page = pmd_page(*pmd);
+   if (page_mapcount(page) > 1)
+   goto huge_unlock;
+
+   if (next - addr != HPAGE_PMD_SIZE) {
+   int err;
+
+   get_page(page);
+   spin_unlock(ptl);
+   lock_page(page);
+   err = split_huge_page(page);
+   unlock_page(page);
+   put_page(page);
+   if (!err)
+   goto regular_page;
+   return 0;
+   }
+
+   if (isolate_lru_page(page))
+   goto huge_unlock;
+
+   list_add(>lru, _list);
+huge_unlock:
+   spin_unlock(ptl);
+   reclaim_pages(_list);
+   return 0;
+   }
+
+   if (pmd_trans_unstable(pmd))
+   return 0;
+regular_page:
+   orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
+   for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+   ptent = *pte;
+   if (!pte_present(ptent))
+   continue;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (!page)
+   continue;
+
+   if (page_mapcount(page) > 1)
+   continue;
+
+   if (isolate_lru_page(page))
+   continue;
+
+   isolated++;
+   list_add(>lru, _list);
+

[RFC 5/7] mm: introduce external memory hinting API

2019-05-19 Thread Minchan Kim

There is some usecase that centralized userspace daemon want to give
a memory hint like MADV_[COOL|COLD] to other process. Android's
ActivityManagerService is one of them.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2)
which works based on pidfd so it could give a hint to the exeternal
process.

int process_madvise(int pidfd, void *addr, size_t length, int advise);

All advises madvise provides can be supported in process_madvise, too.
Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrrace the process could use it successfully.

Please suggest better idea if you have other idea about the permission.

* from v1r1
  * use ptrace capability - surenb, dancol

Signed-off-by: Minchan Kim 
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/proc_fs.h|  1 +
 include/linux/syscalls.h   |  2 ++
 include/uapi/asm-generic/unistd.h  |  2 ++
 kernel/signal.c|  2 +-
 kernel/sys_ni.c|  1 +
 mm/madvise.c   | 45 ++
 8 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..5b9dd55d6b57 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425i386io_uring_setup  sys_io_uring_setup  
__ia32_sys_io_uring_setup
 426i386io_uring_enter  sys_io_uring_enter  
__ia32_sys_io_uring_enter
 427i386io_uring_register   sys_io_uring_register   
__ia32_sys_io_uring_register
+428i386process_madvise sys_process_madvise 
__ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..0e5ee78161c9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425common  io_uring_setup  __x64_sys_io_uring_setup
 426common  io_uring_enter  __x64_sys_io_uring_enter
 427common  io_uring_register   __x64_sys_io_uring_register
+428common  process_madvise __x64_sys_process_madvise
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..f8545d7c5218 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct 
file *file)
 
 #endif /* CONFIG_PROC_FS */
 
+extern struct pid *pidfd_to_pid(const struct file *file);
 struct net;
 
 static inline struct proc_dir_entry *proc_net_mkdir(
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..21c6c9a62006 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
+   size_t len, int behavior);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..7ee82ce04620 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 #define __NR_io_uring_register 427
 __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_process_madvise 428
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
 
 #undef __NR_syscalls
 #define __NR_syscalls 428
diff --git a/kernel/signal.c b/kernel/signal.c
index 1c86b78a7597..04e75daab1f8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t 
*kinfo, siginfo_t *info)
return copy_siginfo_from_user(kinfo, info);
 }
 
-static struct pid *pidfd_to_pid(const struct file *file)
+struct pid *pidfd_to_pid(const struct file *file)
 {
if (file->f_op == _fops)

[RFC 6/7] mm: extend process_madvise syscall to support vector arrary

2019-05-19 Thread Minchan Kim

Currently, process_madvise syscall works for only one address range
so user should call the syscall several times to give hints to
multiple address range.

This patch extends process_madvise syscall to support multiple
hints, address ranges and return vaules so user could give hints
all at once.

struct pr_madvise_param {
int size;   /* the size of this structure */
const struct iovec __user *vec; /* address range array */
}

int process_madvise(int pidfd, ssize_t nr_elem,
int *behavior,
struct pr_madvise_param *results,
struct pr_madvise_param *ranges,
unsigned long flags);

- pidfd

target process fd

- nr_elem

the number of elemenent of array behavior, results, ranges

- behavior

hints for each address range in remote process so that user could
give different hints for each range.

- results

array of buffers to get results for associated remote address range
action.

- ranges

array to buffers to have remote process's address ranges to be
processed

- flags

extra argument for the future. It should be zero this moment.

Example)

struct pr_madvise_param {
int size;
const struct iovec *vec;
};

int main(int argc, char *argv[])
{
struct pr_madvise_param retp, rangep;
struct iovec result_vec[2], range_vec[2];
int hints[2];
long ret[2];
void *addr[2];

pid_t pid;
char cmd[64] = {0,};
addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
  MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

if (MAP_FAILED == addr[0])
return 1;

addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
  MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

if (MAP_FAILED == addr[1])
return 1;

hints[0] = MADV_COLD;
range_vec[0].iov_base = addr[0];
range_vec[0].iov_len = ALLOC_SIZE;
result_vec[0].iov_base = [0];
result_vec[0].iov_len = sizeof(long);
retp.vec = result_vec;
retp.size = sizeof(struct pr_madvise_param);

hints[1] = MADV_COOL;
range_vec[1].iov_base = addr[1];
range_vec[1].iov_len = ALLOC_SIZE;
result_vec[1].iov_base = [1];
result_vec[1].iov_len = sizeof(long);
rangep.vec = range_vec;
rangep.size = sizeof(struct pr_madvise_param);

pid = fork();
if (!pid) {
sleep(10);
} else {
int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
if (pidfd < 0)
return 1;

/* munmap to make pages private for the child */
munmap(addr[0], ALLOC_SIZE);
munmap(addr[1], ALLOC_SIZE);
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
, , 0))
perror("process_madvise fail\n");
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
}

return 0;
}

Signed-off-by: Minchan Kim 
---
 include/uapi/asm-generic/mman-common.h |   5 +
 mm/madvise.c   | 184 +
 2 files changed, 166 insertions(+), 23 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index b9b51eeb8e1a..b8e230de84a6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -74,4 +74,9 @@
 #define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
 PKEY_DISABLE_WRITE)
 
+struct pr_madvise_param {
+   int size;   /* the size of this structure */
+   const struct iovec __user *vec; /* address range array */
+};
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index af02aa17e5c1..f4f569dac2bd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long 
addr,
struct page *page;
struct vm_area_struct *vma = walk->vma;
unsigned long next;
+   long nr_pages = 0;
 
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
@@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned 
long addr,
 
ptep_test_and_clear_young(vma, addr, pte);
deactivate_page(page);
+   nr_pages++;
+
}
 
pte_unmap_unlock(orig_pte, ptl);
+   *(long *)walk->private += nr_pages;
cond_resched();
 
return 0;
@@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned 
long addr,
 
 static void madvise_cool_page_range(struct mmu_gather *tlb,
 struct vm_area_struct *vma,
-unsigned long addr,

[RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM

2019-05-19 Thread Minchan Kim

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with skip_reference_check to make it more clear.

This is a preparatory work for next patch.

Signed-off-by: Minchan Kim 
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d9c3e873eca6..a28e5d17b495 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
  struct scan_control *sc,
  enum ttu_flags ttu_flags,
  struct reclaim_stat *stat,
- bool force_reclaim)
+ bool skip_reference_check)
 {
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
-   enum page_references references = PAGEREF_RECLAIM_CLEAN;
+   enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback;
 
cond_resched();
@@ -1248,7 +1248,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
}
}
 
-   if (!force_reclaim)
+   if (!skip_reference_check)
references = page_check_references(page, sc);
 
switch (references) {
-- 
2.21.0.1020.gf2820cf01a-goog

[RFC 1/7] mm: introduce MADV_COOL

2019-05-19 Thread Minchan Kim

When a process expects no accesses to a certain memory range
it could hint kernel that the pages can be reclaimed
when memory pressure happens but data should be preserved
for future use.  This could reduce workingset eviction so it
ends up increasing performance.

This patch introduces the new MADV_COOL hint to madvise(2)
syscall. MADV_COOL can be used by a process to mark a memory range
as not expected to be used in the near future. The hint can help
kernel in deciding which pages to evict early during memory
pressure.

Internally, it works via deactivating memory from active list to
inactive's head so when the memory pressure happens, they will be
reclaimed earlier than other active pages unless there is no
access until the time.

* v1r2
 * use clear_page_young in deactivate_page - joelaf

* v1r1
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim 
---
 include/linux/page-flags.h |   1 +
 include/linux/page_idle.h  |  15 
 include/linux/swap.h   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c   | 112 +
 mm/swap.c  |  43 ++
 6 files changed, 173 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page 
*page)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
 TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
SetPageYoung(page);
 }
 
+static inline void clear_page_young(struct page *page)
+{
+   ClearPageYoung(page);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, _ext->flags);
 }
 
+static void clear_page_young(struct page *page)
+{
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return;
+
+   clear_bit(PAGE_EXT_YOUNG, _ext->flags);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..64795abea003 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index abd238d0f7a4..f7a4a5d4b642 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -42,6 +42,7 @@
 #define MADV_SEQUENTIAL2   /* expect sequential page 
references */
 #define MADV_WILLNEED  3   /* will need these pages */
 #define MADV_DONTNEED  4   /* don't need these pages */
+#define MADV_COOL  5   /* deactivatie these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE  8   /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..c05817fb570d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -8,6 +8,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -40,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+   case MADV_COOL:
case MADV_FREE:
return 0;
default:
@@ -307,6 +309,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
 }
 
+static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
+   unsigned long end, struct mm_walk *walk)
+{
+   pte_t *orig_pte, *pte, ptent;
+   spinlock_t *ptl;
+   struct page *page;
+   struct vm_area_struct *vma = walk->vma;
+   unsigned long next;
+
+   next = pmd_addr_end(addr, end);
+   if (pmd_trans_huge(*pmd)) {
+   spinlock_t *ptl;
+
+   ptl = pmd_trans_huge_lock(pmd, vma);
+   if (!ptl)
+   return 0;
+
+   if

[RFC 0/7] introduce memory hinting API for external process

2019-05-19 Thread Minchan Kim

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

IMHO we should spell it out that this patchset complements MADV_WONTNEED
and MADV_FREE by adding non-destructive ways to gain some free memory
space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
when memory pressure rises.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COOL which will deactive activated pages and the other is
MADV_COLD which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

struct pr_madvise_param {
int size;
const struct iovec *vec;
}

int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
struct pr_madvise_param *restuls,
struct pr_madvise_param *ranges,
unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once.

I guess others have different ideas about the naming of syscall and options
so feel free to suggest better naming.

- Experiment

We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.

A is vanilla and B is process_madvise.


   A  B  delta   ratio(%)
   allocstall_dma  0  0  0   0.00
   allocstall_movable   1464457  -1007 -69.00
allocstall_normal 263210 190763 -72447

Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-19 Thread Pankaj Gupta



> 
> 
> > On 5/16/19 10:35 PM, Pankaj Gupta wrote:
> > > Can I take it your reviewed/acked-by? or tested-by tag? for the virtio
> > > patch :)I don't feel that I have enough expertise to give the reviewed-by
> > > tag, but you can
> > take my acked-by + tested-by.
> > 
> > Acked-by: Jakub Staron 
> > Tested-by: Jakub Staron 
> > 
> > No kernel panics/stalls encountered during testing this patches (v9) with
> > QEMU + xfstests.
> 
> Thank you for testing and confirming the results. I will add your tested &
> acked-by in v10.
> 
> > Some CPU stalls encountered while testing with crosvm instead of QEMU with
> > xfstests
> > (test generic/464) but no repro for QEMU, so the fault may be on the side
> > of
> > crosvm.
> 
> yes, looks like crosvm related as we did not see any of this in my and your
> testing with Qemu.

Also, they don't seem to be related with virtio-pmem.

Thanks,
Pankaj

[PATCH v2] tipc: Avoid copying bytes beyond the supplied data

2019-05-19 Thread Chris Packham

TLV_SET is called with a data pointer and a len parameter that tells us
how many bytes are pointed to by data. When invoking memcpy() we need
to careful to only copy len bytes.

Previously we would copy TLV_LENGTH(len) bytes which would copy an extra
4 bytes past the end of the data pointer which newer GCC versions
complain about.

 In file included from test.c:17:
 In function 'TLV_SET',
 inlined from 'test' at test.c:186:5:
 /usr/include/linux/tipc_config.h:317:3:
 warning: 'memcpy' forming offset [33, 36] is out of the bounds [0, 32]
 of object 'bearer_name' with type 'char[32]' [-Warray-bounds]
 memcpy(TLV_DATA(tlv_ptr), data, tlv_len);
 ^~~~
 test.c: In function 'test':
 test.c::161:10: note:
 'bearer_name' declared here
 char bearer_name[TIPC_MAX_BEARER_NAME];
  ^~~

We still want to ensure any padding bytes at the end are initialised, do
this with a explicit memset() rather than copy bytes past the end of
data. Apply the same logic to TCM_SET.

Signed-off-by: Chris Packham 
---

Changes in v2:
- Ensure padding bytes are initialised in both TLV_SET and TCM_SET

 include/uapi/linux/tipc_config.h | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/tipc_config.h b/include/uapi/linux/tipc_config.h
index 4b2c93b1934c..4955e1a9f1bc 100644
--- a/include/uapi/linux/tipc_config.h
+++ b/include/uapi/linux/tipc_config.h
@@ -307,8 +307,10 @@ static inline int TLV_SET(void *tlv, __u16 type, void 
*data, __u16 len)
tlv_ptr = (struct tlv_desc *)tlv;
tlv_ptr->tlv_type = htons(type);
tlv_ptr->tlv_len  = htons(tlv_len);
-   if (len && data)
-   memcpy(TLV_DATA(tlv_ptr), data, tlv_len);
+   if (len && data) {
+   memcpy(TLV_DATA(tlv_ptr), data, len);
+   memset(TLV_DATA(tlv_ptr) + len, 0, TLV_SPACE(len) - tlv_len);
+   }
return TLV_SPACE(len);
 }
 
@@ -405,8 +407,10 @@ static inline int TCM_SET(void *msg, __u16 cmd, __u16 
flags,
tcm_hdr->tcm_len   = htonl(msg_len);
tcm_hdr->tcm_type  = htons(cmd);
tcm_hdr->tcm_flags = htons(flags);
-   if (data_len && data)
+   if (data_len && data) {
memcpy(TCM_DATA(msg), data, data_len);
+   memset(TCM_DATA(msg) + data_len, 0, TCM_SPACE(data_len) - 
msg_len);
+   }
return TCM_SPACE(data_len);
 }
 
-- 
2.21.0

Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-19 Thread Pankaj Gupta



> On 5/16/19 10:35 PM, Pankaj Gupta wrote:
> > Can I take it your reviewed/acked-by? or tested-by tag? for the virtio
> > patch :)I don't feel that I have enough expertise to give the reviewed-by
> > tag, but you can
> take my acked-by + tested-by.
> 
> Acked-by: Jakub Staron 
> Tested-by: Jakub Staron 
> 
> No kernel panics/stalls encountered during testing this patches (v9) with
> QEMU + xfstests.

Thank you for testing and confirming the results. I will add your tested &
acked-by in v10.

> Some CPU stalls encountered while testing with crosvm instead of QEMU with
> xfstests
> (test generic/464) but no repro for QEMU, so the fault may be on the side of
> crosvm.

yes, looks like crosvm related as we did not see any of this in my and your
testing with Qemu. 

> 
> 
> The dump for the crosvm/xfstests stall:
> [ 2504.175276] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 2504.176681] rcu: 0-...!: (1 GPs behind) idle=9b2/1/0x4000
> softirq=1089198/1089202 fqs=0
> [ 2504.178270] rcu: 2-...!: (1 ticks this GP)
> idle=cfe/1/0x4002 softirq=1055108/1055110 fqs=0
> [ 2504.179802] rcu: 3-...!: (1 GPs behind) idle=1d6/1/0x4002
> softirq=1046798/1046802 fqs=0
> [ 2504.181215] rcu: 4-...!: (2 ticks this GP)
> idle=522/1/0x4002 softirq=1249063/1249064 fqs=0
> [ 2504.182625] rcu: 5-...!: (1 GPs behind) idle=6da/1/0x4000
> softirq=1131036/1131047 fqs=0
> [ 2504.183955]  (detected by 3, t=0 jiffies, g=1232529, q=1370)
> [ 2504.184762] Sending NMI from CPU 3 to CPUs 0:
> [ 2504.186400] NMI backtrace for cpu 0
> [ 2504.186401] CPU: 0 PID: 6670 Comm: 464 Not tainted 5.1.0+ #1
> [ 2504.186401] Hardware name: ChromiumOS crosvm, BIOS 0
> [ 2504.186402] RIP: 0010:queued_spin_lock_slowpath+0x1c/0x1e0
> [ 2504.186402] Code: e7 89 c8 f0 44 0f b1 07 39 c1 75 dc f3 c3 0f 1f 44 00 00
> ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3 c3 f3 90  ec
> 81 fe 00 01 00 00 0f 84 ab 00 00 00 81 e6 00 ff ff ff 75 44
> [ 2504.186403] RSP: 0018:c9003ee8 EFLAGS: 0002
> [ 2504.186404] RAX: 0001 RBX: 0246 RCX:
> 00404044
> [ 2504.186404] RDX: 0001 RSI: 0001 RDI:
> 8244a280
> [ 2504.186405] RBP: 8244a280 R08: 000f4200 R09:
> 024709ed6c32
> [ 2504.186405] R10:  R11: 0001 R12:
> 8244a280
> [ 2504.186405] R13: 0009 R14: 0009 R15:
> 
> [ 2504.186406] FS:  () GS:8880cc60()
> knlGS:
> [ 2504.186406] CS:  0010 DS:  ES:  CR0: 80050033
> [ 2504.186406] CR2: 7efd6b0f15d8 CR3: 0260a006 CR4:
> 00360ef0
> [ 2504.186407] DR0:  DR1:  DR2:
> 
> [ 2504.186407] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [ 2504.186407] Call Trace:
> [ 2504.186408]  
> [ 2504.186408]  _raw_spin_lock_irqsave+0x1d/0x30
> [ 2504.186408]  rcu_core+0x3b6/0x740
> [ 2504.186408]  ? __hrtimer_run_queues+0x133/0x280
> [ 2504.186409]  ? recalibrate_cpu_khz+0x10/0x10
> [ 2504.186409]  __do_softirq+0xd8/0x2e4
> [ 2504.186409]  irq_exit+0xa3/0xb0
> [ 2504.186410]  smp_apic_timer_interrupt+0x67/0x120
> [ 2504.186410]  apic_timer_interrupt+0xf/0x20
> [ 2504.186410]  
> [ 2504.186410] RIP: 0010:unmap_page_range+0x47a/0x9b0
> [ 2504.186411] Code: 0f 46 46 10 49 39 6e 18 49 89 46 10 48 89 e8 49 0f 43 46
> 18 41 80 4e 20 08 4d 85 c9 49 89 46 18 0f 84 68 ff ff ff 49 8b 51 08 <48> 8d
> 42 ff 83 e2 01 49 0f 44 c1 f6 40 18 01 75 38 48 ba ff 0f 00
> [ 2504.186411] RSP: 0018:c900036cbcc8 EFLAGS: 0282 ORIG_RAX:
> ff13
> [ 2504.186412] RAX:  RBX: 80003751d045 RCX:
> 0001
> [ 2504.186413] RDX: ea0002e09288 RSI: 0269b000 RDI:
> 8880b6525e40
> [ 2504.186413] RBP: 0269c000 R08:  R09:
> eadd4740
> [ 2504.186413] R10: ea0001755700 R11: 8880cc62d120 R12:
> 02794000
> [ 2504.186414] R13: 0269b000 R14: c900036cbdf0 R15:
> 8880572434d8
> [ 2504.186414]  ? unmap_page_range+0x420/0x9b0
> [ 2504.186414]  ? release_pages+0x175/0x390
> [ 2504.186414]  unmap_vmas+0x7c/0xe0
> [ 2504.186415]  exit_mmap+0xa4/0x190
> [ 2504.186415]  mmput+0x3b/0x100
> [ 2504.186415]  do_exit+0x276/0xc10
> [ 2504.186415]  do_group_exit+0x35/0xa0
> [ 2504.186415]  __x64_sys_exit_group+0xf/0x10
> [ 2504.186416]  do_syscall_64+0x43/0x120
> [ 2504.186416]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 2504.186416] RIP: 0033:0x7efd6ae10618
> [ 2504.186416] Code: Bad RIP value.
> [ 2504.186417] RSP: 002b:7ffcac9bde38 EFLAGS: 0246 ORIG_RAX:
> 00e7
> [ 2504.186417] RAX: ffda RBX:  RCX:
> 7efd6ae10618
> [ 2504.186418] RDX:  RSI: 003c RDI:
> 
> [ 2504.186418] RBP: 7efd6b0ed8e0 R08: 00e7 R09:
>

linux-next: stats

2019-05-19 Thread Stephen Rothwell

Hi all,

As usual, the executive friendly graph is at
http://neuling.org/linux-next-size.html :-)

(No merge commits counted, next-20190507 was the first linux-next after
the merge window opened.)

Commits in v5.2-rc1 (relative to v5.1):12064
Commits in next-20190507:  11388
Commits with the same SHA1:10638
Commits with the same patch_id:  350 (1)
Commits with the same subject line:   61 (1)

(1) not counting those in the lines above.

So commits in -rc1 that were in next-20190507: 11049 91%

Some breakdown of the list of extra commits (relative to next-20190507)
in -rc1:

Top ten first word of commit summary:

 58 net
 46 afs
 44 perf
 35 kvm
 35 drm
 34 x86
 34 tools
 33 documentation
 25 thermal
 23 drivers

Top ten authors:

 50 dhowe...@redhat.com
 28 yamada.masah...@socionext.com
 28 tstoya...@vmware.com
 27 changbin...@intel.com
 21 amit.kuche...@linaro.org
 20 jlay...@kernel.org
 16 h...@lst.de
 15 bgolaszew...@baylibre.com
 14 del...@gmx.de
 13 yuehaib...@huawei.com

Top ten commiters:

105 da...@davemloft.net
 73 a...@redhat.com
 59 edubez...@gmail.com
 50 lee.jo...@linaro.org
 50 dhowe...@redhat.com
 42 pbonz...@redhat.com
 31 yamada.masah...@socionext.com
 31 j...@ziepe.ca
 31 cor...@lwn.net
 29 idryo...@gmail.com

There are also 339 commits in next-20190507 that didn't make it into
v5.2-rc1.

Top ten first word of commit summary:

 27 drm
 20 xtensa
 20 mm
 20 arm
 19 nvmem
 19 i2c
 15 selftests
 14 ntb
 11 nfc
 10 arm64

Top ten authors:

 18 jcmvb...@gmail.com
 17 a...@linux-foundation.org
 12 wsa+rene...@sang-engineering.com
 12 evan.q...@amd.com
 10 stefan.wah...@i2se.com
 10 li...@rasmusvillemoes.dk
  9 zo...@linux.ibm.com
  9 log...@deltatee.com
  7 jean-philippe.bruc...@arm.com
  6 tadeusz.st...@intel.com

Some of Andrew's patches are fixes for other patches in his tree (and
have been merged into those).

Top ten commiters:

103 s...@canb.auug.org.au
 23 jcmvb...@gmail.com
 23 alexander.deuc...@amd.com
 22 srinivas.kandaga...@linaro.org
 15 w...@the-dreams.de
 15 jarkko.sakki...@linux.intel.com
 14 jdma...@kudzu.us
 12 stefan.wah...@i2se.com
 11 sa...@linux.intel.com
 11 m...@redhat.com

Those commits by me are from the quilt series (mainly Andrew's mmotm
tree).

-- 
Cheers,
Stephen Rothwell


pgpNdIkkaKTDi.pgp
Description: OpenPGP digital signature

[PATCH] MAINTAINERS: Add mailing list for csky architecture

2019-05-19 Thread guoren

From: Guo Ren 

Add the newly created linux-c...@vger.kernel.org mailing list for patch
reviews and discussions.

Signed-off-by: Guo Ren 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5cfbea4..b5fadcc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3474,6 +3474,7 @@ F:sound/pci/oxygen/
 
 C-SKY ARCHITECTURE
 M: Guo Ren 
+L: linux-c...@vger.kernel.org
 T: git https://github.com/c-sky/csky-linux.git
 S: Supported
 F: arch/csky/
-- 
2.7.4

RE: [PATCH 1/3] enetc: add hardware timestamping support

2019-05-19 Thread Y.b. Lu

Hi,

> -Original Message-
> From: Claudiu Manoil
> Sent: Thursday, May 16, 2019 11:31 PM
> To: Richard Cochran ; Y.b. Lu
> 
> Cc: net...@vger.kernel.org; David Miller ; Shawn
> Guo ; Rob Herring ;
> devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org
> Subject: RE: [PATCH 1/3] enetc: add hardware timestamping support
> 
> 
> >-Original Message-
> >From: Richard Cochran 
> >Sent: Thursday, May 16, 2019 5:33 PM
> >To: Y.b. Lu 
> >Cc: net...@vger.kernel.org; David Miller ; Claudiu
> >Manoil ; Shawn Guo ;
> Rob
> >Herring ; devicet...@vger.kernel.org; linux-arm-
> >ker...@lists.infradead.org; linux-kernel@vger.kernel.org
> >Subject: Re: [PATCH 1/3] enetc: add hardware timestamping support
> >
> >On Thu, May 16, 2019 at 09:59:08AM +, Y.b. Lu wrote:
> >
> [...]
> >
> >>  static bool enetc_clean_tx_ring(struct enetc_bdr *tx_ring, int
> >> napi_budget)  {
> >>struct net_device *ndev = tx_ring->ndev;
> >> +  struct enetc_ndev_priv *priv = netdev_priv(ndev);
> >>int tx_frm_cnt = 0, tx_byte_cnt = 0;
> >>struct enetc_tx_swbd *tx_swbd;
> >> +  union enetc_tx_bd *txbd;
> >> +  bool do_tstamp;
> >>int i, bds_to_clean;
> >> +  u64 tstamp = 0;
> >
> >Please keep in reverse Christmas tree order as much as possible:
> 
> For the xmass tree part, Yangbo, better move the priv and txbd declarations
> inside the scope of the if() {} block where they are actually used, i.e.:
> 
>   if (unlikely(tx_swbd->check_wb)) {
>   struct enetc_ndev_priv *priv = netdev_priv(ndev);
>   union enetc_tx_bd *txbd;
>   [...]
>   }
> 

[Y.b. Lu] Will do that.

> >
> > union enetc_tx_bd *txbd;
> > int i, bds_to_clean;
> > bool do_tstamp;
> > u64 tstamp = 0;
> >
> >>i = tx_ring->next_to_clean;
> >>tx_swbd = _ring->tx_swbd[i];
> >>bds_to_clean = enetc_bd_ready_count(tx_ring, i);
> >>
> >> +  do_tstamp = false;
> >> +
> >>while (bds_to_clean && tx_frm_cnt < ENETC_DEFAULT_TX_WORK) {
> >>bool is_eof = !!tx_swbd->skb;
> >>
> >> +  if (unlikely(tx_swbd->check_wb)) {
> >> +  txbd = ENETC_TXBD(*tx_ring, i);
> >> +
> >> +  if (!(txbd->flags & ENETC_TXBD_FLAGS_W))
> >> +  goto no_wb;
> >> +
> >> +  if (tx_swbd->do_tstamp) {
> >> +  enetc_get_tx_tstamp(>si->hw, txbd,
> >> +  );
> >> +  do_tstamp = true;
> >> +  }
> >> +  }
> >> +no_wb:
> >
> >This goto seems strange and unnecessary.  How about this instead?
> >
> > if (txbd->flags & ENETC_TXBD_FLAGS_W &&
> > tx_swbd->do_tstamp) {
> > enetc_get_tx_tstamp(>si->hw, txbd, 
> > );
> > do_tstamp = true;
> > }
> >
> 
> Absolutely, somehow I missed this.  I guess the intention was to be able to
> support multiple
> if() blocks for the writeback case (W flag set) but the code is much better 
> off
> without the goto.

[Y.b. Lu] Will use this to support current single tstamp writeback case.

> 
> >>enetc_unmap_tx_buff(tx_ring, tx_swbd);
> >>if (is_eof) {
> >> +  if (unlikely(do_tstamp)) {
> >> +  enetc_tstamp_tx(tx_swbd->skb, tstamp);
> >> +  do_tstamp = false;
> >> +  }
> >>napi_consume_skb(tx_swbd->skb, napi_budget);
> >>tx_swbd->skb = NULL;
> >>}
> >> @@ -167,6 +169,11 @@ struct enetc_cls_rule {
> >>
> >>  #define ENETC_MAX_BDR_INT 2 /* fixed to max # of available cpus */
> >>
> >> +enum enetc_hw_features {
> >
> >This is a poor choice of name.  It sounds like it describes HW
> >capabilities, but you use it to track whether a feature is requested at
> >run time.
> >
> >> +  ENETC_F_RX_TSTAMP   = BIT(0),
> >> +  ENETC_F_TX_TSTAMP   = BIT(1),
> >> +};
> >> +
> >>  struct enetc_ndev_priv {
> >>struct net_device *ndev;
> >>struct device *dev; /* dma-mapping device */ @@ -178,6 +185,7 @@
> >> struct enetc_ndev_priv {
> >>u16 rx_bd_count, tx_bd_count;
> >>
> >>u16 msg_enable;
> >> +  int hw_features;
> >
> >This is also poorly named.  How about "tstamp_request" instead?
> >
> 
> This ndev_priv variable was intended to gather flags for all the active h/w
> related features, i.e. keeping count of what h/w offloads are enabled for the
> current device (at least for those that don't have already a netdev_features_t
> flag).
> I wouldn't waste an int for 2 timestamp flags, I'd rather have a more generic
> name.
> Maybe active_offloads then?
> 
> Anyway, the name can be changed later too, when other offloads will be
> added.

[Y.b. Lu] How about using active_offloads, and add TODO comments in enum 
enetc_active_offloads?

> 
> Thanks,
>

[PATCH] intel_th: msu: Fix unused variable warning on arm64 platform

2019-05-19 Thread Shaokun Zhang

Fix this compiler warning on arm64 platform.

Cc: Alexander Shishkin 
Cc: Greg Kroah-Hartman 
Signed-off-by: Shaokun Zhang 
---
 drivers/hwtracing/intel_th/msu.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c
index 81bb54fa3ce8..e15ed5c308e1 100644
--- a/drivers/hwtracing/intel_th/msu.c
+++ b/drivers/hwtracing/intel_th/msu.c
@@ -780,7 +780,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win,
 static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks)
 {
struct msc_window *win;
-   int ret = -ENOMEM, i;
+   int ret = -ENOMEM;
 
if (!nr_blocks)
return 0;
@@ -812,7 +812,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned 
int nr_blocks)
goto err_nomem;
 
 #ifdef CONFIG_X86
-   for (i = 0; i < ret; i++)
+   for (int i = 0; i < ret; i++)
/* Set the page as uncached */
set_memory_uc((unsigned long)msc_win_block(win, i), 1);
 #endif
@@ -860,8 +860,6 @@ static void __msc_buffer_win_free(struct msc *msc, struct 
msc_window *win)
  */
 static void msc_buffer_win_free(struct msc *msc, struct msc_window *win)
 {
-   int i;
-
msc->nr_pages -= win->nr_blocks;
 
list_del(>entry);
@@ -871,7 +869,7 @@ static void msc_buffer_win_free(struct msc *msc, struct 
msc_window *win)
}
 
 #ifdef CONFIG_X86
-   for (i = 0; i < win->nr_blocks; i++)
+   for (int i = 0; i < win->nr_blocks; i++)
/* Reset the page to write-back */
set_memory_wb((unsigned long)msc_win_block(win, i), 1);
 #endif
-- 
2.7.4

[PATCH] scsi: fix typos in code comments

2019-05-19 Thread Weitao Hou

fix abord to abort

Signed-off-by: Weitao Hou 
---
 drivers/scsi/pm8001/pm8001_sas.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index 88eef3b18e41..3de57c5a3299 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -1181,7 +1181,7 @@ int pm8001_query_task(struct sas_task *task)
return rc;
 }
 
-/*  mandatory SAM-3, still need free task/ccb info, abord the specified task */
+/*  mandatory SAM-3, still need free task/ccb info, abort the specified task */
 int pm8001_abort_task(struct sas_task *task)
 {
unsigned long flags;
-- 
2.18.0

RE: [EXT] Re: [PATCH 1/3] enetc: add hardware timestamping support

2019-05-19 Thread Y.b. Lu

Hi,

> -Original Message-
> From: Richard Cochran 
> Sent: Thursday, May 16, 2019 10:33 PM
> To: Y.b. Lu 
> Cc: net...@vger.kernel.org; David Miller ; Claudiu
> Manoil ; Shawn Guo ; Rob
> Herring ; devicet...@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: [EXT] Re: [PATCH 1/3] enetc: add hardware timestamping support
> 
> 
> On Thu, May 16, 2019 at 09:59:08AM +, Y.b. Lu wrote:
> 
> > +config FSL_ENETC_HW_TIMESTAMPING
> > + bool "ENETC hardware timestamping support"
> > + depends on FSL_ENETC || FSL_ENETC_VF
> > + help
> > +   Enable hardware timestamping support on the Ethernet packets
> > +   using the SO_TIMESTAMPING API. Because the RX BD ring dynamic
> > +   allocation hasn't been supported and it's too expensive to use
> 
> s/it's/it is/

[Y.b. Lu] Will modify it. BTW, may I know what's the purpose of dropping single 
quote character? For searching, script checking, or something else?
If require to not use single quote character, I will also modify some other 
places in Kconfig messages.

> 
> > +   extended RX BDs if timestamping isn't used, the option was used
> > +   to control hardware timestamping/extended RX BDs to be enabled
> > +   or not.
> 
> ..., this option enables extended RX BDs in order to support hardware
> timestamping.

[Y.b. Lu] Will rephrase it.

> 
> >  static bool enetc_clean_tx_ring(struct enetc_bdr *tx_ring, int
> > napi_budget)  {
> >   struct net_device *ndev = tx_ring->ndev;
> > + struct enetc_ndev_priv *priv = netdev_priv(ndev);
> >   int tx_frm_cnt = 0, tx_byte_cnt = 0;
> >   struct enetc_tx_swbd *tx_swbd;
> > + union enetc_tx_bd *txbd;
> > + bool do_tstamp;
> >   int i, bds_to_clean;
> > + u64 tstamp = 0;
> 
> Please keep in reverse Christmas tree order as much as possible:
> 
> union enetc_tx_bd *txbd;
> int i, bds_to_clean;
> bool do_tstamp;
> u64 tstamp = 0;
> 
> >   i = tx_ring->next_to_clean;
> >   tx_swbd = _ring->tx_swbd[i];
> >   bds_to_clean = enetc_bd_ready_count(tx_ring, i);
> >
> > + do_tstamp = false;
> > +
> >   while (bds_to_clean && tx_frm_cnt < ENETC_DEFAULT_TX_WORK) {
> >   bool is_eof = !!tx_swbd->skb;
> >
> > + if (unlikely(tx_swbd->check_wb)) {
> > + txbd = ENETC_TXBD(*tx_ring, i);
> > +
> > + if (!(txbd->flags & ENETC_TXBD_FLAGS_W))
> > + goto no_wb;
> > +
> > + if (tx_swbd->do_tstamp) {
> > + enetc_get_tx_tstamp(>si->hw, txbd,
> > + );
> > + do_tstamp = true;
> > + }
> > + }
> > +no_wb:
> 
> This goto seems strange and unnecessary.  How about this instead?
> 
> if (txbd->flags & ENETC_TXBD_FLAGS_W &&
> tx_swbd->do_tstamp) {
> enetc_get_tx_tstamp(>si->hw, txbd,
> );
> do_tstamp = true;
> }
> 
> >   enetc_unmap_tx_buff(tx_ring, tx_swbd);
> >   if (is_eof) {
> > + if (unlikely(do_tstamp)) {
> > + enetc_tstamp_tx(tx_swbd->skb, tstamp);
> > + do_tstamp = false;
> > + }
> >   napi_consume_skb(tx_swbd->skb, napi_budget);
> >   tx_swbd->skb = NULL;
> >   }
> > @@ -167,6 +169,11 @@ struct enetc_cls_rule {
> >
> >  #define ENETC_MAX_BDR_INT2 /* fixed to max # of available cpus */
> >
> > +enum enetc_hw_features {
> 
> This is a poor choice of name.  It sounds like it describes HW capabilities, 
> but
> you use it to track whether a feature is requested at run time.
> 
> > + ENETC_F_RX_TSTAMP   = BIT(0),
> > + ENETC_F_TX_TSTAMP   = BIT(1),
> > +};
> > +
> >  struct enetc_ndev_priv {
> >   struct net_device *ndev;
> >   struct device *dev; /* dma-mapping device */ @@ -178,6 +185,7
> @@
> > struct enetc_ndev_priv {
> >   u16 rx_bd_count, tx_bd_count;
> >
> >   u16 msg_enable;
> > + int hw_features;
> 
> This is also poorly named.  How about "tstamp_request" instead?
> 
> >
> >   struct enetc_bdr *tx_ring[16];
> >   struct enetc_bdr *rx_ring[16];
> 
> Thanks,
> Richard

[v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush

2019-05-19 Thread Yang Shi

A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem).  Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
PTEs in parallel don't remove page tables.  But, the forementioned
commit may do munmap() under read mmap_sem and free page tables.  This
may result in program hang on aarch64 reported by Jan Stancek.  The
problem could be reproduced by his test program with slightly modified
below.

---8<---

static int map_size = 4096;
static int num_iter = 500;
static long threads_total;

static void *distant_area;

void *map_write_unmap(void *ptr)
{
int *fd = ptr;
unsigned char *map_address;
int i, j = 0;

for (i = 0; i < num_iter; i++) {
map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE 
| PROT_READ,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (map_address == MAP_FAILED) {
perror("mmap");
exit(1);
}

for (j = 0; j < map_size; j++)
map_address[j] = 'b';

if (munmap(map_address, map_size) == -1) {
perror("munmap");
exit(1);
}
}

return NULL;
}

void *dummy(void *ptr)
{
return NULL;
}

int main(void)
{
pthread_t thid[2];

/* hint for mmap in map_write_unmap() */
distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
distant_area += DISTANT_MMAP_SIZE / 2;

while (1) {
pthread_create([0], NULL, map_write_unmap, NULL);
pthread_create([1], NULL, dummy, NULL);

pthread_join(thid[0], NULL);
pthread_join(thid[1], NULL);
}
}
---8<---

The program may bring in parallel execution like below:

t1t2
munmap(map_address)
  downgrade_write(>mmap_sem);
  unmap_region()
  tlb_gather_mmu()
inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
tlb->freed_tables = 1
tlb->cleared_pmds = 1

pthread_exit()
madvise(thread_stack, 8M, MADV_DONTNEED)
  zap_page_range()
tlb_gather_mmu()
  inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
if (mm_tlb_flush_nested(tlb->mm))
  __tlb_reset_range()

__tlb_reset_range() would reset freed_tables and cleared_* bits, but
this may cause inconsistency for munmap() which do free page tables.
Then it may result in some architectures, e.g. aarch64, may not flush
TLB completely as expected to have stale TLB entries remained.

Use fullmm flush since it yields much better performance on aarch64 and
non-fullmm doesn't yields significant difference on x86.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.

Reported-by: Jan Stancek 
Tested-by: Jan Stancek 
Suggested-by: Will Deacon 
Cc: Peter Zijlstra 
Cc: Nick Piggin 
Cc: "Aneesh Kumar K.V" 
Cc: Nadav Amit 
Cc: Minchan Kim 
Cc: Mel Gorman 
Cc: sta...@vger.kernel.org  4.20+
Signed-off-by: Yang Shi 
Signed-off-by: Jan Stancek 
---
v3: Adopted fullmm flush suggestion from Will
v2: Reworked the commit log per Peter and Will
Adopted the suggestion from Peter

 mm/mmu_gather.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..289f8cf 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 {
/*
 * If there are parallel threads are doing PTE changes on same range
-* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-* flush by batching, a thread has stable TLB entry can fail to flush
-* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-* forcefully if we detect parallel PTE batching threads.
+* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+* flush by batching, one thread may end up seeing inconsistent PTEs
+* and result in having stale TLB entries.  So flush TLB forcefully
+* if we detect parallel PTE batching threads.
+*
+* However, some syscalls, e.g. munmap(), may free page tables, this
+* needs force flush everything in

Re: [PATCH 1/3] kselftest/cgroup: fix unexcepted testing failure on test_memcontrol

2019-05-19 Thread Alex Shi



> Hi Alex!
> 
> Sorry for the late reply, somehow I missed your e-mails.
> 
> The patchset looks good to me, except a typo in subjects/commit logs:
> you probably meant "unexpected failures".
> 
> Please, fix and feel free to use:
> Reviewed-by: Roman Gushchin 
> for the series.
> 

Thanks Roman!

BTW, do you know other test cases for cgroup2 function or performance?

Thanks
Alex

[RFC 2/2] ARM: dts: imx6ull/z: add fusable-node property

2019-05-19 Thread Peng Fan

Add fusable-node property for OCOTP

Signed-off-by: Peng Fan 
---
 arch/arm/boot/dts/imx6ull.dtsi | 7 +++
 arch/arm/boot/dts/imx6ulz.dtsi | 6 ++
 2 files changed, 13 insertions(+)

diff --git a/arch/arm/boot/dts/imx6ull.dtsi b/arch/arm/boot/dts/imx6ull.dtsi
index 22e4a307fa59..b616ed6ee4bf 100644
--- a/arch/arm/boot/dts/imx6ull.dtsi
+++ b/arch/arm/boot/dts/imx6ull.dtsi
@@ -32,6 +32,13 @@
 
  {
compatible = "fsl,imx6ull-ocotp", "syscon";
+
+   fusable-node = <0xc 22
+  0xc 26
+  0xc 27
+0x10 4
+0x10 5
+   >;
 };
 
  {
diff --git a/arch/arm/boot/dts/imx6ulz.dtsi b/arch/arm/boot/dts/imx6ulz.dtsi
index 0b5f1a763567..8edd9008e38b 100644
--- a/arch/arm/boot/dts/imx6ulz.dtsi
+++ b/arch/arm/boot/dts/imx6ulz.dtsi
@@ -19,6 +19,12 @@
};
 };
 
+ {
+   fusable-node = < 0x10 4
+0x10 5
+   >;
+};
+
 /delete-node/ 
 /delete-node/ 
 /delete-node/ 
-- 
2.16.4

[RFC 1/2] dt-bindings: imx-ocotp: Add fusable-node property

2019-05-19 Thread Peng Fan

Introduce fusable-node property for i.MX OCOTP driver.
The property will only be used by Firmware(eg. U-Boot) to
runtime disable the nodes.

Take i.MX6ULL for example, there are several parts that only
have limited modules enabled controlled by OCOTP fuse. It is
not flexible to provide several dts for the serval parts, instead
we could provide one device tree and let Firmware to runtime disable
the device tree nodes for those modules that are disable(fused).

Signed-off-by: Peng Fan 
---

Currently NXP vendor use U-Boot to set status to disabled for devices
that could not function,
https://source.codeaurora.org/external/imx/uboot-imx/tree/arch/arm/mach-imx/mx6/module_fuse.c?h=imx_v2018.03_4.14.98_2.0.0_ga#n149
But this approach is will not work if kernel dts node path changed.

There are two approaches to resolve:

1. This patch is to add a fusable-node property, and Firmware will parse
   the property and read fuse to decide whether to disable or keeep enable
   the nodes.

2. There is another approach is that add nvmem-cells for all nodes that
   could be disabled(fused). Then in each linux driver to use nvmem
   api to detect fused or not, or in linux driver common code to check
   device functionable or not with nvmem API.


To make it easy to work, we choose [1] here. Please advise whether
it is acceptable, because the property is not used by linux driver in
approach [1]. Or you prefer [2] or please advise if any better solution.

Thanks.

 Documentation/devicetree/bindings/nvmem/imx-ocotp.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt 
b/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt
index 7a999a135e56..e9a998588dbd 100644
--- a/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt
+++ b/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt
@@ -21,6 +21,8 @@ Required properties:
 
 Optional properties:
 - read-only: disable write access
+- fusable-node: array of phandles with reg base and bit offset, this
+   property is used by Firmware to runtime disable nodes.
 
 Optional Child nodes:
 
@@ -42,4 +44,7 @@ Example:
tempmon_temp_grade: temp-grade@20 {
reg = <0x20 4>;
};
+
+   fusable-node = < 0x10 4
+0x10 5>;
};
-- 
2.16.4

Re: [PATCH RESEND 1/5] ARM: dts: imx6qdl-sabresd: Assign corresponding power supply for LDOs

2019-05-19 Thread Shawn Guo

On Sun, May 12, 2019 at 09:57:20AM +, Anson Huang wrote:
> On i.MX6Q/DL SabreSD board, vgen5 supplies vdd1p1/vdd2p5 LDO and
> sw2 supplies vdd3p0 LDO, this patch assigns corresponding power
> supply for vdd1p1/vdd2p5/vdd3p0 to avoid confusion by below log:
> 
> vdd1p1: supplied by regulator-dummy
> vdd3p0: supplied by regulator-dummy
> vdd2p5: supplied by regulator-dummy
> 
> With this patch, the power supply is more accurate:
> 
> vdd1p1: supplied by VGEN5
> vdd3p0: supplied by SW2
> vdd2p5: supplied by VGEN5
> 
> Signed-off-by: Anson Huang 

Applied all, thanks.

[PATCH v2 2/3] kselftest/cgroup: fix unexpected testing failure on test_core

2019-05-19 Thread Alex Shi

The cgroup testing relys on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, some test cases will be failed
as following:

$sudo  ./test_core
not ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
not ok 3 test_cgcore_top_down_constraint_disable
...

To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.

Signed-off-by: Alex Shi 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: Roman Gushchin 
Cc: Claudio Zumbo 
Cc: Claudio 
Cc: linux-kselft...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Roman Gushchin 
---
 tools/testing/selftests/cgroup/test_core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_core.c 
b/tools/testing/selftests/cgroup/test_core.c
index be59f9c34ea2..b1a570d7c557 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -376,6 +376,11 @@ int main(int argc, char *argv[])
 
if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+   if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+   if (cg_write(root, "cgroup.subtree_control", "+memory"))
+   ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
-- 
2.19.1.856.g8858448bb

[PATCH v2 3/3] kselftest/cgroup: fix incorrect test_core skip

2019-05-19 Thread Alex Shi

The test_core will skip the
test_cgcore_no_internal_process_constraint_on_threads test case if the
'cpu' controller missing in root's subtree_control. In fact we need to
set the 'cpu' in subtree_control, to make the testing meaningful.

./test_core
...
ok 4 # skip test_cgcore_no_internal_process_constraint_on_threads
...

Signed-off-by: Alex Shi 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: Roman Gushchin 
Cc: Claudio Zumbo 
Cc: Claudio 
Cc: linux-kselft...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Roman Gushchin 
---
 tools/testing/selftests/cgroup/test_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_core.c 
b/tools/testing/selftests/cgroup/test_core.c
index b1a570d7c557..2ffc3e07d98d 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -198,7 +198,7 @@ static int 
test_cgcore_no_internal_process_constraint_on_threads(const char *roo
char *parent = NULL, *child = NULL;
 
if (cg_read_strstr(root, "cgroup.controllers", "cpu") ||
-   cg_read_strstr(root, "cgroup.subtree_control", "cpu")) {
+   cg_write(root, "cgroup.subtree_control", "+cpu")) {
ret = KSFT_SKIP;
goto cleanup;
}
-- 
2.19.1.856.g8858448bb

[PATCH v2 1/3] kselftest/cgroup: fix unexpected testing failure on test_memcontrol

2019-05-19 Thread Alex Shi

The cgroup testing relies on the root cgroup's subtree_control setting,
If the 'memory' controller isn't set, all test cases will be failed
as following:

$ sudo ./test_memcontrol
not ok 1 test_memcg_subtree_control
not ok 2 test_memcg_current
ok 3 # skip test_memcg_min
not ok 4 test_memcg_low
not ok 5 test_memcg_high
not ok 6 test_memcg_max
not ok 7 test_memcg_oom_events
ok 8 # skip test_memcg_swap_max
not ok 9 test_memcg_sock
not ok 10 test_memcg_oom_group_leaf_events
not ok 11 test_memcg_oom_group_parent_events
not ok 12 test_memcg_oom_group_score_events

To correct this unexpected failure, this patch write the 'memory' to
subtree_control of root to get a right result.

Signed-off-by: Alex Shi 
Cc: Shuah Khan 
Cc: Roman Gushchin 
Cc: Tejun Heo 
Cc: Mike Rapoport 
Cc: Jay Kamat 
Cc: linux-kselft...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Roman Gushchin 
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c 
b/tools/testing/selftests/cgroup/test_memcontrol.c
index 6f339882a6ca..73612d604a2a 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1205,6 +1205,10 @@ int main(int argc, char **argv)
if (cg_read_strstr(root, "cgroup.controllers", "memory"))
ksft_exit_skip("memory controller isn't available\n");
 
+   if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+   if (cg_write(root, "cgroup.subtree_control", "+memory"))
+   ksft_exit_skip("Failed to set root memory controller\n");
+
for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
-- 
2.19.1.856.g8858448bb

Re: [PATCH v2 4/4] vfio: vfio_iommu_type1: implement VFIO_IOMMU_INFO_CAPABILITIES

2019-05-19 Thread kbuild test robot

Hi Pierre,

I love your patch! Perhaps something to improve:

[auto build test WARNING on s390/features]
[also build test WARNING on v5.1 next-20190517]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Pierre-Morel/s390-pci-Exporting-access-to-CLP-PCI-function-and-PCI-group/20190520-025155
base:   https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 


sparse warnings: (new ones prefixed by >>)

>> drivers/vfio/vfio_iommu_type1.c:1707:5: sparse: sparse: symbol 
>> 'vfio_iommu_type1_caps' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH] vfio: vfio_iommu_type1: vfio_iommu_type1_caps() can be static

2019-05-19 Thread kbuild test robot



Fixes: f10b2b74bbea ("vfio: vfio_iommu_type1: implement 
VFIO_IOMMU_INFO_CAPABILITIES")
Signed-off-by: kbuild test robot 
---
 vfio_iommu_type1.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9435647..46a4939 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1704,8 +1704,8 @@ static int vfio_iommu_type1_zpci_grp(struct iommu_domain 
*domain,
return ret;
 }
 
-int vfio_iommu_type1_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps,
- size_t size)
+static int vfio_iommu_type1_caps(struct vfio_iommu *iommu, struct 
vfio_info_cap *caps,
+size_t size)
 {
struct vfio_domain *d;
unsigned long total_size, fn_size, grp_size;

Re: [PATCH RESEND] ARM: dts: imx6ul: add clock-frequency to CPU node

2019-05-19 Thread Shawn Guo

On Sun, May 12, 2019 at 08:57:16AM +, Anson Huang wrote:
> Add clock-frequency property to CPU node. Avoids warnings like
> "/cpus/cpu@0 missing clock-frequency property" for "arm,cortex-a7".
> 
> Signed-off-by: Anson Huang 

Applied, thanks.

Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush

2019-05-19 Thread Yang Shi





On 5/16/19 11:29 PM, Jan Stancek wrote:


- Original Message -

On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote:


On 5/13/19 9:38 AM, Will Deacon wrote:

On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..469492d 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
   {
/*
 * If there are parallel threads are doing PTE changes on same range
-* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-* flush by batching, a thread has stable TLB entry can fail to flush
-* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-* forcefully if we detect parallel PTE batching threads.
+* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+* flush by batching, one thread may end up seeing inconsistent PTEs
+* and result in having stale TLB entries.  So flush TLB forcefully
+* if we detect parallel PTE batching threads.
+*
+* However, some syscalls, e.g. munmap(), may free page tables, this
+* needs force flush everything in the given range. Otherwise this
+* may result in having stale TLB entries for some architectures,
+* e.g. aarch64, that could specify flush what level TLB.
 */
-   if (mm_tlb_flush_nested(tlb->mm)) {
-   __tlb_reset_range(tlb);
-   __tlb_adjust_range(tlb, start, end - start);
+   if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
+   /*
+* Since we can't tell what we actually should have
+* flushed, flush everything in the given range.
+*/
+   tlb->freed_tables = 1;
+   tlb->cleared_ptes = 1;
+   tlb->cleared_pmds = 1;
+   tlb->cleared_puds = 1;
+   tlb->cleared_p4ds = 1;
+
+   /*
+* Some architectures, e.g. ARM, that have range invalidation
+* and care about VM_EXEC for I-Cache invalidation, need force
+* vma_exec set.
+*/
+   tlb->vma_exec = 1;
+
+   /* Force vma_huge clear to guarantee safer flush */
+   tlb->vma_huge = 0;
+
+   tlb->start = start;
+   tlb->end = end;
}

Whilst I think this is correct, it would be interesting to see whether
or not it's actually faster than just nuking the whole mm, as I mentioned
before.

At least in terms of getting a short-term fix, I'd prefer the diff below
if it's not measurably worse.

I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM,
it shows slightly slowdown on records/s but much more sys time spent with
fullmm flush, the below is the data.

     nofullmm fullmm
ops (records/s)  225606  225119
sys (s)    0.69    1.14

It looks the slight reduction of records/s is caused by the increase of sys
time.

That's not what I expected, and I'm unable to explain why moving to fullmm
would /increase/ the system time. I would've thought the time spent doing
the invalidation would decrease, with the downside that the TLB is cold
when returning back to userspace.


I tried ebizzy with various parameters (malloc vs mmap, ran it for hour),
but performance was very similar for both patches.

So, I was looking for workload that would demonstrate the largest difference.
Inspired by python xml-rpc, which can handle each request in new thread,
I tried following [1]:

16 threads, each looping 100k times over:
   mmap(16M)
   touch 1 page
   madvise(DONTNEED)
   munmap(16M)

This yields quite significant difference for 2 patches when running on
my 46 CPU arm host. I checked it twice - applied patch, recompiled, rebooted,
but numbers stayed +- couple seconds the same.


Thanks for the testing. I'm a little bit surprised by the significant 
difference.


I did the same test on my x86 VM (24 cores), they yield almost same number.

Given the significant improvement on arm64 with fullmm version, I'm 
going to respin the patch.




Does it somewhat match your expectation?

v2 patch
-
real2m33.460s
user0m3.359s
sys 15m32.307s

real2m33.895s
user0m2.749s
sys 16m34.500s

real2m35.666s
user0m3.528s
sys 15m23.377s

real2m32.898s
user0m2.789s
sys 16m18.801s

real2m33.087s
user0m3.565s
sys 16m23.815s


fullmm version
---
real0m46.811s
user0m1.596s
sys 1m47.500s

real0m47.322s
user0m1.803s
sys 1m48.449s

real0m46.668s
user0m1.508s
sys 1m47.352s

real0m46.742s
user0m2.007s
sys 1m47.217s

real0m46.948s
user0m1.785s
sys 1m47.906s

[1]

Re: [PATCH V12 RESEND 1/3] ARM: imx_v6_v7_defconfig: Add TPM PWM support by default

2019-05-19 Thread Shawn Guo

On Sun, May 12, 2019 at 08:51:15AM +, Anson Huang wrote:
> Select CONFIG_PWM_IMX_TPM by default to support i.MX7ULP
> TPM PWM.
> 
> Signed-off-by: Anson Huang 

Applied all, thanks.

[PATCH] kbuild: do not check name uniqueness of builtin modules

2019-05-19 Thread Masahiro Yamada

I just thought it was a good idea to scan builtin.modules in the name
uniqueness checking, but Stephen reported a false positive.

ppc64_defconfig produces:

  warning: same basename if the following are built as modules:
arch/powerpc/platforms/powermac/nvram.ko
drivers/char/nvram.ko

..., which is a false positive because the former is never built as
a module as you see in arch/powerpc/platforms/powermac/Makefile:

  # CONFIG_NVRAM is an arch. independent tristate symbol, for pmac32 we really
  # need this to be a bool.  Cheat here and pretend CONFIG_NVRAM=m is really
  # CONFIG_NVRAM=y
  obj-$(CONFIG_NVRAM:m=y) += nvram.o

Since we cannot predict how tricky Makefiles are written in wild,
builtin.modules may potentially contain false positives. I do not
think it is a big deal as far as kmod is concerned, but false positive
warnings in the kernel build makes people upset. It is better to not
do it.

Even without checking builtin.modules, we have enough (and more solid)
test coverage with allmodconfig.

While I touched this part, I replaced the sed code with neater one
provided by Stephen.

Link: https://lkml.org/lkml/2019/5/19/120
Link: https://lkml.org/lkml/2019/5/19/123
Fixes: 3a48a91901c5 ("kbuild: check uniqueness of module names")
Reported-by: Stephen Rothwell 
Signed-off-by: Masahiro Yamada 
---

 scripts/modules-check.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/scripts/modules-check.sh b/scripts/modules-check.sh
index 2f659530e1ec..39e8cb36ba19 100755
--- a/scripts/modules-check.sh
+++ b/scripts/modules-check.sh
@@ -6,10 +6,10 @@ set -e
 # Check uniqueness of module names
 check_same_name_modules()
 {
-   for m in $(sed 's:.*/::' modules.order modules.builtin | sort | uniq -d)
+   for m in $(sed 's:.*/::' modules.order | sort | uniq -d)
do
-   echo "warning: same basename if the following are built as 
modules:" >&2
-   sed "/\/$m/!d;s:^kernel/:  :" modules.order modules.builtin >&2
+   echo "warning: same module names found:" >&2
+   sed -n "/\/$m/s:^kernel/:  :p" modules.order >&2
done
 }
 
-- 
2.17.1

Re: [PATCH v2 3/3] arm64: dts: nxp: frwy-ls1046a: add support for micron nor flash

2019-05-19 Thread Shawn Guo

On Fri, May 10, 2019 at 01:00:24PM +, Pramod Kumar wrote:
> add micron nor flash support for ls1046a frwy board.
> 
> Signed-off-by: Ashish Kumar 
> Signed-off-by: Pramod Kumar 

Prefix 'arm64: dts: frwy-ls1046a: ...' would be good enough.

> ---
>  .../boot/dts/freescale/fsl-ls1046a-frwy.dts | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts 
> b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts
> index de0d19c02944..890f07122dd0 100644
> --- a/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts
> +++ b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts
> @@ -113,6 +113,23 @@
>  
>  };
>  
> +

Unnecessary newline.

Shawn

> + {
> + num-cs = <1>;
> + bus-num = <0>;
> + status = "okay";
> +
> + qflash0: flash@0 {
> + compatible = "jedec,spi-nor";
> + #address-cells = <1>;
> + #size-cells = <1>;
> + spi-max-frequency = <5000>;
> + reg = <0>;
> + spi-rx-bus-width = <4>;
> + spi-tx-bus-width = <4>;
> + };
> +};
> +
>  #include "fsl-ls1046-post.dtsi"
>  
>   {
> -- 
> 2.17.1
>

RE: [PATCH 1/3] enetc: add hardware timestamping support

2019-05-19 Thread Y.b. Lu

> -Original Message-
> From: Claudiu Manoil
> Sent: Thursday, May 16, 2019 9:31 PM
> To: Y.b. Lu ; net...@vger.kernel.org; Richard Cochran
> ; David Miller ; Shawn
> Guo ; Rob Herring 
> Cc: devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org
> Subject: RE: [PATCH 1/3] enetc: add hardware timestamping support
> 
> >-Original Message-
> >From: Y.b. Lu
> [...]
> >Subject: [PATCH 1/3] enetc: add hardware timestamping support
> >
> [...]
> 
> Hi Yangbo,
> 
> These enetc patches targeting net-next will have to be rebased on the latest
> enetc net.git commits, otherwise there will be some merge conflicts for 
> enetc.c
> and enetc_ethtool.c.
> Thanks,
> Claudiu
> 
> see
> 22fb43f36006 "enetc: Add missing link state info for ethtool"
> f4a0be84d73e "enetc: Fix NULL dma address unmap for Tx BD extensions"

[Y.b. Lu] Thanks Claudiu. I should have noticed that. Let me send v2 after 
net-next is open.

Re: [PATCH v2 2/3] arm64: dts: nxp: add ls1046a-frwy board support

2019-05-19 Thread Shawn Guo

On Fri, May 10, 2019 at 01:00:20PM +, Pramod Kumar wrote:
> ls1046afrwy board is based on nxp ls1046a SoC.
> Board support's 4GB ddr memory, i2c, microSD card,
> serial console,qspi nor flash,ifc nand flash,qsgmii network interface,
> usb 3.0 and serdes interface to support two x1gen3 pcie interface.
> 
> Signed-off-by: Vabhav Sharma 
> Signed-off-by: Pramod Kumar 
> ---
>  arch/arm64/boot/dts/freescale/Makefile|   1 +
>  .../boot/dts/freescale/fsl-ls1046a-frwy.dts   | 156 ++
>  2 files changed, 157 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts
> 
> diff --git a/arch/arm64/boot/dts/freescale/Makefile 
> b/arch/arm64/boot/dts/freescale/Makefile
> index 13604e558dc1..84ff6995b41e 100644
> --- a/arch/arm64/boot/dts/freescale/Makefile
> +++ b/arch/arm64/boot/dts/freescale/Makefile
> @@ -8,6 +8,7 @@ dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1028a-qds.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1028a-rdb.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1043a-qds.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1043a-rdb.dtb
> +dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1046a-frwy.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1046a-qds.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1046a-rdb.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1088a-qds.dtb
> diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts 
> b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts
> new file mode 100644
> index ..de0d19c02944
> --- /dev/null
> +++ b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts
> @@ -0,0 +1,156 @@
> +// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +/*
> + * Device Tree Include file for Freescale Layerscape-1046A family SoC.
> + *
> + * Copyright 2019 NXP.
> + *
> + */
> +
> +/dts-v1/;
> +
> +#include "fsl-ls1046a.dtsi"
> +
> +/ {
> + model = "LS1046A FRWY Board";
> + compatible = "fsl,ls1046a-frwy", "fsl,ls1046a";
> +
> + aliases {
> + serial0 = 
> + serial1 = 
> + serial2 = 
> + serial3 = 
> + };
> +
> + chosen {
> + stdout-path = "serial0:115200n8";
> + };
> +
> + sb_3v3: regulator-sb3v3 {
> + compatible = "regulator-fixed";
> + regulator-name = "LT8642SEV-3.3V";
> + regulator-min-microvolt = <330>;
> + regulator-max-microvolt = <330>;
> + regulator-boot-on;
> + regulator-always-on;
> + };
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> +
> + i2c-mux@77 {
> + compatible = "nxp,pca9546";
> + reg = <0x77>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> + i2c-mux-never-disable;

Undocumented property?

> +
> + i2c@0 {
> + #address-cells = <1>;
> + #size-cells = <0>;
> + reg = <0>;
> +
> + eeprom@52 {
> + compatible = "atmel,24c512";
> + reg = <0x52>;
> + };
> +
> + eeprom@53 {
> + compatible = "atmel,24c512";
> + reg = <0x53>;
> + };
> +
> + power-monitor@40 {

Sort the nodes in unit-address.

Shawn

> + compatible = "ti,ina220";
> + reg = <0x40>;
> + shunt-resistor = <1000>;
> + };
> +
> + rtc@51 {
> + compatible = "nxp,pcf2129";
> + reg = <0x51>;
> + };
> +
> + temperature-sensor@4c {
> + compatible = "nxp,sa56004";
> + reg = <0x4c>;
> + vcc-supply = <_3v3>;
> + };
> +
> + };
> + };
> +};
> +
> + {
> + #address-cells = <2>;
> + #size-cells = <1>;
> + /* NAND Flash */
> + ranges = <0x0 0x0 0x0 0x7e80 0x0001>;
> + status = "okay";
> +
> + nand@0,0 {
> + compatible = "fsl,ifc-nand";
> + #address-cells = <1>;
> + #size-cells = <1>;
> + reg = <0x0 0x0 0x1>;
> + };
> +
> +};
> +
> +#include "fsl-ls1046-post.dtsi"
> +
> + {
> + ethernet@e {
> + phy-handle = <_phy4>;
> + phy-connection-type = "qsgmii";
> + };
> +
> + ethernet@e8000 {
> + phy-handle = <_phy2>;
> + phy-connection-type = "qsgmii";
> + };
> +
> + ethernet@ea000 {
> + phy-handle = <_phy1>;
> + phy-connection-type = "qsgmii";
> + };
> +
> + ethernet@f2000 {
> + phy-handle =

Re: [PATCH v3] tipc: fix modprobe tipc failed after switch order of device registration

2019-05-19 Thread hujunwei




On 2019/5/19 19:48, Stephen Rothwell wrote:
> Hi,
> 
> On Sun, 19 May 2019 17:13:45 +0800 hujunwei  wrote:
>>
>> Fixes: 7e27e8d6130c
>> ("tipc: switch order of device registration to fix a crash")
> 
> Please don't split Fixes tags over more than one line.  It is OK if
> they are too long.
> 

Hi, Stephen
Thanks for your suggestion, I will update it later.

Regards,
Junwei

Re: [PATCH v5 4/6] usb: roles: add API to get usb_role_switch by node

2019-05-19 Thread Chunfeng Yun

Hi,
On Fri, 2019-05-17 at 16:05 +0300, Heikki Krogerus wrote:
> Hi,
> 
> On Fri, May 17, 2019 at 01:37:36PM +0300, Heikki Krogerus wrote:
> > On Tue, May 14, 2019 at 04:47:21PM +0800, Chunfeng Yun wrote:
> > > Add fwnode_usb_role_switch_get() to make easier to get
> > > usb_role_switch by fwnode which register it.
> > > It's useful when there is not device_connection registered
> > > between two drivers and only knows the fwnode which register
> > > usb_role_switch.
> > > 
> > > Signed-off-by: Chunfeng Yun 
> > > Tested-by: Biju Das 
> > 
> > Acked-by: Heikki Krogerus 
> 
> Hold on. I just noticed Rob's comment on patch 2/6, where he points out
> that you don't need to use device graph since the controller is the
> parent of the connector. Doesn't that mean you don't really need this
> API?
No, I still need it. 
The change is about the way how to get fwnode;
when use device graph, get fwnode by of_graph_get_remote_node();
but now will get fwnode by of_get_parent();

> 
> > > ---
> > > v5 changes:
> > >  1. remove linux/of.h suggested by Biju
> > >  2. add tested by Biju
> > > 
> > > Note: still depends on [1]
> > >  [1]: [v6,08/13] usb: roles: Introduce stubs for the exiting functions in 
> > > role.h
> > >   https://patchwork.kernel.org/patch/10909971/
> > > 
> > > v4 changes:
> > >   1. use switch_fwnode_match() to find fwnode suggested by Heikki
> > >   2. this patch now depends on [1]
> > > 
> > >  [1] [v6,08/13] usb: roles: Introduce stubs for the exiting functions in 
> > > role.h
> > > https://patchwork.kernel.org/patch/10909971/
> > > 
> > > v3 changes:
> > >   1. use fwnodes instead of node suggested by Andy
> > >   2. rebuild the API suggested by Heikki
> > > 
> > > v2 no changes
> > > ---
> > >  drivers/usb/roles/class.c | 24 
> > >  include/linux/usb/role.h  |  8 
> > >  2 files changed, 32 insertions(+)
> > > 
> > > diff --git a/drivers/usb/roles/class.c b/drivers/usb/roles/class.c
> > > index f45d8df5cfb8..4a1f09a41ec0 100644
> > > --- a/drivers/usb/roles/class.c
> > > +++ b/drivers/usb/roles/class.c
> > > @@ -135,6 +135,30 @@ struct usb_role_switch *usb_role_switch_get(struct 
> > > device *dev)
> > >  }
> > >  EXPORT_SYMBOL_GPL(usb_role_switch_get);
> > >  
> > > +/**
> > > + * fwnode_usb_role_switch_get - Find USB role switch by it's parent 
> > > fwnode
> > > + * @fwnode: The fwnode that register USB role switch
> > > + *
> > > + * Finds and returns role switch registered by @fwnode. The reference 
> > > count
> > > + * for the found switch is incremented.
> > > + */
> > > +struct usb_role_switch *
> > > +fwnode_usb_role_switch_get(struct fwnode_handle *fwnode)
> > > +{
> > > + struct usb_role_switch *sw;
> > > + struct device *dev;
> > > +
> > > + dev = class_find_device(role_class, NULL, fwnode, switch_fwnode_match);
> > > + if (!dev)
> > > + return ERR_PTR(-EPROBE_DEFER);
> > > +
> > > + sw = to_role_switch(dev);
> > > + WARN_ON(!try_module_get(sw->dev.parent->driver->owner));
> > > +
> > > + return sw;
> > > +}
> > > +EXPORT_SYMBOL_GPL(fwnode_usb_role_switch_get);
> 
> This function only basically converts the fwnode to usb_role_switch,
> but I would actually prefer that we walked through the device graph
> here instead of expecting the caller to do that.
> 
> So this function should probable be called fwnode_to_usb_role_switch()
> and not fwnode_usb_role_switch_get(), but I guess you don't need it
> at all, right?
> 
> 
> thanks,
>

[PATCH v2] kernel: fix typos and some coding style in comments

2019-05-19 Thread Weitao Hou

fix lenght to length

Signed-off-by: Weitao Hou 
---
Changes in v2:
  - fix space before tab warnings
---
 kernel/sysctl.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 943c89178e3d..f78f725f225e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -187,17 +187,17 @@ extern int no_unaligned_warning;
  * enum sysctl_writes_mode - supported sysctl write modes
  *
  * @SYSCTL_WRITES_LEGACY: each write syscall must fully contain the sysctl 
value
- * to be written, and multiple writes on the same sysctl file descriptor
- * will rewrite the sysctl value, regardless of file position. No warning
- * is issued when the initial position is not 0.
+ * to be written, and multiple writes on the same sysctl file descriptor
+ * will rewrite the sysctl value, regardless of file position. No warning
+ * is issued when the initial position is not 0.
  * @SYSCTL_WRITES_WARN: same as above but warn when the initial file position 
is
- * not 0.
+ * not 0.
  * @SYSCTL_WRITES_STRICT: writes to numeric sysctl entries must always be at
- * file position 0 and the value must be fully contained in the buffer
- * sent to the write syscall. If dealing with strings respect the file
- * position, but restrict this to the max length of the buffer, anything
- * passed the max lenght will be ignored. Multiple writes will append
- * to the buffer.
+ * file position 0 and the value must be fully contained in the buffer
+ * sent to the write syscall. If dealing with strings respect the file
+ * position, but restrict this to the max length of the buffer, anything
+ * passed the max length will be ignored. Multiple writes will append
+ * to the buffer.
  *
  * These write modes control how current file position affects the behavior of
  * updating sysctl values through the proc interface on each write.
-- 
2.18.0

Re: [PATCH v2 02/11] ARM: dts: imx7s: Update coresight DT bindings

2019-05-19 Thread Shawn Guo

On Wed, May 08, 2019 at 10:18:53AM +0800, Leo Yan wrote:
> CoreSight DT bindings have been updated, thus the old compatible strings
> are obsolete and the drivers will report warning if DTS uses these
> obsolete strings.
> 
> This patch switches to the new bindings for CoreSight dynamic funnel and
> static replicator, so can dismiss warning during initialisation.
> 
> Cc: Shawn Guo 
> Cc: Chris Healy 
> Cc: Andrew Lunn 
> Cc: Fabio Estevam 
> Cc: Mathieu Poirier 
> Cc: Suzuki K Poulose 
> Signed-off-by: Leo Yan 

Applied, thanks.

Re: [v3 PATCH] dt-binding: usb: add usb-role-switch property

2019-05-19 Thread Chunfeng Yun

On Fri, 2019-05-17 at 16:27 +0300, Heikki Krogerus wrote:
> On Wed, May 08, 2019 at 05:17:44PM +0800, Chunfeng Yun wrote:
> > Add a property usb-role-switch to tell the driver that use
> > USB Role Switch framework to handle the role switch,
> > it's useful when the driver has already supported other ways,
> > such as extcon framework etc.
> > 
> > Cc: Biju Das 
> > Cc: Yu Chen 
> > Signed-off-by: Chunfeng Yun 
> 
> Who is meant to pick this? 

> Can you include this in your series where
> you introduce that USB Type-B GPIO connector driver?
> 
Ok, I'll do it if need

> FWIW:
> 
> Reviewed-by: Heikki Krogerus 
> 
> > ---
> > v3:
> > add property type, modify description suggested by Heikki
> > 
> > v2:
> > describe it in terms of h/w functionality suggested by Rob
> > 
> > v1:
> > the property is discussed in:
> > [v2,2/7] dt-bindings: usb: renesas_usb3: add usb-role-switch property
> > https://patchwork.kernel.org/patch/10852497/
> > 
> > Mediatek and Hisilicon also try to use it:
> > [v4,3/6] dt-bindings: usb: mtu3: add properties about USB Role Switch
> > https://patchwork.kernel.org/patch/10918385/
> > [v4,6/6] usb: mtu3: register a USB Role Switch for dual role mode
> > https://patchwork.kernel.org/patch/10918367/
> > 
> > [v6,10/13] usb: dwc3: Registering a role switch in the DRD code
> > https://patchwork.kernel.org/patch/10909981/
> > ---
> >  Documentation/devicetree/bindings/usb/generic.txt | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/Documentation/devicetree/bindings/usb/generic.txt 
> > b/Documentation/devicetree/bindings/usb/generic.txt
> > index 0a74ab8dfdc2..cf5a1ad456e6 100644
> > --- a/Documentation/devicetree/bindings/usb/generic.txt
> > +++ b/Documentation/devicetree/bindings/usb/generic.txt
> > @@ -30,6 +30,10 @@ Optional properties:
> > optional for OTG device.
> >   - adp-disable: tells OTG controllers we want to disable OTG ADP, ADP is
> > optional for OTG device.
> > + - usb-role-switch: boolean, indicates that the device is capable of 
> > assigning
> > +   the USB data role (USB host or USB device) for a given
> > +   USB connector, such as Type-C, Type-B(micro).
> > +   see connector/usb-connector.txt.
> >  
> >  This is an attribute to a USB controller such as:
> >  
> > -- 
> > 2.21.0
> 
> thanks,
>

RE: [PATCH 2/6] Revert "arm64: dts: renesas: r8a7796: Enable DMA for SCIF2"

2019-05-19 Thread Yoshihiro Shimoda

Hi Eugeniu-san, Geert-san,

> From: Eugeniu Rosca, Sent: Tuesday, May 7, 2019 4:43 AM

> > > [0] v5.0-rc6 commit 97f26702bc95b5 ("arm64: dts: renesas: r8a7796: Enable 
> > > DMA for SCIF2")
> > > [1] v4.14.106 commit 703db5d1b1759f ("arm64: dts: renesas: r8a7796: 
> > > Enable DMA for SCIF2")
> > > [2] scif (DEBUG) and rcar-dmac logs:
> > > https://gist.github.com/erosca/132cce76a619724a9e4fa61d1db88c66

> Enabling DEBUG in drivers/dma/sh/rcar-dmac.c, I can notice that one of
> the symptoms is a NULL dst_addr revealed by:
> 
> rcar-dmac e730.dma-controller: chan0: queue chunk (ptrval): 
> 0@0x800639eb8090 -> 0x
> 
> In working scenarios, dst_addr is never zero. Does it give any hints?

Thank you for the report! It's very helpful to me.
I think we should fix the sh-sci driver at least.

According to the [2] log above,

[4.379716] sh-sci e6e88000.serial: sci_dma_tx_work_fn: 800639b55000: 
0...0, cookie 126

This "0...0" means the s->tx_dma_len on the sci_dma_tx_work_fn will be zero. 
And,

> rcar-dmac e730.dma-controller: chan0: queue chunk (ptrval): 
> 0@0x800639eb8090 -> 0x

This means the chunk->dst_addr is not set to the "dst_addr" for SCIF because 
the len on rcar_dmac_chan_prep_sg is zero.
So, I'm thinking:
 - we have to fix the sh_sci driver to avoid "tx_dma_len = 0" transferring.

and 

 - also we have to fix the rcar-dmac driver to avoid this issue because the DMA 
Engine API
   guide doesn't prevent the len = 0.

Eugeniu-san, Geert-san, what do you think?

Best regards,
Yoshihiro Shimoda

>>
> > Thanks!
> 
> Likewise!
> 
> >
> > Gr{oetje,eeting}s,
> >
> > Geert
> >
> > --
> > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> > ge...@linux-m68k.org
> >
> > In personal conversations with technical people, I call myself a hacker. But
> > when I'm talking to journalists I just say "programmer" or something like 
> > that.
> > -- Linus Torvalds
> 
> --
> Best Regards,
> Eugeniu.

Re: [f2fs-dev] [PATCH v2 2/2] f2fs: relocate chksum_offset for large_nat_bitmap feature

2019-05-19 Thread Chao Yu

On 2019/5/20 1:31, Ju Hyung Park wrote:
> Hi Jaegeuk and Chao,
> 
> I was semi-forced today to use the new kernel and test f2fs.
> 
> My Ubuntu initramfs got a bit wonky and I had to boot into live CD and
> fix some stuffs. The live CD was using 4.15 kernel, and just mounting
> the f2fs partition there corrupted f2fs and my 4.19(with 5.1-rc1-4.19
> f2fs-stable merged) refused to mount with "SIT is corrupted node"
> message.
> 
> I used the latest f2fs-tools sent by Chao including "fsck.f2fs: fix to
> repair cp_loads blocks at correct position"
> 
> It spit out 140M worth of output, but at least I didn't have to run it
> twice. Everything returned "Ok" in the 2nd run.
> The new log is at
> http://arter97.com/f2fs/final
> 
> After fixing the image, I used my 4.19 kernel with 5.2-rc1-4.19
> f2fs-stable merged and it mounted.
> 
> But, I got this:
> [1.047791] F2FS-fs (nvme0n1p3): layout of large_nat_bitmap is
> deprecated, run fsck to repair, chksum_offset: 4092
> [1.081307] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint
> [1.161520] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs
> [1.162418] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7e00
> 
> But after doing a reboot, the message is gone:
> [1.098423] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint
> [1.11] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs
> [1.178365] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7eda
> 
> I'm not exactly sure why the kernel detected that I'm still using the
> old layout on the first boot. Maybe fsck didn't fix it properly, or
> the check from the kernel is improper.

The fsck should have fixed layout problem, that's just unnecessary message, we
can avoid it by relocating sanity check of layout.

I've sent one patch for that, and tested with it, it acts as expected. :)

> 
> I also noticed that Jaegeuk sent v1 of this patch to upstream. (Maybe
> that's why the kernel detected old layout?) Please send v2 to upstream
> soon, as running older fsck will cause much more headaches.

Since v1 was upstreamed in 5.2-rc1, I guess we need another patch for that,
maybe in below patch?

[PATCH] f2fs: fix to check layout on last valid checkpoint park

Thanks,

> 
> Thanks.
> 
> 
> On Fri, Apr 26, 2019 at 11:26 AM Chao Yu  wrote:
>>
>> For large_nat_bitmap feature, there is a design flaw:
>>
>> Previous:
>>
>> struct f2fs_checkpoint layout:
>> +--+  0x
>> | checkpoint_ver   |
>> | ..   |
>> | checksum_offset  |--+
>> | ..   |  |
>> | sit_nat_version_bitmap[] |<-|---+
>> | ..   |  |   |
>> | checksum_value   |<-+   |
>> +--+  0x1000  |
>> |  |  nat_bitmap + sit_bitmap
>> | payload blocks   |  |
>> |  |  |
>> +--|<-+
>>
>> Obviously, if nat_bitmap size + sit_bitmap size is larger than
>> MAX_BITMAP_SIZE_IN_CKPT, nat_bitmap or sit_bitmap may overlap
>> checkpoint checksum's position, once checkpoint() is triggered
>> from kernel, nat or sit bitmap will be damaged by checksum field.
>>
>> In order to fix this, let's relocate checksum_value's position
>> to the head of sit_nat_version_bitmap as below, then nat/sit
>> bitmap and chksum value update will become safe.
>>
>> After:
>>
>> struct f2fs_checkpoint layout:
>> +--+  0x
>> | checkpoint_ver   |
>> | ..   |
>> | checksum_offset  |--+
>> | ..   |  |
>> | sit_nat_version_bitmap[] |<-+
>> | ..   |<-+
>> |  |  |
>> +--+  0x1000  |
>> |  |  nat_bitmap + sit_bitmap
>> | payload blocks   |  |
>> |  |  |
>> +--|<-+
>>
>> Related report and discussion:
>>
>> https://sourceforge.net/p/linux-f2fs/mailman/message/36642346/
>>
>> Reported-by: Park Ju Hyung 
>> Signed-off-by: Chao Yu 
>> ---
>> v2:
>> - improve hint message suggested by Ju Hyung.
>> - move verification to f2fs_sanity_check_ckpt().
>>  fs/f2fs/f2fs.h  |  4 +++-
>>  fs/f2fs/super.c | 13 +
>>  2 files changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index 119bc5a9783e..aa71c1aa9eaa 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -1909,9 +1909,11 @@ static inline void *__bitmap_ptr(struct f2fs_sb_info 
>> *sbi, int flag)
>> int offset;
>>
>> if (is_set_ckpt_flags(sbi, CP_LARGE_NAT_BITMAP_FLAG)) {
>> +   unsigned int chksum_size = sizeof(__le32);
>> +
>> offset = (flag == SIT_BITMAP) ?
>> le32_to_cpu(ckpt->nat_ver_bitmap_bytesize) :

1 2 3 >

1 - 100 of 230 matches

Mail list logo