Re: linux-next: removal of the leaks tree

2020-06-16 Thread Tobin C. Harding
On Tue, Jun 16, 2020 at 02:53:33PM +1000, Stephen Rothwell wrote:
> Hi,
>
> I have removed the leaks tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/tobin/leaks.git#leaks-next)
> from linux-next because it has not been updated in more than a year.
> If you would like it reinstated, please just reply and let me know.

No worries Steven, thanks for letting me know.


Tobin

signature.asc
Description: PGP signature


Re: shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)

2019-07-01 Thread Tobin C. Harding
On Sat, Jun 29, 2019 at 08:06:24PM +0100, Al Viro wrote:
> On Sat, Jun 29, 2019 at 05:38:03AM +0100, Al Viro wrote:
> 
> > PS: the problem is not gone in the next iteration of the patchset in
> > question.  The patch I'm proposing (including dput_to_list() and _ONLY_
> > compile-tested) follows.  Comments?
> 
> FWIW, there's another unpleasantness in the whole thing.  Suppose we have
> picked a page full of dentries, all with refcount 0.  We decide to
> evict all of them.  As it turns out, they are from two filesystems.
> Filesystem 1 is NFS on a server, with currently downed hub on the way
> to it.  Filesystem 2 is local.  We attempt to evict an NFS dentry and
> get stuck - tons of dirty data with no way to flush them on server.
> In the meanwhile, admin tries to unmount the local filesystem.  And
> gets stuck as well, since umount can't do anything to its dentries
> that happen to sit in our shrink list.
> 
> I wonder if the root of problem here isn't in shrink_dcache_for_umount();
> all it really needs is to have everything on that fs with refcount 0
> dragged through __dentry_kill().  If something had been on a shrink
> list, __dentry_kill() will just leave behind a struct dentry completely
> devoid of any connection to superblock, other dentries, filesystem
> type, etc. - it's just a piece of memory that won't be freed until
> the owner of shrink list finally gets around to it.  Which can happen
> at any point - all they'll do to it is dentry_free(), and that doesn't
> need any fs-related data structures.
> 
> The logics in shrink_dcache_parent() is
>   collect everything evictable into a shrink list
>   if anything found - kick it out and repeat the scan
>   otherwise, if something had been on other's shrink list
>   repeat the scan
> 
> I wonder if after the "no evictable candidates, but something
> on other's shrink lists" we ought to do something along the
> lines of
>   rcu_read_lock
>   walk it, doing
>   if dentry has zero refcount
>   if it's not on a shrink list,
>   move it to ours
>   else
>   store its address in 'victim'
>   end the walk
>   if no victim found
>   rcu_read_unlock
>   else
>   lock victim for __dentry_kill
>   rcu_read_unlock
>   if it's still alive
>   if it's not IS_ROOT
>   if parent is not on shrink list
>   decrement parent's refcount
>   put it on our list
>   else
>   decrement parent's refcount
>   __dentry_kill(victim)
>   else
>   unlock
>   if our list is non-empty
>   shrink_dentry_list on it
> in there...

Thanks for still thinking about this Al.  I don't have a lot of idea
about what to do with your comments until I can grok them fully but I
wanted to acknowledge having read them.

Thanks,
Tobin.


[PATCH 12/15] dcache: Provide a dentry constructor

2019-06-02 Thread Tobin C. Harding
In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c435398f2c81..867d97a86940 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1603,6 +1603,16 @@ void d_invalidate(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
+static void dcache_ctor(void *p)
+{
+   struct dentry *dentry = p;
+
+   /* Mimic lockref_mark_dead() */
+   dentry->d_lockref.count = -128;
+
+   spin_lock_init(>d_lock);
+}
+
 /**
  * __d_alloc   -   allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1658,7 +1668,6 @@ struct dentry *__d_alloc(struct super_block *sb, const 
struct qstr *name)
 
dentry->d_lockref.count = 1;
dentry->d_flags = 0;
-   spin_lock_init(>d_lock);
seqcount_init(>d_seq);
dentry->d_inode = NULL;
dentry->d_parent = dentry;
@@ -3096,14 +3105,17 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
-   /*
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache.
-*/
-   dentry_cache = KMEM_CACHE_USERCOPY(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
-   d_iname);
+   slab_flags_t flags =
+   SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | 
SLAB_ACCOUNT;
+
+   dentry_cache =
+   kmem_cache_create_usercopy("dentry",
+  sizeof(struct dentry),
+  __alignof__(struct dentry),
+  flags,
+  offsetof(struct dentry, d_iname),
+  sizeof_field(struct dentry, d_iname),
+  dcache_ctor);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
-- 
2.21.0



[PATCH 14/15] slub: Enable moving objects to/from specific nodes

2019-06-02 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (SMO, object migration).
Currently object migration is used to defrag a cache.  On NUMA systems
it would be nice to be able to control the source and destination nodes
when moving objects.

Add CONFIG_SLUB_SMO_NODE to guard this feature.  CONFIG_SLUB_SMO_NODE
depends on CONFIG_SLUB_DEBUG because we use the full list.

Implement moving all objects (including those in full slabs) to a
specific node.  Expose this functionality to userspace via a sysfs
entry.

Add sysfs entry:

   /sysfs/kernel/slab//move

With this users get access to the following functionality:

 - Move all objects to specified node.

echo "N1" > move

 - Move all objects from specified node to other specified
   node (from N1 -> to N2):

echo "N1 N2" > move

This also enables shrinking slabs on a specific node:

echo "N1 N1" > move

Signed-off-by: Tobin C. Harding 
---
 mm/Kconfig |   7 ++
 mm/slub.c  | 247 +
 2 files changed, 254 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index f0c76ba47695..c1438b9e578b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -259,6 +259,13 @@ config ARCH_ENABLE_THP_MIGRATION
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
+config SLUB_SMO_NODE
+   bool "Enable per node control of Slab Movable Objects"
+   depends on SLUB && SYSFS
+   select SLUB_DEBUG
+   help
+ On NUMA systems enable moving objects to and from a specified node.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
diff --git a/mm/slub.c b/mm/slub.c
index 2157205df7ba..23566e5a712b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4336,6 +4336,130 @@ static void move_slab_page(struct page *page, void 
*scratch, int node)
s->migrate(s, vector, count, node, private);
 }
 
+#ifdef CONFIG_SLUB_SMO_NODE
+/*
+ * kmem_cache_move() - Attempt to move all slab objects.
+ * @s: The cache we are working on.
+ * @node: The node to move objects away from.
+ * @target_node: The node to move objects on to.
+ *
+ * Attempts to move all objects (partial slabs and full slabs) to target
+ * node.
+ *
+ * Context: Takes the list_lock.
+ * Return: The number of slabs remaining on node.
+ */
+static unsigned long kmem_cache_move(struct kmem_cache *s,
+int node, int target_node)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   goto out;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >partial, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   if (page->inuse) {
+   list_move(>lru, _list);
+   /* Stop page being considered for allocations */
+   n->nr_partial--;
+   page->frozen = 1;
+
+   slab_unlock(page);
+   } else {/* Empty slab page */
+   list_del(>lru);
+   n->nr_partial--;
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+   }
+
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   if (page->inuse == page->objects) {
+   list_add(>lru, >full);
+   slab_unlock(page);
+   } else {
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+ 

[PATCH 15/15] slub: Enable balancing slabs across nodes

2019-06-02 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (SMO).  On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs.  Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

 1. Move all objects to node 0 (this has the effect of defragmenting the
cache).

 2. Calculate the desired number of slabs for each node (this is done
using the approximation nr_slabs / nr_nodes).

 3. Loop over the nodes moving the desired number of slabs from node 0
to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this).  Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry.  Add sysfs
entry:

   /sysfs/kernel/slab//balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 130 ++
 1 file changed, 130 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 23566e5a712b..70e46c4db757 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4458,6 +4458,119 @@ static unsigned long kmem_cache_move_to_node(struct 
kmem_cache *s, int node)
 
return left;
 }
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node.  This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+ int node, int target_node, long num)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+   long done = 0;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   if (node == target_node)
+   return -EINVAL;
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   return -ENOMEM;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+
+   if (++done >= num)
+   break;
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   /*
+* This is best effort only, if slab still has
+* objects just put it back on the partial list.
+*/
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+out:
+   return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+   struct kmem_cache_node *n = get_node(s, 0);
+   unsigned long desired_nr_slabs_per_node;
+   unsigned long nr_slabs;
+   int nr_nodes = 0;
+   int nid;
+
+   (void)kmem_cache_move_to_node(s, 0);
+
+   for_each_node_state(nid, N_NORMAL_MEMORY)
+   nr_nodes++;
+
+   nr_slabs = atomic_long_read(>nr_slabs);
+   desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+   for_each_node_state(nid, N_NORMAL_MEMORY) {
+   if (nid == 0)
+   continue;
+
+ 

[PATCH 10/15] xarray: Implement migration function for xa_node objects

2019-06-02 Thread Tobin C. Harding
Recently Slab Movable Objects (SMO) was implemented for the SLUB
allocator.  The XArray can take advantage of this and make the xa_node
slab cache objects movable.

Implement functions to migrate objects and activate SMO when we
initialise the XArray slab cache.

This is based on initial code by Matthew Wilcox and was modified to work
with slab object migration.

Cc: Matthew Wilcox 
Signed-off-by: Tobin C. Harding 
---
 lib/xarray.c | 61 
 1 file changed, 61 insertions(+)

diff --git a/lib/xarray.c b/lib/xarray.c
index 861c042daa1d..9354e0f01f26 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1993,12 +1993,73 @@ static void xa_node_ctor(void *arg)
INIT_LIST_HEAD(>private_list);
 }
 
+static void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+   struct xarray *xa = READ_ONCE(node->array);
+   void __rcu **slot;
+   struct xa_node *new_node;
+   int i;
+
+   /* Freed or not yet in tree then skip */
+   if (!xa || xa == XA_RCU_FREE)
+   return;
+
+   new_node = kmem_cache_alloc_node(xa_node_cachep, GFP_KERNEL, numa_node);
+   if (!new_node) {
+   pr_err("%s: slab cache allocation failed\n", __func__);
+   return;
+   }
+
+   xa_lock_irq(xa);
+
+   /* Check again. */
+   if (xa != node->array) {
+   node = new_node;
+   goto unlock;
+   }
+
+   memcpy(new_node, node, sizeof(struct xa_node));
+
+   if (list_empty(>private_list))
+   INIT_LIST_HEAD(_node->private_list);
+   else
+   list_replace(>private_list, _node->private_list);
+
+   for (i = 0; i < XA_CHUNK_SIZE; i++) {
+   void *x = xa_entry_locked(xa, new_node, i);
+
+   if (xa_is_node(x))
+   rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+   }
+   if (!new_node->parent)
+   slot = >xa_head;
+   else
+   slot = _parent_locked(xa, new_node)->slots[new_node->offset];
+   rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+   xa_unlock_irq(xa);
+   xa_node_free(node);
+   rcu_barrier();
+}
+
+static void xa_migrate(struct kmem_cache *s, void **objects, int nr,
+  int node, void *_unused)
+{
+   int i;
+
+   for (i = 0; i < nr; i++)
+   xa_object_migrate(objects[i], node);
+}
+
 void __init xarray_slabcache_init(void)
 {
xa_node_cachep = kmem_cache_create("xarray_node",
   sizeof(struct xa_node), 0,
   SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
   xa_node_ctor);
+
+   kmem_cache_setup_mobility(xa_node_cachep, NULL, xa_migrate);
 }
 
 #ifdef XA_DEBUG
-- 
2.21.0



[PATCH 07/15] tools/testing/slab: Add object migration test module

2019-06-02 Thread Tobin C. Harding
 Total  :   1   Sanity Checks : On   Total:8192
  SlabObj: 392  Full   :   1   Redzoning : On   Used :1120
  SlabSiz:8192  Partial:   0   Poisoning : On   Loss :7072
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:6720
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 352

We can run the stress tests (with the default number of objects):

  # cd /sys/kernel/debug/smo
  # echo 'test' > callfn
  [3.576617] smo: test using nr_objs: 1000 keep: 10
  [3.580169] smo: Module tests completed successfully

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile  |  10 +
 tools/testing/slab/slub_defrag.c | 566 +++
 2 files changed, 576 insertions(+)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
new file mode 100644
index ..440c2e3e356f
--- /dev/null
+++ b/tools/testing/slab/Makefile
@@ -0,0 +1,10 @@
+obj-m += slub_defrag.o
+
+KTREE=../../..
+
+all:
+   make -C ${KTREE} M=$(PWD) modules
+
+clean:
+   make -C ${KTREE} M=$(PWD) clean
+
diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
new file mode 100644
index ..4a5c24394b96
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
+ *
+ * This module is used for testing the SLUB allocator.  Enables
+ * userspace to run kernel functions via a debugfs file.
+ *
+ *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
+ *
+ * String written to `callfn` is parsed by the module and associated
+ * function is called.  See fn_tab for mapping of strings to functions.
+ */
+
+/* debugfs commands accept two optional arguments */
+#define SMO_CMD_DEFAUT_ARG -1
+
+#define SMO_DEBUGFS_DIR "smo"
+struct dentry *smo_debugfs_root;
+
+#define SMO_CACHE_NAME "smo_test"
+static struct kmem_cache *cachep;
+
+struct smo_slub_object {
+   struct list_head list;
+   char buf[32];   /* Unused except to control size of object */
+   long id;
+};
+
+/* Our list of allocated objects */
+LIST_HEAD(objects);
+
+static void list_add_to_objects(struct smo_slub_object *so)
+{
+   /*
+* We free from the front of the list so store at the
+* tail in order to put holes in the cache when we free.
+*/
+   list_add_tail(>list, );
+}
+
+/**
+ * smo_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smo_object_ctor(void *ptr)
+{
+   struct smo_slub_object *so = ptr;
+
+   INIT_LIST_HEAD(>list);
+   memset(so->buf, 0, sizeof(so->buf));
+   so->id = -1;
+}
+
+/**
+ * smo_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smo_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+  int node, void *private)
+{
+   struct smo_slub_object **so_objs = (struct smo_slub_object **)objs;
+   struct smo_slub_object *so_old, *so_new;
+   int i;
+
+   for (i = 0; i < size; i++) {
+   so_old = so_objs[i];
+
+   so_new = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+   if (!so_new) {
+   pr_debug("kmem_cache_alloc failed\n");
+   return;
+   }
+
+   /* Copy object */
+   so_new->id = so_old->id;
+
+   /* Update references to old object */
+   list_del(_old->list);
+   list_add_to_objects(so_new);
+
+   kmem_cache_free(cachep, so_old);
+   }
+}
+
+static int smo_enable_cache_mobility(int _unused, int __unused)
+{
+   /* Enable movable objects: BOOM! */
+   kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+   pr_info("smo: kmem_cache %s defrag enabled\n", SMO_CACHE_NAME);
+   return 0;
+}
+
+/*
+ * smo_alloc_objects() - Allocate objects and store reference.
+ * @nr_objs: Number of objects to allocate.
+ * @node: NUMA node to allocate objects on.
+ *
+ * Allocates @n smo_slub_objects.  Stores a reference to them in
+ * the global list of objects (at the tail of the list).
+ *
+ * Return: The number of objects allocated.
+ */
+static int smo_alloc_objects(int nr_objs, int node)
+{
+   struct smo_slub_object *so;
+   int i;
+
+   /* Set sane parameters if no args passed in */
+   if (nr_objs == 

[PATCH 09/15] lib: Separate radix_tree_node and xa_node slab cache

2019-06-02 Thread Tobin C. Harding
Earlier, Slab Movable Objects (SMO) was implemented.  The XArray is now
able to take advantage of SMO in order to make xarray nodes
movable (when using the SLUB allocator).

Currently the radix tree uses the same slab cache as the XArray.  Only
XArray nodes are movable _not_ radix tree nodes.  We can give the radix
tree its own slab cache to overcome this.

In preparation for implementing XArray object migration (xa_node
objects) via Slab Movable Objects add a slab cache solely for XArray
nodes and make the XArray use this slab cache instead of the
radix_tree_node slab cache.

Cc: Matthew Wilcox 
Signed-off-by: Tobin C. Harding 
---
 include/linux/xarray.h |  3 +++
 init/main.c|  2 ++
 lib/radix-tree.c   |  2 +-
 lib/xarray.c   | 48 ++
 4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..773f91f8e1db 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -42,6 +42,9 @@
 
 #define BITS_PER_XA_VALUE  (BITS_PER_LONG - 1)
 
+/* Called from init/main.c */
+void xarray_slabcache_init(void);
+
 /**
  * xa_mk_value() - Create an XArray entry from an integer.
  * @v: Value to store in XArray.
diff --git a/init/main.c b/init/main.c
index 66a196c5e4c3..8c409a5dc937 100644
--- a/init/main.c
+++ b/init/main.c
@@ -107,6 +107,7 @@ static int kernel_init(void *);
 
 extern void init_IRQ(void);
 extern void radix_tree_init(void);
+extern void xarray_slabcache_init(void);
 
 /*
  * Debug helper: via this flag we know that we are in 'early bootup code'
@@ -622,6 +623,7 @@ asmlinkage __visible void __init start_kernel(void)
 "Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
+   xarray_slabcache_init();
 
/*
 * Set up housekeeping before setting up workqueues to allow the unbound
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 18c1dfbb1765..e6127c4c84b5 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -31,7 +31,7 @@
 /*
  * Radix tree node cache.
  */
-struct kmem_cache *radix_tree_node_cachep;
+static struct kmem_cache *radix_tree_node_cachep;
 
 /*
  * The radix tree is variable-height, so an insert operation not only has
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..861c042daa1d 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -27,6 +27,8 @@
  * @entry refers to something stored in a slot in the xarray
  */
 
+static struct kmem_cache *xa_node_cachep;
+
 static inline unsigned int xa_lock_type(const struct xarray *xa)
 {
return (__force unsigned int)xa->xa_flags & 3;
@@ -244,9 +246,21 @@ void *xas_load(struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_load);
 
-/* Move the radix tree node cache here */
-extern struct kmem_cache *radix_tree_node_cachep;
-extern void radix_tree_node_rcu_free(struct rcu_head *head);
+static void xa_node_rcu_free(struct rcu_head *head)
+{
+   struct xa_node *node = container_of(head, struct xa_node, rcu_head);
+
+   /*
+* Must only free zeroed nodes into the slab.  We can be left with
+* non-NULL entries by radix_tree_free_nodes, so clear the entries
+* and tags here.
+*/
+   memset(node->slots, 0, sizeof(node->slots));
+   memset(node->tags, 0, sizeof(node->tags));
+   INIT_LIST_HEAD(>private_list);
+
+   kmem_cache_free(xa_node_cachep, node);
+}
 
 #define XA_RCU_FREE((struct xarray *)1)
 
@@ -254,7 +268,7 @@ static void xa_node_free(struct xa_node *node)
 {
XA_NODE_BUG_ON(node, !list_empty(>private_list));
node->array = XA_RCU_FREE;
-   call_rcu(>rcu_head, radix_tree_node_rcu_free);
+   call_rcu(>rcu_head, xa_node_rcu_free);
 }
 
 /*
@@ -270,7 +284,7 @@ static void xas_destroy(struct xa_state *xas)
if (!node)
return;
XA_NODE_BUG_ON(node, !list_empty(>private_list));
-   kmem_cache_free(radix_tree_node_cachep, node);
+   kmem_cache_free(xa_node_cachep, node);
xas->xa_alloc = NULL;
 }
 
@@ -298,7 +312,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
xas_destroy(xas);
return false;
}
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
if (!xas->xa_alloc)
return false;
XA_NODE_BUG_ON(xas->xa_alloc, 
!list_empty(>xa_alloc->private_list));
@@ -327,10 +341,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
}
if (gfpflags_allow_blocking(gfp)) {
xas_unlock_type(xas, lock_type);
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
xas_lock_type(xas, lock_type);
} else {
- 

[PATCH 13/15] dcache: Implement partial shrink via Slab Movable Objects

2019-06-02 Thread Tobin C. Harding
The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

In order to enable SMO a call to kmem_cache_setup_mobility() must be
made, we do this during initialization of the dcache.

Implement isolate and 'migrate' functions for the dentry slab cache.
Enable SMO for the dcache during initialization.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 75 +
 1 file changed, 75 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 867d97a86940..3ca721752723 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3072,6 +3072,79 @@ void d_tmpfile(struct dentry *dentry, struct inode 
*inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+   struct list_head *dispose;
+   struct dentry *dentry;
+   int i;
+
+   dispose = kmalloc(sizeof(*dispose), GFP_KERNEL);
+   if (!dispose)
+   return NULL;
+
+   INIT_LIST_HEAD(dispose);
+
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   spin_lock(>d_lock);
+
+   if (dentry->d_lockref.count > 0 ||
+   dentry->d_flags & DCACHE_SHRINK_LIST) {
+   spin_unlock(>d_lock);
+   continue;
+   }
+
+   if (dentry->d_flags & DCACHE_LRU_LIST)
+   d_lru_del(dentry);
+
+   d_shrink_add(dentry, dispose);
+   spin_unlock(>d_lock);
+   }
+
+   return dispose;
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @_unused: We do not access the vector.
+ * @__unused: No need for length of vector.
+ * @___unused: We do not do any allocation.
+ * @private: list_head pointer representing the shrink list.
+ *
+ * Dispose of the shrink list created during isolation function.
+ *
+ * Dentry objects can _not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **_unused, int 
__unused,
+int ___unused, void *private)
+{
+   struct list_head *dispose = private;
+
+   if (!private)   /* kmalloc error during isolate. */
+   return;
+
+   if (!list_empty(dispose))
+   shrink_dentry_list(dispose);
+
+   kfree(private);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3117,6 +3190,8 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+   kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
return;
-- 
2.21.0



[PATCH 11/15] tools/testing/slab: Add XArray movable objects tests

2019-06-02 Thread Tobin C. Harding
We just implemented movable objects for the XArray.  Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs.  For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session


Relevant /proc/slabinfo column headers:

  name   

Prior to testing slabinfo report for radix_tree_node:

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8352
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 497   Sanity Checks : On   Total: 8142848
  SlabObj: 912  Full   : 473   Redzoning : On   Used : 4810752
  SlabSiz:   16384  Partial:  24   Poisoning : On   Loss : 3332096
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2806272
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8442
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 499   Sanity Checks : On   Total: 8175616
  SlabObj: 912  Full   : 484   Redzoning : On   Used : 4862592
  SlabSiz:   16384  Partial:  15   Poisoning : On   Loss : 3313024
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2836512
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  439120

Now we can shrink the radix_tree_node cache:

  # slabinfo radix_tree_node --shrink
  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8515
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 501   Sanity Checks : On   Total: 8208384
  SlabObj: 912  Full   : 500   Redzoning : On   Used : 4904640
  SlabSiz:   16384  Partial:   1   Poisoning : On   Loss : 3303744
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2861040
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile |   2 +-
 tools/testing/slab/slub_defrag_xarray.c | 211 
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o
 
 KTREE=../../..
 
diff --git a/tools/testing/slab/slub_defrag_xarray.c 
b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index ..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+   long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+   struct smox_thing *thing;
+
+   thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+   if (!thing)
+   return ERR_PTR(-ENOMEM);
+
+   thing->id = id;
+   return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructe

[PATCH 08/15] tools/testing/slab: Add object migration test suite

2019-06-02 Thread Tobin C. Harding
We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects.  Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

  $ cd path/to/linux/tools/testing/slab
  $ /slub_defrag.py
  Please run script as root

  $ sudo ./slub_defrag.py
  

  $ sudo ./slub_defrag.py --debug
  Loading module ...
  Slab cache smo_test created
  Objects per slab: 20
  Running sanity checks ...

  Running module stress test (see dmesg for additional test output) ...
  Removing module slub_defrag ...
  Loading module ...
  Slab cache smo_test created

  Running test non-movable ...
  testing slab 'smo_test' prior to enabling movable objects ...
  verified non-movable slabs are NOT shrinkable

  Running test movable ...
  testing slab 'smo_test' after enabling movable objects ...
  verified movable slabs are shrinkable

  Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/slub_defrag.c  |   1 +
 tools/testing/slab/slub_defrag.py | 451 ++
 2 files changed, 452 insertions(+)
 create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)
 
 /*
  * struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
  */
 struct functions {
char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py 
b/tools/testing/slab/slub_defrag.py
new file mode 100755
index ..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+#  - CONFIG_SLUB=y
+#  - CONFIG_SLUB_DEBUG=y
+#  - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+#  - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+#  - Writes to 'callfn' are parsed as a command string and the function
+#associated with command is called.
+#  - Defines 4 commands (all commands operate on smo_test cache):
+# - 'test': Runs module stress tests.
+# - 'alloc N': Allocates N slub objects
+# - 'free N POS': Frees N objects starting at POS (see below)
+# - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects.  Allocation adds
+# objects to the tail of the list.  Free'ing frees from the head of the
+# list.  This has the effect of creating free slots in the slab.  For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab//shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+if debug:
+print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+user = run_shell_get_stdout('whoami')
+if user != b'root\n':
+eprint("Please run script as root")
+sys.exit(1)
+
+
+def mount_debugfs():
+mounted = False
+
+# Check if

[PATCH 06/15] tools/vm/slabinfo: Add defrag_used_ratio output

2019-06-02 Thread Tobin C. Harding
Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int defrag_used_ratio;
int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->movable)
+   printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
slab->deactivate_bypass = get_obj("deactivate_bypass");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
+   slab->defrag_used_ratio = get_obj("defrag_used_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[PATCH 05/15] tools/vm/slabinfo: Add remote node defrag ratio output

2019-06-02 Thread Tobin C. Harding
Add output line for NUMA remote node defrag ratio.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index cbfc56c44c2f..d2c22f9ee2d8 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode)
if (skip_zero && !s->slabs)
return;
 
+   if (mode) {
+   printf("\nNUMA remote node defrag ratio: %3d\n",
+  s->remote_node_defrag_ratio);
+   }
if (!line) {
printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
@@ -1272,6 +1277,8 @@ static void read_slab_dir(void)
slab->cpu_partial_free = get_obj("cpu_partial_free");
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
+   slab->remote_node_defrag_ratio =
+   get_obj("remote_node_defrag_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[PATCH 01/15] slub: Add isolate() and migrate() methods

2019-06-02 Thread Tobin C. Harding
Add the two methods needed for moving objects and enable the display of
the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab
cache.

Add empty functions for SLAB/SLOB. The API is generic so it could be
theoretically implemented for these allocators as well.

Change sysfs 'ctor' field to be 'ops' to contain all the callback
operations defined for a slab cache.  Display the existing 'ctor'
callback in the ops fields contents along with 'isolate' and 'migrate'
callbacks.

Signed-off-by: Tobin C. Harding 
---
 include/linux/slab.h | 70 
 include/linux/slub_def.h |  3 ++
 mm/slub.c| 59 +
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..886fc130334d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct 
kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
+/*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable
+ * mobile objects and targeted reclaim in slab caches.
+ */
+
+/**
+ * typedef kmem_cache_isolate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to isolate.
+ * @nr: Number of objects in @ptr array.
+ *
+ * The purpose of kmem_cache_isolate_func() is to pin each object so that
+ * they cannot be freed until kmem_cache_migrate_func() has processed
+ * them. This may be accomplished by increasing the refcount or setting
+ * a flag.
+ *
+ * The object pointer array passed is also passed to
+ * kmem_cache_migrate_func().  The function may remove objects from the
+ * array by setting pointers to %NULL. This is useful if we can
+ * determine that an object is being freed because
+ * kmem_cache_isolate_func() was called when the subsystem was calling
+ * kmem_cache_free().  In that case it is not necessary to increase the
+ * refcount or specially mark the object because the release of the slab
+ * lock will lead to the immediate freeing of the object.
+ *
+ * Context: Called with locks held so that the slab objects cannot be
+ *  freed.  We are in an atomic context and no slab operations
+ *  may be performed.
+ * Return: A pointer that is passed to the migrate function. If any
+ * objects cannot be touched at this point then the pointer may
+ * indicate a failure and then the migration function can simply
+ * remove the references that were already obtained. The private
+ * data could be used to track the objects that were already pinned.
+ */
+typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int 
nr);
+
+/**
+ * typedef kmem_cache_migrate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to migrate.
+ * @nr: Number of objects in @ptr array.
+ * @node: The NUMA node where the object should be allocated.
+ * @private: The pointer returned by kmem_cache_isolate_func().
+ *
+ * This function is responsible for migrating objects.  Typically, for
+ * each object in the input array you will want to allocate an new
+ * object, copy the original object, update any pointers, and free the
+ * old object.
+ *
+ * After this function returns all pointers to the old object should now
+ * point to the new object.
+ *
+ * Context: Called with no locks held and interrupts enabled.  Sleeping
+ *  is possible.  Any operation may be performed.
+ */
+typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr,
+int nr, int node, void *private);
+
+/*
+ * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func,
+  kmem_cache_migrate_func);
+#else
+static inline void
+kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func 
isolate,
+ kmem_cache_migrate_func migrate) {}
+#endif
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..2879a2f5f8eb 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
gfp_t allocflags;   /* gfp flags to use on each alloc */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *);
+   kmem_cache_isolate_func *isolate;
+   kmem_cache_migrate_func *migrate

[PATCH 02/15] tools/vm/slabinfo: Add support for -C and -M options

2019-06-02 Thread Tobin C. Harding
-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int movable, ctor;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-   printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-   "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+   printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 
2017 Jump Trading LLC.\n\n"
+  "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+
"-a|--aliases   Show aliases\n"
"-A|--activity  Most active slabs first\n"
"-B|--Bytes Show size in bytes\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-D|--display-activeSwitch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
"-L|--Loss  Sort by loss\n"
+   "-M|--movable   Show caches that support movable 
objects\n"
"-n|--numa  Show NUMA information\n"
"-N|--lines=K   Show the first K slabs\n"
"-o|--ops   Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;
 
+   if (show_ctor && !s->ctor)
+   return;
+
+   if (show_movable && !s->movable)
+   return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->ctor)
+   *p++ = 'C';
+   if (s->movable)
+   *p++ = 'M';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
-   s->slabs ? (s->partial * 100) / s->slabs : 100,
+   s->slabs ? (s->partial * 100) /
+   (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
chdir("..");
+   if (read_slab_obj(slab, "ops")) {
+   if (strstr(buffer, "ctor :"))
+   slab->ctor = 1;
+   if (strstr(buffer, "migrate :"))
+   slab->movable = 1;
+   }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1332,6 +1356,8 @@ static void xtotals(void)
 }
 
 struct option opts[] =

[PATCH 03/15] slub: Sort slab cache list

2019-06-02 Thread Tobin C. Harding
It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Signed-off-by: Tobin C. Harding 
---
 mm/slab_common.c | 2 +-
 mm/slub.c| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
goto out_free_cache;
 
s->refcount = 1;
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
memcg_link_cache(s);
 out:
if (err)
diff --git a/mm/slub.c b/mm/slub.c
index 1c380a2bc78a..66d474397c0f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4333,6 +4333,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
return;
}
 
+   mutex_lock(_mutex);
+
s->isolate = isolate;
s->migrate = migrate;
 
@@ -4341,6 +4343,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 * to disable fast cmpxchg based processing.
 */
s->flags &= ~__CMPXCHG_DOUBLE;
+
+   list_move(>list, _caches);  /* Move to top */
+
+   mutex_unlock(_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
-- 
2.21.0



[PATCH 04/15] slub: Slab defrag core

2019-06-02 Thread Tobin C. Harding
Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

Implement Slab Movable Objects in order to defragment slab caches.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

slabinfo  -s

3. Conditionally via the use of kmem_cache_defrag()

- Use Slab Movable Objects when shrinking cache.

Currently when the kernel calls kmem_cache_shrink() we curate the
partial slabs list.  If object migration is not enabled for the cache we
still do this, if however, SMO is enabled we attempt to move objects in
partially full slabs in order to defragment the cache.  Shrink attempts
to move all objects in order to reduce the cache to a single partial
slab for each node.

- Add conditional per node defrag via new function:

kmem_defrag_slabs(int node).

kmem_defrag_slabs() attempts to defragment all slab caches for
node. Defragmentation is done conditionally dependent on MAX_PARTIAL
_and_ defrag_used_ratio.

   Caches are only considered for defragmentation if the number of
   partial slabs exceeds MAX_PARTIAL (per node).

   Also, defragmentation only occurs if the usage ratio of the slab is
   lower than the configured percentage (sysfs field added in this
   patch).  Fragmentation ratios are measured by calculating the
   percentage of objects in use compared to the total number of objects
   that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_defrag_slabs() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

echo 0 > /sys/kernel/slab//defrag_used_ratio

- Add a defrag ratio sysfs field and set it to 30% by default. A limit
of 30% specifies that more than 3 out of 10 available slots for objects
need to be in use otherwise slab defragmentation will be attempted on
the remaining objects.

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

void kmem_cache_setup_mobility(struct kmem_cache *,
   kmem_cache_isolate_func,
   kmem_cache_migrate_func);

Signed-off-by: Tobin C. Harding 
---
 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 include/linux/slab.h|   1 +
 include/linux/slub_def.h|   7 +
 mm/slub.c   | 385 
 4 files changed, 334 insertions(+), 73 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab 
b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..8bd893968e4f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,20 @@ Description:
list.  It can be written to clear the current count.
Available when CONFIG_SLUB_STATS is enabled.
 
+What:  /sys/kernel/slab/cache/defrag_used_ratio
+Date:  June 2019
+KernelVersion: 5.2
+Contact:   Christoph Lameter 
+   Pekka Enberg ,
+Description:
+   The defrag_used_ratio file allows the control of how aggressive
+   slab fragmentation reduction works at reclaiming objects from
+   sparsely populated slabs. This is a percentage. If a slab has
+   less than this percentage of objects allocated then reclaim will
+   attempt to reclaim objects so that the whole slab page can be
+   freed. 0% specifies no reclaim attempt (defrag disabled), 100%
+   specifies attempt to reclaim all pages.  The default is 30%.
+
 What:  /sys/kernel/slab/cache/deactivate_to_tail
 Date:  February 2008
 KernelVersion: 2.6.25
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 886fc130334d..4bf381b34829 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char 
*name,
void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_defrag_slabs(int node);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 v

[PATCH 00/15] Slab Movable Objects (SMO)

2019-06-02 Thread Tobin C. Harding
Hi,

TL;DR - Add object migration (SMO) to the SLUB allocator and implement
object migration for the XArray and the dcache. 

Thanks for you patience with all the RFC's of this patch set.  Here it
is, ready for prime time.

Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

In order to be able to defrag slab chaches we need to be able to migrate
objects to a new slab.  Slab object migration is the core functionality
added by this patch series.

Internal slab fragmentation is a long known problem.  This series does
not claim to completely _fix_ the issue.  Instead we are adding core
code to the SLUB allocator to enable users of the allocator to help
mitigate internal fragmentation.  Object migration is on a per cache
basis, with each cache being able to take advantage of object migration
to varying degrees depending on the nature of the objects stored in the
cache.

Series includes test modules and test code that can be used to verify the
claimed behaviour.

Patch #1 - Adds the callbacks used to enable SMO for a particular cache.

Patch #2 - Updates the slabinfo tool to show operations related to SMO.

Patch #3 - Sorts the cache list putting migratable slabs at front.

Patch #4 - Adds the SMO infrastructure.  This is the core patch of the
   series.

Patch #5, #6 - Further update slabinfo tool for information just added.

Patch #7 - Add a module for testing SMO.

Patch #8 - Add unit test suite in Python utilising test module from #7.

Patch #9 - Add a new slab cache for the XArray (separate from radix tree).

Patch #10 - Implement SMO for the XArray.

Patch #11 - Add module for testing XArray SMO implementation.

Patch #12 - Add a dentry constructor.

Patch #13 - Use SMO to attempt to reduce fragmentation of the dcache by
selectively freeing dentry objects.

Patch #14 - Add functionality to move slab objects to a specific NUMA node.

Patch #15 - Add functionality to balance slab objects across all NUMA nodes.

The last RFC (RFCv5 and discussion on it) included code to conditionally
exclude SMO for the dcache.  This has been removed.  IMO it is now not
needed.  Al sufficiently bollock'ed me during development that I believe
the dentry code is good and does not negatively effect the dcache.  If
someone would like to prove me wrong simply remove the call to

kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);

Testing:

The series has been tested to verify that objects are moved using bare
metal (core i5) and also Qemu.  This has not been tested on big metal or
on NUMA hardware.

I have no measurements on performance gains achievable with this set, I
have just verified that the migration works and does not appear to break
anything.

Patch #14 and #15 depend on

CONFIG_SLBU_DEBUG_ON or boot with 'slub_debug'

Thanks for taking the time to look at this.

Tobin


Tobin C. Harding (15):
  slub: Add isolate() and migrate() methods
  tools/vm/slabinfo: Add support for -C and -M options
  slub: Sort slab cache list
  slub: Slab defrag core
  tools/vm/slabinfo: Add remote node defrag ratio output
  tools/vm/slabinfo: Add defrag_used_ratio output
  tools/testing/slab: Add object migration test module
  tools/testing/slab: Add object migration test suite
  lib: Separate radix_tree_node and xa_node slab cache
  xarray: Implement migration function for xa_node objects
  tools/testing/slab: Add XArray movable objects tests
  dcache: Provide a dentry constructor
  dcache: Implement partial shrink via Slab Movable Objects
  slub: Enable moving objects to/from specific nodes
  slub: Enable balancing slabs across nodes

 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 fs/dcache.c | 105 ++-
 include/linux/slab.h|  71 ++
 include/linux/slub_def.h|  10 +
 include/linux/xarray.h  |   3 +
 init/main.c |   2 +
 lib/radix-tree.c|   2 +-
 lib/xarray.c| 109 ++-
 mm/Kconfig  |   7 +
 mm/slab_common.c|   2 +-
 mm/slub.c   | 827 ++--
 tools/testing/slab/Makefile |  10 +
 tools/testing/slab/slub_defrag.c| 567 ++
 tools/testing/slab/slub_defrag.py   | 451 +++
 tools/testing/slab/slub_defrag_xarray.c | 211 +
 tools/vm/slabinfo.c |  51 +-
 16 files changed, 2339 insertions(+), 103 deletions(-)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools

Re: [RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-06-02 Thread Tobin C. Harding
On Wed, May 29, 2019 at 04:16:51PM +, Roman Gushchin wrote:
> On Wed, May 29, 2019 at 01:54:06PM +1000, Tobin C. Harding wrote:
> > On Tue, May 21, 2019 at 02:05:38AM +, Roman Gushchin wrote:
> > > On Tue, May 21, 2019 at 11:31:18AM +1000, Tobin C. Harding wrote:
> > > > On Tue, May 21, 2019 at 12:57:47AM +, Roman Gushchin wrote:
> > > > > On Mon, May 20, 2019 at 03:40:17PM +1000, Tobin C. Harding wrote:
> > > > > > In an attempt to make the SMO patchset as non-invasive as possible 
> > > > > > add a
> > > > > > config option CONFIG_DCACHE_SMO (under "Memory Management options") 
> > > > > > for
> > > > > > enabling SMO for the DCACHE.  Whithout this option dcache 
> > > > > > constructor is
> > > > > > used but no other code is built in, with this option enabled slab
> > > > > > mobility is enabled and the isolate/migrate functions are built in.
> > > > > > 
> > > > > > Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache 
> > > > > > via
> > > > > > Slab Movable Objects infrastructure.
> > > > > 
> > > > > Hm, isn't it better to make it a static branch? Or basically anything
> > > > > that allows switching on the fly?
> > > > 
> > > > If that is wanted, turning SMO on and off per cache, we can probably do
> > > > this in the SMO code in SLUB.
> > > 
> > > Not necessarily per cache, but without recompiling the kernel.
> > > > 
> > > > > It seems that the cost of just building it in shouldn't be that high.
> > > > > And the question if the defragmentation worth the trouble is so much
> > > > > easier to answer if it's possible to turn it on and off without 
> > > > > rebooting.
> > > > 
> > > > If the question is 'is defragmentation worth the trouble for the
> > > > dcache', I'm not sure having SMO turned off helps answer that question.
> > > > If one doesn't shrink the dentry cache there should be very little
> > > > overhead in having SMO enabled.  So if one wants to explore this
> > > > question then they can turn on the config option.  Please correct me if
> > > > I'm wrong.
> > > 
> > > The problem with a config option is that it's hard to switch over.
> > > 
> > > So just to test your changes in production a new kernel should be built,
> > > tested and rolled out to a representative set of machines (which can be
> > > measured in thousands of machines). Then if results are questionable,
> > > it should be rolled back.
> > > 
> > > What you're actually guarding is the kmem_cache_setup_mobility() call,
> > > which can be perfectly avoided using a boot option, for example. Turning
> > > it on and off completely dynamic isn't that hard too.
> > 
> > Hi Roman,
> > 
> > I've added a boot parameter to SLUB so that admins can enable/disable
> > SMO at boot time system wide.  Then for each object that implements SMO
> > (currently XArray and dcache) I've also added a boot parameter to
> > enable/disable SMO for that cache specifically (these depend on SMO
> > being enabled system wide).
> > 
> > All three boot parameters default to 'off', I've added a config option
> > to default each to 'on'.
> > 
> > I've got a little more testing to do on another part of the set then the
> > PATCH version is coming at you :)
> > 
> > This is more a courtesy email than a request for comment, but please
> > feel free to shout if you don't like the method outlined above.
> > 
> > Fully dynamic config is not currently possible because currently the SMO
> > implementation does not support disabling mobility for a cache once it
> > is turned on, a bit of extra logic would need to be added and some state
> > stored - I'm not sure it warrants it ATM but that can be easily added
> > later if wanted.  Maybe Christoph will give his opinion on this.
> 
> Perfect!

Hi Roman,

I'm about to post PATCH series.  I have removed all the boot time config
options in contrast to what I stated in this thread.  I feel it requires
some comment so as not to seem rude to you.  Please feel free to
re-raise these issues on the series if you feel it is a better place to
do it than on this thread.

I still hear you re making testing easier if there are boot parameters.
I don't have extensive experience testing on a large number of machines
so I have no basis to c

Re: [RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-05-28 Thread Tobin C. Harding
On Tue, May 21, 2019 at 02:05:38AM +, Roman Gushchin wrote:
> On Tue, May 21, 2019 at 11:31:18AM +1000, Tobin C. Harding wrote:
> > On Tue, May 21, 2019 at 12:57:47AM +, Roman Gushchin wrote:
> > > On Mon, May 20, 2019 at 03:40:17PM +1000, Tobin C. Harding wrote:
> > > > In an attempt to make the SMO patchset as non-invasive as possible add a
> > > > config option CONFIG_DCACHE_SMO (under "Memory Management options") for
> > > > enabling SMO for the DCACHE.  Whithout this option dcache constructor is
> > > > used but no other code is built in, with this option enabled slab
> > > > mobility is enabled and the isolate/migrate functions are built in.
> > > > 
> > > > Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
> > > > Slab Movable Objects infrastructure.
> > > 
> > > Hm, isn't it better to make it a static branch? Or basically anything
> > > that allows switching on the fly?
> > 
> > If that is wanted, turning SMO on and off per cache, we can probably do
> > this in the SMO code in SLUB.
> 
> Not necessarily per cache, but without recompiling the kernel.
> > 
> > > It seems that the cost of just building it in shouldn't be that high.
> > > And the question if the defragmentation worth the trouble is so much
> > > easier to answer if it's possible to turn it on and off without rebooting.
> > 
> > If the question is 'is defragmentation worth the trouble for the
> > dcache', I'm not sure having SMO turned off helps answer that question.
> > If one doesn't shrink the dentry cache there should be very little
> > overhead in having SMO enabled.  So if one wants to explore this
> > question then they can turn on the config option.  Please correct me if
> > I'm wrong.
> 
> The problem with a config option is that it's hard to switch over.
> 
> So just to test your changes in production a new kernel should be built,
> tested and rolled out to a representative set of machines (which can be
> measured in thousands of machines). Then if results are questionable,
> it should be rolled back.
> 
> What you're actually guarding is the kmem_cache_setup_mobility() call,
> which can be perfectly avoided using a boot option, for example. Turning
> it on and off completely dynamic isn't that hard too.

Hi Roman,

I've added a boot parameter to SLUB so that admins can enable/disable
SMO at boot time system wide.  Then for each object that implements SMO
(currently XArray and dcache) I've also added a boot parameter to
enable/disable SMO for that cache specifically (these depend on SMO
being enabled system wide).

All three boot parameters default to 'off', I've added a config option
to default each to 'on'.

I've got a little more testing to do on another part of the set then the
PATCH version is coming at you :)

This is more a courtesy email than a request for comment, but please
feel free to shout if you don't like the method outlined above.

Fully dynamic config is not currently possible because currently the SMO
implementation does not support disabling mobility for a cache once it
is turned on, a bit of extra logic would need to be added and some state
stored - I'm not sure it warrants it ATM but that can be easily added
later if wanted.  Maybe Christoph will give his opinion on this.

thanks,
Tobin.


Re: [RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-05-20 Thread Tobin C. Harding
On Tue, May 21, 2019 at 02:05:38AM +, Roman Gushchin wrote:
> On Tue, May 21, 2019 at 11:31:18AM +1000, Tobin C. Harding wrote:
> > On Tue, May 21, 2019 at 12:57:47AM +, Roman Gushchin wrote:
> > > On Mon, May 20, 2019 at 03:40:17PM +1000, Tobin C. Harding wrote:
> > > > In an attempt to make the SMO patchset as non-invasive as possible add a
> > > > config option CONFIG_DCACHE_SMO (under "Memory Management options") for
> > > > enabling SMO for the DCACHE.  Whithout this option dcache constructor is
> > > > used but no other code is built in, with this option enabled slab
> > > > mobility is enabled and the isolate/migrate functions are built in.
> > > > 
> > > > Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
> > > > Slab Movable Objects infrastructure.
> > > 
> > > Hm, isn't it better to make it a static branch? Or basically anything
> > > that allows switching on the fly?
> > 
> > If that is wanted, turning SMO on and off per cache, we can probably do
> > this in the SMO code in SLUB.
> 
> Not necessarily per cache, but without recompiling the kernel.
> > 
> > > It seems that the cost of just building it in shouldn't be that high.
> > > And the question if the defragmentation worth the trouble is so much
> > > easier to answer if it's possible to turn it on and off without rebooting.
> > 
> > If the question is 'is defragmentation worth the trouble for the
> > dcache', I'm not sure having SMO turned off helps answer that question.
> > If one doesn't shrink the dentry cache there should be very little
> > overhead in having SMO enabled.  So if one wants to explore this
> > question then they can turn on the config option.  Please correct me if
> > I'm wrong.
> 
> The problem with a config option is that it's hard to switch over.
> 
> So just to test your changes in production a new kernel should be built,
> tested and rolled out to a representative set of machines (which can be
> measured in thousands of machines). Then if results are questionable,
> it should be rolled back.
> 
> What you're actually guarding is the kmem_cache_setup_mobility() call,
> which can be perfectly avoided using a boot option, for example. Turning
> it on and off completely dynamic isn't that hard too.
> 
> Of course, it's up to you, it's just probably easier to find new users
> of a new feature, when it's easy to test it.

Ok, cool - I like it.  Will add for next version.

thanks,
Tobin.


Re: [RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes

2019-05-20 Thread Tobin C. Harding
On Tue, May 21, 2019 at 01:04:10AM +, Roman Gushchin wrote:
> On Mon, May 20, 2019 at 03:40:14PM +1000, Tobin C. Harding wrote:
> > We have just implemented Slab Movable Objects (SMO).  On NUMA systems
> > slabs can become unbalanced i.e. many slabs on one node while other
> > nodes have few slabs.  Using SMO we can balance the slabs across all
> > the nodes.
> > 
> > The algorithm used is as follows:
> > 
> >  1. Move all objects to node 0 (this has the effect of defragmenting the
> > cache).
> 
> This already sounds dangerous (or costly). Can't it be done without
> cross-node data moves?
>
> > 
> >  2. Calculate the desired number of slabs for each node (this is done
> > using the approximation nr_slabs / nr_nodes).
> 
> So that on this step only (actual data size - desired data size) has
> to be moved?

This is just the most braindead algorithm I could come up with.  Surely
there are a bunch of things that could be improved.  Since I don't know
the exact use case it seemed best not to optimize for any one use case.

I'll review, comment on, and test any algorithm you come up with!

thanks,
Tobin.


Re: [RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-05-20 Thread Tobin C. Harding
On Tue, May 21, 2019 at 12:57:47AM +, Roman Gushchin wrote:
> On Mon, May 20, 2019 at 03:40:17PM +1000, Tobin C. Harding wrote:
> > In an attempt to make the SMO patchset as non-invasive as possible add a
> > config option CONFIG_DCACHE_SMO (under "Memory Management options") for
> > enabling SMO for the DCACHE.  Whithout this option dcache constructor is
> > used but no other code is built in, with this option enabled slab
> > mobility is enabled and the isolate/migrate functions are built in.
> > 
> > Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
> > Slab Movable Objects infrastructure.
> 
> Hm, isn't it better to make it a static branch? Or basically anything
> that allows switching on the fly?

If that is wanted, turning SMO on and off per cache, we can probably do
this in the SMO code in SLUB.

> It seems that the cost of just building it in shouldn't be that high.
> And the question if the defragmentation worth the trouble is so much
> easier to answer if it's possible to turn it on and off without rebooting.

If the question is 'is defragmentation worth the trouble for the
dcache', I'm not sure having SMO turned off helps answer that question.
If one doesn't shrink the dentry cache there should be very little
overhead in having SMO enabled.  So if one wants to explore this
question then they can turn on the config option.  Please correct me if
I'm wrong.

The ifdef guard is there so memory management is not having any negative
effects on the dcache/VFS (no matter how small).  It also means that the
VFS guys don't have to keep an eye on what SMO is doing, they can
just configure SMO out.  The dcache is already fairly complex, I'm not
sure adding complexity to it without good reason is sound practice.  At
best SMO is only going to by mildly useful to the dcache so I feel we
should err on the side of caution.

Open to ideas.

Thanks,
Tobin.


Re: [RFC PATCH v5 04/16] slub: Slab defrag core

2019-05-20 Thread Tobin C. Harding
On Tue, May 21, 2019 at 12:51:57AM +, Roman Gushchin wrote:
> On Mon, May 20, 2019 at 03:40:05PM +1000, Tobin C. Harding wrote:
> > Internal fragmentation can occur within pages used by the slub
> > allocator.  Under some workloads large numbers of pages can be used by
> > partial slab pages.  This under-utilisation is bad simply because it
> > wastes memory but also because if the system is under memory pressure
> > higher order allocations may become difficult to satisfy.  If we can
> > defrag slab caches we can alleviate these problems.
> > 
> > Implement Slab Movable Objects in order to defragment slab caches.
> > 
> > Slab defragmentation may occur:
> > 
> > 1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
> >by the kernel calling kmem_cache_shrink().
> > 
> > 2. Unconditionally through the use of the slabinfo command.
> > 
> > slabinfo  -s
> > 
> > 3. Conditionally via the use of kmem_cache_defrag()
> > 
> > - Use Slab Movable Objects when shrinking cache.
> > 
> > Currently when the kernel calls kmem_cache_shrink() we curate the
> > partial slabs list.  If object migration is not enabled for the cache we
> > still do this, if however, SMO is enabled we attempt to move objects in
> > partially full slabs in order to defragment the cache.  Shrink attempts
> > to move all objects in order to reduce the cache to a single partial
> > slab for each node.
> > 
> > - Add conditional per node defrag via new function:
> > 
> > kmem_defrag_slabs(int node).
> > 
> > kmem_defrag_slabs() attempts to defragment all slab caches for
> > node. Defragmentation is done conditionally dependent on MAX_PARTIAL
> > _and_ defrag_used_ratio.
> > 
> >Caches are only considered for defragmentation if the number of
> >partial slabs exceeds MAX_PARTIAL (per node).
> > 
> >Also, defragmentation only occurs if the usage ratio of the slab is
> >lower than the configured percentage (sysfs field added in this
> >patch).  Fragmentation ratios are measured by calculating the
> >percentage of objects in use compared to the total number of objects
> >that the slab page can accommodate.
> > 
> >The scanning of slab caches is optimized because the defragmentable
> >slabs come first on the list. Thus we can terminate scans on the
> >first slab encountered that does not support defragmentation.
> > 
> >kmem_defrag_slabs() takes a node parameter. This can either be -1 if
> >defragmentation should be performed on all nodes, or a node number.
> > 
> >Defragmentation may be disabled by setting defrag ratio to 0
> > 
> > echo 0 > /sys/kernel/slab//defrag_used_ratio
> > 
> > - Add a defrag ratio sysfs field and set it to 30% by default. A limit
> > of 30% specifies that more than 3 out of 10 available slots for objects
> > need to be in use otherwise slab defragmentation will be attempted on
> > the remaining objects.
> > 
> > In order for a cache to be defragmentable the cache must support object
> > migration (SMO).  Enabling SMO for a cache is done via a call to the
> > recently added function:
> > 
> > void kmem_cache_setup_mobility(struct kmem_cache *,
> >kmem_cache_isolate_func,
> >kmem_cache_migrate_func);
> > 
> > Co-developed-by: Christoph Lameter 
> > Signed-off-by: Tobin C. Harding 
> > ---
> >  Documentation/ABI/testing/sysfs-kernel-slab |  14 +
> >  include/linux/slab.h|   1 +
> >  include/linux/slub_def.h|   7 +
> >  mm/slub.c   | 385 
> >  4 files changed, 334 insertions(+), 73 deletions(-)
> 
> Hi Tobin!
> 
> Overall looks very good to me! I'll take another look when you'll post
> a non-RFC version, but so far I can't find any issues.

Thanks for the reviews.

> A generic question: as I understand, you do support only root kmemcaches now.
> Is kmemcg support in plans?

I know very little about cgroups, I have no plans for this work.
However, I'm not the architect behind this - Christoph is guiding the
direction on this one.  Perhaps he will comment.

> Without it the patchset isn't as attractive to anyone using cgroups,
> as it could be. Also, I hope it can solve (or mitigate) the memcg-specific
> problem of scattering vfs cache workingset over multiple generations of the
> same cgroup (their kmem_caches).

I'm keen to work on anything that makes this more useful so I'll do some
research.  Thanks for the idea.

Regards,
Tobin.


[RFC PATCH v5 12/16] slub: Enable moving objects to/from specific nodes

2019-05-19 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (object migration).
Currently object migration is used to defrag a cache.  On NUMA systems
it would be nice to be able to control the source and destination nodes
when moving objects.

Add CONFIG_SMO_NODE to guard this feature.  CONFIG_SMO_NODE depends on
CONFIG_SLUB_DEBUG because we use the full list.

Implement moving all objects (including those in full slabs) to a
specific node.  Expose this functionality to userspace via a sysfs entry.

Add sysfs entry:

   /sysfs/kernel/slab//move

With this users get access to the following functionality:

 - Move all objects to specified node.

echo "N1" > move

 - Move all objects from specified node to other specified
   node (from N1 -> to N2):

echo "N1 N2" > move

This also enables shrinking slabs on a specific node:

echo "N1 N1" > move

Signed-off-by: Tobin C. Harding 
---
 mm/Kconfig |   7 ++
 mm/slub.c  | 249 +
 2 files changed, 256 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ee8d1f311858..aa8d60e69a01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,13 @@ config ARCH_ENABLE_THP_MIGRATION
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
+config SMO_NODE
+   bool "Enable per node control of Slab Movable Objects"
+   depends on SLUB && SYSFS
+   select SLUB_DEBUG
+   help
+ On NUMA systems enable moving objects to and from a specified node.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
diff --git a/mm/slub.c b/mm/slub.c
index 2157205df7ba..9582f2fc97d2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4336,6 +4336,106 @@ static void move_slab_page(struct page *page, void 
*scratch, int node)
s->migrate(s, vector, count, node, private);
 }
 
+#ifdef CONFIG_SMO_NODE
+/*
+ * kmem_cache_move() - Attempt to move all slab objects.
+ * @s: The cache we are working on.
+ * @node: The node to move objects away from.
+ * @target_node: The node to move objects on to.
+ *
+ * Attempts to move all objects (partial slabs and full slabs) to target
+ * node.
+ *
+ * Context: Takes the list_lock.
+ * Return: The number of slabs remaining on node.
+ */
+static unsigned long kmem_cache_move(struct kmem_cache *s,
+int node, int target_node)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   goto out;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >partial, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   if (page->inuse) {
+   list_move(>lru, _list);
+   /* Stop page being considered for allocations */
+   n->nr_partial--;
+   page->frozen = 1;
+
+   slab_unlock(page);
+   } else {/* Empty slab page */
+   list_del(>lru);
+   n->nr_partial--;
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+   }
+
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   if (page->inuse == page->objects) {
+   list_add(>lru, >full);
+   slab_unlock(page);
+   } else {
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+ 

[RFC PATCH v5 15/16] dcache: Implement partial shrink via Slab Movable Objects

2019-05-19 Thread Tobin C. Harding
The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

Implement isolate and 'migrate' functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 76 +
 1 file changed, 76 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index b7318615979d..0dfe580c2d42 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "mount.h"
 
@@ -3071,6 +3072,79 @@ void d_tmpfile(struct dentry *dentry, struct inode 
*inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+   struct list_head *dispose;
+   struct dentry *dentry;
+   int i;
+
+   dispose = kmalloc(sizeof(*dispose), GFP_KERNEL);
+   if (!dispose)
+   return NULL;
+
+   INIT_LIST_HEAD(dispose);
+
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   spin_lock(>d_lock);
+
+   if (dentry->d_lockref.count > 0 ||
+   dentry->d_flags & DCACHE_SHRINK_LIST) {
+   spin_unlock(>d_lock);
+   continue;
+   }
+
+   if (dentry->d_flags & DCACHE_LRU_LIST)
+   d_lru_del(dentry);
+
+   d_shrink_add(dentry, dispose);
+   spin_unlock(>d_lock);
+   }
+
+   return dispose;
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @_unused: We do not access the vector.
+ * @__unused: No need for length of vector.
+ * @___unused: We do not do any allocation.
+ * @private: list_head pointer representing the shrink list.
+ *
+ * Dispose of the shrink list created during isolation function.
+ *
+ * Dentry objects can _not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **_unused, int 
__unused,
+int ___unused, void *private)
+{
+   struct list_head *dispose = private;
+
+   if (!private)   /* kmalloc error during isolate. */
+   return;
+
+   if (!list_empty(dispose))
+   shrink_dentry_list(dispose);
+
+   kfree(private);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3116,6 +3190,8 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+   kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
return;
-- 
2.21.0



[RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes

2019-05-19 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (SMO).  On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs.  Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

 1. Move all objects to node 0 (this has the effect of defragmenting the
cache).

 2. Calculate the desired number of slabs for each node (this is done
using the approximation nr_slabs / nr_nodes).

 3. Loop over the nodes moving the desired number of slabs from node 0
to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this).  Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry.  Add sysfs
entry:

   /sysfs/kernel/slab//balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 120 ++
 1 file changed, 120 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 9582f2fc97d2..25b6d1e408e3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4574,6 +4574,109 @@ static unsigned long kmem_cache_move_to_node(struct 
kmem_cache *s, int node)
 
return left;
 }
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node.  This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+ int node, int target_node, long num)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+   long done = 0;
+
+   if (node == target_node)
+   return -EINVAL;
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   return -ENOMEM;
+
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+
+   if (++done >= num)
+   break;
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   /*
+* This is best effort only, if slab still has
+* objects just put it back on the partial list.
+*/
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+   struct kmem_cache_node *n = get_node(s, 0);
+   unsigned long desired_nr_slabs_per_node;
+   unsigned long nr_slabs;
+   int nr_nodes = 0;
+   int nid;
+
+   (void)kmem_cache_move_to_node(s, 0);
+
+   for_each_node_state(nid, N_NORMAL_MEMORY)
+   nr_nodes++;
+
+   nr_slabs = atomic_long_read(>nr_slabs);
+   desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+   for_each_node_state(nid, N_NORMAL_MEMORY) {
+   if (nid == 0)
+   continue;
+
+   kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node);
+   }
+}
 #endif
 
 /**
@@ -5838,6 +5941,22 @@ static ssize_t move_store(struct kmem_cache *s, const 
char *buf, size_t length)
return length;
 }
 SLAB_ATTR(move);
+
+static ssize_t balance_show(struct kmem_cache 

[RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-05-19 Thread Tobin C. Harding
In an attempt to make the SMO patchset as non-invasive as possible add a
config option CONFIG_DCACHE_SMO (under "Memory Management options") for
enabling SMO for the DCACHE.  Whithout this option dcache constructor is
used but no other code is built in, with this option enabled slab
mobility is enabled and the isolate/migrate functions are built in.

Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
Slab Movable Objects infrastructure.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 4 
 mm/Kconfig  | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0dfe580c2d42..96063e872366 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3072,6 +3072,7 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+#ifdef CONFIG_DCACHE_SMO
 /*
  * d_isolate() - Dentry isolation callback function.
  * @s: The dentry cache.
@@ -3144,6 +3145,7 @@ static void d_partial_shrink(struct kmem_cache *s, void 
**_unused, int __unused,
 
kfree(private);
 }
+#endif /* CONFIG_DCACHE_SMO */
 
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
@@ -3190,7 +3192,9 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+#ifdef CONFIG_DCACHE_SMO
kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+#endif
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
diff --git a/mm/Kconfig b/mm/Kconfig
index aa8d60e69a01..7dcea76e5ecc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -265,6 +265,13 @@ config SMO_NODE
help
  On NUMA systems enable moving objects to and from a specified node.
 
+config DCACHE_SMO
+   bool "Enable Slab Movable Objects for the dcache"
+   depends on SLUB
+   help
+ Under memory pressure we can try to free dentry slab cache objects 
from
+ the partial slab list if this is enabled.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
-- 
2.21.0



[RFC PATCH v5 11/16] tools/testing/slab: Add XArray movable objects tests

2019-05-19 Thread Tobin C. Harding
We just implemented movable objects for the XArray.  Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs.  For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session


Relevant /proc/slabinfo column headers:

  name   

Prior to testing slabinfo report for radix_tree_node:

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8352
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 497   Sanity Checks : On   Total: 8142848
  SlabObj: 912  Full   : 473   Redzoning : On   Used : 4810752
  SlabSiz:   16384  Partial:  24   Poisoning : On   Loss : 3332096
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2806272
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8442
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 499   Sanity Checks : On   Total: 8175616
  SlabObj: 912  Full   : 484   Redzoning : On   Used : 4862592
  SlabSiz:   16384  Partial:  15   Poisoning : On   Loss : 3313024
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2836512
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  439120

Now we can shrink the radix_tree_node cache:

  # slabinfo radix_tree_node --shrink
  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8515
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 501   Sanity Checks : On   Total: 8208384
  SlabObj: 912  Full   : 500   Redzoning : On   Used : 4904640
  SlabSiz:   16384  Partial:   1   Poisoning : On   Loss : 3303744
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2861040
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile |   2 +-
 tools/testing/slab/slub_defrag_xarray.c | 211 
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o
 
 KTREE=../../..
 
diff --git a/tools/testing/slab/slub_defrag_xarray.c 
b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index ..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+   long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+   struct smox_thing *thing;
+
+   thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+   if (!thing)
+   return ERR_PTR(-ENOMEM);
+
+   thing->id = id;
+   return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructe

[RFC PATCH v5 14/16] dcache: Provide a dentry constructor

2019-05-19 Thread Tobin C. Harding
In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..b7318615979d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1602,6 +1602,16 @@ void d_invalidate(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
+static void dcache_ctor(void *p)
+{
+   struct dentry *dentry = p;
+
+   /* Mimic lockref_mark_dead() */
+   dentry->d_lockref.count = -128;
+
+   spin_lock_init(>d_lock);
+}
+
 /**
  * __d_alloc   -   allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1657,7 +1667,6 @@ struct dentry *__d_alloc(struct super_block *sb, const 
struct qstr *name)
 
dentry->d_lockref.count = 1;
dentry->d_flags = 0;
-   spin_lock_init(>d_lock);
seqcount_init(>d_seq);
dentry->d_inode = NULL;
dentry->d_parent = dentry;
@@ -3095,14 +3104,17 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
-   /*
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache.
-*/
-   dentry_cache = KMEM_CACHE_USERCOPY(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
-   d_iname);
+   slab_flags_t flags =
+   SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | 
SLAB_ACCOUNT;
+
+   dentry_cache =
+   kmem_cache_create_usercopy("dentry",
+  sizeof(struct dentry),
+  __alignof__(struct dentry),
+  flags,
+  offsetof(struct dentry, d_iname),
+  sizeof_field(struct dentry, d_iname),
+  dcache_ctor);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
-- 
2.21.0



[RFC PATCH v5 05/16] tools/vm/slabinfo: Add remote node defrag ratio output

2019-05-19 Thread Tobin C. Harding
Add output line for NUMA remote node defrag ratio.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index cbfc56c44c2f..d2c22f9ee2d8 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode)
if (skip_zero && !s->slabs)
return;
 
+   if (mode) {
+   printf("\nNUMA remote node defrag ratio: %3d\n",
+  s->remote_node_defrag_ratio);
+   }
if (!line) {
printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
@@ -1272,6 +1277,8 @@ static void read_slab_dir(void)
slab->cpu_partial_free = get_obj("cpu_partial_free");
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
+   slab->remote_node_defrag_ratio =
+   get_obj("remote_node_defrag_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[RFC PATCH v5 08/16] tools/testing/slab: Add object migration test suite

2019-05-19 Thread Tobin C. Harding
We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects.  Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

  $ cd path/to/linux/tools/testing/slab
  $ /slub_defrag.py
  Please run script as root

  $ sudo ./slub_defrag.py
  

  $ sudo ./slub_defrag.py --debug
  Loading module ...
  Slab cache smo_test created
  Objects per slab: 20
  Running sanity checks ...

  Running module stress test (see dmesg for additional test output) ...
  Removing module slub_defrag ...
  Loading module ...
  Slab cache smo_test created

  Running test non-movable ...
  testing slab 'smo_test' prior to enabling movable objects ...
  verified non-movable slabs are NOT shrinkable

  Running test movable ...
  testing slab 'smo_test' after enabling movable objects ...
  verified movable slabs are shrinkable

  Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/slub_defrag.c  |   1 +
 tools/testing/slab/slub_defrag.py | 451 ++
 2 files changed, 452 insertions(+)
 create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)
 
 /*
  * struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
  */
 struct functions {
char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py 
b/tools/testing/slab/slub_defrag.py
new file mode 100755
index ..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+#  - CONFIG_SLUB=y
+#  - CONFIG_SLUB_DEBUG=y
+#  - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+#  - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+#  - Writes to 'callfn' are parsed as a command string and the function
+#associated with command is called.
+#  - Defines 4 commands (all commands operate on smo_test cache):
+# - 'test': Runs module stress tests.
+# - 'alloc N': Allocates N slub objects
+# - 'free N POS': Frees N objects starting at POS (see below)
+# - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects.  Allocation adds
+# objects to the tail of the list.  Free'ing frees from the head of the
+# list.  This has the effect of creating free slots in the slab.  For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab//shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+if debug:
+print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+user = run_shell_get_stdout('whoami')
+if user != b'root\n':
+eprint("Please run script as root")
+sys.exit(1)
+
+
+def mount_debugfs():
+mounted = False
+
+# Check if

[RFC PATCH v5 07/16] tools/testing/slab: Add object migration test module

2019-05-19 Thread Tobin C. Harding
 Total  :   1   Sanity Checks : On   Total:8192
  SlabObj: 392  Full   :   1   Redzoning : On   Used :1120
  SlabSiz:8192  Partial:   0   Poisoning : On   Loss :7072
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:6720
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 352

We can run the stress tests (with the default number of objects):

  # cd /sys/kernel/debug/smo
  # echo 'test' > callfn
  [3.576617] smo: test using nr_objs: 1000 keep: 10
  [3.580169] smo: Module tests completed successfully

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile  |  10 +
 tools/testing/slab/slub_defrag.c | 566 +++
 2 files changed, 576 insertions(+)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
new file mode 100644
index ..440c2e3e356f
--- /dev/null
+++ b/tools/testing/slab/Makefile
@@ -0,0 +1,10 @@
+obj-m += slub_defrag.o
+
+KTREE=../../..
+
+all:
+   make -C ${KTREE} M=$(PWD) modules
+
+clean:
+   make -C ${KTREE} M=$(PWD) clean
+
diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
new file mode 100644
index ..4a5c24394b96
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
+ *
+ * This module is used for testing the SLUB allocator.  Enables
+ * userspace to run kernel functions via a debugfs file.
+ *
+ *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
+ *
+ * String written to `callfn` is parsed by the module and associated
+ * function is called.  See fn_tab for mapping of strings to functions.
+ */
+
+/* debugfs commands accept two optional arguments */
+#define SMO_CMD_DEFAUT_ARG -1
+
+#define SMO_DEBUGFS_DIR "smo"
+struct dentry *smo_debugfs_root;
+
+#define SMO_CACHE_NAME "smo_test"
+static struct kmem_cache *cachep;
+
+struct smo_slub_object {
+   struct list_head list;
+   char buf[32];   /* Unused except to control size of object */
+   long id;
+};
+
+/* Our list of allocated objects */
+LIST_HEAD(objects);
+
+static void list_add_to_objects(struct smo_slub_object *so)
+{
+   /*
+* We free from the front of the list so store at the
+* tail in order to put holes in the cache when we free.
+*/
+   list_add_tail(>list, );
+}
+
+/**
+ * smo_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smo_object_ctor(void *ptr)
+{
+   struct smo_slub_object *so = ptr;
+
+   INIT_LIST_HEAD(>list);
+   memset(so->buf, 0, sizeof(so->buf));
+   so->id = -1;
+}
+
+/**
+ * smo_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smo_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+  int node, void *private)
+{
+   struct smo_slub_object **so_objs = (struct smo_slub_object **)objs;
+   struct smo_slub_object *so_old, *so_new;
+   int i;
+
+   for (i = 0; i < size; i++) {
+   so_old = so_objs[i];
+
+   so_new = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+   if (!so_new) {
+   pr_debug("kmem_cache_alloc failed\n");
+   return;
+   }
+
+   /* Copy object */
+   so_new->id = so_old->id;
+
+   /* Update references to old object */
+   list_del(_old->list);
+   list_add_to_objects(so_new);
+
+   kmem_cache_free(cachep, so_old);
+   }
+}
+
+static int smo_enable_cache_mobility(int _unused, int __unused)
+{
+   /* Enable movable objects: BOOM! */
+   kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+   pr_info("smo: kmem_cache %s defrag enabled\n", SMO_CACHE_NAME);
+   return 0;
+}
+
+/*
+ * smo_alloc_objects() - Allocate objects and store reference.
+ * @nr_objs: Number of objects to allocate.
+ * @node: NUMA node to allocate objects on.
+ *
+ * Allocates @n smo_slub_objects.  Stores a reference to them in
+ * the global list of objects (at the tail of the list).
+ *
+ * Return: The number of objects allocated.
+ */
+static int smo_alloc_objects(int nr_objs, int node)
+{
+   struct smo_slub_object *so;
+   int i;
+
+   /* Set sane parameters if no args passed in */
+   if (nr_objs == 

[RFC PATCH v5 06/16] tools/vm/slabinfo: Add defrag_used_ratio output

2019-05-19 Thread Tobin C. Harding
Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int defrag_used_ratio;
int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->movable)
+   printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
slab->deactivate_bypass = get_obj("deactivate_bypass");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
+   slab->defrag_used_ratio = get_obj("defrag_used_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[RFC PATCH v5 09/16] lib: Separate radix_tree_node and xa_node slab cache

2019-05-19 Thread Tobin C. Harding
Earlier, Slab Movable Objects (SMO) was implemented.  The XArray is now
able to take advantage of SMO in order to make xarray nodes
movable (when using the SLUB allocator).

Currently the radix tree uses the same slab cache as the XArray.  Only
XArray nodes are movable _not_ radix tree nodes.  We can give the radix
tree its own slab cache to overcome this.

In preparation for implementing XArray object migration (xa_node
objects) via Slab Movable Objects add a slab cache solely for XArray
nodes and make the XArray use this slab cache instead of the
radix_tree_node slab cache.

Cc: Matthew Wilcox 
Signed-off-by: Tobin C. Harding 
---
 include/linux/xarray.h |  3 +++
 init/main.c|  2 ++
 lib/radix-tree.c   |  2 +-
 lib/xarray.c   | 48 ++
 4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..773f91f8e1db 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -42,6 +42,9 @@
 
 #define BITS_PER_XA_VALUE  (BITS_PER_LONG - 1)
 
+/* Called from init/main.c */
+void xarray_slabcache_init(void);
+
 /**
  * xa_mk_value() - Create an XArray entry from an integer.
  * @v: Value to store in XArray.
diff --git a/init/main.c b/init/main.c
index 5a2c69b4d7b3..e89915ffbe26 100644
--- a/init/main.c
+++ b/init/main.c
@@ -106,6 +106,7 @@ static int kernel_init(void *);
 
 extern void init_IRQ(void);
 extern void radix_tree_init(void);
+extern void xarray_slabcache_init(void);
 
 /*
  * Debug helper: via this flag we know that we are in 'early bootup code'
@@ -621,6 +622,7 @@ asmlinkage __visible void __init start_kernel(void)
 "Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
+   xarray_slabcache_init();
 
/*
 * Set up housekeeping before setting up workqueues to allow the unbound
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 14d51548bea6..edbfb530ba73 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -44,7 +44,7 @@
 /*
  * Radix tree node cache.
  */
-struct kmem_cache *radix_tree_node_cachep;
+static struct kmem_cache *radix_tree_node_cachep;
 
 /*
  * The radix tree is variable-height, so an insert operation not only has
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..a528a5277c9d 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -27,6 +27,8 @@
  * @entry refers to something stored in a slot in the xarray
  */
 
+static struct kmem_cache *xa_node_cachep;
+
 static inline unsigned int xa_lock_type(const struct xarray *xa)
 {
return (__force unsigned int)xa->xa_flags & 3;
@@ -244,9 +246,21 @@ void *xas_load(struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_load);
 
-/* Move the radix tree node cache here */
-extern struct kmem_cache *radix_tree_node_cachep;
-extern void radix_tree_node_rcu_free(struct rcu_head *head);
+void xa_node_rcu_free(struct rcu_head *head)
+{
+   struct xa_node *node = container_of(head, struct xa_node, rcu_head);
+
+   /*
+* Must only free zeroed nodes into the slab.  We can be left with
+* non-NULL entries by radix_tree_free_nodes, so clear the entries
+* and tags here.
+*/
+   memset(node->slots, 0, sizeof(node->slots));
+   memset(node->tags, 0, sizeof(node->tags));
+   INIT_LIST_HEAD(>private_list);
+
+   kmem_cache_free(xa_node_cachep, node);
+}
 
 #define XA_RCU_FREE((struct xarray *)1)
 
@@ -254,7 +268,7 @@ static void xa_node_free(struct xa_node *node)
 {
XA_NODE_BUG_ON(node, !list_empty(>private_list));
node->array = XA_RCU_FREE;
-   call_rcu(>rcu_head, radix_tree_node_rcu_free);
+   call_rcu(>rcu_head, xa_node_rcu_free);
 }
 
 /*
@@ -270,7 +284,7 @@ static void xas_destroy(struct xa_state *xas)
if (!node)
return;
XA_NODE_BUG_ON(node, !list_empty(>private_list));
-   kmem_cache_free(radix_tree_node_cachep, node);
+   kmem_cache_free(xa_node_cachep, node);
xas->xa_alloc = NULL;
 }
 
@@ -298,7 +312,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
xas_destroy(xas);
return false;
}
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
if (!xas->xa_alloc)
return false;
XA_NODE_BUG_ON(xas->xa_alloc, 
!list_empty(>xa_alloc->private_list));
@@ -327,10 +341,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
}
if (gfpflags_allow_blocking(gfp)) {
xas_unlock_type(xas, lock_type);
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
xas_lock_type(xas, lock_type);
} else {
-   xas->

[RFC PATCH v5 10/16] xarray: Implement migration function for xa_node objects

2019-05-19 Thread Tobin C. Harding
Recently Slab Movable Objects (SMO) was implemented for the SLUB
allocator.  The XArray can take advantage of this and make the xa_node
slab cache objects movable.

Implement functions to migrate objects and activate SMO when we
initialise the XArray slab cache.

This is based on initial code by Matthew Wilcox and was modified to work
with slab object migration.

Cc: Matthew Wilcox 
Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 lib/xarray.c | 61 
 1 file changed, 61 insertions(+)

diff --git a/lib/xarray.c b/lib/xarray.c
index a528a5277c9d..c6b077f59e88 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1993,12 +1993,73 @@ static void xa_node_ctor(void *arg)
INIT_LIST_HEAD(>private_list);
 }
 
+static void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+   struct xarray *xa = READ_ONCE(node->array);
+   void __rcu **slot;
+   struct xa_node *new_node;
+   int i;
+
+   /* Freed or not yet in tree then skip */
+   if (!xa || xa == XA_RCU_FREE)
+   return;
+
+   new_node = kmem_cache_alloc_node(xa_node_cachep, GFP_KERNEL, numa_node);
+   if (!new_node) {
+   pr_err("%s: slab cache allocation failed\n", __func__);
+   return;
+   }
+
+   xa_lock_irq(xa);
+
+   /* Check again. */
+   if (xa != node->array) {
+   node = new_node;
+   goto unlock;
+   }
+
+   memcpy(new_node, node, sizeof(struct xa_node));
+
+   if (list_empty(>private_list))
+   INIT_LIST_HEAD(_node->private_list);
+   else
+   list_replace(>private_list, _node->private_list);
+
+   for (i = 0; i < XA_CHUNK_SIZE; i++) {
+   void *x = xa_entry_locked(xa, new_node, i);
+
+   if (xa_is_node(x))
+   rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+   }
+   if (!new_node->parent)
+   slot = >xa_head;
+   else
+   slot = _parent_locked(xa, new_node)->slots[new_node->offset];
+   rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+   xa_unlock_irq(xa);
+   xa_node_free(node);
+   rcu_barrier();
+}
+
+static void xa_migrate(struct kmem_cache *s, void **objects, int nr,
+  int node, void *_unused)
+{
+   int i;
+
+   for (i = 0; i < nr; i++)
+   xa_object_migrate(objects[i], node);
+}
+
+
 void __init xarray_slabcache_init(void)
 {
xa_node_cachep = kmem_cache_create("xarray_node",
   sizeof(struct xa_node), 0,
   SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
   xa_node_ctor);
+   kmem_cache_setup_mobility(xa_node_cachep, NULL, xa_migrate);
 }
 
 #ifdef XA_DEBUG
-- 
2.21.0



[RFC PATCH v5 03/16] slub: Sort slab cache list

2019-05-19 Thread Tobin C. Harding
It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 mm/slab_common.c | 2 +-
 mm/slub.c| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
goto out_free_cache;
 
s->refcount = 1;
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
memcg_link_cache(s);
 out:
if (err)
diff --git a/mm/slub.c b/mm/slub.c
index 1c380a2bc78a..66d474397c0f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4333,6 +4333,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
return;
}
 
+   mutex_lock(_mutex);
+
s->isolate = isolate;
s->migrate = migrate;
 
@@ -4341,6 +4343,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 * to disable fast cmpxchg based processing.
 */
s->flags &= ~__CMPXCHG_DOUBLE;
+
+   list_move(>list, _caches);  /* Move to top */
+
+   mutex_unlock(_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
-- 
2.21.0



[RFC PATCH v5 04/16] slub: Slab defrag core

2019-05-19 Thread Tobin C. Harding
Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

Implement Slab Movable Objects in order to defragment slab caches.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

slabinfo  -s

3. Conditionally via the use of kmem_cache_defrag()

- Use Slab Movable Objects when shrinking cache.

Currently when the kernel calls kmem_cache_shrink() we curate the
partial slabs list.  If object migration is not enabled for the cache we
still do this, if however, SMO is enabled we attempt to move objects in
partially full slabs in order to defragment the cache.  Shrink attempts
to move all objects in order to reduce the cache to a single partial
slab for each node.

- Add conditional per node defrag via new function:

kmem_defrag_slabs(int node).

kmem_defrag_slabs() attempts to defragment all slab caches for
node. Defragmentation is done conditionally dependent on MAX_PARTIAL
_and_ defrag_used_ratio.

   Caches are only considered for defragmentation if the number of
   partial slabs exceeds MAX_PARTIAL (per node).

   Also, defragmentation only occurs if the usage ratio of the slab is
   lower than the configured percentage (sysfs field added in this
   patch).  Fragmentation ratios are measured by calculating the
   percentage of objects in use compared to the total number of objects
   that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_defrag_slabs() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

echo 0 > /sys/kernel/slab//defrag_used_ratio

- Add a defrag ratio sysfs field and set it to 30% by default. A limit
of 30% specifies that more than 3 out of 10 available slots for objects
need to be in use otherwise slab defragmentation will be attempted on
the remaining objects.

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

void kmem_cache_setup_mobility(struct kmem_cache *,
   kmem_cache_isolate_func,
   kmem_cache_migrate_func);

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 include/linux/slab.h|   1 +
 include/linux/slub_def.h|   7 +
 mm/slub.c   | 385 
 4 files changed, 334 insertions(+), 73 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab 
b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..c6f129af035a 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,20 @@ Description:
list.  It can be written to clear the current count.
Available when CONFIG_SLUB_STATS is enabled.
 
+What:  /sys/kernel/slab/cache/defrag_used_ratio
+Date:  May 2019
+KernelVersion: 5.2
+Contact:   Christoph Lameter 
+   Pekka Enberg ,
+Description:
+   The defrag_used_ratio file allows the control of how aggressive
+   slab fragmentation reduction works at reclaiming objects from
+   sparsely populated slabs. This is a percentage. If a slab has
+   less than this percentage of objects allocated then reclaim will
+   attempt to reclaim objects so that the whole slab page can be
+   freed. 0% specifies no reclaim attempt (defrag disabled), 100%
+   specifies attempt to reclaim all pages.  The default is 30%.
+
 What:  /sys/kernel/slab/cache/deactivate_to_tail
 Date:  February 2008
 KernelVersion: 2.6.25
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 886fc130334d..4bf381b34829 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char 
*name,
void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_defrag_slabs(int node);
 
 void memcg_create_kmem_cache(str

[RFC PATCH v5 02/16] tools/vm/slabinfo: Add support for -C and -M options

2019-05-19 Thread Tobin C. Harding
-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int movable, ctor;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-   printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-   "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+   printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 
2017 Jump Trading LLC.\n\n"
+  "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+
"-a|--aliases   Show aliases\n"
"-A|--activity  Most active slabs first\n"
"-B|--Bytes Show size in bytes\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-D|--display-activeSwitch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
"-L|--Loss  Sort by loss\n"
+   "-M|--movable   Show caches that support movable 
objects\n"
"-n|--numa  Show NUMA information\n"
"-N|--lines=K   Show the first K slabs\n"
"-o|--ops   Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;
 
+   if (show_ctor && !s->ctor)
+   return;
+
+   if (show_movable && !s->movable)
+   return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->ctor)
+   *p++ = 'C';
+   if (s->movable)
+   *p++ = 'M';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
-   s->slabs ? (s->partial * 100) / s->slabs : 100,
+   s->slabs ? (s->partial * 100) /
+   (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
chdir("..");
+   if (read_slab_obj(slab, "ops")) {
+   if (strstr(buffer, "ctor :"))
+   slab->ctor = 1;
+   if (strstr(buffer, "migrate :"))
+   slab->movable = 1;
+   }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1332,6 +1356,8 @@ static void 

[RFC PATCH v5 00/16] Slab Movable Objects (SMO)

2019-05-19 Thread Tobin C. Harding
Hi,

Another iteration of the SMO patch set, updates to this version are
restricted to the XArray patches (#9 and #10 and tested with module
implemented in #11).

Applies on top of Linus' tree (tag: v5.2-rc1).

This is a patch set implementing movable objects within the SLUB
allocator.  This is work based on Christopher Lameter's patch set:

 https://lore.kernel.org/patchwork/project/lkml/list/?series=377335

The original code logic is from that set and implemented by Christopher.
Clean up, refactoring, documentation, and additional features by myself.
Responsibility for any bugs remaining falls solely with myself.

I am intending on sending a non-RFC version soon after this one (if
XArray stuff is ok).  If anyone has any objects with SMO in general
please yell at me now.

Changes to this version:

Patch XArray to use a separate slab cache.  Currently the radix tree and
XArray use the same slab cache.  Radix tree nodes can not be moved but
XArray nodes can.

Matthew,

Does this fit in ok with your plans for the XArray and radix tree?  I
don't really like the function names used here or the init function name
(xarray_slabcache_init()).  If there is a better way to do this please
mercilessly correct me :)


Thanks for looking at this,
Tobin.


Tobin C. Harding (16):
  slub: Add isolate() and migrate() methods
  tools/vm/slabinfo: Add support for -C and -M options
  slub: Sort slab cache list
  slub: Slab defrag core
  tools/vm/slabinfo: Add remote node defrag ratio output
  tools/vm/slabinfo: Add defrag_used_ratio output
  tools/testing/slab: Add object migration test module
  tools/testing/slab: Add object migration test suite
  lib: Separate radix_tree_node and xa_node slab cache
  xarray: Implement migration function for xa_node objects
  tools/testing/slab: Add XArray movable objects tests
  slub: Enable moving objects to/from specific nodes
  slub: Enable balancing slabs across nodes
  dcache: Provide a dentry constructor
  dcache: Implement partial shrink via Slab Movable Objects
  dcache: Add CONFIG_DCACHE_SMO

 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 fs/dcache.c | 110 ++-
 include/linux/slab.h|  71 ++
 include/linux/slub_def.h|  10 +
 include/linux/xarray.h  |   3 +
 init/main.c |   2 +
 lib/radix-tree.c|   2 +-
 lib/xarray.c| 109 ++-
 mm/Kconfig  |  14 +
 mm/slab_common.c|   2 +-
 mm/slub.c   | 819 ++--
 tools/testing/slab/Makefile |  10 +
 tools/testing/slab/slub_defrag.c| 567 ++
 tools/testing/slab/slub_defrag.py   | 451 +++
 tools/testing/slab/slub_defrag_xarray.c | 211 +
 tools/vm/slabinfo.c |  51 +-
 16 files changed, 2343 insertions(+), 103 deletions(-)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c
 create mode 100755 tools/testing/slab/slub_defrag.py
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

-- 
2.21.0



[RFC PATCH v5 01/16] slub: Add isolate() and migrate() methods

2019-05-19 Thread Tobin C. Harding
Add the two methods needed for moving objects and enable the display of
the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab
cache.

Add empty functions for SLAB/SLOB. The API is generic so it could be
theoretically implemented for these allocators as well.

Change sysfs 'ctor' field to be 'ops' to contain all the callback
operations defined for a slab cache.  Display the existing 'ctor'
callback in the ops fields contents along with 'isolate' and 'migrate'
callbacks.

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 include/linux/slab.h | 70 
 include/linux/slub_def.h |  3 ++
 mm/slub.c| 59 +
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..886fc130334d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct 
kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
+/*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable
+ * mobile objects and targeted reclaim in slab caches.
+ */
+
+/**
+ * typedef kmem_cache_isolate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to isolate.
+ * @nr: Number of objects in @ptr array.
+ *
+ * The purpose of kmem_cache_isolate_func() is to pin each object so that
+ * they cannot be freed until kmem_cache_migrate_func() has processed
+ * them. This may be accomplished by increasing the refcount or setting
+ * a flag.
+ *
+ * The object pointer array passed is also passed to
+ * kmem_cache_migrate_func().  The function may remove objects from the
+ * array by setting pointers to %NULL. This is useful if we can
+ * determine that an object is being freed because
+ * kmem_cache_isolate_func() was called when the subsystem was calling
+ * kmem_cache_free().  In that case it is not necessary to increase the
+ * refcount or specially mark the object because the release of the slab
+ * lock will lead to the immediate freeing of the object.
+ *
+ * Context: Called with locks held so that the slab objects cannot be
+ *  freed.  We are in an atomic context and no slab operations
+ *  may be performed.
+ * Return: A pointer that is passed to the migrate function. If any
+ * objects cannot be touched at this point then the pointer may
+ * indicate a failure and then the migration function can simply
+ * remove the references that were already obtained. The private
+ * data could be used to track the objects that were already pinned.
+ */
+typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int 
nr);
+
+/**
+ * typedef kmem_cache_migrate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to migrate.
+ * @nr: Number of objects in @ptr array.
+ * @node: The NUMA node where the object should be allocated.
+ * @private: The pointer returned by kmem_cache_isolate_func().
+ *
+ * This function is responsible for migrating objects.  Typically, for
+ * each object in the input array you will want to allocate an new
+ * object, copy the original object, update any pointers, and free the
+ * old object.
+ *
+ * After this function returns all pointers to the old object should now
+ * point to the new object.
+ *
+ * Context: Called with no locks held and interrupts enabled.  Sleeping
+ *  is possible.  Any operation may be performed.
+ */
+typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr,
+int nr, int node, void *private);
+
+/*
+ * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func,
+  kmem_cache_migrate_func);
+#else
+static inline void
+kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func 
isolate,
+ kmem_cache_migrate_func migrate) {}
+#endif
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..2879a2f5f8eb 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
gfp_t allocflags;   /* gfp flags to use on each alloc */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *);
+   kmem_cache_isolate_func *isolate

Re: [RFC PATCH] kobject: Clean up allocated memory on failure

2019-05-16 Thread Tobin C. Harding
On Thu, May 16, 2019 at 08:40:29AM +0200, Greg Kroah-Hartman wrote:
> On Thu, May 16, 2019 at 10:07:16AM +1000, Tobin C. Harding wrote:
> > Currently kobject_add_varg() calls kobject_set_name_vargs() then returns
> > the return value of kobject_add_internal().  kobject_set_name_vargs()
> > allocates memory for the name string.  When kobject_add_varg() returns
> > an error we do not know if memory was allocated or not.  If we check the
> > return value of kobject_add_internal() instead of returning it directly
> > we can free the allocated memory if kobject_add_internal() fails.  Doing
> > this means that we now know that if kobject_add_varg() fails we do not
> > have to do any clean up, this benefit goes back up the call chain
> > meaning that we now do not need to do any cleanup if kobject_del()
> > fails.  Moving further back (in a theoretical kobject user callchain)
> > this means we now no longer need to call kobject_put() after calling
> > kobject_init_and_add(), we can just call kfree() on the enclosing
> > structure.  This makes the kobject API better follow the principle of
> > least surprise.
> > 
> > Check return value of kobject_add_internal() and free previously
> > allocated memory on failure.
> > 
> > Signed-off-by: Tobin C. Harding 
> > ---
> > 
> > Hi Greg,
> > 
> > Pretty excited by this one, if this is correct it means that kobject
> > initialisation code, in the error path, can now use either kobject_put()
> > (to trigger the release method) OR kfree().  This means most of the
> > call sites of kobject_init_and_add() will get fixed for free!
> > 
> > I've been wrong before so I'll state here that this is based on the
> > assumption that kobject_init() does nothing that causes leaked memory.
> > This is _not_ what the function docs in kobject.c say but it _is_ what
> > the code seems to say since kobject_init() does nothing except
> > initialise kobject data member values?  Or have I got the dog by the
> > tail?
> 
> I think you are correct here.  In looking at the code paths, all should
> be good and safe.
> 
> But, if you use your patch, then you have to call kfree, and you can not
> call kobject_put(), otherwise kfree_const() will be called twice on the
> same pointer, right?  So you will have to audit the kernel and change
> everything again :)

Oh my bad, I got so excited by this I read the 'if (name) {' in kobject
to be guarding the double call to kfree_const(), which clearly it doesn't.

> Or, maybe this patch would prevent that:
> 
> 
> diff --git a/lib/kobject.c b/lib/kobject.c
> index f2ccdbac8ed9..03cdec1d450a 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -387,7 +387,14 @@ static __printf(3, 0) int kobject_add_varg(struct 
> kobject *kobj,
>   return retval;
>   }
>   kobj->parent = parent;
> - return kobject_add_internal(kobj);
> +
> + retval = kobject_add_internal(kobj);
> + if (retval && !is_kernel_rodata((unsigned long)(kobj->name))) {
> + kfree_const(kobj->name);
> + kobj->name = NULL;
> + }
> +
> + return retval;
>  }
>
>  /**
> 
> 
> But that feels like a huge hack to me.

I agree, does the job but too ugly.

> I think, to be safe, we should
> keep the existing lifetime rules, as it mirrors what happens with
> 'struct device', and that is what people _should_ be using, not "raw"
> kobjects if at all possible.

Oh, I wasn't seeing this through the eyes of a driver developer, perhaps
I should have started in drivers/ not in fs/ 

> Yeah, I know filesystems don't do that, my fault, I never thought a
> filesystem would care about sysfs all those years ago :)

Tough business that, predicting the future.

Let's drop this and I'll keep plugging away.

Thanks,
Tobin.


[RFC PATCH] kobject: Clean up allocated memory on failure

2019-05-15 Thread Tobin C. Harding
Currently kobject_add_varg() calls kobject_set_name_vargs() then returns
the return value of kobject_add_internal().  kobject_set_name_vargs()
allocates memory for the name string.  When kobject_add_varg() returns
an error we do not know if memory was allocated or not.  If we check the
return value of kobject_add_internal() instead of returning it directly
we can free the allocated memory if kobject_add_internal() fails.  Doing
this means that we now know that if kobject_add_varg() fails we do not
have to do any clean up, this benefit goes back up the call chain
meaning that we now do not need to do any cleanup if kobject_del()
fails.  Moving further back (in a theoretical kobject user callchain)
this means we now no longer need to call kobject_put() after calling
kobject_init_and_add(), we can just call kfree() on the enclosing
structure.  This makes the kobject API better follow the principle of
least surprise.

Check return value of kobject_add_internal() and free previously
allocated memory on failure.

Signed-off-by: Tobin C. Harding 
---

Hi Greg,

Pretty excited by this one, if this is correct it means that kobject
initialisation code, in the error path, can now use either kobject_put()
(to trigger the release method) OR kfree().  This means most of the
call sites of kobject_init_and_add() will get fixed for free!

I've been wrong before so I'll state here that this is based on the
assumption that kobject_init() does nothing that causes leaked memory.
This is _not_ what the function docs in kobject.c say but it _is_ what
the code seems to say since kobject_init() does nothing except
initialise kobject data member values?  Or have I got the dog by the
tail?

thanks,
Tobin.

 lib/kobject.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index f2ccdbac8ed9..bb0c0d374b13 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -387,7 +387,15 @@ static __printf(3, 0) int kobject_add_varg(struct kobject 
*kobj,
return retval;
}
kobj->parent = parent;
-   return kobject_add_internal(kobj);
+   retval = kobject_add_internal(kobj);
+
+   if (retval) {
+   if (kobj->name)
+   kfree_const(kobj->name);
+
+   return retval;
+   }
+   return 0;
 }
 
 /**
-- 
2.21.0



Re: [PATCH 3.18 78/86] bridge: Fix error path for kobject_init_and_add()

2019-05-15 Thread Tobin C. Harding
On Wed, May 15, 2019 at 12:55:55PM +0200, Greg Kroah-Hartman wrote:
> From: "Tobin C. Harding" 
> 
> [ Upstream commit bdfad5aec1392b93495b77b864d58d7f101dc1c1 ]

Greg you are not going to back port all of these kobject fixes are you?
There is going to be a _lot_ of them.  I'm not super comfortable
generating all this work for you.  And besides that, I keep making
mistakes (reference to last nights find of double free in powerpc that
you reviewed already), then we have to back port those too.

For the record I've been going through all uses of kobject and splitting
them into categories

 1. Broken
 2. Too complex to immediately tell
 3. Done correctly

I'm not getting many in category #3, let's hope that some in #1 and #2 are
my misunderstanding and that many in #2 should be in #3.  I'm having fun
fixing them but I shudder at making life hard for other people.

Cheers,
Tobin.


[PATCH] powerpc: Remove double free

2019-05-15 Thread Tobin C. Harding
kfree() after kobject_put().  Who ever wrote this was on crack.

Fixes: 7e8039795a80 ("powerpc/cacheinfo: Fix kobject memleak")
Signed-off-by: Tobin C. Harding 
---

FTR

git log --pretty=format:"%h%x09%an%x09%ad%x09%s" | grep 7e8039795a80
7e8039795a80Tobin C. HardingTue Apr 30 11:09:23 2019 +1000  
powerpc/cacheinfo: Fix kobject memleak

 arch/powerpc/kernel/cacheinfo.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index f2ed3ef4b129..862e2890bd3d 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -767,7 +767,6 @@ static void cacheinfo_create_index_dir(struct cache *cache, 
int index,
  cache_dir->kobj, "index%d", index);
if (rc) {
kobject_put(_dir->kobj);
-   kfree(index_dir);
return;
}
 
-- 
2.21.0



Re: [PATCH] gfs2: Fix error path kobject memory leak

2019-05-13 Thread Tobin C. Harding
On Mon, May 13, 2019 at 09:14:05AM +0200, Greg Kroah-Hartman wrote:
> On Mon, May 13, 2019 at 01:32:13PM +1000, Tobin C. Harding wrote:
> > If a call to kobject_init_and_add() fails we must call kobject_put()
> > otherwise we leak memory.
> > 
> > Function always calls kobject_init_and_add() which always calls
> > kobject_init().
> > 
> > It is safe to leave object destruction up to the kobject release
> > function and never free it manually.
> > 
> > Remove call to kfree() and always call kobject_put() in the error path.
> > 
> > Signed-off-by: Tobin C. Harding 
> > ---
> > 
> > Is it ok to send patches during the merge window?
> > 
> > Applies on top of Linus' mainline tag: v5.1
> > 
> > Happy to rebase if there are conflicts.
> > 
> > thanks,
> > Tobin.
> > 
> >  fs/gfs2/sys.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
> > index 1787d295834e..98586b139386 100644
> > --- a/fs/gfs2/sys.c
> > +++ b/fs/gfs2/sys.c
> > @@ -661,8 +661,6 @@ int gfs2_sys_fs_add(struct gfs2_sbd *sdp)
> > if (error)
> > goto fail_reg;
> >  
> > -   sysfs_frees_sdp = 1; /* Freeing sdp is now done by sysfs calling
> > -   function gfs2_sbd_release. */
> 
> You should also delete this variable at the top of the function, as it
> is now only set once there and never used.

Thanks, I should have gotten a compiler warning for that.  I was feeling
so confident with my builds this morning ... pays not to get too cocky
I suppose.

> With that:
> 
> Reviewed-by: Greg Kroah-Hartman 

Thanks, will re-spin.

Tobin.


[PATCH] ocfs2: Fix error path kobject memory leak

2019-05-12 Thread Tobin C. Harding
If a call to kobject_init_and_add() fails we should call kobject_put()
otherwise we leak memory.

Add call to kobject_put() in the error path of call to
kobject_init_and_add().  Please note, this has the side effect that
the release method is called if kobject_init_and_add() fails.

Signed-off-by: Tobin C. Harding 
---

Is it ok to send patches during the merge window?

Applies on top of Linus' mainline tag: v5.1

Happy to rebase if there are conflicts.

thanks,
Tobin.

 fs/ocfs2/filecheck.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ocfs2/filecheck.c b/fs/ocfs2/filecheck.c
index f65f2b2f594d..1906cc962c4d 100644
--- a/fs/ocfs2/filecheck.c
+++ b/fs/ocfs2/filecheck.c
@@ -193,6 +193,7 @@ int ocfs2_filecheck_create_sysfs(struct ocfs2_super *osb)
ret = kobject_init_and_add(>fs_kobj, _ktype_filecheck,
NULL, "filecheck");
if (ret) {
+   kobject_put(>fs_kobj);
kfree(fcheck);
return ret;
}
-- 
2.21.0



Re: kobject_init_and_add() confusion

2019-05-11 Thread Tobin C. Harding



On Fri, May 10, 2019, at 19:40, Petr Mladek wrote:
> On Fri 2019-05-10 12:35:38, Tobin C. Harding wrote:
> > On Wed, May 01, 2019 at 09:54:16AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, May 1, 2019 at 1:38 AM Tobin C. Harding  wrote:
> > > > TODO
> > > > 
> > > >
> > > > - Fix all the callsites to kobject_init_and_add()
> > > > - Further clarify the function docstring for kobject_init_and_add() 
> > > > [perhaps]
> > > > - Add a section to Documentation/kobject.txt [optional]
> > > > - Add a sample usage file under samples/kobject [optional]
> > > 
> > > The plan sounds good to me, but there is one thing to note IMO:
> > > kobject_cleanup() invokes the ->release() callback for the ktype, so
> > > these callbacks need to be able to cope with kobjects after a failing
> > > kobject_add() which may not be entirely obvious to developers
> > > introducing them ATM.
> > 
> > It has taken a while for this to soak in.  This is actually quite an
> > insidious issue.  If I give an example and perhaps we can come to a
> > solution.  This example is based on the code (and assumptions) in
> > mm/slub.c
> > 
> > If a developer has an object that they wish to add to sysfs they go
> > ahead and embed a kobject in it.  Correctly set up a ktype including
> > release function that just frees the object (using container of).  Now
> > assume that the object is already set up and in use when we go to set up
> > the sysfs entry.
> 
> It would say that this is a bad design. I see the creation of the sysfs
> entry as part of the initialization. The object should not be made
> usable before it is fully initialized.

It may be a case of my lack of understanding of object lifecycles here and not 
bad design, if as you say creation of sysfs is always part of initialisation 
then the problem I describe above should not exist (and it may well not, 
assumptions behind code are hard to grok).
 
> > If kobject_init_and_add() fails and we correctly call
> > kobject_put() the containing object will be free'd.  Yet the calling
> > code may not be done with the object, more to the point just because
> > sysfs setup fails the object is now unusable.  Besides the interesting
> > theoretical discussion this means we cannot just go and willy-nilly add
> > calls to kobject_put() in the error path of kobject_init_and_add() if
> > the original code was not written under the assumption that the release
> > method could be called during the error path (I have found 2 places at
> > least where behaviour of calling the release method is non-trivial to
> > ascertain).
> 
> kobject usage is complicated and it is easy to make it wrong. I think
> that this is motivation to improve the documentation and adding
> good examples.

Cool, I did work on adding your example from last week into samples/kobject but 
I wasn't able to come up with anything that I was totally happy with.  Hard to 
use API needs minimal concise correct examples right, I'm going to keep at that 
as I learn more from seeing/patching current kobject code.

> > I guess, as Greg said, its just a matter that reference counting within
> > the kernel is a hard problem.  So we fix the easy ones and then look a
> > bit harder at the hard ones ...
> 
> The people working on the affected subsystem should be able to help.
> They might have misunderstood kobjects. But they should be more
> familiar with the other dependencies.

Sure thing.

> Thanks for working on it.

Things that bend ones brain are the funnest to work on ;)

Cheers,
Tobin.


Re: kobject_init_and_add() confusion

2019-05-09 Thread Tobin C. Harding
On Wed, May 01, 2019 at 09:54:16AM +0200, Rafael J. Wysocki wrote:
> On Wed, May 1, 2019 at 1:38 AM Tobin C. Harding  wrote:
> >
> > Hi,
> >
> > Looks like I've created a bit of confusion trying to fix memleaks in
> > calls to kobject_init_and_add().  Its spread over various patches and
> > mailing lists so I'm starting a new thread and CC'ing anyone that
> > commented on one of those patches.
> >
> > If there is a better way to go about this discussion please do tell me.
> >
> > The problem
> > ---
> >
> > Calls to kobject_init_and_add() are leaking memory throughout the kernel
> > because of how the error paths are handled.
> >
> > The solution
> > 
> >
> > Write the error path code correctly.
> >
> > Example
> > ---
> >
> > We have samples/kobject/kobject-example.c but it uses
> > kobject_create_and_add().  I thought of adding another example file here
> > but could not think of how to do it off the top of my head without being
> > super contrived.  Can add this to the TODO list if it will help.
> >
> > Here is an attempted canonical usage of kobject_init_and_add() typical
> > of the code that currently is getting it wrong.  This is the second time
> > I've written this and the first time it was wrong even after review (you
> > know who you are, you are definitely buying the next round of drinks :)
> >
> >
> > Assumes we have an object in memory already that has the kobject
> > embedded in it. Variable 'kobj' below would typically be >kobj
> >
> >
> > void fn(void)
> > {
> > int ret;
> >
> > ret = kobject_init_and_add(kobj, ktype, NULL, "foo");
> > if (ret) {
> > /*
> >  * This means kobject_init() has succeeded
> >  * but kobject_add() failed.
> >  */
> > goto err_put;
> > }
> >
> > ret = some_init_fn();
> > if (ret) {
> > /*
> >  * We need to wind back kobject_add() AND 
> > kobject_put().
> 
> kobject_add() and kobject_init() I suppose?
> 
> >  * kobject_add() incremented the refcount in
> >  * kobj->parent, that needs to be decremented THEN 
> > we need
> >  * the call to kobject_put() to decrement the 
> > refcount of kobj.
> >  */
> 
> So actually, if you look at kobject_cleanup(), it calls kobject_del()
> if kobj->state_in_sysfs is set.
> 
> Now, if you look at kobject_add_internal(), it sets
> kobj->state_in_sysfs when about to return 0 (success).
> 
> Therefore calling kobject_put() without the preceding kobject_del() is
> not a bug technically, even though it will trigger the "auto cleanup
> kobject_del" message with debug enabled.
> 
> > goto err_del;
> > }
> >
> > ret = some_other_init_fn();
> > if (ret)
> > goto other_err;
> >
> > kobject_uevent(kobj, KOBJ_ADD);
> > return 0;
> >
> > other_err:
> > other_clean_up_fn();
> > err_del:
> > kobject_del(kobj);
> > err_put:
> > kobject_put(kobj);
> >
> > return ret;
> > }
> >
> >
> > Have I got this correct?
> >
> > TODO
> > 
> >
> > - Fix all the callsites to kobject_init_and_add()
> > - Further clarify the function docstring for kobject_init_and_add() 
> > [perhaps]
> > - Add a section to Documentation/kobject.txt [optional]
> > - Add a sample usage file under samples/kobject [optional]
> 
> The plan sounds good to me, but there is one thing to note IMO:
> kobject_cleanup() invokes the ->release() callback for the ktype, so
> these callbacks need to be able to cope with kobjects after a failing
> kobject_add() which may not be entirely obvious to developers
> introducing them ATM.

It has taken a while for this to soak in.  This is actually quite an
insidious issue.  If I give an example and perhaps we can come to a
solution.  This example is based on the code (and assumptions) in
mm/slub.c

If a developer has an object that they wish to add to sysfs they go
ahead and emb

Re: [RFC PATCH 3/5] kobject: Fix kernel-doc comment first line

2019-05-06 Thread Tobin C. Harding
On Fri, May 03, 2019 at 09:56:07AM +0200, Johan Hovold wrote:
> On Fri, May 03, 2019 at 11:40:15AM +1000, Tobin C. Harding wrote:
> > On Thu, May 02, 2019 at 10:39:22AM +0200, Johan Hovold wrote:
> > > On Thu, May 02, 2019 at 06:25:39PM +1000, Tobin C. Harding wrote: > 
> > > Adding Jon to CC
> > > > 
> > > > On Thu, May 02, 2019 at 09:38:23AM +0200, Johan Hovold wrote:
> > > > > On Thu, May 02, 2019 at 12:31:40PM +1000, Tobin C. Harding wrote:
> > > > > > kernel-doc comments have a prescribed format.  This includes 
> > > > > > parenthesis
> > > > > > on the function name.  To be _particularly_ correct we should also
> > > > > > capitalise the brief description and terminate it with a period.
> > > > > 
> > > > > Why do think capitalisation and full stop is required for the function
> > > > > description?
> > > > > 
> > > > > Sure, the example in the current doc happen to use that, but I'm not
> > > > > sure that's intended as a prescription.
> > > > > 
> > > > > The old kernel-doc nano-HOWTO specifically did not use this:
> > > > > 
> > > > >   
> > > > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt
> > > > > 
> > > > 
> > > > Oh?  I was basing this on Documentation/doc-guide/kernel-doc.rst
> > > > 
> > > > Function documentation
> > > > --
> > > > 
> > > > The general format of a function and function-like macro 
> > > > kernel-doc comment is::
> > > > 
> > > >   /**
> > > >* function_name() - Brief description of function.
> > > >* @arg1: Describe the first argument.
> > > >* @arg2: Describe the second argument.
> > > >*One can provide multiple line descriptions
> > > >*for arguments.
> > > > 
> > > > I figured that was the canonical way to do kernel-doc function
> > > > comments.  I have however refrained from capitalising and adding the
> > > > period to argument strings to reduce code churn.  I figured if I'm
> > > > touching the line to add parenthesis then I might as well make it
> > > > perfect (if such a thing exists).
> > >
> > > I think you may have read too much into that example. Many of the
> > > current function and parameter descriptions aren't even full sentences,
> > > so sentence case and full stop doesn't really make any sense.
> > >
> > > Looks like we discussed this last fall as well:
> > 
> > Ha, this was funny.  By 'we' at first I thought you meant 'we the kernel
> > community' but you actually meant we as in 'me and you'.  Clearly you
> > failed to convince me last time :)
> > 
> > >   https://lkml.kernel.org/r/20180912093116.GC1089@localhost
> > 
> > I am totally aware this is close to code churn and any discussion is
> > bikeshedding ... for me just because loads of places don't do this it
> > still looks nicer to my eyes
> > 
> > /**
> > * sfn() - Super awesome function.
> > 
> > than
> > 
> > /**
> > */ sfn() - super awesome function
> > 
> > I most likely will keep doing these changes if I am touching the
> > kernel-doc comments for other reasons and then drop the changes if the
> > subsystem maintainer thinks its code churn.
> > 
> > I defiantly won't do theses changes in GNSS, GREYBUS, or USB SERIAL.
> 
> This isn't about any particular subsystem, but more the tendency of
> people to make up random rules and try to to force it on others. It's
> churn, and also makes things like code forensics and backports harder
> for no good reason.

Points noted.

> Both capitalisation styles are about as common for the function
> description judging from a quick grep, but only 10% or so use a full
> stop ('.'). And forcing the use of sentence case and full stop for
> things like
> 
>   /**
>* maar_init() - Initialise MAARs.
> 
> or
> 
>   * @instr: Operational instruction.
> 
> would be not just ugly, but wrong (as these are not independent
> clauses).

You are correct here.

Thanks for taking the time to flesh out your argument Johan, I am now in
agreement with you :)

Cheers,
Tobin.


Re: [RFC PATCH 3/5] kobject: Fix kernel-doc comment first line

2019-05-02 Thread Tobin C. Harding
On Thu, May 02, 2019 at 10:39:22AM +0200, Johan Hovold wrote:
> On Thu, May 02, 2019 at 06:25:39PM +1000, Tobin C. Harding wrote: > Adding 
> Jon to CC
> > 
> > On Thu, May 02, 2019 at 09:38:23AM +0200, Johan Hovold wrote:
> > > On Thu, May 02, 2019 at 12:31:40PM +1000, Tobin C. Harding wrote:
> > > > kernel-doc comments have a prescribed format.  This includes parenthesis
> > > > on the function name.  To be _particularly_ correct we should also
> > > > capitalise the brief description and terminate it with a period.
> > > 
> > > Why do think capitalisation and full stop is required for the function
> > > description?
> > > 
> > > Sure, the example in the current doc happen to use that, but I'm not
> > > sure that's intended as a prescription.
> > > 
> > > The old kernel-doc nano-HOWTO specifically did not use this:
> > > 
> > >   https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt
> > > 
> > 
> > Oh?  I was basing this on Documentation/doc-guide/kernel-doc.rst
> > 
> > Function documentation
> > --
> > 
> > The general format of a function and function-like macro kernel-doc 
> > comment is::
> > 
> >   /**
> >* function_name() - Brief description of function.
> >* @arg1: Describe the first argument.
> >* @arg2: Describe the second argument.
> >*One can provide multiple line descriptions
> >*for arguments.
> > 
> > I figured that was the canonical way to do kernel-doc function
> > comments.  I have however refrained from capitalising and adding the
> > period to argument strings to reduce code churn.  I figured if I'm
> > touching the line to add parenthesis then I might as well make it
> > perfect (if such a thing exists).
>
> I think you may have read too much into that example. Many of the
> current function and parameter descriptions aren't even full sentences,
> so sentence case and full stop doesn't really make any sense.
>
> Looks like we discussed this last fall as well:

Ha, this was funny.  By 'we' at first I thought you meant 'we the kernel
community' but you actually meant we as in 'me and you'.  Clearly you
failed to convince me last time :)

>   https://lkml.kernel.org/r/20180912093116.GC1089@localhost

I am totally aware this is close to code churn and any discussion is
bikeshedding ... for me just because loads of places don't do this it
still looks nicer to my eyes

/**
* sfn() - Super awesome function.

than

/**
*/ sfn() - super awesome function

I most likely will keep doing these changes if I am touching the
kernel-doc comments for other reasons and then drop the changes if the
subsystem maintainer thinks its code churn.

I defiantly won't do theses changes in GNSS, GREYBUS, or USB SERIAL.

Oh, and I'm totally going to CC you know every time I flick one of these
patches, prepare to get spammed :)

Cheers,
Tobin.


Re: [PATCH] kobject: clean up the kobject add documentation a bit more

2019-05-02 Thread Tobin C. Harding
On Thu, May 02, 2019 at 12:22:24PM +0200, Greg Kroah-Hartman wrote:
> Commit 1fd7c3b438a2 ("kobject: Improve doc clarity kobject_init_and_add()")
> tried to provide more clarity, but the reference to kobject_del() was
> incorrect.  Fix that up by removing that line, and hopefully be more explicit
> as to exactly what needs to happen here once you register a kobject with the
> kobject core.
> 
> Cc: Tobin C. Harding 
> Fixes: 1fd7c3b438a2 ("kobject: Improve doc clarity kobject_init_and_add()")
> Signed-off-by: Greg Kroah-Hartman 
> 
> diff --git a/lib/kobject.c b/lib/kobject.c
> index 3f4b7e95b0c2..f2ccdbac8ed9 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -416,8 +416,12 @@ static __printf(3, 0) int kobject_add_varg(struct 
> kobject *kobj,
>   * to this function be directly freed with a call to kfree(),
>   * that can leak memory.
>   *
> - * If this call returns successfully and you later need to unwind
> - * kobject_add() for the error path you should call kobject_del().
> + * If this function returns success, kobject_put() must also be 
> called
> + * in order to properly clean up the memory associated with the 
> object.
> + *
> + * In short, once this function is called, kobject_put() MUST be 
> called
> + * when the use of the object is finished in order to properly free
> + * everything.
>   */
>  int kobject_add(struct kobject *kobj, struct kobject *parent,
>   const char *fmt, ...)

Ack! (Do I get to do those :)

I'm not convinced we have the docs for kobject clear enough for a
kobject noob to read but this patch defiantly fixes the error I
introduced.

thanks,
Tobin.


Re: kobject_init_and_add() confusion

2019-05-02 Thread Tobin C. Harding
On Thu, May 02, 2019 at 10:34:12AM +0200, Petr Mladek wrote:
> On Wed 2019-05-01 09:38:03, Tobin C. Harding wrote:
> > Hi,
> > 
> > Looks like I've created a bit of confusion trying to fix memleaks in
> > calls to kobject_init_and_add().  Its spread over various patches and
> > mailing lists so I'm starting a new thread and CC'ing anyone that
> > commented on one of those patches.
> > 
> > If there is a better way to go about this discussion please do tell me.
> > 
> > The problem
> > ---
> > 
> > Calls to kobject_init_and_add() are leaking memory throughout the kernel
> > because of how the error paths are handled.
> > 
> > The solution
> > 
> > 
> > Write the error path code correctly.
> > 
> > Example
> > ---
> > 
> > We have samples/kobject/kobject-example.c but it uses
> > kobject_create_and_add().  I thought of adding another example file here
> > but could not think of how to do it off the top of my head without being
> > super contrived.  Can add this to the TODO list if it will help.
> > 
> > Here is an attempted canonical usage of kobject_init_and_add() typical
> > of the code that currently is getting it wrong.  This is the second time
> > I've written this and the first time it was wrong even after review (you
> > know who you are, you are definitely buying the next round of drinks :)
> > 
> > 
> > Assumes we have an object in memory already that has the kobject
> > embedded in it. Variable 'kobj' below would typically be >kobj
> > 
> > 
> > void fn(void)
> > {
> > int ret;
> > 
> > ret = kobject_init_and_add(kobj, ktype, NULL, "foo");
> > if (ret) {
> > /*
> >  * This means kobject_init() has succeeded
> >  * but kobject_add() failed.
> >  */
> > goto err_put;
> > }
> 
> It is strange to make the structure visible in sysfs before
> we initialize it.
> 
> > ret = some_init_fn();
> > if (ret) {
> > /*
> >  * We need to wind back kobject_add() AND kobject_put().
> >  * kobject_add() incremented the refcount in
> >  * kobj->parent, that needs to be decremented THEN we 
> > need
> >  * the call to kobject_put() to decrement the
> >  * refcount of kobj.
>*/
> > goto err_del;
> > }
> > 
> > ret = some_other_init_fn();
> > if (ret)
> > goto other_err;
> > 
> > kobject_uevent(kobj, KOBJ_ADD);
> > return 0;
> > 
> > other_err:
> > other_clean_up_fn();
> > err_del:
> > kobject_del(kobj);
> > err_put:
> > kobject_put(kobj);
> 
> IMHO, separate kobject_del() makes only sense when the sysfs
> interface must be destroyed before some other actions.
> 
> I guess that we need two examples. I currently understand
> it the following way:
> 
> 1. sysfs interface and the structure can be freed anytime:
> 
>   struct A
>   {
>   struct kobject kobj;
>   ...
>   };
> 
>   void fn(void)
>   {
>   struct A *a;
>   int ret;
> 
>   a = kzalloc(sizeof(*a), GFP_KERNEL);
>   if (!a)
>   return;
> 
>   /*
>* Initialize structure before we make it accessible via
>* sysfs.
>*/
>   ret = some_init_fn();
>   if (ret) {
>   goto init_err;
>   }
> 
>   ret = kobject_init_and_add(>kobj, ktype, NULL, "foo");
>   if (ret)
>   goto kobj_err;
> 
>   return 0;
> 
>   kobj_err:
>   /* kobject_init() always succeds and take reference. */
>   kobject_put(kobj);
>   return ret;
> 
>   init_err:
>   /* kobject was not initialized, simple free is enough */
>   kfree(a);
>   return ret;
>   }
> 
> 
> 2. Structure must be registered into the subsystem before
>it can be made visible via sysfs:
> 
>   struct A
>   {
>  

Re: [RFC PATCH 5/5] livepatch: Do not manually track kobject initialization

2019-05-02 Thread Tobin C. Harding
On Thu, May 02, 2019 at 09:30:44AM +0200, Petr Mladek wrote:
> On Thu 2019-05-02 09:12:32, Greg Kroah-Hartman wrote:
> > On Thu, May 02, 2019 at 12:31:42PM +1000, Tobin C. Harding wrote:
> > > Currently we use custom logic to track kobject initialization.  Recently
> > > a predicate function was added to the kobject API so we now no longer
> > > need to do this.
> > > 
> > > Use kobject API to check for initialized state of kobjects instead of
> > > using custom logic to track state.
> > > 
> > > Signed-off-by: Tobin C. Harding 
> > > ---
> > >  include/linux/livepatch.h |  6 --
> > >  kernel/livepatch/core.c   | 18 +-
> > >  2 files changed, 5 insertions(+), 19 deletions(-)
> > > 
> > > @@ -626,7 +626,7 @@ static void __klp_free_objects(struct klp_patch 
> > > *patch, bool nops_only)
> > >   list_del(>node);
> > >  
> > >   /* Might be called from klp_init_patch() error path. */
> > > - if (obj->kobj_added) {
> > > + if (kobject_is_initialized(>kobj)) {
> > >   kobject_put(>kobj);
> > >   } else if (obj->dynamic) {
> > >   klp_free_object_dynamic(obj);
> > 
> > Same here, let's not be lazy.
> > 
> > The code should "know" if the kobject has been initialized or not
> > because it is the entity that asked for it to be initialized.  Don't add
> > extra logic to the kobject core (like the patch before this did) just
> > because this one subsystem wanted to only write 1 "cleanup" function.
> 
> We use kobject for a mix of statically and dynamically defined
> structures[*]. And we misunderstood the behavior of kobject_init().
> 
> Anyway, the right solution is to call kobject_init()
> already in klp_init_patch_early() for the statically
> defined structures and in klp_alloc*() for the dynamically
> allocated ones. Then we could simply call kobject_put()
> every time.
> 
> Tobin, this goes deeper into the livepatching code that
> you probably expected. Do you want to do the above
> suggested change or should I prepare the patch?

I'd love for you to handle this one Petr, I'd say its a net gain
time wise that way since if I do it you'll have to review it too
carefully anyways.

So that will mean patch #1 and #5 of this series are dropped and handed
off to you (thanks).  Patch #2 and #3 Greg said he will take.  Patch #4
is not needed.  That's a win in my books :)

Thanks,
Tobin.


Re: [RFC PATCH 3/5] kobject: Fix kernel-doc comment first line

2019-05-02 Thread Tobin C. Harding
Adding Jon to CC

On Thu, May 02, 2019 at 09:38:23AM +0200, Johan Hovold wrote:
> On Thu, May 02, 2019 at 12:31:40PM +1000, Tobin C. Harding wrote:
> > kernel-doc comments have a prescribed format.  This includes parenthesis
> > on the function name.  To be _particularly_ correct we should also
> > capitalise the brief description and terminate it with a period.
> 
> Why do think capitalisation and full stop is required for the function
> description?
> 
> Sure, the example in the current doc happen to use that, but I'm not
> sure that's intended as a prescription.
> 
> The old kernel-doc nano-HOWTO specifically did not use this:
> 
>   https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt
> 

Oh?  I was basing this on Documentation/doc-guide/kernel-doc.rst

Function documentation
--

The general format of a function and function-like macro kernel-doc 
comment is::

  /**
   * function_name() - Brief description of function.
   * @arg1: Describe the first argument.
   * @arg2: Describe the second argument.
   *One can provide multiple line descriptions
   *for arguments.

I figured that was the canonical way to do kernel-doc function
comments.  I have however refrained from capitalising and adding the
period to argument strings to reduce code churn.  I figured if I'm
touching the line to add parenthesis then I might as well make it
perfect (if such a thing exists).

thanks,
Tobin.


Re: memleak around kobject_init_and_add()

2019-05-02 Thread Tobin C. Harding
On Thu, May 02, 2019 at 09:28:08AM +0200, Greg Kroah-Hartman wrote:
> On Thu, May 02, 2019 at 09:17:42AM +0200, Greg Kroah-Hartman wrote:
> > On Thu, May 02, 2019 at 07:56:16AM +1000, Tobin C. Harding wrote:
> > > On Sat, Apr 27, 2019 at 09:28:09PM +0200, Greg Kroah-Hartman wrote:
> > > > On Sat, Apr 27, 2019 at 06:13:30PM +1000, Tobin C. Harding wrote:
> > > > > (Note at bottom on reasons for 'To' list 'Cc' list)
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > kobject_init_and_add() seems to be routinely misused.  A failed call 
> > > > > to this
> > > > > function requires a call to kobject_put() otherwise we leak memory.
> > > > > 
> > > > > Examples memleaks can be seen in:
> > > > > 
> > > > >   mm/slub.c
> > > > >   fs/btrfs/sysfs.c
> > > > >   fs/xfs/xfs_sysfs.h: xfs_sysfs_init()
> > > > > 
> > > > >  Question: Do we fix the misuse or fix the API?
> > > > 
> > > > Fix the misuse.
> > > > 
> > > > > $ git grep kobject_init_and_add | wc -l
> > > > > 117
> > > > > 
> > > > > Either way, we will have to go through all 117 call sites and check 
> > > > > them.
> > > > 
> > > > Yes.  Same for other functions like device_add(), that is the "pattern"
> > > > those users must follow.
> > > > 
> > > > > I
> > > > > don't mind fixing them all but I don't want to do it twice because I 
> > > > > chose the
> > > > > wrong option.  Reaching out to those more experienced for a 
> > > > > suggestion please.
> > > > > 
> > > > > Fix the API
> > > > > ---
> > > > > 
> > > > > Typically init functions do not require cleanup if they fail, this 
> > > > > argument
> > > > > leads to this patch
> > > > > 
> > > > > diff --git a/lib/kobject.c b/lib/kobject.c
> > > > > index aa89edcd2b63..62328054bbd0 100644
> > > > > --- a/lib/kobject.c
> > > > > +++ b/lib/kobject.c
> > > > > @@ -453,6 +453,9 @@ int kobject_init_and_add(struct kobject *kobj, 
> > > > > struct kobj_type *ktype,
> > > > >   retval = kobject_add_varg(kobj, parent, fmt, args);
> > > > >   va_end(args);
> > > > >  
> > > > > + if (retval)
> > > > > + kobject_put(kobj);
> > > > > +
> > > > >   return retval;
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(kobject_init_and_add);
> > > > 
> > > > I would _love_ to do this, but realize what a kobject really is.
> > > > 
> > > > It's just a "base object" that is embedded inside of some other object.
> > > > The kobject core has no idea what is going on outside of itself.  If the
> > > > kobject_init_and_add() function fails, it can NOT drop the last
> > > > reference on itself, as that would cause the memory owned by the _WHOLE_
> > > > structure the kobject is embedded in, to be freed.
> > > > 
> > > > And the kobject core can not "know" that something else needed to be
> > > > done _before_ that memory could be freed.  What if the larger structure
> > > > needs to have some other destructor called on it first?  What if
> > > > some other api initialization needs to be torn down.
> > > > 
> > > > As an example, consider this code:
> > > > 
> > > > struct foo {
> > > > struct kobject kobj;
> > > > struct baz *baz;
> > > > };
> > > > 
> > > > void foo_release(struct kobject *kobj)
> > > > {
> > > > struct foo *foo = container_of(kobj, struct foo, kobj);
> > > > kfree(foo);
> > > > }
> > > > 
> > > > struct kobj_type foo_ktype = {
> > > > .release = foo_release,
> > > > };
> > > > 
> > > > struct foo *foo_create(struct foo *parent, char *name)
> > > > {
> > > > struct *foo;
> > > > 
> > > > foo = kzalloc(sizeof(*foo), GFP_KERNEL);
> > > > if (!foo)
> > > > return NULL;
> > > > 
> > >

[RFC PATCH 0/5] kobject: Add and use init predicate

2019-05-01 Thread Tobin C. Harding
Hi,

This set patches kobject to add a predicate function for determining the
initialization state of a kobject.  Stripped down, the predicate is:

bool kobject_is_initialized(struct kobject *kobj)
{
return kobj->state_initialized
}

This is RFC because there are merge conflicts with Greg's driver-core
tree.  I'm guessing this is caused by the cleanup patches (#2 and #3).
If the set is deemed likeable then I can re-work the set targeting
whomever's tree this would go in through.

Applies on top of:

mainline tag: v5.1-rc6
livepatching branch: for-next

Series Description
--

Patch #1 is a memleak patch, previously posted and not overly
interesting.  Comment by Greg on the thread on that patch was the
incentive for this series.

Patch #2 and #3 are kobject kernel-doc comment clean ups.  Can be
dropped if not liked.

Patch #4 adds the predicate function to the kobject API.

Patch #5 uses the new predicate to remove the custom logic from livepatch
for tracking kobject initialization state.

Testing
---

Kernel build configuration

$ egrep LIVEPATCH .config
CONFIG_HAVE_LIVEPATCH=y
CONFIG_LIVEPATCH=y
CONFIG_TEST_LIVEPATCH=m

$ egrep FTRACE .config
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set

Builds fine but doesn't boot in Qemu.  I've never run dynamic Ftrace, it
appears to crash during this.  Was hoping to run the livepatch tests but
not sure how to at this moment.  Is dynamic Ftrace and livepatch testing
something that can even be done in a VM or do I need to do this or
baremetal?

Thanks for taking the time to look at this.

Tobin


Tobin C. Harding (5):
  livepatch: Fix kobject memleak
  kobject: Remove docstring reference to kset
  kobject: Fix kernel-doc comment first line
  kobject: Add kobject initialized predicate
  livepatch: Do not manually track kobject initialization

 include/linux/kobject.h   |  2 ++
 include/linux/livepatch.h |  6 
 kernel/livepatch/core.c   | 28 +-
 lib/kobject.c | 60 +++
 4 files changed, 51 insertions(+), 45 deletions(-)

-- 
2.21.0



[RFC PATCH 1/5] livepatch: Fix kobject memleak

2019-05-01 Thread Tobin C. Harding
Currently error return from kobject_init_and_add() is not followed by a
call to kobject_put().  This means there is a memory leak.

Add call to kobject_put() in error path of kobject_init_and_add().

Signed-off-by: Tobin C. Harding 
---
 kernel/livepatch/core.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index eb0ee10a1981..98295de2172b 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -727,7 +727,9 @@ static int klp_init_func(struct klp_object *obj, struct 
klp_func *func)
ret = kobject_init_and_add(>kobj, _ktype_func,
   >kobj, "%s,%lu", func->old_name,
   func->old_sympos ? func->old_sympos : 1);
-   if (!ret)
+   if (ret)
+   kobject_put(>kobj);
+   else
func->kobj_added = true;
 
return ret;
@@ -803,8 +805,10 @@ static int klp_init_object(struct klp_patch *patch, struct 
klp_object *obj)
name = klp_is_module(obj) ? obj->name : "vmlinux";
ret = kobject_init_and_add(>kobj, _ktype_object,
   >kobj, "%s", name);
-   if (ret)
+   if (ret) {
+   kobject_put(>kobj);
return ret;
+   }
obj->kobj_added = true;
 
klp_for_each_func(obj, func) {
@@ -862,8 +866,10 @@ static int klp_init_patch(struct klp_patch *patch)
 
ret = kobject_init_and_add(>kobj, _ktype_patch,
   klp_root_kobj, "%s", patch->mod->name);
-   if (ret)
+   if (ret) {
+   kobject_put(>kobj);
return ret;
+   }
patch->kobj_added = true;
 
if (patch->replace) {
-- 
2.21.0



[RFC PATCH 5/5] livepatch: Do not manually track kobject initialization

2019-05-01 Thread Tobin C. Harding
Currently we use custom logic to track kobject initialization.  Recently
a predicate function was added to the kobject API so we now no longer
need to do this.

Use kobject API to check for initialized state of kobjects instead of
using custom logic to track state.

Signed-off-by: Tobin C. Harding 
---
 include/linux/livepatch.h |  6 --
 kernel/livepatch/core.c   | 18 +-
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 53551f470722..955d46f37b72 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -47,7 +47,6 @@
  * @stack_node:list node for klp_ops func_stack list
  * @old_size:  size of the old function
  * @new_size:  size of the new function
- * @kobj_added: @kobj has been added and needs freeing
  * @nop:temporary patch to use the original code again; dyn. allocated
  * @patched:   the func has been added to the klp_ops list
  * @transition:the func is currently being applied or reverted
@@ -86,7 +85,6 @@ struct klp_func {
struct list_head node;
struct list_head stack_node;
unsigned long old_size, new_size;
-   bool kobj_added;
bool nop;
bool patched;
bool transition;
@@ -126,7 +124,6 @@ struct klp_callbacks {
  * @node:  list node for klp_patch obj_list
  * @mod:   kernel module associated with the patched object
  * (NULL for vmlinux)
- * @kobj_added: @kobj has been added and needs freeing
  * @dynamic:temporary object for nop functions; dynamically allocated
  * @patched:   the object's funcs have been added to the klp_ops list
  */
@@ -141,7 +138,6 @@ struct klp_object {
struct list_head func_list;
struct list_head node;
struct module *mod;
-   bool kobj_added;
bool dynamic;
bool patched;
 };
@@ -154,7 +150,6 @@ struct klp_object {
  * @list:  list node for global list of actively used patches
  * @kobj:  kobject for sysfs resources
  * @obj_list:  dynamic list of the object entries
- * @kobj_added: @kobj has been added and needs freeing
  * @enabled:   the patch is enabled (but operation may be incomplete)
  * @forced:was involved in a forced transition
  * @free_work: patch cleanup from workqueue-context
@@ -170,7 +165,6 @@ struct klp_patch {
struct list_head list;
struct kobject kobj;
struct list_head obj_list;
-   bool kobj_added;
bool enabled;
bool forced;
struct work_struct free_work;
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 98295de2172b..0b94aa5b38c9 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -590,7 +590,7 @@ static void __klp_free_funcs(struct klp_object *obj, bool 
nops_only)
list_del(>node);
 
/* Might be called from klp_init_patch() error path. */
-   if (func->kobj_added) {
+   if (kobject_is_initialized(>kobj)) {
kobject_put(>kobj);
} else if (func->nop) {
klp_free_func_nop(func);
@@ -626,7 +626,7 @@ static void __klp_free_objects(struct klp_patch *patch, 
bool nops_only)
list_del(>node);
 
/* Might be called from klp_init_patch() error path. */
-   if (obj->kobj_added) {
+   if (kobject_is_initialized(>kobj)) {
kobject_put(>kobj);
} else if (obj->dynamic) {
klp_free_object_dynamic(obj);
@@ -675,7 +675,7 @@ static void klp_free_patch_finish(struct klp_patch *patch)
 * this is called when the patch gets disabled and it
 * cannot get enabled again.
 */
-   if (patch->kobj_added) {
+   if (kobject_is_initialized(>kobj)) {
kobject_put(>kobj);
wait_for_completion(>finish);
}
@@ -729,8 +729,6 @@ static int klp_init_func(struct klp_object *obj, struct 
klp_func *func)
   func->old_sympos ? func->old_sympos : 1);
if (ret)
kobject_put(>kobj);
-   else
-   func->kobj_added = true;
 
return ret;
 }
@@ -809,7 +807,6 @@ static int klp_init_object(struct klp_patch *patch, struct 
klp_object *obj)
kobject_put(>kobj);
return ret;
}
-   obj->kobj_added = true;
 
klp_for_each_func(obj, func) {
ret = klp_init_func(obj, func);
@@ -833,7 +830,6 @@ static int klp_init_patch_early(struct klp_patch *patch)
 
INIT_LIST_HEAD(>list);
INIT_LIST_HEAD(>obj_list);
-   patch->kobj_added = false;
patch->enabled = false;
patch->forced = false;
INIT_WORK(>free_work, klp_free_patch_work_fn);
@@ -844,13 +840,10 @@ static int klp_init_patch_early(struct klp_patch *patch)
 

[RFC PATCH 2/5] kobject: Remove docstring reference to kset

2019-05-01 Thread Tobin C. Harding
Currently the docstring for kobject_get_path() mentions 'kset'.  The
kset is not used in the function callchain starting from this function.

Remove docstring reference to kset from the function kobject_get_path().

Signed-off-by: Tobin C. Harding 
---
 lib/kobject.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index aa89edcd2b63..3eacd5b4643f 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -153,12 +153,11 @@ static void fill_kobj_path(struct kobject *kobj, char 
*path, int length)
 }
 
 /**
- * kobject_get_path - generate and return the path associated with a given 
kobj and kset pair.
- *
+ * kobject_get_path() - Allocate memory and fill in the path for @kobj.
  * @kobj:  kobject in question, with which to build the path
  * @gfp_mask:  the allocation type used to allocate the path
  *
- * The result must be freed by the caller with kfree().
+ * Return: The newly allocated memory, caller must free with kfree().
  */
 char *kobject_get_path(struct kobject *kobj, gfp_t gfp_mask)
 {
-- 
2.21.0



[RFC PATCH 4/5] kobject: Add kobject initialized predicate

2019-05-01 Thread Tobin C. Harding
A call to kobject_init() is required to be paired with a call to
kobject_put() in order to correctly free up the kobject.  During cleanup
functions it would be useful to know if a kobject was initialized in
order to correctly pair the call to kobject_put().  For example this is
necessary if we attempt to initialize multiple objects on a list and one
fails - in order to correctly do cleanup we need to know which objects
have been initialized.

Add a predicate kobject_is_initialized() to the kobject API.  This
function maintains the kobject layer of abstraction; simply returns
kobj->state_initialized.

Signed-off-by: Tobin C. Harding 
---
 include/linux/kobject.h |  2 ++
 lib/kobject.c   | 12 
 2 files changed, 14 insertions(+)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 1ab0d624fb36..65a317b65d9c 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -100,6 +100,8 @@ int kobject_init_and_add(struct kobject *kobj,
 struct kobj_type *ktype, struct kobject *parent,
 const char *fmt, ...);
 
+extern bool kobject_is_initialized(struct kobject *kobj);
+
 extern void kobject_del(struct kobject *kobj);
 
 extern struct kobject * __must_check kobject_create(void);
diff --git a/lib/kobject.c b/lib/kobject.c
index 0181f102cd1c..ecddf417f452 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -366,6 +366,18 @@ void kobject_init(struct kobject *kobj, struct kobj_type 
*ktype)
 }
 EXPORT_SYMBOL(kobject_init);
 
+/**
+ * kobject_is_initialized() - Kobject initialized predicate.
+ * @kobj: The kobject to query
+ *
+ * Return: True if @kobj has been initialized.
+ */
+bool kobject_is_initialized(struct kobject *kobj)
+{
+   return kobj->state_initialized;
+}
+EXPORT_SYMBOL(kobject_is_initialized);
+
 static __printf(3, 0) int kobject_add_varg(struct kobject *kobj,
   struct kobject *parent,
   const char *fmt, va_list vargs)
-- 
2.21.0



[RFC PATCH 3/5] kobject: Fix kernel-doc comment first line

2019-05-01 Thread Tobin C. Harding
kernel-doc comments have a prescribed format.  This includes parenthesis
on the function name.  To be _particularly_ correct we should also
capitalise the brief description and terminate it with a period.

In preparation for adding/updating kernel-doc function comments clean up
the ones currently present.

Signed-off-by: Tobin C. Harding 
---
 lib/kobject.c | 43 ++-
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index 3eacd5b4643f..0181f102cd1c 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -18,7 +18,7 @@
 #include 
 
 /**
- * kobject_namespace - return @kobj's namespace tag
+ * kobject_namespace() - Return @kobj's namespace tag.
  * @kobj: kobject in question
  *
  * Returns namespace tag of @kobj if its parent has namespace ops enabled
@@ -36,7 +36,7 @@ const void *kobject_namespace(struct kobject *kobj)
 }
 
 /**
- * kobject_get_ownership - get sysfs ownership data for @kobj
+ * kobject_get_ownership() - Get sysfs ownership data for @kobj.
  * @kobj: kobject in question
  * @uid: kernel user ID for sysfs objects
  * @gid: kernel group ID for sysfs objects
@@ -264,7 +264,7 @@ static int kobject_add_internal(struct kobject *kobj)
 }
 
 /**
- * kobject_set_name_vargs - Set the name of an kobject
+ * kobject_set_name_vargs() - Set the name of a kobject.
  * @kobj: struct kobject to set the name of
  * @fmt: format string used to build the name
  * @vargs: vargs to format the string.
@@ -304,7 +304,7 @@ int kobject_set_name_vargs(struct kobject *kobj, const char 
*fmt,
 }
 
 /**
- * kobject_set_name - Set the name of a kobject
+ * kobject_set_name() - Set the name of a kobject.
  * @kobj: struct kobject to set the name of
  * @fmt: format string used to build the name
  *
@@ -326,7 +326,7 @@ int kobject_set_name(struct kobject *kobj, const char *fmt, 
...)
 EXPORT_SYMBOL(kobject_set_name);
 
 /**
- * kobject_init - initialize a kobject structure
+ * kobject_init() - Initialize a kobject structure.
  * @kobj: pointer to the kobject to initialize
  * @ktype: pointer to the ktype for this kobject.
  *
@@ -382,7 +382,7 @@ static __printf(3, 0) int kobject_add_varg(struct kobject 
*kobj,
 }
 
 /**
- * kobject_add - the main kobject add function
+ * kobject_add() - The main kobject add function.
  * @kobj: the kobject to add
  * @parent: pointer to the parent of the kobject.
  * @fmt: format to name the kobject with.
@@ -430,7 +430,8 @@ int kobject_add(struct kobject *kobj, struct kobject 
*parent,
 EXPORT_SYMBOL(kobject_add);
 
 /**
- * kobject_init_and_add - initialize a kobject structure and add it to the 
kobject hierarchy
+ * kobject_init_and_add() - Initialize a kobject structure and add it to
+ *  the kobject hierarchy.
  * @kobj: pointer to the kobject to initialize
  * @ktype: pointer to the ktype for this kobject.
  * @parent: pointer to the parent of this kobject.
@@ -457,7 +458,7 @@ int kobject_init_and_add(struct kobject *kobj, struct 
kobj_type *ktype,
 EXPORT_SYMBOL_GPL(kobject_init_and_add);
 
 /**
- * kobject_rename - change the name of an object
+ * kobject_rename() - Change the name of an object.
  * @kobj: object in question.
  * @new_name: object's new name
  *
@@ -524,7 +525,7 @@ int kobject_rename(struct kobject *kobj, const char 
*new_name)
 EXPORT_SYMBOL_GPL(kobject_rename);
 
 /**
- * kobject_move - move object to another parent
+ * kobject_move() - Move object to another parent.
  * @kobj: object in question.
  * @new_parent: object's new parent (can be NULL)
  */
@@ -577,7 +578,7 @@ int kobject_move(struct kobject *kobj, struct kobject 
*new_parent)
 EXPORT_SYMBOL_GPL(kobject_move);
 
 /**
- * kobject_del - unlink kobject from hierarchy.
+ * kobject_del() - Unlink kobject from hierarchy.
  * @kobj: object.
  */
 void kobject_del(struct kobject *kobj)
@@ -599,7 +600,7 @@ void kobject_del(struct kobject *kobj)
 EXPORT_SYMBOL(kobject_del);
 
 /**
- * kobject_get - increment refcount for object.
+ * kobject_get() - Increment refcount for object.
  * @kobj: object.
  */
 struct kobject *kobject_get(struct kobject *kobj)
@@ -692,7 +693,7 @@ static void kobject_release(struct kref *kref)
 }
 
 /**
- * kobject_put - decrement refcount for object.
+ * kobject_put() - Decrement refcount for object.
  * @kobj: object.
  *
  * Decrement the refcount, and if 0, call kobject_cleanup().
@@ -721,7 +722,7 @@ static struct kobj_type dynamic_kobj_ktype = {
 };
 
 /**
- * kobject_create - create a struct kobject dynamically
+ * kobject_create() - Create a struct kobject dynamically.
  *
  * This function creates a kobject structure dynamically and sets it up
  * to be a "dynamic" kobject with a default release function set up.
@@ -744,8 +745,8 @@ struct kobject *kobject_create(void)
 }
 
 /**
- * kobject_create_and_add - create a struct kobject dynamically and register 
it with sysfs
- *
+ * kobject_create_and_add() - Create a struct kobject d

Re: kobject_init_and_add() confusion

2019-05-01 Thread Tobin C. Harding
On Wed, May 01, 2019 at 01:10:22PM +0200, Greg Kroah-Hartman wrote:
> On Wed, May 01, 2019 at 09:38:03AM +1000, Tobin C. Harding wrote:
> > Hi,
> > 
> > Looks like I've created a bit of confusion trying to fix memleaks in
> > calls to kobject_init_and_add().  Its spread over various patches and
> > mailing lists so I'm starting a new thread and CC'ing anyone that
> > commented on one of those patches.
> > 
> > If there is a better way to go about this discussion please do tell me.
> > 
> > The problem
> > ---
> > 
> > Calls to kobject_init_and_add() are leaking memory throughout the kernel
> > because of how the error paths are handled.
> 
> s/are leaking/have the potential to leak/
> 
> Note, no one ever hits these error paths, so it isn't a big issue, and
> is why no one has seen this except for the use of syzbot at times.

One day I'll find an important issue to fix in the kernel.  At the
moment sweeping these up is good practice/learning.  If you have any
_real_ issues that need someone to turn the crank on feel free to dump
them on me :)

> > The solution
> > 
> > 
> > Write the error path code correctly.
> > 
> > Example
> > ---
> > 
> > We have samples/kobject/kobject-example.c but it uses
> > kobject_create_and_add().  I thought of adding another example file here
> > but could not think of how to do it off the top of my head without being
> > super contrived.  Can add this to the TODO list if it will help.
> 
> You could take the example I wrote in that old email and use it, or your
> version below as well.

Responded just now to that email.

> 
> > Here is an attempted canonical usage of kobject_init_and_add() typical
> > of the code that currently is getting it wrong.  This is the second time
> > I've written this and the first time it was wrong even after review (you
> > know who you are, you are definitely buying the next round of drinks :)
> > 
> > Assumes we have an object in memory already that has the kobject
> > embedded in it. Variable 'kobj' below would typically be >kobj
> > 
> > 
> > void fn(void)
> > {
> > int ret;
> > 
> > ret = kobject_init_and_add(kobj, ktype, NULL, "foo");
> > if (ret) {
> > /*
> >  * This means kobject_init() has succeeded
> 
> kobject_init() can not fail except in fun ways that dumps the stack and
> then keeps on going due to the failure being on the caller, not the
> kobject code itself.

Cheers, writing good documentation is HARD.

> >  * but kobject_add() failed.
> >  */
> > goto err_put;
> > }
> > 
> > ret = some_init_fn();
> > if (ret) {
> > /*
> >  * We need to wind back kobject_add() AND kobject_put().
> >  * kobject_add() incremented the refcount in
> >  * kobj->parent, that needs to be decremented THEN we 
> > need
> >  * the call to kobject_put() to decrement the refcount 
> > of kobj.
> >  */
> > goto err_del;
> > }
> > 
> > ret = some_other_init_fn();
> > if (ret)
> > goto other_err;
> > 
> > kobject_uevent(kobj, KOBJ_ADD);
> > return 0;
> > 
> > other_err:
> > other_clean_up_fn();
> > err_del:
> > kobject_del(kobj);
> > err_put:
> > kobject_put(kobj);
> > 
> > return ret;
> > }
> > 
> > 
> > Have I got this correct?
> 
> From what I can tell, yes.

:)

> > TODO
> > 
> > 
> > - Fix all the callsites to kobject_init_and_add()
> > - Further clarify the function docstring for kobject_init_and_add() 
> > [perhaps]
> 
> More documentation, sure!
> 
> > - Add a section to Documentation/kobject.txt [optional]
> 
> That file should probably be reviewed and converted to .rst, I haven't
> looked at it in years.

On my TODO list once I get kobject usage clear in my head.

> > - Add a sample usage file under samples/kobject [optional]
> 
> Would be a good idea, so we can point people at it.

I'll combine your other email example with the extra init/error stuff
from this one and BOOM!

Thanks Greg.

Tobin


Re: memleak around kobject_init_and_add()

2019-05-01 Thread Tobin C. Harding
On Sat, Apr 27, 2019 at 09:28:09PM +0200, Greg Kroah-Hartman wrote:
> On Sat, Apr 27, 2019 at 06:13:30PM +1000, Tobin C. Harding wrote:
> > (Note at bottom on reasons for 'To' list 'Cc' list)
> > 
> > Hi,
> > 
> > kobject_init_and_add() seems to be routinely misused.  A failed call to this
> > function requires a call to kobject_put() otherwise we leak memory.
> > 
> > Examples memleaks can be seen in:
> > 
> > mm/slub.c
> > fs/btrfs/sysfs.c
> > fs/xfs/xfs_sysfs.h: xfs_sysfs_init()
> > 
> >  Question: Do we fix the misuse or fix the API?
> 
> Fix the misuse.
> 
> > $ git grep kobject_init_and_add | wc -l
> > 117
> > 
> > Either way, we will have to go through all 117 call sites and check them.
> 
> Yes.  Same for other functions like device_add(), that is the "pattern"
> those users must follow.
> 
> > I
> > don't mind fixing them all but I don't want to do it twice because I chose 
> > the
> > wrong option.  Reaching out to those more experienced for a suggestion 
> > please.
> > 
> > Fix the API
> > ---
> > 
> > Typically init functions do not require cleanup if they fail, this argument
> > leads to this patch
> > 
> > diff --git a/lib/kobject.c b/lib/kobject.c
> > index aa89edcd2b63..62328054bbd0 100644
> > --- a/lib/kobject.c
> > +++ b/lib/kobject.c
> > @@ -453,6 +453,9 @@ int kobject_init_and_add(struct kobject *kobj, struct 
> > kobj_type *ktype,
> > retval = kobject_add_varg(kobj, parent, fmt, args);
> > va_end(args);
> >  
> > +   if (retval)
> > +   kobject_put(kobj);
> > +
> > return retval;
> >  }
> >  EXPORT_SYMBOL_GPL(kobject_init_and_add);
> 
> I would _love_ to do this, but realize what a kobject really is.
> 
> It's just a "base object" that is embedded inside of some other object.
> The kobject core has no idea what is going on outside of itself.  If the
> kobject_init_and_add() function fails, it can NOT drop the last
> reference on itself, as that would cause the memory owned by the _WHOLE_
> structure the kobject is embedded in, to be freed.
> 
> And the kobject core can not "know" that something else needed to be
> done _before_ that memory could be freed.  What if the larger structure
> needs to have some other destructor called on it first?  What if
> some other api initialization needs to be torn down.
> 
> As an example, consider this code:
> 
> struct foo {
>   struct kobject kobj;
>   struct baz *baz;
> };
> 
> void foo_release(struct kobject *kobj)
> {
>   struct foo *foo = container_of(kobj, struct foo, kobj);
>   kfree(foo);
> }
> 
> struct kobj_type foo_ktype = {
>   .release = foo_release,
> };
> 
> struct foo *foo_create(struct foo *parent, char *name)
> {
>   struct *foo;
> 
>   foo = kzalloc(sizeof(*foo), GFP_KERNEL);
>   if (!foo)
>   return NULL;
> 
>   foo->baz = baz_create(name);
>   if (!foo->baz)
>   return NULL;
> 
>   ret = kobject_init_and_add(>kobj, foo_ktype, >kobj, 
> "foo-%s", name);
>   if (ret) {
>   baz_destroy(foo->baz);
>   kobject_put(>kobj);
>   return NULL;
>   }
> 
>   return foo;
> }
> 
> void foo_destroy(struct foo *foo)
> {
>   baz_destroy(foo->baz);
>   kobject_del(>kobj);
kojbect_put(>kobj);
> }

Does this need this extra call to kobject_put()?  Then foo_create()
leaves foo with a refcount of 1 and foo_destroy drops that refcount.

Thanks for taking the time to explain this stuff.

thanks
Tobin.


Leaving below for reference.

> Now if kobject_init_and_add() had failed, and called kobject_put() right
> away, that would have freed the larger "struct foo", but not cleaned up
> the reference to the baz pointer.
> 
> Yes, you can move all of the other destruction logic into the release
> function, to then get rid of baz, but that really doesn't work in the
> real world as there are times you want to drop that when you "know" you
> can drop it, not when the last reference goes away as those are
> different lifecycles.
> 
> Same thing goes for 'struct device'.  It too is a kobject, so think
> about if the driver core's call to initialize the kobject failed, would
> it be ok at that exact moment in time to free everything?
> 
> Look at the "joy" that is device_add().  If kobject_add() fails, we have
> to clean up the glue directory that we had created, _before_ we can then
> call put_device().  Then stack another layer on top of that, look at
> usb_new_device().  If the call to device_add() fails, it needs to do
> some housekeeping before it can drop the last reference to the device to
> free the memory up.


Re: kobject_init_and_add() confusion

2019-05-01 Thread Tobin C. Harding
On Wed, May 01, 2019 at 09:54:16AM +0200, Rafael J. Wysocki wrote:
> On Wed, May 1, 2019 at 1:38 AM Tobin C. Harding  wrote:
> >
> > Hi,
> >
> > Looks like I've created a bit of confusion trying to fix memleaks in
> > calls to kobject_init_and_add().  Its spread over various patches and
> > mailing lists so I'm starting a new thread and CC'ing anyone that
> > commented on one of those patches.
> >
> > If there is a better way to go about this discussion please do tell me.
> >
> > The problem
> > ---
> >
> > Calls to kobject_init_and_add() are leaking memory throughout the kernel
> > because of how the error paths are handled.
> >
> > The solution
> > 
> >
> > Write the error path code correctly.
> >
> > Example
> > ---
> >
> > We have samples/kobject/kobject-example.c but it uses
> > kobject_create_and_add().  I thought of adding another example file here
> > but could not think of how to do it off the top of my head without being
> > super contrived.  Can add this to the TODO list if it will help.
> >
> > Here is an attempted canonical usage of kobject_init_and_add() typical
> > of the code that currently is getting it wrong.  This is the second time
> > I've written this and the first time it was wrong even after review (you
> > know who you are, you are definitely buying the next round of drinks :)
> >
> >
> > Assumes we have an object in memory already that has the kobject
> > embedded in it. Variable 'kobj' below would typically be >kobj
> >
> >
> > void fn(void)
> > {
> > int ret;
> >
> > ret = kobject_init_and_add(kobj, ktype, NULL, "foo");
> > if (ret) {
> > /*
> >  * This means kobject_init() has succeeded
> >  * but kobject_add() failed.
> >  */
> > goto err_put;
> > }
> >
> > ret = some_init_fn();
> > if (ret) {
> > /*
> >  * We need to wind back kobject_add() AND 
> > kobject_put().
> 
> kobject_add() and kobject_init() I suppose?

You are correct, my mistake.

> >  * kobject_add() incremented the refcount in
> >  * kobj->parent, that needs to be decremented THEN 
> > we need
> >  * the call to kobject_put() to decrement the 
> > refcount of kobj.
> >  */
> 
> So actually, if you look at kobject_cleanup(), it calls kobject_del()
> if kobj->state_in_sysfs is set.
> 
> Now, if you look at kobject_add_internal(), it sets
> kobj->state_in_sysfs when about to return 0 (success).
> 
> Therefore calling kobject_put() without the preceding kobject_del() is
> not a bug technically, even though it will trigger the "auto cleanup
> kobject_del" message with debug enabled.

Thanks for this explanation.  Points noted.

> 
> > goto err_del;
> > }
> >
> > ret = some_other_init_fn();
> > if (ret)
> > goto other_err;
> >
> > kobject_uevent(kobj, KOBJ_ADD);
> > return 0;
> >
> > other_err:
> > other_clean_up_fn();
> > err_del:
> > kobject_del(kobj);
> > err_put:
> > kobject_put(kobj);
> >
> > return ret;
> > }
> >
> >
> > Have I got this correct?
> >
> > TODO
> > 
> >
> > - Fix all the callsites to kobject_init_and_add()
> > - Further clarify the function docstring for kobject_init_and_add() 
> > [perhaps]
> > - Add a section to Documentation/kobject.txt [optional]
> > - Add a sample usage file under samples/kobject [optional]
> 
> The plan sounds good to me, but there is one thing to note IMO:
> kobject_cleanup() invokes the ->release() callback for the ktype, so
> these callbacks need to be able to cope with kobjects after a failing
> kobject_add() which may not be entirely obvious to developers
> introducing them ATM.

During docs fixes I'll try to work this in.

Thanks,
Tobin.


kobject_init_and_add() confusion

2019-04-30 Thread Tobin C. Harding
Hi,

Looks like I've created a bit of confusion trying to fix memleaks in
calls to kobject_init_and_add().  Its spread over various patches and
mailing lists so I'm starting a new thread and CC'ing anyone that
commented on one of those patches.

If there is a better way to go about this discussion please do tell me.

The problem
---

Calls to kobject_init_and_add() are leaking memory throughout the kernel
because of how the error paths are handled.

The solution


Write the error path code correctly.

Example
---

We have samples/kobject/kobject-example.c but it uses
kobject_create_and_add().  I thought of adding another example file here
but could not think of how to do it off the top of my head without being
super contrived.  Can add this to the TODO list if it will help.

Here is an attempted canonical usage of kobject_init_and_add() typical
of the code that currently is getting it wrong.  This is the second time
I've written this and the first time it was wrong even after review (you
know who you are, you are definitely buying the next round of drinks :)


Assumes we have an object in memory already that has the kobject
embedded in it. Variable 'kobj' below would typically be >kobj


void fn(void)
{
int ret;

ret = kobject_init_and_add(kobj, ktype, NULL, "foo");
if (ret) {
/*
 * This means kobject_init() has succeeded
 * but kobject_add() failed.
 */
goto err_put;
}

ret = some_init_fn();
if (ret) {
/*
 * We need to wind back kobject_add() AND kobject_put().
 * kobject_add() incremented the refcount in
 * kobj->parent, that needs to be decremented THEN we 
need
 * the call to kobject_put() to decrement the refcount 
of kobj.
 */
goto err_del;
}

ret = some_other_init_fn();
if (ret)
goto other_err;

kobject_uevent(kobj, KOBJ_ADD);
return 0;

other_err:
other_clean_up_fn();
err_del:
kobject_del(kobj);
err_put:
kobject_put(kobj);

return ret;
}


Have I got this correct?

TODO


- Fix all the callsites to kobject_init_and_add()
- Further clarify the function docstring for kobject_init_and_add() [perhaps]
- Add a section to Documentation/kobject.txt [optional]
- Add a sample usage file under samples/kobject [optional]


Thanks,
Tobin.


Re: [PATCH] mm: Fix kobject memleak in SLUB

2019-04-30 Thread Tobin C. Harding
On Sun, Apr 28, 2019 at 09:40:00AM +1000, Tobin C. Harding wrote:
> Currently error return from kobject_init_and_add() is not followed by a
> call to kobject_put().  This means there is a memory leak.
> 
> Add call to kobject_put() in error path of kobject_init_and_add().
> 
> Signed-off-by: Tobin C. Harding 
> ---
>  mm/slub.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index d30ede89f4a6..84a9d6c06c27 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5756,8 +5756,10 @@ static int sysfs_slab_add(struct kmem_cache *s)
>  
>   s->kobj.kset = kset;
>   err = kobject_init_and_add(>kobj, _ktype, NULL, "%s", name);
> - if (err)
> + if (err) {
> + kobject_put(>kobj);
>   goto out;
> + }
>  
>   err = sysfs_create_group(>kobj, _attr_group);
>   if (err)
> -- 
> 2.21.0
> 

This patch is not _completely_ correct.  Please do not consider for
merge.  There are a bunch of these on various LKML lists, once the
confusion has cleared I'll re-spin v2.

thanks,
Tobin.


Re: [PATCH 1/2] livepatch: Fix kobject memleak

2019-04-30 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 12:44:55PM +0200, Miroslav Benes wrote:
> On Tue, 30 Apr 2019, Greg Kroah-Hartman wrote:
> 
> > On Tue, Apr 30, 2019 at 10:15:33AM +1000, Tobin C. Harding wrote:
> > > Currently error return from kobject_init_and_add() is not followed by a
> > > call to kobject_put().  This means there is a memory leak.
> > > 
> > > Add call to kobject_put() in error path of kobject_init_and_add().
> > > 
> > > Signed-off-by: Tobin C. Harding 
> > 
> > Reviewed-by: Greg Kroah-Hartman 
> 
> Well, it does not even compile...

My apologies, I did compile this but obviously I don't know how to
configure the kernel.

Thanks for the review.

Tobin


Re: [PATCH 1/3] bridge: Fix error path for kobject_init_and_add()

2019-04-30 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 10:28:15AM +1000, Tobin C. Harding wrote:

[snip]

Please do not consider this series for merge.  There is a bit of
confusion here.

There are a few of theses patches live on various LKML lists.  Have to
consolidate all the knowledge.  When I _actually_ know how to use
kobject correctly I'll re-spin.

Thanks for your patience.

Tobin


Re: [PATCH 2/2] livepatch: Use correct kobject cleanup function

2019-04-30 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 01:00:05PM +0200, Miroslav Benes wrote:
> On Tue, 30 Apr 2019, Tobin C. Harding wrote:
> 
> > The correct cleanup function after a call to kobject_init_and_add() has
> > succeeded is kobject_del() _not_ kobject_put().  kobject_del() calls
> > kobject_put().
> > 
> > Use correct cleanup function when removing a kobject.
> > 
> > Signed-off-by: Tobin C. Harding 
> > ---
> >  kernel/livepatch/core.c | 8 +++-
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> > index 98a7bec41faa..4cce6bb6e073 100644
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -589,9 +589,8 @@ static void __klp_free_funcs(struct klp_object *obj, 
> > bool nops_only)
> >  
> > list_del(>node);
> >  
> > -   /* Might be called from klp_init_patch() error path. */
> 
> Could you leave the comment as is? If I am not mistaken, it is still 
> valid. func->kobj_added check is here exactly because the function may be 
> called as mentioned.

Will put it back in for you on v2

thanks,
Tobin.


Re: [PATCH 2/2] livepatch: Use correct kobject cleanup function

2019-04-30 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 05:08:11PM +0200, Petr Mladek wrote:
> On Tue 2019-04-30 10:15:34, Tobin C. Harding wrote:
> > The correct cleanup function after a call to kobject_init_and_add() has
> > succeeded is kobject_del() _not_ kobject_put().  kobject_del() calls
> > kobject_put().
> 
> Really? I see only kobject_put(kobj->parent) in kobject_del.
> It decreases a reference of the _parent_ object and not
> the given one.

Thanks Petr, you are right.  I misread kobject_del().  The story
thickens, so we need to call kobject_del() AND kobject_put().

> Also the section "Kobject removal" in Documentation/kobject.txt
> says that kobject_del() is for two-stage removal. kobject_put()
> still needs to get called at a later time.

Is this call sequence above what is meant by 'two-stage removal', I
didn't really understand that bit of the docs (and I almost always just
assume docs are stale and take them as a hint only :)

> IMHO, this patch causes that kobject_put() would never get called.

I'll do a v2 of this one and re-check all the patches on this I've
already sent (including the docs ones).

> That said, we could probably make the removal a bit cleaner
> by using kobject_del() in klp_free_patch_start() and
> kobject_put() in klp_free_patch_finish(). But I have
> to think more about it.

Noted, thanks for your review.

Tobin



Re: [PATCH] cpufreq: Fix kobject memleak

2019-04-30 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 11:35:52AM +0530, Viresh Kumar wrote:
> Currently the error return path from kobject_init_and_add() is not
> followed by a call to kobject_put() - which means we are leaking the
> kobject.
> 
> Fix it by adding a call to kobject_put() in the error path of
> kobject_init_and_add().
> 
> Signed-off-by: Viresh Kumar 
> ---
> Tobin fixed this for schedutil already.

For what its worth:

 Reviewed-by: Tobin C. Harding 

Thanks Viresh, one less for me to do!

Tobin


Re: [tip:sched/urgent] sched/cpufreq: Fix kobject memleak

2019-04-30 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 11:26:27AM +0530, Viresh Kumar wrote:
> On 29-04-19, 22:52, tip-bot for Tobin C. Harding wrote:
> > Commit-ID:  8bf7ab9c79f3d1a5f02ebac369f656de9ec0aca8
> > Gitweb: 
> > https://git.kernel.org/tip/8bf7ab9c79f3d1a5f02ebac369f656de9ec0aca8
> > Author: Tobin C. Harding 
> > AuthorDate: Tue, 30 Apr 2019 10:11:44 +1000
> > Committer:  Ingo Molnar 
> > CommitDate: Tue, 30 Apr 2019 06:24:09 +0200
> > 
> > sched/cpufreq: Fix kobject memleak
> > 
> > Currently the error return path from kobject_init_and_add() is not
> > followed by a call to kobject_put() - which means we are leaking
> > the kobject.
> > 
> > Fix it by adding a call to kobject_put() in the error path of
> > kobject_init_and_add().
> > 
> > Signed-off-by: Tobin C. Harding 
> > Add call to kobject_put() in error path of kobject_init_and_add().
> 
> This should have been present before the signed-off ?

Thanks.  Some face palm fails on this patch.  Its hard to get good help
:)

Tobin


[tip:sched/urgent] sched/cpufreq: Fix kobject memleak

2019-04-30 Thread tip-bot for Tobin C. Harding
Commit-ID:  9a4f26cc98d81b67ecc23b890c28e2df324e29f3
Gitweb: https://git.kernel.org/tip/9a4f26cc98d81b67ecc23b890c28e2df324e29f3
Author: Tobin C. Harding 
AuthorDate: Tue, 30 Apr 2019 10:11:44 +1000
Committer:  Ingo Molnar 
CommitDate: Tue, 30 Apr 2019 07:57:23 +0200

sched/cpufreq: Fix kobject memleak

Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking
the kobject.

Fix it by adding a call to kobject_put() in the error path of
kobject_init_and_add().

Signed-off-by: Tobin C. Harding 
Cc: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Thomas Gleixner 
Cc: Tobin C. Harding 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: http://lkml.kernel.org/r/20190430001144.24890-1-to...@kernel.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cpufreq_schedutil.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5c41ea367422..3638d2377e3c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -771,6 +771,7 @@ out:
return 0;
 
 fail:
+   kobject_put(>attr_set.kobj);
policy->governor_data = NULL;
sugov_tunables_free(tunables);
 


[tip:sched/urgent] sched/cpufreq: Fix kobject memleak

2019-04-29 Thread tip-bot for Tobin C. Harding
Commit-ID:  8bf7ab9c79f3d1a5f02ebac369f656de9ec0aca8
Gitweb: https://git.kernel.org/tip/8bf7ab9c79f3d1a5f02ebac369f656de9ec0aca8
Author: Tobin C. Harding 
AuthorDate: Tue, 30 Apr 2019 10:11:44 +1000
Committer:  Ingo Molnar 
CommitDate: Tue, 30 Apr 2019 06:24:09 +0200

sched/cpufreq: Fix kobject memleak

Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking
the kobject.

Fix it by adding a call to kobject_put() in the error path of
kobject_init_and_add().

Signed-off-by: Tobin C. Harding 
Add call to kobject_put() in error path of kobject_init_and_add().
Cc: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Thomas Gleixner 
Cc: Tobin C. Harding 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: http://lkml.kernel.org/r/20190430001144.24890-1-to...@kernel.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cpufreq_schedutil.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5c41ea367422..3638d2377e3c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -771,6 +771,7 @@ out:
return 0;
 
 fail:
+   kobject_put(>attr_set.kobj);
policy->governor_data = NULL;
sugov_tunables_free(tunables);
 


Re: [PATCH RESEND] sched/cpufreq: Fix kobject memleak

2019-04-29 Thread Tobin C. Harding
On Tue, Apr 30, 2019 at 06:24:43AM +0200, Ingo Molnar wrote:
> 
> * Tobin C. Harding  wrote:
> 
> > Currently error return from kobject_init_and_add() is not followed by a
> > call to kobject_put().  This means there is a memory leak.
> > 
> > Add call to kobject_put() in error path of kobject_init_and_add().
> > 
> > Signed-off-by: Tobin C. Harding 
> > ---
> > 
> > Resend with SOB tag.
> 
> Please ignore my previous mail :-)

Cheers Ingo, caught myself not checkpatching :(

thanks,
Tobin.



[RFC PATCH v4 15/15] dcache: Add CONFIG_DCACHE_SMO

2019-04-29 Thread Tobin C. Harding
In an attempt to make the SMO patchset as non-invasive as possible add a
config option CONFIG_DCACHE_SMO (under "Memory Management options") for
enabling SMO for the DCACHE.  Whithout this option dcache constructor is
used but no other code is built in, with this option enabled slab
mobility is enabled and the isolate/migrate functions are built in.

Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
Slab Movable Objects infrastructure.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 4 
 mm/Kconfig  | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3f9daba1cc78..9edce104613b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3068,6 +3068,7 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+#ifdef CONFIG_DCACHE_SMO
 /*
  * d_isolate() - Dentry isolation callback function.
  * @s: The dentry cache.
@@ -3140,6 +3141,7 @@ static void d_partial_shrink(struct kmem_cache *s, void 
**_unused, int __unused,
 
kfree(private);
 }
+#endif /* CONFIG_DCACHE_SMO */
 
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
@@ -3186,7 +3188,9 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+#ifdef CONFIG_DCACHE_SMO
kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+#endif
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
diff --git a/mm/Kconfig b/mm/Kconfig
index 47040d939f3b..92fc27ad3472 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -265,6 +265,13 @@ config SMO_NODE
help
  On NUMA systems enable moving objects to and from a specified node.
 
+config DCACHE_SMO
+   bool "Enable Slab Movable Objects for the dcache"
+   depends on SLUB
+   help
+ Under memory pressure we can try to free dentry slab cache objects 
from
+ the partial slab list if this is enabled.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
-- 
2.21.0



[RFC PATCH v4 13/15] dcache: Provide a dentry constructor

2019-04-29 Thread Tobin C. Harding
In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index aac41adf4743..3d6cc06eca56 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1603,6 +1603,16 @@ void d_invalidate(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
+static void dcache_ctor(void *p)
+{
+   struct dentry *dentry = p;
+
+   /* Mimic lockref_mark_dead() */
+   dentry->d_lockref.count = -128;
+
+   spin_lock_init(>d_lock);
+}
+
 /**
  * __d_alloc   -   allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1658,7 +1668,6 @@ struct dentry *__d_alloc(struct super_block *sb, const 
struct qstr *name)
 
dentry->d_lockref.count = 1;
dentry->d_flags = 0;
-   spin_lock_init(>d_lock);
seqcount_init(>d_seq);
dentry->d_inode = NULL;
dentry->d_parent = dentry;
@@ -3091,14 +3100,17 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
-   /*
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache.
-*/
-   dentry_cache = KMEM_CACHE_USERCOPY(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
-   d_iname);
+   slab_flags_t flags =
+   SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | 
SLAB_ACCOUNT;
+
+   dentry_cache =
+   kmem_cache_create_usercopy("dentry",
+  sizeof(struct dentry),
+  __alignof__(struct dentry),
+  flags,
+  offsetof(struct dentry, d_iname),
+  sizeof_field(struct dentry, d_iname),
+  dcache_ctor);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
-- 
2.21.0



[RFC PATCH v4 11/15] slub: Enable moving objects to/from specific nodes

2019-04-29 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (object migration).
Currently object migration is used to defrag a cache.  On NUMA systems
it would be nice to be able to control the source and destination nodes
when moving objects.

Add CONFIG_SMO_NODE to guard this feature.  CONFIG_SMO_NODE depends on
CONFIG_SLUB_DEBUG because we use the full list.  Leave it like this for
the RFC because the patch will be less cluttered to review, separate
full list out of CONFIG_DEBUG before doing a PATCH version.

Implement moving all objects (including those in full slabs) to a
specific node.  Expose this functionality to userspace via a sysfs entry.

Add sysfs entry:

   /sysfs/kernel/slab//move

With this users get access to the following functionality:

 - Move all objects to specified node.

echo "N1" > move

 - Move all objects from specified node to other specified
   node (from N1 -> to N2):

echo "N1 N2" > move

This also enables shrinking slabs on a specific node:

echo "N1 N1" > move

Signed-off-by: Tobin C. Harding 
---
 mm/Kconfig |   7 ++
 mm/slub.c  | 249 +
 2 files changed, 256 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb8a7db..47040d939f3b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,13 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
bool
 
+config SMO_NODE
+   bool "Enable per node control of Slab Movable Objects"
+   depends on SLUB && SYSFS
+   select SLUB_DEBUG
+   help
+ On NUMA systems enable moving objects to and from a specified node.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
diff --git a/mm/slub.c b/mm/slub.c
index e601c804ed79..e4f3dde443f5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4345,6 +4345,106 @@ static void move_slab_page(struct page *page, void 
*scratch, int node)
s->migrate(s, vector, count, node, private);
 }
 
+#ifdef CONFIG_SMO_NODE
+/*
+ * kmem_cache_move() - Attempt to move all slab objects.
+ * @s: The cache we are working on.
+ * @node: The node to move objects away from.
+ * @target_node: The node to move objects on to.
+ *
+ * Attempts to move all objects (partial slabs and full slabs) to target
+ * node.
+ *
+ * Context: Takes the list_lock.
+ * Return: The number of slabs remaining on node.
+ */
+static unsigned long kmem_cache_move(struct kmem_cache *s,
+int node, int target_node)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   goto out;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >partial, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   if (page->inuse) {
+   list_move(>lru, _list);
+   /* Stop page being considered for allocations */
+   n->nr_partial--;
+   page->frozen = 1;
+
+   slab_unlock(page);
+   } else {/* Empty slab page */
+   list_del(>lru);
+   n->nr_partial--;
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+   }
+
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   if (page->inuse == page->objects) {
+   list_add(>lru, >full);
+   slab_unlock(page);
+   } else {
+   n->nr_partial++;
+

[RFC PATCH v4 12/15] slub: Enable balancing slabs across nodes

2019-04-29 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (SMO).  On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs.  Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

 1. Move all objects to node 0 (this has the effect of defragmenting the
cache).

 2. Calculate the desired number of slabs for each node (this is done
using the approximation nr_slabs / nr_nodes).

 3. Loop over the nodes moving the desired number of slabs from node 0
to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this).  Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry.  Add sysfs
entry:

   /sysfs/kernel/slab//balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 120 ++
 1 file changed, 120 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index e4f3dde443f5..a5c48c41d72b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4583,6 +4583,109 @@ static unsigned long kmem_cache_move_to_node(struct 
kmem_cache *s, int node)
 
return left;
 }
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node.  This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+ int node, int target_node, long num)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+   long done = 0;
+
+   if (node == target_node)
+   return -EINVAL;
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   return -ENOMEM;
+
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+
+   if (++done >= num)
+   break;
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   /*
+* This is best effort only, if slab still has
+* objects just put it back on the partial list.
+*/
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+   struct kmem_cache_node *n = get_node(s, 0);
+   unsigned long desired_nr_slabs_per_node;
+   unsigned long nr_slabs;
+   int nr_nodes = 0;
+   int nid;
+
+   (void)kmem_cache_move_to_node(s, 0);
+
+   for_each_node_state(nid, N_NORMAL_MEMORY)
+   nr_nodes++;
+
+   nr_slabs = atomic_long_read(>nr_slabs);
+   desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+   for_each_node_state(nid, N_NORMAL_MEMORY) {
+   if (nid == 0)
+   continue;
+
+   kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node);
+   }
+}
 #endif
 
 /**
@@ -5847,6 +5950,22 @@ static ssize_t move_store(struct kmem_cache *s, const 
char *buf, size_t length)
return length;
 }
 SLAB_ATTR(move);
+
+static ssize_t balance_show(struct kmem_cache 

[RFC PATCH v4 14/15] dcache: Implement partial shrink via Slab Movable Objects

2019-04-29 Thread Tobin C. Harding
The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

Implement isolate and 'migrate' functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 76 +
 1 file changed, 76 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3d6cc06eca56..3f9daba1cc78 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "mount.h"
 
@@ -3067,6 +3068,79 @@ void d_tmpfile(struct dentry *dentry, struct inode 
*inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+   struct list_head *dispose;
+   struct dentry *dentry;
+   int i;
+
+   dispose = kmalloc(sizeof(*dispose), GFP_KERNEL);
+   if (!dispose)
+   return NULL;
+
+   INIT_LIST_HEAD(dispose);
+
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   spin_lock(>d_lock);
+
+   if (dentry->d_lockref.count > 0 ||
+   dentry->d_flags & DCACHE_SHRINK_LIST) {
+   spin_unlock(>d_lock);
+   continue;
+   }
+
+   if (dentry->d_flags & DCACHE_LRU_LIST)
+   d_lru_del(dentry);
+
+   d_shrink_add(dentry, dispose);
+   spin_unlock(>d_lock);
+   }
+
+   return dispose;
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @_unused: We do not access the vector.
+ * @__unused: No need for length of vector.
+ * @___unused: We do not do any allocation.
+ * @private: list_head pointer representing the shrink list.
+ *
+ * Dispose of the shrink list created during isolation function.
+ *
+ * Dentry objects can _not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **_unused, int 
__unused,
+int ___unused, void *private)
+{
+   struct list_head *dispose = private;
+
+   if (!private)   /* kmalloc error during isolate. */
+   return;
+
+   if (!list_empty(dispose))
+   shrink_dentry_list(dispose);
+
+   kfree(private);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3112,6 +3186,8 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+   kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
return;
-- 
2.21.0



[RFC PATCH v4 10/15] tools/testing/slab: Add XArray movable objects tests

2019-04-29 Thread Tobin C. Harding
We just implemented movable objects for the XArray.  Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs.  For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session


Relevant /proc/slabinfo column headers:

  name   

Prior to testing slabinfo report for radix_tree_node:

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8352
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 497   Sanity Checks : On   Total: 8142848
  SlabObj: 912  Full   : 473   Redzoning : On   Used : 4810752
  SlabSiz:   16384  Partial:  24   Poisoning : On   Loss : 3332096
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2806272
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8442
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 499   Sanity Checks : On   Total: 8175616
  SlabObj: 912  Full   : 484   Redzoning : On   Used : 4862592
  SlabSiz:   16384  Partial:  15   Poisoning : On   Loss : 3313024
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2836512
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  439120

Now we can shrink the radix_tree_node cache:

  # slabinfo radix_tree_node --shrink
  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8515
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 501   Sanity Checks : On   Total: 8208384
  SlabObj: 912  Full   : 500   Redzoning : On   Used : 4904640
  SlabSiz:   16384  Partial:   1   Poisoning : On   Loss : 3303744
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2861040
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile |   2 +-
 tools/testing/slab/slub_defrag_xarray.c | 211 
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o
 
 KTREE=../../..
 
diff --git a/tools/testing/slab/slub_defrag_xarray.c 
b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index ..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+   long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+   struct smox_thing *thing;
+
+   thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+   if (!thing)
+   return ERR_PTR(-ENOMEM);
+
+   thing->id = id;
+   return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructe

[RFC PATCH v4 09/15] xarray: Implement migration function for objects

2019-04-29 Thread Tobin C. Harding
Implement functions to migrate objects. This is based on initial code by
Matthew Wilcox and was modified to work with slab object migration.

This patch can not be merged until all radix tree & IDR users are
converted to the XArray because xa_nodes and radix tree nodes share the
same slab cache (thanks Matthew).

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 lib/radix-tree.c | 13 +
 lib/xarray.c | 49 
 2 files changed, 62 insertions(+)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 14d51548bea6..9412c2853726 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1613,6 +1613,17 @@ static int radix_tree_cpu_dead(unsigned int cpu)
return 0;
 }
 
+extern void xa_object_migrate(void *tree_node, int numa_node);
+
+static void radix_tree_migrate(struct kmem_cache *s, void **objects, int nr,
+  int node, void *private)
+{
+   int i;
+
+   for (i = 0; i < nr; i++)
+   xa_object_migrate(objects[i], node);
+}
+
 void __init radix_tree_init(void)
 {
int ret;
@@ -1627,4 +1638,6 @@ void __init radix_tree_init(void)
ret = cpuhp_setup_state_nocalls(CPUHP_RADIX_DEAD, "lib/radix:dead",
NULL, radix_tree_cpu_dead);
WARN_ON(ret < 0);
+   kmem_cache_setup_mobility(radix_tree_node_cachep, NULL,
+ radix_tree_migrate);
 }
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..731dd3d8ddb8 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1971,6 +1971,55 @@ void xa_destroy(struct xarray *xa)
 }
 EXPORT_SYMBOL(xa_destroy);
 
+void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+   struct xarray *xa = READ_ONCE(node->array);
+   void __rcu **slot;
+   struct xa_node *new_node;
+   int i;
+
+   /* Freed or not yet in tree then skip */
+   if (!xa || xa == XA_RCU_FREE)
+   return;
+
+   new_node = kmem_cache_alloc_node(radix_tree_node_cachep,
+GFP_KERNEL, numa_node);
+   if (!new_node)
+   return;
+
+   xa_lock_irq(xa);
+
+   /* Check again. */
+   if (xa != node->array) {
+   node = new_node;
+   goto unlock;
+   }
+
+   memcpy(new_node, node, sizeof(struct xa_node));
+
+   if (list_empty(>private_list))
+   INIT_LIST_HEAD(_node->private_list);
+   else
+   list_replace(>private_list, _node->private_list);
+
+   for (i = 0; i < XA_CHUNK_SIZE; i++) {
+   void *x = xa_entry_locked(xa, new_node, i);
+
+   if (xa_is_node(x))
+   rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+   }
+   if (!new_node->parent)
+   slot = >xa_head;
+   else
+   slot = _parent_locked(xa, new_node)->slots[new_node->offset];
+   rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+   xa_unlock_irq(xa);
+   xa_node_free(node);
+   rcu_barrier();
+}
+
 #ifdef XA_DEBUG
 void xa_dump_node(const struct xa_node *node)
 {
-- 
2.21.0



[RFC PATCH v4 08/15] tools/testing/slab: Add object migration test suite

2019-04-29 Thread Tobin C. Harding
We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects.  Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

  $ cd path/to/linux/tools/testing/slab
  $ /slub_defrag.py
  Please run script as root

  $ sudo ./slub_defrag.py
  

  $ sudo ./slub_defrag.py --debug
  Loading module ...
  Slab cache smo_test created
  Objects per slab: 20
  Running sanity checks ...

  Running module stress test (see dmesg for additional test output) ...
  Removing module slub_defrag ...
  Loading module ...
  Slab cache smo_test created

  Running test non-movable ...
  testing slab 'smo_test' prior to enabling movable objects ...
  verified non-movable slabs are NOT shrinkable

  Running test movable ...
  testing slab 'smo_test' after enabling movable objects ...
  verified movable slabs are shrinkable

  Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/slub_defrag.c  |   1 +
 tools/testing/slab/slub_defrag.py | 451 ++
 2 files changed, 452 insertions(+)
 create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)
 
 /*
  * struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
  */
 struct functions {
char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py 
b/tools/testing/slab/slub_defrag.py
new file mode 100755
index ..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+#  - CONFIG_SLUB=y
+#  - CONFIG_SLUB_DEBUG=y
+#  - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+#  - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+#  - Writes to 'callfn' are parsed as a command string and the function
+#associated with command is called.
+#  - Defines 4 commands (all commands operate on smo_test cache):
+# - 'test': Runs module stress tests.
+# - 'alloc N': Allocates N slub objects
+# - 'free N POS': Frees N objects starting at POS (see below)
+# - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects.  Allocation adds
+# objects to the tail of the list.  Free'ing frees from the head of the
+# list.  This has the effect of creating free slots in the slab.  For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab//shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+if debug:
+print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+user = run_shell_get_stdout('whoami')
+if user != b'root\n':
+eprint("Please run script as root")
+sys.exit(1)
+
+
+def mount_debugfs():
+mounted = False
+
+# Check if

[RFC PATCH v4 07/15] tools/testing/slab: Add object migration test module

2019-04-29 Thread Tobin C. Harding
 Total  :   1   Sanity Checks : On   Total:8192
  SlabObj: 392  Full   :   1   Redzoning : On   Used :1120
  SlabSiz:8192  Partial:   0   Poisoning : On   Loss :7072
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:6720
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 352

We can run the stress tests (with the default number of objects):

  # cd /sys/kernel/debug/smo
  # echo 'test' > callfn
  [3.576617] smo: test using nr_objs: 1000 keep: 10
  [3.580169] smo: Module tests completed successfully

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile  |  10 +
 tools/testing/slab/slub_defrag.c | 566 +++
 2 files changed, 576 insertions(+)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
new file mode 100644
index ..440c2e3e356f
--- /dev/null
+++ b/tools/testing/slab/Makefile
@@ -0,0 +1,10 @@
+obj-m += slub_defrag.o
+
+KTREE=../../..
+
+all:
+   make -C ${KTREE} M=$(PWD) modules
+
+clean:
+   make -C ${KTREE} M=$(PWD) clean
+
diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
new file mode 100644
index ..4a5c24394b96
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
+ *
+ * This module is used for testing the SLUB allocator.  Enables
+ * userspace to run kernel functions via a debugfs file.
+ *
+ *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
+ *
+ * String written to `callfn` is parsed by the module and associated
+ * function is called.  See fn_tab for mapping of strings to functions.
+ */
+
+/* debugfs commands accept two optional arguments */
+#define SMO_CMD_DEFAUT_ARG -1
+
+#define SMO_DEBUGFS_DIR "smo"
+struct dentry *smo_debugfs_root;
+
+#define SMO_CACHE_NAME "smo_test"
+static struct kmem_cache *cachep;
+
+struct smo_slub_object {
+   struct list_head list;
+   char buf[32];   /* Unused except to control size of object */
+   long id;
+};
+
+/* Our list of allocated objects */
+LIST_HEAD(objects);
+
+static void list_add_to_objects(struct smo_slub_object *so)
+{
+   /*
+* We free from the front of the list so store at the
+* tail in order to put holes in the cache when we free.
+*/
+   list_add_tail(>list, );
+}
+
+/**
+ * smo_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smo_object_ctor(void *ptr)
+{
+   struct smo_slub_object *so = ptr;
+
+   INIT_LIST_HEAD(>list);
+   memset(so->buf, 0, sizeof(so->buf));
+   so->id = -1;
+}
+
+/**
+ * smo_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smo_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+  int node, void *private)
+{
+   struct smo_slub_object **so_objs = (struct smo_slub_object **)objs;
+   struct smo_slub_object *so_old, *so_new;
+   int i;
+
+   for (i = 0; i < size; i++) {
+   so_old = so_objs[i];
+
+   so_new = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+   if (!so_new) {
+   pr_debug("kmem_cache_alloc failed\n");
+   return;
+   }
+
+   /* Copy object */
+   so_new->id = so_old->id;
+
+   /* Update references to old object */
+   list_del(_old->list);
+   list_add_to_objects(so_new);
+
+   kmem_cache_free(cachep, so_old);
+   }
+}
+
+static int smo_enable_cache_mobility(int _unused, int __unused)
+{
+   /* Enable movable objects: BOOM! */
+   kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+   pr_info("smo: kmem_cache %s defrag enabled\n", SMO_CACHE_NAME);
+   return 0;
+}
+
+/*
+ * smo_alloc_objects() - Allocate objects and store reference.
+ * @nr_objs: Number of objects to allocate.
+ * @node: NUMA node to allocate objects on.
+ *
+ * Allocates @n smo_slub_objects.  Stores a reference to them in
+ * the global list of objects (at the tail of the list).
+ *
+ * Return: The number of objects allocated.
+ */
+static int smo_alloc_objects(int nr_objs, int node)
+{
+   struct smo_slub_object *so;
+   int i;
+
+   /* Set sane parameters if no args passed in */
+   if (nr_objs == 

[RFC PATCH v4 06/15] tools/vm/slabinfo: Add defrag_used_ratio output

2019-04-29 Thread Tobin C. Harding
Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int defrag_used_ratio;
int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->movable)
+   printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
slab->deactivate_bypass = get_obj("deactivate_bypass");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
+   slab->defrag_used_ratio = get_obj("defrag_used_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[RFC PATCH v4 05/15] tools/vm/slabinfo: Add remote node defrag ratio output

2019-04-29 Thread Tobin C. Harding
Add output line for NUMA remote node defrag ratio.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index cbfc56c44c2f..d2c22f9ee2d8 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode)
if (skip_zero && !s->slabs)
return;
 
+   if (mode) {
+   printf("\nNUMA remote node defrag ratio: %3d\n",
+  s->remote_node_defrag_ratio);
+   }
if (!line) {
printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
@@ -1272,6 +1277,8 @@ static void read_slab_dir(void)
slab->cpu_partial_free = get_obj("cpu_partial_free");
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
+   slab->remote_node_defrag_ratio =
+   get_obj("remote_node_defrag_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[RFC PATCH v4 04/15] slub: Slab defrag core

2019-04-29 Thread Tobin C. Harding
Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

Implement Slab Movable Objects in order to defragment slab caches.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

slabinfo  -s

3. Conditionally via the use of kmem_cache_defrag()

- Use Slab Movable Objects when shrinking cache.

Currently when the kernel calls kmem_cache_shrink() we curate the
partial slabs list.  If object migration is not enabled for the cache we
still do this, if however, SMO is enabled we attempt to move objects in
partially full slabs in order to defragment the cache.  Shrink attempts
to move all objects in order to reduce the cache to a single partial
slab for each node.

- Add conditional per node defrag via new function:

kmem_defrag_slabs(int node).

kmem_defrag_slabs() attempts to defragment all slab caches for node.
 Defragmentation is done conditionally dependent on MAX_PARTIAL _AND_
 defrag_used_ratio.

   Caches are only considered for defragmentation if the number of
   partial slabs exceeds MAX_PARTIAL (per node).

   Also, defragmentation only occurs if the usage ratio of the slab is
   lower than the configured percentage (sysfs field added in this
   patch).  Fragmentation ratios are measured by calculating the
   percentage of objects in use compared to the total number of objects
   that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_defrag_slabs() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

echo 0 > /sys/kernel/slab//defrag_used_ratio

- Add a defrag ratio sysfs field and set it to 30% by default. A limit
of 30% specifies that more than 3 out of 10 available slots for objects
need to be in use otherwise slab defragmentation will be attempted on
the remaining objects.

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

void kmem_cache_setup_mobility(struct kmem_cache *,
   kmem_cache_isolate_func,
   kmem_cache_migrate_func);

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 include/linux/slab.h|   1 +
 include/linux/slub_def.h|   7 +
 mm/slub.c   | 385 
 4 files changed, 334 insertions(+), 73 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab 
b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..7770c03be6b4 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,20 @@ Description:
list.  It can be written to clear the current count.
Available when CONFIG_SLUB_STATS is enabled.
 
+What:  /sys/kernel/slab/cache/defrag_used_ratio
+Date:  February 2019
+KernelVersion: 5.0
+Contact:   Christoph Lameter 
+   Pekka Enberg ,
+Description:
+   The defrag_used_ratio file allows the control of how aggressive
+   slab fragmentation reduction works at reclaiming objects from
+   sparsely populated slabs. This is a percentage. If a slab has
+   less than this percentage of objects allocated then reclaim will
+   attempt to reclaim objects so that the whole slab page can be
+   freed. 0% specifies no reclaim attempt (defrag disabled), 100%
+   specifies attempt to reclaim all pages.  The default is 30%.
+
 What:  /sys/kernel/slab/cache/deactivate_to_tail
 Date:  February 2008
 KernelVersion: 2.6.25
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 886fc130334d..4bf381b34829 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char 
*name,
void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_defrag_slabs(int node);
 
 void memcg_create_kmem_cache(str

[RFC PATCH v4 02/15] tools/vm/slabinfo: Add support for -C and -M options

2019-04-29 Thread Tobin C. Harding
-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int movable, ctor;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-   printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-   "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+   printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 
2017 Jump Trading LLC.\n\n"
+  "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+
"-a|--aliases   Show aliases\n"
"-A|--activity  Most active slabs first\n"
"-B|--Bytes Show size in bytes\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-D|--display-activeSwitch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
"-L|--Loss  Sort by loss\n"
+   "-M|--movable   Show caches that support movable 
objects\n"
"-n|--numa  Show NUMA information\n"
"-N|--lines=K   Show the first K slabs\n"
"-o|--ops   Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;
 
+   if (show_ctor && !s->ctor)
+   return;
+
+   if (show_movable && !s->movable)
+   return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->ctor)
+   *p++ = 'C';
+   if (s->movable)
+   *p++ = 'M';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
-   s->slabs ? (s->partial * 100) / s->slabs : 100,
+   s->slabs ? (s->partial * 100) /
+   (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
chdir("..");
+   if (read_slab_obj(slab, "ops")) {
+   if (strstr(buffer, "ctor :"))
+   slab->ctor = 1;
+   if (strstr(buffer, "migrate :"))
+   slab->movable = 1;
+   }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1332,6 +1356,8 @@ static void 

[RFC PATCH v4 03/15] slub: Sort slab cache list

2019-04-29 Thread Tobin C. Harding
It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 mm/slab_common.c | 2 +-
 mm/slub.c| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
goto out_free_cache;
 
s->refcount = 1;
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
memcg_link_cache(s);
 out:
if (err)
diff --git a/mm/slub.c b/mm/slub.c
index ae44d640b8c1..f6b0e4a395ef 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4342,6 +4342,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
return;
}
 
+   mutex_lock(_mutex);
+
s->isolate = isolate;
s->migrate = migrate;
 
@@ -4350,6 +4352,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 * to disable fast cmpxchg based processing.
 */
s->flags &= ~__CMPXCHG_DOUBLE;
+
+   list_move(>list, _caches);  /* Move to top */
+
+   mutex_unlock(_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
-- 
2.21.0



[RFC PATCH v4 01/15] slub: Add isolate() and migrate() methods

2019-04-29 Thread Tobin C. Harding
Add the two methods needed for moving objects and enable the display of
the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab
cache.

Add empty functions for SLAB/SLOB. The API is generic so it could be
theoretically implemented for these allocators as well.

Change sysfs 'ctor' field to be 'ops' to contain all the callback
operations defined for a slab cache.  Display the existing 'ctor'
callback in the ops fields contents along with 'isolate' and 'migrate'
callbacks.

Co-developed-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 include/linux/slab.h | 70 
 include/linux/slub_def.h |  3 ++
 mm/slub.c| 59 +
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..886fc130334d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct 
kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
+/*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable
+ * mobile objects and targeted reclaim in slab caches.
+ */
+
+/**
+ * typedef kmem_cache_isolate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to isolate.
+ * @nr: Number of objects in @ptr array.
+ *
+ * The purpose of kmem_cache_isolate_func() is to pin each object so that
+ * they cannot be freed until kmem_cache_migrate_func() has processed
+ * them. This may be accomplished by increasing the refcount or setting
+ * a flag.
+ *
+ * The object pointer array passed is also passed to
+ * kmem_cache_migrate_func().  The function may remove objects from the
+ * array by setting pointers to %NULL. This is useful if we can
+ * determine that an object is being freed because
+ * kmem_cache_isolate_func() was called when the subsystem was calling
+ * kmem_cache_free().  In that case it is not necessary to increase the
+ * refcount or specially mark the object because the release of the slab
+ * lock will lead to the immediate freeing of the object.
+ *
+ * Context: Called with locks held so that the slab objects cannot be
+ *  freed.  We are in an atomic context and no slab operations
+ *  may be performed.
+ * Return: A pointer that is passed to the migrate function. If any
+ * objects cannot be touched at this point then the pointer may
+ * indicate a failure and then the migration function can simply
+ * remove the references that were already obtained. The private
+ * data could be used to track the objects that were already pinned.
+ */
+typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int 
nr);
+
+/**
+ * typedef kmem_cache_migrate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to migrate.
+ * @nr: Number of objects in @ptr array.
+ * @node: The NUMA node where the object should be allocated.
+ * @private: The pointer returned by kmem_cache_isolate_func().
+ *
+ * This function is responsible for migrating objects.  Typically, for
+ * each object in the input array you will want to allocate an new
+ * object, copy the original object, update any pointers, and free the
+ * old object.
+ *
+ * After this function returns all pointers to the old object should now
+ * point to the new object.
+ *
+ * Context: Called with no locks held and interrupts enabled.  Sleeping
+ *  is possible.  Any operation may be performed.
+ */
+typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr,
+int nr, int node, void *private);
+
+/*
+ * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func,
+  kmem_cache_migrate_func);
+#else
+static inline void
+kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func 
isolate,
+ kmem_cache_migrate_func migrate) {}
+#endif
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..2879a2f5f8eb 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
gfp_t allocflags;   /* gfp flags to use on each alloc */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *);
+   kmem_cache_isolate_func *isolate

[RFC PATCH v4 00/15] Slab Movable Objects (SMO)

2019-04-29 Thread Tobin C. Harding
Hi,

Another iteration of the SMO patch set, updates to this version are
restricted to the dcache patch #14.

Applies on top of Linus' tree (tag: v5.1-rc6).

This is a patch set implementing movable objects within the SLUB
allocator.  This is work based on Christopher Lameter's patch set:

 https://lore.kernel.org/patchwork/project/lkml/list/?series=377335

The original code logic is from that set and implemented by Christopher.
Clean up, refactoring, documentation, and additional features by myself.
Responsibility for any bugs remaining falls solely with myself.

Changes to this version:

Re-write the dcache Slab Movable Objects isolate/migrate functions.
Based on review/suggestions by Alexander on the last version.

In this version the isolate function loops over the object vector and
builds a shrink list for all objects that have refcount==0 AND are NOT
on anyone else's shrink list.  A pointer to this list is returned from
the isolate function and passed to the migrate function (by the SMO
infrastructure).  The dentry migration function d_partial_shrink()
simply calls shrink_dentry_list() on the received shrink list pointer
and frees the memory associated with the list_head.

Hopefully if this is all ok I can move on to violating the inode
slab cache :)

FWIW testing on a VM in Qemu brings this mild benefit to the dentry slab
cache with no _apparent_ negatives.

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y
CONFIG_SMO_NODE=y
CONFIG_DCACHE_SMO=y

[root@vm ~]# slabinfo  dentry -r | head -n 13

Slabcache: dentry   Aliases:  0 Order :  1 Objects: 38585
** Reclaim accounting active
** Defragmentation at 30%

Sizes (bytes) Slabs  DebugMemory

Object : 192  Total  :2582   Sanity Checks : On   Total: 21151744
SlabObj: 528  Full   :2547   Redzoning : On   Used : 7408320
SlabSiz:8192  Partial:  35   Poisoning : On   Loss : 13743424
Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 12964560
Align  :   8  Objects:  15   Tracing   : Off  Lpadd:  702304

[root@vm ~]# slabinfo  dentry --shrink
[root@vm ~]# slabinfo  dentry -r | head -n 13

Slabcache: dentry   Aliases:  0 Order :  1 Objects: 38426
** Reclaim accounting active
** Defragmentation at 30%

Sizes (bytes) Slabs  DebugMemory

Object : 192  Total  :2578   Sanity Checks : On   Total: 21118976
SlabObj: 528  Full   :2547   Redzoning : On   Used : 7377792
SlabSiz:8192  Partial:  31   Poisoning : On   Loss : 13741184
Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 12911136
Align  :   8  Objects:  15   Tracing   : Off  Lpadd:  701216


Please note, this dentry shrink implementation is 'best effort', results
vary.  This is as is expected.  We are trying to unobtrusively shrink
the dentry cache.

thanks,
Tobin.


Tobin C. Harding (15):
  slub: Add isolate() and migrate() methods
  tools/vm/slabinfo: Add support for -C and -M options
  slub: Sort slab cache list
  slub: Slab defrag core
  tools/vm/slabinfo: Add remote node defrag ratio output
  tools/vm/slabinfo: Add defrag_used_ratio output
  tools/testing/slab: Add object migration test module
  tools/testing/slab: Add object migration test suite
  xarray: Implement migration function for objects
  tools/testing/slab: Add XArray movable objects tests
  slub: Enable moving objects to/from specific nodes
  slub: Enable balancing slabs across nodes
  dcache: Provide a dentry constructor
  dcache: Implement partial shrink via Slab Movable Objects
  dcache: Add CONFIG_DCACHE_SMO

 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 fs/dcache.c | 110 ++-
 include/linux/slab.h|  71 ++
 include/linux/slub_def.h|  10 +
 lib/radix-tree.c|  13 +
 lib/xarray.c|  49 ++
 mm/Kconfig  |  14 +
 mm/slab_common.c|   2 +-
 mm/slub.c   | 819 ++--
 tools/testing/slab/Makefile |  10 +
 tools/testing/slab/slub_defrag.c| 567 ++
 tools/testing/slab/slub_defrag.py   | 451 +++
 tools/testing/slab/slub_defrag_xarray.c | 211 +
 tools/vm/slabinfo.c |  51 +-
 14 files changed, 2299 insertions(+), 93 deletions(-)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c
 create mode 100755 tools/testing/slab/slub_defrag.py
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

-- 
2.21.0



[PATCH RESEND] sched/cpufreq: Fix kobject memleak

2019-04-29 Thread Tobin C. Harding
Currently error return from kobject_init_and_add() is not followed by a
call to kobject_put().  This means there is a memory leak.

Add call to kobject_put() in error path of kobject_init_and_add().

Signed-off-by: Tobin C. Harding 
---

Resend with SOB tag.

 kernel/sched/cpufreq_schedutil.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5c41ea367422..3638d2377e3c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -771,6 +771,7 @@ static int sugov_init(struct cpufreq_policy *policy)
return 0;
 
 fail:
+   kobject_put(>attr_set.kobj);
policy->governor_data = NULL;
sugov_tunables_free(tunables);
 
-- 
2.21.0



[PATCH 2/2] livepatch: Use correct kobject cleanup function

2019-04-29 Thread Tobin C. Harding
The correct cleanup function after a call to kobject_init_and_add() has
succeeded is kobject_del() _not_ kobject_put().  kobject_del() calls
kobject_put().

Use correct cleanup function when removing a kobject.

Signed-off-by: Tobin C. Harding 
---
 kernel/livepatch/core.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 98a7bec41faa..4cce6bb6e073 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -589,9 +589,8 @@ static void __klp_free_funcs(struct klp_object *obj, bool 
nops_only)
 
list_del(>node);
 
-   /* Might be called from klp_init_patch() error path. */
if (func->kobj_added) {
-   kobject_put(>kobj);
+   kobject_del(>kobj);
} else if (func->nop) {
klp_free_func_nop(func);
}
@@ -625,9 +624,8 @@ static void __klp_free_objects(struct klp_patch *patch, 
bool nops_only)
 
list_del(>node);
 
-   /* Might be called from klp_init_patch() error path. */
if (obj->kobj_added) {
-   kobject_put(>kobj);
+   kobject_del(>kobj);
} else if (obj->dynamic) {
klp_free_object_dynamic(obj);
}
@@ -676,7 +674,7 @@ static void klp_free_patch_finish(struct klp_patch *patch)
 * cannot get enabled again.
 */
if (patch->kobj_added) {
-   kobject_put(>kobj);
+   kobject_del(>kobj);
wait_for_completion(>finish);
}
 
-- 
2.21.0



[PATCH 0/2] livepatch: Fix usage of kobject_init_and_add()

2019-04-29 Thread Tobin C. Harding
Hi,

Currently there are a few places in kernel/livepatch/ which do not
correctly use kobject_init_and_add().

An error return from kobject_init_and_add() requires a call to
kobject_put().

The cleanup function after a successful call to kobject_init_and_add()
is kobject_del(). 

This set is part of an effort to check/fix all callsites of
kobject_init_and_add().


This set fixes all callsites under kernel/livepatch/


thanks,
Tobin.


Tobin C. Harding (2):
  livepatch: Fix kobject memleak
  livepatch: Use correct kobject cleanup function

 kernel/livepatch/core.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

-- 
2.21.0



[PATCH 1/2] livepatch: Fix kobject memleak

2019-04-29 Thread Tobin C. Harding
Currently error return from kobject_init_and_add() is not followed by a
call to kobject_put().  This means there is a memory leak.

Add call to kobject_put() in error path of kobject_init_and_add().

Signed-off-by: Tobin C. Harding 
---
 kernel/livepatch/core.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index eb0ee10a1981..98a7bec41faa 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -727,7 +727,9 @@ static int klp_init_func(struct klp_object *obj, struct 
klp_func *func)
ret = kobject_init_and_add(>kobj, _ktype_func,
   >kobj, "%s,%lu", func->old_name,
   func->old_sympos ? func->old_sympos : 1);
-   if (!ret)
+   if (ret)
+   kobject_put(>kobj);
+   else
func->kobj_added = true;
 
return ret;
@@ -803,8 +805,10 @@ static int klp_init_object(struct klp_patch *patch, struct 
klp_object *obj)
name = klp_is_module(obj) ? obj->name : "vmlinux";
ret = kobject_init_and_add(>kobj, _ktype_object,
   >kobj, "%s", name);
-   if (ret)
+   if (ret) {
+   kobject_put(>kobj);
return ret;
+   }
obj->kobj_added = true;
 
klp_for_each_func(obj, func) {
@@ -862,8 +866,10 @@ static int klp_init_patch(struct klp_patch *patch)
 
ret = kobject_init_and_add(>kobj, _ktype_patch,
   klp_root_kobj, "%s", patch->mod->name);
-   if (ret)
+   if (ret) {
+   kobject_put(>kobj);
return ret;
+   }
patch->kobj_added = true;
 
if (patch->replace) {
-- 
2.21.0



[PATCH] sched/cpufreq: Fix kobject memleak

2019-04-29 Thread Tobin C. Harding
Currently error return from kobject_init_and_add() is not followed by a
call to kobject_put().  This means there is a memory leak.

Add call to kobject_put() in error path of kobject_init_and_add().
---
 kernel/sched/cpufreq_schedutil.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5c41ea367422..3638d2377e3c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -771,6 +771,7 @@ static int sugov_init(struct cpufreq_policy *policy)
return 0;
 
 fail:
+   kobject_put(>attr_set.kobj);
policy->governor_data = NULL;
sugov_tunables_free(tunables);
 
-- 
2.21.0



Re: memleak around kobject_init_and_add()

2019-04-28 Thread Tobin C. Harding
On Mon, Apr 29, 2019, at 02:15, Greg Kroah-Hartman wrote:
> On Sun, Apr 28, 2019 at 11:19:57AM +1000, Tobin C. Harding wrote:
> > On Sat, Apr 27, 2019 at 09:28:09PM +0200, Greg Kroah-Hartman wrote:
> > > On Sat, Apr 27, 2019 at 06:13:30PM +1000, Tobin C. Harding wrote:
> > > > (Note at bottom on reasons for 'To' list 'Cc' list)
> > > > 
> > > > Hi,
> > > > 
> > > > kobject_init_and_add() seems to be routinely misused.  A failed call to 
> > > > this
> > > > function requires a call to kobject_put() otherwise we leak memory.
> > > > 
> > > > Examples memleaks can be seen in:
> > > > 
> > > > mm/slub.c
> > > > fs/btrfs/sysfs.c
> > > > fs/xfs/xfs_sysfs.h: xfs_sysfs_init()
> > > > 
> > > >  Question: Do we fix the misuse or fix the API?
> > > 
> > > Fix the misuse.
> > 
> > Following on from this.  It seems we often also forget to call
> > kobject_uevent() after calls to kobject_init_and_add().
> 
> Are you sure?  Usually if you don't call it right away, it happens much
> later when you have everything "ready to go" to tell userspace that it
> then can access that kobject successfully.
> 
> Any specific places you feel is not correct?
> 
> > Before I make a goose of myself patching the whole tree is there ever
> > any reason why we would _not_ want to call kobject_uevent() after
> > successfully calling kobject_add() (or kobject_init_and_add())?
> 
> You should always do so, but again, sometimes it can be much "later"
> after everything is properly set up.
> 
> Ok, at quick glance I see some places that do not properly call this.
> But, those places should not even be using a "raw" kobject in the first
> place, they should be using 'struct device'.  If code using a kobject,
> that should be very "rare", and not normal behavior in the first place.

Cool, thanks.


Re: memleak around kobject_init_and_add()

2019-04-27 Thread Tobin C. Harding
On Sat, Apr 27, 2019 at 09:28:09PM +0200, Greg Kroah-Hartman wrote:
> On Sat, Apr 27, 2019 at 06:13:30PM +1000, Tobin C. Harding wrote:
> > (Note at bottom on reasons for 'To' list 'Cc' list)
> > 
> > Hi,
> > 
> > kobject_init_and_add() seems to be routinely misused.  A failed call to this
> > function requires a call to kobject_put() otherwise we leak memory.
> > 
> > Examples memleaks can be seen in:
> > 
> > mm/slub.c
> > fs/btrfs/sysfs.c
> > fs/xfs/xfs_sysfs.h: xfs_sysfs_init()
> > 
> >  Question: Do we fix the misuse or fix the API?
> 
> Fix the misuse.

Following on from this.  It seems we often also forget to call
kobject_uevent() after calls to kobject_init_and_add().  Before I make a
goose of myself patching the whole tree is there ever any reason why we
would _not_ want to call kobject_uevent() after successfully calling
kobject_add() (or kobject_init_and_add())?

Cheers,
Tobin.


[PATCH] kobject: Improve docs for kobject_add/del

2019-04-27 Thread Tobin C. Harding
There is currently some confusion on how to wind back
kobject_init_and_add() during the error paths in code that uses this
function.

Add documentation to kobject_add() and kobject_del() to help clarify the
usage.

Signed-off-by: Tobin C. Harding 
---

The assumption is that this is the correct usage, and that's what I've
tried to document.  Is this correct?

void fn(void)
{
int ret;

ret = kobject_init_and_add(kobj, ktype, NULL, "foo");
if (ret) {
kobject_put(kobj);
return -1;
}

ret = some_init_fn();
if (ret)
goto err;

ret = some_other_init_fn();
if (ret)
goto other_err;

kobject_uevent(kobj, KOBJ_ADD);
return 0;

other_err:
other_clean_up_fn();
err:
kobject_del(kobj);
return ret;
}

thanks,
Tobin.

 lib/kobject.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index aa89edcd2b63..b2670671977b 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -397,15 +397,19 @@ static __printf(3, 0) int kobject_add_varg(struct kobject 
*kobj,
  * is assigned to the kobject, then the kobject will be located in the
  * root of the sysfs tree.
  *
- * If this function returns an error, kobject_put() must be called to
- * properly clean up the memory associated with the object.
- * Under no instance should the kobject that is passed to this function
- * be directly freed with a call to kfree(), that can leak memory.
- *
  * Note, no "add" uevent will be created with this call, the caller should set
  * up all of the necessary sysfs files for the object and then call
  * kobject_uevent() with the UEVENT_ADD parameter to ensure that
  * userspace is properly notified of this kobject's creation.
+ *
+ * Return: If this function returns an error, kobject_put() must be
+ * called to properly clean up the memory associated with the
+ * object.  Under no instance should the kobject that is passed
+ * to this function be directly freed with a call to kfree(),
+ * that can leak memory.
+ *
+ * If this call returns successfully and you later need to unwind
+ * kobject_add() for the error path you should call kobject_del().
  */
 int kobject_add(struct kobject *kobj, struct kobject *parent,
const char *fmt, ...)
@@ -580,6 +584,9 @@ EXPORT_SYMBOL_GPL(kobject_move);
 /**
  * kobject_del - unlink kobject from hierarchy.
  * @kobj: object.
+ *
+ * This is the function that should be called to delete an object
+ * successfully added via kobject_add().
  */
 void kobject_del(struct kobject *kobj)
 {
-- 
2.21.0



  1   2   3   4   5   6   7   8   9   10   >