Re: Fw: [PATCH] NUMA Slab Allocator

2005-03-16 Thread Manfred Spraul
Martin J. Bligh wrote:
That'd be my inclination  but OTOH, we do that for pagecache OK.
The page cache doesn't have a global hash table.
Dunno, 
I'm torn. Depends if there's locality on the file access or not, I guess.
Is there any *harm* in doing it node local  perhaps creating a node
mem pressure imbalance (OTOH, there's loads of stuff that does that anyway ;-))

 

The harm is slower kmem_cache_free and a lower hit ratio for the per-cpu 
caches: kmem_cache_free must identify and return wrong node objects, and 
due to these returns, the per-cpu array is more often empty in 
kmem_cache_alloc.

IIRC someone from SGI wrote that they have seen bad performance in 
fork-bomb tests on large cpu count systems which might be caused by 
inter-node traffic on the mm_struct structure and that they think that a 
numa aware allocator would help. As far as I know no tests were done to 
very that assumption.

--
   Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] NUMA Slab Allocator

2005-03-16 Thread Martin J. Bligh
> Do you have profile data from your modification? Which percentage of the 
> allocations is node-local, which percentage is from foreign nodes? Preferably 
> per-cache. It shouldn't be difficult to add statistics counters to your patch.
> And: Can you estaimate which percentage is really accessed node-local and 
> which percentage are long-living structures that are accessed from all cpus 
> in the system?
> I had discussions with guys from IBM and SGI regarding a numa allocator, and 
> we decided that we need profile data before we can decide if we need one:
> - A node-local allocator reduces the inter-node traffic, because the callers 
> get node-local memory
> - A node-local allocator increases the inter-node traffic, because objects 
> that are kfree'd on the wrong node must be returned to their home node.

One of the big problems is that much of the slab data really is more global
(ie dentry, inodes, etc). Some of it is more localized (typically the 
kmalloc style stuff). I can't really generate any data easily, as most
of my NUMA boxes are either small Opterons / midsized PPC64, which have 
a fairly low NUMA factor, or large ia32, which only has kernel mem on 
node 0 ;-(

> IIRC the conclusion from our discussion was, that there are at least four 
> possible implementations:
> - your version
> - Add a second per-cpu array for off-node allocations. __cache_free batches, 
> free_block then returns. Global spinlock or per-node spinlock. A patch with a 
> global spinlock is in
> http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
> per-node spinlocks would require a restructuring of free_block.
> - Add per-node array for each cpu for wrong node allocations. Allows very 
> fast batch return: each array contains memory just from one node, usefull if 
> per-node spinlocks are used.
> - do nothing. Least overhead within slab.
> 
> I'm fairly certains that "do nothing" is the right answer for some caches. 
> For example the dentry-cache: The object lifetime is seconds to minutes, 
> the objects are stored in a global hashtable. They will be touched from 
> all cpus in the system, thus guaranteeing that kmem_cache_alloc returns 
> node-local memory won't help. But the added overhead within slab.c will hurt.

That'd be my inclination  but OTOH, we do that for pagecache OK. Dunno, 
I'm torn. Depends if there's locality on the file access or not, I guess.
Is there any *harm* in doing it node local  perhaps creating a node
mem pressure imbalance (OTOH, there's loads of stuff that does that anyway ;-))

The other thing that needs serious thought is how we balance reclaim pressure.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] NUMA Slab Allocator

2005-03-16 Thread Manfred Spraul
Hi Christoph,
Do you have profile data from your modification? Which percentage of the 
allocations is node-local, which percentage is from foreign nodes? 
Preferably per-cache. It shouldn't be difficult to add statistics 
counters to your patch.
And: Can you estaimate which percentage is really accessed node-local 
and which percentage are long-living structures that are accessed from 
all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator, 
and we decided that we need profile data before we can decide if we need 
one:
- A node-local allocator reduces the inter-node traffic, because the 
callers get node-local memory
- A node-local allocator increases the inter-node traffic, because 
objects that are kfree'd on the wrong node must be returned to their 
home node.

static inline void __cache_free (kmem_cache_t *cachep, void* objp)
{
 struct array_cache *ac = ac_data(cachep);
+ struct slab *slabp;
 check_irq_off();
 objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
- if (likely(ac->avail < ac->limit)) {
+ /* Make sure we are not freeing a object from another
+  * node to the array cache on this cpu.
+  */
+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
 

This line is quite slow, and should be performed only for NUMA builds, 
not for non-numa builds. Some kind of wrapper is required.

+ if(unlikely(slabp->nodeid != numa_node_id())) {
+  STATS_INC_FREEMISS(cachep);
+  int nodeid = slabp->nodeid;
+  spin_lock(&(cachep->nodelists[nodeid])->list_lock);
 

This line is very dangerous: Every wrong-node allocation causes a 
spin_lock operation. I fear that the cache line traffic for the spinlock 
might kill the performance for some workloads. I personally think that 
batching is required, i.e. each cpu stores wrong-node objects in a 
seperate per-cpu array, and then the objects are returned as a block to 
their home node.

-/*
- * NUMA: different approach needed if the spinlock is moved into
- * the l3 structure
 

You have moved the cache spinlock into the l3 structure. Have you 
compared both approaches?
A global spinlock has the advantage that batching is possible in 
free_block: Acquire global spinlock, return objects to all nodes in the 
system, release spinlock. A node-local spinlock would mean less 
contention [multiple spinlocks instead of one global lock], but far more 
spin_lock/unlock calls.

IIRC the conclusion from our discussion was, that there are at least 
four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free 
batches, free_block then returns. Global spinlock or per-node spinlock. 
A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows 
very fast batch return: each array contains memory just from one node, 
usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.

I'm fairly certains that "do nothing" is the right answer for some 
caches. For example the dentry-cache: The object lifetime is seconds to 
minutes, the objects are stored in a global hashtable. They will be 
touched from all cpus in the system, thus guaranteeing that 
kmem_cache_alloc returns node-local memory won't help. But the added 
overhead within slab.c will hurt.

--
   Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] NUMA Slab Allocator

2005-03-15 Thread Christoph Lameter
This is a NUMA slab allocator. It creates slabs on multiple nodes and
manages slabs in such a way that locality of allocations is optimized.
Each node has its own list of partial, free and full slabs. All object
allocations for a node occur from node specific slab lists.

Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]>
Signed-off-by: Shobhit Dayal <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.11/include/linux/slab.h
===
--- linux-2.6.11.orig/include/linux/slab.h  2005-03-15 14:47:12.567040608 
-0800
+++ linux-2.6.11/include/linux/slab.h   2005-03-15 14:47:19.290018560 -0800
@@ -63,11 +63,11 @@ extern int kmem_cache_destroy(kmem_cache
 extern int kmem_cache_shrink(kmem_cache_t *);
 extern void *kmem_cache_alloc(kmem_cache_t *, int);
 #ifdef CONFIG_NUMA
-extern void *kmem_cache_alloc_node(kmem_cache_t *, int);
+extern void *kmem_cache_alloc_node(kmem_cache_t *, int, int);
 #else
-static inline void *kmem_cache_alloc_node(kmem_cache_t *cachep, int node)
+static inline void *kmem_cache_alloc_node(kmem_cache_t *cachep, int node, int 
gfp_mask)
 {
-   return kmem_cache_alloc(cachep, GFP_KERNEL);
+   return kmem_cache_alloc(cachep, gfp_mask);
 }
 #endif
 extern void kmem_cache_free(kmem_cache_t *, void *);
@@ -81,6 +81,33 @@ struct cache_sizes {
 };
 extern struct cache_sizes malloc_sizes[];
 extern void *__kmalloc(size_t, int);
+extern void *__kmalloc_node(size_t, int, int);
+
+/*
+ * A new interface to allow allocating memory on a specific node.
+ */
+static inline void *kmalloc_node(size_t size, int node, int flags)
+{
+   if (__builtin_constant_p(size)) {
+   int i = 0;
+#define CACHE(x) \
+   if (size <= x) \
+   goto found; \
+   else \
+   i++;
+#include "kmalloc_sizes.h"
+#undef CACHE
+   {
+   extern void __you_cannot_kmalloc_that_much(void);
+   __you_cannot_kmalloc_that_much();
+   }
+found:
+   return kmem_cache_alloc_node((flags & GFP_DMA) ?
+   malloc_sizes[i].cs_dmacachep :
+   malloc_sizes[i].cs_cachep, node, flags);
+   }
+   return __kmalloc_node(size, node, flags);
+}

 static inline void *kmalloc(size_t size, int flags)
 {
Index: linux-2.6.11/mm/slab.c
===
--- linux-2.6.11.orig/mm/slab.c 2005-03-15 14:47:12.567040608 -0800
+++ linux-2.6.11/mm/slab.c  2005-03-15 16:17:27.242884760 -0800
@@ -75,6 +75,15 @@
  *
  * At present, each engine can be growing a cache.  This should be blocked.
  *
+ * 15 March 2005. NUMA slab allocator.
+ * Shobhit Dayal <[EMAIL PROTECTED]>
+ * Alok N Kataria <[EMAIL PROTECTED]>
+ *
+ * Modified the slab allocator to be node aware on NUMA systems.
+ * Each node has its own list of partial, free and full slabs.
+ * All object allocations for a node occur from node specific slab lists.
+ * Created a new interface called kmalloc_node() for allocating memory from
+ * a specific node.
  */

 #include   
@@ -92,7 +101,7 @@
 #include   
 #include   
 #include   
-
+#include   
 #include   
 #include   
 #include   
@@ -210,6 +219,7 @@ struct slab {
void*s_mem; /* including colour offset */
unsigned intinuse;  /* num of objs active in slab */
kmem_bufctl_t   free;
+   unsigned short  nodeid;
 };

 /*
@@ -278,21 +288,58 @@ struct kmem_list3 {
int free_touched;
unsigned long   next_reap;
struct array_cache  *shared;
+   spinlock_t  list_lock;
+   unsigned intfree_limit;
 };

+/*
+ * Need this for bootstrapping a per node allocator.
+ */
+#define NUM_INIT_LISTS 3
+struct kmem_list3 __initdata initkmem_list3[NUM_INIT_LISTS];
+struct kmem_list3 __initdata kmem64_list3[MAX_NUMNODES];
+
 #define LIST3_INIT(parent) \
-   { \
-   .slabs_full = LIST_HEAD_INIT(parent.slabs_full), \
-   .slabs_partial  = LIST_HEAD_INIT(parent.slabs_partial), \
-   .slabs_free = LIST_HEAD_INIT(parent.slabs_free) \
-   }
+   do {\
+   INIT_LIST_HEAD(&(parent)->slabs_full);  \
+   INIT_LIST_HEAD(&(parent)->slabs_partial);   \
+   INIT_LIST_HEAD(&(parent)->slabs_free);  \
+   (parent)->shared = NULL; \
+   (parent)->list_lock = SPIN_LOCK_UNLOCKED;   \
+   (parent)->free_objects = 0; \
+   (parent)->free_touched = 0; \
+   } while(0)
+
+#define MAKE_LIST(cachep, listp, slab, nodeid) \
+   do {\
+   if(list_empty(&(cachep->nodelists[nodeid]->slab)))  \
+   INIT_LIS