Hi Christoph,
Do you have profile data from your modification? Which percentage of the allocations is node-local, which percentage is from foreign nodes? Preferably per-cache. It shouldn't be difficult to add statistics counters to your patch.
And: Can you estaimate which percentage is really accessed node-local and which percentage are long-living structures that are accessed from all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator, and we decided that we need profile data before we can decide if we need one:
- A node-local allocator reduces the inter-node traffic, because the callers get node-local memory
- A node-local allocator increases the inter-node traffic, because objects that are kfree'd on the wrong node must be returned to their home node.
This line is quite slow, and should be performed only for NUMA builds, not for non-numa builds. Some kind of wrapper is required.static inline void __cache_free (kmem_cache_t *cachep, void* objp) { struct array_cache *ac = ac_data(cachep); + struct slab *slabp;
check_irq_off(); objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
- if (likely(ac->avail < ac->limit)) {
+ /* Make sure we are not freeing a object from another
+ * node to the array cache on this cpu.
+ */
+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
+ if(unlikely(slabp->nodeid != numa_node_id())) {This line is very dangerous: Every wrong-node allocation causes a spin_lock operation. I fear that the cache line traffic for the spinlock might kill the performance for some workloads. I personally think that batching is required, i.e. each cpu stores wrong-node objects in a seperate per-cpu array, and then the objects are returned as a block to their home node.
+ STATS_INC_FREEMISS(cachep);
+ int nodeid = slabp->nodeid;
+ spin_lock(&(cachep->nodelists[nodeid])->list_lock);
-/*You have moved the cache spinlock into the l3 structure. Have you compared both approaches?
- * NUMA: different approach needed if the spinlock is moved into
- * the l3 structure
A global spinlock has the advantage that batching is possible in free_block: Acquire global spinlock, return objects to all nodes in the system, release spinlock. A node-local spinlock would mean less contention [multiple spinlocks instead of one global lock], but far more spin_lock/unlock calls.
IIRC the conclusion from our discussion was, that there are at least four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free batches, free_block then returns. Global spinlock or per-node spinlock. A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows very fast batch return: each array contains memory just from one node, usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.
I'm fairly certains that "do nothing" is the right answer for some caches. For example the dentry-cache: The object lifetime is seconds to minutes, the objects are stored in a global hashtable. They will be touched from all cpus in the system, thus guaranteeing that kmem_cache_alloc returns node-local memory won't help. But the added overhead within slab.c will hurt.
-- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

