Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages
On Friday 14 December 2007 14:51, I wrote: > On Friday 14 December 2007 07:39, Peter Zijlstra wrote: > Note that false sharing of slab pages is still possible between two > unrelated writeout processes, both of which obey rules for their own > writeout path, but the pinned combination does not. This still > leaves a hole through which a deadlock may slip. Actually, no it doesn't. It in fact does not matter how many unrelated writeout processes, block devices, whatevers share a slab cache. Sufficient reserve pages must have been made available (in a perfect work, by adding extra pages to the memalloc reserve on driver initialization, in the real world just by having a big memalloc reserve) to populate the slab up to the sum of the required objects for all memalloc users sharing the cache. So I think this slab technique of yours is fundamentally sound, that is to say, adding a new per-slab flag to keep unbounded numbers of slab objects with unbounded lifetimes from mixing with the bounded number of slab objects with bounded lifetimes. Ponder. OK, here is another issue. Suppose a driver expands the memalloc reserve by the X number of pages it needs on initialization, and shrinks it by the same amount on removal, as is right and proper. The problem is, less than the number of slab pages that got pulled into slab on behalf of the removed driver may be freed (or made freeable) back to the global reserve, due to page sharing with an unrelated user. In theory, the global reserve could be completely depleted by this slab fragmentation. OK, that is like the case that I mistakenly raised in the previous mail, though far less likely to occur, because driver removals are relatively rare and so would be a fragmentation case so severe as to cause global reserve depletion. Even so, if this possibility bothers anybody, it is fairly easy to plug the hole: associate each slab with a given memalloc user instead of just having one bit to classify users. So unrelated memalloc users would never share a slab, no false sharing, everybody happy. The cost: a new pointer field per slab and a few additional lines of code. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages
On Friday 14 December 2007 07:39, Peter Zijlstra wrote: > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to > allocation contexts that are entitled to it. This is done to ensure > reserve pages don't leak out and get consumed. Tighter definitions of "leak out" and "get consumed" would be helpful here. As I see it, the chain of reasoning is: * Any MEMALLOC mode allocation must have come from a properly throttled path and has a finite lifetime that must eventually produce writeout progress. * Since the transaction that made the allocation was throttled and must have a finite lifetime, we know that it must eventually return the resources it consumed to the appropriate resource pool. Now, I think what you mean by "get consumed" and "leak out" is: "become pinned by false sharing with other allocations that do not guarantee that they will be returned to the resource pool". We can say "pinned" for short. So you are attempting to prevent slab pages from becoming pinned by users that do not obey the reserve management rules, which I think your approach achieves. However... Note that false sharing of slab pages is still possible between two unrelated writeout processes, both of which obey rules for their own writeout path, but the pinned combination does not. This still leaves a hole through which a deadlock may slip. My original solution was simply to allocate a full page when drawing from the memaloc reserve, which may use a tad more reserve, but makes it possible to prove the algorithm correct. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/29] mm: slb: add knowledge of reserve pages
Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. This is done to ensure reserve pages don't leak out and get consumed. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- include/linux/slub_def.h |1 mm/slab.c| 59 +++ mm/slub.c| 27 - 3 files changed, 72 insertions(+), 15 deletions(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c +++ linux-2.6/mm/slub.c @@ -21,11 +21,12 @@ #include #include #include +#include "internal.h" /* * Lock order: * 1. slab_lock(page) - * 2. slab->list_lock + * 2. node->list_lock * * The slab_lock protects operations on the object of a particular * slab and its metadata in the page struct. If the slab lock @@ -1071,7 +1072,7 @@ static void setup_object(struct kmem_cac } static noinline struct page *new_slab(struct kmem_cache *s, - gfp_t flags, int node) + gfp_t flags, int node, int *reserve) { struct page *page; struct kmem_cache_node *n; @@ -1087,6 +1088,7 @@ static noinline struct page *new_slab(st if (!page) goto out; + *reserve = page->reserve; n = get_node(s, page_to_nid(page)); if (n) atomic_long_inc(&n->nr_slabs); @@ -1517,11 +1519,12 @@ static noinline unsigned long get_new_sl { struct kmem_cache_cpu *c = *pc; struct page *page; + int reserve; if (gfpflags & __GFP_WAIT) local_irq_enable(); - page = new_slab(s, gfpflags, node); + page = new_slab(s, gfpflags, node, &reserve); if (gfpflags & __GFP_WAIT) local_irq_disable(); @@ -1530,6 +1533,7 @@ static noinline unsigned long get_new_sl return 0; *pc = c = get_cpu_slab(s, smp_processor_id()); + c->reserve = reserve; if (c->page) flush_slab(s, c); c->page = page; @@ -1564,6 +1568,16 @@ static void *__slab_alloc(struct kmem_ca local_irq_save(flags); preempt_enable_no_resched(); #endif + if (unlikely(c->reserve)) { + /* +* If the current slab is a reserve slab and the current +* allocation context does not allow access to the reserves we +* must force an allocation to test the current levels. +*/ + if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS)) + goto grow_slab; + } + if (likely(c->page)) { state = slab_lock(c->page); @@ -1586,7 +1600,7 @@ load_freelist: */ VM_BUG_ON(c->page->freelist == c->page->end); - if (unlikely(state & SLABDEBUG)) + if (unlikely((state & SLABDEBUG) || c->reserve)) goto debug; object = c->page->freelist; @@ -1615,7 +1629,7 @@ grow_slab: /* Perform debugging */ debug: object = c->page->freelist; - if (!alloc_debug_processing(s, c->page, object, addr)) + if ((state & SLABDEBUG) && !alloc_debug_processing(s, c->page, object, addr)) goto another_slab; c->page->inuse++; @@ -2156,10 +2170,11 @@ static struct kmem_cache_node *early_kme struct page *page; struct kmem_cache_node *n; unsigned long flags; + int reserve; BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node)); - page = new_slab(kmalloc_caches, gfpflags, node); + page = new_slab(kmalloc_caches, gfpflags, node, &reserve); BUG_ON(!page); if (page_to_nid(page) != node) { Index: linux-2.6/include/linux/slub_def.h === --- linux-2.6.orig/include/linux/slub_def.h +++ linux-2.6/include/linux/slub_def.h @@ -18,6 +18,7 @@ struct kmem_cache_cpu { unsigned int offset;/* Freepointer offset (in word units) */ unsigned int objsize; /* Size of an object (from kmem_cache) */ unsigned int objects; /* Objects per slab (from kmem_cache) */ + int reserve;/* Did the current page come from the reserve */ }; struct kmem_cache_node { Index: linux-2.6/mm/slab.c === --- linux-2.6.orig/mm/slab.c +++ linux-2.6/mm/slab.c @@ -115,6 +115,8 @@ #include #include +#include "internal.h" + /* * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON. * 0 for faster, smaller code (especially in the critical paths). @@ -265,7 +267,8 @@ struct array_cache { unsigned int avail; unsigned int limit; unsigned int batchcount; - unsigned int touched; + unsigned int touched:1, +