Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-15 Thread Daniel Phillips
On Friday 14 December 2007 14:51, I wrote:
> On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
> Note that false sharing of slab pages is still possible between two
> unrelated writeout processes, both of which obey rules for their own
> writeout path, but the pinned combination does not.  This still
> leaves a hole through which a deadlock may slip.

Actually, no it doesn't.  It in fact does not matter how many unrelated 
writeout processes, block devices, whatevers share a slab cache.  
Sufficient reserve pages must have been made available (in a perfect 
work, by adding extra pages to the memalloc reserve on driver 
initialization, in the real world just by having a big memalloc 
reserve) to populate the slab up to the sum of the required objects for 
all memalloc users sharing the cache.

So I think this slab technique of yours is fundamentally sound, that is 
to say, adding a new per-slab flag to keep unbounded numbers of slab 
objects with unbounded lifetimes from mixing with the bounded number of 
slab objects with bounded lifetimes.

Ponder.  OK, here is another issue.  Suppose a driver expands the 
memalloc reserve by the X number of pages it needs on initialization, 
and shrinks it by the same amount on removal, as is right and proper.  
The problem is, less than the number of slab pages that got pulled into 
slab on behalf of the removed driver may be freed (or made freeable) 
back to the global reserve, due to page sharing with an unrelated user.   
In theory, the global reserve could be completely depleted by this slab 
fragmentation.

OK, that is like the case that I mistakenly raised in the previous mail, 
though far less likely to occur, because driver removals are relatively 
rare and so would be a fragmentation case so severe as to cause global 
reserve depletion.

Even so, if this possibility bothers anybody, it is fairly easy to plug 
the hole: associate each slab with a given memalloc user instead of 
just having one bit to classify users.  So unrelated memalloc users 
would never share a slab, no false sharing, everybody happy.  The cost: 
a new pointer field per slab and a few additional lines of code.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-14 Thread Daniel Phillips
On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to
> allocation contexts that are entitled to it. This is done to ensure
> reserve pages don't leak out and get consumed.

Tighter definitions of "leak out" and "get consumed" would be helpful 
here.  As I see it, the chain of reasoning is:

  * Any MEMALLOC mode allocation must have come from a properly
   throttled path and  has a finite lifetime that must eventually
   produce writeout  progress.

  * Since the transaction that made the allocation was throttled and
must have a finite lifetime, we  know that it must eventually return
the resources it consumed to the appropriate resource pool.

Now, I think what you mean by "get consumed" and "leak out" is: "become 
pinned by false sharing with other allocations that do not guarantee 
that they will be returned to the resource pool".  We can say "pinned" 
for short.

So you are attempting to prevent slab pages from becoming pinned by 
users that do not obey the reserve management rules, which I think your 
approach achieves.  However...

Note that false sharing of slab pages is still possible between two 
unrelated writeout processes, both of which obey rules for their own 
writeout path, but the pinned combination does not.  This still leaves 
a hole through which a deadlock may slip.

My original solution was simply to allocate a full page when drawing 
from the memaloc reserve, which may use a tad more reserve, but makes 
it possible to prove the algorithm correct.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-14 Thread Peter Zijlstra
Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/slub_def.h |1 
 mm/slab.c|   59 +++
 mm/slub.c|   27 -
 3 files changed, 72 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -21,11 +21,12 @@
 #include 
 #include 
 #include 
+#include "internal.h"
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   2. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1071,7 +1072,7 @@ static void setup_object(struct kmem_cac
 }
 
 static noinline struct page *new_slab(struct kmem_cache *s,
-   gfp_t flags, int node)
+   gfp_t flags, int node, int *reserve)
 {
struct page *page;
struct kmem_cache_node *n;
@@ -1087,6 +1088,7 @@ static noinline struct page *new_slab(st
if (!page)
goto out;
 
+   *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1517,11 +1519,12 @@ static noinline unsigned long get_new_sl
 {
struct kmem_cache_cpu *c = *pc;
struct page *page;
+   int reserve;
 
if (gfpflags & __GFP_WAIT)
local_irq_enable();
 
-   page = new_slab(s, gfpflags, node);
+   page = new_slab(s, gfpflags, node, &reserve);
 
if (gfpflags & __GFP_WAIT)
local_irq_disable();
@@ -1530,6 +1533,7 @@ static noinline unsigned long get_new_sl
return 0;
 
*pc = c = get_cpu_slab(s, smp_processor_id());
+   c->reserve = reserve;
if (c->page)
flush_slab(s, c);
c->page = page;
@@ -1564,6 +1568,16 @@ static void *__slab_alloc(struct kmem_ca
local_irq_save(flags);
preempt_enable_no_resched();
 #endif
+   if (unlikely(c->reserve)) {
+   /*
+* If the current slab is a reserve slab and the current
+* allocation context does not allow access to the reserves we
+* must force an allocation to test the current levels.
+*/
+   if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+   goto grow_slab;
+   }
+
if (likely(c->page)) {
state = slab_lock(c->page);
 
@@ -1586,7 +1600,7 @@ load_freelist:
 */
VM_BUG_ON(c->page->freelist == c->page->end);
 
-   if (unlikely(state & SLABDEBUG))
+   if (unlikely((state & SLABDEBUG) || c->reserve))
goto debug;
 
object = c->page->freelist;
@@ -1615,7 +1629,7 @@ grow_slab:
 /* Perform debugging */
 debug:
object = c->page->freelist;
-   if (!alloc_debug_processing(s, c->page, object, addr))
+   if ((state & SLABDEBUG) && !alloc_debug_processing(s, c->page, object, 
addr))
goto another_slab;
 
c->page->inuse++;
@@ -2156,10 +2170,11 @@ static struct kmem_cache_node *early_kme
struct page *page;
struct kmem_cache_node *n;
unsigned long flags;
+   int reserve;
 
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-   page = new_slab(kmalloc_caches, gfpflags, node);
+   page = new_slab(kmalloc_caches, gfpflags, node, &reserve);
 
BUG_ON(!page);
if (page_to_nid(page) != node) {
Index: linux-2.6/include/linux/slub_def.h
===
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -18,6 +18,7 @@ struct kmem_cache_cpu {
unsigned int offset;/* Freepointer offset (in word units) */
unsigned int objsize;   /* Size of an object (from kmem_cache) */
unsigned int objects;   /* Objects per slab (from kmem_cache) */
+   int reserve;/* Did the current page come from the reserve */
 };
 
 struct kmem_cache_node {
Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -115,6 +115,8 @@
 #include   
 #include   
 
+#include   "internal.h"
+
 /*
  * DEBUG   - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & 
SLAB_POISON.
  *   0 for faster, smaller code (especially in the critical paths).
@@ -265,7 +267,8 @@ struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
-   unsigned int touched;
+   unsigned int touched:1,
+