Re: [patch 02/10] SLUB: slab defragmentation and kmem_cache_vacate

Mel Gorman Mon, 21 May 2007 07:12:21 -0700

On (18/05/07 11:10), [EMAIL PROTECTED] didst pronounce:
> Slab defragmentation occurs when the slabs are shrunk (after inode, dentry
> shrinkers have been run from the reclaim code) or when a manual shrinking
> is requested via slabinfo. During the shrink operation SLUB will generate a
> list of partially populated slabs sorted by the number of objects in use.
> 
> We extract pages off that list that are only filled less than a quarter and
> attempt to motivate the users of those slabs to either remove the objects
> or move the objects.
>


I know I brought up this "less than a quarter" thing before and I
haven't thought of a better alternative. However, it occurs to be that
shrink_slab() is called when there is awareness of a reclaim priority.
It may be worth passing that down so that the fraction of candidates
pages is calculated based on priority.

That said..... where is kmem_cache_shrink() ever called? The freeing of
slab pages seems to be indirect these days. Way back,
kmem_cache_shrink() used to be called directly but I'm not sure where it
happens now.

> Targeted reclaim allows to target a single slab for reclaim. This is done by
> calling
> 
> kmem_cache_vacate(page);
> 
> It will return 1 on success, 0 if the operation failed.
> 
> 
> In order for a slabcache to support defragmentation a couple of functions
> must be defined via kmem_cache_ops. These are
> 
> void *get(struct kmem_cache *s, int nr, void **objects)
> 
>       Must obtain a reference to the listed objects. SLUB guarantees that
>       the objects are still allocated. However, other threads may be blocked
>       in slab_free attempting to free objects in the slab. These may succeed
>       as soon as get() returns to the slab allocator. The function must
>       be able to detect the situation and void the attempts to handle such
>       objects (by for example voiding the corresponding entry in the objects
>       array).
> 
>       No slab operations may be performed in get_reference(). Interrupts
>       are disabled. What can be done is very limited. The slab lock
>       for the page with the object is taken. Any attempt to perform a slab
>       operation may lead to a deadlock.
> 
>       get() returns a private pointer that is passed to kick. Should we
>       be unable to obtain all references then that pointer may indicate
>       to the kick() function that it should not attempt any object removal
>       or move but simply remove the reference counts.
> 

Much clearer than before.

> void kick(struct kmem_cache *, int nr, void **objects, void *get_result)
> 
>       After SLUB has established references to the objects in a
>       slab it will drop all locks and then use kick() to move objects out
>       of the slab. The existence of the object is guaranteed by virtue of
>       the earlier obtained references via get(). The callback may perform
>       any slab operation since no locks are held at the time of call.
> 
>       The callback should remove the object from the slab in some way. This
>       may be accomplished by reclaiming the object and then running
>       kmem_cache_free() or reallocating it and then running
>       kmem_cache_free(). Reallocation is advantageous because the partial
>       slabs were just sorted to have the partial slabs with the most objects
>       first. Allocation is likely to result in filling up a slab so that
>       it can be removed from the partial list.
> 
>       Kick() does not return a result. SLUB will check the number of
>       remaining objects in the slab. If all objects were removed then
>       we know that the operation was successful.
> 

Again, much clearer.

> If a kmem_cache_vacate on a page fails then the slab has usually a pretty
> low usage ratio. Go through the slab and resequence the freelist so that
> object addresses increase as we allocate objects. This will trigger the
> cacheline prefetcher when we start allocating from the slab again and
> thereby increase allocations speed.
> 

Nice idea.

> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> ---
>  include/linux/slab.h |   31 +++++
>  mm/slab.c            |    9 +
>  mm/slob.c            |    9 +
>  mm/slub.c            |  264 
> +++++++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 303 insertions(+), 10 deletions(-)
> 
> Index: slub/include/linux/slab.h
> ===================================================================
> --- slub.orig/include/linux/slab.h    2007-05-18 00:13:39.000000000 -0700
> +++ slub/include/linux/slab.h 2007-05-18 00:13:40.000000000 -0700
> @@ -39,6 +39,36 @@ void __init kmem_cache_init(void);
>  int slab_is_available(void);
>  
>  struct kmem_cache_ops {
> +     /*
> +      * Called with slab lock held and interrupts disabled.
> +      * No slab operation may be performed.
> +      *
> +      * Parameters passed are the number of objects to process
> +      * and a an array of pointers to objects for which we
> +      * need references.
> +      *

s/a an/an/

> +      * Returns a pointer that is passed to the kick function.
> +      * If all objects cannot be moved then the pointer may
> +      * indicate that this wont work and then kick can simply
> +      * remove the references that were already obtained.
> +      *
> +      * The array passed to get() is also passed to kick(). The
> +      * function may remove objects by setting array elements to NULL.
> +      */
> +     void *(*get)(struct kmem_cache *, int nr, void **);
> +
> +     /*
> +      * Called with no locks held and interrupts enabled.
> +      * Any operation may be performed in kick().
> +      *
> +      * Parameters passed are the number of objects in the array,
> +      * the array of pointers to the objects and the pointer
> +      * returned by get().
> +      *
> +      * Success is checked by examining the number of remaining
> +      * objects in the slab.
> +      */
> +     void (*kick)(struct kmem_cache *, int nr, void **, void *private);
>  };
>  
>  struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
> @@ -53,6 +83,7 @@ void kmem_cache_free(struct kmem_cache *
>  unsigned int kmem_cache_size(struct kmem_cache *);
>  const char *kmem_cache_name(struct kmem_cache *);
>  int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
> +int kmem_cache_vacate(struct page *);
>  
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> Index: slub/mm/slub.c
> ===================================================================
> --- slub.orig/mm/slub.c       2007-05-18 00:13:39.000000000 -0700
> +++ slub/mm/slub.c    2007-05-18 09:55:47.000000000 -0700
> @@ -1043,12 +1043,11 @@ static struct page *new_slab(struct kmem
>       n = get_node(s, page_to_nid(page));
>       if (n)
>               atomic_long_inc(&n->nr_slabs);
> +
> +     page->inuse = 0;
> +     page->lockless_freelist = NULL;
>       page->offset = s->offset / sizeof(void *);
>       page->slab = s;
> -     page->flags |= 1 << PG_slab;
> -     if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
> -                     SLAB_STORE_USER | SLAB_TRACE))
> -             SetSlabDebug(page);
>  
>       start = page_address(page);
>       end = start + s->objects * s->size;
> @@ -1066,11 +1065,20 @@ static struct page *new_slab(struct kmem
>       set_freepointer(s, last, NULL);
>  
>       page->freelist = start;
> -     page->lockless_freelist = NULL;
> -     page->inuse = 0;
> -out:
> -     if (flags & __GFP_WAIT)
> -             local_irq_disable();
> +
> +     /*
> +      * page->inuse must be 0 when PageSlab(page) becomes
> +      * true so that defrag knows that this slab is not in use.
> +      */
> +     smp_wmb();
> +     __SetPageSlab(page);
> +     if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
> +                     SLAB_STORE_USER | SLAB_TRACE))
> +             SetSlabDebug(page);
> +
> + out:
> +     if (flags & __GFP_WAIT)
> +             local_irq_disable();
>       return page;
>  }
>  
> @@ -2323,6 +2331,191 @@ void kfree(const void *x)
>  EXPORT_SYMBOL(kfree);
>  
>  /*
> + * Order the freelist so that addresses increase as object are allocated.
> + * This is useful to trigger the cpu cacheline prefetching logic.
> + */

makes sense. However, it occurs to me that maybe this should be a
separate patch so it can be measured to be sure. It makes sense though.

> +void resequence_freelist(struct kmem_cache *s, struct page *page)
> +{
> +     void *p;
> +     void *last;
> +     void *addr = page_address(page);
> +     DECLARE_BITMAP(map, s->objects);
> +
> +     bitmap_zero(map, s->objects);
> +
> +     /* Figure out which objects are on the freelist */
> +     for_each_free_object(p, s, page->freelist)
> +             set_bit(slab_index(p, s, addr), map);
> +
> +     last = NULL;
> +     for_each_object(p, s, addr)
> +             if (test_bit(slab_index(p, s, addr), map)) {
> +                     if (last)
> +                             set_freepointer(s, last, p);
> +                     else
> +                             page->freelist = p;
> +                     last = p;
> +             }
> +
> +     if (last)
> +             set_freepointer(s, last, NULL);
> +     else
> +             page->freelist = NULL;
> +}
> +
> +/*
> + * Vacate all objects in the given slab.
> + *
> + * Slab must be locked and frozen. Interrupts are disabled (flags must
> + * be passed).
> + *

It may not hurt to have a VM_BUG_ON() if interrupts are still enabled when
this is called

> + * Will drop and regain and drop the slab lock. At the end the slab will
> + * either be freed or returned to the partial lists.
> + *
> + * Returns the number of remaining objects
> + */
> +static int __kmem_cache_vacate(struct kmem_cache *s,
> +             struct page *page, unsigned long flags, void **vector)
> +{
> +     void *p;
> +     void *addr = page_address(page);
> +     DECLARE_BITMAP(map, s->objects);
> +     int leftover;
> +     int objects;
> +     void *private;
> +
> +     if (!page->inuse)
> +             goto out;
> +
> +     /* Determine used objects */
> +     bitmap_fill(map, s->objects);
> +     for_each_free_object(p, s, page->freelist)
> +             __clear_bit(slab_index(p, s, addr), map);
> +
> +     objects = 0;
> +     memset(vector, 0, s->objects * sizeof(void **));
> +     for_each_object(p, s, addr) {
> +             if (test_bit(slab_index(p, s, addr), map))
> +                     vector[objects++] = p;
> +     }
> +
> +     private = s->ops->get(s, objects, vector);
> +
> +     /*
> +      * Got references. Now we can drop the slab lock. The slab
> +      * is frozen so it cannot vanish from under us nor will
> +      * allocations be performed on the slab. However, unlocking the
> +      * slab will allow concurrent slab_frees to proceed.
> +      */
> +     slab_unlock(page);
> +     local_irq_restore(flags);

I recognise that you want to restore interrupts as early as possible but
it should be noted somewhere that kmem_cache_vacate() disables
interrupts and __kmem_cache_vacate() enabled them again. I had to go
searching to see where interrupts are enabled again.

Maybe even a slab_lock_irq() and slab_unlock_irq() would clarify things
a little.

> +
> +     /*
> +      * Perform the KICK callbacks to remove the objects.
> +      */
> +     s->ops->kick(s, objects, vector, private);
> +
> +     local_irq_save(flags);
> +     slab_lock(page);
> +out:
> +     /*
> +      * Check the result and unfreeze the slab
> +      */
> +     leftover = page->inuse;
> +     if (leftover > 0)
> +             /*
> +              * Cannot free. Lets at least optimize the freelist. We have
> +              * likely touched all the cachelines with the free pointers
> +              * already so it is cheap to do here.
> +              */
> +             resequence_freelist(s, page);
> +     unfreeze_slab(s, page);
> +     local_irq_restore(flags);
> +     return leftover;
> +}
> +
> +/*
> + * Get a page off a list and freeze it. Must be holding slab lock.
> + */
> +static void freeze_from_list(struct kmem_cache *s, struct page *page)
> +{
> +     if (page->inuse < s->objects)
> +             remove_partial(s, page);
> +     else if (s->flags & SLAB_STORE_USER)
> +             remove_full(s, page);
> +     SetSlabFrozen(page);
> +}
> +
> +/*
> + * Attempt to free objects in a page. Return 1 if succesful.
> + */
> +int kmem_cache_vacate(struct page *page)
> +{
> +     unsigned long flags;
> +     struct kmem_cache *s;
> +     int vacated = 0;
> +     void **vector = NULL;
> +
> +     /*
> +      * Get a reference to the page. Return if its freed or being freed.
> +      * This is necessary to make sure that the page does not vanish
> +      * from under us before we are able to check the result.
> +      */
> +     if (!get_page_unless_zero(page))
> +             return 0;
> +
> +     if (!PageSlab(page))
> +             goto out;
> +
> +     s = page->slab;
> +     if (!s)
> +             goto out;
> +
> +     vector = kmalloc(s->objects * sizeof(void *), GFP_KERNEL);
> +     if (!vector)
> +             return 0;

Is it worth logging this event, returning -ENOMEM or something so that
callers are aware of why kmem_cache_vacate() failed in this instance?

Also.. we have called get_page_unless_zero() but if we are out of memory
here, where have we called put_page()? Maybe we should be "goto out"
here with a

if (vector)
        kfree(vector);

> +
> +     local_irq_save(flags);
> +     /*
> +      * The implicit memory barrier in slab_lock guarantees that page->inuse
> +      * is loaded after PageSlab(page) has been established to be true. This 
> is
> +      * only revelant for a  newly created slab.
> +      */
> +     slab_lock(page);
> +
> +     /*
> +      * We may now have locked a page that may be in various stages of
> +      * being freed. If the PageSlab bit is off then we have already
> +      * reached the page allocator. If page->inuse is zero then we are
> +      * in SLUB but freeing or allocating the page.
> +      * page->inuse is never modified without the slab lock held.
> +      *
> +      * Also abort if the page happens to be already frozen. If its
> +      * frozen then a concurrent vacate may be in progress.
> +      */
> +     if (!PageSlab(page) || SlabFrozen(page) || !page->inuse)
> +             goto out_locked;
> +
> +     /*
> +      * We are holding a lock on a slab page and all operations on the
> +      * slab are blocking.
> +      */
> +     if (!s->ops->get || !s->ops->kick)
> +             goto out_locked;
> +     freeze_from_list(s, page);
> +     vacated = __kmem_cache_vacate(s, page, flags, vector) == 0;

That is a little funky looking. This may be nicer;

vacated = __kmem_cache_vacate(s, page, flags, vector);
out:
....
return vacated == 0;

> +out:
> +     put_page(page);
> +     kfree(vector);
> +     return vacated;
> +out_locked:
> +     slab_unlock(page);
> +     local_irq_restore(flags);
> +     goto out;
> +
> +}
> +
> +/*
>   * kmem_cache_shrink removes empty slabs from the partial lists and sorts
>   * the remaining slabs by the number of items in use. The slabs with the
>   * most items in use come first. New allocations will then fill those up
> @@ -2337,11 +2530,12 @@ int kmem_cache_shrink(struct kmem_cache 
>       int node;
>       int i;
>       struct kmem_cache_node *n;
> -     struct page *page;
> +     struct page *page, *page2;
>       struct page *t;
>       struct list_head *slabs_by_inuse =
>               kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
>       unsigned long flags;
> +     LIST_HEAD(zaplist);
>  
>       if (!slabs_by_inuse)
>               return -ENOMEM;
> @@ -2392,8 +2586,44 @@ int kmem_cache_shrink(struct kmem_cache 
>               for (i = s->objects - 1; i >= 0; i--)
>                       list_splice(slabs_by_inuse + i, n->partial.prev);
>  
> +             /*
> +              * If we have no functions available to defragment the slabs
> +              * then we are done.
> +              */
> +             if (!s->ops->get || !s->ops->kick)
> +                     goto out;
> +
> +             /* Take objects with just a few objects off the tail */
> +             while (n->nr_partial > MAX_PARTIAL) {
> +                     page = container_of(n->partial.prev, struct page, lru);
> +
> +                     /*
> +                      * We are holding the list_lock so we can only
> +                      * trylock the slab
> +                      */
> +                     if (page->inuse > s->objects / 4)
> +                             break;
> +
> +                     if (!slab_trylock(page))
> +                             break;
> +
> +                     list_move_tail(&page->lru, &zaplist);
> +                     n->nr_partial--;
> +                     SetSlabFrozen(page);
> +                     slab_unlock(page);
> +             }
>       out:
>               spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +             /* Now we can free objects in the slabs on the zaplist */
> +             list_for_each_entry_safe(page, page2, &zaplist, lru) {
> +                     unsigned long flags;
> +
> +                     local_irq_save(flags);
> +                     slab_lock(page);
> +                     __kmem_cache_vacate(s, page, flags,
> +                                     (void **)slabs_by_inuse);
> +             }
>       }
>  
>       kfree(slabs_by_inuse);
> @@ -3229,6 +3459,20 @@ static ssize_t ops_show(struct kmem_cach
>               x += sprint_symbol(buf + x, (unsigned long)s->ctor);
>               x += sprintf(buf + x, "\n");
>       }
> +
> +     if (s->ops->get) {
> +             x += sprintf(buf + x, "get : ");
> +             x += sprint_symbol(buf + x,
> +                             (unsigned long)s->ops->get);
> +             x += sprintf(buf + x, "\n");
> +     }
> +
> +     if (s->ops->kick) {
> +             x += sprintf(buf + x, "kick : ");
> +             x += sprint_symbol(buf + x,
> +                             (unsigned long)s->ops->kick);
> +             x += sprintf(buf + x, "\n");
> +     }
>       return x;
>  }
>  SLAB_ATTR_RO(ops);
> Index: slub/mm/slab.c
> ===================================================================
> --- slub.orig/mm/slab.c       2007-05-18 00:13:39.000000000 -0700
> +++ slub/mm/slab.c    2007-05-18 00:13:40.000000000 -0700
> @@ -2516,6 +2516,15 @@ int kmem_cache_shrink(struct kmem_cache 
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);
>  
> +/*
> + * SLAB does not support slab defragmentation
> + */
> +int kmem_cache_vacate(struct page *page)
> +{
> +     return 0;
> +}
> +EXPORT_SYMBOL(kmem_cache_vacate);
> +
>  /**
>   * kmem_cache_destroy - delete a cache
>   * @cachep: the cache to destroy
> Index: slub/mm/slob.c
> ===================================================================
> --- slub.orig/mm/slob.c       2007-05-18 00:13:39.000000000 -0700
> +++ slub/mm/slob.c    2007-05-18 00:13:40.000000000 -0700
> @@ -394,6 +394,15 @@ int kmem_cache_shrink(struct kmem_cache 
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);
>  
> +/*
> + * SLOB does not support slab defragmentation
> + */
> +int kmem_cache_vacate(struct page *page)
> +{
> +     return 0;
> +}
> +EXPORT_SYMBOL(kmem_cache_vacate);
> +
>  int kmem_ptr_validate(struct kmem_cache *a, const void *b)
>  {
>       return 0;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 02/10] SLUB: slab defragmentation and kmem_cache_vacate

Reply via email to