Re: Experiences with slub bulk use-case for network stack

2015-09-17 Thread Jesper Dangaard Brouer
On Wed, 16 Sep 2015 10:13:25 -0500 (CDT)
Christoph Lameter  wrote:

> On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> >
> > Hint, this leads up to discussing if current bulk *ALLOC* API need to
> > be changed...
> >
> > Alex and I have been working hard on practical use-case for SLAB
> > bulking (mostly slUb), in the network stack.  Here is a summary of
> > what we have learned so far.
> 
> SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
> another slab allocator.
> 
> Please keep that consistent otherwise things get confusing

This naming scheme is really confusing.  I'll try to be more
consistent.  So, you want capital letters SLAB and SLUB when talking
about a specific slab allocator implementation.


> > Bulk free'ing SKBs during TX completion is a big and easy win.
> >
> > Specifically for slUb, normal path for freeing these objects (which
> > are not on c->freelist) require a locked double_cmpxchg per object.
> > The bulk free (via detached freelist patch) allow to free all objects
> > belonging to the same slab-page, to be free'ed with a single locked
> > double_cmpxchg. Thus, the bulk free speedup is quite an improvement.
> 
> Yep.
> 
> > Alex and I had the idea of bulk alloc returns an "allocator specific
> > cache" data-structure (and we add some helpers to access this).
> 
> Maybe add some Macros to handle this?

Yes, helpers will likely turn out to be macros.


> > In the slUb case, the freelist is a single linked pointer list.  In
> > the network stack the skb objects have a skb->next pointer, which is
> > located at the same position as freelist pointer.  Thus, simply
> > returning the freelist directly, could be interpreted as a skb-list.
> > The helper API would then do the prefetching, when pulling out
> > objects.
> 
> The problem with the SLUB case is that the objects must be on the same
> slab page.

Yes, I'm aware that, that is what we are trying to take advantage of.


> > For the slUb case, we would simply cmpxchg either c->freelist or
> > page->freelist with a NULL ptr, and then own all objects on the
> > freelist. This also reduce the time we keep IRQs disabled.
> 
> You dont need to disable interrupts for the cmpxchges. There is
> additional state in the page struct though so the updates must be
> done carefully.

Yes, I'm aware of cmpxchg does not need to disable interrupts.  And I
plan to take advantage of this, in this new approach for bulk alloc.

Our current bulk alloc disables interrupts for the full period (of
collecting the number requested objects).

What I'm proposing is keeping interrupts on, and then simply cmpxchg
e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls
freelist's). The bulk call now owns these freelists, and returns them
to the caller.  The API caller gets some helpers/macros to access
objects, to shield him from the details (of SLUB freelist's).

The pitfall with this API is we don't know how many objects are on a
SLUB freelist.  And we cannot walk the freelist and count them, because
then we hit the problem of memory/cache stalls (that we are trying so
hard to avoid).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences with slub bulk use-case for network stack

2015-09-17 Thread Christoph Lameter
On Thu, 17 Sep 2015, Jesper Dangaard Brouer wrote:

> What I'm proposing is keeping interrupts on, and then simply cmpxchg
> e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls
> freelist's). The bulk call now owns these freelists, and returns them
> to the caller.  The API caller gets some helpers/macros to access
> objects, to shield him from the details (of SLUB freelist's).
>
> The pitfall with this API is we don't know how many objects are on a
> SLUB freelist.  And we cannot walk the freelist and count them, because
> then we hit the problem of memory/cache stalls (that we are trying so
> hard to avoid).

If you get a fresh page from the page allocator then you know how many
objects are available in a slab page.

There is also a counter in each slab page for the objects allocated. The
number of free object is page->objects - page->inuse.

This is only true for a lockec cmpxchg. The unlocked cmpxchg used for the
per cpu freelist does not use the counters in the page struct.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Experiences with slub bulk use-case for network stack

2015-09-16 Thread Jesper Dangaard Brouer

Hint, this leads up to discussing if current bulk *ALLOC* API need to
be changed...

Alex and I have been working hard on practical use-case for SLAB
bulking (mostly slUb), in the network stack.  Here is a summary of
what we have learned so far.

Bulk free'ing SKBs during TX completion is a big and easy win.

Specifically for slUb, normal path for freeing these objects (which
are not on c->freelist) require a locked double_cmpxchg per object.
The bulk free (via detached freelist patch) allow to free all objects
belonging to the same slab-page, to be free'ed with a single locked
double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

The slUb alloc is hard to beat on speed:
 * accessing c->freelist, local cmpxchg 9 cycles (38% of cost)
 * c->freelist is refilled with single locked cmpxchg

In micro benchmarking it looks like we can beat alloc, because we do a
local_irq_{disable,enable} (cost 7 cycles).  And then pull out all
objects in c->freelist.  Thus, saving 9 cycles per object (counting
from the 2nd object).

However, in practical use-cases we are seeing the single object alloc
win over bulk alloc, we believe this to be due to prefetching.  When
c->freelist get (semi) cache-cold, then it gets more expensive to walk
the freelist (which is a basic single linked list to next free object).

For bulk alloc the full freelist is walked (right-way) and objects
pulled out into the array.  For normal single object alloc only a
single object is returned, but it does a prefetch on the next object
pointer.  Thus, next time single alloc is called the object will have
been prefetched.  Doing prefetch in bulk alloc only helps a little, as
it does not have enough "time" between accessing/walking the freelist
for objects.

So, how can we solve this and make bulk alloc faster?


Alex and I had the idea of bulk alloc returns an "allocator specific
cache" data-structure (and we add some helpers to access this).

In the slUb case, the freelist is a single linked pointer list.  In
the network stack the skb objects have a skb->next pointer, which is
located at the same position as freelist pointer.  Thus, simply
returning the freelist directly, could be interpreted as a skb-list.
The helper API would then do the prefetching, when pulling out
objects.

For the slUb case, we would simply cmpxchg either c->freelist or
page->freelist with a NULL ptr, and then own all objects on the
freelist. This also reduce the time we keep IRQs disabled.

API wise, we don't (necessary) know how many objects are on the
freelist (without first walking the list, which would cause stalls on
data, which we are trying to avoid).

Thus, the API of always returning the exact number of requested
objects will not work...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(related to http://thread.gmane.org/gmane.linux.kernel.mm/137469)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences with slub bulk use-case for network stack

2015-09-16 Thread Christoph Lameter
On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:

>
> Hint, this leads up to discussing if current bulk *ALLOC* API need to
> be changed...
>
> Alex and I have been working hard on practical use-case for SLAB
> bulking (mostly slUb), in the network stack.  Here is a summary of
> what we have learned so far.

SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
another slab allocator.

Please keep that consistent otherwise things get confusing

> Bulk free'ing SKBs during TX completion is a big and easy win.
>
> Specifically for slUb, normal path for freeing these objects (which
> are not on c->freelist) require a locked double_cmpxchg per object.
> The bulk free (via detached freelist patch) allow to free all objects
> belonging to the same slab-page, to be free'ed with a single locked
> double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

Yep.

> Alex and I had the idea of bulk alloc returns an "allocator specific
> cache" data-structure (and we add some helpers to access this).

Maybe add some Macros to handle this?

> In the slUb case, the freelist is a single linked pointer list.  In
> the network stack the skb objects have a skb->next pointer, which is
> located at the same position as freelist pointer.  Thus, simply
> returning the freelist directly, could be interpreted as a skb-list.
> The helper API would then do the prefetching, when pulling out
> objects.

The problem with the SLUB case is that the objects must be on the same
slab page.

> For the slUb case, we would simply cmpxchg either c->freelist or
> page->freelist with a NULL ptr, and then own all objects on the
> freelist. This also reduce the time we keep IRQs disabled.

You dont need to disable interrupts for the cmpxchges. There is additional
state in the page struct though so the updates must be done carefully.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html