On Thu, Aug 11, 2016 at 02:43:16PM +1000, David Gwynne wrote:
> ive been tinkering with per cpu memory in the kernel.

i think vi threw up a little bit on the diff i sent out, so this
should work.

it should also work on !MULTIPROCESSOR kernels now. some of that
is fixes to the percpu.h bits, but its also ifdefing bits in the
pool code so it doesnt bother trying it on UP kernels.

> per cpu memory is pretty much what it sounds like. you allocate
> memory for each cpu to operate on independently of the rest of the
> system, therefore reducing the contention between cpus on cache
> lines.
> 
> this introduces wrappers around the kernel memory allocators, so
> when you ask to allocate N bytes, you get an N sized allocation for
> each cpu, and a way to get access to that memory from each cpu.
> 
> cpumem_get() and cpumem_put() are wrappers around pool_get() and
> pool_put(), and cpumem_malloc() and cpumem_free() are wrappers
> around malloc() and free(). instead of these returning a direct
> reference to memory they return a struct cpumem pointer. you then
> later get a reference to each cpus allocation with cpumem_enter(),
> and then release that reference with cpumem_leave().
> 
> im still debating whether the API should do protection against
> interrupts on the local cpu by handling spls for you. at the moment
> it is up to the caller to manually splfoo() and splx(), but im half
> convinced that cpumem_enter and _leave should do that on behalf of
> the caller.
> 
> this diff also includes two uses of the percpu code. one is to
> provide per cpu caches of pool items, and the other is per cpu
> counters for mbuf statistics.
> 
> ive added a wrapper around percpu memory for counters. basically
> you ask it for N counters, where each counter is a uint64_t.
> counters_enter will give you a per cpu reference to these N counters
> which you can increment and decrement as you wish. internally the
> api will version the per cpu counters so a reader can know when
> theyre consistent, which is important on 32bit archs (where 64bit
> ops arent necessarily atomic), or where you want several counters
> to be consistent with each other (like packet and byte counters).
> counters_read is provided to use the above machinery for a consistent
> read.
> 
> the per cpu caches in pools are modelled on the ones described in
> the "Magazines and Vmem: Extending the Slab Allocator to Many CPUs
> and Arbitrary Resources" paper by Jeff Bonwick and Jonathan Adams.
> pools are based on slabs, so it seem defendable to use this as the
> basis for further improvements.
> 
> like the magazine paper, it maintains a pair of "magazines" of pool
> items on each cpu. when both are full, one gets pushed to a global
> depot. if both are empty it will try to allocate a whole magazine
> from the depot, and if that fails it will fall through to the normal
> pool_get allocation. this semantic for mitigating access to the
> global data structures is the big take-away from the paper in my
> opinion.
> 
> unlike the paper though, the per cpu caches in pools take advantage
> of the fact that pools are not a caching memory allocator, ie, we
> dont keep track of pool items in a constructed state so we can
> scribble over the memory to our hearts content. with that in mind,
> the per cpu caches in pools build linked lists of free items rather
> than allocate magazines to point at pool items. in the future this
> will greatly simplify scaling the size of magazines. right now there
> is no point because there's no contention on anything in the kernel
> except the big lock.
> 
> there are some consequences of the per cpu pool caches. the most
> important is that a hard limit on pool items cannot work because
> that requires access to a shared count, which makes per cpu caches
> completely pointless. a compromise might be to limit the total
> number of pages available to the pool rather than limiting individual
> pool item counts. this would work fine in the mbuf layer for example.
> 
> finally, two things to note.
> 
> firstly, ive written an alternate backend for this stuff for
> uniprocessor kernels that should collapse down to a simple pointer
> deref, rather than an indirect reference through a the map of cpus
> to allocations. i havent tested it at all though.
> 
> secondly, there is a boot strapping problem with per cpu data
> structures, which is very apparent with the mbuf layer. the problem
> is we dont know how many cpus are in the system until we're halfway
> through attaching device drivers in the system. however, if we want
> to use percpu data structures during attach we need to know how
> many cpus we have. mbufs are allocated during attach, so we need
> to know how many cpus we have before attach.
> 
> solaris deals with this problem by assuming MAXCPUS when allocate
> the map of cpus to allocations. im not a fan of this because on
> sparc64 MAXCPUs is 256, but my v445 has 4. using MAXCPUS for the
> per cpu map will cause me to waste nearly 2k of memory (256 cpus *
> 8 bytes per pointer), which makes per cpu data structures less
> attractive.
> 
> i also want to avoid conditionals in hot code like the mbuf layer,
> so i dont want to put if (mbstat != NULL) { cpumem_enter(mbstat); }
> etc when that will evaluate as true all the time except during
> boot.
> 
> instead i have a compromise where allocate a single cpus worth of
> memory as a global to be used during boot. we only spin up cpus
> late during boot (relatively speaking) so we can assume ncpus is 1
> until after the hardware has attached. at that point percpu_init
> bootstraps the per cpu map allocations and the mbuf layer reallocates
> the mbstat percpu mem with it.
> 
> the bet is we will waste less memory with these globals as boot
> allocation than we will on the majority of systems once we know how
> many cpus we have. if i do end up with a sparc64 machine with 256
> cpus, i am almost certainly going to have enough ram to cope with
> losing some bytes here and there.
> 
> thoughts?

Index: sys/percpu.h
===================================================================
RCS file: sys/percpu.h
diff -N sys/percpu.h
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ sys/percpu.h        12 Aug 2016 04:01:42 -0000
@@ -0,0 +1,157 @@
+/*     $OpenBSD$ */
+
+/*
+ * Copyright (c) 2016 David Gwynne <d...@openbsd.org>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef _SYS_PERCPU_H_
+#define _SYS_PERCPU_H_
+
+#ifndef CACHELINESIZE
+#define CACHELINESIZE 64
+#endif
+
+#ifndef __upunused /* this should go in param.h */
+#ifdef MULTIPROCESSOR
+#define __upunused
+#else
+#define __upunused __attribute__((__unused__))
+#endif
+#endif
+
+struct cpumem {
+       void            *mem;
+};
+
+struct cpumem_iter {
+       unsigned int    cpu;
+} __upunused;
+
+struct counters_ref {
+       uint64_t        *gen;
+};
+
+#ifdef _KERNEL
+struct pool;
+
+struct cpumem  *cpumem_get(struct pool *);
+void            cpumem_put(struct pool *, struct cpumem *);
+
+struct cpumem  *cpumem_malloc(size_t, int);
+struct cpumem  *cpumem_realloc(struct cpumem *, size_t, int);
+void            cpumem_free(struct cpumem *, int, size_t);
+
+#ifdef MULTIPROCESSOR
+static inline void *
+cpumem_enter(struct cpumem *cm)
+{
+       unsigned int cpu = CPU_INFO_UNIT(curcpu());
+       return (cm[cpu].mem);
+}
+
+static inline void
+cpumem_leave(struct cpumem *cm, void *mem)
+{
+       /* KDASSERT? */
+}
+
+void           *cpumem_first(struct cpumem_iter *, struct cpumem *);
+void           *cpumem_next(struct cpumem_iter *, struct cpumem *);
+
+#define CPUMEM_BOOT_MEMORY(_name, _sz)                                 \
+static struct {                                                                
\
+       unsigned char   mem[_sz];                                       \
+       struct cpumem   cpumem;                                         \
+} __aligned(CACHELINESIZE) _name##_boot_cpumem = {                     \
+       .cpumem = { _name##_boot_cpumem.mem }                           \
+}
+
+#define CPUMEM_BOOT_INITIALIZER(_name)                                 \
+       { &_name##_boot_cpumem.cpumem }
+
+#else /* MULTIPROCESSOR */
+static inline void *
+cpumem_enter(struct cpumem *cm)
+{
+       return (cm);
+}
+
+static inline void
+cpumem_leave(struct cpumem *cm, void *mem)
+{
+       /* KDASSERT? */
+}
+
+static inline void *
+cpumem_first(struct cpumem_iter *i, struct cpumem *cm)
+{
+       return (cm);
+}
+
+static inline void *
+cpumem_next(struct cpumem_iter *i, struct cpumem *cm)
+{
+       return (NULL);
+}
+
+#define CPUMEM_BOOT_MEMORY(_name, _sz)                                 \
+static struct {                                                                
\
+       unsigned char   mem[_sz];                                       \
+} _name##_boot_cpumem
+
+#define CPUMEM_BOOT_INITIALIZER(_name)                                         
\
+       { (struct cpumem *)&_name##_boot_cpumem.mem }
+
+#endif /* MULTIPROCESSOR */
+
+#define CPUMEM_FOREACH(_var, _iter, _cpumem)                           \
+       for ((_var) = cpumem_first((_iter), (_cpumem));                 \
+           (_var) != NULL;                                             \
+           (_var) = cpumem_next((_iter), (_cpumem)))
+
+struct cpumem  *counters_alloc(unsigned int, int);
+struct cpumem  *counters_realloc(struct cpumem *, unsigned int, int);
+void            counters_free(struct cpumem *, int, unsigned int);
+void            counters_read(struct cpumem *, uint64_t *, unsigned int);
+void            counters_zero(struct cpumem *, unsigned int);
+
+#ifdef MULTIPROCESSOR
+uint64_t       *counters_enter(struct counters_ref *, struct cpumem *);
+void            counters_leave(struct counters_ref *, struct cpumem *);
+
+#define COUNTERS_BOOT_MEMORY(_name, _n)                                \
+       CPUMEM_BOOT_MEMORY(_name, ((_n) + 1) * sizeof(uint64_t))
+#else
+static inline uint64_t *
+counters_enter(struct counters_ref *r, struct cpumem *cm)
+{
+       r->gen = cpumem_enter(cm);
+       return (r->gen);
+}
+
+static inline void
+counters_leave(struct counters_ref *r, struct cpumem *cm)
+{
+       cpumem_leave(cm, r->gen);
+}
+
+#define COUNTERS_BOOT_MEMORY(_name, _n)                                        
\
+       CPUMEM_BOOT_MEMORY(_name, (_n) * sizeof(uint64_t))
+#endif
+
+#define COUNTERS_BOOT_INITIALIZER(_name)       CPUMEM_BOOT_INITIALIZER(_name)
+
+#endif /* _KERNEL */
+#endif /* _SYS_PERCPU_H_ */
Index: sys/mbuf.h
===================================================================
RCS file: /cvs/src/sys/sys/mbuf.h,v
retrieving revision 1.216
diff -u -p -r1.216 mbuf.h
--- sys/mbuf.h  19 Jul 2016 08:13:45 -0000      1.216
+++ sys/mbuf.h  12 Aug 2016 04:01:42 -0000
@@ -236,6 +236,7 @@ struct mbuf {
 #define        MT_FTABLE       5       /* fragment reassembly header */
 #define        MT_CONTROL      6       /* extra-data protocol message */
 #define        MT_OOBDATA      7       /* expedited data  */
+#define MT_NTYPES      8
 
 /* flowid field */
 #define M_FLOWID_VALID 0x8000  /* is the flowid set */
@@ -397,6 +398,12 @@ struct mbstat {
        u_short m_mtypes[256];  /* type specific mbuf allocations */
 };
 
+#define MBSTAT_TYPES           MT_NTYPES
+#define MBSTAT_DROPS           (MBSTAT_TYPES + 0)
+#define MBSTAT_WAIT            (MBSTAT_TYPES + 1)
+#define MBSTAT_DRAIN           (MBSTAT_TYPES + 2)
+#define MBSTAT_COUNT           (MBSTAT_TYPES + 3)
+
 #include <sys/mutex.h>
 
 struct mbuf_list {
@@ -414,7 +421,6 @@ struct mbuf_queue {
 
 #ifdef _KERNEL
 
-extern struct mbstat mbstat;
 extern int nmbclust;                   /* limit on the # of clusters */
 extern int mblowat;                    /* mbuf low water mark */
 extern int mcllowat;                   /* mbuf cluster low water mark */
@@ -423,6 +429,7 @@ extern      int max_protohdr;               /* largest pro
 extern int max_hdr;                    /* largest link+protocol header */
 
 void   mbinit(void);
+void   mbcache(void);
 struct mbuf *m_copym2(struct mbuf *, int, int, int);
 struct mbuf *m_copym(struct mbuf *, int, int, int);
 struct mbuf *m_free(struct mbuf *);
Index: sys/pool.h
===================================================================
RCS file: /cvs/src/sys/sys/pool.h,v
retrieving revision 1.59
diff -u -p -r1.59 pool.h
--- sys/pool.h  21 Apr 2016 04:09:28 -0000      1.59
+++ sys/pool.h  12 Aug 2016 04:01:42 -0000
@@ -84,6 +84,9 @@ struct pool_allocator {
 
 TAILQ_HEAD(pool_pagelist, pool_item_header);
 
+struct pool_list;
+struct cpumem;
+
 struct pool {
        struct mutex    pr_mtx;
        SIMPLEQ_ENTRY(pool)
@@ -118,12 +121,23 @@ struct pool {
 #define PR_LIMITFAIL   0x0004 /* M_CANFAIL */
 #define PR_ZERO                0x0008 /* M_ZERO */
 #define PR_WANTED      0x0100
+#define PR_CPUCACHE    0x0200
 
        int             pr_ipl;
 
        RB_HEAD(phtree, pool_item_header)
                        pr_phtree;
 
+       struct cpumem * pr_cache;
+       struct mutex    pr_cache_mtx;
+       struct pool_list *
+                       pr_cache_list;
+       u_int           pr_cache_nlist;
+       u_int           pr_cache_items;
+       u_int           pr_cache_contention;
+       u_int           pr_cache_contention_prev;
+       int             pr_cache_nout;
+
        u_int           pr_align;
        u_int           pr_maxcolors;   /* Cache coloring */
        int             pr_phoffset;    /* Offset in page of page header */
@@ -175,6 +189,7 @@ struct pool_request {
 
 void           pool_init(struct pool *, size_t, u_int, u_int, int,
                    const char *, struct pool_allocator *);
+void           pool_cache_init(struct pool *);
 void           pool_destroy(struct pool *);
 void           pool_setipl(struct pool *, int);
 void           pool_setlowat(struct pool *, int);
Index: sys/srp.h
===================================================================
RCS file: /cvs/src/sys/sys/srp.h,v
retrieving revision 1.11
diff -u -p -r1.11 srp.h
--- sys/srp.h   7 Jun 2016 07:53:33 -0000       1.11
+++ sys/srp.h   12 Aug 2016 04:01:42 -0000
@@ -21,10 +21,12 @@
 
 #include <sys/refcnt.h>
 
+#ifndef __upunused
 #ifdef MULTIPROCESSOR
 #define __upunused
 #else
 #define __upunused __attribute__((__unused__))
+#endif
 #endif
 
 struct srp {
Index: kern/subr_percpu.c
===================================================================
RCS file: kern/subr_percpu.c
diff -N kern/subr_percpu.c
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ kern/subr_percpu.c  12 Aug 2016 04:01:42 -0000
@@ -0,0 +1,343 @@
+/*     $OpenBSD$ */
+
+/*
+ * Copyright (c) 2016 David Gwynne <d...@openbsd.org>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/pool.h>
+#include <sys/malloc.h>
+#include <sys/types.h>
+#include <sys/atomic.h>
+
+#include <sys/percpu.h>
+
+#ifdef MULTIPROCESSOR
+struct pool cpumem_pl;
+
+void
+percpu_init(void)
+{
+       pool_init(&cpumem_pl, sizeof(struct cpumem) * ncpus, 0, 0,
+           PR_WAITOK, "percpumem", &pool_allocator_single);
+       pool_setipl(&cpumem_pl, IPL_NONE);
+}
+
+struct cpumem *
+cpumem_get(struct pool *pp)
+{
+       struct cpumem *cm;
+       unsigned int cpu;
+
+       cm = pool_get(&cpumem_pl, PR_WAITOK);
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               cm[cpu].mem = pool_get(pp, PR_WAITOK | PR_ZERO);
+
+       return (cm);
+}
+
+void
+cpumem_put(struct pool *pp, struct cpumem *cm)
+{
+       unsigned int cpu;
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               pool_put(pp, cm[cpu].mem);
+
+       pool_put(&cpumem_pl, cm);
+}
+
+struct cpumem *
+cpumem_malloc(size_t sz, int type)
+{
+       struct cpumem *cm;
+       unsigned int cpu;
+
+       sz = roundup(sz, CACHELINESIZE);
+
+       cm = pool_get(&cpumem_pl, PR_WAITOK);
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               cm[cpu].mem = malloc(sz, type, M_WAITOK | M_ZERO);
+
+       return (cm);
+}
+
+struct cpumem *
+cpumem_realloc(struct cpumem *bootcm, size_t sz, int type)
+{
+       struct cpumem *cm;
+       unsigned int cpu;
+
+       sz = roundup(sz, CACHELINESIZE);
+
+       cm = pool_get(&cpumem_pl, PR_WAITOK);
+
+       cm[0].mem = bootcm[0].mem;
+       for (cpu = 1; cpu < ncpus; cpu++)
+               cm[cpu].mem = malloc(sz, type, M_WAITOK | M_ZERO);
+
+       return (cm);
+}
+
+void
+cpumem_free(struct cpumem *cm, int type, size_t sz)
+{
+       unsigned int cpu;
+
+       sz = roundup(sz, CACHELINESIZE);
+
+       for (cpu = 0; cpu < ncpus; cpu++)
+               free(cm[cpu].mem, type, sz);
+
+       pool_put(&cpumem_pl, cm);
+}
+
+void *
+cpumem_first(struct cpumem_iter *i, struct cpumem *cm)
+{
+       i->cpu = 0;
+
+       return (cm[0].mem);
+}
+
+void *
+cpumem_next(struct cpumem_iter *i, struct cpumem *cm)
+{
+       unsigned int cpu = ++i->cpu;
+
+       if (cpu >= ncpus)
+               return (NULL);
+
+       return (cm[cpu].mem);
+}
+
+struct cpumem *
+counters_alloc(unsigned int n, int type)
+{
+       struct cpumem *cm;
+       struct cpumem_iter cmi;
+       uint64_t *counters;
+       unsigned int i;
+
+       KASSERT(n > 0);
+
+       n++; /* add space for a generation number */
+       cm = cpumem_malloc(n * sizeof(uint64_t), type);
+
+       CPUMEM_FOREACH(counters, &cmi, cm) {
+               for (i = 0; i < n; i++)
+                       counters[i] = 0;
+       }
+
+       return (cm);
+}
+
+struct cpumem *
+counters_realloc(struct cpumem *cm, unsigned int n, int type)
+{
+       n++; /* the generation number */
+       return (cpumem_realloc(cm, n * sizeof(uint64_t), type));
+}
+
+void
+counters_free(struct cpumem *cm, int type, unsigned int n)
+{
+       n++; /* generation number */
+       cpumem_free(cm, type, n * sizeof(uint64_t));
+}
+
+uint64_t *
+counters_enter(struct counters_ref *ref, struct cpumem *cm)
+{
+       ref->gen = cpumem_enter(cm);
+       (*ref->gen)++; /* make the generation number odd */
+       return (ref->gen + 1);
+}
+
+void
+counters_leave(struct counters_ref *ref, struct cpumem *cm)
+{
+       membar_producer();
+       (*ref->gen)++; /* make the generation number even again */
+       cpumem_leave(cm, ref->gen);
+}
+
+void
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+{
+       struct cpumem_iter cmi;
+       uint64_t *gen, *counters, *temp;
+       uint64_t enter, leave;
+       unsigned int i;
+
+       for (i = 0; i < n; i++)
+               output[i] = 0;
+
+       temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
+
+       gen = cpumem_first(&cmi, cm);
+       do {
+               counters = gen + 1;
+
+               enter = *gen;
+               for (;;) {
+                       /* the generation number is odd during an update */
+                       while (enter & 1) {
+                               yield();
+                               membar_consumer();
+                               enter = *gen;
+                       }
+
+                       for (i = 0; i < n; i++)
+                               temp[i] = counters[i];
+
+                       membar_consumer();
+                       leave = *gen;
+
+                       if (enter == leave)
+                               break;
+
+                       enter = leave;
+               }
+
+               for (i = 0; i < n; i++)
+                       output[i] += temp[i];
+
+               gen = cpumem_next(&cmi, cm);
+       } while (gen != NULL);
+
+       free(temp, M_TEMP, n * sizeof(uint64_t));
+}
+
+void
+counters_zero(struct cpumem *cm, unsigned int n)
+{
+       struct cpumem_iter cmi;
+       uint64_t *counters;
+       unsigned int i;
+
+       n++; /* zero the generation numbers too */
+
+       counters = cpumem_first(&cmi, cm);
+       do {
+               for (i = 0; i < n; i++)
+                       counters[i] = 0;
+
+               counters = cpumem_next(&cmi, cm);
+       } while (counters != NULL);
+}
+
+#else /* MULTIPROCESSOR */
+
+/*
+ * Uniprocessor implementation of per-CPU data structures.
+ *
+ * UP percpu memory is a single memory allocation cast to/from the
+ * cpumem struct. It is not scaled up to the size of cacheline because
+ * there's no other cache to contend with.
+ */
+
+void
+percpu_init(void)
+{
+       /* nop */
+}
+
+struct cpumem *
+cpumem_get(struct pool *pp)
+{
+       return (pool_get(pp, PR_WAITOK));
+}
+
+void
+cpumem_put(struct pool *pp, struct cpumem *cm)
+{
+       pool_put(pp, cm);
+}
+
+struct cpumem *
+cpumem_malloc(size_t sz, int type)
+{
+       return (malloc(sz, type, M_WAITOK));
+}
+
+struct cpumem *
+cpumem_realloc(struct cpumem *cm, size_t sz, int type)
+{
+       return (cm);
+}
+
+void
+cpumem_free(struct cpumem *cm, int type, size_t sz)
+{
+       free(cm, type, sz);
+}
+
+struct cpumem *
+counters_alloc(unsigned int n, int type)
+{
+       KASSERT(n > 0);
+
+       return (cpumem_malloc(n * sizeof(uint64_t), type));
+}
+
+struct cpumem *
+counters_realloc(struct cpumem *cm, unsigned int n, int type)
+{
+       /* this is unecessary, but symmetrical */
+       return (cpumem_realloc(cm, n * sizeof(uint64_t), type));
+}
+
+void
+counters_free(struct cpumem *cm, int type, unsigned int n)
+{
+       cpumem_free(cm, type, n * sizeof(uint64_t));
+}
+
+void
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+{
+       uint64_t *counters;
+       unsigned int i;
+       int s;
+
+       counters = (uint64_t *)cm;
+
+       s = splhigh();
+       for (i = 0; i < n; i++)
+               output[i] = counters[i];
+       splx(s);
+}
+
+void
+counters_zero(struct cpumem *cm, unsigned int n)
+{
+       uint64_t *counters;
+       unsigned int i;
+       int s;
+
+       counters = (uint64_t *)cm;
+
+       s = splhigh();
+       for (i = 0; i < n; i++)
+               counters[i] = 0;
+       splx(s);
+}
+
+#endif /* MULTIPROCESSOR */
+
Index: kern/init_main.c
===================================================================
RCS file: /cvs/src/sys/kern/init_main.c,v
retrieving revision 1.253
diff -u -p -r1.253 init_main.c
--- kern/init_main.c    17 May 2016 23:28:03 -0000      1.253
+++ kern/init_main.c    12 Aug 2016 04:01:42 -0000
@@ -143,6 +143,7 @@ void        init_exec(void);
 void   kqueue_init(void);
 void   taskq_init(void);
 void   pool_gc_pages(void *);
+void   percpu_init(void);
 
 extern char sigcode[], esigcode[], sigcoderet[];
 #ifdef SYSCALL_DEBUG
@@ -413,6 +414,9 @@ main(void *framep)
                __guard_local = newguard;
        }
 #endif
+
+       percpu_init();          /* per cpu memory allocation */
+       mbcache();              /* enable per cpu caches on mbuf pools */
 
        /* init exec and emul */
        init_exec();
Index: kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.306
diff -u -p -r1.306 kern_sysctl.c
--- kern/kern_sysctl.c  14 Jul 2016 15:39:40 -0000      1.306
+++ kern/kern_sysctl.c  12 Aug 2016 04:01:42 -0000
@@ -77,6 +77,7 @@
 #include <sys/sched.h>
 #include <sys/mount.h>
 #include <sys/syscallargs.h>
+#include <sys/percpu.h>
 
 #include <uvm/uvm_extern.h>
 
@@ -386,9 +387,24 @@ kern_sysctl(int *name, u_int namelen, vo
        case KERN_FILE:
                return (sysctl_file(name + 1, namelen - 1, oldp, oldlenp, p));
 #endif
-       case KERN_MBSTAT:
-               return (sysctl_rdstruct(oldp, oldlenp, newp, &mbstat,
-                   sizeof(mbstat)));
+       case KERN_MBSTAT: {
+               extern struct cpumem *mbstat;
+               uint64_t counters[MBSTAT_COUNT];
+               struct mbstat mbs;
+               unsigned int i;
+
+               memset(&mbs, 0, sizeof(mbs));
+               counters_read(mbstat, counters, MBSTAT_COUNT);
+               for (i = 0; i < MBSTAT_TYPES; i++)
+                       mbs.m_mtypes[i] = counters[i];
+
+               mbs.m_drops = counters[MBSTAT_DROPS];
+               mbs.m_wait = counters[MBSTAT_WAIT];
+               mbs.m_drain = counters[MBSTAT_DRAIN];
+
+               return (sysctl_rdstruct(oldp, oldlenp, newp,
+                   &mbs, sizeof(mbs)));
+       }
 #ifdef GPROF
        case KERN_PROF:
                return (sysctl_doprof(name + 1, namelen - 1, oldp, oldlenp,
Index: kern/subr_pool.c
===================================================================
RCS file: /cvs/src/sys/kern/subr_pool.c,v
retrieving revision 1.194
diff -u -p -r1.194 subr_pool.c
--- kern/subr_pool.c    15 Jan 2016 11:21:58 -0000      1.194
+++ kern/subr_pool.c    12 Aug 2016 04:01:42 -0000
@@ -42,6 +42,7 @@
 #include <sys/sysctl.h>
 #include <sys/task.h>
 #include <sys/timeout.h>
+#include <sys/percpu.h>
 
 #include <uvm/uvm_extern.h>
 
@@ -96,6 +97,33 @@ struct pool_item {
 };
 #define POOL_IMAGIC(ph, pi) ((u_long)(pi) ^ (ph)->ph_magic)
 
+#ifdef MULTIPROCESSOR
+struct pool_list {
+       struct pool_list        *pl_next;       /* next in list */
+       unsigned long            pl_cookie;
+       struct pool_list        *pl_nextl;      /* next list */
+       unsigned long            pl_nitems;     /* items in list */
+};
+
+struct pool_cache {
+       struct pool_list        *pc_actv;
+       unsigned long            pc_nactv;      /* cache pc_actv nitems */
+       struct pool_list        *pc_prev;
+
+       uint64_t                 pc_gen;        /* generation number */
+       uint64_t                 pc_gets;
+       uint64_t                 pc_puts;
+       uint64_t                 pc_fails;
+
+       int                      pc_nout;
+};
+
+void   *pool_cache_get(struct pool *);
+void    pool_cache_put(struct pool *, void *);
+void    pool_cache_destroy(struct pool *);
+#endif
+void    pool_cache_info(struct pool *, struct kinfo_pool *);
+
 #ifdef POOL_DEBUG
 int    pool_debug = 1;
 #else
@@ -363,6 +391,11 @@ pool_destroy(struct pool *pp)
        struct pool_item_header *ph;
        struct pool *prev, *iter;
 
+#ifdef MULTIPROCESSOR
+       if (pp->pr_cache != NULL)
+               pool_cache_destroy(pp);
+#endif
+
 #ifdef DIAGNOSTIC
        if (pp->pr_nout != 0)
                panic("%s: pool busy: still out: %u", __func__, pp->pr_nout);
@@ -429,8 +462,15 @@ pool_get(struct pool *pp, int flags)
        void *v = NULL;
        int slowdown = 0;
 
-       KASSERT(flags & (PR_WAITOK | PR_NOWAIT));
+#ifdef MULTIPROCESSOR
+       if (pp->pr_cache != NULL) {
+               v = pool_cache_get(pp);
+               if (v != NULL)
+                       goto good;
+       }
+#endif
 
+       KASSERT(flags & (PR_WAITOK | PR_NOWAIT));
 
        mtx_enter(&pp->pr_mtx);
        if (pp->pr_nout >= pp->pr_hardlimit) {
@@ -462,6 +502,9 @@ pool_get(struct pool *pp, int flags)
                v = mem.v;
        }
 
+#ifdef MULTIPROCESSOR
+good:
+#endif
        if (ISSET(flags, PR_ZERO))
                memset(v, 0, pp->pr_size);
 
@@ -551,7 +594,7 @@ pool_do_get(struct pool *pp, int flags, 
        MUTEX_ASSERT_LOCKED(&pp->pr_mtx);
 
        if (pp->pr_ipl != -1)
-               splassert(pp->pr_ipl);
+               splassertpl(pp->pr_ipl, pp->pr_wchan);
 
        /*
         * Account for this item now to avoid races if we need to give up
@@ -641,6 +684,13 @@ pool_put(struct pool *pp, void *v)
                panic("%s: NULL item", __func__);
 #endif
 
+#ifdef MULTIPROCESSOR
+       if (pp->pr_cache != NULL && TAILQ_EMPTY(&pp->pr_requests)) {
+               pool_cache_put(pp, v);
+               return;
+       }
+#endif
+
        mtx_enter(&pp->pr_mtx);
 
        if (pp->pr_ipl != -1)
@@ -1346,6 +1396,8 @@ sysctl_dopool(int *name, u_int namelen, 
                if (pp->pr_ipl != -1)
                        mtx_leave(&pp->pr_mtx);
 
+               pool_cache_info(pp, &pi);
+
                rv = sysctl_rdstruct(oldp, oldlenp, NULL, &pi, sizeof(pi));
                break;
        }
@@ -1512,3 +1564,261 @@ pool_multi_free_ni(struct pool *pp, void
        km_free(v, pp->pr_pgsize, &kv, pp->pr_crange);
        KERNEL_UNLOCK();
 }
+
+#ifdef MULTIPROCESSOR
+
+struct pool pool_caches; /* per cpu cache entries */
+
+void
+pool_cache_init(struct pool *pp)
+{
+       struct cpumem *cm;
+       struct pool_cache *pc;
+       struct cpumem_iter i;
+
+       if (pool_caches.pr_size == 0) {
+               pool_init(&pool_caches, sizeof(struct pool_cache), 64, 0,
+                   PR_WAITOK, "plcache", NULL);
+               pool_setipl(&pool_caches, IPL_NONE);
+       }
+
+       KASSERT(pp->pr_size >= sizeof(*pc));
+
+       cm = cpumem_get(&pool_caches);
+
+       mtx_init(&pp->pr_cache_mtx, pp->pr_ipl);
+       pp->pr_cache_list = NULL;
+       pp->pr_cache_nlist = 0;
+       pp->pr_cache_items = 8;
+       pp->pr_cache_contention = 0;
+       pp->pr_cache_contention_prev = 0;
+
+       CPUMEM_FOREACH(pc, &i, cm) {
+               pc->pc_actv = NULL;
+               pc->pc_nactv = 0;
+               pc->pc_prev = NULL;
+
+               pc->pc_gets = 0;
+               pc->pc_puts = 0;
+               pc->pc_fails = 0;
+               pc->pc_nout = 0;
+       }
+
+       pp->pr_cache = cm;
+}
+
+static inline void
+pool_list_enter(struct pool *pp)
+{
+       if (mtx_enter_try(&pp->pr_cache_mtx) == 0) {
+               mtx_enter(&pp->pr_cache_mtx);
+               pp->pr_cache_contention++;
+       }
+}
+
+static inline void
+pool_list_leave(struct pool *pp)
+{
+       mtx_leave(&pp->pr_cache_mtx);
+}
+
+static inline struct pool_list *
+pool_list_alloc(struct pool *pp, struct pool_cache *pc)
+{
+       struct pool_list *pl;
+
+       pool_list_enter(pp);
+       pl = pp->pr_cache_list;
+       if (pl != NULL) {
+               pp->pr_cache_list = pl->pl_nextl;
+               pp->pr_cache_nlist--;
+       }
+
+       pp->pr_cache_nout += pc->pc_nout;
+       pc->pc_nout = 0;
+       pool_list_leave(pp);
+
+       return (pl);
+}
+
+static inline void
+pool_list_free(struct pool *pp, struct pool_cache *pc, struct pool_list *pl)
+{
+       pool_list_enter(pp);
+       pl->pl_nextl = pp->pr_cache_list;
+       pp->pr_cache_list = pl;
+       pp->pr_cache_nlist++;
+
+       pp->pr_cache_nout += pc->pc_nout;
+       pc->pc_nout = 0;
+       pool_list_leave(pp);
+}
+
+static inline struct pool_cache *
+pool_cache_enter(struct pool *pp, int *s)
+{
+       struct pool_cache *pc;
+
+       pc = cpumem_enter(pp->pr_cache);
+       *s = splraise(pp->pr_ipl);
+       pc->pc_gen++;
+
+       return (pc);
+}
+
+static inline void
+pool_cache_leave(struct pool *pp, struct pool_cache *pc, int s)
+{
+       pc->pc_gen++;
+       splx(s);
+       cpumem_leave(pp->pr_cache, pc);
+}
+
+void *
+pool_cache_get(struct pool *pp)
+{
+       struct pool_cache *pc;
+       struct pool_list *pl;
+       int s;
+
+       pc = pool_cache_enter(pp, &s);
+
+       if (pc->pc_actv != NULL) {
+               pl = pc->pc_actv;
+       } else if (pc->pc_prev != NULL) {
+               pl = pc->pc_prev;
+               pc->pc_prev = NULL;
+       } else if ((pl = pool_list_alloc(pp, pc)) == NULL) {
+               pc->pc_fails++;
+               goto done;
+       }
+
+       pc->pc_actv = pl->pl_next;
+       pc->pc_nactv = pl->pl_nitems - 1;
+       pc->pc_gets++;
+       pc->pc_nout++;
+done:
+       pool_cache_leave(pp, pc, s);
+
+       return (pl);
+}
+
+void
+pool_cache_put(struct pool *pp, void *v)
+{
+       struct pool_cache *pc;
+       struct pool_list *pl = v;
+       unsigned long cache_items = pp->pr_cache_items;
+       unsigned long nitems;
+       int s;
+
+       pc = pool_cache_enter(pp, &s);
+
+       nitems = pc->pc_nactv;
+       if (__predict_false(nitems >= cache_items)) {
+               if (pc->pc_prev != NULL)
+                       pool_list_free(pp, pc, pc->pc_prev);
+                       
+               pc->pc_prev = pc->pc_actv;
+
+               pc->pc_actv = NULL;
+               pc->pc_nactv = 0;
+               nitems = 0;
+       }
+
+       pl->pl_next = pc->pc_actv;
+       pl->pl_nitems = ++nitems;
+
+       pc->pc_actv = pl;
+       pc->pc_nactv = nitems;
+
+       pc->pc_puts++;
+       pc->pc_nout--;
+
+       pool_cache_leave(pp, pc, s);
+}
+
+struct pool_list *
+pool_list_put(struct pool *pp, struct pool_list *pl)
+{
+       struct pool_list *rpl, *npl;
+
+       if (pl == NULL)
+               return (NULL);
+
+       rpl = (struct pool_list *)pl->pl_next;
+
+       do {
+               npl = pl->pl_next;
+               pool_put(pp, pl);
+               pl = npl;
+       } while (pl != NULL);
+
+       return (rpl);
+}
+
+void
+pool_cache_destroy(struct pool *pp)
+{
+       struct pool_cache *pc;
+       struct pool_list *pl;
+       struct cpumem_iter i;
+       struct cpumem *cm;
+
+       cm = pp->pr_cache;
+       pp->pr_cache = NULL; /* make pool_put avoid the cache */
+
+       CPUMEM_FOREACH(pc, &i, cm) {
+               pool_list_put(pp, pc->pc_actv);
+               pool_list_put(pp, pc->pc_prev);
+       }
+
+       cpumem_put(&pool_caches, cm);
+
+       pl = pp->pr_cache_list;
+       while (pl != NULL)
+               pl = pool_list_put(pp, pl);
+}
+
+void
+pool_cache_info(struct pool *pp, struct kinfo_pool *pi)
+{
+       struct pool_cache *pc;
+       struct cpumem_iter i;
+
+       if (pp->pr_cache == NULL)
+               return;
+
+       mtx_enter(&pp->pr_cache_mtx);
+       CPUMEM_FOREACH(pc, &i, pp->pr_cache) {
+               uint64_t gen, nget, nput;
+
+               do {
+                       while ((gen = pc->pc_gen) & 1)
+                               yield();
+
+                       nget = pc->pc_gets;
+                       nput = pc->pc_puts;
+               } while (gen != pc->pc_gen);
+
+               pi->pr_nout += pc->pc_nout;
+               pi->pr_nget += nget;
+               pi->pr_nput += nput;
+       }
+
+       pi->pr_nout += pp->pr_cache_nout;
+       mtx_leave(&pp->pr_cache_mtx);
+}
+#else /* MULTIPROCESSOR */
+void
+pool_cache_init(struct pool *pp)
+{
+       /* nop */
+}
+
+void
+pool_cache_info(struct pool *pp, struct kinfo_pool *pi)
+{
+       /* nop */
+}
+#endif /* MULTIPROCESSOR */
Index: kern/uipc_mbuf.c
===================================================================
RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.226
diff -u -p -r1.226 uipc_mbuf.c
--- kern/uipc_mbuf.c    13 Jun 2016 21:24:43 -0000      1.226
+++ kern/uipc_mbuf.c    12 Aug 2016 04:01:42 -0000
@@ -83,6 +83,7 @@
 #include <sys/domain.h>
 #include <sys/protosw.h>
 #include <sys/pool.h>
+#include <sys/percpu.h>
 
 #include <sys/socket.h>
 #include <sys/socketvar.h>
@@ -99,9 +100,11 @@
 #include <net/pfvar.h>
 #endif /* NPF > 0 */
 
-struct mbstat mbstat;          /* mbuf stats */
-struct mutex mbstatmtx = MUTEX_INITIALIZER(IPL_NET);
-struct pool mbpool;            /* mbuf pool */
+/* mbuf stats */
+COUNTERS_BOOT_MEMORY(mbstat_boot, MBSTAT_COUNT);
+struct cpumem *mbstat = COUNTERS_BOOT_INITIALIZER(mbstat_boot);
+/* mbuf pools */
+struct pool mbpool;
 struct pool mtagpool;
 
 /* mbuf cluster pools */
@@ -133,8 +136,8 @@ void        m_zero(struct mbuf *);
 static void (*mextfree_fns[4])(caddr_t, u_int, void *);
 static u_int num_extfree_fns;
 
-const char *mclpool_warnmsg =
-    "WARNING: mclpools limit reached; increase kern.maxclusters";
+const char *mbufpl_warnmsg =
+    "WARNING: mbuf limit reached; increase kern.maxclusters";
 
 /*
  * Initialize the mbuf allocator.
@@ -167,7 +170,6 @@ mbinit(void)
                    mclnames[i], NULL);
                pool_setipl(&mclpools[i], IPL_NET);
                pool_set_constraints(&mclpools[i], &kp_dma_contig);
-               pool_setlowat(&mclpools[i], mcllowat);
        }
 
        (void)mextfree_register(m_extfree_pool);
@@ -177,27 +179,22 @@ mbinit(void)
 }
 
 void
-nmbclust_update(void)
+mbcache(void)
 {
        int i;
-       /*
-        * Set the hard limit on the mclpools to the number of
-        * mbuf clusters the kernel is to support.  Log the limit
-        * reached message max once a minute.
-        */
-       for (i = 0; i < nitems(mclsizes); i++) {
-               (void)pool_sethardlimit(&mclpools[i], nmbclust,
-                   mclpool_warnmsg, 60);
-               /*
-                * XXX this needs to be reconsidered.
-                * Setting the high water mark to nmbclust is too high
-                * but we need to have enough spare buffers around so that
-                * allocations in interrupt context don't fail or mclgeti()
-                * drivers may end up with empty rings.
-                */
-               pool_sethiwat(&mclpools[i], nmbclust);
-       }
-       pool_sethiwat(&mbpool, nmbclust);
+
+       mbstat = counters_realloc(mbstat, MBSTAT_COUNT, M_DEVBUF);
+
+       pool_cache_init(&mbpool);
+       pool_cache_init(&mtagpool);
+       for (i = 0; i < nitems(mclsizes); i++)
+               pool_cache_init(&mclpools[i]);
+}
+
+void
+nmbclust_update(void)
+{
+       (void)pool_sethardlimit(&mbpool, nmbclust, mbufpl_warnmsg, 60);
 }
 
 /*
@@ -207,14 +204,21 @@ struct mbuf *
 m_get(int nowait, int type)
 {
        struct mbuf *m;
+       struct counters_ref cr;
+       uint64_t *counters;
+       int s;
+
+       KDASSERT(type < MT_NTYPES);
 
        m = pool_get(&mbpool, nowait == M_WAIT ? PR_WAITOK : PR_NOWAIT);
        if (m == NULL)
                return (NULL);
 
-       mtx_enter(&mbstatmtx);
-       mbstat.m_mtypes[type]++;
-       mtx_leave(&mbstatmtx);
+       s = splnet();
+       counters = counters_enter(&cr, mbstat);
+       counters[type]++;
+       counters_leave(&cr, mbstat);
+       splx(s);
 
        m->m_type = type;
        m->m_next = NULL;
@@ -233,14 +237,21 @@ struct mbuf *
 m_gethdr(int nowait, int type)
 {
        struct mbuf *m;
+       struct counters_ref cr;
+       uint64_t *counters;
+       int s;
+
+       KDASSERT(type < MT_NTYPES);
 
        m = pool_get(&mbpool, nowait == M_WAIT ? PR_WAITOK : PR_NOWAIT);
        if (m == NULL)
                return (NULL);
 
-       mtx_enter(&mbstatmtx);
-       mbstat.m_mtypes[type]++;
-       mtx_leave(&mbstatmtx);
+       s = splnet();
+       counters = counters_enter(&cr, mbstat);
+       counters[type]++;
+       counters_leave(&cr, mbstat);
+       splx(s);
 
        m->m_type = type;
 
@@ -352,13 +363,18 @@ struct mbuf *
 m_free(struct mbuf *m)
 {
        struct mbuf *n;
+       struct counters_ref cr;
+       uint64_t *counters;
+       int s;
 
        if (m == NULL)
                return (NULL);
 
-       mtx_enter(&mbstatmtx);
-       mbstat.m_mtypes[m->m_type]--;
-       mtx_leave(&mbstatmtx);
+       s = splnet();
+       counters = counters_enter(&cr, mbstat);
+       counters[m->m_type]--;
+       counters_leave(&cr, mbstat);
+       splx(s);
 
        n = m->m_next;
        if (m->m_flags & M_ZEROIZE) {
Index: conf/files
===================================================================
RCS file: /cvs/src/sys/conf/files,v
retrieving revision 1.622
diff -u -p -r1.622 files
--- conf/files  5 Aug 2016 19:00:25 -0000       1.622
+++ conf/files  12 Aug 2016 04:01:42 -0000
@@ -687,6 +687,7 @@ file kern/subr_evcount.c
 file kern/subr_extent.c
 file kern/subr_hibernate.c             hibernate
 file kern/subr_log.c
+file kern/subr_percpu.c
 file kern/subr_poison.c                        diagnostic
 file kern/subr_pool.c
 file kern/dma_alloc.c

Reply via email to