> -----Original Message----- > From: Andrew Rybchenko [mailto:arybche...@solarflare.com] > Sent: Sunday, January 13, 2019 7:31 AM > To: Eads, Gage <gage.e...@intel.com>; dev@dpdk.org > Cc: olivier.m...@6wind.com; Richardson, Bruce <bruce.richard...@intel.com>; > Ananyev, Konstantin <konstantin.anan...@intel.com> > Subject: Re: [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool > > Hi Gage, > > In general looks very good. > > Have you considered to make nb_lifo.h a library to be reusable outside of the > mempool driver?
I'm certainly open to it, if the community can benefit. Due to the difficulty of adding a new lib/ directory (I believe this requires Tech Board approval), I'll defer that work to a separate patchset. > > There are few notes below. > > Thanks, > Andrew. > > On 1/10/19 11:55 PM, Gage Eads wrote: > > > This commit adds support for non-blocking (linked list based) stack > mempool > handler. The stack uses a 128-bit compare-and-swap instruction, and > thus is > limited to x86_64. The 128-bit CAS atomically updates the stack top > pointer > and a modification counter, which protects against the ABA problem. > > In mempool_perf_autotest the lock-based stack outperforms the non- > blocking > handler*, however: > - For applications with preemptible pthreads, a lock-based stack's > worst-case performance (i.e. one thread being preempted while > holding the spinlock) is much worse than the non-blocking stack's. > - Using per-thread mempool caches will largely mitigate the > performance > difference. > > *Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 > v4, > running on isolcpus cores with a tickless scheduler. The lock-based > stack's > rate_persec was 1x-3.5x the non-blocking stack's. > > Signed-off-by: Gage Eads <gage.e...@intel.com> > <mailto:gage.e...@intel.com> > --- > MAINTAINERS | 4 + > config/common_base | 1 + > drivers/mempool/Makefile | 1 + > drivers/mempool/nb_stack/Makefile | 30 +++++ > drivers/mempool/nb_stack/meson.build | 4 + > drivers/mempool/nb_stack/nb_lifo.h | 132 > +++++++++++++++++++++ > drivers/mempool/nb_stack/rte_mempool_nb_stack.c | 125 > +++++++++++++++++++ > .../nb_stack/rte_mempool_nb_stack_version.map | 4 + > mk/rte.app.mk | 1 + > 9 files changed, 302 insertions(+) > create mode 100644 drivers/mempool/nb_stack/Makefile > create mode 100644 drivers/mempool/nb_stack/meson.build > create mode 100644 drivers/mempool/nb_stack/nb_lifo.h > create mode 100644 > drivers/mempool/nb_stack/rte_mempool_nb_stack.c > create mode 100644 > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map > > diff --git a/MAINTAINERS b/MAINTAINERS > index 470f36b9c..5519d3323 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -416,6 +416,10 @@ M: Artem V. Andreev > <artem.andr...@oktetlabs.ru> <mailto:artem.andr...@oktetlabs.ru> > M: Andrew Rybchenko <arybche...@solarflare.com> > <mailto:arybche...@solarflare.com> > F: drivers/mempool/bucket/ > > +Non-blocking stack memory pool > +M: Gage Eads <gage.e...@intel.com> <mailto:gage.e...@intel.com> > +F: drivers/mempool/nb_stack/ > + > > Bus Drivers > ----------- > diff --git a/config/common_base b/config/common_base > index 964a6956e..40ce47312 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -728,6 +728,7 @@ CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64 > CONFIG_RTE_DRIVER_MEMPOOL_RING=y > CONFIG_RTE_DRIVER_MEMPOOL_STACK=y > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y > > > Typically it is alphabetically sorted. Will fix. > > > > # > # Compile PMD for octeontx fpa mempool device > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile > index 28c2e8360..aeae3ac12 100644 > --- a/drivers/mempool/Makefile > +++ b/drivers/mempool/Makefile > @@ -13,5 +13,6 @@ endif > DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring > DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack > DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack > > > Typically it is alphabetically sorted. Yes, already broken, but, please, put > it > before ring. > Sure, will do. > > > > include $(RTE_SDK)/mk/rte.subdir.mk > diff --git a/drivers/mempool/nb_stack/Makefile > b/drivers/mempool/nb_stack/Makefile > new file mode 100644 > index 000000000..38b45f4f5 > --- /dev/null > +++ b/drivers/mempool/nb_stack/Makefile > @@ -0,0 +1,30 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(c) 2019 Intel Corporation > + > +include $(RTE_SDK)/mk/rte.vars.mk > + > +# The non-blocking stack uses a 128-bit compare-and-swap instruction, > and thus > +# is limited to x86_64. > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y) > + > +# > +# library name > +# > +LIB = librte_mempool_nb_stack.a > + > +CFLAGS += -O3 > +CFLAGS += $(WERROR_FLAGS) > + > +# Headers > +CFLAGS += -I$(RTE_SDK)/lib/librte_mempool > > > I guess it is derived from stack. Is it really required? There is no such > line in ring, > bucket, octeontx and dpaa2. > > Good guess :). No, I don't believe so -- I'll remove this. > > +LDLIBS += -lrte_eal -lrte_mempool > + > +EXPORT_MAP := rte_mempool_nb_stack_version.map > + > +LIBABIVER := 1 > + > +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += > rte_mempool_nb_stack.c > + > +endif > + > +include $(RTE_SDK)/mk/rte.lib.mk > diff --git a/drivers/mempool/nb_stack/meson.build > b/drivers/mempool/nb_stack/meson.build > new file mode 100644 > index 000000000..66d64a9ba > --- /dev/null > +++ b/drivers/mempool/nb_stack/meson.build > @@ -0,0 +1,4 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(c) 2019 Intel Corporation > + > +sources = files('rte_mempool_nb_stack.c') > > > Have no tested meson build for non-x86_64 target? > I guess it should be fixed to skip it on non-x86_64 builds. > I did not -- my mistake. I'll correct this. > > > > diff --git a/drivers/mempool/nb_stack/nb_lifo.h > b/drivers/mempool/nb_stack/nb_lifo.h > new file mode 100644 > index 000000000..701d75e37 > --- /dev/null > +++ b/drivers/mempool/nb_stack/nb_lifo.h > @@ -0,0 +1,132 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2019 Intel Corporation > + */ > + > +#ifndef _NB_LIFO_H_ > +#define _NB_LIFO_H_ > + > +struct nb_lifo_elem { > + void *data; > + struct nb_lifo_elem *next; > +}; > + > +struct nb_lifo_head { > + struct nb_lifo_elem *top; /**< Stack top */ > + uint64_t cnt; /**< Modification counter */ > +}; > + > +struct nb_lifo { > + volatile struct nb_lifo_head head __rte_aligned(16); > + rte_atomic64_t len; > +} __rte_cache_aligned; > + > +static __rte_always_inline void > +nb_lifo_init(struct nb_lifo *lifo) > +{ > + memset(lifo, 0, sizeof(*lifo)); > + rte_atomic64_set(&lifo->len, 0); > +} > + > +static __rte_always_inline unsigned int > +nb_lifo_len(struct nb_lifo *lifo) > +{ > + return (unsigned int) rte_atomic64_read(&lifo->len); > +} > + > +static __rte_always_inline void > +nb_lifo_push(struct nb_lifo *lifo, > + struct nb_lifo_elem *first, > + struct nb_lifo_elem *last, > + unsigned int num) > +{ > + while (1) { > + struct nb_lifo_head old_head, new_head; > + > + old_head = lifo->head; > + > + /* Swing the top pointer to the first element in the > list > and > + * make the last element point to the old top. > + */ > + new_head.top = first; > + new_head.cnt = old_head.cnt + 1; > + > + last->next = old_head.top; > + > + if (rte_atomic128_cmpset((volatile uint64_t *) &lifo- > >head, > + (uint64_t *)&old_head, > + (uint64_t *)&new_head)) > + break; > + } > + > + rte_atomic64_add(&lifo->len, num); > > > I'd like to understand why it is not a problem that change of the list and > increase > its length are not atomic. So, we can get wrong length of the stack in the > middle. It would be good to explain it in the comment. > Indeed, there is a window in which the list appears shorter than it is. I don't believe this a problem becaues the get_count callback is inherently racy/approximate (if it is called while the list being accessed). That is, even if the list and its size were updated atomically, the size could change between when get_count reads the size and when that value is returned to the calling thread. I placed the lifo->len updates such that the list may appear to have fewer elements than it does, but will never appear to have more elements. If the mempool is near-empty to the point that this is a concern, I think the bigger problem is the mempool size. If that seems reasonable, I'll add that as a comment in the code. > > > +} > + > +static __rte_always_inline void > +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem) > +{ > + nb_lifo_push(lifo, elem, elem, 1); > +} > + > +static __rte_always_inline struct nb_lifo_elem * > +nb_lifo_pop(struct nb_lifo *lifo, > + unsigned int num, > + void **obj_table, > + struct nb_lifo_elem **last) > +{ > + struct nb_lifo_head old_head; > + > + /* Reserve num elements, if available */ > + while (1) { > + uint64_t len = rte_atomic64_read(&lifo->len); > + > + /* Does the list contain enough elements? */ > + if (len < num) > + return NULL; > + > + if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len, > + len, len - num)) > + break; > + } > + > + /* Pop num elements */ > + while (1) { > + struct nb_lifo_head new_head; > + struct nb_lifo_elem *tmp; > + unsigned int i; > + > + old_head = lifo->head; > + > + tmp = old_head.top; > + > + /* Traverse the list to find the new head. A next > pointer > will > + * either point to another element or NULL; if a thread > + * encounters a pointer that has already been popped, > the CAS > + * will fail. > + */ > + for (i = 0; i < num && tmp != NULL; i++) { > + if (obj_table) > + obj_table[i] = tmp->data; > + if (last) > + *last = tmp; > > > Isn't it better to do obj_table and last assignment later when we successfully > reserved elements? If there is not retries, current solution is optimal, but > I guess > solution with the second traversal to fill in obj_table will show more stable > performance results under high load when many retries are done. > I suspect that the latency of the writes to obj_table and last would largely be hidden by the pointer chasing, since obj_table (aided by the next-line prefetcher) and last should be cached but chances are tmp->next won't be. Admittedly this is just a theory, I haven't experimentally confirmed anything; if you prefer, I'll investigate this further. > > > + tmp = tmp->next; > + } > + > + /* If NULL was encountered, the list was modified while > + * traversing it. Retry. > + */ > + if (i != num) > + continue; > + > + new_head.top = tmp; > + new_head.cnt = old_head.cnt + 1; > + > + if (rte_atomic128_cmpset((volatile uint64_t *) &lifo- > >head, > + (uint64_t *)&old_head, > + (uint64_t *)&new_head)) > + break; > + } > + > + return old_head.top; > +} > + > +#endif /* _NB_LIFO_H_ */ > diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > new file mode 100644 > index 000000000..1b30775f7 > --- /dev/null > +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > @@ -0,0 +1,125 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2019 Intel Corporation > + */ > + > +#include <stdio.h> > +#include <rte_mempool.h> > +#include <rte_malloc.h> > + > +#include "nb_lifo.h" > + > +struct rte_mempool_nb_stack { > + uint64_t size; > + struct nb_lifo used_lifo; /**< LIFO containing mempool pointers > */ > + struct nb_lifo free_lifo; /**< LIFO containing unused LIFO > elements */ > +}; > + > +static int > +nb_stack_alloc(struct rte_mempool *mp) > +{ > + struct rte_mempool_nb_stack *s; > + struct nb_lifo_elem *elems; > + unsigned int n = mp->size; > + unsigned int size, i; > + > + size = sizeof(*s) + n * sizeof(struct nb_lifo_elem); > + > + /* Allocate our local memory structure */ > + s = rte_zmalloc_socket("mempool-nb_stack", > + size, > + RTE_CACHE_LINE_SIZE, > + mp->socket_id); > + if (s == NULL) { > + RTE_LOG(ERR, MEMPOOL, "Cannot allocate > nb_stack!\n"); > + return -ENOMEM; > + } > + > + s->size = n; > + > + nb_lifo_init(&s->used_lifo); > + nb_lifo_init(&s->free_lifo); > + > + elems = (struct nb_lifo_elem *) &s[1]; > > > Does checkpatch.sh generate warning here because of space after type cast? > There are few similar cases in the patch. > No, because that's a --strict option. Regardless, I'm starting to prefer that style -- I'll fix these instances in v2. > > > + for (i = 0; i < n; i++) > + nb_lifo_push_single(&s->free_lifo, &elems[i]); > + > + mp->pool_data = s; > + > + return 0; > +} > + > +static int > +nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table, > + unsigned int n) > +{ > + struct rte_mempool_nb_stack *s = mp->pool_data; > + struct nb_lifo_elem *first, *last, *tmp; > + unsigned int i; > + > + if (unlikely(n == 0)) > + return 0; > + > + /* Pop n free elements */ > + first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL); > + if (unlikely(!first)) > > > Just a nit, but as far as I know comparison with NULL is typically used in > DPDK. > (few cases in the patch) > Looks like this is explicitly called out in the style guide: https://doc.dpdk.org/guides/contributing/coding_style.html#null-pointers Will fix in v2. > > > + return -ENOBUFS; > + > + /* Prepare the list elements */ > + tmp = first; > + for (i = 0; i < n; i++) { > + tmp->data = obj_table[i]; > + last = tmp; > + tmp = tmp->next; > + } > + > + /* Enqueue them to the used list */ > + nb_lifo_push(&s->used_lifo, first, last, n); > + > + return 0; > +} > + > +static int > +nb_stack_dequeue(struct rte_mempool *mp, void **obj_table, > + unsigned int n) > +{ > + struct rte_mempool_nb_stack *s = mp->pool_data; > + struct nb_lifo_elem *first, *last; > + > + if (unlikely(n == 0)) > + return 0; > + > + /* Pop n used elements */ > + first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last); > + if (unlikely(!first)) > + return -ENOENT; > + > + /* Enqueue the list elements to the free list */ > + nb_lifo_push(&s->free_lifo, first, last, n); > + > + return 0; > +} > + > +static unsigned > +nb_stack_get_count(const struct rte_mempool *mp) > +{ > + struct rte_mempool_nb_stack *s = mp->pool_data; > + > + return nb_lifo_len(&s->used_lifo); > +} > + > +static void > +nb_stack_free(struct rte_mempool *mp) > +{ > + rte_free((void *)(mp->pool_data)); > > > I think type cast is not required. > > Will fix. > > > +} > + > +static struct rte_mempool_ops ops_nb_stack = { > + .name = "nb_stack", > + .alloc = nb_stack_alloc, > + .free = nb_stack_free, > + .enqueue = nb_stack_enqueue, > + .dequeue = nb_stack_dequeue, > + .get_count = nb_stack_get_count > +}; > + > +MEMPOOL_REGISTER_OPS(ops_nb_stack); > diff --git > a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map > b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map > new file mode 100644 > index 000000000..fc8c95e91 > --- /dev/null > +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map > @@ -0,0 +1,4 @@ > +DPDK_19.05 { > + > + local: *; > +}; > diff --git a/mk/rte.app.mk b/mk/rte.app.mk > index 02e8b6f05..0b11d9417 100644 > --- a/mk/rte.app.mk > +++ b/mk/rte.app.mk > @@ -133,6 +133,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n) > > _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += - > lrte_mempool_bucket > _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += - > lrte_mempool_stack > +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += - > lrte_mempool_nb_stack > > > It is better to sort it alphabetically. > Will fix. Appreciate the detailed review! Thanks, Gage