> > Hi everyone, > > > > > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the > > > > >>> results are as > > > > >> follows. The numbers in brackets are with the code on master. > > > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 > > > > >>> > > > > >>> RTE>>ring_perf_elem_autotest > > > > >>> ### Testing single element and burst enq/deq ### SP/SC single > > > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst > > > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 > > > > >>> SP/SC burst enq/dequeue (size: 32): 1 (2) MP/MC burst > enq/dequeue (size: > > > > >>> 32): 2 > > > > >>> > > > > >>> ### Testing empty dequeue ### > > > > >>> SC empty dequeue: 2.11 > > > > >>> MC empty dequeue: 1.41 (2.11) > > > > >>> > > > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: > > > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue > > > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35 > > > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95) > > > > >>> > > > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue > (size: > > > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 > > > > >>> (71.27) SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC > > > > >>> bulk enq/dequeue > > > > >>> (size: 32): 25.74 (20.91) > > > > >>> > > > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue > (size: > > > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02 > > > > >>> (173.43) SP/SC bulk enq/dequeue (size: > > > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 > > > > >>> (46.74) > > > > >>> > > > > >>> On one of the Arm platform > > > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the > > > > >>> rest are > > > > >>> ok) > > > > > > > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and > > > > 16 cores/node (SMT=4). Applied all 3 patches in v5, test results > > > > are as > > > > follows: > > > > > > > > RTE>>ring_perf_elem_autotest > > > > ### Testing single element and burst enq/deq ### SP/SC single > enq/dequeue: > > > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8): > > > > 5 MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue > > > > (size: 32): 2 MP/MC burst enq/dequeue (size: 32): 2 > > > > > > > > ### Testing empty dequeue ### > > > > SC empty dequeue: 7.81 > > > > MC empty dequeue: 7.81 > > > > > > > > ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: > > > > 8): 5.76 MP/MC bulk enq/dequeue (size: 8): 7.66 SP/SC bulk > > > > enq/dequeue (size: 32): 2.10 MP/MC bulk enq/dequeue (size: 32): > > > > 2.57 > > > > > > > > ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue > > > > (size: 8): 13.13 MP/MC bulk enq/dequeue (size: 8): 13.98 SP/SC > > > > bulk enq/dequeue (size: 32): 3.41 MP/MC bulk enq/dequeue (size: > > > > 32): 4.45 > > > > > > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: > 8): > > > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk > > > > enq/dequeue > > > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40 > > > > > > > > ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size: > > > > 8): 63.41 MP/MC bulk enq/dequeue (size: 8): 62.70 SP/SC bulk > > > > enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size: > > > > 32): 22.96 > > > > > > > Thanks for running this. There is another test 'ring_perf_autotest' > > > which provides the numbers with the original implementation. The > > > goal > > is to make sure the numbers with the original implementation are the same > as these. Can you please run that as well? > > > > Honnappa, > > > > Your earlier perf report shows the cycles are in less than 1. That's > > is due to it is using 50 or 100MHz clock in EL0. > > Please check with PMU counter. See "ARM64 profiling" in > > > > http://doc.dpdk.org/guides/prog_guide/profile_app.html > > > > > > Here is the octeontx2 values. There is a regression in two core cases > > as you reported earlier in x86. > > > > > > RTE>>ring_perf_autotest > > ### Testing single element and burst enq/deq ### SP/SC single > > enq/dequeue: 288 MP/MC single enq/dequeue: 452 SP/SC burst > enq/dequeue > > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 61 SP/SC burst > > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 21 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 6.33 > > MC empty dequeue: 6.67 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 38.35 MP/MC bulk enq/dequeue (size: > > 8): 67.36 SP/SC bulk enq/dequeue (size: 32): 13.10 MP/MC bulk > > enq/dequeue (size: 32): 21.64 > > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: > > 8): 75.94 MP/MC bulk enq/dequeue (size: 8): 107.66 SP/SC bulk > > enq/dequeue (size: 32): 24.51 MP/MC bulk enq/dequeue (size: 32): 33.23 > > Test OK > > RTE>> > > > > ---- after applying v5 of the patch ------ > > > > RTE>>ring_perf_autotest > > ### Testing single element and burst enq/deq ### SP/SC single > > enq/dequeue: 289 MP/MC single enq/dequeue: 452 SP/SC burst > enq/dequeue > > (size: 8): 40 MP/MC burst enq/dequeue (size: 8): 64 SP/SC burst > > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 22 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 6.33 > > MC empty dequeue: 6.67 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 39.73 MP/MC bulk enq/dequeue (size: > > 8): 69.13 SP/SC bulk enq/dequeue (size: 32): 13.44 MP/MC bulk > > enq/dequeue (size: 32): 22.00 > > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: > > 8): 76.02 MP/MC bulk enq/dequeue (size: 8): 112.50 SP/SC bulk > > enq/dequeue (size: 32): 24.71 MP/MC bulk enq/dequeue (size: 32): 33.34 > > Test OK > > RTE>> > > > > RTE>>ring_perf_elem_autotest > > ### Testing single element and burst enq/deq ### SP/SC single > > enq/dequeue: 290 MP/MC single enq/dequeue: 503 SP/SC burst > enq/dequeue > > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 63 SP/SC burst > > enq/dequeue (size: 32): 11 MP/MC burst enq/dequeue (size: 32): 19 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 6.33 > > MC empty dequeue: 6.67 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 38.92 MP/MC bulk enq/dequeue (size: > > 8): 62.54 SP/SC bulk enq/dequeue (size: 32): 11.46 MP/MC bulk > > enq/dequeue (size: 32): 19.89 > > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: > > 8): 87.55 MP/MC bulk enq/dequeue (size: 8): 99.10 SP/SC bulk > > enq/dequeue (size: 32): 26.63 MP/MC bulk enq/dequeue (size: 32): 29.91 > > Test OK > > RTE>> > > > > As I can see, there is copy&paste bug in patch #3 (that's why it probably > produced some weird numbers for me first). > After fix applied (see patch below), things look pretty good on my box. > As I can see there are only 3 results noticably lower: > SP/SC (size=8) over 2 physical cores same numa socket > MP/MC (size=8) over 2 physical cores on different numa sockets. > All others seems about same or better. > Anyway I went ahead and reworked code a bit (as I suggested before) to get > rid of these huge ENQUEUE/DEQUEUE macros. > Results are very close to fixed patch #3 version (patch is also attached). > Though I suggest people hold on to re-run perf tests till we'll make ring > functional test to run for _elem_ functions too. > I started to work on that, but not sure I'll finish today (most likely > Monday). I have sent V6. This has the test cases added for 'rte_ring_xxx_elem' APIs. All issues are fixed in both the methods of copy, more info below. I will post the performance info soon.
> Perf results from my box, plus patches below. > Konstantin > > perf results > ========== > > Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > > A - ring_perf_autotest > B - ring_perf_elem_autotest + patch #3 + fix C - B + update > > ### Testing using a single lcore ### A B C > SP/SC bulk enq/dequeue (size: 8): 4.06 3.06 3.22 > MP/MC bulk enq/dequeue (size: 8): 10.05 9.04 9.38 > SP/SC bulk enq/dequeue (size: 32): 2.93 1.91 1.84 > MP/MC bulk enq/dequeue (size: 32): 4.12 3.39 3.35 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 9.24 8.92 8.89 > MP/MC bulk enq/dequeue (size: 8): 15.47 15.39 16.02 > SP/SC bulk enq/dequeue (size: 32): 5.78 3.87 3.86 > MP/MC bulk enq/dequeue (size: 32): 6.41 4.57 4.45 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 24.14 29.89 27.05 > MP/MC bulk enq/dequeue (size: 8): 68.61 70.55 69.85 > SP/SC bulk enq/dequeue (size: 32): 12.11 12.99 13.04 > MP/MC bulk enq/dequeue (size: 32): 22.14 17.86 18.25 > > ### Testing using two NUMA nodes ### > SP/SC bulk enq/dequeue (size: 8): 48.78 31.98 33.57 > MP/MC bulk enq/dequeue (size: 8): 167.53 197.29 192.13 > SP/SC bulk enq/dequeue (size: 32): 31.28 21.68 21.61 > MP/MC bulk enq/dequeue (size: 32): 53.45 49.94 48.81 > > fix patch > ======= > > From a2be5a9b136333a56d466ef042c655e522ca7012 Mon Sep 17 00:00:00 > 2001 > From: Konstantin Ananyev <konstantin.anan...@intel.com> > Date: Fri, 18 Oct 2019 15:50:43 +0100 > Subject: [PATCH] fix1 > > Signed-off-by: Konstantin Ananyev <konstantin.anan...@intel.com> > --- > lib/librte_ring/rte_ring_elem.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h > index 92e92f150..5e1819069 100644 > --- a/lib/librte_ring/rte_ring_elem.h > +++ b/lib/librte_ring/rte_ring_elem.h > @@ -118,7 +118,7 @@ struct rte_ring *rte_ring_create_elem(const char > *name, unsigned count, > uint32_t sz = n * (esize / sizeof(uint32_t)); \ > if (likely(idx + n < size)) { \ > for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) > { \ > - memcpy (ring + i, obj + i, 8 * sizeof (uint32_t)); \ > + memcpy (ring + idx, obj + i, 8 * sizeof > + (uint32_t)); \ > } \ > switch (n & 0x7) { \ > case 7: \ > @@ -153,7 +153,7 @@ struct rte_ring *rte_ring_create_elem(const char > *name, unsigned count, > uint32_t sz = n * (esize / sizeof(uint32_t)); \ > if (likely(idx + n < size)) { \ > for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) > { \ > - memcpy (obj + i, ring + i, 8 * sizeof (uint32_t)); \ > + memcpy (obj + i, ring + idx, 8 * sizeof Actually, this fix alone is not enough. 'idx' needs to be normalized to elements of type 'uint32_t'. > + (uint32_t)); \ > } \ > switch (n & 0x7) { \ > case 7: \ > -- > 2.17.1 > > update patch (remove macros) > ========================= > > From 18b388e877b97e243f807f27a323e876b30869dd Mon Sep 17 00:00:00 > 2001 > From: Konstantin Ananyev <konstantin.anan...@intel.com> > Date: Fri, 18 Oct 2019 17:35:43 +0100 > Subject: [PATCH] update1 > > Signed-off-by: Konstantin Ananyev <konstantin.anan...@intel.com> > --- > lib/librte_ring/rte_ring_elem.h | 141 ++++++++++++++++---------------- > 1 file changed, 70 insertions(+), 71 deletions(-) > > diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h > index 5e1819069..eb706b12f 100644 > --- a/lib/librte_ring/rte_ring_elem.h > +++ b/lib/librte_ring/rte_ring_elem.h > @@ -109,75 +109,74 @@ __rte_experimental struct rte_ring > *rte_ring_create_elem(const char *name, unsigned count, > unsigned esize, int socket_id, unsigned > flags); > > -#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n) > do { \ > - unsigned int i; \ > - const uint32_t size = (r)->size; \ > - uint32_t idx = prod_head & (r)->mask; \ > - uint32_t *ring = (uint32_t *)ring_start; \ > - uint32_t *obj = (uint32_t *)obj_table; \ > - uint32_t sz = n * (esize / sizeof(uint32_t)); \ > - if (likely(idx + n < size)) { \ > - for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) > { \ > - memcpy (ring + idx, obj + i, 8 * sizeof (uint32_t)); \ > - } \ > - switch (n & 0x7) { \ > - case 7: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - case 6: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - case 5: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - case 4: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - case 3: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - case 2: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - case 1: \ > - ring[idx++] = obj[i++]; /* fallthrough */ \ > - } \ > - } else { \ > - for (i = 0; idx < size; i++, idx++)\ > - ring[idx] = obj[i]; \ > - for (idx = 0; i < n; i++, idx++) \ > - ring[idx] = obj[i]; \ > - } \ > -} while (0) > - > -#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n) > do { \ > - unsigned int i; \ > - uint32_t idx = cons_head & (r)->mask; \ > - const uint32_t size = (r)->size; \ > - uint32_t *ring = (uint32_t *)ring_start; \ > - uint32_t *obj = (uint32_t *)obj_table; \ > - uint32_t sz = n * (esize / sizeof(uint32_t)); \ > - if (likely(idx + n < size)) { \ > - for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) > { \ > - memcpy (obj + i, ring + idx, 8 * sizeof (uint32_t)); \ > - } \ > - switch (n & 0x7) { \ > - case 7: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - case 6: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - case 5: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - case 4: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - case 3: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - case 2: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - case 1: \ > - obj[i++] = ring[idx++]; /* fallthrough */ \ > - } \ > - } else { \ > - for (i = 0; idx < size; i++, idx++) \ > - obj[i] = ring[idx]; \ > - for (idx = 0; i < n; i++, idx++) \ > - obj[i] = ring[idx]; \ > - } \ > -} while (0) > +static __rte_always_inline void > +copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num, > +uint32_t esize) { > + uint32_t i, sz; > + > + sz = (num * esize) / sizeof(uint32_t); > + > + for (i = 0; i < (sz & ~7); i += 8) > + memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t)); > + > + switch (sz & 7) { > + case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */ > + case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */ > + case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */ > + case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */ > + case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */ > + case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */ > + case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */ > + } > +} > + > +static __rte_always_inline void > +enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head, > + void *obj_table, uint32_t num, uint32_t esize) { > + uint32_t idx, n; > + uint32_t *du32; > + const uint32_t *su32; > + > + const uint32_t size = r->size; > + > + idx = prod_head & (r)->mask; Same here, 'idx' needs to be normalized to elements of type 'uint32_t' and similar fixes on other variables. I have applied your suggestion in 6/6 in v6 along with my corrections. The rte_ring_elem test cases are added in 3/6. I have verified that they are running fine (they are done for 64b alone, will add more). Hopefully, there are no more errors. > + > + du32 = (uint32_t *)ring_start + idx; > + su32 = obj_table; > + > + if (idx + num < size) > + copy_elems(du32, su32, num, esize); > + else { > + n = size - idx; > + copy_elems(du32, su32, n, esize); > + copy_elems(ring_start, su32 + n, num - n, esize); > + } > +} > + > +static __rte_always_inline void > +dequeue_elems(struct rte_ring *r, void *ring_start, uint32_t cons_head, > + void *obj_table, uint32_t num, uint32_t esize) { > + uint32_t idx, n; > + uint32_t *du32; > + const uint32_t *su32; > + > + const uint32_t size = r->size; > + > + idx = cons_head & (r)->mask; > + > + su32 = (uint32_t *)ring_start + idx; > + du32 = obj_table; > + > + if (idx + num < size) > + copy_elems(du32, su32, num, esize); > + else { > + n = size - idx; > + copy_elems(du32, su32, n, esize); > + copy_elems(du32 + n, ring_start, num - n, esize); > + } > +} > > /* Between load and load. there might be cpu reorder in weak model > * (powerpc/arm). > @@ -232,7 +231,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void > * const obj_table, > if (n == 0) > goto end; > > - ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n); > + enqueue_elems(r, &r[1], prod_head, obj_table, n, esize); > > update_tail(&r->prod, prod_head, prod_next, is_sp, 1); > end: > @@ -279,7 +278,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void > *obj_table, > if (n == 0) > goto end; > > - DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n); > + dequeue_elems(r, &r[1], cons_head, obj_table, n, esize); > > update_tail(&r->cons, cons_head, cons_next, is_sc, 0); > > -- > 2.17.1 >