On Sat, Jul 13, 2024 at 08:49:49PM +0530, Vipin Varghese wrote: > Goal of the patch is to improve SSE macswap on x86_64 by reducing > the stalls in backend engine. Original implementation of the SSE > macswap makes loop call to multiple load, shuffle & store. Using > SIMD ISA interleaving we can reduce the stalls for > - load SSE token exhaustion > - Shuffle and Load dependency > > Also other changes which improves packet per second are > - Filling access to MBUF for offload flags which is separate cacheline, > - using register keyword > > Test results: > ------------ > Platform: AMD EPYC SIENA 8594P @2.3GHz, no boost > DPDK: 24.03 > > ------------------------------------------------ > TEST IO 64B: baseline <NIC : MPPs> > - mellanox CX-7 2*200Gbps : 42.0 > - intel E810 1*100Gbps : 82.0 > - intel E810 2*200Gbps (2CQ-DA2): 83.0 > ------------------------------------------------ > TEST MACSWAP 64B: <NIC : Before : After> > - mellanox CX-7 2*200Gbps : 31.533 : 31.90 > - intel E810 1*100Gbps : 50.380 : 47.0 > - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827 > ------------------------------------------------ > TEST MACSWAP 128B: <NIC : Before: After> > - mellanox CX-7 2*200Gbps: 30.946 : 31.770 > - intel E810 1*100Gbps: 49.386 : 46.366 > - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503 > ------------------------------------------------ > TEST MACSWAP 256B: <NIC: Before: After> > - mellanox CX-7 2*200Gbps: 32.480 : 33.150 > - intel E810 1 * 100Gbps: 45.29 : 44.571 > - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117 > ------------------------------------------------ >
Hi, interesting patch. Do you know why we see regressions in some of the cases above? For 1x100G at 64B and 128B packet sizes we see perf drops of 3mpps vs smaller gains in the other two cases at each size (much smaller in the 64B case). Couple of other questions inline below too. Thanks, /Bruce > using multiple queues and lcore there is linear increase in MPPs. > > Signed-off-by: Vipin Varghese <vipin.vargh...@amd.com> > --- > app/test-pmd/macswap_sse.h | 40 ++++++++++++++++++-------------------- > 1 file changed, 19 insertions(+), 21 deletions(-) > > diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h > index 223f87a539..a3d3a274e5 100644 > --- a/app/test-pmd/macswap_sse.h > +++ b/app/test-pmd/macswap_sse.h > @@ -11,21 +11,21 @@ static inline void > do_macswap(struct rte_mbuf *pkts[], uint16_t nb, > struct rte_port *txp) > { > - struct rte_ether_hdr *eth_hdr[4]; > - struct rte_mbuf *mb[4]; > + register struct rte_ether_hdr *eth_hdr[8]; > + register struct rte_mbuf *mb[8]; Does using "register" actually make a difference to the generated code? Also, why increasing the array sizes from 4 to 8 - the actual code only uses 4 elements of each array below anyway? Is it for cache alignment purposes perhaps - if so, please use explicit cache alignment attributes to specify this rather than having it implicit in the array sizes. > uint64_t ol_flags; > int i; > int r; > - __m128i addr0, addr1, addr2, addr3; > + register __m128i addr0, addr1, addr2, addr3; > /** > * shuffle mask be used to shuffle the 16 bytes. > * byte 0-5 wills be swapped with byte 6-11. > * byte 12-15 will keep unchanged. > */ > - __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, > - 5, 4, 3, 2, > - 1, 0, 11, 10, > - 9, 8, 7, 6); > + register const __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, > + 5, 4, 3, 2, > + 1, 0, 11, 10, > + 9, 8, 7, 6); > > ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads); > vlan_qinq_set(pkts, nb, ol_flags, > @@ -44,23 +44,24 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb, > > mb[0] = pkts[i++]; > eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *); > - addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]); > - > mb[1] = pkts[i++]; > eth_hdr[1] = rte_pktmbuf_mtod(mb[1], struct rte_ether_hdr *); > - addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]); > - > - > mb[2] = pkts[i++]; > eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *); > - addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]); > - > mb[3] = pkts[i++]; > eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *); > - addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]); > > + /* Interleave load, shuffle & set */ > + addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]); > + mbuf_field_set(mb[0], ol_flags); > + addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]); > + mbuf_field_set(mb[1], ol_flags); > addr0 = _mm_shuffle_epi8(addr0, shfl_msk); > + addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]); > + mbuf_field_set(mb[2], ol_flags); > addr1 = _mm_shuffle_epi8(addr1, shfl_msk); > + addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]); > + mbuf_field_set(mb[3], ol_flags); > addr2 = _mm_shuffle_epi8(addr2, shfl_msk); > addr3 = _mm_shuffle_epi8(addr3, shfl_msk); > > @@ -69,25 +70,22 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb, > _mm_storeu_si128((__m128i *)eth_hdr[2], addr2); > _mm_storeu_si128((__m128i *)eth_hdr[3], addr3); > > - mbuf_field_set(mb[0], ol_flags); > - mbuf_field_set(mb[1], ol_flags); > - mbuf_field_set(mb[2], ol_flags); > - mbuf_field_set(mb[3], ol_flags); > r -= 4; > } > > for ( ; i < nb; i++) { > if (i < nb - 1) > rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], void *)); > + > mb[0] = pkts[i]; > eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *); > > /* Swap dest and src mac addresses. */ > addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]); > + /* MBUF and Ethernet are 2 separate cacheline */ > + mbuf_field_set(mb[0], ol_flags); > addr0 = _mm_shuffle_epi8(addr0, shfl_msk); > _mm_storeu_si128((__m128i *)eth_hdr[0], addr0); > - > - mbuf_field_set(mb[0], ol_flags); > } Since final loop is only for the odd elements at the end of the array of buffers. Does changing the instruction ordering here really make perf difference?