Hi Ciara,

> Hi Konstantin,
> 
> 
> >-----Original Message-----
> >From: Ananyev, Konstantin <konstantin.anan...@intel.com>
> >Sent: Thursday 8 October 2020 15:40
> >To: Medvedkin, Vladimir <vladimir.medved...@intel.com>; Power, Ciara
> ><ciara.po...@intel.com>; dev@dpdk.org
> >Cc: Richardson, Bruce <bruce.richard...@intel.com>; Jerin Jacob
> ><jer...@marvell.com>; Ruifeng Wang <ruifeng.w...@arm.com>
> >Subject: RE: [dpdk-dev] [PATCH v3 18/18] lpm: choose vector path at runtime
> >
> >>
> >> Hi Ciara,
> >>
> >>
> >> On 30/09/2020 14:04, Ciara Power wrote:
> >> > When choosing the vector path, max SIMD bitwidth is now checked to
> >> > ensure a vector path is allowable. To do this, rather than the
> >> > vector lookup functions being called directly from apps, a generic
> >> > lookup function is called which will call the vector functions if 
> >> > suitable.
> >> >
> >> > Signed-off-by: Ciara Power <ciara.po...@intel.com>
> >> > ---
> >> >   lib/librte_lpm/rte_lpm.h         | 57 ++++++++++++++++++++++++++------
> >> >   lib/librte_lpm/rte_lpm_altivec.h |  2 +-
> >> >   lib/librte_lpm/rte_lpm_neon.h    |  2 +-
> >> >   lib/librte_lpm/rte_lpm_sse.h     |  2 +-
> >> >   4 files changed, 50 insertions(+), 13 deletions(-)
> >> >
> >> > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
> >> > index 03da2d37e0..edba7cafd5 100644
> >> > --- a/lib/librte_lpm/rte_lpm.h
> >> > +++ b/lib/librte_lpm/rte_lpm.h
> >> > @@ -397,8 +397,18 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm
> >*lpm, const uint32_t *ips,
> >> >   /* Mask four results. */
> >> >   #define         RTE_LPM_MASKX4_RES     UINT64_C(0x00ffffff00ffffff)
> >> >
> >> > +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) #include
> >> > +"rte_lpm_neon.h"
> >> > +#elif defined(RTE_ARCH_PPC_64)
> >> > +#include "rte_lpm_altivec.h"
> >> > +#else
> >> > +#include "rte_lpm_sse.h"
> >> > +#endif
> >> > +
> >> >   /**
> >> > - * Lookup four IP addresses in an LPM table.
> >> > + * Lookup four IP addresses in an LPM table individually by calling
> >> > + the
> >> > + * lookup function for each ip. This is used when lookupx4 is
> >> > + called but
> >> > + * the vector path is not suitable.
> >> >    *
> >> >    * @param lpm
> >> >    *   LPM object handle
> >> > @@ -417,16 +427,43 @@ rte_lpm_lookup_bulk_func(const struct
> >rte_lpm *lpm, const uint32_t *ips,
> >> >    *   if lookup would fail.
> >> >    */
> >> >   static inline void
> >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
> >> > -        uint32_t defv);
> >> > +rte_lpm_lookupx4_scalar(struct rte_lpm *lpm, xmm_t ip, uint32_t
> >hop[4],
> >> > +        uint32_t defv)
> >> > +{
> >> > +        int i;
> >> > +        for (i = 0; i < 4; i++)
> >> > +                if (rte_lpm_lookup(lpm, ((uint32_t *) &ip)[i], &hop[i]) 
> >> > < 0)
> >> > +                        hop[i] = defv; /* lookupx4 expected to set on 
> >> > failure
> >*/ }
> >> >
> >> > -#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) -#include
> >> > "rte_lpm_neon.h"
> >> > -#elif defined(RTE_ARCH_PPC_64)
> >> > -#include "rte_lpm_altivec.h"
> >> > -#else
> >> > -#include "rte_lpm_sse.h"
> >> > -#endif
> >> > +/**
> >> > + * Lookup four IP addresses in an LPM table.
> >> > + *
> >> > + * @param lpm
> >> > + *   LPM object handle
> >> > + * @param ip
> >> > + *   Four IPs to be looked up in the LPM table
> >> > + * @param hop
> >> > + *   Next hop of the most specific rule found for IP (valid on lookup 
> >> > hit
> >only).
> >> > + *   This is an 4 elements array of two byte values.
> >> > + *   If the lookup was successful for the given IP, then least 
> >> > significant
> >byte
> >> > + *   of the corresponding element is the  actual next hop and the most
> >> > + *   significant byte is zero.
> >> > + *   If the lookup for the given IP failed, then corresponding element
> >would
> >> > + *   contain default value, see description of then next parameter.
> >> > + * @param defv
> >> > + *   Default value to populate into corresponding element of hop[] 
> >> > array,
> >> > + *   if lookup would fail.
> >> > + */
> >> > +static inline void
> >> > +rte_lpm_lookupx4(struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
> >> > +        uint32_t defv)
> >> > +{
> >> > +        if (rte_get_max_simd_bitwidth() >= RTE_MAX_128_SIMD)
> >> > +                rte_lpm_lookupx4_vec(lpm, ip, hop, defv);
> >> > +        else
> >> > +                rte_lpm_lookupx4_scalar(lpm, ip, hop, defv); }
> >>
> >> I'm afraid this will lead to a drop in performance. rte_lpm_lookupx4
> >> is used in the hot path, and a bulk size is too small to amortize the
> >> cost of adding this extra logic.
> >
> >I do share Vladimir's concern regarding performance here.
> >As I said in other mail - it seems not much point to insert these checks into
> >inline SSE specific function, as SSE is enabled by default for all x86 
> >builds.
> >
> 
> The performance impact is quite small, thanks Vladimir for providing these 
> results:
> 
> before patches:
>       LPM LookupX4: 25.1 cycles (fails = 12.5%)
>       LPM LookupX4: 25.2 cycles (fails = 12.5%)
>       LPM LookupX4: 25.2 cycles (fails = 12.5%)
> 
> v3:
>       LPM LookupX4: 26.2 cycles (fails = 12.5%)
>       LPM LookupX4: 26.2 cycles (fails = 12.5%)
>       LPM LookupX4: 26.2 cycles (fails = 12.5%) 

Yes, perf difference is surprisingly small...
Wonder what tests did you use for that?
I'd expect that on l3fwd it would be more noticeable,
especially on machines with low-end cpus. 

> v4:
> Note: I haven't sent this publicly yet, modified v3 slightly to check the 
> bitwidth
> in LPM create and set a flag that is used in lookupx4 to choose either vector 
> or scalar function.

Yes, avoiding function call will definitely help here.
Though I am sill not convinced we have to make such checks in that function at 
all
(and other inline functions).
Inline functions will be compiled within user code and their behaviour should 
be controlled
together with the rest of user code.
Let say  in l3fwd for IA rte_lpm_lookupx4 is called from /l3fwd_lpm_sse.h,
which as name implies is supposed to be build and used with SSE enabled.
If we'd like l3fwd to obey 'max-simd-width' parameter, then it needs to be done
somewhere at startup, when behaviour is selected, not inside every possible 
inline function
that does use SSE instrincts.

>       LPM LookupX4: 25.5 cycles (fails = 12.5%)
>       LPM LookupX4: 25.5 cycles (fails = 12.5%)
>       LPM LookupX4: 25.5 cycles (fails = 12.5%)
> 
> 
> Thanks,
> Ciara
> 
> >As another more generic thought - might be better to avoid these checks in
> >other public SIMD-specific inline functions (if any).
> >If such function get called from some .c, then at least such SIMD ISA is
> >already enabled for that .c file and I think this check should be
> >left for caller to do.
> >
> >> >
> >> >   #ifdef __cplusplus
> >> >   }
> >> > diff --git a/lib/librte_lpm/rte_lpm_altivec.h
> >> > b/lib/librte_lpm/rte_lpm_altivec.h
> >> > index 228c41b38e..82142d3351 100644
> >> > --- a/lib/librte_lpm/rte_lpm_altivec.h
> >> > +++ b/lib/librte_lpm/rte_lpm_altivec.h
> >> > @@ -16,7 +16,7 @@ extern "C" {
> >> >   #endif
> >> >
> >> >   static inline void
> >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> >> > hop[4],
> >> > +rte_lpm_lookupx4_vec(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> >> > +hop[4],
> >> >          uint32_t defv)
> >> >   {
> >> >          vector signed int i24;
> >> > diff --git a/lib/librte_lpm/rte_lpm_neon.h
> >> > b/lib/librte_lpm/rte_lpm_neon.h index 6c131d3125..14b184515d 100644
> >> > --- a/lib/librte_lpm/rte_lpm_neon.h
> >> > +++ b/lib/librte_lpm/rte_lpm_neon.h
> >> > @@ -16,7 +16,7 @@ extern "C" {
> >> >   #endif
> >> >
> >> >   static inline void
> >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> >> > hop[4],
> >> > +rte_lpm_lookupx4_vec(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> >> > +hop[4],
> >> >          uint32_t defv)
> >> >   {
> >> >          uint32x4_t i24;
> >> > diff --git a/lib/librte_lpm/rte_lpm_sse.h
> >> > b/lib/librte_lpm/rte_lpm_sse.h index 44770b6ff8..cb5477c6cf 100644
> >> > --- a/lib/librte_lpm/rte_lpm_sse.h
> >> > +++ b/lib/librte_lpm/rte_lpm_sse.h
> >> > @@ -15,7 +15,7 @@ extern "C" {
> >> >   #endif
> >> >
> >> >   static inline void
> >> > -rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> >> > hop[4],
> >> > +rte_lpm_lookupx4_vec(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> >> > +hop[4],
> >> >          uint32_t defv)
> >> >   {
> >> >          __m128i i24;
> >> >
> >>
> >> --
> >> Regards,
> >> Vladimir

Reply via email to