Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler

Honnappa Nagarahalli Mon, 10 Jun 2019 18:27:56 -0700

> > > > --
> > > > > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64
> > > > > compiler
> > > > >
> > > > > From: Jerin Jacob <[email protected]>
> > > > >
> > > > > Some compilers reporting the following error, though the
> > > > > existing code doesn't have any uninitialized variable case.
> > > > > Just to make compiler happy, initialize the int32x4_t variable
> > > > > one shot in C language.
> > > > >
> > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4'
> > > > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be
> > > > > used uninitialized in this function [-Werror=maybe-uninitialized]
> > > > >   int32x4_t input;
> > > > >
> > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
> > > > > Cc: [email protected]
> > > > >
> > > > > Signed-off-by: Jerin Jacob <[email protected]>
> > > > > ---
> > > > >  lib/librte_acl/acl_run_neon.h | 29
> > > > > ++++++++++++-----------------
> > > > >  1 file changed, 12 insertions(+), 17 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_acl/acl_run_neon.h
> > > > > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9
> > > > > 100644
> > > > > --- a/lib/librte_acl/acl_run_neon.h
> > > > > +++ b/lib/librte_acl/acl_run_neon.h
> > > > > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx,
> > > > > const uint8_t **data,
> > > > >       uint64_t index_array[8];
> > > > >       struct completion cmplt[8];
> > > > >       struct parms parms[8];
> > > > > -     int32x4_t input0, input1;
> > > > >
> > > > >       acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > >                    total_packets, categories, ctx->trans_table); @@ 
> > > > > -181,17
> > > > > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const
> > > > > +uint8_t
> > > > > **data,
> > > > >
> > > > >       while (flows.started > 0) {
> > > > >               /* Gather 4 bytes of input data for each stream. */
> > > > > -             input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 0),
> > > > > input0, 0);
> > > > > -             input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 4),
> > > > > input1, 0);
> > > > > -
> > > > > -             input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 1),
> > > > > input0, 1);
> > > > > -             input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 5),
> > > > > input1, 1);
> > > > > -
> > > > > -             input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 2),
> > > > > input0, 2);
> > > > > -             input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 6),
> > > > > input1, 2);
> > > > > -
> > > > > -             input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 3),
> > > > > input0, 3);
> > > > > -             input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> 7),
> > > > > input1, 3);
> > > > > +             int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0),
> > > > > +                                 GET_NEXT_4BYTES(parms, 1),
> > > > > +                                 GET_NEXT_4BYTES(parms, 2),
> > > > > +                                 GET_NEXT_4BYTES(parms, 3)};
> > > > > +             int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4),
> > > > > +                                 GET_NEXT_4BYTES(parms, 5),
> > > > > +                                 GET_NEXT_4BYTES(parms, 6),
> > > > > +                                 GET_NEXT_4BYTES(parms, 7)};
> > > > >
> > > > This mixes the use of NEON intrinsics with GCC vector extensions.
> > > > ACLE (Arm C Language Extensions) specifically recommends not to
> > > > mix the two methods in section 12.2.6. IMO, Aaron's suggestion of
> > > > using a temp vector
> > > should be good.
> > >
> > > We are using this pattern across DPDK and SSE for x86 as well.
> > > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > > #n
> > > 91
> > I am not sure about x86, I have not looked at a document similar to
> > ACLE for x86. IMO, it is not relevant here as this is Arm specific code.
> 
> What I meant was its been already used in DPDK for arm64.
> https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n91
Ok, got it. I have had discussion with compiler folks at Arm with mixing vector 
programming models and the recommendation has been to use NEON exclusively. I 
have had this discussion with Marvel compiler folks too some time back.


> 
> Please see offial page vector gcc gcc documentation. The examples are using
> this scheme.
> https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
> 
> This is to just create 'input' variable. I am fine to use any other scheme 
> with
> out additional cost of instructions.
> 
> >
> > >
> > > Since it used in fastpath, a temp variable would be additional cost
> > > for no reason.
> > Then, I would suggest we can go with using 'vdupq_n_s32'.
> 
> We have to form uint64x2_t with 4 x uint32_t variable, How does
> 'vdupq_n_s32' help here?
We would use 'vdupq_n_s32' only for the first initialization, the rest of the 
code remains the same (see the diff below)

> Can you share code snippet without any temp variable?
diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h
index 01b9766d8..b3196cd12 100644
--- a/lib/librte_acl/acl_run_neon.h
+++ b/lib/librte_acl/acl_run_neon.h
@@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t 
**data,

        while (flows.started > 0) {
                /* Gather 4 bytes of input data for each stream. */
-               input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0);
-               input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0);
+               input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0));
+               input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 4));

                input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1);
                input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1);
@@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t 
**data,

        while (flows.started > 0) {
                /* Gather 4 bytes of input data for each stream. */
-               input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0);
+               input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0));
                input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1);
                input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2);
                input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3);

My understanding is that the generated code for both your patch and my changes 
above is the same. Above suggested changes will conform to ACLE recommendation.

> 
> >
> > > If GCC supports it then I think it is fine, I think, above usage
> > > matters with C++ portability.
> > I did not understand the C++ portability part. Can you elaborate more?
> >
> > >
> > >
> > > >
> > > > >               /* Process the 4 bytes of input on each stream. */
> > > > >
> > > > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx,
> > > > > const uint8_t **data,
> > > > >       uint64_t index_array[4];
> > > > >       struct completion cmplt[4];
> > > > >       struct parms parms[4];
> > > > > -     int32x4_t input;
> > > > >
> > > > >       acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > >                    total_packets, categories, ctx->trans_table); @@ 
> > > > > -242,10
> > > > > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const
> > > > > +uint8_t
> > > > > **data,
> > > > >
> > > > >       while (flows.started > 0) {
> > > > >               /* Gather 4 bytes of input data for each stream. */
> > > > > -             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0),
> input,
> > > > > 0);
> > > > > -             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1),
> input,
> > > > > 1);
> > > > > -             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2),
> input,
> > > > > 2);
> > > > > -             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3),
> input,
> > > > > 3);
> > > > > +             int32x4_t input = {GET_NEXT_4BYTES(parms, 0),
> > > > > +                                GET_NEXT_4BYTES(parms, 1),
> > > > > +                                GET_NEXT_4BYTES(parms, 2),
> > > > > +                                GET_NEXT_4BYTES(parms, 3)};
> > > > >
> > > > >               /* Process the 4 bytes of input on each stream. */
> > > > >               input = transition4(input, flows.trans, index_array);
> > > > > --
> > > > > 2.21.0

Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler

Reply via email to