> > > > -- > > > > > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > > > > > compiler > > > > > > > > > > From: Jerin Jacob <jer...@marvell.com> > > > > > > > > > > Some compilers reporting the following error, though the > > > > > existing code doesn't have any uninitialized variable case. > > > > > Just to make compiler happy, initialize the int32x4_t variable > > > > > one shot in C language. > > > > > > > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > > > > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > > > > > used uninitialized in this function [-Werror=maybe-uninitialized] > > > > > int32x4_t input; > > > > > > > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > > > > > Cc: sta...@dpdk.org > > > > > > > > > > Signed-off-by: Jerin Jacob <jer...@marvell.com> > > > > > --- > > > > > lib/librte_acl/acl_run_neon.h | 29 > > > > > ++++++++++++----------------- > > > > > 1 file changed, 12 insertions(+), 17 deletions(-) > > > > > > > > > > diff --git a/lib/librte_acl/acl_run_neon.h > > > > > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 > > > > > 100644 > > > > > --- a/lib/librte_acl/acl_run_neon.h > > > > > +++ b/lib/librte_acl/acl_run_neon.h > > > > > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, > > > > > const uint8_t **data, > > > > > uint64_t index_array[8]; > > > > > struct completion cmplt[8]; > > > > > struct parms parms[8]; > > > > > - int32x4_t input0, input1; > > > > > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > > total_packets, categories, ctx->trans_table); @@ > > > > > -181,17 > > > > > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > > > > > +uint8_t > > > > > **data, > > > > > > > > > > while (flows.started > 0) { > > > > > /* Gather 4 bytes of input data for each stream. */ > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 0), > > > > > input0, 0); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 4), > > > > > input1, 0); > > > > > - > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 1), > > > > > input0, 1); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 5), > > > > > input1, 1); > > > > > - > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 2), > > > > > input0, 2); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 6), > > > > > input1, 2); > > > > > - > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 3), > > > > > input0, 3); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 7), > > > > > input1, 3); > > > > > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > > > > > + GET_NEXT_4BYTES(parms, 1), > > > > > + GET_NEXT_4BYTES(parms, 2), > > > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > > > > > + GET_NEXT_4BYTES(parms, 5), > > > > > + GET_NEXT_4BYTES(parms, 6), > > > > > + GET_NEXT_4BYTES(parms, 7)}; > > > > > > > > > This mixes the use of NEON intrinsics with GCC vector extensions. > > > > ACLE (Arm C Language Extensions) specifically recommends not to > > > > mix the two methods in section 12.2.6. IMO, Aaron's suggestion of > > > > using a temp vector > > > should be good. > > > > > > We are using this pattern across DPDK and SSE for x86 as well. > > > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c > > > #n > > > 91 > > I am not sure about x86, I have not looked at a document similar to > > ACLE for x86. IMO, it is not relevant here as this is Arm specific code. > > What I meant was its been already used in DPDK for arm64. > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n91 Ok, got it. I have had discussion with compiler folks at Arm with mixing vector programming models and the recommendation has been to use NEON exclusively. I have had this discussion with Marvel compiler folks too some time back.
> > Please see offial page vector gcc gcc documentation. The examples are using > this scheme. > https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html > > This is to just create 'input' variable. I am fine to use any other scheme > with > out additional cost of instructions. > > > > > > > > > Since it used in fastpath, a temp variable would be additional cost > > > for no reason. > > Then, I would suggest we can go with using 'vdupq_n_s32'. > > We have to form uint64x2_t with 4 x uint32_t variable, How does > 'vdupq_n_s32' help here? We would use 'vdupq_n_s32' only for the first initialization, the rest of the code remains the same (see the diff below) > Can you share code snippet without any temp variable? diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h index 01b9766d8..b3196cd12 100644 --- a/lib/librte_acl/acl_run_neon.h +++ b/lib/librte_acl/acl_run_neon.h @@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); + input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); + input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 4)); input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); @@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); + input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); My understanding is that the generated code for both your patch and my changes above is the same. Above suggested changes will conform to ACLE recommendation. > > > > > > If GCC supports it then I think it is fine, I think, above usage > > > matters with C++ portability. > > I did not understand the C++ portability part. Can you elaborate more? > > > > > > > > > > > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > > > > > > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, > > > > > const uint8_t **data, > > > > > uint64_t index_array[4]; > > > > > struct completion cmplt[4]; > > > > > struct parms parms[4]; > > > > > - int32x4_t input; > > > > > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > > total_packets, categories, ctx->trans_table); @@ > > > > > -242,10 > > > > > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const > > > > > +uint8_t > > > > > **data, > > > > > > > > > > while (flows.started > 0) { > > > > > /* Gather 4 bytes of input data for each stream. */ > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > input, > > > > > 0); > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > input, > > > > > 1); > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > input, > > > > > 2); > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > input, > > > > > 3); > > > > > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > > > > > + GET_NEXT_4BYTES(parms, 1), > > > > > + GET_NEXT_4BYTES(parms, 2), > > > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > > input = transition4(input, flows.trans, index_array); > > > > > -- > > > > > 2.21.0