> -----Original Message----- > From: Jerin Jacob Kollanukkaran <jer...@marvell.com> > Sent: Saturday, June 8, 2019 2:08 PM > To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; Aaron Conole > <acon...@redhat.com> > Cc: msant...@redhat.com; tho...@monjalon.net; Ruifeng Wang (Arm > Technology China) <ruifeng.w...@arm.com>; Gavin Hu (Arm Technology > China) <gavin...@arm.com>; Dharmik Thakkar <dharmik.thak...@arm.com>; > jerin.ja...@caviumnetworks.com; ys...@mellanox.com; dev@dpdk.org; > bruce.richard...@intel.com; nd <n...@arm.com>; nd <n...@arm.com> > Subject: RE: DPDK compilation on arm is failing in Travis > > > -----Original Message----- > > From: dev <dev-boun...@dpdk.org> On Behalf Of Honnappa Nagarahalli > > Sent: Friday, June 7, 2019 7:24 PM > > To: Aaron Conole <acon...@redhat.com> > > Cc: msant...@redhat.com; tho...@monjalon.net; Ruifeng Wang (Arm > > Technology China) <ruifeng.w...@arm.com>; Gavin Hu (Arm Technology > > China) <gavin...@arm.com>; Dharmik Thakkar > <dharmik.thak...@arm.com>; > > jerin.ja...@caviumnetworks.com; ys...@mellanox.com; dev@dpdk.org; > > bruce.richard...@intel.com; Honnappa Nagarahalli > > <honnappa.nagaraha...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com> > > Subject: Re: [dpdk-dev] DPDK compilation on arm is failing in Travis > > > > > >> > > > > >> > Thomas Monjalon <tho...@monjalon.net> writes: > > > >> > > > > >> > > > > >> > > > > >> > The compilation of the master branch is failing for aarch64: > > > >> > > > > >> > https://travis-ci.com/DPDK/dpdk > > > >> > > > > >> > The log is so much verbose that I am not able to understand > > > >> > what > > > >> > > > > >> > is really wrong. > > > >> > > > > >> > Please help to diagnose and fix, thanks. > > > >> > > > > >> > > > > >> > > > > >> > A discussion about this: > > > >> > > > > >> > > > > >> > > > > >> > http://mails.dpdk.org/archives/dev/2019-June/134012.html > > > >> > > > > >> > > > > >> > > > > >> > I see the error now. > > > >> > > > > >> > It is printing the full log after the error, so I missed the > > > >> > error > > > >> > > > > >> > at the top. > > > >> > > > > >> > > > > >> > > > > >> > I've read your comment about a possible error with the patch > > > >> > > > > >> > removing weak functions but neither me nor Bruce were able to > > > >> > reproduce > > > >> > > > > >> > it. > > > >> > > > > >> > What is the condition to see this compiler warning? > > > >> > > > > >> > > > > >> > > > > >> > It is only on ARM, and only when the neon intrinsics are in use. > > > >> > > > > >> > I am not able to reproduce it from the tip of master. > > > >> > > > > >> > > > > >> > > > > >> > I am using: > > > >> > > > > >> > gcc (Ubuntu 8.3.0-6ubuntu1~18.04) 8.3.0 > > > >> > > > > >> > > > > >> > > > > >> > From the log on Travis, looks like the compiler is: > > > >> > > > > >> > gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 > > > >> > > > > >> > > > > >> > > > > >> > Is this the issue? > > > >> > > > > >> > > > > >> > > > > >> > Why are we seeing the error now? > > > >> > > > > >> > I tested with gcc-5 (Ubuntu/Linaro 5.5.0-12ubuntu1) 5.5.0 > > > >> > 20171010, it > > > >> works fine. I cannot get hold of 5.4.0. Not sure if needs to be > > > >> supported. > > > >> > > > > >> > Are there any issues in upgrading to 7 or 8? > > > >> > > > > >> > I have tested it on my ubuntu 16.04 vm on commit > > > >> > 8cb511bb94ad92a76990f175cac76bb13d51daba > > > >> > (head of master seems to be failing for other reasons on my vm). > > > >> > I tested the following gcc versions: > > > >> > > > > >> > gcc 5.5.0 "cc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010" > > > >> > gcc 7.4.0 "cc (Ubuntu 7.4.0-1ubuntu1~16.04~ppa1) 7.4.0" > > > >> > gcc 8.1.0 "cc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0" > > > >> > > > > >> > All tested versions failed on the exact same error shown in travis. > > > >> > I don't know if the compiler is at fault here. Maybe Aaron's > > > >> > patch is a viable > > > >> option? > > > >> > > > > >> > The issue is the vector lane setting code looks like: > > > >> > > > > >> > > > > >> > > > > >> > lval = lane_set(scalar, rval, lane id) > > > >> > > > > >> > > > > >> > > > > >> > In this case, 'rval' is being used before it is ever set, but > > > >> > it > > > >> > > > > >> > really could be just 0 for the first lane setting code. > > > >> > Thereafter, > > > >> > > > > >> > we use the old value of input as the rval, but each time a > > > >> > different lane is > > > >> set. > > > >> > > > > >> > > > > >> > > > > >> > It would be nice if there were an intrinsic that formatted > > > >> > correctly > > > >> > > > > >> > from the start (something we could call like lval = > > > >> > > > > >> > lane_set_from_array(scalar_array)). > > > >> > > > > >> > [Honnappa] This exists already. ‘vdupq_n_s32’ can be used. Can > > > >> > you try the > > > >> following? > > > >> > > > >> Well, it isn't exactly that. You are setting all lanes from a scalar. > > > > Yes, you are correct, it sets all the lanes. I am not sure on how > > > > this will affect the performance. > > > > > > > >> I'd rather be able to say: > > > >> > > > >> input0 = vdupq_nn_s32(&parms[0]); > > > >> input1 = vdupq_nn_s32(&parms[4]); > > > >> > > > >> Something like that, which lets us delete all the rest of the > > > >> lane-set code. But it seems it doesn't exist. > > > >> > > > >> Regardless, I think either patch should work (either using the 'all > > > >> lanes' > > > >> setting you have or the static variable). I have no preference > > > >> on it > > > >> - it's up to you (or someone else) to say which is preferred. I > > > >> guess your version could be preferable since there's no static to > > > >> need to "explain" :) > > > > I think we can go ahead with your patch with using a temporary > > > > vector for the first set, as it does not introduce any change to > > > > the code and hence performance should not get affected. > > > > > > > > But, I do not understand why you have added 'static'. Also, > > > > changing 'ZEROVAL' to 'tmp' or something similar will be better. > > > > > > The static is there to guarantee '0' value. Otherwise we create a > > > temp variable that has to be initialized explicitly. > > Ok, I am fine with this. I guess this is the explanation you wanted to > > avoid 😊. > > Don’t use BSS for fastpath code. Let it use stack for better cache usage and > multi thread case. I already sent a simple fix for this with temp variable. > Please > don’t complicate.
Typo: s/with temp variable/ with out temp variable/g > > > > > > > > > > >> > > > >> > honnag01@qc2400f-1:~/dpdk$ git diff > > > >> > > > > >> > diff --git a/lib/librte_acl/acl_run_neon.h > > > >> > b/lib/librte_acl/acl_run_neon.h > > > >> > > > > >> > index 01b9766d8..b3196cd12 100644 > > > >> > > > > >> > --- a/lib/librte_acl/acl_run_neon.h > > > >> > > > > >> > +++ b/lib/librte_acl/acl_run_neon.h > > > >> > > > > >> > @@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx > > > >> > *ctx, const uint8_t **data, > > > >> > > > > >> > > > > >> > > > > >> > while (flows.started > 0) { > > > >> > > > > >> > /* Gather 4 bytes of input data for each stream. > > > >> > */ > > > >> > > > > >> > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > >> > input0, > > > 0); > > > >> > > > > >> > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > > > >> > input1, > > > 0); > > > >> > > > > >> > + input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms, > > > >> > + 0)); > > > >> > > > > >> > + input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms, > > > >> > + 4)); > > > >> > > > > >> > > > > >> > > > > >> > input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > > > >> > 1), input0, 1); > > > >> > > > > >> > input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > > > >> > 5), input1, 1); > > > >> > > > > >> > @@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx > > > >> > *ctx, const uint8_t **data, > > > >> > > > > >> > > > > >> > > > > >> > while (flows.started > 0) { > > > >> > > > > >> > /* Gather 4 bytes of input data for each stream. > > > >> > */ > > > >> > > > > >> > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > >> > input, 0); > > > >> > > > > >> > + input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); > > > >> > > > > >> > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > > > >> > 1), input, 1); > > > >> > > > > >> > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > > > >> > 2), input, 2); > > > >> > > > > >> > input = > > > >> > vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); > > > >> > > > > >> > > > > >> > > > > >> > Then 'input' would never appear as an rval before it was set. > > > >> > > > > >> > > > > >> > > > > >> > I thought Jerin Jacob (CC'd) would have some opinion on the right > > > >> > fix. > > > >> > > > > >> > There are three 'fixes' I know exist - one is to squelch the > > > >> > warning > > > >> > > > > >> > (but I don't like it because it could hide future code that > > > >> > introduces > > > >> > > > > >> > this), one is to create a static and use assignment, one is to > > > >> > replace > > > >> > > > > >> > the first call and pass in a 0'd lane for the first one. > > > >> > > > > >> > > > > >> > > > > >> > Actually, I think I have a patch that could work to not > > > >> > introduce an > > > >> > > > > >> > assignment, but squelch the warning. Something like the > > > >> > following (not > > > >> > > > > >> > tested). > > > >> > > > > >> > > > > >> > > > > >> > --- > > > >> > > > > >> > > > > >> > > > > >> > diff --git a/lib/librte_acl/acl_run_neon.h > > > >> > > > > >> > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..37c984fef > > > >> > 100644 > > > >> > > > > >> > --- a/lib/librte_acl/acl_run_neon.h > > > >> > > > > >> > +++ b/lib/librte_acl/acl_run_neon.h > > > >> > > > > >> > @@ -165,6 +165,7 @@ search_neon_8(const struct rte_acl_ctx > > > >> > *ctx, const > > > >> > > > > >> > uint8_t **data, > > > >> > > > > >> > uint64_t index_array[8]; > > > >> > > > > >> > struct completion cmplt[8]; > > > >> > > > > >> > struct parms parms[8]; > > > >> > > > > >> > + static int32x4_t ZEROVAL; > > > >> > > > > >> > int32x4_t input0, input1; > > > >> > > > > >> > > > > >> > > > > >> > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > >> > @@ > > > >> > - > > > >> > > > > >> > 181,8 +182,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, > > > >> > const > > > >> > > > > >> > uint8_t **data, > > > >> > > > > >> > > > > >> > > > > >> > while (flows.started > 0) { > > > >> > > > > >> > /* Gather 4 bytes of input data for each stream. */ > > > >> > > > > >> > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > >> > input0, > > > >> > > > > >> > 0); > > > >> > > > > >> > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > > > >> > input1, > > > >> > > > > >> > 0); > > > >> > > > > >> > + input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > >> > > > > >> > ZEROVAL, 0); > > > >> > > > > >> > + input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > > > >> > > > > >> > ZEROVAL, 0); > > > >> > > > > >> > > > > >> > > > > >> > input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > > > >> > input0, > > > >> > > > > >> > 1); > > > >> > > > > >> > input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), > > > >> > input1, > > > >> > > > > >> > 1); @@ > > > >> > > > > >> > -227,6 +228,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, > > > >> > const > > > >> > > > > >> > uint8_t **data, > > > >> > > > > >> > uint64_t index_array[4]; > > > >> > > > > >> > struct completion cmplt[4]; > > > >> > > > > >> > struct parms parms[4]; > > > >> > > > > >> > + static int32x4_t ZEROVAL; > > > >> > > > > >> > int32x4_t input; > > > >> > > > > >> > > > > >> > > > > >> > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > >> > @@ > > > >> > - > > > >> > > > > >> > 242,7 +244,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, > > > >> > const > > > >> > > > > >> > uint8_t **data, > > > >> > > > > >> > > > > >> > > > > >> > while (flows.started > 0) { > > > >> > > > > >> > /* Gather 4 bytes of input data for each stream. */ > > > >> > > > > >> > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, > > > >> > 0); > > > >> > > > > >> > + input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > >> > > > > >> > ZEROVAL, 0); > > > >> > > > > >> > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > > > >> > input, 1); > > > >> > > > > >> > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > > > >> > input, 2); > > > >> > > > > >> > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > > > >> > input, 3); > > > >> > > > > >> > -- > > > >> > > > > >> > 2.21.0