Re: [dpdk-dev] DPDK compilation on arm is failing in Travis

Jerin Jacob Kollanukkaran Sat, 08 Jun 2019 01:41:39 -0700

> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <[email protected]>
> Sent: Saturday, June 8, 2019 2:08 PM
> To: Honnappa Nagarahalli <[email protected]>; Aaron Conole
> <[email protected]>
> Cc: [email protected]; [email protected]; Ruifeng Wang (Arm
> Technology China) <[email protected]>; Gavin Hu (Arm Technology
> China) <[email protected]>; Dharmik Thakkar <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; nd <[email protected]>; nd <[email protected]>
> Subject: RE: DPDK compilation on arm is failing in Travis
> 
> > -----Original Message-----
> > From: dev <[email protected]> On Behalf Of Honnappa Nagarahalli
> > Sent: Friday, June 7, 2019 7:24 PM
> > To: Aaron Conole <[email protected]>
> > Cc: [email protected]; [email protected]; Ruifeng Wang (Arm
> > Technology China) <[email protected]>; Gavin Hu (Arm Technology
> > China) <[email protected]>; Dharmik Thakkar
> <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > [email protected]; Honnappa Nagarahalli
> > <[email protected]>; nd <[email protected]>; nd <[email protected]>
> > Subject: Re: [dpdk-dev] DPDK compilation on arm is failing in Travis
> >
> > > >> >
> > > >> >  Thomas Monjalon <[email protected]> writes:
> > > >> >
> > > >> >
> > > >> >
> > > >> >  The compilation of the master branch is failing for aarch64:
> > > >> >
> > > >> >  https://travis-ci.com/DPDK/dpdk
> > > >> >
> > > >> > The log is so much verbose that I am not able to understand
> > > >> > what
> > > >> >
> > > >> > is really wrong.
> > > >> >
> > > >> > Please help to diagnose and fix, thanks.
> > > >> >
> > > >> >
> > > >> >
> > > >> > A discussion about this:
> > > >> >
> > > >> >
> > > >> >
> > > >> > http://mails.dpdk.org/archives/dev/2019-June/134012.html
> > > >> >
> > > >> >
> > > >> >
> > > >> > I see the error now.
> > > >> >
> > > >> > It is printing the full log after the error, so I missed the
> > > >> > error
> > > >> >
> > > >> > at the top.
> > > >> >
> > > >> >
> > > >> >
> > > >> > I've read your comment about a possible error with the patch
> > > >> >
> > > >> > removing weak functions but neither me nor Bruce were able to
> > > >> > reproduce
> > > >> >
> > > >> > it.
> > > >> >
> > > >> >  What is the condition to see this compiler warning?
> > > >> >
> > > >> >
> > > >> >
> > > >> > It is only on ARM, and only when the neon intrinsics are in use.
> > > >> >
> > > >> > I am not able to reproduce it from the tip of master.
> > > >> >
> > > >> >
> > > >> >
> > > >> > I am using:
> > > >> >
> > > >> > gcc (Ubuntu 8.3.0-6ubuntu1~18.04) 8.3.0
> > > >> >
> > > >> >
> > > >> >
> > > >> > From the log on Travis, looks like the compiler is:
> > > >> >
> > > >> > gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
> > > >> >
> > > >> >
> > > >> >
> > > >> > Is this the issue?
> > > >> >
> > > >> >
> > > >> >
> > > >> > Why are we seeing the error now?
> > > >> >
> > > >> > I tested with gcc-5 (Ubuntu/Linaro 5.5.0-12ubuntu1) 5.5.0
> > > >> > 20171010, it
> > > >> works fine. I cannot get hold of 5.4.0. Not sure if needs to be 
> > > >> supported.
> > > >> >
> > > >> > Are there any issues in upgrading to 7 or 8?
> > > >> >
> > > >> > I have tested it on my ubuntu 16.04 vm on commit
> > > >> > 8cb511bb94ad92a76990f175cac76bb13d51daba
> > > >> > (head of master seems to be failing for other reasons on my vm).
> > > >> > I tested the following gcc versions:
> > > >> >
> > > >> > gcc 5.5.0 "cc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010"
> > > >> > gcc 7.4.0 "cc (Ubuntu 7.4.0-1ubuntu1~16.04~ppa1) 7.4.0"
> > > >> > gcc 8.1.0 "cc (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0"
> > > >> >
> > > >> > All tested versions failed on the exact same error shown in travis.
> > > >> > I don't know if the compiler is at fault here. Maybe Aaron's
> > > >> > patch is a viable
> > > >> option?
> > > >> >
> > > >> >  The issue is the vector lane setting code looks like:
> > > >> >
> > > >> >
> > > >> >
> > > >> >    lval = lane_set(scalar, rval, lane id)
> > > >> >
> > > >> >
> > > >> >
> > > >> > In this case, 'rval' is being used before it is ever set, but
> > > >> > it
> > > >> >
> > > >> > really could be just 0 for the first lane setting code.
> > > >> > Thereafter,
> > > >> >
> > > >> > we use the old value of input as the rval, but each time a
> > > >> > different lane is
> > > >> set.
> > > >> >
> > > >> >
> > > >> >
> > > >> > It would be nice if there were an intrinsic that formatted
> > > >> > correctly
> > > >> >
> > > >> > from the start (something we could call like lval =
> > > >> >
> > > >> > lane_set_from_array(scalar_array)).
> > > >> >
> > > >> > [Honnappa] This exists already. ‘vdupq_n_s32’ can be used. Can
> > > >> > you try the
> > > >> following?
> > > >>
> > > >> Well, it isn't exactly that.  You are setting all lanes from a scalar.
> > > > Yes, you are correct, it sets all the lanes. I am not sure on how
> > > > this will affect the performance.
> > > >
> > > >> I'd rather be able to say:
> > > >>
> > > >>    input0 = vdupq_nn_s32(&parms[0]);
> > > >>    input1 = vdupq_nn_s32(&parms[4]);
> > > >>
> > > >> Something like that, which lets us delete all the rest of the
> > > >> lane-set code.  But it seems it doesn't exist.
> > > >>
> > > >> Regardless, I think either patch should work (either using the 'all 
> > > >> lanes'
> > > >> setting you have or the static variable).  I have no preference
> > > >> on it
> > > >> - it's up to you (or someone else) to say which is preferred.  I
> > > >> guess your version could be preferable since there's no static to
> > > >> need to "explain" :)
> > > > I think we can go ahead with your patch with using a temporary
> > > > vector for the first set, as it does not introduce any change to
> > > > the code and hence performance should not get affected.
> > > >
> > > > But, I do not understand why you have added 'static'. Also,
> > > > changing 'ZEROVAL' to 'tmp' or something similar will be better.
> > >
> > > The static is there to guarantee '0' value.  Otherwise we create a
> > > temp variable that has to be initialized explicitly.
> > Ok, I am fine with this. I guess this is the explanation you wanted to 
> > avoid 😊.
> 
> Don’t use BSS for fastpath code. Let it use stack  for better cache usage and
> multi thread case. I already sent a simple fix for this with temp variable. 
> Please
> don’t complicate.


Typo: s/with temp variable/ with out temp variable/g


> 
> 
> >
> > >
> > > >>
> > > >> > honnag01@qc2400f-1:~/dpdk$ git diff
> > > >> >
> > > >> > diff --git a/lib/librte_acl/acl_run_neon.h
> > > >> > b/lib/librte_acl/acl_run_neon.h
> > > >> >
> > > >> > index 01b9766d8..b3196cd12 100644
> > > >> >
> > > >> > --- a/lib/librte_acl/acl_run_neon.h
> > > >> >
> > > >> > +++ b/lib/librte_acl/acl_run_neon.h
> > > >> >
> > > >> > @@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx
> > > >> > *ctx, const uint8_t **data,
> > > >> >
> > > >> >
> > > >> >
> > > >> >         while (flows.started > 0) {
> > > >> >
> > > >> >                 /* Gather 4 bytes of input data for each stream.
> > > >> > */
> > > >> >
> > > >> > -               input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), 
> > > >> > input0,
> > > 0);
> > > >> >
> > > >> > -               input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), 
> > > >> > input1,
> > > 0);
> > > >> >
> > > >> > +               input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms,
> > > >> > + 0));
> > > >> >
> > > >> > +               input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms,
> > > >> > + 4));
> > > >> >
> > > >> >
> > > >> >
> > > >> >                 input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> > > >> > 1), input0, 1);
> > > >> >
> > > >> >                 input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> > > >> > 5), input1, 1);
> > > >> >
> > > >> > @@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx
> > > >> > *ctx, const uint8_t **data,
> > > >> >
> > > >> >
> > > >> >
> > > >> >         while (flows.started > 0) {
> > > >> >
> > > >> >                 /* Gather 4 bytes of input data for each stream.
> > > >> > */
> > > >> >
> > > >> > -               input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), 
> > > >> > input, 0);
> > > >> >
> > > >> > +               input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0));
> > > >> >
> > > >> >                 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> > > >> > 1), input, 1);
> > > >> >
> > > >> >                 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms,
> > > >> > 2), input, 2);
> > > >> >
> > > >> >                                                 input =
> > > >> > vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3);
> > > >> >
> > > >> >
> > > >> >
> > > >> >  Then 'input' would never appear as an rval before it was set.
> > > >> >
> > > >> >
> > > >> >
> > > >> > I thought Jerin Jacob (CC'd) would have some opinion on the right 
> > > >> > fix.
> > > >> >
> > > >> > There are three 'fixes' I know exist - one is to squelch the
> > > >> > warning
> > > >> >
> > > >> > (but I don't like it because it could hide future code that
> > > >> > introduces
> > > >> >
> > > >> > this), one is to create a static and use assignment, one is to
> > > >> > replace
> > > >> >
> > > >> > the first call and pass in a 0'd lane for the first one.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Actually, I think I have a patch that could work to not
> > > >> > introduce an
> > > >> >
> > > >> > assignment, but squelch the warning.  Something like the
> > > >> > following (not
> > > >> >
> > > >> > tested).
> > > >> >
> > > >> >
> > > >> >
> > > >> > ---
> > > >> >
> > > >> >
> > > >> >
> > > >> > diff --git a/lib/librte_acl/acl_run_neon.h
> > > >> >
> > > >> > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..37c984fef
> > > >> > 100644
> > > >> >
> > > >> > --- a/lib/librte_acl/acl_run_neon.h
> > > >> >
> > > >> > +++ b/lib/librte_acl/acl_run_neon.h
> > > >> >
> > > >> > @@ -165,6 +165,7 @@ search_neon_8(const struct rte_acl_ctx
> > > >> > *ctx, const
> > > >> >
> > > >> > uint8_t **data,
> > > >> >
> > > >> >     uint64_t index_array[8];
> > > >> >
> > > >> >     struct completion cmplt[8];
> > > >> >
> > > >> >     struct parms parms[8];
> > > >> >
> > > >> > +   static int32x4_t ZEROVAL;
> > > >> >
> > > >> >     int32x4_t input0, input1;
> > > >> >
> > > >> >
> > > >> >
> > > >> >     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > >> > @@
> > > >> > -
> > > >> >
> > > >> > 181,8 +182,8 @@ search_neon_8(const struct rte_acl_ctx *ctx,
> > > >> > const
> > > >> >
> > > >> > uint8_t **data,
> > > >> >
> > > >> >
> > > >> >
> > > >> >     while (flows.started > 0) {
> > > >> >
> > > >> >             /* Gather 4 bytes of input data for each stream. */
> > > >> >
> > > >> > -           input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), 
> > > >> > input0,
> > > >> >
> > > >> > 0);
> > > >> >
> > > >> > -           input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), 
> > > >> > input1,
> > > >> >
> > > >> > 0);
> > > >> >
> > > >> > +           input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0),
> > > >> >
> > > >> > ZEROVAL, 0);
> > > >> >
> > > >> > +           input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4),
> > > >> >
> > > >> > ZEROVAL, 0);
> > > >> >
> > > >> >
> > > >> >
> > > >> >             input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1),
> > > >> > input0,
> > > >> >
> > > >> > 1);
> > > >> >
> > > >> >              input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5),
> > > >> > input1,
> > > >> >
> > > >> > 1); @@
> > > >> >
> > > >> >  -227,6 +228,7 @@ search_neon_4(const struct rte_acl_ctx *ctx,
> > > >> > const
> > > >> >
> > > >> > uint8_t **data,
> > > >> >
> > > >> >     uint64_t index_array[4];
> > > >> >
> > > >> >     struct completion cmplt[4];
> > > >> >
> > > >> >     struct parms parms[4];
> > > >> >
> > > >> > +   static int32x4_t ZEROVAL;
> > > >> >
> > > >> >     int32x4_t input;
> > > >> >
> > > >> >
> > > >> >
> > > >> >     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > >> > @@
> > > >> > -
> > > >> >
> > > >> > 242,7 +244,7 @@ search_neon_4(const struct rte_acl_ctx *ctx,
> > > >> > const
> > > >> >
> > > >> > uint8_t **data,
> > > >> >
> > > >> >
> > > >> >
> > > >> >     while (flows.started > 0) {
> > > >> >
> > > >> >             /* Gather 4 bytes of input data for each stream. */
> > > >> >
> > > >> > -           input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 
> > > >> > 0);
> > > >> >
> > > >> > +           input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0),
> > > >> >
> > > >> > ZEROVAL, 0);
> > > >> >
> > > >> >             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1),
> > > >> > input, 1);
> > > >> >
> > > >> >             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2),
> > > >> > input, 2);
> > > >> >
> > > >> >             input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3),
> > > >> > input, 3);
> > > >> >
> > > >> > --
> > > >> >
> > > >> > 2.21.0

Re: [dpdk-dev] DPDK compilation on arm is failing in Travis

Reply via email to