> -----Original Message----- > From: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > Sent: Monday, June 17, 2019 6:19 AM > To: Jerin Jacob Kollanukkaran <jer...@marvell.com>; dev@dpdk.org > Cc: tho...@monjalon.net; Gavin Hu (Arm Technology China) > <gavin...@arm.com>; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > External Email > > ---------------------------------------------------------------------- > > > > > > Reduced the CC list (changing the topic slightly) > > > > > > > > > > > > > My understanding is that the generated code for both your patch > > > > > and my changes above is the same. Above suggested changes will > > > > > conform to ACLE recommendation. > > > > > > > > Though instructions are different. Effective cycles are same even > > > > though First dup updates the four positions. > > > Can you elaborate on how the instructions are different? > > > I wrote the following code with both the methods: > > > > > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t > > > *p2, uint32_t *p3) { > > > uint32x4_t r = {*p0, *p1, *p2, *p3}; > > > > > > return r; > > > } > > > > > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t > > > *p2, uint32_t *p3) { > > > uint32x4_t r; > > > > > > r = vdupq_n_u32 (* p0); > > > r = vsetq_lane_u32 (*p1, r, 1); > > > r = vsetq_lane_u32 (*p2, r, 2); > > > r = vsetq_lane_u32 (*p3, r, 3); > > > > > > return r; > > > } > > > > > > The generated code has the same instructions for both (omitted the > > > unwanted > > > parts): > > > > > > u32x4_gather_gcc: > > > ld1r {v0.4s}, [x0] > > > ld1 {v0.s}[1], [x1] > > > ld1 {v0.s}[2], [x2] > > > ld1 {v0.s}[3], [x3] > > > ret > > > > > > u32x4_gather_acle: > > > ld1r {v0.4s}, [x0] > > > ld1 {v0.s}[1], [x1] > > > ld1 {v0.s}[2], [x2] > > > ld1 {v0.s}[3], [x3] > > > ret > > > > > > The first 'ld1r' updates all the lanes in both the cases. > > > > > > Please check actual generated code for ACL case. We can see difference > I think there is something wrong with the way you are looking at the > generated code. Please see comments below.
I am generating the dis assembly like below. gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rm search_neon_4' You can try it out. > > > > > To make forward progress send the v2 based on the updated logic > > > > just to make ACLE Spec happy, I don’t see any real reason to do > > > > it though > > > > 😊 > > > Thanks for the patch, it was important to make forward progress. > > > But, I think we should carry forward the discussion as I plan to > > > change other parts of DPDK on similar lines. I want to understand > > > why you think there is no real reason. The ACLE recommendation > > > mentions the > > reasoning. > > > > # I see following in the ACLE spec. What is the actual reasoning? > > " > > ACLE does not define static construction of vector types. E.g. > > int32x4_t x = { 1, 2, 3, 4 }; > > Is not portable. Use the vcreate or vdup intrinsics to construct > > values from scalars. > > " > Here is the complete text from ACLE 2.1 > > 12.2.6 Compatibility with other vector programming models Programmers > should take particular care when combining the Neon Intrinsics API with > alternative vector programming models; ACLE does not specify how the > NEON Intrinsics API interoperates with them. > For instance, the GCC vector extension permits include “arm_neon.h” > ... > uint32x2_t x = {0, 1}; // GCC extension. > uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic. > But with this code the value stored in ‘y’ will depend on both the target > architecture (AArch32 or AArch64) and whether the program is running in > big- or little-endian mode. I don’t have a big endian machine to test. I would be interesting to see The output in bigendian. > It is recommended that NEON Intrinsics be used consistently: > include “arm_neon.h” > ... > const int temp[2] = {0, 1}; > uint32x2_t x = vld1_s32 (temp); > uint32_t y = vget_lane_s32 (x, 0); > > > > > # Why does compiler(gcc) allows if it not indented to use? > I do not have an answer. This is a recommendation and all that I am trying to > say is, following the recommendation does not cost us anything in > performance. If there is no performance regression then no issue in changing to this format. > > > > > # I think, it may be time to introduce UndefinedBehaviorSanitizer > > (UBSan) Gcc feature to DPDK to detect undefined behavior checks to > > detect such case > I am not sure if it helps here. > > > > > > > > > > > > > > > > http://patches.dpdk.org/patch/54656/ > > > >