> -----Original Message----- > From: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > Sent: Wednesday, June 12, 2019 1:18 AM > To: Jerin Jacob Kollanukkaran <jer...@marvell.com>; dev@dpdk.org > Cc: tho...@monjalon.net; Gavin Hu (Arm Technology China) > <gavin...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > Reduced the CC list (changing the topic slightly) > > > > > > > My understanding is that the generated code for both your patch and > > > my changes above is the same. Above suggested changes will conform > > > to ACLE recommendation. > > > > Though instructions are different. Effective cycles are same even > > though First dup updates the four positions. > Can you elaborate on how the instructions are different? > I wrote the following code with both the methods: > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2, > uint32_t *p3) { > uint32x4_t r = {*p0, *p1, *p2, *p3}; > > return r; > } > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t *p2, > uint32_t *p3) { > uint32x4_t r; > > r = vdupq_n_u32 (* p0); > r = vsetq_lane_u32 (*p1, r, 1); > r = vsetq_lane_u32 (*p2, r, 2); > r = vsetq_lane_u32 (*p3, r, 3); > > return r; > } > > The generated code has the same instructions for both (omitted the unwanted > parts): > > u32x4_gather_gcc: > ld1r {v0.4s}, [x0] > ld1 {v0.s}[1], [x1] > ld1 {v0.s}[2], [x2] > ld1 {v0.s}[3], [x3] > ret > > u32x4_gather_acle: > ld1r {v0.4s}, [x0] > ld1 {v0.s}[1], [x1] > ld1 {v0.s}[2], [x2] > ld1 {v0.s}[3], [x3] > ret > > The first 'ld1r' updates all the lanes in both the cases.
Please check actual generated code for ACL case. We can see difference 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] vs 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] With patch: 244 /* Gather 4 bytes of input data for each stream. */ 245 input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] 0x00000000005cc26c <+2028>: 73 12 00 91 add x19, x19, #0x4 0x00000000005cc2ac <+2092>: b3 37 00 f9 str x19, [x29, #104] 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] 0x00000000005cc21c <+1948>: e7 10 00 91 add x7, x7, #0x4 0x00000000005cc260 <+2016>: a7 43 00 f9 str x7, [x29, #128] 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] 0x00000000005cc224 <+1956>: c6 10 00 91 add x6, x6, #0x4 0x00000000005cc264 <+2020>: a6 4f 00 f9 str x6, [x29, #152] 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 0x00000000005cc218 <+1944>: b7 57 40 f9 ldr x23, [x29, #168] 0x00000000005cc220 <+1952>: f4 6a 74 b8 ldr w20, [x23, x20] 0x00000000005cc228 <+1960>: a5 5b 00 f9 str x5, [x29, #176] With out patch: 245 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] 0x00000000005cc248 <+1992>: 73 12 00 91 add x19, x19, #0x4 0x00000000005cc24c <+1996>: b3 37 00 f9 str x19, [x29, #104] 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] 0x00000000005cc228 <+1960>: e7 10 00 91 add x7, x7, #0x4 0x00000000005cc240 <+1984>: a7 43 00 f9 str x7, [x29, #128] 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] 0x00000000005cc22c <+1964>: c6 10 00 91 add x6, x6, #0x4 0x00000000005cc244 <+1988>: a6 4f 00 f9 str x6, [x29, #152] 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 0x00000000005cc21c <+1948>: b7 57 40 f9 ldr x23, [x29, #168] 0x00000000005cc224 <+1956>: f4 6a 74 b8 ldr w20, [x23, x20] 0x00000000005cc230 <+1968>: a5 5b 00 f9 str x5, [x29, #176] > > > To make forward progress send the v2 based on the updated logic just > > to make ACLE Spec happy, I don’t see any real reason to do it though > > 😊 > Thanks for the patch, it was important to make forward progress. > But, I think we should carry forward the discussion as I plan to change other > parts of DPDK on similar lines. I want to understand why you think there is no > real reason. The ACLE recommendation mentions the reasoning. # I see following in the ACLE spec. What is the actual reasoning? " ACLE does not define static construction of vector types. E.g. int32x4_t x = { 1, 2, 3, 4 }; Is not portable. Use the vcreate or vdup intrinsics to construct values from scalars. " # Why does compiler(gcc) allows if it not indented to use? # I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan) Gcc feature to DPDK to detect undefined behavior checks to detect such case > > > > > http://patches.dpdk.org/patch/54656/ > >