> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
> Sent: Wednesday, June 12, 2019 1:18 AM
> To: Jerin Jacob Kollanukkaran <jer...@marvell.com>; dev@dpdk.org
> Cc: tho...@monjalon.net; Gavin Hu (Arm Technology China)
> <gavin...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com>
> Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64
> compiler
> 
> Reduced the CC list (changing the topic slightly)
> 
> > >
> > > My understanding is that the generated code for both your patch and
> > > my changes above is the same. Above suggested changes will conform
> > > to ACLE recommendation.
> >
> > Though instructions are different. Effective cycles are same even
> > though First dup updates the four positions.
> Can you elaborate on how the instructions are different?
> I wrote the following code with both the methods:
> 
> uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2,
> uint32_t *p3) {
>      uint32x4_t r = {*p0, *p1, *p2, *p3};
> 
>      return r;
> }
> 
> uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t *p2,
> uint32_t *p3) {
>      uint32x4_t r;
> 
>      r = vdupq_n_u32 (* p0);
>      r = vsetq_lane_u32 (*p1, r, 1);
>      r = vsetq_lane_u32 (*p2, r, 2);
>      r = vsetq_lane_u32 (*p3, r, 3);
> 
>      return r;
> }
> 
> The generated code has the same instructions for both (omitted the unwanted
> parts):
> 
> u32x4_gather_gcc:
>         ld1r    {v0.4s}, [x0]
>         ld1     {v0.s}[1], [x1]
>         ld1     {v0.s}[2], [x2]
>         ld1     {v0.s}[3], [x3]
>         ret
> 
> u32x4_gather_acle:
>         ld1r    {v0.4s}, [x0]
>         ld1     {v0.s}[1], [x1]
>         ld1     {v0.s}[2], [x2]
>         ld1     {v0.s}[3], [x3]
>         ret
> 
> The first 'ld1r' updates all the lanes in both the cases.


Please check actual generated code for ACL case. We can see difference
 0x00000000005cc1dc <+1884>:  80 6a 65 bc     ldr     s0, [x20, x5]
vs
  0x00000000005cc1dc <+1884>:  9e 6a 65 b8     ldr     w30, [x20, x5]

With patch:

244                     /* Gather 4 bytes of input data for each stream. */
245                     input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0));
   0x00000000005cc1c8 <+1864>:  b4 4f 46 a9     ldp     x20, x19, [x29, #96]
   0x00000000005cc1d8 <+1880>:  65 02 40 b9     ldr     w5, [x19]
   0x00000000005cc1dc <+1884>:  80 6a 65 bc     ldr     s0, [x20, x5]
   0x00000000005cc26c <+2028>:  73 12 00 91     add     x19, x19, #0x4
   0x00000000005cc2ac <+2092>:  b3 37 00 f9     str     x19, [x29, #104]

246                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), 
input, 1);
   0x00000000005cc1d0 <+1872>:  a6 9f 47 a9     ldp     x6, x7, [x29, #120]
   0x00000000005cc1ec <+1900>:  e5 00 40 b9     ldr     w5, [x7]
   0x00000000005cc1f0 <+1904>:  d6 68 65 b8     ldr     w22, [x6, x5]
   0x00000000005cc21c <+1948>:  e7 10 00 91     add     x7, x7, #0x4
   0x00000000005cc260 <+2016>:  a7 43 00 f9     str     x7, [x29, #128]

247                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), 
input, 2);
   0x00000000005cc1d4 <+1876>:  b5 4b 40 f9     ldr     x21, [x29, #144]
   0x00000000005cc1f4 <+1908>:  a6 4f 40 f9     ldr     x6, [x29, #152]
   0x00000000005cc1f8 <+1912>:  d4 00 40 b9     ldr     w20, [x6]
   0x00000000005cc1fc <+1916>:  b5 6a 74 b8     ldr     w21, [x21, x20]
   0x00000000005cc224 <+1956>:  c6 10 00 91     add     x6, x6, #0x4
   0x00000000005cc264 <+2020>:  a6 4f 00 f9     str     x6, [x29, #152]

248                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), 
input, 3);
   0x00000000005cc200 <+1920>:  a5 5b 40 f9     ldr     x5, [x29, #176]
   0x00000000005cc204 <+1924>:  b4 00 40 b9     ldr     w20, [x5]
   0x00000000005cc208 <+1928>:  a5 10 00 91     add     x5, x5, #0x4
   0x00000000005cc218 <+1944>:  b7 57 40 f9     ldr     x23, [x29, #168]
   0x00000000005cc220 <+1952>:  f4 6a 74 b8     ldr     w20, [x23, x20]
   0x00000000005cc228 <+1960>:  a5 5b 00 f9     str     x5, [x29, #176]
   
With out patch:
   
   245                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), 
input, 0);
   0x00000000005cc1c8 <+1864>:  b4 4f 46 a9     ldp     x20, x19, [x29, #96]
   0x00000000005cc1d8 <+1880>:  65 02 40 b9     ldr     w5, [x19]
   0x00000000005cc1dc <+1884>:  9e 6a 65 b8     ldr     w30, [x20, x5]
   0x00000000005cc248 <+1992>:  73 12 00 91     add     x19, x19, #0x4
   0x00000000005cc24c <+1996>:  b3 37 00 f9     str     x19, [x29, #104]

246                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), 
input, 1);
   0x00000000005cc1d0 <+1872>:  a6 9f 47 a9     ldp     x6, x7, [x29, #120]
   0x00000000005cc1ec <+1900>:  e5 00 40 b9     ldr     w5, [x7]
   0x00000000005cc1f0 <+1904>:  d6 68 65 b8     ldr     w22, [x6, x5]
   0x00000000005cc228 <+1960>:  e7 10 00 91     add     x7, x7, #0x4
   0x00000000005cc240 <+1984>:  a7 43 00 f9     str     x7, [x29, #128]

247                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), 
input, 2);
   0x00000000005cc1d4 <+1876>:  b5 4b 40 f9     ldr     x21, [x29, #144]
   0x00000000005cc1f4 <+1908>:  a6 4f 40 f9     ldr     x6, [x29, #152]
   0x00000000005cc1f8 <+1912>:  d4 00 40 b9     ldr     w20, [x6]
   0x00000000005cc1fc <+1916>:  b5 6a 74 b8     ldr     w21, [x21, x20]
   0x00000000005cc22c <+1964>:  c6 10 00 91     add     x6, x6, #0x4
   0x00000000005cc244 <+1988>:  a6 4f 00 f9     str     x6, [x29, #152]

248                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), 
input, 3);
   0x00000000005cc200 <+1920>:  a5 5b 40 f9     ldr     x5, [x29, #176]
   0x00000000005cc204 <+1924>:  b4 00 40 b9     ldr     w20, [x5]
   0x00000000005cc208 <+1928>:  a5 10 00 91     add     x5, x5, #0x4
   0x00000000005cc21c <+1948>:  b7 57 40 f9     ldr     x23, [x29, #168]
   0x00000000005cc224 <+1956>:  f4 6a 74 b8     ldr     w20, [x23, x20]
   0x00000000005cc230 <+1968>:  a5 5b 00 f9     str     x5, [x29, #176]




> 
> > To make forward progress send the v2 based on the updated logic  just
> > to make ACLE  Spec happy, I don’t see any real reason to do it though
> > 😊
> Thanks for the patch, it was important to make forward progress.
> But, I think we should carry forward the discussion as I plan to change other
> parts of DPDK on similar lines. I want to understand why you think there is no
> real reason. The ACLE recommendation mentions the reasoning.

# I see following in the ACLE spec. What is the actual reasoning? 
"
ACLE does not define static construction of vector types. E.g.
 int32x4_t x = { 1, 2, 3, 4 };
Is not portable. Use the vcreate or vdup intrinsics to construct values from 
scalars.
"

# Why does compiler(gcc) allows if it not indented to use? 

# I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan)
Gcc feature to DPDK to detect undefined behavior checks to detect such case

>

> >
> > http://patches.dpdk.org/patch/54656/
> >

Reply via email to