Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-09-03 Thread Christophe LYON via Gcc-patches



On 03/09/2021 10:35, Prathamesh Kulkarni via Gcc-patches wrote:

On Thu, 2 Sept 2021 at 14:32, Christophe Lyon
 wrote:



On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov  wrote:




-Original Message-
From: Prathamesh Kulkarni 
Sent: 24 August 2021 09:01
To: Christophe Lyon 
Cc: Kyrylo Tkachov ; gcc Patches 
Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
intrinsics

On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
 wrote:

On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
 wrote:



On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni

 wrote:

On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
 wrote:



On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches 
patc...@gcc.gnu.org> wrote:




-Original Message-
From: Prathamesh Kulkarni 
Sent: 24 June 2021 12:11
To: gcc Patches ; Kyrylo Tkachov

Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n

intrinsics

Hi,
This patch replaces builtins for vdup_n and vmov_n.
The patch results in regression for pr51534.c.
Consider following function:

uint8x8_t f1 (uint8x8_t a) {
   return vcgt_u8(a, vdup_n_u8(0));
}

code-gen before patch:
f1:
 vmov.i32  d16, #0  @ v8qi
 vcgt.u8 d0, d0, d16
 bx lr

code-gen after patch:
f1:
 vceq.i8 d0, d0, #0
 vmvnd0, d0
 bx lr

I am not sure which one is better tho ?

Hi Prathamesh,

This patch introduces a regression on non-hardfp configs (eg arm-

linux-gnueabi or arm-eabi):

FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-

assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3

FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-

assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3

Can you fix this?

The issue is, for following test:

#include 

uint8x8_t f1 (uint8x8_t a) {
   return vcge_u8(a, vdup_n_u8(0));
}

armhf code-gen:
f1:
 vmov.i32  d0, #0x  @ v8qi
 bxlr

arm softfp code-gen:
f1:
 mov r0, #-1
 mov r1, #-1
 bx  lr

The code-gen for both is same upto split2 pass:

(insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
 (const_vector:V8QI [
 (const_int -1 [0x]) repeated x8
 ])) "foo.c":5:1 1052 {*neon_movv8qi}
  (expr_list:REG_EQUAL (const_vector:V8QI [
 (const_int -1 [0x]) repeated x8
 ])
 (nil)))
(insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
  (nil))

and for softfp target, split2 pass splits the assignment to r0 and r1:

(insn 15 6 16 2 (set (reg:SI 0 r0)
 (const_int -1 [0x])) "foo.c":5:1 740

{*thumb2_movsi_vfp}

  (nil))
(insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
 (const_int -1 [0x])) "foo.c":5:1 740

{*thumb2_movsi_vfp}

  (nil))
(insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
  (nil))

I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?


Yes, probably, or try with check-function-bodies.

Hi,
Sorry for the late response. Does the attached patch look OK ?

ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html

Ok.



Sorry Prathamesh, this does not quite work. See  
https://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/r12-3294-g7a6f40d0452ec76e126c2612dcfa32f3c73e2315/report-build-info.html
(red cells in the gcc column)

Can you have a look?

Ah, for arm-none-eabi target the code-gen for test_vcge_u8_uint8x8_t is:
 mvn r0, #0
 mvn r1, #0
 bx  lr

instead of:
 mov r0, #-1
 mov r1, #-1
 bx  lr

I guess both code-gen sequences are correct, but I am not sure under
which circumstance either sequence gets generated.
To accomodate for both, I tried to check for either pattern with OR:
/* { dg-final { scan-assembler-times "(mov\[\]+r\[0-9\]+,
#-1)|(mvn\[\]+r\[0-9\]+, #0)" 6 { target { arm_softfp_ok } } } }
*/
but that doesn't seem to work.

Do you have any suggestions on how to proceed ?


Sorry I have no magic syntax for this :-)

Maybe it's also worth checking why we have this codegen difference?


Thanks,


Christophe





Thanks,
Prathamesh

Thanks

Christophe


Thanks,
Kyrill


Thanks,
Prathamesh

Thanks,
Prathamesh

  Christophe


Thanks,
Prathamesh

Thanks

Christophe



I think they're equivalent in practice, in any case the patch itself is

good (move away from RTL builtins).

Ok.
Thanks,
Kyrill


Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
which is due to a missed opt in lowering. I had filed it as
PR98435, and posted a fix for it here:
https://gcc.gnu.org/pipermail/gcc-patches/2021-

June/572648.html

Thanks,
Prathamesh


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-09-03 Thread Prathamesh Kulkarni via Gcc-patches
On Thu, 2 Sept 2021 at 14:32, Christophe Lyon
 wrote:
>
>
>
> On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov  
> wrote:
>>
>>
>>
>> > -Original Message-
>> > From: Prathamesh Kulkarni 
>> > Sent: 24 August 2021 09:01
>> > To: Christophe Lyon 
>> > Cc: Kyrylo Tkachov ; gcc Patches > > patc...@gcc.gnu.org>
>> > Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
>> > intrinsics
>> >
>> > On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
>> >  wrote:
>> > >
>> > > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
>> > >  wrote:
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni
>> >  wrote:
>> > > >>
>> > > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
>> > > >>  wrote:
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches > > patc...@gcc.gnu.org> wrote:
>> > > >> >>
>> > > >> >>
>> > > >> >>
>> > > >> >> > -Original Message-
>> > > >> >> > From: Prathamesh Kulkarni 
>> > > >> >> > Sent: 24 June 2021 12:11
>> > > >> >> > To: gcc Patches ; Kyrylo Tkachov
>> > > >> >> > 
>> > > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
>> > intrinsics
>> > > >> >> >
>> > > >> >> > Hi,
>> > > >> >> > This patch replaces builtins for vdup_n and vmov_n.
>> > > >> >> > The patch results in regression for pr51534.c.
>> > > >> >> > Consider following function:
>> > > >> >> >
>> > > >> >> > uint8x8_t f1 (uint8x8_t a) {
>> > > >> >> >   return vcgt_u8(a, vdup_n_u8(0));
>> > > >> >> > }
>> > > >> >> >
>> > > >> >> > code-gen before patch:
>> > > >> >> > f1:
>> > > >> >> > vmov.i32  d16, #0  @ v8qi
>> > > >> >> > vcgt.u8 d0, d0, d16
>> > > >> >> > bx lr
>> > > >> >> >
>> > > >> >> > code-gen after patch:
>> > > >> >> > f1:
>> > > >> >> > vceq.i8 d0, d0, #0
>> > > >> >> > vmvnd0, d0
>> > > >> >> > bx lr
>> > > >> >> >
>> > > >> >> > I am not sure which one is better tho ?
>> > > >> >>
>> > > >> >
>> > > >> > Hi Prathamesh,
>> > > >> >
>> > > >> > This patch introduces a regression on non-hardfp configs (eg arm-
>> > linux-gnueabi or arm-eabi):
>> > > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
>> > assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
>> > > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
>> > assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
>> > > >> >
>> > > >> > Can you fix this?
>> > > >> The issue is, for following test:
>> > > >>
>> > > >> #include 
>> > > >>
>> > > >> uint8x8_t f1 (uint8x8_t a) {
>> > > >>   return vcge_u8(a, vdup_n_u8(0));
>> > > >> }
>> > > >>
>> > > >> armhf code-gen:
>> > > >> f1:
>> > > >> vmov.i32  d0, #0x  @ v8qi
>> > > >> bxlr
>> > > >>
>> > > >> arm softfp code-gen:
>> > > >> f1:
>> > > >> mov r0, #-1
>> > > >> mov r1, #-1
>> > > >> bx  lr
>> > > >>
>> > > >> The code-gen for both is same upto split2 pass:
>> > > >>
>> > > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
>> > > >> (const

Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-09-02 Thread Christophe Lyon via Gcc-patches
On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov 
wrote:

>
>
> > -Original Message-
> > From: Prathamesh Kulkarni 
> > Sent: 24 August 2021 09:01
> > To: Christophe Lyon 
> > Cc: Kyrylo Tkachov ; gcc Patches  > patc...@gcc.gnu.org>
> > Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> > intrinsics
> >
> > On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
> > >  wrote:
> > > >
> > > >
> > > >
> > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni
> >  wrote:
> > > >>
> > > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
> > > >>  wrote:
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches
>  > patc...@gcc.gnu.org> wrote:
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> > -Original Message-
> > > >> >> > From: Prathamesh Kulkarni 
> > > >> >> > Sent: 24 June 2021 12:11
> > > >> >> > To: gcc Patches ; Kyrylo Tkachov
> > > >> >> > 
> > > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> > intrinsics
> > > >> >> >
> > > >> >> > Hi,
> > > >> >> > This patch replaces builtins for vdup_n and vmov_n.
> > > >> >> > The patch results in regression for pr51534.c.
> > > >> >> > Consider following function:
> > > >> >> >
> > > >> >> > uint8x8_t f1 (uint8x8_t a) {
> > > >> >> >   return vcgt_u8(a, vdup_n_u8(0));
> > > >> >> > }
> > > >> >> >
> > > >> >> > code-gen before patch:
> > > >> >> > f1:
> > > >> >> > vmov.i32  d16, #0  @ v8qi
> > > >> >> > vcgt.u8 d0, d0, d16
> > > >> >> > bx lr
> > > >> >> >
> > > >> >> > code-gen after patch:
> > > >> >> > f1:
> > > >> >> > vceq.i8 d0, d0, #0
> > > >> >> > vmvnd0, d0
> > > >> >> > bx lr
> > > >> >> >
> > > >> >> > I am not sure which one is better tho ?
> > > >> >>
> > > >> >
> > > >> > Hi Prathamesh,
> > > >> >
> > > >> > This patch introduces a regression on non-hardfp configs (eg arm-
> > linux-gnueabi or arm-eabi):
> > > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
> > assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> > > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
> > assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
> > > >> >
> > > >> > Can you fix this?
> > > >> The issue is, for following test:
> > > >>
> > > >> #include 
> > > >>
> > > >> uint8x8_t f1 (uint8x8_t a) {
> > > >>   return vcge_u8(a, vdup_n_u8(0));
> > > >> }
> > > >>
> > > >> armhf code-gen:
> > > >> f1:
> > > >> vmov.i32  d0, #0x  @ v8qi
> > > >> bxlr
> > > >>
> > > >> arm softfp code-gen:
> > > >> f1:
> > > >> mov r0, #-1
> > > >> mov r1, #-1
> > > >> bx  lr
> > > >>
> > > >> The code-gen for both is same upto split2 pass:
> > > >>
> > > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
> > > >> (const_vector:V8QI [
> > > >> (const_int -1 [0x]) repeated x8
> > > >> ])) "foo.c":5:1 1052 {*neon_movv8qi}
> > > >>  (expr_list:REG_EQUAL (const_vector:V8QI [
> > > >> (const_int -1 [0x]) repeated x8
> > > >> ])
> > > >> (nil)))
> > > >> (insn 11 10 13 2 (use (reg

RE: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-24 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Prathamesh Kulkarni 
> Sent: 24 August 2021 09:01
> To: Christophe Lyon 
> Cc: Kyrylo Tkachov ; gcc Patches  patc...@gcc.gnu.org>
> Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> intrinsics
> 
> On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
>  wrote:
> >
> > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
> >  wrote:
> > >
> > >
> > >
> > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni
>  wrote:
> > >>
> > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
> > >>  wrote:
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches  patc...@gcc.gnu.org> wrote:
> > >> >>
> > >> >>
> > >> >>
> > >> >> > -Original Message-
> > >> >> > From: Prathamesh Kulkarni 
> > >> >> > Sent: 24 June 2021 12:11
> > >> >> > To: gcc Patches ; Kyrylo Tkachov
> > >> >> > 
> > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> intrinsics
> > >> >> >
> > >> >> > Hi,
> > >> >> > This patch replaces builtins for vdup_n and vmov_n.
> > >> >> > The patch results in regression for pr51534.c.
> > >> >> > Consider following function:
> > >> >> >
> > >> >> > uint8x8_t f1 (uint8x8_t a) {
> > >> >> >   return vcgt_u8(a, vdup_n_u8(0));
> > >> >> > }
> > >> >> >
> > >> >> > code-gen before patch:
> > >> >> > f1:
> > >> >> > vmov.i32  d16, #0  @ v8qi
> > >> >> > vcgt.u8 d0, d0, d16
> > >> >> > bx lr
> > >> >> >
> > >> >> > code-gen after patch:
> > >> >> > f1:
> > >> >> > vceq.i8 d0, d0, #0
> > >> >> > vmvnd0, d0
> > >> >> > bx lr
> > >> >> >
> > >> >> > I am not sure which one is better tho ?
> > >> >>
> > >> >
> > >> > Hi Prathamesh,
> > >> >
> > >> > This patch introduces a regression on non-hardfp configs (eg arm-
> linux-gnueabi or arm-eabi):
> > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
> assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> > >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-
> assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
> > >> >
> > >> > Can you fix this?
> > >> The issue is, for following test:
> > >>
> > >> #include 
> > >>
> > >> uint8x8_t f1 (uint8x8_t a) {
> > >>   return vcge_u8(a, vdup_n_u8(0));
> > >> }
> > >>
> > >> armhf code-gen:
> > >> f1:
> > >> vmov.i32  d0, #0x  @ v8qi
> > >> bxlr
> > >>
> > >> arm softfp code-gen:
> > >> f1:
> > >> mov r0, #-1
> > >> mov r1, #-1
> > >> bx  lr
> > >>
> > >> The code-gen for both is same upto split2 pass:
> > >>
> > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
> > >> (const_vector:V8QI [
> > >> (const_int -1 [0x]) repeated x8
> > >> ])) "foo.c":5:1 1052 {*neon_movv8qi}
> > >>  (expr_list:REG_EQUAL (const_vector:V8QI [
> > >> (const_int -1 [0x]) repeated x8
> > >> ])
> > >> (nil)))
> > >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
> > >>  (nil))
> > >>
> > >> and for softfp target, split2 pass splits the assignment to r0 and r1:
> > >>
> > >> (insn 15 6 16 2 (set (reg:SI 0 r0)
> > >> (const_int -1 [0x])) "foo.c":5:1 740
> {*thumb2_movsi_vfp}
> > >>  (nil))
> > >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
> > >> (const_int -1 [0x])) "foo.c":5:1 740
> {*thumb2_movsi_vfp}
> > >>  (nil))
> > >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
> > >>  (nil))
> > >>
> > >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
> > >>
> > > Yes, probably, or try with check-function-bodies.
> > Hi,
> > Sorry for the late response. Does the attached patch look OK ?
> ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html

Ok.
Thanks,
Kyrill

> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > >  Christophe
> > >
> > >> Thanks,
> > >> Prathamesh
> > >> >
> > >> > Thanks
> > >> >
> > >> > Christophe
> > >> >
> > >> >
> > >> >>
> > >> >> I think they're equivalent in practice, in any case the patch itself 
> > >> >> is
> good (move away from RTL builtins).
> > >> >> Ok.
> > >> >> Thanks,
> > >> >> Kyrill
> > >> >>
> > >> >> >
> > >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> > >> >> > which is due to a missed opt in lowering. I had filed it as
> > >> >> > PR98435, and posted a fix for it here:
> > >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-
> June/572648.html
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Prathamesh


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-24 Thread Prathamesh Kulkarni via Gcc-patches
On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
 wrote:
>
> On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
>  wrote:
> >
> >
> >
> > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni 
> >  wrote:
> >>
> >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
> >>  wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches 
> >> >  wrote:
> >> >>
> >> >>
> >> >>
> >> >> > -Original Message-
> >> >> > From: Prathamesh Kulkarni 
> >> >> > Sent: 24 June 2021 12:11
> >> >> > To: gcc Patches ; Kyrylo Tkachov
> >> >> > 
> >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n 
> >> >> > intrinsics
> >> >> >
> >> >> > Hi,
> >> >> > This patch replaces builtins for vdup_n and vmov_n.
> >> >> > The patch results in regression for pr51534.c.
> >> >> > Consider following function:
> >> >> >
> >> >> > uint8x8_t f1 (uint8x8_t a) {
> >> >> >   return vcgt_u8(a, vdup_n_u8(0));
> >> >> > }
> >> >> >
> >> >> > code-gen before patch:
> >> >> > f1:
> >> >> > vmov.i32  d16, #0  @ v8qi
> >> >> > vcgt.u8 d0, d0, d16
> >> >> > bx lr
> >> >> >
> >> >> > code-gen after patch:
> >> >> > f1:
> >> >> > vceq.i8 d0, d0, #0
> >> >> > vmvnd0, d0
> >> >> > bx lr
> >> >> >
> >> >> > I am not sure which one is better tho ?
> >> >>
> >> >
> >> > Hi Prathamesh,
> >> >
> >> > This patch introduces a regression on non-hardfp configs (eg 
> >> > arm-linux-gnueabi or arm-eabi):
> >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
> >> > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> >> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
> >> > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
> >> >
> >> > Can you fix this?
> >> The issue is, for following test:
> >>
> >> #include 
> >>
> >> uint8x8_t f1 (uint8x8_t a) {
> >>   return vcge_u8(a, vdup_n_u8(0));
> >> }
> >>
> >> armhf code-gen:
> >> f1:
> >> vmov.i32  d0, #0x  @ v8qi
> >> bxlr
> >>
> >> arm softfp code-gen:
> >> f1:
> >> mov r0, #-1
> >> mov r1, #-1
> >> bx  lr
> >>
> >> The code-gen for both is same upto split2 pass:
> >>
> >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
> >> (const_vector:V8QI [
> >> (const_int -1 [0x]) repeated x8
> >> ])) "foo.c":5:1 1052 {*neon_movv8qi}
> >>  (expr_list:REG_EQUAL (const_vector:V8QI [
> >> (const_int -1 [0x]) repeated x8
> >> ])
> >> (nil)))
> >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
> >>  (nil))
> >>
> >> and for softfp target, split2 pass splits the assignment to r0 and r1:
> >>
> >> (insn 15 6 16 2 (set (reg:SI 0 r0)
> >> (const_int -1 [0x])) "foo.c":5:1 740 
> >> {*thumb2_movsi_vfp}
> >>  (nil))
> >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
> >> (const_int -1 [0x])) "foo.c":5:1 740 
> >> {*thumb2_movsi_vfp}
> >>  (nil))
> >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
> >>  (nil))
> >>
> >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
> >>
> > Yes, probably, or try with check-function-bodies.
> Hi,
> Sorry for the late response. Does the attached patch look OK ?
ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> >  Christophe
> >
> >> Thanks,
> >> Prathamesh
> >> >
> >> > Thanks
> >> >
> >> > Christophe
> >> >
> >> >
> >> >>
> >> >> I think they're equivalent in practice, in any case the patch itself is 
> >> >> good (move away from RTL builtins).
> >> >> Ok.
> >> >> Thanks,
> >> >> Kyrill
> >> >>
> >> >> >
> >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> >> >> > which is due to a missed opt in lowering. I had filed it as
> >> >> > PR98435, and posted a fix for it here:
> >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
> >> >> >
> >> >> > Thanks,
> >> >> > Prathamesh


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-16 Thread Prathamesh Kulkarni via Gcc-patches
On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
 wrote:
>
>
>
> On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni 
>  wrote:
>>
>> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
>>  wrote:
>> >
>> >
>> >
>> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches 
>> >  wrote:
>> >>
>> >>
>> >>
>> >> > -Original Message-
>> >> > From: Prathamesh Kulkarni 
>> >> > Sent: 24 June 2021 12:11
>> >> > To: gcc Patches ; Kyrylo Tkachov
>> >> > 
>> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n 
>> >> > intrinsics
>> >> >
>> >> > Hi,
>> >> > This patch replaces builtins for vdup_n and vmov_n.
>> >> > The patch results in regression for pr51534.c.
>> >> > Consider following function:
>> >> >
>> >> > uint8x8_t f1 (uint8x8_t a) {
>> >> >   return vcgt_u8(a, vdup_n_u8(0));
>> >> > }
>> >> >
>> >> > code-gen before patch:
>> >> > f1:
>> >> > vmov.i32  d16, #0  @ v8qi
>> >> > vcgt.u8 d0, d0, d16
>> >> > bx lr
>> >> >
>> >> > code-gen after patch:
>> >> > f1:
>> >> > vceq.i8 d0, d0, #0
>> >> > vmvnd0, d0
>> >> > bx lr
>> >> >
>> >> > I am not sure which one is better tho ?
>> >>
>> >
>> > Hi Prathamesh,
>> >
>> > This patch introduces a regression on non-hardfp configs (eg 
>> > arm-linux-gnueabi or arm-eabi):
>> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
>> > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
>> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
>> > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
>> >
>> > Can you fix this?
>> The issue is, for following test:
>>
>> #include 
>>
>> uint8x8_t f1 (uint8x8_t a) {
>>   return vcge_u8(a, vdup_n_u8(0));
>> }
>>
>> armhf code-gen:
>> f1:
>> vmov.i32  d0, #0x  @ v8qi
>> bxlr
>>
>> arm softfp code-gen:
>> f1:
>> mov r0, #-1
>> mov r1, #-1
>> bx  lr
>>
>> The code-gen for both is same upto split2 pass:
>>
>> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
>> (const_vector:V8QI [
>> (const_int -1 [0x]) repeated x8
>> ])) "foo.c":5:1 1052 {*neon_movv8qi}
>>  (expr_list:REG_EQUAL (const_vector:V8QI [
>> (const_int -1 [0x]) repeated x8
>> ])
>> (nil)))
>> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
>>  (nil))
>>
>> and for softfp target, split2 pass splits the assignment to r0 and r1:
>>
>> (insn 15 6 16 2 (set (reg:SI 0 r0)
>> (const_int -1 [0x])) "foo.c":5:1 740 
>> {*thumb2_movsi_vfp}
>>  (nil))
>> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
>> (const_int -1 [0x])) "foo.c":5:1 740 
>> {*thumb2_movsi_vfp}
>>  (nil))
>> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
>>  (nil))
>>
>> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
>>
> Yes, probably, or try with check-function-bodies.
Hi,
Sorry for the late response. Does the attached patch look OK ?

Thanks,
Prathamesh
>
>  Christophe
>
>> Thanks,
>> Prathamesh
>> >
>> > Thanks
>> >
>> > Christophe
>> >
>> >
>> >>
>> >> I think they're equivalent in practice, in any case the patch itself is 
>> >> good (move away from RTL builtins).
>> >> Ok.
>> >> Thanks,
>> >> Kyrill
>> >>
>> >> >
>> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
>> >> > which is due to a missed opt in lowering. I had filed it as
>> >> > PR98435, and posted a fix for it here:
>> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
>> >> >
>> >> > Thanks,
>> >> > Prathamesh
diff --git a/gcc/testsuite/gcc.target/arm/pr51534.c 
b/gcc/testsuite/gcc.target/arm/pr51534.c
index ac7f1ea4722..5e121f5fb99 100644
--- a/gcc/testsuite/gcc.target/arm/pr51534.c
+++ b/gcc/testsuite/gcc.target/arm/pr51534.c
@@ -64,8 +64,9 @@ GEN_COND_TESTS(vceq)
 /* { dg-final { scan-assembler-times "vceq\.i8\[   \]+\[qQ\]\[0-9\]+, 
\[qQ\]\[0-9\]+, #0" 4 } } */
 /* { dg-final { scan-assembler-times "vceq\.i16\[  \]+\[qQ\]\[0-9\]+, 
\[qQ\]\[0-9\]+, #0" 4 } } */
 /* { dg-final { scan-assembler-times "vceq\.i32\[  \]+\[qQ\]\[0-9\]+, 
\[qQ\]\[0-9\]+, #0" 4 } } */
-/* { dg-final { scan-assembler-times "vmov\.i32\[  \]+\[dD\]\[0-9\]+, 
#0x" 3 } } */
-/* { dg-final { scan-assembler-times "vmov\.i32\[  \]+\[qQ\]\[0-9\]+, 
#4294967295" 3 } } */
+/* { dg-final { scan-assembler-times "vmov\.i32\[  \]+\[dD\]\[0-9\]+, 
#0x" 3 { target { arm_hard_ok } } } } */
+/* { dg-final { scan-assembler-times "vmov\.i32\[  \]+\[qQ\]\[0-9\]+, 
#4294967295" 3 { target { arm_hard_ok } } } } */
+/* { dg-final { scan-assembler-times "mov\[\]+r\[0-9\]+, #-1" 6 { target { 
arm_softfp_ok } } } } */
 
 /* And ensure we don't have unexpected output too.  */
 /* { dg-final { scan-assembler-not "vc\[gl\]\[te\]\.u\[0-9\]+\[
\]+\[qQdD\]\[0-9\]+, \[qQdD\]\[0-9\]+, #0" } } */


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-12 Thread Christophe Lyon via Gcc-patches
On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni <
prathamesh.kulka...@linaro.org> wrote:

> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
>  wrote:
> >
> >
> >
> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches <
> gcc-patches@gcc.gnu.org> wrote:
> >>
> >>
> >>
> >> > -Original Message-
> >> > From: Prathamesh Kulkarni 
> >> > Sent: 24 June 2021 12:11
> >> > To: gcc Patches ; Kyrylo Tkachov
> >> > 
> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
> intrinsics
> >> >
> >> > Hi,
> >> > This patch replaces builtins for vdup_n and vmov_n.
> >> > The patch results in regression for pr51534.c.
> >> > Consider following function:
> >> >
> >> > uint8x8_t f1 (uint8x8_t a) {
> >> >   return vcgt_u8(a, vdup_n_u8(0));
> >> > }
> >> >
> >> > code-gen before patch:
> >> > f1:
> >> > vmov.i32  d16, #0  @ v8qi
> >> > vcgt.u8 d0, d0, d16
> >> > bx lr
> >> >
> >> > code-gen after patch:
> >> > f1:
> >> > vceq.i8 d0, d0, #0
> >> > vmvnd0, d0
> >> > bx lr
> >> >
> >> > I am not sure which one is better tho ?
> >>
> >
> > Hi Prathamesh,
> >
> > This patch introduces a regression on non-hardfp configs (eg
> arm-linux-gnueabi or arm-eabi):
> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c
> scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c
> scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
> >
> > Can you fix this?
> The issue is, for following test:
>
> #include 
>
> uint8x8_t f1 (uint8x8_t a) {
>   return vcge_u8(a, vdup_n_u8(0));
> }
>
> armhf code-gen:
> f1:
> vmov.i32  d0, #0x  @ v8qi
> bxlr
>
> arm softfp code-gen:
> f1:
> mov r0, #-1
> mov r1, #-1
> bx  lr
>
> The code-gen for both is same upto split2 pass:
>
> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
> (const_vector:V8QI [
> (const_int -1 [0x]) repeated x8
> ])) "foo.c":5:1 1052 {*neon_movv8qi}
>  (expr_list:REG_EQUAL (const_vector:V8QI [
> (const_int -1 [0x]) repeated x8
> ])
> (nil)))
> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
>  (nil))
>
> and for softfp target, split2 pass splits the assignment to r0 and r1:
>
> (insn 15 6 16 2 (set (reg:SI 0 r0)
> (const_int -1 [0x])) "foo.c":5:1 740
> {*thumb2_movsi_vfp}
>  (nil))
> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
> (const_int -1 [0x])) "foo.c":5:1 740
> {*thumb2_movsi_vfp}
>  (nil))
> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
>  (nil))
>
> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
>
> Yes, probably, or try with check-function-bodies.

 Christophe

Thanks,
> Prathamesh
> >
> > Thanks
> >
> > Christophe
> >
> >
> >>
> >> I think they're equivalent in practice, in any case the patch itself is
> good (move away from RTL builtins).
> >> Ok.
> >> Thanks,
> >> Kyrill
> >>
> >> >
> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> >> > which is due to a missed opt in lowering. I had filed it as
> >> > PR98435, and posted a fix for it here:
> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
> >> >
> >> > Thanks,
> >> > Prathamesh
>


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-12 Thread Prathamesh Kulkarni via Gcc-patches
On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
 wrote:
>
>
>
> On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches 
>  wrote:
>>
>>
>>
>> > -Original Message-
>> > From: Prathamesh Kulkarni 
>> > Sent: 24 June 2021 12:11
>> > To: gcc Patches ; Kyrylo Tkachov
>> > 
>> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
>> >
>> > Hi,
>> > This patch replaces builtins for vdup_n and vmov_n.
>> > The patch results in regression for pr51534.c.
>> > Consider following function:
>> >
>> > uint8x8_t f1 (uint8x8_t a) {
>> >   return vcgt_u8(a, vdup_n_u8(0));
>> > }
>> >
>> > code-gen before patch:
>> > f1:
>> > vmov.i32  d16, #0  @ v8qi
>> > vcgt.u8 d0, d0, d16
>> > bx lr
>> >
>> > code-gen after patch:
>> > f1:
>> > vceq.i8 d0, d0, #0
>> > vmvnd0, d0
>> > bx lr
>> >
>> > I am not sure which one is better tho ?
>>
>
> Hi Prathamesh,
>
> This patch introduces a regression on non-hardfp configs (eg 
> arm-linux-gnueabi or arm-eabi):
> FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
> scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
> FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c 
> scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
>
> Can you fix this?
The issue is, for following test:

#include 

uint8x8_t f1 (uint8x8_t a) {
  return vcge_u8(a, vdup_n_u8(0));
}

armhf code-gen:
f1:
vmov.i32  d0, #0x  @ v8qi
bxlr

arm softfp code-gen:
f1:
mov r0, #-1
mov r1, #-1
bx  lr

The code-gen for both is same upto split2 pass:

(insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
(const_vector:V8QI [
(const_int -1 [0x]) repeated x8
])) "foo.c":5:1 1052 {*neon_movv8qi}
 (expr_list:REG_EQUAL (const_vector:V8QI [
(const_int -1 [0x]) repeated x8
])
(nil)))
(insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
 (nil))

and for softfp target, split2 pass splits the assignment to r0 and r1:

(insn 15 6 16 2 (set (reg:SI 0 r0)
(const_int -1 [0x])) "foo.c":5:1 740 {*thumb2_movsi_vfp}
 (nil))
(insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
(const_int -1 [0x])) "foo.c":5:1 740 {*thumb2_movsi_vfp}
 (nil))
(insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
 (nil))

I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?

Thanks,
Prathamesh
>
> Thanks
>
> Christophe
>
>
>>
>> I think they're equivalent in practice, in any case the patch itself is good 
>> (move away from RTL builtins).
>> Ok.
>> Thanks,
>> Kyrill
>>
>> >
>> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
>> > which is due to a missed opt in lowering. I had filed it as
>> > PR98435, and posted a fix for it here:
>> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
>> >
>> > Thanks,
>> > Prathamesh


Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-08-11 Thread Christophe Lyon via Gcc-patches
On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches <
gcc-patches@gcc.gnu.org> wrote:

>
>
> > -Original Message-
> > From: Prathamesh Kulkarni 
> > Sent: 24 June 2021 12:11
> > To: gcc Patches ; Kyrylo Tkachov
> > 
> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
> >
> > Hi,
> > This patch replaces builtins for vdup_n and vmov_n.
> > The patch results in regression for pr51534.c.
> > Consider following function:
> >
> > uint8x8_t f1 (uint8x8_t a) {
> >   return vcgt_u8(a, vdup_n_u8(0));
> > }
> >
> > code-gen before patch:
> > f1:
> > vmov.i32  d16, #0  @ v8qi
> > vcgt.u8 d0, d0, d16
> > bx lr
> >
> > code-gen after patch:
> > f1:
> > vceq.i8 d0, d0, #0
> > vmvnd0, d0
> > bx lr
> >
> > I am not sure which one is better tho ?
>
>
Hi Prathamesh,

This patch introduces a regression on non-hardfp configs (eg
arm-linux-gnueabi or arm-eabi):
FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c
scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3
FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c
scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3

Can you fix this?

Thanks

Christophe



> I think they're equivalent in practice, in any case the patch itself is
> good (move away from RTL builtins).
> Ok.
> Thanks,
> Kyrill
>
> >
> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> > which is due to a missed opt in lowering. I had filed it as
> > PR98435, and posted a fix for it here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
> >
> > Thanks,
> > Prathamesh
>


RE: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

2021-06-24 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Prathamesh Kulkarni 
> Sent: 24 June 2021 12:11
> To: gcc Patches ; Kyrylo Tkachov
> 
> Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
> 
> Hi,
> This patch replaces builtins for vdup_n and vmov_n.
> The patch results in regression for pr51534.c.
> Consider following function:
> 
> uint8x8_t f1 (uint8x8_t a) {
>   return vcgt_u8(a, vdup_n_u8(0));
> }
> 
> code-gen before patch:
> f1:
> vmov.i32  d16, #0  @ v8qi
> vcgt.u8 d0, d0, d16
> bx lr
> 
> code-gen after patch:
> f1:
> vceq.i8 d0, d0, #0
> vmvnd0, d0
> bx lr
> 
> I am not sure which one is better tho ?

I think they're equivalent in practice, in any case the patch itself is good 
(move away from RTL builtins).
Ok.
Thanks,
Kyrill

> 
> Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
> which is due to a missed opt in lowering. I had filed it as
> PR98435, and posted a fix for it here:
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
> 
> Thanks,
> Prathamesh