Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On 03/09/2021 10:35, Prathamesh Kulkarni via Gcc-patches wrote: On Thu, 2 Sept 2021 at 14:32, Christophe Lyon wrote: On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov wrote: -Original Message- From: Prathamesh Kulkarni Sent: 24 August 2021 09:01 To: Christophe Lyon Cc: Kyrylo Tkachov ; gcc Patches Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni wrote: On Thu, 12 Aug 2021 at 19:04, Christophe Lyon wrote: On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni wrote: On Wed, 11 Aug 2021 at 22:23, Christophe Lyon wrote: On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches patc...@gcc.gnu.org> wrote: -Original Message- From: Prathamesh Kulkarni Sent: 24 June 2021 12:11 To: gcc Patches ; Kyrylo Tkachov Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics Hi, This patch replaces builtins for vdup_n and vmov_n. The patch results in regression for pr51534.c. Consider following function: uint8x8_t f1 (uint8x8_t a) { return vcgt_u8(a, vdup_n_u8(0)); } code-gen before patch: f1: vmov.i32 d16, #0 @ v8qi vcgt.u8 d0, d0, d16 bx lr code-gen after patch: f1: vceq.i8 d0, d0, #0 vmvnd0, d0 bx lr I am not sure which one is better tho ? Hi Prathamesh, This patch introduces a regression on non-hardfp configs (eg arm- linux-gnueabi or arm-eabi): FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 Can you fix this? The issue is, for following test: #include uint8x8_t f1 (uint8x8_t a) { return vcge_u8(a, vdup_n_u8(0)); } armhf code-gen: f1: vmov.i32 d0, #0x @ v8qi bxlr arm softfp code-gen: f1: mov r0, #-1 mov r1, #-1 bx lr The code-gen for both is same upto split2 pass: (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) (const_vector:V8QI [ (const_int -1 [0x]) repeated x8 ])) "foo.c":5:1 1052 {*neon_movv8qi} (expr_list:REG_EQUAL (const_vector:V8QI [ (const_int -1 [0x]) repeated x8 ]) (nil))) (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1 (nil)) and for softfp target, split2 pass splits the assignment to r0 and r1: (insn 15 6 16 2 (set (reg:SI 0 r0) (const_int -1 [0x])) "foo.c":5:1 740 {*thumb2_movsi_vfp} (nil)) (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ]) (const_int -1 [0x])) "foo.c":5:1 740 {*thumb2_movsi_vfp} (nil)) (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1 (nil)) I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ? Yes, probably, or try with check-function-bodies. Hi, Sorry for the late response. Does the attached patch look OK ? ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html Ok. Sorry Prathamesh, this does not quite work. See https://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/r12-3294-g7a6f40d0452ec76e126c2612dcfa32f3c73e2315/report-build-info.html (red cells in the gcc column) Can you have a look? Ah, for arm-none-eabi target the code-gen for test_vcge_u8_uint8x8_t is: mvn r0, #0 mvn r1, #0 bx lr instead of: mov r0, #-1 mov r1, #-1 bx lr I guess both code-gen sequences are correct, but I am not sure under which circumstance either sequence gets generated. To accomodate for both, I tried to check for either pattern with OR: /* { dg-final { scan-assembler-times "(mov\[\]+r\[0-9\]+, #-1)|(mvn\[\]+r\[0-9\]+, #0)" 6 { target { arm_softfp_ok } } } } */ but that doesn't seem to work. Do you have any suggestions on how to proceed ? Sorry I have no magic syntax for this :-) Maybe it's also worth checking why we have this codegen difference? Thanks, Christophe Thanks, Prathamesh Thanks Christophe Thanks, Kyrill Thanks, Prathamesh Thanks, Prathamesh Christophe Thanks, Prathamesh Thanks Christophe I think they're equivalent in practice, in any case the patch itself is good (move away from RTL builtins). Ok. Thanks, Kyrill Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, which is due to a missed opt in lowering. I had filed it as PR98435, and posted a fix for it here: https://gcc.gnu.org/pipermail/gcc-patches/2021- June/572648.html Thanks, Prathamesh
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Thu, 2 Sept 2021 at 14:32, Christophe Lyon wrote: > > > > On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov > wrote: >> >> >> >> > -Original Message- >> > From: Prathamesh Kulkarni >> > Sent: 24 August 2021 09:01 >> > To: Christophe Lyon >> > Cc: Kyrylo Tkachov ; gcc Patches > > patc...@gcc.gnu.org> >> > Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n >> > intrinsics >> > >> > On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni >> > wrote: >> > > >> > > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon >> > > wrote: >> > > > >> > > > >> > > > >> > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni >> > wrote: >> > > >> >> > > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon >> > > >> wrote: >> > > >> > >> > > >> > >> > > >> > >> > > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches > > patc...@gcc.gnu.org> wrote: >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> > -Original Message- >> > > >> >> > From: Prathamesh Kulkarni >> > > >> >> > Sent: 24 June 2021 12:11 >> > > >> >> > To: gcc Patches ; Kyrylo Tkachov >> > > >> >> > >> > > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n >> > intrinsics >> > > >> >> > >> > > >> >> > Hi, >> > > >> >> > This patch replaces builtins for vdup_n and vmov_n. >> > > >> >> > The patch results in regression for pr51534.c. >> > > >> >> > Consider following function: >> > > >> >> > >> > > >> >> > uint8x8_t f1 (uint8x8_t a) { >> > > >> >> > return vcgt_u8(a, vdup_n_u8(0)); >> > > >> >> > } >> > > >> >> > >> > > >> >> > code-gen before patch: >> > > >> >> > f1: >> > > >> >> > vmov.i32 d16, #0 @ v8qi >> > > >> >> > vcgt.u8 d0, d0, d16 >> > > >> >> > bx lr >> > > >> >> > >> > > >> >> > code-gen after patch: >> > > >> >> > f1: >> > > >> >> > vceq.i8 d0, d0, #0 >> > > >> >> > vmvnd0, d0 >> > > >> >> > bx lr >> > > >> >> > >> > > >> >> > I am not sure which one is better tho ? >> > > >> >> >> > > >> > >> > > >> > Hi Prathamesh, >> > > >> > >> > > >> > This patch introduces a regression on non-hardfp configs (eg arm- >> > linux-gnueabi or arm-eabi): >> > > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- >> > assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 >> > > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- >> > assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 >> > > >> > >> > > >> > Can you fix this? >> > > >> The issue is, for following test: >> > > >> >> > > >> #include >> > > >> >> > > >> uint8x8_t f1 (uint8x8_t a) { >> > > >> return vcge_u8(a, vdup_n_u8(0)); >> > > >> } >> > > >> >> > > >> armhf code-gen: >> > > >> f1: >> > > >> vmov.i32 d0, #0x @ v8qi >> > > >> bxlr >> > > >> >> > > >> arm softfp code-gen: >> > > >> f1: >> > > >> mov r0, #-1 >> > > >> mov r1, #-1 >> > > >> bx lr >> > > >> >> > > >> The code-gen for both is same upto split2 pass: >> > > >> >> > > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) >> > > >> (const
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov wrote: > > > > -Original Message- > > From: Prathamesh Kulkarni > > Sent: 24 August 2021 09:01 > > To: Christophe Lyon > > Cc: Kyrylo Tkachov ; gcc Patches > patc...@gcc.gnu.org> > > Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n > > intrinsics > > > > On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni > > wrote: > > > > > > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon > > > wrote: > > > > > > > > > > > > > > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni > > wrote: > > > >> > > > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon > > > >> wrote: > > > >> > > > > >> > > > > >> > > > > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches > > patc...@gcc.gnu.org> wrote: > > > >> >> > > > >> >> > > > >> >> > > > >> >> > -Original Message- > > > >> >> > From: Prathamesh Kulkarni > > > >> >> > Sent: 24 June 2021 12:11 > > > >> >> > To: gcc Patches ; Kyrylo Tkachov > > > >> >> > > > > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n > > intrinsics > > > >> >> > > > > >> >> > Hi, > > > >> >> > This patch replaces builtins for vdup_n and vmov_n. > > > >> >> > The patch results in regression for pr51534.c. > > > >> >> > Consider following function: > > > >> >> > > > > >> >> > uint8x8_t f1 (uint8x8_t a) { > > > >> >> > return vcgt_u8(a, vdup_n_u8(0)); > > > >> >> > } > > > >> >> > > > > >> >> > code-gen before patch: > > > >> >> > f1: > > > >> >> > vmov.i32 d16, #0 @ v8qi > > > >> >> > vcgt.u8 d0, d0, d16 > > > >> >> > bx lr > > > >> >> > > > > >> >> > code-gen after patch: > > > >> >> > f1: > > > >> >> > vceq.i8 d0, d0, #0 > > > >> >> > vmvnd0, d0 > > > >> >> > bx lr > > > >> >> > > > > >> >> > I am not sure which one is better tho ? > > > >> >> > > > >> > > > > >> > Hi Prathamesh, > > > >> > > > > >> > This patch introduces a regression on non-hardfp configs (eg arm- > > linux-gnueabi or arm-eabi): > > > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- > > assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 > > > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- > > assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 > > > >> > > > > >> > Can you fix this? > > > >> The issue is, for following test: > > > >> > > > >> #include > > > >> > > > >> uint8x8_t f1 (uint8x8_t a) { > > > >> return vcge_u8(a, vdup_n_u8(0)); > > > >> } > > > >> > > > >> armhf code-gen: > > > >> f1: > > > >> vmov.i32 d0, #0x @ v8qi > > > >> bxlr > > > >> > > > >> arm softfp code-gen: > > > >> f1: > > > >> mov r0, #-1 > > > >> mov r1, #-1 > > > >> bx lr > > > >> > > > >> The code-gen for both is same upto split2 pass: > > > >> > > > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) > > > >> (const_vector:V8QI [ > > > >> (const_int -1 [0x]) repeated x8 > > > >> ])) "foo.c":5:1 1052 {*neon_movv8qi} > > > >> (expr_list:REG_EQUAL (const_vector:V8QI [ > > > >> (const_int -1 [0x]) repeated x8 > > > >> ]) > > > >> (nil))) > > > >> (insn 11 10 13 2 (use (reg
RE: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
> -Original Message- > From: Prathamesh Kulkarni > Sent: 24 August 2021 09:01 > To: Christophe Lyon > Cc: Kyrylo Tkachov ; gcc Patches patc...@gcc.gnu.org> > Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n > intrinsics > > On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni > wrote: > > > > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon > > wrote: > > > > > > > > > > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni > wrote: > > >> > > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon > > >> wrote: > > >> > > > >> > > > >> > > > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches patc...@gcc.gnu.org> wrote: > > >> >> > > >> >> > > >> >> > > >> >> > -Original Message- > > >> >> > From: Prathamesh Kulkarni > > >> >> > Sent: 24 June 2021 12:11 > > >> >> > To: gcc Patches ; Kyrylo Tkachov > > >> >> > > > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n > intrinsics > > >> >> > > > >> >> > Hi, > > >> >> > This patch replaces builtins for vdup_n and vmov_n. > > >> >> > The patch results in regression for pr51534.c. > > >> >> > Consider following function: > > >> >> > > > >> >> > uint8x8_t f1 (uint8x8_t a) { > > >> >> > return vcgt_u8(a, vdup_n_u8(0)); > > >> >> > } > > >> >> > > > >> >> > code-gen before patch: > > >> >> > f1: > > >> >> > vmov.i32 d16, #0 @ v8qi > > >> >> > vcgt.u8 d0, d0, d16 > > >> >> > bx lr > > >> >> > > > >> >> > code-gen after patch: > > >> >> > f1: > > >> >> > vceq.i8 d0, d0, #0 > > >> >> > vmvnd0, d0 > > >> >> > bx lr > > >> >> > > > >> >> > I am not sure which one is better tho ? > > >> >> > > >> > > > >> > Hi Prathamesh, > > >> > > > >> > This patch introduces a regression on non-hardfp configs (eg arm- > linux-gnueabi or arm-eabi): > > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- > assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 > > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan- > assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 > > >> > > > >> > Can you fix this? > > >> The issue is, for following test: > > >> > > >> #include > > >> > > >> uint8x8_t f1 (uint8x8_t a) { > > >> return vcge_u8(a, vdup_n_u8(0)); > > >> } > > >> > > >> armhf code-gen: > > >> f1: > > >> vmov.i32 d0, #0x @ v8qi > > >> bxlr > > >> > > >> arm softfp code-gen: > > >> f1: > > >> mov r0, #-1 > > >> mov r1, #-1 > > >> bx lr > > >> > > >> The code-gen for both is same upto split2 pass: > > >> > > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) > > >> (const_vector:V8QI [ > > >> (const_int -1 [0x]) repeated x8 > > >> ])) "foo.c":5:1 1052 {*neon_movv8qi} > > >> (expr_list:REG_EQUAL (const_vector:V8QI [ > > >> (const_int -1 [0x]) repeated x8 > > >> ]) > > >> (nil))) > > >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1 > > >> (nil)) > > >> > > >> and for softfp target, split2 pass splits the assignment to r0 and r1: > > >> > > >> (insn 15 6 16 2 (set (reg:SI 0 r0) > > >> (const_int -1 [0x])) "foo.c":5:1 740 > {*thumb2_movsi_vfp} > > >> (nil)) > > >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ]) > > >> (const_int -1 [0x])) "foo.c":5:1 740 > {*thumb2_movsi_vfp} > > >> (nil)) > > >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1 > > >> (nil)) > > >> > > >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ? > > >> > > > Yes, probably, or try with check-function-bodies. > > Hi, > > Sorry for the late response. Does the attached patch look OK ? > ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html Ok. Thanks, Kyrill > > Thanks, > Prathamesh > > > > Thanks, > > Prathamesh > > > > > > Christophe > > > > > >> Thanks, > > >> Prathamesh > > >> > > > >> > Thanks > > >> > > > >> > Christophe > > >> > > > >> > > > >> >> > > >> >> I think they're equivalent in practice, in any case the patch itself > > >> >> is > good (move away from RTL builtins). > > >> >> Ok. > > >> >> Thanks, > > >> >> Kyrill > > >> >> > > >> >> > > > >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, > > >> >> > which is due to a missed opt in lowering. I had filed it as > > >> >> > PR98435, and posted a fix for it here: > > >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021- > June/572648.html > > >> >> > > > >> >> > Thanks, > > >> >> > Prathamesh
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni wrote: > > On Thu, 12 Aug 2021 at 19:04, Christophe Lyon > wrote: > > > > > > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni > > wrote: > >> > >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon > >> wrote: > >> > > >> > > >> > > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches > >> > wrote: > >> >> > >> >> > >> >> > >> >> > -Original Message- > >> >> > From: Prathamesh Kulkarni > >> >> > Sent: 24 June 2021 12:11 > >> >> > To: gcc Patches ; Kyrylo Tkachov > >> >> > > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n > >> >> > intrinsics > >> >> > > >> >> > Hi, > >> >> > This patch replaces builtins for vdup_n and vmov_n. > >> >> > The patch results in regression for pr51534.c. > >> >> > Consider following function: > >> >> > > >> >> > uint8x8_t f1 (uint8x8_t a) { > >> >> > return vcgt_u8(a, vdup_n_u8(0)); > >> >> > } > >> >> > > >> >> > code-gen before patch: > >> >> > f1: > >> >> > vmov.i32 d16, #0 @ v8qi > >> >> > vcgt.u8 d0, d0, d16 > >> >> > bx lr > >> >> > > >> >> > code-gen after patch: > >> >> > f1: > >> >> > vceq.i8 d0, d0, #0 > >> >> > vmvnd0, d0 > >> >> > bx lr > >> >> > > >> >> > I am not sure which one is better tho ? > >> >> > >> > > >> > Hi Prathamesh, > >> > > >> > This patch introduces a regression on non-hardfp configs (eg > >> > arm-linux-gnueabi or arm-eabi): > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c > >> > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 > >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c > >> > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 > >> > > >> > Can you fix this? > >> The issue is, for following test: > >> > >> #include > >> > >> uint8x8_t f1 (uint8x8_t a) { > >> return vcge_u8(a, vdup_n_u8(0)); > >> } > >> > >> armhf code-gen: > >> f1: > >> vmov.i32 d0, #0x @ v8qi > >> bxlr > >> > >> arm softfp code-gen: > >> f1: > >> mov r0, #-1 > >> mov r1, #-1 > >> bx lr > >> > >> The code-gen for both is same upto split2 pass: > >> > >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) > >> (const_vector:V8QI [ > >> (const_int -1 [0x]) repeated x8 > >> ])) "foo.c":5:1 1052 {*neon_movv8qi} > >> (expr_list:REG_EQUAL (const_vector:V8QI [ > >> (const_int -1 [0x]) repeated x8 > >> ]) > >> (nil))) > >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1 > >> (nil)) > >> > >> and for softfp target, split2 pass splits the assignment to r0 and r1: > >> > >> (insn 15 6 16 2 (set (reg:SI 0 r0) > >> (const_int -1 [0x])) "foo.c":5:1 740 > >> {*thumb2_movsi_vfp} > >> (nil)) > >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ]) > >> (const_int -1 [0x])) "foo.c":5:1 740 > >> {*thumb2_movsi_vfp} > >> (nil)) > >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1 > >> (nil)) > >> > >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ? > >> > > Yes, probably, or try with check-function-bodies. > Hi, > Sorry for the late response. Does the attached patch look OK ? ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html Thanks, Prathamesh > > Thanks, > Prathamesh > > > > Christophe > > > >> Thanks, > >> Prathamesh > >> > > >> > Thanks > >> > > >> > Christophe > >> > > >> > > >> >> > >> >> I think they're equivalent in practice, in any case the patch itself is > >> >> good (move away from RTL builtins). > >> >> Ok. > >> >> Thanks, > >> >> Kyrill > >> >> > >> >> > > >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, > >> >> > which is due to a missed opt in lowering. I had filed it as > >> >> > PR98435, and posted a fix for it here: > >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html > >> >> > > >> >> > Thanks, > >> >> > Prathamesh
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Thu, 12 Aug 2021 at 19:04, Christophe Lyon wrote: > > > > On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni > wrote: >> >> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon >> wrote: >> > >> > >> > >> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches >> > wrote: >> >> >> >> >> >> >> >> > -Original Message- >> >> > From: Prathamesh Kulkarni >> >> > Sent: 24 June 2021 12:11 >> >> > To: gcc Patches ; Kyrylo Tkachov >> >> > >> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n >> >> > intrinsics >> >> > >> >> > Hi, >> >> > This patch replaces builtins for vdup_n and vmov_n. >> >> > The patch results in regression for pr51534.c. >> >> > Consider following function: >> >> > >> >> > uint8x8_t f1 (uint8x8_t a) { >> >> > return vcgt_u8(a, vdup_n_u8(0)); >> >> > } >> >> > >> >> > code-gen before patch: >> >> > f1: >> >> > vmov.i32 d16, #0 @ v8qi >> >> > vcgt.u8 d0, d0, d16 >> >> > bx lr >> >> > >> >> > code-gen after patch: >> >> > f1: >> >> > vceq.i8 d0, d0, #0 >> >> > vmvnd0, d0 >> >> > bx lr >> >> > >> >> > I am not sure which one is better tho ? >> >> >> > >> > Hi Prathamesh, >> > >> > This patch introduces a regression on non-hardfp configs (eg >> > arm-linux-gnueabi or arm-eabi): >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c >> > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 >> > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c >> > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 >> > >> > Can you fix this? >> The issue is, for following test: >> >> #include >> >> uint8x8_t f1 (uint8x8_t a) { >> return vcge_u8(a, vdup_n_u8(0)); >> } >> >> armhf code-gen: >> f1: >> vmov.i32 d0, #0x @ v8qi >> bxlr >> >> arm softfp code-gen: >> f1: >> mov r0, #-1 >> mov r1, #-1 >> bx lr >> >> The code-gen for both is same upto split2 pass: >> >> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) >> (const_vector:V8QI [ >> (const_int -1 [0x]) repeated x8 >> ])) "foo.c":5:1 1052 {*neon_movv8qi} >> (expr_list:REG_EQUAL (const_vector:V8QI [ >> (const_int -1 [0x]) repeated x8 >> ]) >> (nil))) >> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1 >> (nil)) >> >> and for softfp target, split2 pass splits the assignment to r0 and r1: >> >> (insn 15 6 16 2 (set (reg:SI 0 r0) >> (const_int -1 [0x])) "foo.c":5:1 740 >> {*thumb2_movsi_vfp} >> (nil)) >> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ]) >> (const_int -1 [0x])) "foo.c":5:1 740 >> {*thumb2_movsi_vfp} >> (nil)) >> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1 >> (nil)) >> >> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ? >> > Yes, probably, or try with check-function-bodies. Hi, Sorry for the late response. Does the attached patch look OK ? Thanks, Prathamesh > > Christophe > >> Thanks, >> Prathamesh >> > >> > Thanks >> > >> > Christophe >> > >> > >> >> >> >> I think they're equivalent in practice, in any case the patch itself is >> >> good (move away from RTL builtins). >> >> Ok. >> >> Thanks, >> >> Kyrill >> >> >> >> > >> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, >> >> > which is due to a missed opt in lowering. I had filed it as >> >> > PR98435, and posted a fix for it here: >> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html >> >> > >> >> > Thanks, >> >> > Prathamesh diff --git a/gcc/testsuite/gcc.target/arm/pr51534.c b/gcc/testsuite/gcc.target/arm/pr51534.c index ac7f1ea4722..5e121f5fb99 100644 --- a/gcc/testsuite/gcc.target/arm/pr51534.c +++ b/gcc/testsuite/gcc.target/arm/pr51534.c @@ -64,8 +64,9 @@ GEN_COND_TESTS(vceq) /* { dg-final { scan-assembler-times "vceq\.i8\[ \]+\[qQ\]\[0-9\]+, \[qQ\]\[0-9\]+, #0" 4 } } */ /* { dg-final { scan-assembler-times "vceq\.i16\[ \]+\[qQ\]\[0-9\]+, \[qQ\]\[0-9\]+, #0" 4 } } */ /* { dg-final { scan-assembler-times "vceq\.i32\[ \]+\[qQ\]\[0-9\]+, \[qQ\]\[0-9\]+, #0" 4 } } */ -/* { dg-final { scan-assembler-times "vmov\.i32\[ \]+\[dD\]\[0-9\]+, #0x" 3 } } */ -/* { dg-final { scan-assembler-times "vmov\.i32\[ \]+\[qQ\]\[0-9\]+, #4294967295" 3 } } */ +/* { dg-final { scan-assembler-times "vmov\.i32\[ \]+\[dD\]\[0-9\]+, #0x" 3 { target { arm_hard_ok } } } } */ +/* { dg-final { scan-assembler-times "vmov\.i32\[ \]+\[qQ\]\[0-9\]+, #4294967295" 3 { target { arm_hard_ok } } } } */ +/* { dg-final { scan-assembler-times "mov\[\]+r\[0-9\]+, #-1" 6 { target { arm_softfp_ok } } } } */ /* And ensure we don't have unexpected output too. */ /* { dg-final { scan-assembler-not "vc\[gl\]\[te\]\.u\[0-9\]+\[ \]+\[qQdD\]\[0-9\]+, \[qQdD\]\[0-9\]+, #0" } } */
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni < prathamesh.kulka...@linaro.org> wrote: > On Wed, 11 Aug 2021 at 22:23, Christophe Lyon > wrote: > > > > > > > > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches < > gcc-patches@gcc.gnu.org> wrote: > >> > >> > >> > >> > -Original Message- > >> > From: Prathamesh Kulkarni > >> > Sent: 24 June 2021 12:11 > >> > To: gcc Patches ; Kyrylo Tkachov > >> > > >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n > intrinsics > >> > > >> > Hi, > >> > This patch replaces builtins for vdup_n and vmov_n. > >> > The patch results in regression for pr51534.c. > >> > Consider following function: > >> > > >> > uint8x8_t f1 (uint8x8_t a) { > >> > return vcgt_u8(a, vdup_n_u8(0)); > >> > } > >> > > >> > code-gen before patch: > >> > f1: > >> > vmov.i32 d16, #0 @ v8qi > >> > vcgt.u8 d0, d0, d16 > >> > bx lr > >> > > >> > code-gen after patch: > >> > f1: > >> > vceq.i8 d0, d0, #0 > >> > vmvnd0, d0 > >> > bx lr > >> > > >> > I am not sure which one is better tho ? > >> > > > > Hi Prathamesh, > > > > This patch introduces a regression on non-hardfp configs (eg > arm-linux-gnueabi or arm-eabi): > > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 > > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 > > > > Can you fix this? > The issue is, for following test: > > #include > > uint8x8_t f1 (uint8x8_t a) { > return vcge_u8(a, vdup_n_u8(0)); > } > > armhf code-gen: > f1: > vmov.i32 d0, #0x @ v8qi > bxlr > > arm softfp code-gen: > f1: > mov r0, #-1 > mov r1, #-1 > bx lr > > The code-gen for both is same upto split2 pass: > > (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) > (const_vector:V8QI [ > (const_int -1 [0x]) repeated x8 > ])) "foo.c":5:1 1052 {*neon_movv8qi} > (expr_list:REG_EQUAL (const_vector:V8QI [ > (const_int -1 [0x]) repeated x8 > ]) > (nil))) > (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1 > (nil)) > > and for softfp target, split2 pass splits the assignment to r0 and r1: > > (insn 15 6 16 2 (set (reg:SI 0 r0) > (const_int -1 [0x])) "foo.c":5:1 740 > {*thumb2_movsi_vfp} > (nil)) > (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ]) > (const_int -1 [0x])) "foo.c":5:1 740 > {*thumb2_movsi_vfp} > (nil)) > (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1 > (nil)) > > I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ? > > Yes, probably, or try with check-function-bodies. Christophe Thanks, > Prathamesh > > > > Thanks > > > > Christophe > > > > > >> > >> I think they're equivalent in practice, in any case the patch itself is > good (move away from RTL builtins). > >> Ok. > >> Thanks, > >> Kyrill > >> > >> > > >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, > >> > which is due to a missed opt in lowering. I had filed it as > >> > PR98435, and posted a fix for it here: > >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html > >> > > >> > Thanks, > >> > Prathamesh >
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Wed, 11 Aug 2021 at 22:23, Christophe Lyon wrote: > > > > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches > wrote: >> >> >> >> > -Original Message- >> > From: Prathamesh Kulkarni >> > Sent: 24 June 2021 12:11 >> > To: gcc Patches ; Kyrylo Tkachov >> > >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics >> > >> > Hi, >> > This patch replaces builtins for vdup_n and vmov_n. >> > The patch results in regression for pr51534.c. >> > Consider following function: >> > >> > uint8x8_t f1 (uint8x8_t a) { >> > return vcgt_u8(a, vdup_n_u8(0)); >> > } >> > >> > code-gen before patch: >> > f1: >> > vmov.i32 d16, #0 @ v8qi >> > vcgt.u8 d0, d0, d16 >> > bx lr >> > >> > code-gen after patch: >> > f1: >> > vceq.i8 d0, d0, #0 >> > vmvnd0, d0 >> > bx lr >> > >> > I am not sure which one is better tho ? >> > > Hi Prathamesh, > > This patch introduces a regression on non-hardfp configs (eg > arm-linux-gnueabi or arm-eabi): > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c > scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 > FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c > scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 > > Can you fix this? The issue is, for following test: #include uint8x8_t f1 (uint8x8_t a) { return vcge_u8(a, vdup_n_u8(0)); } armhf code-gen: f1: vmov.i32 d0, #0x @ v8qi bxlr arm softfp code-gen: f1: mov r0, #-1 mov r1, #-1 bx lr The code-gen for both is same upto split2 pass: (insn 10 6 11 2 (set (reg/i:V8QI 16 s0) (const_vector:V8QI [ (const_int -1 [0x]) repeated x8 ])) "foo.c":5:1 1052 {*neon_movv8qi} (expr_list:REG_EQUAL (const_vector:V8QI [ (const_int -1 [0x]) repeated x8 ]) (nil))) (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1 (nil)) and for softfp target, split2 pass splits the assignment to r0 and r1: (insn 15 6 16 2 (set (reg:SI 0 r0) (const_int -1 [0x])) "foo.c":5:1 740 {*thumb2_movsi_vfp} (nil)) (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ]) (const_int -1 [0x])) "foo.c":5:1 740 {*thumb2_movsi_vfp} (nil)) (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1 (nil)) I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ? Thanks, Prathamesh > > Thanks > > Christophe > > >> >> I think they're equivalent in practice, in any case the patch itself is good >> (move away from RTL builtins). >> Ok. >> Thanks, >> Kyrill >> >> > >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, >> > which is due to a missed opt in lowering. I had filed it as >> > PR98435, and posted a fix for it here: >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html >> > >> > Thanks, >> > Prathamesh
Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches < gcc-patches@gcc.gnu.org> wrote: > > > > -Original Message- > > From: Prathamesh Kulkarni > > Sent: 24 June 2021 12:11 > > To: gcc Patches ; Kyrylo Tkachov > > > > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics > > > > Hi, > > This patch replaces builtins for vdup_n and vmov_n. > > The patch results in regression for pr51534.c. > > Consider following function: > > > > uint8x8_t f1 (uint8x8_t a) { > > return vcgt_u8(a, vdup_n_u8(0)); > > } > > > > code-gen before patch: > > f1: > > vmov.i32 d16, #0 @ v8qi > > vcgt.u8 d0, d0, d16 > > bx lr > > > > code-gen after patch: > > f1: > > vceq.i8 d0, d0, #0 > > vmvnd0, d0 > > bx lr > > > > I am not sure which one is better tho ? > > Hi Prathamesh, This patch introduces a regression on non-hardfp configs (eg arm-linux-gnueabi or arm-eabi): FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0x 3 FAIL: gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3 Can you fix this? Thanks Christophe > I think they're equivalent in practice, in any case the patch itself is > good (move away from RTL builtins). > Ok. > Thanks, > Kyrill > > > > > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, > > which is due to a missed opt in lowering. I had filed it as > > PR98435, and posted a fix for it here: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html > > > > Thanks, > > Prathamesh >
RE: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
> -Original Message- > From: Prathamesh Kulkarni > Sent: 24 June 2021 12:11 > To: gcc Patches ; Kyrylo Tkachov > > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics > > Hi, > This patch replaces builtins for vdup_n and vmov_n. > The patch results in regression for pr51534.c. > Consider following function: > > uint8x8_t f1 (uint8x8_t a) { > return vcgt_u8(a, vdup_n_u8(0)); > } > > code-gen before patch: > f1: > vmov.i32 d16, #0 @ v8qi > vcgt.u8 d0, d0, d16 > bx lr > > code-gen after patch: > f1: > vceq.i8 d0, d0, #0 > vmvnd0, d0 > bx lr > > I am not sure which one is better tho ? I think they're equivalent in practice, in any case the patch itself is good (move away from RTL builtins). Ok. Thanks, Kyrill > > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi, > which is due to a missed opt in lowering. I had filed it as > PR98435, and posted a fix for it here: > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html > > Thanks, > Prathamesh