Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Hongtao Liu via Gcc-patches Thu, 28 May 2020 20:25:14 -0700

On Thu, May 28, 2020 at 11:37 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 8:00 AM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
> >
> > "Yangfei (Felix)" <felix.y...@huawei.com> writes:
> > > Thanks for reviewing this.
> > > Attached please find the v5 patch.
> > > Note: we also need to modify local variable "mode" once we catch one 
> > > case.  I see test failure without this change.
> >
> > Looks good.  Patch is OK assuming the x86 folks don't want to rewrite
> > gcc.target/i386/pr67609.c to avoid the new optimisation.  I'll hold off
> > applying until the AVX512 thing is sorted.
> >
> > > Bootstrapped and tested on aarch64-linux-gnu.
> > > Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed 
> > > tests on this platform:
> >
> > Thanks for the extra testing.
> >
> > > 1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> > > 2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa
> > >
> > > I have adjust the second one in the v4 patch.
> >
> > So this is:
> >
> >         movdqa  reg(%rip), %xmm1
> >         movaps  %xmm1, -24(%rsp)
> >         movsd   %xmm0, -24(%rsp)
> >         movapd  -24(%rsp), %xmm2
> >         movaps  %xmm2, reg(%rip)
> >         ret
> >
> > to:
> >
> >         movq    %xmm0, reg(%rip)
> >         ret
> >
> > Nice.  I think it's safe to say that's an improvement :-)
> >
> > I don't know whether this means we're no longer testing what the test
> > was intended to test.  Maybe one of the x86 folks has an opinion about
> > whether we should instead rewrite the test somehow.
> >
> > > But The first one looks strange to me.
> > > I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:
> > >
> > > 125         vcvtps2ph       $0, %zmm0, -112(%rbp){%k1}
> > > 126         vcvtps2ph       $0, %zmm0, -80(%rbp){%k1}{z}
> > >
> > > This happens in the combine phase, where an combined insn looks like:
> > >
> > > 1989 Trying 31 -> 33:
> > > 1990    31: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
> > > 133,[frame:DI-0x60],r109:HI)
> > > 1991    33: [frame:DI-0x60]=r106:V16HI
> > > 1992       REG_DEAD r106:V16HI
> > > 1993 Successfully matched this instruction:
> > > 1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 1995             (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 
> > > A256])
> > > 1996     (vec_merge:V16HI (unspec:V16HI [
> > > 1997                 (reg:V16SF 103)
> > > 1998                 (const_int 0 [0])
> > > 1999             ] UNSPEC_VCVTPS2PH)
> > > 2000         (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 2001                 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 
> > > S32 A256])
> > > 2002         (reg:HI 109)))
> > > 2003 allowing combination of insns 31 and 33
> > > 2004 original costs 16 + 4 = 20
> > > 2005 replacement cost 16
> > > 2006 deferring deletion of insn with uid = 31.
> > > 2007 modifying insn i3    33: 
> > > [frame:DI-0x60]=vec_merge(unspec[r103:V16SF,0] 
> > > 133,[frame:DI-0x60],r109:HI)
> > > 2008 deferring rescan insn with uid = 33.
> > >
> > > And this can be matched with pattern: avx512f_vcvtps2ph512_mask
> > > 2282 (insn 33 31 37 4 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 2283                 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 
> > > S32 A256])
> > > 2284         (vec_merge:V16HI (unspec:V16HI [
> > > 2285                     (reg:V16SF 103)
> > > 2286                     (const_int 0 [0])
> > > 2287                 ] UNSPEC_VCVTPS2PH)
> > > 2288             (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 2289                     (const_int -96 [0xffffffffffffffa0])) [2 
> > > res2.x+0 S32 A256])
> > > 2290             (reg:HI 109))) "avx512f-vcvtps2ph-2.c":80:10 5324 
> > > {avx512f_vcvtps2ph512_mask}
> > > 2291      (nil))
> > >
> > > gcc/config/i386/sse.md:
> > > 21663 (define_insn "<mask_codefor>avx512f_vcvtps2ph512<mask_name>"
> > > 21664   [(set (match_operand:V16HI 0 "nonimmediate_operand" "=vm")
> > > 21665         (unspec:V16HI
> > > 21666           [(match_operand:V16SF 1 "register_operand" "v")
> > > 21667            (match_operand:SI 2 "const_0_to_255_operand" "N")]
> > > 21668           UNSPEC_VCVTPS2PH))]
> > > 21669   "TARGET_AVX512F"
> > > 21670   "vcvtps2ph\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
> > > 21671   [(set_attr "type" "ssecvt")
> > > 21672    (set_attr "prefix" "evex")
> > > 21673    (set_attr "mode" "V16SF")])
> > >
> > > How can that happen?
> >
> > This is due to define_subst magic.  The generators automatically
> > create a vec_merge form of the instruction based on the information
> > in the <mode_*> attributes.
> >
> > AFAICT the rtl above is for the line-125 instruction, which looks ok.
> > The problem is the line-126 instruction, since vcvtps2ph doesn't
> > AIUI allow zero masking.
> >


zero masking is not allowed for mem_operand here, but available for
register_operand.
there's something wrong in the pattern, we need to fix it.
(define_insn "<mask_codefor>avx512f_vcvtps2ph512<mask_name>"


> > The "mask" define_subst allows both zeroing and merging,
> > so I guess this means that the pattern should either be using
> > a different define_subst, or should be enforcing merging in
> > some other way.  Please could one of the x86 devs take a look?
> >
>
> Hongtao, can you take a look?
>
> Thanks.
>
>
> --
> H.J.

BTW, i failed to build gcc when apply pr95254-v4.txt.

gcc configure:

Using built-in specs.
COLLECT_GCC=./gcc/xgcc
Target: x86_64-pc-linux-gnu
Configured with: ../../gcc/gnu-toolchain/gcc/configure
--enable-languages=c,c++,fortran --disable-bootstrap
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20200528 (experimental) (GCC)

host on x86_64 rel8.

error message:

during RTL pass: expand
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_eq_dd.c: In
function ‘__bid_eqdd2’:
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_eq_dd.c:35:9:
internal compiler error: in emit_single_push_insn, at expr.c:4405
   35 |   res = __bid64_quiet_equal (ux.i, uy.i);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
during RTL pass: expand
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_gt_dd.c: In
function ‘__bid_gtdd2’:
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_gt_dd.c:35:9:
internal compiler error: in emit_single_push_insn, at expr.c:4405
   35 |   res = __bid64_quiet_greater (ux.i, uy.i);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
during RTL pass: expand
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_dd_to_di.c:
In function ‘__bid_fixdddi’:
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_dd_to_di.c:34:9:
internal compiler error: in emit_single_push_insn, at expr.c:4405
   34 |   res = __bid64_to_int64_xint (ux.i);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
during RTL pass: expand
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_ge_dd.c: In
function ‘__bid_gedd2’:
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_ge_dd.c:35:9:
internal compiler error: in emit_single_push_insn, at expr.c:4405
   35 |   res = __bid64_quiet_greater_equal (ux.i, uy.i);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
during RTL pass: expand
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_le_dd.c: In
function ‘__bid_ledd2’:
../../../../../gcc/gnu-toolchain/gcc/libgcc/config/libbid/_le_dd.c:35:9:
internal compiler error: in emit_single_push_insn, at expr.c:4405
   35 |   res = __bid64_quiet_less_equal (ux.i, uy.i);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--
BR,
Hongtao

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Reply via email to