Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector
On Tue, 3 Nov 2020 at 12:17, Dennis Zhang via Gcc-patches wrote: > > Hi Richard, > > On 10/30/20 2:07 PM, Richard Sandiford wrote: > > Dennis Zhang writes: > >> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def > >> b/gcc/config/aarch64/aarch64-simd-builtins.def > >> index 332a0b6b1ea..39ebb776d1d 100644 > >> --- a/gcc/config/aarch64/aarch64-simd-builtins.def > >> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def > >> @@ -719,6 +719,9 @@ > >> VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) > >> VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) > >> > >> + /* Implemented by aarch64_vget_halfv8bf. */ > >> + VAR1 (GETREG, vget_half, 0, ALL, v8bf) > > > > This should be AUTO_FP, since it doesn't have any side-effects. > > (As before, we should probably rename the flag, but that's separate work.) > > > >> + > >> /* Implemented by aarch64_simd_mmlav16qi. */ > >> VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) > >> VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) > >> diff --git a/gcc/config/aarch64/aarch64-simd.md > >> b/gcc/config/aarch64/aarch64-simd.md > >> index 9f0e2bd1e6f..f62c52ca327 100644 > >> --- a/gcc/config/aarch64/aarch64-simd.md > >> +++ b/gcc/config/aarch64/aarch64-simd.md > >> @@ -7159,6 +7159,19 @@ > >> [(set_attr "type" "neon_dot")] > >> ) > >> > >> +;; vget_low/high_bf16 > >> +(define_expand "aarch64_vget_halfv8bf" > >> + [(match_operand:V4BF 0 "register_operand") > >> + (match_operand:V8BF 1 "register_operand") > >> + (match_operand:SI 2 "aarch64_zero_or_1")] > >> + "TARGET_BF16_SIMD" > >> +{ > >> + int hbase = INTVAL (operands[2]); > >> + rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1); > > > > I think this needs to be: > > > >aarch64_simd_vect_par_cnst_half > > > > instead. The issue is that on big-endian targets, GCC assumes vector > > lane 0 is in the high part of the register, whereas for AArch64 it's > > always in the low part of the register. So we convert from AArch64 > > numbering to GCC numbering when generating the rtx and then take > > endianness into account when matching the rtx later. > > > > It would be good to have -mbig-endian tests that make sure we generate > > the right instruction for each function (i.e. we get them the right way > > round). I guess it would be good to test that for little-endian too. > > > > I've updated the expander using aarch64_simd_vect_par_cnst_half. > And the expander is divided into two for getting low and high half > seperately. > It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu > targets with new tests including -mbig-endian option. > Hi, When testing with a cross x86_64 -> aarch64-none-linux-gnu, the new big-endian test fails: FAIL: gcc.target/aarch64/advsimd-intrinsics/bf16_get-be.c -O0 (test for excess errors) Excess errors: /aci-gcc-fsf/builds/gcc-fsf-gccsrc/sysroot-aarch64-none-linux-gnu/usr/include/gnu/stubs.h:11:11: fatal error: gnu/stubs-lp64_be.h: No such file or directory compilation terminated. What am I missing, since it works for you? Thanks Christophe > >> + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel)); > >> + DONE; > >> +}) > >> + > >> ;; bfmmla > >> (define_insn "aarch64_bfmmlaqv4sf" > >> [(set (match_operand:V4SF 0 "register_operand" "=w") > >> diff --git a/gcc/config/aarch64/predicates.md > >> b/gcc/config/aarch64/predicates.md > >> index 215fcec5955..0c8bc2b0c73 100644 > >> --- a/gcc/config/aarch64/predicates.md > >> +++ b/gcc/config/aarch64/predicates.md > >> @@ -84,6 +84,10 @@ > >> (ior (match_test "op == constm1_rtx") > >>(match_test "op == const1_rtx")) > >> > >> +(define_predicate "aarch64_zero_or_1" > >> + (and (match_code "const_int") > >> + (match_test "op == const0_rtx || op == const1_rtx"))) > > > > zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one. > > But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1, > > so let's keep it as-is. > > > > This predicate is removed since there is no need of the imm operand in > the new expanders. > > Thanks for the reviews. > Is it OK for trunk now? > > Cheers > Dennis > >
Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector
On 11/3/20 2:05 PM, Richard Sandiford wrote: Dennis Zhang writes: Hi Richard, On 10/30/20 2:07 PM, Richard Sandiford wrote: Dennis Zhang writes: diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def index 332a0b6b1ea..39ebb776d1d 100644 --- a/gcc/config/aarch64/aarch64-simd-builtins.def +++ b/gcc/config/aarch64/aarch64-simd-builtins.def @@ -719,6 +719,9 @@ VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) + /* Implemented by aarch64_vget_halfv8bf. */ + VAR1 (GETREG, vget_half, 0, ALL, v8bf) This should be AUTO_FP, since it doesn't have any side-effects. (As before, we should probably rename the flag, but that's separate work.) + /* Implemented by aarch64_simd_mmlav16qi. */ VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 9f0e2bd1e6f..f62c52ca327 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -7159,6 +7159,19 @@ [(set_attr "type" "neon_dot")] ) +;; vget_low/high_bf16 +(define_expand "aarch64_vget_halfv8bf" + [(match_operand:V4BF 0 "register_operand") + (match_operand:V8BF 1 "register_operand") + (match_operand:SI 2 "aarch64_zero_or_1")] + "TARGET_BF16_SIMD" +{ + int hbase = INTVAL (operands[2]); + rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1); I think this needs to be: aarch64_simd_vect_par_cnst_half instead. The issue is that on big-endian targets, GCC assumes vector lane 0 is in the high part of the register, whereas for AArch64 it's always in the low part of the register. So we convert from AArch64 numbering to GCC numbering when generating the rtx and then take endianness into account when matching the rtx later. It would be good to have -mbig-endian tests that make sure we generate the right instruction for each function (i.e. we get them the right way round). I guess it would be good to test that for little-endian too. I've updated the expander using aarch64_simd_vect_par_cnst_half. And the expander is divided into two for getting low and high half seperately. It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu targets with new tests including -mbig-endian option. + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel)); + DONE; +}) + ;; bfmmla (define_insn "aarch64_bfmmlaqv4sf" [(set (match_operand:V4SF 0 "register_operand" "=w") diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md index 215fcec5955..0c8bc2b0c73 100644 --- a/gcc/config/aarch64/predicates.md +++ b/gcc/config/aarch64/predicates.md @@ -84,6 +84,10 @@ (ior (match_test "op == constm1_rtx") (match_test "op == const1_rtx")) +(define_predicate "aarch64_zero_or_1" + (and (match_code "const_int") + (match_test "op == const0_rtx || op == const1_rtx"))) zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one. But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1, so let's keep it as-is. This predicate is removed since there is no need of the imm operand in the new expanders. Thanks for the reviews. Is it OK for trunk now? Looks good. OK for trunk and branches, thanks. Richard Thanks for approval, Richard! This patch is committed at 3553c658533e430b232997bdfd97faf6606fb102 Bests Dennis
Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector
Dennis Zhang writes: > Hi Richard, > > On 10/30/20 2:07 PM, Richard Sandiford wrote: >> Dennis Zhang writes: >>> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def >>> b/gcc/config/aarch64/aarch64-simd-builtins.def >>> index 332a0b6b1ea..39ebb776d1d 100644 >>> --- a/gcc/config/aarch64/aarch64-simd-builtins.def >>> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def >>> @@ -719,6 +719,9 @@ >>> VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) >>> VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) >>> >>> + /* Implemented by aarch64_vget_halfv8bf. */ >>> + VAR1 (GETREG, vget_half, 0, ALL, v8bf) >> >> This should be AUTO_FP, since it doesn't have any side-effects. >> (As before, we should probably rename the flag, but that's separate work.) >> >>> + >>> /* Implemented by aarch64_simd_mmlav16qi. */ >>> VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) >>> VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) >>> diff --git a/gcc/config/aarch64/aarch64-simd.md >>> b/gcc/config/aarch64/aarch64-simd.md >>> index 9f0e2bd1e6f..f62c52ca327 100644 >>> --- a/gcc/config/aarch64/aarch64-simd.md >>> +++ b/gcc/config/aarch64/aarch64-simd.md >>> @@ -7159,6 +7159,19 @@ >>> [(set_attr "type" "neon_dot")] >>> ) >>> >>> +;; vget_low/high_bf16 >>> +(define_expand "aarch64_vget_halfv8bf" >>> + [(match_operand:V4BF 0 "register_operand") >>> + (match_operand:V8BF 1 "register_operand") >>> + (match_operand:SI 2 "aarch64_zero_or_1")] >>> + "TARGET_BF16_SIMD" >>> +{ >>> + int hbase = INTVAL (operands[2]); >>> + rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1); >> >> I think this needs to be: >> >>aarch64_simd_vect_par_cnst_half >> >> instead. The issue is that on big-endian targets, GCC assumes vector >> lane 0 is in the high part of the register, whereas for AArch64 it's >> always in the low part of the register. So we convert from AArch64 >> numbering to GCC numbering when generating the rtx and then take >> endianness into account when matching the rtx later. >> >> It would be good to have -mbig-endian tests that make sure we generate >> the right instruction for each function (i.e. we get them the right way >> round). I guess it would be good to test that for little-endian too. >> > > I've updated the expander using aarch64_simd_vect_par_cnst_half. > And the expander is divided into two for getting low and high half > seperately. > It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu > targets with new tests including -mbig-endian option. > >>> + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel)); >>> + DONE; >>> +}) >>> + >>> ;; bfmmla >>> (define_insn "aarch64_bfmmlaqv4sf" >>> [(set (match_operand:V4SF 0 "register_operand" "=w") >>> diff --git a/gcc/config/aarch64/predicates.md >>> b/gcc/config/aarch64/predicates.md >>> index 215fcec5955..0c8bc2b0c73 100644 >>> --- a/gcc/config/aarch64/predicates.md >>> +++ b/gcc/config/aarch64/predicates.md >>> @@ -84,6 +84,10 @@ >>> (ior (match_test "op == constm1_rtx") >>> (match_test "op == const1_rtx")) >>> >>> +(define_predicate "aarch64_zero_or_1" >>> + (and (match_code "const_int") >>> + (match_test "op == const0_rtx || op == const1_rtx"))) >> >> zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one. >> But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1, >> so let's keep it as-is. >> > > This predicate is removed since there is no need of the imm operand in > the new expanders. > > Thanks for the reviews. > Is it OK for trunk now? Looks good. OK for trunk and branches, thanks. Richard
Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector
Hi Richard, On 10/30/20 2:07 PM, Richard Sandiford wrote: Dennis Zhang writes: diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def index 332a0b6b1ea..39ebb776d1d 100644 --- a/gcc/config/aarch64/aarch64-simd-builtins.def +++ b/gcc/config/aarch64/aarch64-simd-builtins.def @@ -719,6 +719,9 @@ VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) + /* Implemented by aarch64_vget_halfv8bf. */ + VAR1 (GETREG, vget_half, 0, ALL, v8bf) This should be AUTO_FP, since it doesn't have any side-effects. (As before, we should probably rename the flag, but that's separate work.) + /* Implemented by aarch64_simd_mmlav16qi. */ VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 9f0e2bd1e6f..f62c52ca327 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -7159,6 +7159,19 @@ [(set_attr "type" "neon_dot")] ) +;; vget_low/high_bf16 +(define_expand "aarch64_vget_halfv8bf" + [(match_operand:V4BF 0 "register_operand") + (match_operand:V8BF 1 "register_operand") + (match_operand:SI 2 "aarch64_zero_or_1")] + "TARGET_BF16_SIMD" +{ + int hbase = INTVAL (operands[2]); + rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1); I think this needs to be: aarch64_simd_vect_par_cnst_half instead. The issue is that on big-endian targets, GCC assumes vector lane 0 is in the high part of the register, whereas for AArch64 it's always in the low part of the register. So we convert from AArch64 numbering to GCC numbering when generating the rtx and then take endianness into account when matching the rtx later. It would be good to have -mbig-endian tests that make sure we generate the right instruction for each function (i.e. we get them the right way round). I guess it would be good to test that for little-endian too. I've updated the expander using aarch64_simd_vect_par_cnst_half. And the expander is divided into two for getting low and high half seperately. It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu targets with new tests including -mbig-endian option. + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel)); + DONE; +}) + ;; bfmmla (define_insn "aarch64_bfmmlaqv4sf" [(set (match_operand:V4SF 0 "register_operand" "=w") diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md index 215fcec5955..0c8bc2b0c73 100644 --- a/gcc/config/aarch64/predicates.md +++ b/gcc/config/aarch64/predicates.md @@ -84,6 +84,10 @@ (ior (match_test "op == constm1_rtx") (match_test "op == const1_rtx")) +(define_predicate "aarch64_zero_or_1" + (and (match_code "const_int") + (match_test "op == const0_rtx || op == const1_rtx"))) zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one. But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1, so let's keep it as-is. This predicate is removed since there is no need of the imm operand in the new expanders. Thanks for the reviews. Is it OK for trunk now? Cheers Dennis diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def index eb8e6f7b3d8..f26a96042bc 100644 --- a/gcc/config/aarch64/aarch64-simd-builtins.def +++ b/gcc/config/aarch64/aarch64-simd-builtins.def @@ -722,6 +722,10 @@ VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) + /* Implemented by aarch64_vget_lo/hi_halfv8bf. */ + VAR1 (UNOP, vget_lo_half, 0, AUTO_FP, v8bf) + VAR1 (UNOP, vget_hi_half, 0, AUTO_FP, v8bf) + /* Implemented by aarch64_simd_mmlav16qi. */ VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 381a702eba0..af29a2f26f5 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -7159,6 +7159,27 @@ [(set_attr "type" "neon_dot")] ) +;; vget_low/high_bf16 +(define_expand "aarch64_vget_lo_halfv8bf" + [(match_operand:V4BF 0 "register_operand") + (match_operand:V8BF 1 "register_operand")] + "TARGET_BF16_SIMD" +{ + rtx p = aarch64_simd_vect_par_cnst_half (V8BFmode, 8, false); + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], p)); + DONE; +}) + +(define_expand "aarch64_vget_hi_halfv8bf" + [(match_operand:V4BF 0 "register_operand") + (match_operand:V8BF 1 "register_operand")] + "TARGET_BF16_SIMD" +{ + rtx p = aarch64_simd_vect_par_cnst_half (V8BFmode, 8, true); + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], p)); + DONE; +}) + ;; bfmmla (define_insn "aarch64_bfmmlaqv4sf" [(set (match_operand:V4SF 0 "register_operand" "=w") diff
Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector
Dennis Zhang writes: > diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def > b/gcc/config/aarch64/aarch64-simd-builtins.def > index 332a0b6b1ea..39ebb776d1d 100644 > --- a/gcc/config/aarch64/aarch64-simd-builtins.def > +++ b/gcc/config/aarch64/aarch64-simd-builtins.def > @@ -719,6 +719,9 @@ >VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf) >VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf) > > + /* Implemented by aarch64_vget_halfv8bf. */ > + VAR1 (GETREG, vget_half, 0, ALL, v8bf) This should be AUTO_FP, since it doesn't have any side-effects. (As before, we should probably rename the flag, but that's separate work.) > + >/* Implemented by aarch64_simd_mmlav16qi. */ >VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi) >VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi) > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index 9f0e2bd1e6f..f62c52ca327 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -7159,6 +7159,19 @@ >[(set_attr "type" "neon_dot")] > ) > > +;; vget_low/high_bf16 > +(define_expand "aarch64_vget_halfv8bf" > + [(match_operand:V4BF 0 "register_operand") > + (match_operand:V8BF 1 "register_operand") > + (match_operand:SI 2 "aarch64_zero_or_1")] > + "TARGET_BF16_SIMD" > +{ > + int hbase = INTVAL (operands[2]); > + rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1); I think this needs to be: aarch64_simd_vect_par_cnst_half instead. The issue is that on big-endian targets, GCC assumes vector lane 0 is in the high part of the register, whereas for AArch64 it's always in the low part of the register. So we convert from AArch64 numbering to GCC numbering when generating the rtx and then take endianness into account when matching the rtx later. It would be good to have -mbig-endian tests that make sure we generate the right instruction for each function (i.e. we get them the right way round). I guess it would be good to test that for little-endian too. > + emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel)); > + DONE; > +}) > + > ;; bfmmla > (define_insn "aarch64_bfmmlaqv4sf" >[(set (match_operand:V4SF 0 "register_operand" "=w") > diff --git a/gcc/config/aarch64/predicates.md > b/gcc/config/aarch64/predicates.md > index 215fcec5955..0c8bc2b0c73 100644 > --- a/gcc/config/aarch64/predicates.md > +++ b/gcc/config/aarch64/predicates.md > @@ -84,6 +84,10 @@ >(ior (match_test "op == constm1_rtx") > (match_test "op == const1_rtx")) > > +(define_predicate "aarch64_zero_or_1" > + (and (match_code "const_int") > + (match_test "op == const0_rtx || op == const1_rtx"))) zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one. But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1, so let's keep it as-is. Thanks, Richard