On Thu, 25 May 2023 at 01:28, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > > On Wed, 24 May 2023 at 15:40, Richard Sandiford > > <richard.sandif...@arm.com> wrote: > >> > >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > >> > On Mon, 22 May 2023 at 14:18, Richard Sandiford > >> > <richard.sandif...@arm.com> wrote: > >> >> > >> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > >> >> > Hi Richard, > >> >> > Thanks for the suggestions. Does the attached patch look OK ? > >> >> > Boostrap+test in progress on aarch64-linux-gnu. > >> >> > >> >> Like I say, please wait for the tests to complete before sending an RFA. > >> >> It saves a review cycle if the tests don't in fact pass. > >> > Right, sorry, will post patches after completion of testing henceforth. > >> >> > >> >> > diff --git a/gcc/config/aarch64/aarch64.cc > >> >> > b/gcc/config/aarch64/aarch64.cc > >> >> > index 29dbacfa917..e611a7cca25 100644 > >> >> > --- a/gcc/config/aarch64/aarch64.cc > >> >> > +++ b/gcc/config/aarch64/aarch64.cc > >> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode > >> >> > mode, rtx vals, bool even_p) > >> >> > return gen_rtx_PARALLEL (new_mode, vec); > >> >> > } > >> >> > > >> >> > +/* Return true if INSN is a scalar move. */ > >> >> > + > >> >> > +static bool > >> >> > +scalar_move_insn_p (const rtx_insn *insn) > >> >> > +{ > >> >> > + rtx set = single_set (insn); > >> >> > + if (!set) > >> >> > + return false; > >> >> > + rtx src = SET_SRC (set); > >> >> > + rtx dest = SET_DEST (set); > >> >> > + return is_a<scalar_mode>(GET_MODE (dest)) > >> >> > + && aarch64_mov_operand_p (src, GET_MODE (src)); > >> >> > >> >> Formatting: > >> >> > >> >> return (is_a<scalar_mode>(GET_MODE (dest)) > >> >> && aarch64_mov_operand_p (src, GET_MODE (src))); > >> >> > >> >> OK with that change if the tests pass, thanks. > >> > Unfortunately, the patch regressed vec-init-21.c: > >> > > >> > int8x16_t f_s8(int8_t x, int8_t y) > >> > { > >> > return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6, > >> > 7, 8, 9, 10, 11, 12, 13, 14 }; > >> > } > >> > > >> > -O3 code-gen trunk: > >> > f_s8: > >> > adrp x2, .LC0 > >> > ldr q0, [x2, #:lo12:.LC0] > >> > ins v0.b[0], w0 > >> > ins v0.b[1], w1 > >> > ret > >> > > >> > -O3 code-gen patch: > >> > f_s8: > >> > adrp x2, .LC0 > >> > ldr d31, [x2, #:lo12:.LC0] > >> > adrp x2, .LC1 > >> > ldr d0, [x2, #:lo12:.LC1] > >> > ins v31.b[0], w0 > >> > ins v0.b[0], w1 > >> > zip1 v0.16b, v31.16b, v0.16b > >> > ret > >> > > >> > With trunk, it chooses the fallback sequence because both fallback > >> > and zip1 sequence had cost = 20, however with patch applied, > >> > we end up with zip1 sequence cost = 24 and fallback sequence > >> > cost = 28. > >> > > >> > This happens because of using insn_cost instead of > >> > set_rtx_cost for the following expression: > >> > (set (reg:QI 100) > >> > (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)) > >> > set_rtx_cost returns 0 for above expression but insn_cost returns 4. > >> > >> Yeah, was wondering why you'd dropped the set_rtx_cost thing, > >> but decided not to question it since using insn_cost seemed > >> reasonable if it worked. > > The attached patch uses set_rtx_cost for single_set and insn_cost > > otherwise for non debug insns similar to seq_cost. > > FWIW, I think with the aarch64_mov_operand fix, the old way of using > insn_cost for everything would have worked too. But either way is fine. > > >> > This expression template appears twice in fallback sequence, which raises > >> > the cost to 28 from 20, while it appears once in each half of zip1 > >> > sequence, > >> > which raises the cost to 24 from 20, and so it now prefers zip1 sequence > >> > instead. > >> > > >> > I assumed this expression would be ignored because it looks like a > >> > scalar move, > >> > but that doesn't seem to be the case ? > >> > aarch64_classify_symbolic_expression returns > >> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0) > >> > and thus aarch64_mov_operand_p returns false. > >> > >> Ah, I guess it should be aarch64_mov_operand instead. Confusing that > >> they're so different... > > Thanks, using aarch64_mov_operand worked. > >> > >> > Another issue with the zip1 sequence above is using same register x2 > >> > for loading another half of constant in: > >> > adrp x2, .LC1 > >> > > >> > I guess this will create an output dependency from adrp x2, .LC0 -> > >> > adrp x2, .LC1 > >> > and anti-dependency from ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1 > >> > essentially forcing almost the entire sequence (except ins > >> > instructions) to execute sequentially ? > >> > >> I'd expect modern cores to handle that via renaming. > > Ah right, thanks for the clarification. > > > > For some reason, it seems git diff is not formatting the patch correctly :/ > > Or perhaps I am doing something wrongly. > > No, I think it's fine. It's just tabs vs. spaces. A leading > "+" followed by a tab is still only indented 8 columns, whereas > "+" followed by 6 spaces is indented 7 columns. So indentation > can look a bit weird in the diff. > > I was accounting for that though. :) > > > For eg, it shows: > > + return is_a<scalar_mode>(GET_MODE (dest)) > > + && aarch64_mov_operand (src, GET_MODE (src)); > > but after applying the patch, it's formatted correctly with "&& > > aarch64..." right below is_a<scalar_mode>, both on column 10. > > Yeah, the indentation itself was OK. But there's an “emacs rule” > that says that parens should be used when splitting an expression > over multiple lines like this. So: > > ------- > Formatting: > > return (is_a<scalar_mode>(GET_MODE (dest)) > && aarch64_mov_operand_p (src, GET_MODE (src))); > ------- > > was about adding the parens. > > > + for (; seq; seq = NEXT_INSN (seq)) > > + if (NONDEBUG_INSN_P (seq) > > + && !scalar_move_insn_p (seq)) > > + { > > + if (rtx set = single_set (seq)) > > + cost += set_rtx_cost (set, speed); > > + else > > + { > > + int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed); > > + if (this_cost > 0) > > + cost += this_cost; > > + else > > + cost++; > > + } > > + } > > I think it'd be better to do the single_set first, and pass the set > to scalar_move_insn_p. I.e. > > for (; seq; seq = NEXT_INSN (seq)) > if (NONDEBUG_INSN_P (seq)) > { > if (rtx set = single_set (seq)) > { > if (!scalar_move_insn_p (set)) > cost += set_rtx_cost (set, speed); > } > else > { > int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed); > if (this_cost > 0) > cost += this_cost; > else > cost++; > } > } > > Then scalar_move_insn_p can just be the last three statements, > adjusted as follows: > > rtx src = SET_SRC (set); > rtx dest = SET_DEST (set); > return (is_a<scalar_mode> (GET_MODE (dest)) > && aarch64_mov_operand (src, GET_MODE (dest))); > > Note the space after ">", and that the mode passed to aarch64_mov_operand > is the destination mode (since the source mode might be VOIDmode). > > OK with that change, thanks. Hi Richard, Thanks for the suggestions, and sorry for being a bit daft in my previous replies. Does the attached patch look OK ? Bootstrapped+tested on aarch64-linux-gnu.
Thanks, Prathamesh > > Thanks, > Richard
[aarch64] Ignore cost of scalar moves for seq in vector initialization. gcc/ChangeLog: * config/aarch64/aarch64.cc (scalar_move_insn_p): New function. (seq_cost_ignoring_scalar_moves): Likewise. (aarch64_expand_vector_init): Call seq_cost_ignoring_scalar_moves. diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index d6fc94015fa..db7ca4c28c3 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -22332,6 +22332,46 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p) return gen_rtx_PARALLEL (new_mode, vec); } +/* Return true if INSN is a scalar move. */ + +static bool +scalar_move_insn_p (rtx set) +{ + rtx src = SET_SRC (set); + rtx dest = SET_DEST (set); + return (is_a<scalar_mode> (GET_MODE (dest)) + && aarch64_mov_operand (src, GET_MODE (dest))); +} + +/* Similar to seq_cost, but ignore cost for scalar moves. This function + is called from aarch64_expand_vector_init. */ + +static unsigned +seq_cost_ignoring_scalar_moves (const rtx_insn *seq, bool speed) +{ + unsigned cost = 0; + + for (; seq; seq = NEXT_INSN (seq)) + if (NONDEBUG_INSN_P (seq)) + { + if (rtx set = single_set (seq)) + { + if (!scalar_move_insn_p (set)) + cost += set_rtx_cost (set, speed); + } + else + { + int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed); + if (this_cost > 0) + cost += this_cost; + else + cost++; + } + } + + return cost; +} + /* Expand a vector initialization sequence, such that TARGET is initialized to contain VALS. */ @@ -22367,7 +22407,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0); rtx_insn *rec_seq = get_insns (); end_sequence (); - costs[i] = seq_cost (rec_seq, !optimize_size); + costs[i] = seq_cost_ignoring_scalar_moves (rec_seq, !optimize_size); emit_insn (rec_seq); } @@ -22384,7 +22424,8 @@ aarch64_expand_vector_init (rtx target, rtx vals) start_sequence (); aarch64_expand_vector_init_fallback (target, vals); rtx_insn *fallback_seq = get_insns (); - unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size); + unsigned fallback_seq_cost + = seq_cost_ignoring_scalar_moves (fallback_seq, !optimize_size); end_sequence (); emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);