+ /* In order to find out if the loop is of type A or B above look for the
+ loop counter: it will either be incrementing by one per iteration or
+ it will be decrementing by num_of_lanes. We can find the loop counter
+ in the condition at the end of the loop. */
+ rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+ gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+ && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE);
Not sure this should be an assert. If we do encounter a differently
formed loop, we should bail out of DLSTPing for now but we shouldn't ICE.
+ /* The loop latch has to be empty. When compiling all the known MVE
LoLs in
+ user applications, none of those with incrementing counters had
any real
+ insns in the loop latch. As such, this function has only been
tested with
+ an empty latch and may misbehave or ICE if we somehow get here with an
+ increment in the latch, so, for sanity, error out early. */
+ rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+ if (NONDEBUG_INSN_P (dec_insn))
+ gcc_unreachable ();
Similarly here I'd return false rather than gcc_unreachable ();
+ /* Find where both of those are modified in the loop body bb. */
+ rtx condcount_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+ (body, REGNO (condcount))));
Put = on newline, breaks it down nicer.
+ counter_orig_set = XEXP (PATTERN
+ (DF_REF_INSN
+ (DF_REF_NEXT_REG
+ (DF_REG_DEF_CHAIN
+ (REGNO
+ (XEXP (condcount_reg_set, 0)))))),
1);
This makes me a bit nervous, can we be certain that the PATTERN of the
next insn that sets it is indeed a set. Heck can we even be sure
DF_REG_DEF_CHAIN returns a non-null, I can't imagine why not but maybe
there are some constructs it can't follow-up on? Might just be worth
checking these steps and bailing out.
+ /* When we find the vctp instruction: This may be followed by
+ a zero-extend insn to SImode. If it is, then save the
+ zero-extended REG into vctp_vpr_generated. If there is no
+ zero-extend, then store the raw output of the vctp.
+ For any VPT-predicated instructions we need to ensure that
+ the VPR they use is the same as the one given here and
+ they often consume the output of a subreg of the SImode
+ zero-extended VPR-reg. As a result, comparing against the
+ output of the zero-extend is more likely to succeed.
+ This code also guarantees to us that the vctp comes before
+ any instructions that use the VPR within the loop, for the
+ dlstp/letp transform to succeed. */
Wrong comment indent after first line.
+ rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+ if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body))
+ return GEN_INT (1);
arm_mve_loop_valid_for_dlstp already calls arm_mve_get_loop_vctp, maybe
have 'arm_mve_loop_valid_for_dlstp' return vctp_insn or null to
determine success or failure, avoids looping through the BB again.
For the same reason I'd also pass vctp_insn down to
'arm_mve_check_df_chain_back_for_implic_predic'.
+ if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+ {
+ rtx_insn *next_use2 = NULL;
Are we sure single_set can never return 0 here? Maybe worth an extra
check and bail out if it does?
+ /* If the insn pattern requires the use of the VPR value from the
+ vctp as an input parameter. */
s/an an input parameter./as an input parameter for predication./
+ /* None of registers USE-d by the instruction need can be the VPR
+ vctp_vpr_generated. This blocks the optimisation if there any
+ instructions that use the optimised-out VPR value in any way
+ other than as a VPT block predicate. */
Reword this slightly to be less complex:
This instruction USE-s the vctp_vpr_generated other than for
predication, this blocks the transformation as we are not allowed to
optimise the VPR value away.
Will continue reviewing next week :)
On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
Hi all,
This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops. See my previous email
for a general introduction of MVE LOLs.
This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.
Mid-end changes are:
1) Relax the loop-doloop mechanism in the mid-end to allow for
decrement numbers other that -1 and for `count` to be an
rtx containing a simple REG (which in this case will contain
the number of elements to be processed), rather
than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find` that
will return the DEF of a REG only if it is DEF-ed once within the
basic block.
And many things in the backend to implement the above optimisation:
3) Implement the `arm_predict_doloop_p` target hook to instruct the
mid-end about Low Overhead Loops (MVE or not), as well as
`arm_loop_unroll_adjust` which will prevent unrolling of any loops
that are valid for becoming MVE Tail_Predicated Low Overhead Loops
(unrolling can transform a loop in ways that invalidate the dlstp/
letp tranformation logic and the benefit of the dlstp/letp loop
would be considerably higher than that of unrolling)
4) Appropriate changes to the define_expand of doloop_end, new
patterns for dlstp and letp, new iterators, unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
* `arm_mve_dlstp_check_dec_counter`
* `arm_mve_dlstp_check_inc_counter`
* `arm_mve_check_reg_origin_is_num_elems`
* `arm_mve_check_df_chain_back_for_implic_predic`
* `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
This all, in smoe way or another, are running checks on the loop
structure in order to determine if the loop is valid for dlstp/letp
transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
doloop_end) this function re-checks for the loop's suitability for
dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
*`arm_mve_get_vctp_lanes` to map
from vctp unspecs to number of lanes, and
`arm_get_required_vpr_reg`
to check an insn to see if it requires the VPR or not.
* `arm_mve_get_loop_vctp`
* `arm_mve_get_vctp_lanes`
* `arm_emit_mve_unpredicated_insn_to_seq`
* `arm_get_required_vpr_reg`
* `arm_get_required_vpr_reg_param`
* `arm_get_required_vpr_reg_ret_val`
* `arm_mve_vec_insn_is_predicated_with_this_predicate`
* `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?
Thank you,
Stam Markianos-Wright
gcc/ChangeLog:
* config/arm/arm-protos.h (arm_target_insn_ok_for_lob):
Rename to...
(arm_target_bb_ok_for_lob): ...this
(arm_attempt_dlstp_transform): New.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
(TARGET_PREDICT_DOLOOP_P): New.
(arm_block_set_vect):
(arm_target_insn_ok_for_lob): Rename from
arm_target_insn_ok_for_lob.
(arm_target_bb_ok_for_lob): New.
(arm_mve_get_vctp_lanes): New.
(arm_get_required_vpr_reg): New.
(arm_get_required_vpr_reg_param): New.
(arm_get_required_vpr_reg_ret_val): New.
(arm_mve_get_loop_vctp): New.
(arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
(arm_mve_vec_insn_is_predicated_with_this_predicate): New.
(arm_mve_check_df_chain_back_for_implic_predic): New.
(arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
(arm_mve_check_reg_origin_is_num_elems): New.
(arm_mve_dlstp_check_inc_counter): New.
(arm_mve_dlstp_check_dec_counter): New.
(arm_mve_loop_valid_for_dlstp): New.
(arm_predict_doloop_p): New.
(arm_loop_unroll_adjust): New.
(arm_emit_mve_unpredicated_insn_to_seq): New.
(arm_attempt_dlstp_transform): New.
* config/arm/iterators.md (DLSTP): New.
(mode1): Add DLSTP mappings.
* config/arm/mve.md (*predicated_doloop_end_internal): New.
(dlstp<mode1>_insn): New.
* config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
* config/arm/unspecs.md: New unspecs.
* df-core.cc (df_bb_regno_only_def_find): New.
* df.h (df_bb_regno_only_def_find): New.
* loop-doloop.cc (doloop_condition_get): Relax conditions.
(doloop_optimize): Add support for elementwise LoLs.
gcc/testsuite/ChangeLog:
* gcc.target/arm/lob.h: Update framework.
* gcc.target/arm/lob1.c: Likewise.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
* gcc.target/arm/mve/dlstp-int16x8.c: New test.
* gcc.target/arm/mve/dlstp-int32x4.c: New test.
* gcc.target/arm/mve/dlstp-int64x2.c: New test.
* gcc.target/arm/mve/dlstp-int8x16.c: New test.
* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.