On Fri, Nov 9, 2018 at 9:23 AM Sudakshina Das <sudi....@arm.com> wrote:
>
> Hi
>
> I am posting this patch on behalf of Carey (cc'ed). I also have some
> review comments that I will make as a reply to this later.
>
>
> This implements a new AArch64 specific back-end pass that helps optimize
> branch-dense code, which can be a bottleneck for performance on some Arm
> cores. This is achieved by padding out the branch-dense sections of the
> instruction stream with nops.
>
> This has proven to show up to a 2.61%~ improvement on the Cortex A-72
> (SPEC CPU 2006: sjeng).
>
> The implementation includes the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block. This is to allow padding after unconditional
> branches. This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.

Maybe you should split this up into two patches, one of the
FILLER_INSN part and one for the aarch64 parts for easier review for
the maintainers.

- extra_objs="aarch64-builtins.o aarch-common.o
cortex-a57-fma-steering.o aarch64-speculation.o
falkor-tag-collision-avoidance.o"
+ extra_objs="aarch64-builtins.o aarch-common.o
cortex-a57-fma-steering.o aarch64-speculation.o
falkor-tag-collision-avoidance.o aarch64-branch-dilution.o"
Also this seems like the line is already long but that does not mean
it should get even longer; maybe split it up like:
extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o"
extra_objs="${extra_objs} aarch64-speculation.o
falkor-tag-collision-avoidance.o aarch64-branch-dilution.o"

I think you should be using some anonymous namespaces for insn_info,
insn_granule and maybe a few more things too.  Since those are all
local to the pass only (someone could make the same mistake as you and
use those names and you run into ODR violations then).

You might want to use the new dump_printf_loc functions instead of
fprintf where you are printing into dump_file.

+unsigned MAX_BRANCH = 0;
+unsigned GRANULE_SIZE = 0;
Most likely these should not be captilized at all and definitely
either static or in anonymous namespaces.

+inline bool
+is_branch (rtx_insn *insn)

Likewise about static or in an anonymouse namespace.
Maybe it could be rewritten to be easier to understand:
{
  if (insn == NULL)
  return false;
return JUMP_P (insn) || CALL_P (insn) || ANY_RETURN_P (insn);
}

That is swap the order there.

+  /* Pointers to the first/last instructions in granule.  */
+  insn_info *m_first = NULL;
+  insn_info *m_last = NULL;

Even though C++14 is the default compiler in newer versions, we
support C++98 to compile GCC with.  So please fix that so we don't
depend on C++11 features.  Simple way is to test native bootstrapping
with GCC 4.8 on CentOS 7.

+  if (insn->is_unconditional)
+    {
+      value += 2;
+    }

I noticed in some places you use {} around one statement and others
not.  The style should without except when nested and it confusing
which else belongs to it.


+void
+insn_granule::update_indexes ()
This is missing a comment before the function.  Also this seems more
like fixup indexes rather than update.


+  if (MAX_BRANCH < 1)
+    error ("branch dilution: max branch must be greater than zero");
+  else if (MAX_BRANCH >= GRANULE_SIZE)
+    {
+      error ("branch dilution: max branches (%d) must be \
+       less than granule size (%d)", MAX_BRANCH, GRANULE_SIZE);
+    }

You should almost never use error in backend (there are a few
exceptions to that rule).  Either use sorry or internal_error.  Also
the way you wrapped the last error message is incorrect as the tabs
will show up in the error message.

Thanks,
Andrew Pinski

>
> ## Command Line Options
>
> Three new target-specific options are provided:
> - mbranch-dilution
> - mbranch-dilution-granularity={num}
> - mbranch-dilution-max-branches={num}
>
> A number of cores known to be able to benefit from this pass have been
> given default tuning values for their granularity and max-branches.
> Each affected core has a very specific granule size and associated
> max-branch limit. This is a microarchitecture specific optimization.
> Typical usage should be -mdilute-branches with a specificed -mcpu. Cores
> with a granularity tuned to 0 will be ignored. Options are provided for
> experimentation.
>
> ## Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window). When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favour unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by emitting
> a generic nop.
> - Unconditional branches must be the end of a basic block, and nops
> cannot be outside of a basic block.
>    Thus the need for FILLER_INSN, which allows placement outside of a
> basic block - and translates to a nop.
> - For most conditional branches we've taken a simple approach and only
> handle the fallthru edge for simplicity,
>    which we do by inserting a "nop block" of nops on the fallthru edge,
> mapping that back to the original destination block.
> - asm gotos and pcsets are going to be tricky to analyse from a dilution
> perspective so are ignored at present.
>
>
> ## Changelog
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-09  Carey Williams  <carey.willi...@arm.com>
>
>         * gcc.target/aarch64/branch-dilution-off.c: New test.
>         * gcc.target/aarch64/branch-dilution-on.c: New test.
>
>
> gcc/ChangeLog:
>
> 2018-11-09  Carey Williams  <carey.willi...@arm.com>
>
>         * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
>         * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
>         basic blocks.
>         * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>         * config/aarch64/aarch64-branch-dilution.c: New file.
>         * config/aarch64/aarch64-passes.def (branch-dilution): Register
>         pass.
>         * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>         tuning parameters bdilution_gsize and bdilution_maxb.
>         (make_pass_branch_dilution): New declaration.
>         * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
>         cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
>         cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
>         thunderx_tunings,tsv110_tunings,xgene1_tunings,
>         qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
>         Provide default tunings for bdilution_gsize and bdilution_maxb.
>         * config/aarch64/aarch64.md (filler_insn): Define new insn.
>         * config/aarch64/aarch64.opt (mbranch-dilution,
>         mbranch-dilution-granularity,
>         mbranch-dilution-max-branches): Define new branch dilution
>         options.
>         * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
>         for aarch64-branch-dilution.c.
>         * coretypes.h (rtx_filler_insn): New rtx class.
>         * doc/invoke.texi (mbranch-dilution,
>         mbranch-dilution-granularity,
>         mbranch-dilution-max-branches): Document branch dilution
>         options.
>         * emit-rtl.c (emit_filler_after): New emit function.
>         * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
>         * rtl.h (class GTY): New class for rtx_filler_insn.
>         (is_a_helper ::test): New test helper for rtx_filler_insn.
>         (macro FILLER_INSN_P(X)): New predicate.
>         * target-insns.def (filler_insn): Add target insn def.
>
> ### Testing
> - Successful compilation of 3 stage bootstrap with the pass forced on
> (for stage 2, 3)
> - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
> - No 'make check' regressions
>
> Is this ok for trunk?
>
> Thanks
> Sudi

Reply via email to