Hi Andrew, thanks for reviewing I'll work on your comments. Just replying to the high level questions.
Andrew Pinski <pins...@gmail.com> writes: > On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo <andrea.cora...@arm.com> wrote: >> >> Hi all, >> >> this second patch implements the AArch64 specific back-end pass >> 'branch-dilution' controllable by the followings command line options: >> >> -mbranch-dilution >> >> --param=aarch64-branch-dilution-granularity={num} >> >> --param=aarch64-branch-dilution-max-branches={num} >> >> Some cores known to be able to benefit from this pass have been given >> default tuning values for their granularity and max-branches. Each >> affected core has a very specific granule size and associated max-branch >> limit. This is a microarchitecture specific optimization. Typical >> usage should be -mbranch-dilution with a specified -mcpu. Cores with a >> granularity tuned to 0 will be ignored. Options are provided for >> experimentation. > > Can you give a simple example of what this patch does? Sure, this pass simply moves a sliding window over the insns trying to make sure that we never have more then 'max_branch' branches for every 'granule_size' insns. If too many branches are detected nops are added where considered less armful to correct that. There are obviously many scenarios where the compiler can generate a branch dense pieces of code but say we have the equivalent of: ==== .L389: bl foo b .L43 .L388: bl foo b .L42 .L387: bl foo b .L41 .L386: bl foo b .L40 ==== Assuming granule size 4 and max branches 2 this will be transformed in the equivalent of: ==== .L389: bl foo b .L43 nop nop .L388: bl foo b .L42 nop nop .L387: bl foo b .L41 nop nop .L386: bl foo b .L40 nop nop ==== > Also your testcases seem too sensitive to other optimizations which > could happen. E.g. the call to "branch (i)" could be pulled out of > the switch statement. Or even the "*i += N;" could be moved to one > Basic block and the switch becomes just one if statement. > >> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where >> up to ~+3% (xalancbmk) and ~+1.5% (sjeng). Average code size increase >> for all the testsuite proved to be ~0.4%. > > Also does this improve any non-SPEC benchmarks or has it only been > benchmarked with SPEC? So far I tried it only on SPEC 2006. The transformation is not benchmark specific tho, other code may benefit from it. Thanks Andrea