https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at gcc dot gnu.org

--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #11)
> On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > 
> > --- Comment #10 from Martin Liška <marxin at gcc dot gnu.org> ---
> > > So the only significant offender is module_configure.fppized.f90 file. Let
> > > me profile it.
> > 
> > Time profile before/after:
> > 
> > ╔══════════════════════════╤════════╤════════╤═════════╗
> > ║ PASS                     │ Before │ After  │ Change  ║
> > ╠══════════════════════════╪════════╪════════╪═════════╣
> > ║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
> > ║ integrated RA            │ 6.76   │ 6.41   │ 94.82%  ║
> > ║ tree SSA incremental     │ 9.01   │ 11.16  │ 123.86% ║
> > ║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
> > ║ PRE                      │ 23.24  │ 32.32  │ 139.07% ║
> > ║ alias stmt walking       │ 27.69  │ 28.75  │ 103.83% ║
> > ║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
> > ║ TOTAL                    │ 125.39 │ 165.17 │ 131.73% ║
> > ╚══════════════════════════╧════════╧════════╧═════════╝
> > 
> > Richi, do you want a perf report or do you come up with a patch that will
> > introduce the aforementioned params?
> 
> Can you share -fopt-report-loop differences?  From the above I would
> guess we split a lot of loops, meaning the memcpy/memmove/memset
> calls are in the "middle" and we have to split loops (how many
> calls are detected here?).  If that's true another way would be
> to only allow calls at head or tail position, thus a single
> non-builtin partition.

Some analysis shows, focusing on LRA lives, that unpatched we have

lra live on 53 BBs for wrf_alt_nml_obsolete
lra live on 5 BBs for set_config_as_buffer
lra live on 5 BBs for get_config_as_buffer
lra live on 3231 BBs for initial_config
lra live on 3231 BBs for initial_config

while patched

lra live on 53 BBs for wrf_alt_nml_obsolete
lra live on 5 BBs for set_config_as_buffer
lra live on 5 BBs for get_config_as_buffer
lra live on 465 BBs for initial_config
lra live on 465 BBs for initial_config

so it's the initial_config function.  We need 8 DF worklist iterations
in both cases but eventually the amount of local work is larger
or the local work isn't linear in the size of the BBs.  The "work"
it does to not update hardregs by anding ~all_hard_regs_bitmap seems
somewhat pointless unless the functions do not handle those correctly.
But that's micro-optimizing, likewise adding a bitmap_ior_and_compl_and_compl
function to avoid the temporary bitmap in live_trans_fun.

perf tells us most time is spent in process_bb_lives, not in the dataflow
problem though, and there in ix86_hard_regno_call_part_clobbered
(the function has a _lot_ of calls...).

Also w/o pattern detection the lra_simple_p heuristic kicks in since
we have a lot more BBs.

  /* If there are too many pseudos and/or basic blocks (e.g. 10K
     pseudos and 10K blocks or 100K pseudos and 1K blocks), we will
     use simplified and faster algorithms in LRA.  */
  lra_simple_p
    = (ira_use_lra_p
       && max_reg_num () >= (1 << 26) / last_basic_block_for_fn (cfun));

The code is auto-generated and large (I have a single source file using
no modules now but still too large and similar to SPEC to attach here),
so I wouldn't worry too much here.  The above magic constant should be
a --param though.

Reply via email to