Re: [PATCH, PR 10474] Shedule pass_cprop_hardreg before pass_thread_prologue_and_epilogue

Martin Jambor Wed, 24 Apr 2013 11:24:08 -0700

Hi,

On Fri, Apr 19, 2013 at 09:27:28AM -0600, Jeff Law wrote:
> On 04/18/2013 05:08 PM, Martin Jambor wrote:
> >On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote:
> >>On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
> >>>I also have not tried scheduling the hard register copy propagation
> >>>pass twice and measuring the impact on compile times.  Any suggestion
> >>>what might be a good testcase for that?
> >>
> >>I think a better question is when this would be useful in the first
> >>place, and why. In other words: If you propagate hardregs before
> >>shrink wrapping, what could be a source of new opportunities after
> >>shrink wrapping?
> >
> >Yes, we also did that and neither I nor Honza could think of any
> >potential problems there.  And of course, I'd also measure how many
> >statements the second run of the pass changed.  I'll probably do that
> >tomorrow anyway.
> I'd be very curious to see those numbers.  While I tend to think the
> opportunities missed by just running it early will be in the noise
> and nothing we can or should do anything about given the
> compile-time cost of running it twice.  However, experience has
> shown it's worth doing the investigative work to be sure.
>


Here they are.  First, I simply looked at how many instructions would
be changed by a second run of the pass in its current position during
C and C++ bootstrap:

    |                                     | Insns changed |      % |
    |-------------------------------------+---------------+--------|
    | Trunk - only pass in original place |        172608 | 100.00 |
    | First pass before pro/eipilogue     |        170322 |  98.68 |
    | Second pass in the original place   |          8778 |   5.09 |

5% was worth investigating more.  The 20 source files with highest
number of affected instructions by the second run were:

      939 mine/src/libgcc/config/libbid/bid_binarydecimal.c
      909 mine/src/libgcc/config/libbid/bid128_div.c
      813 mine/src/libgcc/config/libbid/bid64_div.c
      744 mine/src/libgcc/config/libbid/bid128_compare.c
      615 mine/src/libgcc/config/libbid/bid128_to_int32.c
      480 mine/src/libgcc/config/libbid/bid128_to_int64.c
      450 mine/src/libgcc/config/libbid/bid128_to_uint32.c
      408 mine/src/libgcc/config/libbid/bid128_fma.c
      354 mine/src/libgcc/config/libbid/bid128_to_uint64.c
      327 mine/src/libgcc/config/libbid/bid128_add.c
      246 mine/src/libgcc/libgcc2.c
      141 mine/src/libgcc/config/libbid/bid_round.c
      129 mine/src/libgcc/config/libbid/bid64_mul.c
      117 mine/src/libgcc/config/libbid/bid64_to_int64.c
       96 mine/src/libsanitizer/tsan/tsan_interceptors.cc
       96 mine/src/libgcc/config/libbid/bid64_compare.c
       87 mine/src/libgcc/config/libbid/bid128_noncomp.c
       84 mine/src/libgcc/config/libbid/bid64_to_bid128.c
       81 mine/src/libgcc/config/libbid/bid64_to_uint64.c
       63 mine/src/libgcc/config/libbid/bid64_to_int32.c

I have manually examined some of the late opportunities for
propagation in mine/src/libgcc/config/libbid/bid_binarydecimal.c and
majority of them was a result of peephole2.

Still, the list of files showed that the config sources of libraries
which might have been built too many times (I so not know how many but
for example I had multilib allowed which changes things a lot)
probably skew the numbers a lot.

So next time I measured only the number of instructions changed during
make stage2-bubble with multilib disabled.  In order to find out where
do the new opportunities come from, I added scheduled
pass_cprop_hardreg after every pass between
pass_branch_target_load_optimize1 and pass_fast_rtl_dce and counted
how many instructions are modified (relative to just having the pass
where it is now):

    |                                          | Insns changed |      % |
    |------------------------------------------+---------------+--------|
    | Trunk - only pass in original place      |         76225 | 100.00 |
    |------------------------------------------+---------------+--------|
    | Before pro/eipilogue                     |         77906 | 102.21 |
    | After pro/eipilogue                      |           267 |   0.35 |
    | After pass_rtl_dse2                      |             0 |   0.00 |
    | After pass_stack_adjustments             |             0 |   0.00 |
    | After pass_jump2                         |           372 |   0.49 |
    | After pass_peephole2                     |           119 |   0.16 |
    | After pass_if_after_reload               |            37 |   0.05 |
    | After pass_regrename - original position |             0 |   0.00 |

Which seems much better.  The 12 source files with most instructions
changed now were:

      116 src/libgcc/libgcc2.c
       64 src/libsanitizer/tsan/tsan_interceptors.cc
       36 src/libsanitizer/tsan/tsan_fd.cc
       31 src/gcc/cp/parser.c
       20 cp-demangle.c
       19 src/libiberty/cp-demangle.c
       12 gtype-desc.c
       12 src/libgcc/unwind-dw2.c
       11 src/gcc/config/i386/i386.c
       10 src/gcc/gengtype.c
       10 src/gcc/dwarf2out.c
        9 src/gcc/fold-const.c

I'm not sure what the conclusion is.  Probably that there are cases
where doing propagation late can be a good thing but these do not
occur that often.  And that more measurements should probably be done.
Anyway, I'll look into alternatives before (see below) pushing this
further.

By the way, scheduling pass_cprop_hardreg after pass_jump2 or
pass_peephole2 (or both) and doing a full bootstrap leads to bootstrap
miscompares.  I have not examined why.



> >>
> >>
> >>But wouldn't it be better to avoid these argument-register pseudos
> >>being assigned to callee-saved registers? Perhaps splitting the live
> >>range of the pseudos before the first call on each path will do the
> >>trick, and let IRA pick the right registers for you instead.
> Isn't one of the difficulties here that the pseudo might correspond
> to an argument that wasn't passed in a register?  Thus you need
> alias analysis to know if it's valid to sink the load?
> 
> At least that's one of the issues I recall when I looked at this a
> couple years ago.
> 
> If we constrain ourselves to just sinking argreg->pseudo copies then
> we can obviously avoid that problem.
> 
> Rather than necessarily looking at this as a range splitting
> problem, can it be looked as a sinking problem?  Ultimately what we
> want is to sink those annoying arg->pseudo setups.  It seems like
> it'd be a fairly simple dataflow problem to determine those points.
> 
> >
> >First, where can I have a look how a live range is split?  ;-)
> There's been several implementations through the years; none that
> I'd say is suitable for reuse.

I have looked at the patch Vlad suggested (most things are new to me
in RTL land and so almost everything takes me ages) and I'm certainly
willing to try and mimic some of it in order to (hopefully) get the
same effect that propagating and shrink-wrapping preparation moves can
do.  Yes, this is not enough to deal with parameters loaded from stack
but unlike latest insertion, it could also work when the parameters
are also used on the fast path, which is often the case.  In fact,
propagation helps exactly because they are used in the entry BB.
Hopefully they will end up in a caller-saved register on the fast path
and we'll flip it over to the callee-saved problematic one only on
(slow) paths going through calls.

Of course, the two approaches are not mutually exclusive and load
sinking might help too.

Thanks a lot for all suggestions,

Martin

Re: [PATCH, PR 10474] Shedule pass_cprop_hardreg before pass_thread_prologue_and_epilogue

Reply via email to