https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115144

--- Comment #1 from Hans-Peter Nilsson <hp at gcc dot gnu.org> ---
I also ran a round compiled with -fno-ivopts -fno-delayed-branch: the latter
because it's somewhat non-linear in finding delay-slot-filling opportunities
(lack of "luck" causing improvements to negate) and the former because it was
mentioned in the commit as similarly messing things up.

That "fixed" all of the performance drop for random_bitstring, but still left
an almost-as-large performance drop in main in
gcc.c-torture/execute/arith-rand-ll.c. IOW, the net performance drop is 1.25%:

r15-0517:
Basic clock cycles, total @: 13662157

r15-0518:
Basic clock cycles, total @: 13832953

The focus of this bug was the on subset of arith-rand-ll.c that is in
gcc.target/cris/pr93372-47.c (i.e. no main function), so if I keep that, the
gist of this PR should instead shift to something like 50% "r15-518 doesn't
play nice with ivopts" but I guess that's already known.

So if anyone's interested in improving r15-518 (but not in ivopts interaction),
I'd suggest that'd be in what happens in the main function for
gcc.c-torture/execute/arith-rand-ll.c.

Having said that, I did compile gcc.target/cris/pr93372-47.c adding -fno-ivopts
-fdump-tree-optimized and it shows that the tot_bits computation ("tot_bits_13
= tot_bits_8 + n_bits_12;") is moved later, right before it's used in a
conditional, which makes me think the delay-branch-scheduling has less
"material" to fill the first delays-slots.

I also compiled gcc.c-torture/execute/arith-rand-ll.c with -fno-ivopts
-fdump-tree-optimized (plus the usual -O2 -march=v10) and will attach the
tree-dump files.  They show what the pr93372-47.c change *and* that several
division operations are moved forward.  This separates them from the modulus
opterations on the same values, so I guess targets where computing these values
together is a win (not CRIS), we'll see a performance loss.

Reply via email to