https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115144
--- Comment #1 from Hans-Peter Nilsson <hp at gcc dot gnu.org> --- I also ran a round compiled with -fno-ivopts -fno-delayed-branch: the latter because it's somewhat non-linear in finding delay-slot-filling opportunities (lack of "luck" causing improvements to negate) and the former because it was mentioned in the commit as similarly messing things up. That "fixed" all of the performance drop for random_bitstring, but still left an almost-as-large performance drop in main in gcc.c-torture/execute/arith-rand-ll.c. IOW, the net performance drop is 1.25%: r15-0517: Basic clock cycles, total @: 13662157 r15-0518: Basic clock cycles, total @: 13832953 The focus of this bug was the on subset of arith-rand-ll.c that is in gcc.target/cris/pr93372-47.c (i.e. no main function), so if I keep that, the gist of this PR should instead shift to something like 50% "r15-518 doesn't play nice with ivopts" but I guess that's already known. So if anyone's interested in improving r15-518 (but not in ivopts interaction), I'd suggest that'd be in what happens in the main function for gcc.c-torture/execute/arith-rand-ll.c. Having said that, I did compile gcc.target/cris/pr93372-47.c adding -fno-ivopts -fdump-tree-optimized and it shows that the tot_bits computation ("tot_bits_13 = tot_bits_8 + n_bits_12;") is moved later, right before it's used in a conditional, which makes me think the delay-branch-scheduling has less "material" to fill the first delays-slots. I also compiled gcc.c-torture/execute/arith-rand-ll.c with -fno-ivopts -fdump-tree-optimized (plus the usual -O2 -march=v10) and will attach the tree-dump files. They show what the pr93372-47.c change *and* that several division operations are moved forward. This separates them from the modulus opterations on the same values, so I guess targets where computing these values together is a win (not CRIS), we'll see a performance loss.