Hi Jeff,
On 10/14/22 09:54, Jeff Law via Gcc wrote:
...
.L2:
xor a4,a4,a5
andi a4,a4,1
srli a3,a0,2
srli a5,a5,1
beq a4,zero,.L3
li a4,-24576 # 0xFFFF_A000
addi a4,a4,1 # 0xFFFF_A001
xor a5,a5,a4
zext.h a5,a5
.L3:
xor a3,a3,a5
andi a3,a3,1
srli a4,a0,3
srli a5,a5,1
beq a3,zero,.L4
li a3,-24576 # 0xFFFF_A000
addi a3,a3,1 # 0xFFFF_A001
...
...
I see that with small tests cse1 is able to substitute redundant
constant reg with equivalent old reg.
I find it easier to reason about this stuff with a graphical CFG, so a
bit of ascii art...
2
/ \
3 ---> 4
/ \
5 ---> 6
Yeah A picture is worth thousand words :-)
Where BB4 corresponds to .L2 and BB6 corresponds to .L3. Evaluation of
the constants occurs in BB3 and BB5.
And Evaluation here means use of the constant (vs. definition ?).
CSE isn't going to catch this. The best way to think about CSE's
capabilities is that it can work on extended basic blocks. An
extended basic block can have jumps out, but not jumps in. There are 3
EBBs in this code. (1,2), (4,5) and 6. So BB4 is in a different EBB
than BB3. So the evaluation in BB3 can't be used by CSE in the EBB
containing BB4, BB5.
Thanks for the detailed explanation.
PRE/GCSE is better suited for this scenario, but it has a critical
constraint. In particular our PRE formulation is never allowed to put
an evaluation of an expression on a path that didn't have one before. So
while there clearly a redundancy on the path 2->3->4->5 (BB3 and BB5),
there is nowhere we could put an evaluation that would reduce the number
of evaluation on that path without introducing an evaluation on paths
that didn't have one. So consider 2->4->6. On that path there are zero
evaluations. So we can't place an eval in BB2 because that will cause
evaluations on 2->4->6 which didn't have any evaluations.
OK. How does PRE calculate all possible paths to consider: say your
example 2-3-4-5 and 2-4-6 ? Is that just indicative or would actually be
the one PRE calculates for this case. Would there be more ?
There isn't a great place in GCC to handle this right now. If the
constraints were relaxed in PRE, then we'd have a chance, but getting
the cost model right is going to be tough.
It would have been better (for this specific case) if loop unrolling was
not being done so early. The tree pass cunroll is flattening it out and
leaving for rest of the all tree/rtl passes to pick up the pieces and
remove any redundancies, if at all. It obviously needs to be early if we
are injecting 7x more instructions, but seems like a lot to unravel.
FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always
unrolls. Is that expected ?
If this seems worthwhile and you have ideas to do this any better, I'd
be happy to work on this with some guidance.
Thx,
-Vineet