https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499
Bug ID: 77499 Summary: Regression after code-hoisting, due to combine pass failing to evaluate known value range Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Hello, We are seeing a regression for arm-none-eabi on a Cortex-M7. This regression was observed after Biener's and Bosscher's GIMPLE code-hoisting patch (https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00360.html). The example below will illustrate the regression: $ cat t.c unsigned short foo (unsigned short x, int c, int d, int e) { unsigned short y; while (c > d) { if (c % 3) { x = (x >> 1) ^ 0xB121U; } else x = x >> 1; c-= e; } return x; } Comparing: $ arm-none-eabi-gcc -mcpu=cortex-m7 -mthumb -O2 -S t.c vs $ arm-none-eabi-gcc -mcpu=cortex-m7 -mthumb -O2 -S t.c -fno-code-hoisting Will illustrate that the code-hoisting version has an extra zero_extension of HImode to SImode. After some digging I found out that during the combine phase, the no-code-hoisting version is able to recognize that the 'last_set_nonzero_bits' are 0x7fff whereas for the code-hoisted version it seems to have lost this knowledge. Looking at the graph dump for the no-code-hoisting t.c.246r.ud_dce.dot I see the following insns: 23: r125:SI=r113:SI 0>>0x1 24: r111:SI=zero_extend(r125:SI#0) 27: r128:SI=r111:SI^r131:SI 28: r113:SI=zero_extend(r128:SI#0) These are all in the same basic block. For the code-hoisting version we have: BB A: ... 12: r116:SI=r112:SI 0>>0x1 13: r112:SI=zero_extend(r116:SI#0) ... BB B: 27: r127:SI=r112:SI^r129:SI 28: r112:SI=zero_extend(r127:SI#0) Now from what I have observed by debugging the combine pass is that combine will first combine instructions 23 and 24. Here combine is able to optimize away the zero_extend, because in 'reg_nonzero_bits_for_combine' the reg_stat[113] has its 'last_set_value' to 'r0' (the unsigned short argument) and its corresponding 'last_set_nonzero_bits' to 0xffffffff0000ffff. Which means the zero_extend is pointless. The code-hoisting version also combines 12 and 13, optimizing away the zero_extend. However, the next set of instructions is where things get tricky. In the no-code-hoisting case it will end up combining all 4 instructions one by one from the top down and it will end up figuring out that the last zero_extend is also not required. For the code-hoisting case, when it tries to combine 28 with 27 (remember they are not in the basic block as 13 and 14, so it will never try to combine all 4), it will eventually try to evaluate the nonzero bits based on r112 and see that the last_set_value for r112 is: (lshiftrt:SI (clobber:SI (const_int 0 [0])) (const_int 1 [0x1])) The last_set_nonzero_bits will be 0x7fffffff, instead of the expected 0x00007fff. This looks like the result of the code in 'record_value_for_reg' in combine.c that sits bellow the comment: /* The value being assigned might refer to X (like in "x++;"). In that case, we must replace it with (clobber (const_int 0)) to prevent infinite loops. */ Given that 12 and 13 were combined into: r112:SI=r112:SI 0>>0x1 This seems to be invalidating the last_set_value and thus leaving us with a weaker 'last_set_nonzero_bits' which isn't enough to optimize away the zero_extend. Any clue on how to "fix" this? Cheers, Andre PS: I am not sure I completely understand the way the last_set_value stuff works for pseudo's in combine, but it looks to me like each instruction is visited in a top down order per basic block again in a top-down order. And each pseudo will have its 'last_set_value' according to the last time that register was seen being set, without any regards to loop or proper dataflow analysis. Can anyone explain to me how this doesnt go horribly wrong?