https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91569
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2020-01-29 Component|rtl-optimization |target Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to cubitect from comment #0) > I wasn't entirely sure where to post this, but I have a very simple test > problem that shows some missed optimisation potential. The task is to cast > an integer to a long and replace the second lowest byte of the result with > a constant (4). Below are three ways to achieve this: > > > long opt_test1(int num) // opt_test1: > { // movslq %edi, %rax > union { // mmovb $4, %ah > long q; // ret > struct { char l,h; }; > } a; > a.q = num; > a.h = 4; > return a.q; > } > > The union here is modelled after the structure of a r?x register which > contains the low and high byte registers: ?l and ?h. The cast and second > byte assignment can be done in one instruction each. The optimiser manages > to understand this and gives the optimal instructions. > > > long opt_test2(int num) // opt_test2: > { // movl %edi, %eax > long a = num; // xor %ah, %ah > a &= (-1UL ^ 0xff00); // orb $4, %ah > a |= (4 << 8); // cltq > return a; // ret > } > > This solution, based on a bitwise AND and OR, is interesting. The optimiser > recognised that I am interested in the second byte and makes use of the 'ah' > register, but why is there a XOR and an OR rather than an a single, > equivalent MOV? Similarly the (MOV + CLTQ) can be replaced outright with > MOVSLQ. Notable here is that some older versions (such as "gcc-4.8.5 -O3") > give results that correspond more to the C code: > andl $-65281, %edi > orl $1024, %edi > movslq %edi, %rax > ret > which is actually better than the output for gcc-9.2. This one looks like a target issue to me. We end up using (insn 20 18 21 2 (parallel [ (set (zero_extract:SI (reg:SI 0 ax [89]) (const_int 8 [0x8]) (const_int 8 [0x8])) (subreg:SI (xor:QI (subreg:QI (zero_extract:SI (reg:SI 0 ax [89]) (const_int 8 [0x8]) (const_int 8 [0x8])) 0) (subreg:QI (zero_extract:SI (reg:SI 0 ax [89]) (const_int 8 [0x8]) (const_int 8 [0x8])) 0)) 0)) (clobber (reg:CC 17 flags)) ]) "t2.c":5:7 520 {*xorqi_ext_2} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) via split2 > long opt_test3(int num) // opt_test3: > { // movslq %edi, %rdi > long a = num; // movq %rdi, -8(%rsp) > ((char*)&a)[1] = 4; // movb $4, -7(%rsp) > return a; // movq -8(%rsp), %rax > } // ret Here we could (and fail to) rewrite the store to a BIT_INSERT_EXPR in update_address_taken. The stack approach suffers from STLF issues on modern CPUs. This one could be split out (it would be tree-optimization). Keeping this issue for the weird splitter.