https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922
Bug ID: 100922 Summary: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 50948 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948&action=edit redundant_zero_extend.c It's rarely a good idea to load the same thing twice; generally better to copy a register. Or to read the same register twice when a copy isn't needed. So the following asm should never happen, but it does with current trunk, and similar with GCC as old as 4.5 movzbl (%rax), %edx movzbl (%rax), %ecx # no branch target between these instructions or ldrb w4, [x2] ldrb w3, [x2], 1 # post-indexed *x2++ (Happens at -O3. With -O2 we have a redundant register copy, so either way still a wasted instruction. And there are other differences earlier in the function with -O2 vs. -O3.) https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk show basically identical code structure. x86-64 gcc (Compiler-Explorer-Build) 12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524 void remove_chars_inplace(char *str, const unsigned char keep_lut[256]) { while(keep_lut[(unsigned char)*str]){ // can be an if() and still repro str++; // keep_lut[0] is false } char *out = str; unsigned char c; /* must be unsigned char for correctness. */ do { c = *str++; unsigned char inc = keep_lut[c]; // unsigned long doesn't help *out = c; out += inc; // inc=0 or 1 to let next char overwrite or not } while(c); } x86-64 asm: remove_chars_inplace: jmp .L8 .L3: # top of search loop for first char to remove addq $1, %rdi .L8: # loop entry point movzbl (%rdi), %eax cmpb $0, (%rsi,%rax) # un-laminates and doesn't macro-fuse ... jne .L3 cmpb $0, (%rdi) # 2nd loop body can be skipped if *str == 0 # should be test %al,%al - this char was already loaded. leaq 1(%rdi), %rax # even -march=znver2 fails to move this earlier or later to allow cmp/je fusion. (Intel won't macro-fuse cmp imm,mem / jcc) je .L1 .L5: # TOP OF 2ND LOOP movzbl (%rax), %edx movzbl (%rax), %ecx # redundant load of *str addq $1, %rax movzbl (%rsi,%rdx), %edx # inc = lut[c] movb %cl, (%rdi) addq %rdx, %rdi # out += inc testb %cl, %cl jne .L5 # }while(c != 0) .L1: ret IDK if it's interesting or not that the cmpb $0, (%rdi) is also a redundant load. The first loop left *str, i.e. (%rdi), in EAX. Putting the LEA between cmp and je (even with -march=znver2) is a separate missed optimization. (unless that's working around Intel's JCC erratum) With only -O2 instead of -O3, we get better asm in that part: it takes advantage of having the char in AL, and jumps into the middle of the next loop after xor-zeroing the `inc` variable. Replacing c = *str++; with c = *str; str++; results in a wasted register copy with trunk, instead of a 2nd load (on x86-64 and arm64). Still a missed opt, but less bad. GCC7 and earlier still do an extra load with either way of writing that. Removing the first loop, or making its loop condition something like *str && keep_lut[*str], removes the problem entirely. The CSE possibility is gone. (Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to unsigned char instead of casting the signed char value to unsigned char, on x86 where char is signed, but not on arm64 where char is unsigned.) --- I didn't find any clear duplicates; the following are barely worth mentioning: * pr94442 looks like extra spilling, not just redundant loading. * pr97366 is due to vectors of different types, probably. * pr64319 needs runtime aliasing detection to avoid, unlike this. The AArch64 version of this does seem to demo pr71942 (a useless and x4, x2, 255 on an LDRB result) when you get it to copy a register instead of doing a 2nd load.