[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

peter at cordes dot ca via Gcc-bugs Sat, 05 Jun 2021 00:59:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922


            Bug ID: 100922
           Summary: CSE leads to fully redundant (back to back)
                    zero-extending loads of the same thing in a loop, or a
                    register copy
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Created attachment 50948
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948&action=edit
redundant_zero_extend.c

It's rarely a good idea to load the same thing twice; generally better to copy
a register.  Or to read the same register twice when a copy isn't needed.  So
the following asm should never happen, but it does with current trunk, and
similar with GCC as old as 4.5

        movzbl  (%rax), %edx
        movzbl  (%rax), %ecx    # no branch target between these instructions

or 

        ldrb    w4, [x2]
        ldrb    w3, [x2], 1     # post-indexed *x2++

(Happens at -O3.  With -O2 we have a redundant register copy, so either way
still a wasted instruction.  And there are other differences earlier in the
function with -O2 vs. -O3.)

https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk
show basically identical code structure.  x86-64 gcc (Compiler-Explorer-Build)
12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524

void remove_chars_inplace(char *str, const unsigned char keep_lut[256])
{
  while(keep_lut[(unsigned char)*str]){     // can be an if() and still repro
    str++;            // keep_lut[0] is false
  }

  char *out = str;
  unsigned char c;   /* must be unsigned char for correctness. */
  do {
      c = *str++;
      unsigned char inc = keep_lut[c];  // unsigned long doesn't help
      *out = c;
      out += inc;   // inc=0 or 1 to let next char overwrite or not
    } while(c);
}

x86-64 asm:

remove_chars_inplace:
        jmp     .L8
.L3:                            # top of search loop for first char to remove
        addq    $1, %rdi
.L8:                            # loop entry point
        movzbl  (%rdi), %eax
        cmpb    $0, (%rsi,%rax)  # un-laminates and doesn't macro-fuse ...
        jne     .L3

        cmpb    $0, (%rdi)      # 2nd loop body can be skipped if *str == 0
                                # should be test %al,%al  - this char was
already loaded.
        leaq    1(%rdi), %rax    # even -march=znver2 fails to move this
earlier or later to allow cmp/je fusion.  (Intel won't macro-fuse cmp imm,mem /
jcc)
        je      .L1

.L5:                             # TOP OF 2ND LOOP
        movzbl  (%rax), %edx
        movzbl  (%rax), %ecx     # redundant load of *str
        addq    $1, %rax
        movzbl  (%rsi,%rdx), %edx  # inc = lut[c]
        movb    %cl, (%rdi)
        addq    %rdx, %rdi       # out += inc
        testb   %cl, %cl
        jne     .L5            # }while(c != 0)
.L1:
        ret

IDK if it's interesting or not that the   cmpb $0, (%rdi)  is also a redundant
load.  The first loop left *str, i.e. (%rdi), in EAX.  Putting the LEA between
cmp and je (even with -march=znver2) is a separate missed optimization. 
(unless that's working around Intel's JCC erratum)

With only -O2 instead of -O3, we get better asm in that part: it takes
advantage of having the char in AL, and jumps into the middle of the next loop
after xor-zeroing the `inc` variable.


Replacing    c = *str++;  with
      c = *str;
      str++;
results in a wasted register copy with trunk, instead of a 2nd load (on x86-64
and arm64).  Still a missed opt, but less bad.  GCC7 and earlier still do an
extra load with either way of writing that.

Removing the first loop, or making its loop condition something like  *str &&
keep_lut[*str],  removes the problem entirely.  The CSE possibility is gone. 
(Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to
unsigned char instead of casting the signed char value to unsigned char, on x86
where char is signed, but not on arm64 where char is unsigned.)

---

I didn't find any clear duplicates; the following are barely worth mentioning:
*  pr94442 looks like extra spilling, not just redundant loading.
*  pr97366 is due to vectors of different types, probably.
*  pr64319 needs runtime aliasing detection to avoid, unlike this.

The AArch64 version of this does seem to demo pr71942 (a useless  and x4, x2,
255 on an LDRB result) when you get it to copy a register instead of doing a
2nd load.

[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

Reply via email to