https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

            Bug ID: 95790
           Summary: Incorrect static target dispatch
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: ipa
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yyc1992 at gmail dot com
                CC: marxin at gcc dot gnu.org
  Target Milestone: ---

The indirection elimination code currently only check for match of the target
for the specific version but doesn't check if all the targets are matching.

Modifying from
https://github.com/gcc-mirror/gcc/commit/b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f#diff-e2d535917af8555baad2e9c8749e96a5

```
__attribute__ ((target ("default")))
static unsigned foo(const char *buf, unsigned size) {
  return 1;
}

__attribute__ ((target ("avx")))
static unsigned foo(const char *buf, unsigned size) {
  return 2;
}

__attribute__ ((target ("avx512f")))
static unsigned foo(const char *buf, unsigned size) {
  return 3;
}

__attribute__ ((target ("default")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
    acc += foo(&buf[i], 1);
  }
  return acc;
}

__attribute__ ((target ("avx")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
    acc += foo(&buf[i], 1);
  }
  return acc;
}
```

With the optimization disabled, which is possible by adding a flatten attribute
to the functions and triggering PR95780 and PR95778, a resolver function is
automatically generated for foo like

```
        .text
.LHOTB0:
        .p2align 4
        .type   _ZL3fooPKcj.resolver, @function
_ZL3fooPKcj.resolver:
        subq    $8, %rsp
        call    __cpu_indicator_init@PLT
        movq    __cpu_model@GOTPCREL(%rip), %rax
        movl    12(%rax), %eax
        testb   $-128, %ah
        je      .L8
        leaq    _ZL3fooPKcj.avx512f(%rip), %rax
.L7:
        addq    $8, %rsp
        ret
        .section        .text.unlikely
        .type   _ZL3fooPKcj.resolver.cold, @function
_ZL3fooPKcj.resolver.cold:
.L8:
        testb   $2, %ah
        leaq    _ZL3fooPKcj.avx(%rip), %rdx
        leaq    _ZL3fooPKcj(%rip), %rax
        cmovne  %rdx, %rax
        jmp     .L7
        .text
        .size   _ZL3fooPKcj.resolver, .-_ZL3fooPKcj.resolver
        .section        .text.unlikely
        .size   _ZL3fooPKcj.resolver.cold, .-_ZL3fooPKcj.resolver.cold
.LCOLDE0:
        .text
.LHOTE0:
        .type   _Z11_ZL3fooPKcjPKcj, @gnu_indirect_function
        .set    _Z11_ZL3fooPKcjPKcj,_ZL3fooPKcj.resolver
```

and the calls from bar goes through the PLT. This is the correct behavior
(albeit sub-optimal since the default could call the default directly) and
allows avx512f version of foo to be called on the correct processor from the
avx version of bar.

With the optimization enabled, however, the call of foo's are inlined to bar
and the avx512f version is never used.

This is somewhat a regression caused by
b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f.

It'll also affect my fix for PR95780 and PR95778.
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/548631.html

Reply via email to