https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790
Bug ID: 95790 Summary: Incorrect static target dispatch Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- The indirection elimination code currently only check for match of the target for the specific version but doesn't check if all the targets are matching. Modifying from https://github.com/gcc-mirror/gcc/commit/b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f#diff-e2d535917af8555baad2e9c8749e96a5 ``` __attribute__ ((target ("default"))) static unsigned foo(const char *buf, unsigned size) { return 1; } __attribute__ ((target ("avx"))) static unsigned foo(const char *buf, unsigned size) { return 2; } __attribute__ ((target ("avx512f"))) static unsigned foo(const char *buf, unsigned size) { return 3; } __attribute__ ((target ("default"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo(&buf[i], 1); } return acc; } __attribute__ ((target ("avx"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo(&buf[i], 1); } return acc; } ``` With the optimization disabled, which is possible by adding a flatten attribute to the functions and triggering PR95780 and PR95778, a resolver function is automatically generated for foo like ``` .text .LHOTB0: .p2align 4 .type _ZL3fooPKcj.resolver, @function _ZL3fooPKcj.resolver: subq $8, %rsp call __cpu_indicator_init@PLT movq __cpu_model@GOTPCREL(%rip), %rax movl 12(%rax), %eax testb $-128, %ah je .L8 leaq _ZL3fooPKcj.avx512f(%rip), %rax .L7: addq $8, %rsp ret .section .text.unlikely .type _ZL3fooPKcj.resolver.cold, @function _ZL3fooPKcj.resolver.cold: .L8: testb $2, %ah leaq _ZL3fooPKcj.avx(%rip), %rdx leaq _ZL3fooPKcj(%rip), %rax cmovne %rdx, %rax jmp .L7 .text .size _ZL3fooPKcj.resolver, .-_ZL3fooPKcj.resolver .section .text.unlikely .size _ZL3fooPKcj.resolver.cold, .-_ZL3fooPKcj.resolver.cold .LCOLDE0: .text .LHOTE0: .type _Z11_ZL3fooPKcjPKcj, @gnu_indirect_function .set _Z11_ZL3fooPKcjPKcj,_ZL3fooPKcj.resolver ``` and the calls from bar goes through the PLT. This is the correct behavior (albeit sub-optimal since the default could call the default directly) and allows avx512f version of foo to be called on the correct processor from the avx version of bar. With the optimization enabled, however, the call of foo's are inlined to bar and the avx512f version is never used. This is somewhat a regression caused by b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f. It'll also affect my fix for PR95780 and PR95778. https://gcc.gnu.org/pipermail/gcc-patches/2020-June/548631.html