[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 --- Comment #10 from Yann Droneaud --- Some more snippets, generated with creduce: -8< void a(long *); int b(void); void c(void); long d(void) { static __thread long e; a(); if (b()) c(); return e; } --8< int a(void); void b(long *); void c(void) { static __thread long d; if (d || a()) d = 0; b(); } --8< void a(int *); int b(int *); int c(void) { static __thread int d; a(); while (b()); return 0; } --8< void b(int *); int c(int *); int d(void) { static __thread int a; if (a) while (c()); b(); return 0; } --8< int b(void); void c(void); void d(int *); int e(void) { static __thread int a[2]; d(a); do { d(a); if (b()) c(); } while (a[0] && a[1]); d(a); return 0; } --8< Those make gcc emits 2 or more calls to __tls_get_addr().
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 --- Comment #9 from Yann Droneaud --- This issue is also reported as bug #81501
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 Yann Droneaud changed: What|Removed |Added CC||yann at droneaud dot fr --- Comment #8 from Yann Droneaud --- Created attachment 46903 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46903=edit An artificial test case for gcc to emit 17 calls to __tls_get_addr() Using Thread Local Storage (TLS) is a pain: the issue reported here still apply on latest GCC. I've code such as static struct state *state(void) __attribute__((pure)); static struct state *state(void) { static __thread struct state s; return } int do(void) { struct state * const s = state(); int res; /* do something */ return res; } Once compiled, code for my real function contains 6 calls to __tls_get_addr(). Which is far more than expected. And far more than necessary. Clang compile the same code and emit a single call to __tls_get_addr(). Both on Linux amd64, -O3 -fPIC. The attached testcase is an example which is designed to trigger 17 calls to __tls_get_addr(). As you will see, there's about one per conditional + function call pair. Once again, clang is able to emit code with a single call to __tls_get_addr(). You can check for yourself: https://godbolt.org/z/QVGjka
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 --- Comment #7 from Alexander Monakov --- Sorry, should have double-checked when commenting. I see RTL LIM simply considers all calls non-invariant in check_maybe_invariant. I wonder if it would make sense to represent tls abi calls as unspecs up to some point like split_all_insns so they get cleaned up automatically like on i386?
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 Eric Botcazou changed: What|Removed |Added CC||ebotcazou at gcc dot gnu.org --- Comment #6 from Eric Botcazou --- > The main reason is __tls_get_addr emitted as a normal call on RTL (for each > GIMPLE access to the variable), but unless I'm missing something RTL doesn't > have a notion of pure/const calls, so RTL loop invariant motion and CSE > cannot clean up the redundant calls. /* 1 if RTX is a call to a const function. Built from ECF_CONST and TREE_READONLY. */ #define RTL_CONST_CALL_P(RTX) \ (RTL_FLAG_CHECK1 ("RTL_CONST_CALL_P", (RTX), CALL_INSN)->unchanging) /* 1 if RTX is a call to a pure function. Built from ECF_PURE and DECL_PURE_P. */ #define RTL_PURE_CALL_P(RTX)\ (RTL_FLAG_CHECK1 ("RTL_PURE_CALL_P", (RTX), CALL_INSN)->return_val) /* 1 if RTX is a call to a const or pure function. */ #define RTL_CONST_OR_PURE_CALL_P(RTX) \ (RTL_CONST_CALL_P (RTX) || RTL_PURE_CALL_P (RTX)) /* 1 if RTX is a call to a looping const or pure function. Built from ECF_LOOPING_CONST_OR_PURE and DECL_LOOPING_CONST_OR_PURE_P. */ #define RTL_LOOPING_CONST_OR_PURE_CALL_P(RTX) \ (RTL_FLAG_CHECK1 ("CONST_OR_PURE_CALL_P", (RTX), CALL_INSN)->call)
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- The main reason is __tls_get_addr emitted as a normal call on RTL (for each GIMPLE access to the variable), but unless I'm missing something RTL doesn't have a notion of pure/const calls, so RTL loop invariant motion and CSE cannot clean up the redundant calls. On i386, ___tls_get_addr is modeled as an unspec instead, which is successfully moved out of the loop during RTL invariant motion.
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 nsz at gcc dot gnu.org changed: What|Removed |Added CC||nsz at gcc dot gnu.org --- Comment #4 from nsz at gcc dot gnu.org --- i run into the same issue: static __thread int x; static int *volatile p; void f(int c) { while (c--) p = } with -xc -O2 -fPIC compiles to pushq %rbx leal -1(%rdi), %ebx .L10: leaq x@tlsld(%rip), %rdi call __tls_get_addr@PLT subl $1, %ebx addq $x@dtpoff, %rax movq %rax, p(%rip) cmpl $-1, %ebx jne .L10 popq %rbx ret note that with -funroll-loops the loop is .L46: leaq x@tlsld(%rip), %rdi call __tls_get_addr@PLT subl $8, %ebx addq $x@dtpoff, %rax movq %rax, p(%rip) movq %rax, p(%rip) movq %rax, p(%rip) movq %rax, p(%rip) movq %rax, p(%rip) movq %rax, p(%rip) movq %rax, p(%rip) movq %rax, p(%rip) cmpl $-1, %ebx jne .L46 so the loop unroller knows it only needs to compute the address once, but gcc fails to hoist it out of the loop. if i use a simple global, then the GOT access is hoisted, if i use an __attribute__((const)) function call then that is hoisted, only tls address computation is broken. the issue is not present with -m32 (i386 code gen), but it is present on e.g. aarch64 and powerpc64 and with tlsdesc -mtls-dialect=gnu2 (then it's the tlsdesc call that's in the loop instead of __tls_get_addr call).
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 --- Comment #3 from Alex Mohr --- FWIW a loop is not required. This generates 4 calls to __tls_get_addr: static thread_local int x; int g(); int f() { int *px = if (g()) *px += g(); if (g()) *px += g(); if (g()) *px += g(); return *px; }
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 --- Comment #2 from Andrew Pinski --- Related to PR 81501.
[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-linux-gnu, ||aarch64-linux-gnu Status|UNCONFIRMED |NEW Last reconfirmed||2017-11-02 Component|c++ |rtl-optimization Ever confirmed|0 |1 --- Comment #1 from Andrew Pinski --- Hmm, I thought we were able to pull the address formation out of the loop but for some reason we are not.