[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2019-09-23 Thread yann at droneaud dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

--- Comment #10 from Yann Droneaud  ---
Some more snippets, generated with creduce:

-8<
  void a(long *);
  int b(void);
  void c(void);
  long d(void) {
static __thread long e;
a();
if (b())
  c();
return e;
  }
--8<
  int a(void);
  void b(long *);
  void c(void) {
static __thread long d;
if (d || a())
  d = 0;
b();
  }
--8<
  void a(int *);
  int b(int *);
  int c(void) {
static __thread int d;
a();
while (b());
return 0;
  }
--8<
  void b(int *);
  int c(int *);
  int d(void) {
static __thread int a;
if (a)
  while (c());
b();
return 0;
  }
--8<
  int b(void);
  void c(void);
  void d(int *);
  int e(void) {
static __thread int a[2];
d(a);
do {
  d(a);
  if (b())
c();
} while (a[0] && a[1]);
d(a);
return 0;
  }
--8<

Those make gcc emits 2 or more calls to __tls_get_addr().

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2019-09-20 Thread yann at droneaud dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

--- Comment #9 from Yann Droneaud  ---
This issue is also reported as bug #81501

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2019-09-20 Thread yann at droneaud dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

Yann Droneaud  changed:

   What|Removed |Added

 CC||yann at droneaud dot fr

--- Comment #8 from Yann Droneaud  ---
Created attachment 46903
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46903=edit
An artificial test case for gcc to emit 17 calls to __tls_get_addr()

Using Thread Local Storage (TLS) is a pain: the issue reported here still apply
on latest GCC.

I've code such as

  static struct state *state(void) __attribute__((pure));
  static struct state *state(void)
  {
  static __thread struct state s;

  return 
  }

  int do(void)
  {
  struct state * const s = state();
  int res;

  /* do something */

  return res;
  }

Once compiled, code for my real function contains 6 calls to __tls_get_addr().
Which is far more than expected. And far more than necessary.
Clang compile the same code and emit a single call to __tls_get_addr(). Both on
Linux amd64, -O3 -fPIC.

The attached testcase is an example which is designed to trigger 17 calls to
__tls_get_addr(). As you will see, there's about one per conditional + function
call pair.

Once again, clang is able to emit code with a single call to __tls_get_addr().

You can check for yourself: https://godbolt.org/z/QVGjka

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2018-10-13 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

--- Comment #7 from Alexander Monakov  ---
Sorry, should have double-checked when commenting. I see RTL LIM simply
considers all calls non-invariant in check_maybe_invariant.

I wonder if it would make sense to represent tls abi calls as unspecs up to
some point like split_all_insns so they get cleaned up automatically like on
i386?

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2018-10-13 Thread ebotcazou at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

Eric Botcazou  changed:

   What|Removed |Added

 CC||ebotcazou at gcc dot gnu.org

--- Comment #6 from Eric Botcazou  ---
> The main reason is __tls_get_addr emitted as a normal call on RTL (for each
> GIMPLE access to the variable), but unless I'm missing something RTL doesn't
> have a notion of pure/const calls, so RTL loop invariant motion and CSE
> cannot clean up the redundant calls.

/* 1 if RTX is a call to a const function.  Built from ECF_CONST and
   TREE_READONLY.  */
#define RTL_CONST_CALL_P(RTX)   \
  (RTL_FLAG_CHECK1 ("RTL_CONST_CALL_P", (RTX), CALL_INSN)->unchanging)

/* 1 if RTX is a call to a pure function.  Built from ECF_PURE and
   DECL_PURE_P.  */
#define RTL_PURE_CALL_P(RTX)\
  (RTL_FLAG_CHECK1 ("RTL_PURE_CALL_P", (RTX), CALL_INSN)->return_val)

/* 1 if RTX is a call to a const or pure function.  */
#define RTL_CONST_OR_PURE_CALL_P(RTX) \
  (RTL_CONST_CALL_P (RTX) || RTL_PURE_CALL_P (RTX))

/* 1 if RTX is a call to a looping const or pure function.  Built from
   ECF_LOOPING_CONST_OR_PURE and DECL_LOOPING_CONST_OR_PURE_P.  */
#define RTL_LOOPING_CONST_OR_PURE_CALL_P(RTX)   \
  (RTL_FLAG_CHECK1 ("CONST_OR_PURE_CALL_P", (RTX), CALL_INSN)->call)

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2018-10-12 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
The main reason is __tls_get_addr emitted as a normal call on RTL (for each
GIMPLE access to the variable), but unless I'm missing something RTL doesn't
have a notion of pure/const calls, so RTL loop invariant motion and CSE cannot
clean up the redundant calls.

On i386, ___tls_get_addr is modeled as an unspec instead, which is successfully
moved out of the loop during RTL invariant motion.

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2018-10-12 Thread nsz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

nsz at gcc dot gnu.org changed:

   What|Removed |Added

 CC||nsz at gcc dot gnu.org

--- Comment #4 from nsz at gcc dot gnu.org ---
i run into the same issue:

static __thread int x;
static int *volatile p;
void f(int c)
{
while (c--)
  p = 
}

with -xc -O2 -fPIC compiles to

  pushq %rbx
  leal -1(%rdi), %ebx
.L10:
  leaq x@tlsld(%rip), %rdi
  call __tls_get_addr@PLT
  subl $1, %ebx
  addq $x@dtpoff, %rax
  movq %rax, p(%rip)
  cmpl $-1, %ebx
  jne .L10
  popq %rbx
  ret

note that with -funroll-loops the loop is

.L46:
  leaq x@tlsld(%rip), %rdi
  call __tls_get_addr@PLT
  subl $8, %ebx
  addq $x@dtpoff, %rax
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  movq %rax, p(%rip)
  cmpl $-1, %ebx
  jne .L46

so the loop unroller knows it only needs to compute the address once, but gcc
fails to hoist it out of the loop.

if i use a simple global, then the GOT access is hoisted, if i use an
__attribute__((const)) function call then that is hoisted, only tls address
computation is broken.

the issue is not present with -m32 (i386 code gen), but it is present on e.g.
aarch64 and powerpc64 and with tlsdesc -mtls-dialect=gnu2 (then it's the
tlsdesc call that's in the loop instead of __tls_get_addr call).

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2017-11-02 Thread amohr at amohr dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

--- Comment #3 from Alex Mohr  ---
FWIW a loop is not required.  This generates 4 calls to __tls_get_addr:

static thread_local int x;
int g();
int f() {
  int *px = 
  if (g()) *px += g();
  if (g()) *px += g();
  if (g()) *px += g();
  return *px;
}

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2017-11-02 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

--- Comment #2 from Andrew Pinski  ---
Related to PR 81501.

[Bug rtl-optimization/82803] Wildly excessive calls to __tls_get_addr with optimizations enabled.

2017-11-02 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-linux-gnu,
   ||aarch64-linux-gnu
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-11-02
  Component|c++ |rtl-optimization
 Ever confirmed|0   |1

--- Comment #1 from Andrew Pinski  ---
Hmm, I thought we were able to pull the address formation out of the loop but
for some reason we are not.