Public bug reported:

Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25
[2]. All Ubuntu versions starting from 16.04 are affected because they
use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x)
performance degradation of math functions (pow, exp/exp2/exp10,
log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2,
sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be
reproduced on any AVX-capable x86-64 machine.

This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM
registers used by AVX-256 instructions extend 128-bit registers used by
SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE
instruction after AVX-256 instruction it has to store upper half of the
YMM register to the internal buffer and then restore it when execution
returns back to AVX instructions. Store/restore is required because old-
fashioned SSE knows nothing about the upper halves of its registers and
may damage them. This store/restore operation is time consuming (several
tens of clock cycles for each operation). To deal with this issue, Intel
introduced AVX-128 instructions which operate on the same 128-bit XMM
register as SSE but take into account upper halves of YMM registers.
Hence, no store/restore required. Practically speaking, AVX-128
instructions is a new smart form of SSE instructions which can be used
together with full-size AVX-256 instructions without any penalty. Intel
recommends to use AVX-128 instructions instead of SSE instructions
wherever possible. To sum things up, it's okay to mix SSE with AVX-128
and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because
both types of instructions are aware of 256-bit YMM registers. Mixing
SSE with AVX-128 is okay because CPU can guarantee that the upper halves
of YMM registers don't contain any meaningful data (how one can put it
there without using AVX-256 instructions) and avoid doing store/restore
operation (why to care about random trash in the upper halves of the YMM
registers). It's not okay to mix SSE with AVX-256 due to the transition
penalty. Scalar floating-point instructions used by routines mentioned
above are implemented as a subset of SSE and AVX-128 instructions. They
operate on a small fraction of 128-bit register but still considered
SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty
as well.

Glibc inadvertently triggers a chain of AVX/SSE transition penalties due
to inappropriate use of AVX-256 instructions inside
_dl_runtime_resolve() procedure. By using AVX-256 instructions to
push/pop YMM registers [4], Glibc makes CPU think that the upper halves
of XMM registers contain meaningful data which needs to be preserved
during execution of SSE instructions. With such a 'dirty' flag set every
switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a
time consuming store/restore procedure. This 'dirty' flag never gets
cleared during the whole program execution which leads to a serious
overall slowdown. Fixed implementation [2] of _dl_runtime_resolve()
procedure tries to avoid using AVX-256 instructions if possible.

Buggy _dl_runtime_resolve() gets called every time when dynamic linker
tries to resolve a symbol (any symbol, not just ones mentioned above).
It's enough for _dl_runtime_resolve() to be called just once to touch
the upper halves of the YMM registers and provoke AVX/SSE transition
penalty in the future. It's safe to say that all dynamically linked
application call _dl_runtime_resolve() at least once which means that
all of them may experience slowdown. Performance degradation takes place
when such application mixes AVX and SSE instructions (switches from AVX
to SSE or back).

There are two types of math routines provided by libm:
(a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and 
other)
(b) ones that don't have AVX-optimized version and rely on general purpose SSE 
implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh 
and others)

For the former group of routines slowdown happens when they get called
from SSE code (i.e. from the application compiled with -mno-avx) because
SSE -> AVX transition takes place. For the latter one slowdown happens
when routines get called from AVX code (i.e. from the application
compiled with -mavx) because AVX -> SSE transition takes place. Both
situations look realistic. SSE code gets generated by gcc to target
x86-64 and AVX-optimized code gets generated by gcc -march=native on
AVX-capable machines.

============================================================================

Let's take one routine from the group (a) and try to reproduce the
slowdown.

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
  printf("%f\n", a);
  return 0;
}

$ gcc -O3 -march=x86-64 -o exp exp.c -lm

$ time ./exp
<..> 2.801s <..>

$ time LD_BIND_NOW=1 ./exp
<..> 0.660s <..>

You can see that application demonstrates 4x better performance when
_dl_runtime_resolve() doesn't get called. That's how serious the impact
of AVX/SSE transition can be.

============================================================================

Let's take one routine from the group (b) and try to reproduce the
slowdown.

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
  printf("%f\n", a);
  return 0;
}

# note that -mavx option has been passed
$ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

$ time ./pow
<..> 4.157s <..>

$ time LD_BIND_NOW=1 ./pow
<..> 2.123s <..>

You can see that application demonstrates 2x better performance when
_dl_runtime_resolve() doesn't get called.

============================================================================

[!] It's important to mention that the context of this bug might be even
wider. After a call to buggy _dl_runtime_resolve() any transition
between AVX-128 and SSE (otherwise legitimate) will suffer from
performance degradation. Any application which mixes AVX-128 floating
point code with SSE floating point code (e.g. by using external SSE-only
library) will experience serious slowdown.

[0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495
[1] 
https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
[2] 
https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
[3] 
https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
[4] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182

** Affects: glibc (Ubuntu)
     Importance: Undecided
         Status: New

** Description changed:

- Serious performance degradation of math functions in 16.04/16.10/17.04
- due to known Glibc bug
- 
  Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25
  [2]. All Ubuntu versions starting from 16.04 are affected because they
  use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x)
  performance degradation of math functions (pow, exp/exp2/exp10,
  log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2,
  sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be
  reproduced on any AVX-capable x86-64 machine.
  
  This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM
  registers used by AVX-256 instructions extend 128-bit registers used by
  SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE
  instruction after AVX-256 instruction it has to store upper half of the
  YMM register to the internal buffer and then restore it when execution
  returns back to AVX instructions. Store/restore is required because old-
  fashioned SSE knows nothing about the upper halves of its registers and
  may damage them. This store/restore operation is time consuming (several
  tens of clock cycles for each operation). To deal with this issue, Intel
  introduced AVX-128 instructions which operate on the same 128-bit XMM
  register as SSE but take into account upper halves of YMM registers.
  Hence, no store/restore required. Practically speaking, AVX-128
  instructions is a new smart form of SSE instructions which can be used
  together with full-size AVX-256 instructions without any penalty. Intel
  recommends to use AVX-128 instructions instead of SSE instructions
  wherever possible. To sum things up, it's okay to mix SSE with AVX-128
  and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because
  both types of instructions are aware of 256-bit YMM registers. Mixing
  SSE with AVX-128 is okay because CPU can guarantee that the upper halves
  of YMM registers don't contain any meaningful data (how one can put it
  there without using AVX-256 instructions) and avoid doing store/restore
  operation (why to care about random trash in the upper halves of the YMM
  registers). It's not okay to mix SSE with AVX-256 due to the transition
  penalty. Scalar floating-point instructions used by routines mentioned
  above are implemented as a subset of SSE and AVX-128 instructions. They
  operate on a small fraction of 128-bit register but still considered
  SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty
  as well.
  
  Glibc inadvertently triggers a chain of AVX/SSE transition penalties due
  to inappropriate use of AVX-256 instructions inside
  _dl_runtime_resolve() procedure. By using AVX-256 instructions to
  push/pop YMM registers [4], Glibc makes CPU think that the upper halves
  of XMM registers contain meaningful data which needs to be preserved
  during execution of SSE instructions. With such a 'dirty' flag set every
  switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a
  time consuming store/restore procedure. This 'dirty' flag never gets
  cleared during the whole program execution which leads to a serious
  overall slowdown. Fixed implementation [2] of _dl_runtime_resolve()
  procedure tries to avoid using AVX-256 instructions if possible.
  
  Buggy _dl_runtime_resolve() gets called every time when dynamic linker
  tries to resolve a symbol (any symbol, not just ones mentioned above).
  It's enough for _dl_runtime_resolve() to be called just once to touch
  the upper halves of the YMM registers and provoke AVX/SSE transition
  penalty in the future. It's safe to say that all dynamically linked
  application call _dl_runtime_resolve() at least once which means that
  all of them may experience slowdown. Performance degradation takes place
  when such application mixes AVX and SSE instructions (switches from AVX
  to SSE or back).
  
  There are two types of math routines provided by libm:
  (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and 
other)
  (b) ones that don't have AVX-optimized version and rely on general purpose 
SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, 
asinh/acosh/atanh and others)
  
  For the former group of routines slowdown happens when they get called
  from SSE code (i.e. from the application compiled with -mno-avx) because
  SSE -> AVX transition takes place. For the latter one slowdown happens
  when routines get called from AVX code (i.e. from the application
  compiled with -mavx) because AVX -> SSE transition takes place. Both
  situations look realistic. SSE code gets generated by gcc to target
  x86-64 and AVX-optimized code gets generated by gcc -march=native on
  AVX-capable machines.
  
  ============================================================================
  
  Let's take one routine from the group (a) and try to reproduce the
  slowdown.
  
  #include <math.h>
  #include <stdio.h>
  
  int main () {
-   double a, b;
-   for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
-   printf("%f\n", a);
-   return 0;
+   double a, b;
+   for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
+   printf("%f\n", a);
+   return 0;
  }
  
  $ gcc -O3 -march=x86-64 -o exp exp.c -lm
  
  $ time ./exp
  <..> 2.801s <..>
  
  $ time LD_BIND_NOW=1 ./exp
  <..> 0.660s <..>
  
  You can see that application demonstrates 4x better performance when
  _dl_runtime_resolve() doesn't get called. That's how serious the impact
  of AVX/SSE transition can be.
  
  ============================================================================
  
  Let's take one routine from the group (b) and try to reproduce the
  slowdown.
  
  #include <math.h>
  #include <stdio.h>
  
  int main () {
-   double a, b;
-   for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
-   printf("%f\n", a);
-   return 0;
+   double a, b;
+   for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
+   printf("%f\n", a);
+   return 0;
  }
  
  # note that -mavx option has been passed
  $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm
  
  $ time ./pow
  <..> 4.157s <..>
  
  $ time LD_BIND_NOW=1 ./pow
  <..> 2.123s <..>
  
  You can see that application demonstrates 2x better performance when
  _dl_runtime_resolve() doesn't get called.
  
  ============================================================================
  
  [!] It's important to mention that the context of this bug might be even
  wider. After a call to buggy _dl_runtime_resolve() any transition
  between AVX-128 and SSE (otherwise legitimate) will suffer from
  performance degradation. Any application which mixes AVX-128 floating
  point code with SSE floating point code (e.g. by using external SSE-only
  library) will experience serious slowdown.
  
  [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495
  [1] 
https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
  [2] 
https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
  [3] 
https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
  [4] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1663280

Title:
  Serious performance degradation of math functions in 16.04/16.10/17.04
  due to known Glibc bug

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to