[Bug c++/89557] New: [7/8 regression] 4*movq to 2*movaps IPC performance regression on znver1

2019-03-02 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

Bug ID: 89557
   Summary: [7/8 regression] 4*movq to 2*movaps IPC performance
regression on znver1
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: 0xe2.0x9a.0x9b at gmail dot com
  Target Milestone: ---

Approximate C++ source code:

  struct __attribute__((aligned(16))) A {
union {
  struct {
uint64_t a;
double b;
  };
  uint64_t data[2];
};
  };

  A a;
  a.a = 2;
  a.b = x*y;
  return a;

CPU: AMD Ryzen 5 1600 Six-Core Processor

GCC 7.4.0 generates (no -march/mtune):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  mov 0x80(%rsp), %rax
  mov 0x88(%rsp), %rdx
  mov %rax, 0x30(%rsp)
  mov %rdx, 0x38(%rsp)

GCC 7.4.0 generates (no -march, -mtune=native):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movaps 0x80(%rsp), %xmm6
  movaps %xmm6, 0x30(%rsp)

GCC 8.2.0 generates (no -march/mtune):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movdqa 0x80(%rsp), %xmm6
  movaps %xmm6, 0x30(%rsp)

GCC 8.2.0 generates (no -march, -mtune=native):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movaps 0x80(%rsp), %xmm6
  movaps %xmm6, 0x30(%rsp)

IPC of an executable which uses the above code (perf stat):

  GCC 7.4.0 (no -march/mtune):
617.233116  task-clock (msec) #0.997 CPUs utilized 
 4,139,124,553  instructions  #1.94  insn per cycle

  GCC 7.4.0 (no -march, -mtune=native):
   1106.252920  task-clock (msec) #1.000 CPUs utilized  
 3,995,268,509  instructions  #1.02  insn per cycle

  GCC 8.2.0 (no -march/mtune):
   1096.852485  task-clock (msec) #1.000 CPUs utilized
 3,790,839,401  instructions  #0.97  insn per cycle

  GCC 8.2.0 (no -march, -mtune=native):
   1105.693441  task-clock (msec) #1.000 CPUs utilized 
 4,041,957,928  instructions  #1.04  insn per cycle

Summary: Using 2*movaps instead of 4*movq severely lowers IPC on znver1 CPUs

[Bug c++/89557] [7/8 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-03-02 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail dot com> changed:

   What|Removed |Added

Summary|[7/8 regression] 4*movq to  |[7/8 regression] 4*movq to
   |2*movaps IPC performance|2*movaps IPC performance
   |regression on znver1|regression on znver1 with
   ||-Og

--- Comment #1 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Forgot to mention that this happens with -Og optimization level.

[Bug c++/89557] [7/8 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-03-02 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #3 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
(In reply to Jakub Jelinek from comment #2)
> -Og is not meant to generate code with good performance, but code which is
> easy to debug, so benchmarking something with -Og makes no sense.

I agree on the first part of your sentence. On the other hand, -Og is in my
opinion the best among the -O? options for C/C++ developers to use during the
development cycle and I believe -Og was originally intended to be used by
developers, so we should care about its performance because of its presumably
non-negligible userbase.

> That said, if znver1 has slow movaps and it is confirmed on something other
> than a microbenchmark, then we should adjust tuning to avoid using it for
> memory copying.

I think a carefully selected use of znver1 16-byte movaps isn't slow, but it is
slow at least in the case when it is preceded by two 8-byte stores or more
generally by any partial store to the 16 bytes in memory.

A little piece of code enables to clearly demonstrate the cause of a problem in
order to suggest an optimization rule for the compiler to follow. The IPC data
I measured are from a larger application, and I was directed to the seemingly
short code fragment by using "perf record" because it is a performance issue in
the larger app.

The 16-byte struct is fundamental to the application and I cannot avoid using
it at this point in time, although I can remove the 16-byte alignment attribute
which causes movaps to be generated.

In general, imposing a 16-byte alignment on any C/C++ data structure with size
>= 16 bytes shouldn't slow down any program by a factor of 2. It can be
expected to increase or decrease performance by say a factor of 1.1 depending
on workload. A factor of 2 slowdown is unexpected.

It would be interesting to see what would happen to performance if all data
structures in C/C++ codes with size >= 16 bytes were annotated to be aligned to
16 bytes. I don't have performance measurements about such general use of the
aligned(16) attribute.

Thank you for your reply.

[Bug c++/89557] [7/8 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-03-03 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #4 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Without the aligned(16) attribute the alignment of the struct in my code is 8
bytes, struct size remains to be 16 bytes:

GCC 8.2.0 generates (-Og, no -march/mtune):

  movq $2, 0x80(%rsp)
  movsd %xmm0, 0x88(%rsp)
  movdqa 0x80(%rsp), %xmm6
  movups %xmm6, 0x30(%rsp)

The movups used here has approximately the same performance as movaps on
znver1.

[Bug target/89557] [7/8/9 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-03-05 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #6 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Created attachment 45897
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45897&action=edit
a.cc: compilable testcase

[Bug target/89557] [7/8/9 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-03-05 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #7 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Created attachment 45898
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45898&action=edit
Makefile

[Bug target/89557] [7/8/9 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-03-05 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #8 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Testcase (a.cc) benchmark results. See attached Makefile for further
information about compiler options.

Machine 1: Ryzen 5 1600 Six-Core Processor:

  a0-7.4: 0.753795user
  ag-7.4: 0.313097user
  a1-7.4: 0.281629user

  a0-8.3: 0.739894user
  ag-8.3: 0.954584user<-- performance issue in respect to ag-7.4
  a1-8.3: 0.281554user
  a3-8.3: 0.224067user

  ag-7.4n: 1.032364user<-- performance issue in respect to ag-7.4
  ag-8.3n: 1.007429user<-- performance issue in respect to ag-7.4

Machine 2: Intel(R) Xeon(R) CPU E5-2676 v3:

  a0-7.4: 1.02user
  ag-7.4: 0.37user
  a1-7.4: 0.34user

  a0-8.3: 1.01user
  ag-8.3: 0.95user<-- performance issue in respect to a1-7.4
  a1-8.3: 0.34user
  a3-8.3: 0.27user

  ag-7.4n (-march=znver1): 1.05user<-- performance issue in respect to
ag-7.4
  ag-8.3n (-march=znver1): 0.99user<-- performance issue in respect to
ag-7.4

Machine 3: Intel(R) Celeron(R) CPU N2930:

  a0-7.4: 2.223435user
  ag-7.4: 1.017597user
  a1-7.4: 0.741288user

  a0-8.3: 2.224145user
  ag-8.3: 1.620879user<-- performance issue in respect to ag-7.4
  a1-8.3: 1.014488user<-- performance regression in respect to a1-7.4
  a3-8.3: 0.885718user<-- performance regression in respect to a1-7.4

  ag-7.4n (-march=znver1): n/a
  ag-8.3n (-march=znver1): n/a

[Bug other/95971] New: [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

Bug ID: 95971
   Summary: [10 regression] Optimizer converts a false boolean
value into a true boolean value
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: 0xe2.0x9a.0x9b at gmail dot com
  Target Milestone: ---

Hello. I have found an optimization issue that is triggered by the -O2
optimization option in GCC 10.1.0.

The source code (see below) contains an infinite while(cond){} loop. The loop
condition is expected to always evaluate to true. The optimizer incorrectly
derives that the loop condition evaluates to false and removes the loop. It is
possible that the issue is related to optimizations of the delete operator in
C++.

Reproducibility:

  g++ 10.1.0 -O0: not reproducible
  g++ 10.1.0 -O1: not reproducible
  g++ 10.1.0 -O2: REPRODUCIBLE
  g++ 10.1.0 -O3: not reproducible
  g++ 9.3.0  -O2: not reproducible
  clang++ 10 -O2: not reproducible

Full source code:

$ cat a.cc
void xbool(bool value);

struct A {
char *a = (char*)1;
~A() { delete a; }
bool isZero() { return a == (void*)0; }
};

int main() {
A a;
xbool(a.isZero());
while(!a.isZero());
xbool(a.isZero()); // This line isn't required to trigger the issue
return 0;
}

$ cat b.cc
void xbool(bool value) {}

$ cat Makefile 
test:
g++ -c -O2 a.cc
g++ -c b.cc
g++ -o a a.o b.o
time ./a

Dump of assembler code for function main:

   push   %rbp
   xor%edi,%edi // %rdi := false
   sub$0x10,%rsp
   movq   $0x1,0x8(%rsp)
   callq  xbool(bool)
   mov$0x1,%edi // %rdi := true
   callq  xbool(bool)
   lea0x8(%rsp),%rdi
   callq  A::~A()
   add$0x10,%rsp
   xor%eax,%eax
   pop%rbp
   retq   
   mov%rax,%rbp
   jmpq   main.cold

In the assembler code: The compiler correctly passes zero (false) in the 1st
call to function xbool(bool), then incorrectly passes one (true) in the 2nd
call to function xbool(bool).

The source code initializes A::a to (char*)1 in order to keep the code as small
as possible to trigger the issue. A::a could have been initialized to a valid
delete-able heap address, but this would unnecessarily enlarge the source code.

The GCC version string on my machine is "g++ (Gentoo 10.1.0-r1 p2) 10.1.0".


Please confirm the reproducibility of this issue.

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #2 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Created attachment 48805
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48805&action=edit
b.cc

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #1 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Created attachment 48804
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48804&action=edit
a.cc

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #3 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Created attachment 48806
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48806&action=edit
Makefile

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail dot com> changed:

   What|Removed |Added

  Attachment #48804|0   |1
is obsolete||

--- Comment #5 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
Created attachment 48808
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48808&action=edit
a.cc

Initialize A::a to a valid heap pointer, instead of initializing it to
(char*)1.

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #7 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
(In reply to Martin Liška from comment #6)
> All right, so it's caused by cdde1:
> 
> Assume loop 1 to be finite: it has an exit and -ffinite-loops is on.
> 
>-ffinite-loops
>Assume that a loop with an exit will eventually take the exit and
> not loop indefinitely.  This allows the compiler to remove loops that
> otherwise have no side-effects, not considering eventual endless looping as
> such.
> 
>This option is enabled by default at -O2 for C++ with -std=c++11
> or higher.

Thank you for the explanation.

Your mindset is forcing me to stop using g++ over time because of reliability
concerns during application development.

Sincerely
Jan

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #9 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
(In reply to Martin Liška from comment #8)
> Or you can use -fno-finite-loops option.

I am sorry, but I cannot trust this compiler not to force me again spending
several hours of time just to learn that -O2 is semantically different from -O1
and -O3.

The meaning of "semantically equivalent" in my mind is different from the
meaning of "semantically equivalent" in your mind. Infinite loopiness is in my
opinion semantically significant, so the compiler should have printed a warning
that would inform me about the fact that the compiler is changing the semantics
of the code in question.

With -O3, the assembly code is:

Dump of assembler code for function main:
   <+0>:sub$0x8,%rsp
   <+4>:xor%edi,%edi
   <+6>:callq  xbool(bool)
   <+11>:   jmpmain+11

"11: jmp 11" is a prime example of what -ffinite-loops is supposed to prevent
from being generated.

Assuming that -O3 actually does include -ffinite-loops, which I am unable to
verify because "g++ --help=optimizers -Q" doesn't accept the -std=gnu++11
option.

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #10 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
I hope you do realize that the code I posted previously is equivalent, or very
close to being equivalent, to the following code:

  struct President {
const bool dead = false;
bool isDead() { return dead; }
  } president;

  while(!president.isDead());
  if(president.isDead()) {
launch_retaliation_nukes();
  }

With -ffinite-loops enabled, the nukes are going to be launched because the
only way that the while-loop can terminate is for President::dead to be true
and thus the "const bool dead" can be assumed to be true when execution reaches
the if-statement after skipping the deleted infinite while loop.

[Bug other/95971] [10 regression] Optimizer converts a false boolean value into a true boolean value

2020-06-29 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95971

--- Comment #12 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
(In reply to Marc Glisse from comment #11)
>   while(!a.isZero());
> 
> that doesn't look like something you would find in real code. Are you
> waiting for a different thread to modify a? Then you should use an atomic
> operation. Are you waiting for the hardware to change something? Use
> volatile. Do you really want an infinite loop? Spell it out
> if(!a.isZero())for(;;);

The code I sent is a downsized version of a larger code, which means that the
posted code isn't the real code.

[Bug target/89557] [7/8/9/10 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-09-17 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #11 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
(In reply to Eric Gallager from comment #10)
> 
> /usr/bin/time ./a0-7.4 |& egrep -o [0-9]+.*user
> 1.48 real 1.26 user
> /usr/bin/time ./ag-7.4 |& egrep -o [0-9]+.*user
> 0.61 real 0.59 user
> /usr/bin/time ./a1-7.4 |& egrep -o [0-9]+.*user
> 0.57 real 0.55 user
> 
> /usr/bin/time ./a0-8.3 |& egrep -o [0-9]+.*user
> 1.27 real 1.21 user
> /usr/bin/time ./ag-8.3 |& egrep -o [0-9]+.*user
> 0.60 real 0.59 user
> /usr/bin/time ./a1-8.3 |& egrep -o [0-9]+.*user
> 0.60 real 0.59 user
> /usr/bin/time ./a3-8.3 |& egrep -o [0-9]+.*user
> 0.45 real 0.43 user
> 
> /usr/bin/time ./ag-7.4n |& egrep -o [0-9]+.*user
> 0.60 real 0.59 user
> /usr/bin/time ./ag-8.3n |& egrep -o [0-9]+.*user
> 0.61 real 0.59 user
> 
> So, uh, I'm not sure if that's a confirmation, but it's an extra data point.

Interesting. Your measurement is showing that there is no performance
regression on your machine when going from ag-7.4 to ag-8.3.

Some questions:

- What CPU was used to obtain your results?

- If you run "perf record ./ag-8.3; perf report", which instructions do you see
highlighted when you enter the disassembly of function "mul"? On Ryzen 3700X, I
see:

   3.57%  movdqu 0x70(%rsp), %xmm4
  69.25%  movups %xmm4, 0x30(%rsp)
   9.32%  jmpq 11bb

Thanks.



Sidenote: The mirror http://gcc.fyxm.net at https://gcc.gnu.org/mirrors.html is
invalid.

[Bug target/89557] [7/8/9/10 regression] 4*movq to 2*movaps IPC performance regression on znver1 with -Og

2019-06-14 Thread 0xe2.0x9a.0x9b at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557

--- Comment #9 from Jan Ziak (http://atom-symbol.net) <0xe2.0x9a.0x9b at gmail 
dot com> ---
(In reply to Richard Biener from comment #5)
> Please provide a compilable testcase.

Done some time ago. Please change the status of this bug from WAITING to some
other status.