[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

--- Comment #5 from Alexander Monakov  ---
(In reply to Richard Biener from comment #4)
> 
> For the case at hand loading two vectors from the destination and then
> punpck{h,l}bw and storing them again might be the most efficient thing
> to do here.

I think such read-modify-write on the destination introduces a data race for
bytes that are not accessed in the original program, so that would be okay only
under -fallow-store-data-races?

[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
With '-fdisable-tree-forwprop4 -msse4.1' you see what the vectorizer perhaps
wanted to achieve.

[Bug rtl-optimization/108318] Floating point calculation moved out of loop despite fesetround

2023-01-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108318

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Please see documentation for the -frounding-math option, but even with that
option added, your testcase still has the faux-invariant moved by RTL PRE
(-fno-gcse).

Interestingly, if your testcase is modified to compute the sum before the call:

#include 
void
foo (double res[4], double a, double b, double x[])
{
  a = x[0];
  b = x[1];
  static const int rm[4]
  = { FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD };
  for (int i = 0; i < 4; ++i)
{
  double t = a + b;
  fesetround (rm[i]);
  res[i] = t;
}
  fesetround (FE_TONEAREST); // restore default
}

Then it demonstrates how a few *other* optimizations also perform unwanted
motion:

* SSA PRE (-fno-tree-pre)
* TER (-fno-tree-ter)
* RTL LIM (-fno-move-loop-invariants)
* and finally the register allocator (unavoidable)

[Bug target/108315] New: -mcpu=power10 changes ABI

2023-01-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108315

Bug ID: 108315
   Summary: -mcpu=power10 changes ABI
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: ABI, wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---
Target: powerpc64le-*-*

Created attachment 54202
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54202=edit
testcase

At least the documentation should mention that if intentional.

In the attached example, the function bar is compiled to

bar:
.localentry bar,1
mtctr 3
mr 12,3
bctr
.long 0
.byte 0,0,0,0,0,0,0,0

i.e. it does not preserve r2 (it's compiled with -mcpu=power10). If the caller
is not compiled with -mcpu=power10, it needs r2 preserved (bar has a
localentry, so the nop in the caller stays a nop after linking).

I verified the testcase misbehaves on Compile Farm's gcc135: as it does not use
any power10-specific instructions, it's runnable there.

[Bug middle-end/108256] New: Missing integer overflow instrumentation when assignment LHS is narrow

2022-12-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108256

Bug ID: 108256
   Summary: Missing integer overflow instrumentation when
assignment LHS is narrow
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

For

unsigned short f(unsigned short x, unsigned short y)
{
return x * y;
}

unsigned short g(unsigned short x, unsigned short y)
{
int r = x * y;
return r;
}

gcc -O2 -fsanitize=undefined emits instrumentation only for 'g', although both
are equivalent. When 'int r' is changed to 'unsigned short r', 'g' is also not
instrumented.

PR 107912 shows a slightly more complicated variant of this. Affects both C and
C++.

[Bug target/108229] [13 Regression] unprofitable STV transform since r13-4873-g0b2c1369d035e928

2022-12-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229

--- Comment #3 from Alexander Monakov  ---
Thank you! I considered this unprofitable for these reasons:

1. As you said, the code grows in size, but the speed benefit is not clear.

2. The transform converts load+add operations in a loop, and their final uses
outside of the loop. How does the costing work in this case, i.e. how are
changes for the more frequently executed instructions are weighted against
changes for the instructions that will be executed once?

3. The scalar 'add reg, mem' instruction results in one micro-fused uop that is
handled as one uop during renaming (one of narrowest point in the pipeline). It
is then issued on two execution units (for the load and for the add).

4. On AMD, there are separate fp/simd pipes, so when the code is already
simd-heavy as in this example, STV offloads instructions from the integer pipes
to the possibly already-busy simd/fp pipes.

That said, the transformed portion is small relative to the inner loop of the
example, so benchmarking yesterday's trunk with/without -mno-stv on Zen 2, I
get:

27.26 bytes/cycle, 3.07 instruction/cycle

vs.

26.01 bytes/cycle, 2.97 instruction/cycle

So it's not the end of the world for this particular example, but I wanted to
raise the issue in case there's a costing problem in STV that needs correcting.

[Bug target/108229] New: [13 Regression] unprofitable STV transform

2022-12-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229

Bug ID: 108229
   Summary: [13 Regression] unprofitable STV transform
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---
Target: x86_64-*-*

In the following example, STV is making a very unprofitable transformation on
trunk, but not on gcc-12:

#include 
#include 

struct b {
struct b *next;
uint64_t data[511];
};

typedef uint64_t u64v2 __attribute__((vector_size(16)));
static inline
void vsum(u64v2 s[], uint64_t *x, size_t n)
{
typedef u64v2 u64v2_u __attribute__((may_alias));
u64v2_u *vx = (void *)x;
for (; n; vx += 4, n -= 8) {
s[0] += vx[0];
s[1] += vx[1];
s[2] += vx[2];
s[3] += vx[3];
}
}

uint64_t sum(struct b *b)
{
uint64_t s = 0;
u64v2 vs[4] = { 0 };
do {
vsum(vs, b->data + 7, 511-7);
#pragma GCC unroll(7)
for (int i = 0; i < 7; i++)
s += b->data[i];
} while ((b = b->next));
vs[0] += vs[1] + vs[2] + vs[3];
return s + vs[0][0] + vs[0][1];
}

gcc -O2 -mavx (-mavx is not necessary, plain -O2 also triggers it):

sum:
vpxor   xmm2, xmm2, xmm2
vmovdqa xmm1, xmm2
vmovdqa xmm3, xmm2
vmovdqa xmm0, xmm2
vmovdqa xmm5, xmm2
.L3:
lea rax, [rdi+64]
lea rdx, [rdi+4096]
.L2:
vpaddq  xmm0, xmm0, XMMWORD PTR [rax]
vpaddq  xmm3, xmm3, XMMWORD PTR [rax+16]
add rax, 64
vpaddq  xmm1, xmm1, XMMWORD PTR [rax-32]
vpaddq  xmm2, xmm2, XMMWORD PTR [rax-16]
cmp rdx, rax
jne .L2
vmovq   xmm6, QWORD PTR [rdi+16]
vmovq   xmm4, QWORD PTR [rdi+8]
vpaddq  xmm4, xmm4, xmm6
vpaddq  xmm4, xmm4, xmm5
vmovq   xmm5, QWORD PTR [rdi+24]
vpaddq  xmm4, xmm4, xmm5
vmovq   xmm5, QWORD PTR [rdi+32]
vpaddq  xmm4, xmm4, xmm5
vmovq   xmm5, QWORD PTR [rdi+40]
vpaddq  xmm4, xmm4, xmm5
vmovq   xmm5, QWORD PTR [rdi+48]
vpaddq  xmm4, xmm4, xmm5
vmovq   xmm5, QWORD PTR [rdi+56]
mov rdi, QWORD PTR [rdi]
vpaddq  xmm5, xmm4, xmm5
testrdi, rdi
jne .L3
vpaddq  xmm1, xmm1, xmm2
vpaddq  xmm0, xmm0, xmm3
vpaddq  xmm0, xmm0, xmm1
vmovdqa xmm1, xmm0
vpsrldq xmm0, xmm0, 8
vpaddq  xmm0, xmm1, xmm0
vpaddq  xmm0, xmm0, xmm5
vmovq   rax, xmm0
ret

compare with gcc -O2 -mavx -mno-stv:

sum:
vpxor   xmm2, xmm2, xmm2
xor edx, edx
vmovdqa xmm1, xmm2
vmovdqa xmm3, xmm2
vmovdqa xmm0, xmm2
.L3:
lea rax, [rdi+64]
lea rcx, [rdi+4096]
.L2:
vpaddq  xmm0, xmm0, XMMWORD PTR [rax]
vpaddq  xmm3, xmm3, XMMWORD PTR [rax+16]
add rax, 64
vpaddq  xmm1, xmm1, XMMWORD PTR [rax-32]
vpaddq  xmm2, xmm2, XMMWORD PTR [rax-16]
cmp rcx, rax
jne .L2
mov rax, QWORD PTR [rdi+16]
add rax, QWORD PTR [rdi+8]
add rdx, rax
add rdx, QWORD PTR [rdi+24]
add rdx, QWORD PTR [rdi+32]
add rdx, QWORD PTR [rdi+40]
add rdx, QWORD PTR [rdi+48]
add rdx, QWORD PTR [rdi+56]
mov rdi, QWORD PTR [rdi]
testrdi, rdi
jne .L3
vpaddq  xmm0, xmm0, xmm3
vpaddq  xmm1, xmm1, xmm2
vpaddq  xmm0, xmm0, xmm1
vmovq   rcx, xmm0
vpextrq rax, xmm0, 1
add rax, rcx
add rax, rdx
ret

[Bug middle-end/108209] goof in genmatch.cc:commutative_op

2022-12-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108209

--- Comment #1 from Alexander Monakov  ---
Keeping notes as I go...

Duplicated checks for 'op0' in lower_for are duplicated.

[Bug middle-end/108209] New: goof in genmatch.cc:commutative_op

2022-12-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108209

Bug ID: 108209
   Summary: goof in genmatch.cc:commutative_op
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

It pretends that define_operator_list is commutative when its first member is
NOT commutative:

  if (user_id *uid = dyn_cast (id))
{
  int res = commutative_op (uid->substitutes[0]);
  if (res < 0)
return 0;
  for (unsigned i = 1; i < uid->substitutes.length (); ++i)
if (res != commutative_op (uid->substitutes[i]))
  return -1;
  return res;
}

The first 'return 0' should be 'return -1' instead.

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

--- Comment #16 from Alexander Monakov  ---
Draft patch for the sched1 issue:
https://inbox.sourceware.org/gcc-patches/cf62c3ec-0a9e-275e-5efa-2689ff1f0...@ispras.ru/T/#m95238afa0f92daa0ba7f8651741089e7cfc03481

[Bug middle-end/108140] ICE expanding __rbit

2022-12-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
   Keywords||ice-on-valid-code
  Component|c   |middle-end
 Target||aarch64-*-*
Summary|tzcnt gives different   |ICE expanding __rbit
   |result in debug vs release  |

--- Comment #4 from Alexander Monakov  ---
When comment #0 says "this crashes at -O2", it means ICE in expand for the
'__rbit' intrinsic on this testcase, which is reproducible on 12.2 and trunk:

#include
#include
int main(int argc, char *argv[])
{   
unsigned long long input = argc-1;
unsigned long long v = __clz(__rbit(input));
printf("%d %d\n", argc, v >= 64 ? 123 : 456);
}

I've edited the bug title to reflect this.

[Bug rtl-optimization/57067] Missing control flow edges for setjmp/longjmp

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067

--- Comment #9 from Alexander Monakov  ---
*** Bug 108117 has been marked as a duplicate of this bug. ***

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

Alexander Monakov  changed:

   What|Removed |Added

 Resolution|FIXED   |DUPLICATE

--- Comment #15 from Alexander Monakov  ---
Sorry, didn't mean to remove the duplicate info. I could swear I didn't touch
the dropdown, not sure what happened.

*** This bug has been marked as a duplicate of bug 57067 ***

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

Alexander Monakov  changed:

   What|Removed |Added

 Resolution|DUPLICATE   |FIXED

--- Comment #14 from Alexander Monakov  ---
(In reply to Andrew Pinski from comment #13)
> 
> The lifetime of the pseduo was already across the call ...

Hm, I disagree: 'vb = 1' is a killing definition. Therefore the 'vb = 0'
initialization is dead at the point of the call.

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

--- Comment #12 from Alexander Monakov  ---
Shouldn't there be another bug for the sched1 issue specifically? In absence of
abnormal control flow, extending lifetimes of pseudos across calls is still
likely to be a pessimization.

[Bug tree-optimization/108129] New: nop_atomic_bit_test_and_p is too bloated

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108129

Bug ID: 108129
   Summary: nop_atomic_bit_test_and_p is too bloated
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

match.pd has multi-pattern matcher 'nop_atomic_bit_test_and_p'.

It expands to ~38 KLOC in gimple-match.cc and ~350 KB in the compiled binary.

There has to be a better way than repeatedly emitting the match pattern for
each member of {ATOMIC,SYNC}_FETCH_{AND,OR_XOR}_N :)

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

--- Comment #9 from Alexander Monakov  ---
(In reply to Feng Xue from comment #8)

> In another angle, because gcc already model control flow and SSA web for
> setjmp/longjmp, explicit volatile specification is not really needed.

That covers GIMPLE, but after transitioning to RTL, setjmp is not properly
modeled anymore (like in old versions of GCC before Tree-SSA). Many RTL passes
simply refuse touching the function if it has a setjmp call, but as your
example demonstrated, scheduling still can make a surprising transform.

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

Alexander Monakov  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #5 from Alexander Monakov  ---
On further thought, this is really an invalid transform, because the value
becomes "clobbered" only if it was changed between setjmp and longjmp:

(C11 7.13.2.1 "The longjmp function")
>  All accessible objects have values, and all other components of the abstract
> machine have state, as of the time the longjmp function was called, except 
> that
> the values of objects of automatic storage duration that are local to the
> function containing the invocation of the corresponding setjmp macro that
> do not have volatile-qualified type and have been changed between the setjmp
> invocation and longjmp call are indeterminate.

In the testcase, the assignment 'vb = 1' did not happen in the abstract
machine.

Moving back to UNCONFIRMED, both because the transform is invalid, and because
lifting assignments to pseudos across calls in sched1 seems useless if not
harmful to performance and code size.

(that said, the -Wclobbered diagnostic still points to a potential issue, so it
shouldn't be ignored)

[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA

2022-12-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
-Wclobbered properly warns here (and it's part of -Wextra).

With explicit -fschedule-insns, reproducible on x86 as well.

The reason for the issue is quite surprising though, I did not expect pre-RA
scheduling to lift assignments to pseudos across calls, because it just
increases register pressure at the point of the call for little or no gain.

[Bug tree-optimization/108076] [10/11/12/13 Regression] GCC with -O3 produces code which fails to link

2022-12-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108076

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
Summary|GCC with -O3 produces code  |[10/11/12/13 Regression]
   |which fails to link |GCC with -O3 produces code
   ||which fails to link
  Known to work||8.5.0
  Component|c   |tree-optimization
   Keywords||link-failure

--- Comment #2 from Alexander Monakov  ---
GIMPLE if-conversion seems to delete BBs with address-taken labels; works with
-fno-tree-loop-if-convert

[Bug tree-optimization/108008] [12 Regression] wrong code with -O3 and posix_memalign

2022-12-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008

--- Comment #10 from Alexander Monakov  ---
Looks similar to PR 107323, but needs explicit -ftree-loop-distribution to
trigger.

[Bug tree-optimization/108008] [12 Regression] wrong code with -O3 and posix_memalign

2022-12-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008

--- Comment #9 from Alexander Monakov  ---
I think this is tree-ldist placing memset(sameZ, 0, zPlaneCount) after the
loop, overwriting conditional 'sameZ[i] = true' assignments that happen in the
loop.

For the smaller testcase from comment #6, -O2 -ftree-loop-distribution is
enough, namely:

works:

gcc-12 -O2 -ftree-loop-distribution -fno-tree-vectorize
-fno-tree-loop-distribute-patterns

breaks:

gcc-12 -O2 -ftree-loop-distribution -fno-tree-vectorize

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-12-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #11 from Alexander Monakov  ---
Factoring out Lujiazui divider shrinks its tables by almost 20x:

3 r lujiazui_decoder_min_issue_delay
20 r lujiazui_decoder_transitions
32 r lujiazui_agu_min_issue_delay
126 r lujiazui_agu_transitions
304 r lujiazui_div_base
352 r lujiazui_div_check
352 r lujiazui_div_transitions
1152 r lujiazui_core_min_issue_delay
1592 r lujiazui_agu_translate
1592 r lujiazui_core_translate
1592 r lujiazui_decoder_translate
1592 r lujiazui_div_translate
3952 r lujiazui_div_min_issue_delay
9216 r lujiazui_core_transitions

[Bug c++/108008] Compiler mis-optimization with posix_memalign

2022-12-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
-fno-split-loops "cures" it (of course it might just be an enabling transform
for an incorrect optimization later on)

Bisecting trunk for which commit fixes/hides it may be useful.

[Bug c/107971] linking an assembler object creates an executable stack

2022-12-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107971

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
The warning is new in binutils-2.39 (the latest release at this time), perhaps
your linker is older.

[Bug tree-optimization/107879] [13 Regression] ffmpeg-4 test suite fails on FPU arithmetics

2022-12-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107879

--- Comment #10 from Alexander Monakov  ---
If anyone is confused like I was, the commit actually includes a testcase, but
the addition is not mentioned in the Changelog. I was sure the server-side
receive hook was supposed to reject such incomplete Changelog, though?

[Bug middle-end/107905] 2x slowdown versus CLANG and ICL

2022-11-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #6 from Alexander Monakov  ---
Let me add that Clang supports GCC's -fprofile-{generate,use} flags for
compatibility as well.

[Bug middle-end/107905] 2x slowdown versus CLANG and ICL

2022-11-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #5 from Alexander Monakov  ---
Not sure what you don't like about the inputs, they appear quite reasonable.
Perhaps GCC's estimation of bb frequencies is off (with profile feedback we
achieve good performance).

Georgi: you'll likely see better results with profile-guided optimization. You
can first compile the benchmark with -O2 -fprofile-generate, run the output (it
will generate *.gcda files), then compile again with -O2 -fprofile-use. For
Clang the options are spelled -fprofile-instr-generate and -fprofile-instr-use,
respectively.

[Bug driver/107787] -Werror=array-bounds=X does not work as expected

2022-11-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107787

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||amonakov at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #3 from Alexander Monakov  ---
Fixed for gcc-13.

[Bug middle-end/107905] 2x slowdown versus CLANG and ICL

2022-11-29 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

Alexander Monakov  changed:

   What|Removed |Added

   Keywords|ra  |
 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
LLVM does a better job at code layout, and massively wins on the amount of
executed branches (in particular unconditional jumps). With -fdisable-rtl-bbro
gcc achieves a similar performance.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #26 from Alexander Monakov  ---
Sure, the right course of action seems to be to simply document that atomic
types and built-ins are meant to be used on "common" (writeback) memory, and no
guarantees can be given otherwise, because it would involve platform specifics
(relaxed ordering of WC writes as you say; tearing by PCI bridges and device
interfaces seems like another possible caveat).

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #24 from Alexander Monakov  ---
(In reply to Peter Cordes from comment #23)
> But at least on Linux, I don't think there's a way for user-space to even
> ask for a page of WT or WP memory (or UC or WC).  Only WB memory is easily
> available without hacking the kernel.  As far as I know, this is true on
> other existing OSes.

I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device memory
(and then CPU vendor cannot guarantee that 128b access won't tear because it
might depend on downstream devices).

[Bug rtl-optimization/107772] function prologue generated even though it's only needed in an unlikely path

2022-11-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107772

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
You'll get better results from outlining a rare path manually: the
prologue/epilogue won't be re-executed for each invocation of 'g':

int g(int);

__attribute__((noinline,cold))
static void f_slowpath(int* b, int* e)
{
switch (0)
do {
if (*b != 0)
default: *b = g(*b);
} while (++b != e);
}

void f(int* b, int* e)
{
for (; b != e; b++)
if (*b != 0) {
f_slowpath(b, e);
return;
}
}

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #21 from Alexander Monakov  ---
(In reply to Michael_S from comment #19)
> > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> > IVs for the two arrays actually helps on this testcase.
> 
> Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
> %ymm3, %ymm0' would be turned into 2 uops.

The difference is at which point in the pipeline. The latter goes through
renaming as one fused uop.

> Misuse of load+op is far bigger problem in this particular test case than
> sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns
> loop that can potentially run at 3 clocks per iteration into loop of 4+
> clocks per iteration.

Sorry, which assembler output this refers to?

> But I consider it a separate issue. I reported similar issue in 97127, but
> here it is more serious. It looks to me that the issue is not soluble within
> existing gcc optimization framework. The only chance is if you accept my old
> and simple advice - within inner loops pretend that AVX is RISC, i.e.
> generate code as if load-op form of AVX instructions weren't existing.

In bug 97127 the best explanation we have so far is we don't optimally handle
the case where non-memory inputs of an fma are reused, so we can't combine a
load with an fma without causing an extra register copy (PR 97127 comment 16
demonstrates what I mean). I cannot imagine such trouble arising with more
common commutative operations like mul/add, especially with non-destructive VEX
encoding. If you hit such examples, I would suggest to report them also,
because their root cause might be different.

In general load-op combining should be very helpful on x86, because it reduces
the number of uops flowing through the renaming stage, which is one of the
narrowest points in the pipeline.

[Bug middle-end/107879] [13 Regression] ffmpeg-4 test suite fails on FPU arithmetics

2022-11-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107879

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Yes, thanks for the report. OK with -fno-tree-dominator-opts.

The dom2/dom3 passes duplicate most of the computations in build_filter for the
'x == 0' branch, but the phi node in the resulting basic block 5 incorrectly
receives 0.0 (from bb 6) as the value of 'ffm' computed on the duplicated path
(should be 1.0).

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #18 from Alexander Monakov  ---
The apparent 'bias' is introduced by instruction scheduling: haifa-sched lifts
a +64 increment over memory accesses, transforming +0 and +32 displacements to
-64 and -32. Sometimes this helps a little bit even on modern x86 CPUs.

Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
'unlaminated' (turned to 2 uops before renaming), so selecting independent IVs
for the two arrays actually helps on this testcase.

[Bug tree-optimization/107647] [12/13 Regression] GCC 12.2.0 may produce FMAs even with -ffp-contract=off

2022-11-17 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647

--- Comment #15 from Alexander Monakov  ---
I'm confused about the first hunk in the attached patch:

--- a/gcc/tree-vect-slp-patterns.cc
+++ b/gcc/tree-vect-slp-patterns.cc
@@ -1035,8 +1035,10 @@ complex_mul_pattern::matches (complex_operation_t op,
   auto_vec left_op, right_op;
   slp_tree add0 = NULL;

-  /* Check if we may be a multiply add.  */
+  /* Check if we may be a multiply add.  It's only valid to form FMAs
+ with -ffp-contract=fast.  */
   if (!mul0
+  && flag_fp_contract_mode != FP_CONTRACT_FAST
   && vect_match_expression_p (l0node[0], PLUS_EXPR))
 {
   auto vals = SLP_TREE_CHILDREN (l0node[0]);


Shouldn't it be ' == FP_CONTRACT_FAST' rather than '!='? It seems we are
checking that a match is found and contracting across statement boundaries is
allowed.

[Bug middle-end/107719] 14% regression on TSVC s3113 on znve4 compared to GCC 7.5

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107719

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
As you say, the inner loop is the same, and it iterates 32000 times. Most
likely it crosses an instruction fetch boundary differently, try
-falign-loops=32.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #10 from Alexander Monakov  ---
(In reply to Jan Hubicka from comment #9)
> Actually for older cores I think the manufacturers do not care much.  I
> still have a working Bulldozer machine and I can do some testing.
> I think in Buldozer case I was basing the latency throughput on data in
> Agner Fog's manuals.

Ahhh, how could I forget that his manuals have data for those cores too. Thanks
for the reminder! This solves the conundrum nicely:

AMD Jaguar ('btver2' in GCC): int/fp division is not pipelined, separate int/fp
dividers;

AMD Bulldozer, Steamroller ('bdver1', 'bdver3'): int division is not pipelined
(one divider), fp division is slightly pipelined (two independent dividers);

Zhaoxin Lujiazui appears to use the same divider as VIA Nano 3000, which is not
pipelined.

So it's already enough to produce a decent patch.

> How do you test it?

For AMD Zen patches I was using measurements by Andreas Abel (
https://uops.info/table_overview.html ) and running a few experiments myself by
coding loops in NASM and timing them with 'perf stat' on a Zen 2 CPU.

[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715

--- Comment #3 from Alexander Monakov  ---
There's a forward dependency over 'c' (read of c[i] vs. write of c[i+1] with
'i' iterating forward), and the vectorized variant takes the hit on each
iteration. How is a slowdown even surprising.

For the non-vectorized variant you have at most 50% iterations waiting on the
previous, when 'b' has positive and negative elements in alternation, but the
generator doesn't elicit this worst case.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #8 from Alexander Monakov  ---
(In reply to Jan Hubicka from comment #7)
> > 53730 r btver2_fp_min_issue_delay
> > 53760 r znver1_fp_transitions
> > 93960 r bdver3_fp_transitions
> > 106102 r lujiazui_core_check
> > 106102 r lujiazui_core_transitions
> > 196123 r lujiazui_core_min_issue_delay
> > 
> > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
> Yes, I think that makes sense...

Do you mean we should fix modeling of divisions there as well? I don't have
latency/throughput measurements for those CPUs, nor access so I can run
experiments myself, unfortunately.

I guess you mean just making a patch to model division units separately,
leaving latency/throughput as in current incorrect models, and leave it to
manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
might be able to crowd-source it eventually.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #6 from Alexander Monakov  ---
With these patches on trunk, current situation is:

nm -CS -t d --defined-only gcc/insn-automata.o | sed 's/^[0-9]* 0*//' | sort -n
| tail -40
2496 r slm_base
2527 r bdver3_load_min_issue_delay
2746 r glm_base
3892 r bdver1_fp_base
 r bdver1_ieu_min_issue_delay
4492 r geode_base
4608 r bdver3_ieu_transitions
6402 r bdver1_load_transitions
6720 r znver1_fp_min_issue_delay
7862 r athlon_fp_check
7862 r athlon_fp_transitions
9122 r lujiazui_core_base
9997 t internal_insn_latency(int, int, rtx_insn*, rtx_insn*)
10108 r bdver3_load_transitions
10498 r geode_check
10498 r geode_transitions
11632 r print_reservation(_IO_FILE*, rtx_insn*)::reservation_names
12575 r athlon_fp_min_issue_delay
12742 r btver2_fp_check
12742 r btver2_fp_transitions
13896 r slm_check
13896 r slm_transitions
17149 t internal_min_issue_delay(int, DFA_chip*)
17349 t internal_state_transition(int, DFA_chip*)
17776 r bdver1_ieu_transitions
20068 r bdver1_fp_check
20068 r bdver1_fp_transitions
26208 r slm_min_issue_delay
27244 r bdver1_fp_min_issue_delay
28518 r glm_check
28518 r glm_transitions
33690 r geode_min_issue_delay
46980 r bdver3_fp_min_issue_delay
49428 r glm_min_issue_delay
53730 r btver2_fp_min_issue_delay
53760 r znver1_fp_transitions
93960 r bdver3_fp_transitions
106102 r lujiazui_core_check
106102 r lujiazui_core_transitions
196123 r lujiazui_core_min_issue_delay

What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?

[Bug target/107676] Nonsensical docs for -mrelax-cmpxchg-loop

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107676

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||amonakov at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #8 from Alexander Monakov  ---
Fixed.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #15 from Alexander Monakov  ---
Ah, there will be an mfence after the vmovdqa when necessary for an atomic
store, thanks (I missed that because the testcase doesn't scan for mfence).

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #13 from Alexander Monakov  ---
Jakub, sorry if I misunderstood the patches from a brief glance, but what
ordering guarantees are you assuming for AVX accesses? It should not be
SEQ_CST. I think what Intel manual is saying is that said accessing will not
tear, but reordering is the same as pre-existing x86 TSO rules (a load can
finish before an earlier store is globally visible).

[Bug tree-optimization/107647] [12/13 Regression] GCC 12.2.0 may produce FMAs even with -ffp-contract=off

2022-11-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647

--- Comment #6 from Alexander Monakov  ---
Sure, but I was talking specifically about the pattern matching introduced by
that commit.

[Bug tree-optimization/107647] [12/13 Regression] GCC 12.2.0 may produce FMAs even with -ffp-contract=off

2022-11-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
Nice catch, thanks for the report. This is due to g:7d810646d421

The documentation should clarify that patterns correspond to basic fma
instructions (without intermediate rounding), and SLP pattern matching should
check flag_fp_contract_mode != FP_CONTRACT_OFF.

[Bug other/107621] spinx generated documents has too much white space on the top

2022-11-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107621

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
The unnecessary empty space appears due to some subresources not loading as a
result of a Content-Security-Policy issue:

https://inbox.sourceware.org/gcc-patches/5ea2ef7e-4b89-272f-c8e1-f3874c9fa...@pfeifer.com/T/#m5acd422ef000b9758206cb186fe62d6244b8cd47

[Bug tree-optimization/107505] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2)

2022-11-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107505

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Alexander Monakov  ---
(In reply to Richard Biener from comment #2)
> That looks about correct - patch is OK if testing succeeds.

Thanks, fixed.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #3 from Alexander Monakov  ---
Followup patches have been posted at
https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/

[Bug tree-optimization/107505] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2)

2022-11-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107505

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Thanks. This is tree-ssa-sink relocating the call after 'zero' is discovered to
be const, so I think the fix may be as simple as

diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 921305201..631fc88c3 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -266,11 +266,11 @@ statement_sink_location (gimple *stmt, basic_block
frombb,
   /* We only can sink assignments and non-looping const/pure calls.  */
   int cf;
   if (!is_gimple_assign (stmt)
   && (!is_gimple_call (stmt)
  || !((cf = gimple_call_flags (stmt)) & (ECF_CONST|ECF_PURE))
- || (cf & ECF_LOOPING_CONST_OR_PURE)))
+ || (cf & (ECF_LOOPING_CONST_OR_PURE|ECF_RETURNS_TWICE
 return false;

   /* We only can sink stmts with a single definition.  */
   def_p = single_ssa_def_operand (stmt, SSA_OP_ALL_DEFS);
   if (def_p == NULL_DEF_OPERAND_P)

[Bug other/107353] frontends sometimes select wrong (too strong) TLS access model

2022-10-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

Alexander Monakov  changed:

   What|Removed |Added

Summary|[13 regression] Numerous|frontends sometimes select
   |ICEs after  |wrong (too strong) TLS
   |r13-3416-g1d561e1851c466|access model

--- Comment #15 from Alexander Monakov  ---
C FE issue was broken out as PR 107419 and Fortran FE issue as PR 107421, which
now "block" this PR together with PR 107393 for the earlier C++ testcase. The
offending assert is gone, so retitling (not a regression anymore).

[Bug fortran/107421] New: problematic interaction of 'common' and 'threadprivate'

2022-10-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107421

Bug ID: 107421
   Summary: problematic interaction of 'common' and
'threadprivate'
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: openmp
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com,
bergner at gcc dot gnu.org, iains at gcc dot gnu.org,
law at gcc dot gnu.org, marxin at gcc dot gnu.org,
segher at gcc dot gnu.org, seurer at gcc dot gnu.org,
unassigned at gcc dot gnu.org
Blocks: 107353
  Target Milestone: ---

+++ This bug was initially created as a clone of Bug #107353 +++

integer :: i

common /c/ i

!$omp threadprivate (/c/)

i = 0

end

f951 -fopenmp invokes decl_default_tls_model before assigning DECL_COMMON in
fortran/trans-common.cc:build_common_decl. This causes 'c' to have local-exec
model rather than initial-exec, breaking internal verification that was
weakened to solve PR 107353.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353
[Bug 107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

[Bug c/107419] New: attributes are ignored when selecting TLS model

2022-10-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107419

Bug ID: 107419
   Summary: attributes are ignored when selecting TLS model
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com,
bergner at gcc dot gnu.org, iains at gcc dot gnu.org,
law at gcc dot gnu.org, marxin at gcc dot gnu.org,
segher at gcc dot gnu.org, seurer at gcc dot gnu.org,
unassigned at gcc dot gnu.org
Blocks: 107353
  Target Milestone: ---

+++ This bug was initially created as a clone of Bug #107353 +++

__attribute__((common))
__thread int i;

int *f()
{
return 
}

C frontend invokes decl_default_tls_model before processing attributes,
assigning local-exec model as if the 'common' attribute was not present.
Recomputing it later would select initial-exec model, breaking internal
verification that was weakened to solve PR 107353.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353
[Bug 107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

--- Comment #13 from Alexander Monakov  ---
As for the Fortran testcases, the issue is again caused by the front-end
invoking decl_default_tls_model before assigning DECL_COMMON, this time in
fortran/trans-common.cc:build_common_decl.

So I guess I can be happy that the assert uncovered issues in three front-ends,
and adjust the code to avoid downgrading TLS model instead of asserting:

diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
index 3ed2b7cf6..bb86005e5 100644
--- a/gcc/ipa-visibility.cc
+++ b/gcc/ipa-visibility.cc
@@ -886,8 +886,8 @@ function_and_variable_visibility (bool whole_program)
  && vnode->ref_list.referring.length ())
{
  enum tls_model new_model = decl_default_tls_model (decl);
- gcc_checking_assert (new_model >= decl_tls_model (decl));
- set_decl_tls_model (decl, new_model);
+ if (new_model >= decl_tls_model (decl))
+   set_decl_tls_model (decl, new_model);
}
}
 }

[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

--- Comment #12 from Alexander Monakov  ---
ICE on the emutls-3.c testcase isn't related to emutls. Rather, the frontend
invokes decl_default_tls_model before attributes are processed, so the first
time around we miss the 'common' attribute when deciding the TLS access model.

The following cut-down testcase fails on x86 as well with -m32 -fpie:

__attribute__((common))
__thread int i;

int *f()
{
return 
}

Before the offending commit GCC compiled 'f' as if the attribute was ignored.
(on ELF targets combining TLS and COMMON is problematic if not undefined)

[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

--- Comment #11 from Alexander Monakov  ---
I've broken out the C++ issue from comment #10 as PR 107393, thanks for the
testcase. It's a separate issue from emutls and Fortran ICEs on other targets.

[Bug c++/107393] New: Wrong TLS model for specialized template

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107393

Bug ID: 107393
   Summary: Wrong TLS model for specialized template
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com,
bergner at gcc dot gnu.org, iains at gcc dot gnu.org,
law at gcc dot gnu.org, marxin at gcc dot gnu.org,
segher at gcc dot gnu.org, seurer at gcc dot gnu.org,
unassigned at gcc dot gnu.org
Blocks: 107353
  Target Milestone: ---

+++ This bug was initially created as a clone of Bug #107353 +++

template
struct S {
static __thread int i;
};

template
__thread int S::i;

extern template
__thread int S::i;

int ()
{
return S::i;
}

int ()
{
return S::i;
}

Current trunk ICEs due to a new verification in ipa-visibility, before that gcc
-O2 used to emit:


_Z2viv:
movq%fs:0, %rax
addq$_ZN1SIvE1iE@tpoff, %rax
ret
_Z2civ:
movq%fs:0, %rax
addq$_ZN1SIcE1iE@tpoff, %rax
ret
_ZN1SIcE1iE:
.zero   4

which incorrectly uses local-exec model to retrieve S::i, which is extern
(and thus could reside in a shared library at link time, not the executable
being linked).

Clang correctly uses initial-exec for S::i and local-exec for S::i.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353
[Bug 107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

--- Comment #9 from Alexander Monakov  ---
Actually, latest results from H.J. Lu's periodic x86_64 tester don't exhibit
such issues either:
https://inbox.sourceware.org/gcc-testresults/20221025065901.6dc0062...@gnu-34.sc.intel.com/T/#u

[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

--- Comment #8 from Alexander Monakov  ---
(In reply to Arseny Solokha from comment #7)
> I have it on x86_64-pc-linux-gnu…

Thanks for the info (I assume you don't have any special configure arguments),
but that's surprising, I ran bootstrap+regtest before committing the patch, and
did not see such issues. I'll recheck with today's trunk.

[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466

2022-10-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
I can start investigating the root cause later today. In the meantime, please
supply the usual reproduction info if possible (configure arguments and
preprocessed source where applicable).

Presumably powerpc64le doesn't use emutls, so there might be two issues.

FWIW, I don't understand why I was not Cc'ed on this bug, especially if adding
the main author turned out to be a problem. The commit message gives my email
twice, as a co-author and as the committer, and it's conveniently hyperlinked
from comment 0.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-10-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #1 from Alexander Monakov  ---
Suggested partial fix for the integer-pipe side of the blowup:
https://inbox.sourceware.org/gcc-patches/4549f27b-238a-7d77-f72b-cc77df8ae...@ispras.ru/

[Bug middle-end/102380] [meta-bug] visibility (fvisibility=* and attributes) issues

2022-10-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102380
Bug 102380 depends on bug 99619, which changed state.

Bug 99619 Summary: fails to infer local-dynamic TLS model from hidden visibility
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99619

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/99619] fails to infer local-dynamic TLS model from hidden visibility

2022-10-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99619

Alexander Monakov  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Alexander Monakov  ---
Fixed for gcc-13.

[Bug target/107250] Load unnecessarily happens before malloc

2022-10-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107250

--- Comment #3 from Alexander Monakov  ---
Well, obviously because in one function both 'f' and 'tmp' are live across the
call, and in the other function only 'f' is live across the call. The
difference is literally pushing one register vs. two registers, plus extra 8
bytes to preserve 16-byte ABI alignment.

[Bug tree-optimization/107250] Load unnecessarily happens before malloc

2022-10-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107250

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
On the other hand, dispatching the load before malloc is useful if you expect
it to miss in the caches. If you wrote the code with that in mind, and the
compiler moved the load anyway, a manual workaround to *that* would be more
invasive.

[Bug middle-end/107115] Wrong codegen from TBAA under stores that change effective type?

2022-10-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107115

--- Comment #12 from Alexander Monakov  ---
For reference, the previous whacked mole appears to be PR 106187 (where
mems_same_for_tbaa_p comes from).

[Bug middle-end/107115] Wrong codegen from TBAA under stores that change effective type?

2022-10-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107115

--- Comment #8 from Alexander Monakov  ---
Just optimizing out the redundant store seems difficult because on some targets
scheduling is invoked from reorg (and it relies on alias sets).

We need a solution that works for combine too — is it possible to invent a
representation for a no-op in-place MEM "move" that only changes its alias set?

[Bug middle-end/107115] Wrong codegen from TBAA under stores that change effective type?

2022-10-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107115

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org,
   ||jakub at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
Cc'ing Jakub for the problematic i386.md peephole that loses alias set info.

As Andrew mentioned in comment #2, the next stop is combine, which sees

(set (mem:DI (plus:DI (mult:DI (sign_extend:DI (reg:SI 101))
(const_int 8 [0x8]))
(reg/v/f:DI 90 [ p4 ])) [1 MEM[(long int *)_11]+0 S8 A64])
(mem:DI (plus:DI (mult:DI (sign_extend:DI (reg:SI 101))
(const_int 8 [0x8]))
(reg/v/f:DI 90 [ p4 ])) [2 *_11+0 S8 A64]))

as a no-op move and removes it (but note differing alias sets in the MEMs).

And with -fdisable-rtl-combine it is then broken by peephole2 of all things:

;; Attempt to optimize away memory stores of values the memory already
;; has.  See PR79593.
(define_peephole2
  [(set (match_operand 0 "register_operand")
(match_operand 1 "memory_operand"))
   (set (match_operand 2 "memory_operand") (match_dup 0))]
  "!MEM_VOLATILE_P (operands[1])
   && !MEM_VOLATILE_P (operands[2])
   && rtx_equal_p (operands[1], operands[2])
   && !reg_overlap_mentioned_p (operands[0], operands[2])"
  [(set (match_dup 0) (match_dup 1))])

[Bug tree-optimization/107107] [10/11/12/13 Regression] Wrong codegen from TBAA when stores to distinct same-mode types are collapsed?

2022-10-01 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107107

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
(In reply to Andrew Pinski from comment #5)
> (In reply to Rich Felker from comment #1)
> > There's also a potentially related test case at
> > https://godbolt.org/z/jfv1Ge6v4 - I'm not yet clear on whether it's likely
> > to have the same root cause.
> 
> This might be a different issue I think.

Yeah, that's sched2 reordering the accesses (probably cselib is confused).
Needs a separate report.

[Bug tree-optimization/107099] New: uncprop a bit

2022-09-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107099

Bug ID: 107099
   Summary: uncprop a bit
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

For the following testcase

#include 

__attribute__((target("avx")))
int f(__m128i a[], long n)
{
for (long i = 0; i < n; i++)
if (!_mm_testz_si128(a[i], a[i]))
return 0;
return 1;
}

gcc -O2 generates

f:
testrsi, rsi
jle .L4
xor eax, eax
jmp .L3
.L10:
add rax, 1
cmp rsi, rax
je  .L4
.L3:
mov rdx, rax
sal rdx, 4
vmovdqa xmm0, XMMWORD PTR [rdi+rdx]
xor edx, edx
vptest  xmm0, xmm0
setedl
je  .L10
mov eax, edx
ret
.L4:
mov edx, 1
mov eax, edx
ret

Note the redundant assignments to edx in the loop and compare with gcc -O2
-fdisable-tree-uncprop1

Also note that generally uncprop adds a data dependency where only a control
dependency existed, hurting speculative execution (hence more appropriate for
-Os than -O2).

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

--- Comment #19 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #18)
> True - but does that catch the cases people are interested and are
> allowed by the FP contraction rules?  I'm thinking of
> 
>  x = a*b + c*d + e + f;
> 
> with -fassociative-math we can form two FMAs here?

Yes; it might be reasonable to limit the match.pd rule to
-fno-associative-math, leaving mul/adds as-is for tree-ssa-math-opts to
recombine otherwise.

>  Of course with
> strict IEEE compliance but allowed FP contraction we can only
> do FMA (a, b, c*d) + e + f, right?

I think so.

>  Does that mean -ffp-contract=on
> only makes sense in absence of any other -ffast-math flags?

Well, the proposal was to make -ffp-contract=fast an '-ffast-math' flag, not
=on. I don't want to judge if '-ffp-contract=on -ffast-math' combination is
reasonable or not, because -ffast-math by itself quite nonsensical already.

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-29 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

--- Comment #17 from Alexander Monakov  ---
(In reply to Richard Biener from comment #16)
> I do think that since the only way to
> preserve expression boundaries is by PAREN_EXPR

Yes, but...

>  that the middle-end
> shouldn't care about FAST vs. ON (well, it cannot), but the language
> frontends need to ensure to emit PAREN_EXPRs for =ON and omit them for
> =FAST.

this will also prevent reassociation across statements too. Doing FMA
contraction in the frontends via a match.pd rule doesn't have this drawback.

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

--- Comment #15 from Alexander Monakov  ---
(In reply to Richard Biener from comment #14)
> I can't
> seem to reproduce any vectorization for your smaller example though.

My small C samples omit some detail as they were meant to illustrate what
happened in the IR. Is that a problem?

By the way, I noticed that tree-ssa-math-opts incorrectly handles
-ffp-contract:

  if (FLOAT_TYPE_P (type)
  && flag_fp_contract_mode == FP_CONTRACT_OFF)
return false;

It should be 'flag_fp_contract_mode != FP_CONTRACT_FAST' instead (the pass
doesn't have any idea about expression boundaries). It dates back to
g:1694907238eb

[Bug lto/107014] flatten+lto fails the kernel build

2022-09-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107014

--- Comment #7 from Alexander Monakov  ---
I wanted to understand what gets exposed in LTO mode that causes a blowup.

I'd say flatten is not appropriate for this function (I don't think you want to
force inlining of memset or _find_next_bit?), so might be better to go back to
the original issue and solve the problem in a more focused way (e.g.
force-inlining the function which needs to access __initdata if you really need
the verification that triggers otherwise).

[Bug lto/107014] flatten+lto fails the kernel build

2022-09-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107014

--- Comment #5 from Alexander Monakov  ---
(In reply to Jiri Slaby from comment #4)
> > I am surprised that "flatten" blows up on this function. Is that with any
> > config, or again some specific settings like gcov? Is there an existing lkml
> > thread about this?
> 
> Yes, linked in the commit log:
> https://lore.kernel.org/all/
> cak8p3a2zwfnexksm8k_suhhwkor17jfo3xaplxjzfpqx0eu...@mail.gmail.com/

I mean now, about compile time blowup with LTO.

[Bug lto/107014] flatten+lto fails the kernel build

2022-09-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107014

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
It was added to force inlining of small helpers that outgrow limits when
building with gcov profiling:

https://github.com/torvalds/linux/commit/258e0815e2b1706e87c0d874211097aa8a7aa52f

(lack of inlining triggered a sanity check, as explained in the commit)


I am surprised that "flatten" blows up on this function. Is that with any
config, or again some specific settings like gcov? Is there an existing lkml
thread about this?

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

--- Comment #13 from Alexander Monakov  ---
(In reply to Richard Biener from comment #12)
> > Isn't it easy now to implement -ffp-contract=on by a GENERIC-only match.pd
> > rule?
> 
> You mean in the frontend only for -ffp-contract=on?

Yes. 

> Maybe, I suppose FE
> specific folding would also work in that case.  One would also need to read
> the fine prints in the language standards again as to whether FP contraction
> allows to form FMA for
> 
>  double tem = a * b;
>  double res = tem + c;
> 
> or across inlined function call boundaries which we'll happily do.

In C contraction is allowed only within an expression (hence a difference
between -ffp-contract=fast vs. -ffp-contract=on).

The original testcase was in C++, I think C++ does not specify it, but
hopefully we'd aim to implement the same semantics as for C.

> Of course for the testcase at hand it's all in
> a single statement and no parens specify association (in case parens also
> matter here, like in Fortran).  The fortran frontend adds PAREN_EXPRs
> as association barriers which also would prevent FMAs to be formed.

Please note that in this testcase GCC is breaking language semantics by
computing the same value in two different ways, and then using different
computed values in dependent computations. This could not have happened in the
abstract machine (there's a singular assignment in the original program, which
is then used in subsequent iterations of the loop).

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

--- Comment #11 from Alexander Monakov  ---
Can we move -ffp-contract=fast under the -ffast-math umbrella and default to
-ffp-contract=on/off?

Isn't it easy now to implement -ffp-contract=on by a GENERIC-only match.pd
rule?

[Bug target/106952] Missed optimization: x < y ? x : y not lowered to minss

2022-09-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106952

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Note, your 'max' function is the same as 'min' (the issue remains with that
corrected).

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

--- Comment #7 from Alexander Monakov  ---
Lawrence, thank you for the nice work reducing the testcase. For RawTherapee
the recommended course of action would be to compile everything with
-ffp-contract=off, then manually reintroduce use of fma in
performance-sensitive places by testing the FP_FAST_FMA macro to know if
hardware fma is available. This way you'll know that all systems without fma
get the same results, and all systems with fma also get the same results (but
different from the former).

For example, my function 'f1' could be adapted like this:

void f1(void)
{
double x1 = 0, x2 = 0, x3 = 0;

for (int i = 0; i < 99; ) {
double t;
#ifdef FP_FAST_FMA
t = fma(x1, b1, fma(x2, b2, fma(x3, b3, B * one)));
#else
t = B * one + x1 * b1 + x2 * b2 + x3 * b3;
#endif
printf("%d %g\t%a\n", i++, t, t);

x3 = x2, x2 = x1, x1 = t;
}
}

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

2022-09-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
This is a lovely showcase how optimizations cooperatively produce something
unexpected.

TL;DR: SLP introduces redundant computations and then fma formation contracts
some (but not all) of those, dramatically reducing numerical stability. In
principle that's similar to incorrectly "optimizing"

double f(double x)
{
  double y = x * x;
  return y - y;
}

(which is guaranteed to return either NaN or 0) to

double f(double x)
{
  return fma(x, x, -(x * x));
}

which returns the round-off tail of x * x (or NaN). I think there's already
another bug with a similar root cause.

In this bug, we begin with (note, all following examples are supposed to be
compiled without fma contraction, i.e. -O0, plain -O2, or -O2 -ffp-contract=off
if your target has fma):

#include 
#include 

double one = 1;

double b1 = 0x1.70e906b54fe4fp+1;
double b2 = -0x1.62adb4752c14ep+1;
double b3 = 0x1.c7001a6f3bd8p-1;
double B = 0x1.29c9034e7cp-13;

void f1(void)
{
double x1 = 0, x2 = 0, x3 = 0;

for (int i = 0; i < 99; ) {
double t = B * one + x1 * b1 + x2 * b2 + x3 * b3;
printf("%d %g\t%a\n", i++, t, t);

x3 = x2, x2 = x1, x1 = t;
}
}

predcom unrolls by 3 to get rid of moves:

void f2(void)
{
double x1 = 0, x2 = 0, x3 = 0;

for (int i = 0; i < 99; ) {
x3 = B * one + x1 * b1 + x2 * b2 + x3 * b3;
printf("%d %g\t%a\n", i++, x3, x3);

x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;
printf("%d %g\t%a\n", i++, x2, x2);

x1 = B * one + x2 * b1 + x3 * b2 + x1 * b3;
printf("%d %g\t%a\n", i++, x1, x1);
}
}

SLP introduces some redundant vector computations:

typedef double f64v2 __attribute__((vector_size(16)));

void f3(void)
{
double x1 = 0, x2 = 0, x3 = 0;

f64v2 x32 = { 0 }, x21 = { 0 };

for (int i = 0; i < 99; ) {
x3 = B * one + x21[1] * b1 + x2 * b2 + x3 * b3;

f64v2 x13b1 = { x21[1] * b1, x3 * b1 };

x32 = B * one + x13b1 + x21 * b2 + x32 * b3;

x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;

f64v2 x13b2 = { b2 * x1, b2 * x32[0] };

x21 = B * one + x32 * b1 + x13b2 + x21 * b3;

x1 = B * one + x2 * b1 + x32[0] * b2 + x1 * b3;

printf("%d %g\t%a\n", i++, x32[0], x32[0]);
printf("%d %g\t%a\n", i++, x32[1], x32[1]);
printf("%d %g\t%a\n", i++, x21[1], x21[1]);
}
}

Note that this is still bit-identical to the initial function. But then
tree-ssa-math-opts "randomly" forms some FMAs:

f64v2 vfma(f64v2 x, f64v2 y, f64v2 z)
{
return (f64v2){ fma(x[0], y[0], z[0]), fma(x[1], y[1], z[1]) };
}

void f4(void)
{
f64v2 vone = { one, one }, vB = { B, B };
f64v2 vb1 = { b1, b1 }, vb2 = { b2, b2 }, vb3 = { b3, b3 };

double x1 = 0, x2 = 0, x3 = 0;

f64v2 x32 = { 0 }, x21 = { 0 };

for (int i = 0; i < 99; ) {
x3 = fma(b3, x3, fma(b2, x2, fma(B, one, x21[1] * b1)));

f64v2 x13b1 = { x21[1] * b1, x3 * b1 };

x32 = vfma(vb3, x32, vfma(vb2, x21, vfma(vB, vone, x13b1)));

x2 = fma(b3, x2, b2 * x1 + fma(B, one, x3 * b1));

f64v2 x13b2 = { b2 * x1, b2 * x32[0] };

x21 = vfma(vb3, x21, x13b2 + vfma(vB, vone, x32 * vb1));

x1 = fma(b3, x1, b2 * x32[0] + fma(B, one, b1 * x2));

printf("%d %g\t%a\n", i++, x32[0], x32[0]);
printf("%d %g\t%a\n", i++, x32[1], x32[1]);
printf("%d %g\t%a\n", i++, x21[1], x21[1]);
}
}

and here some of the redundantly computed values are computed differently
depending on where rounding after multiplication was omitted. Somehow this is
enough to make the computation explode numerically.

[Bug lto/91299] [10/11/12/13 Regression] LTO inlines a weak definition in presence of a non-weak definition from an ELF file

2022-09-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91299

Alexander Monakov  changed:

   What|Removed |Added

   Keywords||wrong-code
Summary|LTO inlines a weak  |[10/11/12/13 Regression]
   |definition in presence of a |LTO inlines a weak
   |non-weak definition from an |definition in presence of a
   |ELF file|non-weak definition from an
   ||ELF file

--- Comment #14 from Alexander Monakov  ---
gcc-4.9 used to get this right, so let's play the regression card? This should
not be in WAITING.

[Bug target/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834

--- Comment #10 from Alexander Monakov  ---
Okay, so this should have been reported against Binutils, but since we are
having the conversation here: the current behavior is not good, gas is silently
selecting a different relocation kind for no clear reason. Why is it not a
warning or an error? Note that if you assemble such GOT reference via NASM:

extern _GLOBAL_OFFSET_TABLE_
default rel
f:
mov rax, [_GLOBAL_OFFSET_TABLE_ wrt ..gotpc]
ret

then t.o has

 :
   0:   48 8b 05 00 00 00 00mov0x0(%rip),%rax# 7 
3: R_X86_64_GOTPCREL_GLOBAL_OFFSET_TABLE_-0x4
   7:   c3  ret

and ld -shared --no-relax -o t.so t.o does not reject it and t.so has

1000 :
1000:   48 8b 05 f1 1f 00 00mov0x1ff1(%rip),%rax# 2ff8
<_DYNAMIC+0xe0>
1007:   c3  ret

and without --no-relax:

1000 :
1000:   48 8d 05 f9 1f 00 00lea0x1ff9(%rip),%rax# 3000
<_GLOBAL_OFFSET_TABLE_>
1007:   c3  ret

So I don't see the reason why it's special-cased in gas.

[Bug target/106453] Redundant zero extension after crc32q

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106453

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Alexander Monakov  ---
Fixed for gcc-13.

[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834

--- Comment #8 from Alexander Monakov  ---
Right, sorry, due to presence of 'main' I overlooked -fPIC in comment #0, and
then after my prompt it got dropped in comment #3.

If you modify the testcase as follows and compile it with -fPIC, it's evident
that GCC is treating both external symbols the same, but gas does not. Similar
to PR 106835, it seems Binutils is special-casing by symbol name. But here the
situation is worse, because GCC output is mentioning the intended relocation
kind:

movq_GLOBAL_OFFSET_TABLE_@GOTPCREL(%rip), %rax

so silently using R_X86_64_GOTOFF64 instead doesn't look right.

#include 

extern char _GLOBAL_OFFSET_TABLE_[];
extern char xGLOBAL_OFFSET_TABLE_[];

int main() {
  printf("%lx", (unsigned long)_GLOBAL_OFFSET_TABLE_);
  printf("%lx", (unsigned long)xGLOBAL_OFFSET_TABLE_);
}

[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834

--- Comment #6 from Alexander Monakov  ---
(In reply to Martin Liška from comment #5)
> Do you mean gas or ld?

gas

> How did you get this output, please (from foo.o or final executable)?

>From foo.o like in comment #0.

[Bug c/106835] [i386] Taking an address of _GLOBAL_OFFSET_TABLE_ produces a wrong value

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106835

--- Comment #3 from Alexander Monakov  ---
It would be unfortunate if that makes it difficult or even impossible to make a
R_386_32 relocation for the address of GOT in hand-written assembly.

In any case, it seems GCC is not making the rules here, so this should be
reported against Binutils so they can clarify the situation?

[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834

Alexander Monakov  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #4 from Alexander Monakov  ---
Probably a Binutils bug then, with binutils-2.37 I get the correct

   4:   48 8d 05 00 00 00 00lea0x0(%rip),%rax# b 
7: R_X86_64_GOTPC32 _GLOBAL_OFFSET_TABLE_-0x4

Can you please report it against binutils at https://sourceware.org/bugzilla/
and mention the link here?

[Bug c/106835] [i386] Taking an address of _GLOBAL_OFFSET_TABLE_ produces a wrong value

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106835

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Surely this is a Binutils (assembler) bug? gcc emits

ptr:
.long   _GLOBAL_OFFSET_TABLE_

[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate

2022-09-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Can you show how gcc -S output looks for you on this testcase? For me the
problematic instruction is just

movl$_GLOBAL_OFFSET_TABLE_, %eax

or

leaq_GLOBAL_OFFSET_TABLE_(%rip), %rax

with -fpie, so it's the assembler who chooses the relocation type (which would
make that a Binutils bug).

[Bug middle-end/106804] Poor codegen for selecting and incrementing value behind a reference

2022-09-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106804

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
(In reply to Richard Biener from comment #7)
> In fact I'd say the reverse transformation is more profitable?

In the end it depends on the context. It's a trade-off between a conditional
branch and extra data dependencies feeding into the address of a store. If a
branch is perfectly predictable, it's preferable. Otherwise, if there's no
memory dependency via the store, you don't care about delaying it, making the
branchless version preferable if that reduces pipeline flushes. If there is a
dependency, it comes down to how often the branch mispredicts, I guess.

  
 /\
| People who tinker with compilers |
| need __builtin_branchless_select |
 \/
  
 \
  \
   \
.--.
   |o_o |
   | ~  |
  //   \ \
 (| | )
/'\_   _/`\
\___)=(___/

[Bug tree-optimization/106781] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2) since r13-1754-g7a158a5776f5ca95

2022-08-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106781

--- Comment #5 from Alexander Monakov  ---
GCC discovers that 'bar' is noreturn, tries to remove its LHS but unfortunately
cgraph.cc:cgraph_edge::redirect_call_stmt_to_callee wants to emit an assignment
of SSA default-def to the LHS. fixup_noreturn_call seems to handle that in a
smarter way.

Is it possible to simply let fixup_noreturn_call do its thing?

diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc
index 8d6ed38ef..6597de669 100644
--- a/gcc/cgraph.cc
+++ b/gcc/cgraph.cc
@@ -1567,7 +1567,7 @@ cgraph_edge::redirect_call_stmt_to_callee (cgraph_edge
*e)

   /* If the call becomes noreturn, remove the LHS if possible.  */
   tree lhs = gimple_call_lhs (new_stmt);
-  if (lhs
+  if (0 && lhs
   && gimple_call_noreturn_p (new_stmt)
   && (VOID_TYPE_P (TREE_TYPE (gimple_call_fntype (new_stmt)))
  || should_remove_lhs_p (lhs)))

[Bug tree-optimization/106781] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2) since r13-1754-g7a158a5776f5ca95

2022-08-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106781

--- Comment #4 from Alexander Monakov  ---
(In reply to Martin Liška from comment #3)
> > Also ICEs in ipa-modref when 'noclone' added to 'noinline', a 12/13
> > regression (different cause, needs a separate PR).
> 
> Can't reproduce Alexander, please attach a testcase.

Ah, it ICEs when emitting a dump, so -fdump-tree-modref2 is needed in addition
to -O2, I've filed that as PR 106783.

[Bug ipa/106783] New: [12/13 Regression] ICE in ipa-modref.cc:analyze_function

2022-08-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106783

Bug ID: 106783
   Summary: [12/13 Regression] ICE in
ipa-modref.cc:analyze_function
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com,
marxin at gcc dot gnu.org, unassigned at gcc dot gnu.org
  Target Milestone: ---

+++ This bug was initially created as a clone of Bug #106781 +++

ICEs when emitting a tree dump with -O2 -fdump-tree-modref2

int n;

__attribute__ ((noinline,noclone,returns_twice)) static int
bar (int)
{
  n /= 0;

  return n;
}

int
foo (int x)
{
  return bar (x);
}

t.c: In function ‘foo’:
t.c:12:1: internal compiler error: in analyze_function, at ipa-modref.cc:3286
   12 | foo (int x)
  | ^~~
0x10e548e analyze_function
gcc/ipa-modref.cc:3286
0x10e83b5 execute
gcc/ipa-modref.cc:4186


Note that -fdump-tree-modref2 is needed. It reaches a gcc_unreachable(), I'd
suggest to move the verification outside of dumping if possible, so the
compiler doesn't ICE or not depending on whether dumping is requested.

[Bug tree-optimization/106781] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2)

2022-08-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106781

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Thanks.

Also ICEs in ipa-modref when 'noclone' added to 'noinline', a 12/13 regression
(different cause, needs a separate PR).

[Bug middle-end/106688] New: leaving SSA emits assignment into the inner loop

2022-08-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106688

Bug ID: 106688
   Summary: leaving SSA emits assignment into the inner loop
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

For the following testcase, gcc -O2

unsigned foo(const unsigned char *buf, long size);
unsigned bar(const unsigned char *buf, long size)
{
typedef char  i8v8  __attribute__((vector_size(8)));
typedef short i16v8 __attribute__((vector_size(16)));
long chunk_sz = 15*16;
for (; size >= chunk_sz; size -= chunk_sz) {
i16v8 vs1 = { 0 };
const unsigned char *end = buf + chunk_sz;
for (; buf != end; buf += 16) {
i16v8 b;
asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)buf));
vs1 += b;
asm("pmovzxbw %1, %0" : "=x"(b) :
"m"(*(i8v8*)(buf+8)));
vs1 += b;
}
asm("" :: "x"(vs1));
}
return foo(buf, size);
}

(asms needed due to PR 31667)

generates

bar:
cmp rsi, 239
jle .L2
lea rdx, [rdi+240]
.L4:
lea rax, [rdx-240]
pxorxmm0, xmm0
.L3:
pmovzxbw QWORD PTR [rax], xmm1
add rax, 16
paddw   xmm0, xmm1

mov rdi, rdx ; <<< ehhh

pmovzxbw QWORD PTR [rax-8], xmm1
paddw   xmm0, xmm1
cmp rax, rdx
jne .L3
sub rsi, 240
add rdx, 240
cmp rsi, 239
jg  .L4
.L2:
jmp foo

It looks as if going out of SSA places in the loop a register copy
corresponding to a phi node which is outside of the loop. Strangely, RTL
optimizations do not clean it up either.

[Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware

2022-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Are you sure the testcase is correctly reduced, i.e. does it show the same
performance degradation? Latency-wise the scheduler is making the correct
decision here: we really want to schedule second-to-last FMA

  y = v_fma_f32 (y, r2, x);

earlier than its predecessor

  r = v_fma_f32 (y, r2, z);

because we need to compute y*r2 before the last FMA.

[Bug middle-end/106470] Subscribed access to __m256i casted to (uint16_t *) produces garbage or a warning

2022-07-29 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106470

--- Comment #8 from Alexander Monakov  ---
But that's the point of many warnings, isn't it? To help the user understand
what's wrong when the code is bad? And bogus warnings just confuse more.

[Bug middle-end/106470] Subscribed access to __m256i casted to (uint16_t *) produces garbage or a warning

2022-07-29 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106470

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
Andrew, surely the bogus -Wuninitialized warning is a GCC bug here?

<    1   2   3   4   >