[Bug tree-optimization/68906] [6 Regression] ICE at -O3 on x86_64-linux-gnu: verify_ssa failed

2015-12-15 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68906

--- Comment #3 from Yuri Rumyantsev  ---
I've prepared simple fix which cures ICE. I will send it for review tomorrow.

2015-12-15 12:50 GMT+03:00 jakub at gcc dot gnu.org :
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68906
>
> Jakub Jelinek  changed:
>
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org
>
> --- Comment #2 from Jakub Jelinek  ---
> This doesn't look to me like a mere omission to invalidate debug stmts after
> some stmt move that (correctly) has not considered debug stmts when 
> determining
> if they should be moved or not, but it looks to me like wrong-code
> transformation.
> Before unswitch, if c is non-zero, we have endless loop, but during 
> unswitching
> it is wrongly changed to branch to the bb that returns instead.
> Say if you compile with -O3 (no -g):
> int a;
> volatile int b;
> short c, d;
> int
> fn1 ()
> {
>   int e;
>   for (;;)
> {
>   a = 3;
>   if (c)
> continue;
>   e = 0;
>   for (; e > -30; e--)
> if (b)
>   {
> int f = e;
> return d;
>   }
> }
> }
>
> int
> main ()
> {
>   c = 1;
>   asm volatile ("" : : "m" (c) : "memory");
>   fn1 ();
>   __builtin_abort ();
> }
>
> then before the change this would just hang (expected), now it aborts instead.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug rtl-optimization/68920] [6 Regression] Undesirable if-conversion for a rarely taken branch

2015-12-17 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68920

--- Comment #4 from Yuri Rumyantsev  ---
You are quite right - the cost model is very poor. We did simple experiment and
set up the branch cost to 1 but noticed performance regressions on other
benchmarks. when we set it to 2 we did not see any difference  since likely
branch deletion is preferred for equal costs. Is there any tuned option in
if-converter to revert this decision? Secondly, we must enhance cost model by
adding cost of conditional move for all targets but this is for GCC 7.

[Bug tree-optimization/68894] New: Recognition min/max pattern with multiple arguments.

2015-12-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68894

Bug ID: 68894
   Summary: Recognition min/max pattern with multiple arguments.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

Analyzing one important benchmark (rgb to cmyk conversion) we found out that
MIN pattern is not recognized for more than 2 arguments. I attached simple
reproducer which exhibit the issue - explicit use of multiple 

[Bug tree-optimization/68894] Recognition min/max pattern with multiple arguments.

2015-12-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68894

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37026
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37026=edit
test-case to reproduce

It is sufficient to compile it with -O3 option to see the difference in
produced assembly.

[Bug rtl-optimization/68898] ICE if rtl if-conversion is off.

2015-12-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68898

--- Comment #2 from Yuri Rumyantsev  ---
Forgot to add stack trace:

Error: dominator of 6 status unknown
t2.f:41:0: internal compiler error: Segmentation fault
0xb4e62f crash_signal
/export/users/gnutester/stability/svn/trunk/gcc/toplev.c:334
0x376583567f ???
   
/home/glibctest/rpmbuild/BUILD/glibc-2.17-c758a686/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x7e7d0d verify_dominators(cdi_direction)
/export/users/gnutester/stability/svn/trunk/gcc/dominance.c:1033
0x7e7fa7 checking_verify_dominators
/export/users/gnutester/stability/svn/trunk/gcc/dominance.h:71
0x7e7fa7 calculate_dominance_info(cdi_direction)
/export/users/gnutester/stability/svn/trunk/gcc/dominance.c:664
0x9a178e ira
/export/users/gnutester/stability/svn/trunk/gcc/ira.c:5155
0x9a178e execute
/export/users/gnutester/stability/svn/trunk/gcc/ira.c:5511

[Bug rtl-optimization/68898] ICE if rtl if-conversion is off.

2015-12-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68898

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37028
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37028=edit
test-case to reproduce

Need to compile with -O2 -m32 -ffast-math options to reproduce. Note that
32-bit and -ffast-math flags are essential.

[Bug rtl-optimization/68898] New: ICE if rtl if-conversion is off.

2015-12-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68898

Bug ID: 68898
   Summary: ICE if rtl if-conversion is off.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

I tried to play with if-conversion flag and got ICE on all benchspec2 from
spec2000 suite. I attach simple Fortran reproducer. Note that
"-fno-if-conversion2" option does not lead to CF.

[Bug tree-optimization/68522] [6 Regression] SPEC CPU2006 435.gromacs miscomparison

2015-12-31 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68522

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #5 from Yuri Rumyantsev  ---
I did deeper investigation of 435.gromacs miscomparison and found out that
1. It is caused by precision lost, i.e. this is not bug in split-paths phase.
2. This is caused by fmadd-sub instructions only (reproduced on avx2 with
fma-support), i.e. with -fno-fma option bench is passed.
3. I found the first guilty routine split-paths for which leads to
miscomparison: (fsettle) which is an ordinary fp-routine with big exit bb which
is replicated. I assume that restriction on size of exit bb to be duplicated
must be introduced to avoid useless code size growth. So you can close it after
adding correspondent parameter-limit.

[Bug rtl-optimization/67145] [6 Regression] associativity from pseudo-reg ordering

2015-12-23 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67145

--- Comment #3 from Yuri Rumyantsev  ---
Created attachment 37120
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37120=edit
non-tested patch

[Bug rtl-optimization/67145] [6 Regression] associativity from pseudo-reg ordering

2015-12-23 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67145

--- Comment #4 from Yuri Rumyantsev  ---
I attached simple non-tested patch which restores performance on x86. This
change is no perfect but using it I noticed 2%-6% speed-up on 32-bit x86
platform. The idea of patch is very simple - we do not bail out if nothing
changed but re-materialize all PLUS rtx-instructions with register-operand. It
is important since an order of the operands in ops is different, i.e. if we
have  x + y + z on function entry, ops is {x,z,y} if REG(x) < REG(z) < REG(y).

[Bug rtl-optimization/69052] New: [6 Regression] Performance regression after r229402.

2015-12-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052

Bug ID: 69052
   Summary: [6 Regression] Performance regression after r229402.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

In loop_invariant phase additional function inv_can_prop_to_addr_use which
tried to determine if forward propagation for cheap address is possible through
call of verify_changes which is very poor in comparison with combine phase.
For example, for attached test-case it tries
(gdb) call debug_rtx(def_insn)
(insn 69 67 70 9 (set (reg/f:SI 149)
(plus:SI (reg:SI 87)
(const:SI (unspec:SI [
(symbol_ref:SI ("ind") [flags 0x2] )
] UNSPEC_GOTOFF t1.c:40 212 {*leasi}
 (expr_list:REG_DEAD (reg:SI 87)
(nil)))
(gdb) call debug_rtx(use_insn)
(insn 70 69 71 9 (set (reg:SI 150)
(mem/u:SI (plus:SI (mult:SI (reg/v:SI 90 [ k ])
(const_int 4 [0x4]))
(reg/f:SI 149)) [1 ind S4 A32])) t1.c:40 86 {*movsi_internal}
 (expr_list:REG_DEAD (reg/f:SI 149)
(nil)))
and determines that propagation is not possible:
(gdb) p ok
$1 = false
but combine can do such substitution.

This leads to undesired code motion and performance lost:
for stmt out[ind[k]] = result
before r229402
movlind@GOTOFF(%ebx,%esi,4), %eax
movl12(%esp), %edi
movl%ebp, (%edi,%eax,4)
after r229402
movl28(%esp), %eax
movl24(%esp), %ebx
movl(%eax,%esi,4), %eax
movl%edi, (%ebx,%eax,4)

redundant fill has been generated by LRA.

Since emulation combine phase is not so simple I assume that additional hook
should be added to turn off such transformation for x86 in PIE mode.

[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.

2015-12-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37133
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37133=edit
test-case to reproduce

It should be compile with -O2 -m32 options to reproduce.

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-11-24 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #11 from Yuri Rumyantsev  ---
In fact, the problem is quite different although it is caused by non-profitable
pattern matching ~X CMP ~Y -> Y CMP X. In general this pattern may be helpful
if we can delete not operation, e.g.
  x1 = ~x;
  y1 = ~y;
  if (x1  y1) ... and there no any other uses of x1 and y1, i.e. x1 and y1
have single use. But if this is not truth we will increase register pressure
since we can not use the same register for x,x1 and y,y1.

Richard proposed to use the same simplification for min/max operations but
in original test-case nested min/max operation (min(x,min(y,z)) or multi
operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does such
transformation) and so this won't help since we have the same register pressure
issue:
c = ~r; 
m = ~g;
y = ~b;
k = min(c, m, y);
*out++ = c - k;
*out++ = m - k;
*out++ = y - k;
*out++ = k;
and we can see that value of 'c' is used in min computation and resulting
store, so if we will use r  g comparison we will increase live range for
r, g, b variables and additional registers will require for them (till
comparison).
Note also that there exists another issue with path-splitting (aka tail
duplication) which duplicate loop back edge and in fact move tail block to
hammock. This transformation does not loop useful (at least at given stage of
design) but this is another topic for discussion.

I'd like to propose to introduce new predicate for pattern matching which tells
us how much uses have left-hand side of ~x.

[Bug middle-end/68542] [6 Regression] 10% 481.wrf performance regression

2015-11-26 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68542

--- Comment #3 from Yuri Rumyantsev  ---
I enhanced a patch for masked stores movement by guard on zero mask - move all
possible producers for stored value and performance degradation disappeared.
the patch will be re-designed and send for review next week.

[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization

2015-11-20 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435

--- Comment #6 from Yuri Rumyantsev  ---
It turned out that fresh gcc performs tail duplication (aka path splitting)
preventing if-conversion. So I post a dump for 20150929 compiler which
reproduces the issue.

[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization

2015-11-20 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435

--- Comment #7 from Yuri Rumyantsev  ---
Created attachment 36780
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36780=edit
rtl-ce1 dump file

The dump is for 20150929 compiler

[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization

2015-11-19 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435

--- Comment #4 from Yuri Rumyantsev  ---
Created attachment 36774
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36774=edit
tar file

tar file contains good and bad ce1-rtl dumps showing the problem

[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization

2015-11-19 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435

--- Comment #2 from Yuri Rumyantsev  ---
I will post 2 rtl dumps for ce1 phase produced with -O2 -m32 options on ix86.
You can see that file t21.c.203r.ce1 produced by 20110927 compiler contains
3 possible IF blocks searched.
1 IF blocks converted.
2 true changes made.
but file t21.c.209r.ce1 produced by 20151119 compiler does not
1 possible IF blocks searched.
0 IF blocks converted.
0 true changes made.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-06-08 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #23 from Yuri Rumyantsev  ---
OK. I will try to prepare the second part of patch.
Few comments about vect-simd-clone-5.c test failure.
1. This loop is marked with safelen=MAX_INT.
2. It contains the following stmt's:
D.3301 = foo.simdclone.1 (vect_vec_iv_.25_12, 123, _17);
# VUSE <.MEM_39>
_22 = MEM[(vector(2) long long int[2] *)];
# VUSE <.MEM_39>
_23 = MEM[(vector(2) long long int[2] *) + 16B];
# .MEM_40 = VDEF <.MEM_39>
D.3301 ={v} {CLOBBER};
vect__3.28_24 = VEC_PACK_TRUNC_EXPR <_22, _23>;
 and fuction ref_indep_Loop_p_1 checks that references
 MEM[(vector(2) long long int[2] *)]
and
MEM[(vector(2) long long int[2] *) + 16B]
are independent.
We can avoid such bad behavior of safelen-check (1) put restriction that loop
does not contain non-analyzed references; (2) add additional check that
reference does not have operands defined inside loop (D.3301 in our case).

What approach is more profitable for you?

[Bug rtl-optimization/71453] Spills to vector registers are sub-optimal.

2016-06-08 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453

--- Comment #2 from Yuri Rumyantsev  ---
Forgot to mention that number of instructions is on 10% more 632 vs 702 for
spills into vector registers.

[Bug rtl-optimization/71453] New: Spills to vector registers are sub-optimal.

2016-06-08 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453

Bug ID: 71453
   Summary: Spills to vector registers are sub-optimal.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We notice significant performance regression on one important benchmark after
r235523.
Note that fix is not responsible for it. A problem is related to spill/fill
to/from vector registers (aka xmm registers). For example, for attached
test-case we can see a nimber of redundant "vector registers spills" and
movements between them:
vmovd   %ecx, %xmm5
vmovd   %xmm5, %ecx
vmovd   %xmm5, 40(%esp) !! It wil be more profitable to save %ecx on stack.
vmovdqa %xmm3, %xmm5 !! this is completely redundant.
...

There is also another issue with spill to vector registers - we must estimate
profitability of such spill in comparison with spill on stack. For example,
such spill can be not profitable if fill to register is not required:
movl%eax, 44(%esp)  !! spill
...
andl44(%esp), %eax !! fill is not required.

[Bug rtl-optimization/71453] Spills to vector registers are sub-optimal.

2016-06-08 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38659
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38659=edit
test-case to reproduce

Must be compiled with -O2 -march=core-avx2 -m32 options.

[Bug tree-optimization/71437] [7 regression' Performance regression after r235817

2016-06-06 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71437

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38652
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38652=edit
test-case to reproduce

Need to be compiled with -O3 -m32 -ffast-math on x86-64.

[Bug tree-optimization/71437] New: [7 regression' Performance regression after r235817

2016-06-06 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71437

Bug ID: 71437
   Summary: [7 regression' Performance regression after r235817
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed ~10% slowdown on one important benchmark used for Silvermont
testing. I can reproduced this performance gap using attached test-case on
SandyBridge:

before r235817

time ./good.exe 
W[100]=10

real0m0.761s

r235817
W[100]=10

real0m0.863s

THere exist another optimization opportunnty, which can be illustrated by the
following test fragment:

if( i == ( I - 1 ) ) 
  L = pL[i] ; 
LD = (float)( L - pL[i] ) /
(float)( pL[i + 1] - pL[i] ) ; 

It is clear that LD value is 0 if L == pL[i], i.e. we can move the second
statement inside the hammock and perform simplification.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-06-07 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #21 from Yuri Rumyantsev  ---
Richard!

Are you planning to prepare the second part of the patch (zeroing safelen and
testing it in loop invariant motion phase as you proposed)?

Thanks.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-06-10 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #25 from Yuri Rumyantsev  ---
Richard!

I prepared the second part of patch and checked that it does not
produce any new failures.
What is your opinion - could I send it to GCC community for review?

ChangeLog:
2016-06-10  Yuri Rumyantsev  <ysrum...@gmail.com>

PR tree-optimization/70729
* tree-ssa-loop-im.c (gather_mem_refs_stmt): Mark loop as having
unanalyzed memory references.
(ref_indep_loop_p_1): Consider memory reference as independent in
loops having positive safelen value and not having unanalyzed memory
references.
(tree_ssa_lim_finalize): Clear-up aux field of loops.
* tree-vect-loop.c (vect_transform_loop): Clear-up safelen value since
it may be not valid after vectorization.

gcc/testsuite/ChangeLog
* g++.dg/vect/pr70729.cc: New test.

2016-06-09 9:42 GMT+03:00 rguenther at suse dot de <gcc-bugzi...@gcc.gnu.org>:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729
>
> --- Comment #24 from rguenther at suse dot de  ---
> On Wed, 8 Jun 2016, ysrumyan at gmail dot com wrote:
>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729
>>
>> --- Comment #23 from Yuri Rumyantsev  ---
>> OK. I will try to prepare the second part of patch.
>> Few comments about vect-simd-clone-5.c test failure.
>> 1. This loop is marked with safelen=MAX_INT.
>> 2. It contains the following stmt's:
>> D.3301 = foo.simdclone.1 (vect_vec_iv_.25_12, 123, _17);
>> # VUSE <.MEM_39>
>> _22 = MEM[(vector(2) long long int[2] *)];
>> # VUSE <.MEM_39>
>> _23 = MEM[(vector(2) long long int[2] *) + 16B];
>> # .MEM_40 = VDEF <.MEM_39>
>> D.3301 ={v} {CLOBBER};
>> vect__3.28_24 = VEC_PACK_TRUNC_EXPR <_22, _23>;
>>  and fuction ref_indep_Loop_p_1 checks that references
>>  MEM[(vector(2) long long int[2] *)]
>> and
>> MEM[(vector(2) long long int[2] *) + 16B]
>> are independent.
>> We can avoid such bad behavior of safelen-check (1) put restriction that loop
>> does not contain non-analyzed references; (2) add additional check that
>> reference does not have operands defined inside loop (D.3301 in our case).
>>
>> What approach is more profitable for you?
>
> I think that we cannot use safelen() to disregard dependences
> against "non-analyzed" references.  This is because of exactly
> such case.  In future we might want to make less references
> "non-analyzed" and use the general alias oracle on them
> (the LIM dependence analysis predates that).
>
> So - simply put the safelen() check after the check for non-analyzed
> reference in the disambiguator.
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug rtl-optimization/71275] [7 regression] Performance drop after r235660 on x86-64 in 32-bit mode.

2016-05-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71275

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38564
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38564=edit
test-case to reproduce

Must be compiled with -O2 -m32 -march=slm options.

[Bug tree-optimization/71347] [7 regression] Performance drop after r235513 on x86-64 in 32-bit mode.

2016-05-30 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71347

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38600
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38600=edit
test-case to reproduce

Need to be compiled with -O2 -m32 -march=slm -ffast-math options on x64-64.

[Bug tree-optimization/71347] New: [7 regression] Performance drop after r235513 on x86-64 in 32-bit mode.

2016-05-30 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71347

Bug ID: 71347
   Summary: [7 regression] Performance drop after r235513 on
x86-64 in 32-bit mode.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed significant regression (more then 10%) after this revision whicn can
be illustrated on the following simple test-case )attached) - one additional
instruction in innermost loop (scalar replacement is not recognized):

  before r235513 r235513
.L6:
movsd   X.1861+8, %xmm2movsd   -8(%eax), %xmm2
addl$8, %eax   movsd   X.1861+8, %xmm1
.L3:
mulsd   %xmm2, %xmm0   mulsd   %xmm1, %xmm2
cmpl$X.1861+64, %eax   addl$8, %eax
movsd   %xmm0, (%eax)  movsd   %xmm2, -8(%eax)
jne .L6cmpl$X.1861+72, %eax
   jne .L6

[Bug tree-optimization/69297] [6 Regression] Performance regression after r230020

2016-01-15 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37356
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37356=edit
test-case to reproduce

TO reproduce compile with -Ofast -march=core-avx2 options.

[Bug tree-optimization/69297] New: [6 Regression] Performance regression after r230020

2016-01-15 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297

Bug ID: 69297
   Summary: [6 Regression] Performance regression after r230020
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

This regression was found on spec2006/464.h264ref. The problem is related to
SLP vectorization of BB's and caused by the wrong calculation of scalar cost,
e.g. for attached test-case:
  Cost model analysis: 
  Vector inside of basic block cost: 188
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 512

although the basic block contains only 96 statements.
I found out that vect_bb_slp_scalar_cost takes into account the same stmt
several times and results in non-profitable SLP vectorization.

[Bug rtl-optimization/67145] [6 Regression] associativity from pseudo-reg ordering

2016-01-13 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67145

--- Comment #6 from Yuri Rumyantsev  ---
We checked that proposed patch does not introduce new performance regression
and I will prepare it for review after bootstrapping and regression testing
completion, likely tomorrow.

[Bug rtl-optimization/69274] New: [6 Regression] Performance regression after r231814 on x86 Haswell.

2016-01-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69274

Bug ID: 69274
   Summary: [6 Regression] Performance regression  after r231814
on x86 Haswell.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

After this simple fix we got huge regression ( > 16%) for spec2006/435.gromacs
on Haswell with "-O2 -ffast-math" options. Preliminary investigation have shown
that 1he size of the hottest loop in benchmark (fsettle) became 10 instructions
shorter (less spill/fill) but performance regressed significantly . Note that
adding the first scheduler by "-fschedule-insns --param
sched-pressure-algorithm=2 -fsched-pressure" gave us +24% speed-up (but only
for this particular benchmark).

[Bug tree-optimization/69297] [6 Regression] Performance regression after r230020

2016-01-18 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297

--- Comment #4 from Yuri Rumyantsev  ---
Yes, this loop was added for avoiding dce phase.

Thanks.
Yuri.

2016-01-18 13:33 GMT+03:00 rguenth at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297
>
> --- Comment #3 from Richard Biener  ---
> With a fix:
>
> t.c:76:10: note: Cost model analysis:
>   Vector inside of basic block cost: 376
>   Vector prologue cost: 0
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 96
> t.c:76:10: note: not vectorized: vectorization is not profitable.
>
> Note the reduction loop is still vectorized:
>
> t.c:74:5: note: Cost model analysis:
>   Vector inside of loop cost: 3
>   Vector prologue cost: 1
>   Vector epilogue cost: 7
>   Scalar iteration cost: 3
>   Scalar outside cost: 0
>   Vector outside cost: 8
>   prologue iterations: 0
>   epilogue iterations: 0
>   Calculated minimum iters for profitability: 4
>
> but likely this isn't profitable either?
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.

2016-02-09 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052

--- Comment #13 from Yuri Rumyantsev  ---
I checked that performance is back for the whole benchmark. Thanks a lot.

Yuri.

2016-02-09 14:17 GMT+03:00 amker at gcc dot gnu.org :
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052
>
> --- Comment #12 from amker at gcc dot gnu.org ---
> Patch sent for review at
> https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00612.html
> It works for the reduced test case, could you please help me to check if it
> works for you original case?
> Thanks,
> bin
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug tree-optimization/69652] [6 Regression] [ICE] verify_ssa fail w/ -O2 -ffast-math -ftree-vectorize

2016-02-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652

--- Comment #4 from Yuri Rumyantsev  ---
Jacub,

Thanks a lot for your detail comments!

I've just sent a patch for review to gcc-patches. Could you please
take a look on it?

Best regards.
Yuri.

2016-02-03 20:22 GMT+03:00 jakub at gcc dot gnu.org :
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652
>
> Jakub Jelinek  changed:
>
>What|Removed |Added
> 
>  CC|jakub at redhat dot com|
>
> --- Comment #3 from Jakub Jelinek  ---
> Clearly a bug in optimize_mask_stores.
> At the start of that function we have:
> ...
> mask__46.14_129 = vect__14.9_121 != vect__21.12_127;
> _46 = _14 != _21;
> mask__ifc__47.15_130 = mask__46.14_129;
> _ifc__47 = _46;
> MASK_STORE (vectp.16_132, 8B, mask__ifc__47.15_130, vect__22.13_128);
> vect__24.20_140 = MEM[(double *)vectp.18_138];
> _24 = *_13;
> vect__25.21_141 = vect__21.12_127 + vect__24.20_140;
> _25 = _21 + _24;
> MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141);
> k_27 = k_28 + 1;
> ...
> Now, the MASK_STORE calls are processed from last to first, which is fine, we
> first move the second MASK_STORE and the vector stmts that feed it:
> vect__24.20_140 = MEM[(double *)vectp.18_138];
> vect__25.21_141 = vect__21.12_127 + vect__24.20_140;
> MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141);
> but then continue trying to move the second MASK_STORE into the same
> conditional block, because it has the same mask.  In this case it is wrong,
> because there is
> the scalar load in between (_24 = *_13) that just waits for DCE, but generally
> there could be arbitrary code.
> /* Put other masked stores with the same mask to STORE_BB.  */
> if (worklist.is_empty ()
> || gimple_call_arg (worklist.last (), 2) != mask
> || worklist.last () != stmt1)
>   break;
> has a simplistic check (doesn't consider other MASK_STORE unless the walking
> walked up to that stmt), but of course it doesn't work too well if some scalar
> stmts were skipped.
>
> I see various issues in that function:
> 1) wrong formatting:
>   gsi_to = gsi_start_bb (store_bb);
>   if (dump_enabled_p ())
> {
>   dump_printf_loc (MSG_NOTE, vect_location,
>"Move stmt to created bb\n");
>   dump_gimple_stmt (MSG_NOTE, TDF_SLIM, last, 0);
> }
> /* Move all stored value producers if possible.  */
> while (!gsi_end_p (gsi))
>   {
> The Move all stored value and everything below up to corresponding closing }
> should be moved two columns to the left
> 2) IMHO stmt1 should be set to NULL before that while (!gsi_end_p (gsi)),
> as the function is prepared to handle multiple bbs
> 3) next to gimple_vdef non-NULL break IMHO should be also
> gimple_has_volatile_ops -> break check, just for safety, we don't wanto to
> mishandle say volatile reads etc.
> 4) you have to skip over debug stmts if there are any, otherwise we have a
> -fcompare-debug issue
> 5) IMHO you should give up also for !is_gimple_assign, say trying to move an
> elemental function call into the conditional is just wrong
> 6) the
> /* Skip scalar statements.  */
> if (!VECTOR_TYPE_P (TREE_TYPE (lhs)))
>   continue;
> should be reconsidered.  IMHO if you have scalar stmts that feed just the 
> stmts
> in the STORE_BB, there is no reason not to move them too, if you have scalar
> stmts that feed other stmts too, IMHO you should give up on them if they have 
> a
> vuse; what code did you have in mind when adding the !VECTOR_TYPE_P check?
> 7) FOR_EACH_IMM_USE_FAST loop should ignore debug stmts, at least for 
> decisions
> when to stop in some stmt; bet the debug stmts if there are any need to be
> reset
> if we move the def stmt that they are using, otherwise we risk -fcompare-debug
> issues
> 8) the worklist.last () != stmt1 check need to be -fcompare-debug friendly 
> too,
> so if there are debug stmts in between the last moved stmt and the previous
> MASK_STORE, we need to handle it as if there aren't any debug stmts in between
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug tree-optimization/69783] New: [6 Regression] Loop is not vectorized after r233212

2016-02-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69783

Bug ID: 69783
   Summary: [6 Regression] Loop is not vectorized after r233212
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

After changes in vect_prune_runtime_alias_test_list() a number of merging
ranges was significantly decreased:

  Before fix
improved number of alias checks from 50 to 3
  After fix
improved number of alias checks from 50 to 22
and loop is not vectorized since
number of versioning for alias run-time tests exceeds 10

[Bug tree-optimization/69783] [6 Regression] Loop is not vectorized after r233212

2016-02-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69783

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37671
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37671=edit
test-case to reproduce

It needs to be compiled with -Ofast -funroll-loops on x86-64

[Bug tree-optimization/69652] [6 Regression] [ICE] verify_ssa fail w/ -O2 -ffast-math -ftree-vectorize

2016-02-05 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652

--- Comment #5 from Yuri Rumyantsev  ---
Jacub,

I'd like to clarify one your remark:

5) IMHO you should give up also for !is_gimple_assign, say trying to move an
elemental function call into the conditional is just wrong

What's wrong in call motion? Note that masked stores and loads are
also represented as call. I assume that likely simd clone function
calls msut not be moved.

Thanks ahead.
Yuri.

P.S. It means that my patch is not correct and should be fixed.

2016-02-04 17:48 GMT+03:00 Yuri Rumyantsev :
> Jacub,
>
> Thanks a lot for your detail comments!
>
> I've just sent a patch for review to gcc-patches. Could you please
> take a look on it?
>
> Best regards.
> Yuri.
>
> 2016-02-03 20:22 GMT+03:00 jakub at gcc dot gnu.org 
> :
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652
>>
>> Jakub Jelinek  changed:
>>
>>What|Removed |Added
>> 
>>  CC|jakub at redhat dot com|
>>
>> --- Comment #3 from Jakub Jelinek  ---
>> Clearly a bug in optimize_mask_stores.
>> At the start of that function we have:
>> ...
>> mask__46.14_129 = vect__14.9_121 != vect__21.12_127;
>> _46 = _14 != _21;
>> mask__ifc__47.15_130 = mask__46.14_129;
>> _ifc__47 = _46;
>> MASK_STORE (vectp.16_132, 8B, mask__ifc__47.15_130, vect__22.13_128);
>> vect__24.20_140 = MEM[(double *)vectp.18_138];
>> _24 = *_13;
>> vect__25.21_141 = vect__21.12_127 + vect__24.20_140;
>> _25 = _21 + _24;
>> MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141);
>> k_27 = k_28 + 1;
>> ...
>> Now, the MASK_STORE calls are processed from last to first, which is fine, we
>> first move the second MASK_STORE and the vector stmts that feed it:
>> vect__24.20_140 = MEM[(double *)vectp.18_138];
>> vect__25.21_141 = vect__21.12_127 + vect__24.20_140;
>> MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141);
>> but then continue trying to move the second MASK_STORE into the same
>> conditional block, because it has the same mask.  In this case it is wrong,
>> because there is
>> the scalar load in between (_24 = *_13) that just waits for DCE, but 
>> generally
>> there could be arbitrary code.
>> /* Put other masked stores with the same mask to STORE_BB.  */
>> if (worklist.is_empty ()
>> || gimple_call_arg (worklist.last (), 2) != mask
>> || worklist.last () != stmt1)
>>   break;
>> has a simplistic check (doesn't consider other MASK_STORE unless the walking
>> walked up to that stmt), but of course it doesn't work too well if some 
>> scalar
>> stmts were skipped.
>>
>> I see various issues in that function:
>> 1) wrong formatting:
>>   gsi_to = gsi_start_bb (store_bb);
>>   if (dump_enabled_p ())
>> {
>>   dump_printf_loc (MSG_NOTE, vect_location,
>>"Move stmt to created bb\n");
>>   dump_gimple_stmt (MSG_NOTE, TDF_SLIM, last, 0);
>> }
>> /* Move all stored value producers if possible.  */
>> while (!gsi_end_p (gsi))
>>   {
>> The Move all stored value and everything below up to corresponding closing }
>> should be moved two columns to the left
>> 2) IMHO stmt1 should be set to NULL before that while (!gsi_end_p (gsi)),
>> as the function is prepared to handle multiple bbs
>> 3) next to gimple_vdef non-NULL break IMHO should be also
>> gimple_has_volatile_ops -> break check, just for safety, we don't wanto to
>> mishandle say volatile reads etc.
>> 4) you have to skip over debug stmts if there are any, otherwise we have a
>> -fcompare-debug issue
>> 5) IMHO you should give up also for !is_gimple_assign, say trying to move an
>> elemental function call into the conditional is just wrong
>> 6) the
>> /* Skip scalar statements.  */
>> if (!VECTOR_TYPE_P (TREE_TYPE (lhs)))
>>   continue;
>> should be reconsidered.  IMHO if you have scalar stmts that feed just the 
>> stmts
>> in the STORE_BB, there is no reason not to move them too, if you have scalar
>> stmts that feed other stmts too, IMHO you should give up on them if they 
>> have a
>> vuse; what code did you have in mind when adding the !VECTOR_TYPE_P check?
>> 7) FOR_EACH_IMM_USE_FAST loop should ignore debug stmts, at least for 
>> decisions
>> when to stop in some stmt; bet the debug stmts if there are any need to be
>> reset
>> if we move the def stmt that they are using, otherwise we risk 
>> -fcompare-debug
>> issues
>> 8) the worklist.last () != stmt1 check need to be -fcompare-debug friendly 
>> too,
>> so if there are debug stmts in between the last moved stmt and the previous
>> MASK_STORE, we need to handle it as if there aren't any debug stmts in 
>> between
>>
>> --
>> You are receiving 

[Bug rtl-optimization/69633] [6 Regression] Redundant move is generated after r228097

2016-02-02 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37559
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37559=edit
test-case to reproduce

Need to be compiled with -O2 -m32 -pie -fPIE.
Assume that -march=slm is not needed.

[Bug rtl-optimization/69633] New: [6 Regression] Redundant move is generated after r228097

2016-02-02 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633

Bug ID: 69633
   Summary: [6 Regression] Redundant move is generated after
r228097
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

Sorry, that we noticed this regression just now but not in September.
After Makarov's fix for 61578 ( and s390 regression) we noticed that for
attached simple test-case extracted from real benchmark one more redundant move
instruction is generated (till 20160202 compiler build):

before fix (postreload dump)
   86: NOTE_INSN_BASIC_BLOCK 4
   40: dx:QI=[si:SI]
   41: ax:QI=[si:SI+0x1]
   42: {si:SI=si:SI+0x3;clobber flags:CC;}
   43: dx:SI=zero_extend(dx:QI)
   44: ax:SI=zero_extend(ax:QI)
   45: cx:SI=zero_extend([si:SI-0x1])
   46: {di:SI=dx:SI*0x4c8b;clobber flags:CC;}
   47: {bx:SI=ax:SI*0x9646;clobber flags:CC;}
   48: {bx:SI=bx:SI+di:SI;clobber flags:CC;}
   49: {di:SI=cx:SI*0x1d2f;clobber flags:CC;}
   50: NOTE_INSN_DELETED
   51: bx:SI=bx:SI+di:SI+0x8000
   52: {bx:SI=bx:SI>>0x10;clobber flags:CC;}
   53: [bp:SI]=bx:QI
   96: bx:SI=dx:SI
   55: {bx:SI=bx:SI<<0xf;clobber flags:CC;}
   57: {bx:SI=bx:SI-dx:SI;clobber flags:CC;}

after fix
   86: NOTE_INSN_BASIC_BLOCK 4
   40: dx:QI=[si:SI]
   41: ax:QI=[si:SI+0x1]
   42: {si:SI=si:SI+0x3;clobber flags:CC;}
   43: dx:SI=zero_extend(dx:QI)
   44: ax:SI=zero_extend(ax:QI)
   45: cx:SI=zero_extend([si:SI-0x1])
   46: {di:SI=dx:SI*0x4c8b;clobber flags:CC;}
   47: {bx:SI=ax:SI*0x9646;clobber flags:CC;}
   48: {bx:SI=bx:SI+di:SI;clobber flags:CC;}
   49: {di:SI=cx:SI*0x1d2f;clobber flags:CC;}
   50: NOTE_INSN_DELETED
   51: bx:SI=bx:SI+di:SI+0x8000
   52: {bx:SI=bx:SI>>0x10;clobber flags:CC;}
   53: [bp:SI]=bx:QI
   96: bx:SI=dx:SI
   55: {bx:SI=bx:SI<<0xf;clobber flags:CC;}
   98: di:SI=bx:SI   !! redundnat move
   57: {di:SI=di:SI-dx:SI;clobber flags:CC;}

In result, we got >3% slowdown on Silvermont in pie & 32-bit mode.

[Bug tree-optimization/69652] [6 Regression] [ICE] verify_ssa fail w/ -O2 -ffast-math -ftree-vectorize

2016-02-03 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652

--- Comment #2 from Yuri Rumyantsev  ---
This is my fault - forgot to fix vuse for scalar statements which are crossed
by masked stores during code motion. Fix is testing and will be sent for review
tomorrow.

[Bug rtl-optimization/69942] gcc.dg/ifcvt-5.c FAILs

2016-02-29 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69942

--- Comment #2 from Yuri Rumyantsev  ---
I attached patch which resolves failure.

[Bug rtl-optimization/69942] gcc.dg/ifcvt-5.c FAILs

2016-02-29 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69942

--- Comment #3 from Yuri Rumyantsev  ---
Created attachment 37822
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37822=edit
proposed patch

Patch to resolve ifcvt5.c failure.

[Bug rtl-optimization/69942] gcc.dg/ifcvt-5.c FAILs

2016-02-26 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69942

--- Comment #1 from Yuri Rumyantsev  ---
The cause of issue is that before ce1 phase pde (or pre) transformation has
been done to remove partial redundant moves to variable i and j, i.e.
code
  int i = x;
  int j = y;
  if (x > y)
{
  i = a;'
  j = i;
}
has been transformed to
  int i,j;
  if (x > y)
{
  i = a;
  j = i;
}
  else
{
  i = x;
  i = y;
}
and ifcvt phase does speculative motion else-part before if-part, i.e. to
original code. This transformation is considered as true change and test is
failed. I assume that test must accept also '6 basic blocks,' to get test
passed.

[Bug tree-optimization/69467] New: [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86.

2016-01-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467

Bug ID: 69467
   Summary: [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes
performance drop on 32-bit x86.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

This is caused by the same revision as 67438
 http://gcc.gnu.org/viewcvs/gcc?view=revision=225248

The issue can be reproduced with attached test-case.
After such transformation applied to loop upper bound:
for ( count = ((*(ptr)) & 0xf) * 2; count > 0; count--, addr++ )
two redundant instructions are generated:
  after   before
movl48(%esp), %ebx movl48(%esp), %ecx
movzbl  (%ebx), %eax   movzbl  (%ecx), %edx   
andl$15, %eax  andl$15, %edx  
movzbl  %al, %ecx  addl%edx, %edx
addl%ecx, %ecx
testb   %al, %al
je  .L12   je  .L12

This can be essential if loop has low trip count.

[Bug tree-optimization/69467] [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86.

2016-01-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 37462
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37462=edit
test-case to reproduce

Need to compile with -m32 at -O2 or -O3 -funroll-loops options.
In description the assembly with -O3 -funroll-loops options was cited.

[Bug tree-optimization/69467] [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86.

2016-01-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467

--- Comment #3 from Yuri Rumyantsev  ---
Richard,

I checked that performance is back with your patch.

Thanks.

2016-01-25 17:50 GMT+03:00 rguenth at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467
>
> Richard Biener  changed:
>
>What|Removed |Added
> 
>  Status|UNCONFIRMED |ASSIGNED
>Last reconfirmed||2016-01-25
>Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
> gnu.org
>Target Milestone|--- |6.0
>  Ever confirmed|0   |1
>
> --- Comment #2 from Richard Biener  ---
> To restore the state before the move from fold to match.pd we'd need to mark
> any such pattern involving compares as the outermost expr (and thus match
> on GIMPLE_CONDs) with an explicit && single_use () check.  Fix for this one:
>
> Index: gcc/match.pd
> ===
> --- gcc/match.pd(revision 232792)
> +++ gcc/match.pd(working copy)
> @@ -1821,12 +1821,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (for cmp (simple_comparison)
>   scmp (swapped_simple_comparison)
>   (simplify
> -  (cmp (mult @0 INTEGER_CST@1) integer_zerop@2)
> +  (cmp (mult@3 @0 INTEGER_CST@1) integer_zerop@2)
>/* Handle unfolded multiplication by zero.  */
>(if (integer_zerop (@1))
> (cmp @1 @2)
> (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0))
> -   && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)))
> +   && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0))
> +   && single_use (@3))
>  /* If @1 is negative we swap the sense of the comparison.  */
>  (if (tree_int_cst_sgn (@1) < 0)
>   (scmp @0 @2)
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug rtl-optimization/69633] [6 Regression] Redundant move is generated after r228097

2016-03-09 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633

--- Comment #3 from Yuri Rumyantsev  ---
Sorry for a confusion. The bug must be closed as user mistake.

2016-03-07 19:18 GMT+03:00 bernds at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633
>
> Bernd Schmidt  changed:
>
>What|Removed |Added
> 
>  CC||bernds at gcc dot gnu.org
>
> --- Comment #2 from Bernd Schmidt  ---
> Doesn't seem to happen over here. Can you still reproduce this with trunk?
> Please post exact arguments to cc1 if it does.
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug tree-optimization/66142] Loop is not vectorized because not sufficient support for GOMP_SIMD_LANE

2016-03-11 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142

--- Comment #26 from Yuri Rumyantsev  ---
If we convert copy structures to copy structure fields test will be vectorized
and all mentions of GOMP_SIMD_LANE will be deleted. But if we slightly modify
test by introducing new function vdot and insert its call:
   b = r.x * ray->dir.x + r.y * ray->dir.y;
 |
 v
   b = vdot (r, ray->dir);
test won't be vectorized:
test2.cpp:70:9: note: not vectorized: not suitable for scatter store
D.6062[_9].org.x = 1.0e+0;

test2.cpp is attached.

[Bug tree-optimization/66142] Loop is not vectorized because not sufficient support for GOMP_SIMD_LANE

2016-03-11 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142

--- Comment #27 from Yuri Rumyantsev  ---
Created attachment 37940
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37940=edit
test-case to reproduce

Need to be compiled with -Ofast -mavx2 -fopenmp options.

[Bug target/70482] Opimization opportunity to vectorize basic block for -mavx target.

2016-04-01 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70482

--- Comment #2 from Yuri Rumyantsev  ---
Richard, 
The problem is in pattern matching:

  /* Pattern detected.  */
  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "vect_recog_widen_mult_pattern: detected:\n");

  /* Check target support  */
  vectype = get_vectype_for_scalar_type (half_type0);
  vecitype = get_vectype_for_scalar_type (itype);
  if (!vectype
  || !vecitype
  || !supportable_widening_operation (WIDEN_MULT_EXPR, last_stmt,
  vecitype, vectype,
  _code, _code,
  _int, _vec))
return NULL;
 We found paatern but it does not supported for 256-bit vectype and need to try
for 128-bit.

[Bug tree-optimization/70482] New: Opimization opportunity to vectorize basic block for -mavx target.

2016-03-31 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70482

Bug ID: 70482
   Summary: Opimization opportunity to vectorize basic block for
-mavx target.
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

If we compile bb-slp-pattern-1.c from gcc.dg/vect suite with -mavx pattern
vectorization won't happen since AVX has very poor support for 256-bit integer
arithmetic. Particularly, widen-mult pattern is recognized but it is not
supported for 256-bit vectors.
Test is failed for native compiler build on AVX machine. The most simple
decision is to use the same scheme as for loop vectorization by decreasing
vector size from 256-bit to 128-bit.

[Bug rtl-optimization/70873] New: [GCC7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-04-29 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

Bug ID: 70873
   Summary: [GCC7 Regressio] 20% performance regression at
482.sphinx3 after r235442 with -O2 -m32 on Haswell.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

This degradation is caused by known issue with partial register dependency:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57954
and can be reproduced with the attached simple test-case:
  before fix
vxorpd%xmm4, %xmm4, %xmm4
vcvtss2sd(%esi,%eax,4), %xmm4, %xmm4
  after fix
vxorpd%xmm6, %xmm6, %xmm6
vcvtss2sd(%esi,%eax,4), %xmm6, %xmm7
I assume that register renaming must not split such register live range but
simply consider it as one.

[Bug rtl-optimization/70873] [GCC7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-04-29 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38375
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38375=edit
test-case to reproduce

Must be compiled with -O2 -mavx2 -m32 options.

[Bug tree-optimization/70849] Loop can be vectorized through gathers on AVX2 platforms.

2016-04-28 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70849

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38365
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38365=edit
test-case to reproduce

Must be compiled with -O3 -mavx2 options

[Bug tree-optimization/70849] New: Loop can be vectorized through gathers on AVX2 platforms.

2016-04-28 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70849

Bug ID: 70849
   Summary: Loop can be vectorized through gathers on AVX2
platforms.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

Simple test which will be attached is not vectorized as not profitable:
test.c:11:5: note: cost model: the vector iteration cost = 2061 divided by the
scalar iteration cost = 9 is greater or equal to the vectorization factor = 8.
test.c:11:5: note: not vectorized: vectorization not profitable.
test.c:11:5: note: not vectorized: vector version will never be profitable.

but it can be vectorized as icc does using gathers:
   LOOP BEGIN at test.c(11,5)
  remark #15388: vectorization support: reference c1[j] has aligned access 
 [ test.c(12,7) ]
  remark #15388: vectorization support: reference c2[j] has aligned access 
 [ test.c(13,7) ]
  remark #15388: vectorization support: reference c1[j] has aligned access 
 [ test.c(12,7) ]
  remark #15388: vectorization support: reference c2[j] has aligned access 
 [ test.c(13,7) ]
  remark #15415: vectorization support: gather was generated for the
variable <f[j+base]>, strided by 256   [ test.c(12,16) ]
  remark #15415: vectorization support: gather was generated for the
variable <f[j+base+1]>, strided by 256   [ test.c(13,16) ]
  remark #15415: vectorization support: gather was generated for the
variable <f[j+base]>, strided by 256   [ test.c(12,16) ]
  remark #15415: vectorization support: gather was generated for the
variable <f[j+base+1]>, strided by 256   [ test.c(13,16) ]
  remark #15305: vectorization support: vector length 8
  remark #15300: LOOP WAS VECTORIZED
  remark #15449: unmasked aligned unit stride stores: 4 
  remark #15460: masked strided loads: 4 
  remark #15475: --- begin vector loop cost summary ---
  remark #15476: scalar loop cost: 18 
  remark #15477: vector loop cost: 12.000 
  remark #15478: estimated potential speedup: 1.500 
  remark #15488: --- end vector loop cost summary ---
   LOOP END

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-04-28 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #11 from Yuri Rumyantsev  ---
Richard,

I slightly modify the patch proposed by you:
1. Apply loop->safelen check only if lim is invoked before loop vectorization
since its value could be non-correct (I simply add bool param to it).
2. I prohibit to apply this check if loop contains unanalyzed memory references
(e.g. calls, clobbers etc.).
With these changes all regressions related to omp simd support were disappeared
and the following failures left (because of changing order of transformation):

 FAIL: gcc.dg/autopar/outer-6.c scan-tree-dump-times parloops2 "parallelizing
inner loop" 0
FAIL: gcc.dg/pr41783.c scan-tree-dump pre "pretmp[^\\n]* = a_global_var;"
FAIL: gcc.dg/tree-ssa/loadpre10.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre23.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre24.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre25.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre4.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/loadpre8.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/ssa-pre-16.c scan-tree-dump-times pre "Eliminated: 1" 1
FAIL: gcc.dg/tree-ssa/ssa-pre-18.c scan-tree-dump pre "Replaced foo \\(f.y\\)"
FAIL: gcc.dg/tree-ssa/ssa-pre-20.c scan-tree-dump pre "New PHIs: 2"
FAIL: gcc.dg/tree-ssa/ssa-pre-3.c scan-tree-dump-times pre "Eliminated: 2" 1
FAIL: gfortran.dg/pr42108.f90   -O   scan-tree-dump pre "in all uses of
countm1[^\n]* / "
FAIL: gfortran.dg/vect/fast-math-vect-8.f90   -O   scan-tree-dump-times vect
"vectorized 1 loops" 1

What is your opinion?

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-04-28 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #12 from Yuri Rumyantsev  ---
Created attachment 38367
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38367=edit
modified patch

[Bug debug/70935] [6/7 Regression] ICE: verify_ssa failed (error: definition in block 9 does not dominate use in block 12) w/ -O3 -g

2016-05-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70935

--- Comment #3 from Yuri Rumyantsev  ---
Jacub,

Here is a simple fix - do not take into consideration edges destination of
which is loop latch block, i.e. loop is endless:
diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c
index dd6fd01..7de5fba 100644
--- a/gcc/tree-ssa-loop-unswitch.c
+++ b/gcc/tree-ssa-loop-unswitch.c
@@ -532,6 +532,12 @@ find_loop_guard (struct loop *loop)
 guard_edge->src->index, guard_edge->dest->index);
   return NULL;
 }
+  if (guard_edge->dest == loop->latch)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf(dump_file,"Guard edge destination is loop latch!\n");
+  return NULL;
+}

   if (dump_file && (dump_flags & TDF_DETAILS))
 fprintf (dump_file,

Is it OK for you?

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-04-19 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #6 from Yuri Rumyantsev  ---
Richard,

I did change proposed by you but it still does not help since we have
loop-carried dependency through this_4(D)->S_n:

  :
  _5 = this_4(D)->S_n;
...
  :
  pretmp_54 = this_4(D)->C2;
  pretmp_57 = this_4(D)->C1;
  pretmp_60 = MEM[(int * *)this_4(D) + 56B];
  _20 = this_4(D)->S_n;
  :   Loop header
  # i_33 = PHI <0(4), i_28(6)>
  # prephitmp_56 = PHI <_5(4), _20(6)> Recurrent phi.
...
test.cpp:66:25: note: vect_is_simple_use: operand prephitmp_56
test.cpp:66:25: note: def_stmt: prephitmp_56 = PHI <_5(4), _20(6)>
test.cpp:66:25: note: type of def: unknown
test.cpp:66:25: note: Unsupported pattern.
test.cpp:66:25: note: not vectorized: unsupported use in stmt.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-04-19 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 38309
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38309=edit
test-case to reproduce

Must be compiled with -Ofast -mavx2 -fopenmp options on x86 machine.

[Bug tree-optimization/70729] New: Loop marked with omp simd pragma is not vectorized

2016-04-19 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

Bug ID: 70729
   Summary: Loop marked with omp simd pragma is not vectorized
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

Analyzing performance of one important benchmark we found out that one of the
hot loop is no vectorized since loop-invariant load of the class member has not
been hoisted out of loop although loop was marked with omp simd pragma.
Test-case  to reproduce is attached.

[Bug rtl-optimization/71275] New: [7 regression] Performance drop after r235660 on x86-64 in 32-bit mode.

2016-05-25 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71275

Bug ID: 71275
   Summary: [7 regression] Performance drop after r235660 on
x86-64 in 32-bit mode.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

Regression can be seen at attached test-case. In the tail block of innermost
loop redundant fill was added:

before r235660r235660

.L3:
addl$1, %esi  addl$1, %esi
addl%eax, %ebxaddl%eax, %ebx
movw%bp, (%edi,%ecx)  movl44(%esp), %edx
movswl  %si, %ebp movswl  %si, %eax
cmpl(%esp), %ebp  cmpl%edi, %eax
jl  .L6   movw%bp, (%edx,%ecx)
  jl  .L6

In result we got up to 14% slow-down on one important benchmark.
It is clear that it is not profitable to keep value of loop upper bound on
register instead of the address base.

[Bug tree-optimization/72739] New: [7 Regression] FAIL: gcc.dg/vect/vect-mask-store-move-1.c after r238301

2016-07-28 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72739

Bug ID: 72739
   Summary: [7 Regression] FAIL:
gcc.dg/vect/vect-mask-store-move-1.c after r238301
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed that after this revision test is failed:
FAIL: gcc.dg/vect/vect-mask-store-move-1.c scan-tree-dump-times vect "Move stm
t to created bb" 4
FAIL: gcc.dg/vect/vect-mask-store-move-1.c -flto -ffat-lto-objects  scan-tree-
dump-times vect "Move stmt to created bb" 4

The problem is caused by complete deletion of vectorized loop which requires
run-time alias check. Note that GCC 6 does not have such issue.

[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2

2016-08-10 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #2 from Yuri Rumyantsev  ---
Jakub,

I removed both your revisions in cse.c (c1) but it did not help - 176.gcc stll
gets RF on avx2 but not on avx. I assume that masked stores are responsible for
it since we have them in binaries:

.L2437:
vmovd   %ecx, %xmm1
vpxor   %xmm5, %xmm5, %xmm5
addl-40(%ebp), %eax
movl-28(%ebp), %edx
vpbroadcastd-36(%ebp), %ymm4
vpaddd  .LC1, %ymm4, %ymm2
vpbroadcastd%xmm1, %ymm1
leal(%edx,%eax,4), %eax
vpsrlvd %ymm2, %ymm1, %ymm2
vpaddd  %ymm7, %ymm4, %ymm3
vpand   %ymm6, %ymm2, %ymm2
vpcmpeqd%ymm5, %ymm2, %ymm2
vpcmpeqd%ymm5, %ymm2, %ymm2
vptest  %ymm2, %ymm2
je  .L2446
vpmaskmovd  %ymm0, %ymm2, (%eax)
.L2446:
vpsrlvd %ymm3, %ymm1, %ymm2
vpxor   %xmm3, %xmm3, %xmm3
leal32(%eax), %edx
vpaddd  .LC3, %ymm4, %ymm4
vpand   %ymm6, %ymm2, %ymm2
vpcmpeqd%ymm3, %ymm2, %ymm2
vpcmpeqd%ymm3, %ymm2, %ymm2
vptest  %ymm2, %ymm2
je  .L2447
vpmaskmovd  %ymm0, %ymm2, (%edx)

Will try to determine the correct revision responsible for it.

[Bug rtl-optimization/70467] Useless "and [esp],-1" emitted on AND with uint64_t variable

2016-08-11 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70467

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #13 from Yuri Rumyantsev  ---
The fix r235764 introduced regression described in PR71956.

[Bug c/72794] New: [7 regression'] CF on spec2000/176.gcc after r238862.

2016-08-03 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794

Bug ID: 72794
   Summary: [7 regression'] CF on spec2000/176.gcc after r238862.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed that after this commit benchmark is failed with message:
/tmp/cchqWD0Q.ltrans0.ltrans.o: In function `yylex':
:(.text+0x566e): undefined reference to `is_reserved_word'
/tmp/cchqWD0Q.ltrans8.ltrans.o: In function `compile_file':
:(.text+0xb1fe): undefined reference to `is_reserved_word'
:(.text+0xb22b): undefined reference to `is_reserved_word'
:(.text+0xb248): undefined reference to `is_reserved_word'
:(.text+0xb265): undefined reference to `is_reserved_word'

i.e. function is_reserved_word with attribute "inline" was deleted but its
calls were not inlined. To reproduce bench spec must be compiled with
-Ofast -funroll-loops -flto -static  -DSPEC_CPU2000_LP64 options.
I did not try to prepare test-case to reproduce since assume that spec2000
suite is available.

[Bug c/72794] [7 regression'] CF on spec2000/176.gcc after r238862.

2016-08-03 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794

--- Comment #2 from Yuri Rumyantsev  ---
Yes, this option cures CF. Does it mean that we must compile spec2000
with this flag?

2016-08-03 19:08 GMT+03:00 pinskia at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794
>
> --- Comment #1 from Andrew Pinski  ---
> Can you try with -std=gnu90 and see if that fixes the issue.
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug c/72794] [7 regression] CF on spec2000/176.gcc after r238862.

2016-08-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794

--- Comment #6 from Yuri Rumyantsev  ---
Thanks for clarification.
This bug can be closed as user misunderstanding.

2016-08-04 14:08 GMT+03:00 rguenth at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794
>
> --- Comment #5 from Richard Biener  ---
> No, it's not a bug in the LTO phase - C99 inline simply does _not_ emit a
> out-of-line copy.  You have to add a extern declaration to force that.
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug c/72794] [7 regression] CF on spec2000/176.gcc after r238862.

2016-08-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794

--- Comment #4 from Yuri Rumyantsev  ---
I assume that there is still issue in lto part of compiler - even if
we ignore "inline" attribute we (lto) must not delete such functions
from binaries. So this bug must be forwarded to lto phase.

2016-08-03 19:43 GMT+03:00 pinskia at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794
>
> Andrew Pinski  changed:
>
>What|Removed |Added
> 
>  Status|UNCONFIRMED |RESOLVED
>  Resolution|--- |INVALID
>
> --- Comment #3 from Andrew Pinski  ---
> (In reply to Yuri Rumyantsev from comment #2)
>> Yes, this option cures CF. Does it mean that we must compile spec2000
>> with this flag?
>
> Yes and it should be considered a portability flag.
>
> Basically GNU90 and ISO C99 inline behave slightly different which is why you
> are seeing this.
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2

2016-08-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956

--- Comment #4 from Yuri Rumyantsev  ---
Need to read "problem file is 176.gcc/src/sched.c, problem function
sched_analyze_insn.

[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2

2016-08-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956

--- Comment #3 from Yuri Rumyantsev  ---
It turned out that after r235653 (with minor int->bool type change) 176.gcc
started RF. If we turn off vrp phase benchmark passes. The problem fail is
sched.c. Note that avx2 is essential for reproducing. Try to understand what
the issue is.

[Bug tree-optimization/71077] [7 Regression] gcc -lto raises ICE

2016-08-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #5 from Yuri Rumyantsev  ---
We found out that after r235653 with minor change of int->bool type 176.gcc
still RF on HSW machine in 32-bit if opt level equal 3. If we turn off VRP
phase by -fno-tree-vrp option benchmark is passed. Need to understand why this
simplification affects on it.

[Bug testsuite/72850] [7 Regression] FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4

2016-08-10 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72850

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #3 from Yuri Rumyantsev  ---
We also noticed huge regression on coremark-pro/core benchmark after this
revision. I attach test-case to reproduce.

[Bug testsuite/72850] [7 Regression] FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4

2016-08-10 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72850

--- Comment #4 from Yuri Rumyantsev  ---
Created attachment 39093
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39093=edit
test-case to reproduce

It is safficient use -Ofast option to compile on x86 machine.

[Bug middle-end/71734] [7 Regression] FAIL: libgomp.fortran/simd4.f90 -O3 -g execution test

2016-07-19 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71734

--- Comment #7 from Yuri Rumyantsev  ---
H.J.

I've just checked this test with my local fixed compiler and got:
Running /users/ysrumyan/workspaces/71261/gcc/testsuite/g++.dg/vect/vect.exp ...
PASS: g++.dg/vect/pr70729.cc  -std=c++11  scan-tree-dump vect "LOOP VECTORIZED"
PASS: g++.dg/vect/pr70729.cc  -std=c++11 (test for excess errors)
PASS: g++.dg/vect/pr70729.cc  -std=c++14  scan-tree-dump vect "LOOP VECTORIZED"
PASS: g++.dg/vect/pr70729.cc  -std=c++14 (test for excess errors)
PASS: g++.dg/vect/pr70729.cc  -std=c++98  scan-tree-dump vect "LOOP VECTORIZED"
PASS: g++.dg/vect/pr70729.cc  -std=c++98 (test for excess errors)

So it looks like not my fault.

2016-07-18 21:38 GMT+03:00 seurer at linux dot vnet.ibm.com
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71734
>
> Bill Seurer  changed:
>
>What|Removed |Added
> 
>  CC||seurer at linux dot 
> vnet.ibm.com
>
> --- Comment #6 from Bill Seurer  ---
> Looks like the simd3/4 tests now work with this patch but
> g++.dg/vect/pr70729.cc now fails:
>
> FAIL: g++.dg/vect/pr70729.cc  -std=c++98 (test for excess errors)
> FAIL: g++.dg/vect/pr70729.cc  -std=c++11 (test for excess errors)
> FAIL: g++.dg/vect/pr70729.cc  -std=c++14 (test for excess errors)
>
> In the log I see
>
> /tmp/cc3mxFhd.s: Assembler messages:
> /tmp/cc3mxFhd.s:29: Error: unrecognized opcode: `xsxexpdp'
> compiler exited with status 1
>
> and also
>
> /home/seurer/gcc/gcc-test/gcc/testsuite/g++.dg/vect/pr70729.cc:7:10: fatal
> error: xmmintrin.h: No such file or directory
> compilation terminated.
> compiler exited with status 1
>
>
> Maybe some of the options you removed weren't really redundant?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug tree-optimization/56688] Fortran save statement prevents loop vectorization.

2016-07-20 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56688

--- Comment #7 from Yuri Rumyantsev  ---
I checked that GCC 7 compiler still does not vectorize loops in thin6d function
which is the only hottest function in 200.sixtrack benchmark.

[Bug rtl-optimization/65698] Non-optimal code for simple compare function for x86 32-bit target

2016-07-20 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65698

--- Comment #3 from Yuri Rumyantsev  ---
I see that this bug was no considered for a while.
Here is my additional comment.
First of all, this test was extracted from bzip2 benchmark, mainGTU function.
The problem is that (1) tree optimizer collects cse for i1 * 2 and i2 * 2;
(2) Forward propagation pass do not substitute it back to address computation
since use_killed_between is very simplified it handles only simple basic block
or semi-hammock:
  /* Finally, if DEF_BB is the sole predecessor of TARGET_BB.  */
  if (single_pred_p (target_bb)
  && single_pred (target_bb) == def_bb)
This function must be enhanced to handle arbitrary cfg.

Note that this deficiency increases register pressure on 2 and we have more
spills/fills for x86 32-bit target.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-07-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #36 from Yuri Rumyantsev  ---
#c33 testcase was not tested since I have some doubts about it. Note
that original problem was
#pragma omp simd
  for (int i=0; i:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729
>
> --- Comment #35 from Jakub Jelinek  ---
> Doesn't it still miscompile the #c33 testcase?
> Say with __attribute__((noinline, noclone)) on baz and
> int v[2048];
>
> int
> main ()
> {
>   v[1023] = 5;
>   baz (v, v + 1023, v + 1024, v + 1023);
>   int i;
>   for (i = 0; i < 1024; i++)
> if (v[i] != 5 * 6 || v[1024 + i] != (i == 1023 ? 5 * 6 : 5) * 9)
>   __builtin_abort ();
>   return 0;
> }
> (untested)?
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-07-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #34 from Yuri Rumyantsev  ---
Thanks a lot Jakub for your detail comments.
I have simple fix which cures failures from 71734. The fix is simple
enough and simply check that the ref in problem belongs to simd loop:

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index ee04826..c710bbe 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2128,7 +2128,7 @@ ref_indep_loop_p_1 (struct loop *loop, im_mem_ref *ref,
   if (bitmap_bit_p (refs_to_check, UNANALYZABLE_MEM_ID))
 return false;

-  if (loop->safelen > 0)
+  if (loop->safelen > 1 && bitmap_bit_p (refs_to_check, ref->id))
 {
   if (dump_file && (dump_flags & TDF_DETAILS))
{

and I checked that simd3.f90 and simd4.f90 from libgomp.fortran passed with it.


2016-07-04 18:30 GMT+03:00 jakub at gcc dot gnu.org :
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729
>
> --- Comment #33 from Jakub Jelinek  ---
> In any case, loop->safelen > 0 test looks also wrong, if there are guarantees
> about single iteration only (safelen(1)), then there is nothing useful at all.
> So it must be loop->safelen >= 2.
>
> For foo in #c29, the q[0] load in foo can be hoisted before the loop.
> More complicated is e.g.:
> void baz (int *p, int *q, int *r, int *s)
> {
>   #pragma omp simd
>   for (int i = 0; i < 1024; i++)
> {
>   p[i] += q[0] * 6;
>   r[i] += s[0] * 9;
> }
> }
> Here IMNSHO only q[0] * 6 can be hoisted before the loop, while it can alias
> p[1023] (or for x < 1023 p[x] if p[x] is initially 0), p[1023] could validly
> alias s[0] and thus s[0] * 9 must not be hoisted.
>
> --
> You are receiving this mail because:
> You reported the bug.

[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized

2016-07-04 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729

--- Comment #37 from Yuri Rumyantsev  ---
Jakub,

I assume that yoour #C33 test-case is not correct, i.e. it can not be
marked with pragma omp simd. For example, even if we turn off lim
phase it will be aborted:
my_g++ -O3  -m64 t33.cpp -o t33.exe.1 -fopenmp -fno-tree-loop-im
/users/ysrumyan/70729$ ./t33.exe.1
Aborted (core dumped)


2016-07-04 19:47 GMT+03:00 Yuri Rumyantsev :
> #c33 testcase was not tested since I have some doubts about it. Note
> that original problem was
> #pragma omp simd
>   for (int i=0; i {
>   float w1 = C2[S_n + i] * w;
>   v1.v_i[i] += (int)w1;
>   C1[S_n + i] += w1;
> }
>
> and we must hoist S_n out of loop to vectorize it.
>
> 2016-07-04 19:40 GMT+03:00 jakub at gcc dot gnu.org 
> :
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729
>>
>> --- Comment #35 from Jakub Jelinek  ---
>> Doesn't it still miscompile the #c33 testcase?
>> Say with __attribute__((noinline, noclone)) on baz and
>> int v[2048];
>>
>> int
>> main ()
>> {
>>   v[1023] = 5;
>>   baz (v, v + 1023, v + 1024, v + 1023);
>>   int i;
>>   for (i = 0; i < 1024; i++)
>> if (v[i] != 5 * 6 || v[1024 + i] != (i == 1023 ? 5 * 6 : 5) * 9)
>>   __builtin_abort ();
>>   return 0;
>> }
>> (untested)?
>>
>> --
>> You are receiving this mail because:
>> You reported the bug.

[Bug tree-optimization/56688] static/saved variables prevent loop vectorization.

2016-07-22 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56688

--- Comment #8 from Yuri Rumyantsev  ---
I checked that if we comment down 'save' stmt in thin6d.f all loops will be
vectorized:
grep -c 'LOOP VECTORIZED' thin6d.f.149t.vect 32

[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2

2016-09-02 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956

--- Comment #5 from Yuri Rumyantsev  ---
This bug is fixed by
Author: ppalka
Date: Sat Aug 27 22:00:17 2016
New Revision: 239798

URL: https://gcc.gnu.org/viewcvs?rev=239798=gcc=rev
Log:
Fix folding of VECTOR_CST comparisons

gcc/ChangeLog:

PR tree-optimization/71077
PR tree-optimization/68542
* fold-const.c (fold_relational_const): Fix folding of
VECTOR_CST comparisons that have a scalar boolean result type.
(selftest::test_vector_folding): New static function.
(selftest::fold_const_c_tests): Call it.

gcc/testsuite/ChangeLog:

PR tree-optimization/71077
* gcc.target/i386/pr71077.c: New test.


Added:
trunk/gcc/testsuite/gcc.target/i386/pr71077.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/fold-const.c
trunk/gcc/testsuite/ChangeLog

So this bug must be closed.

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2016-09-06 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 39574
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39574=edit
test-case to reproduce

Need to compile with -O2 -ffast-math to reproduce.

[Bug tree-optimization/77498] New: [7 regression] Performance drop after r239414 on spec2000/172mgrid

2016-09-06 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Bug ID: 77498
   Summary: [7 regression] Performance drop after r239414 on
spec2000/172mgrid
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed significant regression after
https://gcc.gnu.org/viewcvs/gcc?view=revision=239414
I attached simple routine to reproduce. We can see that register pressure is 2x
higher with this patch after pre. The regression is worse for 32-bit mode.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-09-01 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 39535
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39535=edit
test-case to reproduce

It is sufficient to compile it with -Ofast option.

[Bug tree-optimization/77445] New: [7 Regression] Performance drop after r239219 on coremark test

2016-09-01 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Bug ID: 77445
   Summary: [7 Regression] Performance drop after r239219 on
coremark test
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed huge (32%) performance drop on coremark-pro/core (former coremark
benchmark) after
http://gcc.gnu.org/viewcvs/gcc?view=revision=239219

The problem part is 
   if (optimize_edge_for_speed_p (taken_edge))
which does not look correct since we have a lot of missed opportunities for
jump threading optimization like:

test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.111t.thread2:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.167t.thread3:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.
test.c.170t.thread4:FSM jump-thread path not considered: duplication of 4 insns
is needed and optimizing for size.

If we change it to
  if (!optimize_function_for_size_p (cfun))
performance is back.
I attach the test-case to reproduce issue.

[Bug tree-optimization/71077] [7 Regression] gcc -lto raises ICE

2016-08-18 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077

--- Comment #7 from Yuri Rumyantsev  ---
I checked that proposed patch fixed RF for 176.gcc.

Please, go ahead and commit your patch to trunk.
Thanks.
Yuri.

2016-08-12 20:14 GMT+03:00 patrick at parcs dot ath.cx
<gcc-bugzi...@gcc.gnu.org>:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077
>
> --- Comment #6 from patrick at parcs dot ath.cx ---
> On Fri, 12 Aug 2016, ysrumyan at gmail dot com wrote:
>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077
>>
>> Yuri Rumyantsev  changed:
>>
>>What|Removed |Added
>> 
>>          CC||ysrumyan at gmail dot com
>>
>> --- Comment #5 from Yuri Rumyantsev  ---
>> We found out that after r235653 with minor change of int->bool type 176.gcc
>> still RF on HSW machine in 32-bit if opt level equal 3. If we turn off VRP
>> phase by -fno-tree-vrp option benchmark is passed. Need to understand why 
>> this
>> simplification affects on it.
>
> My only guess is that the combining step still doesn't handle
> VECTOR_CSTs correctly. Could you please check if this patch fixes the
> runtime failure?
>
> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
> index 170e456..0db7bda 100644
> --- a/gcc/tree-ssa-threadedge.c
> +++ b/gcc/tree-ssa-threadedge.c
> @@ -577,6 +577,7 @@ simplify_control_stmt_condition_1 (edge e,
>if (handle_dominating_asserts
>&& (cond_code == EQ_EXPR || cond_code == NE_EXPR)
>&& TREE_CODE (op0) == SSA_NAME
> +  && INTEGRAL_TYPE_P (TREE_TYPE (op0))
>&& integer_zerop (op1))
>  {
>gimple *def_stmt = SSA_NAME_DEF_STMT (op0);
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug target/77344] Internal Compiler Error with arch knl

2016-08-26 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77344

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #3 from Yuri Rumyantsev  ---
I checked that this bug has been fixed in GCC 6 branch some time ago and fresh
version of it compiles this file successfully:
GNU Fortran2008 (Revision=239431/svn-rev:239431/) version 6.1.1 20160812
(x86_64-pc-linux-gnu)
compiled by GNU C version 6.1.1 20160812

It looks like you need to get next release of GCC 6 branch compiler.
Note that I can reproduce ICE with the earlier GCC 6 branch compiler:
compiled by GNU C version 6.1.1 20160617.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-26 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 39892
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39892=edit
test-case to reproduce

Must be compiled with "-Ofast -funroll-loops -march=knl" options.

[Bug rtl-optimization/78116] New: [7 regression] Performance drop after r241173 on avx512 target

2016-10-26 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

Bug ID: 78116
   Summary: [7 regression] Performance drop after r241173 on
avx512 target
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

I attached the simple test-case to reproduce issue.
Before this revision loop marked with label .L27 has 25 instructions but
after it additional fills were added and it has +8 more instructions.
In result we got > 6% performance drop on important benchmark.

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #3 from Yuri Rumyantsev  ---
Created attachment 39910
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39910=edit
another test-case

Must be compiled with "-Ofast -fopenmp -funroll-loops -march=knl"

[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target

2016-10-27 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #2 from Yuri Rumyantsev  ---
WE also found out performance drop on another important benchmark with the same
symptoms after r241170, namely loop marked with .L18 has +12 more fills from
stack. The test-case will be attached.

[Bug ipa/78268] [7 Regression] internal compiler error: Segmentation fault

2016-11-09 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78268

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #5 from Yuri Rumyantsev  ---
We just got ICE for 471.omnetpp  on x86 with guilty revision
r241990

[Bug tree-optimization/78348] [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038

2016-11-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 40036
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40036=edit
test-case to reproduce

Must be compiled with -O3 option to reproduce.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-11-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #4 from Yuri Rumyantsev  ---
Ping.
Do you have any progress on this?

Thanks.

[Bug tree-optimization/78348] New: [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038

2016-11-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348

Bug ID: 78348
   Summary: [7 REGRESSION] 15% performance drop for
coremark-pro/nnet-test after r242038
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

We noticed huge (>15%) performance drop after fix in loop distribution phase.
Before fix fix distribution is not performed since loop contains anti (write
after read) dependence. But now distibution is performed and memmove & memset
built-in are generated. We don't have fast implemention of memmove on HASWELL
that results in leads to performance regression. But note that the dependence
analysis is very poor and does not detect simple copying one struct field to
another. I attached simple test-case to reproduce this issue.
Note also that fix to pg_add_dependence_edges is correct and must not be
removed.

[Bug tree-optimization/78496] New: Missed opportunities for jump threading

2016-11-23 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78496

Bug ID: 78496
   Summary: Missed opportunities for jump threading
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ysrumyan at gmail dot com
  Target Milestone: ---

Created attachment 40131
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40131=edit
test-case to reproduce, compile with -O3 option.

We noticed a huge performance drop on one important benchmark which is caused
by hoisting and collecting comparisons participated in conditional branches.
Here is comments provided by Richard on it:

Note this is a general issue with PRE which tends to
see partial redundancies when it can compute an expression to a
constant on one edge.  There is nothing wrong with that but the
particular example shows the lack of a cost model with respect
to register pressure (same applies to other GIMPLE optimization passes).

In this case we have a lot of expression anticipated from the same
blocks where on one incoming edge their value is constant.  Profitability
here really depends on the "distance" of the to be inserted PHI and
its use I guess.

We're missing quite some jump-threading here as well:

  :
  # x1_197 = PHI <x1_261(15), x1_435(123), x1_435(105)>
  # _407 = PHI <_16(15), _16(123), 0(105)>
  # aa1_410 = PHI <aa1_185(15), aa1_185(123), aa1_216(105)>
  # d1_413 = PHI <d1_191(15), d1_191(123), d1_432(105)>
  # w1_416 = PHI <w1_260(15), w1_260(123), 0(105)>
  # v1_377 = PHI <v1_558(15), v1_558(123), 0(105)>
  # oo1_371 = PHI <oo1_567(15), oo1_567(123), oo1_194(105)>
  # ss1_376 = PHI <ss1_576(15), ss1_576(123), ss1_192(105)>
  # r1_609 = PHI <r1_585(15), r1_585(123), r1_190(105)>
  # _612 = PHI <_596(15), _596(123), _188(105)>
  # out_ind_lsm.82_322 = PHI <out_ind_lsm.82_321(15),
out_ind_lsm.82_321(123), out_ind_lsm.82_532(105)>
  _549 = w1_416 <= 899;
  _548 = _407 > 839;
  _541 = _548 & _549;
  if (_541 != 0)
goto ;
  else
goto ;

here 105 -> 16 -> 124 (forwarder) -> 18 which would eventually
make PRE behave somewhat saner (avoding the far distances).

The case appears with phicprop1 (or rather DOM, itself missing
a followup transform with respect to folding a degenerate constant
PHI plus the followup secondary threading opportunities).  The
backwards threader doesn't exploit the above opportunity though.
Our forward threaders (like DOM) do.  Unfortunately it requires
quite a few iterations to get all opportunities exploited...
(inserting 9 DOM/phi-only-cprop pass pairs "helps")

I suggest to open a bugreport for this.  Jeff may want to look at
the threading issue (I believe the backward threader _does_ iterate).

I attach a test-case to reproduce an issue.

[Bug tree-optimization/78348] [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038

2016-11-15 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348

--- Comment #5 from Yuri Rumyantsev  ---
Yes, I think so.

2016-11-15 14:49 GMT+03:00 rguenth at gcc dot gnu.org
:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348
>
> Richard Biener  changed:
>
>What|Removed |Added
> 
>  Status|UNCONFIRMED |NEW
>Last reconfirmed||2016-11-15
>  Ever confirmed|0   |1
>
> --- Comment #4 from Richard Biener  ---
>> The issue is that memcpy must be produced instead of memove which does
>> not have optimized version for avx2 x86 and simply uses byte copy.
>
> I'd expected a if (! overlap) memcpy () else byte-copy at least.
>
> Note the loop distribution code doesn't try to be clever in choosing memcpy
> over memmove (using dependence analysis).  So improving loop distribution
> (adding a PKIND_MEMMOVE and conservatively using that from dependence 
> analysis)
> is a possibility as well.  But we have
>
> (compute_affine_dependence
>   stmt_a: _2 = par.0_1->x2[i_19][j_20];
>   stmt_b: par.0_1->x1[i_19][j_20] = _2;
> (analyze_overlapping_iterations
>   (chrec_a = {0, +, 1}_2)
>   (chrec_b = {0, +, 1}_2)
>   (overlap_iterations_a = [0])
>   (overlap_iterations_b = [0]))
> (analyze_overlapping_iterations
>   (chrec_a = i_19)
>   (chrec_b = i_19)
>   (overlap_iterations_a = [0])
>   (overlap_iterations_b = [0]))
> (analyze_overlapping_iterations
>   (chrec_a = 33280)
>   (chrec_b = 12800)
> (analyze_ziv_subscript
> )
>   (overlap_iterations_a = no dependence)
>   (overlap_iterations_b = no dependence))
> ) -> no dependence
>
> so I think we could use memcpy for all no dependence cases?
>
> --
> You are receiving this mail because:
> You reported the bug.

<    1   2   3   4   >