[Bug tree-optimization/116785] [15 Regression] RAJAPerf REDUCE_SUM regresses with r15-792-gf0a02467bbc35a

2024-09-30 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785

--- Comment #14 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #13)
> Did it help?

Thanks for the quick Fix. This commit brings back most of the regression.
Please note that the current trunk seems to be broken for unrelated reasons. I
tried this patch with earlier working version that brought back the
performance.

[Bug tree-optimization/116785] [15 Regression] RAJAPerf REDUCE_SUM regresses with r15-792-gf0a02467bbc35a

2024-09-24 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785

--- Comment #10 from kugan at gcc dot gnu.org ---
Created attachment 59186
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59186&action=edit
reduced test (second attempt)

Sorry about the test case. Here is another attempt at reducing.

[Bug tree-optimization/116785] RAJAPerf REDUCE_SUM regresses with commit g:f0a02467bbc35a478eb82f5a8a7e8870827b51fc

2024-09-19 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785

--- Comment #2 from kugan at gcc dot gnu.org ---
Created attachment 59155
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59155&action=edit
creduce  reduced file

[Bug tree-optimization/116785] RAJAPerf REDUCE_SUM regresses with commit f0a02467bbc35a478eb82f5a8a7e8870827b51fc

2024-09-19 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785

--- Comment #1 from kugan at gcc dot gnu.org ---
Created attachment 59154
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59154&action=edit
preprocessed file

[Bug tree-optimization/116785] New: RAJAPerf REDUCE_SUM regresses with commit f0a02467bbc35a478eb82f5a8a7e8870827b51fc

2024-09-19 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785

Bug ID: 116785
   Summary: RAJAPerf REDUCE_SUM regresses with commit
f0a02467bbc35a478eb82f5a8a7e8870827b51fc
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Some of the loops in RAJAPerf are not vectored with the change. This results in
~64% regression for this and some other kernels. This regression can also be
observed again gcc 11 (I tried only this version).

g++ -Ofast -S CONVECTION3DPA-Seq.cpp.ii  -fopt-info-vec -fpermissive

shows:
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:80:31:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:80:31:
optimized:  loop versioned for vectorization because of possible aliasing
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:47:31:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:47:31:
optimized:  loop versioned for vectorization because of possible aliasing
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized:  loop versioned for vectorization because of possible aliasing
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized:  loop versioned for vectorization because of possible aliasing

With the patch reverted:
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:100:29:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:89:29:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:80:31:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:67:29:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:58:31:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:47:31:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23:
optimized: loop vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:162:41:
optimized: basic block part vectorized using 16 byte vectors
/proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:40:29:
optimized: basic block part vectorized using 16 byte vectors




g++ -v  
Using built-in specs.
COLLECT_GCC=/local/home/kvivekananda/install/bin/g++
COLLECT_LTO_WRAPPER=/local/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc/configure --disable-bootstrap --enable-multiarch=yes
--enable-languages=c,c++,fortran,lto --prefix=/local/home/kvivekananda/install
: (reconfigured) ../gcc/configure --disable-bootstrap --enable-multiarch=yes
--prefix=/local/home/kvivekananda/install --enable-languages=c,c++,fortran,lto
--no-create --no-recursion
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.0.0 20240917

[Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290

2024-09-17 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #6 from kugan at gcc dot gnu.org ---
This (In reply to GCC Commits from comment #3)
> The trunk branch has been updated by Richard Sandiford
> :
> 
> https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> 
> commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> Author: Richard Sandiford 
> Date:   Wed May 29 16:43:33 2024 +0100
> 
> aarch64: Split aarch64_combinev16qi before RA [PR115258]
> 
> Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
> purpose is to put the two input data vectors into consecutive registers.
> This aarch64_combinev16qi was then split after reload into individual
> moves (from the first input to the first half of the output, and from
> the second input to the second half of the output).
> 
> In the worst case, the RA might allocate things so that the destination
> of the aarch64_combinev16qi is the second input followed by the first
> input.  In that case, the split form of aarch64_combinev16qi uses three
> eors to swap the registers around.
> 
> This PR is about a test where this worst case occurred.  And given the
> insn description, that allocation doesn't semm unreasonable.
> 
> early-ra should (hopefully) mean that we're now better at allocating
> subregs of vector registers.  The upcoming RA subreg patches should
> improve things further.  The best fix for the PR therefore seems
> to be to split the combination before RA, so that the RA can see
> the underlying moves.
> 
> Perhaps it even makes sense to do this at expand time, avoiding the need
> for aarch64_combinev16qi entirely.  That deserves more experimentation
> though.
> 
> gcc/
> PR target/115258
> * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
> the split before reload.
> * config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
> Generalize
> into a form that handles pseudo registers.
> 
> gcc/testsuite/
> PR target/115258
> * gcc.target/aarch64/pr115258.c: New test.

This is causing performance regression in some TSVC kernels and others. Here is
an example:
https://godbolt.org/z/r91nYEEsP

We now get:
.L3:
add x3, x26, x0
add x2, x25, x0
add x3, x3, 65536
add x2, x2, 65536
sub x0, x0, #16
ldr q31, [x3, 62448]
mov v28.16b, v31.16b
mov v29.16b, v31.16b
tbl v31.16b, {v28.16b - v29.16b}, v30.16b
faddv31.4s, v31.4s, v25.4s
mov v26.16b, v31.16b
mov v27.16b, v31.16b
tbl v31.16b, {v26.16b - v27.16b}, v30.16b
str q31, [x2, 62448]
cmp x0, x27
bne .L3

[Bug middle-end/116626] ICE while VLA vectorisation

2024-09-05 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116626

--- Comment #1 from kugan at gcc dot gnu.org ---
Looks duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569

[Bug middle-end/116626] New: ICE while VLA vectorisation

2024-09-05 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116626

Bug ID: 116626
   Summary: ICE while  VLA vectorisation
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59057
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59057&action=edit
testcase

For a partally reduced code, I am seeing:
t.cpp:350:12: internal compiler error: in to_constant, at poly-int.h:592
  350 |   void _M_run() { _M_func(); }
  |^~
0x38da7fb internal_error(char const*, ...)
../../gcc/gcc/diagnostic-global-context.cc:492
0x38b1e5f fancy_abort(char const*, int, char const*)
../../gcc/gcc/diagnostic.cc:1658
0xf747cb poly_int<2u, unsigned long>::to_constant() const
../../gcc/gcc/poly-int.h:592
0x20bc547 nunits_for_known_piecewise_op
../../gcc/gcc/tree-vect-generic.cc:96
0x20bcff3 expand_vector_piecewise
../../gcc/gcc/tree-vect-generic.cc:290
0x20c1cc7 expand_vector_operation
../../gcc/gcc/tree-vect-generic.cc:1257
0x20c7f37 expand_vector_operations_1
../../gcc/gcc/tree-vect-generic.cc:2366
0x20c8117 expand_vector_operations
../../gcc/gcc/tree-vect-generic.cc:2400
0x20c8417 execute
../../gcc/gcc/tree-vect-generic.cc:2497


compile command:  -O3 -mcpu=neoverse-v2 t.cpp  -std=gnu++20 

g++ -v
Using built-in specs.
COLLECT_GCC=/proj/grco/gcc/Linux_aarch64/upstream-main/latest/bin/g++
COLLECT_LTO_WRAPPER=/proj/grco/gcc/Linux_aarch64/upstream-main/20240905192606-a2e28b10/bin/../libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /var/jenkins/workspace/GCC_oss-main/configure
--enable-multiarch=yes --enable-languages=c,c++,fortran,lto
--prefix=/proj/grco/gcc/Linux_aarch64/oss-main/20240905192606-a2e28b10
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 15.0.0 20240906 (experimental) (GCC)

[Bug middle-end/116562] New: wrong cost of gather load preventing loop from vectored

2024-09-01 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116562

Bug ID: 116562
   Summary: wrong cost of gather load preventing loop from
vectored
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

typedef int real_t;
extern __attribute__((aligned(64))) real_t a[32000],b[32000],c[32000],d[32000];

void s4117()
{
  for (int i = 0; i < 32000; i++) {
  a[i] = b[i] + c[i/2] * d[i];
  }
}
is not vectored for AdvSIMD due to wrong cost calculation.

Compiler option used: cc1plus -Ofast -fdump-tree-vect-all -mcpu=neoverse-v2
--param=aarch64-autovec-preference=1

tt.c:6:21: note:  Cost model analysis:
  Vector inside of loop cost: 64
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar iteration cost: 15
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
tt.c:6:21: missed:  cost model: the vector iteration cost = 64 divided by the
scalar iteration cost = 15 is greater or equal to the vectorization factor = 4.
tt.c:6:21: missed:  not vectorized: vectorization not profitable.
tt.c:6:21: missed:  not vectorized: vector version will never be profitable.
tt.c:6:21: missed:  Loop costings may not be worthwhile.
tt.c:6:21: note:  * Analysis failed with vector mode V4SI



We cost this c[i/2] as having the cost of 4 loads and one construct. I think we
should special case these sort of gather loads which as lower cost in practice?

11233   if (costing_p)
11234 {
11235   /* For emulated gathers N offset vector element
11236  offset add is consumed by the load).  */
11237   inside_cost = record_stmt_cost (cost_vec,
const_nunits,
11238   vec_to_scalar,
stmt_info,
11239   0, vect_body);
11240   /* N scalar loads plus gathering them into a
11241  vector.  */
11242   inside_cost
11243 = record_stmt_cost (cost_vec, const_nunits,
scalar_load,
11244 stmt_info, 0, vect_body);
11245   inside_cost
11246 = record_stmt_cost (cost_vec, 1, vec_construct,
11247 stmt_info, 0, vect_body);
11248   continue;
11249 }

[Bug tree-optimization/116528] New: Not vectoring TSVC s318 loop

2024-08-28 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116528

Bug ID: 116528
   Summary: Not vectoring TSVC s318 loop
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

See:

typedef float real_t;
extern __attribute__((aligned(64))) real_t a[32000];


real_t not_woring(struct args_t *func_args) {
  int k, index;
  int inc = 1;
  real_t max, chksum;
k = 0;
index = 0;
max = (a[0]);
k += inc;
for (int i = 1; i < 32000; i++) {
  if (a[k] > max) {
  index = i;
  max = (a[k]);
  }
  k += inc;
}
  return max + index + 1;
}

Also in https://godbolt.org/z/ra4h6ndKz

ifcvt dump is:

   [local count: 1063004408]:
  # k_16 = PHI 
  # index_17 = PHI 
  # max_19 = PHI <_29(7), max_10(15)>
  # ivtmp_15 = PHI 
  _1 = a[k_16];
  _13 = _1 > max_19;
  index_5 = _13 ? k_16 : index_17;
  _29 = MAX_EXPR <_1, max_19>;
  k_12 = k_16 + 1;
  ivtmp_14 = ivtmp_15 - 1;
  if (ivtmp_14 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 1052266995]:
  goto ; [100.00%]

   [local count: 10737416]:

Here PHI # max_19 = PHI <_29(7), max_10(15)> has two uses 
  _29 = MAX_EXPR <_1, max_19>;
and 
 _13 = _1 > max_19;

As a result, this is not a vect_is_simple_reduction. How can we support this?

[Bug tree-optimization/116338] GCC is not vectoring TSVC s255 while clang can

2024-08-20 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116338

--- Comment #5 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #4)
> You can try to see whether adding a SSA copy would make this supported, it
> seems not allowing a PHI is simply a missed feature.

We now fail in
 /* If this isn't a nested cycle or if the nested cycle reduction value
 is used ouside of the inner loop we cannot handle uses of the reduction
 value.  */
  if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1)

Even if I comment this, I see:
t1.c:16:25: note:   worklist: examine stmt: _22 = x_18 + y_19;
t1.c:16:25: note:   vect_is_simple_use: operand x_18 = PHI <_1(5), x_10(2)>,
type of def: unknown
t1.c:16:25: missed:   Unsupported pattern.
t1.c:10:6: missed:   not vectorized: unsupported use in stmt.
t1.c:16:25: missed:  unexpected pattern.
t1.c:16:25: note:  * Analysis failed with vector mode V4SF

Do we need to somehow mark both the PHI stents as part of the first order
reduction?


   [local count: 1063004408]:
  # x_18 = PHI <_1(5), x_10(2)>
  # y_19 = PHI 
  # i_20 = PHI 
  # ivtmp_17 = PHI 
  _1 = b[i_20];
  _22 = x_18 + y_19;
  _3 = _1 + _22;
  _4 = _3 * 3.3304291534423828125e-1;
  a[i_20] = _4;
  i_13 = i_20 + 1;
  ivtmp_16 = ivtmp_17 - 1;
  if (ivtmp_16 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 1052266995]:
  goto ; [100.00%]

[Bug tree-optimization/116338] GCC is not vectoring TSVC s255 while clang can

2024-08-20 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116338

--- Comment #3 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> The issue is the recurrence
> 
>[local count: 10737416]:
>   x_10 = b[31999];
>   y_11 = b[31998];
> 
>[local count: 1063004408]:
>   # x_18 = PHI <_1(5), x_10(2)>
>   # y_19 = PHI 
>   _1 = b[i_20];
> ..
> 
>[local count: 1052266995]:
>   goto ; [100.00%]
> 
> we handle some cases via vect_phi_first_order_recurrence_p, somebody needs
> to dig in why this one isn't (or can't be) handled with that mechanism.

  /* Ensure the loop latch definition is from within the loop.  */
  edge latch = loop_latch_edge (loop);
  tree ldef = PHI_ARG_DEF_FROM_EDGE (phi, latch);
  if (TREE_CODE (ldef) != SSA_NAME
  || SSA_NAME_IS_DEFAULT_DEF (ldef)
  || is_a  (SSA_NAME_DEF_STMT (ldef))
  || !flow_bb_inside_loop_p (loop, gimple_bb (SSA_NAME_DEF_STMT (ldef
return false;

(gdb) p debug_tree (ldef)
 
unit-size 
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0xf7a8b2a0 precision:32
pointer_to_this >
visited var 
def_stmt x_18 = PHI <_1(5), x_10(2)>
version:18>
$1 = void


That is PHI arg defined along the loop latch is also PHI stmt in the case.

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-08-13 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #20 from kugan at gcc dot gnu.org ---
(In reply to Richard Sandiford from comment #19)
> (In reply to Richard Biener from comment #14)
> > Usually targets do have a limit on the actual length but I see
> > constant_upper_bound_with_limit doesn't query such.  But it would
> > be a more appropriate way to say there might be an actual target limit here?
> The discussion has moved on, but FWIW: this was a deliberate choice.
> The thinking at the time was that VLA code should be truly “agnostic”
> and not hard-code an upper limit.  Hard-coding a limit would be hard-coding
> an assumption that the architectural maximum would never increase in future.
> 
> (The main counterargument was that any uses of the .B form of TBL would
> break down for >256-byte vectors.  We hardly use such TBLs for autovec
> though, and could easily choose not to use them at all.)
> 
> That decision is 8 or 9 years old at this point, so it might seem overly
> dogmatic now.  Even so, I think we should have a strong reason to change
> tack.
> It shouldn't just be about trying to avoid poly_ints :)

Thanks. I have posted an RFC at
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659373.html

In addition to making loop->safelen POLY_INT, I also change the apply_safelen
with:
+ unsigned int safelen;
+ if (loop->safelen.is_constant ())
+   safelen = loop->safelen.coeffs[0];
+ else
+   safelen = INT_MAX;

That is. in essence this would be an INT_MAX in these cases.

[Bug tree-optimization/116338] New: GCC is not vectoring TSVC s255 while clang can

2024-08-11 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116338

Bug ID: 116338
   Summary: GCC is not vectoring TSVC s255 while clang can
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

reduced test case:

typedef float real_t;
extern __attribute__((aligned(64))) real_t a[32000], b[32000];

void s255()
{   
real_t x, y;
x = b[32000 -1];
y = b[32000 -2];
for (int i = 0; i < 32000; i++) {
a[i] = (b[i] + x + y) * (real_t).333;
y = x;
x = b[i];
}

}

gcc is not able to vectorize the loop whereas clang can. See
https://godbolt.org/z/64Kxaahqr

gcc -v
Using built-in specs.
COLLECT_GCC=/home/kvivekananda/install/bin/gcc
COLLECT_LTO_WRAPPER=/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc_base/configure --prefix=/home/kvivekananda/install/
--enable-languages=c,c++,fortran,lto,objc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.0.0 20240618 (experimental) (GCC)

[Bug middle-end/116337] New: Reverse iterated loops has redundant code compared to clang

2024-08-11 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116337

Bug ID: 116337
   Summary: Reverse iterated loops has redundant code compared to
clang
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

For:

extern __attribute__((aligned(64))) int a[32000],b[32000];

void s1112(void)
{
for (int i = 32000 - 1; i >= 0; i--) {
a[i] = b[i] + 1;
}
}

For the loop, with -O3 -mcpu=neoverse-v2 --param=aarch64-autovec-preference=2,
gcc generates

.L3:
ld1wz31.s, p7/z, [x6, x0, lsl 2]
add w1, w1, w3
rev z31.s, z31.s
add z31.s, z31.s, #1
rev z31.s, z31.s
st1wz31.s, p7, [x2, x0, lsl 2]
sub x0, x0, x5
cmp w1, w4
bls .L3

clang generates with -O3 -mcpu=neoverse-v2 -fno-unroll-loops:
.LBB0_1:
ld1w{ z0.s }, p0/z, [x14, x11, lsl #2]
add z0.s, z0.s, #1
st1w{ z0.s }, p0, [x13, x11, lsl #2]
decwx11
cmn x12, x11
b.ne.LBB0_1


This seem to comes due to memory_access_type of VMAT_CONTIGUOUS_REVERSE and the
VEC_PERM_EXPR.

   [local count: 1063004408]:
  # i_10 = PHI 
  # ivtmp_9 = PHI 
  # vectp_b.4_8 = PHI  [(void *)&b +
127984B](2)>
  # vectp_a.9_19 = PHI  [(void *)&a +
127984B](2)>
  # ivtmp_23 = PHI 
  vect__1.6_14 = MEM  [(int *)vectp_b.4_8];
  vect__1.7_15 = VEC_PERM_EXPR ;
  _1 = b[i_10];
  vect__2.8_17 = vect__1.7_15 + { 1, 1, 1, 1 };
  _2 = _1 + 1;
  vect__2.11_21 = VEC_PERM_EXPR ;
  MEM  [(int *)vectp_a.9_19] = vect__2.11_21;
  i_7 = i_10 + -1;
  ivtmp_4 = ivtmp_9 - 1;
  vectp_b.4_13 = vectp_b.4_8 + 18446744073709551600;
  vectp_a.9_20 = vectp_a.9_19 + 18446744073709551600;
  ivtmp_24 = ivtmp_23 + 1;
  if (ivtmp_24 < 8000)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 1052266995]:
  goto ; [100.00%]


gcc -v
Using built-in specs.
COLLECT_GCC=/home/kvivekananda/install/bin/gcc
COLLECT_LTO_WRAPPER=/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc_base/configure --prefix=/home/kvivekananda/install/
--enable-languages=c,c++,fortran,lto,objc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.0.0 20240618 (experimental) (GCC)

[Bug tree-optimization/115450] [15 Regression] cpu2017 502.gcc runtime miscompute on aarch64 with SVE since r15-1006-gd93353e6423eca

2024-06-16 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115450

--- Comment #2 from kugan at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #1)
> >[r15-1006-gd93353e6423eca] Do single-lane SLP discovery for reductions
> 
> 
> Interesting because PR 115256 bisect it to an earlier patch.
I believe this is a new issue.

[Bug tree-optimization/115450] New: cpu2017 502.gcc runtime miscompute

2024-06-11 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115450

Bug ID: 115450
   Summary: cpu2017 502.gcc runtime miscompute
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

5022.gcc is meicompiling for aarch64 with -O3 -Wl,-z,muldefs -lm
-fallow-argument-mismatch -fpermissive -fstack-arrays -flto
-Wl,--sort-section=name -fno-strict-aliasing -fgnu89-inline -march=native
-mcpu=neoverse-v2 -msve-vector-bits=128

gcc -v
Using built-in specs.
COLLECT_GCC=/home/kvivekananda/install_base/bin/gcc
COLLECT_LTO_WRAPPER=/home/kvivekananda/install_base/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc_base/configure --enable-multiarch=yes
--enable-languages=c,c++,fortran,lto --disable-bootstrap
--prefix=/home/kvivekananda/install_base : (reconfigured) ../gcc_base/configure
--enable-multiarch=yes --disable-bootstrap
--prefix=/home/kvivekananda/install_base --enable-languages=c,c++,fortran,lto
--no-create --no-recursion
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.0.0 20240607 (experimental) (GCC) 


Bisect points to:
[d93353e6423ecaaae9fa47d0935caafd9abfe4de] Do single-lane SLP discovery for
reductions

[Bug tree-optimization/115383] [15 Regression] ICE with TCVC_2 build since r15-1053-g28edeb1409a7b8

2024-06-07 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383

--- Comment #6 from kugan at gcc dot gnu.org ---
(In reply to kugan from comment #5)
> (In reply to Richard Biener from comment #4)
> > Created attachment 58378 [details]
> > patch
> > 
> > I'm testing this, but I do not have hardware to test correctness (and qemu
> > not set up).
> 
> Thanks. I will test this on aarch64.

bootstrap and regression test passes. TSVC_2 also builds without any issues.

[Bug tree-optimization/115383] [15 Regression] ICE with TCVC_2 build since r15-1053-g28edeb1409a7b8

2024-06-07 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383

--- Comment #5 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #4)
> Created attachment 58378 [details]
> patch
> 
> I'm testing this, but I do not have hardware to test correctness (and qemu
> not set up).

Thanks. I will test this on aarch64.

[Bug tree-optimization/115383] New: ICE with TCVC_2 build

2024-06-07 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383

Bug ID: 115383
   Summary: ICE with TCVC_2 build
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Patch [PATCH 1/4] Relax COND_EXPR reduction vectorization SLP restriction  seem
to cause ICE while building TSVC_2

Reduced test:

cat tsvc_vec.i
void dummy();
void s331() {
  int j;
  for (int i; i; i++)
if ((float)i < 00.)
  j = i;
  dummy(j);
}


gcc options used:
gcc  -std=c99 -O3 -march=native -flto -Wl,--sort-section=name -mcpu=neoverse-v2
-msve-vector-bits=128
gcc -v:
Using built-in specs.
COLLECT_GCC=/proj/grco/gcc/Linux_aarch64/upstream-main/latest/bin/gcc
COLLECT_LTO_WRAPPER=/proj/grco/gcc/Linux_aarch64/upstream-main/20240606024711346f33e2/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /var/jenkins/workspace/GCC_Nightly/configure
--enable-multiarch=yes --enable-languages=c,c++,fortran,lto
--prefix=/proj/grco/gcc/Linux_aarch64/upstream-main/20240606024711346f33e2
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 15.0.0 20240606 (experimental) (GCC)

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #18 from kugan at gcc dot gnu.org ---
Also, can we set INT_MAX when there is no explicit safelen specified in OMP.
Something like:

--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -6975,14 +6975,11 @@ lower_rec_input_clauses (tree clauses, gimple_seq
*ilist, gimple_seq *dlist,
 {
   tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),
OMP_CLAUSE_SAFELEN);
-  poly_uint64 safe_len;
-  if (c == NULL_TREE
- || (poly_int_tree_p (OMP_CLAUSE_SAFELEN_EXPR (c), &safe_len)
- && maybe_gt (safe_len, sctx.max_vf)))
+  if (c == NULL_TREE)
{
  c = build_omp_clause (UNKNOWN_LOCATION, OMP_CLAUSE_SAFELEN);
  OMP_CLAUSE_SAFELEN_EXPR (c) = build_int_cst (integer_type_node,
-  sctx.max_vf);
+  INT_MAX);
  OMP_CLAUSE_CHAIN (c) = gimple_omp_for_clauses (ctx->stmt);
  gimple_omp_for_set_clauses (ctx->stmt, c);
}

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #12 from kugan at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #11)
> (In reply to kugan from comment #9)
> > Looking at the options, looks to me that making loop->safelen a poly_in is
> > the way to go. (In reply to Jakub Jelinek from comment #4)
> > > The OpenMP safelen clause argument is a scalar integer, so using poly_int
> > > for something that must be an int doesn't make sense.
> > > Though, the above testcase actually doesn't use safelen clause, so safelen
> > > is there effectively infinity.
> > Thanks. I was looking at this to see if there is a way to handle this
> > differently. Looks to me that making loop->safelen a poly_int is the way to
> > handle at least the case when omp safelen clause is not provided.
> 
> Why?
> Then it just is INT_MAX value, which is a magic value that says that it is
> infinity.
> No need to say it is a poly_int infinity.

For this test case, omp_max_vf gets [16, 16] from the backend. This then
becomes 16. If we keep it as poly_int, it would pass maybe_lt (max_vf, min_vf))
after applying safelen?

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #10 from kugan at gcc dot gnu.org ---
Created attachment 57946
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57946&action=edit
patch

patch to make loop->safelen a poly_int

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #9 from kugan at gcc dot gnu.org ---
Looking at the options, looks to me that making loop->safelen a poly_in is the
way to go. (In reply to Jakub Jelinek from comment #4)
> The OpenMP safelen clause argument is a scalar integer, so using poly_int
> for something that must be an int doesn't make sense.
> Though, the above testcase actually doesn't use safelen clause, so safelen
> is there effectively infinity.
Thanks. I was looking at this to see if there is a way to handle this
differently. Looks to me that making loop->safelen a poly_int is the way to
handle at least the case when omp safelen clause is not provided. I am
interested in looking into this. Any suggestions? Here is a completely untested
diff that makes loop->safelen a poly_int.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 114653, which changed state.

Bug 114653 Summary: Not vectorizing the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #8 from kugan at gcc dot gnu.org ---
*** Bug 114653 has been marked as a duplicate of this bug. ***

[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|UNCONFIRMED |RESOLVED

--- Comment #6 from kugan at gcc dot gnu.org ---
Duplicate

*** This bug has been marked as a duplicate of bug 114635 ***

[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

--- Comment #5 from kugan at gcc dot gnu.org ---
ddd for the :
 ref_a: 
_57 = D.4803[_20];
  ref_b: 
D.4803[_20] = _ifc__174;

We get DDR_ARE_DEPENDENT (ddr) == chrec_dont_know. Hence apply_safelen ().

[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

--- Comment #4 from kugan at gcc dot gnu.org ---
This particular loop has loop->safelen set to 16. Does this mean this can never
be loop vectorized for VLA?

[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

--- Comment #3 from kugan at gcc dot gnu.org ---
For SVE mode in vect_analyze_loop_2, we have

(gdb) p min_vf
$15 = {coeffs = {4, 4}}
(gdb) p max_vf
$16 = 16

Thus maybe_lt (max_vf, min_vf)) is false. This results in bad data dependence.

[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

--- Comment #2 from kugan at gcc dot gnu.org ---
Thanks. I see the following in the log:
test.cpp:33:53: missed:   not vectorized: relevant stmt not supported: _54 =
.MASK_LOAD (_53, 32B, _171);
test.cpp:22:19: missed:  bad operation or unsupported loop bound.
test.cpp:22:19: note:  * Analysis  failed with vector mode V4SF


test.cpp:22:19: note:   === vect_analyze_data_ref_dependences ===
test.cpp:22:19: missed:  bad data dependence.
test.cpp:22:19: note:  * Analysis  failed with vector mode VNx16QI

test.cpp:33:53: missed:   not vectorized: relevant stmt not supported: _54 =
.MASK_LOAD (_53, 32B, _171);
test.cpp:22:19: missed:  bad operation or unsupported loop bound.
test.cpp:22:19: note:  * Analysis  failed with vector mode V8QI

test.cpp:22:19: note:   === vect_analyze_data_ref_dependences ===
test.cpp:22:19: missed:  bad data dependence.
test.cpp:22:19: note:  * Analysis  failed with vector mode VNx8QI

test.cpp:33:53: missed:   not vectorized: relevant stmt not supported: _54 =
.MASK_LOAD (_53, 32B, _171);
test.cpp:22:19: missed:  bad operation or unsupported loop bound.
test.cpp:22:19: note:  * Analysis  failed with vector mode V4HI

test.cpp:22:19: note:   === vect_analyze_data_ref_dependences ===
test.cpp:22:19: missed:  bad data dependence.
test.cpp:22:19: note:  * Analysis  failed with vector mode VNx4QI

test.cpp:33:53: missed:   not vectorized: relevant stmt not supported: _54 =
.MASK_LOAD (_53, 32B, _171);
test.cpp:22:19: missed:  bad operation or unsupported loop bound.
test.cpp:22:19: note:  * Analysis  failed with vector mode V2SI

test.cpp:22:19: note:   worklist: examine stmt: _57 = D.4803[_20];
test.cpp:22:19: note:   === vect_analyze_data_ref_dependences ===
test.cpp:22:19: missed:  bad data dependence.
test.cpp:22:19: note:  * Analysis  failed with vector mode VNx2QI
test.cpp:22:19: missed: couldn't vectorize loop
test.cpp:22:19: missed: bad data dependence.

[Bug middle-end/114653] New: Not vectoring the loop with openmp reduction.

2024-04-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653

Bug ID: 114653
   Summary: Not vectoring the loop with openmp reduction.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57910
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57910&action=edit
testcase

Main loop in the attached test case is not vectorized with -fopenmp. It gets
vectorized with -fopenmp-simd.

In the case of -fopenmp reduction variables lax,lay,laz gets assigned to an
array. data reference calculation for this seem to fail. See:

offset from base address: (ssizetype) ((sizetype) _20 * 4)
constant offset from base address: 0
step: 0
base alignment: 16
base misalignment: 0
offset alignment: 4
step alignment: 128
base_object: D.4806[_20]
Creating dr for D.4808[_20]
analyze_innermost: Applying pattern match.pd:219, generic-match-1.cc:3190
test.cpp:37:9: missed:  failed: evolution of offset is not affine.


command used: 
 test.cpp -Ofast -fopenmp -mcpu=neoverse-v2


gcc -v:
Using built-in specs.
COLLECT_GCC=/home/kvivekananda/install/bin/gcc
COLLECT_LTO_WRAPPER=/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/14.0.1/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc/configure --enable-multiarch=yes
--enable-languages=c,c++,fortran,lto --disable-bootstrap
--prefix=/home/kvivekananda/install
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.0.1 20240314 (experimental) (GCC)

[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e

2024-03-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683

--- Comment #5 from kugan at gcc dot gnu.org ---
 -O3 -fno-tree-vectorize  and -O3 -fno-tree-vrp works. I looked at the ever
dump and it is not doing anything suspicious. Looks like range_info usage in
vectoriser is causing the problem.

[Bug libgomp/113698] GNU OpenMP with OMP_PROC_BIND alters thread affinity in a way that negatively affects performance

2024-02-09 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698

--- Comment #4 from kugan at gcc dot gnu.org ---
Thanks for looking into this. The main reason we ere seeing performance issue
turned out to be due to glibc malloc issue in
https://sourceware.org/bugzilla/show_bug.cgi?id=30945

[Bug libgomp/113698] New: GNU OpenMP with OMP_PROC_BIND alters thread affinity in a way that negatively affects performance

2024-01-31 Thread kugan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698

Bug ID: 113698
   Summary: GNU OpenMP with OMP_PROC_BIND alters thread affinity
in a way that negatively affects performance
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57275
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57275&action=edit
testcase

When OMP_PROC_BIND=true it seems gomp set the affinity even before main()
starts. In particular, the main thread gets affinity 0x1 (i.e. pinned to the
first core). For the attached, I get

$ OMP_NUM_THREADS=72 ./a.out
[main thread affinity right after main()]. tid:ae511020
aff:...
duration: 402.949 msec

$ OMP_PROC_BIND=true OMP_NUM_THREADS=72 ./a.out
[main thread affinity right after main()]. tid:fffdded50020
aff:...0001
duration: 7879.59 msec

$ OMP_PROC_BIND=true OMP_NUM_THREADS=72 ./a.out
[main thread affinity right after main()]. tid:ae54c020
aff:...0001
duration: 311219 msec

Compiler options used:
gcc -O0 -fopenmp repro.c

gcc -v:


Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/11/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
11.3.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)

[Bug driver/47785] GCC with -flto does not pass -Wa options to the assembler

2019-10-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47785

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #14 from kugan at gcc dot gnu.org ---
A patch for this is posted at
https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01471.html

[Bug ipa/91468] Suspicious codes in ipa-prop.c and ipa-cp.c

2019-08-26 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91468

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #2 from kugan at gcc dot gnu.org ---
(In reply to Martin Jambor from comment #1)
> (In reply to Feng Xue from comment #0)

> > 
> > In function update_jump_functions_after_inlining(),
> > 
> >   if (dst->type == IPA_JF_ANCESTOR)
> > {
> >   ..
> > 
> >   if (src->type == IPA_JF_PASS_THROUGH
> >   && src->value.pass_through.operation == NOP_EXPR)
> > {
> >..
> > }
> >   else if (src->type == IPA_JF_PASS_THROUGH
> >&& TREE_CODE_CLASS (src->value.pass_through.operation) == 
> > tcc_unary)
> > {
> >   dst->value.ancestor.formal_id = src->value.pass_through.formal_id;
> >   dst->value.ancestor.agg_preserved = false;
> > }
> >   ..   
> > }
> > 
> > If we suppose pass_through operation is "negate_expr" (while it is not a
> > reasonable operation on pointer type), the code might be incorrect. It's
> > better to specify expected unary operations here.
> 
> Kugan, you added this in 2016 and unfortunately I think it is wrong.
> Are there any unary operations we could possibly want to handle?
> In any event, the information that there was an arithmetic function in
> the path of the parameter would be completely lost if the code ever
> executed.  (Which I don't think it ever does, I think it would take
> crazy code that employs LTO to pass an integer to a pointer parameter
> to trigger).
> 
> So I plan to remove the whole if.
> 

Yes, i think this is a mistake and should go. Thanks for doing that.

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-06-17 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #21 from kugan at gcc dot gnu.org ---
(In reply to Christophe Lyon from comment #20)
> Hi Kugan,
> 
> The new test fails with -mabi=ilp32:
> FAIL: gcc.target/aarch64/pr88834.c scan-assembler-times \\tld2w\\t{z[0-9]+.s
> - z[0-9]+.s}, p[0-7]/z, \\[x[0-9]+, x[0-9]+, lsl 2\\]\\n 2
> FAIL: gcc.target/aarch64/pr88834.c scan-assembler-times \\tst2w\\t{z[0-9]+.s
> - z[0-9]+.s}, p[0-7], \\[x[0-9]+, x[0-9]+, lsl 2\\]\\n 1

Thanks Christophe. In the back-end, when we use ILP32, we don't accept SImode
ops if like:

(plus:SI (mult:SI (reg:SI 91)
(const_int 4 [0x4]))
(reg:SI 90))

While we would accept Pmode. My question is, should we care about ILP32 for
SVE? If so we need to fix this. Otherwise, we can run the test for LP64.

[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode

2019-06-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838

--- Comment #6 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Thu Jun 13 03:34:28 2019
New Revision: 272233

URL: https://gcc.gnu.org/viewcvs?rev=272233&root=gcc&view=rev
Log:

gcc/ChangeLog:

2019-06-13  Kugan Vivekanandarajah  

PR target/88838
* tree-vect-loop-manip.c (vect_set_loop_masks_directly): If the
compare_type is not with Pmode size, we will create an IV with
Pmode size with truncated use (i.e. converted to the correct type).
* tree-vect-loop.c (vect_verify_full_masking): Find IV type.
(vect_iv_limit_for_full_masking): New. Factored out of
vect_set_loop_condition_masked.
* tree-vectorizer.h (LOOP_VINFO_MASK_IV_TYPE): New.
(vect_iv_limit_for_full_masking): Declare.

gcc/testsuite/ChangeLog:

2019-06-13  Kugan Vivekanandarajah  

PR target/88838
* gcc.target/aarch64/pr88838.c: New test.
* gcc.target/aarch64/sve/while_1.c: Adjust.

Added:
trunk/gcc/testsuite/gcc.target/aarch64/pr88838.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
trunk/gcc/tree-vect-loop-manip.c
trunk/gcc/tree-vect-loop.c
trunk/gcc/tree-vectorizer.h

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-06-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #19 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Thu Jun 13 03:18:54 2019
New Revision: 272232

URL: https://gcc.gnu.org/viewcvs?rev=272232&root=gcc&view=rev
Log:

gcc/ChangeLog:

2019-06-13  Kugan Vivekanandarajah  

PR target/88834
* tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
IFN_MASK_LOAD_LANES and IFN_MASK_STORE_LANES.
(get_alias_ptr_type_for_ptr_address): Likewise.
(add_iv_candidate_for_use): Add scaled index candidate if useful.
* tree-ssa-address.c (preferred_mem_scale_factor): New.
* config/aarch64/aarch64.c (aarch64_classify_address): Relax
allow_reg_index_p.

gcc/testsuite/ChangeLog:

2019-06-13  Kugan Vivekanandarajah  

PR target/88834
* gcc.target/aarch64/pr88834.c: New test.
* gcc.target/aarch64/sve/struct_vect_1.c: Adjust.
* gcc.target/aarch64/sve/struct_vect_14.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_15.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_16.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_17.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_7.c: Likewise.


Added:
trunk/gcc/testsuite/gcc.target/aarch64/pr88834.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/aarch64/aarch64.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_15.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_16.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_17.c
trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_7.c
trunk/gcc/tree-ssa-address.c
trunk/gcc/tree-ssa-address.h
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-04-09 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #17 from kugan at gcc dot gnu.org ---
(In reply to Wilco from comment #16)
> (In reply to kugan from comment #15)
> > (In reply to Wilco from comment #11)
> > > There is also something odd with the way the loop iterates, this doesn't
> > > look right:
> > > 
> > > whilelo p0.s, x3, x4
> > > incwx3
> > > ptest   p1, p0.b
> > > bne .L3
> > 
> > I am not sure I understand this. I tried with qemu using an execution
> > testcase and It seems to work.
> > 
> > whilelo p0.s, x4, x5
> > incwx4
> > ptest   p1, p0.b
> > bne .L3
> > In my case I have the above (register allocation difference only) incw is
> > correct considering two vector word registers? Am I missing something here?
> 
> I'm talking about the completely redundant ptest, where does that come from?

It is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88836

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-04-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #15 from kugan at gcc dot gnu.org ---
(In reply to Wilco from comment #11)
> There is also something odd with the way the loop iterates, this doesn't
> look right:
> 
> whilelo p0.s, x3, x4
> incwx3
> ptest   p1, p0.b
> bne .L3

I am not sure I understand this. I tried with qemu using an execution testcase
and It seems to work.

whilelo p0.s, x4, x5
incwx4
ptest   p1, p0.b
bne .L3
In my case I have the above (register allocation difference only) incw is
correct considering two vector word registers? Am I missing something here?

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-04-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #14 from kugan at gcc dot gnu.org ---
Created attachment 46104
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46104&action=edit
testcase

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-04-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

kugan at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #46040|0   |1
is obsolete||

--- Comment #13 from kugan at gcc dot gnu.org ---
Created attachment 46103
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46103&action=edit
ivopt changes alone

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-04-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #12 from kugan at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #10)
> (In reply to kugan from comment #9)
> > Created attachment 46040 [details]
> > patch
> 
> Wasn't sure whether this patch was WIP or the final version
> for review, but we need to do something more generic than
> dividing by 4.  I think the test will still fail with "int"
> changed to "short" for example.
> 
> I also don't think the new candidate should be tied to the
> mask/load store functions.  Maybe one approach would be to
> check when adding a zero-based candidate for a use in:
> 
>   /* Record common candidate with initial value zero.  */
>   basetype = TREE_TYPE (iv->base);
>   if (POINTER_TYPE_P (basetype))
> basetype = sizetype;
>   record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
> 
> whether the use actually benefits from this unscaled iv.
> If the use is USE_REF_ADDRESS, we could compare the cost
> of an address with an unscaled index with the cost of an address
> with a scaled index.  I think the natural scale value to try
> would be GET_MODE_INNER (TYPE_MODE (mem_type)).

Thanks for the comments. I agree this is the right place. But I am not sure if
checking the cost at this point is what IV opt generally does. In general,
IV-opt adds candidates which can be helpful and later decides the optimal set. 

If we are to use get_computation_cost to see the costs, we have to create
iv_cand and then discard. Since we are adding only one candidate and that too
for SVE like targets, I am thinking that it is OK. If you still prefer to check
the cost, I will change that.

Attached patch (only the ivopt changes) and testcase

[Bug rtl-optimization/89862] LTO bootstrap fails for ARM

2019-03-29 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862

--- Comment #4 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Sat Mar 30 04:28:51 2019
New Revision: 270031

URL: https://gcc.gnu.org/viewcvs?rev=270031&root=gcc&view=rev
Log:

2019-03-29  Kugan Vivekanandarajah  

Backport from mainline
2019-03-29  Kugan Vivekanandarajah  
Eric Botcazou  

PR rtl-optimization/89862
* rtl.h (word_register_operation_p): Exclude CONST_INT from operations
that operates on the full registers for WORD_REGISTER_OPERATIONS
architectures.


Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/rtl.h

[Bug rtl-optimization/89862] LTO bootstrap fails for ARM

2019-03-29 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862

--- Comment #3 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Sat Mar 30 04:24:22 2019
New Revision: 270030

URL: https://gcc.gnu.org/viewcvs?rev=270030&root=gcc&view=rev
Log:

2019-03-29  Kugan Vivekanandarajah  
Eric Botcazou  

PR rtl-optimization/89862
* rtl.h (word_register_operation_p): Exclude CONST_INT from operations
that operates on the full registers for WORD_REGISTER_OPERATIONS
architectures.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/rtl.h

[Bug rtl-optimization/89862] LTO bootstrap fails for ARM

2019-03-28 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862

--- Comment #2 from kugan at gcc dot gnu.org ---
(In reply to Eric Botcazou from comment #1)
> Can you try this instead?
> 
> Index: rtl.h
> ===
> --- rtl.h   (revision 269886)
> +++ rtl.h   (working copy)
> @@ -4401,6 +4401,7 @@ word_register_operation_p (const_rtx x)
>  {
>switch (GET_CODE (x))
>  {
> +case CONST_INT:
>  case ROTATE:
>  case ROTATERT:
>  case SIGN_EXTRACT:
Thanks for looking into it. Disallowing all the CONST_INT works for me. I have
verified that lto-bootstrap works with the above changes. I will test for
regression and post it to gcc-patches.

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-03-27 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

kugan at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #45686|0   |1
is obsolete||

--- Comment #9 from kugan at gcc dot gnu.org ---
Created attachment 46040
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46040&action=edit
patch

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-03-27 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #8 from kugan at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #7)
> Thanks for looking at this.
> 
> (In reply to kugan from comment #6)
> > cmp w3, 0
> > ble .L1
> > sub w3, w3, #1
> > mov x4, 0
> > cntwx5
> > ptrue   p1.s, all
> > lsr w3, w3, 1
> > add w3, w3, 1
> > whilelo p0.s, xzr, x3
> > .p2align 3,,7
> > .L3:
> > ld2w{z4.s - z5.s}, p0/z, [x1, x4, lsl 2]
> > ld2w{z2.s - z3.s}, p0/z, [x2, x4, lsl 2]
> > add z0.s, z4.s, z2.s
> > sub z1.s, z5.s, z3.s
> > st2w{z0.s - z1.s}, p0, [x0, x4, lsl 2]
> > whilelo p0.s, x5, x3
> > incbx4, all, mul #2
> > incwx5
> > ptest   p1, p0.b
> > bne .L3
> > .L1:
> > ret
> > .cfi_endproc
> 
> This doesn't look right.  x4 is an index, so it should be
> incremented by the number of words in two vectors, rather than
> the number of bytes in two vectors.

Thanks for the comments. Fixed it with the attached patch it generates

f:
.LFB0:
.cfi_startproc
cmp w3, 0
ble .L1
sub w5, w3, #1
cntwx4
mov x3, 0
ptrue   p1.s, all
lsr w5, w5, 1
add w5, w5, 1
whilelo p0.s, xzr, x5
.p2align 3,,7
.L3:
ld2w{z4.s - z5.s}, p0/z, [x1, x3, lsl 2]
ld2w{z2.s - z3.s}, p0/z, [x2, x3, lsl 2]
add z0.s, z4.s, z2.s
sub z1.s, z5.s, z3.s
st2w{z0.s - z1.s}, p0, [x0, x3, lsl 2]
whilelo p0.s, x4, x5
inchx3
incwx4
ptest   p1, p0.b
bne .L3
.L1:
ret
.cfi_endproc

[Bug rtl-optimization/89862] New: LTO bootstrap fails for ARM

2019-03-27 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862

Bug ID: 89862
   Summary: LTO bootstrap fails for ARM
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Created attachment 46039
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46039&action=edit
patch

With the commit:
commit 67c18bce7054934528ff5930cca283b4ac967dca
Author: ebotcazou 
Date:   Wed Jan 31 10:03:06 2018 +PR rtl-optimization/84071
* combine.c (record_dead_and_set_regs_1): Record the source
unmodified
for a paradoxical SUBREG on a WORD_REGISTER_OPERATIONS target.

LTO bootstrap fails for arm (possibly for other WORD_REGISTER_OPERATIONS
targets).

There are internal compiler error: in operator+=, at profile-count.h:792. It
looks like the profile_count is set incorrectly.

Commit 67c18bce7054934528ff5930cca283b4ac967dca skips generating gen_lowpart
for
(set (subreg:SI (reg:QI 1434) 0)
(const_int 224 [0xe0])) and likes. This seems to be the reason for the
error.

attached patch fixes this. Does this look reasonable?

[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode

2019-03-20 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838

--- Comment #5 from kugan at gcc dot gnu.org ---
Created attachment 46000
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46000&action=edit
RFC patch

RFC patch fixes this for review.

[Bug target/88836] [SVE] Redundant PTEST in loop test

2019-02-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88836

--- Comment #2 from kugan at gcc dot gnu.org ---
Created attachment 45795
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45795&action=edit
RFC patch

AFIK, we need to:
1. Change the whilelo pattern in backend
2. Change RTL CSE
- Add support for VEC_DUPLICATE
- When handling PARALLEL rtx, we  may kill CSE defined in the first set so that
it docent reach

Attached patch fix this. With the patch I now have:
.LFB0:
.cfi_startproc
cmp w3, 0
ble .L1
sub w4, w3, #1
cntwx3
lsr w4, w4, 1
add w4, w4, 1
whilelo p0.s, xzr, x4
.p2align 3,,7
.L3:
ld2w{z4.s - z5.s}, p0/z, [x1]
ld2w{z2.s - z3.s}, p0/z, [x2]
add z0.s, z4.s, z2.s
sub z1.s, z5.s, z3.s
st2w{z0.s - z1.s}, p0, [x0]
incbx1, all, mul #2
whilelo p0.s, x3, x4
incbx0, all, mul #2
incwx3
incbx2, all, mul #2
bne .L3
.L1:
ret
.cfi_endproc

[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode

2019-02-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838

--- Comment #4 from kugan at gcc dot gnu.org ---
sorry wr(In reply to kugan from comment #3)
> Created attachment 45794 [details]
> RFC patch

Oops wrong place, it should be for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88836

[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode

2019-02-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838

--- Comment #3 from kugan at gcc dot gnu.org ---
Created attachment 45794
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45794&action=edit
RFC patch

[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode

2019-02-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #2 from kugan at gcc dot gnu.org ---
AFIK, we need to:
1. Change the whilelo pattern in backend
2. Change RTL CSE
- Add support for VEC_DUPLICATE
- When handling PARALLEL rtx, we  may kill CSE defined in the first set so that
it docent reach

Attached patch fix this. With the patch I now have:
.LFB0:
.cfi_startproc
cmp w3, 0
ble .L1
sub w4, w3, #1
cntwx3
lsr w4, w4, 1
add w4, w4, 1
whilelo p0.s, xzr, x4
.p2align 3,,7
.L3:
ld2w{z4.s - z5.s}, p0/z, [x1]
ld2w{z2.s - z3.s}, p0/z, [x2]
add z0.s, z4.s, z2.s
sub z1.s, z5.s, z3.s
st2w{z0.s - z1.s}, p0, [x0]
incbx1, all, mul #2
whilelo p0.s, x3, x4
incbx0, all, mul #2
incwx3
incbx2, all, mul #2
bne .L3
.L1:
ret
.cfi_endproc

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-02-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #6 from kugan at gcc dot gnu.org ---

> 
> Note the difference in mode for aarch64_classify_address. Not sure if this
> is because of the way my patch changes ivopt.

Yes, it ws my mistake in iv-use. with attached patch, I now get
cmp w3, 0
ble .L1
sub w3, w3, #1
mov x4, 0
cntwx5
ptrue   p1.s, all
lsr w3, w3, 1
add w3, w3, 1
whilelo p0.s, xzr, x3
.p2align 3,,7
.L3:
ld2w{z4.s - z5.s}, p0/z, [x1, x4, lsl 2]
ld2w{z2.s - z3.s}, p0/z, [x2, x4, lsl 2]
add z0.s, z4.s, z2.s
sub z1.s, z5.s, z3.s
st2w{z0.s - z1.s}, p0, [x0, x4, lsl 2]
whilelo p0.s, x5, x3
incbx4, all, mul #2
incwx5
ptest   p1, p0.b
bne .L3
.L1:
ret
.cfi_endproc

I will post the patch for review after stage-1 opens. In the meantime any
review is appreciated. Especially the part where iv-use is setup and
get_alias_ptr_type_for_ptr_address.

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-02-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

kugan at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #45661|0   |1
is obsolete||

--- Comment #5 from kugan at gcc dot gnu.org ---
Created attachment 45686
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45686&action=edit
ivopt patch v2

[Bug tree-optimization/89296] New: tree copy-header masking uninitialized warning

2019-02-11 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89296

Bug ID: 89296
   Summary: tree copy-header masking uninitialized warning
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

void test_func(void) {
  int loop;  // uninitialized and "garbage"
  while (!loop) {
   loop = get_a_value();  // <- must be for this test
   printk("...");
  }
}

from Linaro bug report https://bugs.linaro.org/show_bug.cgi?id=4134
-fno-tree-ch gets the required warning

diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index c876d62..d405d00 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -393,7 +393,7 @@ ch_base::copy_headers (function *fun)
{
  gimple *stmt = gsi_stmt (bsi);
  if (gimple_code (stmt) == GIMPLE_COND)
-   gimple_set_no_warning (stmt, true);
+   ;//gimple_set_no_warning (stmt, true);
  else if (is_gimple_assign (stmt))
{
  enum tree_code rhs_code = gimple_assign_rhs_code (stmt);

also gets the required warning. Looking into it.

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-02-11 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #4 from kugan at gcc dot gnu.org ---
Created attachment 45661
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45661&action=edit
ivopt patch v1

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-02-11 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #3 from kugan at gcc dot gnu.org ---
I added iv-use for MASKED_LOAD_LANE and the result is
cmp w3, 0
ble .L1
sub w5, w3, #1
mov x4, 0
lsr w5, w5, 1
add w5, w5, 1
whilelo p0.s, xzr, x5
.p2align 3,,7
.L3:
lsl x3, x4, 3
incwx4
add x7, x1, x3
add x6, x2, x3
ld2w{z4.s - z5.s}, p0/z, [x7]
ld2w{z2.s - z3.s}, p0/z, [x6]
add x3, x0, x3
add z0.s, z4.s, z2.s
sub z1.s, z5.s, z3.s
st2w{z0.s - z1.s}, p0, [x3]
whilelo p0.s, x4, x5
bne .L3
.L1:
ret

No base plus scaled index addressing mode. This is because in ivopt

When called from ivopt:
Breakpoint 4, aarch64_classify_address (info=0x7fffcba0, x=0x76c44f30,
mode=E_DImode, strict_p=false, type=ADDR_QUERY_M)
at
/home/kugan/work/abe/snapshots/gcc.git~origin~aarch64~sve-acle-branch/gcc/config/aarch64/aarch64.c:5689
5689{
(gdb) p debug_rtx (x)
(plus:DI (mult:DI (reg:DI 91)
(const_int 8 [0x8]))
(reg:DI 90))

it accepts it.

When in cfgexpand:
Breakpoint 5, aarch64_classify_address (info=0x7fffcca0, x=0x76c5b840,
mode=E_VNx8SImode, strict_p=false, type=ADDR_QUERY_M)
at
/home/kugan/work/abe/snapshots/gcc.git~origin~aarch64~sve-acle-branch/gcc/config/aarch64/aarch64.c:5689
5689{
(gdb) p debug_rtx (x)
(plus:DI (mult:DI (reg:DI 92 [ ivtmp_28 ])
(const_int 8 [0x8]))
(reg/v/f:DI 110 [ y ]))


This is not accepted because of aarch64_classify_index (info, op1, mode,
strict_p) failing (as it should).

Note the difference in mode for aarch64_classify_address. Not sure if this is
because of the way my patch changes ivopt.

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-02-03 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #2 from kugan at gcc dot gnu.org ---
I'll assign it to myself unless it is being looked at by someone else.

[Bug sanitizer/88333] [9 Regression] ice in asan_emit_stack_protection, at asan.c:1574

2018-12-06 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88333

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #7 from kugan at gcc dot gnu.org ---
*** Bug 88350 has been marked as a duplicate of this bug. ***

[Bug sanitizer/88350] Linux kernel build ICE with allyesconfig for aarch64

2018-12-06 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88350

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from kugan at gcc dot gnu.org ---
Duplicate

*** This bug has been marked as a duplicate of bug 88333 ***

[Bug sanitizer/88350] Linux kernel build ICE with allyesconfig for aarch64

2018-12-06 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88350

kugan at gcc dot gnu.org changed:

   What|Removed |Added

  Alias|PR88333 |

--- Comment #2 from kugan at gcc dot gnu.org ---
Dup of PR88333 and fixed.

[Bug sanitizer/88350] New: Linux kernel build ICE with allyesconfig for aarch64

2018-12-04 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88350

Bug ID: 88350
   Summary: Linux kernel build ICE with allyesconfig for aarch64
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at 
gcc dot gnu.org
  Target Milestone: ---

When Linux kernel is built (allyesconfig) with trunk,  


++ make
CC=/home/tcwg-buildslave/workspace/tcwg_kernel-bisect-gnu_0/bin/aarch64-cc
ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- HOSTCC=gcc -j32 -s -k
:1335:2: warning: #warning syscall rseq not implemented [-Wcpp]
*** WARNING *** there are active plugins, do not report this as a bug unless
you can reproduce it without enabling any plugins.
Event| Plugins
PLUGIN_FINISH_TYPE   | randomize_layout_plugin structleak_plugin
PLUGIN_FINISH_DECL   | randomize_layout_plugin
PLUGIN_ATTRIBUTES| randomize_layout_plugin
latent_entropy_plugin structleak_plugin
PLUGIN_START_UNIT| latent_entropy_plugin
PLUGIN_ALL_IPA_PASSES_START  | randomize_layout_plugin
during RTL pass: expand
arch/arm64/mm/flush.c: In function '__sync_icache_dcache':
arch/arm64/mm/flush.c:61:6: internal compiler error: in
asan_emit_stack_protection, at asan.c:1574
   61 | void __sync_icache_dcache(pte_t pte)
  |  ^~~~


Full build Log can be found in:
https://ci.linaro.org/job/tcwg_kernel-bisect-gnu-master-aarch64-stable-allyesconfig/11/artifact/artifacts/build-1d89613e77d7db420b13ce3ad8b98f07aaf474e8/console.log


Commit that seem to trigger this is:
Author: marxin 
Date:   Fri Nov 30 14:25:15 2018 +

Make red zone size more flexible for stack variables (PR sanitizer/81715).

2018-11-30  Martin Liska  

PR sanitizer/81715
* asan.c (asan_shadow_cst): Remove, partially transform
into flush_redzone_payload.
(RZ_BUFFER_SIZE): New.
(struct asan_redzone_buffer): New.
(asan_redzone_buffer::emit_redzone_byte): Likewise.
(asan_redzone_buffer::flush_redzone_payload): Likewise.
(asan_redzone_buffer::flush_if_full): Likewise.
(asan_emit_stack_protection): Use asan_redzone_buffer class
that is responsible for proper aligned stores and flushing
of shadow memory payload.
* asan.h (ASAN_MIN_RED_ZONE_SIZE): New.
(asan_var_and_redzone_size): Likewise.
* cfgexpand.c (expand_stack_vars): Use smaller alignment
(ASAN_MIN_RED_ZONE_SIZE) in order to make shadow memory
for automatic variables more compact.
2018-11-30  Martin Liska  

PR sanitizer/81715
* c-c++-common/asan/asan-stack-small.c: New test.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@24
138bc75d-0d04-0410-961f-82ee72b054a4

[Bug rtl-optimization/88212] New: IRA Register Coalescing not working for the testcase

2018-11-26 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88212

Bug ID: 88212
   Summary: IRA Register Coalescing not working for the testcase
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

When compiling the following on aarch64 with -O2:
#include 
void g(int32_t *p, int32x2x2_t val, int x)
{
 vst2_lane_s32(p,val,0);
}

generates:
.cfi_startproc
mov v2.8b, v0.8b
mov v3.8b, v1.8b
st2 {v2.s - v3.s}[0], [x0]
ret

clang produces:
st2 { v0.s, v1.s }[0], [x0]
ret

Essentially the problem is that access to part-registers doesn't get
coalesced, so IRA generates moves which aren't actually required.

[Bug target/86677] popcount builtin detection is breaking some kernel build

2018-11-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86677

--- Comment #13 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Mon Nov 12 23:43:56 2018
New Revision: 266039

URL: https://gcc.gnu.org/viewcvs?rev=266039&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-11-13  Kugan Vivekanandarajah  

PR middle-end/86677
PR middle-end/87528
* tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN
POPCOUNT
as expensive when backend does not define it.

gcc/testsuite/ChangeLog:

2018-11-13  Kugan Vivekanandarajah  

PR middle-end/86677
PR middle-end/87528
* g++.dg/tree-ssa/pr86544.C: Run only for target supporting popcount
pattern.
* gcc.dg/tree-ssa/popcount.c: Likewise.
* gcc.dg/tree-ssa/popcount2.c: Likewise.
* gcc.dg/tree-ssa/popcount3.c: Likewise.
* gcc.target/aarch64/popcount4.c: New test.
* lib/target-supports.exp (check_effective_target_popcountl): New.


Added:
trunk/gcc/testsuite/gcc.target/aarch64/popcount4.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c
trunk/gcc/testsuite/lib/target-supports.exp
trunk/gcc/tree-scalar-evolution.c

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2018-11-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #7 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Mon Nov 12 23:43:56 2018
New Revision: 266039

URL: https://gcc.gnu.org/viewcvs?rev=266039&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-11-13  Kugan Vivekanandarajah  

PR middle-end/86677
PR middle-end/87528
* tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN
POPCOUNT
as expensive when backend does not define it.

gcc/testsuite/ChangeLog:

2018-11-13  Kugan Vivekanandarajah  

PR middle-end/86677
PR middle-end/87528
* g++.dg/tree-ssa/pr86544.C: Run only for target supporting popcount
pattern.
* gcc.dg/tree-ssa/popcount.c: Likewise.
* gcc.dg/tree-ssa/popcount2.c: Likewise.
* gcc.dg/tree-ssa/popcount3.c: Likewise.
* gcc.target/aarch64/popcount4.c: New test.
* lib/target-supports.exp (check_effective_target_popcountl): New.


Added:
trunk/gcc/testsuite/gcc.target/aarch64/popcount4.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c
trunk/gcc/testsuite/lib/target-supports.exp
trunk/gcc/tree-scalar-evolution.c

[Bug c++/87469] [9 Regression] ice in record_estimate, at tree-ssa-loop-niter.c:3271

2018-10-29 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87469

--- Comment #5 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Mon Oct 29 22:02:45 2018
New Revision: 265605

URL: https://gcc.gnu.org/viewcvs?rev=265605&root=gcc&view=rev
Log:
gcc/testsuite/ChangeLog:

2018-10-29  Kugan Vivekanandarajah  

PR middle-end/87469
* g++.dg/pr87469.C: New test.

gcc/ChangeLog:

2018-10-29  Kugan Vivekanandarajah  

PR middle-end/87469
* tree-ssa-loop-niter.c (number_of_iterations_popcount): Fix niter
max value.



Added:
trunk/gcc/testsuite/g++.dg/pr87469.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-niter.c

[Bug c++/87469] [9 Regression] ice in record_estimate, at tree-ssa-loop-niter.c:3271

2018-10-17 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87469

--- Comment #4 from kugan at gcc dot gnu.org ---
In the loop here, the value defined in the loop (e) is used outside the loop
hence this should not be detected as popcount (AFIK). I will have a look at
fixing this.

[Bug target/87253] New: Python test_ctypes fails when built with gcc 8.2

2018-09-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87253

Bug ID: 87253
   Summary: Python test_ctypes fails when built with gcc 8.2
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Python-2.7.15

Steps to reproduce error
In Python src directory:
./configure
make
./python Lib/test/regrtest.py -v test_ctypes

==
FAIL: test_struct_by_value (ctypes.test.test_win32.Structures)
--
Traceback (most recent call last):
  File
"/home/kugan.vivekanandarajah/Python-2.7.15/Lib/ctypes/test/test_win32.py",
line 113, in test_struct_by_value
self.assertEqual(ret.left, left.value)
AssertionError: -200 != 10



gdb ./python
b ReturnRect
r Lib/test/regrtest.py -v test_ctypesQuit

(gdb) p cp
$9 = {x = 15, y = 25}
(gdb) p fp
$10 = {x = 548534164448, y = 9890688}

cp and fp are the same as can  be seen from below:

vi /home/kugan.vivekanandarajah/Python-2.7.15/Lib/ctypes/test/test_win32.py
+112

pt = POINT(15, 25)
...
ReturnRect = dll.ReturnRect
ReturnRect.argtypes = [c_int, RECT, POINTER(RECT), POINT, RECT,
  POINTER(RECT), POINT, RECT]


ret = ReturnRect(i, rect, pointer(rect), pt, rect,
 byref(rect), pt, rect)


gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/kugan.vivekanandarajah/install/usr/local/bin/../libexec/gcc/aarch64-unknown-linux-gnu/8.2.1/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc/configure --disable-bootstrap
Thread model: posix
gcc version 8.2.1 20180907 (GCC)

[Bug target/86677] popcount builtin detection is breaking some kernel build

2018-07-26 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86677

--- Comment #2 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> The kernel simply has to provide __popcount{s,d}i2 like it provides other
> libgcc functions if it chooses to not link against libgcc.

Yes, I created this bug just so that I can point it to the kernel people. I
will raise it with the kernel people internally and see what I can do. Thanks.

[Bug target/86677] New: popcount builtin detection is breaking some kernel build

2018-07-25 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86677

Bug ID: 86677
   Summary: popcount builtin detection is breaking some kernel
build
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

Linux kernel build for arm/aarch64 (and possibly other targets) which does not
provide appropriate patterns in the backend will break the kernel build. 

As for aarch64 this happens because kernel is built with -mgeneral-regs-only

Also discussed in:
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00489.html

[Bug tree-optimization/86544] Popcount detection generates different code on C and C++

2018-07-18 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544

--- Comment #4 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Wed Jul 18 22:11:24 2018
New Revision: 262864

URL: https://gcc.gnu.org/viewcvs?rev=262864&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-07-18  Kugan Vivekanandarajah  

PR middle-end/86544
* tree-ssa-phiopt.c (cond_removal_in_popcount_pattern): Handle
comparision with EQ_EXPR
in last stmt.

gcc/testsuite/ChangeLog:

2018-07-18  Kugan Vivekanandarajah  

PR middle-end/86544
* g++.dg/tree-ssa/pr86544.C: New test.


Added:
trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-phiopt.c

[Bug tree-optimization/86544] Popcount detection generates different code on C and C++

2018-07-17 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544

--- Comment #2 from kugan at gcc dot gnu.org ---
Patch posted at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00975.html

[Bug tree-optimization/86544] Popcount detection generates different code on C and C++

2018-07-17 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544

--- Comment #1 from kugan at gcc dot gnu.org ---
(In reply to ktkachov from comment #0)
> Great to see that GCC now detects the popcount loop in PR 82479!
> I am seeing some curious differences between gcc and g++ though.
> int
> pc (unsigned long long b)
> {
> int c = 0;
> 
> while (b) {
> b &= b - 1;
> c++;
> }
> 
> return c;
> }
> 
> If compiled with gcc -O3 on aarch64 this gives:
> pc:
> fmovd0, x0
> cnt v0.8b, v0.8b
> addvb0, v0.8b
> umovw0, v0.b[0]
> ret
> 
> whereas if compiled with g++ -O3 it gives:
> _Z2pcy:
> .LFB0:
> .cfi_startproc
> fmovd0, x0
> cmp x0, 0
> cnt v0.8b, v0.8b
> addvb0, v0.8b
> umovw0, v0.b[0]
> and x0, x0, 255
> cselw0, w0, wzr, ne
> ret
> 
> which is suboptimal. It seems that phiopt3 manages to optimise the C version
> better. The GIMPLE dumps just before the phiopt pass are:
> For the C (good version):
> 
>   int c;
>   int _7;
> 
>[local count: 118111601]:
>   if (b_4(D) != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   _7 = __builtin_popcountl (b_4(D));
> 
>[local count: 118111601]:
>   # c_12 = PHI <0(2), _7(3)>
>   return c_12;
> 
> 
> For the C++ (bad version):
> 
>   int c;
>   int _7;
> 
>[local count: 118111601]:
>   if (b_4(D) == 0)
> goto ; [11.00%]
>   else
> goto ; [89.00%]
> 
>[local count: 105119324]:
>   _7 = __builtin_popcountl (b_4(D));
> 
>[local count: 118111601]:
>   # c_12 = PHI <0(2), _7(3)>
>   return c_12;
> 
> As you can see the order of the gotos and the jump conditions is inverted.
> 
> It seems to me that the two are equivalent and GCC could be doing a better
> job of optimising.
> 
> Can we improve phiopt to handle this more effectively?

Thanks for the test case. I will look at it.

[Bug tree-optimization/86489] ICE in gimple_phi_arg starting with r261682 when building 531.deepsjeng_r with FDO + LTO

2018-07-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86489

--- Comment #7 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Fri Jul 13 05:25:47 2018
New Revision: 262622

URL: https://gcc.gnu.org/viewcvs?rev=262622&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-07-13  Kugan Vivekanandarajah  
Richard Biener  

PR middle-end/86489
* tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
that the loop latch destination where phi is defined.

gcc/testsuite/ChangeLog:

2018-07-13  Kugan Vivekanandarajah  

PR middle-end/86489
* gcc.dg/pr86489.c: New test.


Added:
trunk/gcc/testsuite/gcc.dg/pr86489.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-niter.c

[Bug tree-optimization/86489] ICE in gimple_phi_arg starting with r261682 when building 531.deepsjeng_r with FDO + LTO

2018-07-12 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86489

--- Comment #3 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
>   gimple *phi = SSA_NAME_DEF_STMT (b_11);
>   if (gimple_code (phi) != GIMPLE_PHI
>   || (gimple_assign_lhs (and_stmt)
>   != gimple_phi_arg_def (phi, loop_latch_edge (loop)->dest_idx)))
> return false;
> 
> this may fail if the PHI in question is not the correct one in which case
> it may not have the argument at the latch dest_idx.  Try first verifying
> that the loop latch destination is indeed gimple_bb (phi).

yes, thanks for spotting. I am testing the following patch:

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index f6fa2f7..fbdf838 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2555,6 +2555,7 @@ number_of_iterations_popcount (loop_p loop, edge exit,
... = PHI .  */
   gimple *phi = SSA_NAME_DEF_STMT (b_11);
   if (gimple_code (phi) != GIMPLE_PHI
+  || (gimple_bb (phi) != loop_latch_edge (loop)->dest)
   || (gimple_assign_lhs (and_stmt)
  != gimple_phi_arg_def (phi, loop_latch_edge (loop)->dest_idx)))
 return false;

is checking that there is argument at the latch dest_idx (argument count of
PHI) is still necessary?

[Bug tree-optimization/86489] ICE in gimple_phi_arg starting with r261682 when building 531.deepsjeng_r with FDO + LTO

2018-07-11 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86489

--- Comment #1 from kugan at gcc dot gnu.org ---
Sorry about the breakage, I am trying to reproduce it on x86-64. Please let me
know if you have testcase.

[Bug middle-end/82479] missing popcount builtin detection

2018-06-16 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479

--- Comment #13 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Sat Jun 16 21:39:31 2018
New Revision: 261682

URL: https://gcc.gnu.org/viewcvs?rev=261682&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

PR middle-end/82479
* ipa-fnsummary.c (will_be_nonconstant_expr_predicate): Handle
CALL_EXPR.
* tree-scalar-evolution.c (interpret_expr): Likewise.
(expression_expensive_p): Likewise.
* tree-ssa-loop-ivopts.c (contains_abnormal_ssa_name_p): Likewise.
* tree-ssa-loop-niter.c (number_of_iterations_popcount): New.
(number_of_iterations_exit_assumptions): Use
number_of_iterations_popcount.
(ssa_defined_by_minus_one_stmt_p): New.

gcc/testsuite/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

PR middle-end/82479
* gcc.dg/tree-ssa/popcount.c: New test.
* gcc.dg/tree-ssa/popcount2.c: New test.


Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-fnsummary.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-scalar-evolution.c
trunk/gcc/tree-ssa-loop-ivopts.c
trunk/gcc/tree-ssa-loop-niter.c

[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types

2018-06-16 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946

--- Comment #24 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Sat Jun 16 21:34:29 2018
New Revision: 261681

URL: https://gcc.gnu.org/viewcvs?rev=261681&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

PR middle-end/64946
* cfgexpand.c (expand_debug_expr): Hande ABSU_EXPR.
* config/i386/i386.c (ix86_add_stmt_cost): Likewise.
* dojump.c (do_jump): Likewise.
* expr.c (expand_expr_real_2): Check operand type's sign.
* fold-const.c (const_unop): Handle ABSU_EXPR.
(fold_abs_const): Likewise.
* gimple-pretty-print.c (dump_unary_rhs): Likewise.
* gimple-ssa-backprop.c (backprop::process_assign_use): Likesie.
(strip_sign_op_1): Likesise.
* match.pd: Add new pattern to generate ABSU_EXPR.
* optabs-tree.c (optab_for_tree_code): Handle ABSU_EXPR.
* tree-cfg.c (verify_gimple_assign_unary): Likewise.
* tree-eh.c (operation_could_trap_helper_p): Likewise.
* tree-inline.c (estimate_operator_cost): Likewise.
* tree-pretty-print.c (dump_generic_node): Likewise.
* tree-vect-patterns.c (vect_recog_sad_pattern): Likewise.
* tree.def (ABSU_EXPR): New.

gcc/c-family/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

* c-common.c (c_common_truthvalue_conversion): Handle ABSU_EXPR.

gcc/c/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

* c-typeck.c (build_unary_op): Handle ABSU_EXPR;
* gimple-parser.c (c_parser_gimple_statement): Likewise.
(c_parser_gimple_unary_expression): Likewise.

gcc/cp/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

* constexpr.c (potential_constant_expression_1): Handle ABSU_EXPR.
* cp-gimplify.c (cp_fold): Likewise.

gcc/testsuite/ChangeLog:

2018-06-16  Kugan Vivekanandarajah  

PR middle-end/64946
* gcc.dg/absu.c: New test.
* gcc.dg/gimplefe-29.c: New test.
* gcc.target/aarch64/pr64946.c: New test.


Added:
trunk/gcc/testsuite/gcc.dg/absu.c
trunk/gcc/testsuite/gcc.dg/gimplefe-29.c
trunk/gcc/testsuite/gcc.target/aarch64/pr64946.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/c-family/ChangeLog
trunk/gcc/c-family/c-common.c
trunk/gcc/c/ChangeLog
trunk/gcc/c/c-typeck.c
trunk/gcc/c/gimple-parser.c
trunk/gcc/cfgexpand.c
trunk/gcc/config/i386/i386.c
trunk/gcc/cp/ChangeLog
trunk/gcc/cp/constexpr.c
trunk/gcc/cp/cp-gimplify.c
trunk/gcc/dojump.c
trunk/gcc/expr.c
trunk/gcc/fold-const.c
trunk/gcc/gimple-pretty-print.c
trunk/gcc/gimple-ssa-backprop.c
trunk/gcc/match.pd
trunk/gcc/optabs-tree.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-cfg.c
trunk/gcc/tree-eh.c
trunk/gcc/tree-inline.c
trunk/gcc/tree-pretty-print.c
trunk/gcc/tree-vect-patterns.c
trunk/gcc/tree.def

[Bug fortran/78387] OpenMP segfault/stack size exceeded writing to internal file

2017-10-15 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78387

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #17 from kugan at gcc dot gnu.org ---
*** Bug 82555 has been marked as a duplicate of this bug. ***

[Bug libfortran/82555] SPECcpu201 Wrf_s deadlock

2017-10-15 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #6 from kugan at gcc dot gnu.org ---


*** This bug has been marked as a duplicate of bug 78387 ***

[Bug libfortran/82555] SPECcpu201 Wrf_s deadlock

2017-10-14 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555

--- Comment #5 from kugan at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #4)
> Actually PR 78387 seems exactly this issue.  Please test with a newer
> version of gfortran.

Thanks Andrew. Looks like this is the issue. So far, current trunk is
continuing without error.

[Bug libgomp/82555] SPECcpu201 Wrf_s deadlock

2017-10-14 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555

--- Comment #1 from kugan at gcc dot gnu.org ---
My gcc is slightly old. 
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/kugan.vivekanandarajah/install/test/usr/local/bin/../libexec/gcc/aarch64-unknown-linux-gnu/8.0.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc-exp2/configure : (reconfigured) ../gcc-exp2/configure
--enable-languages=c,c++,fortran,lto,objc --no-create --no-recursion
Thread model: posix
gcc version 8.0.0 20170822 (experimental) (GCC)

I will try with the latest version.

[Bug libgomp/82555] New: SPECcpu201 Wrf_s deadlock

2017-10-14 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555

Bug ID: 82555
   Summary: SPECcpu201 Wrf_s deadlock
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Wrf_s is hanging or deadlocks when run on 48 threads (cores). It doesnt always
happen and I have to run with --iterations=111 and it will eventually happens.
Sometimes in the 2nd iterations and some times much later.

I attached the process to gdb and the back trace is:
(gdb) bt
#0  0x01019924 in __lll_lock_wait (futex=futex@entry=0x2c3b1e0
<_gfortrani_unit_lock>, private=0) at lowlevellock.c:43
#1  0x01012cbc in __pthread_mutex_lock (mutex=0x2c3b1e0
<_gfortrani_unit_lock>) at pthread_mutex_lock.c:80
#2  0x00fd20ac in __gthread_mutex_lock (__mutex=0x2c3b1e0
<_gfortrani_unit_lock>) at ../libgcc/gthr-default.h:748
#3  _gfortrani_close_units () at ../../../gcc-exp2/libgfortran/io/unit.c:835
#4  0x0103950c in __libc_csu_fini ()
#5  0x0103f068 in __run_exit_handlers ()
#6  0x0103f0b0 in exit ()
#7  0x00fc6e60 in _gfortrani_exit_error (status=1, status@entry=3) at
../../../gcc-exp2/libgfortran/runtime/error.c:196
#8  0x00fc7314 in _gfortrani_internal_error
(cmp=cmp@entry=0xcdf23d00, 
message=message@entry=0x11548a8 "stash_internal_unit(): Stack Size
Exceeded") at ../../../gcc-exp2/libgfortran/runtime/error.c:422
#9  0x00fd1a84 in _gfortrani_stash_internal_unit (dtp=0xcdf23d00)
at ../../../gcc-exp2/libgfortran/io/unit.c:549
#10 0x00fd0f6c in _gfortran_st_write_done (dtp=0xcdf23d00) at
../../../gcc-exp2/libgfortran/io/transfer.c:4168
#11 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#12 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#13 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#14 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#15 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#16 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#17 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#18 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#19 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#20 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#21 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#22 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#23 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#24 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#25 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#26 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#27 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#28 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#29 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#30 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#31 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#32 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#33 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#34 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#35 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#36 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#37 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#38 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#39 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#40 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#41 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#42 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#43 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#44 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#45 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()
#46 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad ()

I am running this on AArch64 but I dont think this is an AArch64 specific
issue. Is anyone else seeing this?

[Bug middle-end/82479] missing popcount builtin detection

2017-10-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479

--- Comment #4 from kugan at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #2)
> Confirmed. How useful this optimization is questionable.

This code is part of spec2017/deepsjeng. There is some gain if we can. 

> 
> Gcc has __builtin_popcount which can be used.

I agree.

[Bug middle-end/82479] missing popcount builtin detection

2017-10-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479

--- Comment #1 from kugan at gcc dot gnu.org ---
gcc trunk generates:
PopCount:
mov w2, 0
cbz x0, .L1
.p2align 3
.L3:
sub x1, x0, #1
add w2, w2, 1
andsx0, x0, x1
bne .L3
.L1:
mov w0, w2
ret

[Bug middle-end/82479] New: missing popcount builtin detection

2017-10-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479

Bug ID: 82479
   Summary: missing popcount builtin detection
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

gcc does not have support to detect builtin pop count. As a results, gcc
generates bad code for

int PopCount (long b) {
int c = 0;

while (b) {
b &= b - 1;
c++;
}
return c;
}

clang seems to do that and generates (for aarch64):

_Z8PopCounty:
fmov d0, x0
cnt  v0.8b, v0.8b
uaddlv  h0, v0.8b
fmov w0, s0
ret

[Bug tree-optimization/81558] Loop not vectorized

2017-07-26 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81558

--- Comment #2 from kugan at gcc dot gnu.org ---

> Does LLVM do a runtime alias check here?  For foo1 GCC adds a runtime alias
> check
> (BB vectorization cannot version for aliasing).

Yes. LLVM does not seem to be unrolling the inner loop. As you said, when
disabling cunrolli it works. cunroll pass will unroll after loop vectorisation.
Can anything  done with the heuristics for this case? Thanks.

[Bug middle-end/81558] New: Loop not vectorized

2017-07-25 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81558

Bug ID: 81558
   Summary: Loop not vectorized
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kugan at gcc dot gnu.org
  Target Milestone: ---

For the testcase:

struct I
{
  int opix_x;
  int opix_y;
};

//#define R 
#define R __restrict__
extern struct I * R img;
extern unsigned short ** R imgY_org;
extern unsigned short orig_blocks[256];

void foo1 (int n)
{
  int x = 1, y = 1;
  unsigned short *orgptr=orig_blocks;
  // Vectorized
  for (y = 0; y < img->opix_y; y++)
for (x = 0; x < img->opix_x; x++)
  *orgptr++ = imgY_org [y][x];
}

void foo2 (int n)
{
  int x = 1, y = 1;
  unsigned short *orgptr=orig_blocks;
  // Not vectorized
  for (y = img->opix_y; y < img->opix_y+16; y++)
for (x = img->opix_x; x < img->opix_x+16; x++)
  *orgptr++ = imgY_org [y][x];
}

Loop in foo2 is not vectorized.

In the *.156t.vect, I see:
Creating dr for *_40
analyze_innermost: failed: evolution of base is not affine.
base_address: 
offset from base address: 
constant offset from base address: 
step: 
aligned to: 
base_object: *_40


LLVM seems to be able to vectorize this.

[Bug tree-optimization/80612] [7/8 Regression] ICE in get_range_info, at tree-ssanames.c:375

2017-05-03 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80612

--- Comment #5 from kugan at gcc dot gnu.org ---
(In reply to Marek Polacek from comment #4)
> This should fix it:
> 
> --- a/gcc/calls.c
> +++ b/gcc/calls.c
> @@ -1270,7 +1270,7 @@ get_size_range (tree exp, tree range[2])
>  
>wide_int min, max;
>enum value_range_type range_type
> -= (TREE_CODE (exp) == SSA_NAME
> += ((TREE_CODE (exp) == SSA_NAME && INTEGRAL_TYPE_P (TREE_TYPE (exp)))
> ? get_range_info (exp, &min, &max) : VR_VARYING);
>  
>if (range_type == VR_VARYING)

Looked at the other uses of get_range_info too. There are uses of this in
gcc/gimple-ssa-warn-alloca.c without the check for INTEGRAL_TYPE_P but I think
it is intentional.

[Bug lto/78140] [7 Regression] libxul -flto uses 1GB more memory than gcc-6

2017-01-22 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78140

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #26 from kugan at gcc dot gnu.org ---
(In reply to Richard Biener from comment #20)
> Look at tree-ssanames.c:range_info_def for "tricks" (make them variable
> size):
> 
> /* Value range information for SSA_NAMEs representing non-pointer variables.
> */
> 
> struct GTY ((variable_size)) range_info_def {
>   /* Minimum, maximum and nonzero bits.  */
>   TRAILING_WIDE_INT_ACCESSOR (min, ints, 0)
>   TRAILING_WIDE_INT_ACCESSOR (max, ints, 1)
>   TRAILING_WIDE_INT_ACCESSOR (nonzero_bits, ints, 2)
>   trailing_wide_ints <3> ints;
> };

I am working on a patch to change ipa vrp based on the above.

[Bug tree-optimization/78721] [7 Regression] ICE on valid code at -O2 and -O3 on x86_64-linux-gnu: in set_value_range, at tree-vrp.c:371

2016-12-09 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78721

--- Comment #4 from kugan at gcc dot gnu.org ---
Author: kugan
Date: Fri Dec  9 19:47:10 2016
New Revision: 243501

URL: https://gcc.gnu.org/viewcvs?rev=243501&root=gcc&view=rev
Log:
gcc/testsuite/ChangeLog:

2016-12-09  Kugan Vivekanandarajah  

PR ipa/78721
* gcc.dg/pr78721.c: New test.

gcc/ChangeLog:

2016-12-09  Kugan Vivekanandarajah  

PR ipa/78721
* ipa-cp.c (propagate_vr_accross_jump_function): drop_tree_overflow
after fold_convert.


Added:
trunk/gcc/testsuite/gcc.dg/pr78721.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-cp.c
trunk/gcc/testsuite/ChangeLog

[Bug tree-optimization/78721] [7 Regression] ICE on valid code at -O2 and -O3 on x86_64-linux-gnu: in set_value_range, at tree-vrp.c:371

2016-12-08 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78721

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kugan at gcc dot gnu.org

--- Comment #3 from kugan at gcc dot gnu.org ---
Created attachment 40280
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40280&action=edit
untested patch

[Bug tree-optimization/77862] [7 Regression] ice in add_equivalence

2016-12-07 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77862

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from kugan at gcc dot gnu.org ---
Fixed in trunk.

[Bug tree-optimization/72835] [7 Regression] Incorrect arithmetic optimization involving bitfield arguments

2016-11-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72835

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from kugan at gcc dot gnu.org ---
Fixed in trunk.

[Bug tree-optimization/71408] [7 Regression] wrong code at -Os and above on x86_64-linux-gnu

2016-11-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71408

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from kugan at gcc dot gnu.org ---
Fixed in trunk.

[Bug tree-optimization/40921] missed optimization: x + (-y * z * z) => x - y * z * z

2016-11-21 Thread kugan at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40921

kugan at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||kugan at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #6 from kugan at gcc dot gnu.org ---
Fixed in trunk.

  1   2   3   >