[Bug tree-optimization/114322] New: [14 Regression] SCEV analysis failed for bases like A[(i+x)*stride] since r14-9193-ga0b1798042d033

2024-03-13 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114322

Bug ID: 114322
   Summary: [14 Regression] SCEV analysis failed for bases like
A[(i+x)*stride] since r14-9193-ga0b1798042d033
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

Compile the following case with: gcc simp.c -Ofast -mcpu=neoverse-n1 -S \
 -fdump-tree-ifcvt -fdump-tree-vect-details-scev

int
foo (short *A, int x, int stride)
{
  int sum = 0;

  if (stride > 1)
{
  #pragma GCC unroll 1
  for (int i = 0; i < 1024; ++i)
sum += A[(i + x) * stride];
}

  return sum;
}

The gimple in the loop is:

  :
  # sum_19 = PHI 
  # i_20 = PHI 
  # ivtmp_37 = PHI 
  _1 = x_12(D) + i_20;
  _2 = _1 * stride_11(D);
  _3 = (long unsigned int) _2;
  _4 = _3 * 2;
  _5 = A_13(D) + _4;
  _6 = *_5;
  _7 = (int) _6;
  sum_15 = _7 + sum_19;


Before the commit (i.e., from pr114074 bug fix), it can be vectorized:

Creating dr for *_5
analyze_innermost: (analyze_scalar_evolution 
  (loop_nb = 1)
  (scalar = _5)
(get_scalar_evolution 
  (scalar = _5)
  (scalar_evolution = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) *
2, +, (long unsigned int) stride_11(D) * 2}_1))
)
success.
(analyze_scalar_evolution 
  (loop_nb = 1)
  (scalar = _5)
(get_scalar_evolution 
  (scalar = _5)
  (scalar_evolution = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) *
2, +, (long unsigned int) stride_11(D) * 2}_1))
)
(instantiate_scev 
  (instantiate_below = 5 -> 3)
  (evolution_loop = 1)
  (chrec = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) * 2, +,
(long unsigned int) stride_11(D) * 2}_1)
  (res = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) * 2, +, (long
unsigned int) stride_11(D) * 2}_1))
base_address: A_13(D) + (sizetype) (stride_11(D) * x_12(D)) * 2
offset from base address: 0
constant offset from base address: 0
step: (ssizetype) ((long unsigned int) stride_11(D) * 2)
base alignment: 2
base misalignment: 0
offset alignment: 128
step alignment: 2
base_object: *A_13(D) + (sizetype) (stride_11(D) * x_12(D)) * 2
Access function 0: {0B, +, (long unsigned int) stride_11(D) * 2}_1


After the commit, loop vectorized failed due to SCEV failure with *_5:

Creating dr for *_5
analyze_innermost: (analyze_scalar_evolution 
  (loop_nb = 1)
  (scalar = _5)
(get_scalar_evolution 
  (scalar = _5)
  (scalar_evolution = _5))
)
(analyze_scalar_evolution 
  (loop_nb = 1)
  (scalar = _5)
(get_scalar_evolution 
  (scalar = _5)
  (scalar_evolution = _5))
)
simp.c:11:10: missed:  failed: evolution of base is not affine.
..
  (res = scev_not_known))


To my understanding, '(i + x) * stride' is signed integer calculation, in which
overflow is undefined behavior and the case should be vectorized.

[Bug testsuite/113446] [14 Regression] gcc.dg/tree-ssa/scev-16.c FAILs

2024-01-18 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113446

--- Comment #6 from Hao Liu  ---
Hi Jakub,

That's great. Thanks for the fix.

[Bug target/110625] [14 Regression][AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-12-30 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #26 from Hao Liu  ---
But for now, the patch should fix the regression.(In reply to Tamar Christina
from comment #25)
> Is still pretty inefficient due to all the extends.  If we generate better
> code here this may tip the scale back to vector.  But for now, the patch
> should fix the regression.

That's great. Thanks a lot!

[Bug target/113089] New: [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2023-12-19 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

Bug ID: 113089
   Summary: [14 Regression][aarch64] ICE in
process_uses_of_deleted_def, at rtl-ssa/changes.cc:252
since r14-6605-gc0911c6b357ba9
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

SPEC2017 525.x264 build failure. Options are: -O3 -mcpu=neoverse-n1
-funroll-loops -flto=32 --param early-inlining-insns=96  --param
max-inline-insns-auto=64  --param inline-unit-growth=96

The failure happens while doing LTO optimization:

gcc -std=c99 ... -o ldecod_r

during RTL pass: ldp_fusion
ldecod_src/intra_chroma_pred.c: In function 'intrapred_chroma':
ldecod_src/intra_chroma_pred.c:420:1: internal compiler error: in
process_uses_of_deleted_def, at rtl-ssa/changes.cc:252
  420 | }
  | ^
0x1ccbbab
rtl_ssa::function_info::process_uses_of_deleted_def(rtl_ssa::set_info*)
../../gcc/gcc/rtl-ssa/changes.cc:252
0x1cce34f
rtl_ssa::function_info::change_insns(array_slice)
../../gcc/gcc/rtl-ssa/changes.cc:799
0x1371843 ldp_bb_info::fuse_pair(bool, unsigned int, int, rtl_ssa::insn_info*,
rtl_ssa::insn_info*, base_cand&, rtl_ssa::insn_range_info const&)
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:1520
0x1374663 ldp_bb_info::try_fuse_pair(bool, unsigned int, rtl_ssa::insn_info*,
rtl_ssa::insn_info*)
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2217
0x1374a8f ldp_bb_info::merge_pairs(std::__cxx11::list >&, std::__cxx11::list >&, bool, unsigned int)
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2306
0x1377bfb ldp_bb_info::transform_for_base(int, access_group&)
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2339
0x1377bfb void
ldp_bb_info::traverse_base_map,
int_hash >, access_group,
simple_hashmap_traits,
int_hash > >, access_group> >
>(ordered_hash_map, int_hash >, access_group,
simple_hashmap_traits,
int_hash > >, access_group> >&)
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2398
0x136e29b ldp_bb_info::transform()
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2406
0x136e29b ldp_fusion_bb(rtl_ssa::bb_info*)
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2634
0x136ee93 ldp_fusion()
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2643
0x136eefb execute
../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2693

[Bug tree-optimization/112774] New: Vectorize the loop by inferring nonwrapping information from arrays

2023-11-30 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112774

Bug ID: 112774
   Summary: Vectorize the loop by inferring nonwrapping
information from arrays
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This case extracted from another benchmark and it is simpler than the case in
PR101450, as it has the additional boundary information from the array:

int A[1024 * 2];

int foo (unsigned offset, unsigned N) 
{
  int sum = 0;

  for (unsigned i = 0; i < N; i++)
sum += A[i + offset];

  return sum;
}

The Gimple before the vectorization pass is:

 [local count: 955630224]:
# sum_12 = PHI 
# i_14 = PHI 
_1 = offset_8(D) + i_14;
_2 = A[_1];
sum_9 = _2 + sum_12;
i_10 = i_14 + 1;

GCC failed to vectorize it as it the chrec "{offset_8, +, 1}_1" may
overflow/wrap. I summarized more details in the email:
https://gcc.gnu.org/pipermail/gcc/2023-November/242854.html

Actually, GCC already knows it won't by inferring the range from the array
(in estimate_numbers_of_iterations -> infer_loop_bounds_from_undefined ->
infer_loop_bounds_from_array):

Induction variable (unsigned int) offset_8(D) + 1 * iteration does not wrap
in statement _2 = A[_1];
 in loop 1.
Statement _2 = A[_1];
 is executed at most 2047 (bounded by 2047) + 1 times in loop 1.

We can use re-use this information to vectorize this case. I already have a
simple patch to achieve this, and will send it out later (after doing more
tests).

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #19 from Hao Liu  ---
> Hi, here's the reduced case

Hi Tarmar, thanks for the case.  I've modified it to reproduce the ICE without
LTO and have updated the patch.

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #17 from Hao Liu  ---
> Thanks! I can reduce a testcase for you if you want :)

That will be very helpful. Thanks.

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #15 from Hao Liu  ---
Ah, I see.

I've sent out a quick fix patch for code review.  I'll investigate more about
this and find out the root cause.

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-07-30 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #11 from Hao Liu  ---
Hi Richard,

That's great! Glad to hear the status. Waiting for the patches to be ready and
upstreamed to trunk.

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-07-19 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #8 from Hao Liu  ---
Thanks for the explanation. Understood the root cause and that's reasonable.

So, do you have plan to fix this (i.e. to separate the FP and integer types)?

I want to enable the new costs for Ampere1, which is similar to N2's
issue-info.  If this problem won't be fixed in the near future, I think a
workaround is probably to adjust the general_ops in the issue_info.  E.g. set
the general_ops of both scalar and vector to 3 instead of current values of "4
and 2".

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-07-18 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #6 from Hao Liu  ---
Thanks for the confirmation about the reduction latency.  I'll create a simple
patch to fix this.

> Discounting the loads, we do have 15 general operations.

That's true, and there are indeed 8 general operations for scalar loop.  As the
count_ops() is accurate, it seems maybe the Cost of Vector Body is too large
(Vector inside of loop cost: 51):

*k_48 4 times vec_perm costs 12 in body
*k_48 1 times unaligned_load (misalign -1) costs 4 in body
_5->m1 1 times vec_perm costs 3 in body
_5->m4 1 times unaligned_load (misalign -1) costs 4 in body
(int) _24 2 times vec_promote_demote costs 4 in body
(double) _25 4 times vec_promote_demote costs 8 in body
_2 * _26 4 times vector_stmt costs 8 in body

If it is small enough, even the vect-body cost is increased according to the
issue-info, SLP is still profitable.  I'm not quite familiar with this part and
it may affect all aarch64 targets, so I think it's hard to fix by me.  It would
be great if you will look at how to fix this.

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-07-14 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #3 from Hao Liu  ---
Sorry, it seems this case can not be fixed by only adjusting the calculation of
"reduction latency".  Even it becomes smaller, the case still can not be
vectorized as the "general operations" count is still too large:

Original vector body cost = 51
Scalar issue estimate:
  ...
  general operations = 8
  reduction latency = 2
  estimated min cycles per iteration = 2.00
  estimated cycles per vector iteration (for VF 2) = 4.00
Vector issue estimate:
  ...
  general operations = 15   <-- Too large
  reduction latency = 2 <-- from 8 to 2
  estimated min cycles per iteration = 7.50
Increasing body cost to 96 because scalar code would issue more quickly
...
missed:  cost model: the vector iteration cost = 96 divided by the scalar
iteration cost = 44 is greater or equal to the vectorization factor = 2.
missed:  not vectorized: vectorization not profitable.

[Bug target/110649] [14 Regression] 25% sphinx3 spec2006 regression on Ice Lake and zen between g:acaa441a98bebc52 (2023-07-06 11:36) and g:55900189ab517906 (2023-07-07 00:23)

2023-07-14 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649

--- Comment #2 from Hao Liu  ---
Hi, I bisected the following 3 commits (sequantial):

  [v3] 3a61ca1b925 - Improve profile updates after loop-ch and cunroll
(2023-07-06) 
  [v2] d4c2e34deef - Improve scale_loop_profile (2023-07-06) 
  [v1] 224fd59b2dc - Vect: use a small step to calculate induction for the
unrolled loop (PR tree-optimization/110449) (2023-07-06) 

Tests the time in seconds of 1-copy performance of 482.sphinx3 on zen2:
  v3: 261s
  v2: 231s
  v1: 231s

So the regression should be caused by 3a61ca1b925, i.e.
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=3a61ca1b9256535e1bfb19b2d46cde21f3908a5d

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-07-11 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #2 from Hao Liu  ---
To my understanding, "reduction latency" is the least number of cycles needed
to do the reduction calculation for 1 iteration of loop.  It is calcualted by
the extra instruction issue-info of the new cost models in AArch64 backend.

Usually, the reduction latency of vectorized loop should be smaller than the
scalar loop.  If the latency of vectorized loop is larger than the scalar loop,
it thinks maybe not beneficial to do vectorization, so it increases the
vect-body costs by the scale of vect_reduct_latency/scalar_reduct_latency in
the above case.

For the above case, it thinks the scalar loop needs 4 cycles (2*VF=4) to
calculate "results.m += rhs", while the vectorized loop needs 8 cycles
(2*count=8).  As a result, the vect-body costs are doubled from originial value
of 51 to 102.  It seems not true for the vectorized loop, which should only
need 2 cycles to calculate the SIMD version of "results.m += rhs".

[Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-07-11 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

Bug ID: 110625
   Summary: [AArch64] Vect: SLP fails to vectorize a loop as the
reduction_latency calculated by new costs is too large
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This problem causes a performance regression in SPEC2017 538.imagick.  For the
following simple case (modified from pr96208):

typedef struct {
unsigned short m1, m2, m3, m4;
} the_struct_t;
typedef struct {
double m1, m2, m3, m4, m5;
} the_struct2_t;

double bar1 (the_struct2_t*);

double foo (double* k, unsigned int n, the_struct_t* the_struct) {
unsigned int u;
the_struct2_t result;
for (u=0; u < n; u++, k--) {
result.m1 += (*k)*the_struct[u].m1;
result.m2 += (*k)*the_struct[u].m2;
result.m3 += (*k)*the_struct[u].m3;
result.m4 += (*k)*the_struct[u].m4;
}
return bar1 (&result);
}


Compile it with "-Ofast -S -mcpu=neoverse-n2 -fdump-tree-vect-details
-fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body
cost is increased due to the too large "reduction latency".  See the dump of
vect pass:

Original vector body cost = 51
Scalar issue estimate:
  ...
  reduction latency = 2
  estimated min cycles per iteration = 2.00
  estimated cycles per vector iteration (for VF 2) = 4.00
Vector issue estimate:
  ...
  reduction latency = 8  <-- Too large
  estimated min cycles per iteration = 8.00
Increasing body cost to 102 because scalar code would issue more quickly
Cost model analysis: 
Vector inside of loop cost: 102
...
Scalar iteration cost: 44
...
missed:  cost model: the vector iteration cost = 102 divided by the scalar
iteration cost = 44 is greater or equal to the vectorization factor = 2.
missed:  not vectorized: vectorization not profitable.


SLP will success with "-mcpu=neoverse-n1", as N1 doesn't use the new vector
costs and vector body cost is not increased. The "reduction latency" is
calculated in aarch64.cc count_ops():
  /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
 that's not yet the case.  */
  ops->reduction_latency = MAX (ops->reduction_latency, base * count);

For this case, the "base" is 2 and "count" is 4 .  To my understanding, the
"count" of SLP means the number of scalar stmts (i.e. results.m1 +=, ...) in a
permutation group to be merged into a vector stmt.  It seems not reasonable to
multiply cost by "count" (maybe it doesn't consider about the SLP situation). 
So, I'm thinking to calcualte it differently for SLP situation, e.g.

  unsigned int latency = PURE_SLP_STMT(stmt_info) ? base : base * count;
  ops->reduction_latency = MAX (ops->reduction_latency, latency);

Is this reasonable?

[Bug tree-optimization/110474] Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization

2023-07-05 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110474

Hao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Hao Liu  ---
It's better to have a suggested_epilog_"unroll" factor or support multiple
epilogues.  But need a lot of work.  Let's support the simple patch firstly.

[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-04 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

Hao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #12 from Hao Liu  ---
OK. Now I got your point of useless initialization may introduce extra cost.

[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-04 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

--- Comment #10 from Hao Liu  ---
> foo is just an example for not getting inlined, the point here is extra cost 
> paid.

My point is that the case is different from the original case in
tree-vect-loop.cc.  For example, change the case as following:

__attribute__((noipa)) int foo(int *a) { *a == 1 ? return 1 : return 0;}

That's similar to the original problem (the value of "a" is undefiend).

I don't mean that "a" must be initialized in test().  We can also initalize "a"
in foo, but should not use "a" before initialization. E.g.

__attribute__((noipa)) int foo(int *a) { 
  *a == 1;
  ...
  if (*a)
}

The above case has no problem.

[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-04 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

--- Comment #7 from Hao Liu  ---
> int foo() {
>   bool a = true;
>   bool b;
>   if (a || b)
> return 1;
>   b = true;
>   return 0;
> }
> 
> still has the warning, it looks something can be improved (guess we prefer 
> not to emit warning).

Your case is wrong, you should initialize "b" and there will be no warning.


> __attribute__((noipa)) int foo(int *a) { *a = 1; return 1;}
> 
> int test(){
> #ifdef AINIT
>   int a = 0;
> #else
>   int a;
> #endif
>   int b = foo(&a);
>   return b;
> }

This case doesn't have problem. If "foo" uses "a" directly, the result is
undefined behavior, which causes both correctness and performance issues.

[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-03 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

--- Comment #5 from Hao Liu  ---
BTW, there is no warning is probably because the original code is too
complicated and not inlined. 
Compile the simple case by "g++ -O3 -S -Wall hello.c":
int foo(bool a) {
  bool b;
  if (a || b)
return 1;
  b = true;
  return 0;
}

gcc report warning:
hello.c: In function ‘int foo(bool)’:
hello.c:4:9: warning: ‘b’ is used uninitialized [-Wuninitialized]
4 |   if (a || b)
  |   ~~^~~~

[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-03 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

--- Comment #4 from Hao Liu  ---
> IMHO, the initialization with false is unnecessary and very likely it isn't 
> able to get optimized, it seems worse from this point of view.

Sorry. I don't think so. See more at
https://www.oreilly.com/library/view/c-coding-standards/0321113586/ch20.html:

Start with a clean slate: Uninitialized variables are a common source of bugs
in C and C++ programs. There are few reasons to ever leave a variable
uninitialized. None is serious enough to justify the hazard of undefined
behavior.

[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-03 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

--- Comment #2 from Hao Liu  ---
> Is the warning from some static analyzer?

No. I just find it maybe a bug while looking at the code.

> slp should be true always (always do analyze slp), it doesn't care what's in 
> slp_done_for_suggested_uf.

Oh, I see. This is not a real bug.


IMHO, it would be better to initialize it as "false", which should be much
easier for someone to understand the code.

[Bug tree-optimization/110531] New: Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc

2023-07-03 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531

Bug ID: 110531
   Summary: Vect: slp_done_for_suggested_uf is not initialized in
tree-vect-loop.cc
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This seems an obvious bug in tree-vect-loop.cc:

(1) This var is declared (but not initialized) and used in function
vect_analyze_loop_1:

  bool slp_done_for_suggested_uf;   < Warning, this is not
initialized

  /* Run the main analysis.  */
  opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal,
&suggested_unroll_factor,
slp_done_for_suggested_uf);

(2) It is used before set in function vect_analyze_loop_2:
static opt_result
vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal,
 unsigned *suggested_unroll_factor,
 bool& slp_done_for_suggested_uf)
  ...
  bool slp = !applying_suggested_uf || slp_done_for_suggested_uf;  <--- used
without initialized
  ...
  slp_done_for_suggested_uf = slp;


I don't know the detail logic and wonder if it should be initialized as "true"
or "false" (probably it should be "false").

[Bug tree-optimization/110474] New: Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization

2023-06-28 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110474

Bug ID: 110474
   Summary: Vect: the epilog vect loop should have small VF if the
loop is unrolled during vectorization
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

Hi, I'm trying to use tune loop unrolling during vectorization (see more:
tree-vect-loop.cc suggested_unroll_factor). I find the unrolling may hurt
performance as unrolling also increases the VF (vector factor) of epilog vect
loop.

For example:
int foo(short *A, char *B, int N) {
int sum = 0;
for (int i = 0; i < N; ++i) {
sum += A[i] * B[i];
}
return sum;
}


Compile it with "-O3 -mtune=neoverse-n2 -mcpu=neoverse-n1 --param
aarch64-vect-unroll-limit=2" (I'm using -mcpu n1 as I want to try a target
without SVE). GCC vectorization pass unrolls the loop by 2 and generates code
as following:

if N >= 32:
main vect loop ...

if N >= 16:   # This may hurt performance if N is small (e.g. 8)
epilog vect loop ...

epilog scalar code ...


If the loop is not unrolled (i.e. use "--param aarch64-vect-unroll-limit=1").
GCC generates code as following:

if N >= 16:
main vect loop ...

if N >= 8:
epilog vect loop ...

epilog scalar code ...


The runtime check is based on the VF of epilog vectorization. There is code in
tree-vect-loop.cc (line 2990) to choose epilog vect VF:
  /* If we're vectorizing an epilogue loop, the vectorized loop either needs
 to be able to handle fewer than VF scalars, or needs to have a lower VF
 than the main loop.  */
  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
  && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
  && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
return opt_result::failure_at (vect_location,
   "Vectorization factor too high for"
   " epilogue loop.\n");

But it doesn't consider about the suggested_unroll_factor. So I'm thinking
about adding following code to unscale the orig_loop_vinfo's VF by
unroll_factor:
  unscaled_orig_vf = exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo),
orig_loop_vinfo->suggested_unroll_factor);

Is this reasonable?

[Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization

2023-06-28 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

--- Comment #2 from Hao Liu  ---
That looks better than the currently generated code (it saves one "MOV"
instruction). Yes, it has the loop-carried dependency advantage. But it still
uses one more register for "8*step" (There may be a register pressure problem
for complicated code, not for this simple case). 

This is still a floating point precision problem. There is a PR84201 discussed
about the same problem for X86:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201. The larger step makes the
floating point calculation result has larger gap compared to the original
scalar calculation result. E.g. The SPEC2017 fp benchmark 549.fotonik may
result in VE (Validation Error) after unrolling a loop of double: 
   319do ifreq = 1, tmppower%nofreq <-- HERE
   320  frequency(ifreq,ipower) = freq
   321  freq = freq + freqstep
   322end do

it uses 4*step for unrolled vectorization version other than the 2*step for
non-unrolled vectorization version. The SPEC fp result checks the "relative
tolerance" of the fp results and it is higher than the current standard (i.e.
the compare command line option of "--reltol 1e-10").

[Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization

2023-06-28 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

Bug ID: 110449
   Summary: Vect: use a small step to calculate the loop induction
if the loop is unrolled during loop vectorization
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This is inspired by clang. Compile the follwing case with "-mcpu=neoverse-n2
-O3":

void foo(int *arr, int val, int step) {
  for (int i = 0; i < 1024; i++) {
arr[i] = val;
val += step;
  }
}

It will be unrolled by 2 during vectorization. GCC generates code:
fmovs29, w2 # step
shl v27.2s, v29.2s, 3   # 8*step
shl v28.2s, v29.2s, 2   # 4*step
...
.L2:
mov v30.16b, v31.16b
add v31.4s, v31.4s, v27.4s  # += 8*step
add v29.4s, v30.4s, v28.4s  # += 4*step
stp q30, q29, [x0]
add x0, x0, 32
cmp x1, x0
bne .L2

The v27 (i.e. "8*step") is actually not necessary. We can use v29 + v28 (i.e.
"+ 4*step") and generate simpler code:
fmovs29, w2 # step
shl v28.2s, v29.2s, 2   # 4*step
...
.L2:
add v29.4s, v30.4s, v28.4s  # += 4*step
stp q30, q29, [x0]
add x0, x0, 32
add v30.4s, v29.4s, v28.4s  # += 4*step
cmp x1, x0
bne .L2

This has two benefits:
(1) Save 1 vector register and one "mov" instructon
(2) For floating point, the result value of small step should be closer to the
original scalar result value than large step. I.e. "A + 4*step + ... + 4*step"
should be closer to "A + step + ... + step" than "A + 8*step + ... 8*step".

Do you think if this is reasonable? 

I have a simple patch to enhance the tree-vect-loop.cc
"vectorizable_induction()" to achieve this. Will send out the patch for code
review later.

[Bug tree-optimization/98598] New: Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

Bug ID: 98598
   Summary: Missed opportunity to optimize dependent loads in
loops
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

As we know, dependent loads are not friendly to cache. Especially when in
nested loops, dependent loads such as pa->pb->pc->val may be repeated many
times. For example:

typedef struct C { int val; } C;
typedef struct B { C *pc; } B;
typedef struct A { B *pb; } A;

int foo (int n, int m, A *pa) {
  int sum;

  for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
  sum += pa[j].pb->pc->val;  // each value is repeatedly loaded "n" times
  // ...
}
// ...
  }

  return sum;
}

Such access pattern can be found in real applications and benchmarks, and this
can be critical to performance.

Can we cache the loaded value and avoid repeated dependent loads? E.g.
transform above case into following (suppose there is no alias issue or other
clobber, and "n" is big enough):

int foo (int n, int m, A *pa) {
  int *cache = (int *) malloc(m * sizeof(int));
  for (int j = 0; j < m; j++) {
cache[j] = pa[j].pb->pc->val;
  }

  int sum;

  for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
  sum += cache[j];   // pa[j].pb->pc->val;
  // ...
}
// ...
  }

  free(cache);
  return sum;
}

This should improve performance a lot.

[Bug bootstrap/98318] libcody breaks DragonFly bootstrap

2020-12-22 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318

--- Comment #8 from Hao Liu  ---
Hi Nathan,

The problem is related to use another make binary, which is 4.2.0 and built by
ourselves. Maybe there is a strange bug.

Anyway, after using the system installed make (which is 4.2.1 and under
/usr/bin/), the problem is solved.

Thanks for your help!

[Bug bootstrap/98318] libcody breaks DragonFly bootstrap

2020-12-22 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318

--- Comment #7 from Hao Liu  ---
I found that:
  1. "make -j1" can pass, but "make -j8" always fails. It seems something wrong
with parallel build
  2. When "make -j8" failed, if I try "make -j8" again, it can pass.

> What happens if you cd into the libcody obj directory and try a 'make' there? 
>  (after you've hit the failure).
I tried "make -j8" and libcody.a can be built successfully. This is why the 2nd
"make -j8" try can pass.

> What does that dir's config.log look like?
The config.log look OK, as "make -j1" can always work well. I compared
config.log of Ubuntu vs Centos, they are similar. The tail lines of config.log:
---
/* confdefs.h */
#define PACKAGE_NAME "codylib"
#define PACKAGE_TARNAME "codylib"
#define PACKAGE_VERSION "0.0"
#define PACKAGE_STRING "codylib 0.0"
#define PACKAGE_BUGREPORT "github.com/urnathan/libcody"
#define PACKAGE_URL ""
#define BUGURL "github.com/urnathan/libcody"
#define NMS_CHECKING 0

configure: exit 0
---

> The toplevel make knows that libcody must be built before gcc.
So the problem seems to be why libcody.a is not built. The build log of "make
-j8" on CentOS is strange, as it enters build/libcody/ and then leave the dir
without doing anything. The log is as following:
---
$ grep "libcody" out-j8.log
checking for memchr... mkdir -p -- ./libcody
checking for unistd.h... Configuring in ./libcody
checking bugurl... github.com/urnathan/libcody
checking for strtol... make[2]: Entering directory '.../build/libcody'
checking whether gcc hidden aliases work... make[2]: Leaving directory
'.../build/libcody'
make[2]: *** No rule to make target '../libcody/libcody.a', needed by
'cc1-checksum.c'.  Stop.
---

It seems nothing happenend afer entering build/libcody, no building is
triggered in build/libcody (If it is triggered, it should success just as
manually run "make -j8" in build/libcody). The log of success job is:
---
$ grep "libcody" out-j1.log
mkdir -p -- ./libcody
Configuring in ./libcody
checking bugurl... github.com/urnathan/libcody
make[2]: Entering directory '/home/ec2-user/gcc_tmp/build/libcody'
g++ -g -O2 -fno-enforce-eh-specs -fno-stack-protector -fno-threadsafe-statics
-fno-exceptions -fno-rtti -fdebug-prefix-map=../../gcc/libcody/= -W -Wall
-include config.h -I../../gcc/libcody \
  -MMD -MP -MF buffer.d -c -o buffer.o ../../gcc/libcody/buffer.cc
...
ar -cr libcody.a buffer.o client.o fatal.o netclient.o netserver.o resolver.o
packet.o server.o
ranlib libcody.a
make[2]: Leaving directory '.../build/libcody'
---


Nathan, do you have any idea why libcody.a is not built with "make -j8"?  It
seems configure is OK but something wrong with parallel build. Other libraries
(e.g. gmp, libdecnumber) don't have such problem. Thanks very much.

[Bug bootstrap/98318] libcody breaks DragonFly bootstrap

2020-12-21 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318

--- Comment #5 from Hao Liu  ---
Hi Nanthan,

We can still reprodcue this problem on CentOS 7 (X86) and CentOS 8.2 (AArch64).
We use last  GCC version of yesterday:108beb75da

The configure and build commands are (Bash is used):
$ ../gcc/configure --disable-bootstrap --disable-multilib
--enable-checking=release
$ make -j32
...
make[2]: *** No rule to make target '../libcody/libcody.a', needed by
'cc1-checksum.c'.  Stop.
make[2]: *** Waiting for unfinished jobs

Do you have any idea about how to fix this?

[Bug bootstrap/98318] libcody breaks DragonFly bootstrap

2020-12-21 Thread hliu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318

Hao Liu  changed:

   What|Removed |Added

 CC||hliu at amperecomputing dot com

--- Comment #3 from Hao Liu  ---
We can reproduce the failure on CentOS. But Ubuntu can pass.

This failure is related to following files and code:

---
1. gcc/Makefile.in

CODYLIB = ../libcody/libcody.a
BACKEND = libbackend.a main.o libcommon-target.a libcommon.a \
$(CPPLIB) $(CODYLIB) $(LIBDECNUMBER)


2. gcc/gcc/c/Make-lang.in

cc1-checksum.c : build/genchecksum$(build_exeext) checksum-options \
$(C_OBJS) $(BACKEND) $(LIBDEPS) 
---

It should have some dependence problems, as "libcody.a" must be ready before
building cc1-checksum.c. But don't know how to fix this problem, as I'm not
farmiliar with Makefile :(