[Bug tree-optimization/114322] New: [14 Regression] SCEV analysis failed for bases like A[(i+x)*stride] since r14-9193-ga0b1798042d033
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114322 Bug ID: 114322 Summary: [14 Regression] SCEV analysis failed for bases like A[(i+x)*stride] since r14-9193-ga0b1798042d033 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- Compile the following case with: gcc simp.c -Ofast -mcpu=neoverse-n1 -S \ -fdump-tree-ifcvt -fdump-tree-vect-details-scev int foo (short *A, int x, int stride) { int sum = 0; if (stride > 1) { #pragma GCC unroll 1 for (int i = 0; i < 1024; ++i) sum += A[(i + x) * stride]; } return sum; } The gimple in the loop is: : # sum_19 = PHI # i_20 = PHI # ivtmp_37 = PHI _1 = x_12(D) + i_20; _2 = _1 * stride_11(D); _3 = (long unsigned int) _2; _4 = _3 * 2; _5 = A_13(D) + _4; _6 = *_5; _7 = (int) _6; sum_15 = _7 + sum_19; Before the commit (i.e., from pr114074 bug fix), it can be vectorized: Creating dr for *_5 analyze_innermost: (analyze_scalar_evolution (loop_nb = 1) (scalar = _5) (get_scalar_evolution (scalar = _5) (scalar_evolution = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) * 2, +, (long unsigned int) stride_11(D) * 2}_1)) ) success. (analyze_scalar_evolution (loop_nb = 1) (scalar = _5) (get_scalar_evolution (scalar = _5) (scalar_evolution = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) * 2, +, (long unsigned int) stride_11(D) * 2}_1)) ) (instantiate_scev (instantiate_below = 5 -> 3) (evolution_loop = 1) (chrec = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) * 2, +, (long unsigned int) stride_11(D) * 2}_1) (res = {A_13(D) + (long unsigned int) (stride_11(D) * x_12(D)) * 2, +, (long unsigned int) stride_11(D) * 2}_1)) base_address: A_13(D) + (sizetype) (stride_11(D) * x_12(D)) * 2 offset from base address: 0 constant offset from base address: 0 step: (ssizetype) ((long unsigned int) stride_11(D) * 2) base alignment: 2 base misalignment: 0 offset alignment: 128 step alignment: 2 base_object: *A_13(D) + (sizetype) (stride_11(D) * x_12(D)) * 2 Access function 0: {0B, +, (long unsigned int) stride_11(D) * 2}_1 After the commit, loop vectorized failed due to SCEV failure with *_5: Creating dr for *_5 analyze_innermost: (analyze_scalar_evolution (loop_nb = 1) (scalar = _5) (get_scalar_evolution (scalar = _5) (scalar_evolution = _5)) ) (analyze_scalar_evolution (loop_nb = 1) (scalar = _5) (get_scalar_evolution (scalar = _5) (scalar_evolution = _5)) ) simp.c:11:10: missed: failed: evolution of base is not affine. .. (res = scev_not_known)) To my understanding, '(i + x) * stride' is signed integer calculation, in which overflow is undefined behavior and the case should be vectorized.
[Bug testsuite/113446] [14 Regression] gcc.dg/tree-ssa/scev-16.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113446 --- Comment #6 from Hao Liu --- Hi Jakub, That's great. Thanks for the fix.
[Bug target/110625] [14 Regression][AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #26 from Hao Liu --- But for now, the patch should fix the regression.(In reply to Tamar Christina from comment #25) > Is still pretty inefficient due to all the extends. If we generate better > code here this may tip the scale back to vector. But for now, the patch > should fix the regression. That's great. Thanks a lot!
[Bug target/113089] New: [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089 Bug ID: 113089 Summary: [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- SPEC2017 525.x264 build failure. Options are: -O3 -mcpu=neoverse-n1 -funroll-loops -flto=32 --param early-inlining-insns=96 --param max-inline-insns-auto=64 --param inline-unit-growth=96 The failure happens while doing LTO optimization: gcc -std=c99 ... -o ldecod_r during RTL pass: ldp_fusion ldecod_src/intra_chroma_pred.c: In function 'intrapred_chroma': ldecod_src/intra_chroma_pred.c:420:1: internal compiler error: in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 420 | } | ^ 0x1ccbbab rtl_ssa::function_info::process_uses_of_deleted_def(rtl_ssa::set_info*) ../../gcc/gcc/rtl-ssa/changes.cc:252 0x1cce34f rtl_ssa::function_info::change_insns(array_slice) ../../gcc/gcc/rtl-ssa/changes.cc:799 0x1371843 ldp_bb_info::fuse_pair(bool, unsigned int, int, rtl_ssa::insn_info*, rtl_ssa::insn_info*, base_cand&, rtl_ssa::insn_range_info const&) ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:1520 0x1374663 ldp_bb_info::try_fuse_pair(bool, unsigned int, rtl_ssa::insn_info*, rtl_ssa::insn_info*) ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2217 0x1374a8f ldp_bb_info::merge_pairs(std::__cxx11::list >&, std::__cxx11::list >&, bool, unsigned int) ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2306 0x1377bfb ldp_bb_info::transform_for_base(int, access_group&) ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2339 0x1377bfb void ldp_bb_info::traverse_base_map, int_hash >, access_group, simple_hashmap_traits, int_hash > >, access_group> > >(ordered_hash_map, int_hash >, access_group, simple_hashmap_traits, int_hash > >, access_group> >&) ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2398 0x136e29b ldp_bb_info::transform() ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2406 0x136e29b ldp_fusion_bb(rtl_ssa::bb_info*) ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2634 0x136ee93 ldp_fusion() ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2643 0x136eefb execute ../../gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2693
[Bug tree-optimization/112774] New: Vectorize the loop by inferring nonwrapping information from arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112774 Bug ID: 112774 Summary: Vectorize the loop by inferring nonwrapping information from arrays Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This case extracted from another benchmark and it is simpler than the case in PR101450, as it has the additional boundary information from the array: int A[1024 * 2]; int foo (unsigned offset, unsigned N) { int sum = 0; for (unsigned i = 0; i < N; i++) sum += A[i + offset]; return sum; } The Gimple before the vectorization pass is: [local count: 955630224]: # sum_12 = PHI # i_14 = PHI _1 = offset_8(D) + i_14; _2 = A[_1]; sum_9 = _2 + sum_12; i_10 = i_14 + 1; GCC failed to vectorize it as it the chrec "{offset_8, +, 1}_1" may overflow/wrap. I summarized more details in the email: https://gcc.gnu.org/pipermail/gcc/2023-November/242854.html Actually, GCC already knows it won't by inferring the range from the array (in estimate_numbers_of_iterations -> infer_loop_bounds_from_undefined -> infer_loop_bounds_from_array): Induction variable (unsigned int) offset_8(D) + 1 * iteration does not wrap in statement _2 = A[_1]; in loop 1. Statement _2 = A[_1]; is executed at most 2047 (bounded by 2047) + 1 times in loop 1. We can use re-use this information to vectorize this case. I already have a simple patch to achieve this, and will send it out later (after doing more tests).
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #19 from Hao Liu --- > Hi, here's the reduced case Hi Tarmar, thanks for the case. I've modified it to reproduce the ICE without LTO and have updated the patch.
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #17 from Hao Liu --- > Thanks! I can reduce a testcase for you if you want :) That will be very helpful. Thanks.
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #15 from Hao Liu --- Ah, I see. I've sent out a quick fix patch for code review. I'll investigate more about this and find out the root cause.
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #11 from Hao Liu --- Hi Richard, That's great! Glad to hear the status. Waiting for the patches to be ready and upstreamed to trunk.
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #8 from Hao Liu --- Thanks for the explanation. Understood the root cause and that's reasonable. So, do you have plan to fix this (i.e. to separate the FP and integer types)? I want to enable the new costs for Ampere1, which is similar to N2's issue-info. If this problem won't be fixed in the near future, I think a workaround is probably to adjust the general_ops in the issue_info. E.g. set the general_ops of both scalar and vector to 3 instead of current values of "4 and 2".
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #6 from Hao Liu --- Thanks for the confirmation about the reduction latency. I'll create a simple patch to fix this. > Discounting the loads, we do have 15 general operations. That's true, and there are indeed 8 general operations for scalar loop. As the count_ops() is accurate, it seems maybe the Cost of Vector Body is too large (Vector inside of loop cost: 51): *k_48 4 times vec_perm costs 12 in body *k_48 1 times unaligned_load (misalign -1) costs 4 in body _5->m1 1 times vec_perm costs 3 in body _5->m4 1 times unaligned_load (misalign -1) costs 4 in body (int) _24 2 times vec_promote_demote costs 4 in body (double) _25 4 times vec_promote_demote costs 8 in body _2 * _26 4 times vector_stmt costs 8 in body If it is small enough, even the vect-body cost is increased according to the issue-info, SLP is still profitable. I'm not quite familiar with this part and it may affect all aarch64 targets, so I think it's hard to fix by me. It would be great if you will look at how to fix this.
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #3 from Hao Liu --- Sorry, it seems this case can not be fixed by only adjusting the calculation of "reduction latency". Even it becomes smaller, the case still can not be vectorized as the "general operations" count is still too large: Original vector body cost = 51 Scalar issue estimate: ... general operations = 8 reduction latency = 2 estimated min cycles per iteration = 2.00 estimated cycles per vector iteration (for VF 2) = 4.00 Vector issue estimate: ... general operations = 15 <-- Too large reduction latency = 2 <-- from 8 to 2 estimated min cycles per iteration = 7.50 Increasing body cost to 96 because scalar code would issue more quickly ... missed: cost model: the vector iteration cost = 96 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2. missed: not vectorized: vectorization not profitable.
[Bug target/110649] [14 Regression] 25% sphinx3 spec2006 regression on Ice Lake and zen between g:acaa441a98bebc52 (2023-07-06 11:36) and g:55900189ab517906 (2023-07-07 00:23)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649 --- Comment #2 from Hao Liu --- Hi, I bisected the following 3 commits (sequantial): [v3] 3a61ca1b925 - Improve profile updates after loop-ch and cunroll (2023-07-06) [v2] d4c2e34deef - Improve scale_loop_profile (2023-07-06) [v1] 224fd59b2dc - Vect: use a small step to calculate induction for the unrolled loop (PR tree-optimization/110449) (2023-07-06) Tests the time in seconds of 1-copy performance of 482.sphinx3 on zen2: v3: 261s v2: 231s v1: 231s So the regression should be caused by 3a61ca1b925, i.e. https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=3a61ca1b9256535e1bfb19b2d46cde21f3908a5d
[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 --- Comment #2 from Hao Liu --- To my understanding, "reduction latency" is the least number of cycles needed to do the reduction calculation for 1 iteration of loop. It is calcualted by the extra instruction issue-info of the new cost models in AArch64 backend. Usually, the reduction latency of vectorized loop should be smaller than the scalar loop. If the latency of vectorized loop is larger than the scalar loop, it thinks maybe not beneficial to do vectorization, so it increases the vect-body costs by the scale of vect_reduct_latency/scalar_reduct_latency in the above case. For the above case, it thinks the scalar loop needs 4 cycles (2*VF=4) to calculate "results.m += rhs", while the vectorized loop needs 8 cycles (2*count=8). As a result, the vect-body costs are doubled from originial value of 51 to 102. It seems not true for the vectorized loop, which should only need 2 cycles to calculate the SIMD version of "results.m += rhs".
[Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625 Bug ID: 110625 Summary: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This problem causes a performance regression in SPEC2017 538.imagick. For the following simple case (modified from pr96208): typedef struct { unsigned short m1, m2, m3, m4; } the_struct_t; typedef struct { double m1, m2, m3, m4, m5; } the_struct2_t; double bar1 (the_struct2_t*); double foo (double* k, unsigned int n, the_struct_t* the_struct) { unsigned int u; the_struct2_t result; for (u=0; u < n; u++, k--) { result.m1 += (*k)*the_struct[u].m1; result.m2 += (*k)*the_struct[u].m2; result.m3 += (*k)*the_struct[u].m3; result.m4 += (*k)*the_struct[u].m4; } return bar1 (&result); } Compile it with "-Ofast -S -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body cost is increased due to the too large "reduction latency". See the dump of vect pass: Original vector body cost = 51 Scalar issue estimate: ... reduction latency = 2 estimated min cycles per iteration = 2.00 estimated cycles per vector iteration (for VF 2) = 4.00 Vector issue estimate: ... reduction latency = 8 <-- Too large estimated min cycles per iteration = 8.00 Increasing body cost to 102 because scalar code would issue more quickly Cost model analysis: Vector inside of loop cost: 102 ... Scalar iteration cost: 44 ... missed: cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2. missed: not vectorized: vectorization not profitable. SLP will success with "-mcpu=neoverse-n1", as N1 doesn't use the new vector costs and vector body cost is not increased. The "reduction latency" is calculated in aarch64.cc count_ops(): /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately that's not yet the case. */ ops->reduction_latency = MAX (ops->reduction_latency, base * count); For this case, the "base" is 2 and "count" is 4 . To my understanding, the "count" of SLP means the number of scalar stmts (i.e. results.m1 +=, ...) in a permutation group to be merged into a vector stmt. It seems not reasonable to multiply cost by "count" (maybe it doesn't consider about the SLP situation). So, I'm thinking to calcualte it differently for SLP situation, e.g. unsigned int latency = PURE_SLP_STMT(stmt_info) ? base : base * count; ops->reduction_latency = MAX (ops->reduction_latency, latency); Is this reasonable?
[Bug tree-optimization/110474] Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110474 Hao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from Hao Liu --- It's better to have a suggested_epilog_"unroll" factor or support multiple epilogues. But need a lot of work. Let's support the simple patch firstly.
[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 Hao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #12 from Hao Liu --- OK. Now I got your point of useless initialization may introduce extra cost.
[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 --- Comment #10 from Hao Liu --- > foo is just an example for not getting inlined, the point here is extra cost > paid. My point is that the case is different from the original case in tree-vect-loop.cc. For example, change the case as following: __attribute__((noipa)) int foo(int *a) { *a == 1 ? return 1 : return 0;} That's similar to the original problem (the value of "a" is undefiend). I don't mean that "a" must be initialized in test(). We can also initalize "a" in foo, but should not use "a" before initialization. E.g. __attribute__((noipa)) int foo(int *a) { *a == 1; ... if (*a) } The above case has no problem.
[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 --- Comment #7 from Hao Liu --- > int foo() { > bool a = true; > bool b; > if (a || b) > return 1; > b = true; > return 0; > } > > still has the warning, it looks something can be improved (guess we prefer > not to emit warning). Your case is wrong, you should initialize "b" and there will be no warning. > __attribute__((noipa)) int foo(int *a) { *a = 1; return 1;} > > int test(){ > #ifdef AINIT > int a = 0; > #else > int a; > #endif > int b = foo(&a); > return b; > } This case doesn't have problem. If "foo" uses "a" directly, the result is undefined behavior, which causes both correctness and performance issues.
[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 --- Comment #5 from Hao Liu --- BTW, there is no warning is probably because the original code is too complicated and not inlined. Compile the simple case by "g++ -O3 -S -Wall hello.c": int foo(bool a) { bool b; if (a || b) return 1; b = true; return 0; } gcc report warning: hello.c: In function ‘int foo(bool)’: hello.c:4:9: warning: ‘b’ is used uninitialized [-Wuninitialized] 4 | if (a || b) | ~~^~~~
[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 --- Comment #4 from Hao Liu --- > IMHO, the initialization with false is unnecessary and very likely it isn't > able to get optimized, it seems worse from this point of view. Sorry. I don't think so. See more at https://www.oreilly.com/library/view/c-coding-standards/0321113586/ch20.html: Start with a clean slate: Uninitialized variables are a common source of bugs in C and C++ programs. There are few reasons to ever leave a variable uninitialized. None is serious enough to justify the hazard of undefined behavior.
[Bug tree-optimization/110531] Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 --- Comment #2 from Hao Liu --- > Is the warning from some static analyzer? No. I just find it maybe a bug while looking at the code. > slp should be true always (always do analyze slp), it doesn't care what's in > slp_done_for_suggested_uf. Oh, I see. This is not a real bug. IMHO, it would be better to initialize it as "false", which should be much easier for someone to understand the code.
[Bug tree-optimization/110531] New: Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110531 Bug ID: 110531 Summary: Vect: slp_done_for_suggested_uf is not initialized in tree-vect-loop.cc Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This seems an obvious bug in tree-vect-loop.cc: (1) This var is declared (but not initialized) and used in function vect_analyze_loop_1: bool slp_done_for_suggested_uf; < Warning, this is not initialized /* Run the main analysis. */ opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &suggested_unroll_factor, slp_done_for_suggested_uf); (2) It is used before set in function vect_analyze_loop_2: static opt_result vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *suggested_unroll_factor, bool& slp_done_for_suggested_uf) ... bool slp = !applying_suggested_uf || slp_done_for_suggested_uf; <--- used without initialized ... slp_done_for_suggested_uf = slp; I don't know the detail logic and wonder if it should be initialized as "true" or "false" (probably it should be "false").
[Bug tree-optimization/110474] New: Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110474 Bug ID: 110474 Summary: Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- Hi, I'm trying to use tune loop unrolling during vectorization (see more: tree-vect-loop.cc suggested_unroll_factor). I find the unrolling may hurt performance as unrolling also increases the VF (vector factor) of epilog vect loop. For example: int foo(short *A, char *B, int N) { int sum = 0; for (int i = 0; i < N; ++i) { sum += A[i] * B[i]; } return sum; } Compile it with "-O3 -mtune=neoverse-n2 -mcpu=neoverse-n1 --param aarch64-vect-unroll-limit=2" (I'm using -mcpu n1 as I want to try a target without SVE). GCC vectorization pass unrolls the loop by 2 and generates code as following: if N >= 32: main vect loop ... if N >= 16: # This may hurt performance if N is small (e.g. 8) epilog vect loop ... epilog scalar code ... If the loop is not unrolled (i.e. use "--param aarch64-vect-unroll-limit=1"). GCC generates code as following: if N >= 16: main vect loop ... if N >= 8: epilog vect loop ... epilog scalar code ... The runtime check is based on the VF of epilog vectorization. There is code in tree-vect-loop.cc (line 2990) to choose epilog vect VF: /* If we're vectorizing an epilogue loop, the vectorized loop either needs to be able to handle fewer than VF scalars, or needs to have a lower VF than the main loop. */ if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo))) return opt_result::failure_at (vect_location, "Vectorization factor too high for" " epilogue loop.\n"); But it doesn't consider about the suggested_unroll_factor. So I'm thinking about adding following code to unscale the orig_loop_vinfo's VF by unroll_factor: unscaled_orig_vf = exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo), orig_loop_vinfo->suggested_unroll_factor); Is this reasonable?
[Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449 --- Comment #2 from Hao Liu --- That looks better than the currently generated code (it saves one "MOV" instruction). Yes, it has the loop-carried dependency advantage. But it still uses one more register for "8*step" (There may be a register pressure problem for complicated code, not for this simple case). This is still a floating point precision problem. There is a PR84201 discussed about the same problem for X86: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201. The larger step makes the floating point calculation result has larger gap compared to the original scalar calculation result. E.g. The SPEC2017 fp benchmark 549.fotonik may result in VE (Validation Error) after unrolling a loop of double: 319do ifreq = 1, tmppower%nofreq <-- HERE 320 frequency(ifreq,ipower) = freq 321 freq = freq + freqstep 322end do it uses 4*step for unrolled vectorization version other than the 2*step for non-unrolled vectorization version. The SPEC fp result checks the "relative tolerance" of the fp results and it is higher than the current standard (i.e. the compare command line option of "--reltol 1e-10").
[Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449 Bug ID: 110449 Summary: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This is inspired by clang. Compile the follwing case with "-mcpu=neoverse-n2 -O3": void foo(int *arr, int val, int step) { for (int i = 0; i < 1024; i++) { arr[i] = val; val += step; } } It will be unrolled by 2 during vectorization. GCC generates code: fmovs29, w2 # step shl v27.2s, v29.2s, 3 # 8*step shl v28.2s, v29.2s, 2 # 4*step ... .L2: mov v30.16b, v31.16b add v31.4s, v31.4s, v27.4s # += 8*step add v29.4s, v30.4s, v28.4s # += 4*step stp q30, q29, [x0] add x0, x0, 32 cmp x1, x0 bne .L2 The v27 (i.e. "8*step") is actually not necessary. We can use v29 + v28 (i.e. "+ 4*step") and generate simpler code: fmovs29, w2 # step shl v28.2s, v29.2s, 2 # 4*step ... .L2: add v29.4s, v30.4s, v28.4s # += 4*step stp q30, q29, [x0] add x0, x0, 32 add v30.4s, v29.4s, v28.4s # += 4*step cmp x1, x0 bne .L2 This has two benefits: (1) Save 1 vector register and one "mov" instructon (2) For floating point, the result value of small step should be closer to the original scalar result value than large step. I.e. "A + 4*step + ... + 4*step" should be closer to "A + step + ... + step" than "A + 8*step + ... 8*step". Do you think if this is reasonable? I have a simple patch to enhance the tree-vect-loop.cc "vectorizable_induction()" to achieve this. Will send out the patch for code review later.
[Bug tree-optimization/98598] New: Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 Bug ID: 98598 Summary: Missed opportunity to optimize dependent loads in loops Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- As we know, dependent loads are not friendly to cache. Especially when in nested loops, dependent loads such as pa->pb->pc->val may be repeated many times. For example: typedef struct C { int val; } C; typedef struct B { C *pc; } B; typedef struct A { B *pb; } A; int foo (int n, int m, A *pa) { int sum; for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { sum += pa[j].pb->pc->val; // each value is repeatedly loaded "n" times // ... } // ... } return sum; } Such access pattern can be found in real applications and benchmarks, and this can be critical to performance. Can we cache the loaded value and avoid repeated dependent loads? E.g. transform above case into following (suppose there is no alias issue or other clobber, and "n" is big enough): int foo (int n, int m, A *pa) { int *cache = (int *) malloc(m * sizeof(int)); for (int j = 0; j < m; j++) { cache[j] = pa[j].pb->pc->val; } int sum; for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { sum += cache[j]; // pa[j].pb->pc->val; // ... } // ... } free(cache); return sum; } This should improve performance a lot.
[Bug bootstrap/98318] libcody breaks DragonFly bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318 --- Comment #8 from Hao Liu --- Hi Nathan, The problem is related to use another make binary, which is 4.2.0 and built by ourselves. Maybe there is a strange bug. Anyway, after using the system installed make (which is 4.2.1 and under /usr/bin/), the problem is solved. Thanks for your help!
[Bug bootstrap/98318] libcody breaks DragonFly bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318 --- Comment #7 from Hao Liu --- I found that: 1. "make -j1" can pass, but "make -j8" always fails. It seems something wrong with parallel build 2. When "make -j8" failed, if I try "make -j8" again, it can pass. > What happens if you cd into the libcody obj directory and try a 'make' there? > (after you've hit the failure). I tried "make -j8" and libcody.a can be built successfully. This is why the 2nd "make -j8" try can pass. > What does that dir's config.log look like? The config.log look OK, as "make -j1" can always work well. I compared config.log of Ubuntu vs Centos, they are similar. The tail lines of config.log: --- /* confdefs.h */ #define PACKAGE_NAME "codylib" #define PACKAGE_TARNAME "codylib" #define PACKAGE_VERSION "0.0" #define PACKAGE_STRING "codylib 0.0" #define PACKAGE_BUGREPORT "github.com/urnathan/libcody" #define PACKAGE_URL "" #define BUGURL "github.com/urnathan/libcody" #define NMS_CHECKING 0 configure: exit 0 --- > The toplevel make knows that libcody must be built before gcc. So the problem seems to be why libcody.a is not built. The build log of "make -j8" on CentOS is strange, as it enters build/libcody/ and then leave the dir without doing anything. The log is as following: --- $ grep "libcody" out-j8.log checking for memchr... mkdir -p -- ./libcody checking for unistd.h... Configuring in ./libcody checking bugurl... github.com/urnathan/libcody checking for strtol... make[2]: Entering directory '.../build/libcody' checking whether gcc hidden aliases work... make[2]: Leaving directory '.../build/libcody' make[2]: *** No rule to make target '../libcody/libcody.a', needed by 'cc1-checksum.c'. Stop. --- It seems nothing happenend afer entering build/libcody, no building is triggered in build/libcody (If it is triggered, it should success just as manually run "make -j8" in build/libcody). The log of success job is: --- $ grep "libcody" out-j1.log mkdir -p -- ./libcody Configuring in ./libcody checking bugurl... github.com/urnathan/libcody make[2]: Entering directory '/home/ec2-user/gcc_tmp/build/libcody' g++ -g -O2 -fno-enforce-eh-specs -fno-stack-protector -fno-threadsafe-statics -fno-exceptions -fno-rtti -fdebug-prefix-map=../../gcc/libcody/= -W -Wall -include config.h -I../../gcc/libcody \ -MMD -MP -MF buffer.d -c -o buffer.o ../../gcc/libcody/buffer.cc ... ar -cr libcody.a buffer.o client.o fatal.o netclient.o netserver.o resolver.o packet.o server.o ranlib libcody.a make[2]: Leaving directory '.../build/libcody' --- Nathan, do you have any idea why libcody.a is not built with "make -j8"? It seems configure is OK but something wrong with parallel build. Other libraries (e.g. gmp, libdecnumber) don't have such problem. Thanks very much.
[Bug bootstrap/98318] libcody breaks DragonFly bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318 --- Comment #5 from Hao Liu --- Hi Nanthan, We can still reprodcue this problem on CentOS 7 (X86) and CentOS 8.2 (AArch64). We use last GCC version of yesterday:108beb75da The configure and build commands are (Bash is used): $ ../gcc/configure --disable-bootstrap --disable-multilib --enable-checking=release $ make -j32 ... make[2]: *** No rule to make target '../libcody/libcody.a', needed by 'cc1-checksum.c'. Stop. make[2]: *** Waiting for unfinished jobs Do you have any idea about how to fix this?
[Bug bootstrap/98318] libcody breaks DragonFly bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98318 Hao Liu changed: What|Removed |Added CC||hliu at amperecomputing dot com --- Comment #3 from Hao Liu --- We can reproduce the failure on CentOS. But Ubuntu can pass. This failure is related to following files and code: --- 1. gcc/Makefile.in CODYLIB = ../libcody/libcody.a BACKEND = libbackend.a main.o libcommon-target.a libcommon.a \ $(CPPLIB) $(CODYLIB) $(LIBDECNUMBER) 2. gcc/gcc/c/Make-lang.in cc1-checksum.c : build/genchecksum$(build_exeext) checksum-options \ $(C_OBJS) $(BACKEND) $(LIBDEPS) --- It should have some dependence problems, as "libcody.a" must be ready before building cc1-checksum.c. But don't know how to fix this problem, as I'm not farmiliar with Makefile :(