[Bug tree-optimization/116785] [15 Regression] RAJAPerf REDUCE_SUM regresses with r15-792-gf0a02467bbc35a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785 --- Comment #14 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #13) > Did it help? Thanks for the quick Fix. This commit brings back most of the regression. Please note that the current trunk seems to be broken for unrelated reasons. I tried this patch with earlier working version that brought back the performance.
[Bug tree-optimization/116785] [15 Regression] RAJAPerf REDUCE_SUM regresses with r15-792-gf0a02467bbc35a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785 --- Comment #10 from kugan at gcc dot gnu.org --- Created attachment 59186 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59186&action=edit reduced test (second attempt) Sorry about the test case. Here is another attempt at reducing.
[Bug tree-optimization/116785] RAJAPerf REDUCE_SUM regresses with commit g:f0a02467bbc35a478eb82f5a8a7e8870827b51fc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785 --- Comment #2 from kugan at gcc dot gnu.org --- Created attachment 59155 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59155&action=edit creduce reduced file
[Bug tree-optimization/116785] RAJAPerf REDUCE_SUM regresses with commit f0a02467bbc35a478eb82f5a8a7e8870827b51fc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785 --- Comment #1 from kugan at gcc dot gnu.org --- Created attachment 59154 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59154&action=edit preprocessed file
[Bug tree-optimization/116785] New: RAJAPerf REDUCE_SUM regresses with commit f0a02467bbc35a478eb82f5a8a7e8870827b51fc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116785 Bug ID: 116785 Summary: RAJAPerf REDUCE_SUM regresses with commit f0a02467bbc35a478eb82f5a8a7e8870827b51fc Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Some of the loops in RAJAPerf are not vectored with the change. This results in ~64% regression for this and some other kernels. This regression can also be observed again gcc 11 (I tried only this version). g++ -Ofast -S CONVECTION3DPA-Seq.cpp.ii -fopt-info-vec -fpermissive shows: /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:80:31: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:80:31: optimized: loop versioned for vectorization because of possible aliasing /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:47:31: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:47:31: optimized: loop versioned for vectorization because of possible aliasing /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop versioned for vectorization because of possible aliasing /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop versioned for vectorization because of possible aliasing With the patch reverted: /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:100:29: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:89:29: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:80:31: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:67:29: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:58:31: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:47:31: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/tpl/RAJA/include/RAJA/policy/loop/launch.hpp:101:23: optimized: loop vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:162:41: optimized: basic block part vectorized using 16 byte vectors /proj/ta/tests/OpenMP_4.5_Test_Suite/benchmarks/RAJAPerf/src/apps/CONVECTION3DPA-Seq.cpp:40:29: optimized: basic block part vectorized using 16 byte vectors g++ -v Using built-in specs. COLLECT_GCC=/local/home/kvivekananda/install/bin/g++ COLLECT_LTO_WRAPPER=/local/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc/configure --disable-bootstrap --enable-multiarch=yes --enable-languages=c,c++,fortran,lto --prefix=/local/home/kvivekananda/install : (reconfigured) ../gcc/configure --disable-bootstrap --enable-multiarch=yes --prefix=/local/home/kvivekananda/install --enable-languages=c,c++,fortran,lto --no-create --no-recursion Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 15.0.0 20240917
[Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #6 from kugan at gcc dot gnu.org --- This (In reply to GCC Commits from comment #3) > The trunk branch has been updated by Richard Sandiford > : > > https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec > > commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec > Author: Richard Sandiford > Date: Wed May 29 16:43:33 2024 +0100 > > aarch64: Split aarch64_combinev16qi before RA [PR115258] > > Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose > purpose is to put the two input data vectors into consecutive registers. > This aarch64_combinev16qi was then split after reload into individual > moves (from the first input to the first half of the output, and from > the second input to the second half of the output). > > In the worst case, the RA might allocate things so that the destination > of the aarch64_combinev16qi is the second input followed by the first > input. In that case, the split form of aarch64_combinev16qi uses three > eors to swap the registers around. > > This PR is about a test where this worst case occurred. And given the > insn description, that allocation doesn't semm unreasonable. > > early-ra should (hopefully) mean that we're now better at allocating > subregs of vector registers. The upcoming RA subreg patches should > improve things further. The best fix for the PR therefore seems > to be to split the combination before RA, so that the RA can see > the underlying moves. > > Perhaps it even makes sense to do this at expand time, avoiding the need > for aarch64_combinev16qi entirely. That deserves more experimentation > though. > > gcc/ > PR target/115258 > * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow > the split before reload. > * config/aarch64/aarch64.cc (aarch64_split_combinev16qi): > Generalize > into a form that handles pseudo registers. > > gcc/testsuite/ > PR target/115258 > * gcc.target/aarch64/pr115258.c: New test. This is causing performance regression in some TSVC kernels and others. Here is an example: https://godbolt.org/z/r91nYEEsP We now get: .L3: add x3, x26, x0 add x2, x25, x0 add x3, x3, 65536 add x2, x2, 65536 sub x0, x0, #16 ldr q31, [x3, 62448] mov v28.16b, v31.16b mov v29.16b, v31.16b tbl v31.16b, {v28.16b - v29.16b}, v30.16b faddv31.4s, v31.4s, v25.4s mov v26.16b, v31.16b mov v27.16b, v31.16b tbl v31.16b, {v26.16b - v27.16b}, v30.16b str q31, [x2, 62448] cmp x0, x27 bne .L3
[Bug middle-end/116626] ICE while VLA vectorisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116626 --- Comment #1 from kugan at gcc dot gnu.org --- Looks duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569
[Bug middle-end/116626] New: ICE while VLA vectorisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116626 Bug ID: 116626 Summary: ICE while VLA vectorisation Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Created attachment 59057 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59057&action=edit testcase For a partally reduced code, I am seeing: t.cpp:350:12: internal compiler error: in to_constant, at poly-int.h:592 350 | void _M_run() { _M_func(); } |^~ 0x38da7fb internal_error(char const*, ...) ../../gcc/gcc/diagnostic-global-context.cc:492 0x38b1e5f fancy_abort(char const*, int, char const*) ../../gcc/gcc/diagnostic.cc:1658 0xf747cb poly_int<2u, unsigned long>::to_constant() const ../../gcc/gcc/poly-int.h:592 0x20bc547 nunits_for_known_piecewise_op ../../gcc/gcc/tree-vect-generic.cc:96 0x20bcff3 expand_vector_piecewise ../../gcc/gcc/tree-vect-generic.cc:290 0x20c1cc7 expand_vector_operation ../../gcc/gcc/tree-vect-generic.cc:1257 0x20c7f37 expand_vector_operations_1 ../../gcc/gcc/tree-vect-generic.cc:2366 0x20c8117 expand_vector_operations ../../gcc/gcc/tree-vect-generic.cc:2400 0x20c8417 execute ../../gcc/gcc/tree-vect-generic.cc:2497 compile command: -O3 -mcpu=neoverse-v2 t.cpp -std=gnu++20 g++ -v Using built-in specs. COLLECT_GCC=/proj/grco/gcc/Linux_aarch64/upstream-main/latest/bin/g++ COLLECT_LTO_WRAPPER=/proj/grco/gcc/Linux_aarch64/upstream-main/20240905192606-a2e28b10/bin/../libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: /var/jenkins/workspace/GCC_oss-main/configure --enable-multiarch=yes --enable-languages=c,c++,fortran,lto --prefix=/proj/grco/gcc/Linux_aarch64/oss-main/20240905192606-a2e28b10 Thread model: posix Supported LTO compression algorithms: zlib gcc version 15.0.0 20240906 (experimental) (GCC)
[Bug middle-end/116562] New: wrong cost of gather load preventing loop from vectored
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116562 Bug ID: 116562 Summary: wrong cost of gather load preventing loop from vectored Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- typedef int real_t; extern __attribute__((aligned(64))) real_t a[32000],b[32000],c[32000],d[32000]; void s4117() { for (int i = 0; i < 32000; i++) { a[i] = b[i] + c[i/2] * d[i]; } } is not vectored for AdvSIMD due to wrong cost calculation. Compiler option used: cc1plus -Ofast -fdump-tree-vect-all -mcpu=neoverse-v2 --param=aarch64-autovec-preference=1 tt.c:6:21: note: Cost model analysis: Vector inside of loop cost: 64 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 15 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 tt.c:6:21: missed: cost model: the vector iteration cost = 64 divided by the scalar iteration cost = 15 is greater or equal to the vectorization factor = 4. tt.c:6:21: missed: not vectorized: vectorization not profitable. tt.c:6:21: missed: not vectorized: vector version will never be profitable. tt.c:6:21: missed: Loop costings may not be worthwhile. tt.c:6:21: note: * Analysis failed with vector mode V4SI We cost this c[i/2] as having the cost of 4 loads and one construct. I think we should special case these sort of gather loads which as lower cost in practice? 11233 if (costing_p) 11234 { 11235 /* For emulated gathers N offset vector element 11236 offset add is consumed by the load). */ 11237 inside_cost = record_stmt_cost (cost_vec, const_nunits, 11238 vec_to_scalar, stmt_info, 11239 0, vect_body); 11240 /* N scalar loads plus gathering them into a 11241 vector. */ 11242 inside_cost 11243 = record_stmt_cost (cost_vec, const_nunits, scalar_load, 11244 stmt_info, 0, vect_body); 11245 inside_cost 11246 = record_stmt_cost (cost_vec, 1, vec_construct, 11247 stmt_info, 0, vect_body); 11248 continue; 11249 }
[Bug tree-optimization/116528] New: Not vectoring TSVC s318 loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116528 Bug ID: 116528 Summary: Not vectoring TSVC s318 loop Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- See: typedef float real_t; extern __attribute__((aligned(64))) real_t a[32000]; real_t not_woring(struct args_t *func_args) { int k, index; int inc = 1; real_t max, chksum; k = 0; index = 0; max = (a[0]); k += inc; for (int i = 1; i < 32000; i++) { if (a[k] > max) { index = i; max = (a[k]); } k += inc; } return max + index + 1; } Also in https://godbolt.org/z/ra4h6ndKz ifcvt dump is: [local count: 1063004408]: # k_16 = PHI # index_17 = PHI # max_19 = PHI <_29(7), max_10(15)> # ivtmp_15 = PHI _1 = a[k_16]; _13 = _1 > max_19; index_5 = _13 ? k_16 : index_17; _29 = MAX_EXPR <_1, max_19>; k_12 = k_16 + 1; ivtmp_14 = ivtmp_15 - 1; if (ivtmp_14 != 0) goto ; [98.99%] else goto ; [1.01%] [local count: 1052266995]: goto ; [100.00%] [local count: 10737416]: Here PHI # max_19 = PHI <_29(7), max_10(15)> has two uses _29 = MAX_EXPR <_1, max_19>; and _13 = _1 > max_19; As a result, this is not a vect_is_simple_reduction. How can we support this?
[Bug tree-optimization/116338] GCC is not vectoring TSVC s255 while clang can
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116338 --- Comment #5 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #4) > You can try to see whether adding a SSA copy would make this supported, it > seems not allowing a PHI is simply a missed feature. We now fail in /* If this isn't a nested cycle or if the nested cycle reduction value is used ouside of the inner loop we cannot handle uses of the reduction value. */ if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1) Even if I comment this, I see: t1.c:16:25: note: worklist: examine stmt: _22 = x_18 + y_19; t1.c:16:25: note: vect_is_simple_use: operand x_18 = PHI <_1(5), x_10(2)>, type of def: unknown t1.c:16:25: missed: Unsupported pattern. t1.c:10:6: missed: not vectorized: unsupported use in stmt. t1.c:16:25: missed: unexpected pattern. t1.c:16:25: note: * Analysis failed with vector mode V4SF Do we need to somehow mark both the PHI stents as part of the first order reduction? [local count: 1063004408]: # x_18 = PHI <_1(5), x_10(2)> # y_19 = PHI # i_20 = PHI # ivtmp_17 = PHI _1 = b[i_20]; _22 = x_18 + y_19; _3 = _1 + _22; _4 = _3 * 3.3304291534423828125e-1; a[i_20] = _4; i_13 = i_20 + 1; ivtmp_16 = ivtmp_17 - 1; if (ivtmp_16 != 0) goto ; [98.99%] else goto ; [1.01%] [local count: 1052266995]: goto ; [100.00%]
[Bug tree-optimization/116338] GCC is not vectoring TSVC s255 while clang can
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116338 --- Comment #3 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > The issue is the recurrence > >[local count: 10737416]: > x_10 = b[31999]; > y_11 = b[31998]; > >[local count: 1063004408]: > # x_18 = PHI <_1(5), x_10(2)> > # y_19 = PHI > _1 = b[i_20]; > .. > >[local count: 1052266995]: > goto ; [100.00%] > > we handle some cases via vect_phi_first_order_recurrence_p, somebody needs > to dig in why this one isn't (or can't be) handled with that mechanism. /* Ensure the loop latch definition is from within the loop. */ edge latch = loop_latch_edge (loop); tree ldef = PHI_ARG_DEF_FROM_EDGE (phi, latch); if (TREE_CODE (ldef) != SSA_NAME || SSA_NAME_IS_DEFAULT_DEF (ldef) || is_a (SSA_NAME_DEF_STMT (ldef)) || !flow_bb_inside_loop_p (loop, gimple_bb (SSA_NAME_DEF_STMT (ldef return false; (gdb) p debug_tree (ldef) unit-size align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf7a8b2a0 precision:32 pointer_to_this > visited var def_stmt x_18 = PHI <_1(5), x_10(2)> version:18> $1 = void That is PHI arg defined along the loop latch is also PHI stmt in the case.
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #20 from kugan at gcc dot gnu.org --- (In reply to Richard Sandiford from comment #19) > (In reply to Richard Biener from comment #14) > > Usually targets do have a limit on the actual length but I see > > constant_upper_bound_with_limit doesn't query such. But it would > > be a more appropriate way to say there might be an actual target limit here? > The discussion has moved on, but FWIW: this was a deliberate choice. > The thinking at the time was that VLA code should be truly “agnostic” > and not hard-code an upper limit. Hard-coding a limit would be hard-coding > an assumption that the architectural maximum would never increase in future. > > (The main counterargument was that any uses of the .B form of TBL would > break down for >256-byte vectors. We hardly use such TBLs for autovec > though, and could easily choose not to use them at all.) > > That decision is 8 or 9 years old at this point, so it might seem overly > dogmatic now. Even so, I think we should have a strong reason to change > tack. > It shouldn't just be about trying to avoid poly_ints :) Thanks. I have posted an RFC at https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659373.html In addition to making loop->safelen POLY_INT, I also change the apply_safelen with: + unsigned int safelen; + if (loop->safelen.is_constant ()) + safelen = loop->safelen.coeffs[0]; + else + safelen = INT_MAX; That is. in essence this would be an INT_MAX in these cases.
[Bug tree-optimization/116338] New: GCC is not vectoring TSVC s255 while clang can
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116338 Bug ID: 116338 Summary: GCC is not vectoring TSVC s255 while clang can Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- reduced test case: typedef float real_t; extern __attribute__((aligned(64))) real_t a[32000], b[32000]; void s255() { real_t x, y; x = b[32000 -1]; y = b[32000 -2]; for (int i = 0; i < 32000; i++) { a[i] = (b[i] + x + y) * (real_t).333; y = x; x = b[i]; } } gcc is not able to vectorize the loop whereas clang can. See https://godbolt.org/z/64Kxaahqr gcc -v Using built-in specs. COLLECT_GCC=/home/kvivekananda/install/bin/gcc COLLECT_LTO_WRAPPER=/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc_base/configure --prefix=/home/kvivekananda/install/ --enable-languages=c,c++,fortran,lto,objc Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 15.0.0 20240618 (experimental) (GCC)
[Bug middle-end/116337] New: Reverse iterated loops has redundant code compared to clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116337 Bug ID: 116337 Summary: Reverse iterated loops has redundant code compared to clang Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- For: extern __attribute__((aligned(64))) int a[32000],b[32000]; void s1112(void) { for (int i = 32000 - 1; i >= 0; i--) { a[i] = b[i] + 1; } } For the loop, with -O3 -mcpu=neoverse-v2 --param=aarch64-autovec-preference=2, gcc generates .L3: ld1wz31.s, p7/z, [x6, x0, lsl 2] add w1, w1, w3 rev z31.s, z31.s add z31.s, z31.s, #1 rev z31.s, z31.s st1wz31.s, p7, [x2, x0, lsl 2] sub x0, x0, x5 cmp w1, w4 bls .L3 clang generates with -O3 -mcpu=neoverse-v2 -fno-unroll-loops: .LBB0_1: ld1w{ z0.s }, p0/z, [x14, x11, lsl #2] add z0.s, z0.s, #1 st1w{ z0.s }, p0, [x13, x11, lsl #2] decwx11 cmn x12, x11 b.ne.LBB0_1 This seem to comes due to memory_access_type of VMAT_CONTIGUOUS_REVERSE and the VEC_PERM_EXPR. [local count: 1063004408]: # i_10 = PHI # ivtmp_9 = PHI # vectp_b.4_8 = PHI [(void *)&b + 127984B](2)> # vectp_a.9_19 = PHI [(void *)&a + 127984B](2)> # ivtmp_23 = PHI vect__1.6_14 = MEM [(int *)vectp_b.4_8]; vect__1.7_15 = VEC_PERM_EXPR ; _1 = b[i_10]; vect__2.8_17 = vect__1.7_15 + { 1, 1, 1, 1 }; _2 = _1 + 1; vect__2.11_21 = VEC_PERM_EXPR ; MEM [(int *)vectp_a.9_19] = vect__2.11_21; i_7 = i_10 + -1; ivtmp_4 = ivtmp_9 - 1; vectp_b.4_13 = vectp_b.4_8 + 18446744073709551600; vectp_a.9_20 = vectp_a.9_19 + 18446744073709551600; ivtmp_24 = ivtmp_23 + 1; if (ivtmp_24 < 8000) goto ; [98.99%] else goto ; [1.01%] [local count: 1052266995]: goto ; [100.00%] gcc -v Using built-in specs. COLLECT_GCC=/home/kvivekananda/install/bin/gcc COLLECT_LTO_WRAPPER=/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc_base/configure --prefix=/home/kvivekananda/install/ --enable-languages=c,c++,fortran,lto,objc Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 15.0.0 20240618 (experimental) (GCC)
[Bug tree-optimization/115450] [15 Regression] cpu2017 502.gcc runtime miscompute on aarch64 with SVE since r15-1006-gd93353e6423eca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115450 --- Comment #2 from kugan at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #1) > >[r15-1006-gd93353e6423eca] Do single-lane SLP discovery for reductions > > > Interesting because PR 115256 bisect it to an earlier patch. I believe this is a new issue.
[Bug tree-optimization/115450] New: cpu2017 502.gcc runtime miscompute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115450 Bug ID: 115450 Summary: cpu2017 502.gcc runtime miscompute Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- 5022.gcc is meicompiling for aarch64 with -O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arrays -flto -Wl,--sort-section=name -fno-strict-aliasing -fgnu89-inline -march=native -mcpu=neoverse-v2 -msve-vector-bits=128 gcc -v Using built-in specs. COLLECT_GCC=/home/kvivekananda/install_base/bin/gcc COLLECT_LTO_WRAPPER=/home/kvivekananda/install_base/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc_base/configure --enable-multiarch=yes --enable-languages=c,c++,fortran,lto --disable-bootstrap --prefix=/home/kvivekananda/install_base : (reconfigured) ../gcc_base/configure --enable-multiarch=yes --disable-bootstrap --prefix=/home/kvivekananda/install_base --enable-languages=c,c++,fortran,lto --no-create --no-recursion Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 15.0.0 20240607 (experimental) (GCC) Bisect points to: [d93353e6423ecaaae9fa47d0935caafd9abfe4de] Do single-lane SLP discovery for reductions
[Bug tree-optimization/115383] [15 Regression] ICE with TCVC_2 build since r15-1053-g28edeb1409a7b8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383 --- Comment #6 from kugan at gcc dot gnu.org --- (In reply to kugan from comment #5) > (In reply to Richard Biener from comment #4) > > Created attachment 58378 [details] > > patch > > > > I'm testing this, but I do not have hardware to test correctness (and qemu > > not set up). > > Thanks. I will test this on aarch64. bootstrap and regression test passes. TSVC_2 also builds without any issues.
[Bug tree-optimization/115383] [15 Regression] ICE with TCVC_2 build since r15-1053-g28edeb1409a7b8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383 --- Comment #5 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #4) > Created attachment 58378 [details] > patch > > I'm testing this, but I do not have hardware to test correctness (and qemu > not set up). Thanks. I will test this on aarch64.
[Bug tree-optimization/115383] New: ICE with TCVC_2 build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383 Bug ID: 115383 Summary: ICE with TCVC_2 build Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Patch [PATCH 1/4] Relax COND_EXPR reduction vectorization SLP restriction seem to cause ICE while building TSVC_2 Reduced test: cat tsvc_vec.i void dummy(); void s331() { int j; for (int i; i; i++) if ((float)i < 00.) j = i; dummy(j); } gcc options used: gcc -std=c99 -O3 -march=native -flto -Wl,--sort-section=name -mcpu=neoverse-v2 -msve-vector-bits=128 gcc -v: Using built-in specs. COLLECT_GCC=/proj/grco/gcc/Linux_aarch64/upstream-main/latest/bin/gcc COLLECT_LTO_WRAPPER=/proj/grco/gcc/Linux_aarch64/upstream-main/20240606024711346f33e2/libexec/gcc/aarch64-unknown-linux-gnu/15.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: /var/jenkins/workspace/GCC_Nightly/configure --enable-multiarch=yes --enable-languages=c,c++,fortran,lto --prefix=/proj/grco/gcc/Linux_aarch64/upstream-main/20240606024711346f33e2 Thread model: posix Supported LTO compression algorithms: zlib gcc version 15.0.0 20240606 (experimental) (GCC)
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #18 from kugan at gcc dot gnu.org --- Also, can we set INT_MAX when there is no explicit safelen specified in OMP. Something like: --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -6975,14 +6975,11 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist, { tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt), OMP_CLAUSE_SAFELEN); - poly_uint64 safe_len; - if (c == NULL_TREE - || (poly_int_tree_p (OMP_CLAUSE_SAFELEN_EXPR (c), &safe_len) - && maybe_gt (safe_len, sctx.max_vf))) + if (c == NULL_TREE) { c = build_omp_clause (UNKNOWN_LOCATION, OMP_CLAUSE_SAFELEN); OMP_CLAUSE_SAFELEN_EXPR (c) = build_int_cst (integer_type_node, - sctx.max_vf); + INT_MAX); OMP_CLAUSE_CHAIN (c) = gimple_omp_for_clauses (ctx->stmt); gimple_omp_for_set_clauses (ctx->stmt, c); }
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #12 from kugan at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #11) > (In reply to kugan from comment #9) > > Looking at the options, looks to me that making loop->safelen a poly_in is > > the way to go. (In reply to Jakub Jelinek from comment #4) > > > The OpenMP safelen clause argument is a scalar integer, so using poly_int > > > for something that must be an int doesn't make sense. > > > Though, the above testcase actually doesn't use safelen clause, so safelen > > > is there effectively infinity. > > Thanks. I was looking at this to see if there is a way to handle this > > differently. Looks to me that making loop->safelen a poly_int is the way to > > handle at least the case when omp safelen clause is not provided. > > Why? > Then it just is INT_MAX value, which is a magic value that says that it is > infinity. > No need to say it is a poly_int infinity. For this test case, omp_max_vf gets [16, 16] from the backend. This then becomes 16. If we keep it as poly_int, it would pass maybe_lt (max_vf, min_vf)) after applying safelen?
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #10 from kugan at gcc dot gnu.org --- Created attachment 57946 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57946&action=edit patch patch to make loop->safelen a poly_int
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #9 from kugan at gcc dot gnu.org --- Looking at the options, looks to me that making loop->safelen a poly_in is the way to go. (In reply to Jakub Jelinek from comment #4) > The OpenMP safelen clause argument is a scalar integer, so using poly_int > for something that must be an int doesn't make sense. > Though, the above testcase actually doesn't use safelen clause, so safelen > is there effectively infinity. Thanks. I was looking at this to see if there is a way to handle this differently. Looks to me that making loop->safelen a poly_int is the way to handle at least the case when omp safelen clause is not provided. I am interested in looking into this. Any suggestions? Here is a completely untested diff that makes loop->safelen a poly_int.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 114653, which changed state. Bug 114653 Summary: Not vectorizing the loop with openmp reduction. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #8 from kugan at gcc dot gnu.org --- *** Bug 114653 has been marked as a duplicate of this bug. ***
[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 kugan at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED --- Comment #6 from kugan at gcc dot gnu.org --- Duplicate *** This bug has been marked as a duplicate of bug 114635 ***
[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 --- Comment #5 from kugan at gcc dot gnu.org --- ddd for the : ref_a: _57 = D.4803[_20]; ref_b: D.4803[_20] = _ifc__174; We get DDR_ARE_DEPENDENT (ddr) == chrec_dont_know. Hence apply_safelen ().
[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 --- Comment #4 from kugan at gcc dot gnu.org --- This particular loop has loop->safelen set to 16. Does this mean this can never be loop vectorized for VLA?
[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 --- Comment #3 from kugan at gcc dot gnu.org --- For SVE mode in vect_analyze_loop_2, we have (gdb) p min_vf $15 = {coeffs = {4, 4}} (gdb) p max_vf $16 = 16 Thus maybe_lt (max_vf, min_vf)) is false. This results in bad data dependence.
[Bug middle-end/114653] Not vectorizing the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 --- Comment #2 from kugan at gcc dot gnu.org --- Thanks. I see the following in the log: test.cpp:33:53: missed: not vectorized: relevant stmt not supported: _54 = .MASK_LOAD (_53, 32B, _171); test.cpp:22:19: missed: bad operation or unsupported loop bound. test.cpp:22:19: note: * Analysis failed with vector mode V4SF test.cpp:22:19: note: === vect_analyze_data_ref_dependences === test.cpp:22:19: missed: bad data dependence. test.cpp:22:19: note: * Analysis failed with vector mode VNx16QI test.cpp:33:53: missed: not vectorized: relevant stmt not supported: _54 = .MASK_LOAD (_53, 32B, _171); test.cpp:22:19: missed: bad operation or unsupported loop bound. test.cpp:22:19: note: * Analysis failed with vector mode V8QI test.cpp:22:19: note: === vect_analyze_data_ref_dependences === test.cpp:22:19: missed: bad data dependence. test.cpp:22:19: note: * Analysis failed with vector mode VNx8QI test.cpp:33:53: missed: not vectorized: relevant stmt not supported: _54 = .MASK_LOAD (_53, 32B, _171); test.cpp:22:19: missed: bad operation or unsupported loop bound. test.cpp:22:19: note: * Analysis failed with vector mode V4HI test.cpp:22:19: note: === vect_analyze_data_ref_dependences === test.cpp:22:19: missed: bad data dependence. test.cpp:22:19: note: * Analysis failed with vector mode VNx4QI test.cpp:33:53: missed: not vectorized: relevant stmt not supported: _54 = .MASK_LOAD (_53, 32B, _171); test.cpp:22:19: missed: bad operation or unsupported loop bound. test.cpp:22:19: note: * Analysis failed with vector mode V2SI test.cpp:22:19: note: worklist: examine stmt: _57 = D.4803[_20]; test.cpp:22:19: note: === vect_analyze_data_ref_dependences === test.cpp:22:19: missed: bad data dependence. test.cpp:22:19: note: * Analysis failed with vector mode VNx2QI test.cpp:22:19: missed: couldn't vectorize loop test.cpp:22:19: missed: bad data dependence.
[Bug middle-end/114653] New: Not vectoring the loop with openmp reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114653 Bug ID: 114653 Summary: Not vectoring the loop with openmp reduction. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Created attachment 57910 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57910&action=edit testcase Main loop in the attached test case is not vectorized with -fopenmp. It gets vectorized with -fopenmp-simd. In the case of -fopenmp reduction variables lax,lay,laz gets assigned to an array. data reference calculation for this seem to fail. See: offset from base address: (ssizetype) ((sizetype) _20 * 4) constant offset from base address: 0 step: 0 base alignment: 16 base misalignment: 0 offset alignment: 4 step alignment: 128 base_object: D.4806[_20] Creating dr for D.4808[_20] analyze_innermost: Applying pattern match.pd:219, generic-match-1.cc:3190 test.cpp:37:9: missed: failed: evolution of offset is not affine. command used: test.cpp -Ofast -fopenmp -mcpu=neoverse-v2 gcc -v: Using built-in specs. COLLECT_GCC=/home/kvivekananda/install/bin/gcc COLLECT_LTO_WRAPPER=/home/kvivekananda/install/libexec/gcc/aarch64-unknown-linux-gnu/14.0.1/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc/configure --enable-multiarch=yes --enable-languages=c,c++,fortran,lto --disable-bootstrap --prefix=/home/kvivekananda/install Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 14.0.1 20240314 (experimental) (GCC)
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #5 from kugan at gcc dot gnu.org --- -O3 -fno-tree-vectorize and -O3 -fno-tree-vrp works. I looked at the ever dump and it is not doing anything suspicious. Looks like range_info usage in vectoriser is causing the problem.
[Bug libgomp/113698] GNU OpenMP with OMP_PROC_BIND alters thread affinity in a way that negatively affects performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698 --- Comment #4 from kugan at gcc dot gnu.org --- Thanks for looking into this. The main reason we ere seeing performance issue turned out to be due to glibc malloc issue in https://sourceware.org/bugzilla/show_bug.cgi?id=30945
[Bug libgomp/113698] New: GNU OpenMP with OMP_PROC_BIND alters thread affinity in a way that negatively affects performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698 Bug ID: 113698 Summary: GNU OpenMP with OMP_PROC_BIND alters thread affinity in a way that negatively affects performance Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 57275 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57275&action=edit testcase When OMP_PROC_BIND=true it seems gomp set the affinity even before main() starts. In particular, the main thread gets affinity 0x1 (i.e. pinned to the first core). For the attached, I get $ OMP_NUM_THREADS=72 ./a.out [main thread affinity right after main()]. tid:ae511020 aff:... duration: 402.949 msec $ OMP_PROC_BIND=true OMP_NUM_THREADS=72 ./a.out [main thread affinity right after main()]. tid:fffdded50020 aff:...0001 duration: 7879.59 msec $ OMP_PROC_BIND=true OMP_NUM_THREADS=72 ./a.out [main thread affinity right after main()]. tid:ae54c020 aff:...0001 duration: 311219 msec Compiler options used: gcc -O0 -fopenmp repro.c gcc -v: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/11/lto-wrapper Target: aarch64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.3.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
[Bug driver/47785] GCC with -flto does not pass -Wa options to the assembler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47785 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #14 from kugan at gcc dot gnu.org --- A patch for this is posted at https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01471.html
[Bug ipa/91468] Suspicious codes in ipa-prop.c and ipa-cp.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91468 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #2 from kugan at gcc dot gnu.org --- (In reply to Martin Jambor from comment #1) > (In reply to Feng Xue from comment #0) > > > > In function update_jump_functions_after_inlining(), > > > > if (dst->type == IPA_JF_ANCESTOR) > > { > > .. > > > > if (src->type == IPA_JF_PASS_THROUGH > > && src->value.pass_through.operation == NOP_EXPR) > > { > >.. > > } > > else if (src->type == IPA_JF_PASS_THROUGH > >&& TREE_CODE_CLASS (src->value.pass_through.operation) == > > tcc_unary) > > { > > dst->value.ancestor.formal_id = src->value.pass_through.formal_id; > > dst->value.ancestor.agg_preserved = false; > > } > > .. > > } > > > > If we suppose pass_through operation is "negate_expr" (while it is not a > > reasonable operation on pointer type), the code might be incorrect. It's > > better to specify expected unary operations here. > > Kugan, you added this in 2016 and unfortunately I think it is wrong. > Are there any unary operations we could possibly want to handle? > In any event, the information that there was an arithmetic function in > the path of the parameter would be completely lost if the code ever > executed. (Which I don't think it ever does, I think it would take > crazy code that employs LTO to pass an integer to a pointer parameter > to trigger). > > So I plan to remove the whole if. > Yes, i think this is a mistake and should go. Thanks for doing that.
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #21 from kugan at gcc dot gnu.org --- (In reply to Christophe Lyon from comment #20) > Hi Kugan, > > The new test fails with -mabi=ilp32: > FAIL: gcc.target/aarch64/pr88834.c scan-assembler-times \\tld2w\\t{z[0-9]+.s > - z[0-9]+.s}, p[0-7]/z, \\[x[0-9]+, x[0-9]+, lsl 2\\]\\n 2 > FAIL: gcc.target/aarch64/pr88834.c scan-assembler-times \\tst2w\\t{z[0-9]+.s > - z[0-9]+.s}, p[0-7], \\[x[0-9]+, x[0-9]+, lsl 2\\]\\n 1 Thanks Christophe. In the back-end, when we use ILP32, we don't accept SImode ops if like: (plus:SI (mult:SI (reg:SI 91) (const_int 4 [0x4])) (reg:SI 90)) While we would accept Pmode. My question is, should we care about ILP32 for SVE? If so we need to fix this. Otherwise, we can run the test for LP64.
[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838 --- Comment #6 from kugan at gcc dot gnu.org --- Author: kugan Date: Thu Jun 13 03:34:28 2019 New Revision: 272233 URL: https://gcc.gnu.org/viewcvs?rev=272233&root=gcc&view=rev Log: gcc/ChangeLog: 2019-06-13 Kugan Vivekanandarajah PR target/88838 * tree-vect-loop-manip.c (vect_set_loop_masks_directly): If the compare_type is not with Pmode size, we will create an IV with Pmode size with truncated use (i.e. converted to the correct type). * tree-vect-loop.c (vect_verify_full_masking): Find IV type. (vect_iv_limit_for_full_masking): New. Factored out of vect_set_loop_condition_masked. * tree-vectorizer.h (LOOP_VINFO_MASK_IV_TYPE): New. (vect_iv_limit_for_full_masking): Declare. gcc/testsuite/ChangeLog: 2019-06-13 Kugan Vivekanandarajah PR target/88838 * gcc.target/aarch64/pr88838.c: New test. * gcc.target/aarch64/sve/while_1.c: Adjust. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr88838.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/aarch64/sve/while_1.c trunk/gcc/tree-vect-loop-manip.c trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vectorizer.h
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #19 from kugan at gcc dot gnu.org --- Author: kugan Date: Thu Jun 13 03:18:54 2019 New Revision: 272232 URL: https://gcc.gnu.org/viewcvs?rev=272232&root=gcc&view=rev Log: gcc/ChangeLog: 2019-06-13 Kugan Vivekanandarajah PR target/88834 * tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle IFN_MASK_LOAD_LANES and IFN_MASK_STORE_LANES. (get_alias_ptr_type_for_ptr_address): Likewise. (add_iv_candidate_for_use): Add scaled index candidate if useful. * tree-ssa-address.c (preferred_mem_scale_factor): New. * config/aarch64/aarch64.c (aarch64_classify_address): Relax allow_reg_index_p. gcc/testsuite/ChangeLog: 2019-06-13 Kugan Vivekanandarajah PR target/88834 * gcc.target/aarch64/pr88834.c: New test. * gcc.target/aarch64/sve/struct_vect_1.c: Adjust. * gcc.target/aarch64/sve/struct_vect_14.c: Likewise. * gcc.target/aarch64/sve/struct_vect_15.c: Likewise. * gcc.target/aarch64/sve/struct_vect_16.c: Likewise. * gcc.target/aarch64/sve/struct_vect_17.c: Likewise. * gcc.target/aarch64/sve/struct_vect_7.c: Likewise. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr88834.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_15.c trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_16.c trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_17.c trunk/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_7.c trunk/gcc/tree-ssa-address.c trunk/gcc/tree-ssa-address.h trunk/gcc/tree-ssa-loop-ivopts.c
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #17 from kugan at gcc dot gnu.org --- (In reply to Wilco from comment #16) > (In reply to kugan from comment #15) > > (In reply to Wilco from comment #11) > > > There is also something odd with the way the loop iterates, this doesn't > > > look right: > > > > > > whilelo p0.s, x3, x4 > > > incwx3 > > > ptest p1, p0.b > > > bne .L3 > > > > I am not sure I understand this. I tried with qemu using an execution > > testcase and It seems to work. > > > > whilelo p0.s, x4, x5 > > incwx4 > > ptest p1, p0.b > > bne .L3 > > In my case I have the above (register allocation difference only) incw is > > correct considering two vector word registers? Am I missing something here? > > I'm talking about the completely redundant ptest, where does that come from? It is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88836
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #15 from kugan at gcc dot gnu.org --- (In reply to Wilco from comment #11) > There is also something odd with the way the loop iterates, this doesn't > look right: > > whilelo p0.s, x3, x4 > incwx3 > ptest p1, p0.b > bne .L3 I am not sure I understand this. I tried with qemu using an execution testcase and It seems to work. whilelo p0.s, x4, x5 incwx4 ptest p1, p0.b bne .L3 In my case I have the above (register allocation difference only) incw is correct considering two vector word registers? Am I missing something here?
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #14 from kugan at gcc dot gnu.org --- Created attachment 46104 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46104&action=edit testcase
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 kugan at gcc dot gnu.org changed: What|Removed |Added Attachment #46040|0 |1 is obsolete|| --- Comment #13 from kugan at gcc dot gnu.org --- Created attachment 46103 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46103&action=edit ivopt changes alone
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #12 from kugan at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #10) > (In reply to kugan from comment #9) > > Created attachment 46040 [details] > > patch > > Wasn't sure whether this patch was WIP or the final version > for review, but we need to do something more generic than > dividing by 4. I think the test will still fail with "int" > changed to "short" for example. > > I also don't think the new candidate should be tied to the > mask/load store functions. Maybe one approach would be to > check when adding a zero-based candidate for a use in: > > /* Record common candidate with initial value zero. */ > basetype = TREE_TYPE (iv->base); > if (POINTER_TYPE_P (basetype)) > basetype = sizetype; > record_common_cand (data, build_int_cst (basetype, 0), iv->step, use); > > whether the use actually benefits from this unscaled iv. > If the use is USE_REF_ADDRESS, we could compare the cost > of an address with an unscaled index with the cost of an address > with a scaled index. I think the natural scale value to try > would be GET_MODE_INNER (TYPE_MODE (mem_type)). Thanks for the comments. I agree this is the right place. But I am not sure if checking the cost at this point is what IV opt generally does. In general, IV-opt adds candidates which can be helpful and later decides the optimal set. If we are to use get_computation_cost to see the costs, we have to create iv_cand and then discard. Since we are adding only one candidate and that too for SVE like targets, I am thinking that it is OK. If you still prefer to check the cost, I will change that. Attached patch (only the ivopt changes) and testcase
[Bug rtl-optimization/89862] LTO bootstrap fails for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862 --- Comment #4 from kugan at gcc dot gnu.org --- Author: kugan Date: Sat Mar 30 04:28:51 2019 New Revision: 270031 URL: https://gcc.gnu.org/viewcvs?rev=270031&root=gcc&view=rev Log: 2019-03-29 Kugan Vivekanandarajah Backport from mainline 2019-03-29 Kugan Vivekanandarajah Eric Botcazou PR rtl-optimization/89862 * rtl.h (word_register_operation_p): Exclude CONST_INT from operations that operates on the full registers for WORD_REGISTER_OPERATIONS architectures. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/rtl.h
[Bug rtl-optimization/89862] LTO bootstrap fails for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862 --- Comment #3 from kugan at gcc dot gnu.org --- Author: kugan Date: Sat Mar 30 04:24:22 2019 New Revision: 270030 URL: https://gcc.gnu.org/viewcvs?rev=270030&root=gcc&view=rev Log: 2019-03-29 Kugan Vivekanandarajah Eric Botcazou PR rtl-optimization/89862 * rtl.h (word_register_operation_p): Exclude CONST_INT from operations that operates on the full registers for WORD_REGISTER_OPERATIONS architectures. Modified: trunk/gcc/ChangeLog trunk/gcc/rtl.h
[Bug rtl-optimization/89862] LTO bootstrap fails for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862 --- Comment #2 from kugan at gcc dot gnu.org --- (In reply to Eric Botcazou from comment #1) > Can you try this instead? > > Index: rtl.h > === > --- rtl.h (revision 269886) > +++ rtl.h (working copy) > @@ -4401,6 +4401,7 @@ word_register_operation_p (const_rtx x) > { >switch (GET_CODE (x)) > { > +case CONST_INT: > case ROTATE: > case ROTATERT: > case SIGN_EXTRACT: Thanks for looking into it. Disallowing all the CONST_INT works for me. I have verified that lto-bootstrap works with the above changes. I will test for regression and post it to gcc-patches.
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 kugan at gcc dot gnu.org changed: What|Removed |Added Attachment #45686|0 |1 is obsolete|| --- Comment #9 from kugan at gcc dot gnu.org --- Created attachment 46040 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46040&action=edit patch
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #8 from kugan at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #7) > Thanks for looking at this. > > (In reply to kugan from comment #6) > > cmp w3, 0 > > ble .L1 > > sub w3, w3, #1 > > mov x4, 0 > > cntwx5 > > ptrue p1.s, all > > lsr w3, w3, 1 > > add w3, w3, 1 > > whilelo p0.s, xzr, x3 > > .p2align 3,,7 > > .L3: > > ld2w{z4.s - z5.s}, p0/z, [x1, x4, lsl 2] > > ld2w{z2.s - z3.s}, p0/z, [x2, x4, lsl 2] > > add z0.s, z4.s, z2.s > > sub z1.s, z5.s, z3.s > > st2w{z0.s - z1.s}, p0, [x0, x4, lsl 2] > > whilelo p0.s, x5, x3 > > incbx4, all, mul #2 > > incwx5 > > ptest p1, p0.b > > bne .L3 > > .L1: > > ret > > .cfi_endproc > > This doesn't look right. x4 is an index, so it should be > incremented by the number of words in two vectors, rather than > the number of bytes in two vectors. Thanks for the comments. Fixed it with the attached patch it generates f: .LFB0: .cfi_startproc cmp w3, 0 ble .L1 sub w5, w3, #1 cntwx4 mov x3, 0 ptrue p1.s, all lsr w5, w5, 1 add w5, w5, 1 whilelo p0.s, xzr, x5 .p2align 3,,7 .L3: ld2w{z4.s - z5.s}, p0/z, [x1, x3, lsl 2] ld2w{z2.s - z3.s}, p0/z, [x2, x3, lsl 2] add z0.s, z4.s, z2.s sub z1.s, z5.s, z3.s st2w{z0.s - z1.s}, p0, [x0, x3, lsl 2] whilelo p0.s, x4, x5 inchx3 incwx4 ptest p1, p0.b bne .L3 .L1: ret .cfi_endproc
[Bug rtl-optimization/89862] New: LTO bootstrap fails for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89862 Bug ID: 89862 Summary: LTO bootstrap fails for ARM Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Created attachment 46039 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46039&action=edit patch With the commit: commit 67c18bce7054934528ff5930cca283b4ac967dca Author: ebotcazou Date: Wed Jan 31 10:03:06 2018 +PR rtl-optimization/84071 * combine.c (record_dead_and_set_regs_1): Record the source unmodified for a paradoxical SUBREG on a WORD_REGISTER_OPERATIONS target. LTO bootstrap fails for arm (possibly for other WORD_REGISTER_OPERATIONS targets). There are internal compiler error: in operator+=, at profile-count.h:792. It looks like the profile_count is set incorrectly. Commit 67c18bce7054934528ff5930cca283b4ac967dca skips generating gen_lowpart for (set (subreg:SI (reg:QI 1434) 0) (const_int 224 [0xe0])) and likes. This seems to be the reason for the error. attached patch fixes this. Does this look reasonable?
[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838 --- Comment #5 from kugan at gcc dot gnu.org --- Created attachment 46000 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46000&action=edit RFC patch RFC patch fixes this for review.
[Bug target/88836] [SVE] Redundant PTEST in loop test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88836 --- Comment #2 from kugan at gcc dot gnu.org --- Created attachment 45795 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45795&action=edit RFC patch AFIK, we need to: 1. Change the whilelo pattern in backend 2. Change RTL CSE - Add support for VEC_DUPLICATE - When handling PARALLEL rtx, we may kill CSE defined in the first set so that it docent reach Attached patch fix this. With the patch I now have: .LFB0: .cfi_startproc cmp w3, 0 ble .L1 sub w4, w3, #1 cntwx3 lsr w4, w4, 1 add w4, w4, 1 whilelo p0.s, xzr, x4 .p2align 3,,7 .L3: ld2w{z4.s - z5.s}, p0/z, [x1] ld2w{z2.s - z3.s}, p0/z, [x2] add z0.s, z4.s, z2.s sub z1.s, z5.s, z3.s st2w{z0.s - z1.s}, p0, [x0] incbx1, all, mul #2 whilelo p0.s, x3, x4 incbx0, all, mul #2 incwx3 incbx2, all, mul #2 bne .L3 .L1: ret .cfi_endproc
[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838 --- Comment #4 from kugan at gcc dot gnu.org --- sorry wr(In reply to kugan from comment #3) > Created attachment 45794 [details] > RFC patch Oops wrong place, it should be for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88836
[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838 --- Comment #3 from kugan at gcc dot gnu.org --- Created attachment 45794 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45794&action=edit RFC patch
[Bug target/88838] [SVE] Use 32-bit WHILELO in LP64 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #2 from kugan at gcc dot gnu.org --- AFIK, we need to: 1. Change the whilelo pattern in backend 2. Change RTL CSE - Add support for VEC_DUPLICATE - When handling PARALLEL rtx, we may kill CSE defined in the first set so that it docent reach Attached patch fix this. With the patch I now have: .LFB0: .cfi_startproc cmp w3, 0 ble .L1 sub w4, w3, #1 cntwx3 lsr w4, w4, 1 add w4, w4, 1 whilelo p0.s, xzr, x4 .p2align 3,,7 .L3: ld2w{z4.s - z5.s}, p0/z, [x1] ld2w{z2.s - z3.s}, p0/z, [x2] add z0.s, z4.s, z2.s sub z1.s, z5.s, z3.s st2w{z0.s - z1.s}, p0, [x0] incbx1, all, mul #2 whilelo p0.s, x3, x4 incbx0, all, mul #2 incwx3 incbx2, all, mul #2 bne .L3 .L1: ret .cfi_endproc
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #6 from kugan at gcc dot gnu.org --- > > Note the difference in mode for aarch64_classify_address. Not sure if this > is because of the way my patch changes ivopt. Yes, it ws my mistake in iv-use. with attached patch, I now get cmp w3, 0 ble .L1 sub w3, w3, #1 mov x4, 0 cntwx5 ptrue p1.s, all lsr w3, w3, 1 add w3, w3, 1 whilelo p0.s, xzr, x3 .p2align 3,,7 .L3: ld2w{z4.s - z5.s}, p0/z, [x1, x4, lsl 2] ld2w{z2.s - z3.s}, p0/z, [x2, x4, lsl 2] add z0.s, z4.s, z2.s sub z1.s, z5.s, z3.s st2w{z0.s - z1.s}, p0, [x0, x4, lsl 2] whilelo p0.s, x5, x3 incbx4, all, mul #2 incwx5 ptest p1, p0.b bne .L3 .L1: ret .cfi_endproc I will post the patch for review after stage-1 opens. In the meantime any review is appreciated. Especially the part where iv-use is setup and get_alias_ptr_type_for_ptr_address.
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 kugan at gcc dot gnu.org changed: What|Removed |Added Attachment #45661|0 |1 is obsolete|| --- Comment #5 from kugan at gcc dot gnu.org --- Created attachment 45686 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45686&action=edit ivopt patch v2
[Bug tree-optimization/89296] New: tree copy-header masking uninitialized warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89296 Bug ID: 89296 Summary: tree copy-header masking uninitialized warning Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- void test_func(void) { int loop; // uninitialized and "garbage" while (!loop) { loop = get_a_value(); // <- must be for this test printk("..."); } } from Linaro bug report https://bugs.linaro.org/show_bug.cgi?id=4134 -fno-tree-ch gets the required warning diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index c876d62..d405d00 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -393,7 +393,7 @@ ch_base::copy_headers (function *fun) { gimple *stmt = gsi_stmt (bsi); if (gimple_code (stmt) == GIMPLE_COND) - gimple_set_no_warning (stmt, true); + ;//gimple_set_no_warning (stmt, true); else if (is_gimple_assign (stmt)) { enum tree_code rhs_code = gimple_assign_rhs_code (stmt); also gets the required warning. Looking into it.
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #4 from kugan at gcc dot gnu.org --- Created attachment 45661 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45661&action=edit ivopt patch v1
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #3 from kugan at gcc dot gnu.org --- I added iv-use for MASKED_LOAD_LANE and the result is cmp w3, 0 ble .L1 sub w5, w3, #1 mov x4, 0 lsr w5, w5, 1 add w5, w5, 1 whilelo p0.s, xzr, x5 .p2align 3,,7 .L3: lsl x3, x4, 3 incwx4 add x7, x1, x3 add x6, x2, x3 ld2w{z4.s - z5.s}, p0/z, [x7] ld2w{z2.s - z3.s}, p0/z, [x6] add x3, x0, x3 add z0.s, z4.s, z2.s sub z1.s, z5.s, z3.s st2w{z0.s - z1.s}, p0, [x3] whilelo p0.s, x4, x5 bne .L3 .L1: ret No base plus scaled index addressing mode. This is because in ivopt When called from ivopt: Breakpoint 4, aarch64_classify_address (info=0x7fffcba0, x=0x76c44f30, mode=E_DImode, strict_p=false, type=ADDR_QUERY_M) at /home/kugan/work/abe/snapshots/gcc.git~origin~aarch64~sve-acle-branch/gcc/config/aarch64/aarch64.c:5689 5689{ (gdb) p debug_rtx (x) (plus:DI (mult:DI (reg:DI 91) (const_int 8 [0x8])) (reg:DI 90)) it accepts it. When in cfgexpand: Breakpoint 5, aarch64_classify_address (info=0x7fffcca0, x=0x76c5b840, mode=E_VNx8SImode, strict_p=false, type=ADDR_QUERY_M) at /home/kugan/work/abe/snapshots/gcc.git~origin~aarch64~sve-acle-branch/gcc/config/aarch64/aarch64.c:5689 5689{ (gdb) p debug_rtx (x) (plus:DI (mult:DI (reg:DI 92 [ ivtmp_28 ]) (const_int 8 [0x8])) (reg/v/f:DI 110 [ y ])) This is not accepted because of aarch64_classify_index (info, op1, mode, strict_p) failing (as it should). Note the difference in mode for aarch64_classify_address. Not sure if this is because of the way my patch changes ivopt.
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #2 from kugan at gcc dot gnu.org --- I'll assign it to myself unless it is being looked at by someone else.
[Bug sanitizer/88333] [9 Regression] ice in asan_emit_stack_protection, at asan.c:1574
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88333 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #7 from kugan at gcc dot gnu.org --- *** Bug 88350 has been marked as a duplicate of this bug. ***
[Bug sanitizer/88350] Linux kernel build ICE with allyesconfig for aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88350 kugan at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from kugan at gcc dot gnu.org --- Duplicate *** This bug has been marked as a duplicate of bug 88333 ***
[Bug sanitizer/88350] Linux kernel build ICE with allyesconfig for aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88350 kugan at gcc dot gnu.org changed: What|Removed |Added Alias|PR88333 | --- Comment #2 from kugan at gcc dot gnu.org --- Dup of PR88333 and fixed.
[Bug sanitizer/88350] New: Linux kernel build ICE with allyesconfig for aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88350 Bug ID: 88350 Summary: Linux kernel build ICE with allyesconfig for aarch64 Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: sanitizer Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org, jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at gcc dot gnu.org Target Milestone: --- When Linux kernel is built (allyesconfig) with trunk, ++ make CC=/home/tcwg-buildslave/workspace/tcwg_kernel-bisect-gnu_0/bin/aarch64-cc ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- HOSTCC=gcc -j32 -s -k :1335:2: warning: #warning syscall rseq not implemented [-Wcpp] *** WARNING *** there are active plugins, do not report this as a bug unless you can reproduce it without enabling any plugins. Event| Plugins PLUGIN_FINISH_TYPE | randomize_layout_plugin structleak_plugin PLUGIN_FINISH_DECL | randomize_layout_plugin PLUGIN_ATTRIBUTES| randomize_layout_plugin latent_entropy_plugin structleak_plugin PLUGIN_START_UNIT| latent_entropy_plugin PLUGIN_ALL_IPA_PASSES_START | randomize_layout_plugin during RTL pass: expand arch/arm64/mm/flush.c: In function '__sync_icache_dcache': arch/arm64/mm/flush.c:61:6: internal compiler error: in asan_emit_stack_protection, at asan.c:1574 61 | void __sync_icache_dcache(pte_t pte) | ^~~~ Full build Log can be found in: https://ci.linaro.org/job/tcwg_kernel-bisect-gnu-master-aarch64-stable-allyesconfig/11/artifact/artifacts/build-1d89613e77d7db420b13ce3ad8b98f07aaf474e8/console.log Commit that seem to trigger this is: Author: marxin Date: Fri Nov 30 14:25:15 2018 + Make red zone size more flexible for stack variables (PR sanitizer/81715). 2018-11-30 Martin Liska PR sanitizer/81715 * asan.c (asan_shadow_cst): Remove, partially transform into flush_redzone_payload. (RZ_BUFFER_SIZE): New. (struct asan_redzone_buffer): New. (asan_redzone_buffer::emit_redzone_byte): Likewise. (asan_redzone_buffer::flush_redzone_payload): Likewise. (asan_redzone_buffer::flush_if_full): Likewise. (asan_emit_stack_protection): Use asan_redzone_buffer class that is responsible for proper aligned stores and flushing of shadow memory payload. * asan.h (ASAN_MIN_RED_ZONE_SIZE): New. (asan_var_and_redzone_size): Likewise. * cfgexpand.c (expand_stack_vars): Use smaller alignment (ASAN_MIN_RED_ZONE_SIZE) in order to make shadow memory for automatic variables more compact. 2018-11-30 Martin Liska PR sanitizer/81715 * c-c++-common/asan/asan-stack-small.c: New test. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@24 138bc75d-0d04-0410-961f-82ee72b054a4
[Bug rtl-optimization/88212] New: IRA Register Coalescing not working for the testcase
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88212 Bug ID: 88212 Summary: IRA Register Coalescing not working for the testcase Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- When compiling the following on aarch64 with -O2: #include void g(int32_t *p, int32x2x2_t val, int x) { vst2_lane_s32(p,val,0); } generates: .cfi_startproc mov v2.8b, v0.8b mov v3.8b, v1.8b st2 {v2.s - v3.s}[0], [x0] ret clang produces: st2 { v0.s, v1.s }[0], [x0] ret Essentially the problem is that access to part-registers doesn't get coalesced, so IRA generates moves which aren't actually required.
[Bug target/86677] popcount builtin detection is breaking some kernel build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86677 --- Comment #13 from kugan at gcc dot gnu.org --- Author: kugan Date: Mon Nov 12 23:43:56 2018 New Revision: 266039 URL: https://gcc.gnu.org/viewcvs?rev=266039&root=gcc&view=rev Log: gcc/ChangeLog: 2018-11-13 Kugan Vivekanandarajah PR middle-end/86677 PR middle-end/87528 * tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN POPCOUNT as expensive when backend does not define it. gcc/testsuite/ChangeLog: 2018-11-13 Kugan Vivekanandarajah PR middle-end/86677 PR middle-end/87528 * g++.dg/tree-ssa/pr86544.C: Run only for target supporting popcount pattern. * gcc.dg/tree-ssa/popcount.c: Likewise. * gcc.dg/tree-ssa/popcount2.c: Likewise. * gcc.dg/tree-ssa/popcount3.c: Likewise. * gcc.target/aarch64/popcount4.c: New test. * lib/target-supports.exp (check_effective_target_popcountl): New. Added: trunk/gcc/testsuite/gcc.target/aarch64/popcount4.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c trunk/gcc/testsuite/lib/target-supports.exp trunk/gcc/tree-scalar-evolution.c
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #7 from kugan at gcc dot gnu.org --- Author: kugan Date: Mon Nov 12 23:43:56 2018 New Revision: 266039 URL: https://gcc.gnu.org/viewcvs?rev=266039&root=gcc&view=rev Log: gcc/ChangeLog: 2018-11-13 Kugan Vivekanandarajah PR middle-end/86677 PR middle-end/87528 * tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN POPCOUNT as expensive when backend does not define it. gcc/testsuite/ChangeLog: 2018-11-13 Kugan Vivekanandarajah PR middle-end/86677 PR middle-end/87528 * g++.dg/tree-ssa/pr86544.C: Run only for target supporting popcount pattern. * gcc.dg/tree-ssa/popcount.c: Likewise. * gcc.dg/tree-ssa/popcount2.c: Likewise. * gcc.dg/tree-ssa/popcount3.c: Likewise. * gcc.target/aarch64/popcount4.c: New test. * lib/target-supports.exp (check_effective_target_popcountl): New. Added: trunk/gcc/testsuite/gcc.target/aarch64/popcount4.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c trunk/gcc/testsuite/lib/target-supports.exp trunk/gcc/tree-scalar-evolution.c
[Bug c++/87469] [9 Regression] ice in record_estimate, at tree-ssa-loop-niter.c:3271
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87469 --- Comment #5 from kugan at gcc dot gnu.org --- Author: kugan Date: Mon Oct 29 22:02:45 2018 New Revision: 265605 URL: https://gcc.gnu.org/viewcvs?rev=265605&root=gcc&view=rev Log: gcc/testsuite/ChangeLog: 2018-10-29 Kugan Vivekanandarajah PR middle-end/87469 * g++.dg/pr87469.C: New test. gcc/ChangeLog: 2018-10-29 Kugan Vivekanandarajah PR middle-end/87469 * tree-ssa-loop-niter.c (number_of_iterations_popcount): Fix niter max value. Added: trunk/gcc/testsuite/g++.dg/pr87469.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-niter.c
[Bug c++/87469] [9 Regression] ice in record_estimate, at tree-ssa-loop-niter.c:3271
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87469 --- Comment #4 from kugan at gcc dot gnu.org --- In the loop here, the value defined in the loop (e) is used outside the loop hence this should not be detected as popcount (AFIK). I will have a look at fixing this.
[Bug target/87253] New: Python test_ctypes fails when built with gcc 8.2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87253 Bug ID: 87253 Summary: Python test_ctypes fails when built with gcc 8.2 Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Python-2.7.15 Steps to reproduce error In Python src directory: ./configure make ./python Lib/test/regrtest.py -v test_ctypes == FAIL: test_struct_by_value (ctypes.test.test_win32.Structures) -- Traceback (most recent call last): File "/home/kugan.vivekanandarajah/Python-2.7.15/Lib/ctypes/test/test_win32.py", line 113, in test_struct_by_value self.assertEqual(ret.left, left.value) AssertionError: -200 != 10 gdb ./python b ReturnRect r Lib/test/regrtest.py -v test_ctypesQuit (gdb) p cp $9 = {x = 15, y = 25} (gdb) p fp $10 = {x = 548534164448, y = 9890688} cp and fp are the same as can be seen from below: vi /home/kugan.vivekanandarajah/Python-2.7.15/Lib/ctypes/test/test_win32.py +112 pt = POINT(15, 25) ... ReturnRect = dll.ReturnRect ReturnRect.argtypes = [c_int, RECT, POINTER(RECT), POINT, RECT, POINTER(RECT), POINT, RECT] ret = ReturnRect(i, rect, pointer(rect), pt, rect, byref(rect), pt, rect) gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/home/kugan.vivekanandarajah/install/usr/local/bin/../libexec/gcc/aarch64-unknown-linux-gnu/8.2.1/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc/configure --disable-bootstrap Thread model: posix gcc version 8.2.1 20180907 (GCC)
[Bug target/86677] popcount builtin detection is breaking some kernel build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86677 --- Comment #2 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #1) > The kernel simply has to provide __popcount{s,d}i2 like it provides other > libgcc functions if it chooses to not link against libgcc. Yes, I created this bug just so that I can point it to the kernel people. I will raise it with the kernel people internally and see what I can do. Thanks.
[Bug target/86677] New: popcount builtin detection is breaking some kernel build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86677 Bug ID: 86677 Summary: popcount builtin detection is breaking some kernel build Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- Linux kernel build for arm/aarch64 (and possibly other targets) which does not provide appropriate patterns in the backend will break the kernel build. As for aarch64 this happens because kernel is built with -mgeneral-regs-only Also discussed in: https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00489.html
[Bug tree-optimization/86544] Popcount detection generates different code on C and C++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544 --- Comment #4 from kugan at gcc dot gnu.org --- Author: kugan Date: Wed Jul 18 22:11:24 2018 New Revision: 262864 URL: https://gcc.gnu.org/viewcvs?rev=262864&root=gcc&view=rev Log: gcc/ChangeLog: 2018-07-18 Kugan Vivekanandarajah PR middle-end/86544 * tree-ssa-phiopt.c (cond_removal_in_popcount_pattern): Handle comparision with EQ_EXPR in last stmt. gcc/testsuite/ChangeLog: 2018-07-18 Kugan Vivekanandarajah PR middle-end/86544 * g++.dg/tree-ssa/pr86544.C: New test. Added: trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-phiopt.c
[Bug tree-optimization/86544] Popcount detection generates different code on C and C++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544 --- Comment #2 from kugan at gcc dot gnu.org --- Patch posted at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00975.html
[Bug tree-optimization/86544] Popcount detection generates different code on C and C++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544 --- Comment #1 from kugan at gcc dot gnu.org --- (In reply to ktkachov from comment #0) > Great to see that GCC now detects the popcount loop in PR 82479! > I am seeing some curious differences between gcc and g++ though. > int > pc (unsigned long long b) > { > int c = 0; > > while (b) { > b &= b - 1; > c++; > } > > return c; > } > > If compiled with gcc -O3 on aarch64 this gives: > pc: > fmovd0, x0 > cnt v0.8b, v0.8b > addvb0, v0.8b > umovw0, v0.b[0] > ret > > whereas if compiled with g++ -O3 it gives: > _Z2pcy: > .LFB0: > .cfi_startproc > fmovd0, x0 > cmp x0, 0 > cnt v0.8b, v0.8b > addvb0, v0.8b > umovw0, v0.b[0] > and x0, x0, 255 > cselw0, w0, wzr, ne > ret > > which is suboptimal. It seems that phiopt3 manages to optimise the C version > better. The GIMPLE dumps just before the phiopt pass are: > For the C (good version): > > int c; > int _7; > >[local count: 118111601]: > if (b_4(D) != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > _7 = __builtin_popcountl (b_4(D)); > >[local count: 118111601]: > # c_12 = PHI <0(2), _7(3)> > return c_12; > > > For the C++ (bad version): > > int c; > int _7; > >[local count: 118111601]: > if (b_4(D) == 0) > goto ; [11.00%] > else > goto ; [89.00%] > >[local count: 105119324]: > _7 = __builtin_popcountl (b_4(D)); > >[local count: 118111601]: > # c_12 = PHI <0(2), _7(3)> > return c_12; > > As you can see the order of the gotos and the jump conditions is inverted. > > It seems to me that the two are equivalent and GCC could be doing a better > job of optimising. > > Can we improve phiopt to handle this more effectively? Thanks for the test case. I will look at it.
[Bug tree-optimization/86489] ICE in gimple_phi_arg starting with r261682 when building 531.deepsjeng_r with FDO + LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86489 --- Comment #7 from kugan at gcc dot gnu.org --- Author: kugan Date: Fri Jul 13 05:25:47 2018 New Revision: 262622 URL: https://gcc.gnu.org/viewcvs?rev=262622&root=gcc&view=rev Log: gcc/ChangeLog: 2018-07-13 Kugan Vivekanandarajah Richard Biener PR middle-end/86489 * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check that the loop latch destination where phi is defined. gcc/testsuite/ChangeLog: 2018-07-13 Kugan Vivekanandarajah PR middle-end/86489 * gcc.dg/pr86489.c: New test. Added: trunk/gcc/testsuite/gcc.dg/pr86489.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-niter.c
[Bug tree-optimization/86489] ICE in gimple_phi_arg starting with r261682 when building 531.deepsjeng_r with FDO + LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86489 --- Comment #3 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > gimple *phi = SSA_NAME_DEF_STMT (b_11); > if (gimple_code (phi) != GIMPLE_PHI > || (gimple_assign_lhs (and_stmt) > != gimple_phi_arg_def (phi, loop_latch_edge (loop)->dest_idx))) > return false; > > this may fail if the PHI in question is not the correct one in which case > it may not have the argument at the latch dest_idx. Try first verifying > that the loop latch destination is indeed gimple_bb (phi). yes, thanks for spotting. I am testing the following patch: diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c index f6fa2f7..fbdf838 100644 --- a/gcc/tree-ssa-loop-niter.c +++ b/gcc/tree-ssa-loop-niter.c @@ -2555,6 +2555,7 @@ number_of_iterations_popcount (loop_p loop, edge exit, ... = PHI . */ gimple *phi = SSA_NAME_DEF_STMT (b_11); if (gimple_code (phi) != GIMPLE_PHI + || (gimple_bb (phi) != loop_latch_edge (loop)->dest) || (gimple_assign_lhs (and_stmt) != gimple_phi_arg_def (phi, loop_latch_edge (loop)->dest_idx))) return false; is checking that there is argument at the latch dest_idx (argument count of PHI) is still necessary?
[Bug tree-optimization/86489] ICE in gimple_phi_arg starting with r261682 when building 531.deepsjeng_r with FDO + LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86489 --- Comment #1 from kugan at gcc dot gnu.org --- Sorry about the breakage, I am trying to reproduce it on x86-64. Please let me know if you have testcase.
[Bug middle-end/82479] missing popcount builtin detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479 --- Comment #13 from kugan at gcc dot gnu.org --- Author: kugan Date: Sat Jun 16 21:39:31 2018 New Revision: 261682 URL: https://gcc.gnu.org/viewcvs?rev=261682&root=gcc&view=rev Log: gcc/ChangeLog: 2018-06-16 Kugan Vivekanandarajah PR middle-end/82479 * ipa-fnsummary.c (will_be_nonconstant_expr_predicate): Handle CALL_EXPR. * tree-scalar-evolution.c (interpret_expr): Likewise. (expression_expensive_p): Likewise. * tree-ssa-loop-ivopts.c (contains_abnormal_ssa_name_p): Likewise. * tree-ssa-loop-niter.c (number_of_iterations_popcount): New. (number_of_iterations_exit_assumptions): Use number_of_iterations_popcount. (ssa_defined_by_minus_one_stmt_p): New. gcc/testsuite/ChangeLog: 2018-06-16 Kugan Vivekanandarajah PR middle-end/82479 * gcc.dg/tree-ssa/popcount.c: New test. * gcc.dg/tree-ssa/popcount2.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount.c trunk/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-fnsummary.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-scalar-evolution.c trunk/gcc/tree-ssa-loop-ivopts.c trunk/gcc/tree-ssa-loop-niter.c
[Bug tree-optimization/64946] [AArch64] gcc.target/aarch64/vect-abs-compile.c - "abs" vectorization fails for char/short types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946 --- Comment #24 from kugan at gcc dot gnu.org --- Author: kugan Date: Sat Jun 16 21:34:29 2018 New Revision: 261681 URL: https://gcc.gnu.org/viewcvs?rev=261681&root=gcc&view=rev Log: gcc/ChangeLog: 2018-06-16 Kugan Vivekanandarajah PR middle-end/64946 * cfgexpand.c (expand_debug_expr): Hande ABSU_EXPR. * config/i386/i386.c (ix86_add_stmt_cost): Likewise. * dojump.c (do_jump): Likewise. * expr.c (expand_expr_real_2): Check operand type's sign. * fold-const.c (const_unop): Handle ABSU_EXPR. (fold_abs_const): Likewise. * gimple-pretty-print.c (dump_unary_rhs): Likewise. * gimple-ssa-backprop.c (backprop::process_assign_use): Likesie. (strip_sign_op_1): Likesise. * match.pd: Add new pattern to generate ABSU_EXPR. * optabs-tree.c (optab_for_tree_code): Handle ABSU_EXPR. * tree-cfg.c (verify_gimple_assign_unary): Likewise. * tree-eh.c (operation_could_trap_helper_p): Likewise. * tree-inline.c (estimate_operator_cost): Likewise. * tree-pretty-print.c (dump_generic_node): Likewise. * tree-vect-patterns.c (vect_recog_sad_pattern): Likewise. * tree.def (ABSU_EXPR): New. gcc/c-family/ChangeLog: 2018-06-16 Kugan Vivekanandarajah * c-common.c (c_common_truthvalue_conversion): Handle ABSU_EXPR. gcc/c/ChangeLog: 2018-06-16 Kugan Vivekanandarajah * c-typeck.c (build_unary_op): Handle ABSU_EXPR; * gimple-parser.c (c_parser_gimple_statement): Likewise. (c_parser_gimple_unary_expression): Likewise. gcc/cp/ChangeLog: 2018-06-16 Kugan Vivekanandarajah * constexpr.c (potential_constant_expression_1): Handle ABSU_EXPR. * cp-gimplify.c (cp_fold): Likewise. gcc/testsuite/ChangeLog: 2018-06-16 Kugan Vivekanandarajah PR middle-end/64946 * gcc.dg/absu.c: New test. * gcc.dg/gimplefe-29.c: New test. * gcc.target/aarch64/pr64946.c: New test. Added: trunk/gcc/testsuite/gcc.dg/absu.c trunk/gcc/testsuite/gcc.dg/gimplefe-29.c trunk/gcc/testsuite/gcc.target/aarch64/pr64946.c Modified: trunk/gcc/ChangeLog trunk/gcc/c-family/ChangeLog trunk/gcc/c-family/c-common.c trunk/gcc/c/ChangeLog trunk/gcc/c/c-typeck.c trunk/gcc/c/gimple-parser.c trunk/gcc/cfgexpand.c trunk/gcc/config/i386/i386.c trunk/gcc/cp/ChangeLog trunk/gcc/cp/constexpr.c trunk/gcc/cp/cp-gimplify.c trunk/gcc/dojump.c trunk/gcc/expr.c trunk/gcc/fold-const.c trunk/gcc/gimple-pretty-print.c trunk/gcc/gimple-ssa-backprop.c trunk/gcc/match.pd trunk/gcc/optabs-tree.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-eh.c trunk/gcc/tree-inline.c trunk/gcc/tree-pretty-print.c trunk/gcc/tree-vect-patterns.c trunk/gcc/tree.def
[Bug fortran/78387] OpenMP segfault/stack size exceeded writing to internal file
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78387 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #17 from kugan at gcc dot gnu.org --- *** Bug 82555 has been marked as a duplicate of this bug. ***
[Bug libfortran/82555] SPECcpu201 Wrf_s deadlock
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555 kugan at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |DUPLICATE --- Comment #6 from kugan at gcc dot gnu.org --- *** This bug has been marked as a duplicate of bug 78387 ***
[Bug libfortran/82555] SPECcpu201 Wrf_s deadlock
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555 --- Comment #5 from kugan at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #4) > Actually PR 78387 seems exactly this issue. Please test with a newer > version of gfortran. Thanks Andrew. Looks like this is the issue. So far, current trunk is continuing without error.
[Bug libgomp/82555] SPECcpu201 Wrf_s deadlock
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555 --- Comment #1 from kugan at gcc dot gnu.org --- My gcc is slightly old. gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/home/kugan.vivekanandarajah/install/test/usr/local/bin/../libexec/gcc/aarch64-unknown-linux-gnu/8.0.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: ../gcc-exp2/configure : (reconfigured) ../gcc-exp2/configure --enable-languages=c,c++,fortran,lto,objc --no-create --no-recursion Thread model: posix gcc version 8.0.0 20170822 (experimental) (GCC) I will try with the latest version.
[Bug libgomp/82555] New: SPECcpu201 Wrf_s deadlock
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82555 Bug ID: 82555 Summary: SPECcpu201 Wrf_s deadlock Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org CC: jakub at gcc dot gnu.org Target Milestone: --- Wrf_s is hanging or deadlocks when run on 48 threads (cores). It doesnt always happen and I have to run with --iterations=111 and it will eventually happens. Sometimes in the 2nd iterations and some times much later. I attached the process to gdb and the back trace is: (gdb) bt #0 0x01019924 in __lll_lock_wait (futex=futex@entry=0x2c3b1e0 <_gfortrani_unit_lock>, private=0) at lowlevellock.c:43 #1 0x01012cbc in __pthread_mutex_lock (mutex=0x2c3b1e0 <_gfortrani_unit_lock>) at pthread_mutex_lock.c:80 #2 0x00fd20ac in __gthread_mutex_lock (__mutex=0x2c3b1e0 <_gfortrani_unit_lock>) at ../libgcc/gthr-default.h:748 #3 _gfortrani_close_units () at ../../../gcc-exp2/libgfortran/io/unit.c:835 #4 0x0103950c in __libc_csu_fini () #5 0x0103f068 in __run_exit_handlers () #6 0x0103f0b0 in exit () #7 0x00fc6e60 in _gfortrani_exit_error (status=1, status@entry=3) at ../../../gcc-exp2/libgfortran/runtime/error.c:196 #8 0x00fc7314 in _gfortrani_internal_error (cmp=cmp@entry=0xcdf23d00, message=message@entry=0x11548a8 "stash_internal_unit(): Stack Size Exceeded") at ../../../gcc-exp2/libgfortran/runtime/error.c:422 #9 0x00fd1a84 in _gfortrani_stash_internal_unit (dtp=0xcdf23d00) at ../../../gcc-exp2/libgfortran/io/unit.c:549 #10 0x00fd0f6c in _gfortran_st_write_done (dtp=0xcdf23d00) at ../../../gcc-exp2/libgfortran/io/transfer.c:4168 #11 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #12 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #13 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #14 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #15 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #16 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #17 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #18 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #19 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #20 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #21 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #22 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #23 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #24 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #25 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #26 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #27 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #28 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #29 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #30 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #31 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #32 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #33 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #34 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #35 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #36 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #37 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #38 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #39 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #40 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #41 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #42 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #43 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #44 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #45 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () #46 0x00db933c in __module_ra_rrtm_MOD_rrtmlwrad () I am running this on AArch64 but I dont think this is an AArch64 specific issue. Is anyone else seeing this?
[Bug middle-end/82479] missing popcount builtin detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479 --- Comment #4 from kugan at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #2) > Confirmed. How useful this optimization is questionable. This code is part of spec2017/deepsjeng. There is some gain if we can. > > Gcc has __builtin_popcount which can be used. I agree.
[Bug middle-end/82479] missing popcount builtin detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479 --- Comment #1 from kugan at gcc dot gnu.org --- gcc trunk generates: PopCount: mov w2, 0 cbz x0, .L1 .p2align 3 .L3: sub x1, x0, #1 add w2, w2, 1 andsx0, x0, x1 bne .L3 .L1: mov w0, w2 ret
[Bug middle-end/82479] New: missing popcount builtin detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479 Bug ID: 82479 Summary: missing popcount builtin detection Product: gcc Version: unknown Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- gcc does not have support to detect builtin pop count. As a results, gcc generates bad code for int PopCount (long b) { int c = 0; while (b) { b &= b - 1; c++; } return c; } clang seems to do that and generates (for aarch64): _Z8PopCounty: fmov d0, x0 cnt v0.8b, v0.8b uaddlv h0, v0.8b fmov w0, s0 ret
[Bug tree-optimization/81558] Loop not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81558 --- Comment #2 from kugan at gcc dot gnu.org --- > Does LLVM do a runtime alias check here? For foo1 GCC adds a runtime alias > check > (BB vectorization cannot version for aliasing). Yes. LLVM does not seem to be unrolling the inner loop. As you said, when disabling cunrolli it works. cunroll pass will unroll after loop vectorisation. Can anything done with the heuristics for this case? Thanks.
[Bug middle-end/81558] New: Loop not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81558 Bug ID: 81558 Summary: Loop not vectorized Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- For the testcase: struct I { int opix_x; int opix_y; }; //#define R #define R __restrict__ extern struct I * R img; extern unsigned short ** R imgY_org; extern unsigned short orig_blocks[256]; void foo1 (int n) { int x = 1, y = 1; unsigned short *orgptr=orig_blocks; // Vectorized for (y = 0; y < img->opix_y; y++) for (x = 0; x < img->opix_x; x++) *orgptr++ = imgY_org [y][x]; } void foo2 (int n) { int x = 1, y = 1; unsigned short *orgptr=orig_blocks; // Not vectorized for (y = img->opix_y; y < img->opix_y+16; y++) for (x = img->opix_x; x < img->opix_x+16; x++) *orgptr++ = imgY_org [y][x]; } Loop in foo2 is not vectorized. In the *.156t.vect, I see: Creating dr for *_40 analyze_innermost: failed: evolution of base is not affine. base_address: offset from base address: constant offset from base address: step: aligned to: base_object: *_40 LLVM seems to be able to vectorize this.
[Bug tree-optimization/80612] [7/8 Regression] ICE in get_range_info, at tree-ssanames.c:375
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80612 --- Comment #5 from kugan at gcc dot gnu.org --- (In reply to Marek Polacek from comment #4) > This should fix it: > > --- a/gcc/calls.c > +++ b/gcc/calls.c > @@ -1270,7 +1270,7 @@ get_size_range (tree exp, tree range[2]) > >wide_int min, max; >enum value_range_type range_type > -= (TREE_CODE (exp) == SSA_NAME > += ((TREE_CODE (exp) == SSA_NAME && INTEGRAL_TYPE_P (TREE_TYPE (exp))) > ? get_range_info (exp, &min, &max) : VR_VARYING); > >if (range_type == VR_VARYING) Looked at the other uses of get_range_info too. There are uses of this in gcc/gimple-ssa-warn-alloca.c without the check for INTEGRAL_TYPE_P but I think it is intentional.
[Bug lto/78140] [7 Regression] libxul -flto uses 1GB more memory than gcc-6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78140 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #26 from kugan at gcc dot gnu.org --- (In reply to Richard Biener from comment #20) > Look at tree-ssanames.c:range_info_def for "tricks" (make them variable > size): > > /* Value range information for SSA_NAMEs representing non-pointer variables. > */ > > struct GTY ((variable_size)) range_info_def { > /* Minimum, maximum and nonzero bits. */ > TRAILING_WIDE_INT_ACCESSOR (min, ints, 0) > TRAILING_WIDE_INT_ACCESSOR (max, ints, 1) > TRAILING_WIDE_INT_ACCESSOR (nonzero_bits, ints, 2) > trailing_wide_ints <3> ints; > }; I am working on a patch to change ipa vrp based on the above.
[Bug tree-optimization/78721] [7 Regression] ICE on valid code at -O2 and -O3 on x86_64-linux-gnu: in set_value_range, at tree-vrp.c:371
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78721 --- Comment #4 from kugan at gcc dot gnu.org --- Author: kugan Date: Fri Dec 9 19:47:10 2016 New Revision: 243501 URL: https://gcc.gnu.org/viewcvs?rev=243501&root=gcc&view=rev Log: gcc/testsuite/ChangeLog: 2016-12-09 Kugan Vivekanandarajah PR ipa/78721 * gcc.dg/pr78721.c: New test. gcc/ChangeLog: 2016-12-09 Kugan Vivekanandarajah PR ipa/78721 * ipa-cp.c (propagate_vr_accross_jump_function): drop_tree_overflow after fold_convert. Added: trunk/gcc/testsuite/gcc.dg/pr78721.c Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-cp.c trunk/gcc/testsuite/ChangeLog
[Bug tree-optimization/78721] [7 Regression] ICE on valid code at -O2 and -O3 on x86_64-linux-gnu: in set_value_range, at tree-vrp.c:371
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78721 kugan at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #3 from kugan at gcc dot gnu.org --- Created attachment 40280 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40280&action=edit untested patch
[Bug tree-optimization/77862] [7 Regression] ice in add_equivalence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77862 kugan at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #8 from kugan at gcc dot gnu.org --- Fixed in trunk.
[Bug tree-optimization/72835] [7 Regression] Incorrect arithmetic optimization involving bitfield arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72835 kugan at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #6 from kugan at gcc dot gnu.org --- Fixed in trunk.
[Bug tree-optimization/71408] [7 Regression] wrong code at -Os and above on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71408 kugan at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from kugan at gcc dot gnu.org --- Fixed in trunk.
[Bug tree-optimization/40921] missed optimization: x + (-y * z * z) => x - y * z * z
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40921 kugan at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED CC||kugan at gcc dot gnu.org Resolution|--- |FIXED --- Comment #6 from kugan at gcc dot gnu.org --- Fixed in trunk.