David Malcolm <dmalc...@redhat.com> writes: > On Mon, 2018-06-25 at 11:10 +0200, Richard Biener wrote: >> On Fri, 22 Jun 2018, David Malcolm wrote: >> >> > NightStrike and I were chatting on IRC last week about >> > issues with trying to vectorize the following code: >> > >> > #include <vector> >> > std::size_t f(std::vector<std::vector<float>> const & v) { >> > std::size_t ret = 0; >> > for (auto const & w: v) >> > ret += w.size(); >> > return ret; >> > } >> > >> > icc could vectorize it, but gcc couldn't, but neither of us could >> > immediately figure out what the problem was. >> > >> > Using -fopt-info leads to a wall of text. >> > >> > I tried using my patch here: >> > >> > "[PATCH] v3 of optinfo, remarks and optimization records" >> > https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01267.html >> > >> > It improved things somewhat, by showing: >> > (a) the nesting structure via indentation, and >> > (b) the GCC line at which each message is emitted (by using the >> > "remark" output) >> > >> > but it's still a wall of text: >> > >> > https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.remarks. >> > html >> > https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.d/..%7C. >> > .%7Csrc%7Ctest.cc.html#line-4 >> > >> > It doesn't yet provide a simple high-level message to a >> > tech-savvy user on what they need to do to get GCC to >> > vectorize their loop. >> >> Yeah, in particular the vectorizer is way too noisy in its low-level >> functions. IIRC -fopt-info-vec-missed is "somewhat" better: >> >> t.C:4:26: note: step unknown. >> t.C:4:26: note: vector alignment may not be reachable >> t.C:4:26: note: not ssa-name. >> t.C:4:26: note: use not simple. >> t.C:4:26: note: not ssa-name. >> t.C:4:26: note: use not simple. >> t.C:4:26: note: no array mode for V2DI[3] >> t.C:4:26: note: Data access with gaps requires scalar epilogue loop >> t.C:4:26: note: can't use a fully-masked loop because the target >> doesn't >> have the appropriate masked load or store. >> t.C:4:26: note: not ssa-name. >> t.C:4:26: note: use not simple. >> t.C:4:26: note: not ssa-name. >> t.C:4:26: note: use not simple. >> t.C:4:26: note: no array mode for V2DI[3] >> t.C:4:26: note: Data access with gaps requires scalar epilogue loop >> t.C:4:26: note: op not supported by target. >> t.C:4:26: note: not vectorized: relevant stmt not supported: _15 = >> _14 >> /[ex] 4; >> t.C:4:26: note: bad operation or unsupported loop bound. >> t.C:4:26: note: not vectorized: no grouped stores in basic block. >> t.C:4:26: note: not vectorized: no grouped stores in basic block. >> t.C:6:12: note: not vectorized: not enough data-refs in basic block. >> >> >> > The pertinent dump messages are: >> > >> > test.cc:4:23: remark: === try_vectorize_loop_1 === >> > [../../src/gcc/tree-vectorizer.c:674:try_vectorize_loop_1] >> > cc1plus: remark: >> > Analyzing loop at test.cc:4 >> > [../../src/gcc/dumpfile.c:735:ensure_pending_optinfo] >> > test.cc:4:23: remark: === analyze_loop_nest === >> > [../../src/gcc/tree-vect-loop.c:2299:vect_analyze_loop] >> > [...snip...] >> > test.cc:4:23: remark: === vect_analyze_loop_operations === >> > [../../src/gcc/tree-vect-loop.c:1520:vect_analyze_loop_operations] >> > [...snip...] >> > test.cc:4:23: remark: ==> examining statement: ‘_15 = _14 /[ex] >> > 4;’ [../../src/gcc/tree-vect-stmts.c:9382:vect_analyze_stmt] >> > test.cc:4:23: remark: vect_is_simple_use: operand ‘_14’ >> > [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use] >> > test.cc:4:23: remark: def_stmt: ‘_14 = _8 - _7;’ >> > [../../src/gcc/tree-vect-stmts.c:10098:vect_is_simple_use] >> > test.cc:4:23: remark: type of def: internal [../../src/gcc/tree- >> > vect-stmts.c:10112:vect_is_simple_use] >> > test.cc:4:23: remark: vect_is_simple_use: operand ‘4’ >> > [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use] >> > test.cc:4:23: remark: op not supported by target. >> > [../../src/gcc/tree-vect-stmts.c:5932:vectorizable_operation] >> > test.cc:4:23: remark: not vectorized: relevant stmt not >> > supported: ‘_15 = _14 /[ex] 4;’ [../../src/gcc/tree-vect- >> > stmts.c:9565:vect_analyze_stmt] >> > test.cc:4:23: remark: bad operation or unsupported loop bound. >> > [../../src/gcc/tree-vect-loop.c:2043:vect_analyze_loop_2] >> > cc1plus: remark: vectorized 0 loops in function. >> > [../../src/gcc/tree-vectorizer.c:904:vectorize_loops] >> > >> > In particular, that complaint from >> > [../../src/gcc/tree-vect-stmts.c:9565:vect_analyze_stmt] >> > is coming from: >> > >> > if (!ok) >> > { >> > if (dump_enabled_p ()) >> > { >> > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> > "not vectorized: relevant stmt not "); >> > dump_printf (MSG_MISSED_OPTIMIZATION, "supported: "); >> > dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, >> > stmt, 0); >> > } >> > >> > return false; >> > } >> > >> > This got me thinking: the user presumably wants to know several >> > things: >> > >> > * the location of the loop that can't be vectorized (vect_location >> > captures this) >> > * location of the problematic statement >> > * why it's problematic >> > * the problematic statement itself. >> > >> > The following is an experiment at capturing that information, by >> > recording an "opt_problem" instance describing what the >> > optimization >> > problem is, created deep in the callstack when it occurs, for use >> > later on back at the top of the vectorization callstack. >> >> Nice idea. Of course the issue is then for which issues to >> exactly create those. Like all of the MSG_MISSED_OPTIMIZATION >> dumpings? >> >> I guess the vectorizer needs some axing of useless messages >> and/or we need a MSG_DEBUG to have an additional level below >> MSG_NOTE. >> >> > This extra work is only done if dump_enabled_p. >> > >> > It feels vaguely analogous to an exception object (in terms of >> > packaging up a problem that occurs deep in the stack for reporting >> > back at a higher level). >> > >> > With this patch, we emit: >> > >> > ../../src/test.cc: In function ‘std::size_t f(const >> > std::vector<std::vector<float> >&)’: >> > ../../src/test.cc:4:23: remark: couldn't vectorize loop >> > for (auto const & w: v) >> > ^ >> > In file included from ../x86_64-pc-linux-gnu/libstdc++- >> > v3/include/vector:64, >> > from ../../src/test.cc:1: >> > ../x86_64-pc-linux-gnu/libstdc++- >> > v3/include/bits/stl_vector.h:806:50: >> > note: relevant stmt not supported: ‘_15 = _14 /[ex] 4;’ >> > { return size_type(this->_M_impl._M_finish - this- >> > >_M_impl._M_start); } >> > ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~ >> > ~~~~~~~ >> > >> > which reports both the location of the loop and the statement >> > that's >> > problematic (if I'm reading it right, the pointer arithmetic leads >> > to a >> > division by 4, and presumably we're not able to handle that). >> >> Quite likely because we only handle TRUNC_DIV_EXPR and not >> EXACT_DIV_EXPR >> which we can handle the same semantically (EXACT_DIV_EXPR just gives >> us stronger guarantees). > > [...snip...] > > I've been experimenting with the idea; here's an updated version of > the patch (on top of the optinfo patch kit, as it happens). > > I tried adding more uses of opt_problem, but ran into the issue that it's > hard to find "by hand" the places deep in the callstack where things fail; > I spent a chunk of time stepping through failures, trying to figure out > the best place to capture the opt_problem. > > It seemed like something that we could track in C++'s type system. > > The following kit solves this by introducing a new class opt_result, which > looks a lot like a bool: it has the same representation in memory. > > For instance, at the deepest point of the callstack where the failure > happens, rather than: > > if (!check_something ()) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "foo is unsupported.\n"); > return false; > } > [...lots more checks...] > > // All checks passed: > return true; > > we can capture the cause of the failure via: > > if (!check_something ()) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "foo is unsupported.\n"); > return opt_result::failure ("foo is unsupported", > stmt); > } > [...lots more checks...] > > // All checks passed: > return opt_result::success (); > > which effectively returns true or false, whilst recording any problem. > > opt_result::success and opt_result::failure return opt_result values > which "looks like" true/false respectively, via operator bool(). > If dump_enabled_p, then opt_result::failure also creates an opt_problem *, > capturing the pertinent data (here, "foo is unsupported" and "stmt"). > If dumps are disabled, then opt_problem instances aren't > created, and it's equivalent to just returning a bool (false for failure).
Since the creation of the opt_problem depends on dump_enabled_p, would it make sense for the dump_printf_loc to happen automatically on opt_result::failure, rather than have both? E.g.: if (!check_something ()) // Also calls dump_printf_loc if appropriate. return opt_result::failure ("foo is unsupported", stmt); I think the aim should be for opt_problem to capture any information that's useful at the user level, so doing the dump_printf_loc automatically would help to keep the MISSED_OPTIMIZATIONS messages clean of information that's too low-level. > The opt_problem can be propagated via opt_result values back up > the call stack to where it makes most sense to the user. > For instance, rather than: > > bool ok = try_something_that_might_fail (); > if (!ok) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "some message.\n"); > return false; > } > > we can replace the bool with an opt_result, so if dump_enabled_p, we > assume that if try_something_that_might_fail, an opt_problem * will be > created, and we can propagate it up the call chain: > > opt_result ok = try_something_that_might_fail (); > if (!ok) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "some message.\n"); > return ok; // propagating the opt_result > } > > There's currently a way to implicitly construct an opt_result from a > bool, but this is scaffolding: commenting it out allows the compiler to > tell us where we need to capture failure information. I guess this is bike-shedding, but personally I'd prefer an explicit test for success rather than operator bool, so that: opt_result foo = ...; bool bar = foo; is ill-formed. The danger otherwise might be that we drop a useful opt_problem and replace it with something more generic. E.g. the opt_result form of: if (!ok) { if (dump_enabled_p ()) { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not vectorized: relevant stmt not "); dump_printf (MSG_MISSED_OPTIMIZATION, "supported: "); dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0); } return false; } in vect_analyze_stmt could silently drop the information provided by the subroutine if we forgot to change "ok" from "bool" to "opt_result". Same for opt_wrapper<T>, but defining operator* and operator-> would be useful. LGTM otherwise (but obviously it's Richard B's call). Thanks, Richard > As well as opt_result for "bool", there's a template for wrapping pointers > where non-NULL is "success" and NULL "failure", so that: > > loop_vinfo = vect_analyze_loop_form (loop, shared); > if (!loop_vinfo) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "bad loop form.\n"); > return NULL; > } > > can simply become: > > loop_vinfo = vect_analyze_loop_form (loop, shared); > if (!loop_vinfo) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "bad loop form.\n"); > return loop_vinfo; > } > > where the "loop_vinfo" is now an opt_wrapper<loop_vec_info> and can > capture exactly what went wrong inside vect_analyze_loop_form. > > In all cases, the classes act as if the opt_problem were one of its > fields, but the opt_problem is actually stored in a global, so that when > compiled, an opt_wrapper<T> is effectively just a T, so that we're > still just passing e.g. a bool around; the opt_wrapper<T> classes > simply provide type-checking and an API to ensure that we provide > error-messages deep in the callstack at the places where problems > occur, and that we propagate them. This also avoids having > to manage the ownership of the opt_problem instances. > > Using opt_result and opt_wrapper<T> documents the intent of the code > for the places where we represent success values, and allows the C++ type > system to track where the deepest points in the callstack are where we > need to emit the failure messages from. > > Some examples, showing how it immediately leads to more meaningful > diagnostics from the vectorizer: > > gcc.dg/vect/vect-widen-mult-1.c: In function ‘f’: > gcc.dg/vect/vect-widen-mult-1.c:16:3: remark: loop vectorized using 16 byte > vectors > for (__INTPTR_TYPE__ i = 0; i < N; ++i) > ^~~ > gcc.dg/vect/vect-widen-mult-1.c: In function ‘main’: > gcc.dg/vect/vect-widen-mult-1.c:42:10: remark: couldn't vectorize loop > if (a[i] != (SIGNEDNESS_1 short) ((BASE + i * 5) > ~^~~ > gcc.dg/vect/vect-widen-mult-1.c:42:10: note: control flow in loop > [../../src/gcc/tree-vect-loop.c:1200:vect_analyze_loop_form_1] > gcc.dg/vect/vect-widen-mult-1.c:34:3: remark: couldn't vectorize loop > for (int i = 0; i < N; ++i) > ^~~ > gcc.dg/vect/vect-widen-mult-1.c:38:7: note: statement clobbers memory: > ‘__asm__ __volatile__("" : : : "memory");’ > [../../src/gcc/tree-data-ref.c:5086:find_data_references_in_stmt] > asm volatile ("" ::: "memory"); > ^~~ > > by showing the location of the loop that can't be vectorized, > the problematic statement within the loop, and describing what the > problem is. > > Note how it also captures the location of where in GCC the problem was > encountered. Ultimately this would also show hotness information for the > loop in the remark. > > I'm not sure exactly what an opt_problem should capture. There are a few > places in the kit where it looks the "failure" calls might benefit from > being a formatted print API, with format codes for various middle-end > entities (e.g. gimple, tree, --param values, etc). It's also not clear > to me how this would interact with the optinfo work (e.g. for capturing > things in optimization records), or what the API should look like (though > I like the one in the patch). > > (Not yet bootstrapped or regrtested, and has plenty of FIXMEs: I'm posting > this for comment/discussion). > > Thoughts? > Does the basic idea look useful and sane? (I think so) > > Thanks > Dave > > [...snip...] > > David Malcolm (5): > Add opt-problem.h > Use opt-problem.h in try_vectorize_loop_1 > Add some test coverage > Use opt_result throughout vectorizer > Add opt-problem.cc > > gcc/Makefile.in | 1 + > gcc/opt-problem.cc | 96 ++++++++ > gcc/opt-problem.h | 326 > +++++++++++++++++++++++++ > gcc/testsuite/gcc.dg/vect/vect-alias-check-4.c | 11 +- > gcc/testsuite/gcc.dg/vect/vect-remarks-1.c | 18 ++ > gcc/tree-data-ref.c | 33 +-- > gcc/tree-data-ref.h | 10 +- > gcc/tree-vect-data-refs.c | 189 ++++++++------ > gcc/tree-vect-loop.c | 212 +++++++++------- > gcc/tree-vect-slp.c | 4 +- > gcc/tree-vect-stmts.c | 130 ++++++---- > gcc/tree-vectorizer.c | 34 ++- > gcc/tree-vectorizer.h | 41 ++-- > 13 files changed, 842 insertions(+), 263 deletions(-) > create mode 100644 gcc/opt-problem.cc > create mode 100644 gcc/opt-problem.h > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-remarks-1.c