Re: [PATCH 0/5] [RFC v2] Higher-level reporting of vectorization problems

Richard Sandiford Wed, 11 Jul 2018 08:56:49 -0700

David Malcolm <dmalc...@redhat.com> writes:
> On Mon, 2018-06-25 at 11:10 +0200, Richard Biener wrote:
>> On Fri, 22 Jun 2018, David Malcolm wrote:
>> 
>> > NightStrike and I were chatting on IRC last week about
>> > issues with trying to vectorize the following code:
>> > 
>> > #include <vector>
>> > std::size_t f(std::vector<std::vector<float>> const & v) {
>> >    std::size_t ret = 0;
>> >    for (auto const & w: v)
>> >            ret += w.size();
>> >    return ret;
>> > }
>> > 
>> > icc could vectorize it, but gcc couldn't, but neither of us could
>> > immediately figure out what the problem was.
>> > 
>> > Using -fopt-info leads to a wall of text.
>> > 
>> > I tried using my patch here:
>> > 
>> >  "[PATCH] v3 of optinfo, remarks and optimization records"
>> >   https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01267.html
>> > 
>> > It improved things somewhat, by showing:
>> > (a) the nesting structure via indentation, and
>> > (b) the GCC line at which each message is emitted (by using the
>> >     "remark" output)
>> > 
>> > but it's still a wall of text:
>> > 
>> >   https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.remarks.
>> > html
>> >   https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.d/..%7C.
>> > .%7Csrc%7Ctest.cc.html#line-4
>> > 
>> > It doesn't yet provide a simple high-level message to a
>> > tech-savvy user on what they need to do to get GCC to
>> > vectorize their loop.
>> 
>> Yeah, in particular the vectorizer is way too noisy in its low-level
>> functions.  IIRC -fopt-info-vec-missed is "somewhat" better:
>> 
>> t.C:4:26: note: step unknown.
>> t.C:4:26: note: vector alignment may not be reachable
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: no array mode for V2DI[3]
>> t.C:4:26: note: Data access with gaps requires scalar epilogue loop
>> t.C:4:26: note: can't use a fully-masked loop because the target
>> doesn't 
>> have the appropriate masked load or store.
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: no array mode for V2DI[3]
>> t.C:4:26: note: Data access with gaps requires scalar epilogue loop
>> t.C:4:26: note: op not supported by target.
>> t.C:4:26: note: not vectorized: relevant stmt not supported: _15 =
>> _14 
>> /[ex] 4;
>> t.C:4:26: note: bad operation or unsupported loop bound.
>> t.C:4:26: note: not vectorized: no grouped stores in basic block.
>> t.C:4:26: note: not vectorized: no grouped stores in basic block.
>> t.C:6:12: note: not vectorized: not enough data-refs in basic block.
>> 
>> 
>> > The pertinent dump messages are:
>> > 
>> > test.cc:4:23: remark: === try_vectorize_loop_1 ===
>> > [../../src/gcc/tree-vectorizer.c:674:try_vectorize_loop_1]
>> > cc1plus: remark:
>> > Analyzing loop at test.cc:4
>> > [../../src/gcc/dumpfile.c:735:ensure_pending_optinfo]
>> > test.cc:4:23: remark:  === analyze_loop_nest ===
>> > [../../src/gcc/tree-vect-loop.c:2299:vect_analyze_loop]
>> > [...snip...]
>> > test.cc:4:23: remark:   === vect_analyze_loop_operations ===
>> > [../../src/gcc/tree-vect-loop.c:1520:vect_analyze_loop_operations]
>> > [...snip...]
>> > test.cc:4:23: remark:    ==> examining statement: ‘_15 = _14 /[ex]
>> > 4;’ [../../src/gcc/tree-vect-stmts.c:9382:vect_analyze_stmt]
>> > test.cc:4:23: remark:    vect_is_simple_use: operand ‘_14’
>> > [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use]
>> > test.cc:4:23: remark:    def_stmt: ‘_14 = _8 - _7;’
>> > [../../src/gcc/tree-vect-stmts.c:10098:vect_is_simple_use]
>> > test.cc:4:23: remark:    type of def: internal [../../src/gcc/tree-
>> > vect-stmts.c:10112:vect_is_simple_use]
>> > test.cc:4:23: remark:    vect_is_simple_use: operand ‘4’
>> > [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use]
>> > test.cc:4:23: remark:    op not supported by target.
>> > [../../src/gcc/tree-vect-stmts.c:5932:vectorizable_operation]
>> > test.cc:4:23: remark:    not vectorized: relevant stmt not
>> > supported: ‘_15 = _14 /[ex] 4;’ [../../src/gcc/tree-vect-
>> > stmts.c:9565:vect_analyze_stmt]
>> > test.cc:4:23: remark:   bad operation or unsupported loop bound.
>> > [../../src/gcc/tree-vect-loop.c:2043:vect_analyze_loop_2]
>> > cc1plus: remark: vectorized 0 loops in function.
>> > [../../src/gcc/tree-vectorizer.c:904:vectorize_loops]
>> > 
>> > In particular, that complaint from
>> >   [../../src/gcc/tree-vect-stmts.c:9565:vect_analyze_stmt]
>> > is coming from:
>> > 
>> >   if (!ok)
>> >     {
>> >       if (dump_enabled_p ())
>> >         {
>> >           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >                            "not vectorized: relevant stmt not ");
>> >           dump_printf (MSG_MISSED_OPTIMIZATION, "supported: ");
>> >           dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
>> > stmt, 0);
>> >         }
>> > 
>> >       return false;
>> >     }
>> > 
>> > This got me thinking: the user presumably wants to know several
>> > things:
>> > 
>> > * the location of the loop that can't be vectorized (vect_location
>> >   captures this)
>> > * location of the problematic statement
>> > * why it's problematic
>> > * the problematic statement itself.
>> > 
>> > The following is an experiment at capturing that information, by
>> > recording an "opt_problem" instance describing what the
>> > optimization
>> > problem is, created deep in the callstack when it occurs, for use
>> > later on back at the top of the vectorization callstack.
>> 
>> Nice idea.  Of course the issue is then for which issues to
>> exactly create those.  Like all of the MSG_MISSED_OPTIMIZATION
>> dumpings?
>> 
>> I guess the vectorizer needs some axing of useless messages
>> and/or we need a MSG_DEBUG to have an additional level below
>> MSG_NOTE.
>> 
>> > This extra work is only done if dump_enabled_p.
>> > 
>> > It feels vaguely analogous to an exception object (in terms of
>> > packaging up a problem that occurs deep in the stack for reporting
>> > back at a higher level).
>> > 
>> > With this patch, we emit:
>> > 
>> > ../../src/test.cc: In function ‘std::size_t f(const
>> > std::vector<std::vector<float> >&)’:
>> > ../../src/test.cc:4:23: remark: couldn't vectorize loop
>> >   for (auto const & w: v)
>> >                        ^
>> > In file included from ../x86_64-pc-linux-gnu/libstdc++-
>> > v3/include/vector:64,
>> >                  from ../../src/test.cc:1:
>> > ../x86_64-pc-linux-gnu/libstdc++-
>> > v3/include/bits/stl_vector.h:806:50:
>> > note: relevant stmt not supported: ‘_15 = _14 /[ex] 4;’
>> >        { return size_type(this->_M_impl._M_finish - this-
>> > >_M_impl._M_start); }
>> >                           ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
>> > ~~~~~~~
>> > 
>> > which reports both the location of the loop and the statement
>> > that's
>> > problematic (if I'm reading it right, the pointer arithmetic leads
>> > to a
>> > division by 4, and presumably we're not able to handle that).
>> 
>> Quite likely because we only handle TRUNC_DIV_EXPR and not
>> EXACT_DIV_EXPR
>> which we can handle the same semantically (EXACT_DIV_EXPR just gives
>> us stronger guarantees).
>
> [...snip...]
>
> I've been experimenting with the idea; here's an updated version of
> the patch (on top of the optinfo patch kit, as it happens).
>
> I tried adding more uses of opt_problem, but ran into the issue that it's
> hard to find "by hand" the places deep in the callstack where things fail;
> I spent a chunk of time stepping through failures, trying to figure out
> the best place to capture the opt_problem.
>
> It seemed like something that we could track in C++'s type system.
>
> The following kit solves this by introducing a new class opt_result, which
> looks a lot like a bool: it has the same representation in memory.
>
> For instance, at the deepest point of the callstack where the failure
> happens, rather than:
>
>      if (!check_something ())
>        {
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                             "foo is unsupported.\n");
>          return false;
>        }
>      [...lots more checks...]
>
>      // All checks passed:
>      return true;
>
> we can capture the cause of the failure via:
>
>      if (!check_something ())
>        {
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                             "foo is unsupported.\n");
>          return opt_result::failure ("foo is unsupported",
>                                      stmt);
>        }
>      [...lots more checks...]
>
>      // All checks passed:
>      return opt_result::success ();
>
> which effectively returns true or false, whilst recording any problem.
>
> opt_result::success and opt_result::failure return opt_result values
> which "looks like" true/false respectively, via operator bool().
> If dump_enabled_p, then opt_result::failure also creates an opt_problem *,
> capturing the pertinent data (here, "foo is unsupported" and "stmt").
> If dumps are disabled, then opt_problem instances aren't
> created, and it's equivalent to just returning a bool (false for failure).


Since the creation of the opt_problem depends on dump_enabled_p, would
it make sense for the dump_printf_loc to happen automatically on
opt_result::failure, rather than have both?  E.g.:

      if (!check_something ())
        // Also calls dump_printf_loc if appropriate.
        return opt_result::failure ("foo is unsupported", stmt);

I think the aim should be for opt_problem to capture any information
that's useful at the user level, so doing the dump_printf_loc
automatically would help to keep the MISSED_OPTIMIZATIONS messages
clean of information that's too low-level.

> The opt_problem can be propagated via opt_result values back up
> the call stack to where it makes most sense to the user.
> For instance, rather than:
>
>      bool ok = try_something_that_might_fail ();
>      if (!ok)
>        {
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                             "some message.\n");
>          return false;
>        }
>
> we can replace the bool with an opt_result, so if dump_enabled_p, we
> assume that if try_something_that_might_fail, an opt_problem * will be
> created, and we can propagate it up the call chain:
>
>      opt_result ok = try_something_that_might_fail ();
>      if (!ok)
>        {
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                             "some message.\n");
>          return ok; // propagating the opt_result
>        }
>
> There's currently a way to implicitly construct an opt_result from a
> bool, but this is scaffolding: commenting it out allows the compiler to
> tell us where we need to capture failure information.

I guess this is bike-shedding, but personally I'd prefer an explicit
test for success rather than operator bool, so that:

   opt_result foo = ...;
   bool bar = foo;

is ill-formed.  The danger otherwise might be that we drop a useful
opt_problem and replace it with something more generic.  E.g. the
opt_result form of:

  if (!ok)
    {
      if (dump_enabled_p ())
        {
          dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                           "not vectorized: relevant stmt not ");
          dump_printf (MSG_MISSED_OPTIMIZATION, "supported: ");
          dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
        }

      return false;
    }

in vect_analyze_stmt could silently drop the information provided by
the subroutine if we forgot to change "ok" from "bool" to "opt_result".

Same for opt_wrapper<T>, but defining operator* and operator-> would
be useful.

LGTM otherwise (but obviously it's Richard B's call).

Thanks,
Richard

> As well as opt_result for "bool", there's a template for wrapping pointers
> where non-NULL is "success" and NULL "failure", so that:
>
>       loop_vinfo = vect_analyze_loop_form (loop, shared);
>       if (!loop_vinfo)
>       {
>         if (dump_enabled_p ())
>           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                            "bad loop form.\n");
>         return NULL;
>       }
>
> can simply become:
>
>       loop_vinfo = vect_analyze_loop_form (loop, shared);
>       if (!loop_vinfo)
>       {
>         if (dump_enabled_p ())
>           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                            "bad loop form.\n");
>         return loop_vinfo;
>       }
>
> where the "loop_vinfo" is now an opt_wrapper<loop_vec_info> and can
> capture exactly what went wrong inside vect_analyze_loop_form.
>
> In all cases, the classes act as if the opt_problem were one of its
> fields, but the opt_problem is actually stored in a global, so that when
> compiled, an opt_wrapper<T> is effectively just a T, so that we're
> still just passing e.g. a bool around; the opt_wrapper<T> classes
> simply provide type-checking and an API to ensure that we provide
> error-messages deep in the callstack at the places where problems
> occur, and that we propagate them.  This also avoids having
> to manage the ownership of the opt_problem instances.
>
> Using opt_result and opt_wrapper<T> documents the intent of the code
> for the places where we represent success values, and allows the C++ type
> system to track where the deepest points in the callstack are where we
> need to emit the failure messages from.
>
> Some examples, showing how it immediately leads to more meaningful
> diagnostics from the vectorizer:
>
> gcc.dg/vect/vect-widen-mult-1.c: In function ‘f’:
> gcc.dg/vect/vect-widen-mult-1.c:16:3: remark: loop vectorized using 16 byte 
> vectors
>    for (__INTPTR_TYPE__ i = 0; i < N; ++i)
>    ^~~
> gcc.dg/vect/vect-widen-mult-1.c: In function ‘main’:
> gcc.dg/vect/vect-widen-mult-1.c:42:10: remark: couldn't vectorize loop
>      if (a[i] != (SIGNEDNESS_1 short) ((BASE + i * 5)
>          ~^~~
> gcc.dg/vect/vect-widen-mult-1.c:42:10: note: control flow in loop 
> [../../src/gcc/tree-vect-loop.c:1200:vect_analyze_loop_form_1]
> gcc.dg/vect/vect-widen-mult-1.c:34:3: remark: couldn't vectorize loop
>    for (int i = 0; i < N; ++i)
>    ^~~
> gcc.dg/vect/vect-widen-mult-1.c:38:7: note: statement clobbers memory: 
> ‘__asm__ __volatile__("" :  :  : "memory");’ 
> [../../src/gcc/tree-data-ref.c:5086:find_data_references_in_stmt]
>        asm volatile ("" ::: "memory");
>        ^~~
>
> by showing the location of the loop that can't be vectorized,
> the problematic statement within the loop, and describing what the
> problem is.
>
> Note how it also captures the location of where in GCC the problem was
> encountered.  Ultimately this would also show hotness information for the
> loop in the remark.
>
> I'm not sure exactly what an opt_problem should capture.  There are a few
> places in the kit where it looks the "failure" calls might benefit from
> being a formatted print API, with format codes for various middle-end
> entities (e.g. gimple, tree, --param values, etc).  It's also not clear
> to me how this would interact with the optinfo work (e.g. for capturing
> things in optimization records), or what the API should look like (though
> I like the one in the patch).
>
> (Not yet bootstrapped or regrtested, and has plenty of FIXMEs: I'm posting
> this for comment/discussion).
>
> Thoughts?
> Does the basic idea look useful and sane? (I think so)
>
> Thanks
> Dave
>
> [...snip...]
>
> David Malcolm (5):
>   Add opt-problem.h
>   Use opt-problem.h in try_vectorize_loop_1
>   Add some test coverage
>   Use opt_result throughout vectorizer
>   Add opt-problem.cc
>
>  gcc/Makefile.in                                |   1 +
>  gcc/opt-problem.cc                             |  96 ++++++++
>  gcc/opt-problem.h                              | 326 
> +++++++++++++++++++++++++
>  gcc/testsuite/gcc.dg/vect/vect-alias-check-4.c |  11 +-
>  gcc/testsuite/gcc.dg/vect/vect-remarks-1.c     |  18 ++
>  gcc/tree-data-ref.c                            |  33 +--
>  gcc/tree-data-ref.h                            |  10 +-
>  gcc/tree-vect-data-refs.c                      | 189 ++++++++------
>  gcc/tree-vect-loop.c                           | 212 +++++++++-------
>  gcc/tree-vect-slp.c                            |   4 +-
>  gcc/tree-vect-stmts.c                          | 130 ++++++----
>  gcc/tree-vectorizer.c                          |  34 ++-
>  gcc/tree-vectorizer.h                          |  41 ++--
>  13 files changed, 842 insertions(+), 263 deletions(-)
>  create mode 100644 gcc/opt-problem.cc
>  create mode 100644 gcc/opt-problem.h
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-remarks-1.c

Re: [PATCH 0/5] [RFC v2] Higher-level reporting of vectorization problems

Reply via email to