Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

2016-01-08 Thread Alan Lawrence

On Tues, Oct 27, 2015 at 2:39 PM, Richard Biener  
wrote:

On Mon, Oct 26, 2015 at 6:59 AM, sameera  wrote:


Richard, we have defined the input language for convenience in prototype
implementation. However, we will be using GIMPLE as our IR. As per grammar
of our tree, p-tree denote the permute order associated with the statement,
whereas c-tree is actually the GIMPLE instruction, which performs compute
operation. I tried looking at structures used in SLP, however they can not
be used as they are, as main difference between current SLP implementation
in GCC versus our representation is that, permute order in SLP is part of
the tree node in current GCC, whereas in our representation permute order is
represented as independent tree-node. Hence, I have created new tree
structure for our pass, which will create p-tree nodes for permute order,
and c-tree node which points to appropriate gimple statement.


Yes, that's the whole purpose - get the vectorizer (and SLP) a better
data structure
which is explicit about permutes.



I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op
compensation
code to make the operations match).  Consider

  a[i] = b[i] * 3;
  a[i+1] = b[i+1];

which can be vectorized with SLP if you realize you can multiply b[i+1] by
1.



The pass structure we are having for our optimization is as follows:
- New pass target-aware-loop-vect with following subpasses:
  - Loop structure identification
  - Restricted loop construction
  - Loop analysis (6 phases of target-aware reordering-instruction selection
algorithm)
  - Loop transformation (Cost aware gimple code generation)
We might need some optimizations to be performed on loop body before we
start our optimization for reordering instruction selection, so that input
program can be transformed to fit into our restricted loop structure. These
transformations can be performed by restricted loop construction subpass.
Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
supported by transforming input loop. The case that you have mentioned, can
be transformed at this level.



As of implementing the whole thing in GCC I recently spent some more time
thinking and came to the conclusion that the path to the fastest
improvements
in GCC code-gen would be to build the new AST after the current analysis
phase finished, do the optimization and drive code-generation off the new
AST.
Thus keep things like data-ref and dependence analysis as is, as well as
group analysis and SLP tree construction.  Build the AST from the
vectorizer
"IL" at the point vect_transform_loop / vect_slp_transform_bb is called.



Current pass structure that we have designed, does exactly that. The only
difference is that, we are planning to use the transform phase as a
co-process of our transform pass, than reconstruct an AST from our
representation to pass to vect_transform_loop.



More and more of the rest of the vectorizer code can then be "slowly"
moved
to work on the AST and AST construction be moved earlier.


So I think an incremental approach is much more likely to get us some benefit, 
and I'm wondering how much of the benefit here stems from which bit, and how 
many of these changes can be broken off.


For example, making permutes into explicit operations on the SLP tree, and 
adding a pass that tries to push them up/down the tree into each other, might be 
done on the existing codebase, and would be a significant win in itself 
(bringing some/much(?) of improvements in interleaving code of your proposal).


I'm not really sure how much of the benefit stems from what, though - e.g. does 
reducing vector length down to fit the target machine only after the other 
steps, bring any benefits besides raising abstraction? (Which certainly can be a 
benefit!). AFAICT this could be done independently of other changes?


Another limitation of the present SLP structure is it's not understanding 
sharing (until later CSE, hence costing wrongly); it would be good to fix this 
(making the SLP tree into a DAG, and potentially reusing subtrees by adding 
permutes). This is a step we'd like to take anyway, but might tie in with your 
"Bottom-up commoning of p-node subtrees". And would help with (BB as well as 
loop) cost calculations.


Are there any really fundamental differences that we cannot overcome in steps? ( 
/ are there any steps we shouldn't take in isolation, because we'd have to 
'throw them away' later?)


Alan


RS6000/PowerPC -mpaired

2016-01-05 Thread Alan Lawrence
Can anyone give me any pointers on how to use the -mpaired option ("PowerPC 
750CL paired-single instructions") ? I'm trying to get it to use the 
reduc_s{min,max,plus}_v2sf patterns in gcc/config/rs6000/paired.md, so far I 
haven't managed to find a configuration that supports loading a vector of floats 
but that doesn't also have superior altivec/v4sf support...


Thanks, Alan


Tree CONSTRUCTORs and Ada array indices

2015-10-14 Thread Alan Lawrence

Hi,

I'm having fun with arrays in Ada ;) and wondering if anyone can tell me what's 
right here.


Ada ACATS c64106a contains code with an array type whose indices run from -1 to 
1. (That is, three elements.)


In GCC this gets represented as an ARRAY_TYPE, whose TYPE_DOMAIN is a 32-bit 
signed integer type; the TYPE_MAX_VALUE of this is 1 (as a size_int), the 
TYPE_MIN_VALUE of this is -1 as a size_int, i.e. 4294967295. I believe using 
(unsigned 32-bit) size_int for min/max is correct (?) despite the domain being a 
signed type.


An array of this type is then initialized with a CONSTRUCTOR. The CONSTRUCTOR 
has three elements, with INTEGER_CST indices, being in order 2^32-1, 0, 1 (all 
of sizetype).


I substitute the CONSTRUCTOR into an ARRAY_REF [4294967295]{lb: 4294967295 
sz: 2}, i.e. that picks the element with index (sizetype)2^32-1.


fold() in fold-const.c then fails to fold, because it does binary search for 
2^32-1 through the constructor elements, and first checks against the middle 
CONSTRUCTOR_ELT, which has index 0.


So, where's the bug here? Should all these indices have sizetype? Should 
fold-const.c's binary search respect the TYPE_DOMAIN of the type of the 
CONSTRUCTOR being searched? (Lots of build_int_cst to reinterpret the 
CONSTRUCTOR_ELT indices, and/or the index being searched for, in said TYPE_DOMAIN ?)


Advice appreciated!

Thanks, Alan



Re: Missing dependency in Ada build?

2015-10-12 Thread Alan Lawrence

On 12/10/15 10:58, Eric Botcazou wrote:

Yes, that fixes the dependency problem; the build waits for the xsinfo to
terminate. This at least makes it easy to Ctrl-C and try again. Are we aware
of any reason why *not* to make that change?


You may have noticed that I had made it before you posted this message. :-)


Ah, right. You give me far too much credit ;)


The hang in xsinfo appears to be miscompilation when the host compiler is
gnat-4.8 (on ARM), whereas gnat-4.6 succeeds. I'm going to try again with
later versions (4.9/5.0/etc.)...


Thanks.  This could be worth documenting somewhere then.


I can confirm that gnat-4.9 works OK (as well as gnat-4.6).

I wonder whether bugzilla is still the right place to document this, given 4.8 
is now end-of-lifed...do you know of anywhere appropriate?


Thanks,
Alan



Re: Missing dependency in Ada build?

2015-10-09 Thread Alan Lawrence

On 07/10/15 18:33, Eric Botcazou wrote:

So it looks like there are two problems here:
(1) xsinfo not terminating;
(2) a missing dependency, that the g++ steps should depend upon the step
producing sinfo.h (i.e. the mv that comes after problem (1))

I'm going to look at (1), but I'm hoping someone more familiar with the Ada
build system might be able to help with regards to (2)?


Try to replace $(GNAT1_ADA_OBJS) with $(GNAT1_OBJS) in the line:

# When building from scratch we don't have dependency files, the only thing
# we need to ensure is that the generated files are created first.
$(GNAT1_ADA_OBJS) $(GNATBIND_OBJS): | $(ada_generated_files)

in ada/gcc-interface/Make-lang.in.



Yes, that fixes the dependency problem; the build waits for the xsinfo to 
terminate. This at least makes it easy to Ctrl-C and try again. Are we aware of 
any reason why *not* to make that change?


The hang in xsinfo appears to be miscompilation when the host compiler is 
gnat-4.8 (on ARM), whereas gnat-4.6 succeeds. I'm going to try again with later 
versions (4.9/5.0/etc.)...


Thanks, Alan



Missing dependency in Ada build?

2015-10-07 Thread Alan Lawrence
I've been erratically seeing my Ada builds (bootstrap or even with 
--disable-bootstrap) on an arm-none-linux-gnueabihf system, with errors like this:


g++ -fno-PIE -c  -DIN_GCC_FRONTEND -g -O2 -DIN_GCC -fno-exceptions -fno-rtti 
-fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual 
 -Wmissing-format-attribute -Woverloaded-virtual -Wno-long-long 
-Wno-variadic-macros -Wno-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. 
-Iada -I../../gcc/gcc -I../../gcc/gcc/ada -I../../gcc/gcc/../include 
-I../../gcc/gcc/../libcpp/include -I/home/alalaw01/boot_f800d24b/./gmp 
-I/home/alalaw01/gcc/gmp -I/home/alalaw01/boot_f800d24b/./mpfr 
-I/home/alalaw01/gcc/mpfr -I/home/alalaw01/gcc/mpc/src 
-I../../gcc/gcc/../libdecnumber -I../../gcc/gcc/../libdecnumber/dpd 
-I../libdecnumber -I../../gcc/gcc/../libbacktrace 
-I/home/alalaw01/boot_f800d24b/./isl/include -I/home/alalaw01/gcc/isl/include 
-o ada/trans.o -MT ada/trans.o -MMD -MP -MF ada/.deps/trans.TPo 
../../gcc/gcc/ada/gcc-interface/trans.c
../../gcc/gcc/ada/gcc-interface/trans.c:67:19: fatal error: sinfo.h: No such 
file or directory

 #include "sinfo.h"
   ^
compilation terminated.

(and similarly on a bunch of other files).

However, other times builds under "identical" conditions succeed. If I try a 
single-threaded build (make -j1), I find it sometimes hangs on this line:


(cd ada/bldtools/sinfo; gnatmake -q xsinfo ; ./xsinfo sinfo.h)

which produces output:
=
Check for field name consistency
 OK

Check for function consistency
 OK

Check for missing functions
 OK

Check for set procedure consistency
 OK

Check for missing set procedures
 OK

Check pragma Inlines are all for existing subprograms
 OK

Check no pragma Inlines were omitted
 OK

Check references in functions in body
 OK

Check for missing functions in body
 OK

Check Set procedures in body
 OK

Check for missing set procedures in body
 OK

All tests completed successfully, no errors detected
=
...but then sits forever at 100% of the CPU. Looking at 
build/gcc/ada/bldtools/sinfo/sinfo.h, the file has been partially written (but 
not flushed). In a successful build, that step terminates (no further terminal 
output), the file is flushed, and next comes

mv -f ada/bldtools/sinfo/sinfo.h ada/sinfo.h
from where the g++ steps pick it up.

So it looks like there are two problems here:
(1) xsinfo not terminating;
(2) a missing dependency, that the g++ steps should depend upon the step 
producing sinfo.h (i.e. the mv that comes after problem (1))


I'm going to look at (1), but I'm hoping someone more familiar with the Ada 
build system might be able to help with regards to (2)?


Thanks for any help,
Alan



Re: Results from SPEC2006 FP analysis done at Richard`s request {late July / early August}

2015-08-14 Thread Alan Lawrence

Abe wrote:

Dear all,

Overall, I think the WIP new if converter is holding up
relatively well, but there is obviously opportunity to do better,
at least if the numbers mean what they look like they mean,
i.e. the old converter`s code was fully OK and so is the new one`s.
By "fully OK" I mean e.g. no crashing bugs were introduced by the conversion.


In the following, all the integers over 1000 are loops-vectorized counts.



Interesting, thanks. For what kind of architecture are these - specifically: 
with/out masked/gathering loads/stores ??


Cheers, Alan



Re: replacing multiplication by bit shifts, etc

2015-08-12 Thread Alan Lawrence

Hi!

All compilers can replace multiplication operation by bit shifts and LEA x86 
instruction
(if current target arch is x86).
Can I ask, where in GCC this happens? I can't find it in source code.


See gcc/expmed.c, search for "choose_mult_variant".

HTH
Alan



Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-21 Thread Alan Lawrence

Abe wrote:


Ideally, I/we fix the above problem -- and the rest of the regressions in the 
new if converter --


Ok, what are these regressions? You say you've tested on x86_64 and only 
ifcvt-18.c fails (hence your xfail), so how does one find them?



In particular, I remember that "result = condition ? array1[index] : array2[maybe 
the same index, maybe not]"
is being converted too early IMO.  IOW, somewhere in GCC an array dereference 
is being considered
as either impure, too-expensive, or both.  "array[index]" in C [not in C++!: 
operator overloading]
AFAIK is always pure and is always low-cost whenever the index expression is 
pure and low-cost.


Well, any array access could introduce an extra fault (unless it's dominated by 
another equivalent (read/write) access)...?


--Alan



Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-20 Thread Alan Lawrence

Abe wrote:



of course this says nothing about whether there is *some* other ISA that gets 
regressed!


After finishing fixing the known regressions, I intend/plan to reg-test for 
AArch64;
after that, I think I`m going to need some community help to reg-test for other 
ISAs.


OK, I'm confused. When you write "making the new if-converter not mangle 
IR"...does "the new if-converter" mean your scratchpad fix to PR46029, or is 
there some other new if-conversion phase that you are still working on and 
haven't posted yet? If the latter, does this replace the existing 
tree-if-conv.c, or is it an extra stage before that? I haven't yet understood 
what you mean about "vectorizer-friendly" IR being mangled; is the problem that 
your new phase transforms IR that can currently be if-converted by the existing 
phase, into IR that can't? (Example?) Then I might (only "might", sorry!) be 
able to help...


Cheers, Alan



Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-08 Thread Alan Lawrence

Abe wrote:
[snip]

Predication of instructions can help to remove the burden of the conditional 
branch, but is not available on all relevant architectures.
In some architectures that are typically implemented in modern high-speed 
processors -- i.e. with high core frequency, caches, speculative execution, 
etc. --
there is not full support for predication [as there is, for example, in 32-bit ARM] but there _is_ 
support for "conditional move" [hereinafter "cmove"].
If/when the ISA in question supports cmove to/from main memory, perhaps it 
would be profitable to use two cmoves back-to-back with opposite conditions
and the same register [destination for load, source for store] to implement e.g. "temp = c ? 
X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc.
Even without cmove to/from main memory, two cmoves back-to-back with opposite 
conditions could still be used, e.g. [not for a real-world ISA]:
   load X[x] -> reg1
   load Y[y] -> reg2
   cmove   c  ? reg1 -> reg3
   cmove (!c) ? reg2 -> reg3

Or even better if the ISA can support something like:
   load X[x] -> reg1
   load Y[y] -> reg2
   cmove (c ? reg1 : reg2) -> reg3

However, this is a code-gen issue from my POV, and at the present time all the 
if-conversion work is occurring at the GIMPLE level.
If anybody reading this knows how I could make the if converter generate GIMPLE 
that leads to code-gen that is better for at
least one ISA


Ramana writes that your patch for PR46029 generates more 'csel's (i.e. the 
second form you write) for AArch64: 
https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01721.html


of course this says nothing about whether there is *some* other ISA that gets 
regressed!


HTH
Alan



Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-03 Thread Alan Lawrence

Abe wrote:


In other words, the problem about which I was concerned is not going to be triggered by 
e.g. "if (c)  x = ..."
which lacks an attached "else  x = ..." in a multithreaded program without 
enough locking just because 'x' is global/static.

The only remaining case to consider is if some code being compiler takes the address of 
something thread-local and then "gives"
that pointer to another thread.  Even for _that_ extreme case, Sebastian says 
that the gimplifier will detect this
"address has been taken" situation and do the right thing such that the new if 
converter also does the right thing.


Great :). I don't understand much/anything about how gcc deals with 
thread-locals, but everything before that, all sounds good...



[Alan wrote:]


Can you give an example?


The test cases in the GCC tree at "gcc.dg/vect/pr61194.c" and 
"gcc.dg/vect/vect-mask-load-1.c"
currently test as: the new if-converter is "converting" something that`s 
already vectorizer-friendly...

> [snip]

However, TTBOMK the vectorizer already "understands" that in cases where its 
input looks like:

   x = c ? y : z;

... and 'y' and 'z' are both pure [side-effect-free] -- including, but not limited to, 
they must be non-"volatile" --
it may vectorize a loop containing code like the preceding, ignoring for this 
particular instance the C mandate
that only one of {y, z} be evaluated...


My understanding, is that any decision as to whether one or both of y or z is 
evaluated (when 'evaluation' involves doing any work, e.g. a load), has already 
been encoded into the gimple/tree IR. Thus, if we are to only evaluate one of 
'y' or 'z' in your example, the IR will (prior to if-conversion), contain basic 
blocks and control flow, that means we jump around the one that's not evaluated.


This appears to be the case in pr61194.c: prior to if-conversion, the IR for the 
loop in barX is


 :
  # i_16 = PHI 
  # ivtmp_21 = PHI 
  _5 = x[i_16];
  _6 = _5 > 0.0;
  _7 = w[i_16];
  _8 = _7 < 0.0;
  _9 = _6 & _8;
  if (_9 != 0)
goto ;
  else
goto ;

  :
  iftmp.0_10 = z[i_16];
  goto ;

  :
  iftmp.0_11 = y[i_16];

  :
  # iftmp.0_2 = PHI 
  z[i_16] = iftmp.0_2;
  i_13 = i_16 + 1;
  ivtmp_20 = ivtmp_21 - 1;
  if (ivtmp_20 != 0)
goto ;
  else
goto ;

  :
  goto ;

which clearly contains (unvectorizable!) control flow. Without 
-ftree-loop-if-convert-stores, if-conversion leaves this alone, and 
vectorization fails (i.e. the vectorizer bails out because the loop has >2 basic 
blocks). With -ftree-loop-if-convert-stores, if-conversion produces


 :
  # i_16 = PHI 
  # ivtmp_21 = PHI 
  _5 = x[i_16];
  _6 = _5 > 0.0;
  _7 = w[i_16];
  _8 = _7 < 0.0;
  _9 = _6 & _8;
  iftmp.0_10 = z[i_16]; // <== here
  iftmp.0_11 = y[i_16]; // <== here
  iftmp.0_2 = _9 ? iftmp.0_10 : iftmp.0_11;
  z[i_16] = iftmp.0_2;
  i_13 = i_16 + 1;
  ivtmp_20 = ivtmp_21 - 1;
  if (ivtmp_20 != 0)
goto ;
  else
goto ;

  :
  goto ;

where I have commented the conditional loads that have become unconditional. 
(Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how the 
if-conversion phase converts loads, too - please correct me if I misunderstand 
(Richard?) ?!) This contains no control flow, and so is vectorizable.


(This is all without your scratchpad patch, of course.) IOW this being 
vectorized, or not, relies upon the preceding if-conversion phase removing the 
control flow.


HTH
Alan



Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-02 Thread Alan Lawrence

Abe wrote:




Hi, pleased to meet you :)




As some of you already know, at SARC we are working on a new "if converter" to 
help convert
simple "if"-based blocks of code that appear inside loops into an 
autovectorizer-friendly form
that closely resembles the C ternary operator ["c ? x : y"].  GCC already has 
such a converter,
but it is off by default, in part because it is unsafe: if enabled, it can 
cause certain code
to be transformed in such a way that it malfunctions even though the 
non-converted code worked
just fine with the same inputs.  The new converter, originally by my teammate 
Sebastian Pop,
is safer [almost-always safe *]; we are working on getting it into good-enough 
shape that the
always-safe transformations can be turned on by default whenever the 
autovectorizer is on.

* Always safe for stores, sometimes a little risky for loads:
   speculative loads might cause multithreaded programs with
   insufficient locking to fail due to writes by another thread
   being "lost"/"missed", even though the same program works OK
   "by luck" when compiled without if-conversion of loads.
   This risk comes mainly/only from what the relevant literature
   calls a "half hammock": an "if" with a "then" section but no
   "else" section [or effectively vice-versa, e.g. an empty "then"
   and a non-empty "else"].  In this case, e.g. "if (c)  X[x] = Y[y];"
   with no attached "else" section is risky to fully if-convert
   in the event of the code being compiled running multithreaded
   and not having been written with all the locking it really needs.
   Respectively, e.g. "if (c)  ; /* empty ''then'' */  else  X[x] = Y[y];".


For the unenlightened, can you outline the problem with this code sequence? 
(i.e. the expected transformation that makes it unsafe!?) I would hope your 
scratchpad patch would turn this into something like


a1 = c ? &Y[y] : &scratch;
temp = *a1;
a2 = c ? &X[x] : &scratch;
*a2 = temp;

which seems OK to me - so is the scratchpad approach going away?

(The problem that things might be read in a different order *across* the 
elements of a vector, I can see, but that belongs in the domain of the 
vectorizer itself, not if-conversion, I would think?)



One of the reasons the new if converter has not yet been submitted
for incorporation into GCC`s trunk is that it still has some
performance regressions WRT the old converter, and most of those
are "true regressions", i.e. not just because the old converter
was less safe and the additional safety is what is causing the loss,
but rather because there is more work to do before the patch is ready.

As of this writing, the new if converter sometimes tries
to "convert" something that is already vectorizer-friendly,
and in doing so it renders that code now-NOT-vectorizer-friendly.


Can you give an example? My understanding was that the existing vectorizer 
bailed out pretty much straightaway if the number of basic blocks in the loop 
was not exactly 2 (for inner loops) or 5 (for outermost loops, i.e. containing 
exactly one inner loop)...that seems to rule out vectorization of *any* kind of 
conditional execution, that the if-converter might convert?



Thanks, Alan



Re: dom1 prevents vectorization via partial loop peeling?

2015-04-29 Thread Alan Lawrence

Richard Biener wrote:


Well.  In this case we hit

  /* If one of the loop header's edge is an exit edge then do not
 apply if-conversion.  */
  FOR_EACH_EDGE (e, ei, loop->header->succs)
if (loop_exit_edge_p (loop, e))
  return false;

which is simply because even after if-conversion we'll at least end up
with a non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form.  Running loop header
copying again would probably undo this.


Indeed - header copying causes the loop to be if-converted, and then actually 
vectorized, too. (Well, I had to make do_while_loop_p a bit more aggressive in 
saying things weren't do-while loops.)


I merely moved pass_ch to immediately after pass_dominator, a few stages later; 
I have also tried cloning it, but do you foresee any problems with just moving 
it? (It passes bootstrap, I'm running the full testsuite atm).


Cheers,
Alan



Re: dom1 prevents vectorization via partial loop peeling?

2015-04-28 Thread Alan Lawrence

Ajit Kumar Agarwal wrote:


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard 
Biener
Sent: Tuesday, April 28, 2015 4:12 PM
To: Jeff Law
Cc: Alan Lawrence; gcc@gcc.gnu.org
Subject: Re: dom1 prevents vectorization via partial loop peeling?

On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law  wrote:

On 04/27/2015 10:12 AM, Alan Lawrence wrote:


After copyrename3, immediately prior to dom1, the loop body looks like:

   :

   :
   # i_11 = PHI 
   _5 = a[i_11];
   _6 = i_11 & _5;
   if (_6 != 0)
 goto ;
   else
 goto ;

   :

   :
   # m_2 = PHI <5(4), 4(3)>
   _7 = m_2 * _5;
   b[i_11] = _7;
   i_9 = i_11 + 1;
   if (i_9 != 32)
 goto ;
   else
 goto ;

   :
   goto ;

   :
   return;

dom1 then peeled part of the first loop iteration, producing:
Yup.  The jump threading code realized that if we traverse the edge 
2->3, then we'll always traverse 3->5.  The net effect is like peeling 
the first iteration because we copy bb3.  The original will be reached 
via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its 
conditional removed and will unconditionally transfer control to bb5.




This is a known problem, but we really don't have any good heuristics 
for when to do this vs when not to do it.


Ah, yes, I'd not realized this was connected to the jump-threading issue, but I 
see that now. As you say, the best heuristics are unclear, and I'm not keen on 
trying *too hard* to predict what later phases will/won't do or do/don't 
want...maybe if there are simple heuristics that work, but I would aim more at 
making later phases work with what(ever) they might get???


One (horrible) possibility that I will just throw out (and then duck), is to do 
something akin to tree-if-conversion's "gimple_build_call_internal 
(IFN_LOOP_VECTORIZED, " ...



In contrast, a slightly-different testcase:

>>> [snip]

I would have still expected it to thread 2->3, 3->6->4


Ok, I'll look into that.

(1) dom1 should really, in the second case, perform the same partial 
peeling that it does in the first testcase, if that is what it thinks 
is desirable. (Of course, we might want to fix that only later, as 
ATM that'd take us backwards).
Please a file a BZ.  It could be something simple, or we might be 
hitting one of Zdenek's heuristics around keeping overall loop structure.


Alternatively, maybe we don't want dom1 doing that sort of thing (?), 
but I'm inclined to think that if it's doing such optimizations, it's 
for a good reason ;) I guess there'll be other times where we 
*cannot* do partially peeling of later iterations...
It's an open question -- we never reached any kind of conclusion when 
it was last discussed with Zdenek.  I think the fundamental issue is 
we can't really predict when threading the loop is going to interfere 
with later optimizations or not.  The heuristics we have are marginal at best.


The one thought we've never explored was re-rolling that first 
iteration back into the loop in the vectorizer.


Yeah, there is that ;).

So besides trying to partially-peel the next N iterations, the other approach - 
that strikes me as sanest - is to finish (fully-)peeling off the first 
iteration, and then to vectorize from then on. In fact the ideal (I confess I 
have no knowledge of the GCC representation/infrastructure here) would probably 
be for the vectorizer (in vect_analyze_scalar_cycles) to look for a point in the 
loop, or rather a 'cut' across the loop, that avoids breaking any non-cyclic 
use-def chains, and to use that as the loop header. That analysis could be quite 
complex tho ;)...and I can see that having peeled the first 1/2 iteration, we 
may then end up having to peel the next (vectorization factor - 1/2) iterations 
too to restore alignment!


whereas with rerolling ;)...is there perhaps some reasonable way to keep markers 
around to make the rerolling approach more feasible???



Well.  In this case we hit



  >>/* If one of the loop header's edge is an exit edge then do not
>> apply if-conversion.  */
  >>FOR_EACH_EDGE (e, ei, loop->header->succs)
   >> if (loop_exit_edge_p (loop, e))
 >> return false;


which is simply because even after if-conversion we'll at least end up with a 
non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form.  Running loop header copying 
again would probably undo this.


So I've just posted https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01745.html 
which fixes this limitation of if-conversion. As I first wrote though, the 
vectorizer still fails, because the PHI nodes incoming to the loop header are 
neither reductions nor inductions.


I'll see if I can run loop header copying again, as you sug

dom1 prevents vectorization via partial loop peeling?

2015-04-27 Thread Alan Lawrence

Hi,

I've been experimenting with some small testcases with the aim of getting the 
vectorizer to work on more loops. I started by compiling at -O3 the following 
testcase:


#define N 32

int a[N];
int b[N];

int foo ()
{
  for (int i = 0; i < N ; i++)
{
  int m = (a[i] & i) ? 5 : 4;
  b[i] = a[i] * m;
}
}

After copyrename3, immediately prior to dom1, the loop body looks like:

  :

  :
  # i_11 = PHI 
  _5 = a[i_11];
  _6 = i_11 & _5;
  if (_6 != 0)
goto ;
  else
goto ;

  :

  :
  # m_2 = PHI <5(4), 4(3)>
  _7 = m_2 * _5;
  b[i_11] = _7;
  i_9 = i_11 + 1;
  if (i_9 != 32)
goto ;
  else
goto ;

  :
  goto ;

  :
  return;

dom1 then peeled part of the first loop iteration, producing:

  :
  goto ;

  :
  # i_11 = PHI 
  _5 = a[i_11];
  _6 = i_11 & _5;
  if (_6 != 0)
goto ;
  else
goto ;

  :

  :
  # m_14 = PHI <5(4), 4(3)>

  :
  # m_2 = PHI 
  # _15 = PHI <_5(5), _10(8)>
  # i_16 = PHI 
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  if (i_9 != 32)
goto ;
  else
goto ;

  :
  return;

  :
  # i_1 = PHI <0(2)>
  _10 = a[i_1];
  _3 = i_1 & _10;
  goto ;


which following ivcanon, had been simplified down to:

  :
  _10 = a[0];
  goto ;

  :
  _5 = a[i_9];
  _6 = _5 & i_9;
  if (_6 != 0)
goto ;
  else
goto ;

  :

  :
  # m_14 = PHI <5(3), 4(4)>

  :
  # m_2 = PHI 
  # _15 = PHI <_5(5), _10(2)>
  # i_16 = PHI 
  # ivtmp_13 = PHI 
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  if (ivtmp_3 != 0)
goto ;
  else
goto ;

  :
  return;

tree-if-conversion wouldn't deal with such a loop, and so the control flow made 
the vectorizer bail out too; resulting in scalar code (on both x86_64 and 
AArch64). I am currently testing a patch to tree-if-conv.c that makes it work on 
such a CFG, producing:


  :
  _10 = a[0];
  goto ;

  :

  :
  # m_2 = PHI 
  # _15 = PHI <_5(3), _10(2)>
  # i_16 = PHI 
  # ivtmp_13 = PHI 
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  _5 = a[i_9];
  _6 = _5 & i_9;
  m_14 = _6 != 0 ? 5 : 4;
  if (ivtmp_3 != 0)
goto ;
  else
goto ;

  :
  return;

However, this still fails to vectorize: the PHI nodes at the start of bb 4 
prevent this (in vect_analyze_scalar_cycles: they look a bit like reductions, 
but aren't).


In contrast, a slightly-different testcase:

#define N 32

int a[N];
int b[N];

int foo ()
{
  for (int i = 0; i < N ; i++)
{
  int cond = (a[i] & i) ? -1 : 0; // extra variable here
  int m = (cond) ? 5 : 4;
  b[i] = a[i] * m;
}
}

after copyrename3, just before dom1, is only slightly different:

  :

  :
  # i_15 = PHI 
  _6 = a[i_15];
  _7 = i_15 & _6;
  if (_7 != 0)
goto ;
  else
goto ;

  :
  # m_3 = PHI <4(6), 5(3)>
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  if (i_10 != 32)
goto ;
  else
goto ;

  :
  goto ;

  :
  return;

  :
  goto ;

with bb6 being out-of-line at the end of the function, rather than bb4 falling 
through from just above bb5. However, this prevents dom1 from doing the partial 
peeling, and dom1 only changes the "goto bb7" into a "goto bb3":


  :

  :
  # i_15 = PHI 
  _6 = a[i_15];
  _7 = i_15 & _6;
  if (_7 != 0)
goto ;
  else
goto ;

  :
  # m_3 = PHI <4(6), 5(3)>
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  if (i_10 != 32)
goto ;
  else
goto ;

  :
  return;

  :
  goto ;

Existing tree_if_conversion deals with this just fine, feeding into the 
vectorizer:

  :

  :
  # i_15 = PHI 
  # ivtmp_12 = PHI 
  _6 = a[i_15];
  _7 = _6 & i_15;
  m_3 = _7 != 0 ? 5 : 4;
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  ivtmp_1 = ivtmp_12 - 1;
  if (ivtmp_1 != 0)
goto ;
  else
goto ;

  :
  goto ;

Which is vectorized OK (the only PHI is the induction variable i_15, a true 
scalar cycle).


So, besides the if conversion, I see two issues here:

(1) dom1 should really, in the second case, perform the same partial peeling 
that it does in the first testcase, if that is what it thinks is desirable. (Of 
course, we might want to fix that only later, as ATM that'd take us backwards).


(2) somehow, gcc should vectorize loops that have been partially peeled in that 
way. The only way I can really see, is to partially peel the same portion of the 
first (vectorization factor - 1) loop iterations too, so we can do a PHI of a 
whole vector at a time, and so I'm wondering if anyone can give me any pointers 
here - am I barking up the right tree - and is it reasonable to persuade 
existing vectorizer loop-peeling code (e.g. for alignment) to do this for us 
too, or would anyone recommend a different avenue?


Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'm 
inclined to think that if it's doing such optimizations, it's for a good reason 
;) I guess there'll be other times where we *cannot* do partially peeling of 
later iterations...


Thoughts?

Thanks, Alan



Re: combine_simplify_rtx (doesn't) commute XOR and ASHIFTRT ???

2014-06-30 Thread Alan Lawrence

Thanks for the quick response - the possibility of different modes seems
plausible, as I'm not sure what would happen in that case. (Although I'm not
entirely sure how that case would occur - something that may have changed since
the original.) I haven't had much luck in getting a working toolchain from 2004,
so I've gone with the approach of preserving the check if result_mode !=
shift_mode, as per patch 
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02426.html .

Thanks!
--Alan

Richard Kenner wrote:

and wondering if anyone can explain to me what's wrong with this
transformation. Having worked through all four cases of A and C1
positive and negative, it seems to me that the extra bits 'fed in' to
the most-significant end of the result are the same either way (i.e. the
XOR of the sign bits of A and C1).
I haven't heard from Kenner in a while, but you could always try to 
contact him directly and see if he can recall the issue.  It was a long 
time ago though...


I haven't gone anywhere ...

But indeed ten years is a long time and I don't have any recollection of
this at all. The above analysis seems right, but this isn't the sort of
thing I'd have done if there weren't some sort of bug.  Perhaps it's only
relevant if result_mode != shift_mode?






combine_simplify_rtx (doesn't) commute XOR and ASHIFTRT ???

2014-06-24 Thread Alan Lawrence

I'm looking at git commit ea1ac559 / svn r76965, January 2014 (archive:
https://gcc.gnu.org/ml/gcc-patches/2004-01/msg03406.html), which prevents

(ashiftrt (xor A C1) C2)

from being commuted to

(xor (ashiftrt A C2) (ashiftrt C1 C2))

and wondering if anyone can explain to me what's wrong with this transformation. 
Having worked through all four cases of A and C1 positive and negative, it seems 
to me that the extra bits 'fed in' to the most-significant end of the result are 
the same either way (i.e. the XOR of the sign bits of A and C1).


Re-enabling this transformation improves rtx simplification and codegen on
AArch64, for one.

(I don't have easy access to Alpha/VMS machine on which to test Ada 
s-crc32.adb.)

Thanks for any info, Alan