Re: Vectorizer Pragmas

2014-02-17 Thread Renato Golin
On 16 February 2014 23:44, Tim Prince n...@aol.com wrote:
 I don't think many people want to use both OpenMP 4 and older Intel
 directives together.

I'm having less and less incentives to use anything other than omp4,
cilk and whatever. I think we should be able to map all our internal
needs to those pragmas.

On the other hand, if you guys have any cross discussion with Intel
folks about it, I'd love to hear. Since our support for those
directives are a bit behind, would be good not to duplicate the
efforts in the long run.

Thanks!
--renato


Re: Vectorizer Pragmas

2014-02-17 Thread Tim Prince


On 2/17/2014 4:42 AM, Renato Golin wrote:

On 16 February 2014 23:44, Tim Prince n...@aol.com wrote:

I don't think many people want to use both OpenMP 4 and older Intel
directives together.

I'm having less and less incentives to use anything other than omp4,
cilk and whatever. I think we should be able to map all our internal
needs to those pragmas.

On the other hand, if you guys have any cross discussion with Intel
folks about it, I'd love to hear. Since our support for those
directives are a bit behind, would be good not to duplicate the
efforts in the long run.


I'm continuing discussions with former Intel colleagues.  If you are 
asking for insight into how Intel priorities vary over time, I don't 
expect much, unless the next beta compiler provides some inferences.  
They have talked about implementing all of OpenMP 4.0 except user 
defined reduction this year.  That would imply more activity in that 
area than on cilkplus, although some fixes have come in the latter.  On 
the other hand I had an issue on omp simd reduction(max: ) closed with 
the decision will not be fixed.
I have an icc problem report in on fixing omp simd safelen so it is more 
like the standard and less like the obsolete pragma simd vectorlength.  
Also, I have some problem reports active attempting to get clarification 
of their omp target implementation.


You may have noticed that omp parallel for simd in current Intel 
compilers can be used for combined thread and simd parallelism, 
including the case where the outer loop is parallelizable and 
vectorizable but the inner one is not.


--
Tim Prince



Re: Vectorizer Pragmas

2014-02-17 Thread Renato Golin
On 17 February 2014 14:47, Tim Prince n...@aol.com wrote:
 I'm continuing discussions with former Intel colleagues.  If you are asking
 for insight into how Intel priorities vary over time, I don't expect much,
 unless the next beta compiler provides some inferences.  They have talked
 about implementing all of OpenMP 4.0 except user defined reduction this
 year.  That would imply more activity in that area than on cilkplus,

I'm expecting this. Any proposal to support Cilk in LLVM would be
purely temporary and not endorsed in any way.


 although some fixes have come in the latter.  On the other hand I had an
 issue on omp simd reduction(max: ) closed with the decision will not be
 fixed.

We still haven't got pragmas for induction/reduction logic, so I'm not
too worried about them.


 I have an icc problem report in on fixing omp simd safelen so it is more
 like the standard and less like the obsolete pragma simd vectorlength.

Our width metadata is slightly different in that it means try to use
that length, rather than it's safe to use that length, this is why
I'm holding on use safelen for the moment.


 Also, I have some problem reports active attempting to get clarification of
 their omp target implementation.

Same here... RTFM is not enough in this case. ;)


 You may have noticed that omp parallel for simd in current Intel compilers
 can be used for combined thread and simd parallelism, including the case
 where the outer loop is parallelizable and vectorizable but the inner one is
 not.

That's my fear of going with omp simd directly. I don't want to be
throwing threads all over the place when all I really want is vector
code.

For the time, my proposal is to use legacy pragmas: vector/novector,
unroll/nounroll and simd vectorlength which map nicely to the metadata
we already have and don't incur in OpenMP overhead. Later on, if
OpenMP ends up with simple non-threaded pragmas, we should use those
and deprecate the legacy ones.

If GCC is trying to do the same thing regarding non-threaded-vector
code, I'd be glad to be involved in the discussion. Some LLVM folks
think this should be an OpenMP discussion, I personally think it's
pushing the boundaries a bit too much on an inherently threaded
library extension.

cheers,
--renato


RE: Vectorizer Pragmas

2014-02-17 Thread Geva, Robert
The way Intel present #pragma simd (to users, to the OpenMP committee, to the C 
and C++ committees, etc) is that it is not a hint, it has a meaning.
The meaning is defined in term of evaluation order.
Both C and C++ define an evaluation order for sequential programs. #pragma simd 
relaxes the sequential order into a partial order:
0. subsequent iterations of the loop are chunked together and execute in 
lockstep
1. there is no change in the order of evaluation of expression within an 
iteration
2. if X and Y are expressions in the loop, and X(i) is the evaluation of X in 
iteration i, then for X sequenced before Y and iteration i evaluated before 
iteration j, X(i) is sequenced before Y(j).

A corollary is that the sequential order is always allowed, since it satisfies 
the partial order.
However, the partial order allows the compiler to group copies of the same 
expression next to each other, and then to combine the scalar instructions into 
a vector instruction.
There are other corollaries, such as that if multiple loop iterations write 
into an object defined outside of the loop then it has to be an undefined 
behavior, the vector moral equivalent of a data race. That is what induction 
variables and reductions are necessary exception to this rule and require 
explicit support.

As far as correctness, by this definition, the programmer expressed that it is 
correct, and the compiler should not try to prove correctness. 

On performance heuristics side, the Intel compiler tries to not second guess 
the user. There are users who work much harder than just add a #pragma simd on 
unmodified sequential loops. There are various changes that may be necessary, 
and users who worked hard to get their loops in a good shape are unhappy if the 
compiler does second guess them.

Robert.

-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Renato 
Golin
Sent: Monday, February 17, 2014 7:14 AM
To: tpri...@computer.org
Cc: gcc
Subject: Re: Vectorizer Pragmas

On 17 February 2014 14:47, Tim Prince n...@aol.com wrote:
 I'm continuing discussions with former Intel colleagues.  If you are 
 asking for insight into how Intel priorities vary over time, I don't 
 expect much, unless the next beta compiler provides some inferences.  
 They have talked about implementing all of OpenMP 4.0 except user 
 defined reduction this year.  That would imply more activity in that 
 area than on cilkplus,

I'm expecting this. Any proposal to support Cilk in LLVM would be purely 
temporary and not endorsed in any way.


 although some fixes have come in the latter.  On the other hand I had 
 an issue on omp simd reduction(max: ) closed with the decision will 
 not be fixed.

We still haven't got pragmas for induction/reduction logic, so I'm not too 
worried about them.


 I have an icc problem report in on fixing omp simd safelen so it is 
 more like the standard and less like the obsolete pragma simd vectorlength.

Our width metadata is slightly different in that it means try to use that 
length, rather than it's safe to use that length, this is why I'm holding on 
use safelen for the moment.


 Also, I have some problem reports active attempting to get 
 clarification of their omp target implementation.

Same here... RTFM is not enough in this case. ;)


 You may have noticed that omp parallel for simd in current Intel 
 compilers can be used for combined thread and simd parallelism, 
 including the case where the outer loop is parallelizable and 
 vectorizable but the inner one is not.

That's my fear of going with omp simd directly. I don't want to be throwing 
threads all over the place when all I really want is vector code.

For the time, my proposal is to use legacy pragmas: vector/novector, 
unroll/nounroll and simd vectorlength which map nicely to the metadata we 
already have and don't incur in OpenMP overhead. Later on, if OpenMP ends up 
with simple non-threaded pragmas, we should use those and deprecate the legacy 
ones.

If GCC is trying to do the same thing regarding non-threaded-vector code, I'd 
be glad to be involved in the discussion. Some LLVM folks think this should be 
an OpenMP discussion, I personally think it's pushing the boundaries a bit too 
much on an inherently threaded library extension.

cheers,
--renato


Re: Vectorizer Pragmas

2014-02-16 Thread Tobias Burnus

Renato Golin wrote:

On 15 February 2014 19:26, Jakub Jelinek ja...@redhat.com wrote:

GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one
can be used without rest of OpenMP by using -fopenmp-simd switch.


Does the simd/omp have control over the tree vectorizer? Or are they
just flags for the omp implementation?


As '#pragma omp simd' doesn't generate any threads and doesn't call the 
OpenMP run-time library (libgomp), I would claim that it only controls 
the tree vectorizer. (Hence, -fopenmp-simd was added as it permits this 
control without enabling thread parallelization or dependence on libgomp 
or libpthread.)


Compiler vendors (and users) have different ideas whether the SIMD 
pragmas should give the compiler only a hint or completely override the 
compiler's heuristics. In case of the Intel compiler, the user rules; in 
case of GCC, it only influences the heuristics unless one passes 
explicitly -fsimd-cost-model=unlimited (cf. also -Wopenmp-simd).


[Remark regarding '#pragma simd': I believe that pragma is only active 
with -fcilkplus.]



I don't see why we would need more ways to do the same thing.


Me neither! That's what I'm trying to avoid.

Do you guys use those pragmas for everything related to the
vectorizer? I found that the Intel pragmas (not just simd and omp) are
pretty good fit to most of our needed functionality.

Does GCC use Intel pragmas to control the vectorizer? Would be good to
know how you guys did it, so that we can follow the same pattern.


As written by Jakub, only OpenMP's SIMD (requires: -fopenmp or 
-fopenmp-simd), Cilk plus's SIMD (-fcilkplus) and '#pragma gcc ivdep 
(always enabled) are supported.


As a user, I found Intel's pragmas interesting, but at the end regarded 
OpenMP's SIMD directives/pragmas as sufficient.



Can GCC vectorize lexical blocks as well? Or just loops?


According to http://gcc.gnu.org/projects/tree-ssa/vectorization.html, 
basic-block vectorization (SLP) support exists since 2009.


Tobias


Re: Vectorizer Pragmas

2014-02-16 Thread Renato Golin
On 16 February 2014 17:23, Tobias Burnus bur...@net-b.de wrote:
 As '#pragma omp simd' doesn't generate any threads and doesn't call the
 OpenMP run-time library (libgomp), I would claim that it only controls the
 tree vectorizer. (Hence, -fopenmp-simd was added as it permits this control
 without enabling thread parallelization or dependence on libgomp or
 libpthread.)

Right, this is a bit confusing, but should suffice for out purposes,
which are very similar to GCC's.


 Compiler vendors (and users) have different ideas whether the SIMD pragmas
 should give the compiler only a hint or completely override the compiler's
 heuristics. In case of the Intel compiler, the user rules; in case of GCC,
 it only influences the heuristics unless one passes explicitly
 -fsimd-cost-model=unlimited (cf. also -Wopenmp-simd).

We prefer to be on the safe side, too. We're adding a warning callback
mechanism to warn about possible dangerous situations (debug messages
already do that), possibly with the same idea as -Wopenmp-simd. But
the intent is to not vectorize if we're sure it'll break things. Only
on doubt we'll trust the pragmas/flags.

The flag -fsimd-cost-model=unlimited might be a bit too heavy on other
loops, and is the kind of think that I'd rather have as a pragma or
not at all.


 As a user, I found Intel's pragmas interesting, but at the end regarded
 OpenMP's SIMD directives/pragmas as sufficient.

That was the kind of user experience that I was looking for, thanks!


 According to http://gcc.gnu.org/projects/tree-ssa/vectorization.html,
 basic-block vectorization (SLP) support exists since 2009.

Would it be desirable to use some pragmas to control lexical blocks,
too? I'm not sure omp/cilk pragmas apply to lexical blocks...

cheers,
--renato


Re: Vectorizer Pragmas

2014-02-16 Thread Tim Prince


On 2/16/2014 2:05 PM, Renato Golin wrote:

On 16 February 2014 17:23, Tobias Burnus bur...@net-b.de wrote:


Compiler vendors (and users) have different ideas whether the SIMD pragmas
should give the compiler only a hint or completely override the compiler's
heuristics. In case of the Intel compiler, the user rules; in case of GCC,
it only influences the heuristics unless one passes explicitly
-fsimd-cost-model=unlimited (cf. also -Wopenmp-simd).
Yes, Intel's idea for simd directives is to vectorize without applying 
either cost models or concern about exceptions.

I tried -fsimd-cost-model-unlimited on my tests; it made no difference.




As a user, I found Intel's pragmas interesting, but at the end regarded
OpenMP's SIMD directives/pragmas as sufficient.

That was the kind of user experience that I was looking for, thanks!


The alignment options for OpenMP 4 are limited, but OpenMP 4 also seems 
to prevent loop fusion, where alignment assertions may be more critical.
In addition, Intel uses the older directives, which some marketer 
decided should be called Cilk(tm) Plus even when used in Fortran, to 
control whether streaming stores may be chosen in some situations. I 
think gcc supports those only by explicit intrinsics.
I don't think many people want to use both OpenMP 4 and older Intel 
directives together.
Several of these directives are still in an embryonic stage in both 
Intel and gnu compilers.


--
Tim Prince



Vectorizer Pragmas

2014-02-15 Thread Renato Golin
Folks,

One of the things that we've been discussing for a while and there are
just too many options out there and none fits exactly what we're
looking for (obviously), is the vectorization control pragmas.

Our initial semantics is working on on a specific loop / lexical block to:
 * turn vectorization on/off (even if -fvec is disabled)
 * specify the vector width (number of lanes)
 * specify the unroll factor (either to help with vectorization or to
use when vectorization is not profitable)

Later metadata could be added to:
 * determine memory safety at specific distances
 * determine vectorized functions to use for specific widths
 * etc

The current discussion is floating around four solutions:

1. Local pragma (#pragma vectorize), which is losing badly on the
argument that it's yet-another pragma to do mostly the same thing many
others do.

2. Using OMP SIMD pragmas (#pragma simd, #pragma omp simd) which is
already standardised (OMP 4.0 I think), but that doesn't cover all the
semantics we may want in the future, plus it's segregated and may
confuse the users.

3. Using GCC-style optimize pragmas (#pragma Clang optimize), which
could be Clang-specific without polluting other compilers' namespaces.
The problem here is that we'd end up with duplicated flags with
closely-related-but-different semantics between #pragma GCC and
#pragma Clang variants.

4. Using C++11 annotations. This would be the cleanest way, but would
only be valid in C++11 mode and could be very well a different way to
express an identical semantics to the pragmas, which are valid in all
C variants and Fortran.

I'm trying to avoid adding new semantics to old problems, but I'm also
trying to avoid spreading closely related semantics across a multitude
of pragmas, annotation and who knows else.

Does GCC have anything similar? Do you guys have any ideas we could use?

I'm open to anything, even in favour of one of the defective
propositions above. I'd rather have something than nothing, but I'd
also rather have something that most people agree on.

cheers,
--renato


Re: Vectorizer Pragmas

2014-02-15 Thread Jakub Jelinek
On Sat, Feb 15, 2014 at 06:56:42PM +, Renato Golin wrote:
 1. Local pragma (#pragma vectorize), which is losing badly on the
 argument that it's yet-another pragma to do mostly the same thing many
 others do.
 
 2. Using OMP SIMD pragmas (#pragma simd, #pragma omp simd) which is
 already standardised (OMP 4.0 I think), but that doesn't cover all the
 semantics we may want in the future, plus it's segregated and may
 confuse the users.

GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one
can be used without rest of OpenMP by using -fopenmp-simd switch.
I don't see why we would need more ways to do the same thing.

Jakub


Re: Vectorizer Pragmas

2014-02-15 Thread Renato Golin
On 15 February 2014 19:26, Jakub Jelinek ja...@redhat.com wrote:
 GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one
 can be used without rest of OpenMP by using -fopenmp-simd switch.

Does the simd/omp have control over the tree vectorizer? Or are they
just flags for the omp implementation?


 I don't see why we would need more ways to do the same thing.

Me neither! That's what I'm trying to avoid.

Do you guys use those pragmas for everything related to the
vectorizer? I found that the Intel pragmas (not just simd and omp) are
pretty good fit to most of our needed functionality.

Does GCC use Intel pragmas to control the vectorizer? Would be good to
know how you guys did it, so that we can follow the same pattern.

Can GCC vectorize lexical blocks as well? Or just loops?

IF those pragmas can't be used in lexical blocks, would it be desired
to extend that in GCC? The Intel guys are pretty happy implementing
simd, omp, etc. in LLVM, and I think if the lexical block problem is
common, they may even be open to extending the semantics?

cheers,
--renato


Re: Vectorizer Pragmas

2014-02-15 Thread Tim Prince


On 2/15/2014 3:36 PM, Renato Golin wrote:

On 15 February 2014 19:26, Jakub Jelinek ja...@redhat.com wrote:

GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one
can be used without rest of OpenMP by using -fopenmp-simd switch.

Does the simd/omp have control over the tree vectorizer? Or are they
just flags for the omp implementation?



I don't see why we would need more ways to do the same thing.

Me neither! That's what I'm trying to avoid.

Do you guys use those pragmas for everything related to the
vectorizer? I found that the Intel pragmas (not just simd and omp) are
pretty good fit to most of our needed functionality.

Does GCC use Intel pragmas to control the vectorizer? Would be good to
know how you guys did it, so that we can follow the same pattern.

Can GCC vectorize lexical blocks as well? Or just loops?

IF those pragmas can't be used in lexical blocks, would it be desired
to extend that in GCC? The Intel guys are pretty happy implementing
simd, omp, etc. in LLVM, and I think if the lexical block problem is
common, they may even be open to extending the semantics?

cheers,
--renato
gcc ignores the Intel pragmas, other than the OpenMP 4.0 ones.  I think 
Jakub may have his hands full trying to implement the OpenMP 4 pragmas, 
plus GCC ivdep, and gfortran equivalents.  It's tough enough 
distinguishing between Intel's partial implementation of OpenMP 4 and 
the way it ought to be done.
In my experience, the (somewhat complicated) gcc --param options work 
sufficiently well for specification of unrolling.  In the same vein, I 
haven't seen any cases where gcc 4.9 is excessively aggressive in 
vectorization, so that a #pragma novector plus scalar unroll  is needed, 
as it is with Intel compilers.
I'm assuming that Intel involvement with llvm is aimed toward making it 
look like Intel's own compilers; before I retired, I heard a comment 
which indicated a realization that the idea of pushing llvm over gnu had 
been over-emphasized.  My experience with this is limited; my Intel 
Android phone broke before I got too involved with their llvm Android 
compiler, which had some bad effects on both gcc and Intel software 
usage for normal Windows purposes.
I've never seen a compiler where pragmas could be used to turn on 
auto-vectorization when compile options were set to disable it. The 
closest to that is the Intel(r) Cilk(tm) Plus where CEAN notation 
implies turning on many aggressive optimizations, such that full 
performance can be achieved without problematical -O3.  If your idea is 
to obtain selective effective auto-vectorization in source code which is 
sufficiently broken that -O2 -ftree-vectorize can't be considered or 
-fno-strict-aliasing has to be set, I'm not about to second such a motion.


--
Tim Prince



Re: Vectorizer Pragmas

2014-02-15 Thread Renato Golin
On 15 February 2014 22:49, Tim Prince n...@aol.com wrote:
 In my experience, the (somewhat complicated) gcc --param options work
 sufficiently well for specification of unrolling.

There is precedent for --param in LLVM, we could go this way, too.
Though, I can't see how it'd be applied to a specific function, loop
or lexical block.


 In the same vein, I haven't seen any cases where gcc 4.9 is excessively 
 aggressive in
 vectorization, so that a #pragma novector plus scalar unroll  is needed, as
 it is with Intel compilers.
 (...)
 If your idea is to obtain selective effective
 auto-vectorization in source code which is sufficiently broken that -O2
 -ftree-vectorize can't be considered or -fno-strict-aliasing has to be set,
 I'm not about to second such a motion.

Our main idea with this is to help people report missed vectorization
on their code, and a way to help them achieve performance while LLVM
doesn't catch up.

Another case for this (and other pragmas controlling the optimization
level on a per-function basis) is to help debugging of specific
functions while leaving others untouched.

I'd not condone the usage of such pragmas on a persistent manner, nor
for any code that goes in production, or to work around broken code at
higher optimization levels.

cheers,
--renato