Re: Vectorizer Pragmas
On 16 February 2014 23:44, Tim Prince n...@aol.com wrote: I don't think many people want to use both OpenMP 4 and older Intel directives together. I'm having less and less incentives to use anything other than omp4, cilk and whatever. I think we should be able to map all our internal needs to those pragmas. On the other hand, if you guys have any cross discussion with Intel folks about it, I'd love to hear. Since our support for those directives are a bit behind, would be good not to duplicate the efforts in the long run. Thanks! --renato
Re: Vectorizer Pragmas
On 2/17/2014 4:42 AM, Renato Golin wrote: On 16 February 2014 23:44, Tim Prince n...@aol.com wrote: I don't think many people want to use both OpenMP 4 and older Intel directives together. I'm having less and less incentives to use anything other than omp4, cilk and whatever. I think we should be able to map all our internal needs to those pragmas. On the other hand, if you guys have any cross discussion with Intel folks about it, I'd love to hear. Since our support for those directives are a bit behind, would be good not to duplicate the efforts in the long run. I'm continuing discussions with former Intel colleagues. If you are asking for insight into how Intel priorities vary over time, I don't expect much, unless the next beta compiler provides some inferences. They have talked about implementing all of OpenMP 4.0 except user defined reduction this year. That would imply more activity in that area than on cilkplus, although some fixes have come in the latter. On the other hand I had an issue on omp simd reduction(max: ) closed with the decision will not be fixed. I have an icc problem report in on fixing omp simd safelen so it is more like the standard and less like the obsolete pragma simd vectorlength. Also, I have some problem reports active attempting to get clarification of their omp target implementation. You may have noticed that omp parallel for simd in current Intel compilers can be used for combined thread and simd parallelism, including the case where the outer loop is parallelizable and vectorizable but the inner one is not. -- Tim Prince
Re: Vectorizer Pragmas
On 17 February 2014 14:47, Tim Prince n...@aol.com wrote: I'm continuing discussions with former Intel colleagues. If you are asking for insight into how Intel priorities vary over time, I don't expect much, unless the next beta compiler provides some inferences. They have talked about implementing all of OpenMP 4.0 except user defined reduction this year. That would imply more activity in that area than on cilkplus, I'm expecting this. Any proposal to support Cilk in LLVM would be purely temporary and not endorsed in any way. although some fixes have come in the latter. On the other hand I had an issue on omp simd reduction(max: ) closed with the decision will not be fixed. We still haven't got pragmas for induction/reduction logic, so I'm not too worried about them. I have an icc problem report in on fixing omp simd safelen so it is more like the standard and less like the obsolete pragma simd vectorlength. Our width metadata is slightly different in that it means try to use that length, rather than it's safe to use that length, this is why I'm holding on use safelen for the moment. Also, I have some problem reports active attempting to get clarification of their omp target implementation. Same here... RTFM is not enough in this case. ;) You may have noticed that omp parallel for simd in current Intel compilers can be used for combined thread and simd parallelism, including the case where the outer loop is parallelizable and vectorizable but the inner one is not. That's my fear of going with omp simd directly. I don't want to be throwing threads all over the place when all I really want is vector code. For the time, my proposal is to use legacy pragmas: vector/novector, unroll/nounroll and simd vectorlength which map nicely to the metadata we already have and don't incur in OpenMP overhead. Later on, if OpenMP ends up with simple non-threaded pragmas, we should use those and deprecate the legacy ones. If GCC is trying to do the same thing regarding non-threaded-vector code, I'd be glad to be involved in the discussion. Some LLVM folks think this should be an OpenMP discussion, I personally think it's pushing the boundaries a bit too much on an inherently threaded library extension. cheers, --renato
RE: Vectorizer Pragmas
The way Intel present #pragma simd (to users, to the OpenMP committee, to the C and C++ committees, etc) is that it is not a hint, it has a meaning. The meaning is defined in term of evaluation order. Both C and C++ define an evaluation order for sequential programs. #pragma simd relaxes the sequential order into a partial order: 0. subsequent iterations of the loop are chunked together and execute in lockstep 1. there is no change in the order of evaluation of expression within an iteration 2. if X and Y are expressions in the loop, and X(i) is the evaluation of X in iteration i, then for X sequenced before Y and iteration i evaluated before iteration j, X(i) is sequenced before Y(j). A corollary is that the sequential order is always allowed, since it satisfies the partial order. However, the partial order allows the compiler to group copies of the same expression next to each other, and then to combine the scalar instructions into a vector instruction. There are other corollaries, such as that if multiple loop iterations write into an object defined outside of the loop then it has to be an undefined behavior, the vector moral equivalent of a data race. That is what induction variables and reductions are necessary exception to this rule and require explicit support. As far as correctness, by this definition, the programmer expressed that it is correct, and the compiler should not try to prove correctness. On performance heuristics side, the Intel compiler tries to not second guess the user. There are users who work much harder than just add a #pragma simd on unmodified sequential loops. There are various changes that may be necessary, and users who worked hard to get their loops in a good shape are unhappy if the compiler does second guess them. Robert. -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Renato Golin Sent: Monday, February 17, 2014 7:14 AM To: tpri...@computer.org Cc: gcc Subject: Re: Vectorizer Pragmas On 17 February 2014 14:47, Tim Prince n...@aol.com wrote: I'm continuing discussions with former Intel colleagues. If you are asking for insight into how Intel priorities vary over time, I don't expect much, unless the next beta compiler provides some inferences. They have talked about implementing all of OpenMP 4.0 except user defined reduction this year. That would imply more activity in that area than on cilkplus, I'm expecting this. Any proposal to support Cilk in LLVM would be purely temporary and not endorsed in any way. although some fixes have come in the latter. On the other hand I had an issue on omp simd reduction(max: ) closed with the decision will not be fixed. We still haven't got pragmas for induction/reduction logic, so I'm not too worried about them. I have an icc problem report in on fixing omp simd safelen so it is more like the standard and less like the obsolete pragma simd vectorlength. Our width metadata is slightly different in that it means try to use that length, rather than it's safe to use that length, this is why I'm holding on use safelen for the moment. Also, I have some problem reports active attempting to get clarification of their omp target implementation. Same here... RTFM is not enough in this case. ;) You may have noticed that omp parallel for simd in current Intel compilers can be used for combined thread and simd parallelism, including the case where the outer loop is parallelizable and vectorizable but the inner one is not. That's my fear of going with omp simd directly. I don't want to be throwing threads all over the place when all I really want is vector code. For the time, my proposal is to use legacy pragmas: vector/novector, unroll/nounroll and simd vectorlength which map nicely to the metadata we already have and don't incur in OpenMP overhead. Later on, if OpenMP ends up with simple non-threaded pragmas, we should use those and deprecate the legacy ones. If GCC is trying to do the same thing regarding non-threaded-vector code, I'd be glad to be involved in the discussion. Some LLVM folks think this should be an OpenMP discussion, I personally think it's pushing the boundaries a bit too much on an inherently threaded library extension. cheers, --renato
Re: Vectorizer Pragmas
Renato Golin wrote: On 15 February 2014 19:26, Jakub Jelinek ja...@redhat.com wrote: GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one can be used without rest of OpenMP by using -fopenmp-simd switch. Does the simd/omp have control over the tree vectorizer? Or are they just flags for the omp implementation? As '#pragma omp simd' doesn't generate any threads and doesn't call the OpenMP run-time library (libgomp), I would claim that it only controls the tree vectorizer. (Hence, -fopenmp-simd was added as it permits this control without enabling thread parallelization or dependence on libgomp or libpthread.) Compiler vendors (and users) have different ideas whether the SIMD pragmas should give the compiler only a hint or completely override the compiler's heuristics. In case of the Intel compiler, the user rules; in case of GCC, it only influences the heuristics unless one passes explicitly -fsimd-cost-model=unlimited (cf. also -Wopenmp-simd). [Remark regarding '#pragma simd': I believe that pragma is only active with -fcilkplus.] I don't see why we would need more ways to do the same thing. Me neither! That's what I'm trying to avoid. Do you guys use those pragmas for everything related to the vectorizer? I found that the Intel pragmas (not just simd and omp) are pretty good fit to most of our needed functionality. Does GCC use Intel pragmas to control the vectorizer? Would be good to know how you guys did it, so that we can follow the same pattern. As written by Jakub, only OpenMP's SIMD (requires: -fopenmp or -fopenmp-simd), Cilk plus's SIMD (-fcilkplus) and '#pragma gcc ivdep (always enabled) are supported. As a user, I found Intel's pragmas interesting, but at the end regarded OpenMP's SIMD directives/pragmas as sufficient. Can GCC vectorize lexical blocks as well? Or just loops? According to http://gcc.gnu.org/projects/tree-ssa/vectorization.html, basic-block vectorization (SLP) support exists since 2009. Tobias
Re: Vectorizer Pragmas
On 16 February 2014 17:23, Tobias Burnus bur...@net-b.de wrote: As '#pragma omp simd' doesn't generate any threads and doesn't call the OpenMP run-time library (libgomp), I would claim that it only controls the tree vectorizer. (Hence, -fopenmp-simd was added as it permits this control without enabling thread parallelization or dependence on libgomp or libpthread.) Right, this is a bit confusing, but should suffice for out purposes, which are very similar to GCC's. Compiler vendors (and users) have different ideas whether the SIMD pragmas should give the compiler only a hint or completely override the compiler's heuristics. In case of the Intel compiler, the user rules; in case of GCC, it only influences the heuristics unless one passes explicitly -fsimd-cost-model=unlimited (cf. also -Wopenmp-simd). We prefer to be on the safe side, too. We're adding a warning callback mechanism to warn about possible dangerous situations (debug messages already do that), possibly with the same idea as -Wopenmp-simd. But the intent is to not vectorize if we're sure it'll break things. Only on doubt we'll trust the pragmas/flags. The flag -fsimd-cost-model=unlimited might be a bit too heavy on other loops, and is the kind of think that I'd rather have as a pragma or not at all. As a user, I found Intel's pragmas interesting, but at the end regarded OpenMP's SIMD directives/pragmas as sufficient. That was the kind of user experience that I was looking for, thanks! According to http://gcc.gnu.org/projects/tree-ssa/vectorization.html, basic-block vectorization (SLP) support exists since 2009. Would it be desirable to use some pragmas to control lexical blocks, too? I'm not sure omp/cilk pragmas apply to lexical blocks... cheers, --renato
Re: Vectorizer Pragmas
On 2/16/2014 2:05 PM, Renato Golin wrote: On 16 February 2014 17:23, Tobias Burnus bur...@net-b.de wrote: Compiler vendors (and users) have different ideas whether the SIMD pragmas should give the compiler only a hint or completely override the compiler's heuristics. In case of the Intel compiler, the user rules; in case of GCC, it only influences the heuristics unless one passes explicitly -fsimd-cost-model=unlimited (cf. also -Wopenmp-simd). Yes, Intel's idea for simd directives is to vectorize without applying either cost models or concern about exceptions. I tried -fsimd-cost-model-unlimited on my tests; it made no difference. As a user, I found Intel's pragmas interesting, but at the end regarded OpenMP's SIMD directives/pragmas as sufficient. That was the kind of user experience that I was looking for, thanks! The alignment options for OpenMP 4 are limited, but OpenMP 4 also seems to prevent loop fusion, where alignment assertions may be more critical. In addition, Intel uses the older directives, which some marketer decided should be called Cilk(tm) Plus even when used in Fortran, to control whether streaming stores may be chosen in some situations. I think gcc supports those only by explicit intrinsics. I don't think many people want to use both OpenMP 4 and older Intel directives together. Several of these directives are still in an embryonic stage in both Intel and gnu compilers. -- Tim Prince
Vectorizer Pragmas
Folks, One of the things that we've been discussing for a while and there are just too many options out there and none fits exactly what we're looking for (obviously), is the vectorization control pragmas. Our initial semantics is working on on a specific loop / lexical block to: * turn vectorization on/off (even if -fvec is disabled) * specify the vector width (number of lanes) * specify the unroll factor (either to help with vectorization or to use when vectorization is not profitable) Later metadata could be added to: * determine memory safety at specific distances * determine vectorized functions to use for specific widths * etc The current discussion is floating around four solutions: 1. Local pragma (#pragma vectorize), which is losing badly on the argument that it's yet-another pragma to do mostly the same thing many others do. 2. Using OMP SIMD pragmas (#pragma simd, #pragma omp simd) which is already standardised (OMP 4.0 I think), but that doesn't cover all the semantics we may want in the future, plus it's segregated and may confuse the users. 3. Using GCC-style optimize pragmas (#pragma Clang optimize), which could be Clang-specific without polluting other compilers' namespaces. The problem here is that we'd end up with duplicated flags with closely-related-but-different semantics between #pragma GCC and #pragma Clang variants. 4. Using C++11 annotations. This would be the cleanest way, but would only be valid in C++11 mode and could be very well a different way to express an identical semantics to the pragmas, which are valid in all C variants and Fortran. I'm trying to avoid adding new semantics to old problems, but I'm also trying to avoid spreading closely related semantics across a multitude of pragmas, annotation and who knows else. Does GCC have anything similar? Do you guys have any ideas we could use? I'm open to anything, even in favour of one of the defective propositions above. I'd rather have something than nothing, but I'd also rather have something that most people agree on. cheers, --renato
Re: Vectorizer Pragmas
On Sat, Feb 15, 2014 at 06:56:42PM +, Renato Golin wrote: 1. Local pragma (#pragma vectorize), which is losing badly on the argument that it's yet-another pragma to do mostly the same thing many others do. 2. Using OMP SIMD pragmas (#pragma simd, #pragma omp simd) which is already standardised (OMP 4.0 I think), but that doesn't cover all the semantics we may want in the future, plus it's segregated and may confuse the users. GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one can be used without rest of OpenMP by using -fopenmp-simd switch. I don't see why we would need more ways to do the same thing. Jakub
Re: Vectorizer Pragmas
On 15 February 2014 19:26, Jakub Jelinek ja...@redhat.com wrote: GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one can be used without rest of OpenMP by using -fopenmp-simd switch. Does the simd/omp have control over the tree vectorizer? Or are they just flags for the omp implementation? I don't see why we would need more ways to do the same thing. Me neither! That's what I'm trying to avoid. Do you guys use those pragmas for everything related to the vectorizer? I found that the Intel pragmas (not just simd and omp) are pretty good fit to most of our needed functionality. Does GCC use Intel pragmas to control the vectorizer? Would be good to know how you guys did it, so that we can follow the same pattern. Can GCC vectorize lexical blocks as well? Or just loops? IF those pragmas can't be used in lexical blocks, would it be desired to extend that in GCC? The Intel guys are pretty happy implementing simd, omp, etc. in LLVM, and I think if the lexical block problem is common, they may even be open to extending the semantics? cheers, --renato
Re: Vectorizer Pragmas
On 2/15/2014 3:36 PM, Renato Golin wrote: On 15 February 2014 19:26, Jakub Jelinek ja...@redhat.com wrote: GCC supports #pragma GCC ivdep/#pragma simd/#pragma omp simd, the last one can be used without rest of OpenMP by using -fopenmp-simd switch. Does the simd/omp have control over the tree vectorizer? Or are they just flags for the omp implementation? I don't see why we would need more ways to do the same thing. Me neither! That's what I'm trying to avoid. Do you guys use those pragmas for everything related to the vectorizer? I found that the Intel pragmas (not just simd and omp) are pretty good fit to most of our needed functionality. Does GCC use Intel pragmas to control the vectorizer? Would be good to know how you guys did it, so that we can follow the same pattern. Can GCC vectorize lexical blocks as well? Or just loops? IF those pragmas can't be used in lexical blocks, would it be desired to extend that in GCC? The Intel guys are pretty happy implementing simd, omp, etc. in LLVM, and I think if the lexical block problem is common, they may even be open to extending the semantics? cheers, --renato gcc ignores the Intel pragmas, other than the OpenMP 4.0 ones. I think Jakub may have his hands full trying to implement the OpenMP 4 pragmas, plus GCC ivdep, and gfortran equivalents. It's tough enough distinguishing between Intel's partial implementation of OpenMP 4 and the way it ought to be done. In my experience, the (somewhat complicated) gcc --param options work sufficiently well for specification of unrolling. In the same vein, I haven't seen any cases where gcc 4.9 is excessively aggressive in vectorization, so that a #pragma novector plus scalar unroll is needed, as it is with Intel compilers. I'm assuming that Intel involvement with llvm is aimed toward making it look like Intel's own compilers; before I retired, I heard a comment which indicated a realization that the idea of pushing llvm over gnu had been over-emphasized. My experience with this is limited; my Intel Android phone broke before I got too involved with their llvm Android compiler, which had some bad effects on both gcc and Intel software usage for normal Windows purposes. I've never seen a compiler where pragmas could be used to turn on auto-vectorization when compile options were set to disable it. The closest to that is the Intel(r) Cilk(tm) Plus where CEAN notation implies turning on many aggressive optimizations, such that full performance can be achieved without problematical -O3. If your idea is to obtain selective effective auto-vectorization in source code which is sufficiently broken that -O2 -ftree-vectorize can't be considered or -fno-strict-aliasing has to be set, I'm not about to second such a motion. -- Tim Prince
Re: Vectorizer Pragmas
On 15 February 2014 22:49, Tim Prince n...@aol.com wrote: In my experience, the (somewhat complicated) gcc --param options work sufficiently well for specification of unrolling. There is precedent for --param in LLVM, we could go this way, too. Though, I can't see how it'd be applied to a specific function, loop or lexical block. In the same vein, I haven't seen any cases where gcc 4.9 is excessively aggressive in vectorization, so that a #pragma novector plus scalar unroll is needed, as it is with Intel compilers. (...) If your idea is to obtain selective effective auto-vectorization in source code which is sufficiently broken that -O2 -ftree-vectorize can't be considered or -fno-strict-aliasing has to be set, I'm not about to second such a motion. Our main idea with this is to help people report missed vectorization on their code, and a way to help them achieve performance while LLVM doesn't catch up. Another case for this (and other pragmas controlling the optimization level on a per-function basis) is to help debugging of specific functions while leaving others untouched. I'd not condone the usage of such pragmas on a persistent manner, nor for any code that goes in production, or to work around broken code at higher optimization levels. cheers, --renato