[Bug libgomp/58482] gomp4: user defined reduction produce wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482 --- Comment #6 from Jakub Jelinek --- Author: jakub Date: Thu Sep 26 07:58:02 2013 New Revision: 202937 URL: http://gcc.gnu.org/viewcvs?rev=202937&root=gcc&view=rev Log: PR libgomp/58482 * c-omp.c (c_omp_split_clauses) : Copy also OMP_CLAUSE_REDUCTION_PLACEHOLDER. * testsuite/libgomp.c/simd-6.c: New test. * testsuite/libgomp.c++/simd-8.C: New test. Added: branches/gomp-4_0-branch/libgomp/testsuite/libgomp.c++/simd-8.C branches/gomp-4_0-branch/libgomp/testsuite/libgomp.c/simd-6.c Modified: branches/gomp-4_0-branch/gcc/c-family/ChangeLog.gomp branches/gomp-4_0-branch/gcc/c-family/c-omp.c branches/gomp-4_0-branch/libgomp/ChangeLog.gomp
[Bug libgomp/58482] gomp4: user defined reduction produce wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482 --- Comment #5 from Jakub Jelinek --- There is no problem with having as many reductions as you need, if they are separate variables; the only case that will prevent vectorization is if you have a struct/class with multiple data members as reduction. I'll see if I can reproduce the ICE on Monday.
[Bug libgomp/58482] gomp4: user defined reduction produce wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482 --- Comment #4 from vincenzo Innocente --- I see. I have several use cases in which the reduction requires the access to two variables (minloc for instance: the minimum and its location) btw tried omp parallel for simd got ICE c++ -std=c++11 ured_omp4.cpp -O -ftree-vectorizer-verbose=1 -fopenmp ured_omp4.cpp: In function ‘TwoInt sum(const int*, int)’: ured_omp4.cpp:38:63: internal compiler error: Segmentation fault: 11 #pragma omp parallel for simd aligned(q: 16) reduction(foo:s) ^ ured_omp4.cpp:38:63: internal compiler error: Abort trap: 6 c++: internal compiler error: Abort trap: 6 (program cc1plus) Abort trap: 6
[Bug libgomp/58482] gomp4: user defined reduction produce wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482 --- Comment #2 from vincenzo Innocente --- Thanks Jakub for the clear answer. The reduction operator should be strictly commutative! and I now understand the meaning of omp declare reduction (I hope) so I modified it as you can see below results ok. but omp does not vectorize while good old -Ofast does… shall I prepare a simple example with corresponding assembler and submit? c++ -std=c++11 ured_omp4.cpp -O2 -ftree-vectorizer-verbose=4 -fopenmp Analyzing loop at ured_omp4.cpp:39 ured_omp4.cpp:39:22: note: = analyze_loop_nest = ured_omp4.cpp:39:22: note: === vect_analyze_loop_form === ured_omp4.cpp:39:22: note: === get_loop_niters === ured_omp4.cpp:39:22: note: ==> get_loop_niters:(unsigned int) NN_15(D) ured_omp4.cpp:39:22: note: Symbolic number of iterations is (unsigned int) NN_15(D) ured_omp4.cpp:39:22: note: === vect_analyze_data_refs === ured_omp4.cpp:39:22: note: got vectype for stmt: _20 = *_19; const vector(4) float ured_omp4.cpp:39:22: note: got vectype for stmt: _30 = MEM[(struct TwoInt *)&D.63343][_16].a; vector(4) float ured_omp4.cpp:39:22: note: not vectorized: not suitable for gather load _30 = MEM[(struct TwoInt *)&D.63343][_16].a; ured_omp4.cpp:39:22: note: bad data references. ured_omp4.cpp:35:8: note: vectorized 0 loops in function. ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops I spare you the 4 pages of dump in case of -Ofast pb-d-128-141-131-94:vectorize innocent$ c++ -std=c++11 ured_omp4.cpp -O2 -ftree-vectorizer-verbose=1 -fopenmp; ./a.out ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops 523776,-523776 523776,-523776 pb-d-128-141-131-94:vectorize innocent$ c++ -std=c++11 ured_omp4.cpp -Ofast -ftree-vectorizer-verbose=1; ./a.out ured_omp4.cpp:38:3: note: loop vectorized ured_omp4.cpp:38:3: note: loop peeled for vectorization to enhance alignment ured_omp4.cpp:38:3: note: loop with 3 iterations completely unrolled ured_omp4.cpp:35:8: note: loop with 6 iterations completely unrolled ured_omp4.cpp:46:13: note: loop vectorized ured_omp4.cpp:45:8: note: loop with 2 iterations completely unrolled ured_omp4.cpp:38:3: note: loop vectorized ured_omp4.cpp:63:3: note: loop vectorized ured_omp4.cpp:63:3: note: loop with 4 iterations completely unrolled 523776,-523776 523776,-523776 cat ured_omp4.cpp #include #define Type float struct TwoInt { Type a=0; Type b=0; #pragma omp declare simd TwoInt & operator+=(TwoInt rh) { a+=rh.a; b-=rh.b; } #pragma omp declare simd TwoInt & add(TwoInt rh) { a+=rh.a; b-=rh.b; return *this; } #pragma omp declare simd TwoInt & reduce(TwoInt rh) { a+=rh.a; b+=rh.b; return *this; } }; #pragma omp declare reduction (foo:struct TwoInt: omp_out.reduce(omp_in)) TwoInt sum(Type const * q, int NN) { TwoInt s; #pragma omp simd aligned(q: 16) reduction(foo:s) for (int i=0;i int main() { constexpr int NN=1024; Type q[NN]; Type a=0; for (auto & e: q) e=a++; auto s = sum(q,NN); std::cout << s.a << "," << s.b << std::endl; s = sum4(q,NN); std::cout << s.a << "," << s.b << std::endl; return 0; }
[Bug libgomp/58482] gomp4: user defined reduction produce wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482 --- Comment #3 from Jakub Jelinek --- It is well known that we don't vectorize this, right now we only handle accesses to the SIMD lane privatized vars that cover the whole size of those vars, while in your testcase the access size is half the size of the var. The reason why -Ofast vectorizes it is likely that SRA manages to scalarize those, but scalarizer can't do anything easily with the magic arrays indexed by SIMD_LANE internal fn that we use to represent the privatized variables (and, I couldn't find a better representation yet for those). So, either the SRA pass would need to handle those (split the single array with { float; float } pairs into two arrays with just float type), or the vectorizer would need to do some ugly magic for selected cases (e.g. handle the pair case by having two vector vars and use always either odd or even entries from them). I'm hoping that most people will actually use scalars or single data member classes for reductions etc.
[Bug libgomp/58482] gomp4: user defined reduction produce wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482 Jakub Jelinek changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #1 from Jakub Jelinek --- Yeah, the testcase is just wrong. The way OpenMP reductions work is that in the loop the reduction variable(s) are privatized, i.e. either each thread (for e.g. #pragma omp for) or each SIMD lane (for #pragma omp simd) gets its own copy of that var, either default constructed (C++98 wording; for UDRs with missing initializer clause), or initialized otherwise (see the standard for details), in the loop everything is done on the privatized variable and finally at the end the reduction operation is performed on the original variable, calling the combiner expression with omp_out being the original variable and omp_in being the privatized var, for each of the privatized variables. But your testcase obviously can't work properly in that case, because your reduction operation can't cope properly with being performed more than once. If you subtract the array element values from the counter, then if they are all positive, you'll get negative number, while if you subtract them from the privatized var, all those privatized vars will have negative counter, but then you subtract those negative numbers from the original var counter and get positive number. There is a reason why reduction (-:int_var) is actually implemented as addition rather than subtraction... In your testcase, you'd want a different operation to be used in the reduction combiner, one that would add things rather than subtract.