[Bug libgomp/58482] gomp4: user defined reduction produce wrong result

2013-09-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482

--- Comment #6 from Jakub Jelinek  ---
Author: jakub
Date: Thu Sep 26 07:58:02 2013
New Revision: 202937

URL: http://gcc.gnu.org/viewcvs?rev=202937&root=gcc&view=rev
Log:
PR libgomp/58482
* c-omp.c (c_omp_split_clauses) : Copy
also OMP_CLAUSE_REDUCTION_PLACEHOLDER.

* testsuite/libgomp.c/simd-6.c: New test.
* testsuite/libgomp.c++/simd-8.C: New test.

Added:
branches/gomp-4_0-branch/libgomp/testsuite/libgomp.c++/simd-8.C
branches/gomp-4_0-branch/libgomp/testsuite/libgomp.c/simd-6.c
Modified:
branches/gomp-4_0-branch/gcc/c-family/ChangeLog.gomp
branches/gomp-4_0-branch/gcc/c-family/c-omp.c
branches/gomp-4_0-branch/libgomp/ChangeLog.gomp


[Bug libgomp/58482] gomp4: user defined reduction produce wrong result

2013-09-21 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482

--- Comment #5 from Jakub Jelinek  ---
There is no problem with having as many reductions as you need, if they are
separate variables; the only case that will prevent vectorization is if you
have a struct/class with multiple data members as reduction.

I'll see if I can reproduce the ICE on Monday.


[Bug libgomp/58482] gomp4: user defined reduction produce wrong result

2013-09-21 Thread vincenzo.innocente at cern dot ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482

--- Comment #4 from vincenzo Innocente  ---
I see.
I have several use cases in which the reduction requires the access to two
variables
(minloc for instance: the minimum and its location)


btw tried
omp parallel for simd
got ICE

 c++ -std=c++11 ured_omp4.cpp -O -ftree-vectorizer-verbose=1 -fopenmp
ured_omp4.cpp: In function ‘TwoInt sum(const int*, int)’:
ured_omp4.cpp:38:63: internal compiler error: Segmentation fault: 11
 #pragma omp parallel for simd  aligned(q: 16) reduction(foo:s)
   ^

ured_omp4.cpp:38:63: internal compiler error: Abort trap: 6
c++: internal compiler error: Abort trap: 6 (program cc1plus)
Abort trap: 6

[Bug libgomp/58482] gomp4: user defined reduction produce wrong result

2013-09-20 Thread vincenzo.innocente at cern dot ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482

--- Comment #2 from vincenzo Innocente  ---
Thanks Jakub for the clear answer.
The reduction operator should be strictly commutative!
and I now understand the meaning of 
omp declare reduction (I hope)
so I modified it as you can see below
results ok.
but omp does not vectorize while good old -Ofast does…
shall I prepare a simple example with corresponding assembler and submit?

c++ -std=c++11  ured_omp4.cpp -O2 -ftree-vectorizer-verbose=4 -fopenmp

Analyzing loop at ured_omp4.cpp:39
ured_omp4.cpp:39:22: note: = analyze_loop_nest =
ured_omp4.cpp:39:22: note: === vect_analyze_loop_form ===
ured_omp4.cpp:39:22: note: === get_loop_niters ===
ured_omp4.cpp:39:22: note: ==> get_loop_niters:(unsigned int) NN_15(D)
ured_omp4.cpp:39:22: note: Symbolic number of iterations is (unsigned int)
NN_15(D)
ured_omp4.cpp:39:22: note: === vect_analyze_data_refs ===
ured_omp4.cpp:39:22: note: got vectype for stmt: _20 = *_19;
const vector(4) float
ured_omp4.cpp:39:22: note: got vectype for stmt: _30 = MEM[(struct TwoInt
*)&D.63343][_16].a;
vector(4) float
ured_omp4.cpp:39:22: note: not vectorized: not suitable for gather load _30 =
MEM[(struct TwoInt *)&D.63343][_16].a;

ured_omp4.cpp:39:22: note: bad data references.
ured_omp4.cpp:35:8: note: vectorized 0 loops in function.
ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops
ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops

I spare you the 4 pages of dump in case of -Ofast


pb-d-128-141-131-94:vectorize innocent$  c++ -std=c++11  ured_omp4.cpp -O2
-ftree-vectorizer-verbose=1 -fopenmp; ./a.out
ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops
ured_omp4.cpp:35:8: note: loop turned into non-loop; it never loops
523776,-523776
523776,-523776
pb-d-128-141-131-94:vectorize innocent$  c++ -std=c++11  ured_omp4.cpp -Ofast
-ftree-vectorizer-verbose=1; ./a.out
ured_omp4.cpp:38:3: note: loop vectorized
ured_omp4.cpp:38:3: note: loop peeled for vectorization to enhance alignment
ured_omp4.cpp:38:3: note: loop with 3 iterations completely unrolled
ured_omp4.cpp:35:8: note: loop with 6 iterations completely unrolled
ured_omp4.cpp:46:13: note: loop vectorized
ured_omp4.cpp:45:8: note: loop with 2 iterations completely unrolled
ured_omp4.cpp:38:3: note: loop vectorized
ured_omp4.cpp:63:3: note: loop vectorized
ured_omp4.cpp:63:3: note: loop with 4 iterations completely unrolled
523776,-523776
523776,-523776



cat ured_omp4.cpp 
#include
#define Type float

struct TwoInt {
  Type a=0;
  Type b=0;

#pragma omp declare simd
  TwoInt & operator+=(TwoInt rh) {
a+=rh.a;
b-=rh.b;
  }

#pragma omp declare simd
  TwoInt & add(TwoInt rh) {
a+=rh.a;
b-=rh.b;
return *this;
  }


#pragma omp declare simd
  TwoInt & reduce(TwoInt rh) {
a+=rh.a;
b+=rh.b;
return *this;
  }


};

#pragma omp declare reduction (foo:struct TwoInt: omp_out.reduce(omp_in))


TwoInt sum(Type const * q, int NN) {
  TwoInt s;
#pragma omp simd  aligned(q: 16) reduction(foo:s)
  for (int i=0;i
int main() {
  constexpr int NN=1024;
  Type q[NN];
  Type a=0;
  for (auto & e: q) e=a++;

  auto s = sum(q,NN);
  std::cout << s.a << "," << s.b << std::endl;
  s = sum4(q,NN);
  std::cout << s.a << "," << s.b << std::endl;

  return 0;
}

[Bug libgomp/58482] gomp4: user defined reduction produce wrong result

2013-09-20 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482

--- Comment #3 from Jakub Jelinek  ---
It is well known that we don't vectorize this, right now we only handle
accesses to the SIMD lane privatized vars that cover the whole size of those
vars, while in your testcase the access size is half the size of the var.
The reason why -Ofast vectorizes it is likely that SRA manages to scalarize
those, but scalarizer can't do anything easily with the magic arrays indexed by
SIMD_LANE internal fn that we use to represent the privatized variables (and, I
couldn't find a better representation yet for those).
So, either the SRA pass would need to handle those (split the single array
with { float; float } pairs into two arrays with just float type), or the
vectorizer would need to do some ugly magic for selected cases (e.g. handle the
pair case by having two vector vars and use always either odd or even entries
from them).  I'm hoping that most people will actually use scalars or single
data member classes for reductions etc.


[Bug libgomp/58482] gomp4: user defined reduction produce wrong result

2013-09-20 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58482

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #1 from Jakub Jelinek  ---
Yeah, the testcase is just wrong.  The way OpenMP reductions work is that
in the loop the reduction variable(s) are privatized, i.e. either each thread
(for e.g. #pragma omp for) or each SIMD lane (for #pragma omp simd) gets its
own copy of that var, either default constructed (C++98 wording; for UDRs with
missing initializer clause), or initialized otherwise (see the standard for
details), in the loop everything is done on the privatized variable and finally
at the end the reduction operation is performed on the original variable,
calling the combiner expression with omp_out being the original variable and
omp_in being the privatized var, for each of the privatized variables.
But your testcase obviously can't work properly in that case, because your
reduction operation can't cope properly with being performed more than once.
If you subtract the array element values from the counter, then if they are all
positive, you'll get negative number, while if you subtract them from the
privatized var, all those privatized vars will have negative counter, but
then you subtract those negative numbers from the original var counter and get
positive number.  There is a reason why reduction (-:int_var) is actually
implemented as addition rather than subtraction...
In your testcase, you'd want a different operation to be used in the reduction
combiner, one that would add things rather than subtract.