[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-26 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #26 from Thorsten Kurth --- Hello Jakub, thanks for the clarification. So a team maps to a CTA which is somewhat equivalent to a block in CUDA language, correct? And it is good to have some categorical equivalency between GPU and

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #25 from Jakub Jelinek --- In the GCC implementation of offloading to PTX, all HW threads in a warp (i.e. 32 of them) are a single OpenMP thread, and one needs to use a simd region (effectively SIMT) to get useful work done by all

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-26 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #24 from Thorsten Kurth --- Hello Jakub, I know that the section you mean is racey and gets the wrong number of threads is not right but I put this in in order to see if I get the correct numbers on a CPU (I am not working on a GPU

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 Jakub Jelinek changed: What|Removed |Added Status|WAITING |ASSIGNED --- Comment #23 from Jakub

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-25 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #22 from Thorsten Kurth --- Hello Jakub, that is stuff for Intel vTune. I have commented it out and added the NUM_TEAMS defines in the GNUmakefile. Please pull the latest changes. Best and thanks Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-25 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #21 from Jakub Jelinek --- It doesn't compile for me. cmake -DENABLE_MPI=0 -DENABLE_OpenMP=1 .. make -j16 I don't have ittnotify.h, I've tried to comment that out as well as the _itt* calls, but then run into:

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-25 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #20 from Thorsten Kurth --- To compile the code, edit the GNUmakefile to suit your needs (feel free to ask any questions) and in order to run it run the generated executable, called something like main3d.XXX... and the XXX tell

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #19 from Thorsten Kurth --- Thanks you very much. I am sorry that I do not have a simpler test case. The kernel which is executed is in the same directory as ABecLaplacian and called MG_3D_cpp.cpp. We have seen similar problems with

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #18 from Jakub Jelinek --- Ok, I'll grab your git code and will have a look tomorrow what's going on.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #17 from Thorsten Kurth --- the result though is correct, I verified that both codes generate the correct output.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #16 from Thorsten Kurth --- FYI, the code is: https://github.com/zronaghi/BoxLib.git in branch cpp_kernels_openmp4dot5 and then in Src/LinearSolvers/C_CellMG the file ABecLaplacian.cpp. For example, lines 542 and 543 can be

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #15 from Thorsten Kurth --- The code I care about definitely has optimization enabled. For the fortran stuff it does (for example): ftn -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore -Jo/3d.gnu.MPI.OMP.EXE

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #14 from Jakub Jelinek --- (In reply to Thorsten Kurth from comment #13) > the compiler options are just -fopenmp. I am sure it does not have to do > anything with vectorization as I compare the code runtime with and without > the

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #13 from Thorsten Kurth --- Hello Jakub, the compiler options are just -fopenmp. I am sure it does not have to do anything with vectorization as I compare the code runtime with and without the target directives and thus

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #12 from Jakub Jelinek --- (In reply to Thorsten Kurth from comment #11) > yes, you are right. I thought that map(tofrom:) is the default mapping > but I might be wrong. In any case, teams is always 1. So this code is Variables

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #11 from Thorsten Kurth --- Hello Jakub, yes, you are right. I thought that map(tofrom:) is the default mapping but I might be wrong. In any case, teams is always 1. So this code is basically just data streaming so there is no

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #10 from Jakub Jelinek --- (In reply to Thorsten Kurth from comment #7) > Hello Jakub, > > thanks for your comment but I think the parallel for is not racey. Every > thread is working a block of i-indices so that is fine. The

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #9 from Thorsten Kurth --- Sorry, in the second run I set the number of threads to 12. I think the code works as expected.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #8 from Thorsten Kurth --- Here is the output of the get_num_threads section: [tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32 [tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x We got 1 teams and 32 threads.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #7 from Thorsten Kurth --- Hello Jakub, thanks for your comment but I think the parallel for is not racey. Every thread is working a block of i-indices so that is fine. The dotprod kernel is actually a kernel from the OpenMP

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #6 from Jakub Jelinek --- movq/pushq etc. aren't that expensive, if it affects performance it must be something in the inner loops. A compiler switch that ignores omp target, teams and distribute would basically create a new OpenMP

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #5 from Thorsten Kurth --- To clarify the problem: I think that the additional movq, pushq and other instructions generated when using the target directive can cause a big hit on the performance. I understand that these instructions

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #4 from Thorsten Kurth --- Created attachment 41415 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41415=edit Testcase This is the test case. The files ending on .as contain the assembly code with and without target region

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #3 from Thorsten Kurth --- Created attachment 41414 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41414=edit OpenMP 4.5 Testcase This is the source code

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 Richard Biener changed: What|Removed |Added Keywords||missed-optimization, openmp

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #2 from Jakub Jelinek --- Also, even for host fallback there is a separate set of ICVs and many other properties, the target region can't be just ignored for many reasons even if there is no data sharing. Of course, if you provide

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #1