https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #26 from Thorsten Kurth ---
Hello Jakub,
thanks for the clarification. So a team maps to a CTA which is somewhat
equivalent to a block in CUDA language, correct? And it is good to have some
categorical equivalency between GPU and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #25 from Jakub Jelinek ---
In the GCC implementation of offloading to PTX, all HW threads in a warp (i.e.
32 of them) are a single OpenMP thread, and one needs to use a simd region
(effectively SIMT) to get useful work done by all
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #24 from Thorsten Kurth ---
Hello Jakub,
I know that the section you mean is racey and gets the wrong number of threads
is not right but I put this in in order to see if I get the correct numbers on
a CPU (I am not working on a GPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
Jakub Jelinek changed:
What|Removed |Added
Status|WAITING |ASSIGNED
--- Comment #23 from Jakub
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #22 from Thorsten Kurth ---
Hello Jakub,
that is stuff for Intel vTune. I have commented it out and added the NUM_TEAMS
defines in the GNUmakefile. Please pull the latest changes.
Best and thanks
Thorsten
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #21 from Jakub Jelinek ---
It doesn't compile for me.
cmake -DENABLE_MPI=0 -DENABLE_OpenMP=1 ..
make -j16
I don't have ittnotify.h, I've tried to comment that out as well as the _itt*
calls, but then run into:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #20 from Thorsten Kurth ---
To compile the code, edit the GNUmakefile to suit your needs (feel free to ask
any questions) and in order to run it run the generated executable, called
something like
main3d.XXX...
and the XXX tell
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #19 from Thorsten Kurth ---
Thanks you very much. I am sorry that I do not have a simpler test case. The
kernel which is executed is in the same directory as ABecLaplacian and called
MG_3D_cpp.cpp.
We have seen similar problems with
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #18 from Jakub Jelinek ---
Ok, I'll grab your git code and will have a look tomorrow what's going on.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #17 from Thorsten Kurth ---
the result though is correct, I verified that both codes generate the correct
output.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #16 from Thorsten Kurth ---
FYI, the code is:
https://github.com/zronaghi/BoxLib.git
in branch
cpp_kernels_openmp4dot5
and then in Src/LinearSolvers/C_CellMG
the file ABecLaplacian.cpp. For example, lines 542 and 543 can be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #15 from Thorsten Kurth ---
The code I care about definitely has optimization enabled. For the fortran
stuff it does (for example):
ftn -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore
-Jo/3d.gnu.MPI.OMP.EXE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #14 from Jakub Jelinek ---
(In reply to Thorsten Kurth from comment #13)
> the compiler options are just -fopenmp. I am sure it does not have to do
> anything with vectorization as I compare the code runtime with and without
> the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #13 from Thorsten Kurth ---
Hello Jakub,
the compiler options are just -fopenmp. I am sure it does not have to do
anything with vectorization as I compare the code runtime with and without the
target directives and thus
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #12 from Jakub Jelinek ---
(In reply to Thorsten Kurth from comment #11)
> yes, you are right. I thought that map(tofrom:) is the default mapping
> but I might be wrong. In any case, teams is always 1. So this code is
Variables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #11 from Thorsten Kurth ---
Hello Jakub,
yes, you are right. I thought that map(tofrom:) is the default mapping but
I might be wrong. In any case, teams is always 1. So this code is basically
just data streaming so there is no
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #10 from Jakub Jelinek ---
(In reply to Thorsten Kurth from comment #7)
> Hello Jakub,
>
> thanks for your comment but I think the parallel for is not racey. Every
> thread is working a block of i-indices so that is fine. The
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #9 from Thorsten Kurth ---
Sorry, in the second run I set the number of threads to 12. I think the code
works as expected.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #8 from Thorsten Kurth ---
Here is the output of the get_num_threads section:
[tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32
[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 32 threads.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #7 from Thorsten Kurth ---
Hello Jakub,
thanks for your comment but I think the parallel for is not racey. Every thread
is working a block of i-indices so that is fine. The dotprod kernel is actually
a kernel from the OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #6 from Jakub Jelinek ---
movq/pushq etc. aren't that expensive, if it affects performance it must be
something in the inner loops. A compiler switch that ignores omp target, teams
and distribute would basically create a new OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #5 from Thorsten Kurth ---
To clarify the problem:
I think that the additional movq, pushq and other instructions generated when
using the target directive can cause a big hit on the performance. I understand
that these instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #4 from Thorsten Kurth ---
Created attachment 41415
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41415=edit
Testcase
This is the test case. The files ending on .as contain the assembly code with
and without target region
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #3 from Thorsten Kurth ---
Created attachment 41414
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41414=edit
OpenMP 4.5 Testcase
This is the source code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
Richard Biener changed:
What|Removed |Added
Keywords||missed-optimization, openmp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #2 from Jakub Jelinek ---
Also, even for host fallback there is a separate set of ICVs and many other
properties, the target region can't be just ignored for many reasons even if
there is no data sharing.
Of course, if you provide
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
Jakub Jelinek changed:
What|Removed |Added
CC||jakub at gcc dot gnu.org
--- Comment #1
27 matches
Mail list logo