Re: [Beignet] GROMACS on beignet

Pan, Xiuli Mon, 04 Apr 2016 20:10:12 -0700

Hi Szilárd,


1.      It seems size is fine.

2.      The test seems ok. Could you provide your environment for the HSW 
machine with problem, like kernel versions, drm versions etc.

3.      For clEnqueueWriteBuffer, clEnqueuereadBuffer they are actually I/O 
control to get map so total blocking on CPU, but for clEnqueueNDRangeKernel you 
can enqueue it by pass an event to it then it will not have a flush.


I will try GROMACS on our platforms to see if bugs can be reproduced and we can 
try to root case the bugs.

Thanks
Xiuli

From: Szilárd Páll [mailto:sin.pec...@gmail.com]
Sent: Friday, April 1, 2016 6:46 PM
To: Pan, Xiuli <xiuli....@intel.com>
Cc: beignet@lists.freedesktop.org
Subject: Re: [Beignet] GROMACS on beignet

Hi Xiuli,

Apologies if I were not clear enough with my questions.

1. By saying that there is only 64K local memory, I assume you meant to hint 
that this is a scarce resource. We use only ~2KB local memory per work group 
required mostly for prefetching and reduction across work-group. By disabling 
pre-fetching we could get this down to a minimum of about 800 bytes. However, 
do you expect that to help in any way? Is the hardware capable of keeping in 
flight >30-32 waves of 64 threads?

Also, I think I'm lacking some detailed knowledge as I do not see how is this 
related to the drm_intel_gem_bo_context_exec() issue.

2. As mentioned above, I have local work size = 64 and rely on splitting the 
work over the global grid (so small workloads will have 100s, large ones 10000s 
larger global work size).

3. Do you mean that I should *not* expect concurrency between CPU and GPU to be 
possible with beignet and clEnqueueWriteBuffer, clEnqueuereadBuffer, and 
clEnqueueNDRangeKernel will exhibit blocking behavior?


I have not had time to file a bugzilla yet with reproduction details, sorry 
about that.
The GROMACS source you'll need is in the master branch, plus the 
https://gerrit.gromacs.org/#/c/5752/2 change under review fixes some execution 
width assumptions. Other than that you'll need a small patch to enable Intel 
iGPUs (e.g found here https://bugs.freedesktop.org/show_bug.cgi?id=94265 which 
is BTW another bug on IVB). You'll also run into the include path issue I 
mentioned before for which you'll need to activate the Apple workaround, here's 
a patch:

diff --git a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp 
b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
index 2084d8c..8928582 100644
--- a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
+++ b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
@@ -131,6 +131,8 @@ static int is_gmx_supported_gpu_id(struct gmx_device_info_t 
*ocl_gpu_device)
             return egpuCompatible;
         case OCL_VENDOR_AMD:
             return runningOnCompatibleOSForAmd() ? egpuCompatible : 
egpuIncompatible;
+        case OCL_VENDOR_INTEL:
+            return egpuCompatible;
         default:
             return egpuIncompatible;
     }
diff --git a/src/gromacs/gpu_utils/ocl_compiler.cpp 
b/src/gromacs/gpu_utils/ocl_compiler.cpp
index 6a4772a..9aa3c1e 100644
--- a/src/gromacs/gpu_utils/ocl_compiler.cpp
+++ b/src/gromacs/gpu_utils/ocl_compiler.cpp
@@ -747,7 +747,8 @@ ocl_get_build_options_string(cl_context           context,
          * OpenCL implementations are happy with. Since the standard still says
          * it should be quoted, we handle Apple as a special case.
          */
-#ifdef __APPLE__
+//#ifdef __APPLE__
+#if 1
         std::string unescaped_ocl_root_path = get_ocl_root_path();
         std::string ocl_root_path;


Additionally, here's an input file you'll need to be able to start the program:
https://www.dropbox.com/s/hm5t90iwo3xw5ws/water-48k-frozen.tpr?dl=0
which you can do with the following command:
/PATH/gmx mdrun -s water-48k-frozen

Let me know if something is unclear.

Thanks for the help!

Cheers,

--
Szilárd

On Thu, Mar 31, 2016 at 5:34 AM, Pan, Xiuli 
<xiuli....@intel.com<mailto:xiuli....@intel.com>> wrote:
Hi Szilárd,

Since you have some questions and I could not reproduce them here I just make 
some response that I think may related to this problem:

1.      We only have 64K Share local memory for all of the work groups

2.      The drm_intel_gem_bo_context_exec() failed have a lot of reasons, could 
give us the detail about your test about the execution wide?

3.      As far as I know most  enqueue in beignet default to be blocking(some 
related to GPU is not blocking) , you can see that api clFlush  is actually an 
empty function.

Also I am trying to reproduce your bug here and I am setting up GROMACS. Is 
there anything I should know to run it with beignet?

Thanks
Xiuli

From: Beignet 
[mailto:beignet-boun...@lists.freedesktop.org<mailto:beignet-boun...@lists.freedesktop.org>]
 On Behalf Of Szilárd Páll
Sent: Thursday, March 31, 2016 3:14 AM
To: beignet@lists.freedesktop.org<mailto:beignet@lists.freedesktop.org>
Subject: Re: [Beignet] GROMACS on beignet

Hello again,

I have been trying to verify whether there may be assumptions >=32-wide 
execution hiding in the kernels (in particular in code that's using local 
memory for prefetching or reduction) and tried dropping in mem fences to test a 
few things, but at several points I managed to trigger the aforementioned error:
drm_intel_gem_bo_context_exec() failed: Input/output error

Is this a known issues? There have been reports of it, but perhaps it is just 
the manifestation of multiple possible issues?

Secondly, I do not see the reason why I get blocking behavior of all enqueue 
operations (and I don't get this on NVIDIA or AMD). Are there any peculiarities 
I should be aware of?

Cheers,

--
Szilárd

On Mon, Mar 28, 2016 at 1:49 AM, Szilárd Páll 
<sin.pec...@gmail.com<mailto:sin.pec...@gmail.com>> wrote:
Hi Xiuli,

Thanks for the quick reply!

On Fri, Mar 25, 2016 at 4:06 AM, Pan, Xiuli 
<xiuli....@intel.com<mailto:xiuli....@intel.com>> wrote:
Hi Szilárd,

What do you mean about quoted includes?

I mean -I"/path/to/headers" does not work, but  -I/path/to/headers does.

If you mean the include in kernels, I think we may have some problem with that. 
The *.cl we used for clang actually was a copied tmp version stored not in 
where is used to be. So I think if you just put what need to be included in the 
old place, clang could not find it. You could try a workaround to pass “-I 
where/your/header/is”  as a build option to clBuildProgram.

Then if you have some double types used on Haswell it may have some problem. 
The hardware for HSW does not support double very well as we have refined our 
double support to hardware then, so HSW may have some issues with double type. 
If it is not the problem with double float, you can send your kernel as an 
attachment or report a bug on our Bugzilla(https://bugs.freedesktop.org) and we 
will tried to fix it.

No double precision in the kernels.

For now I'll post here, I feel like a bug report may be an overkill - 
especially as I can't provide a full repro case that does not involve building 
the entire application.

I've attached a minimum set of source files that's needed to compile. We have 
pretty heavy preprocessor use that generates kernels for the different inputs / 
outputs / computation combinations, so one particular flavor that's known to 
produce incorrect results is generated compiling 
nbnxn_ocl_kernels.cl<http://nbnxn_ocl_kernels.cl> with the following flags:

-D_WARPLESS_SOURCE_ -DGMX_OCL_FASTGEN -DEL_RF -DEELNAME=_ElecRF -DLJ_COMB_GEOM 
-DVDWNAME=_VdwLJCombGeom -DCENTRAL=22 -DNBNXN_GPU_NCLUSTER_PER_SUPERCLUSTER=8 
-DNBNXN_GPU_CLUSTER_SIZE=8 -DNBNXN_GPU_JGROUP_SIZE=4 
-DNBNXN_AVOID_SING_R2_INC=1.0e-12f


Additionally I had a closer look and so far I have observed three issues 
(additional to the minor include issue mentione before):

1. If I do a manual prefetch into local memory followed by a mem fence 
(seenbnxn_ocl_kernel_nowarp.clh line 339), I get the following error:
drm_intel_gem_bo_context_exec() failed: Input/output error
The next kernel call then fails with CL_OUT_OF_RESOURCES.
Without the manual prefetch it works better, but...

2. The results produced by the kernel are still somewhat off. It could be that 
I missed a subtle detail and the kernels still do not conform to the hardware's 
execution model. I'm very familar with Intel's hardware and these kernels were 
originally designed for 32/64 wide execution.

3. All task enqueue calls seem to be blocking.


Thanks & Cheers,
--
Szilárd


Thanks
Xiuli

From: Beignet 
[mailto:beignet-boun...@lists.freedesktop.org<mailto:beignet-boun...@lists.freedesktop.org>]
 On Behalf Of Szilárd Páll
Sent: Friday, March 25, 2016 7:16 AM
To: beignet@lists.freedesktop.org<mailto:beignet@lists.freedesktop.org>
Subject: [Beignet] GROMACS on beignet

Hi,

I am a developer of the GROMACS (www.gromacs.org<http://www.gromacs.org>) 
molecular dynamics simulation package. We have OpenCL offload for some of the 
compute-intensive kernels which that works very well on AMD. I wanted to assess 
how feasible is to use an Intel iGPU in GROMACS and after jumping through some 
hoops I got a 4.2 kernel and beignet master installed.

Then I ran into the first minor issue: it seems that beignet does not accept 
quoted includes although AFAIK the double-quoted include paths should be 
accepted, but that did not work. No big deal, it doesn't work with Apple's 
OpenCL either, but I thought I'd ask.

However, the bigger issue is that running on Haswell (HD 4600, I think) the 
kernel produces results that are very off (while the very same source gives 
correct results on other platforms). I've not much time to dig deeper, but I 
thought I'd drop a mail maybe somebody is interested in helping out with tips 
or even tracking down where the issue is.

Suggestions would be welcome!

Cheers,
--
Szilárd

_______________________________________________
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] GROMACS on beignet

Reply via email to