[PyOpenCL] Hints for further debugging?

Michael Wibral Tue, 17 Dec 2019 02:43:52 -0800

Hi,

I am a user/developer of the IDTxl toolbox (https://github.com/pwollstadt/IDTxl/).

We have the following issue and I am looking for pointers on how to further debug my problem.
We have an OpenCL kernel that computes neighbour distances between all points in a set and also looks for neighbours in a certain range.

This code used to run on our older AMD and nividia cards (Hawai, Lexa models, GTX 1080), but we encountered errors on newer models.

The situation now is:
The code runs on CPU via POCL.
The code runs on Hawai and Lexa XT chips using the AMD fglrx and rocm drivers.
The code fails on AMD's Vega chips using the rocm driver; more specifically the kernel starts and runs, and then (as indicated by the time elapsed, measure with linux time) it fails either in the very last computation or when trying to return to the host. The error I get on the Vega GPUs is:
(AMD) Memory access fault by GPU node-1 (Agent handle: 0x562731f06a00) on address 0xa06200000. Reason: Page not present or supervisor privilege.

On nividia GPUs we don't use subbuffer alignment (which seems to be connected to the problem) as it is not required there, but if we do, we get this error before the computation starts:
(NVIDIA) clEnqueueReadBuffer failed: OUT_OF_RESOURCES

From the pattern or errors I would tentatively conclude that:
(a) The OpenCL kernel itself is OK as it runs without problems in POCL.
(b) The error is related to the use of subbuffers or to the padding we use for subbuffer alignment, but it does not seem to matter for all architectures (which is weird).

I am wondering whether this is an OpenCL 1.2 versus 2.0 issue (where 2.0 fails for us)?
Can I enforce a certain openCL version to be used by pyopenCL?
Are there known issues or tricks when using OpenCL 1.2 and 2.0 devices in the same system?
Any other ideas on how to get more hints?

If someone has too much time on their hands, here is how to replicate the problem:

Our systems run Ubuntu 16.04 and 18.04.

1. clone https://github.com/pwollstadt/IDTxl
2. install the code as per the installation instructions from the wiki on github
3. switch to the branch fix_gpu_bug
4. cd to ..../IDTxl/dev/search_GPU/deliverable2_1
5. run $> time python test_opencl_search.py --gpuid 0 -p 3670018 -d 2 -c 2 --padding
(On AMD cards that exhibit the bug it will take some time until the crash)

Thanks for your help,

Michael

<<attachment: Dr. Michael Wibral.vcf>>

_______________________________________________
PyOpenCL mailing list -- pyopencl@tiker.net
To unsubscribe send an email to pyopencl-le...@tiker.net

[PyOpenCL] Hints for further debugging?

Reply via email to