We have a persistent problem attempting to multithread using pycuda. I have a thread pool with one thread per GPU, each one initializes its own context with its given device ID and waits to read jobs from a common Queue object. The main thread processes requests and adds CUDA related jobs to the Queue. This works well enough and utilizes all available GPUs but we frequently run into a locking issue when issuing lots of relatively fast cuda calls where one computation will hang indefinitely. When the contexts are created with the pycuda.driver.ctx_flags.SCHED_BLOCKING_SYNC flag and I attach to a hung process I find it's waiting on a semaphore in cuCtxSynchronize in libcuda.so; when the contexts are created without the SCHED_BLOCKING_SYNC flag I find its still stuck in cuCtxSynchronize but in a spin loop waiting for results.
I have an alternative version with all the same code but bypassing pycuda and calling directly into an nvcc compiled shared library using ctypes that uses cudaSetDevice and cudaDeviceSynchronize rather than the cuCtx* functions and it does not experience these same locking issues. Has anyone ran into this kind of issue before? Also, is there support in pycuda (or planned support for future releases) to use cudaDevice* functions rather than explicit context management? David
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda