I'm trying to run jobs on several GPUs at the same time using multiple threads, each with its own context. Sometimes this works flawlessly, but ~75% of the time I get a cuModuleLoadDataEx error telling me the context has been destroyed. What's frustrating is that nothing changes between failed and successful runs of the code. From what I can tell it's down to luck whether or not the error comes up:
~/anaconda3/lib/python3.6/site-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs) 292 293 from pycuda.driver import module_from_buffer--> 294 self.module = module_from_buffer(cubin) 295 296 self._bind_module() LogicError: cuModuleLoadDataEx failed: context is destroyed - I start by making the contexts from pycuda import driver as cuda cuda.init() contexts = [] for i in range(cuda.Device.count()): c = cuda.Device(i).make_context() c.pop() contexts.append(c) ... and setting up a function to use each context, i.e. import numpy as np def do_work(ctx): with Acquire(ctx): a = gpuarray.to_gpu(np.random.rand(100, 400, 400)) b = gpuarray.to_gpu(np.random.rand(100, 400, 400)) for _ in range(10): c = (a + b) / 2 out = c.get() return out where `Acquire` is a context manager that handles pushing and popping: class Acquire: def __init__(self, context): self.ctx = context def __enter__(self): self.ctx.push() return self.ctx def __exit__(self, type, value, traceback): self.ctx.pop() and here I run the code in parallel using a pool of threaded workers via joblib from joblib import Parallel, delayed pool = Parallel(n_jobs=len(contexts), verbose=8, prefer='threads') with pool: # Pass 1 sum(pool(delayed(do_work)(ctx) for ctx in contexts)) # Pass 2 sum(pool(delayed(do_work)(ctx) for ctx in contexts)) Note that I do several "passes" of work (I'll need to do 50 or so in my real application) with the same thread pool. It seems like the crash always happens somewhere in the second pass, or not at all. Any ideas about how to keep my contexts from getting destroyed? *System info* Ubuntu 16.04 (Amazon Deep Learning AMI) CUDA driver version 396.44 4x V100 GPUs Python 3.6 pycuda version 2018.1.1
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda