I have hit a wall moving some existing pycuda code to a distributed memory cluster and am hoping someone cleverer than I can suggest a work around.
In a nutshell, I have found that get_nvcc_version doesn't work in the way the pyCUDA.compiler module expects with the MPI flavours I use under certain circumstances, which makes behind the scenes JIT compilation inside pyCUDA fail if all the MPI processes don't share a common /tmp filesystem. The simplest repro case is this snippet: ---------- import sys from pycuda import compiler from mpi4py import MPI rank = MPI.COMM_WORLD.Get_rank() sys.stdout.write("[%d] %s\n" %(rank, compiler.get_nvcc_version("nvcc"))) ---------- which will do this when run on a single node $ mpiexec -n 4 python ./pycudachk.py [2] None [3] None [1] None [0] nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2010 NVIDIA Corporation Built on Wed_Nov__3_16:16:57_PDT_2010 Cuda compilation tools, release 3.2, V0.2.1221 I presume this is because the pytools stdout/stderr capture routines don't get anything on MPI processes where stdout/stderr are being managed by the MPI runtime. If all the MPI processes share a common /tmp filesystem, it doesn't seem to matter because the pycuda compiler cache is visible to each process and it all just works. But if they don't (and the cluster I am now using has local /tmp on every node), the compiler module will need to JIT compile stuff locally on each node, and it winds up failing because get_nvcc_version returns None, which in turn makes the md5 hashing calls fail. Something like this: [avidday@n0005 fim]$ mpiexec -n 2 -hosts n0008,n0005 ./fimMPItest.py [1] mpi dims = (2,1), coords = (1,0) [0] mpi dims = (2,1), coords = (0,0) [1]{n0005} CUDA driver GeForce GTX 275, 1791Mb ram, using fim_sm13.cubin [0]{n0008} CUDA driver GeForce GTX 275, 1791Mb ram, using fim_sm13.cubin Traceback (most recent call last): File "./fimMPItest.py", line 20, in ? phi,its = dotest() File "./fimMPItest.py", line 16, in dotest return fim.fimMPIScatter2(gs,a,1.,h,maxiters=1000,tol=1e-6,CC=fimcuda.fimCC) File "/scratch/fim/fim.py", line 464, in fimMPIScatter2 its = mpiCC.Iterate(f, h, tol=tol, maxiters=maxiters) File "/scratch/fim/fimcuda.py", line 187, in Iterate self.active_.fill(np.int32(0)) File "/usr/lib64/python2.4/site-packages/pycuda/gpuarray.py", line 336, in fill File "<string>", line 1, in <lambda> File "/usr/lib64/python2.4/site-packages/pycuda/tools.py", line 485, in context_dependent_memoize File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py", line 384, in get_fill_kernel File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py", line 98, in get_elwise_kernel File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py", line 85, in get_elwise_kernel_and_types File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py", line 74, in get_elwise_module File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line 238, in __init__ File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line 228, in compile File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line 47, in compile_plain TypeError: update() argument 1 must be string or read-only buffer, not None which I interpret as meaning that an internal compile to satisfy a gpuarray fill() call is failing. Running on only a single node works fine. If I am reading things correcly, it looks like this checksum code in compile_plain fails because of what get_nvcc_version returns 41 if cache_dir: 42 checksum = _new_md5() 43 44 checksum.update(source) 45 for option in options: 46 checksum.update(option) 47 checksum.update(get_nvcc_version(nvcc)) The question then becomes how to fix it? It is important to note that nvcc is available to all process and works, so I assume that the fork itself is fine (my reading of the code says that an OSError exception would be raised otherwise). So I am guessing the problem is only that get_nvcc_version can return None even when the fork worked. Would it be too hackish to have get_nvcc_version return something generic like "nvcc unknown version" or something that would still hash ok in the case where the fork worked but the captured output from the fork is not available? Suggestions or corrections to my analysis are greatly welcomed (I know almost nothing about python, pyCUDA or MPI so be gentle if I am completely off target here.....) _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda