I have hit a wall moving some existing pycuda code to a distributed
memory cluster and am hoping someone cleverer than I can suggest a
work around.

In a nutshell, I have found that get_nvcc_version doesn't work in the
way the pyCUDA.compiler module expects with the MPI flavours I use
under certain circumstances, which makes behind the scenes JIT
compilation inside pyCUDA fail if all the MPI processes don't share a
common /tmp filesystem. The simplest repro case is this snippet:


import sys
from pycuda import compiler

from mpi4py import MPI
rank = MPI.COMM_WORLD.Get_rank()

sys.stdout.write("[%d] %s\n" %(rank, compiler.get_nvcc_version("nvcc")))

which will do this when run on a single node

$ mpiexec -n 4  python ./pycudachk.py
[2] None
[3] None
[1] None
[0] nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221

I presume this is because the pytools stdout/stderr capture routines
don't get anything on MPI processes where stdout/stderr are being
managed by the MPI runtime. If all the MPI processes share a common
/tmp filesystem, it doesn't seem to matter because the pycuda compiler
cache is visible to each process and it all just works. But if they
don't (and the cluster I am now using has local /tmp on every node),
the compiler module will need to JIT compile stuff locally on each
node, and it winds up failing because get_nvcc_version returns None,
which in turn makes the md5 hashing calls fail. Something like this:

[avidday@n0005 fim]$ mpiexec -n 2 -hosts n0008,n0005 ./fimMPItest.py
[1] mpi dims = (2,1), coords = (1,0)
[0] mpi dims = (2,1), coords = (0,0)
[1]{n0005} CUDA driver GeForce GTX 275, 1791Mb ram, using fim_sm13.cubin
[0]{n0008} CUDA driver GeForce GTX 275, 1791Mb ram, using fim_sm13.cubin
Traceback (most recent call last):
  File "./fimMPItest.py", line 20, in ?
    phi,its = dotest()
  File "./fimMPItest.py", line 16, in dotest
    return fim.fimMPIScatter2(gs,a,1.,h,maxiters=1000,tol=1e-6,CC=fimcuda.fimCC)
  File "/scratch/fim/fim.py", line 464, in fimMPIScatter2
    its = mpiCC.Iterate(f, h, tol=tol, maxiters=maxiters)
  File "/scratch/fim/fimcuda.py", line 187, in Iterate
  File "/usr/lib64/python2.4/site-packages/pycuda/gpuarray.py", line
336, in fill
  File "<string>", line 1, in <lambda>
  File "/usr/lib64/python2.4/site-packages/pycuda/tools.py", line 485,
in context_dependent_memoize
  File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 384, in get_fill_kernel
  File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 98, in get_elwise_kernel
  File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 85, in get_elwise_kernel_and_types
  File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 74, in get_elwise_module
  File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line
238, in __init__
  File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line
228, in compile
  File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line
47, in compile_plain
TypeError: update() argument 1 must be string or read-only buffer, not None

which I interpret as meaning that an internal compile to satisfy a
gpuarray fill() call is failing. Running on only a single node works
fine. If I am reading things correcly, it looks like this checksum
code in compile_plain fails because of what get_nvcc_version returns

     41     if cache_dir:
     42         checksum = _new_md5()
     44         checksum.update(source)
     45         for option in options:
     46             checksum.update(option)
     47         checksum.update(get_nvcc_version(nvcc))

The question then becomes how to fix it? It is important to note that
nvcc is available to all process and works, so I assume that the fork
itself is fine (my reading of the code says that an OSError exception
would be raised otherwise). So I am guessing the problem is only that
get_nvcc_version can return None even when the fork worked. Would it
be too hackish to have get_nvcc_version return something generic like
"nvcc unknown version" or something that would still hash ok in the
case where the fork worked but the captured output from the fork is
not available?

Suggestions or corrections to my analysis are greatly welcomed (I know
almost nothing about python, pyCUDA or MPI so be gentle if I am
completely off target here.....)

