There is indeed also a problem with MPI + Cuda.

This problem however is deeper, since it happens with Mvapich2 1.9, OpenMPI 1.6.5/1.8.1/1.8.2rc4, Cuda 5.5.22/6.0.37. From my tests, everything works fine with MPI + Cuda on a single node, but as soon as I got to MPI + Cuda accross nodes, I get segv. I suspect something either with the ofed (we use linux ofed rdma, not the Mellanox stack) or the nvidia drivers (we are a couple minor versions behind). My next step is to try and upgrade those.

I do not think this problem is related to not being able to run ring_c on the head node however, because it runs fine with 1.6.5 and ring_c does not involve cuda.

Maxime

Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :
Just out of curiosity, I saw that one of the segv stack traces involved the 
cuda stack.

Can you try a build without CUDA and see if that resolves the problem?



On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Hi Jeff,

Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Correct.

Can it be because torque (pbs_mom) is not running on the head node and mpiexec 
attempts to contact it ?
Not for Open MPI's mpiexec, no.

Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM 
stuff (i.e., Torque stuff) if it sees the environment variable markers 
indicating that it's inside a Torque job.  If not, it just uses rsh/ssh (or 
localhost launch in your case, since you didn't specify any hosts).

If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI 
"hostname" command from Linux), then something is seriously borked with your Open MPI 
installation.
mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0

Try running with:

     mpirun -np 4 --mca plm_base_verbose 10 hostname

This should show the steps OMPI is trying to take to launch the 4 copies of 
"hostname" and potentially give some insight into where it's hanging.

Also, just to make sure, you have ensured that you're compiling everything with 
a single compiler toolchain, and the support libraries from that specific 
compiler toolchain are available on any server on which you're running (to 
include the head node and compute nodes), right?
Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the 
same results). Almost every software (that is compiler, toolchain, etc.) is 
installed on lustre, from sources and is the same on both the login (head) node 
and the compute.

The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration (computes have 
GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the -devel 
packages, some fonts/X11 libraries, etc.), but all the packages that are on the 
computes are also on the login node.

And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the 
Open MPI installation that you expect it to point to.  E.g., if you "ldd ring_c", it 
shows the libmpi.so that you expect.  And "which mpiexec" shows the mpirun that you 
expect.  Etc.
As per the content of "env.out" in the archive, yes. They point to the OMPI 
1.8.2rc4 installation directories, on lustre, and are shared between the head node and 
the compute nodes.


Maxime
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25043.php



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Reply via email to