Solibakke:
I have not reproduced the issue, but I think I have an idea of what is 
happening.  What type of interconnect are you running over in this cluster?
Note that in the Open MPI 1.7.3 series, CUDA-aware support is only available 
within a node and between nodes using the verbs interface over Infiniband.

Rolf

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, November 07, 2013 10:00 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] MPIRUN error message after ./configure and sudo make 
all install...

FWIW: I can never recall seeing someone use --enable-mca-dso...though I don't 
know if that is the source of the problem.

On Nov 7, 2013, at 6:00 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


Hello Solibakke:
Let me try and reproduce with your configure options.

Rolf

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Solibakke Per 
Bjarte
Sent: Thursday, November 07, 2013 8:40 AM
To: 'de...@open-mpi.org<mailto:de...@open-mpi.org>'
Subject: [OMPI devel] MPIRUN error message after ./configure and sudo make all 
install...

Hello
System with:
Cuda 5.5 and OpenMPI-1.7.3 with system: quadro K5000 and 8 CPUs each with 192 
GPUs =1536 cores)

./configure -with-cuda -with-hwloc -enable-dlopen -enable-mca-dso 
-enable-shared -enable-vt -with-threads=posix -enable-mpi-thread-multiple 
-prefix=/usr/local

Works fine under installation:  ./configure and make, make install

Error message during mpirun -hostfile.... ./snp_mpi:

/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
/home/solibakk/econometrics/snp_applik/npmarkets/elreprorun/snp_mpi: symbol 
lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined symbol: 
progress_one_cuda_htod_event
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 18385 on
node PBS-GPU1 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.


Some suggestions for configure options or mpirun  options?

The options: enable-mca-no-build=pml-bfo removes the message. However, I cannot 
reach any of my GPUs only the CPUs.
In configure I assume: -enable-mca-dso must be effective.

Any suggestions for the CUDA (GPU support) for massive parallel running?

Regards
PBSolibakke
________________________________
This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.
________________________________
_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to