Hi,

When running openMPI my system freezes when initializing MPI (function MPI_init). This happens only when I try to run the process in multiples nodes in my cluster. Running multiple instances of the testing code locally (i.e ./mpirun -np 2 greetings) is succesful.

- rsh runs well, and is configured to full access. (i.e. rsh "192.168.1.103 date" is succesful, so they are "rsh AFRLMPPBM2 date" or "rsh AFRLMPPBM2.MPPdomain.com"). Security is not an issue in this system.

- uname -n and hostname return a valid hostname

- The testing code (attached to this email) is run (and fails) as: ./mpirun --hostfile /root/hostfile -np 2 greetings . The hostfile has the names of the localnode (first entry:AFRLMPPBM1) and the remote node (second entry: AFRLMPPBM2). This file is also attached to this email.

- The environment variables seem to be properly set (see env.log attached file). Local mpi programs (i.e. ./mpirun -np 2 greetings) run well.

-.profile has the path information for both the executables and the libraries

- orted runs in the remote node, however it does not print anything in console. The only output in the remote node is:

pam_rhosts_auth[235]: user root has a `+' user entry
pam_rhosts_auth[235]: allowed to r...@afrlmppbm1.mppdomain.com as root
PAM_unix[235]: (rsh) session opened for user root by (uid=0)
in.rshd[236]: r...@afrlmppbm1.mppdomain.com as root: cmd='( ! [ -e ./.profile ] || . ./.profile; orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 - -nodename AFRLMPPBM2.MPPdomain.com --universe root@AFRLMPPBM1:default-universe-3 04 --nsreplica "0.0.0;tcp://192.168.1.102:32824" --gprreplica "0.0.0;tcp://192.1
68.1.102:32824" --mpi-call-yield 0 )'
PAM_unix[235]: (rsh) session closed for user root

Then the remote process returns command prompt. However orted is in the background. The local process is frozen, and just prints: "Calling init", which is just before MPI_Init (see greetings.c).

I believe the COMM WORLD cannot be correctly initialized. However I can't see which part of my configuration is wrong.

Any help is greatly appreciated.

Thank you,

Jorge

Attachment: logs.tar.gz
Description: GNU Zip compressed data

Reply via email to