Hi Jody

Jody Klymak wrote:

On Aug 11, 2009, at  17:35 PM, Gus Correa wrote:

You can check this, say, by logging in to each node and doing /usr/local/openmpi/bin/ompi_info and comparing the output.


Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009.


Did you wipe off the old directories before reinstalling?
I had bad surprises by just running make again.
It is safer to cleanup, run configure, run make, run make install
all over again.

I prefer to install on a NFS mounted directory,
and set the user environment (PATH, MANPATH, LD_LIBRARY_PATH, etc)
to search that directory before it looks for standard ones (such as /usr/local).
This ensures consistency on all nodes with a single installation,
no need to install on all nodes.
For clusters with a modest number of nodes this scales fine.
On different clusters I have used names such as /home/software,
/share/apps (Rocks cluster), etc,
for the main NFS mounted directory that
holds MPI and other applications,
and lives on the head node or on a storage node.
A lot of people do this.

Another thing to look at is what is in your .bashrc/.tcshrc file,
whether it doesn't contain anything that may point to a different OpenMPI, modify the PATH mistakenly, etc.
I don't know about Mac OS-X, but in Linux the files in /etc/profile.d
often also set the user environment, and if they're wrong,
they can produce funny results.
Do you have any MPI related files there?

What about passwords? ssh from server to node is passwordless, but do the nodes need to be passwordless as well? i.e. is xserve01 trying to ssh to xserve02?


I would say so.
At least that is what we have on three Linux clusters.
passwordless ssh across any pair of nodes.
However, I would guess if this were not working,
other MPI versions wouldn't work either.

In any case:

Have you tried to ssh from node to node on all possible pairs?

Do you have the public RSA key for all nodes on /etc/ssh/ssh_known_hosts2 (on all nodes)?

Anyway, not sure what else I can do to debug this. I'm considering rolling back to 1.1.5 and living without a queue manager...


How could you roll back to 1.1.5,
now that you overwrote the directories?

Hang in there!
The problem can be sorted out.

Launching jobs with Torque is way much better than
using barebones mpirun.
You can queue up a sequence of MITgcm runs,
say one year each, each job pending on the correct
completion of the previous job, and just watch the results come out.
This and other features of resource managers
are very convenient, and you don't want to miss that.
If there is more than one user, then a resource manager is a must.
And you don't want to stay behind with the OpenMPI versions
and improvements either.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------



Thanks,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to