Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque?

Ralph Castain Wed, 22 Jul 2009 20:00:29 -0400

mpirun --display-allocation --display-map

Run a batch job that just prints out $PBS_NODEFILE. I'll bet that itisn't what we are expecting, and that the problem comes from it.

In a Torque environment, we read that file to get the list of nodesand #slots/node that are allocated to your job. We then filter thatthrough any hostfile you provide. So all the nodes have to be in the$PBS_NODEFILE, which has to be in the expected format.

I'm a little suspicious, though, because of your reported error. Itsounds like we are indeed trying to launch a daemon on a known node. Ican only surmise a couple of possible reasons for the failure:

1. this is a node that is not allocated for your use. Was node0006 inyour allocation?? If not, then the launch would fail. This wouldindicate we are not parsing the nodefile correctly.

2. if the node is in your allocation, then I would wonder if you havea TCP connection between that node and the one where mpirun exists. Isthere a firewall in the way? Or something that would preclude aconnection? Frankly, I doubt this possibility because it works whenrun manually.


My money is on option #1. :-)

If it is #1 and you send me a copy of a sample $PBS_NODEFILE on yoursystem, I can create a way to parse it so we can provide support forthat older version.


Ralph


On Jul 21, 2009, at 4:44 PM, Song, Kai Song wrote:

Hi Ralph,

Thanks a lot for the fast response.

Could you give me more instructions on which command do I put "--display-allocation" and "--display-map" with? mpirun? ./configure?...

Also,we have tested that in our PBS script, if we put node=1, thehelloworld works. But, when I put node=2 or more, it will hang untiltimeout . And the error message will be something like:

node0006 - daemon did not report back when launched

However, if we don't go through the scheduler and run mpi manually,everything works fine too./home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np16 ./a.out

What do you think the problem would be? It's not the network issue,because manually running MPI works. That is why we question abouttorque compatibility.


Thanks again,

Kai

--------------------
Kai Song
<ks...@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov


----- Original Message -----
From: Ralph Castain <r...@open-mpi.org>
Date: Tuesday, July 21, 2009 12:12 pm

Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with oldtorque?

To: Open MPI Users <us...@open-mpi.org>

I'm afraid I have no idea - I've never seen a Torque version thatold,

however, so it is quite possible that we don't work with it. It
also looks
like it may have been modified (given the p2-aspen3 on the end), so
I have
no idea how the system would behave.

First thing you could do is verify that the allocation is being read
correctly. Add a --display-allocation to the cmd line and see what
we think
Torque gave us. Then add --display-map to see where it plans to
place the
processes.

If all that looks okay, and if you allow ssh, then try -mca plm rsh
on the
cmd line and see if that works.

HTH
Ralph


On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <ks...@lbl.gov>
wrote:

Hi All,

I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-

aspen3 and

myrinet. I compiled it just fine with this configuration:
./configure --prefix=/home/software/ompi/1.3.2-pgi --with-

gm=/usr/local/> --with-gm-libdir=/usr/local/lib64/ --enable-static -
-disable-shared

--with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90

F77=pgf77> LDFLAGS=-L/usr/lib64/torque/


However, when I submit jobs for 2 or more nodes through the torque
schedular, the jobs just hang here. It shows the RUN state, but no
communication between the nodes, then jobs will die with timeout.

We have comfirmed that the myrinet is working because our lam-mpi-

7.1 works

just fine. We are having a really hard time determining what are

the causes

for this problem. So, we suspect it's because our torque is too old.

What is the lowest version requirement of torque for open-mpi-

1.3.2? The

README file didn't specify this detail. Does anyone know more

about it?


Thanks in advance,

Kai
--------------------
Kai Song
<ks...@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque?

Reply via email to