Does your job compiled with mpich-g2? On 10/9/07, 那日苏 <[EMAIL PROTECTED]> wrote: > > Hi, All, > > I have a cluster with 1 head node and 3 slave nodes, and their hostnames > are: > > master: m01.c01 > slaves: s01.c01 s02.c01 s03.c01 > > So I wanna build a small grid. I installed mpich-g2, globus, torque into > my cluster, and the slaves share the mpich-g2 installation on the head > node. I have pasted the "hello world" example with the mpich package, > which proves my installation of globus is OK. Then I interfaced torque and > globus and submitted the "hello world" job above with a RSL file like > this: > > + > ( &(resourceManagerContact="m01.c01/jobmanager-pbs") > (count=2) > (label="subjob 0") > (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0) > (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/)) > (directory=/home/gt) > (executable=/home/gt/hello/hello) > ) > > It worked very well and the output is: > > [EMAIL PROTECTED] hello]$ globusrun -w -f hello.rsl > hello, world > hello, world > > Then I set the $mpirun in pbs.pm to $MPICH-G2_HOME/bin/mpirun, and > submitted a mpich-g2 job: the classical "cpi" program with the mpichpackage. > But > it failed. This is the RSL file: > > + > ( &(resourceManagerContact="m01.c01/jobmanager-pbs") > (count=4) > (jobtype=mpi) > (label="subjob 0") > (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0) > (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/)) > (directory="/home/gt/examples") > (executable="/home/gt/examples/cpi") > ) > > The output is: > > [EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl > Submission of subjob (label = "subjob 0") failed because > authentication with the remote server failed (error code 57) > Submission of subjob (label = "subjob 1") failed because the > connection to the server failed (check host and port) (error code 62) > Submission of subjob (label = "subjob 2") failed because the > connection to the server failed (check host and port) (error code 62) > Submission of subjob (label = "subjob 3") failed because the > connection to the server failed (check host and port) (error code 62) > > So I googled it and someone said that I have to remove the line > "(jobtype=mpi)" if I don't use Vender MPI. I did it and the errors were > gone, but it seems like all the processes ran on the head nodes while none > on the slaves: > > [EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl > Process 3 on m01.c01 > Process 2 on m01.c01 > Process 1 on m01.c01 > pi is approximately 3.1416009869231249, Error is 0.0000083333333318 > wall clock time = 0.083140 > Process 0 on m01.c01 > > Could anyone tell me what's wrong with it? Thanks in advance! > > Best Regards, > Narisu, > Beihang University, > Beijing, > China. > Email:[EMAIL PROTECTED] >
-- Best Regards, S.Mehdi Sheikhalishahi, Web: http://www.cse.shirazu.ac.ir/~alishahi/ Bye.
