Yes, I installed mpich-g2 in /usr/local/mpich-g2 and I compiled my job with /usr/local/mpich-g2/bin/mpicc.

Does your job compiled with mpich-g2?


On 10/9/07, 那日苏 <[EMAIL PROTECTED]> wrote:
Hi, All,

I have a cluster with 1 head node and 3 slave nodes, and their hostnames are:

master:     m01.c01
slaves:     s01.c01     s02.c01     s03.c01

So I wanna build a small grid. I installed mpich-g2, globus, torque into my cluster, and the slaves share the
mpich-g2 installation on the head node. I have pasted the "hello world" example with the mpich package, which proves my installation of globus is OK. Then I interfaced torque and globus and submitted the "hello world" job above with a RSL file like this:
+
( &(resourceManagerContact="m01.c01/jobmanager-pbs")
   (count=2)
   (label="subjob 0")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
       (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
   (directory=/home/gt)
   (executable=/home/gt/hello/hello)
)
It worked very well and the output is:
[[EMAIL PROTECTED] hello]$ globusrun -w -f hello.rsl
hello, world
hello, world
Then I set the $mpirun in pbs.pm to $MPICH-G2_HOME/bin/mpirun, and submitted a mpich-g2 job: the classical "cpi" program with the mpich package.
But it failed. This is the RSL file:
+
( &(resourceManagerContact="m01.c01/jobmanager-pbs")
   (count=4)
   (jobtype=mpi)
   (label="subjob 0")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
                (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
   (directory="/home/gt/examples")
   (executable="/home/gt/examples/cpi")
)
The output is:
[[EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl
    Submission of subjob (label = "subjob 0") failed because authentication with the remote server failed (error code 57)
    Submission of subjob (label = "subjob 1") failed because the connection to the server failed (check host and port) (error code 62)
    Submission of subjob (label = "subjob 2") failed because the connection to the server failed (check host and port) (error code 62)
    Submission of subjob (label = "subjob 3") failed because the connection to the server failed (check host and port) (error code 62)
So I googled it and someone said that I have to remove the line "(jobtype=mpi)" if I don't use Vender MPI. I did it and the errors were gone, but it seems like all the processes ran on the head nodes while none on the slaves:
[gt@m01.c01 examples]$ ./mpirun -globusrsl cpi.rsl
Process 3 on m01.c01
Process 2 on m01.c01
Process 1 on m01.c01
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.083140
Process 0 on m01.c01
Could anyone tell me what's wrong with it? Thanks in advance!

Best Regards,
Narisu,
Beihang University,
Beijing,
China.
Email:[EMAIL PROTECTED]




--
Best Regards,
S.Mehdi Sheikhalishahi,
Web: http://www.cse.shirazu.ac.ir/~alishahi/
Bye.


No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.488 / Virus Database: 269.14.6/1060 - Release Date: 2007/10/9 16:43




Reply via email to