Re: [OMPI users] SGE tight integration and ?tm? protocol for start

Reuti Sat, 11 Oct 2008 18:48:15 -0400

Am 12.10.2008 um 00:21 schrieb Sean Davis:

<snip>


Thanks, Pak.  There is only one queue on the SGE system.  Of course,
there are queue instances for each machine, which is the usual for
SGE.

I'll give the -masterq a look.  And the messages files for the
involved machines are devoid of anything useful; in fact, there is no
mention of these jobs, in general.

Hi,

to see more, you can set "loglevel log_info" in the schedulerconfiguration.

Do you have more than one network card installed and gave them thesame name?

Your defined "tmpdir" is local on each machine?
Do you redifine $TMPDIR in your .bashrc or anything else therein?

-- Reuti

Sean
Date: Sat, 11 Oct 2008 07:56:02 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] SGE tight integration and ?tm? protocolfor
       start
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <3e62159b-14b9-4d44-96f6-0345079bc...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
I don't know much/anything about SGE (I'll leave that to the Sunfolks onthis list to reply), but I can tell you about the tm plugins: tmis theprotocol used by the PBS/Torque family of launchers. It lookslike yourOpen MPI was built with TM support, but when you launch, it'slikely unable
to find the support libraries that it needs to load  those plugins.
This is probably fine in your case, since you want to use SGE,not TM.
On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:
I am relatively new to OpenMPI and Sun Grid Engine parallel
integration. I have a small cluster that is running SGE6.2 onlinux
machines all using Intel Xeon processors.  I have installed OpenMPI
1.2.7 from source using the --with-sge switch.  Now, I am trying to
troubleshoot some problems I am having. I have created a simplejob
script:

The job script looks like:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname

And the output on the error stream:
more junksub.sh.e3574
[shakespeare:05720] mca: base: component_find: unable to openras tm:
file not found (ignored)
[shakespeare:05720] mca: base: component_find: unable to openpls tm:
file not found (ignored)
Starting server daemon at host "shakespeare.nci.nih.gov"
Starting server daemon at host "octopus.nci.nih.gov"
Server daemon successfully started with task id "1.shakespeare"
[shakespeare:05733] mca: base: component_find: unable to openras tm:
file not found (ignored)
[shakespeare:05733] mca: base: component_find: unable to openpls tm:
file not found (ignored)
error: executing task of job 3576 failed: failed sending task to
ex...@octopus.nci.nih.gov: can't find connecti
on
[shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.govfailed
to start as expected.
[shakespeare:05720] ERROR: There may be more informationavailable from[shakespeare:05720] ERROR: the 'qstat -t' command on the GridEngine
 tasks.
[shakespeare:05720] ERROR: If the problem persists, pleaserestart the
[shakespeare:05720] ERROR: Grid Engine PE job
[shakespeare:05720] ERROR: The daemon exited unexpectedly withstatus 1.
However, there is no output in any output stream.
And if I log into shakespeare and qrsh -q all.q@octopus, Iimmediately
get a slot, so there isn't a "direct" problem with connecting.

As I got a hint from folks on the SGE mailing list, it appears that
qrsh is not being used for job submission. Any suggestions asto why
this might be the case (or if it is the case)?

Thanks,
Sean
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] SGE tight integration and ?tm? protocol for start

Reply via email to