Am 01.12.2009 um 10:32 schrieb Ondrej Glembek:

Just to add more info:

Reuti wrote:
Am 30.11.2009 um 20:07 schrieb Ondrej Glembek:

But I think the real problem is, that Open MPI assumes you are outside of SGE and so uses a different startup. Are you resetting any of SGE's
environment variables in your custom starter method (like $JOB_ID)?

Also one of the reasons that makes me think that Open MPI knows it is
inside of SGE is the dump of mpiexec (below)

The first four lines show that starter.sh is called from mpiexec, having
trouble with the (...) command...

The last four lines show, that mpiexec knows the machines it is suppose
tu run on...

Thanx



/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found

You are right. So the question remains: why is Open MPI building such a line at all.

As you found the place in the source, it's done only for certain shells. And I would assume only in case of an rsh/ssh startup. When you put a `sleep 60` in your starter script: 1) it will of course delay the start of the program, but when it gets to 2) mpiexec, you should see some "qrsh -inherit ..." on the master node of the parallel job. Are these present?

-- Reuti


---------------------------------------------------------------------- ---- A daemon (pid 30616) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
---------------------------------------------------------------------- ---- ---------------------------------------------------------------------- ----
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
---------------------------------------------------------------------- ---- ---------------------------------------------------------------------- ----
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
---------------------------------------------------------------------- ---- blade57.fit.vutbr.cz - daemon did not report back when launched blade39.fit.vutbr.cz - daemon did not report back when launched blade41.fit.vutbr.cz - daemon did not report back when launched blade61.fit.vutbr.cz - daemon did not report back when launched





-- Reuti



Thanx
Ondrej


Reuti wrote:
Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
Hi, thanx for reply...

I tried to dump the $@ before calling the exec and here it is:


( test ! -r ./.profile || . ./.profile;
PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export
PATH ;
LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib: $LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
/homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env
-mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri
"3870359552.0;tcp://147.229.8.134:53727" --mca
pls_gridengine_verbose 1 --output-filename mpi.log )


It looks like the line gets constructed in
orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell...

Still I wonder, why mpiexec calls the starter.sh... I thought the
starter was supposed to call the script which wraps a call to
mpiexec...
Correct. This will happen for the master node of this job, i.e. where
the jobscript is executed. But it will also be used for the qrsh
-inherit calls. I wonder about one thing: I see only a call to
"orted" and not the above sub-shell on my machines. Did you compile
Open MPI with --with-sge?
The original call above would be "ssh node_xy ( test ! ....)" which
seems working for ssh and rsh.
Just one note: with the starter script you will lose the set PATH and LD_LIBRARY_PATH, as a new shell is created. It might be necessary to
set it again in your starter method.
-- Reuti

Am I not right???
Ondrej


Reuti wrote:
Hi,
Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
we are using a custom starter method in our SGE to launch our
jobs... It
looks something like this:

#!/bin/sh

# ... we do whole bunch of stuff here

#start the job in thus shell
exec "$@"
the "$@" should be replaced by the path to the jobscript (qsub) or
command (qrsh) plus the given options.
For the spread tasks to other nodes I get as argument: " orted -mca
ess env -mca orte_ess_jobid ...". Also no . ./.profile.
So I wonder, where the . ./.profile is coming from. Can you put a
`sleep 60` or alike before the `exec ...` and grep the built line
from `ps -e f` before it crashes?
-- Reuti
The trouble is that mpiexec passes a command which looks like this:

( . ./.profile ..... )

which, however, is not a valid exec argument...

Is there any way to tell mpiexec to run it in a separate script???
Any
idea how to solve this???

Thanx
Ondrej Glembek

--

  Ondrej Glembek, PhD student  E-mail: glem...@fit.vutbr.cz
  UPGM FIT VUT Brno, L226      Web:
http://www.fit.vutbr.cz/~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290

  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

  Ondrej Glembek, PhD student  E-mail: glem...@fit.vutbr.cz
UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/ ~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290

  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

  Ondrej Glembek, PhD student  E-mail: glem...@fit.vutbr.cz
UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/ ~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290

  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

  Ondrej Glembek, PhD student  E-mail: glem...@fit.vutbr.cz
UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/ ~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290

  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to