The problem occurs in the following situation. In the rsh PLS the number of daemons that have to be spawned is set to zero (as mpirun act now as a daemon). Therefore, the PLS rsh don't do anything except sending the launch order to the daemons. Then the remaining of the work is done in the ODLS. However, as we will not spawn new daemons, the new application will inherit exactly the same environment as the mpirun application. The PATH and LD_LIBRARY_PATH is not set except in the case where -x LD_LIBRARY_PATH is used by the user.

voyager$ mpirun --prefix /Users/bosilca/opt/mpi -np 1 -host voyager printenv | grep LD_
<**there is no output here**>
voyager$ mpirun -x LD_LIBRARY_PATH=/toto --prefix /Users/bosilca/opt/ mpi -np 1 -host voyager printenv | grep LD_
LD_LIBRARY_PATH=/Users/bosilca/opt/mpi/lib:/toto

However, using localhost seems to make the problem vanish, as the rsh PLS will always be used. voyager$ mpirun -x LD_LIBRARY_PATH=/toto --prefix /Users/bosilca/opt/ mpi -np 1 -host localhost printenv | grep LD_
LD_LIBRARY_PATH=/Users/bosilca/opt/mpi/lib:/toto
voyager$ mpirun --prefix /Users/bosilca/opt/mpi -np 1 -host localhost printenv | grep LD_
LD_LIBRARY_PATH=/Users/bosilca/opt/mpi/lib

Digging a little bit deeper, shows that the problem is coming from the fact that rmaps_node->nodename != orte_system_info.nodename when anything else than localhost is provided.

  george.


On Jul 19, 2007, at 10:35 AM, George Bosilca wrote:

It wasn't a bug. There is a bunch of code there just to make sure
PATH and LD_LIBRARY_PATH are set correctly.

Yesterday we discovered that even if you force the --prefix in a
similar execution environment the LD_LIBRARY_PATH doesn't get set.
However, using localhost always solve the problem.

   george.

On Jul 19, 2007, at 10:18 AM, Gleb Natapov wrote:

On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote:
Interesting. Apparently, it is getting a NULL back when it tries
to access
the LD_LIBRARY_PATH in your environment. Here is the code involved:

     newenv = opal_os_path( false, prefix_dir, lib_base, NULL );
     oldenv = getenv("LD_LIBRARY_PATH");
     if (NULL != oldenv) {
          char* temp;
          asprintf(&temp, "%s:%s", newenv, oldenv);
          free(newenv);
          newenv = temp;
     }
     opal_setenv("LD_LIBRARY_PATH", newenv, true, &env);
     if (mca_pls_rsh_component.debug) {
          opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s",
newenv);
     }
     free(newenv);

So you can see that the only way we can get your debugging output
is for the
LD_LIBRARY_PATH in your starting environment to be NULL. Note that
this
comes after we fork, so we are talking about the child process -
not sure
that matters, but may as well point it out.

So the question is: why do you not have LD_LIBRARY_PATH set in your
environment when you provide a different hostname?
Right I don't have LD_LIBRARY_PATH set in my environment, but I expect
that mpirun will provide working environment for all ranks not just
remote ones. This is how it worked before. Perhaps that was a bug, but
this was useful bug :)



On 7/19/07 7:45 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:

On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
But this will lockup:

pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961
printenv | grep
LD

The reason is that the hostname in this last command doesn't
match the
hostname I get when I query my interfaces, so mpirun thinks it
must be a
remote host - and so we stick in ssh until that times out.
Which could be
quick on your machine, but takes awhile for me.

This is not my case. mpirun resolves hostname and runs env but
LD_LIBRARY_PATH is not there. If I use full name like this
# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com
env | grep
LD_LIBRARY_PATH
LD_LIBRARY_PATH=/home/glebn/openmpi/lib

everything is OK.

More info. If I provide hostname to mpirun as returned by command
"hostname" the LD_LIBRARY_PATH is not set:
# /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname`  env | grep LD
OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests

if I provide any other name that resolves to the same IP then
LD_LIBRARY_PATH is set.
# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost  env | grep LD
OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
LD_LIBRARY_PATH=/home/glebn/openmpi/lib

Here is debug output of "bad" run:
/home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca
pls_rsh_debug 1 echo
[elfit1:14730] pls:rsh: launching job 1
[elfit1:14730] pls:rsh: no new daemons to launch

Here is good one:
/home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca
pls_rsh_debug 1 echo
[elfit1:14752] pls:rsh: launching job 1
[elfit1:14752] pls:rsh: local csh: 0, local sh: 1
[elfit1:14752] pls:rsh: assuming same remote shell as local shell
[elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1
[elfit1:14752] pls:rsh: final template argv:
[elfit1:14752] pls:rsh:     /usr/bin/ssh <template> orted --name
<template>
--num_procs 1 --vpid_start 0 --nodename <template> --universe
root@elfit1:default-universe-14752 --nsreplica
"0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
gprreplica
"0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
mca_base_param_file_path
/home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
glebn/openmpiwd
-mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
[elfit1:14752] pls:rsh: launching on node localhost
[elfit1:14752] pls:rsh: localhost is a LOCAL node
[elfit1:14752] pls:rsh: reset PATH:
/home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/
OPENIB/mpi/b
in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/
local/bin:/sbin:/
bin:/usr/sbin:/usr/bin:/root/bin
[elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
openmpi/lib
[elfit1:14752] pls:rsh: changing to directory /root
[elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/
orted) [orted
--name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --
universe
root@elfit1:default-universe-14752 --nsreplica
"0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
gprreplica
"0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
mca_base_param_file_path
/home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
glebn/openmpiwd
-mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd --
set-sid]

--
Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to