But it *does* provide an LD_LIBRARY_PATH that is pointing to your openmpi
installation - it says it did it right here in your debug output:

>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib

I suspect that the problem isn't in the launcher, but rather in the iof
again. Why don't we wait until those fixes come into the trunk before
chasing our tails any further?


On 7/19/07 8:18 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:

> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote:
>> Interesting. Apparently, it is getting a NULL back when it tries to access
>> the LD_LIBRARY_PATH in your environment. Here is the code involved:
>> 
>>      newenv = opal_os_path( false, prefix_dir, lib_base, NULL );
>>      oldenv = getenv("LD_LIBRARY_PATH");
>>      if (NULL != oldenv) {
>>           char* temp;
>>           asprintf(&temp, "%s:%s", newenv, oldenv);
>>           free(newenv);
>>           newenv = temp;
>>      }
>>      opal_setenv("LD_LIBRARY_PATH", newenv, true, &env);
>>      if (mca_pls_rsh_component.debug) {
>>           opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv);
>>      }
>>      free(newenv);
>> 
>> So you can see that the only way we can get your debugging output is for the
>> LD_LIBRARY_PATH in your starting environment to be NULL. Note that this
>> comes after we fork, so we are talking about the child process - not sure
>> that matters, but may as well point it out.
>> 
>> So the question is: why do you not have LD_LIBRARY_PATH set in your
>> environment when you provide a different hostname?
> Right I don't have LD_LIBRARY_PATH set in my environment, but I expect
> that mpirun will provide working environment for all ranks not just
> remote ones. This is how it worked before. Perhaps that was a bug, but
> this was useful bug :)
> 
>> 
>> 
>> On 7/19/07 7:45 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
>> 
>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
>>>>> But this will lockup:
>>>>> 
>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep
>>>>> LD
>>>>> 
>>>>> The reason is that the hostname in this last command doesn't match the
>>>>> hostname I get when I query my interfaces, so mpirun thinks it must be a
>>>>> remote host - and so we stick in ssh until that times out. Which could be
>>>>> quick on your machine, but takes awhile for me.
>>>>> 
>>>> This is not my case. mpirun resolves hostname and runs env but
>>>> LD_LIBRARY_PATH is not there. If I use full name like this
>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep
>>>> LD_LIBRARY_PATH
>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>> 
>>>> everything is OK.
>>>> 
>>> More info. If I provide hostname to mpirun as returned by command
>>> "hostname" the LD_LIBRARY_PATH is not set:
>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname`  env | grep LD
>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>> 
>>> if I provide any other name that resolves to the same IP then
>>> LD_LIBRARY_PATH is set.
>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost  env | grep LD
>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>> 
>>> Here is debug output of "bad" run:
>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo
>>> [elfit1:14730] pls:rsh: launching job 1
>>> [elfit1:14730] pls:rsh: no new daemons to launch
>>> 
>>> Here is good one:
>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo
>>> [elfit1:14752] pls:rsh: launching job 1
>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1
>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell
>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1
>>> [elfit1:14752] pls:rsh: final template argv:
>>> [elfit1:14752] pls:rsh:     /usr/bin/ssh <template> orted --name <template>
>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe
>>> root@elfit1:default-universe-14752 --nsreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>> mca_base_param_file_path
>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpi
>>> wd
>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
>>> [elfit1:14752] pls:rsh: launching on node localhost
>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node
>>> [elfit1:14752] pls:rsh: reset PATH:
>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/OPENIB/mpi
>>> /b
>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin
>>> :/
>>> bin:/usr/sbin:/usr/bin:/root/bin
>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib
>>> [elfit1:14752] pls:rsh: changing to directory /root
>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/orted) [orted
>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --universe
>>> root@elfit1:default-universe-14752 --nsreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica
>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>> mca_base_param_file_path
>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpi
>>> wd
>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd --set-sid]
>>> 
>>> --
>>> Gleb.
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
> Gleb.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to