Thank you. I was able to make everything work by using orte_launch_agent and bash's $@ to pass the necessary parameters to orted within my shell script.
I needed to add additional paths to my LD_LIBRARY_PATH/PATH variables for other necessary libraries, which is why I was pushing on the orte_launch_agent solution. Is there a document that covers the design of openmpi a bit? It looks pretty interesting, and there's quite a few acronyms that I had trouble finding on the internet (e.g. "ess"). On Wed, Sep 24, 2008 at 3:40 PM, Ralph Castain <r...@lanl.gov> wrote: > Yes - you don't want to use orte_launch_agent at all for that purpose. What > you need to set is an info_key in your comm_spawn command for "ompi_prefix", > with the value set to the install path. The ssh launcher will assemble the > launch cmd using that info. > Ralph > > > On Sep 24, 2008, at 1:28 PM, Will Portnoy wrote: > > Yes, your first sentence is correct. I intend to use the unmodified > orted, but I need to set up the unix environment after the ssh has > completed but before orted is executed. > > In particular, one of the more important tasks for me to do after ssh > connects is to set LD_LIBRARY_PATH and PATH to include the paths of > the openmpi's install lib and bin directories, respectively. > Otherwise, orted will not be on the PATH, and its dependent libraries > will not be in LD_LIBRARY_PATH. > > Is there a recommended method to set LD_LIBRARY_PATH and PATH when ssh > is used to connect to other hosts when running an mpi job? > > thank you, > > Will > > On Wed, Sep 24, 2008 at 2:36 PM, Ralph Castain <r...@lanl.gov> wrote: > > So this is a singleton comm_spawn scenario, that requires you specify a > > launch_agent to execute? Just trying to ensure I understand. > > First, let me ensure we have a common understanding of what > > orte_launch_agent does. Basically, that param stipulates the command to be > > used in place of "orted" - it doesn't substitute for "ssh". So if you set > > -mca orte_launch_agent foo, what will happen is: "ssh nodename foo" instead > > of "ssh nodename orted". > > The intent was to provide a way to do things like run valgrind on the orted > > itself. So you could do -mca orte_launch_agent "valgrind orted", and we > > would dutifully run "ssh nodename valrind orted". > > Or if you wanted to write your own orted (e.g., bar-orted), you could > > substitute it for our "orted". > > Or if you wanted to set mca params solely to be seen on the backend > > nodes/procs, you could set -mca orte_launch_agent "orted -mca foo bar", and > > we would launch "ssh nodename orted -mca foo bar". This allows us to set mca > > params without having mpirun see them - helps us to look at debug output, > > for example, from only the backend procs. > > If what you need to do is set something in the environment for the orted, > > there are certain cmd line options that will do that for you - > > orte_launch_agent may or may not be a good method. > > Perhaps it would help if you could tell me exactly what you wanted to have > > orte_launch_agent actually do? > > Thanks > > Ralph > > On Sep 24, 2008, at 12:22 PM, Will Portnoy wrote: > > Sorry for the miscommunication: The processes are started by my > > program with MPI_Comm_spawn, so there was no mpirun involved. > > If you can suggest a test program I can use with mpirun to validate my > > openmpi environment and install, that would probably produce the > > output you would like to see. > > But I'm not sure that will make it clear how the file pointed to by > > "orte_launch_agent" in "mca-params.conf" should be written to setup an > > environment and start orted. > > Will > > On Wed, Sep 24, 2008 at 2:17 PM, Ralph Castain <r...@lanl.gov> wrote: > > Afraid I am confused. This was the entire output from the job?? If so, > > then > > that means mpirun itself wasn't able to find a launch environment it > > could > > use, so you never got to the point of actually launching an orted. > > Do you have ssh in your path? My best immediate guess is that you don't, > > and > > that mpirun therefore doesn't see anything it can use to launch a job. We > > have discussed internally that we need to improve that error message - > > could > > be this is another case emphasizing that point. > > 1.3 is fine to use - still patching some bugs, but nothing that should > > impact this issue. > > Ralph > > On Sep 24, 2008, at 12:11 PM, Will Portnoy wrote: > > That was the output with plm_base_verbose set to 99 - it's the same > > output with 1. > > Yes, I'd like to use ssh. > > orted wasn't starting properly with orte_launch_agent (which was > > needed because my environment on the target machine wasn't set up), so > > that's why I thought I would try it directly on the command line on > > localhost. I thought this was a simpler case: to verify that orted > > could find all of its necessary components without the complexity of > > everything else I'm doing. > > If I needed to use orte_launch_agent, how should I pass the necessary > > parameters to start orted after I set up my environment? > > Am I better off using trunk over 1.3? > > thank you, > > Will > > On Wed, Sep 24, 2008 at 2:01 PM, Ralph Castain <r...@lanl.gov> wrote: > > Could you rerun that with -mca plm_base_verbose 1? What environment are > > you > > in - I assume rsh/ssh? > > I would like to see the cmd line being used to launch the orted. What > > this > > indicates is that we are not getting the cmd line correct. Could just > > be > > that some patch in the trunk didn't get completely applied to the 1.3 > > branch. > > BTW: you probably can't run orted directly off of the cmd line. It > > likely > > needs some cmd line params to get critical info. > > Ralph > > On Sep 24, 2008, at 9:47 AM, Will Portnoy wrote: > > I'm trying to use MPI_Comm_Spawn with MPI_Info's host key to spawn > > processes from a process not started with mpirun. This works with the > > host key set to the localhost's hostname, but it does not work when I > > use other hosts. > > I'm using version 1.3a1r19602. I need to use orte_launch_agent to set > > up my environment a bit before orted is started, but it fails with > > errors listed below. > > When I try to run orted directly on the command line with some of the > > verbosity flags turned to "11", I receive the same messages. > > Does anybody have any suggestions? > > thank you, > > Will > > > [fqdn:24761] mca: base: components_open: Looking for ess components > > [fqdn:24761] mca: base: components_open: opening ess components > > [fqdn:24761] mca: base: components_open: found loaded component env > > [fqdn:24761] mca: base: components_open: component env has no register > > function > > [fqdn:24761] mca: base: components_open: component env open function > > successful > > [fqdn:24761] mca: base: components_open: found loaded component hnp > > [fqdn:24761] mca: base: components_open: component hnp has no register > > function > > [fqdn:24761] mca: base: components_open: component hnp open function > > successful > > [fqdn:24761] mca: base: components_open: found loaded component > > singleton > > [fqdn:24761] mca: base: components_open: component singleton has no > > register function > > [fqdn:24761] mca: base: components_open: component singleton open > > function successful > > [fqdn:24761] mca: base: components_open: found loaded component slurm > > [fqdn:24761] mca: base: components_open: component slurm has no > > register function > > [fqdn:24761] mca: base: components_open: component slurm open function > > successful > > [fqdn:24761] mca: base: components_open: found loaded component tool > > [fqdn:24761] mca: base: components_open: component tool has no > > register > > function > > [fqdn:24761] mca: base: components_open: component tool open function > > successful > > [fqdn:24761] mca:base:select: Auto-selecting ess components > > [fqdn:24761] mca:base:select:( ess) Querying component [env] > > [fqdn:24761] mca:base:select:( ess) Skipping component [env]. Query > > failed to return a module > > [fqdn:24761] mca:base:select:( ess) Querying component [hnp] > > [fqdn:24761] mca:base:select:( ess) Skipping component [hnp]. Query > > failed to return a module > > [fqdn:24761] mca:base:select:( ess) Querying component [singleton] > > [fqdn:24761] mca:base:select:( ess) Skipping component [singleton]. > > Query failed to return a module > > [fqdn:24761] mca:base:select:( ess) Querying component [slurm] > > [fqdn:24761] mca:base:select:( ess) Skipping component [slurm]. Query > > failed to return a module > > [fqdn:24761] mca:base:select:( ess) Querying component [tool] > > [fqdn:24761] mca:base:select:( ess) Skipping component [tool]. Query > > failed to return a module > > [fqdn:24761] mca:base:select:( ess) No component selected! > > [fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > > runtime/orte_init.c at line 125 > > > -------------------------------------------------------------------------- > > It looks like orte_init failed for some reason; your parallel process > > is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > orte_ess_base_select failed > > --> Returned value Not found (-13) instead of ORTE_SUCCESS > > > -------------------------------------------------------------------------- > > [fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > > orted/orted_main.c at line 315 > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >