Thank you.  I was able to make everything work by using
orte_launch_agent and bash's $@ to pass the necessary parameters to
orted within my shell script.

I needed to add additional paths to my LD_LIBRARY_PATH/PATH variables
for other necessary libraries, which is why I was pushing on the
orte_launch_agent solution.

Is there a document that covers the design of openmpi a bit?  It looks
pretty interesting, and there's quite a few acronyms that I had
trouble finding on the internet (e.g. "ess").

On Wed, Sep 24, 2008 at 3:40 PM, Ralph Castain <r...@lanl.gov> wrote:
> Yes - you don't want to use orte_launch_agent at all for that purpose. What
> you need to set is an info_key in your comm_spawn command for "ompi_prefix",
> with the value set to the install path. The ssh launcher will assemble the
> launch cmd using that info.
> Ralph
>
>
> On Sep 24, 2008, at 1:28 PM, Will Portnoy wrote:
>
> Yes, your first sentence is correct.  I intend to use the unmodified
> orted, but I need to set up the unix environment after the ssh has
> completed but before orted is executed.
>
> In particular, one of the more important tasks for me to do after ssh
> connects is to set LD_LIBRARY_PATH and PATH to include the paths of
> the openmpi's install lib and bin directories, respectively.
> Otherwise, orted will not be on the PATH, and its dependent libraries
> will not be in LD_LIBRARY_PATH.
>
> Is there a recommended method to set LD_LIBRARY_PATH and PATH when ssh
> is used to connect to other hosts when running an mpi job?
>
> thank you,
>
> Will
>
> On Wed, Sep 24, 2008 at 2:36 PM, Ralph Castain <r...@lanl.gov> wrote:
>
> So this is a singleton comm_spawn scenario, that requires you specify a
>
> launch_agent to execute? Just trying to ensure I understand.
>
> First, let me ensure we have a common understanding of what
>
> orte_launch_agent does. Basically, that param stipulates the command to be
>
> used in place of "orted" - it doesn't substitute for "ssh". So if you set
>
> -mca orte_launch_agent foo, what will happen is: "ssh nodename foo" instead
>
> of "ssh nodename orted".
>
> The intent was to provide a way to do things like run valgrind on the orted
>
> itself. So you could do -mca orte_launch_agent "valgrind orted", and we
>
> would dutifully run "ssh nodename valrind orted".
>
> Or if you wanted to write your own orted (e.g., bar-orted), you could
>
> substitute it for our "orted".
>
> Or if you wanted to set mca params solely to be seen on the backend
>
> nodes/procs, you could set -mca orte_launch_agent "orted -mca foo bar", and
>
> we would launch "ssh nodename orted -mca foo bar". This allows us to set mca
>
> params without having mpirun see them - helps us to look at debug output,
>
> for example, from only the backend procs.
>
> If what you need to do is set something in the environment for the orted,
>
> there are certain cmd line options that will do that for you -
>
> orte_launch_agent may or may not be a good method.
>
> Perhaps it would help if you could tell me exactly what you wanted to have
>
> orte_launch_agent actually do?
>
> Thanks
>
> Ralph
>
> On Sep 24, 2008, at 12:22 PM, Will Portnoy wrote:
>
> Sorry for the miscommunication: The processes are started by my
>
> program with MPI_Comm_spawn, so there was no mpirun involved.
>
> If you can suggest a test program I can use with mpirun to validate my
>
> openmpi environment and install, that would probably produce the
>
> output you would like to see.
>
> But I'm not sure that will make it clear how the file pointed to by
>
> "orte_launch_agent" in "mca-params.conf" should be written to setup an
>
> environment and start orted.
>
> Will
>
> On Wed, Sep 24, 2008 at 2:17 PM, Ralph Castain <r...@lanl.gov> wrote:
>
> Afraid I am confused. This was the entire output from the job?? If so,
>
> then
>
> that means mpirun itself wasn't able to find a launch environment it
>
> could
>
> use, so you never got to the point of actually launching an orted.
>
> Do you have ssh in your path? My best immediate guess is that you don't,
>
> and
>
> that mpirun therefore doesn't see anything it can use to launch a job. We
>
> have discussed internally that we need to improve that error message -
>
> could
>
> be this is another case emphasizing that point.
>
> 1.3 is fine to use - still patching some bugs, but nothing that should
>
> impact this issue.
>
> Ralph
>
> On Sep 24, 2008, at 12:11 PM, Will Portnoy wrote:
>
> That was the output with plm_base_verbose set to 99 - it's the same
>
> output with 1.
>
> Yes, I'd like to use ssh.
>
> orted wasn't starting properly with orte_launch_agent (which was
>
> needed because my environment on the target machine wasn't set up), so
>
> that's why I thought I would try it directly on the command line on
>
> localhost.  I thought this was a simpler case: to verify that orted
>
> could find all of its necessary components without the complexity of
>
> everything else I'm doing.
>
> If I needed to use orte_launch_agent, how should I pass the necessary
>
> parameters to start orted after I set up my environment?
>
> Am I better off using trunk over 1.3?
>
> thank you,
>
> Will
>
> On Wed, Sep 24, 2008 at 2:01 PM, Ralph Castain <r...@lanl.gov> wrote:
>
> Could you rerun that with -mca plm_base_verbose 1? What environment are
>
> you
>
> in - I assume rsh/ssh?
>
> I would like to see the cmd line being used to launch the orted. What
>
> this
>
> indicates is that we are not getting the cmd line correct. Could just
>
> be
>
> that some patch in the trunk didn't get completely applied to the 1.3
>
> branch.
>
> BTW: you probably can't run orted directly off of the cmd line. It
>
> likely
>
> needs some cmd line params to get critical info.
>
> Ralph
>
> On Sep 24, 2008, at 9:47 AM, Will Portnoy wrote:
>
> I'm trying to use MPI_Comm_Spawn with MPI_Info's host key to spawn
>
> processes from a process not started with mpirun.  This works with the
>
> host key set to the localhost's hostname, but it does not work when I
>
> use other hosts.
>
> I'm using version 1.3a1r19602.  I need to use orte_launch_agent to set
>
> up my environment a bit before orted is started, but it fails with
>
> errors listed below.
>
> When I try to run orted directly on the command line with some of the
>
> verbosity flags turned to "11", I receive the same messages.
>
> Does anybody have any suggestions?
>
> thank you,
>
> Will
>
>
> [fqdn:24761] mca: base: components_open: Looking for ess components
>
> [fqdn:24761] mca: base: components_open: opening ess components
>
> [fqdn:24761] mca: base: components_open: found loaded component env
>
> [fqdn:24761] mca: base: components_open: component env has no register
>
> function
>
> [fqdn:24761] mca: base: components_open: component env open function
>
> successful
>
> [fqdn:24761] mca: base: components_open: found loaded component hnp
>
> [fqdn:24761] mca: base: components_open: component hnp has no register
>
> function
>
> [fqdn:24761] mca: base: components_open: component hnp open function
>
> successful
>
> [fqdn:24761] mca: base: components_open: found loaded component
>
> singleton
>
> [fqdn:24761] mca: base: components_open: component singleton has no
>
> register function
>
> [fqdn:24761] mca: base: components_open: component singleton open
>
> function successful
>
> [fqdn:24761] mca: base: components_open: found loaded component slurm
>
> [fqdn:24761] mca: base: components_open: component slurm has no
>
> register function
>
> [fqdn:24761] mca: base: components_open: component slurm open function
>
> successful
>
> [fqdn:24761] mca: base: components_open: found loaded component tool
>
> [fqdn:24761] mca: base: components_open: component tool has no
>
> register
>
> function
>
> [fqdn:24761] mca: base: components_open: component tool open function
>
> successful
>
> [fqdn:24761] mca:base:select: Auto-selecting ess components
>
> [fqdn:24761] mca:base:select:(  ess) Querying component [env]
>
> [fqdn:24761] mca:base:select:(  ess) Skipping component [env]. Query
>
> failed to return a module
>
> [fqdn:24761] mca:base:select:(  ess) Querying component [hnp]
>
> [fqdn:24761] mca:base:select:(  ess) Skipping component [hnp]. Query
>
> failed to return a module
>
> [fqdn:24761] mca:base:select:(  ess) Querying component [singleton]
>
> [fqdn:24761] mca:base:select:(  ess) Skipping component [singleton].
>
> Query failed to return a module
>
> [fqdn:24761] mca:base:select:(  ess) Querying component [slurm]
>
> [fqdn:24761] mca:base:select:(  ess) Skipping component [slurm]. Query
>
> failed to return a module
>
> [fqdn:24761] mca:base:select:(  ess) Querying component [tool]
>
> [fqdn:24761] mca:base:select:(  ess) Skipping component [tool]. Query
>
> failed to return a module
>
> [fqdn:24761] mca:base:select:(  ess) No component selected!
>
> [fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>
> runtime/orte_init.c at line 125
>
>
> --------------------------------------------------------------------------
>
> It looks like orte_init failed for some reason; your parallel process
>
> is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
> orte_ess_base_select failed
>
> --> Returned value Not found (-13) instead of ORTE_SUCCESS
>
>
> --------------------------------------------------------------------------
>
> [fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>
> orted/orted_main.c at line 315
>
> _______________________________________________
>
> users mailing list
>
> us...@open-mpi.org
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
>
> users mailing list
>
> us...@open-mpi.org
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
>
> users mailing list
>
> us...@open-mpi.org
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
>
> users mailing list
>
> us...@open-mpi.org
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
>
> users mailing list
>
> us...@open-mpi.org
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
>
> users mailing list
>
> us...@open-mpi.org
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to