I would like to mpirun across nodes that do not share a filesystem and
might have the executable in different directories. For example, node0
has the executable at /tmp/job42/mpitest and node1 has it at
/tmp/job100/mpitest.

If you can grant me that I have a ssh wrapper script (that gets set as
the orte/plm_rsh_agent**) that cds to where the executable lies on
each worker node before launching orted, is there a way to tell the
worker node orted processes to run the executable from the current
working directory rather than from the absolute path that (I presume)
the head node process advertises? I've tried adding/changing
orte_remote_tmpdir_base per each worker orted process, but then I get
an error about having both global_tmpdir and remote_tmpdir set. Then
if I set local_tmpdir to match the head node, I'm back at square one.

I know this sounds fairly convoluted, but I'm updating helper scripts
for HTCondor so that its parallel universe can work with newer MPI
versions (dealing with similar headaches trying to get hydra to
cooperate). The default behavior is for condor to place each "job"
(i.e. sshd+orted process) in a sandbox, and we cannot know the name of
the sandbox directories ahead of time or assume that they will have
the same name across nodes. The easiest way to deal with this is if we
can assume the executable lies on a shared fs, but the fewer
assumptions from our POV the better. (Even better would be if someone
/really/ wants to build in condor support like has been done for other
launchers; that's beyond me right now.)

**Also, what is the correct parameter to set to rsh_agent? ompi_info
(and mpirun) says orte_rsh_agent is deprecated, but online docs seem
to suggest that plm_rsh_agent is deprecated. I'm using version 1.8.1.

Thanks for any insight you can provide

Jason Patton
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to