Well, it appears we are already forwarding all envars, which should include PATH. Here is the qrsh command line we use:
“qrsh --inherit --nostdin -V" So would you please try the following patch: diff --git a/orte/mca/plm/rsh/plm_rsh_component.c b/orte/mca/plm/rsh/plm_rsh_component.c index 0183bcc..1cc5aa4 100644 --- a/orte/mca/plm/rsh/plm_rsh_component.c +++ b/orte/mca/plm/rsh/plm_rsh_component.c @@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, int *priority) } mca_plm_rsh_component.agent = tmp; mca_plm_rsh_component.using_qrsh = true; - /* no tree spawn allowed under qrsh */ - mca_plm_rsh_component.no_tree_spawn = true; goto success; } else if (!mca_plm_rsh_component.disable_llspawn && NULL != getenv("LOADL_STEP_ID")) { > On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote: > > I’ll create a patch that you can try - if it works okay, we can commit it > >> On Jan 18, 2017, at 3:29 AM, William Hay <w....@ucl.ac.uk> wrote: >> >> On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote: >>> As I recall, the problem was that qrsh isn???t available on the backend >>> compute nodes, and so we can???t use a tree for launch. If that isn???t >>> true, then we can certainly adjust it. >>> >> qrsh should be available on all nodes of a SoGE cluster but, depending on >> how things are set up, may not be >> findable (ie not in the PATH) when you qrsh -inherit into a node. A >> workaround would be to start backend >> processes with qrsh -inherit -v PATH which will copy the PATH from the >> master node to the slave node >> process or otherwise pass the location of qrsh from one node or another. >> That of course assumes that >> qrsh is in the same location on all nodes. >> >> I've tested that it is possible to qrsh from the head node of a job to a >> slave node and then on to >> another slave node by this method. >> >> William >> >> >>>> On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote: >>>> >>>> Hi, >>>> >>>> While commissioning a new cluster, I wanted to run HPL across the whole >>>> thing using openmpi 2.0.1. >>>> >>>> I couldn't get it to start on more than 129 hosts under Son of Gridengine >>>> (128 remote plus the localhost running the mpirun command). openmpi would >>>> sit there, waiting for all the orted's to check in; however, there were >>>> "only" a maximum of 128 qrsh processes, therefore a maximum of 128 >>>> orted's, therefore waiting a loooong time. >>>> >>>> Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job >>>> to launch. >>>> >>>> Is this intentional, please? >>>> >>>> Doesn't openmpi use a tree-like startup sometimes - any particular reason >>>> it's not using it here? >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users