Hi,
It works for me :)
Thanks!
Mark
On Fri, 20 Jan 2017, r...@open-mpi.org wrote:
Well, it appears we are already forwarding all envars, which should include
PATH. Here is the qrsh command line we use:
“qrsh --inherit --nostdin -V"
So would you please try the following patch:
diff --git a/orte/mca/plm/rsh/plm_rsh_component.c
b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module,
int *priority)
}
mca_plm_rsh_component.agent = tmp;
mca_plm_rsh_component.using_qrsh = true;
- /* no tree spawn allowed under qrsh */
- mca_plm_rsh_component.no_tree_spawn = true;
goto success;
} else if (!mca_plm_rsh_component.disable_llspawn &&
NULL != getenv("LOADL_STEP_ID")) {
On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote:
I’ll create a patch that you can try - if it works okay, we can commit it
On Jan 18, 2017, at 3:29 AM, William Hay <w....@ucl.ac.uk> wrote:
On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:
As I recall, the problem was that qrsh isn???t available on the backend compute
nodes, and so we can???t use a tree for launch. If that isn???t true, then we
can certainly adjust it.
qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be
findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend
processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node
process or otherwise pass the location of qrsh from one node or another. That of course assumes that
qrsh is in the same location on all nodes.
I've tested that it is possible to qrsh from the head node of a job to a slave
node and then on to
another slave node by this method.
William
On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:
Hi,
While commissioning a new cluster, I wanted to run HPL across the whole thing
using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote
plus the localhost running the mpirun command). openmpi would sit there, waiting for all
the orted's to check in; however, there were "only" a maximum of 128 qrsh
processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to
launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason it's
not using it here?
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
-------------------------------------------------------------------
Mark Dixon Email : m.c.di...@leeds.ac.uk
Advanced Research Computing (ARC) Tel (int): 35429
IT Services building Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-------------------------------------------------------------------
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users