Hi,

It works for me :)

Thanks!

Mark

On Fri, 20 Jan 2017, r...@open-mpi.org wrote:

Well, it appears we are already forwarding all envars, which should include 
PATH. Here is the qrsh command line we use:

“qrsh --inherit --nostdin -V"

So would you please try the following patch:

diff --git a/orte/mca/plm/rsh/plm_rsh_component.c 
b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, 
int *priority)
            }
            mca_plm_rsh_component.agent = tmp;
            mca_plm_rsh_component.using_qrsh = true;
-            /* no tree spawn allowed under qrsh */
-            mca_plm_rsh_component.no_tree_spawn = true;
            goto success;
        } else if (!mca_plm_rsh_component.disable_llspawn &&
                   NULL != getenv("LOADL_STEP_ID")) {


On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote:

I’ll create a patch that you can try - if it works okay, we can commit it

On Jan 18, 2017, at 3:29 AM, William Hay <w....@ucl.ac.uk> wrote:

On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:
As I recall, the problem was that qrsh isn???t available on the backend compute 
nodes, and so we can???t use a tree for launch. If that isn???t true, then we 
can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node process or otherwise pass the location of qrsh from one node or another. That of course assumes that qrsh is in the same location on all nodes.

I've tested that it is possible to qrsh from the head node of a job to a slave 
node and then on to
another slave node by this method.

William


On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:

Hi,

While commissioning a new cluster, I wanted to run HPL across the whole thing 
using openmpi 2.0.1.

I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote 
plus the localhost running the mpirun command). openmpi would sit there, waiting for all 
the orted's to check in; however, there were "only" a maximum of 128 qrsh 
processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.

Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to 
launch.

Is this intentional, please?

Doesn't openmpi use a tree-like startup sometimes - any particular reason it's 
not using it here?
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
-------------------------------------------------------------------
Mark Dixon                         Email    : m.c.di...@leeds.ac.uk
Advanced Research Computing (ARC)  Tel (int): 35429
IT Services building               Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-------------------------------------------------------------------
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to