Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

Mark Dixon Tue, 24 Jan 2017 06:53:40 -0800

Hi,

It works for me :)


Thanks!

Mark

On Fri, 20 Jan 2017, r...@open-mpi.org wrote:

Well, it appears we are already forwarding all envars, which should include 
PATH. Here is the qrsh command line we use:

“qrsh --inherit --nostdin -V"

So would you please try the following patch:

diff --git a/orte/mca/plm/rsh/plm_rsh_component.c 
b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, 
int *priority)
            }
            mca_plm_rsh_component.agent = tmp;
            mca_plm_rsh_component.using_qrsh = true;
-            /* no tree spawn allowed under qrsh */
-            mca_plm_rsh_component.no_tree_spawn = true;
            goto success;
        } else if (!mca_plm_rsh_component.disable_llspawn &&
                   NULL != getenv("LOADL_STEP_ID")) {

On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote:

I’ll create a patch that you can try - if it works okay, we can commit it

On Jan 18, 2017, at 3:29 AM, William Hay <w....@ucl.ac.uk> wrote:

On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:

As I recall, the problem was that qrsh isn???t available on the backend compute 
nodes, and so we can???t use a tree for launch. If that isn???t true, then we 
can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not befindable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backendprocesses with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave nodeprocess or otherwise pass the location of qrsh from one node or another. That of course assumes thatqrsh is in the same location on all nodes.


I've tested that it is possible to qrsh from the head node of a job to a slave 
node and then on to
another slave node by this method.

William

On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:

Hi,

While commissioning a new cluster, I wanted to run HPL across the whole thing 
using openmpi 2.0.1.

I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote 
plus the localhost running the mpirun command). openmpi would sit there, waiting for all 
the orted's to check in; however, there were "only" a maximum of 128 qrsh 
processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.

Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to 
launch.

Is this intentional, please?

Doesn't openmpi use a tree-like startup sometimes - any particular reason it's 
not using it here?

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
-------------------------------------------------------------------
Mark Dixon                         Email    : m.c.di...@leeds.ac.uk
Advanced Research Computing (ARC)  Tel (int): 35429
IT Services building               Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-------------------------------------------------------------------

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

Reply via email to