Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

r...@open-mpi.org Fri, 20 Jan 2017 06:41:07 -0800

Well, it appears we are already forwarding all envars, which should include 
PATH. Here is the qrsh command line we use:


“qrsh --inherit --nostdin -V"

So would you please try the following patch:

diff --git a/orte/mca/plm/rsh/plm_rsh_component.c 
b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, 
int *priority)
             }
             mca_plm_rsh_component.agent = tmp;
             mca_plm_rsh_component.using_qrsh = true;
-            /* no tree spawn allowed under qrsh */
-            mca_plm_rsh_component.no_tree_spawn = true;
             goto success;
         } else if (!mca_plm_rsh_component.disable_llspawn &&
                    NULL != getenv("LOADL_STEP_ID")) {


> On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote:
> 
> I’ll create a patch that you can try - if it works okay, we can commit it
> 
>> On Jan 18, 2017, at 3:29 AM, William Hay <w....@ucl.ac.uk> wrote:
>> 
>> On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:
>>> As I recall, the problem was that qrsh isn???t available on the backend 
>>> compute nodes, and so we can???t use a tree for launch. If that isn???t 
>>> true, then we can certainly adjust it.
>>> 
>> qrsh should be available on all nodes of a SoGE cluster but, depending on 
>> how things are set up, may not be 
>> findable (ie not in the PATH) when you qrsh -inherit into a node.  A 
>> workaround would be to start backend 
>> processes with qrsh -inherit -v PATH which will copy the PATH from the 
>> master node to the slave node 
>> process or otherwise pass the location of qrsh from one node or another.  
>> That of course assumes that 
>> qrsh is in the same location on all nodes.
>> 
>> I've tested that it is possible to qrsh from the head node of a job to a 
>> slave node and then on to
>> another slave node by this method.
>> 
>> William
>> 
>> 
>>>> On Jan 17, 2017, at 9:37 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> While commissioning a new cluster, I wanted to run HPL across the whole 
>>>> thing using openmpi 2.0.1.
>>>> 
>>>> I couldn't get it to start on more than 129 hosts under Son of Gridengine 
>>>> (128 remote plus the localhost running the mpirun command). openmpi would 
>>>> sit there, waiting for all the orted's to check in; however, there were 
>>>> "only" a maximum of 128 qrsh processes, therefore a maximum of 128 
>>>> orted's, therefore waiting a loooong time.
>>>> 
>>>> Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job 
>>>> to launch.
>>>> 
>>>> Is this intentional, please?
>>>> 
>>>> Doesn't openmpi use a tree-like startup sometimes - any particular reason 
>>>> it's not using it here?
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

Reply via email to