This is a follow-up of https://mail-archive.com/users@lists.open-mpi.org/msg30055.html

Thanks Matias for the lengthy explanation.


currently, PSM2_DEVICES is overwritten, so i do not think setting it before invoking mpirun will help


also, in this specific case

- the user is running within a SLURM allocation with 2 nodes

- the user specified a host file with 2 distinct nodes


my first impression is that mtl/psm2 could/should handle this (well only one condition has to be met) properly and *not* set

export PSM2_DEVICES="self,shm"


the patch below
- does not overwrite PSM2_DEVICES
- does not set PSM2_DEVICES when num_max_procs > num_total_procs
this is suboptimal, but i could not find a way to get the number of orted.
iirc, MPI_Comm_spawn can have an orted dynamically spawned by passing a host in the MPI_Info. if this host is not part of the hostfile (nor RM allocation ?), then PSM2_DEVICES must be set manually by the user


Ralph,

is there a way to get the number of orted ?
- if i mpirun -np 1 --host n0,n1 ... orte_process_info.num_nodes is 1 (i wish i could get 2) - if running in singleton mode, orte_process_info.num_max_procs is 0 (is this a bug or a feature ?)

Cheers,

Gilles


diff --git a/ompi/mca/mtl/psm2/mtl_psm2_component.c b/ompi/mca/mtl/psm2/mtl_psm2_component.c
index 26bccd2..52b906b 100644
--- a/ompi/mca/mtl/psm2/mtl_psm2_component.c
+++ b/ompi/mca/mtl/psm2/mtl_psm2_component.c
@@ -14,6 +14,8 @@
  * Copyright (c) 2012-2015 Los Alamos National Security, LLC.
  *                         All rights reserved.
  * Copyright (c) 2013-2016 Intel, Inc. All rights reserved
+ * Copyright (c) 2016      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -170,6 +172,13 @@ get_num_total_procs(int *out_ntp)
 }

 static int
+get_num_max_procs(int *out_nmp)
+{
+  *out_nmp = (int)ompi_process_info.max_procs;
+  return OMPI_SUCCESS;
+}
+
+static int
 get_num_local_procs(int *out_nlp)
 {
     /* num_local_peers does not include us in
@@ -201,7 +210,7 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads,
     int        verno_major = PSM2_VERNO_MAJOR;
     int verno_minor = PSM2_VERNO_MINOR;
     int local_rank = -1, num_local_procs = 0;
-    int num_total_procs = 0;
+    int num_total_procs = 0, num_max_procs = 0;

/* Compute the total number of processes on this host and our local rank
      * on that node. We need to provide PSM2 with these values so it can
@@ -221,6 +230,11 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads,
                     "Cannot continue.\n");
         return NULL;
     }
+    if (OMPI_SUCCESS != get_num_max_procs(&num_max_procs)) {
+        opal_output(0, "Cannot determine max number of processes. "
+                    "Cannot continue.\n");
+        return NULL;
+    }

     err = psm2_error_register_handler(NULL /* no ep */,
                                     PSM2_ERRHANDLER_NOP);
@@ -230,8 +244,10 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads,
        return NULL;
     }

-    if (num_local_procs == num_total_procs) {
-      setenv("PSM2_DEVICES", "self,shm", 0);
+ if ((num_local_procs == num_total_procs) && (num_max_procs <= num_total_procs)) {
+        if (NULL == getenv("PSM2_DEVICES")) {
+            setenv("PSM2_DEVICES", "self,shm", 0);
+        }
     }

     err = psm2_init(&verno_major, &verno_minor);





On 9/30/2016 12:38 AM, Cabral, Matias A wrote:

Hi Giles et.al.,

You are right, ptl.c is in PSM2 code. As Ralph mentions, dynamic process support was/is not working in OMPI when using PSM2 because of an issue related to the transport keys. This was fixed in PR #1602 (https://github.com/open-mpi/ompi/pull/1602) and should be included in v2.0.2. HOWEVER, this not the error Juraj is seeing. The root of the assertion is because the PSM/PSM2 MTLs will check for where the “original” process are running and, if detects all are local to the node, it will ONLY initialize the shared memory device (variable PSM2_DEVICES="self,shm” ). This is to avoid “reserving” HW resources in the HFI card that wouldn’t be used unless you later on spawn ranks in other nodes. Therefore, to allow dynamic process to be spawned on other nodes you need to tell PSM2 to instruct the HW to initialize all the de devices by making the environment variable PSM2_DEVICES="self,shm,hfi" available before running the job.

Note that setting PSM2_DEVICES (*) will solve the below assertion, you will most likely still see the transport key issue if PR1602 if is not included.

Thanks,

_MAC

(*)

PSM2_DEVICES -> Omni Path

PSM_DEVICES  -> TrueScale

*From:*users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *r...@open-mpi.org
*Sent:* Thursday, September 29, 2016 7:12 AM
*To:* Open MPI Users <us...@lists.open-mpi.org>
*Subject:* Re: [OMPI users] MPI_Comm_spawn

Ah, that may be why it wouldn’t show up in the OMPI code base itself. If that is the case here, then no - OMPI v2.0.1 does not support comm_spawn for PSM. It is fixed in the upcoming 2.0.2

    On Sep 29, 2016, at 6:58 AM, Gilles Gouaillardet
    <gilles.gouaillar...@gmail.com
    <mailto:gilles.gouaillar...@gmail.com>> wrote:

    Ralph,

    My guess is that ptl.c comes from PSM lib ...

    Cheers,

    Gilles

    On Thursday, September 29, 2016, r...@open-mpi.org
    <mailto:r...@open-mpi.org> <r...@open-mpi.org
    <mailto:r...@open-mpi.org>> wrote:

        Spawn definitely does not work with srun. I don’t recognize
        the name of the file that segfaulted - what is “ptl.c”? Is
        that in your manager program?

            On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet
            <gilles.gouaillar...@gmail.com
            <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>>
            wrote:

            Hi,

            I do not expect spawn can work with direct launch (e.g. srun)

            Do you have PSM (e.g. Infinipath) hardware ? That could be
            linked to the failure

            Can you please try

            mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1
            --hostfile my_hosts ./manager 1

            and see if it help ?

            Note if you have the possibility, I suggest you first try
            that without slurm, and then within a slurm job

            Cheers,

            Gilles

            On Thursday, September 29, 2016, juraj2...@gmail.com
            <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');>
            <juraj2...@gmail.com
            <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');>> wrote:

                Hello,

                I am using MPI_Comm_spawn to dynamically create new
                processes from single manager process. Everything
                works fine when all the processes are running on the
                same node. But imposing restriction to run only a
                single process per node does not work. Below are the
                errors produced during multinode interactive session
                and multinode sbatch job.

                The system I am using is: Linux version
                3.10.0-229.el7.x86_64 (buil...@kbuilder.dev.centos.org
                <mailto:buil...@kbuilder.dev.centos.org>) (gcc version
                4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )

                I am using Open MPI 2.0.1

                Slurm is version 15.08.9

                What is preventing my jobs to spawn on multiple nodes?
                Does slurm requires some additional configuration to
                allow it? Is it issue on the MPI side, does it need to
                be compiled with some special flag (I have compiled it
                with --enable-mpi-fortran=all --with-pmi)?

                The code I am launching is here:
                https://github.com/goghino/dynamicMPI

                Manager tries to launch one new process (./manager 1),
                the error produced by requesting each process to be
                located on different node (interactive session):

                $ salloc -N 2

                $ cat my_hosts

                icsnode37

                icsnode38

                $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1

                [manager]I'm running MPI 3.1

                [manager]Runing on node icsnode37

                icsnode37.12614Assertion failure at ptl.c:183: epaddr
                == ((void *)0)

                icsnode38.32443Assertion failure at ptl.c:183: epaddr
                == ((void *)0)

                [icsnode37:12614] *** Process received signal ***

                [icsnode37:12614] Signal: Aborted (6)

                [icsnode37:12614] Signal code:  (-6)

                [icsnode38:32443] *** Process received signal ***

                [icsnode38:32443] Signal: Aborted (6)

                [icsnode38:32443] Signal code:  (-6)

                The same example as above via sbatch job submission:

                $ cat job.sbatch

                #!/bin/bash

                #SBATCH --nodes=2

                #SBATCH --ntasks-per-node=1

                module load openmpi/2.0.1

                srun -n 1 -N 1 ./manager 1

                $ cat output.o

                [manager]I'm running MPI 3.1

                [manager]Runing on node icsnode39

                srun: Job step aborted: Waiting up to 32 seconds for
                job step to finish.

                [icsnode39:9692] *** An error occurred in MPI_Comm_spawn

                [icsnode39:9692] *** reported by process [1007812608,0]

                [icsnode39:9692] *** on communicator MPI_COMM_SELF

                [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn
                processes

                [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes
                in this communicator will now abort,

                [icsnode39:9692] ***    and potentially your MPI job)

                In: PMI_Abort(50, N/A)

                slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT
                2016-09-26T16:48:20 ***

                srun: error: icsnode39: task 0: Exited with exit code 50

                Thank for any feedback!

                Best regards,

                Juraj

            _______________________________________________
            users mailing list
            us...@lists.open-mpi.org
            <javascript:_e(%7B%7D,'cvml','us...@lists.open-mpi.org');>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/users

    _______________________________________________
    users mailing list
    us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
us...@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to