[OMPI users] slot-list breaks for openmpi-v1.10.2-176-g9d45e07 on "SUSE Linux, Enterprise Server 12 (x86_64)"

Siegmar Gross Sat, 7 May 2016 04:49:19 -0400 (EDT)

Hi,

yesterday I installed openmpi-v1.10.2-176-g9d45e07 on my "SUSE Linux
Enterprise Server 12 (x86_64)" with Sun C 5.13  and gcc-5.3.0.
Unfortunately I have a problem with one of my spawn programs.


loki spawn 129 ompi_info | grep -e "OPAL repo revision" -e "C compiler absolute"
      OPAL repo revision: v1.10.2-176-g9d45e07
     C compiler absolute: /opt/solstudio12.4/bin/cc
loki spawn 130 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
loki spawn 131 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

[loki:02080] *** Process received signal ***
[loki:02080] Signal: Segmentation fault (11)
[loki:02080] Signal code: Address not mapped (1)
[loki:02080] Failing at address: (nil)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

[loki:2073] Local abort before MPI_INIT completed successfully; not able toaggregate error messages, and not able to guarantee that all other processeswere killed![loki:2079] Local abort before MPI_INIT completed successfully; not able toaggregate error messages, and not able to guarantee that all other processeswere killed!

[loki:02080] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f485c593870]

[loki:02080] [ 1]/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12(+0x16d4df)[0x7f485c90e4df][loki:02080] [ 2]/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12(ompi_group_increment_proc_count+0x35)[0x7f485c90eee5][loki:02080] [ 3]/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12(ompi_comm_init+0x2fc)[0x7f485c8be9fc][loki:02080] [ 4]/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12(ompi_mpi_init+0xd12)[0x7f485c962942][loki:02080] [ 5]/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12(PMPI_Init+0x1f2)[0x7f485cda7332]

[loki:02080] [ 6] spawn_slave[0x400a89]
[loki:02080] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485c1fdb05]
[loki:02080] [ 8] spawn_slave[0x400952]
[loki:02080] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------

mpiexec detected that one or more processes exited with non-zero status, thuscausing

the job to be terminated. The first process to do so was:

  Process name: [[38824,2],0]
  Exit code:    1
--------------------------------------------------------------------------
loki spawn 132



Everything works fine with spawn_multiple_master.

loki spawn 134 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5spawn_multiple_master


Parent process 0 running on loki
  I create 3 slave processes.

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 2

Slave process 0 of 2 running on loki
...



I have a similar error with openmpi-v2.x-dev-1404-g74d8ea0. My other
spawn programs work more or less as expected, although spawn_intra_comm
doesn't return so that I have to break it with <Ctrl-c>.

loki spawn 124 ompi_info | grep -e "OPAL repo revision" -e "C compiler absolute"
      OPAL repo revision: v2.x-dev-1404-g74d8ea0
     C compiler absolute: /opt/solstudio12.4/bin/cc
loki spawn 125 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

[loki:03931] OPAL ERROR: Timeout in file../../../../openmpi-v2.x-dev-1404-g74d8ea0/opal/mca/pmix/base/pmix_base_fns.c atline 190

[loki:3931] *** An error occurred in MPI_Comm_spawn
[loki:3931] *** reported by process [2431254529,0]
[loki:3931] *** on communicator MPI_COMM_WORLD
[loki:3931] *** MPI_ERR_UNKNOWN: unknown error
[loki:3931] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[loki:3931] ***    and potentially your MPI job)
loki spawn 126


I would be grateful, if somebody can fix the problem. Thank you very
much for any help in advance.


Kind regards

Siegmar

/* The program demonstrates how to spawn some dynamic MPI processes.
 * This version uses one master process which creates some slave
 * processes.
 *
 * A process or a group of processes can create another group of
 * processes with "MPI_Comm_spawn ()" or "MPI_Comm_spawn_multiple ()".
 * In general it is best (better performance) to start all processes
 * statically with "mpiexec" via the command line. If you want to use
 * dynamic processes you will normally have one master process which
 * starts a lot of slave processes. In some cases it may be useful to
 * enlarge a group of processes, e.g., if the MPI universe provides
 * more virtual cpu's than the current number of processes and the
 * program may benefit from additional processes. You will use
 * "MPI_Comm_spwan_multiple ()" if you must start different
 * programs or if you want to start the same program with different
 * parameters.
 *
 * There are some reasons to prefer "MPI_Comm_spawn_multiple ()"
 * instead of calling "MPI_Comm_spawn ()" multiple times. If you
 * spawn new (child) processes they start up like any MPI application,
 * i.e., they call "MPI_Init ()" and can use the communicator
 * MPI_COMM_WORLD afterwards. This communicator contains only the
 * child processes which have been created with the same call of
 * "MPI_Comm_spawn ()" and which is distinct from MPI_COMM_WORLD
 * of the parent process or processes created in other calls of
 * "MPI_Comm_spawn ()". The natural communication mechanism between
 * the groups of parent and child processes is via an
 * inter-communicator which will be returned from the above
 * MPI functions to spawn new processes. The local group of the
 * inter-communicator contains the parent processes and the remote
 * group contains the child processes. The child processes can get
 * the same inter-communicator calling "MPI_Comm_get_parent ()".
 * Now it is obvious that calling "MPI_Comm_spawn ()" multiple
 * times will create many sets of children with different
 * communicators MPI_COMM_WORLD whereas "MPI_Comm_spawn_multiple ()"
 * creates child processes with a single MPI_COMM_WORLD. Furthermore
 * spawning several processes in one call may be faster than spawning
 * them sequentially and perhaps even the communication between
 * processes spawned at the same time may be faster than communication
 * between sequentially spawned processes.
 *
 * For collective operations it is sometimes easier if all processes
 * belong to the same intra-communicator. You can use the function
 * "MPI_Intercomm_merge ()" to merge the local and remote group of
 * an inter-communicator into an intra-communicator.
 * 
 *
 * Compiling:
 *   Store executable(s) into local directory.
 *     mpicc -o <program name> <source code file name>
 *
 *   Store executable(s) into predefined directories.
 *     make
 *
 *   Make program(s) automatically on all specified hosts. You must
 *   edit the file "make_compile" and specify your host names before
 *   you execute it.
 *     make_compile
 *
 * Running:
 *   LAM-MPI:
 *     mpiexec -boot -np <number of processes> <program name>
 *     or
 *     mpiexec -boot \
 *	 -host <hostname> -np <number of processes> <program name> : \
 *	 -host <hostname> -np <number of processes> <program name>
 *     or
 *     mpiexec -boot [-v] -configfile <application file>
 *     or
 *     lamboot [-v] [<host file>]
 *       mpiexec -np <number of processes> <program name>
 *	 or
 *	 mpiexec [-v] -configfile <application file>
 *     lamhalt
 *
 *   OpenMPI:
 *     "host1", "host2", and so on can all have the same name,
 *     if you want to start a virtual computer with some virtual
 *     cpu's on the local host. The name "localhost" is allowed
 *     as well.
 *
 *     mpiexec -np <number of processes> <program name>
 *     or
 *     mpiexec --host <host1,host2,...> \
 *	 -np <number of processes> <program name>
 *     or
 *     mpiexec -hostfile <hostfile name> \
 *	 -np <number of processes> <program name>
 *     or
 *     mpiexec -app <application file>
 *
 * Cleaning:
 *   local computer:
 *     rm <program name>
 *     or
 *     make clean_all
 *   on all specified computers (you must edit the file "make_clean_all"
 *   and specify your host names before you execute it.
 *     make_clean_all
 *
 *
 * File: spawn_master.c			Author: S. Gross
 * Date: 28.09.2013
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define NUM_SLAVES	4		/* create NUM_SLAVES processes	*/
#define SLAVE_PROG	"spawn_slave"	/* slave program name		*/


int main (int argc, char *argv[])
{
  MPI_Comm COMM_CHILD_PROCESSES;	/* inter-communicator		*/
  int	   ntasks_world,		/* # of tasks in MPI_COMM_WORLD	*/
	   ntasks_local,		/* COMM_CHILD_PROCESSES local	*/
	   ntasks_remote,		/* COMM_CHILD_PROCESSES remote	*/
	   mytid,			/* my task id			*/
	   namelen;			/* length of processor name	*/
  char	   processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init (&argc, &argv);
  MPI_Comm_rank (MPI_COMM_WORLD, &mytid);
  MPI_Comm_size (MPI_COMM_WORLD, &ntasks_world);
  /* check that only the master process is running in MPI_COMM_WORLD.   */
  if (ntasks_world > 1)
  {
    if (mytid == 0)
    {
      fprintf (stderr, "\n\nError: Too many processes (only one "
	       "process allowed).\n"
	       "Usage:\n"
	       "  mpiexec %s\n\n",
	       argv[0]);
    }
    MPI_Finalize ();
    exit (EXIT_SUCCESS);
  }
  MPI_Get_processor_name (processor_name, &namelen);
  printf ("\nParent process %d running on %s\n"
	  "  I create %d slave processes\n\n",
	  mytid,  processor_name, NUM_SLAVES);
  MPI_Comm_spawn (SLAVE_PROG, MPI_ARGV_NULL, NUM_SLAVES,
		  MPI_INFO_NULL, 0, MPI_COMM_WORLD,
		  &COMM_CHILD_PROCESSES, MPI_ERRCODES_IGNORE);
  MPI_Comm_size	(COMM_CHILD_PROCESSES, &ntasks_local);
  MPI_Comm_remote_size (COMM_CHILD_PROCESSES, &ntasks_remote);
  printf ("Parent process %d: "
	  "tasks in MPI_COMM_WORLD:                    %d\n"
	  "                  tasks in COMM_CHILD_PROCESSES local "
	  "group:  %d\n"
	  "                  tasks in COMM_CHILD_PROCESSES remote "
	  "group: %d\n\n",
	  mytid, ntasks_world, ntasks_local, ntasks_remote);
  MPI_Comm_free (&COMM_CHILD_PROCESSES);
  MPI_Finalize ();
  return EXIT_SUCCESS;
}

/* The program demonstrates how to spawn some dynamic MPI processes.
 * This version uses some parent processes which create some slave
 * processes. The slave processes use the same program as the parent
 * processes (determined via argv[0]). All processes will be merged
 * into one intra-communicator.
 *
 * A process or a group of processes can create another group of
 * processes with "MPI_Comm_spawn ()" or "MPI_Comm_spawn_multiple ()".
 * In general it is best (better performance) to start all processes
 * statically with "mpiexec" via the command line. If you want to use
 * dynamic processes you will normally have one master process which
 * starts a lot of slave processes. In some cases it may be useful to
 * enlarge a group of processes, e.g., if the MPI universe provides
 * more virtual cpu's than the current number of processes and the
 * program may benefit from additional processes. You will use
 * "MPI_Comm_spwan_multiple ()" if you must start different
 * programs or if you want to start the same program with different
 * parameters.
 *
 * There are some reasons to prefer "MPI_Comm_spawn_multiple ()"
 * instead of calling "MPI_Comm_spawn ()" multiple times. If you
 * spawn new (child) processes they start up like any MPI application,
 * i.e., they call "MPI_Init ()" and can use the communicator
 * MPI_COMM_WORLD afterwards. This communicator contains only the
 * child processes which have been created with the same call of
 * "MPI_Comm_spawn ()" and which is distinct from MPI_COMM_WORLD
 * of the parent process or processes created in other calls of
 * "MPI_Comm_spawn ()". The natural communication mechanism between
 * the groups of parent and child processes is via an
 * inter-communicator which will be returned from the above
 * MPI functions to spawn new processes. The local group of the
 * inter-communicator contains the parent processes and the remote
 * group contains the child processes. The child processes can get
 * the same inter-communicator calling "MPI_Comm_get_parent ()".
 * Now it is obvious that calling "MPI_Comm_spawn ()" multiple
 * times will create many sets of children with different
 * communicators MPI_COMM_WORLD whereas "MPI_Comm_spawn_multiple ()"
 * creates child processes with a single MPI_COMM_WORLD. Furthermore
 * spawning several processes in one call may be faster than spawning
 * them sequentially and perhaps even the communication between
 * processes spawned at the same time may be faster than communication
 * between sequentially spawned processes.
 *
 * For collective operations it is sometimes easier if all processes
 * belong to the same intra-communicator. You can use the function
 * "MPI_Intercomm_merge ()" to merge the local and remote group of
 * an inter-communicator into an intra-communicator.
 * 
 *
 * Compiling:
 *   Store executable(s) into local directory.
 *     mpicc -o <program name> <source code file name>
 *
 *   Store executable(s) into predefined directories.
 *     make
 *
 *   Make program(s) automatically on all specified hosts. You must
 *   edit the file "make_compile" and specify your host names before
 *   you execute it.
 *     make_compile
 *
 * Running:
 *   LAM-MPI:
 *     mpiexec -boot -np <number of processes> <program name>
 *     or
 *     mpiexec -boot \
 *	 -host <hostname> -np <number of processes> <program name> : \
 *	 -host <hostname> -np <number of processes> <program name>
 *     or
 *     mpiexec -boot [-v] -configfile <application file>
 *     or
 *     lamboot [-v] [<host file>]
 *       mpiexec -np <number of processes> <program name>
 *	 or
 *	 mpiexec [-v] -configfile <application file>
 *     lamhalt
 *
 *   OpenMPI:
 *     "host1", "host2", and so on can all have the same name,
 *     if you want to start a virtual computer with some virtual
 *     cpu's on the local host. The name "localhost" is allowed
 *     as well.
 *
 *     mpiexec -np <number of processes> <program name>
 *     or
 *     mpiexec --host <host1,host2,...> \
 *	 -np <number of processes> <program name>
 *     or
 *     mpiexec -hostfile <hostfile name> \
 *	 -np <number of processes> <program name>
 *     or
 *     mpiexec -app <application file>
 *
 * Cleaning:
 *   local computer:
 *     rm <program name>
 *     or
 *     make clean_all
 *   on all specified computers (you must edit the file "make_clean_all"
 *   and specify your host names before you execute it.
 *     make_clean_all
 *
 *
 * File: spawn_intra_comm.c			Author: S. Gross
 * Date: 30.08.2012
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define NUM_SLAVES	2		/* create NUM_SLAVES processes	*/


int main (int argc, char *argv[])
{
  MPI_Comm COMM_ALL_PROCESSES,		/* intra-communicator		*/
	   COMM_CHILD_PROCESSES,	/* inter-communicator		*/
	   COMM_PARENT_PROCESSES;	/* inter-communicator		*/
  int	   ntasks_world,		/* # of tasks in MPI_COMM_WORLD	*/
	   ntasks_local,		/* COMM_CHILD_PROCESSES local	*/
	   ntasks_remote,		/* COMM_CHILD_PROCESSES remote	*/
	   ntasks_all,			/* tasks in COMM_ALL_PROCESSES	*/
	   mytid_world,			/* my task id in MPI_COMM_WORLD	*/
	   mytid_all,			/* id in COMM_ALL_PROCESSES	*/
	   namelen;			/* length of processor name	*/
  char	   processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init (&argc, &argv);
  MPI_Comm_rank (MPI_COMM_WORLD, &mytid_world);
  /* At first we must decide if this program is executed from a parent
   * or child process because only a parent is allowed to spawn child
   * processes (otherwise the child process with rank 0 would spawn
   * itself child processes and so on). "MPI_Comm_get_parent ()"
   * returns the parent inter-communicator for a spawned MPI rank and
   * MPI_COMM_NULL if the process wasn't spawned, i.e. it was started
   * statically via "mpiexec" on the command line.
   */
  MPI_Comm_get_parent (&COMM_PARENT_PROCESSES);
  if (COMM_PARENT_PROCESSES == MPI_COMM_NULL)
  {
    /* All parent processes must call "MPI_Comm_spawn ()" but only
     * the root process (in our case the process with rank 0) will
     * spawn child processes. All other processes of the
     * intra-communicator (in our case MPI_COMM_WORLD) will ignore
     * the values of all arguments before the "root" parameter.
     */
    if (mytid_world == 0)
    {
      printf ("Parent process 0: I create %d slave processes\n",
	      NUM_SLAVES);
    }
    MPI_Comm_spawn (argv[0], MPI_ARGV_NULL, NUM_SLAVES,
		    MPI_INFO_NULL, 0, MPI_COMM_WORLD,
		    &COMM_CHILD_PROCESSES, MPI_ERRCODES_IGNORE);
  }
  /* Merge all processes into one intra-communicator. The "high" flag
   * determines the order of the processes in the intra-communicator.
   * If parent and child processes use the same flag the order may
   * be arbitray otherwise the processes with "high == 0" will have
   * a lower rank than the processes with "high == 1".
   */
  if (COMM_PARENT_PROCESSES == MPI_COMM_NULL)
  {
    /* parent processes							*/
    MPI_Intercomm_merge (COMM_CHILD_PROCESSES, 0, &COMM_ALL_PROCESSES);
  }
  else
  {
    /* spawned child processes						*/
    MPI_Intercomm_merge (COMM_PARENT_PROCESSES, 1, &COMM_ALL_PROCESSES);
  }
  MPI_Comm_size	(MPI_COMM_WORLD, &ntasks_world);
  MPI_Comm_size (COMM_ALL_PROCESSES, &ntasks_all);
  MPI_Comm_rank (COMM_ALL_PROCESSES, &mytid_all);
  MPI_Get_processor_name (processor_name, &namelen);
  /* With the following printf-statement every process executing this
   * code will print some lines on the display. It may happen that the
   * lines will get mixed up because the display is a critical section.
   * In general only one process (mostly the process with rank 0) will
   * print on the display and all other processes will send their
   * messages to this process. Nevertheless for debugging purposes
   * (or to demonstrate that it is possible) it may be useful if every
   * process prints itself.
   */
  if (COMM_PARENT_PROCESSES == MPI_COMM_NULL)
  {
    MPI_Comm_size	 (COMM_CHILD_PROCESSES, &ntasks_local);
    MPI_Comm_remote_size (COMM_CHILD_PROCESSES, &ntasks_remote);
    printf ("\nParent process %d running on %s\n"
	    "    MPI_COMM_WORLD ntasks:              %d\n"
	    "    COMM_CHILD_PROCESSES ntasks_local:  %d\n"
	    "    COMM_CHILD_PROCESSES ntasks_remote: %d\n"
	    "    COMM_ALL_PROCESSES ntasks:          %d\n"
	    "    mytid in COMM_ALL_PROCESSES:        %d\n",
	    mytid_world, processor_name, ntasks_world, ntasks_local,
	    ntasks_remote, ntasks_all, mytid_all);
  }
  else
  {
    printf ("\nChild process %d running on %s\n"
	    "    MPI_COMM_WORLD ntasks:              %d\n"
	    "    COMM_ALL_PROCESSES ntasks:          %d\n"
	    "    mytid in COMM_ALL_PROCESSES:        %d\n",
	    mytid_world, processor_name, ntasks_world, ntasks_all,
	    mytid_all);
  }
  MPI_Finalize ();
  return EXIT_SUCCESS;
}

[OMPI users] slot-list breaks for openmpi-v1.10.2-176-g9d45e07 on "SUSE Linux, Enterprise Server 12 (x86_64)"

Reply via email to