Hi,

In v1.5, when mpirun is called with both the "-bind-to-core" and
"-npersocket" options, and the npersocket value leads to less procs than
sockets allocated on one node, we get a segfault

Testing environment:
openmpi v1.5
2 nodes with 4 8-cores sockets each
mpirun -n 10 -bind-to-core -npersocket 2

I was expecting to get:
   . ranks 0-1 : node 0 - socket 0
   . ranks 2-3 : node 0 - socket 1
   . ranks 4-5 : node 0 - socket 2
   . ranks 6-7 : node 0 - socket 3
   . ranks 8-9 : node 1 - socket 0

Instead of that, everything worked fine on node 0, and I got a segfault
on node 1, with a stack that looks like:

[derbeyn@berlin18 ~]$ mpirun --host berlin18,berlin26 -n 10
-bind-to-core -npersocket 2 sleep 900
[berlin26:21531] *** Process received signal ***
[berlin26:21531] Signal: Floating point exception (8)
[berlin26:21531] Signal code: Integer divide-by-zero (1)
[berlin26:21531] Failing at address: 0x7fed13731d63
[berlin26:21531] [ 0] /lib64/libpthread.so.0(+0xf490) [0x7fed15327490]
[berlin26:21531]
[ 1] 
/home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/openmpi/mca_odls_default.so(+0x2d63) 
[0x7fed13731d63]
[berlin26:21531]
[ 2] 
/home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_odls_base_default_launch_local+0xaf3)
 [0x7fed15e1fe73]
[berlin26:21531]
[ 3] 
/home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/openmpi/mca_odls_default.so(+0x1d10) 
[0x7fed13730d10]
[berlin26:21531]
[ 4] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(+0x3804d)
[0x7fed15e1004d]
[berlin26:21531]
[ 5] 
/home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_daemon_cmd_processor+0x4aa)
 [0x7fed15e1209a]
[berlin26:21531]
[ 6] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(+0x74ee8)
[0x7fed15e4cee8]
[berlin26:21531]
[ 7] 
/home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_daemon+0x8d8) 
[0x7fed15e0f268]
[berlin26:21531] [ 8] /home_nfs/derbeyn/DISTS/openmpi-v1.5/bin/orted()
[0x4008c6]
[berlin26:21531] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x7fed14fa7c9d]
[berlin26:21531] [10] /home_nfs/derbeyn/DISTS/openmpi-v1.5/bin/orted()
[0x400799]
[berlin26:21531] *** End of error message ***

The reason for this issue is that the npersocket value is taken into
account during the very first phase of mpirun (rmaps/load_balance) to
claim the slots on each node:
npersocket() (in rmaps/load_balance/rmaps_lb.c) claims
   . 8 slots on node 0 (4 sockets * 2 persocket)
   . 2 slots on node 1 (10 total ranks - 8 already claimed)

But when we come to odls_default_fork_local_proc() (in
odls/default/odls_default_module.c) npersocket is actually recomputed.
Everything works fine on node 0. But on node 1, we have:
   . jobdat->policy has both ORTE_BIND_TO_CORE and ORTE_MAPPING_NPERXXX
   . npersocket is recomputed the following way:
     npersocket = jobdat->num_local_procs/orte_odls_globals.num_sockets
                = 2 / 4 = 0
   . later on, when the starting point is computed:
     logical_cpu = (lrank % npersocket) * jobdat->cpus_per_rank;
     we get the divide-by-zero exception.

The problem comes, in my mind, from the fact we are recomputing the
npersocket on the local nodes instead of storing it in the jobdat
structure (as it is done today for the policy, the cpus_per_rank, the
stride,...).
Recomputing this value leads either to the segfault I got, or even to
wrong mappings: if we had had 4 slots claimed on node 1, the result
would have been 1 rank per socket (since we have 4-sockets nodes)
instead of 2 ranks on the first 2 sockets.

The attached patch is a fix proposal implementing my suggestion of
storing the npersocket into the jobdat.

This patch applies on v1.5. Waiting for your comments...

Regards,
Nadia

-- 
Nadia Derbey
npersocket should not be recomputed in odls_default_fork_local_procs: segfault might occur in some particular cases

diff -r ce3749a94a9e orte/mca/odls/base/odls_base_default_fns.c
--- a/orte/mca/odls/base/odls_base_default_fns.c	Fri Nov 04 13:31:18 2011 +0100
+++ b/orte/mca/odls/base/odls_base_default_fns.c	Fri Nov 04 13:55:00 2011 +0100
@@ -352,6 +352,12 @@ int orte_odls_base_default_get_add_procs
         return rc;
     }

+    /* pack the npersocket for this job */
+    if (ORTE_SUCCESS != (rc = opal_dss.pack(data, &map->npersocket, 1, OPAL_INT32))) {
+        ORTE_ERROR_LOG(rc);
+        return rc;
+    }
+
     /* pack the cpus_per_rank for this job */
     if (ORTE_SUCCESS != (rc = opal_dss.pack(data, &map->cpus_per_rank, 1, OPAL_INT16))) {
         ORTE_ERROR_LOG(rc);
@@ -809,6 +815,12 @@ int orte_odls_base_default_construct_chi
         ORTE_ERROR_LOG(rc);
         goto REPORT_ERROR;
     }
+    /* unpack the npersocket for the job */
+    cnt=1;
+    if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->npersocket, &cnt, OPAL_INT32))) {
+        ORTE_ERROR_LOG(rc);
+        goto REPORT_ERROR;
+    }
     /* unpack the cpus/rank for the job */
     cnt=1;
     if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->cpus_per_rank, &cnt, OPAL_INT16))) {
diff -r ce3749a94a9e orte/mca/odls/default/odls_default_module.c
--- a/orte/mca/odls/default/odls_default_module.c	Fri Nov 04 13:31:18 2011 +0100
+++ b/orte/mca/odls/default/odls_default_module.c	Fri Nov 04 13:55:00 2011 +0100
@@ -383,7 +383,7 @@ static int odls_default_fork_local_proc(
                 OPAL_PAFFINITY_CPU_ZERO(mask);
                 if (ORTE_MAPPING_NPERXXX & jobdat->policy) {
                     /* we need to balance the children from this job across the available sockets */
-                    npersocket = jobdat->num_local_procs / orte_odls_globals.num_sockets;
+                    npersocket = jobdat->npersocket;
                     /* determine the socket to use based on those available */
                     if (npersocket < 2) {
                         /* if we only have 1/sock, or we have less procs than sockets,
@@ -578,7 +578,7 @@ static int odls_default_fork_local_proc(
                 }
                 if (ORTE_MAPPING_NPERXXX & jobdat->policy) {
                     /* we need to balance the children from this job across the available sockets */
-                    npersocket = jobdat->num_local_procs / orte_odls_globals.num_sockets;
+                    npersocket = jobdat->npersocket;
                     /* determine the socket to use based on those available */
                     if (npersocket < 2) {
                         /* if we only have 1/sock, or we have less procs than sockets,
diff -r ce3749a94a9e orte/mca/odls/odls_types.h
--- a/orte/mca/odls/odls_types.h	Fri Nov 04 13:31:18 2011 +0100
+++ b/orte/mca/odls/odls_types.h	Fri Nov 04 13:55:00 2011 +0100
@@ -116,6 +116,7 @@ typedef struct orte_odls_job_t {
     orte_app_context_t      **apps;                 /* app_contexts for this job */
     int32_t                 num_apps;               /* number of app_contexts */
     orte_mapping_policy_t   policy;                 /* mapping policy */
+    int32_t                 npersocket;             /* number of ranks/socket */
     int16_t                 cpus_per_rank;          /* number of cpus/rank */
     int16_t                 stride;                 /* step size between cores of multi-core/rank procs */
     orte_job_controls_t     controls;               /* control flags for job */

Reply via email to