Hi all,

When slurm is configured with the following parameters
   TaskPlugin=task/affinity
   TaskPluginParam=Cpusets
srun binds the processes by placing them into different
cpusets, each containing a single core.

e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two allocated
nodes and place the four ranks there, each single rank with a singleton as
a cpu constraint.

The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND (in
opal/mca/paffinity/paffinity.h):
  . opal_paffinity_base_get_processor_info() fills in num_processors with 1
(this is the size of each cpu_set)
  . num_bound is set to 1 too
and this implies *bound=false

So, the binding is correctly done by slurm and not detected by MPI.

To support the cpuset binding done by slurm, I propose the following patch:

hg diff  opal/mca/paffinity/paffinity.h
diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
--- a/opal/mca/paffinity/paffinity.h    Thu Apr 21 17:38:00 2011 +0200
+++ b/opal/mca/paffinity/paffinity.h    Tue Jul 12 15:44:59 2011 +0200
@@ -218,7 +218,8 @@
                     num_bound++;                                    \
                 }                                                   \
             }                                                       \
-            if (0 < num_bound && num_bound < num_processors) {      \
+            if (0 < num_bound && ((num_processors == 1) ||          \
+                                  (num_bound < num_processors))) {  \
                 *(bound) = true;                                    \
             }                                                       \
         }                                                           \


Reply via email to