Sorry for the incredibly late reply. Hopefully, you have already managed to 
find the answer.

I'm not sure what your comm_spawn command looks like, but it appears you 
specified the host in it using the "dash_host" info-key, yes? The problem is 
that this is interpreted the same way as the "-host n001.cluster.com 
<http://n001.cluster.com> " option on an mpiexec cmd line - which means that it 
only allocates _one_ slot to the request. If you are asking to spawn two procs, 
then you don't have adequate resources. One way to check is to only spawn one 
proc with your comm_spawn request and see if that works.

If you want to specify the host, then you need to append the number of slots to 
allocate on that host - e.g., "n001.cluster.com <http://n001.cluster.com> :2". 
Of course, you cannot allocate more than the system provided minus the number 
currently in use. There are additional modifiers you can pass to handle 
variable numbers of slots.

HTH
Ralph


On Oct 25, 2019, at 5:30 AM, Mccall, Kurt E. (MSFC-EV41) via users 
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote:

I am trying to launch a number of manager processes, one per node, and then have
each of those managers spawn, on its own same node, a number of workers.   For 
this example,
I have 2 managers and 2 workers per manager.  I'm following the instructions at 
this link
 
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn
 to force one manager process per node.
  Here is my PBS/Torque qsub command:
 $ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3  
MyManager.bash
 I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot 
for the manager and the other two for the separately spawned workers).  The 
first  argument
is a lower-case L, not a one.
   Here is my mpiexec command within the MyManager.bash script.
 mpiexec --enable-recovery --display-map --display-allocation --mca 
mpi_param_check 1 --v --x DISPLAY --np 2  --map-by ppr:1:node  MyManager.exe
 I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager 
on each node. 
   When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports:
 ======================   ALLOCATED NODES   ======================
        n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP
        n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
=================================================================
--------------------------------------------------------------------------
There are no allocated resources for the application:
  ./MyWorker
that match the requested mapping:
  -host: n001.cluster.com <http://n001.cluster.com/> 
 Verify that you have mapped the allocated resources properly for the
indicated specification.
--------------------------------------------------------------------------
[n001:14883] *** An error occurred in MPI_Comm_spawn
[n001:14883] *** reported by process [1897594881,1]
[n001:14883] *** on communicator MPI_COMM_SELF
[n001:14883] *** MPI_ERR_SPAWN: could not spawn processes
   It the banner above, it clearly states that node n001 has 3 slots reserved
and only one slot in used at time of the spawn.   Not sure why it reports
that there are no resources for it.
 I've tried compiling OpenMpi 4.0 both with and without Torque support, and
I've tried using a an explicit host file (or not), but the error is unchanged. 
Any ideas?
 My cluster is running CentOS 7.4 and I am using the Portland Group C++ 
compiler.

Reply via email to