Couple of things you can try:

* add --oversubscribe to your mpirun cmd line so it doesn’t care how many slots 
there are

* modify your MPI_INFO to be “host”, “node0:22” so it thinks there are more 
slots available

It’s possible that the “host” info processing has a bug in it, but this will 
tell us a little more and hopefully get your running. If you want to bind your 
processes to cores, then add “--bind-to core” to the cmd line



> On Oct 6, 2017, at 1:35 PM, George Reeke <re...@mail.rockefeller.edu> wrote:
> 
> Dear colleagues,
> I need some help controlling where a process spawned with
> MPI_Comm_spawn goes.  I am in openmpi-1.10 under Centos 6.7.
> My application is written in C and am running on a RedBarn
> system with a master node (hardware box) that connects to the
> outside world and two other nodes connected to it via ethernet and
> Infiniband.  There are two executable files, one (I'll call it
> "Rank0Pgm") that expects to be rank 0 and does all the I/O and
> the other ("RanknPgm") that only communicates via MPI messages.
> There are two MPI_Comm_spawns that run just after MPI_Init and
> an initial broadcast that shares some setup info, like this:
> MPI_Comm_spawn("andmsg", argv, 1, MPI_INFO_NULL,
>   hostid, commc, &commd, &sperr);
> where "andmsg" is a program that needs to communicate with the
> internet and with all the other processes via a new communicator
> that will be called commd (and another name for the other one).
>   When I run this program with no hostfile and an mpirun line
> something like this on a node with 32 cores:
> /usr/lib64/openmpi-1.10/bin/mpirun -n 1 Rank0Pgm : -n 28 RanknPgm \
>   < InputFile
> everything works fine.  I assume the spawns use 2 of the 3 available
> cores that I did not ask the program to use.
> 
> Now I want to run on the full network, so I make a hostfile like this
> (call it "nodes120"):
> node0 slots=22 max-slots=22
> n0003 slots=40 max-slots=40
> n0004 slots=56 max-slots=56
> where node0 has 24 cores and I am trying to leave room for my two
> spawned processes.  The spawned processes have to be able to contact
> the internet, so I make an MPI_INFO with MPI_Info_create and
> MPI_Info_set(mpinfo, "host", "node0")
> and change the MPI_INFO_NULL in the spawn calls to point to this
> new MPI_Info.  (If I leave the MPI_INFO_NULL I get a different
> error that is probably not of interest here.)
> 
> Now I run the mpirun like above except now with
> "--hostfile nodes120" and "-n 116" after the colon.  Now I get this
> error:
> 
> "There are not enough slots available in the system to satisfy the 1
> slots that were requested by the application:
>  andmsg
> Either request fewer slots for your application, or make more slots
> available for use."
> 
> I get the same error with "max-slots=24" on the first line of the
> hosts file.
> 
> Sorry for the length of all that.  Request for help:  How do I set
> things up to run my rank 0 program and enough copies of RanknPgm to fill
> all but some number of cores on the master hardware node, and all the
> other rank n programs on the other hardware "nodes" (boxes of CPUs).
> [My application will do best with the default "by slot" scheduling.]
> 
> Suggestions much appreciated.  I am quite convinced my code is OK
> in that it runs OK as shown above on one hardware box.  Also runs
> on my laptop with 4 cores and "-n 3 RanknPgm" so I guess I don't
> even really need to reserve cores for the two spawned processes.
> I thought of using old-fashioned 'fork' but I really want the
> extra communicators to keep asynchronous messages separated.
> The documentation says overloading is OK by default, so maybe
> something else is wrong here.
> 
> George Reeke
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to