IIRC, the correct syntax is: prun -host +e ...
This tells PRRTE that you want empty nodes for this application. You can even specify how many empty nodes you want: prun -host +e:2 ... I haven't tested that in a bit, so please let us know if it works or not so we can fix it if necessary. As for the queue - we do plan to add a queue to PRRTE in 1st quarter next year. Wasn't really thinking of a true scheduler - just a FIFO queue for now. > On Nov 14, 2020, at 11:52 AM, Alexei Colin via users > <users@lists.open-mpi.org> wrote: > > Hi, in context of the PRRTE Distributed Virtual Machine, is there a way > to tell the task mapper inside prun to not share a node across separate > prun jobs? > > For example, inside a resource allocation from Cobalt/ALPS: 2 nodes with > 64 cores each: > > prte --daemonize > prun ... & > ... > prun ... & > pterm > > Scenario A: > > $ prun --map-by ppr:64:node -n 64 ./mpitest & > $ prun --map-by ppr:64:node -n 64 ./mpitest & > > MPI World size = 64 processes > Hello World from rank 0 running on nid03834 (hostname nid03834)! > ... > Hello World from rank 63 running on nid03834 (hostname nid03834)! > > MPI World size = 64 processes > Hello World from rank 0 running on nid03835 (hostname nid03835)! > ... > Hello World from rank 63 running on nid03835 (hostname nid03835)! > > Scenario B: > > $ prun --map-by ppr:64:node -n 1 ./mpitest & > $ prun --map-by ppr:64:node -n 1 ./mpitest & > > MPI World size = 1 processes > Hello World from rank 0 running on nid03834 (hostname nid03834)! > > MPI World size = 1 processes > Hello World from rank 0 running on nid03834 (hostname nid03834)! > > The question is: in Scneario B, how to tell prun that node nid03834 > should not be used for the second prun job, because this node is already > (partially) occupied by a different prun instance job? > > Scenario A implies that the DVM already tracks occupancy, so the > question is just how to tell the mapper to treat a free core on a free > node differently from a free core on a partially occupied node. The > --map-by :NOOVERSUBSCRIBE does not look like the answer since there's > no oversubscription of cores, right? Would need something like --map-by > :exclusive:node? If not supported, how hard would it be for me to patch? > > Potential workarounds I can think of is to fill the unoccupied cores on > partially occupied nodes with dummy jobs with --host pointing to the > partially occupied nodes and a -n count matching the number of > unoccupied cores, but is this even doable? also requires dumping the > mapping from each prun which I am unable to achive with --map-by > :DISPLAY (works with mpirun but not with prun). > > Or, run a Flux instance [1] instead of the PRRTE DVM on the resource > allocation, which seems similar but features a scheduler with a queue (a > feature proposed for the PRRTE DVM on the list earlier [1]). I am > guessing that Flux has the flexibility to this exclusive node mapping, > but not sure. > > The DVM is proving to be very useful to deal with restrictions on > minimum nodecount per job on some HPC clusters, by batching many small > jobs into one job. A queue would be even more useful, but even without a > queue it is still useful for batching sets of jobs which are known to > fit on an allocation in parallel (i.e. without having to wait at all). > > [1] https://flux-framework.readthedocs.io/en/latest/quickstart.html > [2] https://www.mail-archive.com/users@lists.open-mpi.org/msg30692.html > > OpenMPI: commit 7a922c8774b184ecb3aa1cd06720390bd9200b50 > Fri Nov 6 08:48:29 2020 -0800 > PRRTE: commit 37dd45c4d9fe973df1000f1a1421c2718fd80050 > Fri Nov 6 12:45:38 2020 -0600 > > Thank you.