IIRC, the correct syntax is:

prun -host +e ...

This tells PRRTE that you want empty nodes for this application. You can even 
specify how many empty nodes you want:

prun -host +e:2 ...

I haven't tested that in a bit, so please let us know if it works or not so we 
can fix it if necessary.

As for the queue - we do plan to add a queue to PRRTE in 1st quarter next year. 
Wasn't really thinking of a true scheduler - just a FIFO queue for now.


> On Nov 14, 2020, at 11:52 AM, Alexei Colin via users 
> <users@lists.open-mpi.org> wrote:
> 
> Hi, in context of the PRRTE Distributed Virtual Machine, is there a way
> to tell the task mapper inside prun to not share a node across separate
> prun jobs?
> 
> For example, inside a resource allocation from Cobalt/ALPS: 2 nodes with
> 64 cores each:
> 
> prte --daemonize
> prun ... &
> ...
> prun ... &
> pterm
> 
> Scenario A:
> 
> $ prun --map-by ppr:64:node -n 64 ./mpitest &
> $ prun --map-by ppr:64:node -n 64 ./mpitest &
> 
>       MPI World size = 64 processes
>       Hello World from rank 0 running on nid03834 (hostname nid03834)!
>       ...
>       Hello World from rank 63 running on nid03834 (hostname nid03834)!
> 
>       MPI World size = 64 processes
>       Hello World from rank 0 running on nid03835 (hostname nid03835)!
>       ...
>       Hello World from rank 63 running on nid03835 (hostname nid03835)!
> 
> Scenario B:
> 
> $ prun --map-by ppr:64:node -n 1 ./mpitest &
> $ prun --map-by ppr:64:node -n 1 ./mpitest &
> 
>       MPI World size = 1 processes
>       Hello World from rank 0 running on nid03834 (hostname nid03834)!
> 
>       MPI World size = 1 processes
>       Hello World from rank 0 running on nid03834 (hostname nid03834)!
> 
> The question is: in Scneario B, how to tell prun that node nid03834
> should not be used for the second prun job, because this node is already
> (partially) occupied by a different prun instance job?
> 
> Scenario A implies that the DVM already tracks occupancy, so the
> question is just how to tell the mapper to treat a free core on a free
> node differently from a free core on a partially occupied node. The
> --map-by :NOOVERSUBSCRIBE does not look like the answer since there's
> no oversubscription of cores, right? Would need something like --map-by
> :exclusive:node? If not supported, how hard would it be for me to patch?
> 
> Potential workarounds I can think of is to fill the unoccupied cores on
> partially occupied nodes with dummy jobs with --host pointing to the
> partially occupied nodes and a -n count matching the number of
> unoccupied cores, but is this even doable? also requires dumping the
> mapping from each prun which I am unable to achive with --map-by
> :DISPLAY (works with mpirun but not with prun).
> 
> Or, run a Flux instance [1] instead of the PRRTE DVM on the resource
> allocation, which seems similar but features a scheduler with a queue (a
> feature proposed for the PRRTE DVM on the list earlier [1]). I am
> guessing that Flux has the flexibility to this exclusive node mapping,
> but not sure.
> 
> The DVM is proving to be very useful to deal with restrictions on
> minimum nodecount per job on some HPC clusters, by batching many small
> jobs into one job. A queue would be even more useful, but even without a
> queue it is still useful for batching sets of jobs which are known to
> fit on an allocation in parallel (i.e. without having to wait at all).
> 
> [1] https://flux-framework.readthedocs.io/en/latest/quickstart.html
> [2] https://www.mail-archive.com/users@lists.open-mpi.org/msg30692.html
> 
> OpenMPI: commit 7a922c8774b184ecb3aa1cd06720390bd9200b50
> Fri Nov 6 08:48:29 2020 -0800
> PRRTE: commit 37dd45c4d9fe973df1000f1a1421c2718fd80050
> Fri Nov 6 12:45:38 2020 -0600
> 
> Thank you.


Reply via email to