Hi, in context of the PRRTE Distributed Virtual Machine, is there a way
to tell the task mapper inside prun to not share a node across separate
prun jobs?

For example, inside a resource allocation from Cobalt/ALPS: 2 nodes with
64 cores each:

prte --daemonize
prun ... &
...
prun ... &
pterm

Scenario A:

$ prun --map-by ppr:64:node -n 64 ./mpitest &
$ prun --map-by ppr:64:node -n 64 ./mpitest &

        MPI World size = 64 processes
        Hello World from rank 0 running on nid03834 (hostname nid03834)!
        ...
        Hello World from rank 63 running on nid03834 (hostname nid03834)!

        MPI World size = 64 processes
        Hello World from rank 0 running on nid03835 (hostname nid03835)!
        ...
        Hello World from rank 63 running on nid03835 (hostname nid03835)!

Scenario B:

$ prun --map-by ppr:64:node -n 1 ./mpitest &
$ prun --map-by ppr:64:node -n 1 ./mpitest &

        MPI World size = 1 processes
        Hello World from rank 0 running on nid03834 (hostname nid03834)!

        MPI World size = 1 processes
        Hello World from rank 0 running on nid03834 (hostname nid03834)!

The question is: in Scneario B, how to tell prun that node nid03834
should not be used for the second prun job, because this node is already
(partially) occupied by a different prun instance job?

Scenario A implies that the DVM already tracks occupancy, so the
question is just how to tell the mapper to treat a free core on a free
node differently from a free core on a partially occupied node. The
--map-by :NOOVERSUBSCRIBE does not look like the answer since there's
no oversubscription of cores, right? Would need something like --map-by
:exclusive:node? If not supported, how hard would it be for me to patch?

Potential workarounds I can think of is to fill the unoccupied cores on
partially occupied nodes with dummy jobs with --host pointing to the
partially occupied nodes and a -n count matching the number of
unoccupied cores, but is this even doable? also requires dumping the
mapping from each prun which I am unable to achive with --map-by
:DISPLAY (works with mpirun but not with prun).

Or, run a Flux instance [1] instead of the PRRTE DVM on the resource
allocation, which seems similar but features a scheduler with a queue (a
feature proposed for the PRRTE DVM on the list earlier [1]). I am
guessing that Flux has the flexibility to this exclusive node mapping,
but not sure.

The DVM is proving to be very useful to deal with restrictions on
minimum nodecount per job on some HPC clusters, by batching many small
jobs into one job. A queue would be even more useful, but even without a
queue it is still useful for batching sets of jobs which are known to
fit on an allocation in parallel (i.e. without having to wait at all).

[1] https://flux-framework.readthedocs.io/en/latest/quickstart.html
[2] https://www.mail-archive.com/users@lists.open-mpi.org/msg30692.html

OpenMPI: commit 7a922c8774b184ecb3aa1cd06720390bd9200b50
Fri Nov 6 08:48:29 2020 -0800
PRRTE: commit 37dd45c4d9fe973df1000f1a1421c2718fd80050
Fri Nov 6 12:45:38 2020 -0600

Thank you.

Reply via email to