[OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Alexei Colin via users
Hi, in context of the PRRTE Distributed Virtual Machine, is there a way
to tell the task mapper inside prun to not share a node across separate
prun jobs?

For example, inside a resource allocation from Cobalt/ALPS: 2 nodes with
64 cores each:

prte --daemonize
prun ... &
...
prun ... &
pterm

Scenario A:

$ prun --map-by ppr:64:node -n 64 ./mpitest &
$ prun --map-by ppr:64:node -n 64 ./mpitest &

MPI World size = 64 processes
Hello World from rank 0 running on nid03834 (hostname nid03834)!
...
Hello World from rank 63 running on nid03834 (hostname nid03834)!

MPI World size = 64 processes
Hello World from rank 0 running on nid03835 (hostname nid03835)!
...
Hello World from rank 63 running on nid03835 (hostname nid03835)!

Scenario B:

$ prun --map-by ppr:64:node -n 1 ./mpitest &
$ prun --map-by ppr:64:node -n 1 ./mpitest &

MPI World size = 1 processes
Hello World from rank 0 running on nid03834 (hostname nid03834)!

MPI World size = 1 processes
Hello World from rank 0 running on nid03834 (hostname nid03834)!

The question is: in Scneario B, how to tell prun that node nid03834
should not be used for the second prun job, because this node is already
(partially) occupied by a different prun instance job?

Scenario A implies that the DVM already tracks occupancy, so the
question is just how to tell the mapper to treat a free core on a free
node differently from a free core on a partially occupied node. The
--map-by :NOOVERSUBSCRIBE does not look like the answer since there's
no oversubscription of cores, right? Would need something like --map-by
:exclusive:node? If not supported, how hard would it be for me to patch?

Potential workarounds I can think of is to fill the unoccupied cores on
partially occupied nodes with dummy jobs with --host pointing to the
partially occupied nodes and a -n count matching the number of
unoccupied cores, but is this even doable? also requires dumping the
mapping from each prun which I am unable to achive with --map-by
:DISPLAY (works with mpirun but not with prun).

Or, run a Flux instance [1] instead of the PRRTE DVM on the resource
allocation, which seems similar but features a scheduler with a queue (a
feature proposed for the PRRTE DVM on the list earlier [1]). I am
guessing that Flux has the flexibility to this exclusive node mapping,
but not sure.

The DVM is proving to be very useful to deal with restrictions on
minimum nodecount per job on some HPC clusters, by batching many small
jobs into one job. A queue would be even more useful, but even without a
queue it is still useful for batching sets of jobs which are known to
fit on an allocation in parallel (i.e. without having to wait at all).

[1] https://flux-framework.readthedocs.io/en/latest/quickstart.html
[2] https://www.mail-archive.com/users@lists.open-mpi.org/msg30692.html

OpenMPI: commit 7a922c8774b184ecb3aa1cd06720390bd9200b50
Fri Nov 6 08:48:29 2020 -0800
PRRTE: commit 37dd45c4d9fe973df1000f1a1421c2718fd80050
Fri Nov 6 12:45:38 2020 -0600

Thank you.


Re: [OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Alexei Colin via users
On Sat, Nov 14, 2020 at 08:07:47PM +, Ralph Castain via users wrote:
> IIRC, the correct syntax is:
> 
> prun -host +e ...
> 
> This tells PRRTE that you want empty nodes for this application. You can even 
> specify how many empty nodes you want:
> 
> prun -host +e:2 ...
> 
> I haven't tested that in a bit, so please let us know if it works or not so 
> we can fix it if necessary.

Works! Thank you.

$ prun --map-by ppr:64:node --host +e -n 1  ./mpitest &
$ prun --map-by ppr:64:node --host +e -n 1  ./mpitest &

MPI World size = 1 processes
Hello World from rank 0 running on nid03835 (hostname nid03835)!

MPI World size = 1 processes
Hello World from rank 0 running on nid03834 (hostname nid03834)!

Should I PR a patch to prun manpage to change this:

   -H, -host, --host 
 List of hosts on which to invoke processes.

to something like this?:

   -H, -host, --host 
 List of hosts on which to invoke processes. Pass
 +e to allocate only onto empty nodes (i.e. none of
 whose cores have been allocated to other prun jobs) or
 +e:N to allocate to nodes at least N of which are empty.


[OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

2020-12-15 Thread Alexei Colin via users
Hi, is there a way to allocate more resources to rank 0 than to
any of the other ranks in the context of PRRTE DVM?

With mpirun (aka. prte) launcher, I can successfully accomplish this
using a rankfile:

rank 0=+n0 slot=0
rank 1=+n1 slot=0
rank 2=+n1 slot=1

mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
--mca prte_rankfile arankfile ./mpitest

   JOB MAP   
Data for JOB [13205,1] offset 0 Total slots allocated 256
Mapping policy: PPR:NO_USE_LOCAL,NOOVERSUBSCRIBE  Ranking policy:
SLOT Binding policy: NONE
Cpu set: N/A  PPR: 2:node  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: nid03828 Num slots: 64   Max slots: 0 Num procs: 1
Process jobid: [13205,1] App: 0 Process rank: 0
Bound: package[0][core:0]

Data for node: nid03829 Num slots: 64   Max slots: 0Num procs: 2
Process jobid: [13205,1] App: 0 Process rank: 1 Bound: 
package[0][core:0]
Process jobid: [13205,1] App: 0 Process rank: 2 Bound: 
package[0][core:1]

=

But, I cannot achieve this with explicit prte; prun; pterm.  It looks
like rankfile is associated with the DVM as opposed to each prun
instance. I can do this: 

prte --mca prte_rankfile arankfile
prun ...
pterm

But it's not useful for running multiple unrelated prun jobs in the same
DVM that each have a different rank count and ranks-per-node
(ppr:N:node) count, so need their own mapping policy in own rankfiles.
(Multiple pruns in same DVM are needed to pack multiple subjobs into one
resource manager job, in which one DVM spans the full allocation.)

The user-specified rankfile is applied to the prte_rankfile global var
by rmaps rank_file component, but that component is not loaded by prun
(only by prte, i.e.  the DVM owner). Also, prte_ras_base_allocate
processes prte_rankfile global but it is not called by prun.  Would a
patch/hack to somehow make these components load for prun be doable
for me to hack together?  The 'mapping' is happening per prun instance,
correct? so is it just a matter of loading the rank file, or is there
deeper architectural obstacles?


Separate questions:

2. prun man page mentions --rankfile and contains a section about
rankfiles. But the arg is not valid:

prun -n 3 --rankfile arankfile ./mpitest
prun: Error: unknown option "--rankfile"

But, the manpage for prte (aka. mpirun) does not mention rankfiles, and
does not mention the only way I found to specify rankfile: prte/mpirun
--mca prte_rankfile arankfile.

Do you want a PR moving rankfile section from the prun manpage to
the prte manpage and mentioning the MCA parameter as a means to
specify a rankfile?

P.S. Btw, --pmca and --gpmca in prun manpage are also not accepted.


3. How to provide a "default slot_list" in order to not require the
rankfile to enumerate every rank? (exactly the question asked here [1])

For example, ommitting the line for rank 2 results in this error:

rank 0=+n0 slot=0
rank 1=+n1 slot=0

mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
--mca prte_rankfile arankfile ./mpitest

A rank is missing its location specification:

Rank:2
Rank file:   arankfile

All processes must have their location specified in the rank file.
Either add an entry to the file, or provide a default slot_list to
use for any unspecified ranks.


3. Is there a way to use a rankfile but not bind to cores? Omitting
'slot' from lines in rankfile is rejected:

rank 0=+n0
rank 1=+n1
rank 2=+n1

Binding is orthogonal to mapping, correct? Would support rankfiles
without 'slot' be doable for me to quickly patch in?

Rationale: binding causes the following error with --map-by ppr:N:node:

mpirun -n 3 --map-by ppr:2:node:DISPLAY \
--mca prte_rankfile arankfile ./mpitest

The request to bind processes could not be completed due to
an internal error - the locale of the following process was
not set by the mapper code:

  Process:  [[33295,1],0]

Please contact the OMPI developers for assistance. Meantime,
you will still be able to run your application without binding
by specifying "--bind-to none" on your command line.

Adding '--bind-to none' eliminates the error, but the JOB MAP reports
that processes are bound, which is correct w.r.t. the rankfile but
contradictory to --bind-to none:

Mapping policy: PPR:NO_USE_LOCAL,NOOVERSUBSCRIBE
Ranking policy: SLOT Binding policy: NONE
Cpu set: N/A  PPR: 2:node  Cpus-per-rank: N/A  Cpu Type: CORE

Data for node: nid03828 Num slots: 64   Max slots: 0 Num procs: 1
Process jobid: [23033,1] App: 0 Process rank: 0 Bound: 
package[0][core:0]

Data for node: nid03829 Num slots: 64   Max slots: 0Num procs: 2
Process jobid: [23033,1] App: 0 Process rank: 1

Re: [OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

2021-01-13 Thread Alexei Colin via users
On Mon, Jan 11, 2021 at 05:25:56PM +, Josh Hursey via users wrote:
> Thank you for the bug report. I filed a bug against PRRTE so this doesn't get 
> lost that you can follow below:
>   https://github.com/openpmix/prrte/issues/720
> 
> Making rankfle a per-job instead of a per-DVM option might require some 
> internal plumbing work. So I'm not sure how quickly this will be resolved, 
> but you can follow the status on that issue.

Thanks. I achieved the asymmetric node allocation among ranks by using
hostfiles instead of a rankfile (with a tiny two line patch to prrte).
I planned to post the full solution in a reply here (and submit a PR),
just didn't get to it yet; please expect it soon.