Re: [OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

Josh Hursey via users Mon, 11 Jan 2021 09:29:42 -0800

Thank you for the bug report. I filed a bug against PRRTE so this doesn't get 
lost that you can follow below:
  https://github.com/openpmix/prrte/issues/720


Making rankfle a per-job instead of a per-DVM option might require some 
internal plumbing work. So I'm not sure how quickly this will be resolved, but 
you can follow the status on that issue.


On Tue, Dec 15, 2020 at 8:40 PM Alexei Colin via users 
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote:
Hi, is there a way to allocate more resources to rank 0 than to
any of the other ranks in the context of PRRTE DVM?

With mpirun (aka. prte) launcher, I can successfully accomplish this
using a rankfile:

    rank 0=+n0 slot=0
    rank 1=+n1 slot=0
    rank 2=+n1 slot=1

    mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    ========================   JOB MAP   ========================
    Data for JOB [13205,1] offset 0 Total slots allocated 256
        Mapping policy: PPR:NO_USE_LOCAL,NOOVERSUBSCRIBE  Ranking policy:
        SLOT Binding policy: NONE
        Cpu set: N/A  PPR: 2:node  Cpus-per-rank: N/A  Cpu Type: CORE


        Data for node: nid03828     Num slots: 64   Max slots: 0 Num procs: 1
            Process jobid: [13205,1] App: 0 Process rank: 0
            Bound: package[0][core:0]

        Data for node: nid03829     Num slots: 64   Max slots: 0    Num procs: 2
            Process jobid: [13205,1] App: 0 Process rank: 1 Bound: 
package[0][core:0]
            Process jobid: [13205,1] App: 0 Process rank: 2 Bound: 
package[0][core:1]

    =============================================================

But, I cannot achieve this with explicit prte; prun; pterm.  It looks
like rankfile is associated with the DVM as opposed to each prun
instance. I can do this: 

    prte --mca prte_rankfile arankfile
    prun ...
    pterm

But it's not useful for running multiple unrelated prun jobs in the same

DVM that each have a different rank count and ranks-per-node
(ppr:N:node) count, so need their own mapping policy in own rankfiles.
(Multiple pruns in same DVM are needed to pack multiple subjobs into one

resource manager job, in which one DVM spans the full allocation.)

The user-specified rankfile is applied to the prte_rankfile global var
by rmaps rank_file component, but that component is not loaded by prun
(only by prte, i.e.  the DVM owner). Also, prte_ras_base_allocate
processes prte_rankfile global but it is not called by prun.  Would a
patch/hack to somehow make these components load for prun be doable
for me to hack together?  The 'mapping' is happening per prun instance,
correct? so is it just a matter of loading the rank file, or is there
deeper architectural obstacles?


Separate questions:

2. prun man page mentions --rankfile and contains a section about
rankfiles. But the arg is not valid:

    prun -n 3 --rankfile arankfile ./mpitest
    prun: Error: unknown option "--rankfile"

But, the manpage for prte (aka. mpirun) does not mention rankfiles, and
does not mention the only way I found to specify rankfile: prte/mpirun
--mca prte_rankfile arankfile.

Do you want a PR moving rankfile section from the prun manpage to
the prte manpage and mentioning the MCA parameter as a means to
specify a rankfile?

P.S. Btw, --pmca and --gpmca in prun manpage are also not accepted.


3. How to provide a "default slot_list" in order to not require the
rankfile to enumerate every rank? (exactly the question asked here [1])

For example, ommitting the line for rank 2 results in this error:

    rank 0=+n0 slot=0
    rank 1=+n1 slot=0

    mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    A rank is missing its location specification:

    Rank:        2
    Rank file:   arankfile

    All processes must have their location specified in the rank file.
    Either add an entry to the file, or provide a default slot_list to
    use for any unspecified ranks.


3. Is there a way to use a rankfile but not bind to cores? Omitting
'slot' from lines in rankfile is rejected:

    rank 0=+n0
    rank 1=+n1
    rank 2=+n1

Binding is orthogonal to mapping, correct? Would support rankfiles
without 'slot' be doable for me to quickly patch in?

Rationale: binding causes the following error with --map-by ppr:N:node:

    mpirun -n 3 --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    The request to bind processes could not be completed due to
    an internal error - the locale of the following process was
    not set by the mapper code:

      Process:  [[33295,1],0]

    Please contact the OMPI developers for assistance. Meantime,
    you will still be able to run your application without binding
    by specifying "--bind-to none" on your command line.

Adding '--bind-to none' eliminates the error, but the JOB MAP reports
that processes are bound, which is correct w.r.t. the rankfile but
contradictory to --bind-to none:

    Mapping policy: PPR:NO_USE_LOCAL,NOOVERSUBSCRIBE
    Ranking policy: SLOT Binding policy: NONE
    Cpu set: N/A  PPR: 2:node  Cpus-per-rank: N/A  Cpu Type: CORE

    Data for node: nid03828     Num slots: 64   Max slots: 0 Num procs: 1
            Process jobid: [23033,1] App: 0 Process rank: 0 Bound: 
package[0][core:0]

    Data for node: nid03829     Num slots: 64   Max slots: 0    Num procs: 2
            Process jobid: [23033,1] App: 0 Process rank: 1 Bound: 
package[0][core:0]
            Process jobid: [23033,1] App: 0 Process rank: 2 Bound: 
package[0][core:1]

If I don't pass a rank file, I can achieve no binding:

    mpirun --bind-to none -n 3 --map-by ppr:2:node:DISPLAY ./mpitest
    ...
    Mapping policy: BYNODE:NO_USE_LOCAL,NOOVERSUBSCRIBE
    Ranking policy: SLOT Binding policy: NONE
    ...
    Process jobid: [11399,1] App: 0 Process rank: 0 Bound: N/A


Is there any reason for not supporting this not-bound configuration when

a rankfile is specified?


[1] 
https://stackoverflow.com/questions/32333785/how-to-provide-a-default-slot-list-in-openmpi-rankfile


-- 
Josh Hursey
IBM Spectrum MPI Developer

Re: [OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

Reply via email to