Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2021-01-11 Thread Vincent via users

On 07/01/2021 19:51, Josh Hursey via users wrote:
I posted a fix for the static ports issue (currently on the v4.1.x 
branch):

https://github.com/open-mpi/ompi/pull/8339

If you have time do you want to give it a try and confirm that it 
fixes your issue?



Hello Josh

Definitely yes ! It does not crash anymore and I see through ss/netstat 
the orted process is connecting to the port I specified. Good work. 
Thank you.


I wish you a happy 2021.

Regards

Vincent.




Thanks,
Josh


On Tue, Dec 22, 2020 at 2:44 AM Vincent > wrote:


On 18/12/2020 23:04, Josh Hursey wrote:

Vincent,

Thanks for the details on the bug. Indeed this is a case that
seems to have been a problem for a little while now when you
use static ports with ORTE (-mca oob_tcp_static_ipv4_ports
option). It must have crept in when we refactored the internal
regular expression mechanism for the v4 branches (and now that I
look maybe as far back as v3.1). I just hit this same issue in
the past day or so working with a different user.

Though I do not have a suggestion for a workaround at this time
(sorry) I did file a GitHub Issue and am looking at this issue.
With the holiday I don't know when I will have a fix, but you can
watch the ticket for updates.
https://github.com/open-mpi/ompi/issues/8304

In the meantime, you could try the v3.0 series release (which
predates this change) or the current Open MPI master branch
(which approaches this a little differently). The same command
line should work in both. Both can be downloaded from the links
below:
https://www.open-mpi.org/software/ompi/v3.0/
https://www.open-mpi.org/nightly/master/

Hello Josh

Thank you for considering the problem. I will certainly keep
watching the ticket. However, there is nothing really urgent (to
me anyway).



Regarding your command line, it looks pretty good:
  orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca
btl tcp --mca btl_tcp_port_min_v4 6706 --mca
btl_tcp_port_range_v4 10 --mca oob_tcp_static_ipv4_ports 6705
-host node2:1 -np 1 /path/to/some/program arg1 .. argn

I would suggest, while you are debugging this, that you use a
program like /bin/hostname instead of a real MPI program. If
/bin/hostname launches properly then move on to an MPI program.
That will assure you that the runtime wired up correctly
(oob/tcp), and then we can focus on the MPI side of the
communication (btl/tcp). You will want to change "-mca btl tcp"
to at least "-mca btl tcp,self" (or better "-mca btl
tcp,vader,self" if you want shared memory). 'self' is the
loopback interface in Open MPI.

Yes. This is actually what I did. I just wanted to be generic and
report the problem without too much flourish.
But it is important you reminded this for new users, helping them
to understand the real purpose of each layer in an MPI implementation.



Is there a reason that you are specifying the --launch-agent to
the orted? Is it installed in a different path on the remote
nodes? If Open MPI is installed in the same location on all nodes
then you shouldn't need that.

I recompiled the sources, activating
--enable-orterun-prefix-by-default when running ./configure. Of
course, it helps :)

Again, thank you.

Kind regards

Vincent.




Thanks,
Josh




--
Josh Hursey
IBM Spectrum MPI Developer




Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread Daniel Torres via users

Hi.

Thanks for responding. I have taken the most important parts from my 
code and I created a test that reproduces the behavior I described 
previously.


I attach to this e-mail the compressed file "*test.tar.gz*". Inside him, 
you can find:


1.- The .c source code "test.c", which I compiled with "*mpicc -g -O3 
test.c -o test -lm*". The main work is performed on the function 
"*work_on_grid*", starting at line 162.
2.- Four execution examples in two different machines (my own and a 
cluster machine), which I executed with "*mpiexec -np 16 --machinefile 
hostfile --map-by node --mca btl tcp,vader,self --mca btl_base_verbose 
100 ./test 4096 4096*", varying the last two arguments with *4096, 8192 
and 16384* (a matrix size). The error appears with bigger numbers (8192 
in my machine, 16384 in the cluster)

3.- The "ompi_info -a" output from the two machines.
4.- The hostfile.

The duration of the delay is just a few seconds, about 3 ~ 4.

Essentially, the first error message I get from a waiting process is 
"*74: MPI_ERR_PROC_FAILED: Process Failure*".


Hope this information can help.

Thanks a lot for your time.

El 08/01/21 a las 18:40, George Bosilca via users escribió:

Daniel,

There are no timeouts in OMPI with the exception of the initial 
connection over TCP, where we use the socket timeout to prevent 
deadlocks. As you already did quite a few communicator duplications 
and other collective communications before you see the timeout, we 
need more info about this. As Gilles indicated, having the complete 
output might help. What is the duration of the delay for the waiting 
process ? Also, can you post a replicator of this issue ?


  George.


On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:


Daniel,

Can you please post the full error message and share a reproducer for
this issue?

Cheers,

Gilles

On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users
mailto:users@lists.open-mpi.org>> wrote:
>
> Hi all.
>
> Actually I'm implementing an algorithm that creates a process
grid and divides it into row and column communicators as follows:
>
>              col_comm0    col_comm1    col_comm2 col_comm3
> row_comm0    P0           P1           P2        P3
> row_comm1    P4           P5           P6        P7
> row_comm2    P8           P9           P10       P11
> row_comm3    P12          P13          P14       P15
>
> Then, every process works on its own column communicator and
broadcast data on row communicators.
> While column operations are being executed, processes not
included in the current column communicator just wait for results.
>
> In a moment, a column communicator could be splitted to create a
temp communicator and allow only the right processes to work on it.
>
> At the end of a step, a call to MPI_Barrier (on a duplicate of
MPI_COMM_WORLD) is executed to sync all processes and avoid bad
results.
>
> With a small amount of data (a small matrix) the MPI_Barrier
call syncs correctly on the communicator that includes all
processes and processing ends fine.
> But when the amount of data (a big matrix) is incremented,
operations on column communicators take more time to finish and
hence waiting time also increments for waiting processes.
>
> After a few time, waiting processes return an error when they
have not received the broadcast (MPI_Bcast) on row communicators
or when they have finished their work at the sync point
(MPI_Barrier). But when the operations on the current column
communicator end, the still active processes try to broadcast on
row communicators and they fail because the waiting processes have
returned an error. So all processes fail in different moment in time.
>
> So my problem is that waiting processes "believe" that the
current operations have failed (but they have not finished yet!)
and they fail too.
>
> So I have a question about MPI_Bcast/MPI_Barrier:
>
> Is there a way to increment the timeout a process can wait for a
broadcast or barrier to be completed?
>
> Here is my machine and OpenMPI info:
> - OpenMPI version: Open MPI 4.1.0u1a1
> - OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15
10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
>
> Thanks in advance for reading my description/question.
>
> Best regards.
>
> --
> Daniel Torres
> LIPN - Université Sorbonne Paris Nord


--
Daniel Torres
LIPN - Université Sorbonne Paris Nord



test.tar.gz
Description: application/gzip


Re: [OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

2021-01-11 Thread Josh Hursey via users
Thank you for the bug report. I filed a bug against PRRTE so this doesn't get 
lost that you can follow below:
  https://github.com/openpmix/prrte/issues/720

Making rankfle a per-job instead of a per-DVM option might require some 
internal plumbing work. So I'm not sure how quickly this will be resolved, but 
you can follow the status on that issue.


On Tue, Dec 15, 2020 at 8:40 PM Alexei Colin via users 
mailto:users@lists.open-mpi.org> > wrote:
Hi, is there a way to allocate more resources to rank 0 than to
any of the other ranks in the context of PRRTE DVM?

With mpirun (aka. prte) launcher, I can successfully accomplish this
using a rankfile:

    rank 0=+n0 slot=0
    rank 1=+n1 slot=0
    rank 2=+n1 slot=1

    mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

       JOB MAP   
    Data for JOB [13205,1] offset 0 Total slots allocated 256
        Mapping policy: PPR:NO_USE_LOCAL,NOOVERSUBSCRIBE  Ranking policy:
        SLOT Binding policy: NONE
        Cpu set: N/A  PPR: 2:node  Cpus-per-rank: N/A  Cpu Type: CORE


        Data for node: nid03828     Num slots: 64   Max slots: 0 Num procs: 1
            Process jobid: [13205,1] App: 0 Process rank: 0
            Bound: package[0][core:0]

        Data for node: nid03829     Num slots: 64   Max slots: 0    Num procs: 2
            Process jobid: [13205,1] App: 0 Process rank: 1 Bound: 
package[0][core:0]
            Process jobid: [13205,1] App: 0 Process rank: 2 Bound: 
package[0][core:1]

    =

But, I cannot achieve this with explicit prte; prun; pterm.  It looks
like rankfile is associated with the DVM as opposed to each prun
instance. I can do this: 

    prte --mca prte_rankfile arankfile
    prun ...
    pterm

But it's not useful for running multiple unrelated prun jobs in the same

DVM that each have a different rank count and ranks-per-node
(ppr:N:node) count, so need their own mapping policy in own rankfiles.
(Multiple pruns in same DVM are needed to pack multiple subjobs into one

resource manager job, in which one DVM spans the full allocation.)

The user-specified rankfile is applied to the prte_rankfile global var
by rmaps rank_file component, but that component is not loaded by prun
(only by prte, i.e.  the DVM owner). Also, prte_ras_base_allocate
processes prte_rankfile global but it is not called by prun.  Would a
patch/hack to somehow make these components load for prun be doable
for me to hack together?  The 'mapping' is happening per prun instance,
correct? so is it just a matter of loading the rank file, or is there
deeper architectural obstacles?


Separate questions:

2. prun man page mentions --rankfile and contains a section about
rankfiles. But the arg is not valid:

    prun -n 3 --rankfile arankfile ./mpitest
    prun: Error: unknown option "--rankfile"

But, the manpage for prte (aka. mpirun) does not mention rankfiles, and
does not mention the only way I found to specify rankfile: prte/mpirun
--mca prte_rankfile arankfile.

Do you want a PR moving rankfile section from the prun manpage to
the prte manpage and mentioning the MCA parameter as a means to
specify a rankfile?

P.S. Btw, --pmca and --gpmca in prun manpage are also not accepted.


3. How to provide a "default slot_list" in order to not require the
rankfile to enumerate every rank? (exactly the question asked here [1])

For example, ommitting the line for rank 2 results in this error:

    rank 0=+n0 slot=0
    rank 1=+n1 slot=0

    mpirun -n 3 --bind-to none --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    A rank is missing its location specification:

    Rank:        2
    Rank file:   arankfile

    All processes must have their location specified in the rank file.
    Either add an entry to the file, or provide a default slot_list to
    use for any unspecified ranks.


3. Is there a way to use a rankfile but not bind to cores? Omitting
'slot' from lines in rankfile is rejected:

    rank 0=+n0
    rank 1=+n1
    rank 2=+n1

Binding is orthogonal to mapping, correct? Would support rankfiles
without 'slot' be doable for me to quickly patch in?

Rationale: binding causes the following error with --map-by ppr:N:node:

    mpirun -n 3 --map-by ppr:2:node:DISPLAY \
        --mca prte_rankfile arankfile ./mpitest

    The request to bind processes could not be completed due to
    an internal error - the locale of the following process was
    not set by the mapper code:

      Process:  [[33295,1],0]

    Please contact the OMPI developers for assistance. Meantime,
    you will still be able to run your application without binding
    by specifying "--bind-to none" on your command line.

Adding '--bind-to none' eliminates the error, but the JOB MAP reports
that processes are bound, which is correct w.r.t. the rankfile but
contradictory to --bind-to none:

    Mappi

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread Gilles Gouaillardet via users
Daniel,

the test works in my environment (1 node, 32 GB memory) with all the
mentioned parameters.

Did you check the memory usage on your nodes and made sure the oom
killer did not shoot any process?

Cheers,

Gilles

On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users
 wrote:
>
> Hi.
>
> Thanks for responding. I have taken the most important parts from my code and 
> I created a test that reproduces the behavior I described previously.
>
> I attach to this e-mail the compressed file "test.tar.gz". Inside him, you 
> can find:
>
> 1.- The .c source code "test.c", which I compiled with "mpicc -g -O3 test.c 
> -o test -lm". The main work is performed on the function "work_on_grid", 
> starting at line 162.
> 2.- Four execution examples in two different machines (my own and a cluster 
> machine), which I executed with "mpiexec -np 16 --machinefile hostfile 
> --map-by node --mca btl tcp,vader,self --mca btl_base_verbose 100 ./test 4096 
> 4096", varying the last two arguments with 4096, 8192 and 16384 (a matrix 
> size). The error appears with bigger numbers (8192 in my machine, 16384 in 
> the cluster)
> 3.- The "ompi_info -a" output from the two machines.
> 4.- The hostfile.
>
> The duration of the delay is just a few seconds, about 3 ~ 4.
>
> Essentially, the first error message I get from a waiting process is "74: 
> MPI_ERR_PROC_FAILED: Process Failure".
>
> Hope this information can help.
>
> Thanks a lot for your time.
>
> El 08/01/21 a las 18:40, George Bosilca via users escribió:
>
> Daniel,
>
> There are no timeouts in OMPI with the exception of the initial connection 
> over TCP, where we use the socket timeout to prevent deadlocks. As you 
> already did quite a few communicator duplications and other collective 
> communications before you see the timeout, we need more info about this. As 
> Gilles indicated, having the complete output might help. What is the duration 
> of the delay for the waiting process ? Also, can you post a replicator of 
> this issue ?
>
>   George.
>
>
> On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users 
>  wrote:
>>
>> Daniel,
>>
>> Can you please post the full error message and share a reproducer for
>> this issue?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users
>>  wrote:
>> >
>> > Hi all.
>> >
>> > Actually I'm implementing an algorithm that creates a process grid and 
>> > divides it into row and column communicators as follows:
>> >
>> >  col_comm0col_comm1col_comm2 col_comm3
>> > row_comm0P0   P1   P2P3
>> > row_comm1P4   P5   P6P7
>> > row_comm2P8   P9   P10   P11
>> > row_comm3P12  P13  P14   P15
>> >
>> > Then, every process works on its own column communicator and broadcast 
>> > data on row communicators.
>> > While column operations are being executed, processes not included in the 
>> > current column communicator just wait for results.
>> >
>> > In a moment, a column communicator could be splitted to create a temp 
>> > communicator and allow only the right processes to work on it.
>> >
>> > At the end of a step, a call to MPI_Barrier (on a duplicate of 
>> > MPI_COMM_WORLD) is executed to sync all processes and avoid bad results.
>> >
>> > With a small amount of data (a small matrix) the MPI_Barrier call syncs 
>> > correctly on the communicator that includes all processes and processing 
>> > ends fine.
>> > But when the amount of data (a big matrix) is incremented, operations on 
>> > column communicators take more time to finish and hence waiting time also 
>> > increments for waiting processes.
>> >
>> > After a few time, waiting processes return an error when they have not 
>> > received the broadcast (MPI_Bcast) on row communicators or when they have 
>> > finished their work at the sync point (MPI_Barrier). But when the 
>> > operations on the current column communicator end, the still active 
>> > processes try to broadcast on row communicators and they fail because the 
>> > waiting processes have returned an error. So all processes fail in 
>> > different moment in time.
>> >
>> > So my problem is that waiting processes "believe" that the current 
>> > operations have failed (but they have not finished yet!) and they fail too.
>> >
>> > So I have a question about MPI_Bcast/MPI_Barrier:
>> >
>> > Is there a way to increment the timeout a process can wait for a broadcast 
>> > or barrier to be completed?
>> >
>> > Here is my machine and OpenMPI info:
>> > - OpenMPI version: Open MPI 4.1.0u1a1
>> > - OS: Linux Daniel 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 
>> > 2020 x86_64 x86_64 x86_64 GNU/Linux
>> >
>> > Thanks in advance for reading my description/question.
>> >
>> > Best regards.
>> >
>> > --
>> > Daniel Torres
>> > LIPN - Université Sorbonne Paris Nord
>
> --
> Daniel Torres
> LIPN - Université Sorbonne Paris Nord


Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-11 Thread George Bosilca via users
*MPI_ERR_PROC_FAILED is not yet a valid error in MPI. It is coming from
ULFM, an extension to MPI that is not yet in the OMPI master.*

*Daniel what version of Open MPI are you using ? Are you sure you are not
mixing multiple versions due to PATH/LD_LIBRARY_PATH ?*

*George.*


On Mon, Jan 11, 2021 at 21:31 Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Daniel,
>
> the test works in my environment (1 node, 32 GB memory) with all the
> mentioned parameters.
>
> Did you check the memory usage on your nodes and made sure the oom
> killer did not shoot any process?
>
> Cheers,
>
> Gilles
>
> On Tue, Jan 12, 2021 at 1:48 AM Daniel Torres via users
>  wrote:
> >
> > Hi.
> >
> > Thanks for responding. I have taken the most important parts from my
> code and I created a test that reproduces the behavior I described
> previously.
> >
> > I attach to this e-mail the compressed file "test.tar.gz". Inside him,
> you can find:
> >
> > 1.- The .c source code "test.c", which I compiled with "mpicc -g -O3
> test.c -o test -lm". The main work is performed on the function
> "work_on_grid", starting at line 162.
> > 2.- Four execution examples in two different machines (my own and a
> cluster machine), which I executed with "mpiexec -np 16 --machinefile
> hostfile --map-by node --mca btl tcp,vader,self --mca btl_base_verbose 100
> ./test 4096 4096", varying the last two arguments with 4096, 8192 and 16384
> (a matrix size). The error appears with bigger numbers (8192 in my machine,
> 16384 in the cluster)
> > 3.- The "ompi_info -a" output from the two machines.
> > 4.- The hostfile.
> >
> > The duration of the delay is just a few seconds, about 3 ~ 4.
> >
> > Essentially, the first error message I get from a waiting process is
> "74: MPI_ERR_PROC_FAILED: Process Failure".
> >
> > Hope this information can help.
> >
> > Thanks a lot for your time.
> >
> > El 08/01/21 a las 18:40, George Bosilca via users escribió:
> >
> > Daniel,
> >
> > There are no timeouts in OMPI with the exception of the initial
> connection over TCP, where we use the socket timeout to prevent deadlocks.
> As you already did quite a few communicator duplications and other
> collective communications before you see the timeout, we need more info
> about this. As Gilles indicated, having the complete output might help.
> What is the duration of the delay for the waiting process ? Also, can you
> post a replicator of this issue ?
> >
> >   George.
> >
> >
> > On Fri, Jan 8, 2021 at 9:03 AM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> Daniel,
> >>
> >> Can you please post the full error message and share a reproducer for
> >> this issue?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On Fri, Jan 8, 2021 at 10:25 PM Daniel Torres via users
> >>  wrote:
> >> >
> >> > Hi all.
> >> >
> >> > Actually I'm implementing an algorithm that creates a process grid
> and divides it into row and column communicators as follows:
> >> >
> >> >  col_comm0col_comm1col_comm2 col_comm3
> >> > row_comm0P0   P1   P2P3
> >> > row_comm1P4   P5   P6P7
> >> > row_comm2P8   P9   P10   P11
> >> > row_comm3P12  P13  P14   P15
> >> >
> >> > Then, every process works on its own column communicator and
> broadcast data on row communicators.
> >> > While column operations are being executed, processes not included in
> the current column communicator just wait for results.
> >> >
> >> > In a moment, a column communicator could be splitted to create a temp
> communicator and allow only the right processes to work on it.
> >> >
> >> > At the end of a step, a call to MPI_Barrier (on a duplicate of
> MPI_COMM_WORLD) is executed to sync all processes and avoid bad results.
> >> >
> >> > With a small amount of data (a small matrix) the MPI_Barrier call
> syncs correctly on the communicator that includes all processes and
> processing ends fine.
> >> > But when the amount of data (a big matrix) is incremented, operations
> on column communicators take more time to finish and hence waiting time
> also increments for waiting processes.
> >> >
> >> > After a few time, waiting processes return an error when they have
> not received the broadcast (MPI_Bcast) on row communicators or when they
> have finished their work at the sync point (MPI_Barrier). But when the
> operations on the current column communicator end, the still active
> processes try to broadcast on row communicators and they fail because the
> waiting processes have returned an error. So all processes fail in
> different moment in time.
> >> >
> >> > So my problem is that waiting processes "believe" that the current
> operations have failed (but they have not finished yet!) and they fail too.
> >> >
> >> > So I have a question about MPI_Bcast/MPI_Barrier:
> >> >
> >> > Is there a way to increment the timeout a process can wait for a
> broadcast or barrier to be comp