Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread Joshua Ladd via users
These are very, very old versions of UCX and HCOLL installed in your
environment. Also, MXM was deprecated years ago in favor of UCX. What
version of MOFED is installed (run ofed_info -s)? What HCA generation is
present (run ibstat).

Josh

On Tue, Mar 1, 2022 at 6:42 AM Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> John Hearns via users  writes:
>
> > Stupid answer from me. If latency/bandwidth numbers are bad then check
> > that you are really running over the interface that you think you
> > should be. You could be falling back to running over Ethernet.
>
> I'm quite out of my depth here, so all answers are helpful, as I might have
> skipped something very obvious.
>
> In order to try and avoid the possibility of falling back to running
> over Ethernet, I submitted the job with:
>
> mpirun -n 2 --mca btl ^tcp osu_latency
>
> which gives me the following error:
>
> ,
> | At least one pair of MPI processes are unable to reach each other for
> | MPI communications.  This means that no Open MPI device has indicated
> | that it can be used to communicate between these processes.  This is
> | an error; Open MPI requires that all MPI processes be able to reach
> | each other.  This error can sometimes be the result of forgetting to
> | specify the "self" BTL.
> |
> |   Process 1 ([[37380,1],1]) is on host: s01r1b20
> |   Process 2 ([[37380,1],0]) is on host: s01r1b19
> |   BTLs attempted: self
> |
> | Your MPI job is now going to abort; sorry.
> `
>
> This is certainly not happening when I use the "native" OpenMPI,
> etc. provided in the cluster. I have not knowingly specified anywhere
> not to support "self", so I have no clue what might be going on, as I
> assumed that "self" was always built for OpenMPI.
>
> Any hints on what (and where) I should look for?
>
> Many thanks,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o
> privilegiada. Si usted no es el destinatario final del mismo o lo ha
> recibido por error, por favor notifíquelo al remitente inmediatamente.
> Cualquier uso no autorizadas del contenido de este mensaje está
> estrictamente prohibida. Más información en:
> https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged
> information. If you are not the final recipient or have received it in
> error, please notify the sender immediately. Any unauthorized use of the
> content of this message is strictly prohibited. More information:
> https://www.iac.es/en/disclaimer
>


Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread Angel de Vicente via users
Hello,

John Hearns via users  writes:

> Stupid answer from me. If latency/bandwidth numbers are bad then check
> that you are really running over the interface that you think you
> should be. You could be falling back to running over Ethernet.

I'm quite out of my depth here, so all answers are helpful, as I might have
skipped something very obvious.

In order to try and avoid the possibility of falling back to running
over Ethernet, I submitted the job with:

mpirun -n 2 --mca btl ^tcp osu_latency

which gives me the following error:

,
| At least one pair of MPI processes are unable to reach each other for
| MPI communications.  This means that no Open MPI device has indicated
| that it can be used to communicate between these processes.  This is
| an error; Open MPI requires that all MPI processes be able to reach
| each other.  This error can sometimes be the result of forgetting to
| specify the "self" BTL.
| 
|   Process 1 ([[37380,1],1]) is on host: s01r1b20
|   Process 2 ([[37380,1],0]) is on host: s01r1b19
|   BTLs attempted: self
| 
| Your MPI job is now going to abort; sorry.
`

This is certainly not happening when I use the "native" OpenMPI,
etc. provided in the cluster. I have not knowingly specified anywhere
not to support "self", so I have no clue what might be going on, as I
assumed that "self" was always built for OpenMPI.

Any hints on what (and where) I should look for?

Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer


Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-03-01 Thread John Hearns via users
Stupid answer from me. If latency/bandwidth numbers are bad then check that
you are really running over the interface that you think you should be. You
could be falling back to running over Ethernet.

On Mon, 28 Feb 2022 at 20:10, Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
> "Jeff Squyres (jsquyres)"  writes:
>
> > I'd recommend against using Open MPI v3.1.0 -- it's quite old.  If you
> > have to use Open MPI v3.1.x, I'd at least suggest using v3.1.6, which
> > has all the rolled-up bug fixes on the v3.1.x series.
> >
> > That being said, Open MPI v4.1.2 is the most current.  Open MPI v4.1.2
> does
> > restrict which versions of UCX it uses because there are bugs in the
> older
> > versions of UCX.  I am not intimately familiar with UCX -- you'll need
> to ask
> > Nvidia for support there -- but I was under the impression that it's
> just a
> > user-level library, and you could certainly install your own copy of UCX
> to use
> > with your compilation of Open MPI.  I.e., you're not restricted to
> whatever UCX
> > is installed in the cluster system-default locations.
>
> I did follow your advice, so I compiled my own version of UCX (1.11.2)
> and OpenMPI v4.1.1, but for some reason the latency / bandwidth numbers
> are really bad compared to the previous ones, so something is wrong, but
> not sure how to debug it.
>
> > I don't know why you're getting MXM-specific error messages; those don't
> appear
> > to be coming from Open MPI (especially since you configured Open MPI with
> > --without-mxm).  If you can upgrade to Open MPI v4.1.2 and the latest
> UCX, see
> > if you are still getting those MXM error messages.
>
> In this latest attempt, yes, the MXM error messages are still there.
>
> Cheers,
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o
> privilegiada. Si usted no es el destinatario final del mismo o lo ha
> recibido por error, por favor notifíquelo al remitente inmediatamente.
> Cualquier uso no autorizadas del contenido de este mensaje está
> estrictamente prohibida. Más información en:
> https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged
> information. If you are not the final recipient or have received it in
> error, please notify the sender immediately. Any unauthorized use of the
> content of this message is strictly prohibited. More information:
> https://www.iac.es/en/disclaimer
>