Re: [OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-03 Thread George Bosilca via users
If I'm not mistaken, hcoll is playing with the opal_progress in a way that
conflicts with the blessed usage of progress in OMPI and prevents other
components from advancing and timely completing requests. The impact is
minimal for sequential applications using only blocking calls, but is
jeopardizing performance when multiple types of communications are
simultaneously executing or when multiple threads are active.

The solution might be very simple: hcoll is a module providing support for
collective communications so as long as you don't use collectives, or the
tuned module provides collective performance similar to hcoll on your
cluster, just go ahead and disable hcoll. You can also reach out to
Mellanox folks asking them to fix the hcoll usage of opal_progress.

  George.


On Mon, Feb 3, 2020 at 11:09 AM Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> in one of our codes, we want to create a log of events that happen in
> the MPI processes, where the number of these events and their timing is
> unpredictable.
>
> So I implemented a simple test code, where process 0
> creates a thread that is just busy-waiting for messages from any
> process, and which is sent to stdout/stderr/log file upon receiving
> them. The test code is at https://github.com/angel-devicente/thread_io
> and the same idea went into our "real" code.
>
> As far as I could see, this behaves very nicely, there are no deadlocks,
> no lost messages and the performance penalty is minimal when considering
> the real application this is intended for.
>
> But then I found that in a local cluster the performance was very bad
> (from ~5min 50s to ~5s for some test) when run with the locally
> installed OpenMPI and my own OpenMPI installation (same gcc and OpenMPI
> versions). Checking the OpenMPI configuration details, I found that the
> locally installed OpenMPI was configured to use the Mellanox IB driver,
> and in particular the hcoll component was somehow killing performance:
>
> running with
>
> mpirun  --mca coll_hcoll_enable 0 -np 51 ./test_t
>
> was taking ~5s, while enabling coll_hcoll was killing performance, as
> stated above (when run in a single node the performance also goes down,
> but only about a factor 2X).
>
> Has anyone seen anything like this? Perhaps a newer Mellanox driver
> would solve the problem?
>
> We were planning on making our code public, but before we do so, I want
> to understand under which conditions we could have this problem with the
> "Threaded I/O" approach and if possible how to get rid of it completely.
>
> Any help/pointers appreciated.
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> -
> ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de
> Datos, acceda a http://www.iac.es/disclaimer.php
> WARNING: For more information on privacy and fulfilment of the Law
> concerning the Protection of Data, consult
> http://www.iac.es/disclaimer.php?lang=en
>
>


Re: [OMPI users] OpenFabrics

2020-02-03 Thread Jeff Squyres (jsquyres) via users
> On Feb 3, 2020, at 12:35 PM, Bennet Fauber  wrote:
> 
> This is what CentOS installed.
> 
> $ yum list installed hwloc\*
> Loaded plugins: langpacks
> Installed Packages
> hwloc.x86_64 1.11.8-4.el7
> @os
> hwloc-devel.x86_64   1.11.8-4.el7
> @os
> hwloc-libs.x86_641.11.8-4.el7
> @os

I believe that those versions of hwloc are sufficient.

> I will ask my coworker to install a test version.  What can I do by
> way of flags or environment variables to get the best output to
> report?  I believe that `srun` is preferred as the process starter on
> Slurm clusters, but I think `mpirun`/`orterun` has better debugging
> capabilities?

It depends on what is wrong.  ;-)

You mentioned that "something was awry" with the `--with-hwloc=external` 
installation...

-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] OpenFabrics

2020-02-03 Thread Bennet Fauber via users
This is what CentOS installed.

$ yum list installed hwloc\*
Loaded plugins: langpacks
Installed Packages
hwloc.x86_64 1.11.8-4.el7@os
hwloc-devel.x86_64   1.11.8-4.el7@os
hwloc-libs.x86_641.11.8-4.el7@os

I will ask my coworker to install a test version.  What can I do by
way of flags or environment variables to get the best output to
report?  I believe that `srun` is preferred as the process starter on
Slurm clusters, but I think `mpirun`/`orterun` has better debugging
capabilities?

Thanks,-- bennet


On Mon, Feb 3, 2020 at 12:02 PM Jeff Squyres (jsquyres)
 wrote:
>
> On Feb 3, 2020, at 10:03 AM, Bennet Fauber  wrote:
> >
> > Ah, ha!
> >
> > Yes, that seems to be it.  Thanks.
>
> Ok, good.  I understand that UCX is the "preferred" mechanism for IB these 
> days.
>
> > If I might, on a configure related note ask,  whether, if we have
> > these installed with the CentOS 7.6 we are running
> >
> > $ yum list installed libevent\*
> > Loaded plugins: langpacks
> > Installed Packages
> > libevent.x86_64 2.0.21-4.el7   
> > @anaconda
> > libevent-devel.x86_64   2.0.21-4.el7   @os
> >
> > should be be able to use this?
> >
> >./configure ... --with-libevent=external --with-hwloc=external
> >
> > My coworker reported that something was awry using that, and he's put 
> > instead
> >
> >./configure ... --with-libevent=external --with-hwloc=/usr
> >
> > I believe that the problem was that if we did not specify /usr, then
> > srun and mpirun were unable to find the interfaces.  But I also recall
> > from an earlier thread that is very much not recommended.
>
> I don't know offhand if the hwloc and lib event bundled in Centos 7 are 
> sufficient.  They probably are, but I don't know that for a fact.  I'd be 
> curious to know what the problem was if --with-hwloc=external didn't work 
> (assuming that the Centos 7-bundled hwloc was the only one found in your 
> PATH/LD_LIBRARY_PATH/compiler include+linker paths/etc.).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>


Re: [OMPI users] OpenFabrics

2020-02-03 Thread Jeff Squyres (jsquyres) via users
On Feb 3, 2020, at 10:03 AM, Bennet Fauber  wrote:
> 
> Ah, ha!
> 
> Yes, that seems to be it.  Thanks.

Ok, good.  I understand that UCX is the "preferred" mechanism for IB these days.

> If I might, on a configure related note ask,  whether, if we have
> these installed with the CentOS 7.6 we are running
> 
> $ yum list installed libevent\*
> Loaded plugins: langpacks
> Installed Packages
> libevent.x86_64 2.0.21-4.el7   
> @anaconda
> libevent-devel.x86_64   2.0.21-4.el7   @os
> 
> should be be able to use this?
> 
>./configure ... --with-libevent=external --with-hwloc=external
> 
> My coworker reported that something was awry using that, and he's put instead
> 
>./configure ... --with-libevent=external --with-hwloc=/usr
> 
> I believe that the problem was that if we did not specify /usr, then
> srun and mpirun were unable to find the interfaces.  But I also recall
> from an earlier thread that is very much not recommended.

I don't know offhand if the hwloc and lib event bundled in Centos 7 are 
sufficient.  They probably are, but I don't know that for a fact.  I'd be 
curious to know what the problem was if --with-hwloc=external didn't work 
(assuming that the Centos 7-bundled hwloc was the only one found in your 
PATH/LD_LIBRARY_PATH/compiler include+linker paths/etc.).

-- 
Jeff Squyres
jsquy...@cisco.com



[OMPI users] Trouble with Mellanox's hcoll component and MPI_THREAD_MULTIPLE support?

2020-02-03 Thread Angel de Vicente via users
Hi,

in one of our codes, we want to create a log of events that happen in
the MPI processes, where the number of these events and their timing is
unpredictable.

So I implemented a simple test code, where process 0
creates a thread that is just busy-waiting for messages from any
process, and which is sent to stdout/stderr/log file upon receiving
them. The test code is at https://github.com/angel-devicente/thread_io
and the same idea went into our "real" code.

As far as I could see, this behaves very nicely, there are no deadlocks,
no lost messages and the performance penalty is minimal when considering
the real application this is intended for.

But then I found that in a local cluster the performance was very bad
(from ~5min 50s to ~5s for some test) when run with the locally
installed OpenMPI and my own OpenMPI installation (same gcc and OpenMPI
versions). Checking the OpenMPI configuration details, I found that the
locally installed OpenMPI was configured to use the Mellanox IB driver,
and in particular the hcoll component was somehow killing performance:

running with

mpirun  --mca coll_hcoll_enable 0 -np 51 ./test_t

was taking ~5s, while enabling coll_hcoll was killing performance, as
stated above (when run in a single node the performance also goes down,
but only about a factor 2X).

Has anyone seen anything like this? Perhaps a newer Mellanox driver
would solve the problem?

We were planning on making our code public, but before we do so, I want
to understand under which conditions we could have this problem with the
"Threaded I/O" approach and if possible how to get rid of it completely.

Any help/pointers appreciated.
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
-
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de 
Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning 
the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en



Re: [OMPI users] OpenFabrics

2020-02-03 Thread Bennet Fauber via users
Ah, ha!

Yes, that seems to be it.  Thanks.

If I might, on a configure related note ask,  whether, if we have
these installed with the CentOS 7.6 we are running

$ yum list installed libevent\*
Loaded plugins: langpacks
Installed Packages
libevent.x86_64 2.0.21-4.el7   @anaconda
libevent-devel.x86_64   2.0.21-4.el7   @os

should be be able to use this?

./configure ... --with-libevent=external --with-hwloc=external

My coworker reported that something was awry using that, and he's put instead

./configure ... --with-libevent=external --with-hwloc=/usr

I believe that the problem was that if we did not specify /usr, then
srun and mpirun were unable to find the interfaces.  But I also recall
from an earlier thread that is very much not recommended.

We are still struggling with new IB hardware, new scheduler (Slurm),
PMIx, and OpenMPI, so I am a bit muddled about how all the moving
pieces work yet.


On Sun, Feb 2, 2020 at 4:16 PM Jeff Squyres (jsquyres)
 wrote:
>
> Bennet --
>
> Just curious: is there a reason you're not using UCX?
>
>
> > On Feb 2, 2020, at 4:06 PM, Bennet Fauber via users 
> >  wrote:
> >
> > We get these warnings/error from OpenMPI, version 3.1.4 and 4.0.2
> >
> > --
> > WARNING: No preset parameters were found for the device that Open MPI
> > detected:
> >
> >  Local host:gl3080
> >  Device name:   mlx5_0
> >  Device vendor ID:  0x02c9
> >  Device vendor part ID: 4123
> >
> > Default device parameters will be used, which may result in lower
> > performance.  You can edit any of the files specified by the
> > btl_openib_device_param_files MCA parameter to set values for your
> > device.
> >
> > NOTE: You can turn off this warning by setting the MCA parameter
> >  btl_openib_warn_no_device_params_found to 0.
> > --
> >
> > --
> > WARNING: There was an error initializing an OpenFabrics device.
> >
> >  Local host:   gl3080
> >  Local device: mlx5_0
> > --
> >
> > Does anyone know how I can find the parameters that should be set in
> > $PREFIX/etc/btl_openib_device_param.conf or other OpenMPI
> > configuration files so that those warnings do not occur?
> >
> > How might I find the cause of the initialization error?
> >
> > Sorry for the ignorance behind this question.
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>