If I'm not mistaken, hcoll is playing with the opal_progress in a way that
conflicts with the blessed usage of progress in OMPI and prevents other
components from advancing and timely completing requests. The impact is
minimal for sequential applications using only blocking calls, but is
jeopardizing performance when multiple types of communications are
simultaneously executing or when multiple threads are active.

The solution might be very simple: hcoll is a module providing support for
collective communications so as long as you don't use collectives, or the
tuned module provides collective performance similar to hcoll on your
cluster, just go ahead and disable hcoll. You can also reach out to
Mellanox folks asking them to fix the hcoll usage of opal_progress.

  George.


On Mon, Feb 3, 2020 at 11:09 AM Angel de Vicente via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> in one of our codes, we want to create a log of events that happen in
> the MPI processes, where the number of these events and their timing is
> unpredictable.
>
> So I implemented a simple test code, where process 0
> creates a thread that is just busy-waiting for messages from any
> process, and which is sent to stdout/stderr/log file upon receiving
> them. The test code is at https://github.com/angel-devicente/thread_io
> and the same idea went into our "real" code.
>
> As far as I could see, this behaves very nicely, there are no deadlocks,
> no lost messages and the performance penalty is minimal when considering
> the real application this is intended for.
>
> But then I found that in a local cluster the performance was very bad
> (from ~5min 50s to ~5s for some test) when run with the locally
> installed OpenMPI and my own OpenMPI installation (same gcc and OpenMPI
> versions). Checking the OpenMPI configuration details, I found that the
> locally installed OpenMPI was configured to use the Mellanox IB driver,
> and in particular the hcoll component was somehow killing performance:
>
> running with
>
> mpirun  --mca coll_hcoll_enable 0 -np 51 ./test_t
>
> was taking ~5s, while enabling coll_hcoll was killing performance, as
> stated above (when run in a single node the performance also goes down,
> but only about a factor 2X).
>
> Has anyone seen anything like this? Perhaps a newer Mellanox driver
> would solve the problem?
>
> We were planning on making our code public, but before we do so, I want
> to understand under which conditions we could have this problem with the
> "Threaded I/O" approach and if possible how to get rid of it completely.
>
> Any help/pointers appreciated.
> --
> Ángel de Vicente
>
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
>
> ---------------------------------------------------------------------------------------------
> ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de
> Datos, acceda a http://www.iac.es/disclaimer.php
> WARNING: For more information on privacy and fulfilment of the Law
> concerning the Protection of Data, consult
> http://www.iac.es/disclaimer.php?lang=en
>
>

Reply via email to