Thanks Nathan, will keep you informed.

Regards

On 12/05/2017 11:32 PM, Nathan Hjelm wrote:
Should be fixed by PR #4569 (https://github.com/open-mpi/ompi/pull/4569). 
Please treat and let me know.

-Nathan

On Dec 1, 2017, at 7:37 AM, DERBEY, NADIA 
<nadia.der...@atos.net<mailto:nadia.der...@atos.net>> wrote:

Hi,

Our validation team detected a hang when running osu_bibw
micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the
same hang appears with openmpi-3.0).
This hang occurs when calling osu_bibw on a single node (vader btl) with
the options "-x 100 -i 1000".
The -x option changes the warmup loop size.
The -i option changes the measured loop size.

For each exchanged message size, osu_bibw loops doing the following
sequence on both ranks:
   . posts 64 non-blocking sends
   . posts 64 non-blocking receives
   . waits for all the send requests to complete
   . waits for all the receive requests to complete

The loop size is the sum of
   . options.skip (warm up phase that can be changed with the -x option)
   . options.loop (actually measured loop that can be changed with the
-i option).

The default values are the following:

+==============+======+======+
| message size | skip | loop |
|==============+======+======|
|    <= 8K     |   10 |  100 |
|    >  8K     |    2 |   20 |
+==============+======+======+

As said above, the test hangs when moving to more aggressive loop
values: 100 for skip and 1000 for loop.

mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment
from the appropriate free list.
If there are no free fragments anymore, opal_free_list_get() calls
opal_free_list_grow() which in turn calls mca_btl_vader_frag_init()
(initialization routine for the vader btl fragements).
This routine checks if there is enough space left in the mapped memory
segment for the wanted fragment size (current offset + fragment size
shoudl be <= segment size), and it makes opal_free_list_grow fail if the
shared memory segment is exhausted.

As soon as we begin exhausting memory, the 2 ranks get unsynchronized
and the test rapidly hangs. To avoid this hang, I found 2 possible
solutions:

1) change the vader btl segment size: I have set it to 4GB - in order to
be able to do this, I had to change the type parameter in the parameter
registrations to MCA_BASE_VAR_TYPE_SIZE_T.

2) change the call to opal_free_list_get() by a call to
opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the
micro-benchmark run to the end.

So my question is: what would be the best approach (#1 or #2)? and the
question behind this is: what is the reason that makes favoring
opal_free_list_get() instead of opal_free_list_wait().

Thanks

--
Nadia Derbey - B1-387
HPC R&D - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net<mailto:nadia.der...@atos.net>
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com<http://www.atos.com>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel



_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel


--
Nadia Derbey - B1-387
HPC R&D - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net<mailto:nadia.der...@atos.net>
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com<http://www.atos.com>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to