Re: [OMPI users] MPI_type_free question

2020-12-07 Thread Patrick Bégou via users
Hi,

I've written a small piece of code to show the problem. Based on my
application but 2D and using integers arrays for testing.
The  figure below shows the max RSS size of rank 0 process on 2
iterations on 8 and 16 cores, with openib and tcp drivers.
The more processes I have, the larger the memory leak.  I use the same
binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5).
The code is in attachment. I'll try to check type deallocation as soon
as possible.

Patrick




Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> based on George's idea, a simpler check is to retrieve the Fortran
> index via the (standard) MPI_Type_c2() function
>
> after you create a derived datatype.
>
>
> If the index keeps growing forever even after you MPI_Type_free(),
> then this clearly indicates a leak.
>
> Unfortunately, this simple test cannot be used to definitely rule out
> any memory leak.
>
>
> Note you can also
>
> mpirun --mca pml ob1 --mca btl tcp,self ...
>
> in order to force communications over TCP/IP and hence rule out any
> memory leak that could be triggered by your fast interconnect.
>
>
>
> In any case, a reproducer will greatly help us debugging this issue.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>> Patrick,
>>
>> I'm afraid there is no simple way to check this. The main reason
>> being that OMPI use handles for MPI objects, and these handles are
>> not tracked by the library, they are supposed to be provided by the
>> user for each call. In your case, as you already called MPI_Type_free
>> on the datatype, you cannot produce a valid handle.
>>
>> There might be a trick. If the datatype is manipulated with any
>> Fortran MPI functions, then we convert the handle (which in fact is a
>> pointer) to an index into a pointer array structure. Thus, the index
>> will remain used, and can therefore be used to convert back into a
>> valid datatype pointer, until OMPI completely releases the datatype.
>> Look into the ompi_datatype_f_to_c_table table to see the datatypes
>> that exist and get their pointers, and then use these pointers as
>> arguments to ompi_datatype_dump() to see if any of these existing
>> datatypes are the ones you define.
>>
>> George.
>>
>>
>>
>>
>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>>     Hi,
>>
>>     I'm trying to solve a memory leak since my new implementation of
>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>     calls.  Arrays of SubArray types are created/destroyed at each
>>     time step and used for communications.
>>
>>     On my laptop the code runs fine (running for 15000 temporal
>>     itérations on 32 processes with oversubscription) but on our
>>     cluster memory used by the code increase until the OOMkiller stop
>>     the job. On the cluster we use IB QDR for communications.
>>
>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>     the laptop and on the cluster.
>>
>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>     show the problem (resident memory do not increase and we ran
>>     10 temporal iterations)
>>
>>     MPI_type_free manual says that it "/Marks the datatype object
>>     associated with datatype for deallocation/". But  how can I check
>>     that the deallocation is really done ?
>>
>>     Thanks for ant suggestions.
>>
>>     Patrick
>>



test_layout_array.tgz
Description: application/compressed-tar


Re: [OMPI users] MPI_type_free question

2020-12-07 Thread Patrick Bégou via users
Hi George,

I've implemented a call to MPI_Type_f2c using fortran C_BINDING and it
works . Data types are allways set as deallocated (I've checked the
reverse by commenting the calls to MPI_type_free(...) to be sure that it
reports "Not deallocated" in my code in this case.

Then I've ran the code with tcp and openib drivers but keeping the
deallocation commented to see how the memory consumption evolves:

The global slope of the curves are quite similar in tcp and openip on
1000 iterations even if they look differents. So it looks really as a
subarray type deallocation problem but deeper in the code I think.

Patrick




Le 04/12/2020 à 19:20, George Bosilca a écrit :
> On Fri, Dec 4, 2020 at 2:33 AM Patrick Bégou via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Hi George and Gilles,
>
> Thanks George for your suggestion. Is it valuable for 4.05 and 3.1
> OpenMPI Versions ? I will have a look today at these tables. May
> be writing a small piece of code juste creating and freeing
> subarray datatype.
>
>
> Patrick,
>
> If you use Gilles' suggestion to go through the type_f2c function when
> listing the datatypes should give you a portable datatype iterator
> across all versions of OMPI. The call to dump a datatype content,
> ompi_datatype_dump, has been there for a very long time, so the
> combination of the two should work everywhere.
>
> Thinking a little more about this, you don't necessarily have to dump
> the content of the datatype, you only need to check if they are
> different from MPI_DATATYPE_NULL. Thus, you can have a solution using
> only the MPI API.
>
>   George.
>  
>
>
> Thanks Gilles for suggesting disabling the interconnect. it is a
> good fast test and yes, *with "mpirun --mca pml ob1 --mca btl
> tcp,self" I have no memory leak*. So this explain the differences
> between my laptop and the cluster.
> The implementation of type management is so different from 1.7.3  ?
>
> A PhD student tells me he has also some trouble with this code on
> a cluster Omnipath based. I will have to investigate too but not
> sure it is the same problem.
>
> Patrick
>
> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>> Patrick,
>>
>>
>> based on George's idea, a simpler check is to retrieve the
>> Fortran index via the (standard) MPI_Type_c2() function
>>
>> after you create a derived datatype.
>>
>>
>> If the index keeps growing forever even after you
>> MPI_Type_free(), then this clearly indicates a leak.
>>
>> Unfortunately, this simple test cannot be used to definitely rule
>> out any memory leak.
>>
>>
>> Note you can also
>>
>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>
>> in order to force communications over TCP/IP and hence rule out
>> any memory leak that could be triggered by your fast interconnect.
>>
>>
>>
>> In any case, a reproducer will greatly help us debugging this issue.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>> Patrick,
>>>
>>> I'm afraid there is no simple way to check this. The main reason
>>> being that OMPI use handles for MPI objects, and these handles
>>> are not tracked by the library, they are supposed to be provided
>>> by the user for each call. In your case, as you already called
>>> MPI_Type_free on the datatype, you cannot produce a valid handle.
>>>
>>> There might be a trick. If the datatype is manipulated with any
>>> Fortran MPI functions, then we convert the handle (which in fact
>>> is a pointer) to an index into a pointer array structure. Thus,
>>> the index will remain used, and can therefore be used to convert
>>> back into a valid datatype pointer, until OMPI completely
>>> releases the datatype. Look into the ompi_datatype_f_to_c_table
>>> table to see the datatypes that exist and get their pointers,
>>> and then use these pointers as arguments to ompi_datatype_dump()
>>> to see if any of these existing datatypes are the ones you define.
>>>
>>> George.
>>>
>>>
>>>
>>>
>>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>>> mailto:users@lists.open-mpi.org>
>>> 
>>> > wrote:
>>>
>>>     Hi,
>>>
>>>     I'm trying to solve a memory leak since my new
>>> implementation of
>>>     communications based on MPI_AllToAllW and
>>> MPI_type_Create_SubArray
>>>     calls.  Arrays of SubArray types are created/destroyed at each
>>>     time step and used for communications.
>>>
>>>     On my laptop the code runs fine (running for 15000 temporal
>>>     itérations on 32 processes with oversubscription) but on our
>>>     cluster memory used by the code increase until the OOMkiller
>>> stop
>>>     the job. On the cluster we use IB QDR for communications.
>>>
>>>     Sam

[OMPI users] RMA breakage

2020-12-07 Thread Dave Love via users
After seeing several failures with RMA with the change needed to get
4.0.5 through IMB, I looked for simple tests.  So, I built the mpich
3.4b1 tests -- or the ones that would build, and I haven't checked why
some fail -- and ran the rma set.

Three out of 180 passed.  Many (most?) aborted in ucx, like I saw with
production code, with a backtrace like below; others at least reported
an MPI error.  This was on two nodes of a ppc64le RHEL7 IB system with
4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the
same result without those parameters).  I haven't tried to reproduce it
on x86_64, but it seems unlikely to be CPU-specific.

Is there anything we can do to run RMA without just moving to mpich?  Do
releases actually get tested on run-of-the-mill IB+Lustre systems?

+ mpirun -n 2 winname
[gpu005:50906:0:50906]  ucp_worker.c:183  Fatal: failed to set active message 
handler id 1: Invalid parameter
 backtrace (tid:  50906) 
 0 0x0005453c ucs_debug_print_backtrace()  .../src/ucs/debug/debug.c:656
 1 0x00028218 ucp_worker_set_am_handlers()  
.../src/ucp/core/ucp_worker.c:182
 2 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:816
 3 0x00029ae0 ucp_worker_iface_check_events()  
.../src/ucp/core/ucp_worker.c:766
 4 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:819
 5 0x00029ae0 ucp_worker_iface_unprogress_ep()  
.../src/ucp/core/ucp_worker.c:841
 6 0x000582a8 ucp_wireup_ep_t_cleanup()  
.../src/ucp/wireup/wireup_ep.c:381
 7 0x00068124 ucs_class_call_cleanup_chain()  
.../src/ucs/type/class.c:56
 8 0x00057420 ucp_wireup_ep_t_delete()  
.../src/ucp/wireup/wireup_ep.c:28
 9 0x00013de8 uct_ep_destroy()  .../src/uct/base/uct_iface.c:546
10 0x000252f4 ucp_proxy_ep_replace()  
.../src/ucp/core/ucp_proxy_ep.c:236
11 0x00057b88 ucp_wireup_ep_progress()  
.../src/ucp/wireup/wireup_ep.c:89
12 0x00049820 ucs_callbackq_slow_proxy()  
.../src/ucs/datastruct/callbackq.c:400
13 0x0002ca04 ucs_callbackq_dispatch()  
.../src/ucs/datastruct/callbackq.h:211
14 0x0002ca04 uct_worker_progress()  .../src/uct/api/uct.h:2346
15 0x0002ca04 ucp_worker_progress()  .../src/ucp/core/ucp_worker.c:2040
16 0xc144 progress_callback()  osc_ucx_component.c:0
17 0x000374ac opal_progress()  ???:0
18 0x0006cc74 ompi_request_default_wait()  ???:0
19 0x000e6fcc ompi_coll_base_sendrecv_actual()  ???:0
20 0x000e5530 ompi_coll_base_allgather_intra_two_procs()  ???:0
21 0x6c44 ompi_coll_tuned_allgather_intra_dec_fixed()  ???:0
22 0xdc20 component_select()  osc_ucx_component.c:0
23 0x00115b90 ompi_osc_base_select()  ???:0
24 0x00075264 ompi_win_create()  ???:0
25 0x000cb4e8 PMPI_Win_create()  ???:0
26 0x10006ecc MTestGetWin()  .../mpich-3.4b1/test/mpi/util/mtest.c:1173
27 0x10002e40 main()  .../mpich-3.4b1/test/mpi/rma/winname.c:25
28 0x00025200 generic_start_main.isra.0()  libc-start.c:0
29 0x000253f4 __libc_start_main()  ???:0

followed by the abort backtrace


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-07 Thread Dave Love via users
Ralph Castain via users  writes:

> Just a point to consider. OMPI does _not_ want to get in the mode of
> modifying imported software packages. That is a blackhole of effort we
> simply cannot afford.

It's already done that, even in flatten.c.  Otherwise updating to the
current version would be trivial.  I'll eventually make suggestions for
some changes in MPICH for standalone builds if I can verify that they
don't break things outside of OMPI.

Meanwhile we don't have a recent version that will even pass tests
recommended here, and we've long been asking about MPI-IO on lustre.  We
probably should move to some sort of MPICH for MPI-IO on probably the
most likely parallel filesystem as well as RMA on the most likely fabric.

> The correct thing to do would be to flag Rob Latham on that PR and ask
> that he upstream the fix into ROMIO so we can absorb it. We shouldn't
> be committing such things directly into OMPI itself.

It's already fixed differently in mpich, but the simple patch is useful
if there's nothing else broken.  I approve of sending fixes to MPICH,
but that will only do any good if OMPI's version gets updated from
there, which doesn't seem to happen.

> It's called "working with the community" as opposed to taking a
> point-solution approach :-)

The community has already done work to fix this properly.  It's a pity
that will be wasted.  This bit of the community is grateful for the
patch, which is reasonable to carry in packaging for now, unlike a whole
new romio.


Re: [OMPI users] [EXTERNAL] RMA breakage

2020-12-07 Thread Pritchard Jr., Howard via users
Hello Dave,

There's an issue opened about this -

https://github.com/open-mpi/ompi/issues/8252

However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX 
1.9.0 using OMPI 4.0.x at 6ea9d98.
This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0.

Howard

On 12/7/20, 7:21 AM, "users on behalf of Dave Love via users" 
 wrote:

After seeing several failures with RMA with the change needed to get
4.0.5 through IMB, I looked for simple tests.  So, I built the mpich
3.4b1 tests -- or the ones that would build, and I haven't checked why
some fail -- and ran the rma set.

Three out of 180 passed.  Many (most?) aborted in ucx, like I saw with
production code, with a backtrace like below; others at least reported
an MPI error.  This was on two nodes of a ppc64le RHEL7 IB system with
4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the
same result without those parameters).  I haven't tried to reproduce it
on x86_64, but it seems unlikely to be CPU-specific.

Is there anything we can do to run RMA without just moving to mpich?  Do
releases actually get tested on run-of-the-mill IB+Lustre systems?

+ mpirun -n 2 winname
[gpu005:50906:0:50906]  ucp_worker.c:183  Fatal: failed to set active 
message handler id 1: Invalid parameter
 backtrace (tid:  50906) 
 0 0x0005453c ucs_debug_print_backtrace()  
.../src/ucs/debug/debug.c:656
 1 0x00028218 ucp_worker_set_am_handlers()  
.../src/ucp/core/ucp_worker.c:182
 2 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:816
 3 0x00029ae0 ucp_worker_iface_check_events()  
.../src/ucp/core/ucp_worker.c:766
 4 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:819
 5 0x00029ae0 ucp_worker_iface_unprogress_ep()  
.../src/ucp/core/ucp_worker.c:841
 6 0x000582a8 ucp_wireup_ep_t_cleanup()  
.../src/ucp/wireup/wireup_ep.c:381
 7 0x00068124 ucs_class_call_cleanup_chain()  
.../src/ucs/type/class.c:56
 8 0x00057420 ucp_wireup_ep_t_delete()  
.../src/ucp/wireup/wireup_ep.c:28
 9 0x00013de8 uct_ep_destroy()  .../src/uct/base/uct_iface.c:546
10 0x000252f4 ucp_proxy_ep_replace()  
.../src/ucp/core/ucp_proxy_ep.c:236
11 0x00057b88 ucp_wireup_ep_progress()  
.../src/ucp/wireup/wireup_ep.c:89
12 0x00049820 ucs_callbackq_slow_proxy()  
.../src/ucs/datastruct/callbackq.c:400
13 0x0002ca04 ucs_callbackq_dispatch()  
.../src/ucs/datastruct/callbackq.h:211
14 0x0002ca04 uct_worker_progress()  .../src/uct/api/uct.h:2346
15 0x0002ca04 ucp_worker_progress()  
.../src/ucp/core/ucp_worker.c:2040
16 0xc144 progress_callback()  osc_ucx_component.c:0
17 0x000374ac opal_progress()  ???:0
18 0x0006cc74 ompi_request_default_wait()  ???:0
19 0x000e6fcc ompi_coll_base_sendrecv_actual()  ???:0
20 0x000e5530 ompi_coll_base_allgather_intra_two_procs()  ???:0
21 0x6c44 ompi_coll_tuned_allgather_intra_dec_fixed()  ???:0
22 0xdc20 component_select()  osc_ucx_component.c:0
23 0x00115b90 ompi_osc_base_select()  ???:0
24 0x00075264 ompi_win_create()  ???:0
25 0x000cb4e8 PMPI_Win_create()  ???:0
26 0x10006ecc MTestGetWin()  
.../mpich-3.4b1/test/mpi/util/mtest.c:1173
27 0x10002e40 main()  .../mpich-3.4b1/test/mpi/rma/winname.c:25
28 0x00025200 generic_start_main.isra.0()  libc-start.c:0
29 0x000253f4 __libc_start_main()  ???:0

followed by the abort backtrace