Hi George, I've implemented a call to MPI_Type_f2c using fortran C_BINDING and it works . Data types are allways set as deallocated (I've checked the reverse by commenting the calls to MPI_type_free(...) to be sure that it reports "Not deallocated" in my code in this case.
Then I've ran the code with tcp and openib drivers but keeping the deallocation commented to see how the memory consumption evolves: The global slope of the curves are quite similar in tcp and openip on 1000 iterations even if they look differents. So it looks really as a subarray type deallocation problem but deeper in the code I think. Patrick Le 04/12/2020 à 19:20, George Bosilca a écrit : > On Fri, Dec 4, 2020 at 2:33 AM Patrick Bégou via users > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: > > Hi George and Gilles, > > Thanks George for your suggestion. Is it valuable for 4.05 and 3.1 > OpenMPI Versions ? I will have a look today at these tables. May > be writing a small piece of code juste creating and freeing > subarray datatype. > > > Patrick, > > If you use Gilles' suggestion to go through the type_f2c function when > listing the datatypes should give you a portable datatype iterator > across all versions of OMPI. The call to dump a datatype content, > ompi_datatype_dump, has been there for a very long time, so the > combination of the two should work everywhere. > > Thinking a little more about this, you don't necessarily have to dump > the content of the datatype, you only need to check if they are > different from MPI_DATATYPE_NULL. Thus, you can have a solution using > only the MPI API. > > George. > > > > Thanks Gilles for suggesting disabling the interconnect. it is a > good fast test and yes, *with "mpirun --mca pml ob1 --mca btl > tcp,self" I have no memory leak*. So this explain the differences > between my laptop and the cluster. > The implementation of type management is so different from 1.7.3 ? > > A PhD student tells me he has also some trouble with this code on > a cluster Omnipath based. I will have to investigate too but not > sure it is the same problem. > > Patrick > > Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit : >> Patrick, >> >> >> based on George's idea, a simpler check is to retrieve the >> Fortran index via the (standard) MPI_Type_c2() function >> >> after you create a derived datatype. >> >> >> If the index keeps growing forever even after you >> MPI_Type_free(), then this clearly indicates a leak. >> >> Unfortunately, this simple test cannot be used to definitely rule >> out any memory leak. >> >> >> Note you can also >> >> mpirun --mca pml ob1 --mca btl tcp,self ... >> >> in order to force communications over TCP/IP and hence rule out >> any memory leak that could be triggered by your fast interconnect. >> >> >> >> In any case, a reproducer will greatly help us debugging this issue. >> >> >> Cheers, >> >> >> Gilles >> >> >> >> On 12/4/2020 7:20 AM, George Bosilca via users wrote: >>> Patrick, >>> >>> I'm afraid there is no simple way to check this. The main reason >>> being that OMPI use handles for MPI objects, and these handles >>> are not tracked by the library, they are supposed to be provided >>> by the user for each call. In your case, as you already called >>> MPI_Type_free on the datatype, you cannot produce a valid handle. >>> >>> There might be a trick. If the datatype is manipulated with any >>> Fortran MPI functions, then we convert the handle (which in fact >>> is a pointer) to an index into a pointer array structure. Thus, >>> the index will remain used, and can therefore be used to convert >>> back into a valid datatype pointer, until OMPI completely >>> releases the datatype. Look into the ompi_datatype_f_to_c_table >>> table to see the datatypes that exist and get their pointers, >>> and then use these pointers as arguments to ompi_datatype_dump() >>> to see if any of these existing datatypes are the ones you define. >>> >>> George. >>> >>> >>> >>> >>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users >>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> <mailto:users@lists.open-mpi.org> >>> <mailto:users@lists.open-mpi.org>> wrote: >>> >>> Hi, >>> >>> I'm trying to solve a memory leak since my new >>> implementation of >>> communications based on MPI_AllToAllW and >>> MPI_type_Create_SubArray >>> calls. Arrays of SubArray types are created/destroyed at each >>> time step and used for communications. >>> >>> On my laptop the code runs fine (running for 15000 temporal >>> itérations on 32 processes with oversubscription) but on our >>> cluster memory used by the code increase until the OOMkiller >>> stop >>> the job. On the cluster we use IB QDR for communications. >>> >>> Same Gcc/Gfortran 7.3 (built from sources), same sources of >>> OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran >>> code on >>> the laptop and on the cluster. >>> >>> Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not >>> show the problem (resident memory do not increase and we ran >>> 100000 temporal iterations) >>> >>> MPI_type_free manual says that it "/Marks the datatype object >>> associated with datatype for deallocation/". But how can I >>> check >>> that the deallocation is really done ? >>> >>> Thanks for ant suggestions. >>> >>> Patrick >>> >