Re: [OMPI users] Double free or corruption problem updated result

2017-06-17 Thread ashwin .D
Hello Gilles,

  First of all I am extremely grateful for this
communication from you on a weekend and that too few hours after I

posted my email. Well I am not sure I can go on posting log files as
you rightly point out that MPI is not the source of the

problem. Still I have enclosed the valgrind log files as you
requested. I have downloaded the MPICH packages as you suggested

and I am going to install them shortly. But before I do that I think I
have a clue on the source of my problem(double free or corruption) and
I would really appreciate

your advice.


As I mentioned before COSMO has been compiled with mpif90 for shared
memory usage and with gfortran for sequential access.

But it is dependent on a lot of external third party software such as
zlib, libcurl, hdf5, netcdf and netcdf-fortran. When I

looked at the config.log of those packages all of them had  been
compiled with gfortran and gcc and some cases g++ with
enable-shared option. So my question then is could that be a source of
the "mismatch" ?

In other words I would have to recompile all those packages with
mpif90 and mpicc and then try another test. At the very


least there should be no mixing of gcc/gfortran compiled code with
mpif90 compiled code. Comments ?


Best regards,
Ashwin.

>Ashwin,

>did you try to run your app with a MPICH-based library (mvapich,
>IntelMPI or even stock mpich) ?
>or did you try with Open MPI v1.10 ?
>the stacktrace does not indicate the double free occurs in MPI...

>it seems you ran valgrind vs a shell and not your binary.
>assuming your mpirun command is
>mpirun lmparbin_all
>i suggest you try again with
>mpirun --tag-output valgrind lmparbin_all
>that will generate one valgrind log per task, but these are prefixed
>so it should be easier to figure out what is going wrong

>Cheers,

>Gilles


On Sun, Jun 18, 2017 at 11:41 AM, ashwin .D  wrote:
> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer to
> complete. Now I also ran valgrind (not sure whether that is useful or not)
> and I have enclosed the logs.


On Sun, Jun 18, 2017 at 8:11 AM, ashwin .D  wrote:

> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer
> to complete. Now I also ran valgrind (not sure whether that is useful or
> not) and I have enclosed the logs.
>
> On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D  wrote:
>
>> Hello Gilles,
>>I am enclosing all the information you requested.
>>
>> 1)  as an attachment I enclose the log file
>> 2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
>> reinstalled it /usr/lib/local.
>> I ran all the examples in the examples directory. All passed except
>> oshmem_strided_puts where I got this message
>>
>> [[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
>> valid range
>> 
>> --
>> SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
>> errorcode -1.
>> 
>> --
>>
>>
>> 3) I deleted all old OpenMPI versions under /usr/local/lib.
>> 4) I am using the COSMO weather model - http://www.cosmo-model.org/ to
>> run simulations
>> The support staff claim they have seen no errors with a similar setup.
>> They use
>>
>> 1) gfortran 4.8.5
>> 2) OpenMPI 1.10.1
>>
>> The only difference is I use OpenMPI 2.1.1.
>>
>> 5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
>> and I got the same error as in the mpi_logs file
>>
>> 6) Regarding compiler and linking options on Ubuntu 16.04
>>
>> mpif90 --showme:compile and --showme:link give me the options for
>> compiling and linking.
>>
>> Here are the options from my makefile
>>
>> -pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
>>
>> 7) I have a 64 bit OS.
>>
>> Well I think I have responded all of your questions. In any case I have
>> not please let me know and I will respond ASAP. The only thing I have not
>> done is look at /usr/local/include. I saw some old OpenMPI files there. If
>> those need to be deleted I will do after I hear from you.
>>
>> Best regards,
>> Ashwin.
>>
>>
>


logs
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption problem updated result

2017-06-17 Thread ashwin .D
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer
to complete. Now I also ran valgrind (not sure whether that is useful or
not) and I have enclosed the logs.

On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D  wrote:

> Hello Gilles,
>I am enclosing all the information you requested.
>
> 1)  as an attachment I enclose the log file
> 2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
> reinstalled it /usr/lib/local.
> I ran all the examples in the examples directory. All passed except
> oshmem_strided_puts where I got this message
>
> [[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
> valid range
> --
> SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
> errorcode -1.
> --
>
>
> 3) I deleted all old OpenMPI versions under /usr/local/lib.
> 4) I am using the COSMO weather model - http://www.cosmo-model.org/ to
> run simulations
> The support staff claim they have seen no errors with a similar setup.
> They use
>
> 1) gfortran 4.8.5
> 2) OpenMPI 1.10.1
>
> The only difference is I use OpenMPI 2.1.1.
>
> 5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
> and I got the same error as in the mpi_logs file
>
> 6) Regarding compiler and linking options on Ubuntu 16.04
>
> mpif90 --showme:compile and --showme:link give me the options for
> compiling and linking.
>
> Here are the options from my makefile
>
> -pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
>
> 7) I have a 64 bit OS.
>
> Well I think I have responded all of your questions. In any case I have
> not please let me know and I will respond ASAP. The only thing I have not
> done is look at /usr/local/include. I saw some old OpenMPI files there. If
> those need to be deleted I will do after I hear from you.
>
> Best regards,
> Ashwin.
>
>


logs
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Double free or corruption problem updated result

2017-06-17 Thread ashwin .D
Hello Gilles,
   I am enclosing all the information you requested.

1)  as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed except
oshmem_strided_puts where I got this message

[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
valid range
--
SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
errorcode -1.
--


3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model - http://www.cosmo-model.org/ to run
simulations
The support staff claim they have seen no errors with a similar setup. They
use

1) gfortran 4.8.5
2) OpenMPI 1.10.1

The only difference is I use OpenMPI 2.1.1.

5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo. and
I got the same error as in the mpi_logs file

6) Regarding compiler and linking options on Ubuntu 16.04

mpif90 --showme:compile and --showme:link give me the options for compiling
and linking.

Here are the options from my makefile

-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking

7) I have a 64 bit OS.

Well I think I have responded all of your questions. In any case I have not
please let me know and I will respond ASAP. The only thing I have not done
is look at /usr/local/include. I saw some old OpenMPI files there. If those
need to be deleted I will do after I hear from you.

Best regards,
Ashwin.


mpi_logs
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-15 Thread ashwin .D
Hello Jeff and Gilles,
 I just logged in to see the archives and
this message of Gilles -
https://www.mail-archive.com/users@lists.open-mpi.org//msg31219.html and
this message of Jeff -
https://www.mail-archive.com/users@lists.open-mpi.org//msg31217.html are
very useful. Please give me a couple of days to implement some of the ideas
that you both have suggested and allow me to get back to you.

Best regards,
Ashwin

On Wed, Jun 14, 2017 at 4:01 PM, ashwin .D  wrote:

> Hello,
>   I found a thread with Intel MPI(although I am using gfortran
> 4.8.5 and OpenMPI 2.1.1) - https://software.intel.com/en-
> us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266 but
> the error the OP gets is the same as mine
>
> *** glibc detected *** ./a.out: double free or corruption (!prev):
> 0x7fc6dc80 ***
> 04 === Backtrace: =
> 05 /lib64/libc.so.6[0x3411e75e66]
> 06/lib64/libc.so.6[0x3411e789b3]
>
> So the explanation given in that post is this -
> "From their examination our Development team concluded the underlying
> problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
> Fortran RTLs. In short, there were older static Fortran RTL bodies
> incorporated in the openmpi library that when mixed with newer Fortran RTL
> led to the failure. They found the issue is resolved in the newer
> openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
> release with our 15.0 (or newer) release." Could this be possible with my
> version as well ?
>
>
> I am willing to debug this provided I am given some clue on how to
> approach my problem. At the moment I am unable to proceed further and the
> only thing I can add is I ran tests with the sequential form of my
> application and it is much slower although I am using shared memory and all
> the cores are in the same machine.
>
> Best regards,
> Ashwin.
>
>
>
>
>
> On Tue, Jun 13, 2017 at 5:52 PM, ashwin .D  wrote:
>
>> Also when I try to build and run a make check I get these errors - Am I
>> clear to proceed or is my installation broken ? This is on Ubuntu 16.04
>> LTS.
>>
>> ==
>>Open MPI 2.1.1: test/datatype/test-suite.log
>> ==
>>
>> # TOTAL: 9
>> # PASS:  8
>> # SKIP:  0
>> # XFAIL: 0
>> # FAIL:  1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: external32
>> 
>>
>> /home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
>> error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
>> symbol: ompi_datatype_pack_external_size
>> FAIL external32 (exit status:
>>
>> On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D  wrote:
>>
>>> Hello,
>>>   I am using OpenMPI 2.0.0 with a computational fluid dynamics
>>> software and I am encountering a series of errors when running this with
>>> mpirun. This is my lscpu output
>>>
>>> CPU(s):4
>>> On-line CPU(s) list:   0-3
>>> Thread(s) per core:2
>>> Core(s) per socket:2
>>> Socket(s): 1 and I am running OpenMPI's mpirun in the following
>>>
>>> way
>>>
>>> mpirun -np 4  cfd_software
>>>
>>> and I get double free or corruption every single time.
>>>
>>> I have two questions -
>>>
>>> 1) I am unable to capture the standard error that mpirun throws in a file
>>>
>>> How can I go about capturing the standard error of mpirun ?
>>>
>>> 2) Has this error i.e. double free or corruption been reported by others ? 
>>> Is there a Is a
>>>
>>> bug fix available ?
>>>
>>> Regards,
>>>
>>> Ashwin.
>>>
>>>
>>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-14 Thread ashwin .D
Hello,
  I found a thread with Intel MPI(although I am using gfortran
4.8.5 and OpenMPI 2.1.1) -
https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266
but the error the OP gets is the same as mine

*** glibc detected *** ./a.out: double free or corruption (!prev):
0x7fc6dc80 ***
04 === Backtrace: =
05 /lib64/libc.so.6[0x3411e75e66]
06/lib64/libc.so.6[0x3411e789b3]

So the explanation given in that post is this -
"From their examination our Development team concluded the underlying
problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
Fortran RTLs. In short, there were older static Fortran RTL bodies
incorporated in the openmpi library that when mixed with newer Fortran RTL
led to the failure. They found the issue is resolved in the newer
openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
release with our 15.0 (or newer) release." Could this be possible with my
version as well ?


I am willing to debug this provided I am given some clue on how to approach
my problem. At the moment I am unable to proceed further and the only thing
I can add is I ran tests with the sequential form of my application and it
is much slower although I am using shared memory and all the cores are in
the same machine.

Best regards,
Ashwin.





On Tue, Jun 13, 2017 at 5:52 PM, ashwin .D  wrote:

> Also when I try to build and run a make check I get these errors - Am I
> clear to proceed or is my installation broken ? This is on Ubuntu 16.04
> LTS.
>
> ==
>Open MPI 2.1.1: test/datatype/test-suite.log
> ==
>
> # TOTAL: 9
> # PASS:  8
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  1
> # XPASS: 0
> # ERROR: 0
>
> .. contents:: :depth: 2
>
> FAIL: external32
> 
>
> /home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
> error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
> symbol: ompi_datatype_pack_external_size
> FAIL external32 (exit status:
>
> On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D  wrote:
>
>> Hello,
>>   I am using OpenMPI 2.0.0 with a computational fluid dynamics
>> software and I am encountering a series of errors when running this with
>> mpirun. This is my lscpu output
>>
>> CPU(s):4
>> On-line CPU(s) list:   0-3
>> Thread(s) per core:2
>> Core(s) per socket:2
>> Socket(s): 1 and I am running OpenMPI's mpirun in the following
>>
>> way
>>
>> mpirun -np 4  cfd_software
>>
>> and I get double free or corruption every single time.
>>
>> I have two questions -
>>
>> 1) I am unable to capture the standard error that mpirun throws in a file
>>
>> How can I go about capturing the standard error of mpirun ?
>>
>> 2) Has this error i.e. double free or corruption been reported by others ? 
>> Is there a Is a
>>
>> bug fix available ?
>>
>> Regards,
>>
>> Ashwin.
>>
>>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-13 Thread ashwin .D
Also when I try to build and run a make check I get these errors - Am I
clear to proceed or is my installation broken ? This is on Ubuntu 16.04
LTS.

==
   Open MPI 2.1.1: test/datatype/test-suite.log
==

# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: external32


/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
symbol: ompi_datatype_pack_external_size
FAIL external32 (exit status:

On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D  wrote:

> Hello,
>   I am using OpenMPI 2.0.0 with a computational fluid dynamics
> software and I am encountering a series of errors when running this with
> mpirun. This is my lscpu output
>
> CPU(s):4
> On-line CPU(s) list:   0-3
> Thread(s) per core:2
> Core(s) per socket:2
> Socket(s): 1 and I am running OpenMPI's mpirun in the following
>
> way
>
> mpirun -np 4  cfd_software
>
> and I get double free or corruption every single time.
>
> I have two questions -
>
> 1) I am unable to capture the standard error that mpirun throws in a file
>
> How can I go about capturing the standard error of mpirun ?
>
> 2) Has this error i.e. double free or corruption been reported by others ? Is 
> there a Is a
>
> bug fix available ?
>
> Regards,
>
> Ashwin.
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-13 Thread ashwin .D
Hello,
  I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output

CPU(s):4
On-line CPU(s) list:   0-3
Thread(s) per core:2
Core(s) per socket:2
Socket(s): 1 and I am running OpenMPI's mpirun in the following

way

mpirun -np 4  cfd_software

and I get double free or corruption every single time.

I have two questions -

1) I am unable to capture the standard error that mpirun throws in a file

How can I go about capturing the standard error of mpirun ?

2) Has this error i.e. double free or corruption been reported by
others ? Is there a Is a

bug fix available ?

Regards,

Ashwin.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users