[OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-06-30 Thread Steven Eliuk
Hi All,

Looks like we have found a large memory leak,

Very difficult to share code on this but here are some details,

1.8.5 w/ Cuda 7.0 — no memory leak
1.8.5 w/ cuda 6.5 — no memory leak
1.8.6 w/ cuda 7.0 — large memory leak
1.8.5 w/ cuda 6.5 — no memory leak
mvapich2 2.1 GDR — no issue on either flavor of CUDA.

We have a relatively basic program that reproduces the error and have even 
narrowed it back to a single machine w/ multiple gpus and only two slaves. 
Looks like something in the IPC within a single node,

We don’t have many free cycles at the moment but less us know if we can help w/ 
something basic,

Heres our config flag for 1.8.5,

./configure FC=gfortran --without-mx --with-openib=/usr 
--with-openib-libdir=/usr/lib64/ --enable-openib-rdmacm --without-psm 
--with-cuda=/cm/shared/apps/cuda70/toolkit/current 
--prefix=/cm/shared/OpenMPI_1_8_5_CUDA70

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Project Lead,
Computing Science Innovation Center,
SRA - SV,
Samsung Electronics,
665 Clyde Avenue,
Mountain View, CA 94043,
Work: +1 650-623-2986,
Cell: +1 408-819-4407.



Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-06-30 Thread Rolf vandeVaart
Hi Steven,
Thanks for the report.  Very little has changed between 1.8.5 and 1.8.6 within 
the CUDA-aware specific code so I am perplexed.  Also interesting that you do 
not see the issue with 1.8.5 and CUDA 7.0.
You mentioned that it is hard to share the code on this but maybe you could 
share how you observed the behavior.  Does the code need to run for a while to 
see this?
Any suggestions on how I could reproduce this?

Thanks,
Rolf


From: Steven Eliuk [mailto:s.el...@samsung.com]
Sent: Tuesday, June 30, 2015 6:05 PM
To: Rolf vandeVaart
Cc: Open MPI Users
Subject: 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi All,

Looks like we have found a large memory leak,

Very difficult to share code on this but here are some details,

1.8.5 w/ Cuda 7.0 - no memory leak
1.8.5 w/ cuda 6.5 - no memory leak
1.8.6 w/ cuda 7.0 - large memory leak
1.8.5 w/ cuda 6.5 - no memory leak
mvapich2 2.1 GDR - no issue on either flavor of CUDA.

We have a relatively basic program that reproduces the error and have even 
narrowed it back to a single machine w/ multiple gpus and only two slaves. 
Looks like something in the IPC within a single node,

We don't have many free cycles at the moment but less us know if we can help w/ 
something basic,

Heres our config flag for 1.8.5,

./configure FC=gfortran --without-mx --with-openib=/usr 
--with-openib-libdir=/usr/lib64/ --enable-openib-rdmacm --without-psm 
--with-cuda=/cm/shared/apps/cuda70/toolkit/current 
--prefix=/cm/shared/OpenMPI_1_8_5_CUDA70

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Project Lead,
Computing Science Innovation Center,
SRA - SV,
Samsung Electronics,
665 Clyde Avenue,
Mountain View, CA 94043,
Work: +1 650-623-2986,
Cell: +1 408-819-4407.


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-06-30 Thread Steven Eliuk
Ramps up over time, we had a bunch of locked up nodes over the weekend and have 
traced it back to this.

Let me see if I can share more details,

I will review with everyone tomorrow and get back to you,



Rolf vandeVaart  wrote:


Hi Steven,
Thanks for the report.  Very little has changed between 1.8.5 and 1.8.6 within 
the CUDA-aware specific code so I am perplexed.  Also interesting that you do 
not see the issue with 1.8.5 and CUDA 7.0.
You mentioned that it is hard to share the code on this but maybe you could 
share how you observed the behavior.  Does the code need to run for a while to 
see this?
Any suggestions on how I could reproduce this?

Thanks,
Rolf


From: Steven Eliuk [mailto:s.el...@samsung.com]
Sent: Tuesday, June 30, 2015 6:05 PM
To: Rolf vandeVaart
Cc: Open MPI Users
Subject: 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi All,

Looks like we have found a large memory leak,

Very difficult to share code on this but here are some details,

1.8.5 w/ Cuda 7.0 — no memory leak
1.8.5 w/ cuda 6.5 — no memory leak
1.8.6 w/ cuda 7.0 — large memory leak
1.8.5 w/ cuda 6.5 — no memory leak
mvapich2 2.1 GDR — no issue on either flavor of CUDA.

We have a relatively basic program that reproduces the error and have even 
narrowed it back to a single machine w/ multiple gpus and only two slaves. 
Looks like something in the IPC within a single node,

We don’t have many free cycles at the moment but less us know if we can help w/ 
something basic,

Heres our config flag for 1.8.5,

./configure FC=gfortran --without-mx --with-openib=/usr 
--with-openib-libdir=/usr/lib64/ --enable-openib-rdmacm --without-psm 
--with-cuda=/cm/shared/apps/cuda70/toolkit/current 
--prefix=/cm/shared/OpenMPI_1_8_5_CUDA70

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Project Lead,
Computing Science Innovation Center,
SRA - SV,
Samsung Electronics,
665 Clyde Avenue,
Mountain View, CA 94043,
Work: +1 650-623-2986,
Cell: +1 408-819-4407.


This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-01 Thread Stefan Paquay
Hi all,

Hopefully this mail gets posted in the right thread...

I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a
molecular dynamics program, without any use of CUDA. I am not that familiar
with how the internal memory management of LAMMPS works, but it does not
appear CUDA-related.

The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak

Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone git://git.lammps.org/lammps-ro.git lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt

I would like to help find this bug but I am not sure what would help.
LAMMPS itself is pretty big so I can imagine you might not want to go
through all of the code...


Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-01 Thread Rolf vandeVaart
Hi Stefan (and Steven who reported this earlier with CUDA-aware program)



I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.



Rolf



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git<http://git.lammps.org/lammps-ro.git> lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-06 Thread Rolf vandeVaart
Just an FYI that this issue has been found and fixed and will be available in 
the next release.
https://github.com/open-mpi/ompi-release/pull/357

Rolf

From: Rolf vandeVaart
Sent: Wednesday, July 01, 2015 4:47 PM
To: us...@open-mpi.org
Subject: RE: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak


Hi Stefan (and Steven who reported this earlier with CUDA-aware program)



I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.



Rolf



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git<http://git.lammps.org/lammps-ro.git> lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---