Re: [OMPI users] growing memory use from MPI application

2019-06-21 Thread Noam Bernstein via users
> On Jun 21, 2019, at 9:57 PM, Carlson, Timothy S > wrote: > > Switch back to stock OFED? Well, CentOS included OFED has a memory leak (at least when using ucx). I haven't tried OFED's stack yet. > > Make sure all your cards are patched to the latest firmware. That's a good idea. I'

Re: [OMPI users] growing memory use from MPI application

2019-06-21 Thread Noam Bernstein via users
Perhaps I spoke too soon. Now, with the Mellanox OFED stack, we occasionally get the following failure on exit: [compute-4-20:68008:0:68008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) 0 0x0002a3c5 opal_free_list_destruct() opal_free_list.c:0 1 0x

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 1:38 PM, Nathan Hjelm via users > wrote: > > THAT is a good idea. When using Omnipath we see an issue with stale files in > /dev/shm if the application exits abnormally. I don't know if UCX uses that > space as well. No stale shm files. echo 3 > /proc/sys/vm/drop_caches

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Jeff Squyres (jsquyres) via users
On Jun 20, 2019, at 1:34 PM, Noam Bernstein wrote: > > Aha - using Mellanox’s OFED packaging seems to essentially (if not 100%) > fixed the issue. There still appears to be some small leak, but it’s of > order 1 GB, not 10s of GB, and it doesn’t grow continuously. And on later > runs of the

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Nathan Hjelm via users
THAT is a good idea. When using Omnipath we see an issue with stale files in /dev/shm if the application exits abnormally. I don't know if UCX uses that space as well. -Nathan On June 20, 2019 at 11:05 AM, Joseph Schuchart via users wrote: Noam, Another idea: check for stale files in /de

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 10:42 AM, Noam Bernstein via users > wrote: > > I haven’t yet tried the latest OFED or Mellanox low level stuff. That’s next > on my list, but slightly more involved to do, so I’ve been avoiding it. > Aha - using Mellanox’s OFED packaging seems to essentially (if not 10

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Joseph Schuchart via users
Noam, Another idea: check for stale files in /dev/shm/ (or a subdirectory that looks like it belongs to UCX/OpenMPI) and SysV shared memory using `ipcs -m`. Joseph On 6/20/19 3:31 PM, Noam Bernstein via users wrote: On Jun 20, 2019, at 4:44 AM, Charles A Taylor > w

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread John Hearns via users
Errr.. you chave dropped caches? echo 3 > /proc/sys/vm/drop_caches On Thu, 20 Jun 2019 at 15:59, Yann Jobic via users wrote: > Hi, > > Le 6/20/2019 à 3:31 PM, Noam Bernstein via users a écrit : > > > > > >> On Jun 20, 2019, at 4:44 AM, Charles A Taylor >> > wrote: >

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Yann Jobic via users
Hi, Le 6/20/2019 à 3:31 PM, Noam Bernstein via users a écrit : On Jun 20, 2019, at 4:44 AM, Charles A Taylor > wrote: This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought the fix was landed in 4.0.0 but you might want to check the code to be sure ther

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 9:40 AM, Jeff Squyres (jsquyres) > wrote: > > On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users > wrote: >> >> One thing that I’m wondering if anyone familiar with the internals can >> explain is how you get a memory leak that isn’t freed when then program >> ends?

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread John Hearns via users
The kernel using memory is why I suggested running slabtop, to see the kernel slab allocations. Clearly I Was barking up a wrong tree there... On Thu, 20 Jun 2019 at 14:41, Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Jeff Squyres (jsquyres) via users
On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users wrote: > > One thing that I’m wondering if anyone familiar with the internals can > explain is how you get a memory leak that isn’t freed when then program ends? > Doesn’t that suggest that it’s something lower level, like maybe a kernel

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 4:44 AM, Charles A Taylor wrote: > > This looks a lot like a problem I had with OpenMPI 3.1.2. I thought the fix > was landed in 4.0.0 but you might > want to check the code to be sure there wasn’t a regression in 4.1.x. Most > of our codes are still running > 3.1.2 so

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Charles A Taylor via users
This looks a lot like a problem I had with OpenMPI 3.1.2. I thought the fix was landed in 4.0.0 but you might want to check the code to be sure there wasn’t a regression in 4.1.x. Most of our codes are still running 3.1.2 so I haven’t built anything beyond 4.0.0 which definitely included the f

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
> On Jun 19, 2019, at 5:05 PM, Joshua Ladd wrote: > > Hi, Noam > > Can you try your original command line with the following addition: > > mpirun —mca pml ucx —mca btl ^vader,tcp,openib -mca osc ucx > > I think we're seeing some conflict between UCX PML and UCT OSC. I did this, although me

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Joshua Ladd via users
Hi, Noam Can you try your original command line with the following addition: mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx * I think we're seeing some conflict between UCX PML and UCT OSC. Josh On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users < users@lists.open-mpi.org>

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
> On Jun 19, 2019, at 2:44 PM, George Bosilca wrote: > > To completely disable UCX you need to disable the UCX MTL and not only the > BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”. Thanks for the pointer. Disabling ucx this way _does_ seem to fix the memory issue.

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread George Bosilca via users
To completely disable UCX you need to disable the UCX MTL and not only the BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1". As you have a gdb session on the processes you can try to break on some of the memory allocations function (malloc, realloc, calloc). George.

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
I tried to disable ucx (successfully, I think - I replaced the “—mca btl ucx —mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and attaching gdb to a running process shows no ucx-related routines active). It still has the same fast growing (1 GB/s) memory usage problem.

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
> On Jun 19, 2019, at 2:00 PM, John Hearns via users > wrote: > > Noam, it may be a stupid question. Could you try runningslabtop ss the > program executes The top SIZE usage is this line OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 5937540 5937540 100%

Re: [OMPI users] growing memory use from MPI application

2019-06-19 Thread John Hearns via users
Noam, it may be a stupid question. Could you try runningslabtop ss the program executes Also 'watch cat /proc/meminfo'is also a good diagnostic On Wed, 19 Jun 2019 at 18:32, Noam Bernstein via users < users@lists.open-mpi.org> wrote: > Hi - we’re having a weird problem with OpenMPI on ou

[OMPI users] growing memory use from MPI application

2019-06-19 Thread Noam Bernstein via users
Hi - we’re having a weird problem with OpenMPI on our newish infiniband EDR (mlx5) nodes. We're running CentOS 7.6, with all the infiniband and ucx libraries as provided by CentOS, i.e. ucx-1.4.0-1.el7.x86_64 libibverbs-utils-17.2-3.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libibumad-17.2-3.el7.x8