Thanks Gus for your help. We have been working on upgrading OFED and OMPI last few days, so I don't have access to nodes running the outdated OFED at the moment and the updated ones should be ready to test today.
I remember checking limits.conf and setting it to unlimited but the warning kept showing up. We use grid engine and I set the memory unlimited. However, I don't think the scheduler has anything to do with the problem since I tried to run an MPI job directly and the same warning appeared. Adding these parameters yielded an error for the option 'log_mtts_per_seg', I can't recall the error exactly but it was something like option not recognized or not supported. And setting 'log_num_mtt', as mentioned before, causes ib0 interface to fail. I'll report back what happens on the updated versions. Waleed Lotfy Bibliotheca Alexandrina ________________________________________ From: users [users-boun...@open-mpi.org] on behalf of Gus Correa [g...@ldeo.columbia.edu] Sent: Tuesday, December 30, 2014 8:01 PM To: Open MPI Users Subject: Re: [OMPI users] Icreasing OFED registerable memory Hi Waleed Even before any OFED upgrades, you could try the items in the list below. I have OMPI 1.6.5 and 1.8.3 working with an older OFED version, with those settings. That is not really OMPI fault, but Infinband/OFED's. 1) Make sure your locked memory is set to unlimited in /etc/security/limits.conf For instance: * soft memlock unlimited * hard memlock unlimited 2) If you are using a queue system, make sure it sets the locked memory to unlimited, so that all child processes (including your mpiexec and mpi executable) will get it. For instance, in Torque /etc/init.d/pbs_mom or in /etc/sysconfig/pbs_mom: # locked memory ulimit -l unlimited 3) Add the parameters below to /etc/modprobe.d/mlx4_core.conf options mlx4_core log_num_mtt=22 log_mtts_per_seg=1 Do this with care, as the settings vary according to the physical RAM. In addition, the parameters seem to have been deprecated in 3.X kernels, which makes this tricky. See these FAQs: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem *** Having said that, a question remains unanswered: Why is Infiniband such a nightmare? *** I hope this helps, Gus Correa On 12/30/2014 09:16 AM, Waleed Lotfy wrote: > Thank Devendar for your response. > > I'll test it on a new installation with OFED 2.3.2 and OMPI v1.6.5. If it > didn't work I'll give 1.8.4 a try. > > Thank you for your help and I'll get back to you with hopefully good results. > > Waleed Lotfy > Bibliotheca Alexandrina > ________________________________ > From: users [users-boun...@open-mpi.org] on behalf of Deva > [devendar.bure...@gmail.com] > Sent: Monday, December 29, 2014 8:29 PM > To: Open MPI Users > Subject: Re: [OMPI users] Icreasing OFED registerable memory > > Hi Waleed, > > It is highly recommended to upgrade to latest OFED. Meanwhile, Can you try > latest OMPI release (v1.8.4), where this warning is ignored on older OFEDs > > -Devendar > > On Sun, Dec 28, 2014 at 6:03 AM, Waleed Lotfy > <waleed.lo...@bibalex.org<mailto:waleed.lo...@bibalex.org>> wrote: > I have a bunch of 8 GB memory nodes in a cluster who were lately > upgraded to 16 GB. When I run any jobs I get the following warning: > -------------------------------------------------------------------------- > WARNING: It appears that your OpenFabrics subsystem is configured to > only > allow registering part of your physical memory. This can cause MPI jobs > to > run with erratic performance, hang, and/or crash. > > This may be caused by your OpenFabrics vendor limiting the amount of > physical memory that can be registered. You should investigate the > relevant Linux kernel module parameters that control how much physical > memory can be registered, and increase them to allow registering all > physical memory on your machine. > > See this Open MPI FAQ item for more information on these Linux kernel > module > parameters: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > Local host: comp022.local > Registerable memory: 8192 MiB > Total memory: 16036 MiB > > Your MPI job will continue, but may be behave poorly and/or hang. > -------------------------------------------------------------------------- > > Searching for a fix to this issue, I found that I have to set > log_num_mtt within the kernel module, so I added this line to > modprobe.conf: > > options mlx4_core log_num_mtt=21 > > But then ib0 interface fails to start showing this error: > ib_ipoib device ib0 does not seem to be present, delaying > initialization. > > Reducing the value of log_num_mtt to 20, allows ib0 to start but shows > the registerable memory of 8 GB warning. > > I am using OFED 1.3.1, I know it is pretty old and we are planning to > upgrade soon. > > Output on all nodes for 'ompi_info -v ompi full --parsable': > > ompi:version:full:1.2.7 > ompi:version:svn:r19401 > orte:version:full:1.2.7 > orte:version:svn:r19401 > opal:version:full:1.2.7 > opal:version:svn:r19401 > > Any help would be appreciated. > > Waleed Lotfy > Bibliotheca Alexandrina > _______________________________________________ > users mailing list > us...@open-mpi.org<mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/26076.php > > > > -- > > > -Devendar > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/26088.php > _______________________________________________ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/12/26089.php