Hi, This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 a shot?
Best, Josh On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com> wrote: > Dear Open-MPI experts, > > I have updated my little cluster from Scientific Linux 6.5 to 6.6, > this included extensive changes in the Infiniband drivers and a newer > openmpi version (1.8.1). Now I'm getting this message on all nodes > with more than 32 GB of RAM: > > > WARNING: It appears that your OpenFabrics subsystem is configured to only > allow registering part of your physical memory. This can cause MPI jobs to > run with erratic performance, hang, and/or crash. > > This may be caused by your OpenFabrics vendor limiting the amount of > physical memory that can be registered. You should investigate the > relevant Linux kernel module parameters that control how much physical > memory can be registered, and increase them to allow registering all > physical memory on your machine. > > See this Open MPI FAQ item for more information on these Linux kernel > module > parameters: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > Local host: pax98 > Registerable memory: 32768 MiB > Total memory: 49106 MiB > > Your MPI job will continue, but may be behave poorly and/or hang. > > > The issue is similar to the one described in a previous thread about > Ubuntu nodes: > http://www.open-mpi.org/community/lists/users/2014/08/25090.php > But the Infiniband driver is different, the values log_num_mtt and > log_mtts_per_seg both still exist, but they cannot be changed and have > on all configurations the same values: > [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt > 0 > [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg > 3 > > The kernel changelog says that Red Hat has included this commit: > mlx4: Scale size of MTT table with system RAM (Doug Ledford) > so it should be all fine, the buffers scale automatically, however, as > far as I can see, the wrong value calculated by calculate_max_reg() is > used in the code, so I think I cannot simply ignore the warning. Also, > a user has reported a problem with a job, I cannot confirm that this > is the cause. > > My workaround was to simply load the mlx5_core kernel module, as this > is used by calculate_max_reg() to detect OFED 2.0. > > Regards, Götz Waschk > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/25923.php