Hi, I had tested 1.8.4rc1 and it wasn't fixed. I can try again though, maybe I had made an error.
Regards, Götz Waschk On Mon, Dec 8, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > Hi, > > This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 a > shot? > > Best, > > Josh > > On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com> wrote: >> >> Dear Open-MPI experts, >> >> I have updated my little cluster from Scientific Linux 6.5 to 6.6, >> this included extensive changes in the Infiniband drivers and a newer >> openmpi version (1.8.1). Now I'm getting this message on all nodes >> with more than 32 GB of RAM: >> >> >> WARNING: It appears that your OpenFabrics subsystem is configured to only >> allow registering part of your physical memory. This can cause MPI jobs >> to >> run with erratic performance, hang, and/or crash. >> >> This may be caused by your OpenFabrics vendor limiting the amount of >> physical memory that can be registered. You should investigate the >> relevant Linux kernel module parameters that control how much physical >> memory can be registered, and increase them to allow registering all >> physical memory on your machine. >> >> See this Open MPI FAQ item for more information on these Linux kernel >> module >> parameters: >> >> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >> >> Local host: pax98 >> Registerable memory: 32768 MiB >> Total memory: 49106 MiB >> >> Your MPI job will continue, but may be behave poorly and/or hang. >> >> >> The issue is similar to the one described in a previous thread about >> Ubuntu nodes: >> http://www.open-mpi.org/community/lists/users/2014/08/25090.php >> But the Infiniband driver is different, the values log_num_mtt and >> log_mtts_per_seg both still exist, but they cannot be changed and have >> on all configurations the same values: >> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt >> 0 >> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg >> 3 >> >> The kernel changelog says that Red Hat has included this commit: >> mlx4: Scale size of MTT table with system RAM (Doug Ledford) >> so it should be all fine, the buffers scale automatically, however, as >> far as I can see, the wrong value calculated by calculate_max_reg() is >> used in the code, so I think I cannot simply ignore the warning. Also, >> a user has reported a problem with a job, I cannot confirm that this >> is the cause. >> >> My workaround was to simply load the mlx5_core kernel module, as this >> is used by calculate_max_reg() to detect OFED 2.0. >> >> Regards, Götz Waschk >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/12/25923.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/25924.php -- AL I:40: Do what thou wilt shall be the whole of the Law.