Re: [OMPI users] Warning about not enough registerable memory on SL6.6

Götz Waschk Mon, 8 Dec 2014 10:00:05 -0500 (EST)

Hi,

I had tested 1.8.4rc1 and it wasn't fixed. I can try again though,
maybe I had made an error.


Regards, Götz Waschk

On Mon, Dec 8, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> Hi,
>
> This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 a
> shot?
>
> Best,
>
> Josh
>
> On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com> wrote:
>>
>> Dear Open-MPI experts,
>>
>> I have updated my little cluster from Scientific Linux 6.5 to 6.6,
>> this included extensive changes in the Infiniband drivers and a newer
>> openmpi version (1.8.1). Now I'm getting this message on all nodes
>> with more than 32 GB of RAM:
>>
>>
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory.  This can cause MPI jobs
>> to
>> run with erratic performance, hang, and/or crash.
>>
>> This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered.  You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>>
>> See this Open MPI FAQ item for more information on these Linux kernel
>> module
>> parameters:
>>
>>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>>   Local host:              pax98
>>   Registerable memory:     32768 MiB
>>   Total memory:            49106 MiB
>>
>> Your MPI job will continue, but may be behave poorly and/or hang.
>>
>>
>> The issue is similar to the one described in a previous thread about
>> Ubuntu nodes:
>> http://www.open-mpi.org/community/lists/users/2014/08/25090.php
>> But the Infiniband driver is different, the values log_num_mtt and
>> log_mtts_per_seg both still exist, but they cannot be changed and have
>> on all configurations the same values:
>> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt
>> 0
>> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
>> 3
>>
>> The kernel changelog says that Red Hat has included this commit:
>> mlx4: Scale size of MTT table with system RAM (Doug Ledford)
>> so it should be all fine, the buffers scale automatically, however, as
>> far as I can see, the wrong value calculated by calculate_max_reg() is
>> used in the code, so I think I cannot simply ignore the warning. Also,
>> a user has reported a problem with a job, I cannot confirm that this
>> is the cause.
>>
>> My workaround was to simply load the mlx5_core kernel module, as this
>> is used by calculate_max_reg() to detect OFED 2.0.
>>
>> Regards, Götz Waschk
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/12/25923.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/12/25924.php



-- 
AL I:40: Do what thou wilt shall be the whole of the Law.

Re: [OMPI users] Warning about not enough registerable memory on SL6.6

Reply via email to