Hi,

This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 a
shot?

Best,

Josh

On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com> wrote:

> Dear Open-MPI experts,
>
> I have updated my little cluster from Scientific Linux 6.5 to 6.6,
> this included extensive changes in the Infiniband drivers and a newer
> openmpi version (1.8.1). Now I'm getting this message on all nodes
> with more than 32 GB of RAM:
>
>
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory.  This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel
> module
> parameters:
>
>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>   Local host:              pax98
>   Registerable memory:     32768 MiB
>   Total memory:            49106 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang.
>
>
> The issue is similar to the one described in a previous thread about
> Ubuntu nodes:
> http://www.open-mpi.org/community/lists/users/2014/08/25090.php
> But the Infiniband driver is different, the values log_num_mtt and
> log_mtts_per_seg both still exist, but they cannot be changed and have
> on all configurations the same values:
> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt
> 0
> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
> 3
>
> The kernel changelog says that Red Hat has included this commit:
> mlx4: Scale size of MTT table with system RAM (Doug Ledford)
> so it should be all fine, the buffers scale automatically, however, as
> far as I can see, the wrong value calculated by calculate_max_reg() is
> used in the code, so I think I cannot simply ignore the warning. Also,
> a user has reported a problem with a job, I cannot confirm that this
> is the cause.
>
> My workaround was to simply load the mlx5_core kernel module, as this
> is used by calculate_max_reg() to detect OFED 2.0.
>
> Regards, Götz Waschk
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/12/25923.php

Reply via email to