Folks,

FWIW, i observe a similar behaviour on my system.

imho, the root cause is OFED has been upgraded from a (quite) older
version to latest 3.12 version

here is the relevant part of code (btl_openib.c from the master) :


static uint64_t calculate_max_reg (void)
{
    if (0 == stat("/sys/module/mlx4_core/parameters/log_num_mtt",
&statinfo)) {
    } else if (0 == stat("/sys/module/ib_mthca/parameters/num_mtt",
&statinfo)) {
        mtts_per_seg = 1 <<
read_module_param("/sys/module/ib_mthca/parameters/log_mtts_per_seg", 1);
        num_mtt =
read_module_param("/sys/module/ib_mthca/parameters/num_mtt", 1 << 20);
        reserved_mtt =
read_module_param("/sys/module/ib_mthca/parameters/fmr_reserved_mtts", 0);

        max_reg = (num_mtt - reserved_mtt) * opal_getpagesize () *
mtts_per_seg;
    } else if (
            (0 == stat("/sys/module/mlx5_core", &statinfo)) ||
            (0 == stat("/sys/module/mlx4_core/parameters", &statinfo)) ||
            (0 == stat("/sys/module/ib_mthca/parameters", &statinfo))
            ) {
        /* mlx5 means that we have ofed 2.0 and it can always register
2xmem_total for any mlx hca */
        max_reg = 2 * mem_total;
    } else {
    }

    /* Print a warning if we can't register more than 75% of physical
       memory.  Abort if the abort_not_enough_reg_mem MCA param was
       set. */
    if (max_reg < mem_total * 3 / 4) {
    }
    return (max_reg * 7) >> 3;
}

with OFED 3.12, the /sys/module/mlx4_core/parameters/log_num_mtt pseudo
file does *not* exist any more
/sys/module/ib_mthca/parameters/num_mtt exists so the second path is taken
and mtts_per_seg is read from
/sys/module/ib_mthca/parameters/log_mtts_per_seg

i noted that log_mtts_per_seg is also a parameter of mlx4_core :
/sys/module/mlx4_core/parameters/log_mtts_per_seg

the value is 3 in ib_mthca (and leads to a warning) but 5 in mlx4_core
(big enough, and does not lead to a warning if this value is read)


i had no time to read the latest ofed doc, so i cannot answer :
- should log_mtts_per_seg be read from mlx4_core instead ?
- is the warning a false positive ?


my only point is this warning *might* be a false positive and the root
cause *might* be calculate_max_reg logic
*could* be wrong with the latest OFED stack.

Could the Mellanox folks comment on this ?

Cheers,

Gilles




On 2014/12/09 3:18, Götz Waschk wrote:
> Hi,
>
> here's another test with openmpi 1.8.3. With 1.8.1, 32GB was detected, now
> it is just 16:
> % mpirun -np 2 /usr/lib64/openmpi-intel/bin/mpitests-osu_get_bw
> --------------------------------------------------------------------------
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory.  This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel module
> parameters:
>
>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>   Local host:              pax95
>   Registerable memory:     16384 MiB
>   Total memory:            49106 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang.
> --------------------------------------------------------------------------
> # OSU MPI_Get Bandwidth Test v4.3
> # Window creation: MPI_Win_allocate
> # Synchronization: MPI_Win_flush
> # Size      Bandwidth (MB/s)
> 1                      28.56
> 2                      58.74
>
>
> So it wasn't fixed for RHEL 6.6.
>
> Regards, Götz
>
> On Mon, Dec 8, 2014 at 4:00 PM, Götz Waschk <goetz.was...@gmail.com> wrote:
>
>> Hi,
>>
>> I had tested 1.8.4rc1 and it wasn't fixed. I can try again though,
>> maybe I had made an error.
>>
>> Regards, Götz Waschk
>>
>> On Mon, Dec 8, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>> Hi,
>>>
>>> This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3
>> a
>>> shot?
>>>
>>> Best,
>>>
>>> Josh
>>>
>>> On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com>
>> wrote:
>>>> Dear Open-MPI experts,
>>>>
>>>> I have updated my little cluster from Scientific Linux 6.5 to 6.6,
>>>> this included extensive changes in the Infiniband drivers and a newer
>>>> openmpi version (1.8.1). Now I'm getting this message on all nodes
>>>> with more than 32 GB of RAM:
>>>>
>>>>
>>>> WARNING: It appears that your OpenFabrics subsystem is configured to
>> only
>>>> allow registering part of your physical memory.  This can cause MPI jobs
>>>> to
>>>> run with erratic performance, hang, and/or crash.
>>>>
>>>> This may be caused by your OpenFabrics vendor limiting the amount of
>>>> physical memory that can be registered.  You should investigate the
>>>> relevant Linux kernel module parameters that control how much physical
>>>> memory can be registered, and increase them to allow registering all
>>>> physical memory on your machine.
>>>>
>>>> See this Open MPI FAQ item for more information on these Linux kernel
>>>> module
>>>> parameters:
>>>>
>>>>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>>
>>>>   Local host:              pax98
>>>>   Registerable memory:     32768 MiB
>>>>   Total memory:            49106 MiB
>>>>
>>>> Your MPI job will continue, but may be behave poorly and/or hang.
>>>>
>>>>
>>>> The issue is similar to the one described in a previous thread about
>>>> Ubuntu nodes:
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25090.php
>>>> But the Infiniband driver is different, the values log_num_mtt and
>>>> log_mtts_per_seg both still exist, but they cannot be changed and have
>>>> on all configurations the same values:
>>>> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt
>>>> 0
>>>> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
>>>> 3
>>>>
>>>> The kernel changelog says that Red Hat has included this commit:
>>>> mlx4: Scale size of MTT table with system RAM (Doug Ledford)
>>>> so it should be all fine, the buffers scale automatically, however, as
>>>> far as I can see, the wrong value calculated by calculate_max_reg() is
>>>> used in the code, so I think I cannot simply ignore the warning. Also,
>>>> a user has reported a problem with a job, I cannot confirm that this
>>>> is the cause.
>>>>
>>>> My workaround was to simply load the mlx5_core kernel module, as this
>>>> is used by calculate_max_reg() to detect OFED 2.0.
>>>>
>>>> Regards, Götz Waschk
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2014/12/25923.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/12/25924.php
>>
>>
>> --
>> AL I:40: Do what thou wilt shall be the whole of the Law.
>>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/12/25929.php

Reply via email to