Thanks, Gilles

We're back to looking at this (yet again.) It's a false positive, yes,
however, it's not completely benign. The max_reg that's calculated is much
smaller than it should be. In OFED 3.12, max_reg should be 2*TOTAL_RAM. We
should have a fix for 1.8.4.

Josh

On Mon, Dec 8, 2014 at 9:25 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Folks,
>
> FWIW, i observe a similar behaviour on my system.
>
> imho, the root cause is OFED has been upgraded from a (quite) older
> version to latest 3.12 version
>
> here is the relevant part of code (btl_openib.c from the master) :
>
>
> static uint64_t calculate_max_reg (void)
> {
>     if (0 == stat("/sys/module/mlx4_core/parameters/log_num_mtt",
> &statinfo)) {
>     } else if (0 == stat("/sys/module/ib_mthca/parameters/num_mtt",
> &statinfo)) {
>         mtts_per_seg = 1 <<
> read_module_param("/sys/module/ib_mthca/parameters/log_mtts_per_seg", 1);
>         num_mtt =
> read_module_param("/sys/module/ib_mthca/parameters/num_mtt", 1 << 20);
>         reserved_mtt =
> read_module_param("/sys/module/ib_mthca/parameters/fmr_reserved_mtts", 0);
>
>         max_reg = (num_mtt - reserved_mtt) * opal_getpagesize () *
> mtts_per_seg;
>     } else if (
>             (0 == stat("/sys/module/mlx5_core", &statinfo)) ||
>             (0 == stat("/sys/module/mlx4_core/parameters", &statinfo)) ||
>             (0 == stat("/sys/module/ib_mthca/parameters", &statinfo))
>             ) {
>         /* mlx5 means that we have ofed 2.0 and it can always register
> 2xmem_total for any mlx hca */
>         max_reg = 2 * mem_total;
>     } else {
>     }
>
>     /* Print a warning if we can't register more than 75% of physical
>        memory.  Abort if the abort_not_enough_reg_mem MCA param was
>        set. */
>     if (max_reg < mem_total * 3 / 4) {
>     }
>     return (max_reg * 7) >> 3;
> }
>
> with OFED 3.12, the /sys/module/mlx4_core/parameters/log_num_mtt pseudo
> file does *not* exist any more
> /sys/module/ib_mthca/parameters/num_mtt exists so the second path is taken
> and mtts_per_seg is read from
> /sys/module/ib_mthca/parameters/log_mtts_per_seg
>
> i noted that log_mtts_per_seg is also a parameter of mlx4_core :
> /sys/module/mlx4_core/parameters/log_mtts_per_seg
>
> the value is 3 in ib_mthca (and leads to a warning) but 5 in mlx4_core
> (big enough, and does not lead to a warning if this value is read)
>
>
> i had no time to read the latest ofed doc, so i cannot answer :
> - should log_mtts_per_seg be read from mlx4_core instead ?
> - is the warning a false positive ?
>
>
> my only point is this warning *might* be a false positive and the root
> cause *might* be calculate_max_reg logic
> *could* be wrong with the latest OFED stack.
>
> Could the Mellanox folks comment on this ?
>
> Cheers,
>
> Gilles
>
>
>
>
>
> On 2014/12/09 3:18, Götz Waschk wrote:
>
> Hi,
>
> here's another test with openmpi 1.8.3. With 1.8.1, 32GB was detected, now
> it is just 16:
> % mpirun -np 2 /usr/lib64/openmpi-intel/bin/mpitests-osu_get_bw
> --------------------------------------------------------------------------
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory.  This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel module
> parameters:
>
>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>   Local host:              pax95
>   Registerable memory:     16384 MiB
>   Total memory:            49106 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang.
> --------------------------------------------------------------------------
> # OSU MPI_Get Bandwidth Test v4.3
> # Window creation: MPI_Win_allocate
> # Synchronization: MPI_Win_flush
> # Size      Bandwidth (MB/s)
> 1                      28.56
> 2                      58.74
>
>
> So it wasn't fixed for RHEL 6.6.
>
> Regards, Götz
>
> On Mon, Dec 8, 2014 at 4:00 PM, Götz Waschk <goetz.was...@gmail.com> 
> <goetz.was...@gmail.com> wrote:
>
>
>  Hi,
>
> I had tested 1.8.4rc1 and it wasn't fixed. I can try again though,
> maybe I had made an error.
>
> Regards, Götz Waschk
>
> On Mon, Dec 8, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> 
> <jladd.m...@gmail.com> wrote:
>
>  Hi,
>
> This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3
>
>  a
>
>  shot?
>
> Best,
>
> Josh
>
> On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com> 
> <goetz.was...@gmail.com>
>
>  wrote:
>
>  Dear Open-MPI experts,
>
> I have updated my little cluster from Scientific Linux 6.5 to 6.6,
> this included extensive changes in the Infiniband drivers and a newer
> openmpi version (1.8.1). Now I'm getting this message on all nodes
> with more than 32 GB of RAM:
>
>
> WARNING: It appears that your OpenFabrics subsystem is configured to
>
>  only
>
>  allow registering part of your physical memory.  This can cause MPI jobs
> to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel
> module
> parameters:
>
>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>   Local host:              pax98
>   Registerable memory:     32768 MiB
>   Total memory:            49106 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang.
>
>
> The issue is similar to the one described in a previous thread about
> Ubuntu nodes:http://www.open-mpi.org/community/lists/users/2014/08/25090.php
> But the Infiniband driver is different, the values log_num_mtt and
> log_mtts_per_seg both still exist, but they cannot be changed and have
> on all configurations the same values:
> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt
> 0
> [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
> 3
>
> The kernel changelog says that Red Hat has included this commit:
> mlx4: Scale size of MTT table with system RAM (Doug Ledford)
> so it should be all fine, the buffers scale automatically, however, as
> far as I can see, the wrong value calculated by calculate_max_reg() is
> used in the code, so I think I cannot simply ignore the warning. Also,
> a user has reported a problem with a job, I cannot confirm that this
> is the cause.
>
> My workaround was to simply load the mlx5_core kernel module, as this
> is used by calculate_max_reg() to detect OFED 2.0.
>
> Regards, Götz Waschk
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this 
> post:http://www.open-mpi.org/community/lists/users/2014/12/25923.php
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this 
> post:http://www.open-mpi.org/community/lists/users/2014/12/25924.php
>
>
>
> --
> AL I:40: Do what thou wilt shall be the whole of the Law.
>
>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/12/25929.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16454.php
>

Reply via email to