Thanks, Gilles We're back to looking at this (yet again.) It's a false positive, yes, however, it's not completely benign. The max_reg that's calculated is much smaller than it should be. In OFED 3.12, max_reg should be 2*TOTAL_RAM. We should have a fix for 1.8.4.
Josh On Mon, Dec 8, 2014 at 9:25 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Folks, > > FWIW, i observe a similar behaviour on my system. > > imho, the root cause is OFED has been upgraded from a (quite) older > version to latest 3.12 version > > here is the relevant part of code (btl_openib.c from the master) : > > > static uint64_t calculate_max_reg (void) > { > if (0 == stat("/sys/module/mlx4_core/parameters/log_num_mtt", > &statinfo)) { > } else if (0 == stat("/sys/module/ib_mthca/parameters/num_mtt", > &statinfo)) { > mtts_per_seg = 1 << > read_module_param("/sys/module/ib_mthca/parameters/log_mtts_per_seg", 1); > num_mtt = > read_module_param("/sys/module/ib_mthca/parameters/num_mtt", 1 << 20); > reserved_mtt = > read_module_param("/sys/module/ib_mthca/parameters/fmr_reserved_mtts", 0); > > max_reg = (num_mtt - reserved_mtt) * opal_getpagesize () * > mtts_per_seg; > } else if ( > (0 == stat("/sys/module/mlx5_core", &statinfo)) || > (0 == stat("/sys/module/mlx4_core/parameters", &statinfo)) || > (0 == stat("/sys/module/ib_mthca/parameters", &statinfo)) > ) { > /* mlx5 means that we have ofed 2.0 and it can always register > 2xmem_total for any mlx hca */ > max_reg = 2 * mem_total; > } else { > } > > /* Print a warning if we can't register more than 75% of physical > memory. Abort if the abort_not_enough_reg_mem MCA param was > set. */ > if (max_reg < mem_total * 3 / 4) { > } > return (max_reg * 7) >> 3; > } > > with OFED 3.12, the /sys/module/mlx4_core/parameters/log_num_mtt pseudo > file does *not* exist any more > /sys/module/ib_mthca/parameters/num_mtt exists so the second path is taken > and mtts_per_seg is read from > /sys/module/ib_mthca/parameters/log_mtts_per_seg > > i noted that log_mtts_per_seg is also a parameter of mlx4_core : > /sys/module/mlx4_core/parameters/log_mtts_per_seg > > the value is 3 in ib_mthca (and leads to a warning) but 5 in mlx4_core > (big enough, and does not lead to a warning if this value is read) > > > i had no time to read the latest ofed doc, so i cannot answer : > - should log_mtts_per_seg be read from mlx4_core instead ? > - is the warning a false positive ? > > > my only point is this warning *might* be a false positive and the root > cause *might* be calculate_max_reg logic > *could* be wrong with the latest OFED stack. > > Could the Mellanox folks comment on this ? > > Cheers, > > Gilles > > > > > > On 2014/12/09 3:18, Götz Waschk wrote: > > Hi, > > here's another test with openmpi 1.8.3. With 1.8.1, 32GB was detected, now > it is just 16: > % mpirun -np 2 /usr/lib64/openmpi-intel/bin/mpitests-osu_get_bw > -------------------------------------------------------------------------- > WARNING: It appears that your OpenFabrics subsystem is configured to only > allow registering part of your physical memory. This can cause MPI jobs to > run with erratic performance, hang, and/or crash. > > This may be caused by your OpenFabrics vendor limiting the amount of > physical memory that can be registered. You should investigate the > relevant Linux kernel module parameters that control how much physical > memory can be registered, and increase them to allow registering all > physical memory on your machine. > > See this Open MPI FAQ item for more information on these Linux kernel module > parameters: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > Local host: pax95 > Registerable memory: 16384 MiB > Total memory: 49106 MiB > > Your MPI job will continue, but may be behave poorly and/or hang. > -------------------------------------------------------------------------- > # OSU MPI_Get Bandwidth Test v4.3 > # Window creation: MPI_Win_allocate > # Synchronization: MPI_Win_flush > # Size Bandwidth (MB/s) > 1 28.56 > 2 58.74 > > > So it wasn't fixed for RHEL 6.6. > > Regards, Götz > > On Mon, Dec 8, 2014 at 4:00 PM, Götz Waschk <goetz.was...@gmail.com> > <goetz.was...@gmail.com> wrote: > > > Hi, > > I had tested 1.8.4rc1 and it wasn't fixed. I can try again though, > maybe I had made an error. > > Regards, Götz Waschk > > On Mon, Dec 8, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> > <jladd.m...@gmail.com> wrote: > > Hi, > > This should be fixed in OMPI 1.8.3. Is it possible for you to give 1.8.3 > > a > > shot? > > Best, > > Josh > > On Mon, Dec 8, 2014 at 8:43 AM, Götz Waschk <goetz.was...@gmail.com> > <goetz.was...@gmail.com> > > wrote: > > Dear Open-MPI experts, > > I have updated my little cluster from Scientific Linux 6.5 to 6.6, > this included extensive changes in the Infiniband drivers and a newer > openmpi version (1.8.1). Now I'm getting this message on all nodes > with more than 32 GB of RAM: > > > WARNING: It appears that your OpenFabrics subsystem is configured to > > only > > allow registering part of your physical memory. This can cause MPI jobs > to > run with erratic performance, hang, and/or crash. > > This may be caused by your OpenFabrics vendor limiting the amount of > physical memory that can be registered. You should investigate the > relevant Linux kernel module parameters that control how much physical > memory can be registered, and increase them to allow registering all > physical memory on your machine. > > See this Open MPI FAQ item for more information on these Linux kernel > module > parameters: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > Local host: pax98 > Registerable memory: 32768 MiB > Total memory: 49106 MiB > > Your MPI job will continue, but may be behave poorly and/or hang. > > > The issue is similar to the one described in a previous thread about > Ubuntu nodes:http://www.open-mpi.org/community/lists/users/2014/08/25090.php > But the Infiniband driver is different, the values log_num_mtt and > log_mtts_per_seg both still exist, but they cannot be changed and have > on all configurations the same values: > [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt > 0 > [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg > 3 > > The kernel changelog says that Red Hat has included this commit: > mlx4: Scale size of MTT table with system RAM (Doug Ledford) > so it should be all fine, the buffers scale automatically, however, as > far as I can see, the wrong value calculated by calculate_max_reg() is > used in the code, so I think I cannot simply ignore the warning. Also, > a user has reported a problem with a job, I cannot confirm that this > is the cause. > > My workaround was to simply load the mlx5_core kernel module, as this > is used by calculate_max_reg() to detect OFED 2.0. > > Regards, Götz Waschk > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this > post:http://www.open-mpi.org/community/lists/users/2014/12/25923.php > > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this > post:http://www.open-mpi.org/community/lists/users/2014/12/25924.php > > > > -- > AL I:40: Do what thou wilt shall be the whole of the Law. > > > > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/25929.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16454.php >