[OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6
Dear OpenMPI developers, this might be a bit off topic but when using the SLURM scheduler (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a "out-of-order topology discovery” error. According to issue #103 on github (https://github.com/open-mpi/hwloc/issues/103) this error was discussed before and it was possible to sort it out in “insert_object_by_parent”, is this still considered? If not, what (top level) hwloc API call should we look for in the SLURM sources to start debugging? Any help will be most welcome. Kind regards, Pim Schellart
Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6
Hello The github issue you're refering to was closed 18 months ago. The warning (it's not an error) is only supposed to appear if you're importing in a recent hwloc a XML that was exported from a old hwloc. I don't see how that could happen when using Open MPI since the hwloc versions on both sides is the same. Make sure you're not confusing with another error described here http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error Otherwise please report the exact Open MPI and/or hwloc versions as well as the XML lstopo output on the nodes that raise the warning (lstopo foo.xml). Send these to hwloc mailing lists such as hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org Thanks Brice Le 07/12/2014 13:29, Pim Schellart a écrit : > Dear OpenMPI developers, > > this might be a bit off topic but when using the SLURM scheduler (with cpuset > support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a "out-of-order > topology discovery” error. According to issue #103 on github > (https://github.com/open-mpi/hwloc/issues/103) this error was discussed > before and it was possible to sort it out in “insert_object_by_parent”, is > this still considered? If not, what (top level) hwloc API call should we look > for in the SLURM sources to start debugging? Any help will be most welcome. > > Kind regards, > > Pim Schellart > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
[OMPI devel] (no subject)
Watcha, have recently come to install the PISM package on top of PETSc, which, in turn is built against OpenMPI 1.8.1 on our Science Faculty HPC Facility, which has SGI C2112 compute nodes with 64GB RAM running on top of CentOS 6. In testing the PETSc deployment out and when running PISM itself, I am seeing the " ...OpenFabrics subsystem is configured to only allow registering part of your physical memory ..." message telling me Registerable memory: 32768 MiB Total memory:524285 MiB Oh yeah, that's the 512GB big memory node, not a 64Gb compute node, which says Registerable memory: 32768 MiB Total memory:65534 MiB but still suggests a default for allowing the use of 32GB. So, having followed my nose to the OpenMPI FAQ sections, and the Mellanox community page, http://community.mellanox.com/docs/DOC-1120 which suggests the defaults for the two parameters in need of a tweak are log_num_mtt 20 log_mtts_per_seg 0 I came to try and tweak those Mellanox driver parameters. What I see on my compute nodes is # cat /sys/module/mlx4_core/parameters/log_num_mtt 0 # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 3 # so something that doesn't match the defaults the Mellanox page suggests I should be seeing. Furthermore, having "done the math" and realised that I probably want log_num_mtt 22 log_mtts_per_seg 3 to allow OpenMPI to use double the memory (128GB - because giving it 1 TB on the big memory node seems excessive!) when I come to alter those values, I can't seem to. Trying to add a module load option options mlx4_core log_num_mtt=22 via modifying the file /etc/modprobe.d/mlx4.conf never sees that value honoured after a full node reboot. It also appears that the /sys/module/mlx4_core/parameters/ are nearly all read-only, including the ones it's suggested that I tweak, vis: # echo 22 > /sys/module/mlx4_core/parameters/log_num_mtt -bash: /sys/module/mlx4_core/parameters/log_num_mtt: Permission denied # ls -l /sys/module/mlx4_core/parameters/log_num_mtt -r--r--r--. 1 root root 4096 Dec 5 13:08 /sys/module/mlx4_core/parameters/log_num_mtt so I'm getting the impression that the Mrellanox driver doesn't really want the defaults altered ? OK, so if i can't tell my nodes to allow OpenMPI to use any more than 32GB, how do I turn off the OpenMPI message that is telling me about it? Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand
Re: [OMPI devel] (no subject)
Apologies for the lack of a subject line: cut and pasted the body before the subject ! Should have been Removing "registering part of your physical memory" warning message Dunno if anyone can fix that in the maling list?
Re: [OMPI devel] openmpi and XRC API from ofed-3.12
Hi Piotr, this is quite an old thread now, but i did not see any support for XRC with ofed 3.12 yet (nor in trunk nor in v1.8) my understanding is that Bull already did something similar for the v1.6 series, so let me put this the other way around : does Bull have any plan to contribute this work ? (for example, publish a patch for the v1.6 series, or submit pull request(s) for master and v1.8 branch) Cheers, Gilles On 2014/04/23 21:58, Piotr Lesnicki wrote: > Hi, > > In OFED-3.12 the API for XRC has changed. I did not find > corresponding changes in Open MPI: for example the function > 'ibv_create_xrc_rcv_qp()' queried in openmpi configure script no > longer exists in ofed-3.12-rc1. > > Are there any plans to support the new XRC API ? > > > -- > Piotr > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/04/14583.php