It likely depends on how SLURM allocates the cpuset/cgroup inside the nodes. The XML warning is related to these restrictions inside the node. Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere.
How do we check after install whether OMPI uses the embedded or the system-wide hwloc? Brice Le 08/12/2014 17:07, Pim Schellart a écrit : > Dear Ralph, > > the nodes are called coma## and as you can see in the logs the nodes of the > broken example are the same as the nodes of the working one, so that doesn’t > seem to be the cause. Unless (very likely) I’m missing something. Anything > else I can check? > > Regards, > > Pim > >> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote: >> >> As Brice said, OMPI has its own embedded version of hwloc that we use, so >> there is no Slurm interaction to be considered. The most likely cause is >> that one or more of your nodes is picking up a different version of OMPI. So >> things “work” if you happen to get nodes where all the versions match, and >> “fail” when you get a combination that includes a different version. >> >> Is there some way you can narrow down your search to find the node(s) that >> are picking up the different version? >> >> >>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> wrote: >>> >>> Dear Brice, >>> >>> I am not sure why this is happening since all code seems to be using the >>> same hwloc library version (1.8) but it does :) An MPI program is started >>> through SLURM on two nodes with four CPU cores total (divided over the >>> nodes) using the following script: >>> >>> #! /bin/bash >>> #SBATCH -N 2 -n 4 >>> /usr/bin/mpiexec /usr/bin/lstopo --version >>> /usr/bin/mpiexec /usr/bin/lstopo --of xml >>> /usr/bin/mpiexec /path/to/my_mpi_code >>> >>> When this is submitted multiple times it gives “out-of-order” warnings in >>> about 9/10 cases but works without warnings in 1/10 cases. I attached the >>> output (with xml) for both the working and `broken` case. Note that the xml >>> is of course printed (differently) multiple times for each task/core. As >>> always, any help would be appreciated. >>> >>> Regards, >>> >>> Pim Schellart >>> >>> P.S. $ mpirun --version >>> mpirun (Open MPI) 1.6.5 >>> >>> <broken.log><working.log> >>> >>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote: >>>> >>>> Hello >>>> The github issue you're refering to was closed 18 months ago. The >>>> warning (it's not an error) is only supposed to appear if you're >>>> importing in a recent hwloc a XML that was exported from a old hwloc. I >>>> don't see how that could happen when using Open MPI since the hwloc >>>> versions on both sides is the same. >>>> Make sure you're not confusing with another error described here >>>> >>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error >>>> Otherwise please report the exact Open MPI and/or hwloc versions as well >>>> as the XML lstopo output on the nodes that raise the warning (lstopo >>>> foo.xml). Send these to hwloc mailing lists such as >>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org >>>> Thanks >>>> Brice >>>> >>>> >>>> Le 07/12/2014 13:29, Pim Schellart a écrit : >>>>> Dear OpenMPI developers, >>>>> >>>>> this might be a bit off topic but when using the SLURM scheduler (with >>>>> cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a >>>>> "out-of-order topology discovery” error. According to issue #103 on >>>>> github (https://github.com/open-mpi/hwloc/issues/103) this error was >>>>> discussed before and it was possible to sort it out in >>>>> “insert_object_by_parent”, is this still considered? If not, what (top >>>>> level) hwloc API call should we look for in the SLURM sources to start >>>>> debugging? Any help will be most welcome. >>>>> >>>>> Kind regards, >>>>> >>>>> Pim Schellart >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16449.php