Pim: is this an OMPI you built, or one you were given somehow? If you built it, how did you configure it?
> On Dec 8, 2014, at 8:12 AM, Brice Goglin <[email protected]> wrote: > > It likely depends on how SLURM allocates the cpuset/cgroup inside the > nodes. The XML warning is related to these restrictions inside the node. > Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere. > > How do we check after install whether OMPI uses the embedded or the > system-wide hwloc? > > Brice > > > > > Le 08/12/2014 17:07, Pim Schellart a écrit : >> Dear Ralph, >> >> the nodes are called coma## and as you can see in the logs the nodes of the >> broken example are the same as the nodes of the working one, so that doesn’t >> seem to be the cause. Unless (very likely) I’m missing something. Anything >> else I can check? >> >> Regards, >> >> Pim >> >>> On 08 Dec 2014, at 17:03, Ralph Castain <[email protected]> wrote: >>> >>> As Brice said, OMPI has its own embedded version of hwloc that we use, so >>> there is no Slurm interaction to be considered. The most likely cause is >>> that one or more of your nodes is picking up a different version of OMPI. >>> So things “work” if you happen to get nodes where all the versions match, >>> and “fail” when you get a combination that includes a different version. >>> >>> Is there some way you can narrow down your search to find the node(s) that >>> are picking up the different version? >>> >>> >>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <[email protected]> wrote: >>>> >>>> Dear Brice, >>>> >>>> I am not sure why this is happening since all code seems to be using the >>>> same hwloc library version (1.8) but it does :) An MPI program is started >>>> through SLURM on two nodes with four CPU cores total (divided over the >>>> nodes) using the following script: >>>> >>>> #! /bin/bash >>>> #SBATCH -N 2 -n 4 >>>> /usr/bin/mpiexec /usr/bin/lstopo --version >>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml >>>> /usr/bin/mpiexec /path/to/my_mpi_code >>>> >>>> When this is submitted multiple times it gives “out-of-order” warnings in >>>> about 9/10 cases but works without warnings in 1/10 cases. I attached the >>>> output (with xml) for both the working and `broken` case. Note that the >>>> xml is of course printed (differently) multiple times for each task/core. >>>> As always, any help would be appreciated. >>>> >>>> Regards, >>>> >>>> Pim Schellart >>>> >>>> P.S. $ mpirun --version >>>> mpirun (Open MPI) 1.6.5 >>>> >>>> <broken.log><working.log> >>>> >>>>> On 07 Dec 2014, at 13:50, Brice Goglin <[email protected]> wrote: >>>>> >>>>> Hello >>>>> The github issue you're refering to was closed 18 months ago. The >>>>> warning (it's not an error) is only supposed to appear if you're >>>>> importing in a recent hwloc a XML that was exported from a old hwloc. I >>>>> don't see how that could happen when using Open MPI since the hwloc >>>>> versions on both sides is the same. >>>>> Make sure you're not confusing with another error described here >>>>> >>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error >>>>> Otherwise please report the exact Open MPI and/or hwloc versions as well >>>>> as the XML lstopo output on the nodes that raise the warning (lstopo >>>>> foo.xml). Send these to hwloc mailing lists such as >>>>> [email protected] or [email protected] >>>>> Thanks >>>>> Brice >>>>> >>>>> >>>>> Le 07/12/2014 13:29, Pim Schellart a écrit : >>>>>> Dear OpenMPI developers, >>>>>> >>>>>> this might be a bit off topic but when using the SLURM scheduler (with >>>>>> cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a >>>>>> "out-of-order topology discovery” error. According to issue #103 on >>>>>> github (https://github.com/open-mpi/hwloc/issues/103) this error was >>>>>> discussed before and it was possible to sort it out in >>>>>> “insert_object_by_parent”, is this still considered? If not, what (top >>>>>> level) hwloc API call should we look for in the SLURM sources to start >>>>>> debugging? Any help will be most welcome. >>>>>> >>>>>> Kind regards, >>>>>> >>>>>> Pim Schellart >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> [email protected] >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php >> _______________________________________________ >> devel mailing list >> [email protected] >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php > > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16450.php
