It likely depends on how SLURM allocates the cpuset/cgroup inside the
nodes. The XML warning is related to these restrictions inside the node.
Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere.

How do we check after install whether OMPI uses the embedded or the
system-wide hwloc?

Brice




Le 08/12/2014 17:07, Pim Schellart a écrit :
> Dear Ralph,
>
> the nodes are called coma## and as you can see in the logs the nodes of the 
> broken example are the same as the nodes of the working one, so that doesn’t 
> seem to be the cause. Unless (very likely) I’m missing something. Anything 
> else I can check?
>
> Regards,
>
> Pim
>
>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> As Brice said, OMPI has its own embedded version of hwloc that we use, so 
>> there is no Slurm interaction to be considered. The most likely cause is 
>> that one or more of your nodes is picking up a different version of OMPI. So 
>> things “work” if you happen to get nodes where all the versions match, and 
>> “fail” when you get a combination that includes a different version.
>>
>> Is there some way you can narrow down your search to find the node(s) that 
>> are picking up the different version?
>>
>>
>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>>>
>>> Dear Brice,
>>>
>>> I am not sure why this is happening since all code seems to be using the 
>>> same hwloc library version (1.8) but it does :) An MPI program is started 
>>> through SLURM on two nodes with four CPU cores total (divided over the 
>>> nodes) using the following script:
>>>
>>> #! /bin/bash
>>> #SBATCH -N 2 -n 4
>>> /usr/bin/mpiexec /usr/bin/lstopo --version
>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml
>>> /usr/bin/mpiexec  /path/to/my_mpi_code
>>>
>>> When this is submitted multiple times it gives “out-of-order” warnings in 
>>> about 9/10 cases but works without warnings in 1/10 cases. I attached the 
>>> output (with xml) for both the working and `broken` case. Note that the xml 
>>> is of course printed (differently) multiple times for each task/core. As 
>>> always, any help would be appreciated.
>>>
>>> Regards,
>>>
>>> Pim Schellart
>>>
>>> P.S. $ mpirun --version
>>> mpirun (Open MPI) 1.6.5
>>>
>>> <broken.log><working.log>
>>>
>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>>
>>>> Hello
>>>> The github issue you're refering to was closed 18 months ago. The
>>>> warning (it's not an error) is only supposed to appear if you're
>>>> importing in a recent hwloc a XML that was exported from a old hwloc. I
>>>> don't see how that could happen when using Open MPI since the hwloc
>>>> versions on both sides is the same.
>>>> Make sure you're not confusing with another error described here
>>>>
>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error
>>>> Otherwise please report the exact Open MPI and/or hwloc versions as well
>>>> as the XML lstopo output on the nodes that raise the warning (lstopo
>>>> foo.xml). Send these to hwloc mailing lists such as
>>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org
>>>> Thanks
>>>> Brice
>>>>
>>>>
>>>> Le 07/12/2014 13:29, Pim Schellart a écrit :
>>>>> Dear OpenMPI developers,
>>>>>
>>>>> this might be a bit off topic but when using the SLURM scheduler (with 
>>>>> cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a 
>>>>> "out-of-order topology discovery” error. According to issue #103 on 
>>>>> github (https://github.com/open-mpi/hwloc/issues/103) this error was 
>>>>> discussed before and it was possible to sort it out in 
>>>>> “insert_object_by_parent”, is this still considered? If not, what (top 
>>>>> level) hwloc API call should we look for in the SLURM sources to start 
>>>>> debugging? Any help will be most welcome.
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Pim Schellart
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php

Reply via email to