Pim: is this an OMPI you built, or one you were given somehow? If you built it, 
how did you configure it?

> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
> 
> It likely depends on how SLURM allocates the cpuset/cgroup inside the
> nodes. The XML warning is related to these restrictions inside the node.
> Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere.
> 
> How do we check after install whether OMPI uses the embedded or the
> system-wide hwloc?
> 
> Brice
> 
> 
> 
> 
> Le 08/12/2014 17:07, Pim Schellart a écrit :
>> Dear Ralph,
>> 
>> the nodes are called coma## and as you can see in the logs the nodes of the 
>> broken example are the same as the nodes of the working one, so that doesn’t 
>> seem to be the cause. Unless (very likely) I’m missing something. Anything 
>> else I can check?
>> 
>> Regards,
>> 
>> Pim
>> 
>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> As Brice said, OMPI has its own embedded version of hwloc that we use, so 
>>> there is no Slurm interaction to be considered. The most likely cause is 
>>> that one or more of your nodes is picking up a different version of OMPI. 
>>> So things “work” if you happen to get nodes where all the versions match, 
>>> and “fail” when you get a combination that includes a different version.
>>> 
>>> Is there some way you can narrow down your search to find the node(s) that 
>>> are picking up the different version?
>>> 
>>> 
>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>>>> 
>>>> Dear Brice,
>>>> 
>>>> I am not sure why this is happening since all code seems to be using the 
>>>> same hwloc library version (1.8) but it does :) An MPI program is started 
>>>> through SLURM on two nodes with four CPU cores total (divided over the 
>>>> nodes) using the following script:
>>>> 
>>>> #! /bin/bash
>>>> #SBATCH -N 2 -n 4
>>>> /usr/bin/mpiexec /usr/bin/lstopo --version
>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml
>>>> /usr/bin/mpiexec  /path/to/my_mpi_code
>>>> 
>>>> When this is submitted multiple times it gives “out-of-order” warnings in 
>>>> about 9/10 cases but works without warnings in 1/10 cases. I attached the 
>>>> output (with xml) for both the working and `broken` case. Note that the 
>>>> xml is of course printed (differently) multiple times for each task/core. 
>>>> As always, any help would be appreciated.
>>>> 
>>>> Regards,
>>>> 
>>>> Pim Schellart
>>>> 
>>>> P.S. $ mpirun --version
>>>> mpirun (Open MPI) 1.6.5
>>>> 
>>>> <broken.log><working.log>
>>>> 
>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>>> 
>>>>> Hello
>>>>> The github issue you're refering to was closed 18 months ago. The
>>>>> warning (it's not an error) is only supposed to appear if you're
>>>>> importing in a recent hwloc a XML that was exported from a old hwloc. I
>>>>> don't see how that could happen when using Open MPI since the hwloc
>>>>> versions on both sides is the same.
>>>>> Make sure you're not confusing with another error described here
>>>>> 
>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error
>>>>> Otherwise please report the exact Open MPI and/or hwloc versions as well
>>>>> as the XML lstopo output on the nodes that raise the warning (lstopo
>>>>> foo.xml). Send these to hwloc mailing lists such as
>>>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org
>>>>> Thanks
>>>>> Brice
>>>>> 
>>>>> 
>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit :
>>>>>> Dear OpenMPI developers,
>>>>>> 
>>>>>> this might be a bit off topic but when using the SLURM scheduler (with 
>>>>>> cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a 
>>>>>> "out-of-order topology discovery” error. According to issue #103 on 
>>>>>> github (https://github.com/open-mpi/hwloc/issues/103) this error was 
>>>>>> discussed before and it was possible to sort it out in 
>>>>>> “insert_object_by_parent”, is this still considered? If not, what (top 
>>>>>> level) hwloc API call should we look for in the SLURM sources to start 
>>>>>> debugging? Any help will be most welcome.
>>>>>> 
>>>>>> Kind regards,
>>>>>> 
>>>>>> Pim Schellart
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php

Reply via email to