Dear Ralph,

the nodes are called coma## and as you can see in the logs the nodes of the 
broken example are the same as the nodes of the working one, so that doesn’t 
seem to be the cause. Unless (very likely) I’m missing something. Anything else 
I can check?

Regards,

Pim

> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote:
> 
> As Brice said, OMPI has its own embedded version of hwloc that we use, so 
> there is no Slurm interaction to be considered. The most likely cause is that 
> one or more of your nodes is picking up a different version of OMPI. So 
> things “work” if you happen to get nodes where all the versions match, and 
> “fail” when you get a combination that includes a different version.
> 
> Is there some way you can narrow down your search to find the node(s) that 
> are picking up the different version?
> 
> 
>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>> 
>> Dear Brice,
>> 
>> I am not sure why this is happening since all code seems to be using the 
>> same hwloc library version (1.8) but it does :) An MPI program is started 
>> through SLURM on two nodes with four CPU cores total (divided over the 
>> nodes) using the following script:
>> 
>> #! /bin/bash
>> #SBATCH -N 2 -n 4
>> /usr/bin/mpiexec /usr/bin/lstopo --version
>> /usr/bin/mpiexec /usr/bin/lstopo --of xml
>> /usr/bin/mpiexec  /path/to/my_mpi_code
>> 
>> When this is submitted multiple times it gives “out-of-order” warnings in 
>> about 9/10 cases but works without warnings in 1/10 cases. I attached the 
>> output (with xml) for both the working and `broken` case. Note that the xml 
>> is of course printed (differently) multiple times for each task/core. As 
>> always, any help would be appreciated.
>> 
>> Regards,
>> 
>> Pim Schellart
>> 
>> P.S. $ mpirun --version
>> mpirun (Open MPI) 1.6.5
>> 
>> <broken.log><working.log>
>> 
>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> wrote:
>>> 
>>> Hello
>>> The github issue you're refering to was closed 18 months ago. The
>>> warning (it's not an error) is only supposed to appear if you're
>>> importing in a recent hwloc a XML that was exported from a old hwloc. I
>>> don't see how that could happen when using Open MPI since the hwloc
>>> versions on both sides is the same.
>>> Make sure you're not confusing with another error described here
>>> 
>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error
>>> Otherwise please report the exact Open MPI and/or hwloc versions as well
>>> as the XML lstopo output on the nodes that raise the warning (lstopo
>>> foo.xml). Send these to hwloc mailing lists such as
>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org
>>> Thanks
>>> Brice
>>> 
>>> 
>>> Le 07/12/2014 13:29, Pim Schellart a écrit :
>>>> Dear OpenMPI developers,
>>>> 
>>>> this might be a bit off topic but when using the SLURM scheduler (with 
>>>> cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc sometimes gives a 
>>>> "out-of-order topology discovery” error. According to issue #103 on github 
>>>> (https://github.com/open-mpi/hwloc/issues/103) this error was discussed 
>>>> before and it was possible to sort it out in “insert_object_by_parent”, is 
>>>> this still considered? If not, what (top level) hwloc API call should we 
>>>> look for in the SLURM sources to start debugging? Any help will be most 
>>>> welcome.
>>>> 
>>>> Kind regards,
>>>> 
>>>> Pim Schellart
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php

Reply via email to