Pim,

if you configure OpenMPI with --with-hwloc=external (or something like
--with-hwloc=/usr) it is very likely
OpenMPI will use the same hwloc library (e.g. the "system" library) that
is used by SLURM

/* i do not know how Ubuntu packages OpenMPI ... */


The default (e.g. no --with-hwloc parameter in the configure command
line) is to use the hwloc library that is embedded within OpenMPI

Gilles

On 2014/12/09 17:34, Pim Schellart wrote:
> Ah, ok so that was where the confusion came from, I did see hwloc in the 
> SLURM sources but couldn’t immediately figure out where exactly it was used. 
> We will try compiling openmpi with the embedded hwloc. Any particular flags I 
> should set?
>
>> On 09 Dec 2014, at 09:30, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> There is no linkage between slurm and ompi when it comes to hwloc. If you 
>> directly launch your app using srun, then slurm will use its version of 
>> hwloc to do the binding. If you use mpirun to launch the app, then we’ll use 
>> our internal version to do it.
>>
>> The two are completely isolated from each other.
>>
>>
>>> On Dec 9, 2014, at 12:25 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>>>
>>> The version that “lstopo --version” reports is the same (1.8) on all nodes, 
>>> but we may indeed be hitting the second issue. We can try to compile a new 
>>> version of openmpi, but how do we ensure that the external programs (e.g. 
>>> SLURM) are using the same hwloc version as the one embedded in openmpi? Is 
>>> it enough to just compile hwloc 1.9 separately as well and link against 
>>> that? Also, if this is an issue, should we file a bug against hwloc or 
>>> openmpi on Ubuntu for mismatching versions?
>>>
>>>> On 09 Dec 2014, at 00:50, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Hmmm…they probably linked that to the external, system hwloc version, so 
>>>> it sounds like one or more of your nodes has a different hwloc rpm on it.
>>>>
>>>> I couldn’t leaf thru your output well enough to see all the lstopo 
>>>> versions, but you might check to ensure they are the same.
>>>>
>>>> Looking at the code base, you may also hit a problem here. OMPI 1.6 series 
>>>> was based on hwloc 1.3 - the output you sent indicated you have hwloc 1.8, 
>>>> which is quite a big change. OMPI 1.8 series is based on hwloc 1.9, so at 
>>>> least that is closer (though probably still a mismatch).
>>>>
>>>> Frankly, I’d just download and install an OMPI tarball myself and avoid 
>>>> these headaches. This mismatch in required versions is why we embed hwloc 
>>>> as it is a critical library for OMPI, and we had to ensure that the 
>>>> version matched our internal requirements.
>>>>
>>>>
>>>>> On Dec 8, 2014, at 8:50 AM, Pim Schellart <p.schell...@gmail.com> wrote:
>>>>>
>>>>> It is the default openmpi that comes with Ubuntu 14.04.
>>>>>
>>>>>> On 08 Dec 2014, at 17:17, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>
>>>>>> Pim: is this an OMPI you built, or one you were given somehow? If you 
>>>>>> built it, how did you configure it?
>>>>>>
>>>>>>> On Dec 8, 2014, at 8:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>>>>>
>>>>>>> It likely depends on how SLURM allocates the cpuset/cgroup inside the
>>>>>>> nodes. The XML warning is related to these restrictions inside the node.
>>>>>>> Anyway, my feeling is that there's a old OMPI or a old hwloc somewhere.
>>>>>>>
>>>>>>> How do we check after install whether OMPI uses the embedded or the
>>>>>>> system-wide hwloc?
>>>>>>>
>>>>>>> Brice
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 08/12/2014 17:07, Pim Schellart a écrit :
>>>>>>>> Dear Ralph,
>>>>>>>>
>>>>>>>> the nodes are called coma## and as you can see in the logs the nodes 
>>>>>>>> of the broken example are the same as the nodes of the working one, so 
>>>>>>>> that doesn’t seem to be the cause. Unless (very likely) I’m missing 
>>>>>>>> something. Anything else I can check?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Pim
>>>>>>>>
>>>>>>>>> On 08 Dec 2014, at 17:03, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>>
>>>>>>>>> As Brice said, OMPI has its own embedded version of hwloc that we 
>>>>>>>>> use, so there is no Slurm interaction to be considered. The most 
>>>>>>>>> likely cause is that one or more of your nodes is picking up a 
>>>>>>>>> different version of OMPI. So things “work” if you happen to get 
>>>>>>>>> nodes where all the versions match, and “fail” when you get a 
>>>>>>>>> combination that includes a different version.
>>>>>>>>>
>>>>>>>>> Is there some way you can narrow down your search to find the node(s) 
>>>>>>>>> that are picking up the different version?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Dec 8, 2014, at 7:48 AM, Pim Schellart <p.schell...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Dear Brice,
>>>>>>>>>>
>>>>>>>>>> I am not sure why this is happening since all code seems to be using 
>>>>>>>>>> the same hwloc library version (1.8) but it does :) An MPI program 
>>>>>>>>>> is started through SLURM on two nodes with four CPU cores total 
>>>>>>>>>> (divided over the nodes) using the following script:
>>>>>>>>>>
>>>>>>>>>> #! /bin/bash
>>>>>>>>>> #SBATCH -N 2 -n 4
>>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --version
>>>>>>>>>> /usr/bin/mpiexec /usr/bin/lstopo --of xml
>>>>>>>>>> /usr/bin/mpiexec  /path/to/my_mpi_code
>>>>>>>>>>
>>>>>>>>>> When this is submitted multiple times it gives “out-of-order” 
>>>>>>>>>> warnings in about 9/10 cases but works without warnings in 1/10 
>>>>>>>>>> cases. I attached the output (with xml) for both the working and 
>>>>>>>>>> `broken` case. Note that the xml is of course printed (differently) 
>>>>>>>>>> multiple times for each task/core. As always, any help would be 
>>>>>>>>>> appreciated.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Pim Schellart
>>>>>>>>>>
>>>>>>>>>> P.S. $ mpirun --version
>>>>>>>>>> mpirun (Open MPI) 1.6.5
>>>>>>>>>>
>>>>>>>>>> <broken.log><working.log>
>>>>>>>>>>
>>>>>>>>>>> On 07 Dec 2014, at 13:50, Brice Goglin <brice.gog...@inria.fr> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello
>>>>>>>>>>> The github issue you're refering to was closed 18 months ago. The
>>>>>>>>>>> warning (it's not an error) is only supposed to appear if you're
>>>>>>>>>>> importing in a recent hwloc a XML that was exported from a old 
>>>>>>>>>>> hwloc. I
>>>>>>>>>>> don't see how that could happen when using Open MPI since the hwloc
>>>>>>>>>>> versions on both sides is the same.
>>>>>>>>>>> Make sure you're not confusing with another error described here
>>>>>>>>>>>
>>>>>>>>>>> http://www.open-mpi.org/projects/hwloc/doc/v1.10.0/a00028.php#faq_os_error
>>>>>>>>>>> Otherwise please report the exact Open MPI and/or hwloc versions as 
>>>>>>>>>>> well
>>>>>>>>>>> as the XML lstopo output on the nodes that raise the warning (lstopo
>>>>>>>>>>> foo.xml). Send these to hwloc mailing lists such as
>>>>>>>>>>> hwloc-us...@open-mpi.org or hwloc-de...@open-mpi.org
>>>>>>>>>>> Thanks
>>>>>>>>>>> Brice
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Le 07/12/2014 13:29, Pim Schellart a écrit :
>>>>>>>>>>>> Dear OpenMPI developers,
>>>>>>>>>>>>
>>>>>>>>>>>> this might be a bit off topic but when using the SLURM scheduler 
>>>>>>>>>>>> (with cpuset support) on Ubuntu 14.04 (openmpi 1.6) hwloc 
>>>>>>>>>>>> sometimes gives a "out-of-order topology discovery” error. 
>>>>>>>>>>>> According to issue #103 on github 
>>>>>>>>>>>> (https://github.com/open-mpi/hwloc/issues/103) this error was 
>>>>>>>>>>>> discussed before and it was possible to sort it out in 
>>>>>>>>>>>> “insert_object_by_parent”, is this still considered? If not, what 
>>>>>>>>>>>> (top level) hwloc API call should we look for in the SLURM sources 
>>>>>>>>>>>> to start debugging? Any help will be most welcome.
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Pim Schellart
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> Link to this post: 
>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16441.php
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16447.php
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16448.php
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16449.php
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16450.php
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16451.php
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16452.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16453.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16458.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16460.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16464.php

Reply via email to