Thanks! Eugene from Oracle is looking into this; he sees some possible issues in the Solaris hwloc code and is still digging in to it.
On Feb 6, 2013, at 4:46 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi > >> We've been talking about this offline. Can you send us an lstopo >> output from your Solaris machine? Send us the text output and >> the xml output, e.g.: >> >> lstopo > solaris.txt >> lstopo solaris.xml > > I have installed hwloc-1.3.2 and hwloc-1.6.1 and get the following > output (it's the same for both versions in the text file, but has > different xml files). > > > sunpc1 bin 121 lstopo --version > lstopo 1.3.2 > sunpc1 bin 122 lstopo > Machine (8191MB) > NUMANode L#0 (P#1 4095MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > NUMANode L#1 (P#2 4096MB) + Socket L#1 > Core L#2 + PU L#2 (P#2) > Core L#3 + PU L#3 (P#3) > > sunpc1 bin 123 cd ../../hwloc-1.6.1/bin/ > sunpc1 bin 124 lstopo --version > lstopo 1.6.1 > sunpc1 bin 125 lstopo > Machine (8191MB) > NUMANode L#0 (P#1 4095MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > NUMANode L#1 (P#2 4096MB) + Socket L#1 > Core L#2 + PU L#2 (P#2) > Core L#3 + PU L#3 (P#3) > sunpc1 bin 126 > > > I have attached the requested files. > > sunpc1 bin 144 lstopo --version > lstopo 1.3.2 > sunpc1 bin 145 lstopo > /tmp/sunpc1-hwloc-1.3.2.txt > sunpc1 bin 146 lstopo --of xml > /tmp/sunpc1-hwloc-1.3.2.xml > sunpc1 bin 147 cd ../../hwloc-1.6.1/bin/ > sunpc1 bin 148 lstopo --version > lstopo 1.6.1 > sunpc1 bin 149 lstopo > /tmp/sunpc1-hwloc-1.6.1.txt > sunpc1 bin 150 lstopo --of xml > /tmp/sunpc1-hwloc-1.6.1.xml > > > Thank you very much for your help in advance. > > > Kind regards > > Siegmar > > > > >> On Feb 5, 2013, at 12:30 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hi >>> >>> now I can use all our machines once more. I have a problem on >>> Solaris 10 x86_64, because the mapping of processes doesn't >>> correspond to the rankfile. I removed the output from "hostfile" >>> and wrapped around long lines. >>> >>> tyr rankfiles 114 cat rf_ex_sunpc >>> # mpiexec -report-bindings -rf rf_ex_sunpc hostname >>> >>> rank 0=sunpc0 slot=0:0-1,1:0-1 >>> rank 1=sunpc1 slot=0:0-1 >>> rank 2=sunpc1 slot=1:0 >>> rank 3=sunpc1 slot=1:1 >>> >>> >>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname >>> [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >>> [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]: >>> [B B][. .] (slot list 0:0-1) >>> [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] (slot list 1:0) >>> [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] (slot list 1:1) >>> >>> >>> Can I provide any information to solve this problem? My >>> rankfile works as expected, if I use only Linux machines. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> >>> >>>>> Hmmm....well, it certainly works for me: >>>>> >>>>> [rhc@odin ~/v1.6]$ cat rf >>>>> rank 0=odin093 slot=0:0-1,1:0-1 >>>>> rank 1=odin094 slot=0:0-1 >>>>> rank 2=odin094 slot=1:0 >>>>> rank 3=odin094 slot=1:1 >>>>> >>>>> >>>>> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings >>>>> -mca opal_paffinity_alone 0 hostname >>>>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to >>>>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list >>> 0:0-1,1:0-1) >>>>> odin093.cs.indiana.edu >>>>> odin094.cs.indiana.edu >>>>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to >>>>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) >>>>> odin094.cs.indiana.edu >>>>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to >>>>> socket 1[core 0]: [. .][B .] (slot list 1:0) >>>>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to >>>>> socket 1[core 1]: [. .][. B] (slot list 1:1) >>>>> odin094.cs.indiana.edu >>>> >>>> Interesting that it works on your machines. >>>> >>>> >>>>> I see one thing of concern to me in your output - your second node >>>>> appears to be a Sun computer. Is it the same physical architecture? >>>>> Is it also running Linux? Are you sure it is using the same version >>>>> of OMPI, built for that environment and hardware? >>>> >>>> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and >>>> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc" >>>> Solaris 10 x86_64. All machines use the same version of Open MPI, >>>> built for that environment. At the moment I can only use sunpc1 and >>>> linpc1 ("my" developer machines). Next week I will have access to all >>>> machines so that I can test, if I get a different behaviour when I >>>> use two machines with the same operating system (although mixed >>>> operating systems weren't a problem in the past (only machines with >>>> differnt endians)). I let you know my results. >>>> >>>> >>>> Kind regards >>>> >>>> Siegmar >>>> >>>> >>>> >>>> >>>>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and >>>>>> it works for my previous rankfile. >>>>>> >>>>>> >>>>>>> #3493: Handle the case where rankfile provides the allocation >>>>>>> -----------------------------------+---------------------------- >>>>>>> Reporter: rhc | Owner: jsquyres >>>>>>> Type: changeset move request | Status: new >>>>>>> Priority: critical | Milestone: Open MPI 1.6.4 >>>>>>> Version: trunk | Keywords: >>>>>>> -----------------------------------+---------------------------- >>>>>>> Please apply the attached patch that corrects the rmaps function for >>>>>>> obtaining the available nodes when rankfile is providing the >>> allocation. >>>>>> >>>>>> >>>>>> tyr rankfiles 129 more rf_linpc1 >>>>>> # mpiexec -report-bindings -rf rf_linpc1 hostname >>>>>> rank 0=linpc1 slot=0:0-1,1:0-1 >>>>>> >>>>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname >>>>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1] >>>>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >>>>>> >>>>>> >>>>>> >>>>>> Unfortunately I don't get the expected result for the following >>>>>> rankfile. >>>>>> >>>>>> tyr rankfiles 114 more rf_bsp >>>>>> # mpiexec -report-bindings -rf rf_bsp hostname >>>>>> rank 0=linpc1 slot=0:0-1,1:0-1 >>>>>> rank 1=sunpc1 slot=0:0-1 >>>>>> rank 2=sunpc1 slot=1:0 >>>>>> rank 3=sunpc1 slot=1:1 >>>>>> >>>>>> I would expect that rank 0 gets all four cores from linpc1, rank 1 >>>>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and >>>>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my >>>>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3, >>>>>> because they both get all four cores of sunpc1. Is something wrong >>>>>> with my rankfile or with your mapping of processes to cores? I have >>>>>> removed the output from "hostname" and wrapped long lines. >>>>>> >>>>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname >>>>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core >>> 0-1]: >>>>>> [B B][B B] (slot list 0:0-1,1:0-1) >>>>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]: >>>>>> [B B][. .] (slot list 0:0-1) >>>>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core >>> 0-1]: >>>>>> [B B][B B] (slot list 1:0) >>>>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core >>> 0-1]: >>>>>> [B B][B B] (slot list 1:1) >>>>>> >>>>>> >>>>>> I get the following output, if I add the options which you mentioned >>>>>> in a previous email. >>>>>> >>>>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \ >>>>>> -display-allocation -mca ras_base_verbose 5 hostname >>>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>>>>> Querying component [cm] >>>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>>>>> Skipping component [cm]. Query failed to return a module >>>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>>>>> No component selected! >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>>>> nothing found in module - proceeding to hostfile >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>>>> parsing default hostfile >>>>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>>>> nothing found in hostfiles or dash-host - checking for rankfile >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>>>>> ras:base:node_insert inserting 2 nodes >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>>>>> ras:base:node_insert node linpc1 >>>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>>>>> ras:base:node_insert node sunpc1 >>>>>> >>>>>> ====================== ALLOCATED NODES ====================== >>>>>> >>>>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0 >>>>>> Data for node: linpc1 Num slots: 1 Max slots: 0 >>>>>> Data for node: sunpc1 Num slots: 3 Max slots: 0 >>>>>> >>>>>> ================================================================= >>>>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core >>> 0-1]: >>>>>> [B B][B B] (slot list 0:0-1,1:0-1) >>>>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]: >>>>>> [B B][. .] (slot list 0:0-1) >>>>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core >>> 0-1]: >>>>>> [B B][B B] (slot list 1:0) >>>>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core >>> 0-1]: >>>>>> [B B][B B] (slot list 1:1) >>>>>> >>>>>> >>>>>> Thank you very much for any suggestions and any help in advance. >>>>>> >>>>>> >>>>>> Kind regards >>>>>> >>>>>> Siegmar >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> > Machine (8191MB) > NUMANode L#0 (P#1 4095MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > NUMANode L#1 (P#2 4096MB) + Socket L#1 > Core L#2 + PU L#2 (P#2) > Core L#3 + PU L#3 (P#3) > <?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE topology SYSTEM "hwloc.dtd"> > <topology> > <object type="Machine" os_level="-1" os_index="0" cpuset="0x0000000f" > complete_cpuset="0x0000000f" online_cpuset="0x0000000f" > allowed_cpuset="0x0000000f" nodeset="0x00000006" > complete_nodeset="0x00000006" allowed_nodeset="0x00000006"> > <info name="OSName" value="SunOS"/> > <info name="OSRelease" value="5.10"/> > <info name="OSVersion" value="Generic_147441-21"/> > <info name="HostName" value="sunpc1"/> > <info name="Architecture" value="i86pc"/> > <object type="NUMANode" os_level="-1" os_index="1" cpuset="0x00000003" > complete_cpuset="0x00000003" online_cpuset="0x00000003" > allowed_cpuset="0x00000003" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002" > local_memory="4293435392"> > <page_type size="4096" count="0"/> > <object type="Socket" os_level="-1" os_index="0" cpuset="0x00000003" > complete_cpuset="0x00000003" online_cpuset="0x00000003" > allowed_cpuset="0x00000003" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> > <object type="Core" os_level="-1" os_index="0" cpuset="0x00000001" > complete_cpuset="0x00000001" online_cpuset="0x00000001" > allowed_cpuset="0x00000001" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> > <object type="PU" os_level="-1" os_index="0" cpuset="0x00000001" > complete_cpuset="0x00000001" online_cpuset="0x00000001" > allowed_cpuset="0x00000001" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> > </object> > <object type="Core" os_level="-1" os_index="1" cpuset="0x00000002" > complete_cpuset="0x00000002" online_cpuset="0x00000002" > allowed_cpuset="0x00000002" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> > <object type="PU" os_level="-1" os_index="1" cpuset="0x00000002" > complete_cpuset="0x00000002" online_cpuset="0x00000002" > allowed_cpuset="0x00000002" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> > </object> > </object> > </object> > <object type="NUMANode" os_level="-1" os_index="2" cpuset="0x0000000c" > complete_cpuset="0x0000000c" online_cpuset="0x0000000c" > allowed_cpuset="0x0000000c" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004" > local_memory="4294967296"> > <page_type size="4096" count="0"/> > <object type="Socket" os_level="-1" os_index="1" cpuset="0x0000000c" > complete_cpuset="0x0000000c" online_cpuset="0x0000000c" > allowed_cpuset="0x0000000c" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> > <object type="Core" os_level="-1" os_index="2" cpuset="0x00000004" > complete_cpuset="0x00000004" online_cpuset="0x00000004" > allowed_cpuset="0x00000004" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> > <object type="PU" os_level="-1" os_index="2" cpuset="0x00000004" > complete_cpuset="0x00000004" online_cpuset="0x00000004" > allowed_cpuset="0x00000004" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> > </object> > <object type="Core" os_level="-1" os_index="3" cpuset="0x00000008" > complete_cpuset="0x00000008" online_cpuset="0x00000008" > allowed_cpuset="0x00000008" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> > <object type="PU" os_level="-1" os_index="3" cpuset="0x00000008" > complete_cpuset="0x00000008" online_cpuset="0x00000008" > allowed_cpuset="0x00000008" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> > </object> > </object> > </object> > </object> > </topology> > Machine (8191MB) > NUMANode L#0 (P#1 4095MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > NUMANode L#1 (P#2 4096MB) + Socket L#1 > Core L#2 + PU L#2 (P#2) > Core L#3 + PU L#3 (P#3) > <?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE topology SYSTEM "hwloc.dtd"> > <topology> > <object type="Machine" os_index="0" cpuset="0x0000000f" > complete_cpuset="0x0000000f" online_cpuset="0x0000000f" > allowed_cpuset="0x0000000f" nodeset="0x00000006" > complete_nodeset="0x00000006" allowed_nodeset="0x00000006"> > <info name="Backend" value="Solaris"/> > <info name="OSName" value="SunOS"/> > <info name="OSRelease" value="5.10"/> > <info name="OSVersion" value="Generic_147441-21"/> > <info name="HostName" value="sunpc1"/> > <info name="Architecture" value="i86pc"/> > <object type="NUMANode" os_index="1" cpuset="0x00000003" > complete_cpuset="0x00000003" online_cpuset="0x00000003" > allowed_cpuset="0x00000003" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002" > local_memory="4293435392"> > <page_type size="4096" count="0"/> > <object type="Socket" os_index="0" cpuset="0x00000003" > complete_cpuset="0x00000003" online_cpuset="0x00000003" > allowed_cpuset="0x00000003" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> > <info name="CPUType" value=""/> > <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/> > <object type="Core" os_index="0" cpuset="0x00000001" > complete_cpuset="0x00000001" online_cpuset="0x00000001" > allowed_cpuset="0x00000001" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> > <object type="PU" os_index="0" cpuset="0x00000001" > complete_cpuset="0x00000001" online_cpuset="0x00000001" > allowed_cpuset="0x00000001" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> > </object> > <object type="Core" os_index="1" cpuset="0x00000002" > complete_cpuset="0x00000002" online_cpuset="0x00000002" > allowed_cpuset="0x00000002" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> > <object type="PU" os_index="1" cpuset="0x00000002" > complete_cpuset="0x00000002" online_cpuset="0x00000002" > allowed_cpuset="0x00000002" nodeset="0x00000002" > complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> > </object> > </object> > </object> > <object type="NUMANode" os_index="2" cpuset="0x0000000c" > complete_cpuset="0x0000000c" online_cpuset="0x0000000c" > allowed_cpuset="0x0000000c" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004" > local_memory="4294967296"> > <page_type size="4096" count="0"/> > <object type="Socket" os_index="1" cpuset="0x0000000c" > complete_cpuset="0x0000000c" online_cpuset="0x0000000c" > allowed_cpuset="0x0000000c" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> > <info name="CPUType" value=""/> > <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/> > <object type="Core" os_index="2" cpuset="0x00000004" > complete_cpuset="0x00000004" online_cpuset="0x00000004" > allowed_cpuset="0x00000004" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> > <object type="PU" os_index="2" cpuset="0x00000004" > complete_cpuset="0x00000004" online_cpuset="0x00000004" > allowed_cpuset="0x00000004" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> > </object> > <object type="Core" os_index="3" cpuset="0x00000008" > complete_cpuset="0x00000008" online_cpuset="0x00000008" > allowed_cpuset="0x00000008" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> > <object type="PU" os_index="3" cpuset="0x00000008" > complete_cpuset="0x00000008" online_cpuset="0x00000008" > allowed_cpuset="0x00000008" nodeset="0x00000004" > complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> > </object> > </object> > </object> > </object> > </topology> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/