Please try and keep the User list on the messages - allows others to chime in.

You can see the topology by adding "-mca ess_base_verbose 5" to your command 
line. You'll get other stuff as well, and you'll need to --enable-debug in your 
configure.


On Sep 24, 2012, at 4:47 AM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi,
> 
>> The 1.7 series has a completely different way of handling node
>> topology than was used in the 1.6 series. It provides some
>> enhanced features, but it does have some drawbacks in the case
>> where the topology info isn't correct. I fear you are running
>> into this problem (again).
>> 
>> All the commands you show here work fine for me on a Linux 
>> x86_64 box using 1.7r27361 on a Westmere 6-core single-socket
>> machine with hyperthreads enabled. I cannot replicate any of
>> the reported problems, so there isn't much I can do at this point.
>> 
>> As I've said before, the root problem here appears to be some
>> hwloc-related issue with your setup. Until that gets resolved
>> so we get correct topology info, I'm not sure what can be done
>> to resolve what you are seeing. I'll raise the question of
>> possibly providing some alternative support for setups like
>> yours that just can't get topology info, but that would
>> definitely be a long-term question.
> 
> Can we check if you get wrong topology info or which info you get
> at all? Can you tell me a file and location where I can print the
> values of relevant variables on my architecture? Perhaps that can
> help to determine what goes wrong. I would use the latest trunk
> tarball and can make the test a day later, because all changes on
> our "installation server" are mirrored in the night to a our file
> server for all machines.
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> 
>> On Sep 23, 2012, at 3:20 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>> 
>>> Hi,
>>> 
>>> yesterday I installed openmpi-1.7a1r27358 and it has an improved
>>> error message compared to openmpi-1.6.2, but doesn't show process bindings
>>> and has some other problems as well.
>>> 
>>> 
>>> "sunpc0" and "linpc0" are equipped with two dual-core processors running
>>> Solaris 10 x86_64 and Linux x86_64 resp. "tyr" is a dual-processor machine
>>> running Solaris 10 Sparc.
>>> 
>>> tyr fd1026 105 mpiexec -np 2 -host sunpc0 -report-bindings \
>>> -map-by core -bind-to-core date
>>> Sun Sep 23 11:46:36 CEST 2012
>>> Sun Sep 23 11:46:36 CEST 2012
>>> 
>>> tyr fd1026 106 mpicc -showme
>>> cc -I/usr/local/openmpi-1.7_64_cc/include -mt -m64 
>>> -L/usr/local/openmpi-1.7_64_cc/lib64 -lmpi -lpicl -lm -lkstat -llgrp
>>> -lsocket -lnsl -lrt -lm
>>> 
>>> 
>>> openmpi-1.6.2 shows process bindings.
>>> 
>>> tyr fd1026 103 mpiexec -np 2 -host sunpc0 -report-bindings \
>>> -bycore -bind-to-core date
>>> Sun Sep 23 12:09:06 CEST 2012
>>> [sunpc0:13197] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
>>> [sunpc0:13197] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
>>> Sun Sep 23 12:09:06 CEST 2012
>>> 
>>> 
>>> tyr fd1026 104 mpicc -showme
>>> cc -I/usr/local/openmpi-1.6.2_64_cc/include -mt -m64
>>> -L/usr/local/openmpi-1.6.2_64_cc/lib64 -lmpi -lm -lkstat -llgrp
>>> -lsocket -lnsl -lrt -lm
>>> 
>>> 
>>> On my Linux machine I get a warning.
>>> 
>>> tyr fd1026 113 mpiexec -np 2 -host linpc0 -report-bindings \
>>> -map-by core -bind-to-core date
>>> --------------------------------------------------------------------------
>>> WARNING: a request was made to bind a process. While the system
>>> supports binding the process itself, at least one node does NOT
>>> support binding memory to the process location.
>>> 
>>> Node:  linpc0
>>> 
>>> This is a warning only; your job will continue, though performance may
>>> be degraded.
>>> --------------------------------------------------------------------------
>>> Sun Sep 23 11:56:04 CEST 2012
>>> Sun Sep 23 11:56:04 CEST 2012
>>> 
>>> 
>>> 
>>> Everything works fine with openmpi-1.6.2.
>>> 
>>> tyr fd1026 106 mpiexec -np 2 -host linpc0 -report-bindings \
>>> -bycore -bind-to-core date
>>> [linpc0:15808] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
>>> [linpc0:15808] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
>>> Sun Sep 23 12:11:47 CEST 2012
>>> Sun Sep 23 12:11:47 CEST 2012
>>> 
>>> 
>>> 
>>> 
>>> Om my Solaris Sparc machine I get the following errors.
>>> 
>>> 
>>> tyr fd1026 121 mpiexec -np 2 -report-bindings -map-by core -bind-to-core 
> date
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out 
> of bounds in file 
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c 
> at line 847
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out 
> of bounds in file 
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c 
> at line 1414
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out 
> of bounds in file 
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c 
> at line 847
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out 
> of bounds in file 
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c 
> at line 1414
>>> 
>>> 
>>> 
>>> tyr fd1026 122 mpiexec -np 2 -host tyr -report-bindings -map-by core 
> -bind-to core date
>>> --------------------------------------------------------------------------
>>> All nodes which are allocated for this job are already filled.
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> Once more everything works fine with openmpi-1.6.2.
>>> 
>>> tyr fd1026 109 mpiexec -np 2 -report-bindings -bycore -bind-to-core date
>>> [tyr.informatik.hs-fulda.de:23869] MCW rank 0 bound to socket 0[core 0]: 
> [B][.]
>>> [tyr.informatik.hs-fulda.de:23869] MCW rank 1 bound to socket 1[core 0]: 
> [.][B]
>>> Sun Sep 23 12:14:09 CEST 2012
>>> Sun Sep 23 12:14:09 CEST 2012
>>> 
>>> tyr fd1026 110 mpiexec -np 2 -host tyr -report-bindings -bycore 
> -bind-to-core date
>>> [tyr.informatik.hs-fulda.de:23877] MCW rank 0 bound to socket 0[core 0]: 
> [B][.]
>>> [tyr.informatik.hs-fulda.de:23877] MCW rank 1 bound to socket 1[core 0]: 
> [.][B]
>>> Sun Sep 23 12:16:05 CEST 2012
>>> Sun Sep 23 12:16:05 CEST 2012
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 


Reply via email to