Please try and keep the User list on the messages - allows others to chime in.
You can see the topology by adding "-mca ess_base_verbose 5" to your command line. You'll get other stuff as well, and you'll need to --enable-debug in your configure. On Sep 24, 2012, at 4:47 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi, > >> The 1.7 series has a completely different way of handling node >> topology than was used in the 1.6 series. It provides some >> enhanced features, but it does have some drawbacks in the case >> where the topology info isn't correct. I fear you are running >> into this problem (again). >> >> All the commands you show here work fine for me on a Linux >> x86_64 box using 1.7r27361 on a Westmere 6-core single-socket >> machine with hyperthreads enabled. I cannot replicate any of >> the reported problems, so there isn't much I can do at this point. >> >> As I've said before, the root problem here appears to be some >> hwloc-related issue with your setup. Until that gets resolved >> so we get correct topology info, I'm not sure what can be done >> to resolve what you are seeing. I'll raise the question of >> possibly providing some alternative support for setups like >> yours that just can't get topology info, but that would >> definitely be a long-term question. > > Can we check if you get wrong topology info or which info you get > at all? Can you tell me a file and location where I can print the > values of relevant variables on my architecture? Perhaps that can > help to determine what goes wrong. I would use the latest trunk > tarball and can make the test a day later, because all changes on > our "installation server" are mirrored in the night to a our file > server for all machines. > > > Kind regards > > Siegmar > > > > >> On Sep 23, 2012, at 3:20 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hi, >>> >>> yesterday I installed openmpi-1.7a1r27358 and it has an improved >>> error message compared to openmpi-1.6.2, but doesn't show process bindings >>> and has some other problems as well. >>> >>> >>> "sunpc0" and "linpc0" are equipped with two dual-core processors running >>> Solaris 10 x86_64 and Linux x86_64 resp. "tyr" is a dual-processor machine >>> running Solaris 10 Sparc. >>> >>> tyr fd1026 105 mpiexec -np 2 -host sunpc0 -report-bindings \ >>> -map-by core -bind-to-core date >>> Sun Sep 23 11:46:36 CEST 2012 >>> Sun Sep 23 11:46:36 CEST 2012 >>> >>> tyr fd1026 106 mpicc -showme >>> cc -I/usr/local/openmpi-1.7_64_cc/include -mt -m64 >>> -L/usr/local/openmpi-1.7_64_cc/lib64 -lmpi -lpicl -lm -lkstat -llgrp >>> -lsocket -lnsl -lrt -lm >>> >>> >>> openmpi-1.6.2 shows process bindings. >>> >>> tyr fd1026 103 mpiexec -np 2 -host sunpc0 -report-bindings \ >>> -bycore -bind-to-core date >>> Sun Sep 23 12:09:06 CEST 2012 >>> [sunpc0:13197] MCW rank 0 bound to socket 0[core 0]: [B .][. .] >>> [sunpc0:13197] MCW rank 1 bound to socket 0[core 1]: [. B][. .] >>> Sun Sep 23 12:09:06 CEST 2012 >>> >>> >>> tyr fd1026 104 mpicc -showme >>> cc -I/usr/local/openmpi-1.6.2_64_cc/include -mt -m64 >>> -L/usr/local/openmpi-1.6.2_64_cc/lib64 -lmpi -lm -lkstat -llgrp >>> -lsocket -lnsl -lrt -lm >>> >>> >>> On my Linux machine I get a warning. >>> >>> tyr fd1026 113 mpiexec -np 2 -host linpc0 -report-bindings \ >>> -map-by core -bind-to-core date >>> -------------------------------------------------------------------------- >>> WARNING: a request was made to bind a process. While the system >>> supports binding the process itself, at least one node does NOT >>> support binding memory to the process location. >>> >>> Node: linpc0 >>> >>> This is a warning only; your job will continue, though performance may >>> be degraded. >>> -------------------------------------------------------------------------- >>> Sun Sep 23 11:56:04 CEST 2012 >>> Sun Sep 23 11:56:04 CEST 2012 >>> >>> >>> >>> Everything works fine with openmpi-1.6.2. >>> >>> tyr fd1026 106 mpiexec -np 2 -host linpc0 -report-bindings \ >>> -bycore -bind-to-core date >>> [linpc0:15808] MCW rank 0 bound to socket 0[core 0]: [B .][. .] >>> [linpc0:15808] MCW rank 1 bound to socket 0[core 1]: [. B][. .] >>> Sun Sep 23 12:11:47 CEST 2012 >>> Sun Sep 23 12:11:47 CEST 2012 >>> >>> >>> >>> >>> Om my Solaris Sparc machine I get the following errors. >>> >>> >>> tyr fd1026 121 mpiexec -np 2 -report-bindings -map-by core -bind-to-core > date >>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out > of bounds in file >>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c > at line 847 >>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out > of bounds in file >>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c > at line 1414 >>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out > of bounds in file >>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c > at line 847 >>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out > of bounds in file >>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c > at line 1414 >>> >>> >>> >>> tyr fd1026 122 mpiexec -np 2 -host tyr -report-bindings -map-by core > -bind-to core date >>> -------------------------------------------------------------------------- >>> All nodes which are allocated for this job are already filled. >>> -------------------------------------------------------------------------- >>> >>> >>> Once more everything works fine with openmpi-1.6.2. >>> >>> tyr fd1026 109 mpiexec -np 2 -report-bindings -bycore -bind-to-core date >>> [tyr.informatik.hs-fulda.de:23869] MCW rank 0 bound to socket 0[core 0]: > [B][.] >>> [tyr.informatik.hs-fulda.de:23869] MCW rank 1 bound to socket 1[core 0]: > [.][B] >>> Sun Sep 23 12:14:09 CEST 2012 >>> Sun Sep 23 12:14:09 CEST 2012 >>> >>> tyr fd1026 110 mpiexec -np 2 -host tyr -report-bindings -bycore > -bind-to-core date >>> [tyr.informatik.hs-fulda.de:23877] MCW rank 0 bound to socket 0[core 0]: > [B][.] >>> [tyr.informatik.hs-fulda.de:23877] MCW rank 1 bound to socket 1[core 0]: > [.][B] >>> Sun Sep 23 12:16:05 CEST 2012 >>> Sun Sep 23 12:16:05 CEST 2012 >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >