Dear Ralph, the same problems occur without rankfiles.
tyr fd1026 102 which mpicc /usr/local/openmpi-1.7.4_64_cc/bin/mpicc tyr fd1026 103 mpiexec --report-bindings -np 2 \ -host tyr,sunpc1 hostname tyr fd1026 104 /opt/solstudio12.3/bin/sparcv9/dbx \ /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading mpiexec Reading ld.so.1 Reading libopen-rte.so.7.0.0 Reading libopen-pal.so.6.1.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) check -all access checking - ON memuse checking - ON (dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname (process id 26792) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading libgen.so.1 Reading libc_psr.so.1 Reading rtcboot.so Reading librtc.so Reading libmd_psr.so.1 RTC: Enabling Error Checking... RTC: Using UltraSparc trap mechanism RTC: See `help rtc showmap' and `help rtc limitations' for details. RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0xffffffff7fffc85b which is 459 bytes above the current stack pointer Variable is 'cwd' t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" 65 if (0 != strcmp(pwd, cwd)) { (dbx) quit tyr fd1026 105 ssh sunpc1 ... sunpc1 fd1026 102 mpiexec --report-bindings -np 2 \ -host tyr,sunpc1 hostname sunpc1 fd1026 103 /opt/solstudio12.3/bin/amd64/dbx \ /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading mpiexec Reading ld.so.1 Reading libopen-rte.so.7.0.0 Reading libopen-pal.so.6.1.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) check -all access checking - ON memuse checking - ON (dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname (process id 18806) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading libgen.so.1 Reading rtcboot.so Reading librtc.so RTC: Enabling Error Checking... RTC: Running program... Reading disasm.so Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0x436d57 which is 15 bytes into a heap block of size 16 bytes at 0x436d48 This block was allocated from: [1] vasprintf() at 0xfffffd7fdc9b335a [2] asprintf() at 0xfffffd7fdc9b3452 [3] opal_output_init() at line 184 in "output.c" [4] do_open() at line 548 in "output.c" [5] opal_output_open() at line 219 in "output.c" [6] opal_malloc_init() at line 68 in "malloc.c" [7] opal_init_util() at line 250 in "opal_init.c" [8] orterun() at line 658 in "orterun.c" t@1 (l@1) stopped in do_open at line 638 in file "output.c" 638 info[i].ldi_prefix = strdup(lds->lds_prefix); (dbx) run --report-bindings -np 2 -host sunpc0,sunpc1 hostname Running: mpiexec --report-bindings -np 2 -host sunpc0,sunpc1 hostname (process id 18857) RTC: Enabling Error Checking... RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0x436d57 which is 15 bytes into a heap block of size 16 bytes at 0x436d48 This block was allocated from: [1] vasprintf() at 0xfffffd7fdc9b335a [2] asprintf() at 0xfffffd7fdc9b3452 [3] opal_output_init() at line 184 in "output.c" [4] do_open() at line 548 in "output.c" [5] opal_output_open() at line 219 in "output.c" [6] opal_malloc_init() at line 68 in "malloc.c" [7] opal_init_util() at line 250 in "opal_init.c" [8] orterun() at line 658 in "orterun.c" t@1 (l@1) stopped in do_open at line 638 in file "output.c" 638 info[i].ldi_prefix = strdup(lds->lds_prefix); (dbx) run --report-bindings -np 2 -host linpc1,sunpc1 hostname Running: mpiexec --report-bindings -np 2 -host linpc1,sunpc1 hostname (process id 18868) RTC: Enabling Error Checking... RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0x436d57 which is 15 bytes into a heap block of size 16 bytes at 0x436d48 This block was allocated from: [1] vasprintf() at 0xfffffd7fdc9b335a [2] asprintf() at 0xfffffd7fdc9b3452 [3] opal_output_init() at line 184 in "output.c" [4] do_open() at line 548 in "output.c" [5] opal_output_open() at line 219 in "output.c" [6] opal_malloc_init() at line 68 in "malloc.c" [7] opal_init_util() at line 250 in "opal_init.c" [8] orterun() at line 658 in "orterun.c" t@1 (l@1) stopped in do_open at line 638 in file "output.c" 638 info[i].ldi_prefix = strdup(lds->lds_prefix); (dbx) quit sunpc1 fd1026 104 exit logout tyr fd1026 106 Do you need anything else? Kind regards Siegmar > Hard to know how to address all that, Siegmar, but I'll give it > a shot. See below. > > On Jan 22, 2014, at 5:34 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > Hi, > > > > yesterday I installed openmpi-1.7.4rc2r30323 on our machines > > ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux > > 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr" > > contains the following lines. > > > > rank 0=linpc0 slot=0:0-1;1:0-1 > > rank 1=linpc1 slot=0:0-1 > > rank 2=sunpc1 slot=1:0 > > rank 3=tyr slot=1:0 > > > > I get no output, when I run the following command. > > > > mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname > > > > "dbx" reports the following problem. > > > > /opt/solstudio12.3/bin/sparcv9/dbx \ > > /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec > > For information about new features see `help changes' > > To remove this message, put `dbxenv suppress_startup_message > > 7.9' in your .dbxrc > > Reading mpiexec > > Reading ld.so.1 > > ... > > Reading libmd.so.1 > > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname > > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname > > (process id 22337) > > Reading libc_psr.so.1 > > ... > > Reading mca_dfs_test.so > > > > execution completed, exit code is 1 > > (dbx) check -all > > access checking - ON > > memuse checking - ON > > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname > > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname > > (process id 22344) > > Reading rtcapihook.so > > ... > > RTC: Running program... > > Read from uninitialized (rui) on thread 1: > > Attempting to read 1 byte at address 0xffffffff7fffbf8b > > which is 459 bytes above the current stack pointer > > Variable is 'cwd' > > t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" > > 65 if (0 != strcmp(pwd, cwd)) { > > (dbx) quit > > > > This looks like a bogus issue to me. Are you able to run something > *without* a rankfile? In other words, is it rankfile operation that > is causing a problem, or are you unable to run anything on Sparc? > > > > > > > > > Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile. > > > > rank 0=linpc1 slot=0:0-1;1:0-1 > > rank 1=sunpc1 slot=0:0-1 > > rank 2=sunpc1 slot=1:0 > > rank 3=sunpc1 slot=1:1 > > > > > > mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname > > [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]], > > socket 0[core 1[hwt 0]]: [B/B][./.] > > [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] > > [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] > > sunpc1 > > sunpc1 > > sunpc1 > > [linpc1:29997] MCW rank 0 is not bound (or bound to all available > > processors) > > linpc1 > > > > > > Unfortunately "dbx" reports nevertheless a problem. > > > > /opt/solstudio12.3/bin/amd64/dbx \ > > /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec > > For information about new features see `help changes' > > To remove this message, put `dbxenv suppress_startup_message 7.9' > > in your .dbxrc > > Reading mpiexec > > Reading ld.so.1 > > ... > > Reading libmd.so.1 > > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname > > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname > > (process id 18330) > > Reading mca_shmem_mmap.so > > ... > > Reading mca_dfs_test.so > > [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]], > > socket 0[core 1[hwt 0]]: [B/B][./.] > > [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] > > [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] > > sunpc1 > > sunpc1 > > sunpc1 > > [linpc1:30148] MCW rank 0 is not bound (or bound to all available > > processors) > > linpc1 > > > > execution completed, exit code is 0 > > (dbx) check -all > > access checking - ON > > memuse checking - ON > > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname > > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname > > (process id 18340) > > Reading rtcapihook.so > > ... > > > > RTC: Running program... > > Reading disasm.so > > Read from uninitialized (rui) on thread 1: > > Attempting to read 1 byte at address 0x436d57 > > which is 15 bytes into a heap block of size 16 bytes at 0x436d48 > > This block was allocated from: > > [1] vasprintf() at 0xfffffd7fdc9b335a > > [2] asprintf() at 0xfffffd7fdc9b3452 > > [3] opal_output_init() at line 184 in "output.c" > > [4] do_open() at line 548 in "output.c" > > [5] opal_output_open() at line 219 in "output.c" > > [6] opal_malloc_init() at line 68 in "malloc.c" > > [7] opal_init_util() at line 250 in "opal_init.c" > > [8] orterun() at line 658 in "orterun.c" > > > > t@1 (l@1) stopped in do_open at line 638 in file "output.c" > > 638 info[i].ldi_prefix = strdup(lds->lds_prefix); > > (dbx) > > > > > > Again, I think dbx is just getting lost > > > > > > > > > I can also manually bind threads on our Sun M4000 server (two quad-core > > Sparc VII processors with two hwthreads each). > > > > mpiexec --report-bindings -np 4 --bind-to hwthread hostname > > [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to > > socket 0[core 1[hwt 0]]: [../B./../..][../../../..] > > [rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to > > socket 1[core 4[hwt 0]]: [../../../..][B./../../..] > > [rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to > > socket 1[core 5[hwt 0]]: [../../../..][../B./../..] > > [rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to > > socket 0[core 0[hwt 0]]: [B./../../..][../../../..] > > rs0.informatik.hs-fulda.de > > rs0.informatik.hs-fulda.de > > rs0.informatik.hs-fulda.de > > rs0.informatik.hs-fulda.de > > > > > > It doesn't work with cores. I know that it wasn't possible last > > summer and it seems that it is still not possible now. > > > > mpiexec --report-bindings -np 4 --bind-to core hostname > > ----------------------------------------------------------------------- > > Open MPI tried to bind a new process, but something went wrong. The > > process was killed without launching the target application. Your job > > will now abort. > > > > Local host: rs0 > > Application name: /usr/local/bin/hostname > > Error message: hwloc indicates cpu binding cannot be enforced > > Location: > > ../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500 > > ----------------------------------------------------------------------- > > 4 total processes failed to start > > > > > > > > Is it possible to specify hwthreads in a rankfile, so that I can > > use a rankfile for these machines? > > Possible - yes. Will it happen in immediate future - no, I'm afraid > I'm swamped right now. However, I'll make a note of it for the future > > > > > I get the expected output, if I use two M4000 servers, although the > > above mentioned error still exists. > > > > > > /opt/solstudio12.3/bin/sparcv9/dbx \ > > /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec > > For information about new features see `help changes' > > To remove this message, put `dbxenv suppress_startup_message 7.9' > > in your .dbxrc > > Reading mpiexec > > Reading ld.so.1 > > ... > > Reading libmd.so.1 > > (dbx) run --report-bindings --host rs0,rs1 -np 4 \ > > --bind-to hwthread hostname > > Running: mpiexec --report-bindings --host rs0,rs1 -np 4 > > --bind-to hwthread hostname > > (process id 9599) > > Reading libc_psr.so.1 > > ... > > Reading mca_dfs_test.so > > [rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to > > socket 1[core 4[hwt 0]]: [../../../..][B./../../..] > > [rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to > > socket 0[core 0[hwt 0]]: [B./../../..][../../../..] > > rs0.informatik.hs-fulda.de > > rs0.informatik.hs-fulda.de > > rs1.informatik.hs-fulda.de > > [rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to > > socket 0[core 0[hwt 0]]: [B./../../..][../../../..] > > [rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to > > socket 1[core 4[hwt 0]]: [../../../..][B./../../..] > > rs1.informatik.hs-fulda.de > > > > execution completed, exit code is 0 > > (dbx) check -all > > access checking - ON > > memuse checking - ON > > (dbx) run --report-bindings --host rs0,rs1 -np 4 \ > > --bind-to hwthread hostname > > Running: mpiexec --report-bindings --host rs0,rs1 -np 4 > > --bind-to hwthread hostname > > (process id 9607) > > Reading rtcapihook.so > > ... > > RTC: Running program... > > Read from uninitialized (rui) on thread 1: > > Attempting to read 1 byte at address 0xffffffff7fffc80b > > which is 459 bytes above the current stack pointer > > Variable is 'cwd' > > dbx: warning: can't find file > > ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../ > > openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c" > > dbx: warning: see `help finding-files' > > t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" > > (dbx) > > > > > > Our M4000 server has no access to the source code, so that it couldn't > > find the file. Nevertheless it is the same error message as above. Is it > > possible that someone soves this problem? Thank you very much for any > > help in advance. > > > > > > Kind regards > > > > Siegmar > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >