Dear Ralph,

the same problems occur without rankfiles.

tyr fd1026 102 which mpicc
/usr/local/openmpi-1.7.4_64_cc/bin/mpicc

tyr fd1026 103 mpiexec --report-bindings -np 2 \
  -host tyr,sunpc1 hostname

tyr fd1026 104 /opt/solstudio12.3/bin/sparcv9/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec 
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message
  7.9' in your .dbxrc
Reading mpiexec
Reading ld.so.1
Reading libopen-rte.so.7.0.0
Reading libopen-pal.so.6.1.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname
Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname 
(process id 26792)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading libgen.so.1
Reading libc_psr.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0xffffffff7fffc85b
    which is 459 bytes above the current stack pointer
Variable is 'cwd'
t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
   65           if (0 != strcmp(pwd, cwd)) {
(dbx) quit


tyr fd1026 105 ssh sunpc1
...
sunpc1 fd1026 102 mpiexec --report-bindings -np 2 \
  -host tyr,sunpc1 hostname

sunpc1 fd1026 103 /opt/solstudio12.3/bin/amd64/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message
  7.9' in your .dbxrc
Reading mpiexec
Reading ld.so.1
Reading libopen-rte.so.7.0.0
Reading libopen-pal.so.6.1.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname
Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname 
(process id 18806)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading libgen.so.1
Reading rtcboot.so
Reading librtc.so
RTC: Enabling Error Checking...
RTC: Running program...
Reading disasm.so
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x436d57
    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
This block was allocated from:
        [1] vasprintf() at 0xfffffd7fdc9b335a 
        [2] asprintf() at 0xfffffd7fdc9b3452 
        [3] opal_output_init() at line 184 in "output.c"
        [4] do_open() at line 548 in "output.c"
        [5] opal_output_open() at line 219 in "output.c"
        [6] opal_malloc_init() at line 68 in "malloc.c"
        [7] opal_init_util() at line 250 in "opal_init.c"
        [8] orterun() at line 658 in "orterun.c"

t@1 (l@1) stopped in do_open at line 638 in file "output.c"
  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
(dbx) run --report-bindings -np 2 -host sunpc0,sunpc1 hostname
Running: mpiexec --report-bindings -np 2 -host sunpc0,sunpc1 hostname 
(process id 18857)
RTC: Enabling Error Checking...
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x436d57
    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
This block was allocated from:
        [1] vasprintf() at 0xfffffd7fdc9b335a 
        [2] asprintf() at 0xfffffd7fdc9b3452 
        [3] opal_output_init() at line 184 in "output.c"
        [4] do_open() at line 548 in "output.c"
        [5] opal_output_open() at line 219 in "output.c"
        [6] opal_malloc_init() at line 68 in "malloc.c"
        [7] opal_init_util() at line 250 in "opal_init.c"
        [8] orterun() at line 658 in "orterun.c"

t@1 (l@1) stopped in do_open at line 638 in file "output.c"
  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
(dbx) run --report-bindings -np 2 -host linpc1,sunpc1 hostname
Running: mpiexec --report-bindings -np 2 -host linpc1,sunpc1 hostname 
(process id 18868)
RTC: Enabling Error Checking...
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x436d57
    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
This block was allocated from:
        [1] vasprintf() at 0xfffffd7fdc9b335a 
        [2] asprintf() at 0xfffffd7fdc9b3452 
        [3] opal_output_init() at line 184 in "output.c"
        [4] do_open() at line 548 in "output.c"
        [5] opal_output_open() at line 219 in "output.c"
        [6] opal_malloc_init() at line 68 in "malloc.c"
        [7] opal_init_util() at line 250 in "opal_init.c"
        [8] orterun() at line 658 in "orterun.c"

t@1 (l@1) stopped in do_open at line 638 in file "output.c"
  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
(dbx) quit
sunpc1 fd1026 104 exit
logout
tyr fd1026 106 


Do you need anything else?


Kind regards

Siegmar



> Hard to know how to address all that, Siegmar, but I'll give it
> a shot. See below.
> 
> On Jan 22, 2014, at 5:34 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> > Hi,
> > 
> > yesterday I installed openmpi-1.7.4rc2r30323 on our machines
> > ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux
> > 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr"
> > contains the following lines.
> > 
> > rank 0=linpc0 slot=0:0-1;1:0-1
> > rank 1=linpc1 slot=0:0-1
> > rank 2=sunpc1 slot=1:0
> > rank 3=tyr slot=1:0
> > 
> > I get no output, when I run the following command.
> > 
> > mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> > 
> > "dbx" reports the following problem.
> > 
> > /opt/solstudio12.3/bin/sparcv9/dbx \
> >  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> > For information about new features see `help changes'
> > To remove this message, put `dbxenv suppress_startup_message
> >  7.9' in your .dbxrc
> > Reading mpiexec
> > Reading ld.so.1
> > ...
> > Reading libmd.so.1
> > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
> > (process id 22337)
> > Reading libc_psr.so.1
> > ...
> > Reading mca_dfs_test.so
> > 
> > execution completed, exit code is 1
> > (dbx) check -all
> > access checking - ON
> > memuse checking - ON
> > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
> > (process id 22344)
> > Reading rtcapihook.so
> > ...
> > RTC: Running program...
> > Read from uninitialized (rui) on thread 1:
> > Attempting to read 1 byte at address 0xffffffff7fffbf8b
> >    which is 459 bytes above the current stack pointer
> > Variable is 'cwd'
> > t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
> >   65           if (0 != strcmp(pwd, cwd)) {
> > (dbx) quit
> > 
> 
> This looks like a bogus issue to me. Are you able to run something
> *without* a rankfile? In other words, is it rankfile operation that
> is causing a problem, or are you unable to run anything on Sparc?
> 
> > 
> > 
> > 
> > Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile.
> > 
> > rank 0=linpc1 slot=0:0-1;1:0-1
> > rank 1=sunpc1 slot=0:0-1
> > rank 2=sunpc1 slot=1:0
> > rank 3=sunpc1 slot=1:1
> > 
> > 
> > mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> > [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]],
> >  socket 0[core 1[hwt 0]]: [B/B][./.]
> > [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> > [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> > sunpc1
> > sunpc1
> > sunpc1
> > [linpc1:29997] MCW rank 0 is not bound (or bound to all available
> >  processors)
> > linpc1
> > 
> > 
> > Unfortunately "dbx" reports nevertheless a problem.
> > 
> > /opt/solstudio12.3/bin/amd64/dbx \
> >  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> > For information about new features see `help changes'
> > To remove this message, put `dbxenv suppress_startup_message 7.9'
> >  in your .dbxrc
> > Reading mpiexec
> > Reading ld.so.1
> > ...
> > Reading libmd.so.1
> > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
> > (process id 18330)
> > Reading mca_shmem_mmap.so
> > ...
> > Reading mca_dfs_test.so
> > [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]],
> >  socket 0[core 1[hwt 0]]: [B/B][./.]
> > [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> > [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> > sunpc1
> > sunpc1
> > sunpc1
> > [linpc1:30148] MCW rank 0 is not bound (or bound to all available
> >  processors)
> > linpc1
> > 
> > execution completed, exit code is 0
> > (dbx) check -all
> > access checking - ON
> > memuse checking - ON
> > (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> > Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
> > (process id 18340)
> > Reading rtcapihook.so
> > ...
> > 
> > RTC: Running program...
> > Reading disasm.so
> > Read from uninitialized (rui) on thread 1:
> > Attempting to read 1 byte at address 0x436d57
> >    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
> > This block was allocated from:
> >        [1] vasprintf() at 0xfffffd7fdc9b335a 
> >        [2] asprintf() at 0xfffffd7fdc9b3452 
> >        [3] opal_output_init() at line 184 in "output.c"
> >        [4] do_open() at line 548 in "output.c"
> >        [5] opal_output_open() at line 219 in "output.c"
> >        [6] opal_malloc_init() at line 68 in "malloc.c"
> >        [7] opal_init_util() at line 250 in "opal_init.c"
> >        [8] orterun() at line 658 in "orterun.c"
> > 
> > t@1 (l@1) stopped in do_open at line 638 in file "output.c"
> >  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
> > (dbx) 
> > 
> > 
> 
> Again, I think dbx is just getting lost
> 
> > 
> > 
> > 
> > I can also manually bind threads on our Sun M4000 server (two quad-core
> > Sparc VII processors with two hwthreads each).
> > 
> > mpiexec --report-bindings -np 4 --bind-to hwthread hostname
> > [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to 
> >  socket 0[core 1[hwt 0]]: [../B./../..][../../../..]
> > [rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to 
> >  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
> > [rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to 
> >  socket 1[core 5[hwt 0]]: [../../../..][../B./../..]
> > [rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to 
> >  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
> > rs0.informatik.hs-fulda.de
> > rs0.informatik.hs-fulda.de
> > rs0.informatik.hs-fulda.de
> > rs0.informatik.hs-fulda.de
> > 
> > 
> > It doesn't work with cores. I know that it wasn't possible last
> > summer and it seems that it is still not possible now.
> > 
> > mpiexec --report-bindings -np 4 --bind-to core hostname
> > -----------------------------------------------------------------------
> > Open MPI tried to bind a new process, but something went wrong.  The
> > process was killed without launching the target application.  Your job
> > will now abort.
> > 
> >  Local host:        rs0
> >  Application name:  /usr/local/bin/hostname
> >  Error message:     hwloc indicates cpu binding cannot be enforced
> >  Location:          
> > ../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500
> > -----------------------------------------------------------------------
> > 4 total processes failed to start
> > 
> > 
> > 
> > Is it possible to specify hwthreads in a rankfile, so that I can
> > use a rankfile for these machines?
> 
> Possible - yes. Will it happen in immediate future - no, I'm afraid
> I'm swamped right now. However, I'll make a note of it for the future
> 
> > 
> > I get the expected output, if I use two M4000 servers, although the
> > above mentioned error still exists.
> > 
> > 
> > /opt/solstudio12.3/bin/sparcv9/dbx \
> >  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> > For information about new features see `help changes'
> > To remove this message, put `dbxenv suppress_startup_message 7.9'
> >  in your .dbxrc
> > Reading mpiexec
> > Reading ld.so.1
> > ...
> > Reading libmd.so.1
> > (dbx) run --report-bindings --host rs0,rs1 -np 4 \
> >  --bind-to hwthread hostname
> > Running: mpiexec --report-bindings --host rs0,rs1 -np 4
> >  --bind-to hwthread hostname 
> > (process id 9599)
> > Reading libc_psr.so.1
> > ...
> > Reading mca_dfs_test.so
> > [rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to
> >  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
> > [rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to
> >  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
> > rs0.informatik.hs-fulda.de
> > rs0.informatik.hs-fulda.de
> > rs1.informatik.hs-fulda.de
> > [rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to
> >  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
> > [rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to
> >  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
> > rs1.informatik.hs-fulda.de
> > 
> > execution completed, exit code is 0
> > (dbx) check -all
> > access checking - ON
> > memuse checking - ON
> > (dbx) run --report-bindings --host rs0,rs1 -np 4 \
> >  --bind-to hwthread hostname
> > Running: mpiexec --report-bindings --host rs0,rs1 -np 4
> >  --bind-to hwthread hostname 
> > (process id 9607)
> > Reading rtcapihook.so
> > ...
> > RTC: Running program...
> > Read from uninitialized (rui) on thread 1:
> > Attempting to read 1 byte at address 0xffffffff7fffc80b
> >    which is 459 bytes above the current stack pointer
> > Variable is 'cwd'
> > dbx: warning: can't find file
> >  ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../
> >  openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c"
> > dbx: warning: see `help finding-files'
> > t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
> > (dbx) 
> > 
> > 
> > Our M4000 server has no access to the source code, so that it couldn't
> > find the file. Nevertheless it is the same error message as above. Is it
> > possible that someone soves this problem? Thank you very much for any
> > help in advance.
> > 
> > 
> > Kind regards
> > 
> > Siegmar
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 

Reply via email to