> On Apr 5, 2018, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". 
> This allows the debugger to make a call our function, and output internal 
> information about the library status.

OK - after a number of missteps, I recompiled openmpi with debugging mode 
active, reran the executable (didn’t recompile, just using the new library), 
and got the comm pointer by attaching to the process and looking at the stack 
trace:

#0  0x00002b8a7599c42b in ibv_poll_cq (cq=0xec66010, num_entries=256, 
wc=0x7ffdea76d680) at /usr/include/infiniband/verbs.h:1272
#1  0x00002b8a759a8194 in poll_device (device=0xebc5300, count=0) at 
btl_openib_component.c:3608
#2  0x00002b8a759a871f in progress_one_device (device=0xebc5300) at 
btl_openib_component.c:3741
#3  0x00002b8a759a87be in btl_openib_component_progress () at 
btl_openib_component.c:3765
#4  0x00002b8a64b9da42 in opal_progress () at runtime/opal_progress.c:222
#5  0x00002b8a76c2c199 in ompi_request_wait_completion (req=0xec22600) at 
../../../../ompi/request/request.h:392
#6  0x00002b8a76c2d642 in mca_pml_ob1_recv (addr=0x2b8a8a99bf20, count=5423600, 
datatype=0x2b8a64832b80, src=1, tag=200, comm=0xed932d0, status=0x385dd90) at 
pml_ob1_irecv.c:135
#7  0x00002b8a6454c857 in PMPI_Recv (buf=0x2b8a8a99bf20, count=5423600, 
type=0x2b8a64832b80, source=1, tag=200, comm=0xed932d0, status=0x385dd90) at 
precv.c:79
#8  0x00002b8a6428ca7c in ompi_recv_f (buf=0x2b8a8a99bf20 
"DB»\373\v{\277\204\333\336\306[B\205\277\030ҀҶ\250v\277\225\377qW\001\251w?\240\020\202&=)S\277\202+\214\067\224\345R?\272\221Co\236\206\217?",
 count=0x7ffdea770eb4, datatype=0x2d43bec, source=0x7ffdea770a38,
    tag=0x2d43bf0, comm=0x5d30a68, status=0x385dd90, ierr=0x7ffdea770a3c) at 
precv_f.c:85
#9  0x000000000042887b in m_recv_z (comm=..., node=-858993460, zvec=Cannot 
access memory at address 0x2d
) at mpi.F:680
#10 0x000000000123e0f1 in fileio::outwav (io=..., wdes=..., w=Cannot access 
memory at address 0x2d
) at fileio.F:952
#11 0x0000000002abfd8f in vamp () at main.F:4204
#12 0x00000000004139de in main ()
#13 0x0000003f0c81ed1d in __libc_start_main () from /lib64/libc.so.6
#14 0x00000000004138e9 in _start ()

The comm value is different in omp_recv_f and things below, so I tried both.   
With the value of the lower level functions I get nothing useful
(gdb) call mca_pml_ob1_dump(0xed932d0, 1)
$1 = 0
and the value from omp_recv_f I get a seg fault:
(gdb) call mca_pml_ob1_dump(0x5d30a68, 1)

Program received signal SIGSEGV, Segmentation fault.
0x00002b8a76c26d0d in mca_pml_ob1_dump (comm=0x5d30a68, verbose=1) at 
pml_ob1.c:577
577         opal_output(0, "Communicator %s [%p](%d) rank %d recv_seq %d 
num_procs %lu last_probed %lu\n",
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mca_pml_ob1_dump) will be abandoned.
When the function is done executing, GDB will silently stop.

Should this have worked, or am I doing something wrong?

                                                                                
                thanks,
                                                                                
                Noam

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to