Hi,

I'm currently tracing a segfault in mpi_init which is caused
by ompi/runtime/ompi_mpi_init.c:569

    ret = MCA_PML_CALL(add_procs(procs, nprocs));
    free(procs);

In most cases, no segfault occurs and everything works fine,
but with some special combinations of machines, I can trigger
the bug.

If I choose a working configuration and increase the number
of IPv6 addresses on one of the machines, the segfault occurs.

It cannot be triggered by adding IPv4 addresses, and disabling
IPv6 completely solves the problem.

The debugger shows that free internally calls mem2chunk.
The working configuration has a chunksize of 16 (bytes?),
the failing one has $BIGNUM, thus causing the segfault.
(trying to free unallocated memory)

I think these long IPv6 addresses overwrite a buffer (or at
least some memory which is allocated inside OMPI's memory
pool, thus delaying the segfault).

There are two issues found by valgrind, but I wanted to
check the "normal" valgrind output first. With the nightly
snapshot 1.2b1r12555, I got the following "errors":

==8948== Conditional jump or move depends on uninitialised value(s)
==8948==    at 0x1B92884D: ompi_attr_create_predefined_callback 
(attribute_predefined.c:374)
==8948==    by 0x1BC869B8: orte_gpr_proxy_deliver_notify_msg 
(gpr_proxy_deliver_notify_msg.c:144)
==8948==    by 0x1B9FEDF7: mca_oob_xcast (oob_base_xcast.c:147)
==8948==    by 0x1B947E49: ompi_mpi_init (ompi_mpi_init.c:542)
==8948==    by 0x1B97D657: MPI_Init (pinit.c:71)
==8948==    by 0x8048846: main (in /home/racl/adi/ompi/trunk/test/vm/ring)

and

==8948== Syscall param writev(vector[...]) points to uninitialised byte(s)
==8948==    at 0x1BBCD5E8: (within /lib/tls/libc-2.3.2.so)
==8948==    by 0x1BD873C1: mca_btl_tcp_frag_send (btl_tcp_frag.c:104)
==8948==    by 0x1BD87133: mca_btl_tcp_endpoint_send_handler 
(btl_tcp_endpoint.c:689)
==8948==    by 0x1BA48AD3: opal_event_process_active (event.c:463)
==8948==    by 0x1BA48E11: opal_event_base_loop (event.c:600)
==8948==    by 0x1BA48BE3: opal_event_loop (event.c:514)
==8948==    by 0x1BA4211D: opal_progress (opal_progress.c:259)
==8948==    by 0x1BD59D24: opal_condition_wait (condition.h:81)
==8948==    by 0x1BD5AD00: mca_pml_ob1_send (pml_ob1_isend.c:128)
==8948==    by 0x1B985CD9: MPI_Send (psend.c:63)
==8948==    by 0x80488B6: main (in /home/racl/adi/ompi/trunk/test/vm/ring)
==8948==  Address 0x80FEECE is not stack'd, malloc'd or (recently) free'd


Should I worry about these two?

The segfault itself is probably related to this output:

==3324== Syscall param writev(vector[...]) points to uninitialised byte(s)
==3324==    at 0x1BBB45E8: (within /lib/tls/libc-2.3.2.so)
==3324==    by 0x1BC57191: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:234)
==3324==    by 0x1BC58658: mca_oob_tcp_peer_send (oob_tcp_peer.c:194)
==3324==    by 0x1BC5E873: mca_oob_tcp_send (oob_tcp_send.c:152)
==3324==    by 0x1B9FEC92: mca_oob_send_packed (oob_base_send.c:78)
==3324==    by 0x1BC6CE92: orte_gpr_proxy_exec_compound_cmd 
(gpr_proxy_compound_cmd.c:117)
==3324==    by 0x1B94503A: ompi_mpi_init (ompi_mpi_init.c:523)
==3324==    by 0x1B97AE7F: MPI_Init (pinit.c:71)
==3324==    by 0x8048846: main (in /home/racl/adi/ompi/trunk/test/vm/ring)
==3324==  Address 0x822BF11 is not stack'd, malloc'd or (recently) free'd

But I still have to look closer.

Is there a way to disable OMPI's ptmalloc2 and use the
system's free/malloc? (hopefully causing the segfault right where
it is done, probably a memcpy with wrong size)

Or are there other ways to debug such an issue?

TIA

-- 
mail: a...@thur.de      http://adi.thur.de      PGP: v2-key via keyserver

Paradox ist, wenn einer vom Rotwein blau wird.

Reply via email to