I'll fix the case in attr_create_predefined_callback - we should initialize the rank variable first to be safe.
For your other question, do your configure with "--without-memory-manager". Ralph On 11/12/06 10:52 AM, "Adrian Knoth" <a...@drcomp.erfurt.thur.de> wrote: > Hi, > > I'm currently tracing a segfault in mpi_init which is caused > by ompi/runtime/ompi_mpi_init.c:569 > > ret = MCA_PML_CALL(add_procs(procs, nprocs)); > free(procs); > > In most cases, no segfault occurs and everything works fine, > but with some special combinations of machines, I can trigger > the bug. > > If I choose a working configuration and increase the number > of IPv6 addresses on one of the machines, the segfault occurs. > > It cannot be triggered by adding IPv4 addresses, and disabling > IPv6 completely solves the problem. > > The debugger shows that free internally calls mem2chunk. > The working configuration has a chunksize of 16 (bytes?), > the failing one has $BIGNUM, thus causing the segfault. > (trying to free unallocated memory) > > I think these long IPv6 addresses overwrite a buffer (or at > least some memory which is allocated inside OMPI's memory > pool, thus delaying the segfault). > > There are two issues found by valgrind, but I wanted to > check the "normal" valgrind output first. With the nightly > snapshot 1.2b1r12555, I got the following "errors": > > ==8948== Conditional jump or move depends on uninitialised value(s) > ==8948== at 0x1B92884D: ompi_attr_create_predefined_callback > (attribute_predefined.c:374) > ==8948== by 0x1BC869B8: orte_gpr_proxy_deliver_notify_msg > (gpr_proxy_deliver_notify_msg.c:144) > ==8948== by 0x1B9FEDF7: mca_oob_xcast (oob_base_xcast.c:147) > ==8948== by 0x1B947E49: ompi_mpi_init (ompi_mpi_init.c:542) > ==8948== by 0x1B97D657: MPI_Init (pinit.c:71) > ==8948== by 0x8048846: main (in /home/racl/adi/ompi/trunk/test/vm/ring) > > and > > ==8948== Syscall param writev(vector[...]) points to uninitialised byte(s) > ==8948== at 0x1BBCD5E8: (within /lib/tls/libc-2.3.2.so) > ==8948== by 0x1BD873C1: mca_btl_tcp_frag_send (btl_tcp_frag.c:104) > ==8948== by 0x1BD87133: mca_btl_tcp_endpoint_send_handler > (btl_tcp_endpoint.c:689) > ==8948== by 0x1BA48AD3: opal_event_process_active (event.c:463) > ==8948== by 0x1BA48E11: opal_event_base_loop (event.c:600) > ==8948== by 0x1BA48BE3: opal_event_loop (event.c:514) > ==8948== by 0x1BA4211D: opal_progress (opal_progress.c:259) > ==8948== by 0x1BD59D24: opal_condition_wait (condition.h:81) > ==8948== by 0x1BD5AD00: mca_pml_ob1_send (pml_ob1_isend.c:128) > ==8948== by 0x1B985CD9: MPI_Send (psend.c:63) > ==8948== by 0x80488B6: main (in /home/racl/adi/ompi/trunk/test/vm/ring) > ==8948== Address 0x80FEECE is not stack'd, malloc'd or (recently) free'd > > > Should I worry about these two? > > The segfault itself is probably related to this output: > > ==3324== Syscall param writev(vector[...]) points to uninitialised byte(s) > ==3324== at 0x1BBB45E8: (within /lib/tls/libc-2.3.2.so) > ==3324== by 0x1BC57191: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:234) > ==3324== by 0x1BC58658: mca_oob_tcp_peer_send (oob_tcp_peer.c:194) > ==3324== by 0x1BC5E873: mca_oob_tcp_send (oob_tcp_send.c:152) > ==3324== by 0x1B9FEC92: mca_oob_send_packed (oob_base_send.c:78) > ==3324== by 0x1BC6CE92: orte_gpr_proxy_exec_compound_cmd > (gpr_proxy_compound_cmd.c:117) > ==3324== by 0x1B94503A: ompi_mpi_init (ompi_mpi_init.c:523) > ==3324== by 0x1B97AE7F: MPI_Init (pinit.c:71) > ==3324== by 0x8048846: main (in /home/racl/adi/ompi/trunk/test/vm/ring) > ==3324== Address 0x822BF11 is not stack'd, malloc'd or (recently) free'd > > But I still have to look closer. > > Is there a way to disable OMPI's ptmalloc2 and use the > system's free/malloc? (hopefully causing the segfault right where > it is done, probably a memcpy with wrong size) > > Or are there other ways to debug such an issue? > > TIA