come from the BTL headers where the fields do not have the same alignment inside. The original question was asked by Nysal Jan on an email with the subject "SEGV in EM64T <--> PPC64 communication" on Oct. 11 2006. Unfortunately, we still have the same problem.
I'm forwarding that email. Further investigation showed that the same issue exists with a few other ob1 headers as well. A 64-bit build doesn't have this problem. I'm not sure if this might be the same issue that you are facing. You could test if the attached patch works for you (Although this is not the right solution). Maybe using -malign-double for the build might also work, but I haven't tried that out. ****************************************************************** Hi Jeff, I'm using the r12014M revision of the trunk. I'm getting a SEGV (backtrace included) when running the osu b/w benchmark on a heterogeneous set of 2 nodes (a EM64T & PPC64). A 32 bit build, compiled with gcc, was used. The problem was tracked down to a difference in the size of the mca_btl_tcp_hdr_t structure on these two architectures. struct mca_btl_tcp_hdr_t { mca_btl_base_header_t base; /* a uint8_t */ uint8_t type; uint16_t count; uint64_t size; }; This structure has a size of 12 bytes on EM64T(no padding here) & 16 bytes on PPC64(some padding is added before 'size'). http://docs.sun.com/app/docs/doc/816-5138/6mba6ua5t?a=view mentions that 'long long' has a 4 byte alignment on i386, which might explain why the structure is only 12 bytes on EM64T. The failure happens in mca_btl_tcp_endpoint_recv_handler() when trying to invoke reg->cbfunc() and reg->cbfunc is NULL. Assuming the receiver side is EM64T: frag->iov[0].iov_len = sizeof(frag->hdr) (so assigned 12 bytes on EM64T) thus the readv() in mca_btl_tcp_frag_recv() reads 12 bytes into the first vector instead of 16 and from there on everything goes wrong. ******************************************************************
Index: ompi/mca/btl/tcp/btl_tcp_hdr.h =================================================================== --- ompi/mca/btl/tcp/btl_tcp_hdr.h (revision 12316) +++ ompi/mca/btl/tcp/btl_tcp_hdr.h (working copy) @@ -42,6 +42,9 @@ mca_btl_base_header_t base; uint8_t type; uint16_t count; +#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT + uint32_t padding; +#endif uint64_t size; }; typedef struct mca_btl_tcp_hdr_t mca_btl_tcp_hdr_t; Index: ompi/mca/pml/ob1/pml_ob1_hdr.h =================================================================== --- ompi/mca/pml/ob1/pml_ob1_hdr.h (revision 12316) +++ ompi/mca/pml/ob1/pml_ob1_hdr.h (working copy) @@ -130,6 +130,9 @@ */ struct mca_pml_ob1_frag_hdr_t { mca_pml_ob1_common_hdr_t hdr_common; /**< common attributes */ +#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT + uint32_t padding; +#endif uint64_t hdr_frag_offset; /**< offset into message */ ompi_ptr_t hdr_src_req; /**< pointer to source request */ ompi_ptr_t hdr_dst_req; /**< pointer to matched receive */ @@ -155,6 +158,9 @@ struct mca_pml_ob1_ack_hdr_t { mca_pml_ob1_common_hdr_t hdr_common; /**< common attributes */ +#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT + uint32_t padding; +#endif ompi_ptr_t hdr_src_req; /**< source request */ ompi_ptr_t hdr_dst_req; /**< matched receive request */ uint64_t hdr_rdma_offset; /**< starting point rdma protocol */