come from the BTL headers where the fields do not have the same
alignment inside. The original question was asked by Nysal Jan on an
email with the subject "SEGV in EM64T <--> PPC64 communication" on
Oct. 11 2006. Unfortunately, we still have the same problem.

I'm forwarding that email. Further investigation showed that the same issue
exists with a few other ob1 headers as well. A 64-bit build doesn't have
this problem. I'm not sure if this might be the same issue that you are
facing. You could test if the attached patch works for you (Although this is
not the right solution). Maybe using -malign-double for the build might also
work, but I haven't tried that out.

******************************************************************
Hi Jeff,
I'm using the r12014M revision of the trunk.
I'm getting a SEGV (backtrace included) when running the osu b/w benchmark
on a heterogeneous set of 2 nodes (a EM64T  &  PPC64).
A 32 bit build, compiled with gcc, was used. The problem was tracked down to
a difference in the size of the mca_btl_tcp_hdr_t structure on these two
architectures.

struct mca_btl_tcp_hdr_t {
   mca_btl_base_header_t base;  /* a uint8_t */
   uint8_t  type;
   uint16_t count;
   uint64_t size;
};

This structure has a size of 12 bytes on EM64T(no padding here) & 16 bytes
on PPC64(some padding is added before 'size').
http://docs.sun.com/app/docs/doc/816-5138/6mba6ua5t?a=view   mentions that
'long long' has a 4 byte alignment on i386, which might explain why the
structure is only 12 bytes on EM64T.

The failure happens in mca_btl_tcp_endpoint_recv_handler() when trying to
invoke reg->cbfunc() and reg->cbfunc is NULL.
Assuming the receiver side is EM64T:
frag->iov[0].iov_len = sizeof(frag->hdr) (so assigned 12 bytes on EM64T)
thus the readv() in mca_btl_tcp_frag_recv() reads 12 bytes into the first
vector instead of 16 and from there on everything goes wrong.
******************************************************************
Index: ompi/mca/btl/tcp/btl_tcp_hdr.h
===================================================================
--- ompi/mca/btl/tcp/btl_tcp_hdr.h      (revision 12316)
+++ ompi/mca/btl/tcp/btl_tcp_hdr.h      (working copy)
@@ -42,6 +42,9 @@
     mca_btl_base_header_t base;
     uint8_t  type;
     uint16_t count;
+#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT
+    uint32_t padding;
+#endif
     uint64_t size;
 };
 typedef struct mca_btl_tcp_hdr_t mca_btl_tcp_hdr_t;
Index: ompi/mca/pml/ob1/pml_ob1_hdr.h
===================================================================
--- ompi/mca/pml/ob1/pml_ob1_hdr.h      (revision 12316)
+++ ompi/mca/pml/ob1/pml_ob1_hdr.h      (working copy)
@@ -130,6 +130,9 @@
  */
 struct mca_pml_ob1_frag_hdr_t {
     mca_pml_ob1_common_hdr_t hdr_common;     /**< common attributes */
+#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT
+    uint32_t padding;
+#endif
     uint64_t hdr_frag_offset;                /**< offset into message */
     ompi_ptr_t hdr_src_req;                  /**< pointer to source request */
     ompi_ptr_t hdr_dst_req;                  /**< pointer to matched receive */
@@ -155,6 +158,9 @@

 struct mca_pml_ob1_ack_hdr_t {
     mca_pml_ob1_common_hdr_t hdr_common;      /**< common attributes */
+#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT
+    uint32_t padding;
+#endif
     ompi_ptr_t hdr_src_req;                   /**< source request */
     ompi_ptr_t hdr_dst_req;                   /**< matched receive request */
     uint64_t hdr_rdma_offset;                 /**< starting point rdma protocol
 */

Reply via email to