Hello George:

While the change on the shm side does initially seem unnecessary, it
is handling a bus error case on the sending side, not on the receiving
side.

The change in the mca_btl_sm_hdr_t is necessary because of the way the
pml and btl headers are stored in shared memory and because of the
fact that in some cases, the pml header has a uint64_t in it.  If the
mca_btl_sm_hdr_t is size 12, then the pml header does not start on a
double-word aligned boundary.  In the case the pml header is a
mca_pml_ob1_rendezvous_hdr_t, we get a bus error while loading the
hdr_msg_length.  Here is an example of it although it can happen in
other places as well. (Line numbers are close to what is in the trunk
give or take a few lines)

program terminated by signal BUS (invalid address alignment)
Current function is mca_pml_ob1_send_request_start_rndv (optimized)
  743 hdr->hdr_rndv.hdr_msg_length = sendreq->req_send.req_bytes_packed;
 (dbx) print &(hdr->hdr_rndv.hdr_msg_length)
&hdr->hdr_rndv.hdr_msg_length = 0xf4d1e81c
 (dbx) where
=>[1] mca_pml_ob1_send_request_start_rndv() (optimized),
          at 0xfd5f76b8 (line ~743) in "pml_ob1_sendreq.c"
  [2] mca_pml_ob1_send_request_start() (optimized),
          at 0xfd5d013c (line ~388) in "pml_ob1_sendreq.h"
[3] mca_pml_ob1_send() (optimized), at 0xfd5d1544 (line ~117) in "pml_ob1_isend.c"
  [4] PMPI_Send), at 0xfedd7204 (line ~65) in "psend.c"
  [5] main(0xffbfed40, 0xfffffff8, 0x2, 0x0, 0x7d1, 0x7d0), at 0x125bc
(dbx)


George Bosilca wrote:
Rolf,

If we memcpy instead of assigning the header in the OB1 PML why do we need the padding in the frag header ?

  Thanks,
    george.

On Jan 3, 2008, at 2:47 PM, Rolf vandeVaart wrote:


Greetings.  We have seen some bus errors when compiling a user
application with certain compiler flags and running on a sparc based
server.  The issue is that some structures are not word or double word
aligned causing a bus error.  I have tracked down two places where I can
make a minor change and everything seems to work fine.   However, I want
to see if anyone has issues with these changes.  The two changes are
shown below.

burl-ct-v440-0 206 =>svn diff
Index: ompi/mca/btl/sm/btl_sm_frag.h
===================================================================
--- ompi/mca/btl/sm/btl_sm_frag.h    (revision 17039)
+++ ompi/mca/btl/sm/btl_sm_frag.h    (working copy)
@@ -9,6 +9,7 @@
*                         University of Stuttgart.  All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
*                         All rights reserved.
+ * Copyright (c) 2008      Sun Microsystems, Inc.  All rights reserved.
* $COPYRIGHT$
*  * Additional copyrights may follow
@@ -41,6 +42,10 @@
   struct mca_btl_sm_frag_t *frag;
   size_t len;
   mca_btl_base_tag_t tag;
+   /* Add a 4 byte pad to round out structure to 16 bytes for 32-bit
+    * and to 24 bytes for 64-bit.  Helps prevent bus errors for strict
+    * alignment cases like SPARC. */
+    char pad[4];
};
typedef struct mca_btl_sm_hdr_t mca_btl_sm_hdr_t;


Index: ompi/mca/pml/ob1/pml_ob1_recvfrag.h
===================================================================
--- ompi/mca/pml/ob1/pml_ob1_recvfrag.h    (revision 17039)
+++ ompi/mca/pml/ob1/pml_ob1_recvfrag.h    (working copy)
@@ -9,6 +9,7 @@
*                         University of Stuttgart.  All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
*                         All rights reserved.
+ * Copyright (c) 2008      Sun Microsystems, Inc.  All rights reserved.
* $COPYRIGHT$
*  * Additional copyrights may follow
@@ -67,7 +68,8 @@
   unsigned char* _ptr = (unsigned char*)frag->addr;                   \
   /* init recv_frag */                                                \
   frag->btl = btl;                                                    \
- frag->hdr = *(mca_pml_ob1_hdr_t*)hdr; \ + memcpy(&frag->hdr, (void *)((mca_pml_ob1_hdr_t*)hdr) \ + sizeof(mca_pml_ob1_hdr_t)); \
   frag->num_segments = 1;                                             \
   _size = segs[0].seg_len;                                            \
   for( i = 1; i < cnt; i++ ) {                                        \
burl-ct-v440-0 207 =>


The ticket associated with this issue is
https://svn.open-mpi.org/trac/ompi/ticket/1148

Rolf
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


------------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to