Re: [OMPI devel] patch for building gm btl
On Jan 3, 2008, at 5:13 PM, Patrick Geoffray wrote: One thing I quickly learned with MTT is that there is only 24 hours in a day :-) Sweet. :-) Be sure to ask any questions you have about MTT on the MTT users list (mtt-us...@open-mpi.org ). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] patch for building gm btl
Paul, Paul H. Hargrove wrote: discuss what tests we will run, but it will probably be a very minimal set. Once we both have MTT setup and running GM tests, we should compare configs to avoid overlap (and thus increase coverage). That would be great. I have only one 32-node 2G cluster I can use full-time for MTT testing for GM, MX, OpenMPI, MPICH{1,2}, HP-MPI, and many more. One thing I quickly learned with MTT is that there is only 24 hours in a day :-) Patrick
Re: [OMPI devel] patch for building gm btl
Patrick, Thanks for the info. Jeff and I are working (well Jeff is working anyway) to get MTT setup on my cluster to do MTT builds against both the GM-1.6.4 and GM-2.0.19 libs I have installed. While there is no current development at Myricom for GM, there are still folks with older hardware in the field who are using GM (and in my case will continue to do so until the hardware dies). We have only 2 nodes (GM-2.0.19) to run on and Jeff and I have yet to discuss what tests we will run, but it will probably be a very minimal set. Once we both have MTT setup and running GM tests, we should compare configs to avoid overlap (and thus increase coverage). -Paul Patrick Geoffray wrote: Hi Paul, Paul H. Hargrove wrote: The fact that this has gone unfixed for 2 months suggests to me that nobody is building the GM BTL. So, how would I go about checking ... a) ...if there exists any periodic build of the GM BTL via MTT? We are deploying MTT on all our clusters. Right now, we use our own MTT server, but we will report a subset of the test to the OpenMPI server once everything is working. c) ...which GM library versions such builds, if any, compile against There is no GM tests currently under our still-evolving MTT setup. Once we have a working setup, we will run a single Pallas test on 32 nodes with GM-2.1.28, two 2G NICs per node (single and dual port). There is no active development on GM, just kernel updates, so the GM version does not matter much. Patrick ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] patch for building gm btl
Hi Paul, Paul H. Hargrove wrote: The fact that this has gone unfixed for 2 months suggests to me that nobody is building the GM BTL. So, how would I go about checking ... a) ...if there exists any periodic build of the GM BTL via MTT? We are deploying MTT on all our clusters. Right now, we use our own MTT server, but we will report a subset of the test to the OpenMPI server once everything is working. c) ...which GM library versions such builds, if any, compile against There is no GM tests currently under our still-evolving MTT setup. Once we have a working setup, we will run a single Pallas test on 32 nodes with GM-2.1.28, two 2G NICs per node (single and dual port). There is no active development on GM, just kernel updates, so the GM version does not matter much. Patrick
[OMPI devel] Fixing SPARC bus errors
Greetings. We have seen some bus errors when compiling a user application with certain compiler flags and running on a sparc based server. The issue is that some structures are not word or double word aligned causing a bus error. I have tracked down two places where I can make a minor change and everything seems to work fine. However, I want to see if anyone has issues with these changes. The two changes are shown below. burl-ct-v440-0 206 =>svn diff Index: ompi/mca/btl/sm/btl_sm_frag.h === --- ompi/mca/btl/sm/btl_sm_frag.h(revision 17039) +++ ompi/mca/btl/sm/btl_sm_frag.h(working copy) @@ -9,6 +9,7 @@ * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. + * Copyright (c) 2008 Sun Microsystems, Inc. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -41,6 +42,10 @@ struct mca_btl_sm_frag_t *frag; size_t len; mca_btl_base_tag_t tag; + /* Add a 4 byte pad to round out structure to 16 bytes for 32-bit +* and to 24 bytes for 64-bit. Helps prevent bus errors for strict +* alignment cases like SPARC. */ +char pad[4]; }; typedef struct mca_btl_sm_hdr_t mca_btl_sm_hdr_t; Index: ompi/mca/pml/ob1/pml_ob1_recvfrag.h === --- ompi/mca/pml/ob1/pml_ob1_recvfrag.h(revision 17039) +++ ompi/mca/pml/ob1/pml_ob1_recvfrag.h(working copy) @@ -9,6 +9,7 @@ * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. + * Copyright (c) 2008 Sun Microsystems, Inc. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -67,7 +68,8 @@ unsigned char* _ptr = (unsigned char*)frag->addr; \ /* init recv_frag */\ frag->btl = btl;\ -frag->hdr = *(mca_pml_ob1_hdr_t*)hdr; \ +memcpy(>hdr, (void *)((mca_pml_ob1_hdr_t*)hdr)\ + sizeof(mca_pml_ob1_hdr_t)); \ frag->num_segments = 1; \ _size = segs[0].seg_len;\ for( i = 1; i < cnt; i++ ) {\ burl-ct-v440-0 207 => The ticket associated with this issue is https://svn.open-mpi.org/trac/ompi/ticket/1148 Rolf
Re: [OMPI devel] Common initialization code for IB.
On Jan 3, 2008, at 9:03 AM, Gleb Natapov wrote: In Paris we've talked about putting HCA discovery and initialization code outside of openib BTL so other components that want to use IB will be able to share common code, data and registration cache. Other components I am thinking about are ofud and multicast collectives. I started to look at this and I have a couple of problems with this approach. Currently openib BTL has if_include/if_exclude parameters to control which HCAs should be used. Should we make those parameters global and initialize only HCAs that are not exulted by those filters, or should we initialize all HCAs and each component will have its own include/exclude filters? Good question. I think the optimal solution would be to have one set of globals (common_of_if_include or somesuch?) with optional per- component overrides. E.g., tell all of OMPI to if_include mthca0, but then tell just the multicast collectives to if_include ipath1 (for whatever reason). This would allow fine-grained selection of which communication types use which devices. To minimize the repetition of code, this could be effected by having a function in the common/of area that does all the work for the include/ exclude behavior. You can simply call it with any of the MCA param values, such as: common_of_if_in/exclude, btl_openib_if_in/exclude, coll_of_if_in/exclude, ... and it can return a list of ports to use. Another problem is how multicast collective knows that all processes in a communicator are reachable via the same network, do we have a mechanism in ompi to check this? Good question. Perhaps the common_of stuff could hang some data off the ompi_proc_t that can be read by any of-like component (btl openib, coll of multicast, etc.)...? This could contain a subnet ID, or perhaps a reachable flag, or somesuch. -- Jeff Squyres Cisco Systems
[OMPI devel] Common initialization code for IB.
Hi, In Paris we've talked about putting HCA discovery and initialization code outside of openib BTL so other components that want to use IB will be able to share common code, data and registration cache. Other components I am thinking about are ofud and multicast collectives. I started to look at this and I have a couple of problems with this approach. Currently openib BTL has if_include/if_exclude parameters to control which HCAs should be used. Should we make those parameters global and initialize only HCAs that are not exulted by those filters, or should we initialize all HCAs and each component will have its own include/exclude filters? Another problem is how multicast collective knows that all processes in a communicator are reachable via the same network, do we have a mechanism in ompi to check this? -- Gleb.