Eugene, Did you have chance to make progress on the issue #2 ? I'm wondering how we want to proceed from here.
Pavel (Pasha) Shamis --- Computer Science Research Group Computer Science and Math Division Oak Ridge National Laboratory On Aug 21, 2012, at 2:19 PM, Eugene Loh wrote: On 8/21/2012 9:31 AM, Ralph Castain wrote: Looks to me like you just need to add a couple of includes and correct a typo - yes? Right. This part is under control. The library issue sounds like something isn't right in the Makefile.am - perhaps the syntax has a typo there as well? I don't know. This is the part where I could use help. I took a quick peek at some Makefile.am files. I can't see what the essential difference is between, say, coll/ml/Makefile.am and, say, coll/sm/Makefile.am (which behaves all right). Nor do I see why there would be a difference in coll/ml between one system (happens to be SPARC, though I don't know that's significant) and another. On Aug 21, 2012, at 11:36 AM, Eugene Loh wrote: r27078 (ML collective component) broke some Solaris OMPI builds. 1) In ompi/mca/coll/ml/coll_ml_lmngr.c 199 #ifdef HAVE_POSIX_MEMALIGN 200 if((errno = posix_memalign(&lmngr->base_addr, 201 lmngr->list_alignment, 202 lmngr->list_size * lmngr->list_block_size)) != 0) { 203 ML_ERROR(("Failed to allocate memory: %s [%d]", errno, strerror(errno))); 204 return OMPI_ERROR; 205 } 206 #else 207 lmngr->base_addr = 208 malloc(lmngr->list_size * lmngr->list_block_size + lmngr->list_alignment); 209 if(NULL == lmngr->base_addr) { 210 ML_ERROR(("Failed to allocate memory: %s [%d]", errno, strerror(errno))); 211 return OMPI_ERROR; 212 } 213 214 lmngr->base_addr = (void*)OPAL_ALIGN((uintptr_t)lmngr->base_addr, 215 lmngr->list_align, uintptr_t); 216 #endif The "#else" code path has multiple problems -- specifically at the statement on lines 214-215: - OPAL_ALIGN needs to be defined (e.g., #include "opal/align.h") - uintptr_t need to be defined (e.g., #include "opal_stdint.h") - list_align should be list_alignment I could fix, but need help with... 2) http://www.open-mpi.org/mtt/index.php?do_redir=2089 Somehow, coll_ml is getting pulled into libmpi.so. E.g., this doesn't look right: % nm ompi/.libs/libmpi.so | grep mca_coll_ml [13161] | 2556704| 172|FUNC |LOCL |0 |11 |mca_coll_ml_alloc_op_prog_single_frag_dag [13171] | 2555488| 344|FUNC |LOCL |0 |11 |mca_coll_ml_buffer_recycling [13173] | 2555392| 92|FUNC |LOCL |0 |11 |mca_coll_ml_err [23992] | 0| 0|FUNC |GLOB |0 |UNDEF |mca_coll_ml_memsync_intra The UNDEF is causing a problem, but I'm guessing all that mca_coll_ml_ stuff shouldn't be in there at all in the first place. This is on one Solaris system, while another doesn't see the problem and builds fine. _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/devel