I can add x86-64/Linux/SS12.3 to the NOT-showing-the-problem list. -Paul
On Fri, Aug 24, 2012 at 6:47 PM, Eugene Loh <eugene....@oracle.com> wrote: > ** > Indeed. Sorry to jump late back into the melee. I did reproduce the > problem on a second SPARC system, to answer Ralph's earlier question; I > don't know how interesting that is given that it's very similar to the > original system. And, to corroborate Paul's AMD observation, we have an > x86/Solaris/Studio system that is *not* seeing the problem. Thanks to Paul > for identifying the likely cause of the problem. > > On 8/24/2012 6:32 PM, Ralph Castain wrote: > > Thanks Paul!! That is very helpful - hopefully the ORNL folks can now fix > the problem. > > On Aug 24, 2012, at 6:29 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > I *can* reproduce the problem on SPARC/Solaris-10 with the SS12.3 > compiler and an ALMOST vanilla configure: > $ [path_to]configure \ > --prefix=[blah] CC=cc CXX=CC F77=f77 FC=f90 \ > CFLAGS="-m64" --with-wrapper-cflags="-m64" CXXFLAGS="-m64" > --with-wrapper-cxxflags="-m64" \ > FFLAGS="-m64" --with-wrapper-fflags="-m64" FCFLAGS="-m64" > --with-wrapper-fcflags="-m64" \ > CXXFLAGS="-m64 -library=stlport4" > > I did NOT manage to reproduce on AMD64/Solaris-11, which completed a > build w/ VT disabled. > Unfortunately I have neither SPARC/Solaris-11 nor > AMD64/Solaris-10 readily available to disambiguate the key factor. > Hopefully it is enough to know that the problem is reproducible w/o > Oracle's massive configure commandline. > > > The build isn't complete, but I can already see that the symbol has > "leaked" into libmpi: > > $ grep -arl mca_coll_ml_memsync_intra BLD/ > BLD/ompi/mca/bcol/.libs/libmca_bcol.a > BLD/ompi/mca/bcol/base/.libs/bcol_base_open.o > BLD/ompi/.libs/libmpi.so.0.0.0 > BLD/ompi/.libs/libmpi.so > BLD/ompi/.libs/libmpi.so.0 > > It is referenced by mca_coll_ml_generic_collectives_launcher: > > $ nm BLD/ompi/.libs/libmpi.so.0.0.0 | grep -B1 mca_coll_ml_memsync_intra > 00000000006a6088 t mca_coll_ml_generic_collectives_launcher > U mca_coll_ml_memsync_intra > > This is coming from libmca_bcol.a: > $ nm BLD/ompi/mca/bcol/.libs/libmca_bcol.a | grep -B1 > mca_coll_ml_memsync_intra > 0000000000005248 t mca_coll_ml_generic_collectives_launcher > U mca_coll_ml_memsync_intra > > > This appears to be via the following chain of calls within coll_ml.h: > > mca_coll_ml_generic_collectives_launcher > mca_coll_ml_task_completion_processing > coll_ml_fragment_completion_processing > mca_coll_ml_buffer_recycling > mca_coll_ml_memsync_intra > > All of which are marked as "static > inline __opal_attribute_always_inline__". > > -Paul > > > On Fri, Aug 24, 2012 at 4:55 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> OK, I have a vanilla configure+make running on both SPARC/Solaris-10 and >> AMD64/Solaris-11. >> I am using the 12.3 Oracle compilers in both cases to match the original >> report. >> I'll post the results when they complete. >> >> In the meantime, I took a quick look at the code and have a pretty >> reasonable guess as to the cause. >> Looking at ompi/mca/coll/ml/coll_ml.h I see: >> >> 827 int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *module, int >> bank_index); >> [...] >> 996 static inline __opal_attribute_always_inline__ >> 997 int >> mca_coll_ml_buffer_recycling(mca_coll_ml_collective_operation_progress_t >> *ml_request) >> 998 { >> [...] >> 1023 rc = mca_coll_ml_memsync_intra(ml_module, >> ml_memblock->memsync_counter); >> [...] >> 1041 } >> >> Based on past experience w/ the Sun/Oracle compilers on another project >> (See http://bugzilla.hcs.ufl.edu/cgi-bin/bugzilla3/show_bug.cgi?id=193 ), >> I suspect that this static-inline-always function is being emitted by the >> compiler in every object which includes this header even if they don't call >> it.. The call on line 1023 then results in the undefined reference >> to mca_coll_ml_memsync_intra. Basically it is not safe for an inline >> function in a header to call an extern function that isn't available to >> every object that includes the header REGARDLESS of whether the object >> invokes the inline function or not. >> >> -Paul >> >> >> >> On Fri, Aug 24, 2012 at 4:40 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Oracle uses an abysmally complicated configure line, but nearly all of >>> it is irrelevant to the problem here. For this, I would suggest just doing >>> a vanilla ./configure - if the component gets pulled into libmpi, then we >>> know there is a problem. >>> >>> Thanks! >>> >>> Just FYI: here is there actual configure line, just in case you spot >>> something problematic: >>> >>> CC=cc CXX=CC F77=f77 FC=f90 --with-openib --enable-openib-connectx-xrc >>> --without-udapl >>> --disable-openib-ibcm --enable-btl-openib-failover --without-dtrace >>> --enable-heterogeneous >>> --enable-cxx-exceptions --enable-shared --enable-orterun-prefix-by-default >>> --with-sge >>> --enable-mpi-f90 --with-mpi-f90-size=small --disable-peruse --disable-state >>> --disable-mpi-thread-multiple --disable-debug --disable-mem-debug >>> --disable-mem-profile >>> CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 >>> -xvector=lib -Qoption >>> cg -xregs=no%appl -xdepend=yes -xbuiltin=%all -xO5" >>> CXXFLAGS="-xtarget=ultra3 -m32 >>> -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg >>> -xregs=no%appl -xdepend=yes >>> -xbuiltin=%all -xO5 -Bstatic -lCrun -lCstd -Bdynamic" >>> FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 >>> -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl >>> -stackvar -xO5" >>> FCFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch >>> -xprefetch_level=2 -xvector=lib -Qoption >>> cg -xregs=no%appl -stackvar -xO5" >>> --prefix=/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/installs/JA08/install >>> --mandir=${prefix}/man --bindir=${prefix}/bin --libdir=${prefix}/lib >>> --includedir=${prefix}/include >>> --with-tm=/ws/ompi-tools/orte/torque/current/shared-install32 >>> --enable-contrib-no-build=vt --with-package-string="Oracle Message Passing >>> Toolkit " >>> --with-ident-string="@(#)RELEASE VERSION 1.9openmpi-1.5.4-r1.9a1r27092" >>> >>> >>> and the error he gets is: >>> >>> make[2]: Entering directory >>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info' >>> CCLD ompi_info >>> Undefined first referenced >>> symbol in file >>> mca_coll_ml_memsync_intra ../../../ompi/.libs/libmpi.so >>> ld: fatal: symbol referencing errors. No output written to .libs/ompi_info >>> make[2]: *** [ompi_info] Error 2 >>> make[2]: Leaving directory >>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info' >>> make[1]: *** [install-recursive] Error 1 >>> make[1]: Leaving directory >>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi' >>> make: *** [install-recursive] Error 1 >>> >>> >>> On Aug 24, 2012, at 4:30 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >>> >>> I have access to a few different Solaris machines and can offer to build >>> the trunk if somebody tells me what configure flags are desired. >>> >>> -Paul >>> >>> On Fri, Aug 24, 2012 at 8:54 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Eugene - can you confirm that this is only happening on the one Solaris >>>> system? In other words, is this a general issue or something specific to >>>> that one machine? >>>> >>>> I'm wondering because if it is just the one machine, then it might be >>>> something strange about how it is setup - perhaps the version of Solaris, >>>> or it is configuring --enable-static, or... >>>> >>>> Just trying to assess how general a problem this might be, and thus if >>>> this should be a blocker or not. >>>> >>>> On Aug 24, 2012, at 8:00 AM, Eugene Loh <eugene....@oracle.com> wrote: >>>> >>>> > On 08/24/12 09:54, Shamis, Pavel wrote: >>>> >> Maybe there is a chance to get direct access to this system ? >>>> > No. >>>> > >>>> > But I'm attaching compressed log files from configure/make. >>>> > >>>> > >>>> <tarball-of-log-files.tar.bz2>_______________________________________________ >>>> > devel mailing list >>>> > de...@open-mpi.org >>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing > listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900