I can add x86-64/Linux/SS12.3 to the NOT-showing-the-problem list.

-Paul

On Fri, Aug 24, 2012 at 6:47 PM, Eugene Loh <eugene....@oracle.com> wrote:

> **
> Indeed.  Sorry to jump late back into the melee.  I did reproduce the
> problem on a second SPARC system, to answer Ralph's earlier question;  I
> don't know how interesting that is given that it's very similar to the
> original system.  And, to corroborate Paul's AMD observation, we have an
> x86/Solaris/Studio system that is *not* seeing the problem.  Thanks to Paul
> for identifying the likely cause of the problem.
>
> On 8/24/2012 6:32 PM, Ralph Castain wrote:
>
> Thanks Paul!! That is very helpful - hopefully the ORNL folks can now fix
> the problem.
>
>  On Aug 24, 2012, at 6:29 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>  I *can* reproduce the problem on SPARC/Solaris-10 with the SS12.3
> compiler and an ALMOST vanilla configure:
>  $ [path_to]configure \
>        --prefix=[blah]  CC=cc CXX=CC F77=f77 FC=f90 \
>         CFLAGS="-m64"  --with-wrapper-cflags="-m64"  CXXFLAGS="-m64"
>  --with-wrapper-cxxflags="-m64" \
>        FFLAGS="-m64"  --with-wrapper-fflags="-m64"  FCFLAGS="-m64"
>  --with-wrapper-fcflags="-m64" \
>        CXXFLAGS="-m64 -library=stlport4"
>
>  I did NOT manage to reproduce on AMD64/Solaris-11, which completed a
> build w/ VT disabled.
> Unfortunately I have neither SPARC/Solaris-11 nor
> AMD64/Solaris-10 readily available to disambiguate the key factor.
> Hopefully it is enough to know that the problem is reproducible w/o
> Oracle's massive configure commandline.
>
>
>  The build isn't complete, but I can already see that the symbol has
> "leaked" into libmpi:
>
>  $ grep -arl mca_coll_ml_memsync_intra BLD/
> BLD/ompi/mca/bcol/.libs/libmca_bcol.a
> BLD/ompi/mca/bcol/base/.libs/bcol_base_open.o
> BLD/ompi/.libs/libmpi.so.0.0.0
> BLD/ompi/.libs/libmpi.so
> BLD/ompi/.libs/libmpi.so.0
>
>  It is referenced by mca_coll_ml_generic_collectives_launcher:
>
>  $ nm BLD/ompi/.libs/libmpi.so.0.0.0 | grep -B1 mca_coll_ml_memsync_intra
> 00000000006a6088 t mca_coll_ml_generic_collectives_launcher
>                  U mca_coll_ml_memsync_intra
>
>  This is coming from libmca_bcol.a:
>  $ nm BLD/ompi/mca/bcol/.libs/libmca_bcol.a | grep -B1
> mca_coll_ml_memsync_intra
> 0000000000005248 t mca_coll_ml_generic_collectives_launcher
>                  U mca_coll_ml_memsync_intra
>
>
>  This appears to be via the following chain of calls within coll_ml.h:
>
>  mca_coll_ml_generic_collectives_launcher
>    mca_coll_ml_task_completion_processing
>       coll_ml_fragment_completion_processing
>          mca_coll_ml_buffer_recycling
>              mca_coll_ml_memsync_intra
>
>  All of which are marked as "static
> inline __opal_attribute_always_inline__".
>
>  -Paul
>
>
> On Fri, Aug 24, 2012 at 4:55 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>> OK, I have a vanilla configure+make running on both SPARC/Solaris-10 and
>> AMD64/Solaris-11.
>> I am using the 12.3 Oracle compilers in both cases to match the original
>> report.
>> I'll post the results when they complete.
>>
>>  In the meantime, I took a quick look at the code and have a pretty
>> reasonable guess as to the cause.
>> Looking at ompi/mca/coll/ml/coll_ml.h I see:
>>
>>     827  int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *module, int
>> bank_index);
>> [...]
>>     996  static inline __opal_attribute_always_inline__
>>    997          int
>> mca_coll_ml_buffer_recycling(mca_coll_ml_collective_operation_progress_t
>> *ml_request)
>>    998  {
>>  [...]
>>   1023                  rc = mca_coll_ml_memsync_intra(ml_module,
>> ml_memblock->memsync_counter);
>> [...]
>>   1041  }
>>
>>  Based on past experience w/ the Sun/Oracle compilers on another project
>> (See http://bugzilla.hcs.ufl.edu/cgi-bin/bugzilla3/show_bug.cgi?id=193 ),
>> I suspect that this static-inline-always function is being emitted by the
>> compiler in every object which includes this header even if they don't call
>> it..  The call on line 1023 then results in the undefined reference
>> to mca_coll_ml_memsync_intra.  Basically it is not safe for an inline
>> function in a header to call an extern function that isn't available to
>> every object that includes the header REGARDLESS of whether the object
>> invokes the inline function or not.
>>
>>  -Paul
>>
>>
>>
>> On Fri, Aug 24, 2012 at 4:40 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Oracle uses an abysmally complicated configure line, but nearly all of
>>> it is irrelevant to the problem here. For this, I would suggest just doing
>>> a vanilla ./configure - if the component gets pulled into libmpi, then we
>>> know there is a problem.
>>>
>>>  Thanks!
>>>
>>>  Just FYI: here is there actual configure line, just in case you spot
>>> something problematic:
>>>
>>>  CC=cc CXX=CC F77=f77 FC=f90  --with-openib  --enable-openib-connectx-xrc  
>>> --without-udapl
>>> --disable-openib-ibcm  --enable-btl-openib-failover   --without-dtrace  
>>> --enable-heterogeneous
>>> --enable-cxx-exceptions --enable-shared --enable-orterun-prefix-by-default 
>>> --with-sge
>>> --enable-mpi-f90 --with-mpi-f90-size=small  --disable-peruse --disable-state
>>> --disable-mpi-thread-multiple   --disable-debug  --disable-mem-debug  
>>> --disable-mem-profile
>>> CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 
>>> -xvector=lib -Qoption
>>> cg -xregs=no%appl -xdepend=yes -xbuiltin=%all -xO5"  
>>> CXXFLAGS="-xtarget=ultra3 -m32
>>> -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg 
>>> -xregs=no%appl -xdepend=yes
>>> -xbuiltin=%all -xO5 -Bstatic -lCrun -lCstd -Bdynamic"  
>>> FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2
>>> -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl 
>>> -stackvar -xO5"
>>> FCFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch 
>>> -xprefetch_level=2 -xvector=lib -Qoption
>>> cg -xregs=no%appl -stackvar -xO5"
>>> --prefix=/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/installs/JA08/install
>>> --mandir=${prefix}/man  --bindir=${prefix}/bin  --libdir=${prefix}/lib
>>> --includedir=${prefix}/include   
>>> --with-tm=/ws/ompi-tools/orte/torque/current/shared-install32
>>> --enable-contrib-no-build=vt --with-package-string="Oracle Message Passing 
>>> Toolkit "
>>> --with-ident-string="@(#)RELEASE VERSION 1.9openmpi-1.5.4-r1.9a1r27092"
>>>
>>>
>>>  and the error he gets is:
>>>
>>>  make[2]: Entering directory
>>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
>>>   CCLD     ompi_info
>>> Undefined                   first referenced
>>>  symbol                         in file
>>> mca_coll_ml_memsync_intra           ../../../ompi/.libs/libmpi.so
>>> ld: fatal: symbol referencing errors. No output written to .libs/ompi_info
>>> make[2]: *** [ompi_info] Error 2
>>> make[2]: Leaving directory
>>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
>>> make[1]: *** [install-recursive] Error 1
>>> make[1]: Leaving directory
>>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi'
>>> make: *** [install-recursive] Error 1
>>>
>>>
>>>  On Aug 24, 2012, at 4:30 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>>
>>> I have access to a few different Solaris machines and can offer to build
>>> the trunk if somebody tells me what configure flags are desired.
>>>
>>>  -Paul
>>>
>>> On Fri, Aug 24, 2012 at 8:54 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> Eugene - can you confirm that this is only happening on the one Solaris
>>>> system? In other words, is this a general issue or something specific to
>>>> that one machine?
>>>>
>>>> I'm wondering because if it is just the one machine, then it might be
>>>> something strange about how it is setup - perhaps the version of Solaris,
>>>> or it is configuring --enable-static, or...
>>>>
>>>> Just trying to assess how general a problem this might be, and thus if
>>>> this should be a blocker or not.
>>>>
>>>> On Aug 24, 2012, at 8:00 AM, Eugene Loh <eugene....@oracle.com> wrote:
>>>>
>>>> > On 08/24/12 09:54, Shamis, Pavel wrote:
>>>> >> Maybe there is a chance to get direct access to this system ?
>>>> > No.
>>>> >
>>>> > But I'm attaching compressed log files from configure/make.
>>>> >
>>>> >
>>>> <tarball-of-log-files.tar.bz2>_______________________________________________
>>>> > devel mailing list
>>>> > de...@open-mpi.org
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>>
>>>  --
>>>  Paul H. Hargrove                          phhargr...@lbl.gov
>>> Future Technologies Group
>>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>
>>>  _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>>  --
>>  Paul H. Hargrove                          phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>>
>
>
>  --
>  Paul H. Hargrove                          phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing 
> listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to