On Sun, Dec 14, 2014 at 10:52 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
> Solaris-10/SPARC and "--enable-static --disable-shared" appears broken for
> C++ apps (but OK for C).
> I will report in more details when I have more information.
>

First the good news:

The problem I was experiencing (with the Solaris Studio compilers) turned
out to be "pilot error".
I had added "-library=stlport4" to CXXFLAGS but neglected to add the same
in --with-wrapper-cxxflags.
Adding to both has always sort of bothered me, and this time it bit me.
Oddly, the problem didn't appear until I forced static libs.

Now the bad news:

By trying more variants on my Solaris platforms I was able to get TWO new
failure modes.
However, I have a fix for one.

1)
Still Solaris-10/SPARC and "--enable-static --disable-shared"  but this
time with gcc-3.4.6.
With this configuration I get Bus Errors from "make check" that do not
occur without these configure options:

bash: line 5:  3141 Bus Error               (core dumped) ${dir}$tst
FAIL: position
bash: line 5:  3221 Bus Error               (core dumped) ${dir}$tst
FAIL: position_noncontig


Examining the core from the second failure:

t@1 (l@1) program terminated by signal BUS (invalid address alignment)
Current function is main
  208       opal_pack_debug     = 0;
(dbx) print &opal_pack_debug
&opal_pack_debug = 0x10092e169


The problem seems to be that the tests declare this (and others) as an int,
but the opal headers say bool:

$ gegrep  -r '^extern .* opal_(pack|unpack|position)_debug' .
./test/datatype/position.c:extern int opal_unpack_debug;
./test/datatype/position.c:extern int opal_pack_debug;
./test/datatype/position.c:extern int opal_position_debug ;
./test/datatype/position_noncontig.c:extern int opal_unpack_debug;
./test/datatype/position_noncontig.c:extern int opal_pack_debug;
./test/datatype/position_noncontig.c:extern int opal_position_debug ;
./opal/datatype/opal_convertor_internal.h:extern bool opal_pack_debug;
./opal/datatype/opal_datatype_position.c:extern bool opal_position_debug;

Defn of opal_unpack_debug is well hidden, but is also "bool".

Correcting "int" to "bool" for those 3 vars in the two tests resolved this
problem for me.



2)
Now on my Solaris-11/x86-64 system with both GigE and IPoIB interfaces.
I am seeing the following when using the Solaris Studio compilers (Gnu
compilers were fine):

$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
examples/ring_c'
[pcp-j-20:16239] mca_oob_tcp_accept: accept() failed: Error 0 (0).
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    pcp-j-20
  Remote host:   172.18.0.120
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------


Notice the "Error 0 (0)" which means errno=0 and suggests that we've not
properly linked the thread-safe C libraries (recall that there is one
thread per interface and these hosts have two).
I see "-D_REENTRANT" in the output of "make".
However, the man pages suggest that one also needs "-mt=yes" in *both* the
compile and link steps (it defines _REENTRANT and links the proper libs).

I hoped that I could resolve this failure by adding LDFLAGS=-mt=yes to the
configure command.
However, that didn't work.


-Paul


-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to