On Sun, Dec 14, 2014 at 10:52 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Solaris-10/SPARC and "--enable-static --disable-shared" appears broken for > C++ apps (but OK for C). > I will report in more details when I have more information. >
First the good news: The problem I was experiencing (with the Solaris Studio compilers) turned out to be "pilot error". I had added "-library=stlport4" to CXXFLAGS but neglected to add the same in --with-wrapper-cxxflags. Adding to both has always sort of bothered me, and this time it bit me. Oddly, the problem didn't appear until I forced static libs. Now the bad news: By trying more variants on my Solaris platforms I was able to get TWO new failure modes. However, I have a fix for one. 1) Still Solaris-10/SPARC and "--enable-static --disable-shared" but this time with gcc-3.4.6. With this configuration I get Bus Errors from "make check" that do not occur without these configure options: bash: line 5: 3141 Bus Error (core dumped) ${dir}$tst FAIL: position bash: line 5: 3221 Bus Error (core dumped) ${dir}$tst FAIL: position_noncontig Examining the core from the second failure: t@1 (l@1) program terminated by signal BUS (invalid address alignment) Current function is main 208 opal_pack_debug = 0; (dbx) print &opal_pack_debug &opal_pack_debug = 0x10092e169 The problem seems to be that the tests declare this (and others) as an int, but the opal headers say bool: $ gegrep -r '^extern .* opal_(pack|unpack|position)_debug' . ./test/datatype/position.c:extern int opal_unpack_debug; ./test/datatype/position.c:extern int opal_pack_debug; ./test/datatype/position.c:extern int opal_position_debug ; ./test/datatype/position_noncontig.c:extern int opal_unpack_debug; ./test/datatype/position_noncontig.c:extern int opal_pack_debug; ./test/datatype/position_noncontig.c:extern int opal_position_debug ; ./opal/datatype/opal_convertor_internal.h:extern bool opal_pack_debug; ./opal/datatype/opal_datatype_position.c:extern bool opal_position_debug; Defn of opal_unpack_debug is well hidden, but is also "bool". Correcting "int" to "bool" for those 3 vars in the two tests resolved this problem for me. 2) Now on my Solaris-11/x86-64 system with both GigE and IPoIB interfaces. I am seeing the following when using the Solaris Studio compilers (Gnu compilers were fine): $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c' [pcp-j-20:16239] mca_oob_tcp_accept: accept() failed: Error 0 (0). ------------------------------------------------------------ A process or daemon was unable to complete a TCP connection to another process: Local host: pcp-j-20 Remote host: 172.18.0.120 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. ------------------------------------------------------------ Notice the "Error 0 (0)" which means errno=0 and suggests that we've not properly linked the thread-safe C libraries (recall that there is one thread per interface and these hosts have two). I see "-D_REENTRANT" in the output of "make". However, the man pages suggest that one also needs "-mt=yes" in *both* the compile and link steps (it defines _REENTRANT and links the proper libs). I hoped that I could resolve this failure by adding LDFLAGS=-mt=yes to the configure command. However, that didn't work. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900