Paul, i do not think -lpthread is passed automatically to LDFLAGS on Solaris, so you might have to do it manually as well
i never used --with-wrapper-cflags before, so i'd rather invite you to mpicc -show to make sure the right flags are passed at the right place when the app is built Cheers, Gilles On 2014/12/17 12:04, Paul Hargrove wrote: > Gilles, > > I am running the build of 1.8.3 first. > As you suggest, I will only try without -m64 if 1.8.3 runs with it. > > Regarding "-mt" my understanding from "man cc" is that it has a DUAL > function: > 1) Passes -D_REENTRANT to the preprocess stage (if any) > 2) Passes "the right flags" to the linker stage (if any) > > NOTE that since Solaris has historically supported a "native" (non-POSIX) > threads library, when linking for pthreads one must pass BOTH "-mt" and > "-lpthread", and they must be in that order. > > I *think* have already tried adding "-mt" to both CFLAGS and LDFLAGS, but > am not 100% sure I've done so correctly. I believe I need to configure with > > CFLAGS="-m64 -mt" --with-wrapper-cflags="-m64 -mt" \ > LDFLAGS="-mt" --with-wrapper-ldflags="-mt" > > if I am to be sure that orterun and the app are both compiled and linked > with "-mt". > Is that right? > > -Paul > > On Tue, Dec 16, 2014 at 6:25 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: >> Thanks Paul, >> >> if 1.8.3 with -m64 and the same compilers runs fine, then please do not >> bother running 1.8.4rc4 without -m64. >> /* i understand you are busy and i hardly believe -m64 is the root cause */ >> >> a regression i can think of involves the flags we use for pthreads : >> for bad reasons, we initially tested the following flags on solaris : >> -pthread >> -pthreads >> -mt >> >> with solarisstudio 12.4, -mt was chosen >> >> 1.8.4rc4 has a bug (fixed in the v1.8 git): -D_REENTRANT is not >> automatically added, so you have to do it manually. >> i just figured out that -mt is unlikely automatically. >> do we need this and where ? >> CFLAGS ? (or is -D_REENTRANT enough ?) >> LDFLAGS ? (that might be solaris and/or solarisstudio (12.4) specific and >> i simply ignore it) >> >> Bottom line, i do invite you to test 1.8.4rc4 again and with >> CFLAGS="-mt" >> or >> CFLAGS="-mt -m64" >> if you previously tested 1.8.3 with -m64 >> >> Cheers, >> >> Gilles >> >> >> >> On 2014/12/17 11:05, Paul Hargrove wrote: >> >> Gilles, >> >> First, please note that prior tests of 1.8.3 ran with no problems on these >> hosts. >> So, I *think* this problem is a regression. >> However, I am not 100% certain that this *exact* configuration was tested. >> So, I am RE-running a test of 1.8.3 now to be absolutely sure if this is a >> regression. >> I will report the outcome when I can. >> >> I have limited time to run the tests you are asking for. I will do my >> best, but am concerned that I won't be responsive enough and may hold up >> the release. I fully understand why you ask multiple questions in one >> email to keep things moving. >> >> I am running mpirun on pcp-j-20 and "getent hosts pcp-j-20" run there yields >> >> $ getent hosts pcp-j-20 >> 127.0.0.1 pcp-j-20 pcp-j-20.local localhost loghost >> 172.16.0.120 pcp-j-20 pcp-j-20.local localhost loghost >> >> In case it matters: there is an entry for 172.18.0.0.120 in /etc/hosts as >> pcp-j-20-ib. >> >> I will run a test tonight to determine if the same issue is present without >> "-m64". >> I will report the outcome when I can. >> >> Yes, I can ping and ssh to "pcp-j-{19,20}" and "172.{16,18}.0.{119,120}". >> I see the following if run on either pcp-j-19 or pcp-j-20: >> >> $ for x in {pcp-j-,172.{16,18}.0.1}{19,20}; do ssh $x echo OK connecting to >> $x; done >> OK connecting to pcp-j-19 >> OK connecting to pcp-j-20 >> OK connecting to 172.16.0.119 >> OK connecting to 172.16.0.120 >> OK connecting to 172.18.0.119 >> OK connecting to 172.18.0.120 >> >> >> I will report on the 1.8.3 and the non-m64 runs when they are done. >> Meanwhile, if you have other things you want run let me know. >> >> -Paul >> >> On Tue, Dec 16, 2014 at 5:35 PM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >> Thanks Paul, >> >> Are you invoking mpirun on pcp-j-20 ? >> If yes, what does >> getent hosts pcp-j-20 >> says ? >> >> BTW, did you try without -m64 ? >> >> Does the following work >> ping/ssh 172.18.0.120 >> >> Honestly, this output makes very little sense to me, so i am asking way >> too much info hoping i can reproduce this issue or get a hint on what can >> possibly goes wrong. >> >> Cheers, >> >> Gilles >> >> Paul Hargrove <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote: >> Gilles, >> >> I am running mpirun on a host that ALSO will run one of the application >> processes. >> Requested ifconfig and netstat outputs appear below. >> >> -Paul >> >> [phargrov@pcp-j-20 ~]$ ifconfig -a >> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 >> index 1 >> inet 127.0.0.1 netmask ff000000 >> bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 >> index 2 >> inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255 >> pFFFF.ibp0: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> >> mtu 2044 index 3 >> inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255 >> lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 >> index 1 >> inet6 ::1/128 >> bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2 >> inet6 fe80::250:45ff:fe5c:2b0/10 >> [phargrov@pcp-j-20 ~]$ netstat -nr >> >> Routing Table: IPv4 >> Destination Gateway Flags Ref Use Interface >> -------------------- -------------------- ----- ----- ---------- --------- >> default 172.16.254.1 UG 2 158463 bge0 >> 127.0.0.1 127.0.0.1 UH 5 398913 lo0 >> 172.16.0.0 172.16.0.120 U 4 135241319 bge0 >> 172.18.0.0 172.18.0.120 U 3 26 >> pFFFF.ibp0 >> >> Routing Table: IPv6 >> Destination/Mask Gateway Flags Ref Use >> If >> --------------------------- --------------------------- ----- --- ------- >> ----- >> ::1 ::1 UH 2 0 >> lo0 >> fe80::/10 fe80::250:45ff:fe5c:2b0 U 2 0 >> bge0 >> >> On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >> >> Paul, >> >> could you please send the output of >> ifconfig -a >> netstat -nr >> >> on the three hosts you are using >> (i assume you are still invoking mpirun from one node, and tasks are >> running on two other nodes) >> >> Cheers, >> >> Gilles >> >> >> On 2014/12/16 16:00, Paul Hargrove wrote: >> >> Gilles, >> >> I looked again carefully and I am *NOT* finding -D_REENTRANT passed to most >> compilations. >> It appears to be used for building libevent and vt, but nothing else. >> The output from configure contains >> >> checking if more special flags are required for pthreads... -D_REENTRANT >> >> only in the libevent and vt sub-configure portions. >> >> When configured for gcc on Solaris-11 I see the following in configure >> >> checking for C optimization flags... -m64 -D_REENTRANT -g >> -finline-functions -fno-strict-aliasing >> >> but with CC=cc the equivalent line is >> >> checking for C optimization flags... -m64 -g >> >> In both cases the "-m64" is from the CFLAGS I have passed to configure. >> >> However, when I use CFLAGS="-m64 -D_REENTRANT" the problem DOES NOT go away. >> I see >> >> [pcp-j-20:24740] mca_oob_tcp_accept: accept() failed: Error 0 (11). >> ------------------------------------------------------------ >> A process or daemon was unable to complete a TCP connection >> to another process: >> Local host: pcp-j-20 >> Remote host: 172.18.0.120 >> This is usually caused by a firewall on the remote host. Please >> check that any firewall (e.g., iptables) has been disabled and >> try again. >> ------------------------------------------------------------ >> >> which is at least appears to have a non-zero errno. >> A quick grep through /usr/include/sys/errno shows 11 is EAGAIN. >> >> With the oob.patch you provided the failed accept goes away, BUT the >> connection still fails: >> >> ------------------------------------------------------------ >> A process or daemon was unable to complete a TCP connection >> to another process: >> Local host: pcp-j-20 >> Remote host: 172.18.0.120 >> This is usually caused by a firewall on the remote host. Please >> check that any firewall (e.g., iptables) has been disabled and >> try again. >> ------------------------------ >> ------------------------------ >> >> >> Use of "-mca oob_tcp_if_include bge0" to use a single interface did not fix >> this. >> >> >> -Paul >> >> On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote: >> >> Gilles, >> >> I am NOT seeing the problem with gcc. >> It is only occurring with the Studio compilers. >> >> As I've already reported, I have tried adding either "-mt" or "-mt=yes" to >> both LDFLAGS and --with-wrapper-ldflags. >> >> The "cc" manpage (on the Solaris-10 system I can get to right now) says: >> >> -mt Compile and link for multithreaded code. >> >> This option passes -D_REENTRANT to the preprocessor and >> passes -lthread in the correct order to ld. >> >> The -mt option is required if the application or >> libraries are multithreaded. >> >> To ensure proper library linking order, you must use >> this option, rather than -lthread, to link with lib- >> thread. >> >> If you are using POSIX threads, you must link with the >> options -mt -lpthread. The -mt option is necessary >> because libC and libCrun need libthread for a mul- >> tithreaded application. >> >> If you compile and link in separate steps and you com- >> pile with -mt, you might get unexpected results. If you >> compile one translation unit with -mt, compile all >> units of the program with -mt. >> >> I cannot connect to my Solaris-11 system right now, but I recall the text >> to be quite similar. >> >> -Paul >> >> On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote: >> >> >> Paul, >> >> did you manually set -mt ? >> >> if i remember correctly, solaris 11 (at least with gcc compilers) do not >> need any flags >> (except the -D_REENTRANT that is added automatically) >> >> Cheers, >> >> Gilles >> >> >> On 2014/12/16 12:10, Paul Hargrove wrote: >> >> Gilles, >> >> I will try the patch when I can. >> However, our network is undergoing network maintenance right now, leaving >> me unable to reach the necessary hosts. >> >> As for -D_REENTRANT, I had already reported having verified in the "make" >> output that it had been added automatically. >> >> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the >> preprocessor. >> >> -Paul >> >> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> >> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote: >> >> >> Paul, >> >> could you please make sure configure added "-D_REENTRANT" to the CFLAGS ? >> /* otherwise, errno is a global variable instead of a per thread variable, >> which can >> explains some weird behaviour. note this should have been already fixed */ >> >> assuming -D_REENTRANT is set, could you please give the attached patch a >> try ? >> >> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing >> error message >> e.g. failed: Error 0 (0) >> >> FWIW, master is also affected. >> >> Cheers, >> >> Gilles >> >> >> On 2014/12/16 10:47, Paul Hargrove wrote: >> >> I have tried with a oob_tcp_if_include setting so that there is now only 1 >> interface. >> Even with just one interface and -mt=yes in both LDFLAGS and >> wrapper-ldflags I *still* getting messages like >> >> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0). >> ------------------------------ >> ------------------------------ >> A process or daemon was unable to complete a TCP connection >> to another process: >> Local host: pcp-j-20 >> Remote host: 172.16.0.120 >> This is usually caused by a firewall on the remote host. Please >> check that any firewall (e.g., iptables) has been disabled and >> try again. >> ------------------------------ >> ------------------------------ >> >> >> I am getting less certain that my speculation about thread-safe libs is >> correct. >> >> -Paul >> >> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote: >> >> A little more reading finds that... >> >> Docs says that one needs "-mt" without the "=yes". >> That will work for both old and new compilers, where "-mt=yes" chokes >> older ones. >> >> Also, man pages say "-mt" must come before "-lpthread" in the link command. >> >> -Paul >> >> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >> wrote: >> >> >> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain <r...@open-mpi.org> >> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> >> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> >> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> >> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> >> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> wrote: >> >> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the >> multi-threaded C libraries, apparently need "-mt=yes" in both compile and >> link. Need someone to investigate. >> >> >> The lack of multi-thread libraries is my SPECULATION. >> >> The fact that configuring with LDFLAGS=-mt=yes did not help may or may >> not prove anything. >> I didn't see them in "mpicc -show" and so maybe they needed to be in >> wrapper-ldflags instead. >> My time this week is quite limited, but I can "fire an forget" tests of >> any tarballs you provide. >> >> -Paul >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> >> >> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this >> post:http://www.open-mpi.org/community/lists/devel/2014/12/16608.php >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this >> post:http://www.open-mpi.org/community/lists/devel/2014/12/16611.php >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16613.php >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this >> post:http://www.open-mpi.org/community/lists/devel/2014/12/16615.php >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this >> post:http://www.open-mpi.org/community/lists/devel/2014/12/16619.php >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16620.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16621.php >> > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16622.php