Ralph, You get it right. The latest nightly tarball shoul work out of the box. (well, -m64 must be passed manually, but this is not related whatsoever to the issue discussed here)
Cheers, Gilles "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >Paul -- > >The __sun macro check is now in the OMPI 1.8 tree, and is in the latest >nightly tarball. > >If I'm following this thread right -- and I might not be! -- I think Gilles is >saying that now that the __sun check is in, it should fix this >-mt/-D_REENTRANT/whatever problem. > >Can you confirm? > > >On Dec 16, 2014, at 1:55 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> Gilles, >> >> I am running mpirun on a host that ALSO will run one of the application >> processes. >> Requested ifconfig and netstat outputs appear below. >> >> -Paul >> >> [phargrov@pcp-j-20 ~]$ ifconfig -a >> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 >> index 1 >> inet 127.0.0.1 netmask ff000000 >> bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 index >> 2 >> inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255 >> pFFFF.ibp0: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> >> mtu 2044 index 3 >> inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255 >> lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 >> index 1 >> inet6 ::1/128 >> bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2 >> inet6 fe80::250:45ff:fe5c:2b0/10 >> [phargrov@pcp-j-20 ~]$ netstat -nr >> >> Routing Table: IPv4 >> Destination Gateway Flags Ref Use Interface >> -------------------- -------------------- ----- ----- ---------- --------- >> default 172.16.254.1 UG 2 158463 bge0 >> 127.0.0.1 127.0.0.1 UH 5 398913 lo0 >> 172.16.0.0 172.16.0.120 U 4 135241319 bge0 >> 172.18.0.0 172.18.0.120 U 3 26 pFFFF.ibp0 >> >> Routing Table: IPv6 >> Destination/Mask Gateway Flags Ref Use >> If >> --------------------------- --------------------------- ----- --- ------- >> ----- >> ::1 ::1 UH 2 0 >> lo0 >> fe80::/10 fe80::250:45ff:fe5c:2b0 U 2 0 >> bge0 >> >> On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> Paul, >> >> could you please send the output of >> ifconfig -a >> netstat -nr >> >> on the three hosts you are using >> (i assume you are still invoking mpirun from one node, and tasks are running >> on two other nodes) >> >> Cheers, >> >> Gilles >> >> >> On 2014/12/16 16:00, Paul Hargrove wrote: >>> Gilles, >>> >>> I looked again carefully and I am *NOT* finding -D_REENTRANT passed to most >>> compilations. >>> It appears to be used for building libevent and vt, but nothing else. >>> The output from configure contains >>> >>> checking if more special flags are required for pthreads... -D_REENTRANT >>> >>> only in the libevent and vt sub-configure portions. >>> >>> When configured for gcc on Solaris-11 I see the following in configure >>> >>> checking for C optimization flags... -m64 -D_REENTRANT -g >>> -finline-functions -fno-strict-aliasing >>> >>> but with CC=cc the equivalent line is >>> >>> checking for C optimization flags... -m64 -g >>> >>> In both cases the "-m64" is from the CFLAGS I have passed to configure. >>> >>> However, when I use CFLAGS="-m64 -D_REENTRANT" the problem DOES NOT go away. >>> I see >>> >>> [pcp-j-20:24740] mca_oob_tcp_accept: accept() failed: Error 0 (11). >>> ------------------------------------------------------------ >>> A process or daemon was unable to complete a TCP connection >>> to another process: >>> Local host: pcp-j-20 >>> Remote host: 172.18.0.120 >>> This is usually caused by a firewall on the remote host. Please >>> check that any firewall (e.g., iptables) has been disabled and >>> try again. >>> ------------------------------------------------------------ >>> >>> which is at least appears to have a non-zero errno. >>> A quick grep through /usr/include/sys/errno shows 11 is EAGAIN. >>> >>> With the oob.patch you provided the failed accept goes away, BUT the >>> connection still fails: >>> >>> ------------------------------------------------------------ >>> A process or daemon was unable to complete a TCP connection >>> to another process: >>> Local host: pcp-j-20 >>> Remote host: 172.18.0.120 >>> This is usually caused by a firewall on the remote host. Please >>> check that any firewall (e.g., iptables) has been disabled and >>> try again. >>> ------------------------------------------------------------ >>> >>> >>> Use of "-mca oob_tcp_if_include bge0" to use a single interface did not fix >>> this. >>> >>> >>> -Paul >>> >>> On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove >>> <phhargr...@lbl.gov> >>> wrote: >>> >>>> Gilles, >>>> >>>> I am NOT seeing the problem with gcc. >>>> It is only occurring with the Studio compilers. >>>> >>>> As I've already reported, I have tried adding either "-mt" or "-mt=yes" to >>>> both LDFLAGS and --with-wrapper-ldflags. >>>> >>>> The "cc" manpage (on the Solaris-10 system I can get to right now) says: >>>> >>>> -mt Compile and link for multithreaded code. >>>> >>>> This option passes -D_REENTRANT to the preprocessor and >>>> passes -lthread in the correct order to ld. >>>> >>>> The -mt option is required if the application or >>>> libraries are multithreaded. >>>> >>>> To ensure proper library linking order, you must use >>>> this option, rather than -lthread, to link with lib- >>>> thread. >>>> >>>> If you are using POSIX threads, you must link with the >>>> options -mt -lpthread. The -mt option is necessary >>>> because libC and libCrun need libthread for a mul- >>>> tithreaded application. >>>> >>>> If you compile and link in separate steps and you com- >>>> pile with -mt, you might get unexpected results. If you >>>> compile one translation unit with -mt, compile all >>>> units of the program with -mt. >>>> >>>> I cannot connect to my Solaris-11 system right now, but I recall the text >>>> to be quite similar. >>>> >>>> -Paul >>>> >>>> On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet < >>>> >>>> gilles.gouaillar...@iferc.org >>>> > wrote: >>>> >>>> >>>>> Paul, >>>>> >>>>> did you manually set -mt ? >>>>> >>>>> if i remember correctly, solaris 11 (at least with gcc compilers) do not >>>>> need any flags >>>>> (except the -D_REENTRANT that is added automatically) >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> >>>>> On 2014/12/16 12:10, Paul Hargrove wrote: >>>>> >>>>> Gilles, >>>>> >>>>> I will try the patch when I can. >>>>> However, our network is undergoing network maintenance right now, leaving >>>>> me unable to reach the necessary hosts. >>>>> >>>>> As for -D_REENTRANT, I had already reported having verified in the "make" >>>>> output that it had been added automatically. >>>>> >>>>> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the >>>>> preprocessor. >>>>> >>>>> -Paul >>>>> >>>>> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@iferc.org> >>>>> wrote: >>>>> >>>>> >>>>> Paul, >>>>> >>>>> could you please make sure configure added "-D_REENTRANT" to the CFLAGS ? >>>>> /* otherwise, errno is a global variable instead of a per thread variable, >>>>> which can >>>>> explains some weird behaviour. note this should have been already fixed */ >>>>> >>>>> assuming -D_REENTRANT is set, could you please give the attached patch a >>>>> try ? >>>>> >>>>> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing >>>>> error message >>>>> e.g. failed: Error 0 (0) >>>>> >>>>> FWIW, master is also affected. >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> >>>>> On 2014/12/16 10:47, Paul Hargrove wrote: >>>>> >>>>> I have tried with a oob_tcp_if_include setting so that there is now only 1 >>>>> interface. >>>>> Even with just one interface and -mt=yes in both LDFLAGS and >>>>> wrapper-ldflags I *still* getting messages like >>>>> >>>>> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0). >>>>> ------------------------------ >>>>> >>>>> ------------------------------ >>>>> A process or daemon was unable to complete a TCP connection >>>>> to another process: >>>>> Local host: pcp-j-20 >>>>> Remote host: 172.16.0.120 >>>>> This is usually caused by a firewall on the remote host. Please >>>>> check that any firewall (e.g., iptables) has been disabled and >>>>> try again. >>>>> ------------------------------ >>>>> ------------------------------ >>>>> >>>>> >>>>> I am getting less certain that my speculation about thread-safe libs is >>>>> correct. >>>>> >>>>> -Paul >>>>> >>>>> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove >>>>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >>>>> <phhargr...@lbl.gov> >>>>> wrote: >>>>> >>>>> A little more reading finds that... >>>>> >>>>> Docs says that one needs "-mt" without the "=yes". >>>>> That will work for both old and new compilers, where "-mt=yes" chokes >>>>> older ones. >>>>> >>>>> Also, man pages say "-mt" must come before "-lpthread" in the link >>>>> command. >>>>> >>>>> -Paul >>>>> >>>>> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove >>>>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >>>>> <phhargr...@lbl.gov> >>>>> >>>>> wrote: >>>>> >>>>> >>>>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain >>>>> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> >>>>> <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the >>>>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and >>>>> link. Need someone to investigate. >>>>> >>>>> >>>>> The lack of multi-thread libraries is my SPECULATION. >>>>> >>>>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may >>>>> not prove anything. >>>>> I didn't see them in "mpicc -show" and so maybe they needed to be in >>>>> wrapper-ldflags instead. >>>>> My time this week is quite limited, but I can "fire an forget" tests of >>>>> any tarballs you provide. >>>>> >>>>> -Paul >>>>> >>>>> -- >>>>> Paul H. Hargrove >>>>> phhargr...@lbl.gov >>>>> >>>>> >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: >>>>> +1-510-495-2352 >>>>> >>>>> Lawrence Berkeley National Laboratory Fax: >>>>> +1-510-486-6900 >>>>> >>>>> >>>>> >>>>> -- >>>>> Paul H. Hargrove >>>>> phhargr...@lbl.gov >>>>> >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: >>>>> +1-510-495-2352 >>>>> >>>>> Lawrence Berkeley National Laboratory Fax: >>>>> +1-510-486-6900 >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing >>>>> listde...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing >>>>> listde...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16608.php >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing >>>>> listde...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16611.php >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Paul H. Hargrove >>>> phhargr...@lbl.gov >>>> >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: >>>> +1-510-495-2352 >>>> >>>> Lawrence Berkeley National Laboratory Fax: >>>> +1-510-486-6900 >>>> >>>> >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16613.php >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16615.php >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16617.php > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16660.php