Gilles,

I am running the build of 1.8.3 first.
As you suggest, I will only try without -m64 if 1.8.3 runs with it.

Regarding "-mt" my understanding from "man cc" is that it has a DUAL
function:
1) Passes -D_REENTRANT to the preprocess stage (if any)
2) Passes "the right flags" to the linker stage (if any)

NOTE that since Solaris has historically supported a "native" (non-POSIX)
threads library, when linking for pthreads one must pass BOTH "-mt" and
"-lpthread", and they must be in that order.

I *think* have already tried adding "-mt" to both CFLAGS and LDFLAGS, but
am not 100% sure I've done so correctly.  I believe I need to configure with

   CFLAGS="-m64 -mt" --with-wrapper-cflags="-m64 -mt" \
   LDFLAGS="-mt" --with-wrapper-ldflags="-mt"

if I am to be sure that orterun and the app are both compiled and linked
with "-mt".
Is that right?

-Paul

On Tue, Dec 16, 2014 at 6:25 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
>
>  Thanks Paul,
>
> if 1.8.3 with -m64 and the same compilers runs fine, then please do not
> bother running 1.8.4rc4 without -m64.
> /* i understand you are busy and i hardly believe -m64 is the root cause */
>
> a regression i can think of involves the flags we use for pthreads :
> for bad reasons, we initially tested the following flags on solaris :
> -pthread
> -pthreads
> -mt
>
> with solarisstudio 12.4, -mt was chosen
>
> 1.8.4rc4 has a bug (fixed in the v1.8 git): -D_REENTRANT is not
> automatically added, so you have to do it manually.
> i just figured out that -mt is unlikely automatically.
> do we need this and where ?
> CFLAGS ? (or is -D_REENTRANT enough ?)
> LDFLAGS ? (that might be solaris and/or solarisstudio (12.4) specific and
> i simply ignore it)
>
> Bottom line, i do invite you to test 1.8.4rc4 again and with
> CFLAGS="-mt"
> or
> CFLAGS="-mt -m64"
> if you previously tested 1.8.3 with -m64
>
> Cheers,
>
> Gilles
>
>
>
> On 2014/12/17 11:05, Paul Hargrove wrote:
>
> Gilles,
>
> First, please note that prior tests of 1.8.3 ran with no problems on these
> hosts.
> So, I *think* this problem is a regression.
> However, I am not 100% certain that this *exact* configuration was tested.
> So, I am RE-running a test of 1.8.3 now to be absolutely sure if this is a
> regression.
> I will report the outcome when I can.
>
> I have limited time to run the tests you are asking for.  I will do my
> best, but am concerned that I won't be responsive enough and may hold up
> the release.  I fully understand why you ask multiple questions in one
> email to keep things moving.
>
> I am running mpirun on pcp-j-20 and "getent hosts pcp-j-20" run there yields
>
> $ getent hosts pcp-j-20
> 127.0.0.1       pcp-j-20 pcp-j-20.local localhost loghost
> 172.16.0.120    pcp-j-20 pcp-j-20.local localhost loghost
>
> In case it matters: there is an entry for 172.18.0.0.120 in /etc/hosts as
> pcp-j-20-ib.
>
> I will run a test tonight to determine if the same issue is present without
> "-m64".
> I will report the outcome when I can.
>
> Yes, I can ping and ssh to "pcp-j-{19,20}" and "172.{16,18}.0.{119,120}".
> I see the following if run on either pcp-j-19 or pcp-j-20:
>
> $ for x in {pcp-j-,172.{16,18}.0.1}{19,20}; do ssh $x echo OK connecting to
> $x; done
> OK connecting to pcp-j-19
> OK connecting to pcp-j-20
> OK connecting to 172.16.0.119
> OK connecting to 172.16.0.120
> OK connecting to 172.18.0.119
> OK connecting to 172.18.0.120
>
>
> I will report on the 1.8.3 and the non-m64 runs when they are done.
> Meanwhile, if you have other things you want run let me know.
>
> -Paul
>
> On Tue, Dec 16, 2014 at 5:35 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
>
>  Thanks Paul,
>
> Are you invoking mpirun on pcp-j-20 ?
> If yes, what does
> getent hosts pcp-j-20
> says ?
>
> BTW, did you try without -m64 ?
>
> Does the following work
> ping/ssh 172.18.0.120
>
> Honestly, this output makes very little sense to me, so i am asking way
> too much info hoping i can reproduce this issue or get a hint on what can
> possibly goes wrong.
>
> Cheers,
>
> Gilles
>
> Paul Hargrove <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote:
> Gilles,
>
> I am running mpirun on a host that ALSO will run one of the application
> processes.
> Requested ifconfig and netstat outputs appear below.
>
> -Paul
>
> [phargrov@pcp-j-20 ~]$ ifconfig -a
> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232
> index 1
>         inet 127.0.0.1 netmask ff000000
> bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500
> index 2
>         inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255
> pFFFF.ibp0: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU>
> mtu 2044 index 3
>         inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255
> lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252
> index 1
>         inet6 ::1/128
> bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2
>         inet6 fe80::250:45ff:fe5c:2b0/10
> [phargrov@pcp-j-20 ~]$ netstat -nr
>
> Routing Table: IPv4
>   Destination           Gateway           Flags  Ref     Use     Interface
> -------------------- -------------------- ----- ----- ---------- ---------
> default              172.16.254.1         UG        2     158463 bge0
> 127.0.0.1            127.0.0.1            UH        5     398913 lo0
> 172.16.0.0           172.16.0.120         U         4  135241319 bge0
> 172.18.0.0           172.18.0.120         U         3         26
> pFFFF.ibp0
>
> Routing Table: IPv6
>   Destination/Mask            Gateway                   Flags Ref   Use
>  If
> --------------------------- --------------------------- ----- --- -------
> -----
> ::1                         ::1                         UH      2       0
> lo0
> fe80::/10                   fe80::250:45ff:fe5c:2b0     U       2       0
> bge0
>
> On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
>
>
>  Paul,
>
> could you please send the output of
> ifconfig -a
> netstat -nr
>
> on the three hosts you are using
> (i assume you are still invoking mpirun from one node, and tasks are
> running on two other nodes)
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/16 16:00, Paul Hargrove wrote:
>
> Gilles,
>
> I looked again carefully and I am *NOT* finding -D_REENTRANT passed to most
> compilations.
> It appears to be used for building libevent and vt, but nothing else.
> The output from configure contains
>
> checking if more special flags are required for pthreads... -D_REENTRANT
>
> only in the libevent and vt sub-configure portions.
>
> When configured for gcc on Solaris-11 I see the following in configure
>
> checking for C optimization flags... -m64 -D_REENTRANT -g
> -finline-functions -fno-strict-aliasing
>
> but with CC=cc the equivalent line is
>
> checking for C optimization flags... -m64 -g
>
> In both cases the "-m64" is from the CFLAGS I have passed to configure.
>
> However, when I use CFLAGS="-m64 -D_REENTRANT" the problem DOES NOT go away.
> I see
>
> [pcp-j-20:24740] mca_oob_tcp_accept: accept() failed: Error 0 (11).
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:    pcp-j-20
>   Remote host:   172.18.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------------------------------------
>
> which is at least appears to have a non-zero errno.
> A quick grep through /usr/include/sys/errno shows 11 is EAGAIN.
>
> With the oob.patch you provided the failed accept goes away, BUT the
> connection still fails:
>
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:    pcp-j-20
>   Remote host:   172.18.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------
> ------------------------------
>
>
> Use of "-mca oob_tcp_if_include bge0" to use a single interface did not fix
> this.
>
>
> -Paul
>
> On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote:
>
>  Gilles,
>
> I am NOT seeing the problem with gcc.
> It is only occurring with the Studio compilers.
>
> As I've already reported, I have tried adding either "-mt" or "-mt=yes" to
> both LDFLAGS and --with-wrapper-ldflags.
>
> The "cc" manpage (on the Solaris-10 system I can get to right now) says:
>
>      -mt  Compile and link for multithreaded code.
>
>           This option passes -D_REENTRANT to the preprocessor and
>           passes -lthread in the correct order to ld.
>
>           The -mt option is required if the application or
>           libraries are multithreaded.
>
>           To ensure proper library linking order, you must use
>           this option, rather than -lthread, to link with lib-
>           thread.
>
>           If you are using POSIX threads, you must link with the
>           options -mt -lpthread.  The -mt option is necessary
>           because libC and libCrun need libthread for a mul-
>           tithreaded application.
>
>           If you compile and link in separate steps and you com-
>           pile with -mt, you might get unexpected results. If you
>           compile one translation unit with -mt, compile all
>           units of the program with -mt.
>
> I cannot connect to my Solaris-11 system right now, but I recall the text
> to be quite similar.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote:
>
>
>   Paul,
>
> did you manually set -mt ?
>
> if i remember correctly, solaris 11 (at least with gcc compilers) do not
> need any flags
> (except the -D_REENTRANT that is added automatically)
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/16 12:10, Paul Hargrove wrote:
>
> Gilles,
>
> I will try the patch when I can.
> However, our network is undergoing network maintenance right now, leaving
> me unable to reach the necessary hosts.
>
> As for -D_REENTRANT, I had already reported having verified in the "make"
> output that it had been added automatically.
>
> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the
> preprocessor.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> 
> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote:
>
>
>  Paul,
>
> could you please make sure configure added  "-D_REENTRANT" to the CFLAGS ?
> /* otherwise, errno is a global variable instead of a per thread variable,
> which can
> explains some weird behaviour. note this should have been already fixed */
>
> assuming -D_REENTRANT is set, could you please give the attached patch a
> try ?
>
> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing
> error message
> e.g. failed: Error 0 (0)
>
> FWIW, master is also affected.
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/16 10:47, Paul Hargrove wrote:
>
> I have tried with a oob_tcp_if_include setting so that there is now only 1
> interface.
> Even with just one interface and -mt=yes in both LDFLAGS and
> wrapper-ldflags I *still* getting messages like
>
> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0).
> ------------------------------
> ------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:    pcp-j-20
>   Remote host:   172.16.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------
> ------------------------------
>
>
> I am getting less certain that my speculation about thread-safe libs is
> correct.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote:
>
>  A little more reading finds that...
>
> Docs says that one needs "-mt" without the "=yes".
> That will work for both old and new compilers, where "-mt=yes" chokes
> older ones.
>
> Also, man pages say "-mt" must come before "-lpthread" in the link command.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov>
> wrote:
>
>
> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain <r...@open-mpi.org> 
> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> 
> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> 
> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> 
> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> 
> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> wrote:
>
>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
> link. Need someone to investigate.
>
>
> The lack of multi-thread libraries is my SPECULATION.
>
> The fact that configuring with LDFLAGS=-mt=yes did not help may or may
> not prove anything.
> I didn't see them in "mpicc -show" and so maybe they needed to be in
> wrapper-ldflags instead.
> My time this week is quite limited, but I can "fire an forget" tests of
> any tarballs you provide.
>
> -Paul
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
>
>
>
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/12/16608.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/12/16611.php
>
>  --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16613.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/12/16615.php
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/12/16619.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16620.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16621.php
>


-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to