Paul, could you please make sure configure added "-D_REENTRANT" to the CFLAGS ? /* otherwise, errno is a global variable instead of a per thread variable, which can explains some weird behaviour. note this should have been already fixed */
assuming -D_REENTRANT is set, could you please give the attached patch a try ? i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing error message e.g. failed: Error 0 (0) FWIW, master is also affected. Cheers, Gilles On 2014/12/16 10:47, Paul Hargrove wrote: > I have tried with a oob_tcp_if_include setting so that there is now only 1 > interface. > Even with just one interface and -mt=yes in both LDFLAGS and > wrapper-ldflags I *still* getting messages like > > [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0). > ------------------------------------------------------------ > A process or daemon was unable to complete a TCP connection > to another process: > Local host: pcp-j-20 > Remote host: 172.16.0.120 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > ------------------------------------------------------------ > > > I am getting less certain that my speculation about thread-safe libs is > correct. > > -Paul > > On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> A little more reading finds that... >> >> Docs says that one needs "-mt" without the "=yes". >> That will work for both old and new compilers, where "-mt=yes" chokes >> older ones. >> >> Also, man pages say "-mt" must come before "-lpthread" in the link command. >> >> -Paul >> >> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove <phhargr...@lbl.gov> >> wrote: >>> >>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the >>>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and >>>> link. Need someone to investigate. >>> >>> The lack of multi-thread libraries is my SPECULATION. >>> >>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may >>> not prove anything. >>> I didn't see them in "mpicc -show" and so maybe they needed to be in >>> wrapper-ldflags instead. >>> My time this week is quite limited, but I can "fire an forget" tests of >>> any tarballs you provide. >>> >>> -Paul >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16607.php
diff --git a/orte/mca/oob/tcp/oob_tcp_listener.c b/orte/mca/oob/tcp/oob_tcp_listener.c index b6d2ad8..87ff08d 100644 --- a/orte/mca/oob/tcp/oob_tcp_listener.c +++ b/orte/mca/oob/tcp/oob_tcp_listener.c @@ -14,6 +14,8 @@ * Copyright (c) 2009-2014 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2011 Oak Ridge National Labs. All rights reserved. * Copyright (c) 2013-2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -729,7 +731,6 @@ static void* listen_thread(opal_object_t *obj) if (pending_connection->fd < 0) { if (opal_socket_errno != EAGAIN || opal_socket_errno != EWOULDBLOCK) { - CLOSE_THE_SOCKET(pending_connection->fd); if (EMFILE == opal_socket_errno) { ORTE_ERROR_LOG(ORTE_ERR_SYS_LIMITS_SOCKETS); orte_show_help("help-orterun.txt", "orterun:sys-limit-sockets", true); @@ -737,6 +738,7 @@ static void* listen_thread(opal_object_t *obj) opal_output(0, "mca_oob_tcp_accept: accept() failed: %s (%d).", strerror(opal_socket_errno), opal_socket_errno); } + CLOSE_THE_SOCKET(pending_connection->fd); OBJ_RELEASE(pending_connection); goto done; }