Ralph,
the root cause is
getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
fails with errno ENOPROTOOPT on solaris 11.2
the attached patch is a proof of concept and works for me :
/* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */
Cheers,
Gilles
On 9/21/2015 2:16 PM, Paul Hargrove wrote:
Ralph,
Just as you say:
The first 64s pause was before the hwloc error message appeared.
The second was after the second server_setup_fork appears, and before
whatever line came after that.
I don't know if stdio buffering my be "distorting" the placement of
the pause relative to the lines of output.
However, prior to your patch the entire failed mpirun was around 1s.
No allocation.
No resource manager.
Just a single workstation.
-Paul
On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
?? Just so this old fossilized brain gets this right: you are
saying there was a 64s pause before the hwloc error appeared, and
then another 64s pause after the second server_setup_fork message
appeared?
If that’s true, then I’m chasing the wrong problem - it sounds
like something is messed up in the mpirun startup. Did you have
more than one node in the allocation by chance? I’m wondering if
we are getting held up by something in the daemon launch/callback
area.
On Sep 20, 2015, at 4:08 PM, Paul Hargrove <phhargr...@lbl.gov
<mailto:phhargr...@lbl.gov>> wrote:
Ralph,
Still failing with that patch, but with the addition of a fairly
long pause (64s) before the first error message appears, and
again after the second "server setup_fork" (64s again)
New output is attached.
-Paul
On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
Argh - found a typo in the output line. Could you please try
the attached patch and do it again? This might fix it, but if
not it will provide me with some idea of the returned error.
Thanks
Ralph
On Sep 20, 2015, at 12:40 PM, Paul Hargrove
<phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:
Yes, it is definitely at 10.
Another attempt is attached.
-Paul
On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain
<r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
Paul - can you please confirm that you gave mpirun a
level of 10 for the pmix_base_verbose param? This output
isn’t what I would have expected from that level - it
looks more like the verbosity was set to 5, and so the
error number isn’t printed.
Thanks
Ralph
On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com
<mailto:gilles.gouaillar...@gmail.com>> wrote:
Paul,
I do not remember it like that ...
at that time, the issue in ompi was that the global
errno was uses instead of the per thread errno.
though the man pages tells -mt should be used fir
multithreaded apps, you tried -D_REENTRANT on all your
platforms, and it was enough to get the expected result.
I just wanted to check pmix1xx (sub)configure did
correctly pass the -D_REENTRANT flag, and it does. so
this is very likely a new and unrelated error
Cheers,
Gilles
On Sunday, September 20, 2015, Paul Hargrove
<phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:
Gilles,
Yes every $CC invocation in opal/mca/pmix/pmix1xx
includes "-D_REENTRANT".
However, they don't include "-mt".
I believe we concluded (when we had problems
previously) that "-mt" was the proper flag (at
compile and link) for multi-threaded with the
Studio compilers.
-Paul
On Sat, Sep 19, 2015 at 11:29 PM, Gilles
Gouaillardet<gilles.gouaillar...@gmail.com>wrote:
Paul,
Can you please double check pmix1xx is compiled
with -D_REENTRANT ?
We ran into similar issues in the past, and
they only occurred with Solaris
Cheers,
Gilles
On Sunday, September 20, 2015, Paul Hargrove
<phhargr...@lbl.gov> wrote:
Ralph,
The output from the requested run is attached.
-Paul
On Sat, Sep 19, 2015 at 9:46 PM, Ralph
Castain<r...@open-mpi.org>wrote:
Ah, okay - that makes more sense. I’ll
have to let Brice see if he can figure
out how to silence the hwloc error
message as I can’t find where it came
from. The other errors are real and are
the reason why the job was terminated.
The problem is that we are trying to
establish a communication between the
app and the daemon via unix domain
socket, and we failed to do so. The
error tells me that we were able to
create and connect to the socket, but
failed when the daemon tried to do a
blocking send to the app.
Can you rerun it with -mca
pmix_base_verbose 10? It will tell us
the value of the error number that was
returned
Thanks
Ralph
On Sep 19, 2015, at 9:37 PM, Paul
Hargrove <phhargr...@lbl.gov> wrote:
Ralph,
No it did not run.
The complete output (which I really
should have included in the first
place) is below.
-Paul
$ mpirun -mca btl sm,self -np 2
examples/ring_c'
Error opening /devices/pci@0,0:reg:
Permission denied
[pcp-d-3:26054] PMIX ERROR: ERROR in
file
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
at line 181
[pcp-d-3:26053] PMIX ERROR:
UNREACHABLE in file
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
at line 463
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some
reason; your parallel process is
likely to abort. There are many
reasons that a parallel process can
fail during MPI_INIT; some of which
are due to configuration or environment
problems. This failure appears to be
an internal failure; here's some
additional information (which may only
be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of
"Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in
this communicator will now abort,
*** and potentially your MPI job)
[pcp-d-3:26054] Local abort before
MPI_INIT completed completed
successfully, but am not able to
aggregate error messages, and not able
to guarantee that all other processes
were killed!
-------------------------------------------------------
Primary job terminated normally, but
1 process returned
a non-zero exit code.. Per
user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more
processes exited with non-zero status,
thus causing
the job to be terminated. The first
process to do so was:
Process name: [[11371,1],0]
Exit code: 1
--------------------------------------------------------------------------
On Sat, Sep 19, 2015 at 8:50 PM, Ralph
Castain<r...@open-mpi.org>wrote:
Paul, can you clarify something
for me? The error in this case
indicates that the client wasn’t
able to reach the daemon - this
should have resulted in
termination of the job. Did the
job actually run?
On Sep 18, 2015, at 2:50 AM,
Ralph Castain <r...@open-mpi.org>
wrote:
I'm on travel right now, but it
should be an easy fix when I
return. Sorry for the annoyance
On Thu, Sep 17, 2015 at 11:13 PM,
Paul
Hargrove<phhargr...@lbl.gov>wrote:
Any suggestion how I (as a
non-root user) can avoid
seeing this hwloc error
message on every run?
-Paul
On Thu, Sep 17, 2015 at 11:00
PM, Gilles
Gouaillardet<gil...@rist.or.jp>wrote:
Paul,
IIRC, the "Permission
denied" is coming from
hwloc that cannot collect
all the info it would like.
Cheers,
Gilles
On 9/18/2015 2:34 PM,
Paul Hargrove wrote:
Tried tonight's master
tarball on Solaris 11.2
on x86-64 with the
Studio Compilers
(default ILP32 output)
and saw the following
result
$ mpirun -mca btl
sm,self -np 2
examples/ring_c'
Error opening
/devices/pci@0,0:reg:
Permission denied
[pcp-d-4:00492] PMIX
ERROR: ERROR in file
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
at line 181
[pcp-d-4:00491] PMIX
ERROR: UNREACHABLE in
file
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
at line 463
I don't know if the
Permission denied error
is related to the
subsequent PMIX errors,
but any message that
says "UNREACHABLE" is
clearly worth reporting.
-Paul
--
Paul H. Hargrove
phhargr...@lbl.gov
Computer Languages &
Systems Software (CLaSS)
Group
Computer Science
Department
Tel:+1-510-495-2352
<tel:%2B1-510-495-2352>
Lawrence Berkeley
National Laboratory
Fax:+1-510-486-6900
<tel:%2B1-510-486-6900>
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18074.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18075.php
--
Paul H. Hargrove
phhargr...@lbl.gov
Computer Languages & Systems
Software (CLaSS) Group
Computer Science Department
Tel:+1-510-495-2352
<tel:%2B1-510-495-2352>
Lawrence Berkeley National
Laboratory
Fax:+1-510-486-6900
<tel:%2B1-510-486-6900>
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18076.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18078.php
--
Paul H. Hargrove phhargr...@lbl.gov
Computer Languages & Systems Software
(CLaSS) Group
Computer Science Department
Tel:+1-510-495-2352
<tel:%2B1-510-495-2352>
Lawrence Berkeley National Laboratory
Fax:+1-510-486-6900
<tel:%2B1-510-486-6900>
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18080.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18081.php
--
Paul H. Hargrove phhargr...@lbl.gov
Computer Languages & Systems Software
(CLaSS) Group
Computer Science Department
Tel:+1-510-495-2352 <tel:%2B1-510-495-2352>
Lawrence Berkeley National Laboratory
Fax:+1-510-486-6900 <tel:%2B1-510-486-6900>
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18083.php
--
Paul H. Hargrove phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department Tel:
+1-510-495-2352 <tel:%2B1-510-495-2352>
Lawrence Berkeley National Laboratory Fax:
+1-510-486-6900 <tel:%2B1-510-486-6900>
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2015/09/18085.php
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/09/18086.php
--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department Tel:
+1-510-495-2352 <tel:%2B1-510-495-2352>
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
<tel:%2B1-510-486-6900>
<typescript>_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/09/18087.php
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/09/18088.php
--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department Tel: +1-510-495-2352
<tel:%2B1-510-495-2352>
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
<tel:%2B1-510-486-6900>
<typescript>_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/09/18089.php
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/09/18092.php
--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/09/18093.php
diff --git a/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
b/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
index 61f617a..fcd08de 100644
--- a/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
+++ b/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
@@ -809,6 +809,7 @@ static pmix_status_t recv_connect_ack(int sd)
pmix_status_t rc;
struct timeval tv, save;
pmix_socklen_t sz;
+ bool sockopt = true;
pmix_output_verbose(2, pmix_globals.debug_output,
"pmix: RECV CONNECT ACK FROM SERVER");
@@ -816,14 +817,20 @@ static pmix_status_t recv_connect_ack(int sd)
/* get the current timeout value so we can reset to it */
sz = sizeof(save);
if (0 != getsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, (void*)&save, &sz)) {
- return PMIX_ERR_UNREACH;
- }
-
- /* set a timeout on the blocking recv so we don't hang */
- tv.tv_sec = 2;
- tv.tv_usec = 0;
- if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv))) {
- return PMIX_ERR_UNREACH;
+ if (ENOPROTOOPT == errno) {
+ sockopt = false;
+ } else {
+ return PMIX_ERR_UNREACH;
+ }
+ } else {
+ /* set a timeout on the blocking recv so we don't hang */
+ tv.tv_sec = 2;
+ tv.tv_usec = 0;
+ if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv))) {
+ pmix_output_verbose(2, pmix_globals.debug_output,
+ "pmix: recv_connect_ack could not setsockopt
SO_RCVTIMEO");
+ return PMIX_ERR_UNREACH;
+ }
}
/* receive the status reply */
@@ -855,9 +862,11 @@ static pmix_status_t recv_connect_ack(int sd)
return rc;
}
- /* return the socket to normal */
- if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &save, sz)) {
- return PMIX_ERR_UNREACH;
+ if (sockopt) {
+ /* return the socket to normal */
+ if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &save, sz)) {
+ return PMIX_ERR_UNREACH;
+ }
}
return PMIX_SUCCESS;