Eric,
I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would
not worry too much of that crash (to me, it is an undefined behavior
anyway)
Cheers,
Gilles
On Friday, September 16, 2016, Eric Chamberland
<eric.chamberl...@giref.ulaval.ca
<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
Hi,
I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:
[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
hash 1366255883
[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh
path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[41545,0],0] from
[[39075,0],0]
[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
stop comm
unfortunately, I didn't got any coredump (???) The line:
[lorien:172218] Signal code: Invalid permissions (2)
is curious or not?
as usual, here are the build logs:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log>
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt>
Does the PR #1376 will prevent or fix this too?
Thanks again!
Eric
On 15/09/16 09:32 AM, Eric Chamberland wrote:
Hi Gilles,
On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:
Eric,
a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
<https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch>
the bug is specific to singleton mode (e.g. ./a.out vs
mpirun -np 1
./a.out), so if applying a patch does not fit your test
workflow,
it might be easier for you to update it and mpirun -np 1
./a.out instead
of ./a.out
basically, increasing verbosity runs some extra code, which
include
sprintf.
so yes, it is possible to crash an app by increasing
verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can
get a core
file and a backtrace, we will soon find out
Damn! I did got one but it got erased last night when the
automatic
process started again... (which erase all directories before
starting) :/
I would like to put core files in a user specific directory,
but it
seems it has to be a system-wide configuration... :/ I will
trick this
by changing the "pwd" to a path outside the erased directory...
So as of tonight I should be able to retrieve core files even
after I
relaunched the process..
Thanks for all the support!
Eric
Cheers,
Gilles
On 9/15/2016 2:58 AM, Eric Chamberland wrote:
Ok,
one test segfaulted *but* I can't tell if it is the
*same* bug because
there has been a segfault:
stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt>
[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias
190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam
53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh
: rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start
comm
*** Error in `orted': realloc(): invalid next size:
0x0000000001e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG:
Unable to start a
daemon on the local node in file ess_singleton_module.c
at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG:
Unable to start a
daemon on the local node in file ess_singleton_module.c
at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
*** and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed
completed
successfully, but am not able to aggregate error
messages, and not
able to guarantee that all other processes were killed!
stdout:
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during orte_init; some of which are due to
configuration or
environment problems. This failure appears to be an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Unable to start a daemon on the
local node (-127)
instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during MPI_INIT; some of which are due to
configuration or
environment
problems. This failure appears to be an internal
failure; here's some
additional information (which may only be relevant to an
Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Unable to start a daemon on the local
node" (-127)
instead of "Success" (0)
--------------------------------------------------------------------------
openmpi content of $TMP:
/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx------ 3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt 356 root root 61440 Sep 14 13:45 ..
...
drwx------ 1848 cmpbib bib 45056 Sep 14 13:34
openmpi-sessions-40031@lorien_0
srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24
pmix-190552
cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0>
find . -type f
./53310/contact.txt
cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
<http://132.203.7.36:54605>
190552
egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep
53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
plm:base:set_hnp_name: final jobfam 53310
(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt>
config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log>
ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt>
Maybe it aborted (instead of giving the other message)
while doing the
error because of export OMPI_MCA_plm_base_verbose=5 ?
Thanks,
Eric
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
Eric,
do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?
in the first case, my understanding is that
conflicts cannot happen ...
once you hit the bug, can you please please post the
output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict
Cheers,
Gilles
On Wednesday, September 14, 2016, Eric Chamberland
<eric.chamberl...@giref.ulaval.ca
<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
Lucky!
Since each runs have a specific TMP, I still
have it on disc.
for the faulty run, the TMP variable was:
TMP=/tmp/tmp.wOv5dkNaSI
and into $TMP I have:
openmpi-sessions-40031@lorien_0
and into this subdirectory I have a bunch of
empty dirs:
cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |wc -l
1841
cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |more
total 68
drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx------ 3 cmpbib bib 231 Sep 13 03:50 ..
drwx------ 2 cmpbib bib 6 Sep 13 02:10
10015
drwx------ 2 cmpbib bib 6 Sep 13 03:05
10049
drwx------ 2 cmpbib bib 6 Sep 13 03:15
10052
drwx------ 2 cmpbib bib 6 Sep 13 02:22
10059
drwx------ 2 cmpbib bib 6 Sep 13 02:22
10110
drwx------ 2 cmpbib bib 6 Sep 13 02:41
10114
...
If I do:
lsof |grep "openmpi-sessions-40031"
lsof: WARNING: can't stat() fuse.gvfsd-fuse file
system
/run/user/1000/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() tracefs file system
/sys/kernel/debug/tracing
Output information may be incomplete.
nothing...
What else may I check?
Eric
On 14/09/16 08:47 AM, Joshua Ladd wrote:
Hi, Eric
I **think** this might be related to the
following:
https://github.com/pmix/master/pull/145
<https://github.com/pmix/master/pull/145>
<https://github.com/pmix/master/pull/145
<https://github.com/pmix/master/pull/145>>
I'm wondering if you can look into the /tmp
directory and see
if you
have a bunch of stale usock files.
Best,
Josh
On Wed, Sep 14, 2016 at 1:36 AM, Gilles
Gouaillardet
<gil...@rist.or.jp
<mailto:gil...@rist.or.jp>> wrote:
Eric,
can you please provide more information
on how your tests
are launched ?
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
do you use a batch manager ? if yes,
which one ?
do you run one test per job ? or
multiple tests per job ?
how are these tests launched ?
do the test that crashes use
MPI_Comm_spawn ?
i am surprised by the process name
[[9325,5754],0], which
suggests
there MPI_Comm_spawn was called 5753
times (!)
can you also run
hostname
on the 'lorien' host ?
if you configure'd Open MPI with
--enable-debug, can you
export OMPI_MCA_plm_base_verbose 5
then run one test and post the logs ?
from orte_plm_base_set_hnp_name(),
"lorien" and pid 142766
should
produce job family 5576 (but you get
9325)
the discrepancy could be explained by
the use of a batch
manager
and/or a full hostname i am unaware of.
orte_plm_base_set_hnp_name() generate a
16 bits job family
from the
(32 bits hash of the) hostname and the
mpirun (32 bits ?)
pid.
so strictly speaking, it is possible two
jobs launched on
the same
node are assigned the same 16 bits job
family.
the easiest way to detect this could
be to
- edit
orte/mca/plm/base/plm_base_jobid.c
and replace
OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
"plm:base:set_hnp_name: final
jobfam %lu",
(unsigned
long)jobfam));
with
OPAL_OUTPUT_VERBOSE((4,
orte_plm_base_framework.framework_output,
"plm:base:set_hnp_name: final
jobfam %lu",
(unsigned
long)jobfam));
configure Open MPI with --enable-debug
and rebuild
and then
export OMPI_MCA_plm_base_verbose=4
and run your tests.
when the problem occurs, you will be
able to check which
pids
produced the faulty jobfam, and that
could hint to a
conflict.
Cheers,
Gilles
On 9/14/2016 12:35 AM, Eric Chamberland
wrote:
Hi,
It is the third time this happened
into the last 10
days.
While running nighlty tests (~2200),
we have one or two
tests
that fails at the very beginning
with this strange
error:
[lorien:142766] [[9325,5754],0]
usock_peer_recv_connect_ack:
received unexpected process
identifier [[9325,0],0]
from
[[5590,0],0]
But I can't reproduce the problem
right now... ie: If I
launch
this test alone "by hand", it is
successful... the same
test was
successful yesterday...
Is there some kind of "race
condition" that can happen
on the
creation of "tmp" files if many
tests runs together on
the same
node? (we are oversubcribing even
sequential runs...)
Here are the build logs:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>>
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>>
Thanks,
Eric
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>