On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:
Eric,
can you please provide more information on how your tests are launched ?
Yes!
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
For all sequential tests, we do ./a.out.
do you use a batch manager ? if yes, which one ?
No.
do you run one test per job ? or multiple tests per job ?
On this automatic compilation, up to 16 tests are launched together.
how are these tests launched ?
For sequential ones, the special thing is that they are launched via
python Popen call, which launches "time" which launches the code.
So the "full" commande line is:
/usr/bin/time -v -o
/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt
mpi_v=2 verbose=True Beowulf=False outilMassif=False
outilPerfRecord=False verifValgrind=False outilPerfStat=False
outilCallgrind=False
RepertoireDestination=/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien
RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien
Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier
do the test that crashes use MPI_Comm_spawn ?
i am surprised by the process name [[9325,5754],0], which suggests there
MPI_Comm_spawn was called 5753 times (!)
can you also run
hostname
on the 'lorien' host ?
[eric@lorien] Scripts (master $ u+1)> hostname
lorien
if you configure'd Open MPI with --enable-debug, can you
Yes.
export OMPI_MCA_plm_base_verbose 5
then run one test and post the logs ?
Hmmm, strange?
[lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash
1366255883
[lorien:93841] plm:base:set_hnp_name: final jobfam 22260
[lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:93841] [[22260,0],0] plm:base:receive start comm
[lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered
[lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a
dynamic spawn
[lorien:93841] [[22260,0],0] plm:base:receive stop comm
~
from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
produce job family 5576 (but you get 9325)
the discrepancy could be explained by the use of a batch manager and/or
a full hostname i am unaware of.
orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
bits hash of the) hostname and the mpirun (32 bits ?) pid.
so strictly speaking, it is possible two jobs launched on the same node
are assigned the same 16 bits job family.
the easiest way to detect this could be to
- edit orte/mca/plm/base/plm_base_jobid.c
and replace
OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
"plm:base:set_hnp_name: final jobfam %lu",
(unsigned long)jobfam));
with
OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
"plm:base:set_hnp_name: final jobfam %lu",
(unsigned long)jobfam));
configure Open MPI with --enable-debug and rebuild
and then
export OMPI_MCA_plm_base_verbose=4
and run your tests.
when the problem occurs, you will be able to check which pids produced
the faulty jobfam, and that could hint to a conflict.
Does this gives the same output as with export
OMPI_MCA_plm_base_verbose=5 without the patch?
If so, beacause all is automated, applying a patch is "harder" than
doing a simple
export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add
OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs?
Thanks!
Eric
Cheers,
Gilles
On 9/14/2016 12:35 AM, Eric Chamberland wrote:
Hi,
It is the third time this happened into the last 10 days.
While running nighlty tests (~2200), we have one or two tests that
fails at the very beginning with this strange error:
[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[9325,0],0] from [[5590,0],0]
But I can't reproduce the problem right now... ie: If I launch this
test alone "by hand", it is successful... the same test was successful
yesterday...
Is there some kind of "race condition" that can happen on the creation
of "tmp" files if many tests runs together on the same node? (we are
oversubcribing even sequential runs...)
Here are the build logs:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
Thanks,
Eric
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel