Thanks Eric, the goal of the patch is simply not to output info that is not needed (by both orted and a.out) /* since you ./a.out, an orted is forked under the hood */ so the patch is really optional, though convenient.
Cheers, Gilles On Wednesday, September 14, 2016, Eric Chamberland < eric.chamberl...@giref.ulaval.ca> wrote: > > > On 14/09/16 01:36 AM, Gilles Gouaillardet wrote: > >> Eric, >> >> >> can you please provide more information on how your tests are launched ? >> >> > Yes! > > do you >> >> mpirun -np 1 ./a.out >> >> or do you simply >> >> ./a.out >> >> > For all sequential tests, we do ./a.out. > > >> do you use a batch manager ? if yes, which one ? >> > > No. > > >> do you run one test per job ? or multiple tests per job ? >> > > On this automatic compilation, up to 16 tests are launched together. > > >> how are these tests launched ? >> > For sequential ones, the special thing is that they are launched via > python Popen call, which launches "time" which launches the code. > > So the "full" commande line is: > > /usr/bin/time -v -o /users/cmpbib/compilations/lor > ien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/ > Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt > /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt > mpi_v=2 verbose=True Beowulf=False outilMassif=False outilPerfRecord=False > verifValgrind=False outilPerfStat=False outilCallgrind=False > RepertoireDestination=/users/cmpbib/compilations/lorien/linu > x_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien > RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMP > ILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien > Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier > > > >> >> do the test that crashes use MPI_Comm_spawn ? >> >> i am surprised by the process name [[9325,5754],0], which suggests there >> MPI_Comm_spawn was called 5753 times (!) >> >> >> can you also run >> >> hostname >> >> on the 'lorien' host ? >> >> > [eric@lorien] Scripts (master $ u+1)> hostname > lorien > > if you configure'd Open MPI with --enable-debug, can you >> > Yes. > > >> export OMPI_MCA_plm_base_verbose 5 >> >> then run one test and post the logs ? >> >> > Hmmm, strange? > > [lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path > NULL > [lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash > 1366255883 > [lorien:93841] plm:base:set_hnp_name: final jobfam 22260 > [lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL > [lorien:93841] [[22260,0],0] plm:base:receive start comm > [lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered > [lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a > dynamic spawn > [lorien:93841] [[22260,0],0] plm:base:receive stop comm > > ~ > >> >> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should >> produce job family 5576 (but you get 9325) >> >> the discrepancy could be explained by the use of a batch manager and/or >> a full hostname i am unaware of. >> >> >> orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 >> bits hash of the) hostname and the mpirun (32 bits ?) pid. >> >> so strictly speaking, it is possible two jobs launched on the same node >> are assigned the same 16 bits job family. >> >> >> the easiest way to detect this could be to >> >> - edit orte/mca/plm/base/plm_base_jobid.c >> >> and replace >> >> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, >> "plm:base:set_hnp_name: final jobfam %lu", >> (unsigned long)jobfam)); >> >> with >> >> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, >> "plm:base:set_hnp_name: final jobfam %lu", >> (unsigned long)jobfam)); >> >> configure Open MPI with --enable-debug and rebuild >> >> and then >> >> export OMPI_MCA_plm_base_verbose=4 >> >> and run your tests. >> >> >> when the problem occurs, you will be able to check which pids produced >> the faulty jobfam, and that could hint to a conflict. >> >> Does this gives the same output as with export > OMPI_MCA_plm_base_verbose=5 without the patch? > > If so, beacause all is automated, applying a patch is "harder" than doing > a simple > export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add > OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs? > > Thanks! > > Eric > > > >> Cheers, >> >> >> Gilles >> >> >> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >> >>> Hi, >>> >>> It is the third time this happened into the last 10 days. >>> >>> While running nighlty tests (~2200), we have one or two tests that >>> fails at the very beginning with this strange error: >>> >>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received >>> unexpected process identifier [[9325,0],0] from [[5590,0],0] >>> >>> But I can't reproduce the problem right now... ie: If I launch this >>> test alone "by hand", it is successful... the same test was successful >>> yesterday... >>> >>> Is there some kind of "race condition" that can happen on the creation >>> of "tmp" files if many tests runs together on the same node? (we are >>> oversubcribing even sequential runs...) >>> >>> Here are the build logs: >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>> .01h16m01s_config.log >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>> .01h16m01s_ompi_info_all.txt >>> >>> >>> Thanks, >>> >>> Eric >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel