This has nothing to do with PMIx, Josh - the error is coming out of the usock OOB component.
> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > Eric, > > We are looking into the PMIx code path that sets up the jobid. The session > directories are created based on the jobid. It might be the case that the > jobids (generated with rand) happen to be the same for different jobs > resulting in multiple jobs sharing the same session directory, but we need to > check. We will update. > > Josh > > On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland > <eric.chamberl...@giref.ulaval.ca <mailto:eric.chamberl...@giref.ulaval.ca>> > wrote: > Lucky! > > Since each runs have a specific TMP, I still have it on disc. > > for the faulty run, the TMP variable was: > > TMP=/tmp/tmp.wOv5dkNaSI > > and into $TMP I have: > > openmpi-sessions-40031@lorien_0 > > and into this subdirectory I have a bunch of empty dirs: > > cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc > -l > 1841 > > cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la > |more > total 68 > drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . > drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. > drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 > drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 > drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 > drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 > drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 > drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 > ... > > If I do: > > lsof |grep "openmpi-sessions-40031" > lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs > Output information may be incomplete. > lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing > Output information may be incomplete. > > nothing... > > What else may I check? > > Eric > > > On 14/09/16 08:47 AM, Joshua Ladd wrote: > Hi, Eric > > I **think** this might be related to the following: > > https://github.com/pmix/master/pull/145 > <https://github.com/pmix/master/pull/145> > > I'm wondering if you can look into the /tmp directory and see if you > have a bunch of stale usock files. > > Best, > > Josh > > > On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp > <mailto:gil...@rist.or.jp> > <mailto:gil...@rist.or.jp <mailto:gil...@rist.or.jp>>> wrote: > > Eric, > > > can you please provide more information on how your tests are launched ? > > do you > > mpirun -np 1 ./a.out > > or do you simply > > ./a.out > > > do you use a batch manager ? if yes, which one ? > > do you run one test per job ? or multiple tests per job ? > > how are these tests launched ? > > > do the test that crashes use MPI_Comm_spawn ? > > i am surprised by the process name [[9325,5754],0], which suggests > there MPI_Comm_spawn was called 5753 times (!) > > > can you also run > > hostname > > on the 'lorien' host ? > > if you configure'd Open MPI with --enable-debug, can you > > export OMPI_MCA_plm_base_verbose 5 > > then run one test and post the logs ? > > > from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should > produce job family 5576 (but you get 9325) > > the discrepancy could be explained by the use of a batch manager > and/or a full hostname i am unaware of. > > > orte_plm_base_set_hnp_name() generate a 16 bits job family from the > (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. > > so strictly speaking, it is possible two jobs launched on the same > node are assigned the same 16 bits job family. > > > the easiest way to detect this could be to > > - edit orte/mca/plm/base/plm_base_jobid.c > > and replace > > OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final jobfam %lu", > (unsigned long)jobfam)); > > with > > OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final jobfam %lu", > (unsigned long)jobfam)); > > configure Open MPI with --enable-debug and rebuild > > and then > > export OMPI_MCA_plm_base_verbose=4 > > and run your tests. > > > when the problem occurs, you will be able to check which pids > produced the faulty jobfam, and that could hint to a conflict. > > > Cheers, > > > Gilles > > > > On 9/14/2016 12:35 AM, Eric Chamberland wrote: > > Hi, > > It is the third time this happened into the last 10 days. > > While running nighlty tests (~2200), we have one or two tests > that fails at the very beginning with this strange error: > > [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: > received unexpected process identifier [[9325,0],0] from > [[5590,0],0] > > But I can't reproduce the problem right now... ie: If I launch > this test alone "by hand", it is successful... the same test was > successful yesterday... > > Is there some kind of "race condition" that can happen on the > creation of "tmp" files if many tests runs together on the same > node? (we are oversubcribing even sequential runs...) > > Here are the build logs: > > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> > > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> > > > Thanks, > > Eric > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > <mailto:devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > <mailto:devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel