This has nothing to do with PMIx, Josh - the error is coming out of the usock 
OOB component.


> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> 
> Eric,
> 
> We are looking into the PMIx code path that sets up the jobid. The session 
> directories are created based on the jobid. It might be the case that the 
> jobids (generated with rand) happen to be the same for different jobs 
> resulting in multiple jobs sharing the same session directory, but we need to 
> check. We will update.
> 
> Josh
> 
> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland 
> <eric.chamberl...@giref.ulaval.ca <mailto:eric.chamberl...@giref.ulaval.ca>> 
> wrote:
> Lucky!
> 
> Since each runs have a specific TMP, I still have it on disc.
> 
> for the faulty run, the TMP variable was:
> 
> TMP=/tmp/tmp.wOv5dkNaSI
> 
> and into $TMP I have:
> 
> openmpi-sessions-40031@lorien_0
> 
> and into this subdirectory I have a bunch of empty dirs:
> 
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc 
> -l
> 1841
> 
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la 
> |more
> total 68
> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
> ...
> 
> If I do:
> 
> lsof |grep "openmpi-sessions-40031"
> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>       Output information may be incomplete.
> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>       Output information may be incomplete.
> 
> nothing...
> 
> What else may I check?
> 
> Eric
> 
> 
> On 14/09/16 08:47 AM, Joshua Ladd wrote:
> Hi, Eric
> 
> I **think** this might be related to the following:
> 
> https://github.com/pmix/master/pull/145 
> <https://github.com/pmix/master/pull/145>
> 
> I'm wondering if you can look into the /tmp directory and see if you
> have a bunch of stale usock files.
> 
> Best,
> 
> Josh
> 
> 
> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp 
> <mailto:gil...@rist.or.jp>
> <mailto:gil...@rist.or.jp <mailto:gil...@rist.or.jp>>> wrote:
> 
>     Eric,
> 
> 
>     can you please provide more information on how your tests are launched ?
> 
>     do you
> 
>     mpirun -np 1 ./a.out
> 
>     or do you simply
> 
>     ./a.out
> 
> 
>     do you use a batch manager ? if yes, which one ?
> 
>     do you run one test per job ? or multiple tests per job ?
> 
>     how are these tests launched ?
> 
> 
>     do the test that crashes use MPI_Comm_spawn ?
> 
>     i am surprised by the process name [[9325,5754],0], which suggests
>     there MPI_Comm_spawn was called 5753 times (!)
> 
> 
>     can you also run
> 
>     hostname
> 
>     on the 'lorien' host ?
> 
>     if you configure'd Open MPI with --enable-debug, can you
> 
>     export OMPI_MCA_plm_base_verbose 5
> 
>     then run one test and post the logs ?
> 
> 
>     from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>     produce job family 5576 (but you get 9325)
> 
>     the discrepancy could be explained by the use of a batch manager
>     and/or a full hostname i am unaware of.
> 
> 
>     orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>     (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
> 
>     so strictly speaking, it is possible two jobs launched on the same
>     node are assigned the same 16 bits job family.
> 
> 
>     the easiest way to detect this could be to
> 
>     - edit orte/mca/plm/base/plm_base_jobid.c
> 
>     and replace
> 
>         OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>                              "plm:base:set_hnp_name: final jobfam %lu",
>                              (unsigned long)jobfam));
> 
>     with
> 
>         OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>                              "plm:base:set_hnp_name: final jobfam %lu",
>                              (unsigned long)jobfam));
> 
>     configure Open MPI with --enable-debug and rebuild
> 
>     and then
> 
>     export OMPI_MCA_plm_base_verbose=4
> 
>     and run your tests.
> 
> 
>     when the problem occurs, you will be able to check which pids
>     produced the faulty jobfam, and that could hint to a conflict.
> 
> 
>     Cheers,
> 
> 
>     Gilles
> 
> 
> 
>     On 9/14/2016 12:35 AM, Eric Chamberland wrote:
> 
>         Hi,
> 
>         It is the third time this happened into the last 10 days.
> 
>         While running nighlty tests (~2200), we have one or two tests
>         that fails at the very beginning with this strange error:
> 
>         [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>         received unexpected process identifier [[9325,0],0] from
>         [[5590,0],0]
> 
>         But I can't reproduce the problem right now... ie: If I launch
>         this test alone "by hand", it is successful... the same test was
>         successful yesterday...
> 
>         Is there some kind of "race condition" that can happen on the
>         creation of "tmp" files if many tests runs together on the same
>         node? (we are oversubcribing even sequential runs...)
> 
>         Here are the build logs:
> 
>         
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>         
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
> 
>         
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>         
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>  
> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
> 
> 
>         Thanks,
> 
>         Eric
>         _______________________________________________
>         devel mailing list
>         devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
> <mailto:devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>>
>         https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
> 
> 
>     _______________________________________________
>     devel mailing list
>     devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
> <mailto:devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>>
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
> 
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to