Eric,

We are looking into the PMIx code path that sets up the jobid. The session
directories are created based on the jobid. It might be the case that the
jobids (generated with rand) happen to be the same for different jobs
resulting in multiple jobs sharing the same session directory, but we need
to check. We will update.

Josh

On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Lucky!
>
> Since each runs have a specific TMP, I still have it on disc.
>
> for the faulty run, the TMP variable was:
>
> TMP=/tmp/tmp.wOv5dkNaSI
>
> and into $TMP I have:
>
> openmpi-sessions-40031@lorien_0
>
> and into this subdirectory I have a bunch of empty dirs:
>
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
> |wc -l
> 1841
>
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
> |more
> total 68
> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
> ...
>
> If I do:
>
> lsof |grep "openmpi-sessions-40031"
> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>       Output information may be incomplete.
> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>       Output information may be incomplete.
>
> nothing...
>
> What else may I check?
>
> Eric
>
>
> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>
>> Hi, Eric
>>
>> I **think** this might be related to the following:
>>
>> https://github.com/pmix/master/pull/145
>>
>> I'm wondering if you can look into the /tmp directory and see if you
>> have a bunch of stale usock files.
>>
>> Best,
>>
>> Josh
>>
>>
>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp
>> <mailto:gil...@rist.or.jp>> wrote:
>>
>>     Eric,
>>
>>
>>     can you please provide more information on how your tests are
>> launched ?
>>
>>     do you
>>
>>     mpirun -np 1 ./a.out
>>
>>     or do you simply
>>
>>     ./a.out
>>
>>
>>     do you use a batch manager ? if yes, which one ?
>>
>>     do you run one test per job ? or multiple tests per job ?
>>
>>     how are these tests launched ?
>>
>>
>>     do the test that crashes use MPI_Comm_spawn ?
>>
>>     i am surprised by the process name [[9325,5754],0], which suggests
>>     there MPI_Comm_spawn was called 5753 times (!)
>>
>>
>>     can you also run
>>
>>     hostname
>>
>>     on the 'lorien' host ?
>>
>>     if you configure'd Open MPI with --enable-debug, can you
>>
>>     export OMPI_MCA_plm_base_verbose 5
>>
>>     then run one test and post the logs ?
>>
>>
>>     from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>     produce job family 5576 (but you get 9325)
>>
>>     the discrepancy could be explained by the use of a batch manager
>>     and/or a full hostname i am unaware of.
>>
>>
>>     orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>>     (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>
>>     so strictly speaking, it is possible two jobs launched on the same
>>     node are assigned the same 16 bits job family.
>>
>>
>>     the easiest way to detect this could be to
>>
>>     - edit orte/mca/plm/base/plm_base_jobid.c
>>
>>     and replace
>>
>>         OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>                              (unsigned long)jobfam));
>>
>>     with
>>
>>         OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>                              (unsigned long)jobfam));
>>
>>     configure Open MPI with --enable-debug and rebuild
>>
>>     and then
>>
>>     export OMPI_MCA_plm_base_verbose=4
>>
>>     and run your tests.
>>
>>
>>     when the problem occurs, you will be able to check which pids
>>     produced the faulty jobfam, and that could hint to a conflict.
>>
>>
>>     Cheers,
>>
>>
>>     Gilles
>>
>>
>>
>>     On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>
>>         Hi,
>>
>>         It is the third time this happened into the last 10 days.
>>
>>         While running nighlty tests (~2200), we have one or two tests
>>         that fails at the very beginning with this strange error:
>>
>>         [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>         received unexpected process identifier [[9325,0],0] from
>>         [[5590,0],0]
>>
>>         But I can't reproduce the problem right now... ie: If I launch
>>         this test alone "by hand", it is successful... the same test was
>>         successful yesterday...
>>
>>         Is there some kind of "race condition" that can happen on the
>>         creation of "tmp" files if many tests runs together on the same
>>         node? (we are oversubcribing even sequential runs...)
>>
>>         Here are the build logs:
>>
>>         http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>> .01h16m01s_config.log
>>         <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>> 3.01h16m01s_config.log>
>>
>>         http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>> .01h16m01s_ompi_info_all.txt
>>         <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>> 3.01h16m01s_ompi_info_all.txt>
>>
>>
>>         Thanks,
>>
>>         Eric
>>         _______________________________________________
>>         devel mailing list
>>         devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>
>>
>>     _______________________________________________
>>     devel mailing list
>>     devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>
>>
>> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to