Re: [OMPI devel] toward a unique session directory

[email protected] Thu, 15 Sep 2016 08:12:29 -0700

Actually, you just use the envar that was previously cited on a different email 
thread:


 if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
        /* you were launched by mpirun */
} else {
        /* you were direct launched */
}

This is available from time of first instruction, so no worries as to when you 
look.


> On Sep 15, 2016, at 7:50 AM, Pritchard Jr., Howard <[email protected]> wrote:
> 
> HI Gilles,
> 
> From what point in the job launch are you needed to determine whether
> or not the job was direct launched?
> 
> Howard
> 
> -- 
> Howard Pritchard
> 
> HPC-DES
> Los Alamos National Laboratory
> 
> 
> 
> 
> 
> On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet"
> <[email protected] on behalf of
> [email protected]> wrote:
> 
>> Ralph,
>> 
>> that looks good to me.
>> 
>> can you please remind me how to test if an app was launched by
>> mpirun/orted or direct launched by the RM ?
>> 
>> right now, which direct launch method is supported ?
>> i am aware of srun (SLURM) and apron (CRAY), are there any other ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Thu, Sep 15, 2016 at 7:10 PM, [email protected] <[email protected]>
>> wrote:
>>> 
>>> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <[email protected]>
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> my reply is in the text
>>> 
>>> 
>>> On 9/15/2016 11:11 AM, [email protected] wrote:
>>> 
>>> If we are going to make a change, then let’s do it only once. Since we
>>> introduced PMIx and the concept of the string namespace, the plan has
>>> been
>>> to switch away from a numerical jobid and to the namespace. This
>>> eliminates
>>> the issue of the hash altogether. If we are going to make a disruptive
>>> change, then let’s do that one. Either way, this isn’t something that
>>> could
>>> go into the 2.x series. It is far too invasive, and would have to be
>>> delayed
>>> until a 3.x at the earliest.
>>> 
>>> got it !
>>> 
>>> Note that I am not yet convinced that is the issue here. We’ve had this
>>> hash
>>> for 12 years, and this is the first time someone has claimed to see a
>>> problem. That makes me very suspicious that the root cause isn’t what
>>> you
>>> are pursuing. This is only being reported for _singletons_, and that is
>>> a
>>> very unique code path. The only reason for launching the orted is to
>>> support
>>> PMIx operations such as notification and comm_spawn. If those aren’t
>>> being
>>> used, then we could use the “isolated” mode where the usock OOB isn’t
>>> even
>>> activated, thus eliminating the problem. This would be a much smaller
>>> “fix”
>>> and could potentially fit into 2.x
>>> 
>>> a bug has been identified and fixed, let's wait and see how things go
>>> 
>>> how can i use the isolated mode ?
>>> shall i simply
>>> export OMPI_MCA_pmix=isolated
>>> export OMPI_MCA_plm=isolated
>>> ?
>>> 
>>> out of curiosity, does "isolated" means we would not even need to fork
>>> the
>>> HNP ?
>>> 
>>> 
>>> Yes - that’s the idea. Simplify and make things faster. All you have to
>>> do
>>> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that
>>> code
>>> is in 2.x as well
>>> 
>>> 
>>> 
>>> FWIW: every organization I’ve worked with has an epilog script that
>>> blows
>>> away temp dirs. It isn’t the RM-based environment that is of concern -
>>> it’s
>>> the non-RM one where epilog scripts don’t exist that is the problem.
>>> 
>>> well, i was looking at this the other way around.
>>> if mpirun/orted creates the session directory with mkstemp(), then
>>> there is
>>> no more need to do any cleanup
>>> (as long as you do not run out of disk space)
>>> but with direct run, there is always a little risk that a previous
>>> session
>>> directory is used, hence the requirement for an epilogue.
>>> also, if the RM is configured to run one job at a time per a given node,
>>> epilog can be quite trivial.
>>> but if several jobs can run on a given node at the same time, epilog
>>> become
>>> less trivial
>>> 
>>> 
>>> Yeah, this session directory thing has always been problematic. We’ve
>>> had
>>> litter problems since day one, and tried multiple solutions over the
>>> years.
>>> Obviously, none of those has proven fully successful :-(
>>> 
>>> Artem came up with a good solution using PMIx that allows the RM to
>>> control
>>> the session directory location for both direct launch and mpirun launch,
>>> thus ensuring the RM can cleanup the correct place upon session
>>> termination.
>>> As we get better adoption of that method out there, then the RM-based
>>> solution (even for multiple jobs sharing a node) should be resolved.
>>> 
>>> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem.
>>> Your
>>> proposal would resolve that one as (a) we always have orted’s in that
>>> scenario, and (b) the orted’s pass the session directory down to the
>>> apps.
>>> So maybe the right approach is to use mkstemp() in the scenario where
>>> we are
>>> launched via orted and the RM has not specified a session directory.
>>> 
>>> I’m not sure we can resolve the direct launch without PMIx problem - I
>>> think
>>> that’s best left as another incentive for RMs to get on-board the PMIx
>>> bus.
>>> 
>>> Make sense?
>>> 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <[email protected]>
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> On 9/15/2016 12:11 AM, [email protected] wrote:
>>> 
>>> Many things are possible, given infinite time :-)
>>> 
>>> i could not agree more :-D
>>> 
>>> The issue with this notion lies in direct launch scenarios - i.e., when
>>> procs are launched directly by the RM and not via mpirun. In this case,
>>> there is nobody who can give us the session directory (well, until PMIx
>>> becomes universal), and so the apps must be able to generate a name that
>>> they all can know. Otherwise, we lose shared memory support because they
>>> can’t rendezvous.
>>> 
>>> thanks for the explanation,
>>> now let me rephrase that
>>> "a MPI task must be able to rebuild the path to the session directory,
>>> based
>>> on the information it has when launched.
>>> if mpirun is used, we have several options to pass this option to the
>>> MPI
>>> tasks.
>>> in case of direct run, this info is unlikely (PMIx is not universal
>>> (yet))
>>> passed by the batch manager, so we have to use what is available"
>>> 
>>> my concern is that, to keep things simple, session directory is based
>>> on the
>>> Open MPI jobid, and since stepid is zero most of the time, jobid really
>>> means job family
>>> which is stored on 16 bits.
>>> 
>>> in the case of mpirun, jobfam is a 16 bit hash of the hostname
>>> (reasonnable
>>> sized string) and the mpirun pid (32 bits on Linux)
>>> if several mpirun are invoked on the same host at a given time, there
>>> is a
>>> risk two distinct jobs are assigned the same jobfam (since we hash from
>>> 32
>>> bits down to 16 bits).
>>> also, there is a risk the session directory already exists from a
>>> previous
>>> job, with some/all files and unix sockets from a previous job, leading
>>> to
>>> undefined behavior
>>> (an early crash if we are lucky, odd things otherwise).
>>> 
>>> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid
>>> passed by the RM, and once again, there is a risk of conflict and/or the
>>> re-use of a previous session directory.
>>> 
>>> to me, the issue here is we are using the Open MPI jobfam in order to
>>> build
>>> the session directory path
>>> instead, what if we
>>> 1) when mpirun, use a session directory created by mkstemp(), and pass
>>> it to
>>> MPI tasks via the environment or retrieve it from orted/mpirun right
>>> after
>>> the communication has been established.
>>> 2) for direct run, use a session directory based on the full jobid
>>> (which
>>> might be a string or a number) as passed by the RM
>>> 
>>> in case of 1), there is no more risk of a hash conflict, or re-using a
>>> previous session directory
>>> in case of 2), there is no more risk of a hash conflict, but there is
>>> still
>>> a risk of re-using a session directory from a previous (e.g. terminated)
>>> job.
>>> that being said, once we document how the session directory is built
>>> from
>>> the jobid, sysadmins will be able to write a RM epilog that do remove
>>> the
>>> session directory.
>>> 
>>> does that make sense ?
>>> 
>>> 
>>> However, that doesn’t seem to be the root problem here. I suspect there
>>> is a
>>> bug in the code that spawns the orted from the singleton, and
>>> subsequently
>>> parses the returned connection info. If you look at the error, you’ll
>>> see
>>> that both jobid’s have “zero” for their local jobid. This means that
>>> the two
>>> procs attempting to communicate both think they are daemons, which is
>>> impossible in this scenario.
>>> 
>>> So something garbled the string that the orted returns on startup to the
>>> singleton, and/or the singleton is parsing it incorrectly. IIRC, the
>>> singleton gets its name from that string, and so I expect it is getting
>>> the
>>> wrong name - and hence the error.
>>> 
>>> i will investigate that.
>>> 
>>> As you may recall, you made a change a little while back where we
>>> modified
>>> the code in ess/singleton to be a little less strict in its checking of
>>> that
>>> returned string. I wonder if that is biting us here? It wouldn’t fix the
>>> problem, but might generate a different error at a more obvious place.
>>> 
>>> do you mean
>>> 
>>> https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2bad
>>> ec0740f
>>> ?
>>> this has not been backported to v2.x, and the issue was reported on v2.x
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet
>>> <[email protected]> wrote:
>>> 
>>> Ralph,
>>> 
>>> is there any reason to use a session directory based on the jobid (or
>>> job
>>> family) ?
>>> I mean, could we use mkstemp to generate a unique directory, and then
>>> propagate the path via orted comm or the environment ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Wednesday, September 14, 2016, [email protected] <[email protected]>
>>> wrote:
>>>> 
>>>> This has nothing to do with PMIx, Josh - the error is coming out of the
>>>> usock OOB component.
>>>> 
>>>> 
>>>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <[email protected]> wrote:
>>>> 
>>>> Eric,
>>>> 
>>>> We are looking into the PMIx code path that sets up the jobid. The
>>>> session
>>>> directories are created based on the jobid. It might be the case that
>>>> the
>>>> jobids (generated with rand) happen to be the same for different jobs
>>>> resulting in multiple jobs sharing the same session directory, but we
>>>> need
>>>> to check. We will update.
>>>> 
>>>> Josh
>>>> 
>>>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
>>>> <[email protected]> wrote:
>>>>> 
>>>>> Lucky!
>>>>> 
>>>>> Since each runs have a specific TMP, I still have it on disc.
>>>>> 
>>>>> for the faulty run, the TMP variable was:
>>>>> 
>>>>> TMP=/tmp/tmp.wOv5dkNaSI
>>>>> 
>>>>> and into $TMP I have:
>>>>> 
>>>>> openmpi-sessions-40031@lorien_0
>>>>> 
>>>>> and into this subdirectory I have a bunch of empty dirs:
>>>>> 
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>>>>> -la
>>>>> |wc -l
>>>>> 1841
>>>>> 
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>>>>> -la
>>>>> |more
>>>>> total 68
>>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>>> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>>> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>>> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>>> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>>> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>>> ...
>>>>> 
>>>>> If I do:
>>>>> 
>>>>> lsof |grep "openmpi-sessions-40031"
>>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>>> /run/user/1000/gvfs
>>>>>      Output information may be incomplete.
>>>>> lsof: WARNING: can't stat() tracefs file system
>>>>> /sys/kernel/debug/tracing
>>>>>      Output information may be incomplete.
>>>>> 
>>>>> nothing...
>>>>> 
>>>>> What else may I check?
>>>>> 
>>>>> Eric
>>>>> 
>>>>> 
>>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>>> 
>>>>>> Hi, Eric
>>>>>> 
>>>>>> I **think** this might be related to the following:
>>>>>> 
>>>>>> https://github.com/pmix/master/pull/145
>>>>>> 
>>>>>> I'm wondering if you can look into the /tmp directory and see if you
>>>>>> have a bunch of stale usock files.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Josh
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>>>> <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>>    Eric,
>>>>>> 
>>>>>> 
>>>>>>    can you please provide more information on how your tests are
>>>>>> launched ?
>>>>>> 
>>>>>>    do you
>>>>>> 
>>>>>>    mpirun -np 1 ./a.out
>>>>>> 
>>>>>>    or do you simply
>>>>>> 
>>>>>>    ./a.out
>>>>>> 
>>>>>> 
>>>>>>    do you use a batch manager ? if yes, which one ?
>>>>>> 
>>>>>>    do you run one test per job ? or multiple tests per job ?
>>>>>> 
>>>>>>    how are these tests launched ?
>>>>>> 
>>>>>> 
>>>>>>    do the test that crashes use MPI_Comm_spawn ?
>>>>>> 
>>>>>>    i am surprised by the process name [[9325,5754],0], which
>>>>>> suggests
>>>>>>    there MPI_Comm_spawn was called 5753 times (!)
>>>>>> 
>>>>>> 
>>>>>>    can you also run
>>>>>> 
>>>>>>    hostname
>>>>>> 
>>>>>>    on the 'lorien' host ?
>>>>>> 
>>>>>>    if you configure'd Open MPI with --enable-debug, can you
>>>>>> 
>>>>>>    export OMPI_MCA_plm_base_verbose 5
>>>>>> 
>>>>>>    then run one test and post the logs ?
>>>>>> 
>>>>>> 
>>>>>>    from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>>>>>    produce job family 5576 (but you get 9325)
>>>>>> 
>>>>>>    the discrepancy could be explained by the use of a batch manager
>>>>>>    and/or a full hostname i am unaware of.
>>>>>> 
>>>>>> 
>>>>>>    orte_plm_base_set_hnp_name() generate a 16 bits job family from
>>>>>> the
>>>>>>    (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>>>> 
>>>>>>    so strictly speaking, it is possible two jobs launched on the
>>>>>> same
>>>>>>    node are assigned the same 16 bits job family.
>>>>>> 
>>>>>> 
>>>>>>    the easiest way to detect this could be to
>>>>>> 
>>>>>>    - edit orte/mca/plm/base/plm_base_jobid.c
>>>>>> 
>>>>>>    and replace
>>>>>> 
>>>>>>        OPAL_OUTPUT_VERBOSE((5,
>>>>>> orte_plm_base_framework.framework_output,
>>>>>>                             "plm:base:set_hnp_name: final jobfam
>>>>>> %lu",
>>>>>>                             (unsigned long)jobfam));
>>>>>> 
>>>>>>    with
>>>>>> 
>>>>>>        OPAL_OUTPUT_VERBOSE((4,
>>>>>> orte_plm_base_framework.framework_output,
>>>>>>                             "plm:base:set_hnp_name: final jobfam
>>>>>> %lu",
>>>>>>                             (unsigned long)jobfam));
>>>>>> 
>>>>>>    configure Open MPI with --enable-debug and rebuild
>>>>>> 
>>>>>>    and then
>>>>>> 
>>>>>>    export OMPI_MCA_plm_base_verbose=4
>>>>>> 
>>>>>>    and run your tests.
>>>>>> 
>>>>>> 
>>>>>>    when the problem occurs, you will be able to check which pids
>>>>>>    produced the faulty jobfam, and that could hint to a conflict.
>>>>>> 
>>>>>> 
>>>>>>    Cheers,
>>>>>> 
>>>>>> 
>>>>>>    Gilles
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>    On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>>> 
>>>>>>        Hi,
>>>>>> 
>>>>>>        It is the third time this happened into the last 10 days.
>>>>>> 
>>>>>>        While running nighlty tests (~2200), we have one or two tests
>>>>>>        that fails at the very beginning with this strange error:
>>>>>> 
>>>>>>        [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>>>>        received unexpected process identifier [[9325,0],0] from
>>>>>>        [[5590,0],0]
>>>>>> 
>>>>>>        But I can't reproduce the problem right now... ie: If I
>>>>>> launch
>>>>>>        this test alone "by hand", it is successful... the same test
>>>>>> was
>>>>>>        successful yesterday...
>>>>>> 
>>>>>>        Is there some kind of "race condition" that can happen on the
>>>>>>        creation of "tmp" files if many tests runs together on the
>>>>>> same
>>>>>>        node? (we are oversubcribing even sequential runs...)
>>>>>> 
>>>>>>        Here are the build logs:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s
>>>>>> _config.log
>>>>>> 
>>>>>> 
>>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01
>>>>>> s_config.log>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s
>>>>>> _ompi_info_all.txt
>>>>>> 
>>>>>> 
>>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01
>>>>>> s_ompi_info_all.txt>
>>>>>> 
>>>>>> 
>>>>>>        Thanks,
>>>>>> 
>>>>>>        Eric
>>>>>>        _______________________________________________
>>>>>>        devel mailing list
>>>>>>        [email protected] <mailto:[email protected]>
>>>>>>        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>>        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>> 
>>>>>> 
>>>>>>    _______________________________________________
>>>>>>    devel mailing list
>>>>>>    [email protected] <mailto:[email protected]>
>>>>>>    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>>    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] toward a unique session directory

Reply via email to