Actually, you just use the envar that was previously cited on a different email
thread:
if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
/* you were launched by mpirun */
} else {
/* you were direct launched */
}
This is available from time of first instruction, so no worries as to when you
look.
> On Sep 15, 2016, at 7:50 AM, Pritchard Jr., Howard <[email protected]> wrote:
>
> HI Gilles,
>
> From what point in the job launch are you needed to determine whether
> or not the job was direct launched?
>
> Howard
>
> --
> Howard Pritchard
>
> HPC-DES
> Los Alamos National Laboratory
>
>
>
>
>
> On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet"
> <[email protected] on behalf of
> [email protected]> wrote:
>
>> Ralph,
>>
>> that looks good to me.
>>
>> can you please remind me how to test if an app was launched by
>> mpirun/orted or direct launched by the RM ?
>>
>> right now, which direct launch method is supported ?
>> i am aware of srun (SLURM) and apron (CRAY), are there any other ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Sep 15, 2016 at 7:10 PM, [email protected] <[email protected]>
>> wrote:
>>>
>>> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <[email protected]>
>>> wrote:
>>>
>>> Ralph,
>>>
>>>
>>> my reply is in the text
>>>
>>>
>>> On 9/15/2016 11:11 AM, [email protected] wrote:
>>>
>>> If we are going to make a change, then let’s do it only once. Since we
>>> introduced PMIx and the concept of the string namespace, the plan has
>>> been
>>> to switch away from a numerical jobid and to the namespace. This
>>> eliminates
>>> the issue of the hash altogether. If we are going to make a disruptive
>>> change, then let’s do that one. Either way, this isn’t something that
>>> could
>>> go into the 2.x series. It is far too invasive, and would have to be
>>> delayed
>>> until a 3.x at the earliest.
>>>
>>> got it !
>>>
>>> Note that I am not yet convinced that is the issue here. We’ve had this
>>> hash
>>> for 12 years, and this is the first time someone has claimed to see a
>>> problem. That makes me very suspicious that the root cause isn’t what
>>> you
>>> are pursuing. This is only being reported for _singletons_, and that is
>>> a
>>> very unique code path. The only reason for launching the orted is to
>>> support
>>> PMIx operations such as notification and comm_spawn. If those aren’t
>>> being
>>> used, then we could use the “isolated” mode where the usock OOB isn’t
>>> even
>>> activated, thus eliminating the problem. This would be a much smaller
>>> “fix”
>>> and could potentially fit into 2.x
>>>
>>> a bug has been identified and fixed, let's wait and see how things go
>>>
>>> how can i use the isolated mode ?
>>> shall i simply
>>> export OMPI_MCA_pmix=isolated
>>> export OMPI_MCA_plm=isolated
>>> ?
>>>
>>> out of curiosity, does "isolated" means we would not even need to fork
>>> the
>>> HNP ?
>>>
>>>
>>> Yes - that’s the idea. Simplify and make things faster. All you have to
>>> do
>>> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that
>>> code
>>> is in 2.x as well
>>>
>>>
>>>
>>> FWIW: every organization I’ve worked with has an epilog script that
>>> blows
>>> away temp dirs. It isn’t the RM-based environment that is of concern -
>>> it’s
>>> the non-RM one where epilog scripts don’t exist that is the problem.
>>>
>>> well, i was looking at this the other way around.
>>> if mpirun/orted creates the session directory with mkstemp(), then
>>> there is
>>> no more need to do any cleanup
>>> (as long as you do not run out of disk space)
>>> but with direct run, there is always a little risk that a previous
>>> session
>>> directory is used, hence the requirement for an epilogue.
>>> also, if the RM is configured to run one job at a time per a given node,
>>> epilog can be quite trivial.
>>> but if several jobs can run on a given node at the same time, epilog
>>> become
>>> less trivial
>>>
>>>
>>> Yeah, this session directory thing has always been problematic. We’ve
>>> had
>>> litter problems since day one, and tried multiple solutions over the
>>> years.
>>> Obviously, none of those has proven fully successful :-(
>>>
>>> Artem came up with a good solution using PMIx that allows the RM to
>>> control
>>> the session directory location for both direct launch and mpirun launch,
>>> thus ensuring the RM can cleanup the correct place upon session
>>> termination.
>>> As we get better adoption of that method out there, then the RM-based
>>> solution (even for multiple jobs sharing a node) should be resolved.
>>>
>>> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem.
>>> Your
>>> proposal would resolve that one as (a) we always have orted’s in that
>>> scenario, and (b) the orted’s pass the session directory down to the
>>> apps.
>>> So maybe the right approach is to use mkstemp() in the scenario where
>>> we are
>>> launched via orted and the RM has not specified a session directory.
>>>
>>> I’m not sure we can resolve the direct launch without PMIx problem - I
>>> think
>>> that’s best left as another incentive for RMs to get on-board the PMIx
>>> bus.
>>>
>>> Make sense?
>>>
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <[email protected]>
>>> wrote:
>>>
>>> Ralph,
>>>
>>>
>>> On 9/15/2016 12:11 AM, [email protected] wrote:
>>>
>>> Many things are possible, given infinite time :-)
>>>
>>> i could not agree more :-D
>>>
>>> The issue with this notion lies in direct launch scenarios - i.e., when
>>> procs are launched directly by the RM and not via mpirun. In this case,
>>> there is nobody who can give us the session directory (well, until PMIx
>>> becomes universal), and so the apps must be able to generate a name that
>>> they all can know. Otherwise, we lose shared memory support because they
>>> can’t rendezvous.
>>>
>>> thanks for the explanation,
>>> now let me rephrase that
>>> "a MPI task must be able to rebuild the path to the session directory,
>>> based
>>> on the information it has when launched.
>>> if mpirun is used, we have several options to pass this option to the
>>> MPI
>>> tasks.
>>> in case of direct run, this info is unlikely (PMIx is not universal
>>> (yet))
>>> passed by the batch manager, so we have to use what is available"
>>>
>>> my concern is that, to keep things simple, session directory is based
>>> on the
>>> Open MPI jobid, and since stepid is zero most of the time, jobid really
>>> means job family
>>> which is stored on 16 bits.
>>>
>>> in the case of mpirun, jobfam is a 16 bit hash of the hostname
>>> (reasonnable
>>> sized string) and the mpirun pid (32 bits on Linux)
>>> if several mpirun are invoked on the same host at a given time, there
>>> is a
>>> risk two distinct jobs are assigned the same jobfam (since we hash from
>>> 32
>>> bits down to 16 bits).
>>> also, there is a risk the session directory already exists from a
>>> previous
>>> job, with some/all files and unix sockets from a previous job, leading
>>> to
>>> undefined behavior
>>> (an early crash if we are lucky, odd things otherwise).
>>>
>>> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid
>>> passed by the RM, and once again, there is a risk of conflict and/or the
>>> re-use of a previous session directory.
>>>
>>> to me, the issue here is we are using the Open MPI jobfam in order to
>>> build
>>> the session directory path
>>> instead, what if we
>>> 1) when mpirun, use a session directory created by mkstemp(), and pass
>>> it to
>>> MPI tasks via the environment or retrieve it from orted/mpirun right
>>> after
>>> the communication has been established.
>>> 2) for direct run, use a session directory based on the full jobid
>>> (which
>>> might be a string or a number) as passed by the RM
>>>
>>> in case of 1), there is no more risk of a hash conflict, or re-using a
>>> previous session directory
>>> in case of 2), there is no more risk of a hash conflict, but there is
>>> still
>>> a risk of re-using a session directory from a previous (e.g. terminated)
>>> job.
>>> that being said, once we document how the session directory is built
>>> from
>>> the jobid, sysadmins will be able to write a RM epilog that do remove
>>> the
>>> session directory.
>>>
>>> does that make sense ?
>>>
>>>
>>> However, that doesn’t seem to be the root problem here. I suspect there
>>> is a
>>> bug in the code that spawns the orted from the singleton, and
>>> subsequently
>>> parses the returned connection info. If you look at the error, you’ll
>>> see
>>> that both jobid’s have “zero” for their local jobid. This means that
>>> the two
>>> procs attempting to communicate both think they are daemons, which is
>>> impossible in this scenario.
>>>
>>> So something garbled the string that the orted returns on startup to the
>>> singleton, and/or the singleton is parsing it incorrectly. IIRC, the
>>> singleton gets its name from that string, and so I expect it is getting
>>> the
>>> wrong name - and hence the error.
>>>
>>> i will investigate that.
>>>
>>> As you may recall, you made a change a little while back where we
>>> modified
>>> the code in ess/singleton to be a little less strict in its checking of
>>> that
>>> returned string. I wonder if that is biting us here? It wouldn’t fix the
>>> problem, but might generate a different error at a more obvious place.
>>>
>>> do you mean
>>>
>>> https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2bad
>>> ec0740f
>>> ?
>>> this has not been backported to v2.x, and the issue was reported on v2.x
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet
>>> <[email protected]> wrote:
>>>
>>> Ralph,
>>>
>>> is there any reason to use a session directory based on the jobid (or
>>> job
>>> family) ?
>>> I mean, could we use mkstemp to generate a unique directory, and then
>>> propagate the path via orted comm or the environment ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Wednesday, September 14, 2016, [email protected] <[email protected]>
>>> wrote:
>>>>
>>>> This has nothing to do with PMIx, Josh - the error is coming out of the
>>>> usock OOB component.
>>>>
>>>>
>>>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <[email protected]> wrote:
>>>>
>>>> Eric,
>>>>
>>>> We are looking into the PMIx code path that sets up the jobid. The
>>>> session
>>>> directories are created based on the jobid. It might be the case that
>>>> the
>>>> jobids (generated with rand) happen to be the same for different jobs
>>>> resulting in multiple jobs sharing the same session directory, but we
>>>> need
>>>> to check. We will update.
>>>>
>>>> Josh
>>>>
>>>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
>>>> <[email protected]> wrote:
>>>>>
>>>>> Lucky!
>>>>>
>>>>> Since each runs have a specific TMP, I still have it on disc.
>>>>>
>>>>> for the faulty run, the TMP variable was:
>>>>>
>>>>> TMP=/tmp/tmp.wOv5dkNaSI
>>>>>
>>>>> and into $TMP I have:
>>>>>
>>>>> openmpi-sessions-40031@lorien_0
>>>>>
>>>>> and into this subdirectory I have a bunch of empty dirs:
>>>>>
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>>>>> -la
>>>>> |wc -l
>>>>> 1841
>>>>>
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>>>>> -la
>>>>> |more
>>>>> total 68
>>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 ..
>>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015
>>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049
>>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052
>>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059
>>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110
>>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114
>>>>> ...
>>>>>
>>>>> If I do:
>>>>>
>>>>> lsof |grep "openmpi-sessions-40031"
>>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>>> /run/user/1000/gvfs
>>>>> Output information may be incomplete.
>>>>> lsof: WARNING: can't stat() tracefs file system
>>>>> /sys/kernel/debug/tracing
>>>>> Output information may be incomplete.
>>>>>
>>>>> nothing...
>>>>>
>>>>> What else may I check?
>>>>>
>>>>> Eric
>>>>>
>>>>>
>>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>>>
>>>>>> Hi, Eric
>>>>>>
>>>>>> I **think** this might be related to the following:
>>>>>>
>>>>>> https://github.com/pmix/master/pull/145
>>>>>>
>>>>>> I'm wondering if you can look into the /tmp directory and see if you
>>>>>> have a bunch of stale usock files.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>>>> <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>> Eric,
>>>>>>
>>>>>>
>>>>>> can you please provide more information on how your tests are
>>>>>> launched ?
>>>>>>
>>>>>> do you
>>>>>>
>>>>>> mpirun -np 1 ./a.out
>>>>>>
>>>>>> or do you simply
>>>>>>
>>>>>> ./a.out
>>>>>>
>>>>>>
>>>>>> do you use a batch manager ? if yes, which one ?
>>>>>>
>>>>>> do you run one test per job ? or multiple tests per job ?
>>>>>>
>>>>>> how are these tests launched ?
>>>>>>
>>>>>>
>>>>>> do the test that crashes use MPI_Comm_spawn ?
>>>>>>
>>>>>> i am surprised by the process name [[9325,5754],0], which
>>>>>> suggests
>>>>>> there MPI_Comm_spawn was called 5753 times (!)
>>>>>>
>>>>>>
>>>>>> can you also run
>>>>>>
>>>>>> hostname
>>>>>>
>>>>>> on the 'lorien' host ?
>>>>>>
>>>>>> if you configure'd Open MPI with --enable-debug, can you
>>>>>>
>>>>>> export OMPI_MCA_plm_base_verbose 5
>>>>>>
>>>>>> then run one test and post the logs ?
>>>>>>
>>>>>>
>>>>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>>>>> produce job family 5576 (but you get 9325)
>>>>>>
>>>>>> the discrepancy could be explained by the use of a batch manager
>>>>>> and/or a full hostname i am unaware of.
>>>>>>
>>>>>>
>>>>>> orte_plm_base_set_hnp_name() generate a 16 bits job family from
>>>>>> the
>>>>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>>>>
>>>>>> so strictly speaking, it is possible two jobs launched on the
>>>>>> same
>>>>>> node are assigned the same 16 bits job family.
>>>>>>
>>>>>>
>>>>>> the easiest way to detect this could be to
>>>>>>
>>>>>> - edit orte/mca/plm/base/plm_base_jobid.c
>>>>>>
>>>>>> and replace
>>>>>>
>>>>>> OPAL_OUTPUT_VERBOSE((5,
>>>>>> orte_plm_base_framework.framework_output,
>>>>>> "plm:base:set_hnp_name: final jobfam
>>>>>> %lu",
>>>>>> (unsigned long)jobfam));
>>>>>>
>>>>>> with
>>>>>>
>>>>>> OPAL_OUTPUT_VERBOSE((4,
>>>>>> orte_plm_base_framework.framework_output,
>>>>>> "plm:base:set_hnp_name: final jobfam
>>>>>> %lu",
>>>>>> (unsigned long)jobfam));
>>>>>>
>>>>>> configure Open MPI with --enable-debug and rebuild
>>>>>>
>>>>>> and then
>>>>>>
>>>>>> export OMPI_MCA_plm_base_verbose=4
>>>>>>
>>>>>> and run your tests.
>>>>>>
>>>>>>
>>>>>> when the problem occurs, you will be able to check which pids
>>>>>> produced the faulty jobfam, and that could hint to a conflict.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> It is the third time this happened into the last 10 days.
>>>>>>
>>>>>> While running nighlty tests (~2200), we have one or two tests
>>>>>> that fails at the very beginning with this strange error:
>>>>>>
>>>>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>>>> received unexpected process identifier [[9325,0],0] from
>>>>>> [[5590,0],0]
>>>>>>
>>>>>> But I can't reproduce the problem right now... ie: If I
>>>>>> launch
>>>>>> this test alone "by hand", it is successful... the same test
>>>>>> was
>>>>>> successful yesterday...
>>>>>>
>>>>>> Is there some kind of "race condition" that can happen on the
>>>>>> creation of "tmp" files if many tests runs together on the
>>>>>> same
>>>>>> node? (we are oversubcribing even sequential runs...)
>>>>>>
>>>>>> Here are the build logs:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s
>>>>>> _config.log
>>>>>>
>>>>>>
>>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01
>>>>>> s_config.log>
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s
>>>>>> _ompi_info_all.txt
>>>>>>
>>>>>>
>>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01
>>>>>> s_ompi_info_all.txt>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Eric
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected] <mailto:[email protected]>
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected] <mailto:[email protected]>
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel