Actually, you just use the envar that was previously cited on a different email thread:
if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) { /* you were launched by mpirun */ } else { /* you were direct launched */ } This is available from time of first instruction, so no worries as to when you look. > On Sep 15, 2016, at 7:50 AM, Pritchard Jr., Howard <howa...@lanl.gov> wrote: > > HI Gilles, > > From what point in the job launch are you needed to determine whether > or not the job was direct launched? > > Howard > > -- > Howard Pritchard > > HPC-DES > Los Alamos National Laboratory > > > > > > On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet" > <devel-boun...@lists.open-mpi.org on behalf of > gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> that looks good to me. >> >> can you please remind me how to test if an app was launched by >> mpirun/orted or direct launched by the RM ? >> >> right now, which direct launch method is supported ? >> i am aware of srun (SLURM) and apron (CRAY), are there any other ? >> >> Cheers, >> >> Gilles >> >> On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org <r...@open-mpi.org> >> wrote: >>> >>> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <gil...@rist.or.jp> >>> wrote: >>> >>> Ralph, >>> >>> >>> my reply is in the text >>> >>> >>> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote: >>> >>> If we are going to make a change, then let’s do it only once. Since we >>> introduced PMIx and the concept of the string namespace, the plan has >>> been >>> to switch away from a numerical jobid and to the namespace. This >>> eliminates >>> the issue of the hash altogether. If we are going to make a disruptive >>> change, then let’s do that one. Either way, this isn’t something that >>> could >>> go into the 2.x series. It is far too invasive, and would have to be >>> delayed >>> until a 3.x at the earliest. >>> >>> got it ! >>> >>> Note that I am not yet convinced that is the issue here. We’ve had this >>> hash >>> for 12 years, and this is the first time someone has claimed to see a >>> problem. That makes me very suspicious that the root cause isn’t what >>> you >>> are pursuing. This is only being reported for _singletons_, and that is >>> a >>> very unique code path. The only reason for launching the orted is to >>> support >>> PMIx operations such as notification and comm_spawn. If those aren’t >>> being >>> used, then we could use the “isolated” mode where the usock OOB isn’t >>> even >>> activated, thus eliminating the problem. This would be a much smaller >>> “fix” >>> and could potentially fit into 2.x >>> >>> a bug has been identified and fixed, let's wait and see how things go >>> >>> how can i use the isolated mode ? >>> shall i simply >>> export OMPI_MCA_pmix=isolated >>> export OMPI_MCA_plm=isolated >>> ? >>> >>> out of curiosity, does "isolated" means we would not even need to fork >>> the >>> HNP ? >>> >>> >>> Yes - that’s the idea. Simplify and make things faster. All you have to >>> do >>> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that >>> code >>> is in 2.x as well >>> >>> >>> >>> FWIW: every organization I’ve worked with has an epilog script that >>> blows >>> away temp dirs. It isn’t the RM-based environment that is of concern - >>> it’s >>> the non-RM one where epilog scripts don’t exist that is the problem. >>> >>> well, i was looking at this the other way around. >>> if mpirun/orted creates the session directory with mkstemp(), then >>> there is >>> no more need to do any cleanup >>> (as long as you do not run out of disk space) >>> but with direct run, there is always a little risk that a previous >>> session >>> directory is used, hence the requirement for an epilogue. >>> also, if the RM is configured to run one job at a time per a given node, >>> epilog can be quite trivial. >>> but if several jobs can run on a given node at the same time, epilog >>> become >>> less trivial >>> >>> >>> Yeah, this session directory thing has always been problematic. We’ve >>> had >>> litter problems since day one, and tried multiple solutions over the >>> years. >>> Obviously, none of those has proven fully successful :-( >>> >>> Artem came up with a good solution using PMIx that allows the RM to >>> control >>> the session directory location for both direct launch and mpirun launch, >>> thus ensuring the RM can cleanup the correct place upon session >>> termination. >>> As we get better adoption of that method out there, then the RM-based >>> solution (even for multiple jobs sharing a node) should be resolved. >>> >>> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem. >>> Your >>> proposal would resolve that one as (a) we always have orted’s in that >>> scenario, and (b) the orted’s pass the session directory down to the >>> apps. >>> So maybe the right approach is to use mkstemp() in the scenario where >>> we are >>> launched via orted and the RM has not specified a session directory. >>> >>> I’m not sure we can resolve the direct launch without PMIx problem - I >>> think >>> that’s best left as another incentive for RMs to get on-board the PMIx >>> bus. >>> >>> Make sense? >>> >>> >>> >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp> >>> wrote: >>> >>> Ralph, >>> >>> >>> On 9/15/2016 12:11 AM, r...@open-mpi.org wrote: >>> >>> Many things are possible, given infinite time :-) >>> >>> i could not agree more :-D >>> >>> The issue with this notion lies in direct launch scenarios - i.e., when >>> procs are launched directly by the RM and not via mpirun. In this case, >>> there is nobody who can give us the session directory (well, until PMIx >>> becomes universal), and so the apps must be able to generate a name that >>> they all can know. Otherwise, we lose shared memory support because they >>> can’t rendezvous. >>> >>> thanks for the explanation, >>> now let me rephrase that >>> "a MPI task must be able to rebuild the path to the session directory, >>> based >>> on the information it has when launched. >>> if mpirun is used, we have several options to pass this option to the >>> MPI >>> tasks. >>> in case of direct run, this info is unlikely (PMIx is not universal >>> (yet)) >>> passed by the batch manager, so we have to use what is available" >>> >>> my concern is that, to keep things simple, session directory is based >>> on the >>> Open MPI jobid, and since stepid is zero most of the time, jobid really >>> means job family >>> which is stored on 16 bits. >>> >>> in the case of mpirun, jobfam is a 16 bit hash of the hostname >>> (reasonnable >>> sized string) and the mpirun pid (32 bits on Linux) >>> if several mpirun are invoked on the same host at a given time, there >>> is a >>> risk two distinct jobs are assigned the same jobfam (since we hash from >>> 32 >>> bits down to 16 bits). >>> also, there is a risk the session directory already exists from a >>> previous >>> job, with some/all files and unix sockets from a previous job, leading >>> to >>> undefined behavior >>> (an early crash if we are lucky, odd things otherwise). >>> >>> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid >>> passed by the RM, and once again, there is a risk of conflict and/or the >>> re-use of a previous session directory. >>> >>> to me, the issue here is we are using the Open MPI jobfam in order to >>> build >>> the session directory path >>> instead, what if we >>> 1) when mpirun, use a session directory created by mkstemp(), and pass >>> it to >>> MPI tasks via the environment or retrieve it from orted/mpirun right >>> after >>> the communication has been established. >>> 2) for direct run, use a session directory based on the full jobid >>> (which >>> might be a string or a number) as passed by the RM >>> >>> in case of 1), there is no more risk of a hash conflict, or re-using a >>> previous session directory >>> in case of 2), there is no more risk of a hash conflict, but there is >>> still >>> a risk of re-using a session directory from a previous (e.g. terminated) >>> job. >>> that being said, once we document how the session directory is built >>> from >>> the jobid, sysadmins will be able to write a RM epilog that do remove >>> the >>> session directory. >>> >>> does that make sense ? >>> >>> >>> However, that doesn’t seem to be the root problem here. I suspect there >>> is a >>> bug in the code that spawns the orted from the singleton, and >>> subsequently >>> parses the returned connection info. If you look at the error, you’ll >>> see >>> that both jobid’s have “zero” for their local jobid. This means that >>> the two >>> procs attempting to communicate both think they are daemons, which is >>> impossible in this scenario. >>> >>> So something garbled the string that the orted returns on startup to the >>> singleton, and/or the singleton is parsing it incorrectly. IIRC, the >>> singleton gets its name from that string, and so I expect it is getting >>> the >>> wrong name - and hence the error. >>> >>> i will investigate that. >>> >>> As you may recall, you made a change a little while back where we >>> modified >>> the code in ess/singleton to be a little less strict in its checking of >>> that >>> returned string. I wonder if that is biting us here? It wouldn’t fix the >>> problem, but might generate a different error at a more obvious place. >>> >>> do you mean >>> >>> https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2bad >>> ec0740f >>> ? >>> this has not been backported to v2.x, and the issue was reported on v2.x >>> >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com> wrote: >>> >>> Ralph, >>> >>> is there any reason to use a session directory based on the jobid (or >>> job >>> family) ? >>> I mean, could we use mkstemp to generate a unique directory, and then >>> propagate the path via orted comm or the environment ? >>> >>> Cheers, >>> >>> Gilles >>> >>> On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org> >>> wrote: >>>> >>>> This has nothing to do with PMIx, Josh - the error is coming out of the >>>> usock OOB component. >>>> >>>> >>>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>>> >>>> Eric, >>>> >>>> We are looking into the PMIx code path that sets up the jobid. The >>>> session >>>> directories are created based on the jobid. It might be the case that >>>> the >>>> jobids (generated with rand) happen to be the same for different jobs >>>> resulting in multiple jobs sharing the same session directory, but we >>>> need >>>> to check. We will update. >>>> >>>> Josh >>>> >>>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland >>>> <eric.chamberl...@giref.ulaval.ca> wrote: >>>>> >>>>> Lucky! >>>>> >>>>> Since each runs have a specific TMP, I still have it on disc. >>>>> >>>>> for the faulty run, the TMP variable was: >>>>> >>>>> TMP=/tmp/tmp.wOv5dkNaSI >>>>> >>>>> and into $TMP I have: >>>>> >>>>> openmpi-sessions-40031@lorien_0 >>>>> >>>>> and into this subdirectory I have a bunch of empty dirs: >>>>> >>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls >>>>> -la >>>>> |wc -l >>>>> 1841 >>>>> >>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls >>>>> -la >>>>> |more >>>>> total 68 >>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>>>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>>>> ... >>>>> >>>>> If I do: >>>>> >>>>> lsof |grep "openmpi-sessions-40031" >>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>>>> /run/user/1000/gvfs >>>>> Output information may be incomplete. >>>>> lsof: WARNING: can't stat() tracefs file system >>>>> /sys/kernel/debug/tracing >>>>> Output information may be incomplete. >>>>> >>>>> nothing... >>>>> >>>>> What else may I check? >>>>> >>>>> Eric >>>>> >>>>> >>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>>>>> >>>>>> Hi, Eric >>>>>> >>>>>> I **think** this might be related to the following: >>>>>> >>>>>> https://github.com/pmix/master/pull/145 >>>>>> >>>>>> I'm wondering if you can look into the /tmp directory and see if you >>>>>> have a bunch of stale usock files. >>>>>> >>>>>> Best, >>>>>> >>>>>> Josh >>>>>> >>>>>> >>>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>>>>> <gil...@rist.or.jp >>>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>>> >>>>>> Eric, >>>>>> >>>>>> >>>>>> can you please provide more information on how your tests are >>>>>> launched ? >>>>>> >>>>>> do you >>>>>> >>>>>> mpirun -np 1 ./a.out >>>>>> >>>>>> or do you simply >>>>>> >>>>>> ./a.out >>>>>> >>>>>> >>>>>> do you use a batch manager ? if yes, which one ? >>>>>> >>>>>> do you run one test per job ? or multiple tests per job ? >>>>>> >>>>>> how are these tests launched ? >>>>>> >>>>>> >>>>>> do the test that crashes use MPI_Comm_spawn ? >>>>>> >>>>>> i am surprised by the process name [[9325,5754],0], which >>>>>> suggests >>>>>> there MPI_Comm_spawn was called 5753 times (!) >>>>>> >>>>>> >>>>>> can you also run >>>>>> >>>>>> hostname >>>>>> >>>>>> on the 'lorien' host ? >>>>>> >>>>>> if you configure'd Open MPI with --enable-debug, can you >>>>>> >>>>>> export OMPI_MCA_plm_base_verbose 5 >>>>>> >>>>>> then run one test and post the logs ? >>>>>> >>>>>> >>>>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should >>>>>> produce job family 5576 (but you get 9325) >>>>>> >>>>>> the discrepancy could be explained by the use of a batch manager >>>>>> and/or a full hostname i am unaware of. >>>>>> >>>>>> >>>>>> orte_plm_base_set_hnp_name() generate a 16 bits job family from >>>>>> the >>>>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. >>>>>> >>>>>> so strictly speaking, it is possible two jobs launched on the >>>>>> same >>>>>> node are assigned the same 16 bits job family. >>>>>> >>>>>> >>>>>> the easiest way to detect this could be to >>>>>> >>>>>> - edit orte/mca/plm/base/plm_base_jobid.c >>>>>> >>>>>> and replace >>>>>> >>>>>> OPAL_OUTPUT_VERBOSE((5, >>>>>> orte_plm_base_framework.framework_output, >>>>>> "plm:base:set_hnp_name: final jobfam >>>>>> %lu", >>>>>> (unsigned long)jobfam)); >>>>>> >>>>>> with >>>>>> >>>>>> OPAL_OUTPUT_VERBOSE((4, >>>>>> orte_plm_base_framework.framework_output, >>>>>> "plm:base:set_hnp_name: final jobfam >>>>>> %lu", >>>>>> (unsigned long)jobfam)); >>>>>> >>>>>> configure Open MPI with --enable-debug and rebuild >>>>>> >>>>>> and then >>>>>> >>>>>> export OMPI_MCA_plm_base_verbose=4 >>>>>> >>>>>> and run your tests. >>>>>> >>>>>> >>>>>> when the problem occurs, you will be able to check which pids >>>>>> produced the faulty jobfam, and that could hint to a conflict. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> >>>>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> It is the third time this happened into the last 10 days. >>>>>> >>>>>> While running nighlty tests (~2200), we have one or two tests >>>>>> that fails at the very beginning with this strange error: >>>>>> >>>>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: >>>>>> received unexpected process identifier [[9325,0],0] from >>>>>> [[5590,0],0] >>>>>> >>>>>> But I can't reproduce the problem right now... ie: If I >>>>>> launch >>>>>> this test alone "by hand", it is successful... the same test >>>>>> was >>>>>> successful yesterday... >>>>>> >>>>>> Is there some kind of "race condition" that can happen on the >>>>>> creation of "tmp" files if many tests runs together on the >>>>>> same >>>>>> node? (we are oversubcribing even sequential runs...) >>>>>> >>>>>> Here are the build logs: >>>>>> >>>>>> >>>>>> >>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s >>>>>> _config.log >>>>>> >>>>>> >>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01 >>>>>> s_config.log> >>>>>> >>>>>> >>>>>> >>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s >>>>>> _ompi_info_all.txt >>>>>> >>>>>> >>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01 >>>>>> s_ompi_info_all.txt> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Eric >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel