Hi Ralph,

Great, the semantics look exactly as what I need!

(To aid in debugging I added "--debug-devel" to orte-dvm.c which was useful to 
detect and come by some initial bumps)

The current status:

* I can submit applications and see their output on the orte-dvm console

* The following message is reported infinitely on the orte-submit console:

[warn] opal_libevent2022_event_base_loop: reentrant invocation.  Only one 
event_base_loop can run on each event_base at once.

* orte-submit doesn't return, while I see "[nid02819:20571] [[2120,0],0] dvm: 
job [2120,9] has completed" on the orte-dvm console.

* On the orte-dvm console I see the following when submitting (so also for 
"successful" runs):

[nid02434:00564] [[9021,0],0] Releasing job data for [INVALID]
[nid03388:26474] [[9021,0],2] ORTE_ERROR_LOG: Not found in file 
../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433
[nid03534:31545] procdir: /tmp/openmpi-sessions-62758@nid03534_0/9021/1/0
[nid03534:31545] jobdir: /tmp/openmpi-sessions-62758@nid03534_0/9021/1
[nid03534:31545] top: openmpi-sessions-62758@nid03534_0
[nid03534:31545] tmp: /tmp
[nid03534:31545] sess_dir_finalize: proc session dir does not exist

* If I dont specify any "-np" on the orte-submit, then I see on the orte-dvm 
console:

[nid02434:00564] [[9021,0],0] Releasing job data for [INVALID]
[nid03388:26474] [[9021,0],2] ORTE_ERROR_LOG: Not found in file 
../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433
[nid03534:31544] [[9021,0],1] ORTE_ERROR_LOG: Not found in file 
../../../../orte/mca/odls/base/odls_base_default_fns.c at line 433

* It only seems to work for single nodes (probably related to the previous 
point).


Is this all expected behaviour given the current implementation?


Thanks!

Mark



> On 02 Feb 2015, at 4:21 , Ralph Castain <r...@open-mpi.org> wrote:
> 
> I have pushed the changes to the OMPI master. It took a little bit more than 
> I had hoped due to the changes to the ORTE infrastructure, but hopefully this 
> will meet your needs. It consists of two new tools:
> 
> (a) orte-dvm - starts the virtual machine by launching a daemon on every node 
> of the allocation, as constrained by -host and/or -hostfile. Check the 
> options for outputting the URI as you’ll need that info for the other tool. 
> The DVM remains “up” until you issue the orte-submit -terminate command, or 
> hit the orte-dvm process with a sigterm.
> 
> (b) orte-submit - takes the place of mpirun. Basically just packages your app 
> and arguments and sends it to orte-dvm for execution. Requires the URI of 
> orte-dvm. The tool exits once the job has completed execution, though you can 
> run multiple jobs in parallel by backgrounding orte-submit or issuing 
> commands from separate shells.
> 
> I’ve added man pages for both tools, though they may not be complete. Also, I 
> don’t have all the mapping/ranking/binding options supported just yet as I 
> first wanted to see if this meets your basic needs before worrying about the 
> detail.
> 
> Let me know what you think
> Ralph
> 
> 
>> On Jan 21, 2015, at 4:07 PM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>> wrote:
>> 
>> Hi Ralph,
>> 
>> All makes sense! Thanks a lot!
>> 
>> Looking forward to your modifications.
>> Please don't hesitate to through things with rough-edges to me!
>> 
>> Cheers,
>> 
>> Mark
>> 
>>> On 21 Jan 2015, at 23:21 , Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> Let me address your questions up here so you don’t have to scan thru the 
>>> entire note.
>>> 
>>> PMIx rationale: PMI has been around for a long time, primarily used inside 
>>> the MPI library implementations to perform wireup. It provided a link from 
>>> the MPI library to the local resource manager. However, as we move towards 
>>> exascale, two things became apparent:
>>> 
>>> 1. the current PMI implementations don’t scale adequately to get there. The 
>>> API created too many communications and assumed everything was a blocking 
>>> operation, thus preventing asynchronous progress
>>> 
>>> 2. there were increasing requests for application-level interactions to the 
>>> resource manager. People want ways to spawn jobs (and not just from within 
>>> MPI), request pre-location of data, control power, etc. Rather than having 
>>> every RM write its own interface (and thus make everyone’s code 
>>> non-portable), we at Intel decided to extend the existing PMI definitions 
>>> to support those functions. Thus, an application developer can directly 
>>> access PMIx functions to perform all those operations.
>>> 
>>> PMIx v1.0 is about to be released - it’ll be backward compatible with PMI-1 
>>> and PMI-2, plus add non-blocking operations and significantly reduce the 
>>> number of communications. PMIx 2.0 is slated for this summer and will 
>>> include the advanced controls capabilities.
>>> 
>>> ORCM is being developed because we needed a BSD-licensed, fully featured 
>>> resource manager. This will allow us to integrate the RM even more tightly 
>>> to the file system, networking, and other subsystems, thus achieving higher 
>>> launch performance and providing desired features such as QoS management. 
>>> PMIx is a part of that plan, but as you say, they each play their separate 
>>> roles in the overall stack.
>>> 
>>> 
>>> Persistent ORTE: there is a learning curve on ORTE, I fear. We do have some 
>>> videos on the web site that can help get you started, and I’ve given a 
>>> number of “classes" at Intel now for that purpose. I still have it on my 
>>> “to-do” list that I summarize those classes and post them on the web site.
>>> 
>>> For now, let me summarize how things work. At startup, mpirun reads the 
>>> allocation (usually from the environment, but it depends on the host RM) 
>>> and launches a daemon on each allocated node. Each daemon reads its local 
>>> hardware environment and “phones home” to let mpirun know it is alive. Once 
>>> all daemons have reported, mpirun maps the processes to the nodes and sends 
>>> that map to all the daemons in a scalable broadcast pattern.
>>> 
>>> Upon receipt of the launch message, each daemon parses it to identify which 
>>> procs it needs to locally spawn. Once spawned, each proc connects back to 
>>> its local daemon via a Unix domain socket for wireup support. As procs 
>>> complete, the daemon maintains bookkeeping and reports back to mpirun once 
>>> all procs are done. When all procs are reported complete (or one reports as 
>>> abnormally terminated), mpirun sends a “die” message to every daemon so it 
>>> will cleanly terminate.
>>> 
>>> What I will do is simply tell mpirun to not do that last step, but instead 
>>> to wait to receive a “terminate” cmd before ending the daemons. This will 
>>> allow you to reuse the existing DVM, making each independent job start a 
>>> great deal faster. You’ll need to either manually terminate the DVM, or the 
>>> RM will do so when the allocation expires.
>>> 
>>> HTH
>>> Ralph
>>> 
>>> 
>>>> On Jan 21, 2015, at 12:52 PM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>>>> wrote:
>>>> 
>>>> Hi Ralph,
>>>> 
>>>>> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>> Hi Mark
>>>>> 
>>>>>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos 
>>>>>> <mark.santcr...@rutgers.edu> wrote:
>>>>>> 
>>>>>> Hi Ralph, all,
>>>>>> 
>>>>>> To give some background, I'm part of the RADICAL-Pilot [1] development 
>>>>>> team.
>>>>>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) 
>>>>>> concept, which is in its most minimal form takes care of the decoupling 
>>>>>> of resource acquisition and workload management.
>>>>>> So instead of launching your real_science.exe through PBS, you submit a 
>>>>>> Pilot, which will allow you to perform application level scheduling.
>>>>>> Most obvious use-case if you want to run many (relatively) small tasks, 
>>>>>> then you really don;t want to go through the batch system every time. 
>>>>>> That is besides the fact that these machines are very bad in managing 
>>>>>> many tasks anyway.
>>>>> 
>>>>> Yeah, we sympathize.
>>>> 
>>>> Thats always good :-)
>>>> 
>>>>> Of course, one obvious solution is to get an allocation and execute a 
>>>>> shell script that runs the tasks within that allocation - yes?
>>>> 
>>>> Not really. Most of our use-cases have dynamic runtime properties, which 
>>>> means that at t=0 the exact workload is not known.
>>>> 
>>>> In addition, I don't think such a script would allow me to work around the 
>>>> aprun bottleneck, as I'm not aware of a way to start MPI tasks that span 
>>>> multiple nodes from a Cray worker node.
>>>> 
>>>>>> I looked a bit better at ORCM and it clearly overlaps with what I want 
>>>>>> to achieve.
>>>>> 
>>>>> Agreed. In ORCM, we allow a user to request a “session” that results in 
>>>>> allocation of resources. Each session is given an “orchestrator” - the 
>>>>> ORCM “shepherd” daemon - responsible for executing the individual tasks 
>>>>> across the assigned allocation, and a collection of “lamb” daemons (one 
>>>>> on each node of the allocation) that forms a distributed VM. The 
>>>>> orchestrator can execute the tasks very quickly since it doesn’t have to 
>>>>> go back to the scheduler, and we allow it to do so according to any 
>>>>> provided precedence requirement. Again, for simplicity, a shell script is 
>>>>> the default mechanism for submitting the individual tasks.
>>>> 
>>>> Yeah, similar solution to a similar problem.
>>>> I noticed that Exascale is also part of the motivation? How does this 
>>>> relate to the pmix effort? Different part of the stack I guess.
>>>> 
>>>>>> One thing I noticed is that parts of it runs as root, why is that?
>>>>> 
>>>>> ORCM is a full resource manager, which means it has a scheduler 
>>>>> (rudimentary today) and boot-time daemons that must run as root so they 
>>>>> can fork/exec the session-level daemons (that run at the user level). The 
>>>>> orchestrator and its daemons all run at the user-level.
>>>> 
>>>> Ok. Our solution is user-space only, as one of our features is that we are 
>>>> able to run across different type of systems. Both approaches come with a 
>>>> tradeoff obviously.
>>>> 
>>>>>>> We used to have a cmd line option in ORTE for what you propose - it 
>>>>>>> wouldn’t be too hard to restore. Is there some reason to do so?
>>>>>> 
>>>>>> Can you point me to something that I could look for in the repo history, 
>>>>>> then I can see if it serves my purpose.
>>>>> 
>>>>> It would be back in the svn repo, I fear - would take awhile to hunt it 
>>>>> down. Basically, it just (a) started all the daemons to create a VM, and 
>>>>> (b) told mpirun to stick around as a persistent daemon. All subsequent 
>>>>> calls to mpirun would reference back to the persistent one, thus using it 
>>>>> to launch the jobs against the standing VM instead of starting a new one 
>>>>> every time.
>>>> 
>>>> *nod* That's what I tried to do this afternoon actually with the 
>>>> "--ompi-server", but that was not meant to be.
>>>> 
>>>>> For ORCM, we just took that capability and expressed it as the “shepherd” 
>>>>> plus “lamb” daemon architecture described above.
>>>> 
>>>> ACK.
>>>> 
>>>>> If you don’t want to replace the base RM, then using ORTE to establish a 
>>>>> persistent VM is probably the way to go.
>>>> 
>>>> Indeed, thats what it sounds like. Plus that ORTE is generic enough that I 
>>>> can re-use it on other type of systems too.
>>>> 
>>>>> I can probably make it do that again fairly readily. We have a 
>>>>> developer’s meeting next week, which usually means I have some free time 
>>>>> (during evenings and topics I’m not involved with), so I can take a crack 
>>>>> at this then if that would be timely enough.
>>>> 
>>>> Happy to accept that offer. At this stage I'm not sure if I would want a 
>>>> CLI or would be more interested to be able to do this programmatically 
>>>> though.
>>>> Also more than willing to assist in any way I can.
>>>> 
>>>> I tried to see how it all worked, but because of the modular nature of 
>>>> ompi that was quite daunting. There is some learning curve I guess :-)
>>>> So it seems that mpirun is persistent, and opens up a listening port, then 
>>>> some orded's get launched that phone home.
>>>> From there I got lost in the MCA maze. How do the tasks get unto the 
>>>> compute nodes and started?
>>>> 
>>>> Thanks a lot again, I appreciate your help.
>>>> 
>>>> Cheers,
>>>> 
>>>> Mark
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/01/26227.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/01/26228.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/01/26229.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26249.php

Reply via email to