Hi Ralph,

All makes sense! Thanks a lot!

Looking forward to your modifications.
Please don't hesitate to through things with rough-edges to me!

Cheers,

Mark

> On 21 Jan 2015, at 23:21 , Ralph Castain <r...@open-mpi.org> wrote:
> 
> Let me address your questions up here so you don’t have to scan thru the 
> entire note.
> 
> PMIx rationale: PMI has been around for a long time, primarily used inside 
> the MPI library implementations to perform wireup. It provided a link from 
> the MPI library to the local resource manager. However, as we move towards 
> exascale, two things became apparent:
> 
> 1. the current PMI implementations don’t scale adequately to get there. The 
> API created too many communications and assumed everything was a blocking 
> operation, thus preventing asynchronous progress
> 
> 2. there were increasing requests for application-level interactions to the 
> resource manager. People want ways to spawn jobs (and not just from within 
> MPI), request pre-location of data, control power, etc. Rather than having 
> every RM write its own interface (and thus make everyone’s code 
> non-portable), we at Intel decided to extend the existing PMI definitions to 
> support those functions. Thus, an application developer can directly access 
> PMIx functions to perform all those operations.
> 
> PMIx v1.0 is about to be released - it’ll be backward compatible with PMI-1 
> and PMI-2, plus add non-blocking operations and significantly reduce the 
> number of communications. PMIx 2.0 is slated for this summer and will include 
> the advanced controls capabilities.
> 
> ORCM is being developed because we needed a BSD-licensed, fully featured 
> resource manager. This will allow us to integrate the RM even more tightly to 
> the file system, networking, and other subsystems, thus achieving higher 
> launch performance and providing desired features such as QoS management. 
> PMIx is a part of that plan, but as you say, they each play their separate 
> roles in the overall stack.
> 
> 
> Persistent ORTE: there is a learning curve on ORTE, I fear. We do have some 
> videos on the web site that can help get you started, and I’ve given a number 
> of “classes" at Intel now for that purpose. I still have it on my “to-do” 
> list that I summarize those classes and post them on the web site.
> 
> For now, let me summarize how things work. At startup, mpirun reads the 
> allocation (usually from the environment, but it depends on the host RM) and 
> launches a daemon on each allocated node. Each daemon reads its local 
> hardware environment and “phones home” to let mpirun know it is alive. Once 
> all daemons have reported, mpirun maps the processes to the nodes and sends 
> that map to all the daemons in a scalable broadcast pattern.
> 
> Upon receipt of the launch message, each daemon parses it to identify which 
> procs it needs to locally spawn. Once spawned, each proc connects back to its 
> local daemon via a Unix domain socket for wireup support. As procs complete, 
> the daemon maintains bookkeeping and reports back to mpirun once all procs 
> are done. When all procs are reported complete (or one reports as abnormally 
> terminated), mpirun sends a “die” message to every daemon so it will cleanly 
> terminate.
> 
> What I will do is simply tell mpirun to not do that last step, but instead to 
> wait to receive a “terminate” cmd before ending the daemons. This will allow 
> you to reuse the existing DVM, making each independent job start a great deal 
> faster. You’ll need to either manually terminate the DVM, or the RM will do 
> so when the allocation expires.
> 
> HTH
> Ralph
> 
> 
>> On Jan 21, 2015, at 12:52 PM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>> wrote:
>> 
>> Hi Ralph,
>> 
>>> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> Hi Mark
>>> 
>>>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>>>> wrote:
>>>> 
>>>> Hi Ralph, all,
>>>> 
>>>> To give some background, I'm part of the RADICAL-Pilot [1] development 
>>>> team.
>>>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) 
>>>> concept, which is in its most minimal form takes care of the decoupling of 
>>>> resource acquisition and workload management.
>>>> So instead of launching your real_science.exe through PBS, you submit a 
>>>> Pilot, which will allow you to perform application level scheduling.
>>>> Most obvious use-case if you want to run many (relatively) small tasks, 
>>>> then you really don;t want to go through the batch system every time. That 
>>>> is besides the fact that these machines are very bad in managing many 
>>>> tasks anyway.
>>> 
>>> Yeah, we sympathize.
>> 
>> Thats always good :-)
>> 
>>> Of course, one obvious solution is to get an allocation and execute a shell 
>>> script that runs the tasks within that allocation - yes?
>> 
>> Not really. Most of our use-cases have dynamic runtime properties, which 
>> means that at t=0 the exact workload is not known.
>> 
>> In addition, I don't think such a script would allow me to work around the 
>> aprun bottleneck, as I'm not aware of a way to start MPI tasks that span 
>> multiple nodes from a Cray worker node.
>> 
>>>> I looked a bit better at ORCM and it clearly overlaps with what I want to 
>>>> achieve.
>>> 
>>> Agreed. In ORCM, we allow a user to request a “session” that results in 
>>> allocation of resources. Each session is given an “orchestrator” - the ORCM 
>>> “shepherd” daemon - responsible for executing the individual tasks across 
>>> the assigned allocation, and a collection of “lamb” daemons (one on each 
>>> node of the allocation) that forms a distributed VM. The orchestrator can 
>>> execute the tasks very quickly since it doesn’t have to go back to the 
>>> scheduler, and we allow it to do so according to any provided precedence 
>>> requirement. Again, for simplicity, a shell script is the default mechanism 
>>> for submitting the individual tasks.
>> 
>> Yeah, similar solution to a similar problem.
>> I noticed that Exascale is also part of the motivation? How does this relate 
>> to the pmix effort? Different part of the stack I guess.
>> 
>>>> One thing I noticed is that parts of it runs as root, why is that?
>>> 
>>> ORCM is a full resource manager, which means it has a scheduler 
>>> (rudimentary today) and boot-time daemons that must run as root so they can 
>>> fork/exec the session-level daemons (that run at the user level). The 
>>> orchestrator and its daemons all run at the user-level.
>> 
>> Ok. Our solution is user-space only, as one of our features is that we are 
>> able to run across different type of systems. Both approaches come with a 
>> tradeoff obviously.
>> 
>>>>> We used to have a cmd line option in ORTE for what you propose - it 
>>>>> wouldn’t be too hard to restore. Is there some reason to do so?
>>>> 
>>>> Can you point me to something that I could look for in the repo history, 
>>>> then I can see if it serves my purpose.
>>> 
>>> It would be back in the svn repo, I fear - would take awhile to hunt it 
>>> down. Basically, it just (a) started all the daemons to create a VM, and 
>>> (b) told mpirun to stick around as a persistent daemon. All subsequent 
>>> calls to mpirun would reference back to the persistent one, thus using it 
>>> to launch the jobs against the standing VM instead of starting a new one 
>>> every time.
>> 
>> *nod* That's what I tried to do this afternoon actually with the 
>> "--ompi-server", but that was not meant to be.
>> 
>>> For ORCM, we just took that capability and expressed it as the “shepherd” 
>>> plus “lamb” daemon architecture described above.
>> 
>> ACK.
>> 
>>> If you don’t want to replace the base RM, then using ORTE to establish a 
>>> persistent VM is probably the way to go.
>> 
>> Indeed, thats what it sounds like. Plus that ORTE is generic enough that I 
>> can re-use it on other type of systems too.
>> 
>>> I can probably make it do that again fairly readily. We have a developer’s 
>>> meeting next week, which usually means I have some free time (during 
>>> evenings and topics I’m not involved with), so I can take a crack at this 
>>> then if that would be timely enough.
>> 
>> Happy to accept that offer. At this stage I'm not sure if I would want a CLI 
>> or would be more interested to be able to do this programmatically though.
>> Also more than willing to assist in any way I can.
>> 
>> I tried to see how it all worked, but because of the modular nature of ompi 
>> that was quite daunting. There is some learning curve I guess :-)
>> So it seems that mpirun is persistent, and opens up a listening port, then 
>> some orded's get launched that phone home.
>> From there I got lost in the MCA maze. How do the tasks get unto the compute 
>> nodes and started?
>> 
>> Thanks a lot again, I appreciate your help.
>> 
>> Cheers,
>> 
>> Mark
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/01/26227.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26228.php

Reply via email to