Hi Ralph, All makes sense! Thanks a lot!
Looking forward to your modifications. Please don't hesitate to through things with rough-edges to me! Cheers, Mark > On 21 Jan 2015, at 23:21 , Ralph Castain <r...@open-mpi.org> wrote: > > Let me address your questions up here so you don’t have to scan thru the > entire note. > > PMIx rationale: PMI has been around for a long time, primarily used inside > the MPI library implementations to perform wireup. It provided a link from > the MPI library to the local resource manager. However, as we move towards > exascale, two things became apparent: > > 1. the current PMI implementations don’t scale adequately to get there. The > API created too many communications and assumed everything was a blocking > operation, thus preventing asynchronous progress > > 2. there were increasing requests for application-level interactions to the > resource manager. People want ways to spawn jobs (and not just from within > MPI), request pre-location of data, control power, etc. Rather than having > every RM write its own interface (and thus make everyone’s code > non-portable), we at Intel decided to extend the existing PMI definitions to > support those functions. Thus, an application developer can directly access > PMIx functions to perform all those operations. > > PMIx v1.0 is about to be released - it’ll be backward compatible with PMI-1 > and PMI-2, plus add non-blocking operations and significantly reduce the > number of communications. PMIx 2.0 is slated for this summer and will include > the advanced controls capabilities. > > ORCM is being developed because we needed a BSD-licensed, fully featured > resource manager. This will allow us to integrate the RM even more tightly to > the file system, networking, and other subsystems, thus achieving higher > launch performance and providing desired features such as QoS management. > PMIx is a part of that plan, but as you say, they each play their separate > roles in the overall stack. > > > Persistent ORTE: there is a learning curve on ORTE, I fear. We do have some > videos on the web site that can help get you started, and I’ve given a number > of “classes" at Intel now for that purpose. I still have it on my “to-do” > list that I summarize those classes and post them on the web site. > > For now, let me summarize how things work. At startup, mpirun reads the > allocation (usually from the environment, but it depends on the host RM) and > launches a daemon on each allocated node. Each daemon reads its local > hardware environment and “phones home” to let mpirun know it is alive. Once > all daemons have reported, mpirun maps the processes to the nodes and sends > that map to all the daemons in a scalable broadcast pattern. > > Upon receipt of the launch message, each daemon parses it to identify which > procs it needs to locally spawn. Once spawned, each proc connects back to its > local daemon via a Unix domain socket for wireup support. As procs complete, > the daemon maintains bookkeeping and reports back to mpirun once all procs > are done. When all procs are reported complete (or one reports as abnormally > terminated), mpirun sends a “die” message to every daemon so it will cleanly > terminate. > > What I will do is simply tell mpirun to not do that last step, but instead to > wait to receive a “terminate” cmd before ending the daemons. This will allow > you to reuse the existing DVM, making each independent job start a great deal > faster. You’ll need to either manually terminate the DVM, or the RM will do > so when the allocation expires. > > HTH > Ralph > > >> On Jan 21, 2015, at 12:52 PM, Mark Santcroos <mark.santcr...@rutgers.edu> >> wrote: >> >> Hi Ralph, >> >>> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote: >>> >>> Hi Mark >>> >>>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos <mark.santcr...@rutgers.edu> >>>> wrote: >>>> >>>> Hi Ralph, all, >>>> >>>> To give some background, I'm part of the RADICAL-Pilot [1] development >>>> team. >>>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) >>>> concept, which is in its most minimal form takes care of the decoupling of >>>> resource acquisition and workload management. >>>> So instead of launching your real_science.exe through PBS, you submit a >>>> Pilot, which will allow you to perform application level scheduling. >>>> Most obvious use-case if you want to run many (relatively) small tasks, >>>> then you really don;t want to go through the batch system every time. That >>>> is besides the fact that these machines are very bad in managing many >>>> tasks anyway. >>> >>> Yeah, we sympathize. >> >> Thats always good :-) >> >>> Of course, one obvious solution is to get an allocation and execute a shell >>> script that runs the tasks within that allocation - yes? >> >> Not really. Most of our use-cases have dynamic runtime properties, which >> means that at t=0 the exact workload is not known. >> >> In addition, I don't think such a script would allow me to work around the >> aprun bottleneck, as I'm not aware of a way to start MPI tasks that span >> multiple nodes from a Cray worker node. >> >>>> I looked a bit better at ORCM and it clearly overlaps with what I want to >>>> achieve. >>> >>> Agreed. In ORCM, we allow a user to request a “session” that results in >>> allocation of resources. Each session is given an “orchestrator” - the ORCM >>> “shepherd” daemon - responsible for executing the individual tasks across >>> the assigned allocation, and a collection of “lamb” daemons (one on each >>> node of the allocation) that forms a distributed VM. The orchestrator can >>> execute the tasks very quickly since it doesn’t have to go back to the >>> scheduler, and we allow it to do so according to any provided precedence >>> requirement. Again, for simplicity, a shell script is the default mechanism >>> for submitting the individual tasks. >> >> Yeah, similar solution to a similar problem. >> I noticed that Exascale is also part of the motivation? How does this relate >> to the pmix effort? Different part of the stack I guess. >> >>>> One thing I noticed is that parts of it runs as root, why is that? >>> >>> ORCM is a full resource manager, which means it has a scheduler >>> (rudimentary today) and boot-time daemons that must run as root so they can >>> fork/exec the session-level daemons (that run at the user level). The >>> orchestrator and its daemons all run at the user-level. >> >> Ok. Our solution is user-space only, as one of our features is that we are >> able to run across different type of systems. Both approaches come with a >> tradeoff obviously. >> >>>>> We used to have a cmd line option in ORTE for what you propose - it >>>>> wouldn’t be too hard to restore. Is there some reason to do so? >>>> >>>> Can you point me to something that I could look for in the repo history, >>>> then I can see if it serves my purpose. >>> >>> It would be back in the svn repo, I fear - would take awhile to hunt it >>> down. Basically, it just (a) started all the daemons to create a VM, and >>> (b) told mpirun to stick around as a persistent daemon. All subsequent >>> calls to mpirun would reference back to the persistent one, thus using it >>> to launch the jobs against the standing VM instead of starting a new one >>> every time. >> >> *nod* That's what I tried to do this afternoon actually with the >> "--ompi-server", but that was not meant to be. >> >>> For ORCM, we just took that capability and expressed it as the “shepherd” >>> plus “lamb” daemon architecture described above. >> >> ACK. >> >>> If you don’t want to replace the base RM, then using ORTE to establish a >>> persistent VM is probably the way to go. >> >> Indeed, thats what it sounds like. Plus that ORTE is generic enough that I >> can re-use it on other type of systems too. >> >>> I can probably make it do that again fairly readily. We have a developer’s >>> meeting next week, which usually means I have some free time (during >>> evenings and topics I’m not involved with), so I can take a crack at this >>> then if that would be timely enough. >> >> Happy to accept that offer. At this stage I'm not sure if I would want a CLI >> or would be more interested to be able to do this programmatically though. >> Also more than willing to assist in any way I can. >> >> I tried to see how it all worked, but because of the modular nature of ompi >> that was quite daunting. There is some learning curve I guess :-) >> So it seems that mpirun is persistent, and opens up a listening port, then >> some orded's get launched that phone home. >> From there I got lost in the MCA maze. How do the tasks get unto the compute >> nodes and started? >> >> Thanks a lot again, I appreciate your help. >> >> Cheers, >> >> Mark >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/01/26227.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/01/26228.php