Hi Ralph, > On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote: > > Hi Mark > >> On Jan 21, 2015, at 11:21 AM, Mark Santcroos <mark.santcr...@rutgers.edu> >> wrote: >> >> Hi Ralph, all, >> >> To give some background, I'm part of the RADICAL-Pilot [1] development team. >> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) >> concept, which is in its most minimal form takes care of the decoupling of >> resource acquisition and workload management. >> So instead of launching your real_science.exe through PBS, you submit a >> Pilot, which will allow you to perform application level scheduling. >> Most obvious use-case if you want to run many (relatively) small tasks, then >> you really don;t want to go through the batch system every time. That is >> besides the fact that these machines are very bad in managing many tasks >> anyway. > > Yeah, we sympathize.
Thats always good :-) > Of course, one obvious solution is to get an allocation and execute a shell > script that runs the tasks within that allocation - yes? Not really. Most of our use-cases have dynamic runtime properties, which means that at t=0 the exact workload is not known. In addition, I don't think such a script would allow me to work around the aprun bottleneck, as I'm not aware of a way to start MPI tasks that span multiple nodes from a Cray worker node. >> I looked a bit better at ORCM and it clearly overlaps with what I want to >> achieve. > > Agreed. In ORCM, we allow a user to request a “session” that results in > allocation of resources. Each session is given an “orchestrator” - the ORCM > “shepherd” daemon - responsible for executing the individual tasks across the > assigned allocation, and a collection of “lamb” daemons (one on each node of > the allocation) that forms a distributed VM. The orchestrator can execute the > tasks very quickly since it doesn’t have to go back to the scheduler, and we > allow it to do so according to any provided precedence requirement. Again, > for simplicity, a shell script is the default mechanism for submitting the > individual tasks. Yeah, similar solution to a similar problem. I noticed that Exascale is also part of the motivation? How does this relate to the pmix effort? Different part of the stack I guess. >> One thing I noticed is that parts of it runs as root, why is that? > > ORCM is a full resource manager, which means it has a scheduler (rudimentary > today) and boot-time daemons that must run as root so they can fork/exec the > session-level daemons (that run at the user level). The orchestrator and its > daemons all run at the user-level. Ok. Our solution is user-space only, as one of our features is that we are able to run across different type of systems. Both approaches come with a tradeoff obviously. >>> We used to have a cmd line option in ORTE for what you propose - it >>> wouldn’t be too hard to restore. Is there some reason to do so? >> >> Can you point me to something that I could look for in the repo history, >> then I can see if it serves my purpose. > > It would be back in the svn repo, I fear - would take awhile to hunt it down. > Basically, it just (a) started all the daemons to create a VM, and (b) told > mpirun to stick around as a persistent daemon. All subsequent calls to mpirun > would reference back to the persistent one, thus using it to launch the jobs > against the standing VM instead of starting a new one every time. *nod* That's what I tried to do this afternoon actually with the "--ompi-server", but that was not meant to be. > For ORCM, we just took that capability and expressed it as the “shepherd” > plus “lamb” daemon architecture described above. ACK. > If you don’t want to replace the base RM, then using ORTE to establish a > persistent VM is probably the way to go. Indeed, thats what it sounds like. Plus that ORTE is generic enough that I can re-use it on other type of systems too. > I can probably make it do that again fairly readily. We have a developer’s > meeting next week, which usually means I have some free time (during evenings > and topics I’m not involved with), so I can take a crack at this then if that > would be timely enough. Happy to accept that offer. At this stage I'm not sure if I would want a CLI or would be more interested to be able to do this programmatically though. Also more than willing to assist in any way I can. I tried to see how it all worked, but because of the modular nature of ompi that was quite daunting. There is some learning curve I guess :-) So it seems that mpirun is persistent, and opens up a listening port, then some orded's get launched that phone home. From there I got lost in the MCA maze. How do the tasks get unto the compute nodes and started? Thanks a lot again, I appreciate your help. Cheers, Mark