Hi Ralph,

> On 21 Jan 2015, at 21:20 , Ralph Castain <r...@open-mpi.org> wrote:
> 
> Hi Mark
> 
>> On Jan 21, 2015, at 11:21 AM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>> wrote:
>> 
>> Hi Ralph, all,
>> 
>> To give some background, I'm part of the RADICAL-Pilot [1] development team.
>> RADICAL-Pilot is a Pilot System, an implementation of the Pilot (job) 
>> concept, which is in its most minimal form takes care of the decoupling of 
>> resource acquisition and workload management.
>> So instead of launching your real_science.exe through PBS, you submit a 
>> Pilot, which will allow you to perform application level scheduling.
>> Most obvious use-case if you want to run many (relatively) small tasks, then 
>> you really don;t want to go through the batch system every time. That is 
>> besides the fact that these machines are very bad in managing many tasks 
>> anyway.
> 
> Yeah, we sympathize.

Thats always good :-)

> Of course, one obvious solution is to get an allocation and execute a shell 
> script that runs the tasks within that allocation - yes?

Not really. Most of our use-cases have dynamic runtime properties, which means 
that at t=0 the exact workload is not known.

In addition, I don't think such a script would allow me to work around the 
aprun bottleneck, as I'm not aware of a way to start MPI tasks that span 
multiple nodes from a Cray worker node.

>> I looked a bit better at ORCM and it clearly overlaps with what I want to 
>> achieve.
> 
> Agreed. In ORCM, we allow a user to request a “session” that results in 
> allocation of resources. Each session is given an “orchestrator” - the ORCM 
> “shepherd” daemon - responsible for executing the individual tasks across the 
> assigned allocation, and a collection of “lamb” daemons (one on each node of 
> the allocation) that forms a distributed VM. The orchestrator can execute the 
> tasks very quickly since it doesn’t have to go back to the scheduler, and we 
> allow it to do so according to any provided precedence requirement. Again, 
> for simplicity, a shell script is the default mechanism for submitting the 
> individual tasks.

Yeah, similar solution to a similar problem.
I noticed that Exascale is also part of the motivation? How does this relate to 
the pmix effort? Different part of the stack I guess.

>> One thing I noticed is that parts of it runs as root, why is that?
> 
> ORCM is a full resource manager, which means it has a scheduler (rudimentary 
> today) and boot-time daemons that must run as root so they can fork/exec the 
> session-level daemons (that run at the user level). The orchestrator and its 
> daemons all run at the user-level.

Ok. Our solution is user-space only, as one of our features is that we are able 
to run across different type of systems. Both approaches come with a tradeoff 
obviously.

>>> We used to have a cmd line option in ORTE for what you propose - it 
>>> wouldn’t be too hard to restore. Is there some reason to do so?
>> 
>> Can you point me to something that I could look for in the repo history, 
>> then I can see if it serves my purpose.
> 
> It would be back in the svn repo, I fear - would take awhile to hunt it down. 
> Basically, it just (a) started all the daemons to create a VM, and (b) told 
> mpirun to stick around as a persistent daemon. All subsequent calls to mpirun 
> would reference back to the persistent one, thus using it to launch the jobs 
> against the standing VM instead of starting a new one every time.

*nod* That's what I tried to do this afternoon actually with the 
"--ompi-server", but that was not meant to be.

> For ORCM, we just took that capability and expressed it as the “shepherd” 
> plus “lamb” daemon architecture described above.

ACK.

> If you don’t want to replace the base RM, then using ORTE to establish a 
> persistent VM is probably the way to go.

Indeed, thats what it sounds like. Plus that ORTE is generic enough that I can 
re-use it on other type of systems too.

> I can probably make it do that again fairly readily. We have a developer’s 
> meeting next week, which usually means I have some free time (during evenings 
> and topics I’m not involved with), so I can take a crack at this then if that 
> would be timely enough.

Happy to accept that offer. At this stage I'm not sure if I would want a CLI or 
would be more interested to be able to do this programmatically though.
Also more than willing to assist in any way I can.

I tried to see how it all worked, but because of the modular nature of ompi 
that was quite daunting. There is some learning curve I guess :-)
So it seems that mpirun is persistent, and opens up a listening port, then some 
orded's get launched that phone home.
From there I got lost in the MCA maze. How do the tasks get unto the compute 
nodes and started?

Thanks a lot again, I appreciate your help.

Cheers,

Mark

Reply via email to