hi again, [i'm going to snip out the sections that seem resolved] [also, sorry about mutating the subject last time -- oops.]
This sounds fine - you'll find that the bproc pls does the exact same thing. In that case, we use #ifdefs since the APIs are actually different between the versions - we just create a wrapper inside the bproc pls code for the older version so that we can always call the same API. I'm not sure what the case will be in LSF - I believe the function calls are indeed different, so you might be able to use the same approach.
okay
> i'll probably just continue experimenting on my own for the moment (tracking > any updates to the main trunk LSF support) to see if i can figure it out. any > advice the best way to get such back support into trunk, if and when if exists > / is working? The *best* way would be for you to sign a third-party agreement - see the web site for details and a copy. Barring that, the only option would be to submit the code through either Jeff or I. We greatly prefer the agreement method as it is (a) less burdensome on us and (b) gives you greater flexibility.
i'll talk to 'the man' -- it should be okay ... eventually, at least ...
I can't speak to the motivation behind MPI-2 - the others in the group can do a much better job of that. What I can say is that we started out with a design to support such modes of operation as dynamic farms, but the group has been moving away from it due to a combination of performance impacts, reliability, and (frankly) lack of interest from our user community. Our intent now is to cut the RTE back to the basics required to support the MPI standard, including MPI-2 - which arguably says nothing about dynamic resource allocation.
that's true -- dynamic processes can be useful even under a static allocation. in fact, in the short term for my particular application, i'll probably do just that -- the user picks an initial allocation, and then i just do the best i can. hopefully the allocations will be 'small enough' to get away without dynamic acquisition for a while (a year?). beyond that, i guess i'm just one of those guys that thinks it's a shame that MPI supplanted pvm so long ago in the first place. and yes, i already looked into modifying pvm instead, no thank you ... ;)
Not to say we won't support it - just indicating that such support will have lower priority and that the system will be designed primarily for other priorities. So dynamic resource allocation will have to be considered as an "exception case", with all the attendant implications.
fair enough. i'm still hoping it won't be too exceptional, really. on a related note, perhaps is it possible to 'join' running openMPI jobs (using nameservers or whatnot)? if so, then application level workarounds are also possible -- and can even be automated if the application just launches a whole new copy of itself via whatever top-level means was used to launch itself in the first place.
I think someone is feeding you a very extreme view of LSF. I have interacted for years with people working with LSF-based systems, and can count on the fingers of one hand the people who are operating the way you describe.
perhaps -- i'm trying to convince the guy it's worth taking a look at enhancing open-mpi/open-rte as opposed to continuing with his internal effort. maybe i'll get him to chime in directly on this issue -- however ...
*Can* you use LSF that way? Sure. Is that how most people use it? Not from what I have seen. Still, if that's a mode you want to support...have at it! ;-)
... that said, his library already has the needed workarounds for this usage model. still, the communication is much simpler -- TCP point to point only (which is 'enough' for me now, but i'm not sure about the future), and i'm a little worried about the maturity and (software engineering and performance wise) scalability of his effort.
Keep in mind, though, that Open MPI is driven by performance for large-scale multiprocessor computations. As I indicated earlier, the type of operation you are describing will have to be treated as an "exception case". Literally, this means you are welcome to try and make it work, but the fundamental operations of the system won't be designed to optimize that mode at the sacrifice of the primary objective.
again, fair enough. ;)
> duly noted. i don't pretend to be able to follow the current control flow at > the moment. i think just running the debug version with all the printouts > should help me a lot there. also, perhaps if i just make a rmgr_dyn_lsf, and > don't use sds, then there might not be as many subsystems involved to > complain. actually, i suspect the LSF specific part would be (very) small, so > perhaps it could be rmgr_dynurm + a new component type like dynraspls to > encapsulate the DRM specific part. You have to use sds as this is the framework where the application process learns its name. That framework will be receiving more responsibilities in the revised implementation, so you'll unfortunately have to use it. Your best bet (IMHO) would be to create an lsf_farm component in the new PLM when we get the system revised.
sounds right -- some things will of course depend on when i need what working where. but if possible i'll try to get in too deep before some of these design changes are in. any hints on a timeline?
> > hmm, i'm thinking that if there was a way to directly tell open-rte to acquire > more daemons non-blockingly, that would be enough. > in the LSF case, i think one would bsub the daemons themselves (with arguments > sufficient to phone-home, so no sds needed?), so (node acquisition == daemon > startup). You could - though this sounds pretty non-scalable to me.
hmm, in what way? my impression is that you need to go back to an LSF queue on every request for new resources from LSF. there might be some way to give a (variably) higher priority to running jobs, but it's still going to require a bsub()/lsb_submit() (or similar API) to get new resources. otherwise, it defeats the queuing / job control system. i think.
> > this functions could be called heuristically by MPI-2 spawn type functions, or > even manually by the application (in the short term). it should not effect the > semantics of the MPI-2 calls themselves. Your best bet would be to have your own component so that you could do whatever you wanted with the spawn API. You could play with an RMGR component for now, but your best bet is clearly going to be the new PLM.
sounds right. same potential timeline issues as above of course.
> > the goal is that one could determine (at least with some confidence) if there > were any free (and ready to spawn quickly without blocking) resources before > issuing a spawn call. this might just mean examining the value of the MPI > universe size (and that this value could change), or it might need some new > interface, i dunno. You know, the real issue here (I think) is being driven by your use of bsub - which I believe is a batch launch request. Why would you want to do that instead of just directly calling lsb_launch()? I suspect we can get the Platform folks to give us an API to request additional node allocations from inside our program, so why not just use the API to launch? Or are you going the batch route because we don't currently have an API and you want to support older LSF versions? Might be more pain than it's worth...
well, certainly part of the issue is the need (or at least strong preference) to support 6.2 -- but read on. hmm, i'll need to review the APIs in more detail, but here is my current understanding: there appear to be some overlaps between the ls_* and lsb_* functions, but they seem basically compatible as far as i can tell. almost all the functions have a command line version as well, for example: lsb_submit()/bsub lsb_getalloc()/none and lsb_launch()/blaunch are new with LSF 7.0, but appear to just be a different (simpler) interface to existing functionality in the LSB_* env vars and the ls_rexec()/lsgrun commands -- although, as you say, perhaps platform will hook or enhance them later. but, the key issue is that lsb_launch() just starts tasks -- it does not perform or interact with the queue or job control (much?). so, you can't use these functions to get an allocation in the first place, and you have to be careful not to use them as a way around the queuing system. [ as a side note, the function ls_rexecv()/lsgrun is the one i have heard admins do not like because it can break queuing/accounting, and might try to disable somehow. i don't really buy that, because it's not you can disable it and have the system still work, since (as above) || job launching depends on it. i guess if you really don't care about || launching maybe you could. but, if used properly after a proper allocation i don't think there should (or even can) be a problem. ] so, lsb_submit()/bsub is a combination allocate/launch -- you specify the allocation size you want, and when it's all ready, it runs the 'job' (really the job launcher) only on one (randomly chosen) 'head' node from the allocation, with the env vars set so the launcher can use ls_rexec/lsgrun functions to start the rest of the job. there are of course various script wrappers you can use (mpijob, pvmjob, etc) instead of your 'real job'. then, i think lsf *should* try to track what processes get started via the wrapper / head process so it knows they are part of the same job. i dunno if it really does that -- but, my guess is that at the least it assumes the allocation is in use until the original process ends. in any case, the wrapper / head process examines the environment vars and uses ls_rexec()/lsgrun or the like to actually run N copies of the 'real job' executable. in 7.0, it can conveniently use lsb_getalloc() and lsb_launch(), but that doesn't really change any semantics as far as i know. one could imaging that calling lsb_launch() instead of ls_rexec() might be preferable from a process tracking point of view, but i don't see why Platform couldn't hook ls_rexec() just as well as lsb_launch(). i really need to get a little more confidence on that issue, since it's what determines what actions will (or perhaps already do in practice) 'break' the queuing/reporting system. there are some 'allocate only' functions as well, such as ls_placereq()/lsplace -- these can just return a host list / set the env vars without running anything at first. apparently, you need to run something 'soon' on the resultant hosts or the load balancer might get confused and reuse them. also, since this doesn't seem to go through the queues, it's probably not a viable set of functions to really use. a red herring, as far as i'm concerned. there is also an lsb_runjob() that is similar to lsb_launch(), but for an already submitted job. so, if one were to lsb_sumbit() with an option set to never launch it automatically, and then one were to run lsb_runjob(), you can avoid the queue and/or force the use of certain hosts? i guess this is also not a good function to use, but at least the queuing system would be aware of any bad behavior (queue skipping via ls_placereq() to get extra hosts, for instance) in this case ... there does *not* appear to be an option to lsb_submit() that allows a non-blocking programmatic callback when allocation is complete. if there was, it would need to deal with process tracking issues, or maybe just merge the old and new jobs somehow in that case. so to speak to the original point, it would indeed be nice to be able to do additional allocations (and then an lsb_launch) with a simple programmatic interface for completeness, but i don't see one. however, lsb_submit() is pretty close -- it makes a 'new' job, but i think that's okay. the initial daemon that gets run on the 'head' (i.e. randomly chosen) node of the new job will run an lsb_launch() or similar to start up the remaining N-1 daemons as children -- thus hopefully keeping the queuing system and process tracking happy. or you could use some LSF option / wrapper script to tell it to run the same daemon on all N hosts for you, if a some suitable option/wrapper exists anyway. so, in summary lsb_sumit() does allocation + one (non-optional) launch on allocation completion. lsb_launch() (or similar) does only launching, should probably only be run from the single process started from an lsb_submit(), and should only launch things on the allocation given by lsb_getalloc() (or env vars). Matt.