Re: [gridengine users] Increasing global utilisation, the marketing talk and the truth?

Reuti Thu, 16 Aug 2012 14:30:58 -0700

Hi,

Am 16.08.2012 um 22:51 schrieb Jake Carroll:


> I'm currently assessing different job scheduling technologies for a sizeable 
> compute/HPC project I'm working on.
> 
> One of the things various vendors seem to always throw out there as a "value 
> add" in their respective scheduler is their ability to "drive up utilisation" 
> of the HPC cluster environment with some kind of advanced scheduling 
> mechanisms. Pretty much all the big guys seem to bang on about this kind of 
> thing. Moab talk it up, Platform LSM talk about it and say it's something 
> quite special. I don't hear Altair/PBS Pro say much about it, nor do I hear 
> it really made reference to in the OGE/SGE circles however.
> 
> So – I guess what I'm after is some reality. Are there some kind of highly 
> engineered/premium bits of proprietary code in what companies/schedulers like 
> Moab and Platform LSF (IBM) offer that can't be achieved in the SGE/OGE 
> "free" products?
> 
> The general intention is that you are always running your HPC environment at 
> full tilt, such that you aren't left with compute nodes being 
> underutilised/if the HPC environment is idle or under low load, it gives the 
> users who do need it maximum ability to maximise their compute performance, 
> but if it's busy, it will scale back appropriately (almost dynamically) such 
> that SLA's are adhered to. 
> 
> I heard the words "Goal driven SLA sensitive workload scheduling". I thought 
> that sounded like some lovely marketing speak, but I will try not to be 
> cynical about it.

I have no insight into LSF capabilities, but to me it looks like all will 
schedule the jobs to some policy "who is the next one" and besides backfilling 
it will always generate idle cores at some point: either because all memory is 
already used up (no no small jobs are left in the waiting queue) or because you 
have to reserve cores/memory for a later parallel/big job.

I'm not aware that any of them have some kind of linear optimization to handle 
a cut-off problem: I have a bundle of jobs with a runtime and resource 
requirement I know. Task: rearrange them in such a way, that all finish in the 
least overall amount of time. Such a scheduler would have already some 
real-time behavior, as it could make a forecast when your job will end latest 
and guarantee this.


> Thoughts? I'd like to know if people are doing similar things with SGE/OGE – 
> and whether or not truly dynamic load smoothing or some form of smarter 
> mechanisms of workload dispersion are being implemented elsewhere. I.e – is 
> this just a case of having a "load" complex written that can then somehow 
> dynamically adjust the number of jobs a user is allowed to schedule, and then 
> if others want to load up a lot of slots with jobs too, fairshare kicks in 
> and pauses other people's jobs contextually?

-- Reuti


> Thanks.
> 
> --JC
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Increasing global utilisation, the marketing talk and the truth?

Reply via email to