We never set any real memory restrictions because we've never had memory
the bottleneck (our machines are 56 cores with 256GB RAM and we only allow
up to 56 tasks to run on each) - most jobs take very little RAM but
sometimes jobs take quiet a lot - I know the jobs can grow even bigger and
that's why the spare 10% is for..

Right now we've defined external rules (through zabbix) to take a node out
of the cluster (i.e. disable it's queues) once it reaches 90% memory and
put it back in when it's down to 70% - but I'm not a fan of this on and off
thingy plus I don't like it being external (more things to configure and
maintain, more "mystery" non-working things when we forget about that rule).

I suppose the load sensor thing could work but can I have 2 thresholds or
do I need to make a math magic to somehow consider both? I wouldn't like a
machine with like low load but no free memory or a machine with high load
but tons of free memory to show as fine and dandy.


On Wed, Jun 1, 2016 at 5:37 PM, Simon Andrews <[email protected]>
wrote:

> The problem here will be knowing how much free memory you need from the
> jobs which are already running.
>
> In theory you could do this by adding free memory as a load_threshold (see
> man qconf).  This would then allow low memory to put the node into an alarm
> state (like you get if the load average is too high), and stop further jobs
> going to it.  You'd need to write a load sensor to get a suitable value for
> this from each node so you could decide when the memory state was too
> high.  The problem is that the current memory usage may not reflect the
> maximum usage that the jobs you already have running will achieve.
>  Limiting submissions based on current usage still, in our experience,
> caused the nodes to run out of memory regularly, causing jobs to get killed.
>
> In the end our solution was to have strict hard limits (h_vmem) on memory
> and to define h_vmem as a consumable complex.  To make life easier for our
> users though we used a job submission verifier to add a default allocation
> of 1GB to any job which didn't ask for any memory.  This covers all of the
> small jobs.  For larger jobs we simply tell people to ask for more than
> they need if they're only doing something once, or if they have a bunch of
> jobs to run then run one with too much memory allocated and then use qacct
> to look at the actual max usage so they know what they should ask for next
> time.  We had some teething troubles with this for a few weeks after it was
> introduced, but it's all been working smoothly for a long time now.
>
> Simon.
>
> From: <[email protected]> on behalf of Ben Daniel Pere <
> [email protected]>
> Date: Wednesday, 1 June 2016 at 14:44
> To: "[email protected]" <[email protected]>
> Subject: [gridengine users] How to set a minimum free memory limit for
> any task submission on SGE?
>
> Hi all,
>
> I'm trying to limit SGE from submittion jobs to any node which has less
> than 10% memory free, regarrdless of queue, user or anything else - just
> add a rule / complex / resource quote that will make SGE not submit tasks
> to machines with less than 10% memory free (or 20GB if number must be
> absolute).
>
> We've got several queues and each can run tasks with various memory
> demands and we're not asking users to provide how much memory they need and
> there's no way to start doing it, people will give wrong estimations and
> we'll either have dying tasks or idle cluster due to exaggerated
> estimations and it really depends on the size of the dataset being worked
> on which often can't be predetermined.
>
> is there a way to do it? thanks!!!
>
> The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT 
> *Registered
> Charity No. 1053902.*
>
> The information transmitted in this email is directed only to the
> addressee. If you received this in error, please contact the sender and
> delete this email from your system. The contents of this e-mail are the
> views of the sender and do not necessarily represent the views of the
> Babraham Institute. Full conditions at: www.babraham.ac.uk
> <http://www.babraham.ac.uk/terms>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to