Hi,

> Am 24.02.2015 um 15:47 schrieb Simon Andrews <[email protected]>:
> 
> We've recently implemented a memory management system on our cluster which
> requires that users set h_vmem on their jobs, and also tracks the
> consumption of RAM on each compute node by setting h_vmem as a consumable
> resource so we don't overcommit any nodes.

Actually h_vmem sets two limits: one kernel limit (per process) and a 
summarization by SGE across all processes belonging to this job which will kill 
the job once the limit is exceeded.



> Despite this we're getting jobs which are dying due to not being able to
> allocate memory.  The nodes on which these failures happen still have
> plenty of free memory and the jobs are dying from internal malloc errors,
> rather than being killed due to the limit which was imposed by grid engine.
> I suspect that what is happening is that we're getting memory
> fragmentation, so that even though there is plenty of memory available,
> the programs aren't able to allocate a large enough contiguous block of
> memory and are therefore dying.

Yes. But the fragmentation seems to be created on an application level, not on 
an OS level by other jobs.


> Does this seem like a likely explanation?  If so, is there anything which
> can be done in the configuration of either the queues or the nodes to try
> to minimise the chances of these kinds of errors occurring?

AFAICT this is nothing SGE can handle, but has to be taken care of by the 
programmer of the application to use the granted memory in a better way.

-- Reuti


>  Thanks
> 
> Simon.
> 
> The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT 
> Registered Charity No. 1053902.
> The information transmitted in this email is directed only to the addressee. 
> If you received this in error, please contact the sender and delete this 
> email from your system. The contents of this e-mail are the views of the 
> sender and do not necessarily represent the views of the Babraham Institute. 
> Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to