Am 13.11.2012 um 15:24 schrieb Txema Heredia Genestar:

> we have a 300-core cluster with a ~150Tb shared directory (GPFS). Our users 
> run some genomic analysis that use huge files and usually cannot fit the 
> 500Gb internal HDD of the nodes. As you can imagine, sometimes things get 
> pretty intense and all the nagios disk alarms start going off (the disk 
> "works" but we got 10+ sec timeouts).
> 
> Knowing that I cannot trust our users to request any "disk_intensive" 
> parameter/flag, I was pondering on setting a suspend_threshold in the queues, 
> watching the shared disk status (e.g. timing an ls to the shared disk) and 
> start suspending jobs when the disk has, say, a 3 sec delay. This would be a 
> nice fix for our issue, but it has some problems: When there are both 
> "IO-intensive" and "normal" jobs, and the suspend_threshold kicks in, SGE 
> will start suspending jobs ¿without any particular criteria? (I don't know 
> this part)

Yes.


> , and lots of innocent "normal" jobs will be suspended through all the nodes 
> before the disk load is stabilized.

You would need another queue for the normal jobs without a suspend_threshold, 
but then you are where you started: users won't add a proper flag or request a 
proper queue.

And: suspending a "bigio" job would only help if there is another "bigio" job 
on the same node, as otherwise it will have this "bigio" a little bit later 
again...

-- Reuti


> Does anyone have any idea/workaround to solve this? Or should I ignore/relax 
> all the disk alarms?
> 
> Thanks in advance,
> 
> Txema
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to