Well, our main problem (for now) is not a single huge_io job, but lots (50+ maybe?) big_io jobs running at once scattered through the 26 nodes of our cluster.

My only hope is creating a cron job that reads the qstat info for all running jobs, parses all the jobs "io" values, saves them, compares them against the "io" value from the last cron run, and, if in trouble, suspends the jobs with the highest "now_io - old_io". But I don't expect it to hit the nail very often...

By the way... what does that "io" really measure?

Txema




El 13/11/12 18:05, Reuti escribió:
Am 13.11.2012 um 15:24 schrieb Txema Heredia Genestar:

we have a 300-core cluster with a ~150Tb shared directory (GPFS). Our users run some 
genomic analysis that use huge files and usually cannot fit the 500Gb internal HDD of the 
nodes. As you can imagine, sometimes things get pretty intense and all the nagios disk 
alarms start going off (the disk "works" but we got 10+ sec timeouts).

Knowing that I cannot trust our users to request any "disk_intensive" parameter/flag, I was 
pondering on setting a suspend_threshold in the queues, watching the shared disk status (e.g. timing an ls to 
the shared disk) and start suspending jobs when the disk has, say, a 3 sec delay. This would be a nice fix 
for our issue, but it has some problems: When there are both "IO-intensive" and "normal" 
jobs, and the suspend_threshold kicks in, SGE will start suspending jobs ¿without any particular criteria? (I 
don't know this part)
Yes.


, and lots of innocent "normal" jobs will be suspended through all the nodes 
before the disk load is stabilized.
You would need another queue for the normal jobs without a suspend_threshold, 
but then you are where you started: users won't add a proper flag or request a 
proper queue.

And: suspending a "bigio" job would only help if there is another "bigio" job on the same 
node, as otherwise it will have this "bigio" a little bit later again...

-- Reuti


Does anyone have any idea/workaround to solve this? Or should I ignore/relax 
all the disk alarms?

Thanks in advance,

Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to