Am 13.11.2012 um 15:24 schrieb Txema Heredia Genestar: > we have a 300-core cluster with a ~150Tb shared directory (GPFS). Our users > run some genomic analysis that use huge files and usually cannot fit the > 500Gb internal HDD of the nodes. As you can imagine, sometimes things get > pretty intense and all the nagios disk alarms start going off (the disk > "works" but we got 10+ sec timeouts). > > Knowing that I cannot trust our users to request any "disk_intensive" > parameter/flag, I was pondering on setting a suspend_threshold in the queues, > watching the shared disk status (e.g. timing an ls to the shared disk) and > start suspending jobs when the disk has, say, a 3 sec delay. This would be a > nice fix for our issue, but it has some problems: When there are both > "IO-intensive" and "normal" jobs, and the suspend_threshold kicks in, SGE > will start suspending jobs ¿without any particular criteria? (I don't know > this part)
Yes. > , and lots of innocent "normal" jobs will be suspended through all the nodes > before the disk load is stabilized. You would need another queue for the normal jobs without a suspend_threshold, but then you are where you started: users won't add a proper flag or request a proper queue. And: suspending a "bigio" job would only help if there is another "bigio" job on the same node, as otherwise it will have this "bigio" a little bit later again... -- Reuti > Does anyone have any idea/workaround to solve this? Or should I ignore/relax > all the disk alarms? > > Thanks in advance, > > Txema > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
