Reuti,

I don't understand what you mean by too late... If you know for sure
the disk WILL cause problems, then of course it is easy. But the
problem is that the load sensor does not necessary know what to check
and what will fail next, so you might need to check every disk, NFS
mount, network connection, software license, etc to come up with
"host_healthcheck".

In LSF, the admin can define the EXIT_RATE for the host & the
GLOBAL_EXIT_RATE rate for the whole cluster. In SGE the way to do this
can only be done in the starter_method, as it knows when jobs are
started & when jobs exit. So a simple one would write to some sort of
/tmp area, and do some math to come up with the rate. When a job
exceeds the EXIT_RATE threshold, then it will close the queue/host.

Rayson


On Thu, Mar 10, 2011 at 12:53 PM, Reuti <[email protected]> wrote:
>> The starter method should be really simple, just record the exit time
>> of the late few jobs, and calculate the rate of exit. If the rate is
>> too high, disable the host.
>>
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html
>>
>> "starter_method"
>
> Isn't it already too late when the "starter_method" is started? I mean, when 
> no job information can be written (e.g. to the spool area), it will never get 
> executed but still trash the job.
>
> -- Reuti
>
>
>> Rayson
>>
>>
>>
>> On Thu, Mar 10, 2011 at 11:47 AM, Reuti <[email protected]> wrote:
>>> well, the feature to use the hawking radiation to allow the jobs to pop up 
>>> on other nodes needs precise alignment of the installation -  SCNR
>>>
>>> There is a demo script to check the size of e.g. /tmp here 
>>> http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use 
>>> "load_thresholds tmpfree=1G" in the queue definition, so that the queue 
>>> instance is set to alarm state in case it falls below a certain value.
>>>
>>> A load sensor can also deliver a boolean value, hence checking locally 
>>> something like "all disks fine" and use this as a "load_threshold" can also 
>>> be a solution. How to check something is of course specific to your node 
>>> setup.
>>>
>>> The last necessary piece would be to inform the admin: this could be done 
>>> by the load sensor too, but as the node is known not to be in a proper 
>>> state I wouldn't recommend this. Better might be a cron-job on the qmaster 
>>> machine checking `qstat -explain a -qs a -u foobar` *)  to look for passed 
>>> load thresholds.
>>>
>>> -- Reuti
>>>
>>> *) There is no switch "show no jobs at all" to `qstat`, so using an unknown 
>>> user "foobar" will help. And OTOH there is no "load_threshold" in the 
>>> exechost definition.
>>>
>>>
>>>> -Ed
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
>>>>
>>>>> I'm looking for best practices and techniques to detect blackhole hosts 
>>>>> quickly
>>>>> and disable them.  ( Platform LSF has this already built in...)
>>>>>
>>>>> What I see is possible is:
>>>>>
>>>>> Using a cron job on a ge client node...
>>>>>
>>>>> -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
>>>>> -  if detected, use qmod -d '<queue_instance>' to disable
>>>>> -  send email to ge_admin list
>>>>> -  possibly send email of failed jobs to user(s)
>>>>>
>>>>> Must be robust to be able to timeout properly when ge is down or too busy
>>>>> for qmod to respond...and/or filesystem problems, etc...
>>>>>
>>>>> ( perl or php alarm and sig handlers for proc_open work well for 
>>>>> enforcing timeouts...)
>>>>>
>>>>> Any hints would be appreciated before I start on it...
>>>>>
>>>>> Won't take long to write the code, just looking for best practices and 
>>>>> maybe
>>>>> a setting I'm missing in the ge config...
>>>>
>>>> what is causing the blackhole? For example: if it's a full file system on 
>>>> a node, you could detect it by a load sensor in SGE and define in the 
>>>> queue setup an alarm threshold, so that no more jobs are schedule to this 
>>>> particular node.
>>>>
>>>> -- Reuti
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to