Reuti,

1.
The exechost isn't the head node is it? We've always referred to our SGE 
clusters as having
three types of nodes: submit nodes, compute nodes and head nodes.

Compute ring is an OpenMPI term for the slots to which processes for the job 
are dispatched,
but I meant the compute nodes that actually take part in a job.

2.
I was thinking of looking at the start time and end time, taking the difference 
and seeing if this
difference is >= h_rt to determine if the job was aborted for this reason or 
not. Then indicating
that in the subject header of the email.

3.
I was hoping there might be some sort of formula to determine an ideal h_rt 
value. I'll have
to kick this decision to management.

Personally, I don't know how users are supposed to predict in advance how long 
their jobs are supposed to
run for, nor how much memory their job will use via h_vmem. I've seen the 
runtime for some jobs extend
to several days.

Thank you for the script Reuti, as well as all your help resolving my 
difficulties even posting to this
email list.

-Bill L.

________________________________________
From: Reuti [[email protected]]
Sent: Thursday, September 24, 2015 1:43 PM
To: Lane, William
Cc: [email protected] List
Subject: Re: [gridengine users] Create short.q queue definition that limits the 
runtime of a job

Hi,

Am 23.09.2015 um 21:18 schrieb Lane, William:

> Reuti,
>
> 1.
> If more than one compute node takes part in the compute ring, how does one 
> determine
> which one is the exechost?

What do you mean by compute ring - a parallel job?

The exechost is the one where the job script is executed. Hence you can use 
$HOSTNAME in the jobscript to get its name (which also shows up in `qstat`).


> Or is the exechost always the node on which you submit a job?

There are only rare circumstances where a submit host is also an exechost (or 
vice versa). Usually the jobscript is executed on one of the exechosts (which 
may not be reachable by a login at all), while you submit on a machines where 
you logged into.

(There are exceptions in case of a CRAY where the jobscript may indeed run on a 
submit host, as you need `aprun` to push the real executable to the nodes while 
there is no load by the jobs on the submit machine at all).


> 2.
>> Not by default. You will have to use a mail wrapper which will scan the 
>> messages file of the exechost for an entry of this particular job and append 
>> it to the email. I can supply a snippet if you need.
>
> We would be interested in implementing the above. Is there anyway to have an 
> email differentiate between a job being aborted because it exceeds the h_rt 
> constraint of a queue vs. other reasons?

After scanning the messages file you can decide to send the email or change the 
header as you like. See attached.


> 3.
> Another issue is what kind of values for h_rt should be used?  I've had jobs 
> last for 8 hours as well as nearly 24. What kind of stats would be good to 
> look at to determine what values of h_rt should be used?

This you have to decide on your own. What is judged as a  short job depends on 
the circumstances.

>
> 4.
> Should there be nodes dedicated to the short.q queue?

Depends on your overall setup and goal. Do you want to have these nodes left 
free by other jobs, so that short jobs start instantly?

-- Reuti

[siehe angehÃĪngte Datei: mailer.sh]
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to