Reuti, 1. The exechost isn't the head node is it? We've always referred to our SGE clusters as having three types of nodes: submit nodes, compute nodes and head nodes.
Compute ring is an OpenMPI term for the slots to which processes for the job are dispatched, but I meant the compute nodes that actually take part in a job. 2. I was thinking of looking at the start time and end time, taking the difference and seeing if this difference is >= h_rt to determine if the job was aborted for this reason or not. Then indicating that in the subject header of the email. 3. I was hoping there might be some sort of formula to determine an ideal h_rt value. I'll have to kick this decision to management. Personally, I don't know how users are supposed to predict in advance how long their jobs are supposed to run for, nor how much memory their job will use via h_vmem. I've seen the runtime for some jobs extend to several days. Thank you for the script Reuti, as well as all your help resolving my difficulties even posting to this email list. -Bill L. ________________________________________ From: Reuti [[email protected]] Sent: Thursday, September 24, 2015 1:43 PM To: Lane, William Cc: [email protected] List Subject: Re: [gridengine users] Create short.q queue definition that limits the runtime of a job Hi, Am 23.09.2015 um 21:18 schrieb Lane, William: > Reuti, > > 1. > If more than one compute node takes part in the compute ring, how does one > determine > which one is the exechost? What do you mean by compute ring - a parallel job? The exechost is the one where the job script is executed. Hence you can use $HOSTNAME in the jobscript to get its name (which also shows up in `qstat`). > Or is the exechost always the node on which you submit a job? There are only rare circumstances where a submit host is also an exechost (or vice versa). Usually the jobscript is executed on one of the exechosts (which may not be reachable by a login at all), while you submit on a machines where you logged into. (There are exceptions in case of a CRAY where the jobscript may indeed run on a submit host, as you need `aprun` to push the real executable to the nodes while there is no load by the jobs on the submit machine at all). > 2. >> Not by default. You will have to use a mail wrapper which will scan the >> messages file of the exechost for an entry of this particular job and append >> it to the email. I can supply a snippet if you need. > > We would be interested in implementing the above. Is there anyway to have an > email differentiate between a job being aborted because it exceeds the h_rt > constraint of a queue vs. other reasons? After scanning the messages file you can decide to send the email or change the header as you like. See attached. > 3. > Another issue is what kind of values for h_rt should be used? I've had jobs > last for 8 hours as well as nearly 24. What kind of stats would be good to > look at to determine what values of h_rt should be used? This you have to decide on your own. What is judged as a short job depends on the circumstances. > > 4. > Should there be nodes dedicated to the short.q queue? Depends on your overall setup and goal. Do you want to have these nodes left free by other jobs, so that short jobs start instantly? -- Reuti [siehe angehÃĪngte Datei: mailer.sh] IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is strictly prohibited. Thank you for your cooperation. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
