Am 11.02.2015 um 20:13 schrieb Michael Stauffer <[email protected]>:
> 
> On Wed, Feb 11, 2015 at 2:02 PM, Reuti <[email protected]> wrote:
> Hi,
> 
> > Am 11.02.2015 um 19:28 schrieb Michael Stauffer <[email protected]>:
> >
> > Hi,
> >
> > Is there a way to easily query if a job is idle or otherwise stuck even 
> > though a queue state says it's running? I've seen some old jobs that are 
> > listed as running in the queue, but upon investigation on their compute 
> > node there is no cpu activity associated with the processes, there are no 
> > error messages in output files.
> 
> The used CPU time you can check by looking at the "usage" line in the `qstat 
> -j <job_id>` output.
> 
> Any logic to have a safe indication whether a job is stuck in an infinity 
> loop or still computing won't be easy to be implemented and will most likely 
> depend on each particular application, whether there are any output or 
> scratch files which can be checked too. But even then the same output may 
> repeatedly being written thereto.
> 
> We have even jobs which compute (apparently) fine, but only by manual 
> investigation one can say that the computed values converge to a wrong state 
> or are oscillating between states and won't stop ever.
> 
> -- Reuti
> 
> Thanks Reuti. I can see how this would be difficult. I may use the 'usage' 
> line from qstat. I could check every N hours, writing the usage output for 
> each running job to a file, then check the current usage stats against the 
> previous run's file and look for lines that haven't changed at all. To be 
> safe I'd just then email the user to suggset they take a look.
> 
> This won't catch instances of jobs that are stuck in loops of course, but at 
> least it'll catch completely hung jobs.
> 
> How often are a job's stats updated? Looks like every 40 seconds?

As defined in "load_report_time" IIRC.

-- Reuti


> 
> -M
> 
>  
> > I can devise a script to do this, but if there's already something for this 
> > I'd just use that. Thanks.
> >
> > -M
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to