On Wed, Feb 11, 2015 at 2:02 PM, Reuti <[email protected]> wrote:

> Hi,
>
> > Am 11.02.2015 um 19:28 schrieb Michael Stauffer <[email protected]>:
> >
> > Hi,
> >
> > Is there a way to easily query if a job is idle or otherwise stuck even
> though a queue state says it's running? I've seen some old jobs that are
> listed as running in the queue, but upon investigation on their compute
> node there is no cpu activity associated with the processes, there are no
> error messages in output files.
>
> The used CPU time you can check by looking at the "usage" line in the
> `qstat -j <job_id>` output.
>
> Any logic to have a safe indication whether a job is stuck in an infinity
> loop or still computing won't be easy to be implemented and will most
> likely depend on each particular application, whether there are any output
> or scratch files which can be checked too. But even then the same output
> may repeatedly being written thereto.
>
> We have even jobs which compute (apparently) fine, but only by manual
> investigation one can say that the computed values converge to a wrong
> state or are oscillating between states and won't stop ever.
>
> -- Reuti
>
> Thanks Reuti. I can see how this would be difficult. I may use the 'usage'
line from qstat. I could check every N hours, writing the usage output for
each running job to a file, then check the current usage stats against the
previous run's file and look for lines that haven't changed at all. To be
safe I'd just then email the user to suggset they take a look.

This won't catch instances of jobs that are stuck in loops of course, but
at least it'll catch completely hung jobs.

How often are a job's stats updated? Looks like every 40 seconds?

-M



> > I can devise a script to do this, but if there's already something for
> this I'd just use that. Thanks.
> >
> > -M
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to