Re: [gt-user] GRAM jobs dying after 24 hours

Martin Feller Mon, 10 Aug 2009 06:25:00 -0700

What very probably happens is that a credential being delegated to the
server expired. It's being removed on the server-side in that case
and jobs that still refer to such a (no longer existing) credential
fail with the error message you pasted.

How do you delegate the credential that is being used by jobs:
* Do you let globusrun-ws delegate for you?
* Do you delegate a credential, e.g. using globus-credential-delegate
  and refer to the credential in your job description or let globusrun-ws
  pick up the epr of the manually delegated credential?

You can debug this e.g. like this:
* Submit jobs that do not require a delegated credential and see if the
  same problem still occurs. From your description I'd say that those jobs
  will not fail.
* Delegate a credential that is valid for, say, 60h, using
  globus-credential-delegate and refer to that credential in your jobs.
  (globusrun-ws options: -Jf, -Sf) and check if the jobs still fail after
  24h.

Maybe worth noting: sometimes people delegate although they don't really
need to delegate, i.e. the job does not need a job credential and no
staging is performed.

-Martin

Yuriy wrote:
>  Hi,
>  
>  Some of the jobs submitted to torque via GRAM are killed after about
>  24 hours in the queue, all with the similar message in globus logs: 
>  
>  2009-07-10 11:32:16,052 INFO  exec.StateMachine 
> [RunQueueThread_5,logJobFailed:3250] Job 74bd3c60-6c17-11de-9a06-9ba1d1ebd14a 
> failed. Description: Couldn't obtain a delegated credential. Cause: 
> org.globus.exec.generated.FaultType: Couldn't obtain a delegated credential. 
> caused by [0: org.oasis.wsrf.faults.BaseFaultType: Error getting delegation 
> resource [Caused by: org.globus.wsrf.NoSuchResourceException]]
>  
>  torque reports exit status = 271 (exceeds resource limit or killed by
>  user), none of the "problematic" jobs seem to exceed any
>  limits. Moreover we had a lot of jobs that run for longer then 24 hours
>  and completed successfully (sometimes users just re-submitted jobs
>  with the same description and using exactly the same tools and it
>  completed without any problems). 
>  
>  All problematic jobs were submitted with globusrun-ws tool 
>  
>  Could anyone explain what is going on here? 
>  
>  
>  Currently we use globus version from VDT 1.10, started with VDT 1.6 
>  From looking in logs, we  had the same problem for over a year, but not
>  many people are affected and most just re-submit without
>  reporting. 
>  
>  Cheers,
>  Yuriy
>

Re: [gt-user] GRAM jobs dying after 24 hours

Reply via email to