What very probably happens is that a credential being delegated to the server expired. It's being removed on the server-side in that case and jobs that still refer to such a (no longer existing) credential fail with the error message you pasted.
How do you delegate the credential that is being used by jobs: * Do you let globusrun-ws delegate for you? * Do you delegate a credential, e.g. using globus-credential-delegate and refer to the credential in your job description or let globusrun-ws pick up the epr of the manually delegated credential? You can debug this e.g. like this: * Submit jobs that do not require a delegated credential and see if the same problem still occurs. From your description I'd say that those jobs will not fail. * Delegate a credential that is valid for, say, 60h, using globus-credential-delegate and refer to that credential in your jobs. (globusrun-ws options: -Jf, -Sf) and check if the jobs still fail after 24h. Maybe worth noting: sometimes people delegate although they don't really need to delegate, i.e. the job does not need a job credential and no staging is performed. -Martin Yuriy wrote: > Hi, > > Some of the jobs submitted to torque via GRAM are killed after about > 24 hours in the queue, all with the similar message in globus logs: > > 2009-07-10 11:32:16,052 INFO exec.StateMachine > [RunQueueThread_5,logJobFailed:3250] Job 74bd3c60-6c17-11de-9a06-9ba1d1ebd14a > failed. Description: Couldn't obtain a delegated credential. Cause: > org.globus.exec.generated.FaultType: Couldn't obtain a delegated credential. > caused by [0: org.oasis.wsrf.faults.BaseFaultType: Error getting delegation > resource [Caused by: org.globus.wsrf.NoSuchResourceException]] > > torque reports exit status = 271 (exceeds resource limit or killed by > user), none of the "problematic" jobs seem to exceed any > limits. Moreover we had a lot of jobs that run for longer then 24 hours > and completed successfully (sometimes users just re-submitted jobs > with the same description and using exactly the same tools and it > completed without any problems). > > All problematic jobs were submitted with globusrun-ws tool > > Could anyone explain what is going on here? > > > Currently we use globus version from VDT 1.10, started with VDT 1.6 > From looking in logs, we had the same problem for over a year, but not > many people are affected and most just re-submit without > reporting. > > Cheers, > Yuriy >