Re: [gt-user] GT4 job management

Stuart Martin Fri, 20 Jun 2008 08:58:03 -0700

By default, the files are stored in directories under the home dir ofthe account that starts the container.

e.g. ~/.globus/persisted/...

So there is not an issue with the ws gram service being able accessthe persisted job state files. The only issue here is the file system(e.g. nfs) and the default can be changed to unsure that it is writingto local disk to avoid any file system issues.


-Stu

On Jun 20, 2008, at Jun 20, 4:22 AM, [EMAIL PROTECTED]wrote:

On Jun 19, 2008, at Jun 19, 6:40 PM, [EMAIL PROTECTED]
wrote:

Hi,
Thanks for your reply.

I Think there are some places to store each of job state
information. But
where and how that information can be stored or retrieved ?


Yes.  At various times the job state information is written to and
read from flat files.  We're looking into alternatively using a
database to store the information.  Possibly coming sometime in the
4.2 series.


So, where that information be stored in Globus Toolkit 4.0.7. And how

Globus manage that information to prevent some failure that i hadmention

before ?

Thanks

Tonny

Hi Tonny,
GRAM is fault tolerant, meaning that when/if the container orservice
host crashes, the job details are not lost.  When the GRAM4 service
is
restarted, then the processing/monitoring of the job resumes.GRAM2
requires user/client intervention to restart the processing of the
job.

If the job included file stage in directives and those had not
completed at the time of the crash, then gram will resumeprocessingthe job for that job state and continue until the job has beenfully
processed.
If the job had already been submitted to the local resourcemanager,
then GRAM will resume monitoring the job in the LRM and continue
processing the job to completion. GRAM persists the LRM job id.If
the crash included the LRM and the LRM is also fault tolerant and
resumes processing of the job, then the job will be completely
processed without requiring any client intervention.

A persistent connection between the GRAM client and service is not
maintained, so network failures between the client and servicecan be
overcome.

In GRAM4 (WS GRAM), an EPR is included in the reply to
createManagedJob. This allows the client to contact the servicewhendesired to get the current job status, cancel the job, subscribefor
notifications, ...
If the createManagedJob call is received by the GRAM service, butthe
reply (containing the EPR) is not received by the client (possibly
due
to network failure), then GRAM4 provides the means to subsequently
get
the EPR in order to control the previously submitted job.
Detail about that are here:
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/user-index.html#s-wsgram-user-submissionid

Lemme know if you have any more questions on this.

Regards,
-Stu

On Jun 19, 2008, at Jun 19, 10:00 AM, [EMAIL PROTECTED]
wrote:
Hi,

I'm not quite understand about how GT4 manages job that was
submitted when
some failures happen, for example lost connection with client that
caused
by temporary network failure and lost contact with LRM that caused
by
globus being restarted during job execution.

does anybody know about this ?

Regards

Tonny

Re: [gt-user] GT4 job management

Reply via email to