We can’t depend on queue status as its different for different machine and none
of the machine give the queue status as job was canceled (see examples below).
As Airavata is managing the job and got the cancel request from user, Airavata
should mark the job status to cancel along with task and experiment status on a
successful attempt. In case of job got canceled in queued state, we don’t have
stdout/error and in running state stdout/error will not have any detail that
job was canceled. As we discussed, when we are successfully able to cancel the
job, we should mark the job status canceled and stop monitoring the job. In
case of ultrascan, we don’t want to run output handers. We can have other
gateways with requirement to get output some outputs and can be handled with a
API flag. According to my understanding simple workflow steps are. Please add
more to this if i missed anything.
1. User calls job cancel with intermediate outputs false
2. Validator check the current status
2.A.
1 if status executing then it calls job cancel function from
orchestrator
2 On success we remove the job from the queue viewer or mark
the status canceled
3 In job status canceled and flag false we don’t call out
handler
4 Incase intermediate flag true search or stdout/error
2.B if any other status API return exception that operation not allowed
Thanks
Raminder
Trestles >>
[us3@trestles-login1 ~]$ qstat -u us3
trestles-fe1.local:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
----------------------- ----------- -------- ---------------- ------ -----
------ ------ --------- - ---------
2242884.trestles-fe1.l us3 shared A1613788797 -- 2
64 -- 00:30:00 Q --
[us3@trestles-login1 ~]$ qdel 2242884
[us3@trestles-login1 ~]$ qstat -u us3
trestles-fe1.local:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
----------------------- ----------- -------- ---------------- ------ -----
------ ------ --------- - ---------
2242884.trestles-fe1.l us3 shared A1613788797 0 2
64 -- 00:30:00 R 00:00:05
[us3@trestles-login1 ~]$ qstat -u us3
trestles-fe1.local:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
----------------------- ----------- -------- ---------------- ------ -----
------ ------ --------- - ---------
2242884.trestles-fe1.l us3 shared A1613788797 10302 2
64 -- 00:30:00 C --
Stampede >>
[email protected] ~ $ squeue -u us3
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
3897023 normal A8020068 us3 PD 0:00 2 (Priority)
[email protected] ~ $ scancel 3897023
[email protected] ~ $ squeue -u us3
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
Lonestar >>
us3@lonestar ~ $ qstat
job-ID prior name user state submit/start at queue
slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2109621 0.00000 A619522656 us3 qw 08/13/2014 09:44:43
24
us3@lonestar ~ $ qdel 2109621
us3 has deleted job 2109621
us3@lonestar ~ $ qstat
us3@lonestar ~ $
Alamo >>
us3@alamo ~ $ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
193052.alamo 1967556229 us3 0 R default
us3@alamo ~ $ qdel 193052
us3@alamo ~ $ qstat
us3@alamo ~ $
On Aug 13, 2014, at 9:01 AM, Marlon Pierce <[email protected]> wrote:
> There is an advantage for task (or job) state to capture the information that
> really comes from the machine (completed, cancelled, failed, etc), and for
> experiment state to be set to canceled by Airavata. That is, there should be
> parts of Airavata that capture machine-specific state information about the
> job for logging/auditing purposes.
>
> * Airavata issues "cancel" command to job in "launched" or "executing" state.
>
> * Airavata confirms that the job has left the queue or is no longer
> executing. This could be machine-specific, but the main question is "has the
> job left the queue?" or "is the job no longer in executing state?" I don't
> think it is "if this is trestles, and since we issued a qdel command, is the
> job marked as completed; of if this is stampede, is the job now marked as
> failed?"
>
> * If the job cancel works, the Airavata marks this as canceled.
>
> * If cancel fails for some reason, don't change the Experiment state but
> throw an error.
>
>
> Marlon
>
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>> Hi All,
>>
>> I have few concerns about experiment cancellation. When we want to cancel
>> and experiment we have to run a particular command in the computing
>> resource. Based on the computing resource different resources show the job
>> status of the cancelled jobs in a different way. Ex: trestles shows the
>> cancelled jobs as completed, some other machines show it as as cancelled,
>> some might show it as failed.
>>
>> I think we should replicated this information in the JobDetails object as
>> the Job status and make sure the Experiments and Task statuses as
>> cancelled. The other approach is when we cancel we explicitly make all the
>> states in the experiment model (experiments,tasks,job states as cancelled)
>> as cancelled and manually handle the state we get from the computing
>> resource.
>>
>> My concerns should we really hide that information shown in the computing
>> resource from the Job status we are storing in to the registry ? or leave
>> it as it is and handle other statuses to represent the cancelled
>> experiments ? If we make everything cancel there will be inconsistency in
>> the JobStatus.
>>
>> WDYT ?
>>
>> Lahiru
>>
>