We can’t depend on queue status as its different for different machine and none 
of the machine give the queue status as job was canceled (see examples below). 
As Airavata is managing the job and got the cancel request from user, Airavata 
should mark the job status to cancel along with task and experiment status on a 
successful attempt. In case of job got canceled in queued state, we don’t have 
stdout/error and in running state stdout/error will not have any detail that 
job was canceled.  As we discussed, when we are successfully able to cancel the 
job, we should mark the job status canceled and stop monitoring the job. In 
case of ultrascan, we don’t want to run output handers. We can have other 
gateways with requirement to get output some outputs and can be handled with a 
API flag. According to my understanding simple workflow steps are. Please add 
more to this if i missed anything.  

1. User calls job cancel with intermediate outputs false
2. Validator check the current status
        2.A.
                1 if status executing then it calls job cancel function from 
orchestrator 
                2 On success we remove the job from the queue viewer or mark 
the status canceled
                3 In job status canceled and flag false we don’t call out 
handler
                4 Incase intermediate flag true search or stdout/error   
 
        2.B if any other status API return exception that operation not allowed 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                
Thanks
Raminder

Trestles >> 
[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                
  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK 
  Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- 
------ ------ --------- - ---------
2242884.trestles-fe1.l  us3         shared   A1613788797         --      2     
64    --   00:30:00 Q       --
[us3@trestles-login1 ~]$ qdel 2242884
[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                
  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK 
  Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- 
------ ------ --------- - ---------
2242884.trestles-fe1.l  us3         shared   A1613788797           0     2     
64    --   00:30:00 R  00:00:05

[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                
  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK 
  Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- 
------ ------ --------- - ---------
2242884.trestles-fe1.l  us3         shared   A1613788797       10302     2     
64    --   00:30:00 C       --


Stampede >>
[email protected] ~ $ squeue -u us3
             JOBID   PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
           3897023      normal A8020068      us3 PD       0:00      2 (Priority)
[email protected] ~ $ scancel 3897023
[email protected] ~ $ squeue -u us3
             JOBID   PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)

Lonestar >>
us3@lonestar ~ $ qstat
job-ID  prior   name       user         state submit/start at     queue         
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2109621 0.00000 A619522656 us3          qw    08/13/2014 09:44:43               
                    24
us3@lonestar ~ $ qdel 2109621
us3 has deleted job 2109621
us3@lonestar ~ $ qstat
us3@lonestar ~ $

Alamo >>
us3@alamo ~ $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
193052.alamo              1967556229       us3                    0 R default
us3@alamo ~ $ qdel 193052
us3@alamo ~ $ qstat
us3@alamo ~ $


On Aug 13, 2014, at 9:01 AM, Marlon Pierce <[email protected]> wrote:

> There is an advantage for task (or job) state to capture the information that 
> really comes from the machine (completed, cancelled, failed, etc), and for 
> experiment state to be set to canceled by Airavata.  That is, there should be 
> parts of Airavata that capture machine-specific state information about the 
> job for logging/auditing purposes.
> 
> * Airavata issues "cancel" command to job in "launched" or "executing" state.
> 
> * Airavata confirms that the job has left the queue or is no longer 
> executing. This could be machine-specific, but the main question is "has the 
> job left the queue?" or "is the job no longer in executing state?"  I don't 
> think it is "if this is trestles, and since we issued a qdel command, is the 
> job marked as completed; of if this is stampede, is the job now marked as 
> failed?"
> 
> * If the job cancel works, the Airavata marks this as canceled.
> 
> * If cancel fails for some reason, don't change the Experiment state but 
> throw an error.
> 
> 
> Marlon
> 
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>> Hi All,
>> 
>> I have few concerns about experiment cancellation. When we want to cancel
>> and experiment we have to run a particular command in the computing
>> resource. Based on the computing resource different resources show the job
>> status of the cancelled jobs in a different way. Ex: trestles shows the
>> cancelled jobs as completed, some other machines show it as as cancelled,
>> some might show it as failed.
>> 
>> I think we should replicated this information in the JobDetails object as
>> the Job status and make sure the Experiments and Task statuses as
>> cancelled. The other approach is when we cancel we explicitly make all the
>> states in the experiment model (experiments,tasks,job states as cancelled)
>> as cancelled and manually handle the state we get from the computing
>> resource.
>> 
>> My concerns should we really hide that information shown in the computing
>> resource from the Job status we are storing in to the registry ? or leave
>> it as it is and handle other statuses to represent the cancelled
>> experiments ? If we make everything cancel there will be inconsistency in
>> the JobStatus.
>> 
>> WDYT ?
>> 
>> Lahiru
>> 
> 

Reply via email to