Thanks Lahiru. I will give this a try and test for different cases.
Raminder On Aug 19, 2014, at 5:42 AM, Lahiru Gunathilake <[email protected]> wrote: > Hi All, > > I have committed the initial version of the Experiment canceling. > > Experiment cancel is an Airavata-API method which can be invoked by the > Airavata client. This request will get to the GFac Provider level > cancellation only if the job is already submitted to the computing resource, > otherwise it will be handled by the orchestrator. > > If cancel request comes to an Experiment already completed, failed or > cancelling, cancel operation will be failed and error will be throw to the > client. > > If the job is marked cancelled successfully, experiment launch execution will > be stopped in the next immediate plugin invocation(launchExperiment operation > which runs in a separate thread). Ex: GFac is running Handler1 during cancel > and experiment launch execution will be stopped before the next plugin > invocation. > Limitation: if there is 500 file transfer in Input Handlers(currently > transferring file number 100) and during that step if user cancel the > experiment rest of the files will transfer and before the next plugin > original execution will be cancelled. (If we want to download partial outputs > we have to modify this logic). GFac framework can handle cancel(thats what we > have now) or framework can just try to execute all the plugins and plugin > implementation listen to a cancellation for that particular execution and act > accordingly. > > If the job is already submitted and Gfac is monitoring the job, it will be > cancelled by invoking providers cancel operation. Experiment statuses,Task > Statuses,Job Statuses will be updated properly and monitoring will be stopped > for those jobs with terminating Job statuses by the monitoring results. > > When there are multiple Gfac instances, original experiment launch request > can go to gfac Node1(separate jvm)and the cancel request doesn't have to go > to the same gfac Node. Orchestrator will handle this scenario and make the > job cancel request successful and experiment launch will be stopped as > explained above. > > During GFac node failure there could be jobs launching and job cancel > executions happening in that instance. Orchestrator will route both type of > requests to an available gfac nodes and recover the executions. > > I have a knowns issue to be fixed, which is when I run the cancel operation > sometimes GFac level authentication fails, I will try to find out what is > happenning, this problem comes time to time and I am not sure whether this is > something related to cancel feature or something to do with trestles. > > Regards > Lahiru > > > > > On Mon, Aug 18, 2014 at 7:13 PM, Lahiru Gunathilake <[email protected]> wrote: > Hi Marlon, > > I should be able to wrap-up later today or early tomorrow. > > Regards > Lahiru > > > On Mon, Aug 18, 2014 at 7:01 PM, Marlon Pierce <[email protected]> wrote: > How goes the implementation? > > Marlon > > > On 8/13/14, 11:09 PM, Lahiru Gunathilake wrote: > Thank you very much for all the inputs ! This will take these in to > consideration. > > Regards > Lahiru > > > On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <[email protected]> wrote: > > If I understand this correctly, I want to offer some input from our > experience with CIPRES. > > Currently, if a CIPRES user wishes to cancel a job, they must delete the > entire job, and therefore all ability to view the input and other files > used become unavailable. > > This is not an ideal solution. > > > > There is value to the user to being able to see partially completed > results, or even the input files they used. > > > > So I would vote for making partial output of the job available as an > option. > > Any additional information you can provide about status would be useful, > especially for folks who are debugging failures.. > > > > Just my 2c. > > > > Mark > > > > *From:* Eroma Abeysinghe [mailto:[email protected]] > *Sent:* Wednesday, August 13, 2014 7:04 AM > *To:* [email protected] > *Subject:* Re: Experiment Cancellation > > > > > My questions and thoughts on Experiment cancellation > 1. What are we going to do for output or partial output of the job at the > time of cancelling? > Are we going to discard or make them available for the experiment. Are > we safe keeping all the job information, messages on CANCELLED jobs or > discard them as well? > > 2. Are we going to allow editing for CANCELLED or CANCELLING experiments? > IMO we should not. because allowing editing is required if its going to > Re-launch. > > 3. With existing experiment and job states we need to decide which are > going to be CANCELLED > Out of Airavata Experiment states Cancellation should be allowed for > states; > CREATED > VALIDATED > SCHEDULED > LAUNCHED > EXECUTING > Cancellation should be communicated to resources if the job states are; > SUBMITTED > SETUP > QUEUED > ACTIVE > HELD > > > There is SUSPENDED state in both experiment and job but is this a > currently active state? > > 4. Cloning will be available for CANCELLED and CANCELLING experiments. > > 5. In Experiment Summary we should display any errors took place in > cancelling process > > > > > > On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <[email protected]> wrote: > > There is an advantage for task (or job) state to capture the information > that really comes from the machine (completed, cancelled, failed, etc), and > for experiment state to be set to canceled by Airavata. That is, there > should be parts of Airavata that capture machine-specific state information > about the job for logging/auditing purposes. > > * Airavata issues "cancel" command to job in "launched" or "executing" > state. > > * Airavata confirms that the job has left the queue or is no longer > executing. This could be machine-specific, but the main question is "has > the job left the queue?" or "is the job no longer in executing state?" I > don't think it is "if this is trestles, and since we issued a qdel command, > is the job marked as completed; of if this is stampede, is the job now > marked as failed?" > > * If the job cancel works, the Airavata marks this as canceled. > > * If cancel fails for some reason, don't change the Experiment state but > throw an error. > > > Marlon > > > > On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote: > > Hi All, > > I have few concerns about experiment cancellation. When we want to cancel > and experiment we have to run a particular command in the computing > resource. Based on the computing resource different resources show the job > status of the cancelled jobs in a different way. Ex: trestles shows the > cancelled jobs as completed, some other machines show it as as cancelled, > some might show it as failed. > > I think we should replicated this information in the JobDetails object as > the Job status and make sure the Experiments and Task statuses as > cancelled. The other approach is when we cancel we explicitly make all the > states in the experiment model (experiments,tasks,job states as cancelled) > as cancelled and manually handle the state we get from the computing > resource. > > My concerns should we really hide that information shown in the computing > resource from the Job status we are storing in to the registry ? or leave > it as it is and handle other statuses to represent the cancelled > experiments ? If we make everything cancel there will be inconsistency in > the JobStatus. > > WDYT ? > > Lahiru > > > > > > > -- > > Thank You, > > Best Regards, > > Eroma > > > > > > > > -- > System Analyst Programmer > PTI Lab > Indiana University > > > > -- > System Analyst Programmer > PTI Lab > Indiana University
