[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures
[ https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782158#comment-16782158 ] Dimuthu Upeksha commented on AIRAVATA-2943: --- Fixed in https://github.com/apache/airavata/commit/8b10120be4ce1d0720f214dc5e849d1dc862c595 > Re-queueing and node failures in HPC clusters need to be handled in gateway > middleware as resubmitting failures > > > Key: AIRAVATA-2943 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2943 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 > Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 > in Jetstream >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > Currently in clusters (PBS and SLURM) jobs are getting either re-queued due > to node failures. In such scenarios the jobs are been executed after > re-queueing but on gateway side it is taken as a FAILED job at the initial > NODE_FAIL. > These types of failures need to be captured as retrying failures instead of > taking it as an end result. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures
[ https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dimuthu Upeksha closed AIRAVATA-2943. - Resolution: Fixed > Re-queueing and node failures in HPC clusters need to be handled in gateway > middleware as resubmitting failures > > > Key: AIRAVATA-2943 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2943 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 > Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 > in Jetstream >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > Currently in clusters (PBS and SLURM) jobs are getting either re-queued due > to node failures. In such scenarios the jobs are been executed after > re-queueing but on gateway side it is taken as a FAILED job at the initial > NODE_FAIL. > These types of failures need to be captured as retrying failures instead of > taking it as an end result. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRAVATA-2963) Cannot login to testing gateway portal and also getting an error in create experiment.
[ https://issues.apache.org/jira/browse/AIRAVATA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dimuthu Upeksha closed AIRAVATA-2963. - Resolution: Fixed > Cannot login to testing gateway portal and also getting an error in create > experiment. > -- > > Key: AIRAVATA-2963 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2963 > Project: Airavata > Issue Type: Bug > Components: PGA PHP Web Gateway >Affects Versions: 0.18 > Environment: https://testing.seagrid.org >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > # When username and password is enterered getting the exception [1] > # When the exception page is refreshed user is in the home page and when > clicked 'Create' in Experiment getting the second exception [2] > [1]UserProfileServiceException > Error while creating user profile. More info : Failed to update user profile > in IAM service > > [2]ErrorException > Invalid argument supplied for foreach() (View: > /var/www/portals/seagrid/app/views/experiment/create.blade.php) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRAVATA-2973) Helix submitting two jobs; both at the same time for a single experiment
[ https://issues.apache.org/jira/browse/AIRAVATA-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782154#comment-16782154 ] Dimuthu Upeksha commented on AIRAVATA-2973: --- Fixed in https://github.com/apache/airavata/commit/0f0a52afadcb9bc33439cfb6be4ceb062a01ebfa > Helix submitting two jobs; both at the same time for a single experiment > > > Key: AIRAVATA-2973 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2973 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 > Environment: https://testing.seagrid.org >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > Launched an experiment and the experiment has two jobs. Both jobs are created > at the same time, they both have same CREATION time. When the experiment was > cancelled both got tagged as CANCELLED. > exp ID: SLM002-AmberSander-Comet9_02a8cf12-75ad-4820-991f-d593ce832945 > Double job submission is random. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRAVATA-2973) Helix submitting two jobs; both at the same time for a single experiment
[ https://issues.apache.org/jira/browse/AIRAVATA-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dimuthu Upeksha closed AIRAVATA-2973. - Resolution: Fixed > Helix submitting two jobs; both at the same time for a single experiment > > > Key: AIRAVATA-2973 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2973 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 > Environment: https://testing.seagrid.org >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > Launched an experiment and the experiment has two jobs. Both jobs are created > at the same time, they both have same CREATION time. When the experiment was > cancelled both got tagged as CANCELLED. > exp ID: SLM002-AmberSander-Comet9_02a8cf12-75ad-4820-991f-d593ce832945 > Double job submission is random. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRAVATA-2974) Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED
[ https://issues.apache.org/jira/browse/AIRAVATA-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dimuthu Upeksha closed AIRAVATA-2974. - Resolution: Fixed > Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED > -- > > Key: AIRAVATA-2974 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2974 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 > Environment: https://testing.seagrid.org >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > Cancelled an experiment where the was already executed and COMPLETE. When the > exp status changed to CANCELED so did the status of the job. > Since the job was already COMPLETE and the SUs were used it should not have > changed the status to CANCELED. IT should have remained as COMPLETE. > exp ID: SLM002-AmberSander-Comet23_88570cbf-cdf3-4b73-aba7-0d2bf6a9a2d5 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRAVATA-2974) Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED
[ https://issues.apache.org/jira/browse/AIRAVATA-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782149#comment-16782149 ] Dimuthu Upeksha commented on AIRAVATA-2974: --- Fixed in https://github.com/apache/airavata/commit/039f9a2cdb7f4c7bfad0aa846fe160d478e59644 > Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED > -- > > Key: AIRAVATA-2974 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2974 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 > Environment: https://testing.seagrid.org >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > Cancelled an experiment where the was already executed and COMPLETE. When the > exp status changed to CANCELED so did the status of the job. > Since the job was already COMPLETE and the SUs were used it should not have > changed the status to CANCELED. IT should have remained as COMPLETE. > exp ID: SLM002-AmberSander-Comet23_88570cbf-cdf3-4b73-aba7-0d2bf6a9a2d5 -- This message was sent by Atlassian JIRA (v7.6.3#76005)