[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

2019-03-01 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782158#comment-16782158
 ] 

Dimuthu Upeksha commented on AIRAVATA-2943:
---

Fixed in 
https://github.com/apache/airavata/commit/8b10120be4ce1d0720f214dc5e849d1dc862c595

> Re-queueing and node failures in HPC clusters need to be handled in gateway 
> middleware as resubmitting failures 
> 
>
> Key: AIRAVATA-2943
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 
> in Jetstream
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due 
> to node failures. In such scenarios the jobs are been executed after 
> re-queueing but on gateway side it is taken as a FAILED job at the initial 
> NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of 
> taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2943.
-
Resolution: Fixed

> Re-queueing and node failures in HPC clusters need to be handled in gateway 
> middleware as resubmitting failures 
> 
>
> Key: AIRAVATA-2943
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 
> in Jetstream
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due 
> to node failures. In such scenarios the jobs are been executed after 
> re-queueing but on gateway side it is taken as a FAILED job at the initial 
> NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of 
> taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2963) Cannot login to testing gateway portal and also getting an error in create experiment.

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2963.
-
Resolution: Fixed

> Cannot login to testing gateway portal and also getting an error in create 
> experiment.
> --
>
> Key: AIRAVATA-2963
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2963
> Project: Airavata
>  Issue Type: Bug
>  Components: PGA PHP Web Gateway
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # When username and password is enterered getting the exception [1]
>  # When the exception page is refreshed user is in the home page and when 
> clicked 'Create' in Experiment getting the second exception [2]
> [1]UserProfileServiceException
> Error while creating user profile. More info : Failed to update user profile 
> in IAM service
>  
> [2]ErrorException
> Invalid argument supplied for foreach() (View: 
> /var/www/portals/seagrid/app/views/experiment/create.blade.php)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2973) Helix submitting two jobs; both at the same time for a single experiment

2019-03-01 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782154#comment-16782154
 ] 

Dimuthu Upeksha commented on AIRAVATA-2973:
---

Fixed in 
https://github.com/apache/airavata/commit/0f0a52afadcb9bc33439cfb6be4ceb062a01ebfa

> Helix submitting two jobs; both at the same time for a single experiment
> 
>
> Key: AIRAVATA-2973
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2973
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org 
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Launched an experiment and the experiment has two jobs. Both jobs are created 
> at the same time, they both have same CREATION time. When the experiment was 
> cancelled both got tagged as CANCELLED.
> exp ID: SLM002-AmberSander-Comet9_02a8cf12-75ad-4820-991f-d593ce832945
> Double job submission is random.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2973) Helix submitting two jobs; both at the same time for a single experiment

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2973.
-
Resolution: Fixed

> Helix submitting two jobs; both at the same time for a single experiment
> 
>
> Key: AIRAVATA-2973
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2973
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org 
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Launched an experiment and the experiment has two jobs. Both jobs are created 
> at the same time, they both have same CREATION time. When the experiment was 
> cancelled both got tagged as CANCELLED.
> exp ID: SLM002-AmberSander-Comet9_02a8cf12-75ad-4820-991f-d593ce832945
> Double job submission is random.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2974) Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2974.
-
Resolution: Fixed

> Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED 
> --
>
> Key: AIRAVATA-2974
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2974
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Cancelled an experiment where the was already executed and COMPLETE. When the 
> exp status changed to CANCELED so did the status of the job.
> Since the job was already COMPLETE and the SUs were used it should not have 
> changed the status to CANCELED. IT should have remained as COMPLETE.
> exp ID: SLM002-AmberSander-Comet23_88570cbf-cdf3-4b73-aba7-0d2bf6a9a2d5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2974) Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED

2019-03-01 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782149#comment-16782149
 ] 

Dimuthu Upeksha commented on AIRAVATA-2974:
---

Fixed in 
https://github.com/apache/airavata/commit/039f9a2cdb7f4c7bfad0aa846fe160d478e59644

> Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED 
> --
>
> Key: AIRAVATA-2974
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2974
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Cancelled an experiment where the was already executed and COMPLETE. When the 
> exp status changed to CANCELED so did the status of the job.
> Since the job was already COMPLETE and the SUs were used it should not have 
> changed the status to CANCELED. IT should have remained as COMPLETE.
> exp ID: SLM002-AmberSander-Comet23_88570cbf-cdf3-4b73-aba7-0d2bf6a9a2d5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)