[jira] [Closed] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

Dimuthu Upeksha (JIRA) Fri, 01 Mar 2019 14:18:05 -0800


     [ 
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dimuthu Upeksha closed AIRAVATA-2943.
-------------------------------------
    Resolution: Fixed

> Re-queueing and node failures in HPC clusters need to be handled in gateway 
> middleware as resubmitting failures 
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRAVATA-2943
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
>             Project: Airavata
>          Issue Type: Bug
>          Components: helix implementation
>    Affects Versions: 0.18
>         Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 
> in Jetstream
>            Reporter: Eroma
>            Assignee: Dimuthu Upeksha
>            Priority: Major
>             Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due 
> to node failures. In such scenarios the jobs are been executed after 
> re-queueing but on gateway side it is taken as a FAILED job at the initial 
> NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of 
> taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

Reply via email to