[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

Dimuthu Upeksha (JIRA) Fri, 01 Mar 2019 14:18:06 -0800


    [ 
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782158#comment-16782158
 ]


Dimuthu Upeksha commented on AIRAVATA-2943:
-------------------------------------------

Fixed in 
https://github.com/apache/airavata/commit/8b10120be4ce1d0720f214dc5e849d1dc862c595

> Re-queueing and node failures in HPC clusters need to be handled in gateway 
> middleware as resubmitting failures 
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRAVATA-2943
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
>             Project: Airavata
>          Issue Type: Bug
>          Components: helix implementation
>    Affects Versions: 0.18
>         Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 
> in Jetstream
>            Reporter: Eroma
>            Assignee: Dimuthu Upeksha
>            Priority: Major
>             Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due 
> to node failures. In such scenarios the jobs are been executed after 
> re-queueing but on gateway side it is taken as a FAILED job at the initial 
> NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of 
> taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

Reply via email to