[ https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dimuthu Upeksha closed AIRAVATA-2943. ------------------------------------- Resolution: Fixed > Re-queueing and node failures in HPC clusters need to be handled in gateway > middleware as resubmitting failures > ---------------------------------------------------------------------------------------------------------------- > > Key: AIRAVATA-2943 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2943 > Project: Airavata > Issue Type: Bug > Components: helix implementation > Affects Versions: 0.18 > Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 > in Jetstream > Reporter: Eroma > Assignee: Dimuthu Upeksha > Priority: Major > Fix For: 0.18 > > > Currently in clusters (PBS and SLURM) jobs are getting either re-queued due > to node failures. In such scenarios the jobs are been executed after > re-queueing but on gateway side it is taken as a FAILED job at the initial > NODE_FAIL. > These types of failures need to be captured as retrying failures instead of > taking it as an end result. -- This message was sent by Atlassian JIRA (v7.6.3#76005)