Eroma created AIRAVATA-3872: ------------------------------- Summary: Computing resource node failure and job re-queue handing Key: AIRAVATA-3872 URL: https://issues.apache.org/jira/browse/AIRAVATA-3872 Project: Airavata Issue Type: Improvement Components: helix implementation Environment: https://django.ultrascan.scigap.org/ Reporter: Eroma Assignee: Dimuthu
This issue was experienced in time to time, this time in production Ultrascan gateway, [https://django.ultrascan.scigap.org/.|https://django.ultrascan.scigap.org/] This gateway is connected to the production stack an Django portal for admin operations. When a job is submitted and queued a node failure happens, when this failure is notified through email notification job goes to UNKNOWN state in the gateway. In the remote cluster, the job gets re-queued and completed, and email notifications are sent. The Helix identifies UNKNOWN as a final job state and does not process emails sent after. Currently, when this happens, an operational task takes care of updating the job status and processing the email notifications sent. -- This message was sent by Atlassian Jira (v8.20.10#820010)