[ https://issues.apache.org/jira/browse/AIRAVATA-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623942#comment-16623942 ]
Dimuthu Upeksha commented on AIRAVATA-2826: ------------------------------------------- Added job submission retrying logic * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13166404] > Helix participant server was stopped and started while experiments are > launched and job submissions to Jetstream cluster failed > ------------------------------------------------------------------------------------------------------------------------------- > > Key: AIRAVATA-2826 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2826 > Project: Airavata > Issue Type: Bug > Components: helix implementation > Affects Versions: 0.18 > Environment: https://staging.seagrid.org/ > Reporter: Eroma > Assignee: Dimuthu Upeksha > Priority: Major > Fix For: 0.18 > > > # Experiments started launching while helix participant stopped and started. > # When the helix participant was started particularly jobs to Jetstream > failed. > # Job submission failed due to environment set up failed in jetstream with > error [1] > [1] > org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : > 658d46e9-b08b-46c0-9701-4bf5eeb23134, Task > TASK_f4e3eccf-3e03-4d34-9cf0-7028efd09a40 failed due to Failed to setup > environment of task TASK_f4e3eccf-3e03-4d34-9cf0-7028efd09a40, > net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not > receive any keep-alive response for 25 seconds at > org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102) > at > org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:55) > at > org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) > at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:90) at > org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at > java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Caused by: > org.apache.airavata.agents.api.AgentException: > net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not > receive any keep-alive response for 25 seconds at > org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.createDirectory(SSHJAgentAdaptor.java:146) > at > org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:51) > ... 10 more Caused by: net.schmizz.sshj.connection.ConnectionException: > [CONNECTION_LOST] Did not receive any keep-alive response for 25 seconds at > net.schmizz.keepalive.KeepAliveRunner.checkMaxReached(KeepAliveRunner.java:64) > at > net.schmizz.keepalive.KeepAliveRunner.doKeepAlive(KeepAliveRunner.java:56) at > net.schmizz.keepalive.KeepAlive.run(KeepAlive.java:63) -- This message was sent by Atlassian JIRA (v7.6.3#76005)