[jira] [Commented] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side
[ https://issues.apache.org/jira/browse/AIRAVATA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431625#comment-16431625 ] Dimuthu Upeksha commented on AIRAVATA-2743: --- Fixed in https://github.com/apache/airavata/commit/f912d39d37e85d0ac9b3a5c4a027714d17e208f2 > Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling > at cluster side > > > Key: AIRAVATA-2743 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2743 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 >Reporter: Eroma >Assignee: Dimuthu Upeksha >Priority: Major > Fix For: 0.18 > > > # Submit an experiment > # Cancel the experiment in PGA > # Experiment status changes to CANCELING > # Experiment status changes to CANCELLED while job is in either SUBMITTED or > QUEUED. > # Experiment status should change to CANCELLED only after the job status > changes to an end status (CANCELLED, COMPLETED or FAILED). > # -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side
Eroma created AIRAVATA-2743: --- Summary: Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side Key: AIRAVATA-2743 URL: https://issues.apache.org/jira/browse/AIRAVATA-2743 Project: Airavata Issue Type: Bug Components: helix implementation Affects Versions: 0.18 Reporter: Eroma Assignee: Dimuthu Upeksha Fix For: 0.18 # Submit an experiment # Cancel the experiment in PGA # Experiment status changes to CANCELING # Experiment status changes to CANCELLED while job is in either SUBMITTED or QUEUED. # Experiment status should change to CANCELLED only after the job status changes to an end status (CANCELLED, COMPLETED or FAILED). # -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRAVATA-2710) How to assign owner of "everyone" group in Sharing Registry?
[ https://issues.apache.org/jira/browse/AIRAVATA-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431086#comment-16431086 ] Marcus Christie commented on AIRAVATA-2710: --- Thanks [~smarru], perhaps we can meet to discuss this, I'm also concerned about over-engineering this. > How to assign owner of "everyone" group in Sharing Registry? > > > Key: AIRAVATA-2710 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2710 > Project: Airavata > Issue Type: Bug >Reporter: Marcus Christie >Assignee: Marcus Christie >Priority: Major > > in AIRAVATA-2662 the "everyone" group is being added to the Sharing Registry. > A UserGroup in the Sharing Registry must have a owner. This presents a > problem, the "everyone" group cannot be created until there is a user who can > be the owner, but createUser should add each user to the "everyone" group. > For now the implementation of createUser creates the "everyone" group if it > doesn't already exist and makes this user the owner of the group. That's > less than ideal since the first user of a domain ends up the owner of the > "everyone" group. > Here are some possible alternatives: > * create a dummy admin user for the domain that is made the owner of the > everyone group > * allow groups to not have an owner (make the OWNER_ID column nullable on > USER_GROUP) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed
[ https://issues.apache.org/jira/browse/AIRAVATA-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430643#comment-16430643 ] Dimuthu Upeksha commented on AIRAVATA-2742: --- Tested this locally for both SIGKILL and SIGTERM commands but couldn't reproduce it. As a safety step, I'm updating Helix core version form 0.6.7 -> 0.8.0. But I would suggest to extensively inspect participant restarts and the consistency of workflow executions in future testing iterations. Specially, observe the Helix Controller log https://github.com/apache/airavata/commit/01e0e70605ea9937304458651335166e52c51d60 > Helix Controller throws an Exception when the participant is killed > --- > > Key: AIRAVATA-2742 > URL: https://issues.apache.org/jira/browse/AIRAVATA-2742 > Project: Airavata > Issue Type: Bug > Components: helix implementation >Affects Versions: 0.18 >Reporter: Dimuthu Upeksha >Assignee: Dimuthu Upeksha >Priority: Major > > This was a sporadic issue and occurred only once in the test setup. There > were 5 - 10 tasks running in the Participant and Participant was externally > killed by SIGTERM command (kill . Once the Participant is started > again, it did not pickup the tasks that it was running at the time it was > killed. Surprisingly, the status of the respective workflows were IN_PROGRESS > status. Helix Controller log showed following error for each Workflow. This > seems like a bug in Helix and I posted the issue in Helix mailing list > (Subject : Sporadic issue when restarting a Participant). > > 2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage > - Error computing assignment for resource > Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117. > Skipping. > java.lang.NullPointerException: Name is null > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25) > at > org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272) > at > org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140) > at > org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171) > at > org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66) > at > org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48) > at > org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295) > at > org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595) > 2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage > - Error computing assignment for resource > Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025. > Skipping. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed
Dimuthu Upeksha created AIRAVATA-2742: - Summary: Helix Controller throws an Exception when the participant is killed Key: AIRAVATA-2742 URL: https://issues.apache.org/jira/browse/AIRAVATA-2742 Project: Airavata Issue Type: Bug Components: helix implementation Affects Versions: 0.18 Reporter: Dimuthu Upeksha This was a sporadic issue and occurred only once in the test setup. There were 5 - 10 tasks running in the Participant and Participant was externally killed by SIGTERM command (kill . Once the Participant is started again, it did not pickup the tasks that it was running at the time it was killed. Surprisingly, the status of the respective workflows were IN_PROGRESS status. Helix Controller log showed following error for each Workflow. This seems like a bug in Helix and I posted the issue in Helix mailing list (Subject : Sporadic issue when restarting a Participant). 2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage - Error computing assignment for resource Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117. Skipping. java.lang.NullPointerException: Name is null at java.lang.Enum.valueOf(Enum.java:236) at org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25) at org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272) at org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140) at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171) at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66) at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48) at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295) at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595) 2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage - Error computing assignment for resource Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025. Skipping. -- This message was sent by Atlassian JIRA (v7.6.3#76005)