[jira] [Commented] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side

2018-04-09 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431625#comment-16431625
 ] 

Dimuthu Upeksha commented on AIRAVATA-2743:
---

Fixed in 
https://github.com/apache/airavata/commit/f912d39d37e85d0ac9b3a5c4a027714d17e208f2

> Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling 
> at cluster side
> 
>
> Key: AIRAVATA-2743
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2743
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submit an experiment
>  # Cancel the experiment in PGA
>  # Experiment status changes to CANCELING
>  # Experiment status changes to CANCELLED while job is in either SUBMITTED or 
> QUEUED.
>  # Experiment status should change to CANCELLED only after the job status 
> changes to an end status (CANCELLED, COMPLETED or FAILED).
>  #



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side

2018-04-09 Thread Eroma (JIRA)
Eroma created AIRAVATA-2743:
---

 Summary: Experiment in CANCELLED while job is still QUEUED or 
SUBMITTED and canceling at cluster side
 Key: AIRAVATA-2743
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2743
 Project: Airavata
  Issue Type: Bug
  Components: helix implementation
Affects Versions: 0.18
Reporter: Eroma
Assignee: Dimuthu Upeksha
 Fix For: 0.18


# Submit an experiment
 # Cancel the experiment in PGA
 # Experiment status changes to CANCELING
 # Experiment status changes to CANCELLED while job is in either SUBMITTED or 
QUEUED.
 # Experiment status should change to CANCELLED only after the job status 
changes to an end status (CANCELLED, COMPLETED or FAILED).
 #



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2710) How to assign owner of "everyone" group in Sharing Registry?

2018-04-09 Thread Marcus Christie (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431086#comment-16431086
 ] 

Marcus Christie commented on AIRAVATA-2710:
---

Thanks [~smarru], perhaps we can meet to discuss this, I'm also concerned about 
over-engineering this.

> How to assign owner of "everyone" group in Sharing Registry?
> 
>
> Key: AIRAVATA-2710
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2710
> Project: Airavata
>  Issue Type: Bug
>Reporter: Marcus Christie
>Assignee: Marcus Christie
>Priority: Major
>
> in AIRAVATA-2662 the "everyone" group is being added to the Sharing Registry. 
>  A UserGroup in the Sharing Registry must have a owner. This presents a 
> problem, the "everyone" group cannot be created until there is a user who can 
> be the owner, but createUser should add each user to the "everyone" group.
> For now the implementation of createUser creates the "everyone" group if it 
> doesn't already exist and makes this user the owner of the group.  That's 
> less than ideal since the first user of a domain ends up the owner of the 
> "everyone" group.
> Here are some possible alternatives:
> * create a dummy admin user for the domain that is made the owner of the 
> everyone group
> * allow groups to not have an owner (make the OWNER_ID column nullable on 
> USER_GROUP)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed

2018-04-09 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430643#comment-16430643
 ] 

Dimuthu Upeksha commented on AIRAVATA-2742:
---

Tested this locally for both SIGKILL and SIGTERM commands but couldn't 
reproduce it. As a safety step, I'm updating Helix core version form 0.6.7 -> 
0.8.0. But I would suggest to extensively inspect participant restarts and the 
consistency of workflow executions in future testing iterations. Specially, 
observe the Helix Controller log

https://github.com/apache/airavata/commit/01e0e70605ea9937304458651335166e52c51d60

> Helix Controller throws an Exception when the participant is killed
> ---
>
> Key: AIRAVATA-2742
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2742
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> This was a sporadic issue and occurred only once in the test setup. There 
> were 5 - 10 tasks running in the Participant and Participant was externally 
> killed by SIGTERM command (kill . Once the Participant is started 
> again, it did not pickup the tasks that it was running at the time it was 
> killed. Surprisingly, the status of the respective workflows were IN_PROGRESS 
> status. Helix Controller log showed following error for each Workflow. This 
> seems like a bug in Helix and I posted the issue in Helix mailing list 
> (Subject : Sporadic issue when restarting a Participant). 
>  
> 2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage 
>  - Error computing assignment for resource 
> Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117.
>  Skipping.
> java.lang.NullPointerException: Name is null
>         at java.lang.Enum.valueOf(Enum.java:236)
>         at 
> org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25)
>         at 
> org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272)
>         at 
> org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140)
>         at 
> org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171)
>         at 
> org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66)
>         at 
> org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48)
>         at 
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295)
>         at 
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595)
> 2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage 
>  - Error computing assignment for resource 
> Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025.
>  Skipping. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed

2018-04-09 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2742:
-

 Summary: Helix Controller throws an Exception when the participant 
is killed
 Key: AIRAVATA-2742
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2742
 Project: Airavata
  Issue Type: Bug
  Components: helix implementation
Affects Versions: 0.18
Reporter: Dimuthu Upeksha


This was a sporadic issue and occurred only once in the test setup. There were 
5 - 10 tasks running in the Participant and Participant was externally killed 
by SIGTERM command (kill . Once the Participant is started again, 
it did not pickup the tasks that it was running at the time it was killed. 
Surprisingly, the status of the respective workflows were IN_PROGRESS status. 
Helix Controller log showed following error for each Workflow. This seems like 
a bug in Helix and I posted the issue in Helix mailing list (Subject : Sporadic 
issue when restarting a Participant). 

 
2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage  
- Error computing assignment for resource 
Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117.
 Skipping.
java.lang.NullPointerException: Name is null
        at java.lang.Enum.valueOf(Enum.java:236)
        at 
org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25)
        at 
org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272)
        at 
org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140)
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171)
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66)
        at 
org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48)
        at 
org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295)
        at 
org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595)
2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage  
- Error computing assignment for resource 
Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025.
 Skipping. 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)