[jira] [Assigned] (AURORA-1426) thermos kill hangs when killing a Docker-containerized task

2017-08-09 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1426:
-

Assignee: (was: Kai Huang)

> thermos kill hangs when killing a Docker-containerized task
> ---
>
> Key: AURORA-1426
> URL: https://issues.apache.org/jira/browse/AURORA-1426
> Project: Aurora
>  Issue Type: Bug
>  Components: Docker, Executor
>Reporter: Kevin Sweeney
>
> When presented with a docker task as argument in the Vagrant environment it 
> hangs:
> {noformat}
> + sudo thermos kill 
> 1438642104346-vagrant-test-http_example_docker-0-7ce56f48-3a55-413c-885e-8e4218afd582
> I0803 22:48:36.449739 15926 helper.py:276] Found existing runner, forcing 
> leadership forfeit.
> I0803 22:48:36.450725 15926 helper.py:279] Successfully killed leader.
> (hangs here)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AURORA-216) allow aurora executor to be customized via the commandline

2017-08-09 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-216:


Assignee: (was: Kai Huang)

> allow aurora executor to be customized via the commandline
> --
>
> Key: AURORA-216
> URL: https://issues.apache.org/jira/browse/AURORA-216
> Project: Aurora
>  Issue Type: Story
>  Components: Executor
>Reporter: brian wickman
>Priority: Minor
>
> Right now the AuroraExecutor takes runner_provider, sandbox_provider and 
> status_providers.  These need to be the following:
>   - runner_provider: TaskRunnerProvider (assigned_task -> TaskRunner)
>   - status_providers: list(StatusCheckerProvider) (assigned_task -> 
> StatusChecker)
>   - sandbox_provider: SandboxProvider (assigned_task -> SandboxInterface)
> These are generic enough that we should allow these to be specified on the 
> command line as entry points, for example, something like:
> {noformat}
>   --runner_provider 
> apache.aurora.executor.thermos_runner:ThermosTaskRunnerProvider
>   --status_provider 
> apache.aurora.executor.common.health_checker:HealthCheckerProvider
>   --status_provider myorg.zookeeper:ZkAnnouncerProvider
>   --sandbox_provider myorg.docker:DockerSandboxProvider
> {noformat}
> Then have these loaded up using pkg_resources.EntryPoint.  These plugins can 
> either be linked into the .pex or injected onto the PYTHONPATH of the 
> executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks

2017-08-09 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1934.
---
Resolution: Fixed

> Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
> ---
>
> Key: AURORA-1934
> URL: https://issues.apache.org/jira/browse/AURORA-1934
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> Aurora Scheduler has a webhook module that watches all TaskStateChanges and 
> send events to configured endpoint. This floods the endpoint with a lot of 
> noise if we only care about certain types of TaskStateChange event(e.g. task 
> state change from RUNNING -> LOST). We should allow the aurora administrator 
> to provide a white list of task state change event types in their web hook 
> configuration, so that the web hook will only send these events to the 
> configured endpoint.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks

2017-08-09 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reopened AURORA-1934:
---

> Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
> ---
>
> Key: AURORA-1934
> URL: https://issues.apache.org/jira/browse/AURORA-1934
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> Aurora Scheduler has a webhook module that watches all TaskStateChanges and 
> send events to configured endpoint. This floods the endpoint with a lot of 
> noise if we only care about certain types of TaskStateChange event(e.g. task 
> state change from RUNNING -> LOST). We should allow the aurora administrator 
> to provide a white list of task state change event types in their web hook 
> configuration, so that the web hook will only send these events to the 
> configured endpoint.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AURORA-1940) aurora job restart request should be retryable

2017-07-07 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1940:
--
Description: 
There was a recent change to the Aurora client to provide "at most once" 
instead of "at least once" retries for non-idempotent operations. See:
https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

`aurora job restart` is a non-idempotent operation, thus it was not retried. 
However, when a transport exception occurs, the operator has to babysit simple 
operations like aurora job restart if it were not retried. Compared to the 
requests that were causing problems (admin tasks, job creating, updates, etc.), 
restarts in general should be retried rather than erring on the side of caution.

  was:
There was a recent change to the Aurora client to provide "at most once" 
instead of "at least once" retries for non-idempotent operations. See:
https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

`aurora job restart` is a non-idempotent operation, thus it was not retried. 
When there is a transport exception, the operator has to babysit simple 
operations like aurora job restart if it were not retried. Compared to the 
requests that were causing problems (admin tasks, job creating, updates, etc.), 
restarts in general should be retried rather than erring on the side of caution.


> aurora job restart request should be retryable
> --
>
> Key: AURORA-1940
> URL: https://issues.apache.org/jira/browse/AURORA-1940
> Project: Aurora
>  Issue Type: Task
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There was a recent change to the Aurora client to provide "at most once" 
> instead of "at least once" retries for non-idempotent operations. See:
> https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> `aurora job restart` is a non-idempotent operation, thus it was not retried. 
> However, when a transport exception occurs, the operator has to babysit 
> simple operations like aurora job restart if it were not retried. Compared to 
> the requests that were causing problems (admin tasks, job creating, updates, 
> etc.), restarts in general should be retried rather than erring on the side 
> of caution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AURORA-1940) aurora job restart request should be retryable

2017-07-06 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1940:
--
Description: 
There was a recent change to the Aurora client to provide "at most once" 
instead of "at least once" retries for non-idempotent operations. See:
https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

`aurora job restart` is a non-idempotent operation, thus it was not retried. 
When there is a transport exception, the operator has to babysit simple 
operations like aurora job restart if it were not retried. Compared to the 
requests that were causing problems (admin tasks, job creating, updates, etc.), 
restarts in general should be retried rather than erring on the side of caution.

  was:
There was a recent change to the Aurora client to provide "at most once" 
instead of "at least once" retries for non-idempotent operations. See:
https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

`aurora job restart` is a non-idempotent operation, thus it was not retried. 
However, during a scheduler failover, the operator has to babysit simple 
operations like aurora job restart if it were not retried. Compared to the 
requests that were causing problems (admin tasks, job creating, updates, etc.), 
restarts in general should be retried rather than erring on the side of caution.


> aurora job restart request should be retryable
> --
>
> Key: AURORA-1940
> URL: https://issues.apache.org/jira/browse/AURORA-1940
> Project: Aurora
>  Issue Type: Task
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There was a recent change to the Aurora client to provide "at most once" 
> instead of "at least once" retries for non-idempotent operations. See:
> https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> `aurora job restart` is a non-idempotent operation, thus it was not retried. 
> When there is a transport exception, the operator has to babysit simple 
> operations like aurora job restart if it were not retried. Compared to the 
> requests that were causing problems (admin tasks, job creating, updates, 
> etc.), restarts in general should be retried rather than erring on the side 
> of caution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AURORA-1940) aurora job restart request should be retryable

2017-06-30 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1940:
-

 Summary: aurora job restart request should be retryable
 Key: AURORA-1940
 URL: https://issues.apache.org/jira/browse/AURORA-1940
 Project: Aurora
  Issue Type: Task
Reporter: Kai Huang
Priority: Minor


There was a recent change to the Aurora client to provide "at most once" 
instead of "at least once" retries for non-idempotent operations. See:
https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

`aurora job restart` is a non-idempotent operation, thus it was not retried. 
However, during a scheduler failover, the operator has to babysit simple 
operations like aurora job restart if it were not retried. Compared to the 
requests that were causing problems (admin tasks, job creating, updates, etc.), 
restarts in general should be retried rather than erring on the side of caution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AURORA-1940) aurora job restart request should be retryable

2017-06-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1940:
-

Assignee: Kai Huang

> aurora job restart request should be retryable
> --
>
> Key: AURORA-1940
> URL: https://issues.apache.org/jira/browse/AURORA-1940
> Project: Aurora
>  Issue Type: Task
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There was a recent change to the Aurora client to provide "at most once" 
> instead of "at least once" retries for non-idempotent operations. See:
> https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> `aurora job restart` is a non-idempotent operation, thus it was not retried. 
> However, during a scheduler failover, the operator has to babysit simple 
> operations like aurora job restart if it were not retried. Compared to the 
> requests that were causing problems (admin tasks, job creating, updates, 
> etc.), restarts in general should be retried rather than erring on the side 
> of caution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion

2017-06-26 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063422#comment-16063422
 ] 

Kai Huang edited comment on AURORA-1937 at 6/26/17 5:38 PM:


Add counter for status_update and framework_message: 
https://reviews.apache.org/r/60350/



was (Author: kaih):
Add counter for status_update and framework_message:
https://reviews.apache.org/r/60350/


> Add metrics for status updates before switching to V1 Mesos Driver 
> implementaion
> 
>
> Key: AURORA-1937
> URL: https://issues.apache.org/jira/browse/AURORA-1937
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>
> Zameer has created a new driver implementation around V1Mesos 
> (https://reviews.apache.org/r/57061). 
> The V1 Mesos code requires a Scheduler callback with a different API. To 
> maximize code reuse, event handling logic was extracted into a 
> [MesosCallbackHandler | 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
>  class. However, we do not have the metrics for handling task status update 
> in this class.
> Metrics around task status update are key performance indicators for the 
> scheduler. We need to add the metrics back, in order to switch to V1Mesos 
> driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion

2017-06-26 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063483#comment-16063483
 ] 

Kai Huang commented on AURORA-1937:
---

Add timing metrics for status_update: https://reviews.apache.org/r/60437/

> Add metrics for status updates before switching to V1 Mesos Driver 
> implementaion
> 
>
> Key: AURORA-1937
> URL: https://issues.apache.org/jira/browse/AURORA-1937
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>
> Zameer has created a new driver implementation around V1Mesos 
> (https://reviews.apache.org/r/57061). 
> The V1 Mesos code requires a Scheduler callback with a different API. To 
> maximize code reuse, event handling logic was extracted into a 
> [MesosCallbackHandler | 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
>  class. However, we do not have the metrics for handling task status update 
> in this class.
> Metrics around task status update are key performance indicators for the 
> scheduler. We need to add the metrics back, in order to switch to V1Mesos 
> driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion

2017-06-21 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1937:
-

Assignee: Kai Huang

> Add metrics for status updates before switching to V1 Mesos Driver 
> implementaion
> 
>
> Key: AURORA-1937
> URL: https://issues.apache.org/jira/browse/AURORA-1937
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>
> Zameer has created a new driver implementation around V1Mesos 
> (https://reviews.apache.org/r/57061). 
> The V1 Mesos code requires a Scheduler callback with a different API. To 
> maximize code reuse, event handling logic was extracted into a 
> [MesosCallbackHandler | 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
>  class. However, we do not have the metrics for handling task status update 
> in this class.
> Metrics around task status update are key performance indicators for the 
> scheduler. We need to add the metrics back, in order to switch to V1Mesos 
> driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion

2017-06-21 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1937:
--
Summary: Add metrics for status updates before switching to V1 Mesos Driver 
implementaion  (was: Add metrics for status updates after switching to V1 Mesos 
Driver implementaion)

> Add metrics for status updates before switching to V1 Mesos Driver 
> implementaion
> 
>
> Key: AURORA-1937
> URL: https://issues.apache.org/jira/browse/AURORA-1937
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>
> Zameer has created a new driver implementation around V1Mesos 
> (https://reviews.apache.org/r/57061). 
> The V1 Mesos code requires a Scheduler callback with a different API. To 
> maximize code reuse, event handling logic was extracted into a 
> [MesosCallbackHandler | 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
>  class. However, we do not have the metrics for handling task status update 
> in this class.
> Metrics around task status update are key performance indicators for the 
> scheduler. We need to add the metrics back, in order to switch to V1Mesos 
> driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AURORA-1937) Add metrics for status updates after switching to V1 Mesos Driver implementaion

2017-06-21 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1937:
-

 Summary: Add metrics for status updates after switching to V1 
Mesos Driver implementaion
 Key: AURORA-1937
 URL: https://issues.apache.org/jira/browse/AURORA-1937
 Project: Aurora
  Issue Type: Task
  Components: Scheduler
Reporter: Kai Huang


Zameer has created a new driver implementation around V1Mesos 
(https://reviews.apache.org/r/57061). 

The V1 Mesos code requires a Scheduler callback with a different API. To 
maximize code reuse, event handling logic was extracted into a 
[MesosCallbackHandler | 
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
 class. However, we do not have the metrics for handling task status update in 
this class.

Metrics around task status update are key performance indicators for the 
scheduler. We need to add the metrics back, in order to switch to V1Mesos 
driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks

2017-06-13 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048550#comment-16048550
 ] 

Kai Huang commented on AURORA-1934:
---

https://reviews.apache.org/r/59940/

> Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
> ---
>
> Key: AURORA-1934
> URL: https://issues.apache.org/jira/browse/AURORA-1934
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> Aurora Scheduler has a webhook module that watches all TaskStateChanges and 
> send events to configured endpoint. This floods the endpoint with a lot of 
> noise if we only care about certain types of TaskStateChange event(e.g. task 
> state change from RUNNING -> LOST). We should allow the aurora administrator 
> to provide a white list of task state change event types in their web hook 
> configuration, so that the web hook will only send these events to the 
> configured endpoint.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks

2017-06-08 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1934:
--
Priority: Minor  (was: Major)

> Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
> ---
>
> Key: AURORA-1934
> URL: https://issues.apache.org/jira/browse/AURORA-1934
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> Aurora Scheduler has a webhook module that watches all TaskStateChanges and 
> send events to configured endpoint. This floods the endpoint with a lot of 
> noise if we only care about certain types of TaskStateChange event(e.g. task 
> state change from RUNNING -> LOST). We should allow the aurora administrator 
> to provide a white list of task state change event types in their web hook 
> configuration, so that the web hook will only send these events to the 
> configured endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks

2017-06-07 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1934:
--
Description: Aurora Scheduler has a webhook module that watches all 
TaskStateChanges and send events to configured endpoint. This floods the 
endpoint with a lot of noise if we only care about certain types of 
TaskStateChange event(e.g. task state change from RUNNING -> LOST). We should 
allow the aurora administrator to provide a white list of task state change 
event types in their web hook configuration, so that the web hook will only 
send these events to the configured endpoint.  (was: Aurora Scheduler has a 
webhook module that watches all TaskStateChanges and send events to configured 
endpoint. This generates a lot of noise to the endpoint if we only care about 
certain types of TaskStateChange like the transition from RUNNING -> LOST 
state. We should allow the aurora administrator to provide a white list of task 
state change event types in their web hook configuration, so that the web hook 
will only send these events to the configured endpoint.)

> Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
> ---
>
> Key: AURORA-1934
> URL: https://issues.apache.org/jira/browse/AURORA-1934
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>
> Aurora Scheduler has a webhook module that watches all TaskStateChanges and 
> send events to configured endpoint. This floods the endpoint with a lot of 
> noise if we only care about certain types of TaskStateChange event(e.g. task 
> state change from RUNNING -> LOST). We should allow the aurora administrator 
> to provide a white list of task state change event types in their web hook 
> configuration, so that the web hook will only send these events to the 
> configured endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks

2017-06-07 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1934:
-

 Summary: Add a whitelist for TaskStateChange events in Aurora 
Scheduler WebHooks
 Key: AURORA-1934
 URL: https://issues.apache.org/jira/browse/AURORA-1934
 Project: Aurora
  Issue Type: Task
  Components: Scheduler
Reporter: Kai Huang
Assignee: Kai Huang


Aurora Scheduler has a webhook module that watches all TaskStateChanges and 
send events to configured endpoint. This generates a lot of noise to the 
endpoint if we only care about certain types of TaskStateChange like the 
transition from RUNNING -> LOST state. We should allow the aurora administrator 
to provide a white list of task state change event types in their web hook 
configuration, so that the web hook will only send these events to the 
configured endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1929) Improve explicit task history pruning.

2017-06-02 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035170#comment-16035170
 ] 

Kai Huang commented on AURORA-1929:
---

https://reviews.apache.org/r/59699/

> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. 
> For example, when we use $ aurora_admin prune_tasks for 135k running tasks 
> (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed 
> seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops can be added to the 
> transaction and then later committed as an atomic unit. However, the 
> StateManager removes tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and each RemoveTasks operation is coalesced with its previous operation, 
> which seems inefficient and unnecessary 
> (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).
> We need to batch all removeTasks operations and execute them all at once to 
> avoid additional cost of coalescing. The fix will also benefit implicit task 
> history pruning since it has similar underlying implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-05-31 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1929:
--
Description: 
There are currently two types of task history pruning running by aurora:
# The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. 

For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k 
jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems 
to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops can be added to the 
transaction and then later committed as an atomic unit. However, the 
StateManager removes tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and each RemoveTasks operation is coalesced with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.

  was:
There are currently two types of task history pruning running by aurora:
# The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. 

For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k 
jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems 
to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. However, 
the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.


> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. 
> For example, when we use $ aurora_admin prune_tasks for 135k running tasks 
> (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed 
> seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops can be added to the 
> transaction and then later committed as an atomic unit. However, the 
> StateManager removes tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and each RemoveTasks operation is coalesced with its previous operation, 
> which seems inefficient and unnecessary 
> (https://github.com/apache/aurora/blob

[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-05-31 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1929:
--
Component/s: Scheduler

> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. 
> For example, when we use $ aurora_admin prune_tasks for 135k running tasks 
> (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed 
> seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops can be added to the 
> transaction and then later committed as an atomic unit. However, the 
> StateManager removes tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and each RemoveTasks operation is coalesced with its previous operation, 
> which seems inefficient and unnecessary 
> (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).
> We need to batch all removeTasks operations and execute them all at once to 
> avoid additional cost of coalescing. The fix will also benefit implicit task 
> history pruning since it has similar underlying implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-05-31 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1929:
--
Description: 
There are currently two types of task history pruning running by aurora:
# The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. 

For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k 
jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems 
to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. However, 
the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.

  was:
There are currently two types of task history pruning running by aurora:
# The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.


> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. 
> For example, when we use $ aurora_admin prune_tasks for 135k running tasks 
> (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed 
> seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops(RemoveTasks) can be 
> added to the transaction and then later committed as an atomic unit. However, 
> the current implementation remove tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and coalesces each RemoveTasks operation with its previous operation, which 
> seems inefficient and unnecessary 
> (https://github.com/apache/aurora/

[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-05-31 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1929:
--
Description: 
There are currently two types of task history pruning running by aurora:
# The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.

  was:
There are currently two types of task history pruning running by aurora:
# 1) The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# 2) The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.


> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
> for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
> tasks, the pruning speed seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops(RemoveTasks) can be 
> added to the transaction and then later committed as an atomic unit. 
> However, the current implementation remove tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and coalesces each RemoveTasks operation with its previous operation, which 
> seems inefficient and unnecessary 
> (https://github.com/apache/auror

[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-05-31 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1929:
--
Description: 
There are currently two types of task history pruning running by aurora:
# 1) The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
# 2) The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.

  was:
There are currently two types of task history pruning running by aurora:
#The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
#The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.


> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
>
> There are currently two types of task history pruning running by aurora:
> # 1) The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # 2) The explicit task history pruning initiated by `aurora_admin 
> prune_tasks` command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
> for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
> tasks, the pruning speed seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops(RemoveTasks) can be 
> added to the transaction and then later committed as an atomic unit. 
> However, the current implementation remove tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and coalesces each RemoveTasks operation with its previous operation, which 
> seems inefficient and unnecessary 
> (https://github.com/apache/a

[jira] [Created] (AURORA-1929) Improve explicit task history pruning.

2017-05-31 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1929:
-

 Summary: Improve explicit task history pruning.
 Key: AURORA-1929
 URL: https://issues.apache.org/jira/browse/AURORA-1929
 Project: Aurora
  Issue Type: Task
Reporter: Kai Huang
Assignee: Kai Huang
Priority: Minor


There are currently two types of task history pruning running by aurora:
#The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
#The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1837) Improve implicit task history pruning

2017-05-31 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1837:
--
Summary: Improve implicit task history pruning  (was: Improve task history 
pruning)

> Improve implicit task history pruning
> -
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-18 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1879:
--
Attachment: pending_tasks.png

> /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending 
> tasks with the same key
> ---
>
> Key: AURORA-1879
> URL: https://issues.apache.org/jira/browse/AURORA-1879
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Kai Huang
> Attachments: pending_tasks.png
>
>
> When we have multiple TaskGroups that have same key but different 
> TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error.
> This bug seems to be related to a recent commit (Added the 'reason' to the 
> /pendingTasks 
> endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1).
>  
> Attached were a screenshot of the /pendingTasks endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-18 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1879:
-

 Summary: /pendingTasks endpoint shows 500 HTTP Error when there 
are multiple pending tasks with the same key
 Key: AURORA-1879
 URL: https://issues.apache.org/jira/browse/AURORA-1879
 Project: Aurora
  Issue Type: Bug
Reporter: Kai Huang


When we have multiple TaskGroups that have same key but different TaskConfigs, 
the /pendingTasks endpoint will gives a 500 HTTP Error.

This bug seems to be related to a recent commit (Added the 'reason' to the 
/pendingTasks 
endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1).
 

Attached were a screenshot of the /pendingTasks endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-18 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1879:
--
Component/s: Scheduler

> /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending 
> tasks with the same key
> ---
>
> Key: AURORA-1879
> URL: https://issues.apache.org/jira/browse/AURORA-1879
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Kai Huang
>
> When we have multiple TaskGroups that have same key but different 
> TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error.
> This bug seems to be related to a recent commit (Added the 'reason' to the 
> /pendingTasks 
> endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1).
>  
> Attached were a screenshot of the /pendingTasks endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-10-17 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583543#comment-15583543
 ] 

Kai Huang commented on AURORA-1225:
---

Some one at Aurora Team@Twitter will continue working on it in the next few 
weeks, most likely pairing with me to fix it. 

I'll update this ticket once we make any progress on it.

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-10-17 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582921#comment-15582921
 ] 

Kai Huang commented on AURORA-1225:
---

This ticket is almost done. We need to add some more tests for the health 
checker.

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-10-17 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1225:
--
Assignee: (was: Kai Huang)

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569618#comment-15569618
 ] 

Kai Huang commented on AURORA-1791:
---

The ticket to track is:  https://issues.apache.org/jira/browse/AURORA-1793

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1793:
--
Assignee: Kai Huang

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1793:
--
Description: 
The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
compatible. We decided to revert this commit.

The changes that directly causes problems is:
{code}
Modify executor state transition logic to rely on health checks (if enabled).
commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
{code}

There are two downstream commits that depends on the above commit:
{code}
Add min_consecutive_health_checks in HealthCheckConfig
commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
{code}
{code}
Add support for receiving min_consecutive_successes in health checker
commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
{code}
We will drop all three of these commits and revert back to one commit before 
the problematic commit:
{code}
Running task ssh without an instance should pick a random instance
commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
{code}

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Priority: Blocker
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1793) Revert

2016-10-12 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1793:
-

 Summary: Revert 
 Key: AURORA-1793
 URL: https://issues.apache.org/jira/browse/AURORA-1793
 Project: Aurora
  Issue Type: Bug
Reporter: Kai Huang
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1793:
--
Summary: Revert Commit ca683 which is not backwards compatible  (was: 
Revert )

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569600#comment-15569600
 ] 

Kai Huang commented on AURORA-1791:
---

We've decided to revert the commit. 

The changes that directly causes problems is:

Modify executor state transition logic to rely on health checks (if enabled).
commit ca683cb9e27bae76424a687bc6c3af5a73c501b9

There are two downstream commits that depends on the above commit:

Add min_consecutive_health_checks in HealthCheckConfig
commit ed72b1bf662d1e29d2bb483b317c787630c26a9e

Add support for receiving min_consecutive_successes in health checker
commit e91130e49445c3933b6e27f5fde18c3a0e61b87a

We will drop all three of these commits and revert back to one commit before 
the problematic commit:
Running task ssh without an instance should pick a random instance
commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c

I will create a separate ticket for people to track the reversion.



> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569183#comment-15569183
 ] 

Kai Huang commented on AURORA-1791:
---

 Thanks for the examples, [~davmclau] It seems the above example you provided 
is similar to approach(a) that I implemented in the patch.
One thing that is worth noting is that this approach relies on the latest 
health check, even if it's way beyond initial_interval_secs expires.
Consider the case where initial_interval_secs = 150, interval_secs = 100, the 
task becomes healthy at second 101, we won't be able to see the healthy state 
until second 200. This is some grey area in the design doc which we need to 
discuss more. 

I agree with you that we should take a step back, since it seems we need more 
discussion on whether min_consecutive_successes should trigger TASK state 
transition after initial_interval_secs expires. 

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768
 ] 

Kai Huang edited comment on AURORA-1791 at 10/12/16 6:43 AM:
-

To sum up, the issue is caused by failed to reach min_consecutive_successes, 
not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but ignores it until 
initial_interval_secs expires. This does not cause any problem but does not 
seem clear to people. I've changed it to:  updating failure counter after 
initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options 
here:

(a) Doing health checks periodically as defined. Even initial_interval_secs 
expires and min successes is not reached (because periodic check will miss some 
successes), we do not fail health check right away. Instead, we will rely on 
the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption 
that if a task responds OK before initial_interval_secs expires, for next 
health check, it will still responds OK. However, it's likely the task fails to 
respond OK until we perform this additional health check. It's highly likely 
the instance will be healthy afterwards, but we should fail the health check 
according to the definition?


was (Author: kaih):
To sum up, the issue is caused by failed to reach min_consecutive_successes, 
not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but only ignores it until 
initial_interval_secs expires. This does not cause any problem but does not 
seem clear to people. I've changed it to:  updating failure counter after 
initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options 
here:

(a) Doing health checks periodically as defined. Even initial_interval_secs 
expires and min successes is not reached (because periodic check will miss some 
successes), we do not fail health check right away. Instead, we will rely on 
the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption 
that if a task responds OK before initial_interval_secs expires, for next 
health check, it will still responds OK. However, it's likely the task fails to 
respond OK until we perform this additional health check. It's highly likely 
the instance will be healthy afterwards, but we should fail the health check 
according to the definition?

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567803#comment-15567803
 ] 

Kai Huang commented on AURORA-1791:
---

An issue to implement (b) is that the health checker thread might be sleeping 
while initial_interval_secs expires.

We will need a event-driven mechanism to notify the health checker to wake up 
and do a health check when initial_interval_secs expires. This seems requires a 
lot of refactoring.

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768
 ] 

Kai Huang commented on AURORA-1791:
---

To sum up, the issue is caused by failed to reach min_consecutive_successes, 
not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but only ignores it until 
initial_interval_secs expires. This does not cause any problem but does not 
seem clear to people. I've changed it to:  updating failure counter after 
initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options 
here:

(a) Doing health checks periodically as defined. Even initial_interval_secs 
expires and min successes is not reached (because periodic check will miss some 
successes), we do not fail health check right away. Instead, we will rely on 
the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption 
that if a task responds OK before initial_interval_secs expires, for next 
health check, it will still responds OK. However, it's likely the task fails to 
respond OK until we perform this additional health check. It's highly likely 
the instance will be healthy afterwards, but we should fail the health check 
according to the definition?

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566721#comment-15566721
 ] 

Kai Huang commented on AURORA-1791:
---

In your case, it's likely your task becomes health at the 6th second. However, 
within the 10 secs initial_interval_secs, we only performed one health check at 
the 5th second. So by definition, we "failed" to detect at least one health 
check within the given period, so we moved the task state to FAILED. 

We can enforce a health check at the end of the initial_interval_secs.

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566711#comment-15566711
 ] 

Kai Huang commented on AURORA-1791:
---

Thanks for pointing it out. The current implementation is: if we failed to 
see(even if it exist) enough successful health checks within the 
initial_interval_secs, we will move the task state to failed. This behavior is 
not back-ward compatible and not safe. I'll revised the design.

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config

2016-09-29 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1224.
---
Resolution: Fixed

> Add a new "min_consecutive_health_checks" setting in .aurora config
> ---
>
> Key: AURORA-1224
> URL: https://issues.apache.org/jira/browse/AURORA-1224
> Project: Aurora
>  Issue Type: Task
>  Components: Client, Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-09-29 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1225.
---
Resolution: Fixed

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1426) thermos kill hangs when killing a Docker-containerized task

2016-09-26 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1426:
--
Assignee: Kai Huang

> thermos kill hangs when killing a Docker-containerized task
> ---
>
> Key: AURORA-1426
> URL: https://issues.apache.org/jira/browse/AURORA-1426
> Project: Aurora
>  Issue Type: Bug
>  Components: Docker, Executor
>Reporter: Kevin Sweeney
>Assignee: Kai Huang
>
> When presented with a docker task as argument in the Vagrant environment it 
> hangs:
> {noformat}
> + sudo thermos kill 
> 1438642104346-vagrant-test-http_example_docker-0-7ce56f48-3a55-413c-885e-8e4218afd582
> I0803 22:48:36.449739 15926 helper.py:276] Found existing runner, forcing 
> leadership forfeit.
> I0803 22:48:36.450725 15926 helper.py:279] Successfully killed leader.
> (hangs here)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-216) allow aurora executor to be customized via the commandline

2016-09-26 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-216:
-
Assignee: Kai Huang

> allow aurora executor to be customized via the commandline
> --
>
> Key: AURORA-216
> URL: https://issues.apache.org/jira/browse/AURORA-216
> Project: Aurora
>  Issue Type: Story
>  Components: Executor
>Reporter: brian wickman
>Assignee: Kai Huang
>Priority: Minor
>
> Right now the AuroraExecutor takes runner_provider, sandbox_provider and 
> status_providers.  These need to be the following:
>   - runner_provider: TaskRunnerProvider (assigned_task -> TaskRunner)
>   - status_providers: list(StatusCheckerProvider) (assigned_task -> 
> StatusChecker)
>   - sandbox_provider: SandboxProvider (assigned_task -> SandboxInterface)
> These are generic enough that we should allow these to be specified on the 
> command line as entry points, for example, something like:
> {noformat}
>   --runner_provider 
> apache.aurora.executor.thermos_runner:ThermosTaskRunnerProvider
>   --status_provider 
> apache.aurora.executor.common.health_checker:HealthCheckerProvider
>   --status_provider myorg.zookeeper:ZkAnnouncerProvider
>   --sandbox_provider myorg.docker:DockerSandboxProvider
> {noformat}
> Then have these loaded up using pkg_resources.EntryPoint.  These plugins can 
> either be linked into the .pex or injected onto the PYTHONPATH of the 
> executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config

2016-09-20 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507123#comment-15507123
 ] 

Kai Huang commented on AURORA-1224:
---

I would propose making the following changes to the client:

1. Add min_consecutive_health_checks to HealthCheckConfig struct. (Default = 1)

2. Add the following constraint:
   initial_interval_secs >= interval_secs * min_consecutive_health_checks

3. Remove the client constraint:
 watch_secs >= initial_interval_secs + (max_consecutive_failures * 
interval_secs)

4. Add a new constraint(OPTIONAL):
If watch_secs is 0, the Health Check should be enabled (either by providing 
a "health" port or a ShellHealthChecker).Otherwise the instance will always 
marked as healthy.





> Add a new "min_consecutive_health_checks" setting in .aurora config
> ---
>
> Key: AURORA-1224
> URL: https://issues.apache.org/jira/browse/AURORA-1224
> Project: Aurora
>  Issue Type: Task
>  Components: Client, Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1775) Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script

2016-09-16 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1775:
--
Description: 
When I run build-support/python/make-pycharm-virtualenv script locally, 
exceptions were thrown at: 
https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24

"./pants dependencies --external-only src/test/python::” does no longer seem to 
work for pants 1.1.0-rc7. As a result, the Pycharm project will be generated 
with a incomplete list of packages, which prevents developers from debugging 
python code and running tests in IDE. (pants test via command line still works)

I was wondering if someone can work on it and fix this issue? Thanks!

  was:
When I run build-support/python/make-pycharm-virtualenv script locally, 
exceptions were thrown at: 
https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24

"./pants dependencies --external-only src/test/python::” does no longer seem to 
work for pants 1.1.0-rc7. As a result, the Pycharm project will be generated 
with a incomplete list of packages, which prevents developers from debugging 
python code and running tests in IDE.

I was wondering if someone can work on it and fix this issue? Thanks!


> Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script
> -
>
> Key: AURORA-1775
> URL: https://issues.apache.org/jira/browse/AURORA-1775
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>
> When I run build-support/python/make-pycharm-virtualenv script locally, 
> exceptions were thrown at: 
> https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24
> "./pants dependencies --external-only src/test/python::” does no longer seem 
> to work for pants 1.1.0-rc7. As a result, the Pycharm project will be 
> generated with a incomplete list of packages, which prevents developers from 
> debugging python code and running tests in IDE. (pants test via command line 
> still works)
> I was wondering if someone can work on it and fix this issue? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1775) Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script

2016-09-16 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1775:
-

 Summary: Updating pants to 1.1.0-rc7 breaks the 
make-pycharm-virtualenv script
 Key: AURORA-1775
 URL: https://issues.apache.org/jira/browse/AURORA-1775
 Project: Aurora
  Issue Type: Bug
Reporter: Kai Huang


When I run build-support/python/make-pycharm-virtualenv script locally, 
exceptions were thrown at: 
https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24

"./pants dependencies --external-only src/test/python::” does no longer seem to 
work for pants 1.1.0-rc7. As a result, the Pycharm project will be generated 
with a incomplete list of packages, which prevents developers from debugging 
python code and running tests in IDE.

I was wondering if someone can work on it and fix this issue? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-09-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1225:
--
Sprint: Twitter Aurora Q2'16 Sprint 21

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-09-08 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471115#comment-15471115
 ] 

Kai Huang edited comment on AURORA-1225 at 9/8/16 7:52 PM:
---

Currently, aurora_executor passes a callback to status manager. When a 
StatusResult is returned, the callback function shuts down the status manager. 
We need to modify the callback function, so that it won't shutdown the status 
manager if the status in StatusResult is TASK_RUNNING.


was (Author: kaih):
Currently, aurora_executor sends the task status update to scheduler,  whereas 
the health checker performs the health check. When a successful required number 
of health checks is reached, the health checker needs to notify the 
aurora_executor about the state change. 

To solve this problem, we can implement a event-listener in aurora executor 
that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. 

Here is an example for the implementation of event dispatcher and event 
listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-08 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473026#comment-15473026
 ] 

Kai Huang edited comment on AURORA-1223 at 9/8/16 7:03 AM:
---

Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 
for watch_secs value. For health-check enabled jobs, watch_secs is set to be 0, 
so the scheduler will not use watch_secs for job updates.


was (Author: kaih):
Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 
for watch_secs value. If watch_secs is 0, the scheduler will not use watch_secs 
for job updates.

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-08 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1223:
--
Comment: was deleted

(was: Found watch_secs constraint to relax at scheduler side.)

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-08 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473026#comment-15473026
 ] 

Kai Huang edited comment on AURORA-1223 at 9/8/16 7:03 AM:
---

Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 
for watch_secs value. If watch_secs is 0, the scheduler will not use watch_secs 
for job updates.


was (Author: kaih):
Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 
for watch_secs value.

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-08 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1223.
---
Resolution: Fixed

Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 
for watch_secs value.

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-09-07 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471115#comment-15471115
 ] 

Kai Huang edited comment on AURORA-1225 at 9/7/16 4:52 PM:
---

Currently, aurora_executor sends the task status update to scheduler,  whereas 
the health checker performs the health check. When a successful required number 
of health checks is reached, the health checker needs to notify the 
aurora_executor about the state change. 

To solve this problem, we can implement a event-listener in aurora executor 
that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. 

Here is an example for the implementation of event dispatcher and event 
listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/


was (Author: kaih):
Currently, aurora_executor sends the task status update to scheduler whereas 
the health checker performs the health check. When a successful required number 
of health checks is reached, the task state transitions to RUNNING. Therefore, 
the health checker needs to notify the aurora_executor about the state change. 
To solve this problem, we can implement a event-listener in aurora executor 
that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. 
Here is an example for the implementation of event dispatcher and event 
listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-09-07 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471115#comment-15471115
 ] 

Kai Huang commented on AURORA-1225:
---

Currently, aurora_executor sends the task status update to scheduler whereas 
the health checker performs the health check. When a successful required number 
of health checks is reached, the task state transitions to RUNNING. Therefore, 
the health checker needs to notify the aurora_executor about the state change. 
To solve this problem, we can implement a event-listener in aurora executor 
that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. 
Here is an example for the implementation of event dispatcher and event 
listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1221) Modify task state machine to treat STARTING as a new active state

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468426#comment-15468426
 ] 

Kai Huang edited comment on AURORA-1221 at 9/6/16 8:27 PM:
---

There are two side-effects if we add STARTING state into LIVE_STATES thrift 
constant.

1. aurora job create --wait-until=RUNNING will finish waiting when a task 
reaches STARTING state (instead of RUNNING)

2. aurora task commands will now also work for STARTING tasks.

For now, we do NOT treat STARTING as live state. It makes more sense to 
preserve the original meaning of the above two commands, especially for "aurora 
task ssh", given that the sandbox is being initialized in STARTING state.


was (Author: kaih):
There are two side-effects after we add STARTING state into LIVE_STATES thrift 
constant.

aurora job create --wait-until=RUNNING will finish waiting when a task reaches 
STARTING state (instead of RUNNING)

aurora task commands will now also work for STARTING tasks.

> Modify task state machine to treat STARTING as a new active state
> -
>
> Key: AURORA-1221
> URL: https://issues.apache.org/jira/browse/AURORA-1221
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Scheduler needs to treat STARTING as the new live state. 
> Open: should we treat STARTING as a transient state with general timeout 
> (currently 5 minutes) or treat it as a persistent live state instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-06 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1222.
---
Resolution: Fixed

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468433#comment-15468433
 ] 

Kai Huang edited comment on AURORA-1222 at 9/6/16 8:19 PM:
---

After discussion with Maxim, I decided not to account STARTING for the platform 
and job uptime calculation. As a result, I add a new MTTS (Median Time To 
Starting) metric in the sla module. See https://reviews.apache.org/r/51580/.


was (Author: kaih):
After discussion with Maxim, I decided not to account STARTING for the platform 
and job uptime calculation.
As a result, I add a new MTTS (Median Time To Starting) metric in the sla 
module. See https://reviews.apache.org/r/51580/.

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468433#comment-15468433
 ] 

Kai Huang commented on AURORA-1222:
---

After discussion with Maxim, I decided not to account STARTING for the platform 
and job uptime calculation.
As a result, I add a new MTTS (Median Time To Starting) metric in the sla 
module. See https://reviews.apache.org/r/51580/.

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1221) Modify task state machine to treat STARTING as a new active state

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468426#comment-15468426
 ] 

Kai Huang commented on AURORA-1221:
---

There are two side-effects after we add STARTING state into LIVE_STATES thrift 
constant.

aurora job create --wait-until=RUNNING will finish waiting when a task reaches 
STARTING state (instead of RUNNING)

aurora task commands will now also work for STARTING tasks.

> Modify task state machine to treat STARTING as a new active state
> -
>
> Key: AURORA-1221
> URL: https://issues.apache.org/jira/browse/AURORA-1221
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Scheduler needs to treat STARTING as the new live state. 
> Open: should we treat STARTING as a transient state with general timeout 
> (currently 5 minutes) or treat it as a persistent live state instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-06 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reopened AURORA-1223:
---

Found watch_secs constraint to relax at scheduler side.

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468067#comment-15468067
 ] 

Kai Huang edited comment on AURORA-1223 at 9/6/16 6:19 PM:
---

After discussion on Aurora dev list, we decided to keep the watch_secs 
infrastructure on scheduler side. 

Our final conclusion is that we adopt the following implementation:

If the users want purely health checking driven updates they can set watch_secs 
to 0 and enable health checks. 

If they want to have both health checking and time driven updates they can set 
watch_secs to the time that they care about, and doing health checks at 
STARTING state as well.

If they just want time driven updates they can disable health checking and set 
watch_secs to the time that they care about.

There will be only one scheduler change required: 
Currently scheduler does not accept zero value for watch_secs, we need to relax 
this constraint.


was (Author: kaih):
After discussion on Aurora dev list, it turns out there will be no 
scheduler-side change associated with this issue.

Our final conclusion is that we adopt the following implementation:

If the users want purely health checking driven updates they can set watch_secs 
to 0 and enable health checks.  (watch_secs=0 is not allowed at client side, we 
will relax this constraint after we modified executor. However, no scheduler 
change is required since scheduler allows non-negative values for watch_secs)

If they want to have both health checking and time driven updates they can set 
watch_secs to the time that they care about, and doing health checks at 
STARTING state as well.

If they just want time driven updates they can disable health checking and set 
watch_secs to the time that they care about..

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-06 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1223.
---
Resolution: Fixed

After discussion on Aurora dev list, it turns out there will be no 
scheduler-side change associated with this issue.

Our final conclusion is that we adopt the following implementation:

If the users want purely health checking driven updates they can set watch_secs 
to 0 and enable health checks.  (watch_secs=0 is not allowed at client side, we 
will relax this constraint after we modified executor. However, no scheduler 
change is required since scheduler allows non-negative values for watch_secs)

If they want to have both health checking and time driven updates they can set 
watch_secs to the time that they care about, and doing health checks at 
STARTING state as well.

If they just want time driven updates they can disable health checking and set 
watch_secs to the time that they care about..

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-02 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1222:
--
Sprint: Twitter Aurora Q2'16 Sprint 20

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1225:
-

Assignee: Kai Huang

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1221) Modify task state machine to treat STARTING as a new active state

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1221:
-

Assignee: Kai Huang

> Modify task state machine to treat STARTING as a new active state
> -
>
> Key: AURORA-1221
> URL: https://issues.apache.org/jira/browse/AURORA-1221
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Scheduler needs to treat STARTING as the new live state. 
> Open: should we treat STARTING as a transient state with general timeout 
> (currently 5 minutes) or treat it as a persistent live state instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1224:
-

Assignee: Kai Huang

> Add a new "min_consecutive_health_checks" setting in .aurora config
> ---
>
> Key: AURORA-1224
> URL: https://issues.apache.org/jira/browse/AURORA-1224
> Project: Aurora
>  Issue Type: Task
>  Components: Client, Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1222:
--
Sprint:   (was: Twitter Aurora Q2'16 Sprint 20)

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reassigned AURORA-1222:
-

Assignee: Kai Huang

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1222:
--
Sprint: Twitter Aurora Q2'16 Sprint 20

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-08-30 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1223:
--
Story Points: 5

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-08-29 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1223:
--
Assignee: Kai Huang

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)