[jira] [Assigned] (AURORA-1426) thermos kill hangs when killing a Docker-containerized task
[ https://issues.apache.org/jira/browse/AURORA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1426: - Assignee: (was: Kai Huang) > thermos kill hangs when killing a Docker-containerized task > --- > > Key: AURORA-1426 > URL: https://issues.apache.org/jira/browse/AURORA-1426 > Project: Aurora > Issue Type: Bug > Components: Docker, Executor >Reporter: Kevin Sweeney > > When presented with a docker task as argument in the Vagrant environment it > hangs: > {noformat} > + sudo thermos kill > 1438642104346-vagrant-test-http_example_docker-0-7ce56f48-3a55-413c-885e-8e4218afd582 > I0803 22:48:36.449739 15926 helper.py:276] Found existing runner, forcing > leadership forfeit. > I0803 22:48:36.450725 15926 helper.py:279] Successfully killed leader. > (hangs here) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AURORA-216) allow aurora executor to be customized via the commandline
[ https://issues.apache.org/jira/browse/AURORA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-216: Assignee: (was: Kai Huang) > allow aurora executor to be customized via the commandline > -- > > Key: AURORA-216 > URL: https://issues.apache.org/jira/browse/AURORA-216 > Project: Aurora > Issue Type: Story > Components: Executor >Reporter: brian wickman >Priority: Minor > > Right now the AuroraExecutor takes runner_provider, sandbox_provider and > status_providers. These need to be the following: > - runner_provider: TaskRunnerProvider (assigned_task -> TaskRunner) > - status_providers: list(StatusCheckerProvider) (assigned_task -> > StatusChecker) > - sandbox_provider: SandboxProvider (assigned_task -> SandboxInterface) > These are generic enough that we should allow these to be specified on the > command line as entry points, for example, something like: > {noformat} > --runner_provider > apache.aurora.executor.thermos_runner:ThermosTaskRunnerProvider > --status_provider > apache.aurora.executor.common.health_checker:HealthCheckerProvider > --status_provider myorg.zookeeper:ZkAnnouncerProvider > --sandbox_provider myorg.docker:DockerSandboxProvider > {noformat} > Then have these loaded up using pkg_resources.EntryPoint. These plugins can > either be linked into the .pex or injected onto the PYTHONPATH of the > executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
[ https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang resolved AURORA-1934. --- Resolution: Fixed > Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks > --- > > Key: AURORA-1934 > URL: https://issues.apache.org/jira/browse/AURORA-1934 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > Aurora Scheduler has a webhook module that watches all TaskStateChanges and > send events to configured endpoint. This floods the endpoint with a lot of > noise if we only care about certain types of TaskStateChange event(e.g. task > state change from RUNNING -> LOST). We should allow the aurora administrator > to provide a white list of task state change event types in their web hook > configuration, so that the web hook will only send these events to the > configured endpoint. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
[ https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reopened AURORA-1934: --- > Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks > --- > > Key: AURORA-1934 > URL: https://issues.apache.org/jira/browse/AURORA-1934 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > Aurora Scheduler has a webhook module that watches all TaskStateChanges and > send events to configured endpoint. This floods the endpoint with a lot of > noise if we only care about certain types of TaskStateChange event(e.g. task > state change from RUNNING -> LOST). We should allow the aurora administrator > to provide a white list of task state change event types in their web hook > configuration, so that the web hook will only send these events to the > configured endpoint. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AURORA-1940) aurora job restart request should be retryable
[ https://issues.apache.org/jira/browse/AURORA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1940: -- Description: There was a recent change to the Aurora client to provide "at most once" instead of "at least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 `aurora job restart` is a non-idempotent operation, thus it was not retried. However, when a transport exception occurs, the operator has to babysit simple operations like aurora job restart if it were not retried. Compared to the requests that were causing problems (admin tasks, job creating, updates, etc.), restarts in general should be retried rather than erring on the side of caution. was: There was a recent change to the Aurora client to provide "at most once" instead of "at least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 `aurora job restart` is a non-idempotent operation, thus it was not retried. When there is a transport exception, the operator has to babysit simple operations like aurora job restart if it were not retried. Compared to the requests that were causing problems (admin tasks, job creating, updates, etc.), restarts in general should be retried rather than erring on the side of caution. > aurora job restart request should be retryable > -- > > Key: AURORA-1940 > URL: https://issues.apache.org/jira/browse/AURORA-1940 > Project: Aurora > Issue Type: Task >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There was a recent change to the Aurora client to provide "at most once" > instead of "at least once" retries for non-idempotent operations. See: > https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 > `aurora job restart` is a non-idempotent operation, thus it was not retried. > However, when a transport exception occurs, the operator has to babysit > simple operations like aurora job restart if it were not retried. Compared to > the requests that were causing problems (admin tasks, job creating, updates, > etc.), restarts in general should be retried rather than erring on the side > of caution. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AURORA-1940) aurora job restart request should be retryable
[ https://issues.apache.org/jira/browse/AURORA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1940: -- Description: There was a recent change to the Aurora client to provide "at most once" instead of "at least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 `aurora job restart` is a non-idempotent operation, thus it was not retried. When there is a transport exception, the operator has to babysit simple operations like aurora job restart if it were not retried. Compared to the requests that were causing problems (admin tasks, job creating, updates, etc.), restarts in general should be retried rather than erring on the side of caution. was: There was a recent change to the Aurora client to provide "at most once" instead of "at least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 `aurora job restart` is a non-idempotent operation, thus it was not retried. However, during a scheduler failover, the operator has to babysit simple operations like aurora job restart if it were not retried. Compared to the requests that were causing problems (admin tasks, job creating, updates, etc.), restarts in general should be retried rather than erring on the side of caution. > aurora job restart request should be retryable > -- > > Key: AURORA-1940 > URL: https://issues.apache.org/jira/browse/AURORA-1940 > Project: Aurora > Issue Type: Task >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There was a recent change to the Aurora client to provide "at most once" > instead of "at least once" retries for non-idempotent operations. See: > https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 > `aurora job restart` is a non-idempotent operation, thus it was not retried. > When there is a transport exception, the operator has to babysit simple > operations like aurora job restart if it were not retried. Compared to the > requests that were causing problems (admin tasks, job creating, updates, > etc.), restarts in general should be retried rather than erring on the side > of caution. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AURORA-1940) aurora job restart request should be retryable
Kai Huang created AURORA-1940: - Summary: aurora job restart request should be retryable Key: AURORA-1940 URL: https://issues.apache.org/jira/browse/AURORA-1940 Project: Aurora Issue Type: Task Reporter: Kai Huang Priority: Minor There was a recent change to the Aurora client to provide "at most once" instead of "at least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 `aurora job restart` is a non-idempotent operation, thus it was not retried. However, during a scheduler failover, the operator has to babysit simple operations like aurora job restart if it were not retried. Compared to the requests that were causing problems (admin tasks, job creating, updates, etc.), restarts in general should be retried rather than erring on the side of caution. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AURORA-1940) aurora job restart request should be retryable
[ https://issues.apache.org/jira/browse/AURORA-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1940: - Assignee: Kai Huang > aurora job restart request should be retryable > -- > > Key: AURORA-1940 > URL: https://issues.apache.org/jira/browse/AURORA-1940 > Project: Aurora > Issue Type: Task >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There was a recent change to the Aurora client to provide "at most once" > instead of "at least once" retries for non-idempotent operations. See: > https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 > `aurora job restart` is a non-idempotent operation, thus it was not retried. > However, during a scheduler failover, the operator has to babysit simple > operations like aurora job restart if it were not retried. Compared to the > requests that were causing problems (admin tasks, job creating, updates, > etc.), restarts in general should be retried rather than erring on the side > of caution. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion
[ https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063422#comment-16063422 ] Kai Huang edited comment on AURORA-1937 at 6/26/17 5:38 PM: Add counter for status_update and framework_message: https://reviews.apache.org/r/60350/ was (Author: kaih): Add counter for status_update and framework_message: https://reviews.apache.org/r/60350/ > Add metrics for status updates before switching to V1 Mesos Driver > implementaion > > > Key: AURORA-1937 > URL: https://issues.apache.org/jira/browse/AURORA-1937 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang > > Zameer has created a new driver implementation around V1Mesos > (https://reviews.apache.org/r/57061). > The V1 Mesos code requires a Scheduler callback with a different API. To > maximize code reuse, event handling logic was extracted into a > [MesosCallbackHandler | > https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286] > class. However, we do not have the metrics for handling task status update > in this class. > Metrics around task status update are key performance indicators for the > scheduler. We need to add the metrics back, in order to switch to V1Mesos > driver in production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion
[ https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063483#comment-16063483 ] Kai Huang commented on AURORA-1937: --- Add timing metrics for status_update: https://reviews.apache.org/r/60437/ > Add metrics for status updates before switching to V1 Mesos Driver > implementaion > > > Key: AURORA-1937 > URL: https://issues.apache.org/jira/browse/AURORA-1937 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang > > Zameer has created a new driver implementation around V1Mesos > (https://reviews.apache.org/r/57061). > The V1 Mesos code requires a Scheduler callback with a different API. To > maximize code reuse, event handling logic was extracted into a > [MesosCallbackHandler | > https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286] > class. However, we do not have the metrics for handling task status update > in this class. > Metrics around task status update are key performance indicators for the > scheduler. We need to add the metrics back, in order to switch to V1Mesos > driver in production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion
[ https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1937: - Assignee: Kai Huang > Add metrics for status updates before switching to V1 Mesos Driver > implementaion > > > Key: AURORA-1937 > URL: https://issues.apache.org/jira/browse/AURORA-1937 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang > > Zameer has created a new driver implementation around V1Mesos > (https://reviews.apache.org/r/57061). > The V1 Mesos code requires a Scheduler callback with a different API. To > maximize code reuse, event handling logic was extracted into a > [MesosCallbackHandler | > https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286] > class. However, we do not have the metrics for handling task status update > in this class. > Metrics around task status update are key performance indicators for the > scheduler. We need to add the metrics back, in order to switch to V1Mesos > driver in production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion
[ https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1937: -- Summary: Add metrics for status updates before switching to V1 Mesos Driver implementaion (was: Add metrics for status updates after switching to V1 Mesos Driver implementaion) > Add metrics for status updates before switching to V1 Mesos Driver > implementaion > > > Key: AURORA-1937 > URL: https://issues.apache.org/jira/browse/AURORA-1937 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang > > Zameer has created a new driver implementation around V1Mesos > (https://reviews.apache.org/r/57061). > The V1 Mesos code requires a Scheduler callback with a different API. To > maximize code reuse, event handling logic was extracted into a > [MesosCallbackHandler | > https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286] > class. However, we do not have the metrics for handling task status update > in this class. > Metrics around task status update are key performance indicators for the > scheduler. We need to add the metrics back, in order to switch to V1Mesos > driver in production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AURORA-1937) Add metrics for status updates after switching to V1 Mesos Driver implementaion
Kai Huang created AURORA-1937: - Summary: Add metrics for status updates after switching to V1 Mesos Driver implementaion Key: AURORA-1937 URL: https://issues.apache.org/jira/browse/AURORA-1937 Project: Aurora Issue Type: Task Components: Scheduler Reporter: Kai Huang Zameer has created a new driver implementation around V1Mesos (https://reviews.apache.org/r/57061). The V1 Mesos code requires a Scheduler callback with a different API. To maximize code reuse, event handling logic was extracted into a [MesosCallbackHandler | https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286] class. However, we do not have the metrics for handling task status update in this class. Metrics around task status update are key performance indicators for the scheduler. We need to add the metrics back, in order to switch to V1Mesos driver in production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
[ https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048550#comment-16048550 ] Kai Huang commented on AURORA-1934: --- https://reviews.apache.org/r/59940/ > Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks > --- > > Key: AURORA-1934 > URL: https://issues.apache.org/jira/browse/AURORA-1934 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > Aurora Scheduler has a webhook module that watches all TaskStateChanges and > send events to configured endpoint. This floods the endpoint with a lot of > noise if we only care about certain types of TaskStateChange event(e.g. task > state change from RUNNING -> LOST). We should allow the aurora administrator > to provide a white list of task state change event types in their web hook > configuration, so that the web hook will only send these events to the > configured endpoint. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
[ https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1934: -- Priority: Minor (was: Major) > Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks > --- > > Key: AURORA-1934 > URL: https://issues.apache.org/jira/browse/AURORA-1934 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > Aurora Scheduler has a webhook module that watches all TaskStateChanges and > send events to configured endpoint. This floods the endpoint with a lot of > noise if we only care about certain types of TaskStateChange event(e.g. task > state change from RUNNING -> LOST). We should allow the aurora administrator > to provide a white list of task state change event types in their web hook > configuration, so that the web hook will only send these events to the > configured endpoint. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
[ https://issues.apache.org/jira/browse/AURORA-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1934: -- Description: Aurora Scheduler has a webhook module that watches all TaskStateChanges and send events to configured endpoint. This floods the endpoint with a lot of noise if we only care about certain types of TaskStateChange event(e.g. task state change from RUNNING -> LOST). We should allow the aurora administrator to provide a white list of task state change event types in their web hook configuration, so that the web hook will only send these events to the configured endpoint. (was: Aurora Scheduler has a webhook module that watches all TaskStateChanges and send events to configured endpoint. This generates a lot of noise to the endpoint if we only care about certain types of TaskStateChange like the transition from RUNNING -> LOST state. We should allow the aurora administrator to provide a white list of task state change event types in their web hook configuration, so that the web hook will only send these events to the configured endpoint.) > Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks > --- > > Key: AURORA-1934 > URL: https://issues.apache.org/jira/browse/AURORA-1934 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang > > Aurora Scheduler has a webhook module that watches all TaskStateChanges and > send events to configured endpoint. This floods the endpoint with a lot of > noise if we only care about certain types of TaskStateChange event(e.g. task > state change from RUNNING -> LOST). We should allow the aurora administrator > to provide a white list of task state change event types in their web hook > configuration, so that the web hook will only send these events to the > configured endpoint. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1934) Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks
Kai Huang created AURORA-1934: - Summary: Add a whitelist for TaskStateChange events in Aurora Scheduler WebHooks Key: AURORA-1934 URL: https://issues.apache.org/jira/browse/AURORA-1934 Project: Aurora Issue Type: Task Components: Scheduler Reporter: Kai Huang Assignee: Kai Huang Aurora Scheduler has a webhook module that watches all TaskStateChanges and send events to configured endpoint. This generates a lot of noise to the endpoint if we only care about certain types of TaskStateChange like the transition from RUNNING -> LOST state. We should allow the aurora administrator to provide a white list of task state change event types in their web hook configuration, so that the web hook will only send these events to the configured endpoint. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035170#comment-16035170 ] Kai Huang commented on AURORA-1929: --- https://reviews.apache.org/r/59699/ > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There are currently two types of task history pruning running by aurora: > # The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # The explicit task history pruning initiated by `aurora_admin prune_tasks` > command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. > For example, when we use $ aurora_admin prune_tasks for 135k running tasks > (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed > seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops can be added to the > transaction and then later committed as an atomic unit. However, the > StateManager removes tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and each RemoveTasks operation is coalesced with its previous operation, > which seems inefficient and unnecessary > (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). > We need to batch all removeTasks operations and execute them all at once to > avoid additional cost of coalescing. The fix will also benefit implicit task > history pruning since it has similar underlying implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1929: -- Description: There are currently two types of task history pruning running by aurora: # The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops can be added to the transaction and then later committed as an atomic unit. However, the StateManager removes tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and each RemoveTasks operation is coalesced with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. was: There are currently two types of task history pruning running by aurora: # The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There are currently two types of task history pruning running by aurora: > # The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # The explicit task history pruning initiated by `aurora_admin prune_tasks` > command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. > For example, when we use $ aurora_admin prune_tasks for 135k running tasks > (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed > seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops can be added to the > transaction and then later committed as an atomic unit. However, the > StateManager removes tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and each RemoveTasks operation is coalesced with its previous operation, > which seems inefficient and unnecessary > (https://github.com/apache/aurora/blob
[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1929: -- Component/s: Scheduler > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There are currently two types of task history pruning running by aurora: > # The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # The explicit task history pruning initiated by `aurora_admin prune_tasks` > command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. > For example, when we use $ aurora_admin prune_tasks for 135k running tasks > (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed > seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops can be added to the > transaction and then later committed as an atomic unit. However, the > StateManager removes tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and each RemoveTasks operation is coalesced with its previous operation, > which seems inefficient and unnecessary > (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). > We need to batch all removeTasks operations and execute them all at once to > avoid additional cost of coalescing. The fix will also benefit implicit task > history pruning since it has similar underlying implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1929: -- Description: There are currently two types of task history pruning running by aurora: # The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. was: There are currently two types of task history pruning running by aurora: # The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There are currently two types of task history pruning running by aurora: > # The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # The explicit task history pruning initiated by `aurora_admin prune_tasks` > command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. > For example, when we use $ aurora_admin prune_tasks for 135k running tasks > (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed > seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops(RemoveTasks) can be > added to the transaction and then later committed as an atomic unit. However, > the current implementation remove tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and coalesces each RemoveTasks operation with its previous operation, which > seems inefficient and unnecessary > (https://github.com/apache/aurora/
[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1929: -- Description: There are currently two types of task history pruning running by aurora: # The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. was: There are currently two types of task history pruning running by aurora: # 1) The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # 2) The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There are currently two types of task history pruning running by aurora: > # The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # The explicit task history pruning initiated by `aurora_admin prune_tasks` > command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. For example, when we use $ aurora_admin prune_tasks > for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all > tasks, the pruning speed seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops(RemoveTasks) can be > added to the transaction and then later committed as an atomic unit. > However, the current implementation remove tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and coalesces each RemoveTasks operation with its previous operation, which > seems inefficient and unnecessary > (https://github.com/apache/auror
[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1929: -- Description: There are currently two types of task history pruning running by aurora: # 1) The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. # 2) The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. was: There are currently two types of task history pruning running by aurora: #The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. #The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > > There are currently two types of task history pruning running by aurora: > # 1) The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # 2) The explicit task history pruning initiated by `aurora_admin > prune_tasks` command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. For example, when we use $ aurora_admin prune_tasks > for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all > tasks, the pruning speed seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops(RemoveTasks) can be > added to the transaction and then later committed as an atomic unit. > However, the current implementation remove tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and coalesces each RemoveTasks operation with its previous operation, which > seems inefficient and unnecessary > (https://github.com/apache/a
[jira] [Created] (AURORA-1929) Improve explicit task history pruning.
Kai Huang created AURORA-1929: - Summary: Improve explicit task history pruning. Key: AURORA-1929 URL: https://issues.apache.org/jira/browse/AURORA-1929 Project: Aurora Issue Type: Task Reporter: Kai Huang Assignee: Kai Huang Priority: Minor There are currently two types of task history pruning running by aurora: #The implicit task history pruning running by TaskHistoryPrunner in the background, which registers all inactive tasks upon terminal state change for pruning. #The explicit task history pruning initiated by `aurora_admin prune_tasks` command, which prunes inactive tasks in the cluster. The prune_tasks endpoint seems to be very slow when the cluster has a large number of inactive tasks. For example, when we use $ aurora_admin prune_tasks for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed seems to max out at 3k tasks per minute. Currently, aurora uses StreamManager to manages a single log stream append transaction for task history pruning. Local storage ops(RemoveTasks) can be added to the transaction and then later committed as an atomic unit. However, the current implementation remove tasks one by one in a for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), and coalesces each RemoveTasks operation with its previous operation, which seems inefficient and unnecessary (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). We need to batch all removeTasks operations and execute them all at once to avoid additional cost of coalescing. The fix will also benefit implicit task history pruning since it has similar underlying implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1837) Improve implicit task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1837: -- Summary: Improve implicit task history pruning (was: Improve task history pruning) > Improve implicit task history pruning > - > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key
[ https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1879: -- Attachment: pending_tasks.png > /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending > tasks with the same key > --- > > Key: AURORA-1879 > URL: https://issues.apache.org/jira/browse/AURORA-1879 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: Kai Huang > Attachments: pending_tasks.png > > > When we have multiple TaskGroups that have same key but different > TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error. > This bug seems to be related to a recent commit (Added the 'reason' to the > /pendingTasks > endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1). > > Attached were a screenshot of the /pendingTasks endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key
Kai Huang created AURORA-1879: - Summary: /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key Key: AURORA-1879 URL: https://issues.apache.org/jira/browse/AURORA-1879 Project: Aurora Issue Type: Bug Reporter: Kai Huang When we have multiple TaskGroups that have same key but different TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error. This bug seems to be related to a recent commit (Added the 'reason' to the /pendingTasks endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1). Attached were a screenshot of the /pendingTasks endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key
[ https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1879: -- Component/s: Scheduler > /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending > tasks with the same key > --- > > Key: AURORA-1879 > URL: https://issues.apache.org/jira/browse/AURORA-1879 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: Kai Huang > > When we have multiple TaskGroups that have same key but different > TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error. > This bug seems to be related to a recent commit (Added the 'reason' to the > /pendingTasks > endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1). > > Attached were a screenshot of the /pendingTasks endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583543#comment-15583543 ] Kai Huang commented on AURORA-1225: --- Some one at Aurora Team@Twitter will continue working on it in the next few weeks, most likely pairing with me to fix it. I'll update this ticket once we make any progress on it. > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582921#comment-15582921 ] Kai Huang commented on AURORA-1225: --- This ticket is almost done. We need to add some more tests for the health checker. > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1225: -- Assignee: (was: Kai Huang) > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569618#comment-15569618 ] Kai Huang commented on AURORA-1791: --- The ticket to track is: https://issues.apache.org/jira/browse/AURORA-1793 > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible
[ https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1793: -- Assignee: Kai Huang > Revert Commit ca683 which is not backwards compatible > - > > Key: AURORA-1793 > URL: https://issues.apache.org/jira/browse/AURORA-1793 > Project: Aurora > Issue Type: Bug >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Blocker > > The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards > compatible. We decided to revert this commit. > The changes that directly causes problems is: > {code} > Modify executor state transition logic to rely on health checks (if enabled). > commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 > {code} > There are two downstream commits that depends on the above commit: > {code} > Add min_consecutive_health_checks in HealthCheckConfig > commit ed72b1bf662d1e29d2bb483b317c787630c26a9e > {code} > {code} > Add support for receiving min_consecutive_successes in health checker > commit e91130e49445c3933b6e27f5fde18c3a0e61b87a > {code} > We will drop all three of these commits and revert back to one commit before > the problematic commit: > {code} > Running task ssh without an instance should pick a random instance > commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible
[ https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1793: -- Description: The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards compatible. We decided to revert this commit. The changes that directly causes problems is: {code} Modify executor state transition logic to rely on health checks (if enabled). commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 {code} There are two downstream commits that depends on the above commit: {code} Add min_consecutive_health_checks in HealthCheckConfig commit ed72b1bf662d1e29d2bb483b317c787630c26a9e {code} {code} Add support for receiving min_consecutive_successes in health checker commit e91130e49445c3933b6e27f5fde18c3a0e61b87a {code} We will drop all three of these commits and revert back to one commit before the problematic commit: {code} Running task ssh without an instance should pick a random instance commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c {code} > Revert Commit ca683 which is not backwards compatible > - > > Key: AURORA-1793 > URL: https://issues.apache.org/jira/browse/AURORA-1793 > Project: Aurora > Issue Type: Bug >Reporter: Kai Huang >Priority: Blocker > > The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards > compatible. We decided to revert this commit. > The changes that directly causes problems is: > {code} > Modify executor state transition logic to rely on health checks (if enabled). > commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 > {code} > There are two downstream commits that depends on the above commit: > {code} > Add min_consecutive_health_checks in HealthCheckConfig > commit ed72b1bf662d1e29d2bb483b317c787630c26a9e > {code} > {code} > Add support for receiving min_consecutive_successes in health checker > commit e91130e49445c3933b6e27f5fde18c3a0e61b87a > {code} > We will drop all three of these commits and revert back to one commit before > the problematic commit: > {code} > Running task ssh without an instance should pick a random instance > commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1793) Revert
Kai Huang created AURORA-1793: - Summary: Revert Key: AURORA-1793 URL: https://issues.apache.org/jira/browse/AURORA-1793 Project: Aurora Issue Type: Bug Reporter: Kai Huang Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible
[ https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1793: -- Summary: Revert Commit ca683 which is not backwards compatible (was: Revert ) > Revert Commit ca683 which is not backwards compatible > - > > Key: AURORA-1793 > URL: https://issues.apache.org/jira/browse/AURORA-1793 > Project: Aurora > Issue Type: Bug >Reporter: Kai Huang >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569600#comment-15569600 ] Kai Huang commented on AURORA-1791: --- We've decided to revert the commit. The changes that directly causes problems is: Modify executor state transition logic to rely on health checks (if enabled). commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 There are two downstream commits that depends on the above commit: Add min_consecutive_health_checks in HealthCheckConfig commit ed72b1bf662d1e29d2bb483b317c787630c26a9e Add support for receiving min_consecutive_successes in health checker commit e91130e49445c3933b6e27f5fde18c3a0e61b87a We will drop all three of these commits and revert back to one commit before the problematic commit: Running task ssh without an instance should pick a random instance commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c I will create a separate ticket for people to track the reversion. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569183#comment-15569183 ] Kai Huang commented on AURORA-1791: --- Thanks for the examples, [~davmclau] It seems the above example you provided is similar to approach(a) that I implemented in the patch. One thing that is worth noting is that this approach relies on the latest health check, even if it's way beyond initial_interval_secs expires. Consider the case where initial_interval_secs = 150, interval_secs = 100, the task becomes healthy at second 101, we won't be able to see the healthy state until second 200. This is some grey area in the design doc which we need to discuss more. I agree with you that we should take a step back, since it seems we need more discussion on whether min_consecutive_successes should trigger TASK state transition after initial_interval_secs expires. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768 ] Kai Huang edited comment on AURORA-1791 at 10/12/16 6:43 AM: - To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. In commit ca683, I keep updating the failure counter but ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to: updating failure counter after initial_interval_secs expires. For the root cause of the issue, min_consecutive_successes, we have two options here: (a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. (b) Doing an additional health check whenever initial_interval_secs expires. In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition? was (Author: kaih): To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. In commit ca683, I keep updating the failure counter but only ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to: updating failure counter after initial_interval_secs expires. For the root cause of the issue, min_consecutive_successes, we have two options here: (a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. (b) Doing an additional health check whenever initial_interval_secs expires. In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition? > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567803#comment-15567803 ] Kai Huang commented on AURORA-1791: --- An issue to implement (b) is that the health checker thread might be sleeping while initial_interval_secs expires. We will need a event-driven mechanism to notify the health checker to wake up and do a health check when initial_interval_secs expires. This seems requires a lot of refactoring. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768 ] Kai Huang commented on AURORA-1791: --- To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. In commit ca683, I keep updating the failure counter but only ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to: updating failure counter after initial_interval_secs expires. For the root cause of the issue, min_consecutive_successes, we have two options here: (a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. (b) Doing an additional health check whenever initial_interval_secs expires. In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition? > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566721#comment-15566721 ] Kai Huang commented on AURORA-1791: --- In your case, it's likely your task becomes health at the 6th second. However, within the 10 secs initial_interval_secs, we only performed one health check at the 5th second. So by definition, we "failed" to detect at least one health check within the given period, so we moved the task state to FAILED. We can enforce a health check at the end of the initial_interval_secs. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566711#comment-15566711 ] Kai Huang commented on AURORA-1791: --- Thanks for pointing it out. The current implementation is: if we failed to see(even if it exist) enough successful health checks within the initial_interval_secs, we will move the task state to failed. This behavior is not back-ward compatible and not safe. I'll revised the design. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config
[ https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang resolved AURORA-1224. --- Resolution: Fixed > Add a new "min_consecutive_health_checks" setting in .aurora config > --- > > Key: AURORA-1224 > URL: https://issues.apache.org/jira/browse/AURORA-1224 > Project: Aurora > Issue Type: Task > Components: Client, Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > HealthCheckConfig should accept a new configuration value that will tell how > many positive consecutive health checks an instance requires to move from > STARTING to RUNNING. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang resolved AURORA-1225. --- Resolution: Fixed > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1426) thermos kill hangs when killing a Docker-containerized task
[ https://issues.apache.org/jira/browse/AURORA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1426: -- Assignee: Kai Huang > thermos kill hangs when killing a Docker-containerized task > --- > > Key: AURORA-1426 > URL: https://issues.apache.org/jira/browse/AURORA-1426 > Project: Aurora > Issue Type: Bug > Components: Docker, Executor >Reporter: Kevin Sweeney >Assignee: Kai Huang > > When presented with a docker task as argument in the Vagrant environment it > hangs: > {noformat} > + sudo thermos kill > 1438642104346-vagrant-test-http_example_docker-0-7ce56f48-3a55-413c-885e-8e4218afd582 > I0803 22:48:36.449739 15926 helper.py:276] Found existing runner, forcing > leadership forfeit. > I0803 22:48:36.450725 15926 helper.py:279] Successfully killed leader. > (hangs here) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-216) allow aurora executor to be customized via the commandline
[ https://issues.apache.org/jira/browse/AURORA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-216: - Assignee: Kai Huang > allow aurora executor to be customized via the commandline > -- > > Key: AURORA-216 > URL: https://issues.apache.org/jira/browse/AURORA-216 > Project: Aurora > Issue Type: Story > Components: Executor >Reporter: brian wickman >Assignee: Kai Huang >Priority: Minor > > Right now the AuroraExecutor takes runner_provider, sandbox_provider and > status_providers. These need to be the following: > - runner_provider: TaskRunnerProvider (assigned_task -> TaskRunner) > - status_providers: list(StatusCheckerProvider) (assigned_task -> > StatusChecker) > - sandbox_provider: SandboxProvider (assigned_task -> SandboxInterface) > These are generic enough that we should allow these to be specified on the > command line as entry points, for example, something like: > {noformat} > --runner_provider > apache.aurora.executor.thermos_runner:ThermosTaskRunnerProvider > --status_provider > apache.aurora.executor.common.health_checker:HealthCheckerProvider > --status_provider myorg.zookeeper:ZkAnnouncerProvider > --sandbox_provider myorg.docker:DockerSandboxProvider > {noformat} > Then have these loaded up using pkg_resources.EntryPoint. These plugins can > either be linked into the .pex or injected onto the PYTHONPATH of the > executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config
[ https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507123#comment-15507123 ] Kai Huang commented on AURORA-1224: --- I would propose making the following changes to the client: 1. Add min_consecutive_health_checks to HealthCheckConfig struct. (Default = 1) 2. Add the following constraint: initial_interval_secs >= interval_secs * min_consecutive_health_checks 3. Remove the client constraint: watch_secs >= initial_interval_secs + (max_consecutive_failures * interval_secs) 4. Add a new constraint(OPTIONAL): If watch_secs is 0, the Health Check should be enabled (either by providing a "health" port or a ShellHealthChecker).Otherwise the instance will always marked as healthy. > Add a new "min_consecutive_health_checks" setting in .aurora config > --- > > Key: AURORA-1224 > URL: https://issues.apache.org/jira/browse/AURORA-1224 > Project: Aurora > Issue Type: Task > Components: Client, Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > HealthCheckConfig should accept a new configuration value that will tell how > many positive consecutive health checks an instance requires to move from > STARTING to RUNNING. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1775) Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script
[ https://issues.apache.org/jira/browse/AURORA-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1775: -- Description: When I run build-support/python/make-pycharm-virtualenv script locally, exceptions were thrown at: https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24 "./pants dependencies --external-only src/test/python::” does no longer seem to work for pants 1.1.0-rc7. As a result, the Pycharm project will be generated with a incomplete list of packages, which prevents developers from debugging python code and running tests in IDE. (pants test via command line still works) I was wondering if someone can work on it and fix this issue? Thanks! was: When I run build-support/python/make-pycharm-virtualenv script locally, exceptions were thrown at: https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24 "./pants dependencies --external-only src/test/python::” does no longer seem to work for pants 1.1.0-rc7. As a result, the Pycharm project will be generated with a incomplete list of packages, which prevents developers from debugging python code and running tests in IDE. I was wondering if someone can work on it and fix this issue? Thanks! > Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script > - > > Key: AURORA-1775 > URL: https://issues.apache.org/jira/browse/AURORA-1775 > Project: Aurora > Issue Type: Bug >Reporter: Kai Huang > > When I run build-support/python/make-pycharm-virtualenv script locally, > exceptions were thrown at: > https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24 > "./pants dependencies --external-only src/test/python::” does no longer seem > to work for pants 1.1.0-rc7. As a result, the Pycharm project will be > generated with a incomplete list of packages, which prevents developers from > debugging python code and running tests in IDE. (pants test via command line > still works) > I was wondering if someone can work on it and fix this issue? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1775) Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script
Kai Huang created AURORA-1775: - Summary: Updating pants to 1.1.0-rc7 breaks the make-pycharm-virtualenv script Key: AURORA-1775 URL: https://issues.apache.org/jira/browse/AURORA-1775 Project: Aurora Issue Type: Bug Reporter: Kai Huang When I run build-support/python/make-pycharm-virtualenv script locally, exceptions were thrown at: https://github.com/apache/aurora/blob/e67c6a732c00786bc74d63d4fb2b9f5f398c5435/build-support/python/make-pycharm-virtualenv#L24 "./pants dependencies --external-only src/test/python::” does no longer seem to work for pants 1.1.0-rc7. As a result, the Pycharm project will be generated with a incomplete list of packages, which prevents developers from debugging python code and running tests in IDE. I was wondering if someone can work on it and fix this issue? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1225: -- Sprint: Twitter Aurora Q2'16 Sprint 21 > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471115#comment-15471115 ] Kai Huang edited comment on AURORA-1225 at 9/8/16 7:52 PM: --- Currently, aurora_executor passes a callback to status manager. When a StatusResult is returned, the callback function shuts down the status manager. We need to modify the callback function, so that it won't shutdown the status manager if the status in StatusResult is TASK_RUNNING. was (Author: kaih): Currently, aurora_executor sends the task status update to scheduler, whereas the health checker performs the health check. When a successful required number of health checks is reached, the health checker needs to notify the aurora_executor about the state change. To solve this problem, we can implement a event-listener in aurora executor that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. Here is an example for the implementation of event dispatcher and event listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/ > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473026#comment-15473026 ] Kai Huang edited comment on AURORA-1223 at 9/8/16 7:03 AM: --- Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 for watch_secs value. For health-check enabled jobs, watch_secs is set to be 0, so the scheduler will not use watch_secs for job updates. was (Author: kaih): Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 for watch_secs value. If watch_secs is 0, the scheduler will not use watch_secs for job updates. > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1223: -- Comment: was deleted (was: Found watch_secs constraint to relax at scheduler side.) > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473026#comment-15473026 ] Kai Huang edited comment on AURORA-1223 at 9/8/16 7:03 AM: --- Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 for watch_secs value. If watch_secs is 0, the scheduler will not use watch_secs for job updates. was (Author: kaih): Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 for watch_secs value. > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang resolved AURORA-1223. --- Resolution: Fixed Modify the assertion of watch_secs in scheduler. So the scheduler can accept 0 for watch_secs value. > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471115#comment-15471115 ] Kai Huang edited comment on AURORA-1225 at 9/7/16 4:52 PM: --- Currently, aurora_executor sends the task status update to scheduler, whereas the health checker performs the health check. When a successful required number of health checks is reached, the health checker needs to notify the aurora_executor about the state change. To solve this problem, we can implement a event-listener in aurora executor that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. Here is an example for the implementation of event dispatcher and event listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/ was (Author: kaih): Currently, aurora_executor sends the task status update to scheduler whereas the health checker performs the health check. When a successful required number of health checks is reached, the task state transitions to RUNNING. Therefore, the health checker needs to notify the aurora_executor about the state change. To solve this problem, we can implement a event-listener in aurora executor that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. Here is an example for the implementation of event dispatcher and event listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/ > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471115#comment-15471115 ] Kai Huang commented on AURORA-1225: --- Currently, aurora_executor sends the task status update to scheduler whereas the health checker performs the health check. When a successful required number of health checks is reached, the task state transitions to RUNNING. Therefore, the health checker needs to notify the aurora_executor about the state change. To solve this problem, we can implement a event-listener in aurora executor that listens to a "TASK_RUNNING_TRANSITION" event dispatched by health checker. Here is an example for the implementation of event dispatcher and event listener: http://code.activestate.com/recipes/577432-simple-event-dispatcher/ > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1221) Modify task state machine to treat STARTING as a new active state
[ https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468426#comment-15468426 ] Kai Huang edited comment on AURORA-1221 at 9/6/16 8:27 PM: --- There are two side-effects if we add STARTING state into LIVE_STATES thrift constant. 1. aurora job create --wait-until=RUNNING will finish waiting when a task reaches STARTING state (instead of RUNNING) 2. aurora task commands will now also work for STARTING tasks. For now, we do NOT treat STARTING as live state. It makes more sense to preserve the original meaning of the above two commands, especially for "aurora task ssh", given that the sandbox is being initialized in STARTING state. was (Author: kaih): There are two side-effects after we add STARTING state into LIVE_STATES thrift constant. aurora job create --wait-until=RUNNING will finish waiting when a task reaches STARTING state (instead of RUNNING) aurora task commands will now also work for STARTING tasks. > Modify task state machine to treat STARTING as a new active state > - > > Key: AURORA-1221 > URL: https://issues.apache.org/jira/browse/AURORA-1221 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Scheduler needs to treat STARTING as the new live state. > Open: should we treat STARTING as a transient state with general timeout > (currently 5 minutes) or treat it as a persistent live state instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang resolved AURORA-1222. --- Resolution: Fixed > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468433#comment-15468433 ] Kai Huang edited comment on AURORA-1222 at 9/6/16 8:19 PM: --- After discussion with Maxim, I decided not to account STARTING for the platform and job uptime calculation. As a result, I add a new MTTS (Median Time To Starting) metric in the sla module. See https://reviews.apache.org/r/51580/. was (Author: kaih): After discussion with Maxim, I decided not to account STARTING for the platform and job uptime calculation. As a result, I add a new MTTS (Median Time To Starting) metric in the sla module. See https://reviews.apache.org/r/51580/. > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468433#comment-15468433 ] Kai Huang commented on AURORA-1222: --- After discussion with Maxim, I decided not to account STARTING for the platform and job uptime calculation. As a result, I add a new MTTS (Median Time To Starting) metric in the sla module. See https://reviews.apache.org/r/51580/. > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1221) Modify task state machine to treat STARTING as a new active state
[ https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468426#comment-15468426 ] Kai Huang commented on AURORA-1221: --- There are two side-effects after we add STARTING state into LIVE_STATES thrift constant. aurora job create --wait-until=RUNNING will finish waiting when a task reaches STARTING state (instead of RUNNING) aurora task commands will now also work for STARTING tasks. > Modify task state machine to treat STARTING as a new active state > - > > Key: AURORA-1221 > URL: https://issues.apache.org/jira/browse/AURORA-1221 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Scheduler needs to treat STARTING as the new live state. > Open: should we treat STARTING as a transient state with general timeout > (currently 5 minutes) or treat it as a persistent live state instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reopened AURORA-1223: --- Found watch_secs constraint to relax at scheduler side. > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468067#comment-15468067 ] Kai Huang edited comment on AURORA-1223 at 9/6/16 6:19 PM: --- After discussion on Aurora dev list, we decided to keep the watch_secs infrastructure on scheduler side. Our final conclusion is that we adopt the following implementation: If the users want purely health checking driven updates they can set watch_secs to 0 and enable health checks. If they want to have both health checking and time driven updates they can set watch_secs to the time that they care about, and doing health checks at STARTING state as well. If they just want time driven updates they can disable health checking and set watch_secs to the time that they care about. There will be only one scheduler change required: Currently scheduler does not accept zero value for watch_secs, we need to relax this constraint. was (Author: kaih): After discussion on Aurora dev list, it turns out there will be no scheduler-side change associated with this issue. Our final conclusion is that we adopt the following implementation: If the users want purely health checking driven updates they can set watch_secs to 0 and enable health checks. (watch_secs=0 is not allowed at client side, we will relax this constraint after we modified executor. However, no scheduler change is required since scheduler allows non-negative values for watch_secs) If they want to have both health checking and time driven updates they can set watch_secs to the time that they care about, and doing health checks at STARTING state as well. If they just want time driven updates they can disable health checking and set watch_secs to the time that they care about.. > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang resolved AURORA-1223. --- Resolution: Fixed After discussion on Aurora dev list, it turns out there will be no scheduler-side change associated with this issue. Our final conclusion is that we adopt the following implementation: If the users want purely health checking driven updates they can set watch_secs to 0 and enable health checks. (watch_secs=0 is not allowed at client side, we will relax this constraint after we modified executor. However, no scheduler change is required since scheduler allows non-negative values for watch_secs) If they want to have both health checking and time driven updates they can set watch_secs to the time that they care about, and doing health checks at STARTING state as well. If they just want time driven updates they can disable health checking and set watch_secs to the time that they care about.. > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1222: -- Sprint: Twitter Aurora Q2'16 Sprint 20 > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1225: - Assignee: Kai Huang > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1221) Modify task state machine to treat STARTING as a new active state
[ https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1221: - Assignee: Kai Huang > Modify task state machine to treat STARTING as a new active state > - > > Key: AURORA-1221 > URL: https://issues.apache.org/jira/browse/AURORA-1221 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Scheduler needs to treat STARTING as the new live state. > Open: should we treat STARTING as a transient state with general timeout > (currently 5 minutes) or treat it as a persistent live state instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config
[ https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1224: - Assignee: Kai Huang > Add a new "min_consecutive_health_checks" setting in .aurora config > --- > > Key: AURORA-1224 > URL: https://issues.apache.org/jira/browse/AURORA-1224 > Project: Aurora > Issue Type: Task > Components: Client, Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > HealthCheckConfig should accept a new configuration value that will tell how > many positive consecutive health checks an instance requires to move from > STARTING to RUNNING. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1222: -- Sprint: (was: Twitter Aurora Q2'16 Sprint 20) > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang reassigned AURORA-1222: - Assignee: Kai Huang > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING
[ https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1222: -- Sprint: Twitter Aurora Q2'16 Sprint 20 > Modify stats and SLA metrics to properly account for STARTING > - > > Key: AURORA-1222 > URL: https://issues.apache.org/jira/browse/AURORA-1222 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko > > Both platform and job uptime calculations will be affected by treating > STARTING as a new live state. Also, a new MTTS (Median Time To Starting) > metric would be great to have in addition to MTTA and MTTR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1223: -- Story Points: 5 > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
[ https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Huang updated AURORA-1223: -- Assignee: Kai Huang > Modify scheduler updater to not use "watch_secs" for health-check enabled jobs > -- > > Key: AURORA-1223 > URL: https://issues.apache.org/jira/browse/AURORA-1223 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > > When health checks are enabled in a job config, scheduler updater should > ignore "watch_secs" UpdateConfig value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)