[ 
https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491591#comment-14491591
 ] 

Stephan Erb commented on AURORA-894:
------------------------------------

I believe there is a smaller story embedded into this one which is not blocked 
by AURORA-279 and therefore easier to implement.

We could start by introducing the {{STARTING}} state and transition a job to 
{{RUNNING}} when the first {{min_consecutive_health_checks}} have passed. This 
requires the introduction of the new state on server and executor side, but 
keeps the updater out of the loop.

This smaller story also has immediate benefit: Right now, when implementing a 
dashboard or monitoring for services on Aurora, one always has to re-implement 
health checks. Just looking at the {{RUNNING}} state is not enough because the 
service might be starting instead of serving requests. With the proposed change 
however, Aurora guarantees me that a {{RUNNING}} service is always healthy 
(modulo the acceptable inconsistency window of the health check interval).



> Server updater should watch healthy instances
> ---------------------------------------------
>
>                 Key: AURORA-894
>                 URL: https://issues.apache.org/jira/browse/AURORA-894
>             Project: Aurora
>          Issue Type: Epic
>          Components: Scheduler
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
>              Labels: 2015-Q2
>
> Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) 
> countdown when an instance reaches RUNNING state, the updater should rely on 
> the first successful health check instead. This will potentially speed up 
> updates as the {{minWaitInInstanceRunningMs}} will no longer have to be 
> chosen based on the worst observed instance startup/warmup delay but rather 
> as a desired health check duration according to the following formula:
> {noformat}
> minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000
> {noformat}
> where:
>   {{interval_secs}} - 
> https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects
>   {{num_desired_healthchecks}} - the desired number of OK health checks to 
> observe before declaring an instance updated successfully
>   
> The above would allow every instance to start watching interval depending on 
> the individual instance performance and potentially exit updater earlier. 
> This feature requires AURORA-279.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to