Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-14 Thread Kai Huang


> On Sept. 14, 2016, 10:03 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 40-43
> > 
> >
> > Do we really need this enum? A task that is in the middle of the 
> > `initial_interval_secs` timeout technically is still healthy, right? I feel 
> > having a boolean `isHealthy` should be enough to cover all possible 
> > scenarios as `StatusManager` checks for `isHealthy` AND the task status 
> > itself to invoke a callback.
> 
> Kai Huang wrote:
> How do we distinguish a undergoing health check from a successful health 
> check(completed within initial_interval) if isHealthy is True in both of 
> them? Should we make a second variable to indicate that?

Oh, never mind, it seems I missed one point here: When we reach the requirement 
of suceessful health checks before initial_interval expires, we should 
pre-terminate the initial_interval secs by seeting the initital_in_progress to 
be expired. So that will work. Thanks!


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148937
---


On Sept. 14, 2016, 4:43 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 4:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> [TODOs]
> Currently I fixed all broken tests caused by my changes. However, more tests 
> needs to be added in order to accomodate the executor change. I will send 
> follow-up review update when I cover more edge cases. But any feedback on the 
> implementation and tests is greatly appreciated.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-14 Thread Kai Huang


> On Sept. 14, 2016, 10:03 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/status_checker.py, line 104
> > 
> >
> > Not obvious to me: what's the purpose of this condition here?

Oh, I thought HealthChecker is not the last link in the ChainedStatusChecker. 
If it's the last one, then this condition is not necessary.


> On Sept. 14, 2016, 10:03 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/status_manager.py, line 60
> > 
> >
> > How about extending the `StatusManager` to take two callbacks instead? 
> > The old callback would remain a default one and the new would be an 
> > optional one to exclusively process 'TASK_RUNNING'. Since it's now status 
> > aware, it would make sense to consolidate status management here and avoid 
> > the if-else switch upstream.
> > 
> > Also, I'd use `mesos_pb2.TASK_RUNNING` instead of relying on string 
> > const.

Agreed. I had similar thought initially, now I'm more assured.


> On Sept. 14, 2016, 10:03 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 40-43
> > 
> >
> > Do we really need this enum? A task that is in the middle of the 
> > `initial_interval_secs` timeout technically is still healthy, right? I feel 
> > having a boolean `isHealthy` should be enough to cover all possible 
> > scenarios as `StatusManager` checks for `isHealthy` AND the task status 
> > itself to invoke a callback.

How do we distinguish a undergoing health check from a successful health 
check(completed within initial_interval) if isHealthy is True in both of them? 
Should we make a second variable to indicate that?


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148937
---


On Sept. 14, 2016, 4:43 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 4:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> [TODOs]
> Currently I fixed all broken tests caused by my changes. However, more tests 
> needs to be added in order to accomodate the executor change. I will send 
> follow-up review update when I cover more edge cases. But any feedback on the 
> implementation and tests is greatly appreciated.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 

Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-14 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148937
---




src/main/python/apache/aurora/executor/aurora_executor.py (line 125)


I don't think we should include `Task is healthy.` message for 
non-health-check-enabled tasks as we have no visibility to the health of such 
tasks.



src/main/python/apache/aurora/executor/aurora_executor.py (lines 133 - 141)


This code is largely a repetition of what alrady exists in 
health_checker.py. I'd be great to reuse it (say from task_info.py).



src/main/python/apache/aurora/executor/common/health_checker.py (lines 40 - 43)


Do we really need this enum? A task that is in the middle of the 
`initial_interval_secs` timeout technically is still healthy, right? I feel 
having a boolean `isHealthy` should be enough to cover all possible scenarios 
as `StatusManager` checks for `isHealthy` AND the task status itself to invoke 
a callback.



src/main/python/apache/aurora/executor/common/health_checker.py (line 69)


This comment needs update.



src/main/python/apache/aurora/executor/common/health_checker.py (line 103)


`...if is_healthy else...`



src/main/python/apache/aurora/executor/common/health_checker.py (lines 136 - 
150)


If you get rid of the `HealthCheckStatus` enum I think all we need is this:
```
if self.initial_in_progress:
  if self.current_consecutive_health_checks >= 
self.min_consecutive_health_checks:
  self.healthy = True
else:
  self.healthy = self.current_consecutive_failures > 
self.max_consecutive_failures
  self.reason = reason
  
  
@property
def initial_in_progress(self):
  if not self._expired:
if self.elapsed_time <= self.initial_interval:
  return true
else:  
  self._expired = True
return !self._expired
```



src/main/python/apache/aurora/executor/common/health_checker.py (line 149)


This should be log.info instead.



src/main/python/apache/aurora/executor/common/status_checker.py (line 104)


Not obvious to me: what's the purpose of this condition here?



src/main/python/apache/aurora/executor/status_manager.py (line 60)


How about extending the `StatusManager` to take two callbacks instead? The 
old callback would remain a default one and the new would be an optional one to 
exclusively process 'TASK_RUNNING'. Since it's now status aware, it would make 
sense to consolidate status management here and avoid the if-else switch 
upstream.

Also, I'd use `mesos_pb2.TASK_RUNNING` instead of relying on string const.


- Maxim Khutornenko


On Sept. 14, 2016, 4:43 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 4:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to 

Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-14 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148927
---




src/main/python/apache/aurora/executor/common/health_checker.py (line 137)


It seems to me the order doesn't matter that much. However, I still have 
the feeling that Maxim's proposal makes the expiration and failure count more 
close to their true definition, and less ambuiguous. The tests should not 
dictate how we write the source code. Sometimes we have to refactor them.

Please feel free to weigh in.


- Kai Huang


On Sept. 14, 2016, 4:43 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 4:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> [TODOs]
> Currently I fixed all broken tests caused by my changes. However, more tests 
> needs to be added in order to accomodate the executor change. I will send 
> follow-up review update when I cover more edge cases. But any feedback on the 
> implementation and tests is greatly appreciated.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-14 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148925
---




src/main/python/apache/aurora/executor/common/health_checker.py (line 136)


The reason I sned the message here is because in test_health_checker.py, we 
expect the _maybe_update_failure_count() method sends the message to change the 
healthy state of health cheker. See: 
https://github.com/apache/aurora/blob/master/src/test/python/apache/aurora/executor/common/test_health_checker.py#L451

So currently, the flow in the health checker thread is expected to be:
(1)update failure count
(2)check_expire
(3)send_message
(4)sleep
(5)go to (1)

I agree with Maxim the following flow of health checker thread will make 
more sense, but it appears we have to refactor a number of tests:
(1)check expired
(2)send message
(3)update failure count
(4)sleep
(5)go to (1)



src/main/python/apache/aurora/executor/common/health_checker.py (line 158)


checking the time is expired by summing up the elapsed time is not a good 
strategy. i've tried to use a clock to check if the initital_interval expires 
in the same thread. However, that causes some flaky test when we apply a minor 
offset of the check interval in tests. See: 
https://github.com/apache/aurora/blob/master/src/test/python/apache/aurora/executor/common/test_health_checker.py#L114
I'm thinking about doing the expiration check in a separate thread, and 
memoize if initital_interval expires.


- Kai Huang


On Sept. 14, 2016, 4:43 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 4:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> [TODOs]
> Currently I fixed all broken tests caused by my changes. However, more tests 
> needs to be added in order to accomodate the executor change. I will send 
> follow-up review update when I cover more edge cases. But any feedback on the 
> implementation and tests is greatly appreciated.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   

Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-13 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148858
---


Ship it!




Master (5069f93) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 14, 2016, 4:43 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 4:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> [TODOs]
> Currently I fixed all broken tests caused by my changes. However, more tests 
> needs to be added in order to accomodate the executor change. I will send 
> follow-up review update when I cover more edge cases. But any feedback on the 
> implementation and tests is greatly appreciated.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: @ReviewBot retry Modify executor state transition logic to rely on health checks (if enabled)

2016-09-13 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 14, 2016, 4:43 a.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Remove a redundant line.


Summary (updated)
-

@ReviewBot retry Modify executor state transition logic to rely on health 
checks (if enabled)


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.

[TODOs]
Currently I fixed all broken tests caused by my changes. However, more tests 
needs to be added in order to accomodate the executor change. I will send 
follow-up review update when I cover more edge cases. But any feedback on the 
implementation and tests is greatly appreciated.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang