Re: Review Request 52094: Add min_consecutive_health_checks in HealthCheckConfig

2016-09-22 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52094/#review150134
---




docs/reference/configuration.md (line 368)


In addition to documentation, we also need to notify the user:
1. For task with health check, they can leave out watch_secs and relies on 
initial_interval_secs.
2. For task without health check, they still have to provide watch_secs to 
cover the warming up period.

I'm thinking that whether we should incorporate this information as a 
client warning message to prevent misuse of the configuration?

I'll wait until Maxim to weigh in. But thanks for your feedback, Dmitriy!


- Kai Huang


On Sept. 23, 2016, 4:34 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52094/
> ---
> 
> (Updated Sept. 23, 2016, 4:34 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1224
> https://issues.apache.org/jira/browse/AURORA-1224
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Add min_consecutive_health_checks to HealthCheckConfig.
> 
> [Summary]
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.
> 
> [Background]
> This review depends on the executor change(AURORA-1225). Please see 
> https://reviews.apache.org/r/51876/ for more details and background.
> 
> [Change List]
> 1. Add a configuration value "min_consecutive_health_checks"(default=1) to 
> HealthCheckConfig struct.
> 2. Modify the default value of watch_secs to be 0.
> 3. Add a client-side constraint: 
> initial_interval_secs >= min_consecutive_health_checks * interval_secs
> 4. Update the unit tests for health check config in client/config.py, skip 
> unit tests related to watch_secs.
> 
> 
> Diffs
> -
> 
>   docs/reference/configuration.md f2a0b1873f31e91f3bf0cac6f8448e8130fae688 
>   src/main/python/apache/aurora/client/config.py 
> 0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1 
>   src/main/python/apache/aurora/config/schema/base.py 
> 845163043b0b7b2f9e7aca14677ca9f094658551 
>   src/test/python/apache/aurora/client/test_config.py 
> 5cf68a5145ddf9478baa30453c0bcb73136fa7eb 
> 
> Diff: https://reviews.apache.org/r/52094/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/client::
> 
> ./pants test.pytest src/test/python/apache/aurora/config::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 52094: Add min_consecutive_health_checks in HealthCheckConfig

2016-09-22 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52094/
---

(Updated Sept. 23, 2016, 4:34 a.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Add documentation for min_consecutive_health_checks, update the documentaion of 
watch_secs and initial_interval_secs.


Bugs: AURORA-1224
https://issues.apache.org/jira/browse/AURORA-1224


Repository: aurora


Description
---

Add min_consecutive_health_checks to HealthCheckConfig.

[Summary]
HealthCheckConfig should accept a new configuration value that will tell how 
many positive consecutive health checks an instance requires to move from 
STARTING to RUNNING.

[Background]
This review depends on the executor change(AURORA-1225). Please see 
https://reviews.apache.org/r/51876/ for more details and background.

[Change List]
1. Add a configuration value "min_consecutive_health_checks"(default=1) to 
HealthCheckConfig struct.
2. Modify the default value of watch_secs to be 0.
3. Add a client-side constraint: 
initial_interval_secs >= min_consecutive_health_checks * interval_secs
4. Update the unit tests for health check config in client/config.py, skip unit 
tests related to watch_secs.


Diffs (updated)
-

  docs/reference/configuration.md f2a0b1873f31e91f3bf0cac6f8448e8130fae688 
  src/main/python/apache/aurora/client/config.py 
0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1 
  src/main/python/apache/aurora/config/schema/base.py 
845163043b0b7b2f9e7aca14677ca9f094658551 
  src/test/python/apache/aurora/client/test_config.py 
5cf68a5145ddf9478baa30453c0bcb73136fa7eb 

Diff: https://reviews.apache.org/r/52094/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/client::

./pants test.pytest src/test/python/apache/aurora/config::


Thanks,

Kai Huang



Re: Review Request 52094: Add min_consecutive_health_checks in HealthCheckConfig

2016-09-22 Thread Kai Huang


> On Sept. 23, 2016, 1:16 a.m., Dmitriy Shirchenko wrote:
> > src/main/python/apache/aurora/client/config.py, line 113
> > 
> >
> > why was this configuration option deleted? it is still referenced: 
> > https://github.com/apache/aurora/blob/master/docs/reference/configuration.md#healthcheckconfig-objects
> > 
> > if you are introducing a new option it should be documented.

This configuration option is not deleted. We just won't validate the constraint 
between watch_secs, max_consecutive_failures here.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52094/#review150118
---


On Sept. 20, 2016, 7:43 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52094/
> ---
> 
> (Updated Sept. 20, 2016, 7:43 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1224
> https://issues.apache.org/jira/browse/AURORA-1224
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Add min_consecutive_health_checks to HealthCheckConfig.
> 
> [Summary]
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.
> 
> [Background]
> This review depends on the executor change(AURORA-1225). Please see 
> https://reviews.apache.org/r/51876/ for more details and background.
> 
> [Change List]
> 1. Add a configuration value "min_consecutive_health_checks"(default=1) to 
> HealthCheckConfig struct.
> 2. Modify the default value of watch_secs to be 0.
> 3. Add a client-side constraint: 
> initial_interval_secs >= min_consecutive_health_checks * interval_secs
> 4. Update the unit tests for health check config in client/config.py, skip 
> unit tests related to watch_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/client/config.py 
> 0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1 
>   src/main/python/apache/aurora/config/schema/base.py 
> 845163043b0b7b2f9e7aca14677ca9f094658551 
>   src/test/python/apache/aurora/client/test_config.py 
> 5cf68a5145ddf9478baa30453c0bcb73136fa7eb 
> 
> Diff: https://reviews.apache.org/r/52094/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/client::
> 
> ./pants test.pytest src/test/python/apache/aurora/config::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 52094: Add min_consecutive_health_checks in HealthCheckConfig

2016-09-22 Thread Dmitriy Shirchenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52094/#review150118
---




src/main/python/apache/aurora/client/config.py (line 112)


why was this configuration option deleted? it is still referenced: 
https://github.com/apache/aurora/blob/master/docs/reference/configuration.md#healthcheckconfig-objects

if you are introducing a new option it should be documented.



src/main/python/apache/aurora/config/schema/base.py (line 32)


new default should changed in configuration.md



src/main/python/apache/aurora/config/schema/base.py (line 63)


also needs to be documented.


- Dmitriy Shirchenko


On Sept. 20, 2016, 7:43 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52094/
> ---
> 
> (Updated Sept. 20, 2016, 7:43 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1224
> https://issues.apache.org/jira/browse/AURORA-1224
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Add min_consecutive_health_checks to HealthCheckConfig.
> 
> [Summary]
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.
> 
> [Background]
> This review depends on the executor change(AURORA-1225). Please see 
> https://reviews.apache.org/r/51876/ for more details and background.
> 
> [Change List]
> 1. Add a configuration value "min_consecutive_health_checks"(default=1) to 
> HealthCheckConfig struct.
> 2. Modify the default value of watch_secs to be 0.
> 3. Add a client-side constraint: 
> initial_interval_secs >= min_consecutive_health_checks * interval_secs
> 4. Update the unit tests for health check config in client/config.py, skip 
> unit tests related to watch_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/client/config.py 
> 0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1 
>   src/main/python/apache/aurora/config/schema/base.py 
> 845163043b0b7b2f9e7aca14677ca9f094658551 
>   src/test/python/apache/aurora/client/test_config.py 
> 5cf68a5145ddf9478baa30453c0bcb73136fa7eb 
> 
> Diff: https://reviews.apache.org/r/52094/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/client::
> 
> ./pants test.pytest src/test/python/apache/aurora/config::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150117
---


Ship it!




Master (4ead189) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 23, 2016, 12:57 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 12:57 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150115
---




src/main/python/apache/aurora/executor/aurora_executor.py (line 121)


This is perhaps not the best approach. Exposing 
self.is_health_check_enabled for testing purpose seems a little bit hacky. 
However, I see a couple of benefits here:

1. Do things in one pass, reuse the is_health_check_enabled logic as much 
as possible, with minimum duplication.
2. Minimize the change to health checker and task_info.
3. Reflect the descriptions in the comment of this function.
- Initialize the sandbox
- Start the ThermosTaskRunner (fork the Thermos TaskRunner)
- Set up necessary HealthCheckers
- Set up StatusManager, and attach HealthCheckers


- Kai Huang


On Sept. 23, 2016, 12:57 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 12:57 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 52094: Add min_consecutive_health_checks in HealthCheckConfig

2016-09-22 Thread Kai Huang


> On Sept. 20, 2016, 7:59 p.m., Aurora ReviewBot wrote:
> > Master (a9f4e26) is green with this patch.
> >   ./build-support/jenkins/build.sh
> > 
> > I will refresh this build result if you post a review containing 
> > "@ReviewBot retry"

ping for review: Could any one take a look at this review? Conceptually it 
depends on https://reviews.apache.org/r/51876/.
However, the implementation should be independent from executor change.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52094/#review149719
---


On Sept. 20, 2016, 7:43 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52094/
> ---
> 
> (Updated Sept. 20, 2016, 7:43 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1224
> https://issues.apache.org/jira/browse/AURORA-1224
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Add min_consecutive_health_checks to HealthCheckConfig.
> 
> [Summary]
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.
> 
> [Background]
> This review depends on the executor change(AURORA-1225). Please see 
> https://reviews.apache.org/r/51876/ for more details and background.
> 
> [Change List]
> 1. Add a configuration value "min_consecutive_health_checks"(default=1) to 
> HealthCheckConfig struct.
> 2. Modify the default value of watch_secs to be 0.
> 3. Add a client-side constraint: 
> initial_interval_secs >= min_consecutive_health_checks * interval_secs
> 4. Update the unit tests for health check config in client/config.py, skip 
> unit tests related to watch_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/client/config.py 
> 0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1 
>   src/main/python/apache/aurora/config/schema/base.py 
> 845163043b0b7b2f9e7aca14677ca9f094658551 
>   src/test/python/apache/aurora/client/test_config.py 
> 5cf68a5145ddf9478baa30453c0bcb73136fa7eb 
> 
> Diff: https://reviews.apache.org/r/52094/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/client::
> 
> ./pants test.pytest src/test/python/apache/aurora/config::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 23, 2016, 12:57 a.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Refactored is_health_check_enabled, do one pass when setting up the necessary 
HealthCheckers.


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51929: Scheduling multiple tasks per round.

2016-09-22 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review150106
---


Ship it!




Ship It!

- Zameer Manji


On Sept. 20, 2016, 3:02 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> ---
> 
> (Updated Sept. 20, 2016, 3:02 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> single_task_launch_latency` overhead. Based on real cluster observations, the 
> `single_task_launch_latency` is less than 1% of a single task scheduling 
> attempt, which is << than the savings from avoided additional scheduling 
> rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per 
> round) is only slightly more demanding (~8%). Both results though are MUCH 
> higher than the real cluster perf, which just confirms we are not bound by 
> CPU time here:
> 
> Master:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  17126.183 ± 488.425  ops/s
> ```
> 
> This patch:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  15838.051 ± 187.890  ops/s
> ```
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> 6f1cbfbc4510a037cffc95fee54f62f463d2b534 
>   src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java 
> 87b9e1928ab2d44668df1123f32ffdc4197c0c70 
>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
> 664bc6cf964ede2473a4463e58bcdbcb65bc7413 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java 
> 5d319557057e27fd5fc6d3e553e9ca9139399c50 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 207d38d1ddfd373892602218a98c1daaf4a1325f 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 7f7b4358ef05c0f0d0e14daac1a5c25488467dc9 
>   
> src/test/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilterTest.java
>  ece476b918e6f2c128039e561eea23a94d8ed396 
>   
> src/test/java/org/apache/aurora/scheduler/filter/AttributeAggregateTest.java 
> 209f9298a1d55207b9b41159f2ab366f92c1eb70 
>   
> src/test/java/org/apache/aurora/scheduler/filter/SchedulingFilterImplTest.java
>  0cf23df9f373c0d9b27e55a12adefd5f5fd81ba5 
>   src/test/java/org/apache/aurora/scheduler/http/AbstractJettyTest.java 
> c1c3eca4a6e6c88dab6b1c69fae3e2f290b58039 
>   
> src/test/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilterTest.java
>  ee5c6528af89cc62a35fdb314358c489556d8131 
>   src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorImplTest.java 
> 98048fabc00f233925b6cca015c2525980556e2b 
>   
> src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorModuleTest.java 
> 2c3e5f32c774be07a5fa28c8bcf3b9a5d88059a1 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskGroupsTest.java 
> 88729626de5fa87b45472792c59cc0ff1ade3e93 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  a4e87d2216401f344dca64d69b945de7bcf8159a 
>   src/test/java/org/apache/a

Re: Review Request 51929: Scheduling multiple tasks per round.

2016-09-22 Thread Maxim Khutornenko


> On Sept. 22, 2016, 10:34 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java, 
> > line 114
> > 
> >
> > Is there a way of doing this without making `AttributeAggregate` 
> > explicitly mutable?
> > 
> > I don't like how we have to update this from the offer out of band 
> > instead of just refreshing from storage.
> > 
> > Why can't we just have a `refresh` method which re-runs the original 
> > query after we consume the offer?
> > 
> > I'm weary of bugs where we forget to call this method after scheduling 
> > with an offer.

The whole purpose of having the `AttributeAggregate` is to avoid reconstructing 
the job state every time a view into attributes is required. The current 
approach saves an expensive (especially in the `DBTaskStore case` and a large 
job combo) fetch across task and attribute stores. The underlying data 
structure of the `AttributeAggregate` is just a Multiset of host attributes and 
their values. Throwing all that data away and reconstructing it again just to 
have a new host representation added is very wasteful, especially given that we 
already have all new attributes handy.

As for not forgetting to call it, I don't see how reconstructing would be any 
better in that regard as we would still have to remember to reconstruct it out 
of band. The `TaskAssignerImplTest.testAssignPartialNoVetoes` covers 
`AttributeAggregate` update case by the way.


> On Sept. 22, 2016, 10:34 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java, line 
> > 179
> > 
> >
> > We don't check if this is null later, so why not just make it an empty 
> > `ImmutableSet` to start?

I don't see any benefits in that. This is the classic assign-or-throw pattern. 
The `result.get()` returns a Set and we don't pass nulls for collections.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review150085
---


On Sept. 20, 2016, 10:02 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> ---
> 
> (Updated Sept. 20, 2016, 10:02 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> single_task_launch_latency` overhead. Based on real cluster observations, the 
> `single_task_launch_latency` is less than 1% of a single task scheduling 
> attempt, which is << than the savings from avoided additional scheduling 
> rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per 
> round) is only slightly more demanding (~8%). Both results though are MUCH 
> higher than the real cluster perf, which just confirms we are not bound by 
> CPU time here:
> 
> Master:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  17126.183 ± 488.425  ops/s
> ```
> 
> This patch:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  15838.051 ± 187.890  ops/s
> ```
> 
> 
> Diffs
> -
> 
>

Re: Review Request 51929: Scheduling multiple tasks per round.

2016-09-22 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review150085
---




src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java (line 
114)


Is there a way of doing this without making `AttributeAggregate` explicitly 
mutable?

I don't like how we have to update this from the offer out of band instead 
of just refreshing from storage.

Why can't we just have a `refresh` method which re-runs the original query 
after we consume the offer?

I'm weary of bugs where we forget to call this method after scheduling with 
an offer.



src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java (line 47)


Nice use of the new `Stream` API. Very clean and easy to understand.



src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java (line 172)


We don't check if this is null later, so why not just make it an empty 
`ImmutableSet` to start?


- Zameer Manji


On Sept. 20, 2016, 3:02 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> ---
> 
> (Updated Sept. 20, 2016, 3:02 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> single_task_launch_latency` overhead. Based on real cluster observations, the 
> `single_task_launch_latency` is less than 1% of a single task scheduling 
> attempt, which is << than the savings from avoided additional scheduling 
> rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per 
> round) is only slightly more demanding (~8%). Both results though are MUCH 
> higher than the real cluster perf, which just confirms we are not bound by 
> CPU time here:
> 
> Master:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  17126.183 ± 488.425  ops/s
> ```
> 
> This patch:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  15838.051 ± 187.890  ops/s
> ```
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> 6f1cbfbc4510a037cffc95fee54f62f463d2b534 
>   src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java 
> 87b9e1928ab2d44668df1123f32ffdc4197c0c70 
>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
> 664bc6cf964ede2473a4463e58bcdbcb65bc7413 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java 
> 5d319557057e27fd5fc6d3e553e9ca9139399c50 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 207d38d1ddfd373892602218a98c1daaf4a1325f 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 7f7b4358ef05c0f0d0e14daac1a5c25488467dc9 
>   
> src/test/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilterTest.java
>  ece476b918e6f2c128039e561eea23a94d8ed396 
>   
> src/test/java/org/apache/aurora/scheduler/filter/AttributeAggrega

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 279-283
> > 
> >
> > This logic needs to be refactored to reduce duplication and reuse 
> > what's now in `is_health_check_enabled` as much as possible. Ideally, we 
> > should have the only place we extract `health_checker` and the like.
> 
> Kai Huang wrote:
> The complexity is that the health checker did some computation to 
> calculate the port map while going through the if condition. It would be 
> helpful to introduce a utitlity class to store all the side-effect values 
> like port map.

I had two concerns here:
1. If we decide to extract the health_check_enabled() from health_checker.py, 
and put it into task_info.py, this will result in a circular dependency, 
because is_health_check_enabled() function requires some string constants and 
variables defined in health_checker.py, which in turn requires task_info.py. 
This explains why I move the string constants to api.thrift as global constants.

2. It seems that is_health_check_enabled function will always be an overkill. 
We cannot just check if health check is enabled for a task, without computing 
the port maps and extracting the health_checker and health_check_config from 
it. It's coupled with the setting up of a health checker. To eliminate 
duplication, Joshua and I discussed the idea to check if health check is 
enabled while we set up the necessary health checkers. However this approach 
results in flaky tests, probably because it is too heavy-weighted.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5b

Re: Review Request 51929: Scheduling multiple tasks per round.

2016-09-22 Thread Zameer Manji


> On Sept. 20, 2016, 11:49 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java, 
> > lines 93-96
> > 
> >
> > Regarding your notes in the RB description: I don't see a problem if we 
> > set this to a slightly higher value such as `10` or `15`. 
> > 
> > It seems like we will maintain the basic taskgroup round robin 
> > scheduling fairness even with slightly larger batch sizes, so I am ok with 
> > bumping the value.
> 
> Maxim Khutornenko wrote:
> The higher this count is the less fair do we treat lower instance count 
> jobs in case the capacity of the cluster is constrained. It's hard to come up 
> with an ideal default here but I feel 10-15 would be a bit too extreme for 
> the general case.

+1

I think 5 is suitable as well.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review149681
---


On Sept. 20, 2016, 3:02 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> ---
> 
> (Updated Sept. 20, 2016, 3:02 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> single_task_launch_latency` overhead. Based on real cluster observations, the 
> `single_task_launch_latency` is less than 1% of a single task scheduling 
> attempt, which is << than the savings from avoided additional scheduling 
> rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per 
> round) is only slightly more demanding (~8%). Both results though are MUCH 
> higher than the real cluster perf, which just confirms we are not bound by 
> CPU time here:
> 
> Master:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  17126.183 ± 488.425  ops/s
> ```
> 
> This patch:
> ```
> Benchmark
> Mode  Cnt  Score Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  15838.051 ± 187.890  ops/s
> ```
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> 6f1cbfbc4510a037cffc95fee54f62f463d2b534 
>   src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java 
> 87b9e1928ab2d44668df1123f32ffdc4197c0c70 
>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
> 664bc6cf964ede2473a4463e58bcdbcb65bc7413 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java 
> 5d319557057e27fd5fc6d3e553e9ca9139399c50 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 207d38d1ddfd373892602218a98c1daaf4a1325f 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 7f7b4358ef05c0f0d0e14daac1a5c25488467dc9 
>   
> src/test/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilterTest.java
>  ece476b918e6f2c128039e561eea23a94d8ed396 
>   
> src/test/java/org/apache/aurora/scheduler/filter/AttributeAggregateTest.java 
> 209f9298a1d55207b9b41159f2ab366f92c1eb70 
>   
> src/test/java/org/apache/aurora/scheduler/filter/Scheduli

Re: Review Request 52167: Prepare release notes for release, all update committers guide with details on this step and on reverting auto-generated commits for subsequent release candidates.

2016-09-22 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52167/#review150053
---


Ship it!




Ship It!

- Maxim Khutornenko


On Sept. 22, 2016, 6:06 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52167/
> ---
> 
> (Updated Sept. 22, 2016, 6:06 p.m.)
> 
> 
> Review request for Aurora and Maxim Khutornenko.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Prepare release notes for release, all update committers guide with details 
> on this step and on reverting auto-generated commits for subsequent release 
> candidates.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md 854bcbdf4e0dc40d0e18532642903e67232d2c9a 
>   docs/development/committers-guide.md 
> ebf464bf887772862bf9d420a8f65720038275bc 
> 
> Diff: https://reviews.apache.org/r/52167/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 52167: Prepare release notes for release, all update committers guide with details on this step and on reverting auto-generated commits for subsequent release candidates.

2016-09-22 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52167/#review150041
---


Ship it!




Master (6edbac2) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 22, 2016, 6:06 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52167/
> ---
> 
> (Updated Sept. 22, 2016, 6:06 p.m.)
> 
> 
> Review request for Aurora and Maxim Khutornenko.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Prepare release notes for release, all update committers guide with details 
> on this step and on reverting auto-generated commits for subsequent release 
> candidates.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md 854bcbdf4e0dc40d0e18532642903e67232d2c9a 
>   docs/development/committers-guide.md 
> ebf464bf887772862bf9d420a8f65720038275bc 
> 
> Diff: https://reviews.apache.org/r/52167/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Review Request 52167: Prepare release notes for release, all update committers guide with details on this step and on reverting auto-generated commits for subsequent release candidates.

2016-09-22 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52167/
---

Review request for Aurora and Maxim Khutornenko.


Repository: aurora


Description
---

Prepare release notes for release, all update committers guide with details on 
this step and on reverting auto-generated commits for subsequent release 
candidates.


Diffs
-

  RELEASE-NOTES.md 854bcbdf4e0dc40d0e18532642903e67232d2c9a 
  docs/development/committers-guide.md ebf464bf887772862bf9d420a8f65720038275bc 

Diff: https://reviews.apache.org/r/52167/diff/


Testing
---


Thanks,

Joshua Cohen



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150035
---


Ship it!





src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 195)


For posterity,

This is important because the Guava docs says that `run()` should respond 
to shutdown requests by checking `isRunning()`. If we block indefinately we 
technically don't respond to shutdown requests.


- Zameer Manji


On Sept. 22, 2016, 8:20 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 8:20 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Joshua Cohen


> On Sept. 22, 2016, 2:24 p.m., Joshua Cohen wrote:
> > What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
> > there something inherent to the batch worker that makes this necessary?
> > 
> > Also, this fixes the error handling that led to corruption in the event of 
> > a failure, but should we also determine the root cause that led to the 
> > failure to begin with (as part of another review no doubt)?
> 
> Maxim Khutornenko wrote:
> The pre-BatchWorker behavior depended on where/how the error was thrown 
> and could have resulted in a hard shutdown or a mysterious silent async 
> failure killing one of the executor threads. With the BatchWorker, the 
> scheduler will always fail on unhandled exception. This is needed to preserve 
> the transaction boundaries and let the error rollback the changes.
> 
> > ...but should we also determine the root cause that led to the failure 
> to begin with  
> 
> I thought it should have been clear from the RB description but then I 
> realized I explained it in Slack and not here. We purposefully induce 
> scheduler failovers many times during the e2e run to verify things like job 
> updates still work. If the shutdown happens during the transaction, the DB 
> gets closed underneath and the subsequent queries fail any way they like. 
> This is exactly what happened in this case. A task is saved, driver exits, 
> StateManager attempts to save task events but fails due to "DB closed" error. 
> That error is caught by the BatchWorker and the native log transaction 
> successfully completes before the scheduler fully terminates.

Thanks, I missed that in slack!


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review14
---


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150026
---


Ship it!




Ship It!

- Joshua Cohen


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150020
---


Ship it!




Master (d3c5ca7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/
---

(Updated Sept. 22, 2016, 3:20 p.m.)


Review request for Aurora, Stephan Erb and Zameer Manji.


Changes
---

Removing redundant shutdown hook.


Bugs: AURORA-1779
https://issues.apache.org/jira/browse/AURORA-1779


Repository: aurora


Description
---

##Problem

The current `BatchWorker` error handling assumes graceful recovery if any of 
the individual batch items fail. This may not hold true if the storage is 
modified _before_ the error is thrown and the `LogStorage` transaction is not 
rolled back. Consider the following fragment from the `StateManagerImpl`:
```
storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);

for (IScheduledTask scheduledTask : scheduledTasks) {
  updateTaskAndExternalState(
  storeProvider.getUnsafeTaskStore(),
  Tasks.id(scheduledTask),
  Optional.of(scheduledTask),
  Optional.of(PENDING),
  Optional.absent());
}
```
If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND 
`BatchWorker` catches and logs that exception, the native store transaction 
will be committed and as a result the storage will be left in a logically 
inconsistent state.

##Solution

I can see at least 3 ways to solve this problem:
1. Fail hard and shutdown scheduler when any of the batch items throw;
2. Fail only the last batch (drop all its items) and let storage transaction 
roll back;
3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
failed item would be retried when the storage transaction rolls back;

The #3 is the most involved and would be very hard to get right (assuming batch 
items are idempotent, which may not be the case). The #2 may result in a very 
hard to troubleshoot scenario where _some_ items would be dropped on the floor 
and never completed. 

I suggest we go with #1 as the most straightforward and transparent approach. 
It's also the only one that ensures storage state consistency, which is the 
most vital invariant to preserve (as AURORA-1779 proves). The only downside of 
this approach is that scheduler will go down hard any time an unhandled error 
is thrown but arguably this is the easiest way to improve our codebase and 
certainly better than leaving the scheduler in a crippled state.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
e05d4b48749b0691902a505bea1b4490fdd08f29 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
a86dc82cafa7f5436d2b8d00c6db575ff8311eea 

Diff: https://reviews.apache.org/r/52141/diff/


Testing
---

unit and e2e tests


Thanks,

Maxim Khutornenko



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Maxim Khutornenko


> On Sept. 22, 2016, 2:24 p.m., Joshua Cohen wrote:
> > What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
> > there something inherent to the batch worker that makes this necessary?
> > 
> > Also, this fixes the error handling that led to corruption in the event of 
> > a failure, but should we also determine the root cause that led to the 
> > failure to begin with (as part of another review no doubt)?

The pre-BatchWorker behavior depended on where/how the error was thrown and 
could have resulted in a hard shutdown or a mysterious silent async failure 
killing one of the executor threads. With the BatchWorker, the scheduler will 
always fail on unhandled exception. This is needed to preserve the transaction 
boundaries and let the error rollback the changes.

> ...but should we also determine the root cause that led to the failure to 
> begin with  

I thought it should have been clear from the RB description but then I realized 
I explained it in Slack and not here. We purposefully induce scheduler 
failovers many times during the e2e run to verify things like job updates still 
work. If the shutdown happens during the transaction, the DB gets closed 
underneath and the subsequent queries fail any way they like. This is exactly 
what happened in this case. A task is saved, driver exits, StateManager 
attempts to save task events but fails due to "DB closed" error. That error is 
caught by the BatchWorker and the native log transaction successfully completes 
before the scheduler fully terminates.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review14
---


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82ca

Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Maxim Khutornenko


> On Sept. 22, 2016, 2:25 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 152
> > 
> >
> > Something is wrong here.
> > 
> > The listener from `GuavaUtils.LifecycleShutdownListener` should have 
> > already been applied to this service in `SchedulerServicesModule`. We used 
> > `addSchedulerActiveServiceBinding` to add the BatchWorker services so I'm 
> > not sure why this helps at all.

Ah, that's right, that should be enough. So, the fix is even simpler, just let 
the error bubble up and the existing shutdown hook will take care of the rest. 
Verified in vagrant that's exactly what happens.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review15
---


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review15
---




src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 152)


Something is wrong here.

The listener from `GuavaUtils.LifecycleShutdownListener` should have 
already been applied to this service in `SchedulerServicesModule`. We used 
`addSchedulerActiveServiceBinding` to add the BatchWorker services so I'm not 
sure why this helps at all.


- Zameer Manji


On Sept. 21, 2016, 10:38 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 21, 2016, 10:38 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review14
---



What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
there something inherent to the batch worker that makes this necessary?

Also, this fixes the error handling that led to corruption in the event of a 
failure, but should we also determine the root cause that led to the failure to 
begin with (as part of another review no doubt)?

- Joshua Cohen


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149963
---


Ship it!




Ship It!

- Stephan Erb


On Sept. 22, 2016, 7:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 7:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>