-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
-----------------------------------------------------------

(Updated Sept. 14, 2016, 4:47 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
-------

Javadoc changes and minor refactoring.


Repository: aurora


Description
-------

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#####Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#####Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#####Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#####Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
99c27e8012f10a67ce5f1b84d258e7a5608995c7 
  src/test/java/org/apache/aurora/scheduler/scheduling/TaskThrottlerTest.java 
7d104aa2ea4a4d99be4711f666d18beca238284e 
  
src/test/java/org/apache/aurora/scheduler/state/MaintenanceControllerImplTest.java
 94f5ca565476f62d72879837a0e7dafabcf30432 
  src/test/java/org/apache/aurora/scheduler/testing/BatchWorkerUtil.java 
PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
196df4754b553f05e50b66ad2f84271901bc9eba 

Diff: https://reviews.apache.org/r/51759/diff/


Testing
-------

All types of testing including deploying to test and production clusters.


Thanks,

Maxim Khutornenko

Reply via email to