This is an automated email from the ASF dual-hosted git repository. cbickel pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-openwhisk.git
The following commit(s) were added to refs/heads/master by this push: new 06818a4 Add documentation to the loadbalancer. (#3778) 06818a4 is described below commit 06818a4a8056aea4c0c555033f2a947ff15e33fa Author: Markus Thömmes <markusthoem...@me.com> AuthorDate: Tue Jul 3 09:00:04 2018 +0200 Add documentation to the loadbalancer. (#3778) * Add documentation to the loadbalancer. * Add information on the overflow and other edge cases. * Incooperating more feedback to make prose description clearer. * Clarify capacity determination. * Clarify health protocol. --- .../ShardingContainerPoolBalancer.scala | 90 +++++++++++++++++++++- 1 file changed, 88 insertions(+), 2 deletions(-) diff --git a/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala b/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala index eac4aff..72124ec 100644 --- a/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala +++ b/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala @@ -45,10 +45,96 @@ import scala.concurrent.{ExecutionContext, Future, Promise} import scala.util.{Failure, Success} /** - * A loadbalancer that uses "horizontal" sharding to not collide with fellow loadbalancers. + * A loadbalancer that schedules workload based on a hashing-algorithm. + * + * ## Algorithm + * + * At first, for every namespace + action pair a hash is calculated and then an invoker is picked based on that hash + * (`hash % numInvokers`). The determined index is the so called "home-invoker". This is the invoker where the following + * progression will **always** start. If this invoker is healthy (see "Invoker health checking") and if there is + * capacity on that invoker (see "Capacity checking"), the request is scheduled to it. + * + * If one of these prerequisites is not true, the index is incremented by a step-size. The step-sizes available are the + * all coprime numbers smaller than the amount of invokers available (coprime, to minimize collisions while progressing + * through the invokers). The step-size is picked by the same hash calculated above (`hash & numStepSizes`). The + * home-invoker-index is now incremented by the step-size and the checks (healthy + capacity) are done on the invoker + * we land on now. + * + * This procedure is repeated until all invokers have been checked at which point the "overload" strategy will be + * employed, which is to choose a healthy invoker randomly. In a steadily running system, that overload means that there + * is no capacity on any invoker left to schedule the current request to. + * + * If no invokers are available or if there are no healthy invokers in the system, the loadbalancer will return an error + * stating that no invokers are available to take any work. Requests are not queued anywhere in this case. + * + * An example: + * - availableInvokers: 10 (all healthy) + * - hash: 13 + * - homeInvoker: hash % availableInvokers = 13 % 10 = 3 + * - stepSizes: 1, 3, 7 (note how 2 and 5 is not part of this because it's not coprime to 10) + * - stepSizeIndex: hash % numStepSizes = 13 % 3 = 1 => stepSize = 3 + * + * Progression to check the invokers: 3, 6, 9, 2, 5, 8, 1, 4, 7, 0 --> done + * + * This heuristic is based on the assumption, that the chance to get a warm container is the best on the home invoker + * and degrades the more steps you make. The hashing makes sure that all loadbalancers in a cluster will always pick the + * same home invoker and do the same progression for a given action. + * + * Known caveats: + * - This assumption is not always true. For instance, two heavy workloads landing on the same invoker can override each + * other, which results in many cold starts due to all containers being evicted by the invoker to make space for the + * "other" workload respectively. Future work could be to keep a buffer of invokers last scheduled for each action and + * to prefer to pick that one. Then the second-last one and so forth. + * + * ## Capacity checking + * + * The maximum capacity per invoker is configured using `invoker-busy-threshold`, which is the maximum amount of actions + * running in parallel on that invoker. + * + * Spare capacity is determined by what the loadbalancer thinks it scheduled to each invoker. Upon scheduling, an entry + * is made to update the books and a slot in a Semaphore is taken. That slot is only released after the response from + * the invoker (active-ack) arrives **or** after the active-ack times out. The Semaphore has as many slots as are + * configured via `invoker-busy-threshold`. + * + * Known caveats: + * - In an overload scenario, activations are queued directly to the invokers, which makes the active-ack timeout + * unpredictable. Timing out active-acks in that case can cause the loadbalancer to prematurely assign new load to an + * overloaded invoker, which can cause uneven queues. + * - The same is true if an invoker is extraordinarily slow in processing activations. The queue on this invoker will + * slowly rise if it gets slow to the point of still sending pings, but handling the load so slowly, that the + * active-acks time out. The loadbalancer again will think there is capacity, when there is none. + * + * Both caveats could be solved in future work by not queueing to invoker topics on overload, but to queue on a + * centralized overflow topic. Timing out an active-ack can then be seen as a system-error, as described in the + * following. + * + * ## Invoker health checking + * + * Invoker health is determined via a kafka-based protocol, where each invoker pings the loadbalancer every second. If + * no ping is seen for a defined amount of time, the invoker is considered "Offline". + * + * Moreover, results from all activations are inspected. If more than 3 out of the last 10 activations contained system + * errors, the invoker is considered "Unhealthy". If an invoker is unhealty, no user workload is sent to it, but + * test-actions are sent by the loadbalancer to check if system errors are still happening. If the + * system-error-threshold-count in the last 10 activations falls below 3, the invoker is considered "Healthy" again. + * + * To summarize: + * - "Offline": Ping missing for > 10 seconds + * - "Unhealthy": > 3 **system-errors** in the last 10 activations, pings arriving as usual + * - "Healthy": < 3 **system-errors** in the last 10 activations, pings arriving as usual + * + * ## Horizontal sharding + * + * Sharding is employed to avoid both loadbalancers having to share any data, because the metrics used in scheduling + * are very fast changing. * * Horizontal sharding means, that each invoker's capacity is evenly divided between the loadbalancers. If an invoker - * has at most 16 slots available, those will be divided to 8 slots for each loadbalancer (if there are 2). + * has at most 16 slots available (invoker-busy-threshold = 16), those will be divided to 8 slots for each loadbalancer + * (if there are 2). + * + * Known caveats: + * - If a loadbalancer leaves or joins the cluster, all state is removed and created from scratch. Those events should + * not happen often. */ class ShardingContainerPoolBalancer(config: WhiskConfig, controllerInstance: ControllerInstanceId)( implicit val actorSystem: ActorSystem,