[incubator-openwhisk] branch master updated: Add documentation to the loadbalancer. (#3778)

cbickel Tue, 03 Jul 2018 00:00:25 -0700

This is an automated email from the ASF dual-hosted git repository.

cbickel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-openwhisk.git



The following commit(s) were added to refs/heads/master by this push:
     new 06818a4  Add documentation to the loadbalancer. (#3778)
06818a4 is described below

commit 06818a4a8056aea4c0c555033f2a947ff15e33fa
Author: Markus Thömmes <markusthoem...@me.com>
AuthorDate: Tue Jul 3 09:00:04 2018 +0200

    Add documentation to the loadbalancer. (#3778)
    
    * Add documentation to the loadbalancer.
    
    * Add information on the overflow and other edge cases.
    
    * Incooperating more feedback to make prose description clearer.
    
    * Clarify capacity determination.
    
    * Clarify health protocol.
---
 .../ShardingContainerPoolBalancer.scala            | 90 +++++++++++++++++++++-
 1 file changed, 88 insertions(+), 2 deletions(-)

diff --git 
a/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
 
b/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
index eac4aff..72124ec 100644
--- 
a/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
+++ 
b/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
@@ -45,10 +45,96 @@ import scala.concurrent.{ExecutionContext, Future, Promise}
 import scala.util.{Failure, Success}
 
 /**
- * A loadbalancer that uses "horizontal" sharding to not collide with fellow 
loadbalancers.
+ * A loadbalancer that schedules workload based on a hashing-algorithm.
+ *
+ * ## Algorithm
+ *
+ * At first, for every namespace + action pair a hash is calculated and then 
an invoker is picked based on that hash
+ * (`hash % numInvokers`). The determined index is the so called 
"home-invoker". This is the invoker where the following
+ * progression will **always** start. If this invoker is healthy (see "Invoker 
health checking") and if there is
+ * capacity on that invoker (see "Capacity checking"), the request is 
scheduled to it.
+ *
+ * If one of these prerequisites is not true, the index is incremented by a 
step-size. The step-sizes available are the
+ * all coprime numbers smaller than the amount of invokers available (coprime, 
to minimize collisions while progressing
+ * through the invokers). The step-size is picked by the same hash calculated 
above (`hash & numStepSizes`). The
+ * home-invoker-index is now incremented by the step-size and the checks 
(healthy + capacity) are done on the invoker
+ * we land on now.
+ *
+ * This procedure is repeated until all invokers have been checked at which 
point the "overload" strategy will be
+ * employed, which is to choose a healthy invoker randomly. In a steadily 
running system, that overload means that there
+ * is no capacity on any invoker left to schedule the current request to.
+ *
+ * If no invokers are available or if there are no healthy invokers in the 
system, the loadbalancer will return an error
+ * stating that no invokers are available to take any work. Requests are not 
queued anywhere in this case.
+ *
+ * An example:
+ * - availableInvokers: 10 (all healthy)
+ * - hash: 13
+ * - homeInvoker: hash % availableInvokers = 13 % 10 = 3
+ * - stepSizes: 1, 3, 7 (note how 2 and 5 is not part of this because it's not 
coprime to 10)
+ * - stepSizeIndex: hash % numStepSizes = 13 % 3 = 1 => stepSize = 3
+ *
+ * Progression to check the invokers: 3, 6, 9, 2, 5, 8, 1, 4, 7, 0 --> done
+ *
+ * This heuristic is based on the assumption, that the chance to get a warm 
container is the best on the home invoker
+ * and degrades the more steps you make. The hashing makes sure that all 
loadbalancers in a cluster will always pick the
+ * same home invoker and do the same progression for a given action.
+ *
+ * Known caveats:
+ * - This assumption is not always true. For instance, two heavy workloads 
landing on the same invoker can override each
+ *   other, which results in many cold starts due to all containers being 
evicted by the invoker to make space for the
+ *   "other" workload respectively. Future work could be to keep a buffer of 
invokers last scheduled for each action and
+ *   to prefer to pick that one. Then the second-last one and so forth.
+ *
+ * ## Capacity checking
+ *
+ * The maximum capacity per invoker is configured using 
`invoker-busy-threshold`, which is the maximum amount of actions
+ * running in parallel on that invoker.
+ *
+ * Spare capacity is determined by what the loadbalancer thinks it scheduled 
to each invoker. Upon scheduling, an entry
+ * is made to update the books and a slot in a Semaphore is taken. That slot 
is only released after the response from
+ * the invoker (active-ack) arrives **or** after the active-ack times out. The 
Semaphore has as many slots as are
+ * configured via `invoker-busy-threshold`.
+ *
+ * Known caveats:
+ * - In an overload scenario, activations are queued directly to the invokers, 
which makes the active-ack timeout
+ *   unpredictable. Timing out active-acks in that case can cause the 
loadbalancer to prematurely assign new load to an
+ *   overloaded invoker, which can cause uneven queues.
+ * - The same is true if an invoker is extraordinarily slow in processing 
activations. The queue on this invoker will
+ *   slowly rise if it gets slow to the point of still sending pings, but 
handling the load so slowly, that the
+ *   active-acks time out. The loadbalancer again will think there is 
capacity, when there is none.
+ *
+ * Both caveats could be solved in future work by not queueing to invoker 
topics on overload, but to queue on a
+ * centralized overflow topic. Timing out an active-ack can then be seen as a 
system-error, as described in the
+ * following.
+ *
+ * ## Invoker health checking
+ *
+ * Invoker health is determined via a kafka-based protocol, where each invoker 
pings the loadbalancer every second. If
+ * no ping is seen for a defined amount of time, the invoker is considered 
"Offline".
+ *
+ * Moreover, results from all activations are inspected. If more than 3 out of 
the last 10 activations contained system
+ * errors, the invoker is considered "Unhealthy". If an invoker is unhealty, 
no user workload is sent to it, but
+ * test-actions are sent by the loadbalancer to check if system errors are 
still happening. If the
+ * system-error-threshold-count in the last 10 activations falls below 3, the 
invoker is considered "Healthy" again.
+ *
+ * To summarize:
+ * - "Offline": Ping missing for > 10 seconds
+ * - "Unhealthy": > 3 **system-errors** in the last 10 activations, pings 
arriving as usual
+ * - "Healthy": < 3 **system-errors** in the last 10 activations, pings 
arriving as usual
+ *
+ * ## Horizontal sharding
+ *
+ * Sharding is employed to avoid both loadbalancers having to share any data, 
because the metrics used in scheduling
+ * are very fast changing.
  *
  * Horizontal sharding means, that each invoker's capacity is evenly divided 
between the loadbalancers. If an invoker
- * has at most 16 slots available, those will be divided to 8 slots for each 
loadbalancer (if there are 2).
+ * has at most 16 slots available (invoker-busy-threshold = 16), those will be 
divided to 8 slots for each loadbalancer
+ * (if there are 2).
+ *
+ * Known caveats:
+ * - If a loadbalancer leaves or joins the cluster, all state is removed and 
created from scratch. Those events should
+ *   not happen often.
  */
 class ShardingContainerPoolBalancer(config: WhiskConfig, controllerInstance: 
ControllerInstanceId)(
   implicit val actorSystem: ActorSystem,

[incubator-openwhisk] branch master updated: Add documentation to the loadbalancer. (#3778)

Reply via email to