PR is opened here: https://github.com/apache/incubator-openwhisk/pull/3497
On Mar 27, 2018, at 3:25 PM, Tyson Norris <tnor...@adobe.com.INVALID<mailto:tnor...@adobe.com.INVALID>> wrote: On Mar 27, 2018, at 2:28 PM, David P Grove <gro...@us.ibm.com<mailto:gro...@us.ibm.com>> wrote: Tyson Norris <tnor...@adobe.com.INVALID<mailto:tnor...@adobe.com.INVALID>> wrote on 03/27/2018 04:33:48 PM: We’ve been discussing how to handle mesos framework HA in the Invoker, and I created a proposal on the wiki to discuss. https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FOPENWHISK%2FClustered%2BSingleton&data=02%7C01%7Ctnorris%40adobe.com%7C68fc29c7e2e140b6bae008d5942a1e91%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636577830978350589&sdata=PGiGXPzmRMesUar6qTLPmyMNG%2FH6gRE2Xur05C3peRE%3D&reserved=0 +Invoker+for+HA+on+Mesos In general, the idea is to allow a single cluster-wide/single ContainerPool to operate, while providing a reasonable failover behavior in case of its unexpected death. To accomplish this, the proposal is to allow parts of the ContainerPool (freePool and prewarmPool) to be replicated to other (passive) invoker instances, and to allow the replicated container meta data to be used by ContainerFactories to resurrect containers for use in case a failure occurs. This does a couple things, like removing the notion of resource scheduling from the Controller (since there is only ever 1 invoker), and allows the ContainerPool to operate with a holistic view of the cluster, useful for whole-cluster ContainerFactory impls like MesosContainerFactory. I’m curious if the kubernetes folks will also find this useful? Hi Tyson, Thanks for writing this up! A couple of thoughts. 1. Using Akka Distributed Data to actively track the set of containers to support failure recovery seems like a lot of overhead. For Kubernetes, we are labeling all the action containers with their owning invoker using Kubernetes labels. So, when an invoker crashes and gets replaced, one could recover all of its prewarmed & freepool containers with a simple query against the Kubernetes API server. No need to track the set actively; Kube is already doing that via the labels. Anything similar to Kube labels in Mesos? Do you have an example of the labels working? I guess the labels are changed over time through the lifecycle of the container? Nothing I know of similar in mesos, although it could be done in the MesosContainerFactory for similar purposes. I think it is partly a question of where to place container re-association logic (in the ContainerPool? ContainerFactory? Both?), and partly how to implement ContainerPool behavior. Currently MesosContainerFactory operates as an embedded mesos client that must behave as a singleton with failover, so ContainerPool needs to do the same As for the overhead, I’m not sure it is, or isn’t. Currently it is using WriteLocal writeConsistency to publish updates to the pools, and only on particular combinations (prewarm+PrewarmedData, and free+WarmedData) I’ll get the PR up shortly. 2. We've been exploring running with a smaller number of invokers than worker nodes and cluster-wide scheduling using the KubernetesContainerFactory + invokerAgent. However, I don't believe at production scale a single Invoker for an entire cluster is going to be viable. Especially with the current architecture where the action parameters get streamed through the invoker and the action results get streamed back through the invoker. I believe that is going to bottleneck how many containers a single Invoker can manage. Yeah I’m wondering the same thing. For now, operating only one (or a few) controllers has the same issue right? For mesos, we can operate "multiple invoker clusters”, following the same approach outlined, but instead of using a single topic invoker0, we have multiple topics invoker1..invokern, (similar to today), but where each topic is consumed by multiple invokers (in active/passive mode). Separately, related to performance: - we still plan to allow concurrency so things that are required to be fast (blocking activations, and ones that signal they tolerate it) should leverage this, and things that can tolerate additional latency should not - if the ContainerPool operated cleanup on a more GC-like semantic, external clients (other invokers, or the controller) would be able to use existing running containers (at least where concurrency is tolerated). When all clients of a container have completed (plus some time), it could be garbage collected by the ContainerPool (similar to how freePool containers currently linger). This could unburden some activation processing from the current invoker workflow. Of course, your mileage may vary: I’m not sure how well any of that works in a case where you don’t support concurrent activations, or for cases where the number of actions/containers exceeds the number of clients (i.e. where everything is only ever used once). Thanks Tyson --dave