Hi OpenWhiskers, Today, we have an arbitrary system-wide limit of maximum concurrent connections in the system. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system.
I propose to a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a `503 - System overloaded` message back to the user. The system-wide concurrency limit would be removed. The organic system-limit will be adjustable by a timeout factor, which is made adjustable https://github.com/apache/incubator-openwhisk/pull/3767. The default is 2 * maximumActionRuntime + 1 minute. For the vast majority of use-cases, this means that there are 3x more activations in the system than it can handle or put differently: activations need to wait for minutes until they are executed. I think it's safe to say that the system is overloaded if this is true for all invokers in your system. Note: We used to handle active-ack timeouts as system errors and take invokers into unhealthy state. While having the old non-consistent loadbalancer, that caused a lot of "flappy" states in the invokers. With the new consistent implementation, active-ack timeouts should only occur in problematic situations (either the invoker itself is having problems, or queueing). Taking the invoker out of the loadbalancer if there are active-acks missing on that invoker is generally helpful, because missing active-acks also means inconsistent state in the loadbalancer (it updates its state as if the active-ack arrived correctly). A first stab at the implementation can be found here: https://github.com/apache/incubator-openwhisk/pull/3875. Any concerns with that approach to place an upper bound on the system? Cheers, Markus
