Hi Chetan,

>Currently one aspect which is not clear is does Controller has access
>to
>
>1. Pool of prewarm containers - Container of base image where /init
>is
>yet not done. So these containers can then be initialized within
>Controller
>2. OR Pool of warm container bound to specific user+action. These
>containers would possibly have been initialized by ContainerManager
>and then it allocates them to controller.

The latter case is what I had in mind. The controller only knows containers 
that are already ready to call /run on.

Pre-Warm containers are an implementation detail to the Controller. The 
ContainerManager can keep them around to be able to answer demand for specific 
resources more quickly, but the Controller doesn't care. It only knows warm 
containers.

>Can you elaborate this bit more i.e. how scale up logic would work
>and
>is asynchronous?
>
>I think above aspect (type of pool) would have bearing on scale up
>logic. If an action was not in use so far then when first request
>comes (i.e. 0-1 scale up case) would Controller ask ContainerManager
>for specific action container and then wait for its setup and then
>execute it. OR if it has a generic pool then it takes one and
>initializes it and use it. And if its not done synchronously then
>would such an action be put to overflow queue.

In this specific example, the Controller will request a container from the 
ContainerManager and buffer the request until it finally has capacity to 
execute it. All subsequent requests will be put on the same buffer and a 
Container will be requested for each of them. 

Whether we put this buffer in an overflow queue (aka persist it) remains to be 
decided. If we keep it in memory, we have roughly the same guarantees as today. 
As Rodric mentioned though, we can improve certain failure scenarios (like 
waiting for a container in this case) by making this buffer more persistent. 
I'm not mentioning Kafka here for a reason, because in this case any persistent 
buffer is just fine.

Also note that this is not necessarily the case of the overflow queue. The 
overflow queue is used for arbitrary requests once the ContainerManager cannot 
create more resources and thus requests need to wait.

The buffer I described above is a per action "invoke me once resources are 
available" buffer, that could potentially be designed to be per Controller to 
not have the challenge of scaling it out. That of course has its downsides in 
itself, for instance: A buffer that spans all controllers would enable 
work-stealing between controllers with missing capacity and could mitigate some 
of load-imbalances that Dominic mentioned. We are entering then the same area 
that his proposal enters: The need of a queue per action.

Conclusion is, we have 2 perspectives to look at this:

1. Do we need to persist an in-memory queue that waits for resources to be 
created by the ContainerManager?
2. Do we need a shared queue between the Controllers to enable work-stealing in 
cases where multiple Controllers wait for resources?
 
An important thing to note here: Since all of this is no longer happening on 
the critical path (stuff gets put on the queue only if it needs to wait for 
resources anyway), we can afford a solution that isn't as perfomant as Kafka 
might be. That could potentially open up the possibility to use a technology 
more geared towards Pub/Sub, where subscribers per action are more cheap to 
implement than on Kafka?

Does that make sense? Hope that helps :). Thanks for the questions!

Cheers,
Markus

Reply via email to