Dear Markus. I may not correctly understand the direction of new architecture. So let me describe my concerns in more details.
Since that is a future architecture of OpenWhisk and requires many breaking changes, I think it should at least address all known issues. So I focused on figuring out whether it handles all issues which are reported in my proposal. ( https://cwiki.apache.org/confluence/display/OPENWHISK/Autonomous+Container+Scheduling ) 1. Docker daemon performance issue. The most critical issue is poor performance of docker daemon. Since it is not inherently designed for high throughput or concurrent processing, Docker daemon shows poor performance in comparison with OW. In OW(serverless) world, action execution can be finished within 5ms ~ 10ms, but the Docker daemon shows 100 ~ 500ms latency. Still, we can take advantage of Prewarm and Warmed containers, but under the situation where container creation/deletion/pausing/resuming happen frequently and the situation lasted for long-term, the requests are delayed and even the Docker daemon crashed. So I think it is important to reduce the loads(requests) against the Docker daemon. That's the reason why I initially thought that a Warmed state would be kept for more than today's behavior. Today, containers would stay in the Warmed state only for 50ms, so it introduces PAUSE/RESUME in case action comes with the interval of more than 50 ms such as 1 sec. This will lead to more loads on Docker daemon. And if the state of containers is changing like today, the state in ContainerManager would be frequently changing as well. This may induce a synchronization issue among controllers and, among ContainerManagers(in case there would be more than one ContainerManager). So I think containers should be running for more than today's pauseGrace time. With more than 1 concurrency limit per container, it would also be better to keep containers running(not paused) for more than 50ms. 2. Proxy case. In the edge case where a container only exists in controller1, how can controller2 decide to proxy the request to controller1 rather than just creating its own container? If it asks to ContainerManager, ContainerManager should know the state of the container in controller1. If the container in controller1 is already busy, it would be better to create a new container in controller2 rather than proxying the requests to controller1. If it goes this way, ContainerManager should know all the status of containers in all controllers to make a right decision and it's not easy to synchronize all the status of containers in controllers. If it does not work like this, how can controller2 proxy requests to controller1 without any information about controller1's status? 3. Intervention among multiple actions If the concurrency limit is 1, and the container lifecycle is managed like today, intervention among multiple actions can happen again. For example, the maximum number of containers which can be created by a user is 2, and ActionA and ActionB invocation requests come alternatively, controllers will try to remove and recreate containers again and again. I used an example with a small number of max container limit for simplicity, but it can happen with a higher limit as well. And though concurrency limit is more than 1 such as 3, it also can happen if actions come more quickly than the execution time of actions. 4. Is concurrency per container controlled by users in a per-action based way? Let me clarify my question about concurrency limit. If concurrency per container limit is more than 1, there could be multiple actions being invoked at some point. If the action requires high memory footprint such as 200MB or 150MB, it can crash if the sum of memory usage of concurrent actions exceeds the container memory. (In our case(here), some users are executing headless-chrome and puppeteer within actions, so it could happen under the similar situation.) So I initially thought concurrency per container is controlled by users in a per-action based way. If concurrency per container is only configured by OW operators statically, some users may not be able to invoke their actions correctly in the worst case though operators increased the memory of the biggest container type. And not only for this case, there could be some more reasons that some users just want to invoke their actions without per-container concurrency but the others want it for better throughput. So we may need some logic for users to take care of per-container concurrency for each actions. 5. Better to wait for the completion rather than creating a new container. According to the workload, it would be better to wait for the previous execution rather than creating a new container because it takes upto 500ms ~ 1s. Even though the concurrency limit is more than 1, it still can happen if there is no logic to cumulate invocations and decide whether to create a new container or waiting for the existing container. 6. HA of ContainerManager. Since it is mandatory to deploy the system without any downtime to use it for production, we need to support HA of ContainerManager. It means the state of ContainerManager should be replicated among replicas. (No matter which method we use between master/slave or clustering.) If ContainerManager knows about the status of each container, it would not be easy to support HA with its eventual consistent nature. If it does only know which containers are assigned to which controllers, it cannot handle the edge case as I mentioned above. Since many parts of the architecture are not addressed yet, I think it would be better to separate each parts and discuss further deeply. But in the big picture, I think we need to figure out whether it can handle or at least alleviate all known issues or not first. Best regards, Dominic 2018-07-21 1:36 GMT+09:00 David P Grove <[email protected]>: > > > Tyson Norris <[email protected]> wrote on 07/20/2018 12:24:07 PM: > > > > On Logging, I think if you are considering enabling concurrent > > activation processing, you will encounter that the only approach to > > parsing logs to be associated with a specific activationId, is to > > force the log output to be structured, and always include the > > activationId with every log message. This requires a change at the > > action container layer, but the simpler thing to do is to encourage > > action containers to provide a structured logging context that > > action developers can (and must) use to generate logs. > > Good point. I agree that if there is concurrent activation processing in > the container, structured logging is the only sensible thing to do. > > > --dave >
