Re: Proposal on a future architecture of OpenWhisk

Dominic Kim Mon, 23 Jul 2018 03:48:15 -0700

Dear Markus.

I may not correctly understand the direction of new architecture.
So let me describe my concerns in more details.


Since that is a future architecture of OpenWhisk and requires many breaking
changes, I think it should at least address all known issues.
So I focused on figuring out whether it handles all issues which are
reported in my proposal.
(
https://cwiki.apache.org/confluence/display/OPENWHISK/Autonomous+Container+Scheduling
)

1. Docker daemon performance issue.

The most critical issue is poor performance of docker daemon.
Since it is not inherently designed for high throughput or concurrent
processing, Docker daemon shows poor performance in comparison with OW.
In OW(serverless) world, action execution can be finished within 5ms ~
10ms, but the Docker daemon shows 100 ~ 500ms latency.
Still, we can take advantage of Prewarm and Warmed containers, but under
the situation where container creation/deletion/pausing/resuming happen
frequently and the situation lasted for long-term, the requests are delayed
and even the Docker daemon crashed.
So I think it is important to reduce the loads(requests) against the Docker
daemon.

That's the reason why I initially thought that a Warmed state would be kept
for more than today's behavior.
Today, containers would stay in the Warmed state only for 50ms, so it
introduces PAUSE/RESUME in case action comes with the interval of more than
50 ms such as 1 sec.
This will lead to more loads on Docker daemon.

And if the state of containers is changing like today, the state in
ContainerManager would be frequently changing as well.
This may induce a synchronization issue among controllers and, among
ContainerManagers(in case there would be more than one ContainerManager).

So I think containers should be running for more than today's pauseGrace
time.
With more than 1 concurrency limit per container, it would also be better
to keep containers running(not paused) for more than 50ms.

2. Proxy case.

In the edge case where a container only exists in controller1, how can
controller2 decide to proxy the request to controller1 rather than just
creating its own container?
If it asks to ContainerManager, ContainerManager should know the state of
the container in controller1.
If the container in controller1 is already busy, it would be better to
create a new container in controller2 rather than proxying the requests to
controller1.

If it goes this way, ContainerManager should know all the status of
containers in all controllers to make a right decision and it's not easy to
synchronize all the status of containers in controllers.
If it does not work like this, how can controller2 proxy requests to
controller1 without any information about controller1's status?

3. Intervention among multiple actions

If the concurrency limit is 1, and the container lifecycle is managed like
today, intervention among multiple actions can happen again.
For example, the maximum number of containers which can be created by a
user is 2, and ActionA and ActionB invocation requests come alternatively,
controllers will try to remove and recreate containers again and again.
I used an example with a small number of max container limit for
simplicity, but it can happen with a higher limit as well.

And though concurrency limit is more than 1 such as 3, it also can happen
if actions come more quickly than the execution time of actions.

4. Is concurrency per container controlled by users in a per-action based
way?
Let me clarify my question about concurrency limit.

If concurrency per container limit is more than 1, there could be multiple
actions being invoked at some point.
If the action requires high memory footprint such as 200MB or 150MB, it can
crash if the sum of memory usage of concurrent actions exceeds the
container memory.
(In our case(here), some users are executing headless-chrome and puppeteer
within actions, so it could happen under the similar situation.)

So I initially thought concurrency per container is controlled by users in
a per-action based way.
If concurrency per container is only configured by OW operators statically,
some users may not be able to invoke their actions correctly in the worst
case though operators increased the memory of the biggest container type.

And not only for this case, there could be some more reasons that some
users just want to invoke their actions without per-container concurrency
but the others want it for better throughput.

So we may need some logic for users to take care of per-container
concurrency for each actions.

5. Better to wait for the completion rather than creating a new container.
According to the workload, it would be better to wait for the previous
execution rather than creating a new container because it takes upto 500ms
~ 1s.
Even though the concurrency limit is more than 1, it still can happen if
there is no logic to cumulate invocations and decide whether to create a
new container or waiting for the existing container.


6. HA of ContainerManager.
Since it is mandatory to deploy the system without any downtime to use it
for production, we need to support HA of ContainerManager.
It means the state of ContainerManager should be replicated among replicas.
(No matter which method we use between master/slave or clustering.)

If ContainerManager knows about the status of each container, it would not
be easy to support HA with its eventual consistent nature.
If it does only know which containers are assigned to which controllers, it
cannot handle the edge case as I mentioned above.



Since many parts of the architecture are not addressed yet, I think it
would be better to separate each parts and discuss further deeply.
But in the big picture, I think we need to figure out whether it can handle
or at least alleviate all known issues or not first.


Best regards,
Dominic


2018-07-21 1:36 GMT+09:00 David P Grove <[email protected]>:

>
>
> Tyson Norris <[email protected]> wrote on 07/20/2018 12:24:07 PM:
> >
> > On Logging, I think if you are considering enabling concurrent
> > activation processing, you will encounter that the only approach to
> > parsing logs to be associated with a specific activationId, is to
> > force the log output to be structured, and always include the
> > activationId with every log message. This requires a change at the
> > action container layer, but the simpler thing to do is to encourage
> > action containers to provide a structured logging context that
> > action developers can (and must) use to generate logs.
>
> Good point.  I agree that if there is concurrent activation processing in
> the container, structured logging is the only sensible thing to do.
>
>
> --dave
>

Re: Proposal on a future architecture of OpenWhisk

Reply via email to