Re: Proposal on a future architecture of OpenWhisk

Tyson Norris Mon, 20 Aug 2018 15:21:03 -0700


On Aug 19, 2018, at 3:59 AM, Markus Thömmes 
<[email protected]<mailto:[email protected]>> wrote:


Hi Tyson,

Am Fr., 17. Aug. 2018 um 23:45 Uhr schrieb Tyson Norris
<[email protected]<mailto:[email protected]>>:


If the failover of the singleton is too long (I think it will be based on
cluster size, oldest node becomes the singleton host iirc), I think we need
to consider how containers can launch in the meantime. A first step might
be to test out the singleton behavior in the cluster of various sizes.


I agree this bit of design is crucial, a few thoughts:
Pre-warm wouldn't help here, the ContainerRouters only know warm
containers. Pre-warming is managed by the ContainerManager.

Ah right


Considering a fail-over scenario: We could consider sharing the state via
EventSourcing. That is: All state lives inside of frequently snapshotted
events and thus can be shared between multiple instances of the
ContainerManager seamlessly. Alternatively, we could also think about only
working on persisted state. That way, a cold-standby model could fly. We
should make sure that the state is not "slightly stale" but rather both
instances see the same state at any point in time. I believe on that
cold-path of generating new containers, we can live with the extra-latency
of persisting what we're doing as the path will still be dominated by the
container creation latency.

Wasn’t clear if you mean not using ClusterSingleton? To be clear in
ClusterSingleton case there are 2 issues:
- time it takes for akka ClusterSingletonManager to realize it needs to
start a new actor
- time it takes for the new actor to assume a usable state

EventSourcing (or ext persistence) may help with the latter, but we will
need to be sure the former is tolerable to start with.
Here is an example test from akka source that may be useful (multi-jvm,
but all local):

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fakka%2Fakka%2Fblob%2F009214ae07708e8144a279e71d06c4a504907e31%2Fakka-cluster-tools%2Fsrc%2Fmulti-jvm%2Fscala%2Fakka%2Fcluster%2Fsingleton%2FClusterSingletonManagerChaosSpec.scala&amp;data=02%7C01%7Ctnorris%40adobe.com%7C63c6bb3a36724f38cc9d08d605c2ddee%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636702732034251656&amp;sdata=omVsIo%2FoD8weG4Zy%2BGX2A53ATRmylUxYCbqknu4MoeM%3D&amp;reserved=0

Some things to consider, that I don’t know details of:
- will the size of cluster affect the singleton behavior in case of
failure? (I think so, but not sure, and what extent); in the simple test
above it takes ~6s for the replacement singleton to begin startup, but if
we have 100s of nodes, I’m not sure how much time it will take. (I don’t
think this should be hard to test, but I haven’t done it)
- in case of hard crash, what is the singleton behavior? In graceful jvm
termination, I know the cluster behavior is good, but there is always this
question about how downing nodes will be handled. If this critical piece of
the system relies on akka cluster functionality, we will need to make sure
that the singleton can be reconstituted, both in case of graceful
termination (restart/deployment events) and non-graceful termination (hard
vm crash, hard container crash) . This is ignoring more complicated cases
of extended network partitions, which will also have bad affects on many of
the downstream systems.


I don't think we need to be eager to consider akka-cluster to be set in
stone here. The singleton in my mind doesn't need to be clustered at all.
Say we have a fully shared state through persistence or event-sourcing and
a hot-standby model, couldn't we implement the fallback through routing in
front of the active/passive ContainerManager pair? Once one goes
unreachable, fall back to the other.



Yeah I would rather see the hot standby and deal with persistence. I don’t 
think akka clustersingleton is going to be fast enough in a high volume 
scenario.
Either routing in front or ContainerRouters who observe the active (leader) 
status, we just have to determine that the status change is tolerably fast.





Handover time as you say is crucial, but I'd say as it only impacts
container creation, we could live with, let's say, 5 seconds of
failover-downtime on this path? What's your experience been on singleton
failover? How long did it take?


Seconds in the simplest case, so I think we need to test it in a scaled
case (100s of cluster nodes), as well as the hard crash case (where not
downing the node may affect the cluster state).




On Aug 16, 2018, at 11:01 AM, Tyson Norris 
<[email protected]<mailto:[email protected]>
<mailto:[email protected]>>
wrote:

A couple comments on singleton:
- use of cluster singleton will introduce a new single point of failure
- from time of singleton node failure, to single resurrection on a
different instance, will be an outage from the point of view of any
ContainerRouter that does not already have a warm+free container to service
an activation
- resurrecting the singleton will require transferring or rebuilding the
state when recovery occurs - in my experience this was tricky, and requires
replicating the data (which will be slightly stale, but better than
rebuilding from nothing); I don’t recall the handover delay (to transfer
singleton to a new akka cluster node) when I tried last, but I think it was
not as fast as I hoped it would be.

I don’t have a great suggestion for the singleton failure case, but
would like to consider this carefully, and discuss the ramifications (which
may or may not be tolerable) before pursuing this particular aspect of the
design.


On prioritization:
- if concurrency is enabled for an action, this is another
prioritization aspect, of sorts - if the action supports concurrency, there
is no reason (except for destruction coordination…) that it cannot be
shared across shards. This could be added later, but may be worth
considering since there is a general reuse problem where a series of
activations that arrives at different ContainerRouters will create a new
container in each, while they could be reused (and avoid creating new
containers) if concurrency is tolerated in that container. This would only
(ha ha) require changing how container destroy works, where it cannot be
destroyed until the last ContainerRouter is done with it. And if container
destruction is coordinated in this way to increase reuse, it would also be
good to coordinate construction (don’t concurrently construct the same
container for multiple containerRouters IFF a single container would enable
concurrent activations once it is created). I’m not sure if others are
desiring this level of container reuse, but if so, it would be worth
considering these aspects (sharding/isolation vs sharing/coordination) as
part of any redesign.


Yes, I can see where you're heading here. I think this can be generalized:

Assume intra-container concurrency C and number of ContainerRouters R.
If C > R: Shard the "slots" on this container evenly across R. The
container can only be destroyed after you receive R acknowledgements of
doing so.
If C < R: Hand out 1 slot to C Routers, point the remaining Routers to the
ones that got slots.


Yes, mostly - I think there is also a case where destruction message is
revoked by the same router (receiving a new activation for the container
which it previously requested destruction of). But I think this is covered
in the details of tracking “after you receive R acks of destructions"


Hm, I don't think that case exists. Once a Router has acknowledged a
revoke, it will remove the container from its state immediately.  It will
never revoke that acknowledgement therefore, but rather request a new
resource if it finds it now has insufficient resources.


Yes I was meaning from ContainerManager perspective. I think we are on the same 
page, that ContainerManager can share the same container until all recipients 
of that container have asked for its destruction.

Again, this might be my fault for not providing sequence diagrams for the
algorithms I'm describing here.



Concurrent creation: Batch creation requests while one container is being
created. Say you received a request for a new container that has C slots.
If there are more requests for that container arriving while it is being
created, don't act on them and fold the creation into the first one. Only
start creating a new container if the number of resource requests exceed C.

Does that make sense? I think in that model you can set C=1 and it works as
I envisioned it to work, or set it to C=200 and things will be shared even
across routers.

Side note: One detail about the pending concurrency impl today is that due
to the async nature of tracking the active activations within the
container, there is no guarantee (when C>1) that the number is exact, so if
you specify C=200, you may actually get a different container at 195 or
205. This is not really related to this discussion, but is based on the
current messaging/future behavior in ContainerPool/ContainerProxy, so
wanted to mention it explicitly, in case it matters to anyone.


Not relevant for this discussion, but: I think that is not the right way of
approaching it. If you handle the concurrency metric asynchronously, you
effectively allow for an unbounded number of request to reach your
container (in reality bound by the number of CPU cores and the real
concurrency that happens in your system). I think the ContainerPool should
track these number consistently as well to be able to guarantee there are
never more than C requests on a container. I believe that is crucial,
especially for the C=1 case but might be relevant for other cases as well.

The whole point of this design is to accurately track that concurrency
metric.


Tracking these metrics consistently will introduce the same problem as 
precisely tracking throttling numbers across multiple controllers, I think, 
where either there is delay introduced to use remote data, or eventual 
consistency will introduce inaccurate data.

I’m interested to know if this accuracy is important as long as actual 
concurrency <= C?

We can discuss in the concurrency PR :)




Thanks
Tyson







WDYT?

THanks
Tyson

On Aug 15, 2018, at 8:55 AM, Carlos Santana 
<[email protected]<mailto:[email protected]><mailto:
[email protected]<mailto:[email protected]>>
<mailto:[email protected]>> wrote:

I think we should add a section on prioritization for blocking vs. async
invokes (none blocking actions a triggers)

The front door has the luxury of known some intent from the incoming
request, I feel it would make sense to high priority to blocking invokes,
and for async they go straight to the queue to be pick up by the system
to
eventually run, even if it takes 10 times longer to execute than a
blocking
invoke, for example a webaction would take 10ms vs. a DB trigger fire,
or a
async webhook takes 100ms.

Also the controller takes time to convert a trigger and process the
rules,
this is something that can also be taken out of hot path.

So I'm just saying we could optimize the system because we know if the
incoming request is a hot or hotter path :-)

-- Carlos

Re: Proposal on a future architecture of OpenWhisk

Reply via email to