Re: Proposal on a future architecture of OpenWhisk

Markus Thömmes Mon, 20 Aug 2018 01:04:22 -0700

Am So., 19. Aug. 2018 um 18:59 Uhr schrieb TzuChiao Yeh <
su3g4284zo...@gmail.com>:


> On Sun, Aug 19, 2018 at 7:13 PM Markus Thömmes <markusthoem...@apache.org>
> wrote:
>
> > Hi Tzu-Chiao,
> >
> > Am Sa., 18. Aug. 2018 um 06:56 Uhr schrieb TzuChiao Yeh <
> > su3g4284zo...@gmail.com>:
> >
> > > Hi Markus,
> > >
> > > Nice thoughts on separating logics in this revision! I'm not sure this
> > > question has already been clarified, sorry if duplicate.
> > >
> > > Same question on cluster singleton:
> > >
> > > I think there will be two possibilities on container deletion: 1.
> > > ContainerRouter removes it (when error or idle-state) 2.
> ContainerManager
> > > decides to remove it (i.e. clear space for new creation).
> > >
> > > For case 2, how do we ensure the safe deletion in ContainerManager?
> > > Consider if there's still a similar model on busy/free/prewarmed pool,
> it
> > > might require additional states related to containers from busy to free
> > > state, then we can safely remove it or reject if nothing found (system
> > > overloaded).
> > >
> > > By paused state or other states/message? There might be some trade-offs
> > on
> > > granularity (time-slice in scheduling) and performance bottleneck on
> > > ClusterSingleton.
> > >
>
> I'm not sure if I quite got the point, but here's an attempt on an
> > explanation:
> >
> > Yes, Container removal in case 2 is triggered from the ContainerManager.
> To
> > be able to safely remove it, it requests all ContainerRouters owning that
> > container to stop serving it and hand it back. Once it's been handed
> back,
> > the ContainerManager can safely delete it. The contract should also say:
> A
> > container must be handed back in unpaused state, so it can be deleted
> > safely. Since the ContainerRouters handle pause/unpause, they'll need to
> > stop serving the container, unpause it, remove it from their state and
> > acknowledge to the ContainerManager that they handed it back.
> >
>
> Thank you, it's clear to me.
>
>
> > There is an open question on when to consider a system to be in overflow
> > state, or rather: How to handle the edge-situation. If you cannot
> generate
> > more containers, we need to decide whether we remove another container
> (the
> > case you're describing) or if we call it quits and say "503, overloaded,
> go
> > away for now". The logic deciding this is up for discussion as well. The
> > heuristic could take into account how many resources in the whole system
> > you already own, how many resources do others own and if we want to
> decide
> > to share those fairly or not-fairly. Note that this is also very much
> > related to being able to scale the resources up in themselves (to be able
> > to generate new containers). If we assume a bounded system though, yes,
> > we'll need to find a strategy on how to handle this case. I believe with
> > the state the ContainerManager has, it can provide a more eloquent answer
> > to that question than what we can do today (nothing really, we just keep
> on
> > churning through containers).
> >
>
> I agree. An additional problem is in the case of burst requests,
> ContainerManager will "over-estimate" containers allocation, whether
> work-stealing between ContainerRouters has been enabled or not. For bounded
> system, we have better carefully handle these to avoid frequently
> creation/deletion. I'm wondering if sharing message queue between
> ContainerManager (since it's not a critical path) or any mechanism for
> checking queue size (i.e. checking kafka lags) can possibly eliminate
> this?  However, this may be only happened in short running tasks and
> throttling already being helpful.
>

Are you saying: It will over-estimate container allocation because it will
create a container for each request as they arrive if there are no
containers around currently and the actual number of containers needed
might be lower for very short running use-cases where requests arrive in
short bursts?

If so: I agree, I don't see how any system can possibly solve this without
taking the estimated runtime of each request into account though. Can you
elaborate on how your thoughts on checking queue-size etc?


>
>
> > Does that answer the question?
>
>
> > >
> > > Thanks!
> > >
> > > Tzu-Chiao
> > >
> > > On Sat, Aug 18, 2018 at 5:55 AM Tyson Norris <tnor...@adobe.com.invalid
> >
> > > wrote:
> > >
> > > > Ugh my reply formatting got removed!!! Trying this again with some >>
> > > >
> > > > On Aug 17, 2018, at 2:45 PM, Tyson Norris <tnor...@adobe.com.INVALID
> > > > <mailto:tnor...@adobe.com.INVALID>> wrote:
> > > >
> > > >
> > > > If the failover of the singleton is too long (I think it will be
> based
> > on
> > > > cluster size, oldest node becomes the singleton host iirc), I think
> we
> > > need
> > > > to consider how containers can launch in the meantime. A first step
> > might
> > > > be to test out the singleton behavior in the cluster of various
> sizes.
> > > >
> > > >
> > > > I agree this bit of design is crucial, a few thoughts:
> > > > Pre-warm wouldn't help here, the ContainerRouters only know warm
> > > > containers. Pre-warming is managed by the ContainerManager.
> > > >
> > > >
> > > > >> Ah right
> > > >
> > > >
> > > >
> > > > Considering a fail-over scenario: We could consider sharing the state
> > via
> > > > EventSourcing. That is: All state lives inside of frequently
> > snapshotted
> > > > events and thus can be shared between multiple instances of the
> > > > ContainerManager seamlessly. Alternatively, we could also think about
> > > only
> > > > working on persisted state. That way, a cold-standby model could fly.
> > We
> > > > should make sure that the state is not "slightly stale" but rather
> both
> > > > instances see the same state at any point in time. I believe on that
> > > > cold-path of generating new containers, we can live with the
> > > extra-latency
> > > > of persisting what we're doing as the path will still be dominated by
> > the
> > > > container creation latency.
> > > >
> > > >
> > > >
> > > > >> Wasn’t clear if you mean not using ClusterSingleton? To be clear
> in
> > > > ClusterSingleton case there are 2 issues:
> > > > - time it takes for akka ClusterSingletonManager to realize it needs
> to
> > > > start a new actor
> > > > - time it takes for the new actor to assume a usable state
> > > >
> > > > EventSourcing (or ext persistence) may help with the latter, but we
> > will
> > > > need to be sure the former is tolerable to start with.
> > > > Here is an example test from akka source that may be useful
> (multi-jvm,
> > > > but all local):
> > > >
> > > >
> > >
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fakka%2Fakka%2Fblob%2F009214ae07708e8144a279e71d06c4a504907e31%2Fakka-cluster-tools%2Fsrc%2Fmulti-jvm%2Fscala%2Fakka%2Fcluster%2Fsingleton%2FClusterSingletonManagerChaosSpec.scala&amp;data=02%7C01%7Ctnorris%40adobe.com%7C50be947ede884f3b78e208d6048ac99a%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636701391474213555&amp;sdata=Ojk1yRGCbG4OxD5MXOabmH1ggbgk%2BymZ7%2BUqDQINAPo%3D&amp;reserved=0
> > > >
> > > > Some things to consider, that I don’t know details of:
> > > > - will the size of cluster affect the singleton behavior in case of
> > > > failure? (I think so, but not sure, and what extent); in the simple
> > test
> > > > above it takes ~6s for the replacement singleton to begin startup,
> but
> > if
> > > > we have 100s of nodes, I’m not sure how much time it will take. (I
> > don’t
> > > > think this should be hard to test, but I haven’t done it)
> > > > - in case of hard crash, what is the singleton behavior? In graceful
> > jvm
> > > > termination, I know the cluster behavior is good, but there is always
> > > this
> > > > question about how downing nodes will be handled. If this critical
> > piece
> > > of
> > > > the system relies on akka cluster functionality, we will need to make
> > > sure
> > > > that the singleton can be reconstituted, both in case of graceful
> > > > termination (restart/deployment events) and non-graceful termination
> > > (hard
> > > > vm crash, hard container crash) . This is ignoring more complicated
> > cases
> > > > of extended network partitions, which will also have bad affects on
> > many
> > > of
> > > > the downstream systems.
> > > >
> > > >
> > > >
> > > >
> > > > Handover time as you say is crucial, but I'd say as it only impacts
> > > > container creation, we could live with, let's say, 5 seconds of
> > > > failover-downtime on this path? What's your experience been on
> > singleton
> > > > failover? How long did it take?
> > > >
> > > >
> > > >
> > > > >> Seconds in the simplest case, so I think we need to test it in a
> > > scaled
> > > > case (100s of cluster nodes), as well as the hard crash case (where
> not
> > > > downing the node may affect the cluster state).
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Aug 16, 2018, at 11:01 AM, Tyson Norris <tnor...@adobe.com.INVALID
> > > > <mailto:tnor...@adobe.com.INVALID><mailto:tnor...@adobe.com.INVALID
> >>
> > > > wrote:
> > > >
> > > > A couple comments on singleton:
> > > > - use of cluster singleton will introduce a new single point of
> failure
> > > > - from time of singleton node failure, to single resurrection on a
> > > > different instance, will be an outage from the point of view of any
> > > > ContainerRouter that does not already have a warm+free container to
> > > service
> > > > an activation
> > > > - resurrecting the singleton will require transferring or rebuilding
> > the
> > > > state when recovery occurs - in my experience this was tricky, and
> > > requires
> > > > replicating the data (which will be slightly stale, but better than
> > > > rebuilding from nothing); I don’t recall the handover delay (to
> > transfer
> > > > singleton to a new akka cluster node) when I tried last, but I think
> it
> > > was
> > > > not as fast as I hoped it would be.
> > > >
> > > > I don’t have a great suggestion for the singleton failure case, but
> > > > would like to consider this carefully, and discuss the ramifications
> > > (which
> > > > may or may not be tolerable) before pursuing this particular aspect
> of
> > > the
> > > > design.
> > > >
> > > >
> > > > On prioritization:
> > > > - if concurrency is enabled for an action, this is another
> > > > prioritization aspect, of sorts - if the action supports concurrency,
> > > there
> > > > is no reason (except for destruction coordination…) that it cannot be
> > > > shared across shards. This could be added later, but may be worth
> > > > considering since there is a general reuse problem where a series of
> > > > activations that arrives at different ContainerRouters will create a
> > new
> > > > container in each, while they could be reused (and avoid creating new
> > > > containers) if concurrency is tolerated in that container. This would
> > > only
> > > > (ha ha) require changing how container destroy works, where it cannot
> > be
> > > > destroyed until the last ContainerRouter is done with it. And if
> > > container
> > > > destruction is coordinated in this way to increase reuse, it would
> also
> > > be
> > > > good to coordinate construction (don’t concurrently construct the
> same
> > > > container for multiple containerRouters IFF a single container would
> > > enable
> > > > concurrent activations once it is created). I’m not sure if others
> are
> > > > desiring this level of container reuse, but if so, it would be worth
> > > > considering these aspects (sharding/isolation vs
> sharing/coordination)
> > as
> > > > part of any redesign.
> > > >
> > > >
> > > > Yes, I can see where you're heading here. I think this can be
> > > generalized:
> > > >
> > > > Assume intra-container concurrency C and number of ContainerRouters
> R.
> > > > If C > R: Shard the "slots" on this container evenly across R. The
> > > > container can only be destroyed after you receive R acknowledgements
> of
> > > > doing so.
> > > > If C < R: Hand out 1 slot to C Routers, point the remaining Routers
> to
> > > the
> > > > ones that got slots.
> > > >
> > > >
> > > >
> > > > >>Yes, mostly - I think there is also a case where destruction
> message
> > is
> > > > revoked by the same router (receiving a new activation for the
> > container
> > > > which it previously requested destruction of). But I think this is
> > > covered
> > > > in the details of tracking “after you receive R acks of destructions”
> > > >
> > > >
> > > >
> > > > Concurrent creation: Batch creation requests while one container is
> > being
> > > > created. Say you received a request for a new container that has C
> > slots.
> > > > If there are more requests for that container arriving while it is
> > being
> > > > created, don't act on them and fold the creation into the first one.
> > Only
> > > > start creating a new container if the number of resource requests
> > exceed
> > > C.
> > > >
> > > > Does that make sense? I think in that model you can set C=1 and it
> > works
> > > as
> > > > I envisioned it to work, or set it to C=200 and things will be shared
> > > even
> > > > across routers.
> > > >
> > > >
> > > > >> Side note: One detail about the pending concurrency impl today is
> > that
> > > > due to the async nature of tracking the active activations within the
> > > > container, there is no guarantee (when C>1) that the number is exact,
> > so
> > > if
> > > > you specify C=200, you may actually get a different container at 195
> or
> > > > 205. This is not really related to this discussion, but is based on
> the
> > > > current messaging/future behavior in ContainerPool/ContainerProxy, so
> > > > wanted to mention it explicitly, in case it matters to anyone.
> > > >
> > > > Thanks
> > > > Tyson
> > > >
> > > >
> > >
> > > --
> > > Tzu-Chiao Yeh (@tz70s)
> > >
> >
>
>
> --
> Tzu-Chiao Yeh (@tz70s)
>

Re: Proposal on a future architecture of OpenWhisk

Reply via email to