Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Rodric Rabbah
> A second idea that comes to my mind would be to implement a shared
counter (it can easily be eventually consistent, consistency is I think not
a concern here).

This is simply a drive-by comment, as I have not directly weighed in on the
rest of the discussion. But this comment about a shared counter reminded of
an recent interview with Tim Wagner
https://read.acloud.guru/serverless-and-blockchain-an-interview-with-tim-wagner-from-aws-to-coinbase-f3b2b5939790
about SQS and Lambda integration:

   So in order to give customers — and ourselves, frankly — some
control over that, we had to go invent an entire new feature, concurrency
controls per function in Lambda, which also meant we had to have metrics
and internal infrastructure for that.

   That required us to change some of our architecture to produce what
we call the high-speed counting service, and and so on and so forth.
There’s a whole lot of iceberg below the waterline for the piece that comes
poking above the top.

-r


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Markus Thömmes
Hi,

Am Fr., 24. Aug. 2018 um 00:07 Uhr schrieb Tyson Norris
:

> > Router is not pulling at queue for "specific actions", just for any
> action
> > that might replace idle containers - right? This is complicated with
> > concurrency though since while a container is not idle (paused +
> > removable), it may be useable, but only if the action received is
> the same
> > as one existing warm container, and that container has concurrency
> slots
> > available for additional activations. It may be helpful to diagram
> some of
> > this stealing queue flow a bit more, I'm not seeing how it will work
> out
> > other than creating more containers than is absolutely required,
> which may
> > be ok, not sure.
> >
>
> Yes, I will diagram things out soonish, I'm a little bit narrow on time
> currently.
>
> The idea is that indeed the Router pulls for *specific* actions. This
> is a
> problem when using Kafka, but might be solvable when we don't require
> Kafka. I have to test this for feasibility though.
>
>
> Hmm OK - it's not clear how a router that is empty (not servicing any
> activations) becomes a router that is pulling for that specific action,
> when other routers pulling for that action are at capacity (so new
> containers are needed)
>

Disclaimer: All of this is very much in the idea and not part of the
original proposal.

That's where the second part of the "idea" that I had above comes in: If we
can somehow detect, that nobody is pulling (all are at capacity or have no
container), that is then the moment where we need to create new containers.

I proposed further above in the discussion that the Routers do *not* ask
for more Containers but rather that signal of the MQ (hey, nobody is having
capacity for that action apparently... so let's create capacity) could be
used to create Containers.

That's a very blunt assumption though, I don't know if that's feasible at
all. There might be a performant way of implementing just that signal in a
distributed way though (the signal being: We are out of capacity over the
whole cluster, we need more). A second idea that comes to my mind would be
to implement a shared counter (it can easily be eventually consistent,
consistency is I think not a concern here). Once that counter drops to 0 or
below 0, we need more Containers.

This is I think were the prototyping thread comes in: Personally, I don't
feel comfortable stating that any of the approaches outlined by me really
do work without having tried it out.


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Tyson Norris
> Router is not pulling at queue for "specific actions", just for any action
> that might replace idle containers - right? This is complicated with
> concurrency though since while a container is not idle (paused +
> removable), it may be useable, but only if the action received is the same
> as one existing warm container, and that container has concurrency slots
> available for additional activations. It may be helpful to diagram some of
> this stealing queue flow a bit more, I'm not seeing how it will work out
> other than creating more containers than is absolutely required, which may
> be ok, not sure.
>

Yes, I will diagram things out soonish, I'm a little bit narrow on time
currently.

The idea is that indeed the Router pulls for *specific* actions. This is a
problem when using Kafka, but might be solvable when we don't require
Kafka. I have to test this for feasibility though.


Hmm OK - it's not clear how a router that is empty (not servicing any 
activations) becomes a router that is pulling for that specific action, when 
other routers pulling for that action are at capacity (so new containers are 
needed)




Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Markus Thömmes
Hi Dave,

I agree! I'll start another thread on a discussion of how/where we
prototype things to hash out some of the unknowns.

Cheers,
Markus

Am Do., 23. Aug. 2018 um 22:05 Uhr schrieb David P Grove :

>
> Related to the random vs. smart routing discussion.
>
> A key unknown that influences the design is how much load we can drive
> through a single ContainerRouter.
> + If they are highly scalable (500 to 1000 containers per router),
> then even a fairly large OpenWhisk deployment could be running with a
> handful of ContainerRouters (and smart routing is quite viable).
> + If they are less scalable (10s to 100 containers per router) then
> large deployments will be running with 50+ ContinaerRouters and smart
> routing degrades to random routing in terms of container reuse for our long
> tail workloads.
>
> --dave
>


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Markus Thömmes
Hi Tyson,

Am Do., 23. Aug. 2018 um 21:28 Uhr schrieb Tyson Norris
:

> >
> > And each ContainerRouter has a queue consumer that presumably pulls
> from
> > the queue constantly? Or is consumption based on something else? If
> all
> > ContainerRouters are consuming at the same rate, then while this does
> > distribute the load across ContainerRouters, it doesn't really
> guarantee
> > any similar state (number of containers, active connections, etc) at
> each
> > ContainerRouter, I think. Maybe I am missing something here?
> >
>
>
> The idea is that ContainerRouters do **not** pull from the queue
> constantly. They pull work for actions that they have idle containers
> for.
>
> Router is not pulling at queue for "specific actions", just for any action
> that might replace idle containers - right? This is complicated with
> concurrency though since while a container is not idle (paused +
> removable), it may be useable, but only if the action received is the same
> as one existing warm container, and that container has concurrency slots
> available for additional activations. It may be helpful to diagram some of
> this stealing queue flow a bit more, I'm not seeing how it will work out
> other than creating more containers than is absolutely required, which may
> be ok, not sure.
>

Yes, I will diagram things out soonish, I'm a little bit narrow on time
currently.

The idea is that indeed the Router pulls for *specific* actions. This is a
problem when using Kafka, but might be solvable when we don't require
Kafka. I have to test this for feasibility though.


>
> Similar state in terms of number of containers is done via the
> ContainerManager. Active connections should roughly even out with the
> queue
> being pulled on idle.
>
> Yeah carefully defining "idle" may be tricky, if we want to achieve
> absolute minimum containers in use for a specific action at any time.
>
>
> >
> > The edge-case here is for very slow load. It's minimizing the
> amount of
> > Containers needed. Another example:
> > Say you have 3 Routers. A request for action X comes in, goes to
> > Router1.
> > It requests a container, puts the work on the queue, nobody
> steals it,
> > as
> > soon as the Container gets ready, the work is taken from the
> queue and
> > executed. All nice and dandy.
> >
> > Important remark: The Router that requested more Containers is
> not
> > necessarily the one that's getting the Containers. We need to
> make
> > sure to
> > evenly distribute Containers across the system.
> >
> > So back to our example: What happens if requests for action X
> are made
> > one
> > after the other? Well, the layer above the Routers (something
> needs to
> > loadbalance them, be it DNS or another type of routing layer)
> isn't
> > aware
> > of the locality of the Container that we created to execute
> action X.
> > As it
> > schedules fairly randomly (round-robin in a multi-tenant system
> is
> > essentially random) the action will hit each Router once very
> soon. As
> > we're only generating one request after the other, arguably we
> only
> > want to
> > create only one container.
> >
> > That's why in this example the 2 remaining Routers with no
> container
> > get a
> > reference to Router1.
> >
> > In the case you mentioned:
> > > it seems like sending to another Router which has the
> container, but
> > may
> > not be able to use it immediately, may cause failures in some
> cases.
> >
> > I don't recall if it's in the document or in the discussion on
> the
> > dev-list: The router would respond to the proxied request with a
> 503
> > immediatly. That tells the proxying router: Oh, apparently we
> need more
> > resources. So it requests another container etc etc.
> >
> > Does that clarify that specific edge-case?
> >
> > Yes, but I would not call this an edge-case -  I think it is more of
> a
> > ramp up to maximum container reuse, and will probably dramatically
> impacted
> > by containers that do NOT support concurrency (will get a 503 when a
> single
> > activation is in flight, vs high concurrency container, which would
> cause
> > 503 only once max concurrency reached).
> > If each ContainerRouter is as likely to receive the original
> request, and
> > each is also as likely to receive the queued item from the stealing
> queue,
> > then there will be a lot of cross traffic during the ramp up from 1
> > container to  containers. E.g.
> >
> > From client:
> > Request1 -> Router 1 -> queue (no containers)
> > Request2 -> Router 2 -> queue (no containers)
> > Request3 -> Router 3 -> queue (no containers)
> > From queue:
> > 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread David P Grove

Related to the random vs. smart routing discussion.

A key unknown that influences the design is how much load we can drive
through a single ContainerRouter.
+ If they are highly scalable (500 to 1000 containers per router),
then even a fairly large OpenWhisk deployment could be running with a
handful of ContainerRouters (and smart routing is quite viable).
+ If they are less scalable (10s to 100 containers per router) then
large deployments will be running with 50+ ContinaerRouters and smart
routing degrades to random routing in terms of container reuse for our long
tail workloads.

--dave


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Tyson Norris
>
> And each ContainerRouter has a queue consumer that presumably pulls from
> the queue constantly? Or is consumption based on something else? If all
> ContainerRouters are consuming at the same rate, then while this does
> distribute the load across ContainerRouters, it doesn't really guarantee
> any similar state (number of containers, active connections, etc) at each
> ContainerRouter, I think. Maybe I am missing something here?
>


The idea is that ContainerRouters do **not** pull from the queue
constantly. They pull work for actions that they have idle containers for.

Router is not pulling at queue for "specific actions", just for any action that 
might replace idle containers - right? This is complicated with concurrency 
though since while a container is not idle (paused + removable), it may be 
useable, but only if the action received is the same as one existing warm 
container, and that container has concurrency slots available for additional 
activations. It may be helpful to diagram some of this stealing queue flow a 
bit more, I'm not seeing how it will work out other than creating more 
containers than is absolutely required, which may be ok, not sure. 

Similar state in terms of number of containers is done via the
ContainerManager. Active connections should roughly even out with the queue
being pulled on idle.

Yeah carefully defining "idle" may be tricky, if we want to achieve absolute 
minimum containers in use for a specific action at any time.


>
> The edge-case here is for very slow load. It's minimizing the amount 
of
> Containers needed. Another example:
> Say you have 3 Routers. A request for action X comes in, goes to
> Router1.
> It requests a container, puts the work on the queue, nobody steals it,
> as
> soon as the Container gets ready, the work is taken from the queue and
> executed. All nice and dandy.
>
> Important remark: The Router that requested more Containers is not
> necessarily the one that's getting the Containers. We need to make
> sure to
> evenly distribute Containers across the system.
>
> So back to our example: What happens if requests for action X are made
> one
> after the other? Well, the layer above the Routers (something needs to
> loadbalance them, be it DNS or another type of routing layer) isn't
> aware
> of the locality of the Container that we created to execute action X.
> As it
> schedules fairly randomly (round-robin in a multi-tenant system is
> essentially random) the action will hit each Router once very soon. As
> we're only generating one request after the other, arguably we only
> want to
> create only one container.
>
> That's why in this example the 2 remaining Routers with no container
> get a
> reference to Router1.
>
> In the case you mentioned:
> > it seems like sending to another Router which has the container, but
> may
> not be able to use it immediately, may cause failures in some cases.
>
> I don't recall if it's in the document or in the discussion on the
> dev-list: The router would respond to the proxied request with a 503
> immediatly. That tells the proxying router: Oh, apparently we need 
more
> resources. So it requests another container etc etc.
>
> Does that clarify that specific edge-case?
>
> Yes, but I would not call this an edge-case -  I think it is more of a
> ramp up to maximum container reuse, and will probably dramatically 
impacted
> by containers that do NOT support concurrency (will get a 503 when a 
single
> activation is in flight, vs high concurrency container, which would cause
> 503 only once max concurrency reached).
> If each ContainerRouter is as likely to receive the original request, and
> each is also as likely to receive the queued item from the stealing queue,
> then there will be a lot of cross traffic during the ramp up from 1
> container to  containers. E.g.
>
> From client:
> Request1 -> Router 1 -> queue (no containers)
> Request2 -> Router 2 -> queue (no containers)
> Request3 -> Router 3 -> queue (no containers)
> From queue:
> Request1 -> Router1  -> create and use container
> Reuqest2 -> Router2 -> Router1 -> 503 -> create container
> Request3 -> Router3 -> Router1 -> 503 -> Router2 -> 503 -> create 
container
>
> In other words - the 503 may help when there is one container existing,
> and it is deemed to be busy, but what if there are 10 containers existing
> (on different Routers other than where the request was pulled from the
> stealing queue) - do you make HTTP requests to all 10 Routers to see if
> they are busy before creating a new 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread David P Grove

Great discussion; I'm not entirely convinced on part of this point though.


> We need a work-stealing queue here to dynamically rebalance between the
> Routers since the layer above the Routers has no idea about capacity and
> (at least that's my assumption) schedules randomly.

I agree we can't really keep track of actual current capacity outside of
the individual Router.  But I don't want to jump immediately from that to
assuming truly random scheduling at the layer above because it pushes a
pretty key problem down into the ContainerManager/ContainerRouter layer
(dealing with the "edge case" of finding hot containers for the very long
tail of actions that can be serviced by a very small number of running
containers).

The layer above could route based on runtime kind to increase the
probability of container reuse.

The layer above could still do some hash-based scheme to map an initial
"home" Router (or subset of Routers on a very large deployment) and rely on
work-stealing/overflow queue to deal with "noisy neighbor" hash collisions
if a Router gets badly overloaded.

Each Router is potentially managing a fairly large pool of containers.  The
pools don't have to be the same size between Routers.  More crazily, the
Routers could even autoscale themselves to deal with uneven load (in
effect, hierarchical routing).

Lots of half-baked ideas are possible here :)

--dave


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-23 Thread Markus Thömmes
Hi Tyson,

Am Do., 23. Aug. 2018 um 00:33 Uhr schrieb Tyson Norris
:

> Hi - thanks for the discussion! More inline...
>
> On 8/22/18, 2:55 PM, "Markus Thömmes"  wrote:
>
> Hi Tyson,
>
> Am Mi., 22. Aug. 2018 um 23:37 Uhr schrieb Tyson Norris
> :
>
> > Hi -
> > >
> > > When exactly is the case that a ContainerRouter should put a
> blocking
> > > activation to a queue for stealing? Since a) it is not spawning
> > containers
> > > and b) it is not parsing request/response bodies, can we say
> this
> > would
> > > only happen when a ContainerRouter maxes out its incoming
> request
> > handling?
> > >
> >
> > That's exactly the idea! The work-stealing queue will only be
> used if
> > the
> > Router where to request landed cannot serve the demand right
> now. For
> > example, if it maxed out the slots it has for a certain action
> (all
> > containers are working to their full extent) it requests more
> > resources and
> > puts the request-token on the work-stealing queue.
> >
> > So to clarify, ContainerRouter "load" (which can trigger use of
> queue) is
> > mostly (only?) based on:
> > * the number of Container references
> > * the number of outstanding inbound  HTTP requests, e.g. when lots of
> > requests can be routed to the same container
> > * the number of outstand outbound HTTP requests to remote action
> > containers (assume all are remote)
> > It is unclear the order of magnitude considered for "maxed out
> slots",
> > since container refs should be simple (like ip+port, action metadata,
> > activation count, and warm state), inbound connection handling is
> basically
> > a http server, and outbound is a connection pool per action container
> > (let's presume connection reuse for the moment).
> > I think it will certainly need testing to determine these and to be
> > configurable in any case, for each of these separate stats.. Is there
> > anything else that affects the load for ContainerRouter?
> >
>
> "Overload" is determined by the availability of free slots on any
> container
> being able to serve the current action invocation (or rather the
> absence
> thereof). An example:
> Say RouterA has 2 containers for action X. Each container has an
> allowed
> concurrency of 10. On each of those 2 there are 10 active invocations
> already running (the ContainerRouter knows this, these are open
> connections
> to the containers). If another request comes in for X, we know we don't
> have capacity for it. We request more resources and offer the work we
> got
> for stealing.
>
> I don't think there are tweaks needed here. The Router keeps an
> "activeInvocations" number per container and compares that to the
> allowed
> concurrency on that container. If activeInvocations ==
> allowedConcurrency
> we're out of capacity and need more.
>
> We need a work-stealing queue here to dynamically rebalance between the
> Routers since the layer above the Routers has no idea about capacity
> and
> (at least that's my assumption) schedules randomly.
>
> I think it is confusing to say that the ContainerRouter doesn't have
> capacity for it - rather, the existing set of continers in the
> ContainerRouter don't have capacity for it. I understand now, in any case.
>

Noted, will adjust future wording on this, thanks!


> So there are a couple of active paths in ContainerRouter, still only
> considering sync/blocking activations:
> * warmpath - run immediately
> * coldpath - send to queue
>
> And each ContainerRouter has a queue consumer that presumably pulls from
> the queue constantly? Or is consumption based on something else? If all
> ContainerRouters are consuming at the same rate, then while this does
> distribute the load across ContainerRouters, it doesn't really guarantee
> any similar state (number of containers, active connections, etc) at each
> ContainerRouter, I think. Maybe I am missing something here?
>


The idea is that ContainerRouters do **not** pull from the queue
constantly. They pull work for actions that they have idle containers for.

Similar state in terms of number of containers is done via the
ContainerManager. Active connections should roughly even out with the queue
being pulled on idle.


>
>

>
>
> >
> > That request-token will then be taken by any Router that has free
> > capacity
> > for that action (note: this is not simple with kafka, but might
> be
> > simpler
> > with other MQ technologies). Since new resources have been
> requested,
> > it is
> > guaranteed that one Router will eventually become free.
> >
> > Is "requests resources" here requesting new action containers, which
> it
> > won't be able to process itself immediately, but should startup +
> warm and
> > be 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-22 Thread Tyson Norris
Hi - thanks for the discussion! More inline...

On 8/22/18, 2:55 PM, "Markus Thömmes"  wrote:

Hi Tyson,

Am Mi., 22. Aug. 2018 um 23:37 Uhr schrieb Tyson Norris
:

> Hi -
> >
> > When exactly is the case that a ContainerRouter should put a 
blocking
> > activation to a queue for stealing? Since a) it is not spawning
> containers
> > and b) it is not parsing request/response bodies, can we say this
> would
> > only happen when a ContainerRouter maxes out its incoming request
> handling?
> >
>
> That's exactly the idea! The work-stealing queue will only be used if
> the
> Router where to request landed cannot serve the demand right now. For
> example, if it maxed out the slots it has for a certain action (all
> containers are working to their full extent) it requests more
> resources and
> puts the request-token on the work-stealing queue.
>
> So to clarify, ContainerRouter "load" (which can trigger use of queue) is
> mostly (only?) based on:
> * the number of Container references
> * the number of outstanding inbound  HTTP requests, e.g. when lots of
> requests can be routed to the same container
> * the number of outstand outbound HTTP requests to remote action
> containers (assume all are remote)
> It is unclear the order of magnitude considered for "maxed out slots",
> since container refs should be simple (like ip+port, action metadata,
> activation count, and warm state), inbound connection handling is 
basically
> a http server, and outbound is a connection pool per action container
> (let's presume connection reuse for the moment).
> I think it will certainly need testing to determine these and to be
> configurable in any case, for each of these separate stats.. Is there
> anything else that affects the load for ContainerRouter?
>

"Overload" is determined by the availability of free slots on any container
being able to serve the current action invocation (or rather the absence
thereof). An example:
Say RouterA has 2 containers for action X. Each container has an allowed
concurrency of 10. On each of those 2 there are 10 active invocations
already running (the ContainerRouter knows this, these are open connections
to the containers). If another request comes in for X, we know we don't
have capacity for it. We request more resources and offer the work we got
for stealing.

I don't think there are tweaks needed here. The Router keeps an
"activeInvocations" number per container and compares that to the allowed
concurrency on that container. If activeInvocations == allowedConcurrency
we're out of capacity and need more.

We need a work-stealing queue here to dynamically rebalance between the
Routers since the layer above the Routers has no idea about capacity and
(at least that's my assumption) schedules randomly.

I think it is confusing to say that the ContainerRouter doesn't have capacity 
for it - rather, the existing set of continers in the ContainerRouter don't 
have capacity for it. I understand now, in any case.
So there are a couple of active paths in ContainerRouter, still only 
considering sync/blocking activations:
* warmpath - run immediately
* coldpath - send to queue

And each ContainerRouter has a queue consumer that presumably pulls from the 
queue constantly? Or is consumption based on something else? If all 
ContainerRouters are consuming at the same rate, then while this does 
distribute the load across ContainerRouters, it doesn't really guarantee any 
similar state (number of containers, active connections, etc) at each 
ContainerRouter, I think. Maybe I am missing something here?
  




>
> That request-token will then be taken by any Router that has free
> capacity
> for that action (note: this is not simple with kafka, but might be
> simpler
> with other MQ technologies). Since new resources have been requested,
> it is
> guaranteed that one Router will eventually become free.
>
> Is "requests resources" here requesting new action containers, which it
> won't be able to process itself immediately, but should startup + warm and
> be provided to "any ContainerRouter"? This makes, sense, just want to
> clarify that "resources == containers".
>

Yes, resources == containers.


>
> >
> > If ContainerManager has enough awareness of ContainerRouters'
> states, I'm
> > not sure where using a queue would be used (for redirecting to other
> > ContainerRouters) vs ContainerManager responding with a
> ContainerRouters
> > reference (instead of an action container reference) - I'm not
> following
> > the logic of the edge case in the proposal - there is 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-22 Thread Markus Thömmes
Hi Tyson,

Am Mi., 22. Aug. 2018 um 23:37 Uhr schrieb Tyson Norris
:

> Hi -
> >
> > When exactly is the case that a ContainerRouter should put a blocking
> > activation to a queue for stealing? Since a) it is not spawning
> containers
> > and b) it is not parsing request/response bodies, can we say this
> would
> > only happen when a ContainerRouter maxes out its incoming request
> handling?
> >
>
> That's exactly the idea! The work-stealing queue will only be used if
> the
> Router where to request landed cannot serve the demand right now. For
> example, if it maxed out the slots it has for a certain action (all
> containers are working to their full extent) it requests more
> resources and
> puts the request-token on the work-stealing queue.
>
> So to clarify, ContainerRouter "load" (which can trigger use of queue) is
> mostly (only?) based on:
> * the number of Container references
> * the number of outstanding inbound  HTTP requests, e.g. when lots of
> requests can be routed to the same container
> * the number of outstand outbound HTTP requests to remote action
> containers (assume all are remote)
> It is unclear the order of magnitude considered for "maxed out slots",
> since container refs should be simple (like ip+port, action metadata,
> activation count, and warm state), inbound connection handling is basically
> a http server, and outbound is a connection pool per action container
> (let's presume connection reuse for the moment).
> I think it will certainly need testing to determine these and to be
> configurable in any case, for each of these separate stats.. Is there
> anything else that affects the load for ContainerRouter?
>

"Overload" is determined by the availability of free slots on any container
being able to serve the current action invocation (or rather the absence
thereof). An example:
Say RouterA has 2 containers for action X. Each container has an allowed
concurrency of 10. On each of those 2 there are 10 active invocations
already running (the ContainerRouter knows this, these are open connections
to the containers). If another request comes in for X, we know we don't
have capacity for it. We request more resources and offer the work we got
for stealing.

I don't think there are tweaks needed here. The Router keeps an
"activeInvocations" number per container and compares that to the allowed
concurrency on that container. If activeInvocations == allowedConcurrency
we're out of capacity and need more.

We need a work-stealing queue here to dynamically rebalance between the
Routers since the layer above the Routers has no idea about capacity and
(at least that's my assumption) schedules randomly.


>
> That request-token will then be taken by any Router that has free
> capacity
> for that action (note: this is not simple with kafka, but might be
> simpler
> with other MQ technologies). Since new resources have been requested,
> it is
> guaranteed that one Router will eventually become free.
>
> Is "requests resources" here requesting new action containers, which it
> won't be able to process itself immediately, but should startup + warm and
> be provided to "any ContainerRouter"? This makes, sense, just want to
> clarify that "resources == containers".
>

Yes, resources == containers.


>
> >
> > If ContainerManager has enough awareness of ContainerRouters'
> states, I'm
> > not sure where using a queue would be used (for redirecting to other
> > ContainerRouters) vs ContainerManager responding with a
> ContainerRouters
> > reference (instead of an action container reference) - I'm not
> following
> > the logic of the edge case in the proposal - there is mention of
> "which
> > controller the request needs to go", but maybe this is a typo and
> should
> > say ContainerRouter?
> >
>
> Indeed that's a typo, it should say ContainerRouter.
>
> The ContainerManager only knows which Router has which Container. It
> does
> not know whether the respective Router has capacity on that container
> (the
> capacity metric is very hard to share since it's ever changing).
>
> Hence, in an edge-case where there are less Containers than Routers,
> the
> ContainerManager can hand out references to the Routers it gave
> Containers
> to the Routers that have none. (This is the edge-case described in the
> proposal).
>
> I'm not sure why in this case the ContainerManager does not just create a
> new container, instead of sending to another Router? If there is some
> intended limit on "number of containers for a particular action", that
> would be a reason, but given that the ContainerManager cannot know the
> state of the existing containers, it seems like sending to another Router
> which has the container, but may not be able to use it immediately, may
> cause failures in some cases.
>

The edge-case here is for very slow load. It's minimizing the amount of
Containers 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-22 Thread Tyson Norris
Hi - 
>
> When exactly is the case that a ContainerRouter should put a blocking
> activation to a queue for stealing? Since a) it is not spawning containers
> and b) it is not parsing request/response bodies, can we say this would
> only happen when a ContainerRouter maxes out its incoming request 
handling?
>

That's exactly the idea! The work-stealing queue will only be used if the
Router where to request landed cannot serve the demand right now. For
example, if it maxed out the slots it has for a certain action (all
containers are working to their full extent) it requests more resources and
puts the request-token on the work-stealing queue.

So to clarify, ContainerRouter "load" (which can trigger use of queue) is 
mostly (only?) based on:
* the number of Container references 
* the number of outstanding inbound  HTTP requests, e.g. when lots of requests 
can be routed to the same container
* the number of outstand outbound HTTP requests to remote action containers 
(assume all are remote)
It is unclear the order of magnitude considered for "maxed out slots", since 
container refs should be simple (like ip+port, action metadata, activation 
count, and warm state), inbound connection handling is basically a http server, 
and outbound is a connection pool per action container (let's presume 
connection reuse for the moment).
I think it will certainly need testing to determine these and to be 
configurable in any case, for each of these separate stats.. Is there anything 
else that affects the load for ContainerRouter?

That request-token will then be taken by any Router that has free capacity
for that action (note: this is not simple with kafka, but might be simpler
with other MQ technologies). Since new resources have been requested, it is
guaranteed that one Router will eventually become free.

Is "requests resources" here requesting new action containers, which it won't 
be able to process itself immediately, but should startup + warm and be 
provided to "any ContainerRouter"? This makes, sense, just want to clarify that 
"resources == containers".

>
> If ContainerManager has enough awareness of ContainerRouters' states, I'm
> not sure where using a queue would be used (for redirecting to other
> ContainerRouters) vs ContainerManager responding with a ContainerRouters
> reference (instead of an action container reference) - I'm not following
> the logic of the edge case in the proposal - there is mention of "which
> controller the request needs to go", but maybe this is a typo and should
> say ContainerRouter?
>

Indeed that's a typo, it should say ContainerRouter.

The ContainerManager only knows which Router has which Container. It does
not know whether the respective Router has capacity on that container (the
capacity metric is very hard to share since it's ever changing).

Hence, in an edge-case where there are less Containers than Routers, the
ContainerManager can hand out references to the Routers it gave Containers
to the Routers that have none. (This is the edge-case described in the
proposal).

I'm not sure why in this case the ContainerManager does not just create a new 
container, instead of sending to another Router? If there is some intended 
limit on "number of containers for a particular action", that would be a 
reason, but given that the ContainerManager cannot know the state of the 
existing containers, it seems like sending to another Router which has the 
container, but may not be able to use it immediately, may cause failures in 
some cases. 


The work-stealing queue though is used to rebalance work in case one of the
Routers get overloaded.

Got it.

Thanks
Tyson
 



Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-22 Thread Markus Thömmes
Hi Tyson,

Am Mi., 22. Aug. 2018 um 22:49 Uhr schrieb Tyson Norris
:

> Yes, agreed this makes sense, same as Carlos is saying.
>
> Let's ignore async for now, I think that one is simpler __ - does "A
> blocking request can still be put onto the work-stealing queue" mean that
> it wouldn't always be put on the queue?
>
> If there is existing warm container capacity in the ContainerRouter
> receiving the activation, ideally it would skip the queue - right?
>

Exactly, it should skip the queue whenever possible.


>
> When exactly is the case that a ContainerRouter should put a blocking
> activation to a queue for stealing? Since a) it is not spawning containers
> and b) it is not parsing request/response bodies, can we say this would
> only happen when a ContainerRouter maxes out its incoming request handling?
>

That's exactly the idea! The work-stealing queue will only be used if the
Router where to request landed cannot serve the demand right now. For
example, if it maxed out the slots it has for a certain action (all
containers are working to their full extent) it requests more resources and
puts the request-token on the work-stealing queue.

That request-token will then be taken by any Router that has free capacity
for that action (note: this is not simple with kafka, but might be simpler
with other MQ technologies). Since new resources have been requested, it is
guaranteed that one Router will eventually become free.


>
> If ContainerManager has enough awareness of ContainerRouters' states, I'm
> not sure where using a queue would be used (for redirecting to other
> ContainerRouters) vs ContainerManager responding with a ContainerRouters
> reference (instead of an action container reference) - I'm not following
> the logic of the edge case in the proposal - there is mention of "which
> controller the request needs to go", but maybe this is a typo and should
> say ContainerRouter?
>

Indeed that's a typo, it should say ContainerRouter.

The ContainerManager only knows which Router has which Container. It does
not know whether the respective Router has capacity on that container (the
capacity metric is very hard to share since it's ever changing).

Hence, in an edge-case where there are less Containers than Routers, the
ContainerManager can hand out references to the Routers it gave Containers
to the Routers that have none. (This is the edge-case described in the
proposal).
The work-stealing queue though is used to rebalance work in case one of the
Routers get overloaded.


>
> Thanks
> Tyson
>
> On 8/21/18, 1:16 AM, "Markus Thömmes"  wrote:
>
> Hi Tyson,
>
> if we take the concerns apart as I proposed above, timeouts should only
> ever be triggered after a request is scheduled as you say, that is: As
> soon
> as it's crossing the user-container mark. With the concern separation,
> it
> is plausible that blocking invocations are never buffered anywhere,
> which
> makes a lot of sense, because you cannot persist the open HTTP
> connection
> to the client anyway.
>
> To make the distinction clear: A blocking request can still be put
> onto the
> work-stealing queue to be balanced between different ContainerRouters.
>
> A blocking request though would never be written to a persistent buffer
> that's used to be able to efficiently handle async invocations and
> backpressuring them. That buffer should be entirely separate and could
> possibly be placed outside of the execution system to make the
> distinction
> more explicit. The execution system itself would then only deal with
> request-response style invocations and asynchronous invocations are
> done by
> having a seperate queue and a consumer that creates HTTP requests to
> the
> execution system.
>
> Cheers,
> Markus
>
> Am Mo., 20. Aug. 2018 um 23:30 Uhr schrieb Tyson Norris
> :
>
> > Thanks for summarizing Markus.
> >
> > Yes this is confusing in context of current system, which stores in
> kafka,
> > but not to indefinitely wait, since timeout begins immediately
> > So, I think the problem of buffering/queueing is: when does the
> timeout
> > begin? If not everything is buffered the same, their timeout should
> not
> > begin until processing begins.
> >
> > Maybe it would make sense to:
> > * always buffer (indefinitely) to queue for async, never for sync
> > * timeout for async not started till read from queue - which may be
> > delayed from time of trigger or http request
> > * this should also come with some system monitoring to indicate the
> queue
> > processing is not keeping up with some configurable max delay
> threshold ("I
> > can’t tolerate delays of > 5 minutes", etc)
> > * ContainerRouters can only pull from async queue when
> > * increasing the number of pending activations won’t exceed
> some
> > threshold (prevent excessive load of async on ContainerRouters)
> > 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-22 Thread Tyson Norris
Yes, agreed this makes sense, same as Carlos is saying. 

Let's ignore async for now, I think that one is simpler __ - does "A blocking 
request can still be put onto the work-stealing queue" mean that it wouldn't 
always be put on the queue? 

If there is existing warm container capacity in the ContainerRouter receiving 
the activation, ideally it would skip the queue - right? 

When exactly is the case that a ContainerRouter should put a blocking 
activation to a queue for stealing? Since a) it is not spawning containers and 
b) it is not parsing request/response bodies, can we say this would only happen 
when a ContainerRouter maxes out its incoming request handling? 

If ContainerManager has enough awareness of ContainerRouters' states, I'm not 
sure where using a queue would be used (for redirecting to other 
ContainerRouters) vs ContainerManager responding with a ContainerRouters 
reference (instead of an action container reference) - I'm not following the 
logic of the edge case in the proposal - there is mention of "which controller 
the request needs to go", but maybe this is a typo and should say 
ContainerRouter?

Thanks
Tyson

On 8/21/18, 1:16 AM, "Markus Thömmes"  wrote:

Hi Tyson,

if we take the concerns apart as I proposed above, timeouts should only
ever be triggered after a request is scheduled as you say, that is: As soon
as it's crossing the user-container mark. With the concern separation, it
is plausible that blocking invocations are never buffered anywhere, which
makes a lot of sense, because you cannot persist the open HTTP connection
to the client anyway.

To make the distinction clear: A blocking request can still be put onto the
work-stealing queue to be balanced between different ContainerRouters.

A blocking request though would never be written to a persistent buffer
that's used to be able to efficiently handle async invocations and
backpressuring them. That buffer should be entirely separate and could
possibly be placed outside of the execution system to make the distinction
more explicit. The execution system itself would then only deal with
request-response style invocations and asynchronous invocations are done by
having a seperate queue and a consumer that creates HTTP requests to the
execution system.

Cheers,
Markus

Am Mo., 20. Aug. 2018 um 23:30 Uhr schrieb Tyson Norris
:

> Thanks for summarizing Markus.
>
> Yes this is confusing in context of current system, which stores in kafka,
> but not to indefinitely wait, since timeout begins immediately
> So, I think the problem of buffering/queueing is: when does the timeout
> begin? If not everything is buffered the same, their timeout should not
> begin until processing begins.
>
> Maybe it would make sense to:
> * always buffer (indefinitely) to queue for async, never for sync
> * timeout for async not started till read from queue - which may be
> delayed from time of trigger or http request
> * this should also come with some system monitoring to indicate the queue
> processing is not keeping up with some configurable max delay threshold 
("I
> can’t tolerate delays of > 5 minutes", etc)
> * ContainerRouters can only pull from async queue when
> * increasing the number of pending activations won’t exceed some
> threshold (prevent excessive load of async on ContainerRouters)
> * ContainerManager is not overloaded (can still create containers,
> or has some configurable way to indicate the cluster is healthy enough to
> cope with extra processing)
>
> We could of course make this configurable so that operators can choose to:
> * treat async/sync activations the same for sync/async (the overloaded
> system fails when either ContainerManager or ContainerRouters are max
> capacity)
> * treat async/sync with preference for:
> * sync - where async is buffered for unknown period before
> processing, incoming sync traffic (or lack of)
> * async - where sync is sent to the queue, to be processed in
> order of receipt interleaved with async traffic (similar to today, I 
think)
>
> I think the impact here (aside from technical) is the timing difference if
> we introduce latency in side affects based on the activation being sync vs
> async.
>
> I’m also not sure prioritizing message processing between sync/async
> internally in ContainerRouter is better than just have some dedicated
> ContainerRouters that receive all async activations, and others that
> receive all sync activations, but the end result is the same, I think.
>
>
> > On Aug 19, 2018, at 4:29 AM, Markus Thömmes 
> wrote:
> >
> > Hi Tyson, Carlos,
> >
> > FWIW I should change that to no longer say "Kafka" but "buffer" or
> "message
> 

Re: Proposal on a future architecture of OpenWhisk

2018-08-21 Thread Carlos Santana
That was the idea I was proposing, to have 2 buckets one for sync invokes
(request/response) overflow, and one for async invokes.
Routers will steal pull work from the buckets, it can be a single set of
routers (same code) or be deploy in two groups for sharding and
traffic/risk isolation.

The Routers dealing with async will pull from the bucket/bus as they have
capacity, so there is never an overflow for async invokes just a delay, if
you want async invokes to be pikcup faster then add more Routers and
backend container worker resources (i.e. kube nodes, vm, physical). Bursts
will just go the queue to be eventually process but never throttle or drop.

The Routers dealing with sync will push to overflow bucket as they can't
handle it at the moment and another Router or maybe itself will pull from
overflow bucket.
Which in practice  overflow should be a rare case if system is over
provision, and maybe it happens in some spikes/bursts for sync.

-- Carlos

On Tue, Aug 21, 2018 at 1:37 PM Tyson Norris 
wrote:

> > Tracking these metrics consistently will introduce the same problem
> as
> > precisely tracking throttling numbers across multiple controllers, I
> think,
> > where either there is delay introduced to use remote data, or
> eventual
> > consistency will introduce inaccurate data.
> >
>
> If you're talking about limit enforcement, you're right! Regarding the
> concurrency on each container though, we are able to accurately track
> that
> and we need to be able to make sure that actual concurrency is always
> <= C.
>
>
> >
> > I’m interested to know if this accuracy is important as long as
> actual
> > concurrency <= C?
> >
>
> I don't think it is as much, no. But how do you keep <= C if you don't
> accurately track?
>
> Maybe I should say that while we cannot accurately track, we can still
> guarantee <= C, we just cannot guarantee maximizing concurrency up to C.
>
> Since the HTTP requests are done via futures in proxy, the messaging
> between pool and proxy doesn't have an accurate way to get exactly C
> requests in flight, but can prevent ever sending > C messages that cause
> the HTTP requests. The options for this are:
> - track in flight requests in the pool; passing C will cause more
> containers to be used, but probably the container will always only have < C
> in flight.
> - track in flight requests in the proxy; passing C will cause the message
> in proxy to be stashed/delayed until some HTTP requests are completed, and
> if the >C state remains, the pool will eventually learn this state and
> cause more containers to be used.
>
> (current impl in the PR does the latter)
>
> Thanks
> Tyson
>
>
>


Re: Proposal on a future architecture of OpenWhisk

2018-08-21 Thread Tyson Norris
> Tracking these metrics consistently will introduce the same problem as
> precisely tracking throttling numbers across multiple controllers, I 
think,
> where either there is delay introduced to use remote data, or eventual
> consistency will introduce inaccurate data.
>

If you're talking about limit enforcement, you're right! Regarding the
concurrency on each container though, we are able to accurately track that
and we need to be able to make sure that actual concurrency is always <= C.


>
> I’m interested to know if this accuracy is important as long as actual
> concurrency <= C?
>

I don't think it is as much, no. But how do you keep <= C if you don't
accurately track?

Maybe I should say that while we cannot accurately track, we can still 
guarantee <= C, we just cannot guarantee maximizing concurrency up to C.

Since the HTTP requests are done via futures in proxy, the messaging between 
pool and proxy doesn't have an accurate way to get exactly C requests in 
flight, but can prevent ever sending > C messages that cause the HTTP requests. 
The options for this are:
- track in flight requests in the pool; passing C will cause more containers to 
be used, but probably the container will always only have < C in flight.
- track in flight requests in the proxy; passing C will cause the message in 
proxy to be stashed/delayed until some HTTP requests are completed, and if the 
>C state remains, the pool will eventually learn this state and cause more 
containers to be used.

(current impl in the PR does the latter) 

Thanks
Tyson
 



Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-21 Thread Markus Thömmes
Hi Tyson,

if we take the concerns apart as I proposed above, timeouts should only
ever be triggered after a request is scheduled as you say, that is: As soon
as it's crossing the user-container mark. With the concern separation, it
is plausible that blocking invocations are never buffered anywhere, which
makes a lot of sense, because you cannot persist the open HTTP connection
to the client anyway.

To make the distinction clear: A blocking request can still be put onto the
work-stealing queue to be balanced between different ContainerRouters.

A blocking request though would never be written to a persistent buffer
that's used to be able to efficiently handle async invocations and
backpressuring them. That buffer should be entirely separate and could
possibly be placed outside of the execution system to make the distinction
more explicit. The execution system itself would then only deal with
request-response style invocations and asynchronous invocations are done by
having a seperate queue and a consumer that creates HTTP requests to the
execution system.

Cheers,
Markus

Am Mo., 20. Aug. 2018 um 23:30 Uhr schrieb Tyson Norris
:

> Thanks for summarizing Markus.
>
> Yes this is confusing in context of current system, which stores in kafka,
> but not to indefinitely wait, since timeout begins immediately
> So, I think the problem of buffering/queueing is: when does the timeout
> begin? If not everything is buffered the same, their timeout should not
> begin until processing begins.
>
> Maybe it would make sense to:
> * always buffer (indefinitely) to queue for async, never for sync
> * timeout for async not started till read from queue - which may be
> delayed from time of trigger or http request
> * this should also come with some system monitoring to indicate the queue
> processing is not keeping up with some configurable max delay threshold ("I
> can’t tolerate delays of > 5 minutes", etc)
> * ContainerRouters can only pull from async queue when
> * increasing the number of pending activations won’t exceed some
> threshold (prevent excessive load of async on ContainerRouters)
> * ContainerManager is not overloaded (can still create containers,
> or has some configurable way to indicate the cluster is healthy enough to
> cope with extra processing)
>
> We could of course make this configurable so that operators can choose to:
> * treat async/sync activations the same for sync/async (the overloaded
> system fails when either ContainerManager or ContainerRouters are max
> capacity)
> * treat async/sync with preference for:
> * sync - where async is buffered for unknown period before
> processing, incoming sync traffic (or lack of)
> * async - where sync is sent to the queue, to be processed in
> order of receipt interleaved with async traffic (similar to today, I think)
>
> I think the impact here (aside from technical) is the timing difference if
> we introduce latency in side affects based on the activation being sync vs
> async.
>
> I’m also not sure prioritizing message processing between sync/async
> internally in ContainerRouter is better than just have some dedicated
> ContainerRouters that receive all async activations, and others that
> receive all sync activations, but the end result is the same, I think.
>
>
> > On Aug 19, 2018, at 4:29 AM, Markus Thömmes 
> wrote:
> >
> > Hi Tyson, Carlos,
> >
> > FWIW I should change that to no longer say "Kafka" but "buffer" or
> "message
> > queue".
> >
> > I see two use-cases for a queue here:
> > 1. What you two are alluding to: Buffering asynchronous requests because
> of
> > a different notion of "latency sensitivity" if the system is in an
> overload
> > scenario.
> > 2. As a work-stealing type balancing layer between the ContainerRouters.
> If
> > we assume round-robin/least-connected (essentially random) scheduling
> > between ContainerRouters, we will get load discrepancies between them. To
> > smoothen those out, a ContainerRouter can put the work on a queue to be
> > stolen by a Router that actually has space for that work (for example:
> > Router1 requests a new container, puts the work on the queue while it
> waits
> > for that container, Router2 already has a free container and executes the
> > action by stealing it from the queue). This does has the added complexity
> > of breaking a streaming communication between User and Container (to
> > support essentially unbounded payloads). A nasty wrinkle that might
> render
> > this design alternative invalid! We could come up with something smarter
> > here, i.e. only putting a reference to the work on the queue and the
> > stealer connects to the initial owner directly which then streams the
> > payload through to the stealer, rather than persisting it somewhere.
> >
> > It is important to note, that in this design, blocking invokes could
> > potentially gain the ability to have unbounded entities, where
> > trigger/non-blocking invokes might need to be subject to a 

Re: Proposal on a future architecture of OpenWhisk

2018-08-21 Thread Markus Thömmes
Hi Tyson,

Am Di., 21. Aug. 2018 um 00:20 Uhr schrieb Tyson Norris
:

>
>
> On Aug 19, 2018, at 3:59 AM, Markus Thömmes  > wrote:
>
> Hi Tyson,
>
> Am Fr., 17. Aug. 2018 um 23:45 Uhr schrieb Tyson Norris
> mailto:tnor...@adobe.com.invalid>>:
>
>
> If the failover of the singleton is too long (I think it will be based on
> cluster size, oldest node becomes the singleton host iirc), I think we need
> to consider how containers can launch in the meantime. A first step might
> be to test out the singleton behavior in the cluster of various sizes.
>
>
> I agree this bit of design is crucial, a few thoughts:
> Pre-warm wouldn't help here, the ContainerRouters only know warm
> containers. Pre-warming is managed by the ContainerManager.
>
> Ah right
>
>
> Considering a fail-over scenario: We could consider sharing the state via
> EventSourcing. That is: All state lives inside of frequently snapshotted
> events and thus can be shared between multiple instances of the
> ContainerManager seamlessly. Alternatively, we could also think about only
> working on persisted state. That way, a cold-standby model could fly. We
> should make sure that the state is not "slightly stale" but rather both
> instances see the same state at any point in time. I believe on that
> cold-path of generating new containers, we can live with the extra-latency
> of persisting what we're doing as the path will still be dominated by the
> container creation latency.
>
> Wasn’t clear if you mean not using ClusterSingleton? To be clear in
> ClusterSingleton case there are 2 issues:
> - time it takes for akka ClusterSingletonManager to realize it needs to
> start a new actor
> - time it takes for the new actor to assume a usable state
>
> EventSourcing (or ext persistence) may help with the latter, but we will
> need to be sure the former is tolerable to start with.
> Here is an example test from akka source that may be useful (multi-jvm,
> but all local):
>
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fakka%2Fakka%2Fblob%2F009214ae07708e8144a279e71d06c4a504907e31%2Fakka-cluster-tools%2Fsrc%2Fmulti-jvm%2Fscala%2Fakka%2Fcluster%2Fsingleton%2FClusterSingletonManagerChaosSpec.scaladata=02%7C01%7Ctnorris%40adobe.com%7C63c6bb3a36724f38cc9d08d605c2ddee%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636702732034251656sdata=omVsIo%2FoD8weG4Zy%2BGX2A53ATRmylUxYCbqknu4MoeM%3Dreserved=0
>
> Some things to consider, that I don’t know details of:
> - will the size of cluster affect the singleton behavior in case of
> failure? (I think so, but not sure, and what extent); in the simple test
> above it takes ~6s for the replacement singleton to begin startup, but if
> we have 100s of nodes, I’m not sure how much time it will take. (I don’t
> think this should be hard to test, but I haven’t done it)
> - in case of hard crash, what is the singleton behavior? In graceful jvm
> termination, I know the cluster behavior is good, but there is always this
> question about how downing nodes will be handled. If this critical piece of
> the system relies on akka cluster functionality, we will need to make sure
> that the singleton can be reconstituted, both in case of graceful
> termination (restart/deployment events) and non-graceful termination (hard
> vm crash, hard container crash) . This is ignoring more complicated cases
> of extended network partitions, which will also have bad affects on many of
> the downstream systems.
>
>
> I don't think we need to be eager to consider akka-cluster to be set in
> stone here. The singleton in my mind doesn't need to be clustered at all.
> Say we have a fully shared state through persistence or event-sourcing and
> a hot-standby model, couldn't we implement the fallback through routing in
> front of the active/passive ContainerManager pair? Once one goes
> unreachable, fall back to the other.
>
>
>
> Yeah I would rather see the hot standby and deal with persistence. I don’t
> think akka clustersingleton is going to be fast enough in a high volume
> scenario.
> Either routing in front or ContainerRouters who observe the active
> (leader) status, we just have to determine that the status change is
> tolerably fast.
>
>
>
>
>
> Handover time as you say is crucial, but I'd say as it only impacts
> container creation, we could live with, let's say, 5 seconds of
> failover-downtime on this path? What's your experience been on singleton
> failover? How long did it take?
>
>
> Seconds in the simplest case, so I think we need to test it in a scaled
> case (100s of cluster nodes), as well as the hard crash case (where not
> downing the node may affect the cluster state).
>
>
>
>
> On Aug 16, 2018, at 11:01 AM, Tyson Norris  
> >
> wrote:
>
> A couple comments on singleton:
> - use of cluster singleton will introduce a new single point of failure
> - from time of singleton node failure, to single 

Re: Proposal on a future architecture of OpenWhisk

2018-08-20 Thread Tyson Norris


On Aug 19, 2018, at 3:59 AM, Markus Thömmes 
mailto:markusthoem...@apache.org>> wrote:

Hi Tyson,

Am Fr., 17. Aug. 2018 um 23:45 Uhr schrieb Tyson Norris
mailto:tnor...@adobe.com.invalid>>:


If the failover of the singleton is too long (I think it will be based on
cluster size, oldest node becomes the singleton host iirc), I think we need
to consider how containers can launch in the meantime. A first step might
be to test out the singleton behavior in the cluster of various sizes.


I agree this bit of design is crucial, a few thoughts:
Pre-warm wouldn't help here, the ContainerRouters only know warm
containers. Pre-warming is managed by the ContainerManager.

Ah right


Considering a fail-over scenario: We could consider sharing the state via
EventSourcing. That is: All state lives inside of frequently snapshotted
events and thus can be shared between multiple instances of the
ContainerManager seamlessly. Alternatively, we could also think about only
working on persisted state. That way, a cold-standby model could fly. We
should make sure that the state is not "slightly stale" but rather both
instances see the same state at any point in time. I believe on that
cold-path of generating new containers, we can live with the extra-latency
of persisting what we're doing as the path will still be dominated by the
container creation latency.

Wasn’t clear if you mean not using ClusterSingleton? To be clear in
ClusterSingleton case there are 2 issues:
- time it takes for akka ClusterSingletonManager to realize it needs to
start a new actor
- time it takes for the new actor to assume a usable state

EventSourcing (or ext persistence) may help with the latter, but we will
need to be sure the former is tolerable to start with.
Here is an example test from akka source that may be useful (multi-jvm,
but all local):

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fakka%2Fakka%2Fblob%2F009214ae07708e8144a279e71d06c4a504907e31%2Fakka-cluster-tools%2Fsrc%2Fmulti-jvm%2Fscala%2Fakka%2Fcluster%2Fsingleton%2FClusterSingletonManagerChaosSpec.scaladata=02%7C01%7Ctnorris%40adobe.com%7C63c6bb3a36724f38cc9d08d605c2ddee%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636702732034251656sdata=omVsIo%2FoD8weG4Zy%2BGX2A53ATRmylUxYCbqknu4MoeM%3Dreserved=0

Some things to consider, that I don’t know details of:
- will the size of cluster affect the singleton behavior in case of
failure? (I think so, but not sure, and what extent); in the simple test
above it takes ~6s for the replacement singleton to begin startup, but if
we have 100s of nodes, I’m not sure how much time it will take. (I don’t
think this should be hard to test, but I haven’t done it)
- in case of hard crash, what is the singleton behavior? In graceful jvm
termination, I know the cluster behavior is good, but there is always this
question about how downing nodes will be handled. If this critical piece of
the system relies on akka cluster functionality, we will need to make sure
that the singleton can be reconstituted, both in case of graceful
termination (restart/deployment events) and non-graceful termination (hard
vm crash, hard container crash) . This is ignoring more complicated cases
of extended network partitions, which will also have bad affects on many of
the downstream systems.


I don't think we need to be eager to consider akka-cluster to be set in
stone here. The singleton in my mind doesn't need to be clustered at all.
Say we have a fully shared state through persistence or event-sourcing and
a hot-standby model, couldn't we implement the fallback through routing in
front of the active/passive ContainerManager pair? Once one goes
unreachable, fall back to the other.



Yeah I would rather see the hot standby and deal with persistence. I don’t 
think akka clustersingleton is going to be fast enough in a high volume 
scenario.
Either routing in front or ContainerRouters who observe the active (leader) 
status, we just have to determine that the status change is tolerably fast.





Handover time as you say is crucial, but I'd say as it only impacts
container creation, we could live with, let's say, 5 seconds of
failover-downtime on this path? What's your experience been on singleton
failover? How long did it take?


Seconds in the simplest case, so I think we need to test it in a scaled
case (100s of cluster nodes), as well as the hard crash case (where not
downing the node may affect the cluster state).




On Aug 16, 2018, at 11:01 AM, Tyson Norris 
mailto:tnor...@adobe.com.INVALID>
>
wrote:

A couple comments on singleton:
- use of cluster singleton will introduce a new single point of failure
- from time of singleton node failure, to single resurrection on a
different instance, will be an outage from the point of view of any
ContainerRouter that does not already have a warm+free container to service
an activation
- resurrecting the singleton will require transferring or rebuilding the
state 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-20 Thread Tyson Norris
Thanks for summarizing Markus. 

Yes this is confusing in context of current system, which stores in kafka, but 
not to indefinitely wait, since timeout begins immediately
So, I think the problem of buffering/queueing is: when does the timeout begin? 
If not everything is buffered the same, their timeout should not begin until 
processing begins. 

Maybe it would make sense to:
* always buffer (indefinitely) to queue for async, never for sync 
* timeout for async not started till read from queue - which may be delayed 
from time of trigger or http request
* this should also come with some system monitoring to indicate the queue 
processing is not keeping up with some configurable max delay threshold ("I 
can’t tolerate delays of > 5 minutes", etc)
* ContainerRouters can only pull from async queue when
* increasing the number of pending activations won’t exceed some 
threshold (prevent excessive load of async on ContainerRouters) 
* ContainerManager is not overloaded (can still create containers, or 
has some configurable way to indicate the cluster is healthy enough to cope 
with extra processing)

We could of course make this configurable so that operators can choose to:
* treat async/sync activations the same for sync/async (the overloaded system 
fails when either ContainerManager or ContainerRouters are max capacity)
* treat async/sync with preference for:
* sync - where async is buffered for unknown period before processing, 
incoming sync traffic (or lack of)
* async - where sync is sent to the queue, to be processed in order of 
receipt interleaved with async traffic (similar to today, I think)

I think the impact here (aside from technical) is the timing difference if we 
introduce latency in side affects based on the activation being sync vs async.

I’m also not sure prioritizing message processing between sync/async internally 
in ContainerRouter is better than just have some dedicated ContainerRouters 
that receive all async activations, and others that receive all sync 
activations, but the end result is the same, I think.


> On Aug 19, 2018, at 4:29 AM, Markus Thömmes  wrote:
> 
> Hi Tyson, Carlos,
> 
> FWIW I should change that to no longer say "Kafka" but "buffer" or "message
> queue".
> 
> I see two use-cases for a queue here:
> 1. What you two are alluding to: Buffering asynchronous requests because of
> a different notion of "latency sensitivity" if the system is in an overload
> scenario.
> 2. As a work-stealing type balancing layer between the ContainerRouters. If
> we assume round-robin/least-connected (essentially random) scheduling
> between ContainerRouters, we will get load discrepancies between them. To
> smoothen those out, a ContainerRouter can put the work on a queue to be
> stolen by a Router that actually has space for that work (for example:
> Router1 requests a new container, puts the work on the queue while it waits
> for that container, Router2 already has a free container and executes the
> action by stealing it from the queue). This does has the added complexity
> of breaking a streaming communication between User and Container (to
> support essentially unbounded payloads). A nasty wrinkle that might render
> this design alternative invalid! We could come up with something smarter
> here, i.e. only putting a reference to the work on the queue and the
> stealer connects to the initial owner directly which then streams the
> payload through to the stealer, rather than persisting it somewhere.
> 
> It is important to note, that in this design, blocking invokes could
> potentially gain the ability to have unbounded entities, where
> trigger/non-blocking invokes might need to be subject to a bound here to be
> able to support eventual execution efficiently.
> 
> Personally, I'm much more torn to the work-stealing type case. It implies a
> wholy different notion of using the queue though and doesn't have much to
> do with the way we use it today, which might be confusing. It could also
> well be the case, that work-stealing type algorithms are easier to back on
> a proper MQ vs. trying to make it work on Kafka.
> 
> It might also be important to note that those two use-cases might require
> different technologies (buffering vs. queue-backend for work-stealing) and
> could well be seperated in the design as well. For instance, buffering
> triggers fires etc. does not necessarily need to be done on the execution
> layer but could instead be pushed to another layer. Having the notion of
> "async" vs "sync" in the execution layer could be benefitial for
> loadbalancing itself though. Something worth exploring imho.
> 
> Sorry for the wall of text, I hope this clarifies things!
> 
> Cheers,
> Markus
> 
> Am Sa., 18. Aug. 2018 um 02:36 Uhr schrieb Carlos Santana <
> csantan...@gmail.com>:
> 
>> triggers get responded right away (202) with an activation is and then
>> sent to the queue to be processed async same as async action invokes.
>> 
>> I 

Re: Proposal on a future architecture of OpenWhisk

2018-08-20 Thread TzuChiao Yeh
Yes, exactly.

Sorry if my poor English bothering you :(, I'll try my best to correct
texts. I don't have an accurate model in mind, just share some thoughts I
think that might be helpful:

As you say before, there's some pre-conditions for scheduling decision:
unbounded/bounded system, fair/unfair scheduling etc.

For an ubounded system, providers may not that care about the problem on
over-estimation; on the contrary, the resource bounded system cares about
the "overall throughput" and bounded resource utilization is stable, and
potentially cause fair scheduling decisions: "paying penalties as you go
more". Therefore, the following mechanism is based on the "assumption of
using bounded system".

I'd read a scholar paper quite relevant to this, but forget some detail on
it and will re-read it after. The basic idea is splitting queues into a
warm queues and a cold-start queues, and add delay (penalty) on pulling
cold-start queue. In the context of OW:

1. ContainerRouters duplicate and queue activation (reference) into warm
and cold-start queue.
2. ContainerRouter pull out activation (reference) from warm queue if a
container is available again, and drop the activation (reference) from
cold-start queue.
3. ContainerManager pull out activation (reference) as creation request
from cold-start queue with an "incremental delay".
4. Continue (3), ContainerManager doesn't pull activation (reference) from
warm-start queue. Once the activation (reference) being stolen out via
ContainerRouter during creation. There's a "over-estimate" occurred.

I believe scheduling in serverless model should care about how much
queueing events associated with available resource slots and even how many
slots we already allocated for a specific action, namespace, etc in bounded
system.  The critical point is how do we set the incremental delay, but I
think ContainerManager potentially has enough information to do a smarter
decision between these metrics. In addition, since this is not a critical
path, we can afford a slightly gained latency here for better system
throughput.

I.e. an intuitive approach: *NextPollDelay = DelayFactor *
IncrementalFactor * CurrentAllocatedSlotsRatio * NumOfOverEstimate  /
QueuedEvents*

And we can make user configure delay factor, i.e. 0 for 0 poll delay in
system doesn't really care about this (that we can have a unified model for
either bounded or unbounded system) or customized how much penalty would
like to pay if a burst occurred.

This is quite straightforward and may have plenty of problems I think, i.e.
1. serverless workload has uncertain elapsed time, 2. latency in OW system
needs to acquiring information and making decision, 3. message queue
operation latency, 4. gains more complex if join pre-warmed model and
priority into scheduling. 5. will this break serverless pricing model? 6.
...

I think there's no significant change on the big picture of future
architecture and should not stop us going forward from now. If folks pay
more interests on the problem of over-estimation, we can further find out a
proper solution after having more detail on the future architecture. Since
throttling already help us to avoid from this situation.

Thanks!

On Mon, Aug 20, 2018 at 4:03 PM Markus Thömmes 
wrote:

> Am So., 19. Aug. 2018 um 18:59 Uhr schrieb TzuChiao Yeh <
> su3g4284zo...@gmail.com>:
>
> > On Sun, Aug 19, 2018 at 7:13 PM Markus Thömmes <
> markusthoem...@apache.org>
> > wrote:
> >
> > > Hi Tzu-Chiao,
> > >
> > > Am Sa., 18. Aug. 2018 um 06:56 Uhr schrieb TzuChiao Yeh <
> > > su3g4284zo...@gmail.com>:
> > >
> > > > Hi Markus,
> > > >
> > > > Nice thoughts on separating logics in this revision! I'm not sure
> this
> > > > question has already been clarified, sorry if duplicate.
> > > >
> > > > Same question on cluster singleton:
> > > >
> > > > I think there will be two possibilities on container deletion: 1.
> > > > ContainerRouter removes it (when error or idle-state) 2.
> > ContainerManager
> > > > decides to remove it (i.e. clear space for new creation).
> > > >
> > > > For case 2, how do we ensure the safe deletion in ContainerManager?
> > > > Consider if there's still a similar model on busy/free/prewarmed
> pool,
> > it
> > > > might require additional states related to containers from busy to
> free
> > > > state, then we can safely remove it or reject if nothing found
> (system
> > > > overloaded).
> > > >
> > > > By paused state or other states/message? There might be some
> trade-offs
> > > on
> > > > granularity (time-slice in scheduling) and performance bottleneck on
> > > > ClusterSingleton.
> > > >
> >
> > I'm not sure if I quite got the point, but here's an attempt on an
> > > explanation:
> > >
> > > Yes, Container removal in case 2 is triggered from the
> ContainerManager.
> > To
> > > be able to safely remove it, it requests all ContainerRouters owning
> that
> > > container to stop serving it and hand it back. Once it's been handed
> > back,
> > > the ContainerManager can 

Re: Proposal on a future architecture of OpenWhisk

2018-08-20 Thread Markus Thömmes
Am So., 19. Aug. 2018 um 18:59 Uhr schrieb TzuChiao Yeh <
su3g4284zo...@gmail.com>:

> On Sun, Aug 19, 2018 at 7:13 PM Markus Thömmes 
> wrote:
>
> > Hi Tzu-Chiao,
> >
> > Am Sa., 18. Aug. 2018 um 06:56 Uhr schrieb TzuChiao Yeh <
> > su3g4284zo...@gmail.com>:
> >
> > > Hi Markus,
> > >
> > > Nice thoughts on separating logics in this revision! I'm not sure this
> > > question has already been clarified, sorry if duplicate.
> > >
> > > Same question on cluster singleton:
> > >
> > > I think there will be two possibilities on container deletion: 1.
> > > ContainerRouter removes it (when error or idle-state) 2.
> ContainerManager
> > > decides to remove it (i.e. clear space for new creation).
> > >
> > > For case 2, how do we ensure the safe deletion in ContainerManager?
> > > Consider if there's still a similar model on busy/free/prewarmed pool,
> it
> > > might require additional states related to containers from busy to free
> > > state, then we can safely remove it or reject if nothing found (system
> > > overloaded).
> > >
> > > By paused state or other states/message? There might be some trade-offs
> > on
> > > granularity (time-slice in scheduling) and performance bottleneck on
> > > ClusterSingleton.
> > >
>
> I'm not sure if I quite got the point, but here's an attempt on an
> > explanation:
> >
> > Yes, Container removal in case 2 is triggered from the ContainerManager.
> To
> > be able to safely remove it, it requests all ContainerRouters owning that
> > container to stop serving it and hand it back. Once it's been handed
> back,
> > the ContainerManager can safely delete it. The contract should also say:
> A
> > container must be handed back in unpaused state, so it can be deleted
> > safely. Since the ContainerRouters handle pause/unpause, they'll need to
> > stop serving the container, unpause it, remove it from their state and
> > acknowledge to the ContainerManager that they handed it back.
> >
>
> Thank you, it's clear to me.
>
>
> > There is an open question on when to consider a system to be in overflow
> > state, or rather: How to handle the edge-situation. If you cannot
> generate
> > more containers, we need to decide whether we remove another container
> (the
> > case you're describing) or if we call it quits and say "503, overloaded,
> go
> > away for now". The logic deciding this is up for discussion as well. The
> > heuristic could take into account how many resources in the whole system
> > you already own, how many resources do others own and if we want to
> decide
> > to share those fairly or not-fairly. Note that this is also very much
> > related to being able to scale the resources up in themselves (to be able
> > to generate new containers). If we assume a bounded system though, yes,
> > we'll need to find a strategy on how to handle this case. I believe with
> > the state the ContainerManager has, it can provide a more eloquent answer
> > to that question than what we can do today (nothing really, we just keep
> on
> > churning through containers).
> >
>
> I agree. An additional problem is in the case of burst requests,
> ContainerManager will "over-estimate" containers allocation, whether
> work-stealing between ContainerRouters has been enabled or not. For bounded
> system, we have better carefully handle these to avoid frequently
> creation/deletion. I'm wondering if sharing message queue between
> ContainerManager (since it's not a critical path) or any mechanism for
> checking queue size (i.e. checking kafka lags) can possibly eliminate
> this?  However, this may be only happened in short running tasks and
> throttling already being helpful.
>

Are you saying: It will over-estimate container allocation because it will
create a container for each request as they arrive if there are no
containers around currently and the actual number of containers needed
might be lower for very short running use-cases where requests arrive in
short bursts?

If so: I agree, I don't see how any system can possibly solve this without
taking the estimated runtime of each request into account though. Can you
elaborate on how your thoughts on checking queue-size etc?


>
>
> > Does that answer the question?
>
>
> > >
> > > Thanks!
> > >
> > > Tzu-Chiao
> > >
> > > On Sat, Aug 18, 2018 at 5:55 AM Tyson Norris  >
> > > wrote:
> > >
> > > > Ugh my reply formatting got removed!!! Trying this again with some >>
> > > >
> > > > On Aug 17, 2018, at 2:45 PM, Tyson Norris  > > > > wrote:
> > > >
> > > >
> > > > If the failover of the singleton is too long (I think it will be
> based
> > on
> > > > cluster size, oldest node becomes the singleton host iirc), I think
> we
> > > need
> > > > to consider how containers can launch in the meantime. A first step
> > might
> > > > be to test out the singleton behavior in the cluster of various
> sizes.
> > > >
> > > >
> > > > I agree this bit of design is crucial, a few thoughts:
> > > > Pre-warm wouldn't help here, the 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-20 Thread Markus Thömmes
I believe we should keep this as general as long as possible. We should
define the characteristics we need for each path rather than deciding on a
certain technology early on.

Am So., 19. Aug. 2018 um 16:07 Uhr schrieb Dascalita Dragos <
ddrag...@gmail.com>:

> “... FWIW I should change that to no longer say "Kafka" but "buffer" or
> "message
> queue"...”
> +1. One idea could be to use Akka Streams and let the OW operator make a
> decision on using Kafka with Akka Streams, or not [1]. This would make OW
> deployment easier, Kafka becoming optional, while opening the door for
> other connectors like AWS Kinesis, Azure Event Hub, and others (see the
> link at [1] for a more complete list of connectors )
>
> [1] - https://developer.lightbend.com/docs/alpakka/current/
> On Sun, Aug 19, 2018 at 7:30 AM Markus Thömmes 
> wrote:
>
> > Hi Tyson, Carlos,
> >
> > FWIW I should change that to no longer say "Kafka" but "buffer" or
> "message
> > queue".
> >
> > I see two use-cases for a queue here:
> > 1. What you two are alluding to: Buffering asynchronous requests because
> of
> > a different notion of "latency sensitivity" if the system is in an
> overload
> > scenario.
> > 2. As a work-stealing type balancing layer between the ContainerRouters.
> If
> > we assume round-robin/least-connected (essentially random) scheduling
> > between ContainerRouters, we will get load discrepancies between them. To
> > smoothen those out, a ContainerRouter can put the work on a queue to be
> > stolen by a Router that actually has space for that work (for example:
> > Router1 requests a new container, puts the work on the queue while it
> waits
> > for that container, Router2 already has a free container and executes the
> > action by stealing it from the queue). This does has the added complexity
> > of breaking a streaming communication between User and Container (to
> > support essentially unbounded payloads). A nasty wrinkle that might
> render
> > this design alternative invalid! We could come up with something smarter
> > here, i.e. only putting a reference to the work on the queue and the
> > stealer connects to the initial owner directly which then streams the
> > payload through to the stealer, rather than persisting it somewhere.
> >
> > It is important to note, that in this design, blocking invokes could
> > potentially gain the ability to have unbounded entities, where
> > trigger/non-blocking invokes might need to be subject to a bound here to
> be
> > able to support eventual execution efficiently.
> >
> > Personally, I'm much more torn to the work-stealing type case. It
> implies a
> > wholy different notion of using the queue though and doesn't have much to
> > do with the way we use it today, which might be confusing. It could also
> > well be the case, that work-stealing type algorithms are easier to back
> on
> > a proper MQ vs. trying to make it work on Kafka.
> >
> > It might also be important to note that those two use-cases might require
> > different technologies (buffering vs. queue-backend for work-stealing)
> and
> > could well be seperated in the design as well. For instance, buffering
> > triggers fires etc. does not necessarily need to be done on the execution
> > layer but could instead be pushed to another layer. Having the notion of
> > "async" vs "sync" in the execution layer could be benefitial for
> > loadbalancing itself though. Something worth exploring imho.
> >
> > Sorry for the wall of text, I hope this clarifies things!
> >
> > Cheers,
> > Markus
> >
> > Am Sa., 18. Aug. 2018 um 02:36 Uhr schrieb Carlos Santana <
> > csantan...@gmail.com>:
> >
> > > triggers get responded right away (202) with an activation is and then
> > > sent to the queue to be processed async same as async action invokes.
> > >
> > > I think we would keep same contract as today for this type of
> activations
> > > that are eventually process different from blocking invokes including
> we
> > > Actions were the http client hold a connection waiting for the result
> > back.
> > >
> > > - Carlos Santana
> > > @csantanapr
> > >
> > > > On Aug 17, 2018, at 6:14 PM, Tyson Norris  >
> > > wrote:
> > > >
> > > > Hi -
> > > > Separate thread regarding the proposal: what is considered for
> routing
> > > activations as overload and destined for kafka?
> > > >
> > > > In general, if kafka is not on the blocking activation path, why
> would
> > > it be used at all, if the timeouts and processing expectations of
> > blocking
> > > and non-blocking are the same?
> > > >
> > > > One case I can imagine: triggers + non-blocking invokes, but only in
> > the
> > > case where those have some different timeout characteristics. e.g. if a
> > > trigger fires an action, is there any case where the activation should
> be
> > > buffered to kafka if it will timeout same as a blocking activation?
> > > >
> > > > Sorry if I’m missing something obvious.
> > > >
> > > > Thanks
> > > > Tyson
> > > >
> > > >
> > >
> >
>


Re: Proposal on a future architecture of OpenWhisk

2018-08-19 Thread TzuChiao Yeh
On Sun, Aug 19, 2018 at 7:13 PM Markus Thömmes 
wrote:

> Hi Tzu-Chiao,
>
> Am Sa., 18. Aug. 2018 um 06:56 Uhr schrieb TzuChiao Yeh <
> su3g4284zo...@gmail.com>:
>
> > Hi Markus,
> >
> > Nice thoughts on separating logics in this revision! I'm not sure this
> > question has already been clarified, sorry if duplicate.
> >
> > Same question on cluster singleton:
> >
> > I think there will be two possibilities on container deletion: 1.
> > ContainerRouter removes it (when error or idle-state) 2. ContainerManager
> > decides to remove it (i.e. clear space for new creation).
> >
> > For case 2, how do we ensure the safe deletion in ContainerManager?
> > Consider if there's still a similar model on busy/free/prewarmed pool, it
> > might require additional states related to containers from busy to free
> > state, then we can safely remove it or reject if nothing found (system
> > overloaded).
> >
> > By paused state or other states/message? There might be some trade-offs
> on
> > granularity (time-slice in scheduling) and performance bottleneck on
> > ClusterSingleton.
> >

I'm not sure if I quite got the point, but here's an attempt on an
> explanation:
>
> Yes, Container removal in case 2 is triggered from the ContainerManager. To
> be able to safely remove it, it requests all ContainerRouters owning that
> container to stop serving it and hand it back. Once it's been handed back,
> the ContainerManager can safely delete it. The contract should also say: A
> container must be handed back in unpaused state, so it can be deleted
> safely. Since the ContainerRouters handle pause/unpause, they'll need to
> stop serving the container, unpause it, remove it from their state and
> acknowledge to the ContainerManager that they handed it back.
>

Thank you, it's clear to me.


> There is an open question on when to consider a system to be in overflow
> state, or rather: How to handle the edge-situation. If you cannot generate
> more containers, we need to decide whether we remove another container (the
> case you're describing) or if we call it quits and say "503, overloaded, go
> away for now". The logic deciding this is up for discussion as well. The
> heuristic could take into account how many resources in the whole system
> you already own, how many resources do others own and if we want to decide
> to share those fairly or not-fairly. Note that this is also very much
> related to being able to scale the resources up in themselves (to be able
> to generate new containers). If we assume a bounded system though, yes,
> we'll need to find a strategy on how to handle this case. I believe with
> the state the ContainerManager has, it can provide a more eloquent answer
> to that question than what we can do today (nothing really, we just keep on
> churning through containers).
>

I agree. An additional problem is in the case of burst requests,
ContainerManager will "over-estimate" containers allocation, whether
work-stealing between ContainerRouters has been enabled or not. For bounded
system, we have better carefully handle these to avoid frequently
creation/deletion. I'm wondering if sharing message queue between
ContainerManager (since it's not a critical path) or any mechanism for
checking queue size (i.e. checking kafka lags) can possibly eliminate
this?  However, this may be only happened in short running tasks and
throttling already being helpful.


> Does that answer the question?


> >
> > Thanks!
> >
> > Tzu-Chiao
> >
> > On Sat, Aug 18, 2018 at 5:55 AM Tyson Norris 
> > wrote:
> >
> > > Ugh my reply formatting got removed!!! Trying this again with some >>
> > >
> > > On Aug 17, 2018, at 2:45 PM, Tyson Norris  > > > wrote:
> > >
> > >
> > > If the failover of the singleton is too long (I think it will be based
> on
> > > cluster size, oldest node becomes the singleton host iirc), I think we
> > need
> > > to consider how containers can launch in the meantime. A first step
> might
> > > be to test out the singleton behavior in the cluster of various sizes.
> > >
> > >
> > > I agree this bit of design is crucial, a few thoughts:
> > > Pre-warm wouldn't help here, the ContainerRouters only know warm
> > > containers. Pre-warming is managed by the ContainerManager.
> > >
> > >
> > > >> Ah right
> > >
> > >
> > >
> > > Considering a fail-over scenario: We could consider sharing the state
> via
> > > EventSourcing. That is: All state lives inside of frequently
> snapshotted
> > > events and thus can be shared between multiple instances of the
> > > ContainerManager seamlessly. Alternatively, we could also think about
> > only
> > > working on persisted state. That way, a cold-standby model could fly.
> We
> > > should make sure that the state is not "slightly stale" but rather both
> > > instances see the same state at any point in time. I believe on that
> > > cold-path of generating new containers, we can live with the
> > extra-latency
> > > of persisting what we're doing as the 

Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-19 Thread Dascalita Dragos
“... FWIW I should change that to no longer say "Kafka" but "buffer" or
"message
queue"...”
+1. One idea could be to use Akka Streams and let the OW operator make a
decision on using Kafka with Akka Streams, or not [1]. This would make OW
deployment easier, Kafka becoming optional, while opening the door for
other connectors like AWS Kinesis, Azure Event Hub, and others (see the
link at [1] for a more complete list of connectors )

[1] - https://developer.lightbend.com/docs/alpakka/current/
On Sun, Aug 19, 2018 at 7:30 AM Markus Thömmes 
wrote:

> Hi Tyson, Carlos,
>
> FWIW I should change that to no longer say "Kafka" but "buffer" or "message
> queue".
>
> I see two use-cases for a queue here:
> 1. What you two are alluding to: Buffering asynchronous requests because of
> a different notion of "latency sensitivity" if the system is in an overload
> scenario.
> 2. As a work-stealing type balancing layer between the ContainerRouters. If
> we assume round-robin/least-connected (essentially random) scheduling
> between ContainerRouters, we will get load discrepancies between them. To
> smoothen those out, a ContainerRouter can put the work on a queue to be
> stolen by a Router that actually has space for that work (for example:
> Router1 requests a new container, puts the work on the queue while it waits
> for that container, Router2 already has a free container and executes the
> action by stealing it from the queue). This does has the added complexity
> of breaking a streaming communication between User and Container (to
> support essentially unbounded payloads). A nasty wrinkle that might render
> this design alternative invalid! We could come up with something smarter
> here, i.e. only putting a reference to the work on the queue and the
> stealer connects to the initial owner directly which then streams the
> payload through to the stealer, rather than persisting it somewhere.
>
> It is important to note, that in this design, blocking invokes could
> potentially gain the ability to have unbounded entities, where
> trigger/non-blocking invokes might need to be subject to a bound here to be
> able to support eventual execution efficiently.
>
> Personally, I'm much more torn to the work-stealing type case. It implies a
> wholy different notion of using the queue though and doesn't have much to
> do with the way we use it today, which might be confusing. It could also
> well be the case, that work-stealing type algorithms are easier to back on
> a proper MQ vs. trying to make it work on Kafka.
>
> It might also be important to note that those two use-cases might require
> different technologies (buffering vs. queue-backend for work-stealing) and
> could well be seperated in the design as well. For instance, buffering
> triggers fires etc. does not necessarily need to be done on the execution
> layer but could instead be pushed to another layer. Having the notion of
> "async" vs "sync" in the execution layer could be benefitial for
> loadbalancing itself though. Something worth exploring imho.
>
> Sorry for the wall of text, I hope this clarifies things!
>
> Cheers,
> Markus
>
> Am Sa., 18. Aug. 2018 um 02:36 Uhr schrieb Carlos Santana <
> csantan...@gmail.com>:
>
> > triggers get responded right away (202) with an activation is and then
> > sent to the queue to be processed async same as async action invokes.
> >
> > I think we would keep same contract as today for this type of activations
> > that are eventually process different from blocking invokes including we
> > Actions were the http client hold a connection waiting for the result
> back.
> >
> > - Carlos Santana
> > @csantanapr
> >
> > > On Aug 17, 2018, at 6:14 PM, Tyson Norris 
> > wrote:
> > >
> > > Hi -
> > > Separate thread regarding the proposal: what is considered for routing
> > activations as overload and destined for kafka?
> > >
> > > In general, if kafka is not on the blocking activation path, why would
> > it be used at all, if the timeouts and processing expectations of
> blocking
> > and non-blocking are the same?
> > >
> > > One case I can imagine: triggers + non-blocking invokes, but only in
> the
> > case where those have some different timeout characteristics. e.g. if a
> > trigger fires an action, is there any case where the activation should be
> > buffered to kafka if it will timeout same as a blocking activation?
> > >
> > > Sorry if I’m missing something obvious.
> > >
> > > Thanks
> > > Tyson
> > >
> > >
> >
>


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-19 Thread Markus Thömmes
Hi Tyson, Carlos,

FWIW I should change that to no longer say "Kafka" but "buffer" or "message
queue".

I see two use-cases for a queue here:
1. What you two are alluding to: Buffering asynchronous requests because of
a different notion of "latency sensitivity" if the system is in an overload
scenario.
2. As a work-stealing type balancing layer between the ContainerRouters. If
we assume round-robin/least-connected (essentially random) scheduling
between ContainerRouters, we will get load discrepancies between them. To
smoothen those out, a ContainerRouter can put the work on a queue to be
stolen by a Router that actually has space for that work (for example:
Router1 requests a new container, puts the work on the queue while it waits
for that container, Router2 already has a free container and executes the
action by stealing it from the queue). This does has the added complexity
of breaking a streaming communication between User and Container (to
support essentially unbounded payloads). A nasty wrinkle that might render
this design alternative invalid! We could come up with something smarter
here, i.e. only putting a reference to the work on the queue and the
stealer connects to the initial owner directly which then streams the
payload through to the stealer, rather than persisting it somewhere.

It is important to note, that in this design, blocking invokes could
potentially gain the ability to have unbounded entities, where
trigger/non-blocking invokes might need to be subject to a bound here to be
able to support eventual execution efficiently.

Personally, I'm much more torn to the work-stealing type case. It implies a
wholy different notion of using the queue though and doesn't have much to
do with the way we use it today, which might be confusing. It could also
well be the case, that work-stealing type algorithms are easier to back on
a proper MQ vs. trying to make it work on Kafka.

It might also be important to note that those two use-cases might require
different technologies (buffering vs. queue-backend for work-stealing) and
could well be seperated in the design as well. For instance, buffering
triggers fires etc. does not necessarily need to be done on the execution
layer but could instead be pushed to another layer. Having the notion of
"async" vs "sync" in the execution layer could be benefitial for
loadbalancing itself though. Something worth exploring imho.

Sorry for the wall of text, I hope this clarifies things!

Cheers,
Markus

Am Sa., 18. Aug. 2018 um 02:36 Uhr schrieb Carlos Santana <
csantan...@gmail.com>:

> triggers get responded right away (202) with an activation is and then
> sent to the queue to be processed async same as async action invokes.
>
> I think we would keep same contract as today for this type of activations
> that are eventually process different from blocking invokes including we
> Actions were the http client hold a connection waiting for the result back.
>
> - Carlos Santana
> @csantanapr
>
> > On Aug 17, 2018, at 6:14 PM, Tyson Norris 
> wrote:
> >
> > Hi -
> > Separate thread regarding the proposal: what is considered for routing
> activations as overload and destined for kafka?
> >
> > In general, if kafka is not on the blocking activation path, why would
> it be used at all, if the timeouts and processing expectations of blocking
> and non-blocking are the same?
> >
> > One case I can imagine: triggers + non-blocking invokes, but only in the
> case where those have some different timeout characteristics. e.g. if a
> trigger fires an action, is there any case where the activation should be
> buffered to kafka if it will timeout same as a blocking activation?
> >
> > Sorry if I’m missing something obvious.
> >
> > Thanks
> > Tyson
> >
> >
>


Re: Proposal on a future architecture of OpenWhisk

2018-08-19 Thread Markus Thömmes
Hi Tzu-Chiao,

Am Sa., 18. Aug. 2018 um 06:56 Uhr schrieb TzuChiao Yeh <
su3g4284zo...@gmail.com>:

> Hi Markus,
>
> Nice thoughts on separating logics in this revision! I'm not sure this
> question has already been clarified, sorry if duplicate.
>
> Same question on cluster singleton:
>
> I think there will be two possibilities on container deletion: 1.
> ContainerRouter removes it (when error or idle-state) 2. ContainerManager
> decides to remove it (i.e. clear space for new creation).
>
> For case 2, how do we ensure the safe deletion in ContainerManager?
> Consider if there's still a similar model on busy/free/prewarmed pool, it
> might require additional states related to containers from busy to free
> state, then we can safely remove it or reject if nothing found (system
> overloaded).
>
> By paused state or other states/message? There might be some trade-offs on
> granularity (time-slice in scheduling) and performance bottleneck on
> ClusterSingleton.
>

I'm not sure if I quite got the point, but here's an attempt on an
explanation:

Yes, Container removal in case 2 is triggered from the ContainerManager. To
be able to safely remove it, it requests all ContainerRouters owning that
container to stop serving it and hand it back. Once it's been handed back,
the ContainerManager can safely delete it. The contract should also say: A
container must be handed back in unpaused state, so it can be deleted
safely. Since the ContainerRouters handle pause/unpause, they'll need to
stop serving the container, unpause it, remove it from their state and
acknowledge to the ContainerManager that they handed it back.

There is an open question on when to consider a system to be in overflow
state, or rather: How to handle the edge-situation. If you cannot generate
more containers, we need to decide whether we remove another container (the
case you're describing) or if we call it quits and say "503, overloaded, go
away for now". The logic deciding this is up for discussion as well. The
heuristic could take into account how many resources in the whole system
you already own, how many resources do others own and if we want to decide
to share those fairly or not-fairly. Note that this is also very much
related to being able to scale the resources up in themselves (to be able
to generate new containers). If we assume a bounded system though, yes,
we'll need to find a strategy on how to handle this case. I believe with
the state the ContainerManager has, it can provide a more eloquent answer
to that question than what we can do today (nothing really, we just keep on
churning through containers).

Does that answer the question?


>
> Thanks!
>
> Tzu-Chiao
>
> On Sat, Aug 18, 2018 at 5:55 AM Tyson Norris 
> wrote:
>
> > Ugh my reply formatting got removed!!! Trying this again with some >>
> >
> > On Aug 17, 2018, at 2:45 PM, Tyson Norris  > > wrote:
> >
> >
> > If the failover of the singleton is too long (I think it will be based on
> > cluster size, oldest node becomes the singleton host iirc), I think we
> need
> > to consider how containers can launch in the meantime. A first step might
> > be to test out the singleton behavior in the cluster of various sizes.
> >
> >
> > I agree this bit of design is crucial, a few thoughts:
> > Pre-warm wouldn't help here, the ContainerRouters only know warm
> > containers. Pre-warming is managed by the ContainerManager.
> >
> >
> > >> Ah right
> >
> >
> >
> > Considering a fail-over scenario: We could consider sharing the state via
> > EventSourcing. That is: All state lives inside of frequently snapshotted
> > events and thus can be shared between multiple instances of the
> > ContainerManager seamlessly. Alternatively, we could also think about
> only
> > working on persisted state. That way, a cold-standby model could fly. We
> > should make sure that the state is not "slightly stale" but rather both
> > instances see the same state at any point in time. I believe on that
> > cold-path of generating new containers, we can live with the
> extra-latency
> > of persisting what we're doing as the path will still be dominated by the
> > container creation latency.
> >
> >
> >
> > >> Wasn’t clear if you mean not using ClusterSingleton? To be clear in
> > ClusterSingleton case there are 2 issues:
> > - time it takes for akka ClusterSingletonManager to realize it needs to
> > start a new actor
> > - time it takes for the new actor to assume a usable state
> >
> > EventSourcing (or ext persistence) may help with the latter, but we will
> > need to be sure the former is tolerable to start with.
> > Here is an example test from akka source that may be useful (multi-jvm,
> > but all local):
> >
> >
> 

Re: Proposal on a future architecture of OpenWhisk

2018-08-19 Thread Markus Thömmes
Hi Tyson,

Am Fr., 17. Aug. 2018 um 23:45 Uhr schrieb Tyson Norris
:

>
> If the failover of the singleton is too long (I think it will be based on
> cluster size, oldest node becomes the singleton host iirc), I think we need
> to consider how containers can launch in the meantime. A first step might
> be to test out the singleton behavior in the cluster of various sizes.
>
>
> I agree this bit of design is crucial, a few thoughts:
> Pre-warm wouldn't help here, the ContainerRouters only know warm
> containers. Pre-warming is managed by the ContainerManager.
>
> Ah right
>
>
> Considering a fail-over scenario: We could consider sharing the state via
> EventSourcing. That is: All state lives inside of frequently snapshotted
> events and thus can be shared between multiple instances of the
> ContainerManager seamlessly. Alternatively, we could also think about only
> working on persisted state. That way, a cold-standby model could fly. We
> should make sure that the state is not "slightly stale" but rather both
> instances see the same state at any point in time. I believe on that
> cold-path of generating new containers, we can live with the extra-latency
> of persisting what we're doing as the path will still be dominated by the
> container creation latency.
>
> Wasn’t clear if you mean not using ClusterSingleton? To be clear in
> ClusterSingleton case there are 2 issues:
> - time it takes for akka ClusterSingletonManager to realize it needs to
> start a new actor
> - time it takes for the new actor to assume a usable state
>
> EventSourcing (or ext persistence) may help with the latter, but we will
> need to be sure the former is tolerable to start with.
> Here is an example test from akka source that may be useful (multi-jvm,
> but all local):
>
> https://github.com/akka/akka/blob/009214ae07708e8144a279e71d06c4a504907e31/akka-cluster-tools/src/multi-jvm/scala/akka/cluster/singleton/ClusterSingletonManagerChaosSpec.scala
>
> Some things to consider, that I don’t know details of:
> - will the size of cluster affect the singleton behavior in case of
> failure? (I think so, but not sure, and what extent); in the simple test
> above it takes ~6s for the replacement singleton to begin startup, but if
> we have 100s of nodes, I’m not sure how much time it will take. (I don’t
> think this should be hard to test, but I haven’t done it)
> - in case of hard crash, what is the singleton behavior? In graceful jvm
> termination, I know the cluster behavior is good, but there is always this
> question about how downing nodes will be handled. If this critical piece of
> the system relies on akka cluster functionality, we will need to make sure
> that the singleton can be reconstituted, both in case of graceful
> termination (restart/deployment events) and non-graceful termination (hard
> vm crash, hard container crash) . This is ignoring more complicated cases
> of extended network partitions, which will also have bad affects on many of
> the downstream systems.
>

I don't think we need to be eager to consider akka-cluster to be set in
stone here. The singleton in my mind doesn't need to be clustered at all.
Say we have a fully shared state through persistence or event-sourcing and
a hot-standby model, couldn't we implement the fallback through routing in
front of the active/passive ContainerManager pair? Once one goes
unreachable, fall back to the other.


>
>
>
> Handover time as you say is crucial, but I'd say as it only impacts
> container creation, we could live with, let's say, 5 seconds of
> failover-downtime on this path? What's your experience been on singleton
> failover? How long did it take?
>
>
> Seconds in the simplest case, so I think we need to test it in a scaled
> case (100s of cluster nodes), as well as the hard crash case (where not
> downing the node may affect the cluster state).
>
>
>
>
> On Aug 16, 2018, at 11:01 AM, Tyson Norris  >
> wrote:
>
> A couple comments on singleton:
> - use of cluster singleton will introduce a new single point of failure
> - from time of singleton node failure, to single resurrection on a
> different instance, will be an outage from the point of view of any
> ContainerRouter that does not already have a warm+free container to service
> an activation
> - resurrecting the singleton will require transferring or rebuilding the
> state when recovery occurs - in my experience this was tricky, and requires
> replicating the data (which will be slightly stale, but better than
> rebuilding from nothing); I don’t recall the handover delay (to transfer
> singleton to a new akka cluster node) when I tried last, but I think it was
> not as fast as I hoped it would be.
>
> I don’t have a great suggestion for the singleton failure case, but
> would like to consider this carefully, and discuss the ramifications (which
> may or may not be tolerable) before pursuing this particular aspect of the
> design.
>
>
> On prioritization:
> - if concurrency is 

Re: Proposal on a future architecture of OpenWhisk

2018-08-18 Thread David P Grove

[ Discussion about cluster singleton or not for the ContainerManager]

fwiw, I believe for Kubernetes we do not need to attempt to deal with fault
tolerance for the ContainerManager state ourselves.  We can use labels to
replicate all the persistent metadata for a container (prewarm or not, the
ContainerRouter it is assigned to) in the Kube objects representing the
pods in Kube's etcd metadata server.  If we need to restart a
ContainerManager, the new instance can come up "instantly" and start
servicing requests while recovering the state of the previous instance via
querries against etcd to discover the pre-existing containers it owned.

We'll need to validate the performance of this is acceptable (should be,
since it is just some asynchronous labeling operations when (a) the
container is created and (b) on the initial transition from stemcell to
warm), but it is going to be pretty simple to implement and makes good
usage of the underlying platform's capabilities.

--dave


Re: Kafka and Proposal on a future architecture of OpenWhisk

2018-08-17 Thread Carlos Santana
triggers get responded right away (202) with an activation is and then sent to 
the queue to be processed async same as async action invokes. 

I think we would keep same contract as today for this type of activations that 
are eventually process different from blocking invokes including we Actions 
were the http client hold a connection waiting for the result back. 

- Carlos Santana
@csantanapr

> On Aug 17, 2018, at 6:14 PM, Tyson Norris  wrote:
> 
> Hi - 
> Separate thread regarding the proposal: what is considered for routing 
> activations as overload and destined for kafka?
> 
> In general, if kafka is not on the blocking activation path, why would it be 
> used at all, if the timeouts and processing expectations of blocking and 
> non-blocking are the same?
> 
> One case I can imagine: triggers + non-blocking invokes, but only in the case 
> where those have some different timeout characteristics. e.g. if a trigger 
> fires an action, is there any case where the activation should be buffered to 
> kafka if it will timeout same as a blocking activation?  
> 
> Sorry if I’m missing something obvious.
> 
> Thanks
> Tyson
> 
> 


Kafka and Proposal on a future architecture of OpenWhisk

2018-08-17 Thread Tyson Norris
Hi - 
Separate thread regarding the proposal: what is considered for routing 
activations as overload and destined for kafka?

In general, if kafka is not on the blocking activation path, why would it be 
used at all, if the timeouts and processing expectations of blocking and 
non-blocking are the same?

One case I can imagine: triggers + non-blocking invokes, but only in the case 
where those have some different timeout characteristics. e.g. if a trigger 
fires an action, is there any case where the activation should be buffered to 
kafka if it will timeout same as a blocking activation?  

Sorry if I’m missing something obvious.

Thanks
Tyson




Re: Proposal on a future architecture of OpenWhisk

2018-08-17 Thread Tyson Norris
Ugh my reply formatting got removed!!! Trying this again with some >>

On Aug 17, 2018, at 2:45 PM, Tyson Norris 
mailto:tnor...@adobe.com.INVALID>> wrote:


If the failover of the singleton is too long (I think it will be based on
cluster size, oldest node becomes the singleton host iirc), I think we need
to consider how containers can launch in the meantime. A first step might
be to test out the singleton behavior in the cluster of various sizes.


I agree this bit of design is crucial, a few thoughts:
Pre-warm wouldn't help here, the ContainerRouters only know warm
containers. Pre-warming is managed by the ContainerManager.


>> Ah right



Considering a fail-over scenario: We could consider sharing the state via
EventSourcing. That is: All state lives inside of frequently snapshotted
events and thus can be shared between multiple instances of the
ContainerManager seamlessly. Alternatively, we could also think about only
working on persisted state. That way, a cold-standby model could fly. We
should make sure that the state is not "slightly stale" but rather both
instances see the same state at any point in time. I believe on that
cold-path of generating new containers, we can live with the extra-latency
of persisting what we're doing as the path will still be dominated by the
container creation latency.



>> Wasn’t clear if you mean not using ClusterSingleton? To be clear in 
>> ClusterSingleton case there are 2 issues:
- time it takes for akka ClusterSingletonManager to realize it needs to start a 
new actor
- time it takes for the new actor to assume a usable state

EventSourcing (or ext persistence) may help with the latter, but we will need 
to be sure the former is tolerable to start with.
Here is an example test from akka source that may be useful (multi-jvm, but all 
local):
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fakka%2Fakka%2Fblob%2F009214ae07708e8144a279e71d06c4a504907e31%2Fakka-cluster-tools%2Fsrc%2Fmulti-jvm%2Fscala%2Fakka%2Fcluster%2Fsingleton%2FClusterSingletonManagerChaosSpec.scaladata=02%7C01%7Ctnorris%40adobe.com%7C50be947ede884f3b78e208d6048ac99a%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636701391474213555sdata=Ojk1yRGCbG4OxD5MXOabmH1ggbgk%2BymZ7%2BUqDQINAPo%3Dreserved=0

Some things to consider, that I don’t know details of:
- will the size of cluster affect the singleton behavior in case of failure? (I 
think so, but not sure, and what extent); in the simple test above it takes ~6s 
for the replacement singleton to begin startup, but if we have 100s of nodes, 
I’m not sure how much time it will take. (I don’t think this should be hard to 
test, but I haven’t done it)
- in case of hard crash, what is the singleton behavior? In graceful jvm 
termination, I know the cluster behavior is good, but there is always this 
question about how downing nodes will be handled. If this critical piece of the 
system relies on akka cluster functionality, we will need to make sure that the 
singleton can be reconstituted, both in case of graceful termination 
(restart/deployment events) and non-graceful termination (hard vm crash, hard 
container crash) . This is ignoring more complicated cases of extended network 
partitions, which will also have bad affects on many of the downstream systems.




Handover time as you say is crucial, but I'd say as it only impacts
container creation, we could live with, let's say, 5 seconds of
failover-downtime on this path? What's your experience been on singleton
failover? How long did it take?



>> Seconds in the simplest case, so I think we need to test it in a scaled case 
>> (100s of cluster nodes), as well as the hard crash case (where not downing 
>> the node may affect the cluster state).





On Aug 16, 2018, at 11:01 AM, Tyson Norris 
mailto:tnor...@adobe.com.INVALID>>
wrote:

A couple comments on singleton:
- use of cluster singleton will introduce a new single point of failure
- from time of singleton node failure, to single resurrection on a
different instance, will be an outage from the point of view of any
ContainerRouter that does not already have a warm+free container to service
an activation
- resurrecting the singleton will require transferring or rebuilding the
state when recovery occurs - in my experience this was tricky, and requires
replicating the data (which will be slightly stale, but better than
rebuilding from nothing); I don’t recall the handover delay (to transfer
singleton to a new akka cluster node) when I tried last, but I think it was
not as fast as I hoped it would be.

I don’t have a great suggestion for the singleton failure case, but
would like to consider this carefully, and discuss the ramifications (which
may or may not be tolerable) before pursuing this particular aspect of the
design.


On prioritization:
- if concurrency is enabled for an action, this is another
prioritization aspect, of sorts - if the action supports concurrency, there
is no 

Re: Proposal on a future architecture of OpenWhisk

2018-08-17 Thread Tyson Norris

If the failover of the singleton is too long (I think it will be based on
cluster size, oldest node becomes the singleton host iirc), I think we need
to consider how containers can launch in the meantime. A first step might
be to test out the singleton behavior in the cluster of various sizes.


I agree this bit of design is crucial, a few thoughts:
Pre-warm wouldn't help here, the ContainerRouters only know warm
containers. Pre-warming is managed by the ContainerManager.

Ah right


Considering a fail-over scenario: We could consider sharing the state via
EventSourcing. That is: All state lives inside of frequently snapshotted
events and thus can be shared between multiple instances of the
ContainerManager seamlessly. Alternatively, we could also think about only
working on persisted state. That way, a cold-standby model could fly. We
should make sure that the state is not "slightly stale" but rather both
instances see the same state at any point in time. I believe on that
cold-path of generating new containers, we can live with the extra-latency
of persisting what we're doing as the path will still be dominated by the
container creation latency.

Wasn’t clear if you mean not using ClusterSingleton? To be clear in 
ClusterSingleton case there are 2 issues:
- time it takes for akka ClusterSingletonManager to realize it needs to start a 
new actor
- time it takes for the new actor to assume a usable state

EventSourcing (or ext persistence) may help with the latter, but we will need 
to be sure the former is tolerable to start with.
Here is an example test from akka source that may be useful (multi-jvm, but all 
local):
https://github.com/akka/akka/blob/009214ae07708e8144a279e71d06c4a504907e31/akka-cluster-tools/src/multi-jvm/scala/akka/cluster/singleton/ClusterSingletonManagerChaosSpec.scala

Some things to consider, that I don’t know details of:
- will the size of cluster affect the singleton behavior in case of failure? (I 
think so, but not sure, and what extent); in the simple test above it takes ~6s 
for the replacement singleton to begin startup, but if we have 100s of nodes, 
I’m not sure how much time it will take. (I don’t think this should be hard to 
test, but I haven’t done it)
- in case of hard crash, what is the singleton behavior? In graceful jvm 
termination, I know the cluster behavior is good, but there is always this 
question about how downing nodes will be handled. If this critical piece of the 
system relies on akka cluster functionality, we will need to make sure that the 
singleton can be reconstituted, both in case of graceful termination 
(restart/deployment events) and non-graceful termination (hard vm crash, hard 
container crash) . This is ignoring more complicated cases of extended network 
partitions, which will also have bad affects on many of the downstream systems.



Handover time as you say is crucial, but I'd say as it only impacts
container creation, we could live with, let's say, 5 seconds of
failover-downtime on this path? What's your experience been on singleton
failover? How long did it take?


Seconds in the simplest case, so I think we need to test it in a scaled case 
(100s of cluster nodes), as well as the hard crash case (where not downing the 
node may affect the cluster state).




On Aug 16, 2018, at 11:01 AM, Tyson Norris 
mailto:tnor...@adobe.com.INVALID>>
wrote:

A couple comments on singleton:
- use of cluster singleton will introduce a new single point of failure
- from time of singleton node failure, to single resurrection on a
different instance, will be an outage from the point of view of any
ContainerRouter that does not already have a warm+free container to service
an activation
- resurrecting the singleton will require transferring or rebuilding the
state when recovery occurs - in my experience this was tricky, and requires
replicating the data (which will be slightly stale, but better than
rebuilding from nothing); I don’t recall the handover delay (to transfer
singleton to a new akka cluster node) when I tried last, but I think it was
not as fast as I hoped it would be.

I don’t have a great suggestion for the singleton failure case, but
would like to consider this carefully, and discuss the ramifications (which
may or may not be tolerable) before pursuing this particular aspect of the
design.


On prioritization:
- if concurrency is enabled for an action, this is another
prioritization aspect, of sorts - if the action supports concurrency, there
is no reason (except for destruction coordination…) that it cannot be
shared across shards. This could be added later, but may be worth
considering since there is a general reuse problem where a series of
activations that arrives at different ContainerRouters will create a new
container in each, while they could be reused (and avoid creating new
containers) if concurrency is tolerated in that container. This would only
(ha ha) require changing how container destroy works, where it cannot be
destroyed 

Re: Proposal on a future architecture of OpenWhisk

2018-08-17 Thread Carlos Santana
It would be cool to implement some of this algorithms in synthetic way with
mocks and stubs, and with simulated latency, like create container random
delay between 1 and 2 seconds in controller manager and returns a url which
is just a url to a mock server etc..

then hit it with a load test with different patterns (bursty, noise noisy
neighbor, mix of webaction and trigger fires, etc..) and graph the behavior

-- Carlos

On Fri, Aug 17, 2018 at 7:21 AM Markus Thömmes 
wrote:

> Hi Tyson,
>
> thanks for the great input!
>
> Am Do., 16. Aug. 2018 um 23:14 Uhr schrieb Tyson Norris
> :
>
> > Thinking more about the singleton aspect, I guess this is mostly an issue
> > for blackbox containers, where manifest/managed containers will mitigate
> at
> > least some of the singleton failure delays by prewarm/stemcell
> containers.
> >
> > So in the case of singleton failure, impacts would be:
> > - managed containers once prewarms are exhausted (may be improved by
> being
> > more intelligent about prewarm pool sizing based on load etc)
> > - managed containers that don’t match any prewarms (similar - if prewarm
> > pool is dynamically configured based on load, this is less problem)
> > - blackbox containers (no help)
> >
> > If the failover of the singleton is too long (I think it will be based on
> > cluster size, oldest node becomes the singleton host iirc), I think we
> need
> > to consider how containers can launch in the meantime. A first step might
> > be to test out the singleton behavior in the cluster of various sizes.
> >
>
> I agree this bit of design is crucial, a few thoughts:
> Pre-warm wouldn't help here, the ContainerRouters only know warm
> containers. Pre-warming is managed by the ContainerManager.
>
> Considering a fail-over scenario: We could consider sharing the state via
> EventSourcing. That is: All state lives inside of frequently snapshotted
> events and thus can be shared between multiple instances of the
> ContainerManager seamlessly. Alternatively, we could also think about only
> working on persisted state. That way, a cold-standby model could fly. We
> should make sure that the state is not "slightly stale" but rather both
> instances see the same state at any point in time. I believe on that
> cold-path of generating new containers, we can live with the extra-latency
> of persisting what we're doing as the path will still be dominated by the
> container creation latency.
>
> Handover time as you say is crucial, but I'd say as it only impacts
> container creation, we could live with, let's say, 5 seconds of
> failover-downtime on this path? What's your experience been on singleton
> failover? How long did it take?
>
>
> >
> > > On Aug 16, 2018, at 11:01 AM, Tyson Norris 
> > wrote:
> > >
> > > A couple comments on singleton:
> > > - use of cluster singleton will introduce a new single point of failure
> > - from time of singleton node failure, to single resurrection on a
> > different instance, will be an outage from the point of view of any
> > ContainerRouter that does not already have a warm+free container to
> service
> > an activation
> > > - resurrecting the singleton will require transferring or rebuilding
> the
> > state when recovery occurs - in my experience this was tricky, and
> requires
> > replicating the data (which will be slightly stale, but better than
> > rebuilding from nothing); I don’t recall the handover delay (to transfer
> > singleton to a new akka cluster node) when I tried last, but I think it
> was
> > not as fast as I hoped it would be.
> > >
> > > I don’t have a great suggestion for the singleton failure case, but
> > would like to consider this carefully, and discuss the ramifications
> (which
> > may or may not be tolerable) before pursuing this particular aspect of
> the
> > design.
> > >
> > >
> > > On prioritization:
> > > - if concurrency is enabled for an action, this is another
> > prioritization aspect, of sorts - if the action supports concurrency,
> there
> > is no reason (except for destruction coordination…) that it cannot be
> > shared across shards. This could be added later, but may be worth
> > considering since there is a general reuse problem where a series of
> > activations that arrives at different ContainerRouters will create a new
> > container in each, while they could be reused (and avoid creating new
> > containers) if concurrency is tolerated in that container. This would
> only
> > (ha ha) require changing how container destroy works, where it cannot be
> > destroyed until the last ContainerRouter is done with it. And if
> container
> > destruction is coordinated in this way to increase reuse, it would also
> be
> > good to coordinate construction (don’t concurrently construct the same
> > container for multiple containerRouters IFF a single container would
> enable
> > concurrent activations once it is created). I’m not sure if others are
> > desiring this level of container reuse, but if so, it would be worth
> > considering 

Re: Proposal on a future architecture of OpenWhisk

2018-08-17 Thread Markus Thömmes
Hi Tyson,

thanks for the great input!

Am Do., 16. Aug. 2018 um 23:14 Uhr schrieb Tyson Norris
:

> Thinking more about the singleton aspect, I guess this is mostly an issue
> for blackbox containers, where manifest/managed containers will mitigate at
> least some of the singleton failure delays by prewarm/stemcell containers.
>
> So in the case of singleton failure, impacts would be:
> - managed containers once prewarms are exhausted (may be improved by being
> more intelligent about prewarm pool sizing based on load etc)
> - managed containers that don’t match any prewarms (similar - if prewarm
> pool is dynamically configured based on load, this is less problem)
> - blackbox containers (no help)
>
> If the failover of the singleton is too long (I think it will be based on
> cluster size, oldest node becomes the singleton host iirc), I think we need
> to consider how containers can launch in the meantime. A first step might
> be to test out the singleton behavior in the cluster of various sizes.
>

I agree this bit of design is crucial, a few thoughts:
Pre-warm wouldn't help here, the ContainerRouters only know warm
containers. Pre-warming is managed by the ContainerManager.

Considering a fail-over scenario: We could consider sharing the state via
EventSourcing. That is: All state lives inside of frequently snapshotted
events and thus can be shared between multiple instances of the
ContainerManager seamlessly. Alternatively, we could also think about only
working on persisted state. That way, a cold-standby model could fly. We
should make sure that the state is not "slightly stale" but rather both
instances see the same state at any point in time. I believe on that
cold-path of generating new containers, we can live with the extra-latency
of persisting what we're doing as the path will still be dominated by the
container creation latency.

Handover time as you say is crucial, but I'd say as it only impacts
container creation, we could live with, let's say, 5 seconds of
failover-downtime on this path? What's your experience been on singleton
failover? How long did it take?


>
> > On Aug 16, 2018, at 11:01 AM, Tyson Norris 
> wrote:
> >
> > A couple comments on singleton:
> > - use of cluster singleton will introduce a new single point of failure
> - from time of singleton node failure, to single resurrection on a
> different instance, will be an outage from the point of view of any
> ContainerRouter that does not already have a warm+free container to service
> an activation
> > - resurrecting the singleton will require transferring or rebuilding the
> state when recovery occurs - in my experience this was tricky, and requires
> replicating the data (which will be slightly stale, but better than
> rebuilding from nothing); I don’t recall the handover delay (to transfer
> singleton to a new akka cluster node) when I tried last, but I think it was
> not as fast as I hoped it would be.
> >
> > I don’t have a great suggestion for the singleton failure case, but
> would like to consider this carefully, and discuss the ramifications (which
> may or may not be tolerable) before pursuing this particular aspect of the
> design.
> >
> >
> > On prioritization:
> > - if concurrency is enabled for an action, this is another
> prioritization aspect, of sorts - if the action supports concurrency, there
> is no reason (except for destruction coordination…) that it cannot be
> shared across shards. This could be added later, but may be worth
> considering since there is a general reuse problem where a series of
> activations that arrives at different ContainerRouters will create a new
> container in each, while they could be reused (and avoid creating new
> containers) if concurrency is tolerated in that container. This would only
> (ha ha) require changing how container destroy works, where it cannot be
> destroyed until the last ContainerRouter is done with it. And if container
> destruction is coordinated in this way to increase reuse, it would also be
> good to coordinate construction (don’t concurrently construct the same
> container for multiple containerRouters IFF a single container would enable
> concurrent activations once it is created). I’m not sure if others are
> desiring this level of container reuse, but if so, it would be worth
> considering these aspects (sharding/isolation vs sharing/coordination) as
> part of any redesign.
>

Yes, I can see where you're heading here. I think this can be generalized:

Assume intra-container concurrency C and number of ContainerRouters R.
If C > R: Shard the "slots" on this container evenly across R. The
container can only be destroyed after you receive R acknowledgements of
doing so.
If C < R: Hand out 1 slot to C Routers, point the remaining Routers to the
ones that got slots.

Concurrent creation: Batch creation requests while one container is being
created. Say you received a request for a new container that has C slots.
If there are more requests for that 

Re: Proposal on a future architecture of OpenWhisk

2018-08-16 Thread Tyson Norris
Thinking more about the singleton aspect, I guess this is mostly an issue for 
blackbox containers, where manifest/managed containers will mitigate at least 
some of the singleton failure delays by prewarm/stemcell containers. 

So in the case of singleton failure, impacts would be:
- managed containers once prewarms are exhausted (may be improved by being more 
intelligent about prewarm pool sizing based on load etc)
- managed containers that don’t match any prewarms (similar - if prewarm pool 
is dynamically configured based on load, this is less problem)
- blackbox containers (no help)

If the failover of the singleton is too long (I think it will be based on 
cluster size, oldest node becomes the singleton host iirc), I think we need to 
consider how containers can launch in the meantime. A first step might be to 
test out the singleton behavior in the cluster of various sizes.

> On Aug 16, 2018, at 11:01 AM, Tyson Norris  wrote:
> 
> A couple comments on singleton:
> - use of cluster singleton will introduce a new single point of failure - 
> from time of singleton node failure, to single resurrection on a different 
> instance, will be an outage from the point of view of any ContainerRouter 
> that does not already have a warm+free container to service an activation
> - resurrecting the singleton will require transferring or rebuilding the 
> state when recovery occurs - in my experience this was tricky, and requires 
> replicating the data (which will be slightly stale, but better than 
> rebuilding from nothing); I don’t recall the handover delay (to transfer 
> singleton to a new akka cluster node) when I tried last, but I think it was 
> not as fast as I hoped it would be.
> 
> I don’t have a great suggestion for the singleton failure case, but would 
> like to consider this carefully, and discuss the ramifications (which may or 
> may not be tolerable) before pursuing this particular aspect of the design.
> 
> 
> On prioritization:
> - if concurrency is enabled for an action, this is another prioritization 
> aspect, of sorts - if the action supports concurrency, there is no reason 
> (except for destruction coordination…) that it cannot be shared across 
> shards. This could be added later, but may be worth considering since there 
> is a general reuse problem where a series of activations that arrives at 
> different ContainerRouters will create a new container in each, while they 
> could be reused (and avoid creating new containers) if concurrency is 
> tolerated in that container. This would only (ha ha) require changing how 
> container destroy works, where it cannot be destroyed until the last 
> ContainerRouter is done with it. And if container destruction is coordinated 
> in this way to increase reuse, it would also be good to coordinate 
> construction (don’t concurrently construct the same container for multiple 
> containerRouters IFF a single container would enable concurrent activations 
> once it is created). I’m not sure if others are desiring this level of 
> container reuse, but if so, it would be worth considering these aspects 
> (sharding/isolation vs sharing/coordination) as part of any redesign.
> 
> 
> WDYT?
> 
> THanks
> Tyson
> 
> On Aug 15, 2018, at 8:55 AM, Carlos Santana 
> mailto:csantan...@gmail.com>> wrote:
> 
> I think we should add a section on prioritization for blocking vs. async
> invokes (none blocking actions a triggers)
> 
> The front door has the luxury of known some intent from the incoming
> request, I feel it would make sense to high priority to blocking invokes,
> and for async they go straight to the queue to be pick up by the system to
> eventually run, even if it takes 10 times longer to execute than a blocking
> invoke, for example a webaction would take 10ms vs. a DB trigger fire, or a
> async webhook takes 100ms.
> 
> Also the controller takes time to convert a trigger and process the rules,
> this is something that can also be taken out of hot path.
> 
> So I'm just saying we could optimize the system because we know if the
> incoming request is a hot or hotter path :-)
> 
> -- Carlos
> 
> 



Re: Proposal on a future architecture of OpenWhisk

2018-08-16 Thread Tyson Norris
A couple comments on singleton:
- use of cluster singleton will introduce a new single point of failure - from 
time of singleton node failure, to single resurrection on a different instance, 
will be an outage from the point of view of any ContainerRouter that does not 
already have a warm+free container to service an activation
- resurrecting the singleton will require transferring or rebuilding the state 
when recovery occurs - in my experience this was tricky, and requires 
replicating the data (which will be slightly stale, but better than rebuilding 
from nothing); I don’t recall the handover delay (to transfer singleton to a 
new akka cluster node) when I tried last, but I think it was not as fast as I 
hoped it would be.

I don’t have a great suggestion for the singleton failure case, but would like 
to consider this carefully, and discuss the ramifications (which may or may not 
be tolerable) before pursuing this particular aspect of the design.


On prioritization:
- if concurrency is enabled for an action, this is another prioritization 
aspect, of sorts - if the action supports concurrency, there is no reason 
(except for destruction coordination…) that it cannot be shared across shards. 
This could be added later, but may be worth considering since there is a 
general reuse problem where a series of activations that arrives at different 
ContainerRouters will create a new container in each, while they could be 
reused (and avoid creating new containers) if concurrency is tolerated in that 
container. This would only (ha ha) require changing how container destroy 
works, where it cannot be destroyed until the last ContainerRouter is done with 
it. And if container destruction is coordinated in this way to increase reuse, 
it would also be good to coordinate construction (don’t concurrently construct 
the same container for multiple containerRouters IFF a single container would 
enable concurrent activations once it is created). I’m not sure if others are 
desiring this level of container reuse, but if so, it would be worth 
considering these aspects (sharding/isolation vs sharing/coordination) as part 
of any redesign.


WDYT?

THanks
Tyson

On Aug 15, 2018, at 8:55 AM, Carlos Santana 
mailto:csantan...@gmail.com>> wrote:

I think we should add a section on prioritization for blocking vs. async
invokes (none blocking actions a triggers)

The front door has the luxury of known some intent from the incoming
request, I feel it would make sense to high priority to blocking invokes,
and for async they go straight to the queue to be pick up by the system to
eventually run, even if it takes 10 times longer to execute than a blocking
invoke, for example a webaction would take 10ms vs. a DB trigger fire, or a
async webhook takes 100ms.

Also the controller takes time to convert a trigger and process the rules,
this is something that can also be taken out of hot path.

So I'm just saying we could optimize the system because we know if the
incoming request is a hot or hotter path :-)

-- Carlos




Re: Proposal on a future architecture of OpenWhisk

2018-08-16 Thread Michael Marth
   > You are right... today! I'm not saying Knative is necessarily a
> superior
> backend for OpenWhisk as it stands today. All I'm saying is that from
> an
> architecture point-of-view, Knative serving replaces all of the
    > concerns
> that the execution layer has.
>
>
> >
> >
> > Thanks,
> >
> > dragos
> >
> >
> > [1] - https://doc.akka.io/docs/akka/2.5/distributed-data.html
> >
> >
> > 
> > From: David P Grove 
> > Sent: Tuesday, August 14, 2018 2:15:13 PM
> > To: dev@openwhisk.apache.org
> > Subject: Re: Proposal on a future architecture of OpenWhisk
> >
> >
> >
> >
> > "Markus Thömmes"  wrote on 08/14/2018
> 10:06:49
> > AM:
> > >
> > > I just published a revision on the initial proposal I made. I
> still owe a
> > > lot of sequence diagrams for the container distribution, sorry for
> taking
> > > so long on that, I'm working on it.
> > >
> > > I did include a clear seperation of concerns into the proposal,
> where
> > > user-facing abstractions and the execution (loadbalacing, scaling)
> of
> > > functions are loosely coupled. That enables us to exchange the
> execution
> > > system while not changing anything in the Controllers at all (to 
an
> > > extent). The interface to talk to the execution layer is HTTP.
> > >
> >
> > Nice writeup!
> >
> > For me, the part of the design I'm wondering about is the separation
> of the
> > ContainerManager and the ContainerRouter and having the
> ContainerManager by
> > a cluster singleton. With Kubernetes blinders on, it seems more
> natural to
> > me to fuse the ContainerManager into each of the ContainerRouter
> instances
> > (since there is very little to the ContainerManager except (a)
> talking to
> > Kubernetes and (b) keeping track of which Containers it has handed
> out to
> > which ContainerRouters -- a task which is eliminated if we fuse
> them).
> >
> > The main challenge is dealing with your "edge case" where the 
optimal
> > number of containers to create to execute a function is less than 
the
> > number of ContainerRouters.  I suspect this is actually an important
> case
> > to handle well for large-scale deployments of OpenWhisk.  Having
> 20ish
> > ContainerRouters on a large cluster seems plausible, and then we'd
> expect a
> > long tail of functions where the optimal number of container
> instances is
> > less than 20.
> >
> > I wonder if we can partially mitigate this problem by doing some
> amount of
> > smart routing in the Controller.  For example, the first level of
> routing
> > could be based on the kind of the action (nodejs:6, python, etc).
> That
> > could then vector to per-runtime ContainerRouters which dynamically
> > auto-scale based on load.  Since there doesn't have to be a fixed
> division
> > of actual execution resources to each ContainerRouter this could
> work.  It
> > also lets easily stemcells for multiple runtimes without worrying
> about
> > wasting too many resources.
> >
> > How do you want to deal with design alternatives?  Should I be
> adding to
> > the wiki page?  Doing something else?
> >
> > --dave
> >
>
>
>




Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Carlos Santana
ially unbounded number of routers is not
> > viable
> > performance wise.
> >
> > Hence the premise is to manage that state locally and essentially
> > shard the
> > list of available containers between all routers, so each of them can
> > keep
> > its respective state local.
> >
> >
> > >
> > >
> > > Re Knative approach, can you expand why the execution layer/data
> > plane
> > > would be replaced entirely by Knative serving ? I think knative
> > serving
> > > handles very well some cases like API requests, but it's not
> > designed to
> > > guarantee concurrency restrictions like "1 request at a time per
> > container"
> > > - something that AI Actions need.
> > >
> >
> > You are right... today! I'm not saying Knative is necessarily a
> > superior
> > backend for OpenWhisk as it stands today. All I'm saying is that from
> > an
> > architecture point-of-view, Knative serving replaces all of the
> > concerns
> > that the execution layer has.
> >
> >
> > >
> > >
> > > Thanks,
> > >
> > > dragos
> > >
> > >
> > > [1] - https://doc.akka.io/docs/akka/2.5/distributed-data.html
> > >
> > >
> > > 
> > > From: David P Grove 
> > > Sent: Tuesday, August 14, 2018 2:15:13 PM
> > > To: dev@openwhisk.apache.org
> > > Subject: Re: Proposal on a future architecture of OpenWhisk
> > >
> > >
> > >
> > >
> > > "Markus Thömmes"  wrote on 08/14/2018
> > 10:06:49
> > > AM:
> > > >
> > > > I just published a revision on the initial proposal I made. I
> > still owe a
> > > > lot of sequence diagrams for the container distribution, sorry
> for
> > taking
> > > > so long on that, I'm working on it.
> > > >
> > > > I did include a clear seperation of concerns into the proposal,
> > where
> > > > user-facing abstractions and the execution (loadbalacing,
> scaling)
> > of
> > > > functions are loosely coupled. That enables us to exchange the
> > execution
> > > > system while not changing anything in the Controllers at all (to
> an
> > > > extent). The interface to talk to the execution layer is HTTP.
> > > >
> > >
> > > Nice writeup!
> > >
> > > For me, the part of the design I'm wondering about is the
> separation
> > of the
> > > ContainerManager and the ContainerRouter and having the
> > ContainerManager by
> > > a cluster singleton. With Kubernetes blinders on, it seems more
> > natural to
> > > me to fuse the ContainerManager into each of the ContainerRouter
> > instances
> > > (since there is very little to the ContainerManager except (a)
> > talking to
> > > Kubernetes and (b) keeping track of which Containers it has handed
> > out to
> > > which ContainerRouters -- a task which is eliminated if we fuse
> > them).
> > >
> > > The main challenge is dealing with your "edge case" where the
> optimal
> > > number of containers to create to execute a function is less than
> the
> > > number of ContainerRouters.  I suspect this is actually an
> important
> > case
> > > to handle well for large-scale deployments of OpenWhisk.  Having
> > 20ish
> > > ContainerRouters on a large cluster seems plausible, and then we'd
> > expect a
> > > long tail of functions where the optimal number of container
> > instances is
> > > less than 20.
> > >
> > > I wonder if we can partially mitigate this problem by doing some
> > amount of
> > > smart routing in the Controller.  For example, the first level of
> > routing
> > > could be based on the kind of the action (nodejs:6, python, etc).
> > That
> > > could then vector to per-runtime ContainerRouters which dynamically
> > > auto-scale based on load.  Since there doesn't have to be a fixed
> > division
> > > of actual execution resources to each ContainerRouter this could
> > work.  It
> > > also lets easily stemcells for multiple runtimes without worrying
> > about
> > > wasting too many resources.
> > >
> > > How do you want to deal with design alternatives?  Should I be
> > adding to
> > > the wiki page?  Doing something else?
> > >
> > > --dave
> > >
> >
> >
> >
>


Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Markus Thömmes
Hi Michael,

loosing/adding a shard is essentially reconciled by the ContainerManager.
As it keeps track of all the ContainerRouters in the system, it can also
observe one going down/crashing or one coming up and joining the "cluster".

If one Router leaves the cluster, the ContainerManager knows which
containers where "managed" by that router and redistributes them across the
Routers left in the system.
If one Router joins the cluster, we can try to rebalance containers to take
load off existing ones. Precise algorithm to be defined but the primitives
should be in place to be able to do that.

Does that answer the question?

Cheers,
Markus

Am Mi., 15. Aug. 2018 um 16:18 Uhr schrieb Michael Marth
:

> Markus,
>
> I agree with your preference of making the state sharded instead of
> distributed. (not only for the scalability reasons you quote but also for
> operational concerns).
> What are your thoughts about losing a shard (planned or crashed) or adding
> a shard?
>
> Michael
>
>
> On 15.08.18, 09:58, "Markus Thömmes"  wrote:
>
> Hi Dragos,
>
> thanks for your questions, good discussion :)
>
> Am Di., 14. Aug. 2018 um 23:42 Uhr schrieb Dragos Dascalita Haut
> :
>
> > Markus, I appreciate the enhancements you mentioned in the wiki, and
> I'm
> > very much inline with the ideas you brought in there.
> >
> >
> >
> > "...having the ContainerManager be a cluster singleton..."
> >
> > I was just in process to reply with the same idea :)
> >
> > In addition, I was thinking we can leverage Akka Distributed Data
> [1] to
> > keep all ContainerRouter actors eventually consistent. When creating
> a new
> > container, the ContainerManager can write with a consistency
> "WriteAll"; it
> > would be a little slower but it would improve consistency.
> >
>
> I think we need to quantify "a little slower". Note that "WriteAll"
> becomes
> slower and slower the more actors you add to the cluster. Scalability
> is at
> question then.
>
> Of course scalability is also at question if we make the
> ContainerManager a
> singleton. The ContainerManager has a 1:1 relationship to the
> Kubernetes/Mesos scheduler. Do we know how those are distributed? I
> think
> the Kubernetes scheduler is a singleton, but I'll need to doublecheck
> on
> that.
>
> I can see the possibility to move the ContainerManager into each
> Router and
> have them communicate with each other to shard in the same way I'm
> proposing. As Dave is hitting on the very same points, I get the
> feeling we
> should/could breakout that specific discussion if we can agree on some
> basic premises of the design (see my answers on the thread with Dave).
> WDYT?
>
>
> >
> >
> > The "edge-case" isn't clear to me b/c I'm coming from the assumption
> that
> > it doesn't matter which ContainerRouter handles the next request,
> given
> > that all actors have the same data. Maybe you can help me understand
> better
> > the edge-case ?
> >
>
> ContainerRouters do not have the same state specifically. The
> live-concurrency on a container is potentially very fast changing data.
> Sharing that across a potentially unbounded number of routers is not
> viable
> performance wise.
>
> Hence the premise is to manage that state locally and essentially
> shard the
> list of available containers between all routers, so each of them can
> keep
> its respective state local.
>
>
> >
> >
> > Re Knative approach, can you expand why the execution layer/data
> plane
> > would be replaced entirely by Knative serving ? I think knative
> serving
> > handles very well some cases like API requests, but it's not
> designed to
> > guarantee concurrency restrictions like "1 request at a time per
> container"
> > - something that AI Actions need.
> >
>
> You are right... today! I'm not saying Knative is necessarily a
> superior
> backend for OpenWhisk as it stands today. All I'm saying is that from
> an
> architecture point-of-view, Knative serving replaces all of the
> concerns
> that the execution layer has.
>
>
> >
> >
> > Thanks,
> >
> > dragos
> >
> >
> > [1] - https://doc.akka.io/docs/akka/2.5/distributed-data.html
> >
> >
> > 
>

Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Michael Marth
Markus,

I agree with your preference of making the state sharded instead of 
distributed. (not only for the scalability reasons you quote but also for 
operational concerns).
What are your thoughts about losing a shard (planned or crashed) or adding a 
shard?

Michael


On 15.08.18, 09:58, "Markus Thömmes"  wrote:

Hi Dragos,

thanks for your questions, good discussion :)

Am Di., 14. Aug. 2018 um 23:42 Uhr schrieb Dragos Dascalita Haut
:

> Markus, I appreciate the enhancements you mentioned in the wiki, and I'm
> very much inline with the ideas you brought in there.
>
>
>
> "...having the ContainerManager be a cluster singleton..."
>
> I was just in process to reply with the same idea :)
>
> In addition, I was thinking we can leverage Akka Distributed Data [1] to
> keep all ContainerRouter actors eventually consistent. When creating a new
> container, the ContainerManager can write with a consistency "WriteAll"; 
it
> would be a little slower but it would improve consistency.
>

I think we need to quantify "a little slower". Note that "WriteAll" becomes
slower and slower the more actors you add to the cluster. Scalability is at
question then.

Of course scalability is also at question if we make the ContainerManager a
singleton. The ContainerManager has a 1:1 relationship to the
Kubernetes/Mesos scheduler. Do we know how those are distributed? I think
the Kubernetes scheduler is a singleton, but I'll need to doublecheck on
that.

I can see the possibility to move the ContainerManager into each Router and
have them communicate with each other to shard in the same way I'm
proposing. As Dave is hitting on the very same points, I get the feeling we
should/could breakout that specific discussion if we can agree on some
basic premises of the design (see my answers on the thread with Dave). WDYT?


>
>
> The "edge-case" isn't clear to me b/c I'm coming from the assumption that
> it doesn't matter which ContainerRouter handles the next request, given
> that all actors have the same data. Maybe you can help me understand 
better
> the edge-case ?
>

ContainerRouters do not have the same state specifically. The
live-concurrency on a container is potentially very fast changing data.
Sharing that across a potentially unbounded number of routers is not viable
performance wise.

Hence the premise is to manage that state locally and essentially shard the
list of available containers between all routers, so each of them can keep
its respective state local.


>
>
> Re Knative approach, can you expand why the execution layer/data plane
> would be replaced entirely by Knative serving ? I think knative serving
> handles very well some cases like API requests, but it's not designed to
> guarantee concurrency restrictions like "1 request at a time per 
container"
> - something that AI Actions need.
>

You are right... today! I'm not saying Knative is necessarily a superior
backend for OpenWhisk as it stands today. All I'm saying is that from an
architecture point-of-view, Knative serving replaces all of the concerns
that the execution layer has.


>
>
> Thanks,
>
> dragos
>
>
> [1] - https://doc.akka.io/docs/akka/2.5/distributed-data.html
>
>
    > ____
> From: David P Grove 
> Sent: Tuesday, August 14, 2018 2:15:13 PM
> To: dev@openwhisk.apache.org
> Subject: Re: Proposal on a future architecture of OpenWhisk
>
>
>
>
> "Markus Thömmes"  wrote on 08/14/2018 10:06:49
> AM:
> >
> > I just published a revision on the initial proposal I made. I still owe 
a
> > lot of sequence diagrams for the container distribution, sorry for 
taking
> > so long on that, I'm working on it.
> >
> > I did include a clear seperation of concerns into the proposal, where
> > user-facing abstractions and the execution (loadbalacing, scaling) of
> > functions are loosely coupled. That enables us to exchange the execution
> > system while not changing anything in the Controllers at all (to an
> > extent). The interface to talk to the execution layer is HTTP.
> >
>
> Nice writeup!
>
> For me, the part of the design I'm wondering about is the separation of 
the
> ContainerManager and the ContainerRouter and having the ContainerManager 
by
> a cluster singleton. With Kubernetes blinders 

Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Bertrand Delacretaz
Hi Markus,

On Wed, Aug 15, 2018 at 11:27 AM Markus Thömmes
 wrote:
> ...From the ContainerRouter's
> PoV, a container is just an IP address + Port, so that concern is
> encapsulated in the ContainerManager...

Cool, thanks for confirming. This means the ContainerManager can do
whatever allocation makes sense, great!

-Bertrand


Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Markus Thömmes
Hi Bertrand,

that's indeed something I haven't thought about yet, but as you say, the
ContainerManager could support multiple backends at once and schedule a
container wherever it thinks it makes sense. From the ContainerRouter's
PoV, a container is just an IP address + Port, so that concern is
encapsulated in the ContainerManager.

Cheers,
Markus

Am Mi., 15. Aug. 2018 um 10:35 Uhr schrieb Bertrand Delacretaz <
bdelacre...@apache.org>:

> Hi,
>
> On Tue, Aug 14, 2018 at 4:07 PM Markus Thömmes
>  wrote:
> ...
> >
> https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future+architecture
> ...
>
> Very clear proposal, thank you! And thanks for bringing the discussion
> here.
>
> Is the ContainerManager meant to support multiple underlying
> orchestrators? I'm thinking of a use case where you want to segregate
> actions of a specific set of namespaces to a dedicated orchestrator.
> This can be useful for cases where people don't trust existing
> container isolation mechanisms.
>
> From my understanding it looks like this is covered, just wanted to check.
>
> -Bertrand
>
>
> -Bertrand
>


Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Bertrand Delacretaz
Hi,

On Tue, Aug 14, 2018 at 4:07 PM Markus Thömmes
 wrote:
...
> https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future+architecture
...

Very clear proposal, thank you! And thanks for bringing the discussion here.

Is the ContainerManager meant to support multiple underlying
orchestrators? I'm thinking of a use case where you want to segregate
actions of a specific set of namespaces to a dedicated orchestrator.
This can be useful for cases where people don't trust existing
container isolation mechanisms.

>From my understanding it looks like this is covered, just wanted to check.

-Bertrand


-Bertrand


Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Markus Thömmes
Hi Dragos,

thanks for your questions, good discussion :)

Am Di., 14. Aug. 2018 um 23:42 Uhr schrieb Dragos Dascalita Haut
:

> Markus, I appreciate the enhancements you mentioned in the wiki, and I'm
> very much inline with the ideas you brought in there.
>
>
>
> "...having the ContainerManager be a cluster singleton..."
>
> I was just in process to reply with the same idea :)
>
> In addition, I was thinking we can leverage Akka Distributed Data [1] to
> keep all ContainerRouter actors eventually consistent. When creating a new
> container, the ContainerManager can write with a consistency "WriteAll"; it
> would be a little slower but it would improve consistency.
>

I think we need to quantify "a little slower". Note that "WriteAll" becomes
slower and slower the more actors you add to the cluster. Scalability is at
question then.

Of course scalability is also at question if we make the ContainerManager a
singleton. The ContainerManager has a 1:1 relationship to the
Kubernetes/Mesos scheduler. Do we know how those are distributed? I think
the Kubernetes scheduler is a singleton, but I'll need to doublecheck on
that.

I can see the possibility to move the ContainerManager into each Router and
have them communicate with each other to shard in the same way I'm
proposing. As Dave is hitting on the very same points, I get the feeling we
should/could breakout that specific discussion if we can agree on some
basic premises of the design (see my answers on the thread with Dave). WDYT?


>
>
> The "edge-case" isn't clear to me b/c I'm coming from the assumption that
> it doesn't matter which ContainerRouter handles the next request, given
> that all actors have the same data. Maybe you can help me understand better
> the edge-case ?
>

ContainerRouters do not have the same state specifically. The
live-concurrency on a container is potentially very fast changing data.
Sharing that across a potentially unbounded number of routers is not viable
performance wise.

Hence the premise is to manage that state locally and essentially shard the
list of available containers between all routers, so each of them can keep
its respective state local.


>
>
> Re Knative approach, can you expand why the execution layer/data plane
> would be replaced entirely by Knative serving ? I think knative serving
> handles very well some cases like API requests, but it's not designed to
> guarantee concurrency restrictions like "1 request at a time per container"
> - something that AI Actions need.
>

You are right... today! I'm not saying Knative is necessarily a superior
backend for OpenWhisk as it stands today. All I'm saying is that from an
architecture point-of-view, Knative serving replaces all of the concerns
that the execution layer has.


>
>
> Thanks,
>
> dragos
>
>
> [1] - https://doc.akka.io/docs/akka/2.5/distributed-data.html
>
>
> ____
> From: David P Grove 
> Sent: Tuesday, August 14, 2018 2:15:13 PM
> To: dev@openwhisk.apache.org
> Subject: Re: Proposal on a future architecture of OpenWhisk
>
>
>
>
> "Markus Thömmes"  wrote on 08/14/2018 10:06:49
> AM:
> >
> > I just published a revision on the initial proposal I made. I still owe a
> > lot of sequence diagrams for the container distribution, sorry for taking
> > so long on that, I'm working on it.
> >
> > I did include a clear seperation of concerns into the proposal, where
> > user-facing abstractions and the execution (loadbalacing, scaling) of
> > functions are loosely coupled. That enables us to exchange the execution
> > system while not changing anything in the Controllers at all (to an
> > extent). The interface to talk to the execution layer is HTTP.
> >
>
> Nice writeup!
>
> For me, the part of the design I'm wondering about is the separation of the
> ContainerManager and the ContainerRouter and having the ContainerManager by
> a cluster singleton. With Kubernetes blinders on, it seems more natural to
> me to fuse the ContainerManager into each of the ContainerRouter instances
> (since there is very little to the ContainerManager except (a) talking to
> Kubernetes and (b) keeping track of which Containers it has handed out to
> which ContainerRouters -- a task which is eliminated if we fuse them).
>
> The main challenge is dealing with your "edge case" where the optimal
> number of containers to create to execute a function is less than the
> number of ContainerRouters.  I suspect this is actually an important case
> to handle well for large-scale deployments of OpenWhisk.  Having 20ish
> ContainerRouters on a large cluster seems plausible, and then we'd expect a
> long tail of functions where 

Re: Proposal on a future architecture of OpenWhisk

2018-08-15 Thread Markus Thömmes
Hi Dave,

thanks a lot for your input! Greatly appreciated.

Am Di., 14. Aug. 2018 um 23:15 Uhr schrieb David P Grove :

>
>
>
> "Markus Thömmes"  wrote on 08/14/2018 10:06:49
> AM:
> >
> > I just published a revision on the initial proposal I made. I still owe a
> > lot of sequence diagrams for the container distribution, sorry for taking
> > so long on that, I'm working on it.
> >
> > I did include a clear seperation of concerns into the proposal, where
> > user-facing abstractions and the execution (loadbalacing, scaling) of
> > functions are loosely coupled. That enables us to exchange the execution
> > system while not changing anything in the Controllers at all (to an
> > extent). The interface to talk to the execution layer is HTTP.
> >
>
> Nice writeup!
>
> For me, the part of the design I'm wondering about is the separation of the
> ContainerManager and the ContainerRouter and having the ContainerManager by
> a cluster singleton. With Kubernetes blinders on, it seems more natural to
> me to fuse the ContainerManager into each of the ContainerRouter instances
> (since there is very little to the ContainerManager except (a) talking to
> Kubernetes and (b) keeping track of which Containers it has handed out to
> which ContainerRouters -- a task which is eliminated if we fuse them).
>

As you say below, the main concern is dealing with the edge-case I laid out.


>
> The main challenge is dealing with your "edge case" where the optimal
> number of containers to create to execute a function is less than the
> number of ContainerRouters.  I suspect this is actually an important case
> to handle well for large-scale deployments of OpenWhisk.  Having 20ish
> ContainerRouters on a large cluster seems plausible, and then we'd expect a
> long tail of functions where the optimal number of container instances is
> less than 20.
>

I agree, in large scale environments that might well be an important case.


>
> I wonder if we can partially mitigate this problem by doing some amount of
> smart routing in the Controller.  For example, the first level of routing
> could be based on the kind of the action (nodejs:6, python, etc).  That
> could then vector to per-runtime ContainerRouters which dynamically
> auto-scale based on load.  Since there doesn't have to be a fixed division
> of actual execution resources to each ContainerRouter this could work.  It
> also lets easily stemcells for multiple runtimes without worrying about
> wasting too many resources.
>

The premise I wanted to keep in my proposal is that you can route
essentially random between the routers. That's also why I use the overflow
queue as a work-stealing queue essentially to balance load between the
routers if the discrepancies get too high.

My general gut-feeling as to what can work here is: Keep state local as
long as you can (the individual ContainerRouters) to make the hot-path as
fast as possible. Fall back to work-stealing (slower, more constrained),
once things get out of bands.


>
> How do you want to deal with design alternatives?  Should I be adding to
> the wiki page?  Doing something else?
>

Good question. Feels like we can break out a "Routing" Work Group out of
this? Part of my proposal was to build this out collaboratively. Maybe we
can try to find consensus on some general points (direct HTTP connection to
containers should be part of it, we'll need an overflow queue) and once/if
we agree on the general broader picture, we can break out discussions on
individual aspects of it? Would that make sense?


>
> --dave
>


Re: Proposal on a future architecture of OpenWhisk

2018-08-14 Thread Dragos Dascalita Haut
Markus, I appreciate the enhancements you mentioned in the wiki, and I'm very 
much inline with the ideas you brought in there.



"...having the ContainerManager be a cluster singleton..."

I was just in process to reply with the same idea :)

In addition, I was thinking we can leverage Akka Distributed Data [1] to keep 
all ContainerRouter actors eventually consistent. When creating a new 
container, the ContainerManager can write with a consistency "WriteAll"; it 
would be a little slower but it would improve consistency.


The "edge-case" isn't clear to me b/c I'm coming from the assumption that it 
doesn't matter which ContainerRouter handles the next request, given that all 
actors have the same data. Maybe you can help me understand better the 
edge-case ?


Re Knative approach, can you expand why the execution layer/data plane would be 
replaced entirely by Knative serving ? I think knative serving handles very 
well some cases like API requests, but it's not designed to guarantee 
concurrency restrictions like "1 request at a time per container" - something 
that AI Actions need.


Thanks,

dragos


[1] - https://doc.akka.io/docs/akka/2.5/distributed-data.html



From: David P Grove 
Sent: Tuesday, August 14, 2018 2:15:13 PM
To: dev@openwhisk.apache.org
Subject: Re: Proposal on a future architecture of OpenWhisk




"Markus Thömmes"  wrote on 08/14/2018 10:06:49
AM:
>
> I just published a revision on the initial proposal I made. I still owe a
> lot of sequence diagrams for the container distribution, sorry for taking
> so long on that, I'm working on it.
>
> I did include a clear seperation of concerns into the proposal, where
> user-facing abstractions and the execution (loadbalacing, scaling) of
> functions are loosely coupled. That enables us to exchange the execution
> system while not changing anything in the Controllers at all (to an
> extent). The interface to talk to the execution layer is HTTP.
>

Nice writeup!

For me, the part of the design I'm wondering about is the separation of the
ContainerManager and the ContainerRouter and having the ContainerManager by
a cluster singleton. With Kubernetes blinders on, it seems more natural to
me to fuse the ContainerManager into each of the ContainerRouter instances
(since there is very little to the ContainerManager except (a) talking to
Kubernetes and (b) keeping track of which Containers it has handed out to
which ContainerRouters -- a task which is eliminated if we fuse them).

The main challenge is dealing with your "edge case" where the optimal
number of containers to create to execute a function is less than the
number of ContainerRouters.  I suspect this is actually an important case
to handle well for large-scale deployments of OpenWhisk.  Having 20ish
ContainerRouters on a large cluster seems plausible, and then we'd expect a
long tail of functions where the optimal number of container instances is
less than 20.

I wonder if we can partially mitigate this problem by doing some amount of
smart routing in the Controller.  For example, the first level of routing
could be based on the kind of the action (nodejs:6, python, etc).  That
could then vector to per-runtime ContainerRouters which dynamically
auto-scale based on load.  Since there doesn't have to be a fixed division
of actual execution resources to each ContainerRouter this could work.  It
also lets easily stemcells for multiple runtimes without worrying about
wasting too many resources.

How do you want to deal with design alternatives?  Should I be adding to
the wiki page?  Doing something else?

--dave


Re: Proposal on a future architecture of OpenWhisk

2018-08-14 Thread David P Grove



"Markus Thömmes"  wrote on 08/14/2018 10:06:49
AM:
>
> I just published a revision on the initial proposal I made. I still owe a
> lot of sequence diagrams for the container distribution, sorry for taking
> so long on that, I'm working on it.
>
> I did include a clear seperation of concerns into the proposal, where
> user-facing abstractions and the execution (loadbalacing, scaling) of
> functions are loosely coupled. That enables us to exchange the execution
> system while not changing anything in the Controllers at all (to an
> extent). The interface to talk to the execution layer is HTTP.
>

Nice writeup!

For me, the part of the design I'm wondering about is the separation of the
ContainerManager and the ContainerRouter and having the ContainerManager by
a cluster singleton. With Kubernetes blinders on, it seems more natural to
me to fuse the ContainerManager into each of the ContainerRouter instances
(since there is very little to the ContainerManager except (a) talking to
Kubernetes and (b) keeping track of which Containers it has handed out to
which ContainerRouters -- a task which is eliminated if we fuse them).

The main challenge is dealing with your "edge case" where the optimal
number of containers to create to execute a function is less than the
number of ContainerRouters.  I suspect this is actually an important case
to handle well for large-scale deployments of OpenWhisk.  Having 20ish
ContainerRouters on a large cluster seems plausible, and then we'd expect a
long tail of functions where the optimal number of container instances is
less than 20.

I wonder if we can partially mitigate this problem by doing some amount of
smart routing in the Controller.  For example, the first level of routing
could be based on the kind of the action (nodejs:6, python, etc).  That
could then vector to per-runtime ContainerRouters which dynamically
auto-scale based on load.  Since there doesn't have to be a fixed division
of actual execution resources to each ContainerRouter this could work.  It
also lets easily stemcells for multiple runtimes without worrying about
wasting too many resources.

How do you want to deal with design alternatives?  Should I be adding to
the wiki page?  Doing something else?

--dave


Proposal on a future architecture of OpenWhisk

2018-08-14 Thread Markus Thömmes
Hey OpenWhiskers,

I just published a revision on the initial proposal I made. I still owe a
lot of sequence diagrams for the container distribution, sorry for taking
so long on that, I'm working on it.

I did include a clear seperation of concerns into the proposal, where
user-facing abstractions and the execution (loadbalacing, scaling) of
functions are loosely coupled. That enables us to exchange the execution
system while not changing anything in the Controllers at all (to an
extent). The interface to talk to the execution layer is HTTP.

Wanted to get this out as a possible idea on how to incooperate Knative in
the future and how it could look like alongside other implementations.

As always, feedback is very much welcome and appreciated.
https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future+architecture

Cheers,
Markus


Re: Proposal on a future architecture of OpenWhisk

2018-07-25 Thread Markus Thömmes
Hi David,

the problem is, that if you have N controllers in the system and M
containers but N > M and all controllers manage their containers
exclusively, you'll end up with controllers not having a container to
manage at all.
There's valid, very slow workload that needs to create only 1
container, for example a slow cron trigger. Due to the round-robin
nature of our front-door, eventually all controllers will get one of
those requests at some point. Since they are by design not aware of
the containers because they are managed by another controller they'd
end-up asking for a newly created container. Given N controllers we'd
always create at least N containers for any action eventually. That is
wasteful.

Instead, requests are proxied to a controller which we know manages a
container for the given action (the ContainerManager knows that) and
thereby bypass the need to create too many containers. If the load is
too much to be handled by the M containers, the controllers managing
those M will request new containers, which will get distributed to all
controllers. Eventually, given enough load, all controllers will have
containers to manage for each action.

The ContainerManager only needs to know which controller has which
container. It does not need to know in which state these containers
are. If they are busy, the controller itself will request more
resources accordingly.

Hope that clarifies

Cheers,
Markus

2018-07-25 14:19 GMT+02:00 David Breitgand :
> Hi Markus,
>
> I'd like to better understand the edge case.
>
> Citing from the wiki.
>
>>> Edge case: If an action only has a very small amount of containers
> (less than there are Controllers in the system), we have a problem with
> the method described above.
>
> Isn't there always at least one controller in the system? I think the
> problem is not the number of Controllers, but rather availability of
> prewarm containers that these Controllers control. If all containers of
> this Controller are busy at the moment, and concurrency level per
> container is 1 and the invocation hit this controller, it cannot execute
> the action immediately with one of its containers. Is that the problem
> that is being solved?
>
>>> Since the front-door schedules round-robin or least-connected, it's
> impossible to decide to which Controller the request needs to go to hit
> that has a container available.
> In this case, the other Controllers (who didn't get a container) act as a
> proxy and send the request to a Controller that actually has a container
> (maybe even via HTTP redirect). The ContainerManager decides which
> Controllers will act as a proxy in this case, since its the instance that
> distributes the containers.
>>>
>
> When reading your proposal, I was under impression that ContainerManager
> only knows about existence of containers allocated to the Controllers
> (because they asked), but ContainerManager does not know about the state
> of these containers at every given moment (i.e., whether they are being
> busy with running some action or not). I don't see Controllers updating
> ContainerManager about this in your diagrams.
>
> Thanks.
>
> -- david
>
>
>
>
> From:   "Markus Thoemmes" 
> To: dev@openwhisk.apache.org
> Date:   23/07/2018 02:21 PM
> Subject:Re: Proposal on a future architecture of OpenWhisk
>
>
>
> Hi Dominic,
>
> let's see if I can clarify the specific points one by one.
>
>>1. Docker daemon performance issue.
>>
>>...
>>
>>That's the reason why I initially thought that a Warmed state would
>>be kept
>>for more than today's behavior.
>>Today, containers would stay in the Warmed state only for 50ms, so it
>>introduces PAUSE/RESUME in case action comes with the interval of
>>more than
>>50 ms such as 1 sec.
>>This will lead to more loads on Docker daemon.
>
> You're right that the docker daemon's throughput is indeed an issue.
>
> Please note that PAUSE/RESUME are not executed via the docker daemon in
> performance
> tuned environment but rather done via runc, which does not have such a
> throughput
> issue because it's not a daemon at all. PAUSE/RESUME latencies are ~10ms
> for each
> operation.
>
> Further, the duration of the pauseGrace is not related to the overall
> architecture at
> all. Rather, it's a so narrow to safe-guard against users stealing cycles
> from the
> vendor's infrastructure. It's also a configurable value so you can tweak
> it as you
> want.
>
> The proposed architecture itself has no impact on the pauseGrace.
>
>>
>>And if the state of containers is changing like today, the state in
>>ContainerManager would be frequently changing as well.
>>This may induce a sy

Re: Proposal on a future architecture of OpenWhisk

2018-07-25 Thread David Breitgand
Hi Markus, 

I'd like to better understand the edge case.

Citing from the wiki.

>> Edge case: If an action only has a very small amount of containers 
(less than there are Controllers in the system), we have a problem with 
the method described above. 

Isn't there always at least one controller in the system? I think the 
problem is not the number of Controllers, but rather availability of 
prewarm containers that these Controllers control. If all containers of 
this Controller are busy at the moment, and concurrency level per 
container is 1 and the invocation hit this controller, it cannot execute 
the action immediately with one of its containers. Is that the problem 
that is being solved? 

>> Since the front-door schedules round-robin or least-connected, it's 
impossible to decide to which Controller the request needs to go to hit 
that has a container available.
In this case, the other Controllers (who didn't get a container) act as a 
proxy and send the request to a Controller that actually has a container 
(maybe even via HTTP redirect). The ContainerManager decides which 
Controllers will act as a proxy in this case, since its the instance that 
distributes the containers. 
>>

When reading your proposal, I was under impression that ContainerManager 
only knows about existence of containers allocated to the Controllers 
(because they asked), but ContainerManager does not know about the state 
of these containers at every given moment (i.e., whether they are being 
busy with running some action or not). I don't see Controllers updating 
ContainerManager about this in your diagrams. 

Thanks.

-- david 




From:   "Markus Thoemmes" 
To: dev@openwhisk.apache.org
Date:   23/07/2018 02:21 PM
Subject:    Re: Proposal on a future architecture of OpenWhisk



Hi Dominic,

let's see if I can clarify the specific points one by one.

>1. Docker daemon performance issue.
>
>...
>
>That's the reason why I initially thought that a Warmed state would
>be kept
>for more than today's behavior.
>Today, containers would stay in the Warmed state only for 50ms, so it
>introduces PAUSE/RESUME in case action comes with the interval of
>more than
>50 ms such as 1 sec.
>This will lead to more loads on Docker daemon.

You're right that the docker daemon's throughput is indeed an issue.

Please note that PAUSE/RESUME are not executed via the docker daemon in 
performance
tuned environment but rather done via runc, which does not have such a 
throughput
issue because it's not a daemon at all. PAUSE/RESUME latencies are ~10ms 
for each
operation.

Further, the duration of the pauseGrace is not related to the overall 
architecture at
all. Rather, it's a so narrow to safe-guard against users stealing cycles 
from the
vendor's infrastructure. It's also a configurable value so you can tweak 
it as you
want.

The proposed architecture itself has no impact on the pauseGrace.

>
>And if the state of containers is changing like today, the state in
>ContainerManager would be frequently changing as well.
>This may induce a synchronization issue among controllers and, among
>ContainerManagers(in case there would be more than one
>ContainerManager).

The ContainerManager will NOT be informed about pause/unpause state 
changes and it
doesn't need to. I agree that such a behavior would generate serious load 
on the
ContainerManager, but I think it's unnecessary.

>2. Proxy case.
>
>...
>
>If it goes this way, ContainerManager should know all the status of
>containers in all controllers to make a right decision and it's not
>easy to
>synchronize all the status of containers in controllers.
>If it does not work like this, how can controller2 proxy requests to
>controller1 without any information about controller1's status?


The ContainerManager distributes a list of containers across all 
controllers.
If it does not have enough containers at hand to give one to each 
controller,
it instead tells controller2 to proxy to controller1, because the 
ContainerManager
knows at distribution-time, that controller1 has such a container.

No synchronization needed between controllers at all.

If controller1 gets more requests than the single container can handle, it 
will
request more containers, so eventually controller2 will get its own.

Please refer to 
https://lists.apache.org/thread.html/84a7b8171b90719c2f7aab86bea48a7e7865874c4e54f082b0861380@%3Cdev.openwhisk.apache.org%3E

for more information on that protocol.


>3. Intervention among multiple actions
>
>If the concurrency limit is 1, and the container lifecycle is managed
>like
>today, intervention among multiple actions can happen again.
>For example, the maximum number of containers which can be created by
>a
>user is 2, and ActionA and ActionB invocation requests come
>alternatively,
>controllers will try to remove and recreate con

Re: Proposal on a future architecture of OpenWhisk

2018-07-23 Thread Dominic Kim
Dear Markus.

I may not correctly understand the direction of new architecture.
So let me describe my concerns in more details.

Since that is a future architecture of OpenWhisk and requires many breaking
changes, I think it should at least address all known issues.
So I focused on figuring out whether it handles all issues which are
reported in my proposal.
(
https://cwiki.apache.org/confluence/display/OPENWHISK/Autonomous+Container+Scheduling
)

1. Docker daemon performance issue.

The most critical issue is poor performance of docker daemon.
Since it is not inherently designed for high throughput or concurrent
processing, Docker daemon shows poor performance in comparison with OW.
In OW(serverless) world, action execution can be finished within 5ms ~
10ms, but the Docker daemon shows 100 ~ 500ms latency.
Still, we can take advantage of Prewarm and Warmed containers, but under
the situation where container creation/deletion/pausing/resuming happen
frequently and the situation lasted for long-term, the requests are delayed
and even the Docker daemon crashed.
So I think it is important to reduce the loads(requests) against the Docker
daemon.

That's the reason why I initially thought that a Warmed state would be kept
for more than today's behavior.
Today, containers would stay in the Warmed state only for 50ms, so it
introduces PAUSE/RESUME in case action comes with the interval of more than
50 ms such as 1 sec.
This will lead to more loads on Docker daemon.

And if the state of containers is changing like today, the state in
ContainerManager would be frequently changing as well.
This may induce a synchronization issue among controllers and, among
ContainerManagers(in case there would be more than one ContainerManager).

So I think containers should be running for more than today's pauseGrace
time.
With more than 1 concurrency limit per container, it would also be better
to keep containers running(not paused) for more than 50ms.

2. Proxy case.

In the edge case where a container only exists in controller1, how can
controller2 decide to proxy the request to controller1 rather than just
creating its own container?
If it asks to ContainerManager, ContainerManager should know the state of
the container in controller1.
If the container in controller1 is already busy, it would be better to
create a new container in controller2 rather than proxying the requests to
controller1.

If it goes this way, ContainerManager should know all the status of
containers in all controllers to make a right decision and it's not easy to
synchronize all the status of containers in controllers.
If it does not work like this, how can controller2 proxy requests to
controller1 without any information about controller1's status?

3. Intervention among multiple actions

If the concurrency limit is 1, and the container lifecycle is managed like
today, intervention among multiple actions can happen again.
For example, the maximum number of containers which can be created by a
user is 2, and ActionA and ActionB invocation requests come alternatively,
controllers will try to remove and recreate containers again and again.
I used an example with a small number of max container limit for
simplicity, but it can happen with a higher limit as well.

And though concurrency limit is more than 1 such as 3, it also can happen
if actions come more quickly than the execution time of actions.

4. Is concurrency per container controlled by users in a per-action based
way?
Let me clarify my question about concurrency limit.

If concurrency per container limit is more than 1, there could be multiple
actions being invoked at some point.
If the action requires high memory footprint such as 200MB or 150MB, it can
crash if the sum of memory usage of concurrent actions exceeds the
container memory.
(In our case(here), some users are executing headless-chrome and puppeteer
within actions, so it could happen under the similar situation.)

So I initially thought concurrency per container is controlled by users in
a per-action based way.
If concurrency per container is only configured by OW operators statically,
some users may not be able to invoke their actions correctly in the worst
case though operators increased the memory of the biggest container type.

And not only for this case, there could be some more reasons that some
users just want to invoke their actions without per-container concurrency
but the others want it for better throughput.

So we may need some logic for users to take care of per-container
concurrency for each actions.

5. Better to wait for the completion rather than creating a new container.
According to the workload, it would be better to wait for the previous
execution rather than creating a new container because it takes upto 500ms
~ 1s.
Even though the concurrency limit is more than 1, it still can happen if
there is no logic to cumulate invocations and decide whether to create a
new container or waiting for the existing container.



Re: Proposal on a future architecture of OpenWhisk

2018-07-20 Thread David P Grove


Tyson Norris  wrote on 07/20/2018 12:24:07 PM:
>
> On Logging, I think if you are considering enabling concurrent
> activation processing, you will encounter that the only approach to
> parsing logs to be associated with a specific activationId, is to
> force the log output to be structured, and always include the
> activationId with every log message. This requires a change at the
> action container layer, but the simpler thing to do is to encourage
> action containers to provide a structured logging context that
> action developers can (and must) use to generate logs.

Good point.  I agree that if there is concurrent activation processing in
the container, structured logging is the only sensible thing to do.


--dave


Re: Proposal on a future architecture of OpenWhisk

2018-07-20 Thread Markus Thoemmes
I agree 100%!

In the face of intra-container concurrency we should go in fully and find a 
solution that works even there.

Another slight wrinkle: Usually, the log-forwarder (whomever that may be) needs 
to know which container belongs to which user to namespace logs accordingly. We 
cannot really set that on the container itself, because there are pre-warm 
containers. The mapping of container -> userr is immutable though, since we 
don't reuse containers for different users (today). It would then be plausible 
to informer the log forwarder about the ContainerID -> user mapping to make it 
do the right thing.

Note that this specific piece of information cannot really be part of the log's 
own context since the user must not be able to change it.

@Tyson seems like you already put quite a bit of thought into this. Could you 
turn this into a proposal in itself to discuss seperately?

Cheers,
Markus

-Tyson Norris  wrote: -

>To: "dev@openwhisk.apache.org" 
>From: Tyson Norris 
>Date: 07/20/2018 06:24PM
>Subject: Re: Proposal on a future architecture of OpenWhisk
>
>On Logging, I think if you are considering enabling concurrent
>activation processing, you will encounter that the only approach to
>parsing logs to be associated with a specific activationId, is to
>force the log output to be structured, and always include the
>activationId with every log message. This requires a change at the
>action container layer, but the simpler thing to do is to encourage
>action containers to provide a structured logging context that action
>developers can (and must) use to generate logs. 
>
>An example is nodejs container - for the time being, we are hijacking
>the stdout/stderr and injecting the activationId when any developer
>code writes to stdout/stderr (as console.log/console.error). This may
>not work as simply in all action containers, and isn’t great even in
>nodejs. 
>
>I would rather encourage action containers to provide a logging
>context, where action devs use: log.info, log.debug, etc, and this
>logging context does the needful to assert some structure to the log
>format. In general, many (most?) languages have conventions (slf4xyz,
>et al) for this already, and while you lose “random writes to
>stdout”, I haven’t seen this be an actual problem. 
>
>If you don’t deal with interleaved logs (typically because
>activations don’t run concurrently), this this is less of an issue,
>but regardless, writing log parsers is a solved problem that would
>still be good to offload to external (not in OW controller/invoker)
>systems (logstash, fluentd, splunk, etc). This obviously comes with a
>caveat that logs parsing will be delayed, but that is OK from my
>point of view, partly because most logs will never be viewed, and
>partly because the log ingest systems are mostly fast enough already
>to limit this delay to seconds or milliseconds.  
>
>Thanks
>Tyson



Re: Proposal on a future architecture of OpenWhisk

2018-07-20 Thread Tyson Norris
On Logging, I think if you are considering enabling concurrent activation 
processing, you will encounter that the only approach to parsing logs to be 
associated with a specific activationId, is to force the log output to be 
structured, and always include the activationId with every log message. This 
requires a change at the action container layer, but the simpler thing to do is 
to encourage action containers to provide a structured logging context that 
action developers can (and must) use to generate logs. 

An example is nodejs container - for the time being, we are hijacking the 
stdout/stderr and injecting the activationId when any developer code writes to 
stdout/stderr (as console.log/console.error). This may not work as simply in 
all action containers, and isn’t great even in nodejs. 

I would rather encourage action containers to provide a logging context, where 
action devs use: log.info, log.debug, etc, and this logging context does the 
needful to assert some structure to the log format. In general, many (most?) 
languages have conventions (slf4xyz, et al) for this already, and while you 
lose “random writes to stdout”, I haven’t seen this be an actual problem. 

If you don’t deal with interleaved logs (typically because activations don’t 
run concurrently), this this is less of an issue, but regardless, writing log 
parsers is a solved problem that would still be good to offload to external 
(not in OW controller/invoker) systems (logstash, fluentd, splunk, etc). This 
obviously comes with a caveat that logs parsing will be delayed, but that is OK 
from my point of view, partly because most logs will never be viewed, and 
partly because the log ingest systems are mostly fast enough already to limit 
this delay to seconds or milliseconds.  

Thanks
Tyson
> On Jul 20, 2018, at 8:46 AM, David P Grove  wrote:
> 
> 
> Rethinking the architecture to more fully exploit the capabilities of the
> underlying container orchestration platforms is pretty exciting.  I think
> there are lots of interesting ideas to explore about how best to schedule
> the workload.
> 
> As brought out in the architecture proposal [1], although it is logically
> an orthogonal issue, improving the log processing for user containers is a
> key piece of this roadmap.  The initial experiences with the
> KubernetesContainerFactory indicate that post-facto log enrichment to add
> the activation id to each log line is a serious bottleneck.  It adds
> complexity to the system and measurably reduces system performance by
> delaying the re-use of action containers until the logs can be extracted
> and processing.
> 
> I believe what we really want is to be using an openwhisk-aware log driver
> that will dynamically inject the current activation id into every log line
> as soon as it is written.  Then the user container logs, already properly
> enriched when they are generated, can be fed directly into the platform
> logging system with no post-processing needed.
> 
> If the low-level container runtime is docker 17.09 or better, I think we
> could probably achieve this by writing a logging driver plugin [2] that
> extends docker's json logging driver.  For non-blackbox containers, I think
> we "just" need the /run method to update a shared location that is
> accessible to the logging driver plugin with the current activation id
> before it invokes the user code.  As log lines are produced, that location
> is read and the string with the activation id gets injected into the json
> formatted log line as it is produced.   For blackbox containers, we could
> have our dockerskeleton do the same thing, but the user would have to opt
> in somehow to the protocol if they were using their own action runner.
> Warning:  I haven't looked into how flushing works with these drivers, so
> I'm not sure that this really workswe need to make sure we don't enrich
> a log line with the wrong activation id because of delayed flushing.
> 
> If we're running on Kubernetes, we might decide that instead of using a
> logging driver plugin, to use a streaming sidecar container as shown in [3]
> and have the controller interact with the sidecar to update the current
> activation id (or have the sidecar read it from a shared memory location
> that is updated by /run to minimize the differences between deployment
> platforms).  I'm not sure this really works as well, since the sidecar
> might fall behind in processing the logs, so we might still need a
> handshake somewhere.
> 
> A third option would be to extend our current sentineled log design by also
> writing a "START_WHISK_ACTIVATION_LOG " line in the /run
> method before invoking the user code.  We'd still have to post-process the
> log files, but it could be decoupled from the critical path since the
> post-processor would have the activation id available to it in the log
> files (and thus would not need to handshake with the controller at all,
> thus we could offload all logging to a node-level 

Re: Proposal on a future architecture of OpenWhisk

2018-07-20 Thread David P Grove

Rethinking the architecture to more fully exploit the capabilities of the
underlying container orchestration platforms is pretty exciting.  I think
there are lots of interesting ideas to explore about how best to schedule
the workload.

As brought out in the architecture proposal [1], although it is logically
an orthogonal issue, improving the log processing for user containers is a
key piece of this roadmap.  The initial experiences with the
KubernetesContainerFactory indicate that post-facto log enrichment to add
the activation id to each log line is a serious bottleneck.  It adds
complexity to the system and measurably reduces system performance by
delaying the re-use of action containers until the logs can be extracted
and processing.

I believe what we really want is to be using an openwhisk-aware log driver
that will dynamically inject the current activation id into every log line
as soon as it is written.  Then the user container logs, already properly
enriched when they are generated, can be fed directly into the platform
logging system with no post-processing needed.

If the low-level container runtime is docker 17.09 or better, I think we
could probably achieve this by writing a logging driver plugin [2] that
extends docker's json logging driver.  For non-blackbox containers, I think
we "just" need the /run method to update a shared location that is
accessible to the logging driver plugin with the current activation id
before it invokes the user code.  As log lines are produced, that location
is read and the string with the activation id gets injected into the json
formatted log line as it is produced.   For blackbox containers, we could
have our dockerskeleton do the same thing, but the user would have to opt
in somehow to the protocol if they were using their own action runner.
Warning:  I haven't looked into how flushing works with these drivers, so
I'm not sure that this really workswe need to make sure we don't enrich
a log line with the wrong activation id because of delayed flushing.

If we're running on Kubernetes, we might decide that instead of using a
logging driver plugin, to use a streaming sidecar container as shown in [3]
and have the controller interact with the sidecar to update the current
activation id (or have the sidecar read it from a shared memory location
that is updated by /run to minimize the differences between deployment
platforms).  I'm not sure this really works as well, since the sidecar
might fall behind in processing the logs, so we might still need a
handshake somewhere.

A third option would be to extend our current sentineled log design by also
writing a "START_WHISK_ACTIVATION_LOG " line in the /run
method before invoking the user code.  We'd still have to post-process the
log files, but it could be decoupled from the critical path since the
post-processor would have the activation id available to it in the log
files (and thus would not need to handshake with the controller at all,
thus we could offload all logging to a node-level log processing/forwarding
agent).

Option 3 would be really easy to implement and is independent of the
details of the low-level log driver, but doesn't eliminate the need to
post-process the logs. It just makes it easier to move that processing off
any critical path.

Thoughts?

--dave

[1] https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future
+architecture
[2] https://docs.docker.com/v17.09/engine/admin/logging/plugins/
[3] https://kubernetes.io/docs/concepts/cluster-administration/logging/


Re: Proposal on a future architecture of OpenWhisk

2018-07-20 Thread Markus Thoemmes
Hi Chetan,

>As activations may be bulky (1 MB max) it may not be possible to keep
>them in memory even if there is small incoming rate depending on how
>fast they get consumed. Currently usage of Kafka takes the pressure
>of
>Controller and helps in keeping them stable. So I suspect we may need
>to make use of buffer more often to keep pressure off Controllers,
>specially for heterogeneous loads and when system is making full use
>of cluster capacity.

Note that we can also excert client side buffering in this case. If we enable 
end-to-end streaming of parameters/results (which is very much facilitated with 
my proposal, though not impossible even in the current architecture), you can 
refuse to consume the HTTP requests entity until you are able to pass it 
downstream to a container. That way, nothing (or close-to-nothing) is buffered 
in memory of the controller. In an overload scenario, where the wait time for 
resources is expected to be high, this can be altered to buffer into something 
like an overflow queue. In steady state, the controller should not buffer more 
than necessary.

To your proposal: If I understand correctly, what you are laying out is heavily 
reliant on detaching the Invoker from the hosts where containers run (since you 
assume a ContainerPool to handle a lot more containers than today. This 
assumption moves the general architecture quite close to what I am proposing. 
Essentially, you propose a load-balancing algorithm in front of the controllers 
(in my picture), since these own their respective Containerpool (so to say).

In essence, this would then be a solution for the concerns Dominic mentioned 
with load imbalance in between controllers. This is exactly what we've been 
discussing earlier (where I proposed pubsub vs. Kafka) etc. and I believe both 
solutions do go in a similar direction.

Sorry if I simplified this too much and let me know if I'm overlooking an 
aspect here.

Cheers,
Markus
 
   
 
Mit freundlichen Grüßen / Kind regards  
   
 
Markus Thoemmes
Software Engineer - IBM Bluemix  
E904

 Phone:  +49-172-2684500
  IBM Deutschland GmbH  
 Email:  
markus.thoem...@de.ibm.com   Am Fichtenberg 1   

 71083 Herrenberg   

IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), 
Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
Sitz der Gesellschaft: Ehningen / Registergericht: 
Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940
  

-Chetan Mehrotra  wrote: -

>To: dev@openwhisk.apache.org
>From: Chetan Mehrotra 
>Date: 07/20/2018 02:13PM
>Subject: Re: Proposal on a future architecture of OpenWhisk
>
>Hi Markus,
>
>Some of the points below are more of hunch and thinking out loud and
>hence may not be fully objective or valid !.
>
>> The buffer I described above is a per action "invoke me once
>resources are available" buffer
>
>> 1. Do we need to persist an in-memory queue that waits for
>resources to be created by the ContainerManager?
>> 2. Do we need a shared queue between the Controllers to enable
>work-stealing in cases where multiple Controllers wait for resources?
>
>As activations may be bulky (1 MB max) it may not be possible to keep
>them in memory even if there is small incoming rate depending on how
>fast they get consumed. Currently usage of Kafka takes the pressure
>of
>Controller and helps in keeping them stable. So I suspect we may need
>to make use of buffer more often to keep pressure off Controllers,
>specially for heterogeneous loads and when system is making full use
>of cluster capacity.
>
>> That could potentially open up the possibility to use a technology
>more geared towards Pub/Sub, where subscribers per action are more
>cheap to implement than on Kafka?
>
>That would certainly be one option to consider. System like RabbitMQ
>can support lots of queues. They may pose a problem on how to
>efficiently consumer resources from all such queues.
>
>Another option I was thinking was to have a design similar to current
>involving controller and invoker but having a single queue shared
>between controller and invoker and using act

Re: Proposal on a future architecture of OpenWhisk

2018-07-19 Thread Markus Thoemmes
Hi Chetan,

>Currently one aspect which is not clear is does Controller has access
>to
>
>1. Pool of prewarm containers - Container of base image where /init
>is
>yet not done. So these containers can then be initialized within
>Controller
>2. OR Pool of warm container bound to specific user+action. These
>containers would possibly have been initialized by ContainerManager
>and then it allocates them to controller.

The latter case is what I had in mind. The controller only knows containers 
that are already ready to call /run on.

Pre-Warm containers are an implementation detail to the Controller. The 
ContainerManager can keep them around to be able to answer demand for specific 
resources more quickly, but the Controller doesn't care. It only knows warm 
containers.

>Can you elaborate this bit more i.e. how scale up logic would work
>and
>is asynchronous?
>
>I think above aspect (type of pool) would have bearing on scale up
>logic. If an action was not in use so far then when first request
>comes (i.e. 0-1 scale up case) would Controller ask ContainerManager
>for specific action container and then wait for its setup and then
>execute it. OR if it has a generic pool then it takes one and
>initializes it and use it. And if its not done synchronously then
>would such an action be put to overflow queue.

In this specific example, the Controller will request a container from the 
ContainerManager and buffer the request until it finally has capacity to 
execute it. All subsequent requests will be put on the same buffer and a 
Container will be requested for each of them. 

Whether we put this buffer in an overflow queue (aka persist it) remains to be 
decided. If we keep it in memory, we have roughly the same guarantees as today. 
As Rodric mentioned though, we can improve certain failure scenarios (like 
waiting for a container in this case) by making this buffer more persistent. 
I'm not mentioning Kafka here for a reason, because in this case any persistent 
buffer is just fine.

Also note that this is not necessarily the case of the overflow queue. The 
overflow queue is used for arbitrary requests once the ContainerManager cannot 
create more resources and thus requests need to wait.

The buffer I described above is a per action "invoke me once resources are 
available" buffer, that could potentially be designed to be per Controller to 
not have the challenge of scaling it out. That of course has its downsides in 
itself, for instance: A buffer that spans all controllers would enable 
work-stealing between controllers with missing capacity and could mitigate some 
of load-imbalances that Dominic mentioned. We are entering then the same area 
that his proposal enters: The need of a queue per action.

Conclusion is, we have 2 perspectives to look at this:

1. Do we need to persist an in-memory queue that waits for resources to be 
created by the ContainerManager?
2. Do we need a shared queue between the Controllers to enable work-stealing in 
cases where multiple Controllers wait for resources?
 
An important thing to note here: Since all of this is no longer happening on 
the critical path (stuff gets put on the queue only if it needs to wait for 
resources anyway), we can afford a solution that isn't as perfomant as Kafka 
might be. That could potentially open up the possibility to use a technology 
more geared towards Pub/Sub, where subscribers per action are more cheap to 
implement than on Kafka?

Does that make sense? Hope that helps :). Thanks for the questions!

Cheers,
Markus



Re: Proposal on a future architecture of OpenWhisk

2018-07-19 Thread Chetan Mehrotra
Hi Markus,

Currently one aspect which is not clear is does Controller has access to

1. Pool of prewarm containers - Container of base image where /init is
yet not done. So these containers can then be initialized within
Controller
2. OR Pool of warm container bound to specific user+action. These
containers would possibly have been initialized by ContainerManager
and then it allocates them to controller.

> The scaleup model stays exactly the same as today! If you have 200 
> simultaneous invocations (assuming a per-container concurrency limit of 1) we 
> will create 200 containers to handle that load (given the requests are truly 
> simultaneous --> arrive at the same time). Containers are NOT created in a 
> synchronous way and there's no need to sequentialize their creation. Does 
> something in the proposal hint to that? If so, we should fix that immediately.

Can you elaborate this bit more i.e. how scale up logic would work and
is asynchronous?

I think above aspect (type of pool) would have bearing on scale up
logic. If an action was not in use so far then when first request
comes (i.e. 0-1 scale up case) would Controller ask ContainerManager
for specific action container and then wait for its setup and then
execute it. OR if it has a generic pool then it takes one and
initializes it and use it. And if its not done synchronously then
would such an action be put to overflow queue.

Chetan Mehrotra

On Thu, Jul 19, 2018 at 2:39 PM Markus Thoemmes
 wrote:
>
> Hi Dominic,
>
> >Ah yes. Now I remember I wondered why OS doesn't support
> >"at-least-once"
> >semantic.
> >This is the question apart from the new architecture, but is this
> >because
> >of the case that user can execute the non-idempotent action?
> >So though an invoker is failed, still action could be executed and it
> >could
> >cause some side effects such as repeating the action which requires
> >"at-most-once" semantic more than once?
>
> Exactly. Once we pass the HTTP request into the container, we cannot know 
> whether the action has already caused a side-effect. At that point it's not 
> safe to retry (hence /run doesn't allow for retries vs. /init does) and in 
> doubt we need to abort.
> We could imagine the user to state idempotency of an action so it's safe for 
> us to retry, but that's a different can of worms and imho unrelated to the 
> architecture as you say.
>
> >BTW, how would long warmed containers be kept in the new
> >architecture? Is
> >it a 1 or 2 order of magnitude in seconds?
>
> I don't see a reason to change this behavior from what we have today. Could 
> be configurable and potentially be hours. The only concerns are:
> - Scale-down of worker nodes is inhibited if we keep containers around a long 
> time --> costs the vendor money
> - If the system is full with warm containers and we want to evict one to make 
> space for a different container, removing and recreating a container is more 
> expensive than just creating.
>
> >In the new architecture, concurrency limit is controlled by users in
> >a
> >per-action based way?
>
> That's not necessarily architecture related, but Tyson is implementing this, 
> yes. Note that this is "concurrency per container" not "concurrency per 
> action" (which could be a second knob to turn).
>
> In a nutshell:
> - concurrency per container: The amount of parallel HTTP requests allowed for 
> a single container (this is what Tyson is implementing)
> - concurrency per action: You could potentially limit the maximum amount of 
> concurrent invocations running for each action (which is distinct from the 
> above, because this could mean to limit the amount of containers created vs. 
> limiting the amount of parallel HTTP requests to a SINGLE container)
>
> >So in case a user wants to execute the long-running action, does he
> >configure the concurreny limit for the action?
>
> Long running isn't related to concurrency I think.
>
> >
> >And if concurrency limit is 1, in case action container is possessed,
> >wouldn't controllers request a container again and again?
> >And if it only allows container creation in a synchronous
> >way(creating one
> >by one), couldn't it be a burden in case a user wants a huge number
> >of(100~200) simultaneous invocations?
>
> The scaleup model stays exactly the same as today! If you have 200 
> simultaneous invocations (assuming a per-container concurrency limit of 1) we 
> will create 200 containers to handle that load (given the requests are truly 
> simultaneous --> arrive at the same time). Containers are NOT created in a 
> synchronous way and there's no need to sequentialize their creation. Does 
> something in the proposal hint to that? If so, we should fix that immediately.
>
> No need to apologize, this is great engagement, exactly what we need here. 
> Keep it up!
>
> Cheers,
> Markus
>


Re: Proposal on a future architecture of OpenWhisk

2018-07-19 Thread Rodric Rabbah
Hi Mark

This is precisely captured by the serverless contract article I published 
recently:

https://medium.com/openwhisk/the-serverless-contract-44329fab10fb

Queue, reject, or add capacity as three potential resolutions under load. 

-r

> On Jul 18, 2018, at 8:16 AM, Martin Gencur  wrote:
> 
> Hi Markus,
> thinking about scalability and the edge case. When there are not enough 
> containers and new controllers are being created, and all of them redirect 
> traffic to the controllers with containers, doesn't it mean overloading the 
> available containers a lot? I'm curious how we throttle the traffic in this 
> case.
> 
> I guess the other approach would be to block creating new controllers when 
> there are no containers available as long as we don't want to overload the 
> existing containers. And keep the overflowing workload in Kafka as well.
> 
> Thanks,
> Martin Gencur
> QE, Red Hat


Re: Proposal on a future architecture of OpenWhisk

2018-07-19 Thread Rodric Rabbah


> On Jul 17, 2018, at 4:49 AM, Markus Thoemmes  
> wrote:
> 
> The design proposed does not intent to change the way we provide 
> oberservibility via persisting activation records.

It is worth considering how we can provide observability for activations in 
flight. As it stands today, as a user you get to see when the action has 
finished (if we record the record successfully). But given an activation id you 
cannot query the status otherwise: either the record exists, or not found.

-r

Re: Proposal on a future architecture of OpenWhisk

2018-07-19 Thread Rodric Rabbah
Regarding at least or at most once: 

Functions should be stateless and the burden is on the action for external side 
effects anyway... so it’s plausible with these in mind that we contemplate 
shifting modes (a la lambda). There are cases though that we should retry that 
are safer: in flight requests which are lost before they reach the container 
http end point, and failures to assign a container.

-r

Re: Proposal on a future architecture of OpenWhisk

2018-07-19 Thread Markus Thoemmes
Hi Dominic,

>Ah yes. Now I remember I wondered why OS doesn't support
>"at-least-once"
>semantic.
>This is the question apart from the new architecture, but is this
>because
>of the case that user can execute the non-idempotent action?
>So though an invoker is failed, still action could be executed and it
>could
>cause some side effects such as repeating the action which requires
>"at-most-once" semantic more than once?

Exactly. Once we pass the HTTP request into the container, we cannot know 
whether the action has already caused a side-effect. At that point it's not 
safe to retry (hence /run doesn't allow for retries vs. /init does) and in 
doubt we need to abort.
We could imagine the user to state idempotency of an action so it's safe for us 
to retry, but that's a different can of worms and imho unrelated to the 
architecture as you say.

>BTW, how would long warmed containers be kept in the new
>architecture? Is
>it a 1 or 2 order of magnitude in seconds?

I don't see a reason to change this behavior from what we have today. Could be 
configurable and potentially be hours. The only concerns are: 
- Scale-down of worker nodes is inhibited if we keep containers around a long 
time --> costs the vendor money
- If the system is full with warm containers and we want to evict one to make 
space for a different container, removing and recreating a container is more 
expensive than just creating.

>In the new architecture, concurrency limit is controlled by users in
>a
>per-action based way?

That's not necessarily architecture related, but Tyson is implementing this, 
yes. Note that this is "concurrency per container" not "concurrency per action" 
(which could be a second knob to turn).

In a nutshell:
- concurrency per container: The amount of parallel HTTP requests allowed for a 
single container (this is what Tyson is implementing)
- concurrency per action: You could potentially limit the maximum amount of 
concurrent invocations running for each action (which is distinct from the 
above, because this could mean to limit the amount of containers created vs. 
limiting the amount of parallel HTTP requests to a SINGLE container)

>So in case a user wants to execute the long-running action, does he
>configure the concurreny limit for the action?

Long running isn't related to concurrency I think.

>
>And if concurrency limit is 1, in case action container is possessed,
>wouldn't controllers request a container again and again?
>And if it only allows container creation in a synchronous
>way(creating one
>by one), couldn't it be a burden in case a user wants a huge number
>of(100~200) simultaneous invocations?

The scaleup model stays exactly the same as today! If you have 200 simultaneous 
invocations (assuming a per-container concurrency limit of 1) we will create 
200 containers to handle that load (given the requests are truly simultaneous 
--> arrive at the same time). Containers are NOT created in a synchronous way 
and there's no need to sequentialize their creation. Does something in the 
proposal hint to that? If so, we should fix that immediately.

No need to apologize, this is great engagement, exactly what we need here. Keep 
it up!

Cheers,
Markus 



Re: Proposal on a future architecture of OpenWhisk

2018-07-18 Thread Dominic Kim
Dear Markus.
Thank you for the quick response.

> In the proposal, the semantics of talking to the container do not change
from what we have today. If the http request fails for whatever reason
while "in-flight", processing cannot be completed, just like if an invoker
crashes. Note that we commit the message immediatly after reading it today,
which leads to "at-most-once" semantics. OpenWhisk does not support
"at-least-once" today. To do so, you'd retry from the outside.

Ah yes. Now I remember I wondered why OS doesn't support "at-least-once"
semantic.
This is the question apart from the new architecture, but is this because
of the case that user can execute the non-idempotent action?
So though an invoker is failed, still action could be executed and it could
cause some side effects such as repeating the action which requires
"at-most-once" semantic more than once?


> I don't believe the ContainerManager needs to do that much honestly. In
the Kubernetes case for instance it only asks the Kube API for pods and
then keeps a list of these pods per action. Further it divides this list
whenever a new container is added/removed. I think we can push this quite
far in a master/slave fashion as Brendan mentioned. This is guessing
though, it'll be crucial to measure the throughput that one instance
actually can provide and then decide on whether that's feasible or not.
>
> As its state isn't moving at a super fast pace, we can probably afford to
persist it into something like redis or etcd for the failover to take over
if one dies.
>
> Of course I'm very open for scaling it out horizontally if that's
achievable.

Yes, reducing the requests against Docker daemon seems the right way to go.
BTW, how would long warmed containers be kept in the new architecture? Is
it a 1 or 2 order of magnitude in seconds?


> Assuming general round-robin (or even random scheduling) in front of the
controllers would even out things to a certain extent would they?
>
> Another probably feasible solution is to implement session stickiness or
hashing as you mentioned in todays call. Another comment that was raised
during today's would come into play there as well: We could change the
container division algorithm to not divide evenly but to only give
containers to those controllers that requested them. In conjunction with
session stickiness, that could yield better load distribution results
(given the session stickiness is smart enough to divide appropriately.

Yes, that could an option.
I concern it might cause some load imbalance among controller as well.
And another question comes up, how can we keep stick session for multiple
controllers and for multiple actions respectively?



> Overload of a given container is determined by its concurrency limit.
Today that is "1". If 1 request is active in a container, and that is the
only container, we need more containers. As soon as all containers reached
their maximum concurrency, we need to scale up.

In the new architecture, concurrency limit is controlled by users in a
per-action based way?
So in case a user wants to execute the long-running action, does he
configure the concurreny limit for the action?

And if concurrency limit is 1, in case action container is possessed,
wouldn't controllers request a container again and again?
And if it only allows container creation in a synchronous way(creating one
by one), couldn't it be a burden in case a user wants a huge number
of(100~200) simultaneous invocations?


Please bear with many questions.
I am also one of the advocates who want to improve and enhance OW in a such
that way.
I hope my question helps to build more delicate architecture.

Thanks
Best regards
Dominic

2018-07-19 2:16 GMT+09:00 Markus Thoemmes :

> Hi Dominic,
>
> thanks for your feedback, let's see...
>
> >1. Buffering of activation and failure handling.
> >
> >As of now, Kafka acts as a kind of buffer in case activation
> >processing is
> >a bit delayed due to some reasons such as invoker failure.
> >If Kafka is only used for the overflowing case, how can it guarantee
> >"at
> >least once" activation processing?
> >For example, if a controller receives the requests and it crashed
> >before
> >the activation is complete. How can other alive controllers or the
> >restored
> >controller handle it?
>
> In the proposal, the semantics of talking to the container do not change
> from what we have today. If the http request fails for whatever reason
> while "in-flight", processing cannot be completed, just like if an invoker
> crashes. Note that we commit the message immediatly after reading it today,
> which leads to "at-most-once" semantics. OpenWhisk does not support
> "at-least-once" today. To do so, you'd retry from the outside.
>
> >2. A bottleneck in ContainerManager.
> >
> >Now ContainerManage has many critical logics.
> >It takes care of the container lifecycle, logging, scheduling and so
> >on.
> >Also, it should be aware of whole container state as well as
> >controller
> 

Re: Proposal on a future architecture of OpenWhisk

2018-07-18 Thread Markus Thoemmes
Hi Dominic,

thanks for your feedback, let's see...

>1. Buffering of activation and failure handling.
>
>As of now, Kafka acts as a kind of buffer in case activation
>processing is
>a bit delayed due to some reasons such as invoker failure.
>If Kafka is only used for the overflowing case, how can it guarantee
>"at
>least once" activation processing?
>For example, if a controller receives the requests and it crashed
>before
>the activation is complete. How can other alive controllers or the
>restored
>controller handle it?

In the proposal, the semantics of talking to the container do not change from 
what we have today. If the http request fails for whatever reason while 
"in-flight", processing cannot be completed, just like if an invoker crashes. 
Note that we commit the message immediatly after reading it today, which leads 
to "at-most-once" semantics. OpenWhisk does not support "at-least-once" today. 
To do so, you'd retry from the outside. 

>2. A bottleneck in ContainerManager.
>
>Now ContainerManage has many critical logics.
>It takes care of the container lifecycle, logging, scheduling and so
>on.
>Also, it should be aware of whole container state as well as
>controller
>status and distribute containers among alive controllers.
>It might need to do some health checking to/from all controllers and
>containers(may be orchestrator such as k8s).
>
>I think in this case ContainerManager can be a bottleneck as the size
>of
>the cluster grows.
>And if we add more nodes to scale out ContainerManager or to prepare
>for
>the SPOF, then all states of ContainerManager should be shared among
>all
>nodes.
>If we take master/slave approach, the master would become a
>bottleneck at
>some point, and if we take a clustering approach, we need some
>mechanism to
>synchronize the cluster status among ContainerManagers.
>And this procedure should be done in 1 or 2 order of magnitude in
>milliseconds.
>
>Do you have anything in your mind to handle this?

I don't believe the ContainerManager needs to do that much honestly. In the 
Kubernetes case for instance it only asks the Kube API for pods and then keeps 
a list of these pods per action. Further it divides this list whenever a new 
container is added/removed. I think we can push this quite far in a 
master/slave fashion as Brendan mentioned. This is guessing though, it'll be 
crucial to measure the throughput that one instance actually can provide and 
then decide on whether that's feasible or not.

As its state isn't moving at a super fast pace, we can probably afford to 
persist it into something like redis or etcd for the failover to take over if 
one dies.

Of course I'm very open for scaling it out horizontally if that's achievable.

>3. Imbalance among controllers.
>
>I think there could be some imbalance among controllers.
>For example, there are 3 controllers with 3, 1, and 1 containers
>respectively for the given action.
>In some case, 1 containers in the controller1 might be overloaded but
>2
>containers in controller2 can be available.
>If the number of containers for the given action belongs to each
>controller
>varies, it could happen more easily.
>This is because controllers are not aware of the status of other
>controllers.
>So in some case, some action containers are overloaded but the others
>may
>handle just moderate requests.
>Then each controller may request more containers instead of utilizing
>existing(but in other controllers) containers, and this can lead to
>the
>waste of resources.

Assuming general round-robin (or even random scheduling) in front of the 
controllers would even out things to a certain extent would they?

Another probably feasible solution is to implement session stickiness or 
hashing as you mentioned in todays call. Another comment that was raised during 
today's would come into play there as well: We could change the container 
division algorithm to not divide evenly but to only give containers to those 
controllers that requested them. In conjunction with session stickiness, that 
could yield better load distribution results (given the session stickiness is 
smart enough to divide appropriately.

>4. How do controllers determine whether to create more containers?
>
>Let's say, a controller has only one container for the given action.
>How controllers recognize this container is overloaded and need more
>containers to create?
>If the action execution time is short, it can calculate the number of
>buffered activation for the given action.
>But the action execution time is long, let's say 1 min or 2 mins,
>then even
>though there is only 1 activation request in the buffer, the
>controller
>needs to create more containers.
>(Because subsequent activation request will be delayed for 1 or
>2mins.)
>Since we cannot know the execution time of action in advance, we may
>need a
>sort of timeout(of activation response) approach for all actions.
>But still, we cannot know how much time of execution are remaining
>for the
>given 

Re: Proposal on a future architecture of OpenWhisk

2018-07-18 Thread Dominic Kim
Dear Markus.
Thank you for the great work!

I think this is a good approach in the big picture.


I have a few questions.

1. Buffering of activation and failure handling.

As of now, Kafka acts as a kind of buffer in case activation processing is
a bit delayed due to some reasons such as invoker failure.
If Kafka is only used for the overflowing case, how can it guarantee "at
least once" activation processing?
For example, if a controller receives the requests and it crashed before
the activation is complete. How can other alive controllers or the restored
controller handle it?


2. A bottleneck in ContainerManager.

Now ContainerManage has many critical logics.
It takes care of the container lifecycle, logging, scheduling and so on.
Also, it should be aware of whole container state as well as controller
status and distribute containers among alive controllers.
It might need to do some health checking to/from all controllers and
containers(may be orchestrator such as k8s).

I think in this case ContainerManager can be a bottleneck as the size of
the cluster grows.
And if we add more nodes to scale out ContainerManager or to prepare for
the SPOF, then all states of ContainerManager should be shared among all
nodes.
If we take master/slave approach, the master would become a bottleneck at
some point, and if we take a clustering approach, we need some mechanism to
synchronize the cluster status among ContainerManagers.
And this procedure should be done in 1 or 2 order of magnitude in
milliseconds.

Do you have anything in your mind to handle this?


3. Imbalance among controllers.

I think there could be some imbalance among controllers.
For example, there are 3 controllers with 3, 1, and 1 containers
respectively for the given action.
In some case, 1 containers in the controller1 might be overloaded but 2
containers in controller2 can be available.
If the number of containers for the given action belongs to each controller
varies, it could happen more easily.
This is because controllers are not aware of the status of other
controllers.
So in some case, some action containers are overloaded but the others may
handle just moderate requests.
Then each controller may request more containers instead of utilizing
existing(but in other controllers) containers, and this can lead to the
waste of resources.


4. How do controllers determine whether to create more containers?

Let's say, a controller has only one container for the given action.
How controllers recognize this container is overloaded and need more
containers to create?
If the action execution time is short, it can calculate the number of
buffered activation for the given action.
But the action execution time is long, let's say 1 min or 2 mins, then even
though there is only 1 activation request in the buffer, the controller
needs to create more containers.
(Because subsequent activation request will be delayed for 1 or 2mins.)
Since we cannot know the execution time of action in advance, we may need a
sort of timeout(of activation response) approach for all actions.
But still, we cannot know how much time of execution are remaining for the
given action after the timeout occurred.
Further, if a user requests 100 or 200 concurrent invocations with a 2
mins-long action, all subsequent requests will suffer from the latency
overhead of timeout.



Thanks
Best regards
Dominic.




2018-07-18 22:45 GMT+09:00 Martin Gencur :

> On 18.7.2018 14:41, Markus Thoemmes wrote:
>
>> Hi Martin,
>>
>> thanks for the great questions :)
>>
>> thinking about scalability and the edge case. When there are not
>>> enough
>>> containers and new controllers are being created, and all of them
>>> redirect traffic to the controllers with containers, doesn't it mean
>>> overloading the available containers a lot? I'm curious how we
>>> throttle the traffic in this case.
>>>
>> True, the first few requests will overload the controller that owns the
>> very first container. That one will request new containers immediately,
>> which will then be distributed to all existing Controllers by the
>> ContainerManager. An interesting wrinkle here is, that you'd want the
>> overloading requests to be completed by the Controllers that sent it to the
>> "single-owning-Controller".
>>
>
> Ah, got it. So it is a pretty common scenario. Scaling out controllers and
> containers. I thought this is a case where we reach a limit of created
> containers and no more containers can be created.
>
>
>   What we could do here is:
>>
>> Controller0 owns ContainerA1
>> Controller1 relays requests for A to Controller0
>> Controller0 has more requests than it can handle, so it requests
>> additional containers. All requests coming from Controller1 will be
>> completed with a predefined message (for example "HTTP 503 overloaded" with
>> a specific header say "X-Return-To-Sender-By: Controller0")
>> Controller1 recognizes this as "okay, I'll wait for containers to
>> appear", which will eventually happen 

Re: Proposal on a future architecture of OpenWhisk

2018-07-18 Thread Martin Gencur

On 18.7.2018 14:41, Markus Thoemmes wrote:

Hi Martin,

thanks for the great questions :)


thinking about scalability and the edge case. When there are not
enough
containers and new controllers are being created, and all of them
redirect traffic to the controllers with containers, doesn't it mean
overloading the available containers a lot? I'm curious how we
throttle the traffic in this case.

True, the first few requests will overload the controller that owns the very first 
container. That one will request new containers immediately, which will then be 
distributed to all existing Controllers by the ContainerManager. An interesting wrinkle 
here is, that you'd want the overloading requests to be completed by the Controllers that 
sent it to the "single-owning-Controller".


Ah, got it. So it is a pretty common scenario. Scaling out controllers 
and containers. I thought this is a case where we reach a limit of 
created containers and no more containers can be created.




  What we could do here is:

Controller0 owns ContainerA1
Controller1 relays requests for A to Controller0
Controller0 has more requests than it can handle, so it requests additional containers. All 
requests coming from Controller1 will be completed with a predefined message (for example 
"HTTP 503 overloaded" with a specific header say "X-Return-To-Sender-By: 
Controller0")
Controller1 recognizes this as "okay, I'll wait for containers to appear", 
which will eventually happen (because Controller0 has already requested them) so it can 
route and complete those requests on its own.
Controller1 will now no longer relay requests to Controller0 but will request 
containers itself (acknowledging that Controller0 is already overloaded).


Yeah, I think it makes sense.




I guess the other approach would be to block creating new controllers
when there are no containers available as long as we don't want to
overload the existing containers. And keep the overflowing workload
in Kafka as well.

Right, the second possibility is to use a pub/sub (not necessarily Kafka) queue 
between Controllers. Controller0 subscribes to a topic for action A because it 
owns a container for it. Controller1 doesn't own a container (yet) and 
publishes a message as overflow to topic A. The wrinkle in this case is, that 
Controller0 can't complete the request but needs to send it back to Controller1 
(where the HTTP connection is opened from the client).

Does that make sense?


I was rather thinking about blocking the creation of Controller1 in this 
case and responding to the client that the system is overloaded. But the 
first approach seems better because it's a pretty common use case (not 
reaching the limit of created containers).


Thanks!
Martin



Cheers,
Markus





Re: Proposal on a future architecture of OpenWhisk

2018-07-18 Thread Markus Thoemmes
Hi Martin,

thanks for the great questions :)

>thinking about scalability and the edge case. When there are not
>enough 
>containers and new controllers are being created, and all of them 
>redirect traffic to the controllers with containers, doesn't it mean 
>overloading the available containers a lot? I'm curious how we
>throttle the traffic in this case.

True, the first few requests will overload the controller that owns the very 
first container. That one will request new containers immediately, which will 
then be distributed to all existing Controllers by the ContainerManager. An 
interesting wrinkle here is, that you'd want the overloading requests to be 
completed by the Controllers that sent it to the "single-owning-Controller". 
What we could do here is:

Controller0 owns ContainerA1
Controller1 relays requests for A to Controller0
Controller0 has more requests than it can handle, so it requests additional 
containers. All requests coming from Controller1 will be completed with a 
predefined message (for example "HTTP 503 overloaded" with a specific header 
say "X-Return-To-Sender-By: Controller0")
Controller1 recognizes this as "okay, I'll wait for containers to appear", 
which will eventually happen (because Controller0 has already requested them) 
so it can route and complete those requests on its own.
Controller1 will now no longer relay requests to Controller0 but will request 
containers itself (acknowledging that Controller0 is already overloaded).

>
>I guess the other approach would be to block creating new controllers
>when there are no containers available as long as we don't want to 
>overload the existing containers. And keep the overflowing workload
>in Kafka as well.

Right, the second possibility is to use a pub/sub (not necessarily Kafka) queue 
between Controllers. Controller0 subscribes to a topic for action A because it 
owns a container for it. Controller1 doesn't own a container (yet) and 
publishes a message as overflow to topic A. The wrinkle in this case is, that 
Controller0 can't complete the request but needs to send it back to Controller1 
(where the HTTP connection is opened from the client).

Does that make sense?

Cheers,
Markus



Re: Proposal on a future architecture of OpenWhisk

2018-07-18 Thread Martin Gencur

Hi Markus,
thinking about scalability and the edge case. When there are not enough 
containers and new controllers are being created, and all of them 
redirect traffic to the controllers with containers, doesn't it mean 
overloading the available containers a lot? I'm curious how we throttle 
the traffic in this case.


I guess the other approach would be to block creating new controllers 
when there are no containers available as long as we don't want to 
overload the existing containers. And keep the overflowing workload in 
Kafka as well.


Thanks,
Martin Gencur
QE, Red Hat

On 13.7.2018 19:29, Markus Thoemmes wrote:

Hello OpenWhiskers,

I just published a proposal on a potential future architecture for OpenWhisk 
that aligns deployments with and without an underlying container orchestrator 
like Mesos or Kubernetes. It also incooperates some of the proposals that are 
already out there and tries to give a holistic view of where we want OpenWhisk 
to go to in the near future. It's designed to keep the APIs stable but is very 
invasive in its changes under the hood.

This proposal is the outcome of a lot of discussions with fellow colleagues and 
community members. It is based on experience with the problems the current 
architecture has. Moreover it aims to remove friction with the deployment 
topologies on top of a container orchestrator.

Feedback is very very very welcome! The proposal has some gaps and generally 
does not go into much detail implementationwise. I'd love to see all those gaps 
filled by the community!

Find the proposal here: 
https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future+architecture

Cheers,
Markus





Re: Proposal on a future architecture of OpenWhisk

2018-07-17 Thread TzuChiao Yeh
Hi Markus,

Yes, I agree that storing activation records should be a separated
discussion. Pipe activation records into logging system (elasticsearch,
kibana) will be cool!

But I think I'm not asking these now though, however, thanks for pointing
these out, looks interesting.

I think I got some misunderstanding. Originally, I considered some edge
cases once invoker got failed during responding back with active-ack, but
there's no recovery/retry logic from now (therefore so-called best-effort).
However, whether supporting stronger execution guarantee may not be
discussed here now, but this indeed will be different mechanism if we
bypassing Kafka or not.

Thanks for answering me anyway,
Tzuchiao


On Tue, Jul 17, 2018 at 4:49 PM Markus Thoemmes 
wrote:

> Hi Tzu-Chiao,
>
> great questions although I'd relay those into a seperate discussion. The
> design proposed does not intent to change the way we provide
> oberservibility via persisting activation records. The controller takes
> that responsibility in the design.
>
> It is fair to open a discussion on what our plans for the activation
> record itself are though, in the future. There is a lot of work going on in
> that area currently, with Vadim implementing user-facing metrics (which can
> serve of part of what activation records do) and James implementing
> different ActivationStores with the intention to eventually moving
> activation records to the logging system.
>
> Another angle here is that both of these solutions drop persistence of the
> activation result by default, since it is potentially a large blob.
> Persisting logs into CouchDB doesn't really scale either so there are a
> couple of LogStores to shift that burden away. What remains is largely a
> small, bounded record of some metrics per activation. I'll be happy to see
> a separate proposal + discussion on where we want to take this in the
> future :)
>
> Cheers,
> Markus
>
>


Re: Proposal on a future architecture of OpenWhisk

2018-07-17 Thread Markus Thoemmes
Hi Tzu-Chiao,

great questions although I'd relay those into a seperate discussion. The design 
proposed does not intent to change the way we provide oberservibility via 
persisting activation records. The controller takes that responsibility in the 
design.

It is fair to open a discussion on what our plans for the activation record 
itself are though, in the future. There is a lot of work going on in that area 
currently, with Vadim implementing user-facing metrics (which can serve of part 
of what activation records do) and James implementing different 
ActivationStores with the intention to eventually moving activation records to 
the logging system.

Another angle here is that both of these solutions drop persistence of the 
activation result by default, since it is potentially a large blob. Persisting 
logs into CouchDB doesn't really scale either so there are a couple of 
LogStores to shift that burden away. What remains is largely a small, bounded 
record of some metrics per activation. I'll be happy to see a separate proposal 
+ discussion on where we want to take this in the future :)

Cheers,
Markus



Re: Proposal on a future architecture of OpenWhisk

2018-07-17 Thread TzuChiao Yeh
Hi Markus,

Awesome work! Thanks for doing this.

One simple question here: due to directly call actions via http calls, do
we still persist activation (i.e. duplicate activations into somewhere
storage)? Since we already provide "best-effort" invocation for users, not
sure persistence is still worth-doing. Or maybe we can provide some
guarantee options in the future?

Thanks,
Tzu-Chiao Yeh (@tz70s)


On Tue, Jul 17, 2018 at 12:42 AM Markus Thoemmes 
wrote:

> Hi Chetan,
>
> > Hi Thomas,
>
> It's Markus Thömmes/Thoemmes respectively :)
>
> > Is this routing round robin for per namespace + action name url or is
> > it for any url? For e.g. if we have controller c1-c3 and request come
> > in order a1,a2,a3, a1 which controller would be handling which action
> > here?
>
> It's for any URL. I'm not sure the general front-door (nginx in our case)
> supports keyed round-robin/least-connected. For sanity, I basically assume
> that every request can land on any controller with no control of how that
> might happen.
>
> Cheers,
> Markus
>
>


Re: Proposal on a future architecture of OpenWhisk

2018-07-16 Thread Markus Thoemmes
Hi Chetan,

> Hi Thomas,

It's Markus Thömmes/Thoemmes respectively :)

> Is this routing round robin for per namespace + action name url or is
> it for any url? For e.g. if we have controller c1-c3 and request come
> in order a1,a2,a3, a1 which controller would be handling which action
> here?

It's for any URL. I'm not sure the general front-door (nginx in our case) 
supports keyed round-robin/least-connected. For sanity, I basically assume that 
every request can land on any controller with no control of how that might 
happen.

Cheers,
Markus



Re: Proposal on a future architecture of OpenWhisk

2018-07-16 Thread Chetan Mehrotra
Hi Thomas,

Proposal looks good and consolidates various ideas discussed so far.
Would have a closer look. Have a quick query for now

> Since the front-door schedules round-robin or least-connected

Is this routing round robin for per namespace + action name url or is
it for any url? For e.g. if we have controller c1-c3 and request come
in order a1,a2,a3, a1 which controller would be handling which action
here?
Chetan Mehrotra


On Fri, Jul 13, 2018 at 10:59 PM, Markus Thoemmes
 wrote:
> Hello OpenWhiskers,
>
> I just published a proposal on a potential future architecture for OpenWhisk 
> that aligns deployments with and without an underlying container orchestrator 
> like Mesos or Kubernetes. It also incooperates some of the proposals that are 
> already out there and tries to give a holistic view of where we want 
> OpenWhisk to go to in the near future. It's designed to keep the APIs stable 
> but is very invasive in its changes under the hood.
>
> This proposal is the outcome of a lot of discussions with fellow colleagues 
> and community members. It is based on experience with the problems the 
> current architecture has. Moreover it aims to remove friction with the 
> deployment topologies on top of a container orchestrator.
>
> Feedback is very very very welcome! The proposal has some gaps and generally 
> does not go into much detail implementationwise. I'd love to see all those 
> gaps filled by the community!
>
> Find the proposal here: 
> https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future+architecture
>
> Cheers,
> Markus
>


Proposal on a future architecture of OpenWhisk

2018-07-13 Thread Markus Thoemmes
Hello OpenWhiskers,

I just published a proposal on a potential future architecture for OpenWhisk 
that aligns deployments with and without an underlying container orchestrator 
like Mesos or Kubernetes. It also incooperates some of the proposals that are 
already out there and tries to give a holistic view of where we want OpenWhisk 
to go to in the near future. It's designed to keep the APIs stable but is very 
invasive in its changes under the hood.

This proposal is the outcome of a lot of discussions with fellow colleagues and 
community members. It is based on experience with the problems the current 
architecture has. Moreover it aims to remove friction with the deployment 
topologies on top of a container orchestrator.

Feedback is very very very welcome! The proposal has some gaps and generally 
does not go into much detail implementationwise. I'd love to see all those gaps 
filled by the community!

Find the proposal here: 
https://cwiki.apache.org/confluence/display/OPENWHISK/OpenWhisk+future+architecture

Cheers,
Markus