Hi - Following up on this, I’ve been working on a PR. One issue I’ve run into (which may be problematic in other scheduling scenarios) is that the scheduling in LoadBalancerService doesn’t respect the new async nature of activation counting in LoadBalancerData. At least I think this is a good description.
Specifically, I am creating a test that submits activations via LoadBalancer.publish, and I end up with 10 activations scheduled on invoker0, even though I use an invokerBusyThreshold of 1. It would only occur when concurrent requests (or very short time between?) arrive at the same controller, I think. (Otherwise the counts can sync up quickly enough) I’ll work more on testing it. Assuming this (dealing with async counters) is the problem, I’m not exactly sure how to deal with it. Some options may include: - change LoadBalancer to an actor, so that local counter states can be easier managed (these would still need to replicate, but at least locally it would do the right thing) - coordinate the schedule + setupActivation calls to also rely on some local state for activations that should be counted but have not yet been processed within LoadBalancerData Any suggestions in this area would be great. Thanks Tyson > On Oct 6, 2017, at 11:04 AM, Tyson Norris <tnor...@adobe.com.INVALID> wrote: > > With many invokers, there is less data exposed to rebalancing operations, > since the invoker topics will only ever receive enough activations that can > be processed “immediately", currently set to 16. The single backlog topic > would only be consumed by the controller (not any invoker), and the invokers > would only consumer their respective “process immediately” topic - which > effectively has no, or very little, backlog - 16 max. My suggestion is that > having multiple backlogs is an unnecessary problem, regardless of how many > invokers there are. > > It is worth noting the case of multiple controllers as well, where multiple > controllers may be processing the same backlog topic. I don’t think this > should cause any more trouble than the distributed activation counting that > should be enabled via controller clustering, but it may mean that if one > controller enters overflow state, it should signal that ALL controllers are > now in overflow state, etc. > > Regarding “timeout”, I would plan to use the existing timeout mechanism, > where an ActivationEntry is created immediately, regardless of whether the > activation is going to get processed, or get added to the backlog. At time of > processing the backlog message, if the entry is timed out, throw it away. > (The entry map may need to be shared in the case multiple invokers are in > use, and they all consume from the same topic; alternatively, we can > partition the topic so that entries are only processed by the controller that > has backlogged them) > > Yes, once invokers are saturated, and backlogging begins, I think all > incoming activations should be sent straight to backlog (we already know that > no invokers are available). This should not hurt overall performance anymore > than it currently does, and should be better (since the first invoker > available can start taking work, instead of waiting on a specific invoker to > become available). > > I’m working on a PR, I think much of these details will come out there, but > in the meantime, let me know if any of this doesn’t make sense. > > Thanks > Tyson > > > On Oct 5, 2017, at 2:49 PM, David P Grove > <gro...@us.ibm.com<mailto:gro...@us.ibm.com>> wrote: > > > I can see the value in delaying the binding of activations to invokers when > the system is loaded (can't execute "immediately" on its target invoker). > > Perhaps in ignorance, I am a little worried about the scalability of a single > backlog topic. With a few hundred invokers, it seems like we'd be exposed to > frequent and expensive partition rebalancing operations as invokers > crash/restart. Maybe if we have N = K*M invokers, we can get away with M > backlog topics each being read by K invokers. We could still get imbalance > across the different backlog topics, but it might be good enough. > > I think we'd also need to do some thinking of how to ensure that work put in > a backlog topic doesn't languish there for a really long time. Once we start > having work in the backlog, do we need to stop putting work in immediately > topics? If we do, that could hurt overall performance. If we don't, how will > the backlog topic ever get drained if most invokers are kept busy servicing > their immediately topics? > > --dave > > Tyson Norris ---10/04/2017 07:45:38 PM---Hi - I’ve been discussing a bit with > a few about optimizing the queueing that goes on ahead of invok > > From: Tyson Norris > <tnor...@adobe.com.INVALID<mailto:tnor...@adobe.com.INVALID>> > To: "dev@openwhisk.apache.org<mailto:dev@openwhisk.apache.org>" > <dev@openwhisk.apache.org<mailto:dev@openwhisk.apache.org>> > Date: 10/04/2017 07:45 PM > Subject: Invoker activation queueing proposal > > ________________________________ > > > > Hi - > > I’ve been discussing a bit with a few about optimizing the queueing that goes > on ahead of invokers so that things behave more simply and predictable. > > > > In short: Instead of scheduling activations to an invoker on receipt, do the > following: > > - execute the activation "immediately" if capacity is available > > - provide a single overflow topic for activations that cannot execute > “immediately" > > - schedule from the overflow topic when capacity is available > > > > (BTW “Immediately” means: still queued via existing invoker topics, but ONLY > gets queued there in the case that the invoker is not fully loaded, and > therefore should execute it “very soon") > > > > Later: it would also be good to provide more container state data from > invoker to controller, to get better scheduling options - e.g. if some > invokers can handle running more containers than other invokers, that info > can be used to avoid over/under-loading the invokers (currently we assume > each invoker can handle 16 activations, I think) > > > > I put a wiki page proposal here: > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__cwiki.apache.org_confluence_display_OPENWHISK_Invoker-2BActivation-2BQueueing-2BChange%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DFe4FicGBU_20P2yihxV-apaNSFb6BSj6AlkptSF2gMk%26m%3DUE8OIR_GnMltmRZyIuLVHMlzyQvNku-H7kLk67u45IM%26s%3DLD75-npfzA7qzUGNgYbFBy4qKatnkdO5I2vKYSGUBg8%26e&data=02%7C01%7C%7C206a28eb8f9d4f3b131608d50ce4be0f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636429098932393323&sdata=pLsSnlJRYtL4cHMqciGBsA9kLaHzW1GjbijpJCsD1po%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__cwiki.apache.org_confluence_display_OPENWHISK_Invoker-2BActivation-2BQueueing-2BChange%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DFe4FicGBU_20P2yihxV-apaNSFb6BSj6AlkptSF2gMk%26m%3DUE8OIR_GnMltmRZyIuLVHMlzyQvNku-H7kLk67u45IM%26s%3DLD75-npfzA7qzUGNgYbFBy4qKatnkdO5I2vKYSGUBg8%26e%3D&data=02%7C01%7C%7C36a3439a232c45d2119c08d50c3b096b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636428370067764903&sdata=MBzAhIAVOdHCG0acu8YKCNmeYXO8T9PcILoQrlUyixw%3D&reserved=0> > > > > WDYT? > > > > Thanks > > Tyson >