2020-10-21 18:31:01 UTC - Brendan Doyle: We're having issues with two high scale functions mapping to the same home invoker resulting in lots of container recreations because it's swapping between the two functions. I'm trying to see if we can tinker anything with configs that might help. I see that `pauseGrace` is default 50 milliseconds. My theory is that if I increase this it essentially puts a lock on the container while waiting for another run of the function so it should cause less swaps and just more containers should fall over to non-home invokers for these two functions. Does that understanding check out? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603305061114600?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 18:35:13 UTC - Brendan Doyle: And follow up, any operators out there changed this default and had important negative side effects I should know about before playing with it? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603305313114700?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 18:37:01 UTC - Dave Grove: Does anyone have recent experience using zipkin or similar tracing tool with OpenWhisk? I found @James Thomas’s project (<https://github.com/jthomas/zipkin-instrumentation-openwhisk>), but since it was from 2017 I wasn’t sure if that was a good place to start, or if there was something newer. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603305421116300?thread_ts=1603305421.116300&cid=C3TPCAQG1 ---- 2020-10-21 18:40:29 UTC - Dave Grove: I believe `pauseGrace`` only actually does anything with the DockerContainerFactory. It’s basically how long should the invoker allow an idle container to run before doing a `docker pause` on it. The motivation is to prevent clever users from sneakily executing background computation between billable foreground invocations of functions in a container. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603305629116400?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 18:41:27 UTC - Brendan Doyle: Yea I'm looking more closely at the code now. I was hoping that it wouldn't attempt to remove the container if it wasn't paused so it acts as a pseudo lock, but that doesn't seem to be the case https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603305687116600?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 18:41:39 UTC - Dave Grove: With the KubernetesContainerFactory, we don’t have the same ability to do `docker pause` and `docker unpause` on containers, so although you can set this to different values it won’t do anything. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603305699116800?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 18:46:44 UTC - parichehr vahidinia: how to set different values for _idle-container_ and _pause-grace_ with the _kubernetesContainerFactory_ and also _DockerContainerFactory?_ https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603306004118000?thread_ts=1603306004.118000&cid=C3TPCAQG1 ---- 2020-10-21 19:39:14 UTC - Dave Grove: The general trick for overriding the default values that are set in the various .conf files is to define environment variables that start with CONFIG_ in the invoker/controller pods. For example, there is a property `whisk.loadbalancer.blackboxFraction.` You set the environment variable `CONFIG_whisk_loadbalancer_blackboxFraction` to set a different value. If you look into invoker-pod.yaml and controller-pod.yaml in the OpenWhisk helm chart you will see quite a few examples of this being done. +1 : parichehr vahidinia https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603309154118300?thread_ts=1603306004.118000&cid=C3TPCAQG1 ---- 2020-10-21 20:22:55 UTC - Rodric Rabbah: @Dave Grove since you answered the question here do you want to post it to stackoverflow? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603311775118900?thread_ts=1603261638.111000&cid=C3TPCAQG1 ---- 2020-10-21 20:25:13 UTC - parichehr vahidinia: @Dave Grove any ideas on this? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603311913119100?thread_ts=1602925036.090800&cid=C3TPCAQG1 ---- 2020-10-21 20:27:14 UTC - Rodric Rabbah: for internal tracing? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312034119400?thread_ts=1603305421.116300&cid=C3TPCAQG1 ---- 2020-10-21 20:31:43 UTC - Rodric Rabbah: the pause grace does allow a container to stay unpaused longer, and if there is another activation in the corresponding invoker’s q that can use that container, it makes the container more likely to be reused (vs starting another container, or having to unpause a container) https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312303119700?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:32:32 UTC - Rodric Rabbah: it’s not a lock in the way you thought about it https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312352119900?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:35:09 UTC - Brendan Doyle: interesting, what makes it more likely to be reused? I'm working through the code right now and it doesn't seem like container pool has knowledge of whether it's paused or not.
But yea our issue is actual container removals from one function for another when the invoker is full and then just swapping back and forth between them. Any ideas on how we might be able to mitigate that before the new pull based scheduler? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312509120100?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:35:52 UTC - Rodric Rabbah: it delays the state transition from running to paused https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312552120400?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:36:19 UTC - Rodric Rabbah: if you find the actor/state machine that manages the container life cycle, there should be one transition that’s affected / delayed https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312579120600?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:36:35 UTC - Rodric Rabbah: add another invoker :slightly_smiling_face: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312595120800?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:36:47 UTC - Rodric Rabbah: are the functions from the same user? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312607121000?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:37:28 UTC - Rodric Rabbah: you might be better of looking at the invoker hashing in the load balancer https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312648121200?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:37:43 UTC - Rodric Rabbah: this is a performance pathology with the current scheduler, unfortunately https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312663121400?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:38:26 UTC - Brendan Doyle: we have invoker space in the fleet, the problem is the two functions hash to the same invoker but yea it should redistribute the hashes with a new invoker but still vulnerable. Our current resolution is bringing down an invoker to rehash things ha https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312706121700?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:38:52 UTC - Rodric Rabbah: :face_palm: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312732121900?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:40:12 UTC - Brendan Doyle: yea I'm curious if we could do something fancy with the invoker hashing quickly in the load balancer that's not a big change until we have the new scheduler https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312812122100?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:40:15 UTC - Brendan Doyle: I'll look into that https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603312815122300?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:45:40 UTC - Brendan Doyle: Follow up question, when we run into this issue we get a ton of the ` ```s"Rescheduling Run message, too many message in the pool, "``` logs. I'm wondering if theres anything else I can deduce that we could help with configurations. Does blowing up the runBuffer cause things to significantly slow down? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603313140122500?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:46:00 UTC - parichehr vahidinia: @Dave Grove Excuse me, I am a novice. Can you please introduce me a document to understand how to set environment variables? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603313160122700?thread_ts=1603306004.118000&cid=C3TPCAQG1 ---- 2020-10-21 20:46:20 UTC - Rodric Rabbah: i don’t know - this message used to indicate a bug but there’s been changes in that area of the scheduler which i haven’t followed https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603313180122900?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:47:14 UTC - Brendan Doyle: gotcha thanks! do you happen to have any idea what the bug used to be? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603313234123100?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:48:30 UTC - Rodric Rabbah: i could be misleading you because it’s been a while --- it used to mean the state of the resource table (container allocation) was not consistent with the pipeline that feeds the invoker: pulled one too many messages and until a container is free to reconcile the state, you’ll get that message printed https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603313310123300?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 20:49:10 UTC - Brendan Doyle: yea that sounds about right https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603313350123500?thread_ts=1603305061.114600&cid=C3TPCAQG1 ---- 2020-10-21 21:55:37 UTC - Dave Grove: From the context, I’m guessing the ask is really for user-level tracing of actions, but references to either internal or external usage could be helpful https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603317337123700?thread_ts=1603305421.116300&cid=C3TPCAQG1 ---- 2020-10-21 22:12:26 UTC - Dave Grove: There is unfortunately not a lot of written documentation. You can imitate the way it is done for other environment variables by editing the .yaml files for the controller or invoker pods. For example, <https://github.com/apache/openwhisk-deploy-kube/blob/master/helm/openwhisk/templates/invoker-pod.yaml#L192-#L193> and <https://github.com/apache/openwhisk-deploy-kube/blob/master/helm/openwhisk/values.yaml#L272> https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603318346123900?thread_ts=1603306004.118000&cid=C3TPCAQG1 ---- 2020-10-21 22:14:03 UTC - parichehr vahidinia: thank you :hibiscus: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1603318443124200?thread_ts=1603306004.118000&cid=C3TPCAQG1 ----