bdoyle0182 commented on code in PR #5320: URL: https://github.com/apache/openwhisk/pull/5320#discussion_r959200114
########## core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala: ########## @@ -144,6 +144,11 @@ class ContainerManager(jobManagerFactory: ActorRefFactory => ActorRef, logging.info(this, s"received ${msgs.size} creation message [${msgs.head.invocationNamespace}:${msgs.head.action}]") ContainerManager .getAvailableInvokers(etcdClient, memory, invocationNamespace) + .recover({ + case t: Throwable => + logging.error(this, s"Unable to get available invokers: ${t.getMessage}.") + List.empty[InvokerHealth] + }) Review Comment: A request is done to etcd to get the list of healthy invokers. If the future for the request to etcd fails for any reason, there was no failure handing on that call prior to this. This function is just a synchronous `unit` so if there's no failure handling on that future it just completes and never makes an acknowledgement back to the memory queue that the creation message has been processed either successfully or failed for the memory queue to properly decrement `creationIds`. Also should clarify if it silently fails at this point, it hasn't yet registered the creation job so the akka timer to timeout the creation job to make the call back to MemoryQueue on timeout is not created. The memory queue indefinitely thinks that there is a container creation in progress. If the action never needs more than one container, it will never be able to execute because the memory queue thinks one is in progress. Also the memory queue can not be stopped on timeout in this case because of the `cr eationIds` not being 0. ``` else { logging.info( this, s"[$invocationNamespace:$action:$stateName] The queue is timed out but there are still ${queue.size} activation messages or (running: ${containers.size}, in-progress: ${creationIds.size}) containers") stay } ``` So while I think this covers the only edge case I know of, we really need an additional safeguard in `MemoryQueue` to eventually clear out knowledge of in progress containers if things get out of sync as there's no way to guarantee creationIds is perfectly in sync when it's dependent essentially on a fire and forget successfully making the callback at some point which is prone to introducing bugs even if this pr covers every case for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@openwhisk.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org