bdoyle0182 commented on code in PR #5320:
URL: https://github.com/apache/openwhisk/pull/5320#discussion_r959200114


##########
core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala:
##########
@@ -144,6 +144,11 @@ class ContainerManager(jobManagerFactory: ActorRefFactory 
=> ActorRef,
     logging.info(this, s"received ${msgs.size} creation message 
[${msgs.head.invocationNamespace}:${msgs.head.action}]")
     ContainerManager
       .getAvailableInvokers(etcdClient, memory, invocationNamespace)
+      .recover({
+        case t: Throwable =>
+          logging.error(this, s"Unable to get available invokers: 
${t.getMessage}.")
+          List.empty[InvokerHealth]
+      })

Review Comment:
   A request is done to etcd to get the list of healthy invokers. If the future 
for the request to etcd fails for any reason, there was no failure handing on 
that call prior to this. This function is just a synchronous `unit` so if 
there's no failure handling on that future it just completes and never makes an 
acknowledgement back to the memory queue that the creation message has been 
processed either successfully or failed for the memory queue to properly 
decrement `creationIds`. The memory queue indefinitely thinks that there is a 
container creation in progress. If the action never needs more than one 
container, it will never be able to execute because the memory queue thinks one 
is in progress. Also the memory queue can not be stopped on timeout in this 
case because of the `creationIds` not being 0.
   
   ```else {
           logging.info(
             this,
             s"[$invocationNamespace:$action:$stateName] The queue is timed out 
but there are still ${queue.size} activation messages or (running: 
${containers.size}, in-progress: ${creationIds.size}) containers")
           stay
         }
         ```
         
         So while I think this covers the only edge case I know of, we really 
need an additional safeguard in `MemoryQueue` to eventually clear out knowledge 
of in progress containers if things get out of sync as there's no way to 
guarantee creationIds is perfectly in sync when it's dependent essentially on a 
fire and forget successfully making the callback at some point which is prone 
to introducing bugs even if this pr covers every case for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to