[GitHub] [openwhisk] bdoyle0182 commented on a diff in pull request #5320: add error handling to container manager when invoker query fails

GitBox Tue, 30 Aug 2022 23:27:19 -0700


bdoyle0182 commented on code in PR #5320:
URL: https://github.com/apache/openwhisk/pull/5320#discussion_r959200114



##########
core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala:
##########
@@ -144,6 +144,11 @@ class ContainerManager(jobManagerFactory: ActorRefFactory 
=> ActorRef,
     logging.info(this, s"received ${msgs.size} creation message 
[${msgs.head.invocationNamespace}:${msgs.head.action}]")
     ContainerManager
       .getAvailableInvokers(etcdClient, memory, invocationNamespace)
+      .recover({
+        case t: Throwable =>
+          logging.error(this, s"Unable to get available invokers: 
${t.getMessage}.")
+          List.empty[InvokerHealth]
+      })

Review Comment:
   A request is done to etcd to get the list of healthy invokers. If the future 
for the request to etcd fails for any reason, there was no failure handing on 
that call prior to this. This function is just a synchronous `unit` so if 
there's no failure handling on that future it just completes and never makes an 
acknowledgement back to the memory queue that the creation message has been 
processed either successfully or failed for the memory queue to properly 
decrement `creationIds`. Also should clarify if it silently fails at this 
point, it hasn't yet registered the creation job so the akka timer to timeout 
the creation job to make the call back to MemoryQueue on timeout is not 
created. The memory queue indefinitely thinks that there is a container 
creation in progress. If the action never needs more than one container, it 
will never be able to execute because the memory queue thinks one is in 
progress. Also the memory queue can not be stopped on timeout in this case 
because of the `cr
 eationIds` not being 0.
   
   
   ```
   else {
           logging.info(
             this,
             s"[$invocationNamespace:$action:$stateName] The queue is timed out 
but there are still ${queue.size} activation messages or (running: 
${containers.size}, in-progress: ${creationIds.size}) containers")
           stay
         }
   ```
   
         
   So while I think this covers the only edge case I know of, we really need an 
additional safeguard in `MemoryQueue` to eventually clear out knowledge of in 
progress containers if things get out of sync as there's no way to guarantee 
creationIds is perfectly in sync when it's dependent essentially on a fire and 
forget successfully making the callback at some point which is prone to 
introducing bugs even if this pr covers every case for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [openwhisk] bdoyle0182 commented on a diff in pull request #5320: add error handling to container manager when invoker query fails

Reply via email to