bdoyle0182 commented on issue #5325: URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247467187
Actually it's a bit simpler and an unhandled failure case in the paused state as this is the only place that could return an etcd error in the FunctionPullingContainerProxy ``` case Event(StateTimeout, data: WarmData) => (for { count <- getLiveContainerCount(data.invocationNamespace, data.action.fullyQualifiedName(false), data.revision) (warmedContainerKeepingCount, warmedContainerKeepingTimeout) <- getWarmedContainerLimit( data.invocationNamespace) } yield { logging.info( this, s"Live container count: ${count}, warmed container keeping count configuration: ${warmedContainerKeepingCount} in namespace: ${data.invocationNamespace}") if (count <= warmedContainerKeepingCount) { Keep(warmedContainerKeepingTimeout) } else { Remove } }).pipeTo(self) stay ``` the state times out. The query to etcd fails which pipes the failure message to itself. However the message is uncaught in the state so it ends up getting stashed with this and now the container sits around waiting indefinitely until a new activation comes in. `case _ => delay` New activation comes in and tries to wake up the warmed container playing out all of the events I described in my previous messages starting with attempting to unpause the container. It then transitions to `Running` and onTransition it unstashes the failure message to etcd. So we just need to put proper failure handling on the state timeout event for paused. For additional confirmation this is exactly what's happening, the gap in logs between when it was paused and when the new activation comes in is much greater than the idleTimeout. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@openwhisk.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org