[ https://issues.apache.org/jira/browse/YUNIKORN-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507689#comment-17507689 ]
Craig Condit commented on YUNIKORN-1107: ---------------------------------------- [~lowc1012], on a large cluster, the health check can take a considerable amount of time as it has to walk all the internal data structures, acquiring locks along the way that can block scheduler progress. An attacker would only need to spam lots of health check requests in a short period of time to essentially block the scheduler from making forward progress. We really only need to run the check maybe every 30-60 seconds. The liveness probe doesn't really make sense for YuniKorn, as if the service is running, it is "live". The health check, in part because it needs to acquire and release many locks, can sometimes report incorrect information depending upon the timing of operations. It also may report issues that are really more relevant for the K8s cluster health as a whole and do not indicate a problem with YK itself. This is useful for diagnostics, but is not a reliable indicator that YK should be terminated and restarted. > Make health check occur in the background > ----------------------------------------- > > Key: YUNIKORN-1107 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1107 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler > Reporter: Craig Condit > Assignee: Ryan Lo > Priority: Major > > Currently, the health check endpoint in the REST API performs a lengthy > process that could be used as a denial-of-service vector. We should schedule > the health check in the background periodically, and have the REST API simply > report the results of the latest check. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org