[ https://issues.apache.org/jira/browse/YUNIKORN-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Condit resolved YUNIKORN-1179. ------------------------------------ Fix Version/s: 1.0.0 Target Version: 1.0.0 Resolution: Fixed Merged to master and branch-1.0. Thanks [~lowc1012] for the contribution. > Logs are spammed with health check status messages > -------------------------------------------------- > > Key: YUNIKORN-1179 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1179 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko > Assignee: Ryan Lo > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > YUNIKORN-1107 introduced periodic background health check. > The problem is, too much noise is printed to the console: > {noformat} > 2022-04-20T13:28:03.101Z INFO scheduler/health_checker.go:87 > Scheduler is healthy {"health check values": [{"Name":"Scheduling > errors","Succeeded":true,"Description":"Check for scheduling error entries in > metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the > metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for > failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed > nodes logged in the metrics"},{"Name":"Negative > resources","Succeeded":true,"Description":"Check for negative resources in > the partitions","DiagnosisMessage":"Partitions with negative resources: > []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for > negative resources in the nodes","DiagnosisMessage":"Nodes with negative > resources: []"},{"Name":"Consistency of > data","Succeeded":true,"Description":"Check if a node's allocated resource <= > total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: > []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if > total partition resource == sum of the node resources from the > partition","DiagnosisMessage":"Partitions with inconsistent data: > []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if > node total resource = allocated resource + occupied resource + available > resource","DiagnosisMessage":"Nodes with inconsistent data: > []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if > node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes > with inconsistent data: []"},{"Name":"Reservation > check","Succeeded":true,"Description":"Check the reservation nr compared to > the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: > [0.000000]"},{"Name":"Orphan allocation on node > check","Succeeded":true,"Description":"Check if there are orphan allocations > on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan > allocation on app check","Succeeded":true,"Description":"Check if there are > orphan allocations on the > applications","DiagnosisMessage":"OrphanAllocations: []"}]} > 2022-04-20T13:28:33.098Z INFO scheduler/health_checker.go:87 > Scheduler is healthy {"health check values": [{"Name":"Scheduling > errors","Succeeded":true,"Description":"Check for scheduling error entries in > metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the > metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for > failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed > nodes logged in the metrics"},{"Name":"Negative > resources","Succeeded":true,"Description":"Check for negative resources in > the partitions","DiagnosisMessage":"Partitions with negative resources: > []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for > negative resources in the nodes","DiagnosisMessage":"Nodes with negative > resources: []"},{"Name":"Consistency of > data","Succeeded":true,"Description":"Check if a node's allocated resource <= > total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: > []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if > total partition resource == sum of the node resources from the > partition","DiagnosisMessage":"Partitions with inconsistent data: > []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if > node total resource = allocated resource + occupied resource + available > resource","DiagnosisMessage":"Nodes with inconsistent data: > []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if > node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes > with inconsistent data: []"},{"Name":"Reservation > check","Succeeded":true,"Description":"Check the reservation nr compared to > the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: > [0.000000]"},{"Name":"Orphan allocation on node > check","Succeeded":true,"Description":"Check if there are orphan allocations > on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan > allocation on app check","Succeeded":true,"Description":"Check if there are > orphan allocations on the > applications","DiagnosisMessage":"OrphanAllocations: []"}]} > {noformat} > I don't think we need that much output in every 30 seconds. In fact, if the > scheduler is healthy, we don't need anything at all, maybe a short printout > on DEBUG level, but nothing more. > If the health check failed, then we might log it, but even in that case this > looks unnecessary. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org