[ 
https://issues.apache.org/jira/browse/YUNIKORN-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-1179.
------------------------------------
     Fix Version/s: 1.0.0
    Target Version: 1.0.0
        Resolution: Fixed

Merged to master and branch-1.0. Thanks [~lowc1012] for the contribution.

> Logs are spammed with health check status messages
> --------------------------------------------------
>
>                 Key: YUNIKORN-1179
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1179
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Ryan Lo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>
> YUNIKORN-1107 introduced periodic background health check.
> The problem is, too much noise is printed to the console:
> {noformat}
> 2022-04-20T13:28:03.101Z      INFO    scheduler/health_checker.go:87  
> Scheduler is healthy    {"health check values": [{"Name":"Scheduling 
> errors","Succeeded":true,"Description":"Check for scheduling error entries in 
> metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the 
> metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for 
> failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed 
> nodes logged in the metrics"},{"Name":"Negative 
> resources","Succeeded":true,"Description":"Check for negative resources in 
> the partitions","DiagnosisMessage":"Partitions with negative resources: 
> []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for 
> negative resources in the nodes","DiagnosisMessage":"Nodes with negative 
> resources: []"},{"Name":"Consistency of 
> data","Succeeded":true,"Description":"Check if a node's allocated resource <= 
> total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: 
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
> total partition resource == sum of the node resources from the 
> partition","DiagnosisMessage":"Partitions with inconsistent data: 
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
> node total resource = allocated resource + occupied resource + available 
> resource","DiagnosisMessage":"Nodes with inconsistent data: 
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
> node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes 
> with inconsistent data: []"},{"Name":"Reservation 
> check","Succeeded":true,"Description":"Check the reservation nr compared to 
> the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: 
> [0.000000]"},{"Name":"Orphan allocation on node 
> check","Succeeded":true,"Description":"Check if there are orphan allocations 
> on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan 
> allocation on app check","Succeeded":true,"Description":"Check if there are 
> orphan allocations on the 
> applications","DiagnosisMessage":"OrphanAllocations: []"}]}
> 2022-04-20T13:28:33.098Z      INFO    scheduler/health_checker.go:87  
> Scheduler is healthy    {"health check values": [{"Name":"Scheduling 
> errors","Succeeded":true,"Description":"Check for scheduling error entries in 
> metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the 
> metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for 
> failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed 
> nodes logged in the metrics"},{"Name":"Negative 
> resources","Succeeded":true,"Description":"Check for negative resources in 
> the partitions","DiagnosisMessage":"Partitions with negative resources: 
> []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for 
> negative resources in the nodes","DiagnosisMessage":"Nodes with negative 
> resources: []"},{"Name":"Consistency of 
> data","Succeeded":true,"Description":"Check if a node's allocated resource <= 
> total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: 
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
> total partition resource == sum of the node resources from the 
> partition","DiagnosisMessage":"Partitions with inconsistent data: 
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
> node total resource = allocated resource + occupied resource + available 
> resource","DiagnosisMessage":"Nodes with inconsistent data: 
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
> node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes 
> with inconsistent data: []"},{"Name":"Reservation 
> check","Succeeded":true,"Description":"Check the reservation nr compared to 
> the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: 
> [0.000000]"},{"Name":"Orphan allocation on node 
> check","Succeeded":true,"Description":"Check if there are orphan allocations 
> on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan 
> allocation on app check","Succeeded":true,"Description":"Check if there are 
> orphan allocations on the 
> applications","DiagnosisMessage":"OrphanAllocations: []"}]}
> {noformat}
> I don't think we need that much output in every 30 seconds. In fact, if the 
> scheduler is healthy, we don't need anything at all, maybe a short printout 
> on DEBUG level, but nothing more.
> If the health check failed, then we might log it, but even in that case this 
> looks unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to