Lars George created HBASE-17630: ----------------------------------- Summary: Health Script not shutting down server process with certain script behavior Key: HBASE-17630 URL: https://issues.apache.org/jira/browse/HBASE-17630 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 1.3.1 Reporter: Lars George
As discussed on dev@... I tried the supplied {{healthcheck.sh}}, but did not have {{snmpd}} running. That caused the script to take a long time to error out, which exceed the 10 seconds the check was meant to run. That resets the check and it keeps reporting the error, but never stops the servers: {noformat} 2017-02-04 05:55:08,962 INFO [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] hbase.HealthCheckChore: Health Check Chore runs every 10sec 2017-02-04 05:55:08,975 INFO [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] hbase.HealthChecker: HealthChecker initialized with script at /opt/hbase/bin/healthcheck.sh, timeout=60000 ... 2017-02-04 05:55:50,435 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:55:50,436 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time 2017-02-04 05:55:50,437 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: slave-1.internal.larsgeorge.com,16020,1486216506007-MemstoreFlusherChore missed its start time 2017-02-04 05:55:50,438 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:56:20,522 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec : ERROR check link, OK: disks ok, 2017-02-04 05:56:20,523 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:56:50,600 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:56:50,600 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:57:20,681 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec : ERROR check link, OK: disks ok, 2017-02-04 05:57:20,681 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:57:50,763 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:57:50,764 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:58:20,844 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec : ERROR check link, OK: disks ok, 2017-02-04 05:58:20,844 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:58:50,923 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:58:50,923 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:59:21,017 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec : ERROR check link, OK: disks ok, 2017-02-04 05:59:21,018 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time {noformat} We need to fix the handling of the timeout of the health check script and ho the chore is treating that to shut down the server process. The current settings of check frequency and timeout overlap and cause the above. -- This message was sent by Atlassian JIRA (v6.3.15#6346)