Yesterday I tried snmpwalk on CentOS as well - same behavior. Lars: Can you file a JIRA to fix the bug ?
Thanks On Sun, Feb 5, 2017 at 2:22 AM, Lars George <[email protected]> wrote: > Hi Ted, > > This does not work on Mac as provided. I tried on a CentOS 6 machine, > and had to install net-snmp and net-snmp-utils, plus start the snmpd > to make it time out quicker. But even even there the snmpwalk return > nothing, making the script fail. > > Anyhow, the snmpwalk failing after the retries is just an example of > what can happen if the health check script takes too long to fail. The > bottom line is that it does _not_ stop the server as expected as our > check in the code is reset because of the chore's delay. That is a bug > methinks. > > Or, in other words, when I fixed the snmpwalk to come back quickly as > explained above, the error was caught in time and the server stopped > as expected. > > Makes sense? > > Lars > > On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu <[email protected]> wrote: > > Running the command from the script locally (on Mac): > > > > $ /usr/bin/snmpwalk -t 5 -Oe -Oq -Os -v 1 -c public localhost if > > Timeout: No Response from localhost > > $ echo $? > > 1 > > > > Looks like the script should parse the output from snmpwalk and provide > > some hint if unexpected result is reported. > > > > Cheers > > > > On Sat, Feb 4, 2017 at 6:40 AM, Lars George <[email protected]> > wrote: > > > >> Hi, > >> > >> I tried the supplied `healthcheck.sh`, but did not have snmpd running. > >> That caused the script to take a long time to error out, which exceed > >> the 10 seconds the check was meant to run. That resets the check and > >> it keeps reporting the error, but never stops the servers: > >> > >> 2017-02-04 05:55:08,962 INFO > >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] > >> hbase.HealthCheckChore: Health Check Chore runs every 10sec > >> 2017-02-04 05:55:08,975 INFO > >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] > >> hbase.HealthChecker: HealthChecker initialized with script at > >> /opt/hbase/bin/healthcheck.sh, timeout=60000 > >> > >> ... > >> > >> 2017-02-04 05:55:50,435 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:55:50,436 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time > >> 2017-02-04 05:55:50,437 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.ScheduledChore: Chore: > >> slave-1.internal.larsgeorge.com,16020,1486216506007- > MemstoreFlusherChore > >> missed its start time > >> 2017-02-04 05:55:50,438 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:56:20,522 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:56:20,523 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:56:50,600 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:56:50,600 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:57:20,681 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:57:20,681 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:57:50,763 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:57:50,764 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:58:20,844 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:58:20,844 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:58:50,923 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:58:50,923 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> 2017-02-04 05:59:21,017 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec : > >> ERROR check link, OK: disks ok, > >> > >> 2017-02-04 05:59:21,018 INFO > >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] > >> hbase.ScheduledChore: Chore: HealthChecker missed its start time > >> > >> That seems like a bug, no? > >> > >> Lars > >> >
