Will do, thanks
On Sun, Feb 5, 2017 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Yesterday I tried snmpwalk on CentOS as well - same behavior. > > Lars: > Can you file a JIRA to fix the bug ? > > Thanks > > On Sun, Feb 5, 2017 at 2:22 AM, Lars George <lars.geo...@gmail.com> wrote: > >> Hi Ted, >> >> This does not work on Mac as provided. I tried on a CentOS 6 machine, >> and had to install net-snmp and net-snmp-utils, plus start the snmpd >> to make it time out quicker. But even even there the snmpwalk return >> nothing, making the script fail. >> >> Anyhow, the snmpwalk failing after the retries is just an example of >> what can happen if the health check script takes too long to fail. The >> bottom line is that it does _not_ stop the server as expected as our >> check in the code is reset because of the chore's delay. That is a bug >> methinks. >> >> Or, in other words, when I fixed the snmpwalk to come back quickly as >> explained above, the error was caught in time and the server stopped >> as expected. >> >> Makes sense? >> >> Lars >> >> On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> > Running the command from the script locally (on Mac): >> > >> > $ /usr/bin/snmpwalk -t 5 -Oe -Oq -Os -v 1 -c public localhost if >> > Timeout: No Response from localhost >> > $ echo $? >> > 1 >> > >> > Looks like the script should parse the output from snmpwalk and provide >> > some hint if unexpected result is reported. >> > >> > Cheers >> > >> > On Sat, Feb 4, 2017 at 6:40 AM, Lars George <lars.geo...@gmail.com> >> wrote: >> > >> >> Hi, >> >> >> >> I tried the supplied `healthcheck.sh`, but did not have snmpd running. >> >> That caused the script to take a long time to error out, which exceed >> >> the 10 seconds the check was meant to run. That resets the check and >> >> it keeps reporting the error, but never stops the servers: >> >> >> >> 2017-02-04 05:55:08,962 INFO >> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] >> >> hbase.HealthCheckChore: Health Check Chore runs every 10sec >> >> 2017-02-04 05:55:08,975 INFO >> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] >> >> hbase.HealthChecker: HealthChecker initialized with script at >> >> /opt/hbase/bin/healthcheck.sh, timeout=60000 >> >> >> >> ... >> >> >> >> 2017-02-04 05:55:50,435 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:55:50,436 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time >> >> 2017-02-04 05:55:50,437 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: >> >> slave-1.internal.larsgeorge.com,16020,1486216506007- >> MemstoreFlusherChore >> >> missed its start time >> >> 2017-02-04 05:55:50,438 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:56:20,522 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:56:20,523 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:56:50,600 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:56:50,600 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:57:20,681 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:57:20,681 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:57:50,763 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:57:50,764 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:58:20,844 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:58:20,844 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:58:50,923 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:58:50,923 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:59:21,017 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:59:21,018 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> >> >> That seems like a bug, no? >> >> >> >> Lars >> >> >>