Re: Health Script does not stop region server

Lars George Sun, 05 Feb 2017 02:23:36 -0800

Hi Ted,

This does not work on Mac as provided. I tried on a CentOS 6 machine,
and had to install net-snmp and net-snmp-utils, plus start the snmpd
to make it time out quicker. But even even there the snmpwalk return
nothing, making the script fail.


Anyhow, the snmpwalk failing after the retries is just an example of
what can happen if the health check script takes too long to fail. The
bottom line is that it does _not_ stop the server as expected as our
check in the code is reset because of the chore's delay. That is a bug
methinks.

Or, in other words, when I fixed the snmpwalk to come back quickly as
explained above, the error was caught in time and the server stopped
as expected.

Makes sense?

Lars

On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu <[email protected]> wrote:
> Running the command from the script locally (on Mac):
>
> $ /usr/bin/snmpwalk -t 5 -Oe  -Oq  -Os -v 1 -c public localhost if
> Timeout: No Response from localhost
> $ echo $?
> 1
>
> Looks like the script should parse the output from snmpwalk and provide
> some hint if unexpected result is reported.
>
> Cheers
>
> On Sat, Feb 4, 2017 at 6:40 AM, Lars George <[email protected]> wrote:
>
>> Hi,
>>
>> I tried the supplied `healthcheck.sh`, but did not have snmpd running.
>> That caused the script to take a long time to error out, which exceed
>> the 10 seconds the check was meant to run. That resets the check and
>> it keeps reporting the error, but never stops the servers:
>>
>> 2017-02-04 05:55:08,962 INFO
>> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
>> hbase.HealthCheckChore: Health Check Chore runs every 10sec
>> 2017-02-04 05:55:08,975 INFO
>> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
>> hbase.HealthChecker: HealthChecker initialized with script at
>> /opt/hbase/bin/healthcheck.sh, timeout=60000
>>
>> ...
>>
>> 2017-02-04 05:55:50,435 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:55:50,436 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.ScheduledChore: Chore: CompactionChecker missed its start time
>> 2017-02-04 05:55:50,437 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.ScheduledChore: Chore:
>> slave-1.internal.larsgeorge.com,16020,1486216506007-MemstoreFlusherChore
>> missed its start time
>> 2017-02-04 05:55:50,438 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:56:20,522 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:56:20,523 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:56:50,600 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:56:50,600 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:57:20,681 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:57:20,681 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:57:50,763 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:57:50,764 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:58:20,844 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:58:20,844 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:58:50,923 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:58:50,923 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> 2017-02-04 05:59:21,017 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec :
>> ERROR check link, OK: disks ok,
>>
>> 2017-02-04 05:59:21,018 INFO
>> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>>
>> That seems like a bug, no?
>>
>> Lars
>>

Re: Health Script does not stop region server

Reply via email to