Re: HMaster not failing over dead RegionServers

Stack Sat, 30 Jun 2012 08:10:03 -0700

On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
<bbeaudrea...@hubspot.com> wrote:
> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> 10.125.18.129:50020. Already tried 14 time(s).
>


This was one of the servers that went down?

> It was not following through the splitting of HLog files and didn't appear
> to be moving regions off failed hosts.  After giving it about 20 minutes to
> try to right itself, I tried restarting the service.  The restart script
> just hung for a while printing dots and nothing apparent was happening on
> the logs at the time.

Can we see the log  Bryan?

You might thread dump when its hung-up the next time Bryan (Would be
something for us to do a looksee on).

> Finally I kill -9 the process, so that another
> master could take over.  The new master seemed to start splitting logs, but
> eventually got into the same state of printing the above message.
>

You think it a particular log?


> Eventually it all worked out, but it took WAY too long (almost an hour, all
> said).  Is this something that is tunable?

Have RS carry less WALs?  Its a configuration.

> They should have instantly been
> removed from the list instead of retrying so many times.  Each server was
> retried upwards of 30-40 times.
>

Yeah, thats a bit silly.

We're working on the MTTR in general.  You logs would be of interest
to a few of us if its ok that someone else can take a look.

St.Ack

> I am running cdh3u2 (0.90.4).
>
> Thanks,
>
> Bryan

Re: HMaster not failing over dead RegionServers

Reply via email to