Bryan,

The master could not detect if the region server is dead.
How do you set the zookeeper session timeout?

Thanks,
Jimmy

On Sat, Jun 30, 2012 at 8:09 AM, Stack <st...@duboce.net> wrote:
> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
> <bbeaudrea...@hubspot.com> wrote:
>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
>> 10.125.18.129:50020. Already tried 14 time(s).
>>
>
> This was one of the servers that went down?
>
>> It was not following through the splitting of HLog files and didn't appear
>> to be moving regions off failed hosts.  After giving it about 20 minutes to
>> try to right itself, I tried restarting the service.  The restart script
>> just hung for a while printing dots and nothing apparent was happening on
>> the logs at the time.
>
> Can we see the log  Bryan?
>
> You might thread dump when its hung-up the next time Bryan (Would be
> something for us to do a looksee on).
>
>> Finally I kill -9 the process, so that another
>> master could take over.  The new master seemed to start splitting logs, but
>> eventually got into the same state of printing the above message.
>>
>
> You think it a particular log?
>
>
>> Eventually it all worked out, but it took WAY too long (almost an hour, all
>> said).  Is this something that is tunable?
>
> Have RS carry less WALs?  Its a configuration.
>
>> They should have instantly been
>> removed from the list instead of retrying so many times.  Each server was
>> retried upwards of 30-40 times.
>>
>
> Yeah, thats a bit silly.
>
> We're working on the MTTR in general.  You logs would be of interest
> to a few of us if its ok that someone else can take a look.
>
> St.Ack
>
>> I am running cdh3u2 (0.90.4).
>>
>> Thanks,
>>
>> Bryan

Reply via email to