Bryan, The master could not detect if the region server is dead. How do you set the zookeeper session timeout?
Thanks, Jimmy On Sat, Jun 30, 2012 at 8:09 AM, Stack <st...@duboce.net> wrote: > On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault > <bbeaudrea...@hubspot.com> wrote: >> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: / >> 10.125.18.129:50020. Already tried 14 time(s). >> > > This was one of the servers that went down? > >> It was not following through the splitting of HLog files and didn't appear >> to be moving regions off failed hosts. After giving it about 20 minutes to >> try to right itself, I tried restarting the service. The restart script >> just hung for a while printing dots and nothing apparent was happening on >> the logs at the time. > > Can we see the log Bryan? > > You might thread dump when its hung-up the next time Bryan (Would be > something for us to do a looksee on). > >> Finally I kill -9 the process, so that another >> master could take over. The new master seemed to start splitting logs, but >> eventually got into the same state of printing the above message. >> > > You think it a particular log? > > >> Eventually it all worked out, but it took WAY too long (almost an hour, all >> said). Is this something that is tunable? > > Have RS carry less WALs? Its a configuration. > >> They should have instantly been >> removed from the list instead of retrying so many times. Each server was >> retried upwards of 30-40 times. >> > > Yeah, thats a bit silly. > > We're working on the MTTR in general. You logs would be of interest > to a few of us if its ok that someone else can take a look. > > St.Ack > >> I am running cdh3u2 (0.90.4). >> >> Thanks, >> >> Bryan