On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault <bbeaudrea...@hubspot.com> wrote: > 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: / > 10.125.18.129:50020. Already tried 14 time(s). >
This was one of the servers that went down? > It was not following through the splitting of HLog files and didn't appear > to be moving regions off failed hosts. After giving it about 20 minutes to > try to right itself, I tried restarting the service. The restart script > just hung for a while printing dots and nothing apparent was happening on > the logs at the time. Can we see the log Bryan? You might thread dump when its hung-up the next time Bryan (Would be something for us to do a looksee on). > Finally I kill -9 the process, so that another > master could take over. The new master seemed to start splitting logs, but > eventually got into the same state of printing the above message. > You think it a particular log? > Eventually it all worked out, but it took WAY too long (almost an hour, all > said). Is this something that is tunable? Have RS carry less WALs? Its a configuration. > They should have instantly been > removed from the list instead of retrying so many times. Each server was > retried upwards of 30-40 times. > Yeah, thats a bit silly. We're working on the MTTR in general. You logs would be of interest to a few of us if its ok that someone else can take a look. St.Ack > I am running cdh3u2 (0.90.4). > > Thanks, > > Bryan