Yeah this is very suspicious. Also since the error the master tripped over happened just after the region server stopped logging in that file seems even more suspicious. Usually when there's an error in the regionserver's main thread it will go to sysout so that's the .out file instead of .log file, but every time you restart a process it overwrites it, so unless you didn't restart the region server we probably lost the info that were in there. And if the process did die, then it really explains why the master wasn't able to connect to it.
J-D On Fri, May 28, 2010 at 8:37 AM, Lucas Nazário dos Santos <[email protected]> wrote: > Here are the complete logs: > > http://www.ninvest.com.br/docs/logs_hbase/hbase-root-master-ip-10-251-158-224.log > http://www.ninvest.com.br/docs/logs_hbase/hbase-root-zookeeper-ip-10-251-158-224.log > http://www.ninvest.com.br/docs/logs_hbase/hbase-root-regionserver-ip-10-251-158-224.log > > The regionserver stopped logging at 8:31am. Strange... > > I hope this help. > > Lucas > > > On Thu, May 27, 2010 at 8:09 PM, Jean-Daniel Cryans > <[email protected]>wrote: > >> On Thu, May 27, 2010 at 4:01 PM, Lucas Nazário dos Santos >> <[email protected]> wrote: >> > Thanks a lot for the responses. I'll be monitoring HBase and get back in >> > touch if it happens again. >> > >> > Maybe HBase could employ a mechanism to automatically recover from >> > connectivity issues like the one I had gone through. Then me and others >> > wouldn't need to manually restart it. >> >> Well usually if one machine is not reachable, it's not a big deal >> since there are other machines to connect to and HBase redistributes >> the regions to them. Also, why is it refused? Can we see the region >> server log? >> >> > >> > I still didn't get why the master kept failing even after its recovery, >> and >> > why I had to stop/start the cluster in order to get rid of the >> "Connection >> > refused" error. >> >> I'd also like to understand why the region server isn't responding, >> the master can only know so much. >> >> > >> > I'm assuming it's not big deal and my solution can live with it. >> > >> > More logs bellow. >> > >> >> Consider pastebin or a web server next time ;) >> >
