From: [email protected] [mailto:[email protected]] On Behalf Of Keith Turner Sent: Tuesday, September 17, 2013 3:20 PM To: [email protected] Subject: Re: Dead Tablet Server On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H. <[email protected]> wrote: Forgive my ignorance with this, But I have not yet had a tablet failure that I have been able to recover without restarting the entire accumulo cluster. I have 3 Tablets, 2 Online, 1 dead. Using Accumulo 1.4.3 The tablet error reports: Uncaught exception in TabletServer.main, exiting java.lang.RuntimeException: java.lang.RuntimeException: Too many retries, exiting. at org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(T abletServer.java:2684) at org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.ja va:2703) at org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.j ava:3168) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.start.Main$1.run(Main.java:89) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Too many retries, exiting. at org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(T abletServer.java:2681) ... 8 more It would be nice to add this stack trace as a comment on ACCUMULO-1277 to make it easier to find via google. Would you like to do this? If not I can. I just added it to the comments : https://issues.apache.org/jira/browse/ACCUMULO-1277 The recovery portion of the Admin guide says that recovery is performed by asking the loggers to copy their write-ahead logs into HDFS. The logs are copied, sorted and then tablets can find missing updates. Once complete the tablets involved should return to an 'online' state. I am not sure how to ask the loggers to copy their write-ahead logs into hdfs. Is this the same as using the flush shell command? If so, the flush command needs a pattern of tables or a table name. Would I want to perform something like, 'accumulo flush -p .+' to flush all of the table data to HDFS? Another concern is that the Tablet Server process was no longer running on the server. I logged into that server and ran "start-here.sh". The tablet server is now running, but it is still reported as 'dead' to the monitor. Thanks in advance, Charles
