Forgive my ignorance with this, But I have not yet had a tablet failure
that I have been able to recover without restarting the entire accumulo
cluster.
I have 3 Tablets, 2 Online, 1 dead. Using Accumulo 1.4.3
The tablet error reports:
Uncaught exception in TabletServer.main, exiting
java.lang.RuntimeException: java.lang.RuntimeException: Too
many retries, exiting.
at
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(T
abletServer.java:2684)
at
org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.ja
va:2703)
at
org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.j
ava:3168)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.accumulo.start.Main$1.run(Main.java:89)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Too many retries,
exiting.
at
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(T
abletServer.java:2681)
... 8 more
The recovery portion of the Admin guide says that recovery is performed
by asking the loggers to copy their write-ahead logs into HDFS. The
logs are copied, sorted and then tablets can find missing updates. Once
complete the tablets involved should return to an 'online' state.
I am not sure how to ask the loggers to copy their write-ahead logs into
hdfs. Is this the same as using the flush shell command? If so, the
flush command needs a pattern of tables or a table name. Would I want
to perform something like, 'accumulo flush -p .+' to flush all of the
table data to HDFS?
Another concern is that the Tablet Server process was no longer running
on the server. I logged into that server and ran "start-here.sh". The
tablet server is now running, but it is still reported as 'dead' to the
monitor.
Thanks in advance,
Charles