From: [email protected]
[mailto:[email protected]] On Behalf
Of Josh Elser
Sent: Tuesday, September 17, 2013 10:39 AM
To: [email protected]
Subject: Re: Dead Tablet Server
On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H. <[email protected]>
wrote:
Forgive my ignorance with this, But I have not yet had a tablet failure that I
have been able to recover without restarting the entire accumulo cluster.
I have 3 Tablets, 2 Online, 1 dead. Using Accumulo 1.4.3
The tablet error reports:
Uncaught exception in TabletServer.main, exiting
java.lang.RuntimeException: java.lang.RuntimeException: Too many
retries, exiting.
at
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2684)
at
org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703)
at
org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.accumulo.start.Main$1.run(Main.java:89)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Too many retries, exiting.
at
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681)
... 8 more
Looking at the code, the tablet server couldn't obtain a lock for itself (using
its IP:port). I would start looking there. You could use zkCli.sh provided by
ZooKeeper and look in /accumulo/${instance_id}/tservers/${ip}:${port} to see if
there is another server which already has the lock somehow.
I logged into the zookeeper using zkCli and checked the path
you mentioned.
[zk: 1620-accumulo.dhcp.saic.com(CONNECTED) 2] ls
/accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers
[10.35.56.92:9997, 10.35.56.93:9997]
There are only 2 servers listed there. Both servers are the ‘online’ tablet
servers that are working okay. So I guess there is no other server which
already has the lock? The tablet server that is dead is 10.35.56.91:9997 I
believe. As my 3 server’s IP addresses follow the pattern x.x.x.91,92,93.
The recovery portion of the Admin guide says that recovery is performed
by asking the loggers to copy their write-ahead logs into HDFS. The logs are
copied, sorted and then tablets can find missing updates. Once complete the
tablets involved should return to an ‘online’ state.
I am not sure how to ask the loggers to copy their write-ahead logs
into hdfs. Is this the same as using the flush shell command? If so, the
flush command needs a pattern of tables or a table name. Would I want to
perform something like, ‘accumulo flush -p .+’ to flush all of the table data
to HDFS?
You shouldn't have to do anything manually here. The loggers should be handling
this completely for you as a part of their normal operations. The most likely
issue you may run into if you're missing WALs is if your logger process doesn't
have enough memory to perform that copy/sort/etc but this is easily verified by
checking the logger*.out file for an OOME.
I don’t see any OutOfMemory exceptions in my logs. The Xmx on my
tserver is set to 384m while the tserver.memory.maps.max is set to 256m. The
admin docs mention having the memory.max.maps ~75% of the Xmx setting. I guess
I’m around 66%? There are some other custom memory settings,
tserver.cache.data.size is 15m, tserver.cache.index.size is 40m,
logger.sort.buffer.size is 50m, and tserver.walog.max.size is 256m. If all of
those values combined were ‘maxed out’, wouldn’t that be well above the 384m of
Xmx?
Another concern is that the Tablet Server process was no longer running
on the server. I logged into that server and ran “start-here.sh”. The tablet
server is now running, but it is still reported as ‘dead’ to the monitor.
Can you determine from the monitor if that tablet server is actually hosting
tablets? 1.4.3 had a couple of bugs around the master not updating it's
internal state for nodes in the failed state. Check the Tablet Server page and
see if there's an entry in the table of servers.
I can confirm that the tablet server was actually hosting tablets
when it was up and running. The three tservers seemd to be well balanced with
each tserver hosting between 50 and 60 tablets. (Currently 163 tablets total)
Thanks in advance,
Charles