RE: Dead Tablet Server

Ott, Charles H. Tue, 17 Sep 2013 08:13:37 -0700

From: [email protected] 
[mailto:[email protected]] On Behalf 
Of Josh Elser
Sent: Tuesday, September 17, 2013 10:39 AM
To: [email protected]
Subject: Re: Dead Tablet Server

 

 

On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H. <[email protected]> 
wrote:

Forgive my ignorance with this, But I have not yet had a tablet failure that I 
have been able to recover without restarting the entire accumulo cluster.

I have 3 Tablets, 2 Online, 1 dead.  Using Accumulo 1.4.3 

The tablet error reports:

Uncaught exception in TabletServer.main, exiting

         java.lang.RuntimeException: java.lang.RuntimeException: Too many 
retries, exiting.

                 at 
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2684)

                 at 
org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703)

                 at 
org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168)

                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

                 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

                 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

                 at java.lang.reflect.Method.invoke(Method.java:597)

                 at org.apache.accumulo.start.Main$1.run(Main.java:89)

                 at java.lang.Thread.run(Thread.java:662)

         Caused by: java.lang.RuntimeException: Too many retries, exiting.

                 at 
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681)

                 ... 8 more

 

Looking at the code, the tablet server couldn't obtain a lock for itself (using 
its IP:port). I would start looking there. You could use zkCli.sh provided by 
ZooKeeper and look in /accumulo/${instance_id}/tservers/${ip}:${port} to see if 
there is another server which already has the lock somehow.

 

                I logged into the zookeeper using zkCli and checked the path 
you mentioned.

[zk: 1620-accumulo.dhcp.saic.com(CONNECTED) 2] ls 
/accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers

[10.35.56.92:9997, 10.35.56.93:9997]

 

There are only 2 servers listed there.  Both servers are the ‘online’ tablet 
servers that are working okay.  So I guess there is no other server which 
already has the lock?  The tablet server that is dead is 10.35.56.91:9997 I 
believe.  As my 3 server’s  IP addresses follow the pattern x.x.x.91,92,93. 

 

         

        The recovery portion of the Admin guide says that recovery is performed 
by asking the loggers to copy their write-ahead logs into HDFS.  The logs are 
copied, sorted and then tablets can find missing updates.  Once complete the 
tablets involved should return to an ‘online’ state.

         

        I am not sure how to ask the loggers to copy their write-ahead logs 
into hdfs.  Is this the same as using the flush shell command?  If so, the 
flush command needs a pattern of tables or a table name.  Would I want to 
perform something like, ‘accumulo flush -p .+’ to flush all of the table data 
to HDFS?

 

You shouldn't have to do anything manually here. The loggers should be handling 
this completely for you as a part of their normal operations. The most likely 
issue you may run into if you're missing WALs is if your logger process doesn't 
have enough memory to perform that copy/sort/etc but this is easily verified by 
checking the logger*.out file for an OOME.


            I don’t see any OutOfMemory exceptions in my logs.  The Xmx on my 
tserver is set to 384m while the tserver.memory.maps.max is set to 256m.  The 
admin docs mention having the memory.max.maps ~75% of the Xmx setting.  I guess 
I’m around 66%?  There are some other custom memory settings, 
tserver.cache.data.size is 15m, tserver.cache.index.size is 40m, 
logger.sort.buffer.size is 50m, and tserver.walog.max.size is 256m.  If all of 
those values combined were ‘maxed out’, wouldn’t that be well above the 384m of 
Xmx?

         

        Another concern is that the Tablet Server process was no longer running 
on the server.  I logged into that server and ran “start-here.sh”.  The tablet 
server is now running, but it is still reported as ‘dead’ to the monitor. 

 

Can you determine from the monitor if that tablet server is actually hosting 
tablets? 1.4.3 had a couple of bugs around the master not updating it's 
internal state for nodes in the failed state. Check the Tablet Server page and 
see if there's an entry in the table of servers.

 

            I can confirm that the tablet server was actually hosting tablets 
when it was up and running.  The three tservers seemd to be well balanced with 
each tserver hosting between 50 and 60 tablets. (Currently 163 tablets total)

         

        Thanks in advance,

        Charles
RE: Dead Tablet Server

Reply via email to