I tried restarting the tablet server and tail –f the logs folder. I’m seeing this message every 5 seconds:
==> tserver_1620-Node1.log <== 2013-09-17 13:11:24,836 [tabletserver.TabletServer] INFO : Waiting for tablet server lock ==> tserver_1620-Node1.debug.log <== 2013-09-17 13:11:29,861 [zookeeper.ZooLock] DEBUG: event /accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers/10.35.56.91:9997 NodeCreated SyncConnected 2013-09-17 13:11:29,886 [tabletserver.TabletServer] INFO : Waiting for tablet server lock Of course after a while it says too many retries and stops. I double checked the zooNode /accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers, but again there was no entry for 10.35.56.91. only 92 and 93. Is there any reason why doing ./stop-all.sh and then ./start-all.sh would resolve this issue. Last time this happened I was able to restart the entire cluster and all 3 nodes came back online. However, it’s only been about 2 days since this error has occurred, so I think it would be best for me to fix the issue, rather than trying to restart over and over. From: [email protected] [mailto:[email protected]] On Behalf Of Ott, Charles H. Sent: Tuesday, September 17, 2013 11:04 AM To: [email protected] Subject: RE: Dead Tablet Server From: [email protected] [mailto:[email protected]] On Behalf Of Josh Elser Sent: Tuesday, September 17, 2013 10:39 AM To: [email protected] Subject: Re: Dead Tablet Server On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H. <[email protected]> wrote: Forgive my ignorance with this, But I have not yet had a tablet failure that I have been able to recover without restarting the entire accumulo cluster. I have 3 Tablets, 2 Online, 1 dead. Using Accumulo 1.4.3 The tablet error reports: Uncaught exception in TabletServer.main, exiting java.lang.RuntimeException: java.lang.RuntimeException: Too many retries, exiting. at org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2684) at org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703) at org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.start.Main$1.run(Main.java:89) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Too many retries, exiting. at org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681) ... 8 more Looking at the code, the tablet server couldn't obtain a lock for itself (using its IP:port). I would start looking there. You could use zkCli.sh provided by ZooKeeper and look in /accumulo/${instance_id}/tservers/${ip}:${port} to see if there is another server which already has the lock somehow. I logged into the zookeeper using zkCli and checked the path you mentioned. [zk: 1620-accumulo.dhcp.saic.com(CONNECTED) 2] ls /accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers [10.35.56.92:9997, 10.35.56.93:9997] There are only 2 servers listed there. Both servers are the ‘online’ tablet servers that are working okay. So I guess there is no other server which already has the lock? The tablet server that is dead is 10.35.56.91:9997 I believe. As my 3 server’s IP addresses follow the pattern x.x.x.91,92,93. The recovery portion of the Admin guide says that recovery is performed by asking the loggers to copy their write-ahead logs into HDFS. The logs are copied, sorted and then tablets can find missing updates. Once complete the tablets involved should return to an ‘online’ state. I am not sure how to ask the loggers to copy their write-ahead logs into hdfs. Is this the same as using the flush shell command? If so, the flush command needs a pattern of tables or a table name. Would I want to perform something like, ‘accumulo flush -p .+’ to flush all of the table data to HDFS? You shouldn't have to do anything manually here. The loggers should be handling this completely for you as a part of their normal operations. The most likely issue you may run into if you're missing WALs is if your logger process doesn't have enough memory to perform that copy/sort/etc but this is easily verified by checking the logger*.out file for an OOME. I don’t see any OutOfMemory exceptions in my logs. The Xmx on my tserver is set to 384m while the tserver.memory.maps.max is set to 256m. The admin docs mention having the memory.max.maps ~75% of the Xmx setting. I guess I’m around 66%? There are some other custom memory settings, tserver.cache.data.size is 15m, tserver.cache.index.size is 40m, logger.sort.buffer.size is 50m, and tserver.walog.max.size is 256m. If all of those values combined were ‘maxed out’, wouldn’t that be well above the 384m of Xmx? Another concern is that the Tablet Server process was no longer running on the server. I logged into that server and ran “start-here.sh”. The tablet server is now running, but it is still reported as ‘dead’ to the monitor. Can you determine from the monitor if that tablet server is actually hosting tablets? 1.4.3 had a couple of bugs around the master not updating it's internal state for nodes in the failed state. Check the Tablet Server page and see if there's an entry in the table of servers. I can confirm that the tablet server was actually hosting tablets when it was up and running. The three tservers seemd to be well balanced with each tserver hosting between 50 and 60 tablets. (Currently 163 tablets total) Thanks in advance, Charles
