[ https://issues.apache.org/jira/browse/ACCUMULO-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189890#comment-13189890 ]
Keith Turner commented on ACCUMULO-327: --------------------------------------- I added the following and reran the random walk test overnight. The patch makes the java process print a stack track before killing itself. I only saw one tablet server die. Its look was deleted. I did not see anything suspicious in its jstack. {noformat} Index: src/main/java/org/apache/accumulo/server/util/Halt.java =================================================================== --- src/main/java/org/apache/accumulo/server/util/Halt.java (revision 1233105) +++ src/main/java/org/apache/accumulo/server/util/Halt.java (working copy) @@ -41,6 +41,15 @@ public static void halt(final int status, Runnable runnable) { try { + // print stack trace on exit + ProcessBuilder processBuilder = new ProcessBuilder("/bin/sh", "-c", "kill -3 $PPID"); + Process process = processBuilder.start(); + process.waitFor(); + } catch (Exception e) { + e.printStackTrace(); + } + + try { // give ourselves a little time to try and do something new Daemon() { public void run() { {noformat} > master lost all tablet servers > ------------------------------ > > Key: ACCUMULO-327 > URL: https://issues.apache.org/jira/browse/ACCUMULO-327 > Project: Accumulo > Issue Type: Bug > Components: tserver > Environment: running the random walk test on a small cluster > Reporter: Eric Newton > Assignee: Keith Turner > > Master would occasionally take a long time to collect status information from > a tablet server. The connection would timeout after the default 120 second > RPC time. This probably left the connection in a bad state because I am > seeing > {noformat} > org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 > but got 0 > at > org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:445) > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(TabletClientService.java:893) > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(TabletClientService.java:876) > {noformat} > If the master is unable to collect statistics on the tablet server, it > attempts to halt it (as above) and then it removes its lock in zookeeper. > Eventually, under the pressure of random walk operations, the master killed > every tablet server. > Guess: a lock in the tablet server is delaying status reporting. > I wrote a script to process the master logs. It saves each line that refers > to the IP address of a tablet server. When it sees the zookeeper lock has > been deleted, it prints the last N lines that refer to that tablet server. > In 7 out of the 10 cases, a split timed out prior or during the status > request failures. > In 5 cases, the tablet server was hosting the root tablet (a necessary > condition when the last server died). > In 5 cases, the table_table info tablet was being hosted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira