[jira] [Commented] (ACCUMULO-327) master lost all tablet servers

Keith Turner (Commented) (JIRA) Fri, 20 Jan 2012 08:50:02 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189890#comment-13189890
 ]


Keith Turner commented on ACCUMULO-327:
---------------------------------------

I added the following and reran the random walk test overnight.  The patch 
makes the java process print a stack track before killing itself.  I only saw 
one tablet server die.  Its look was deleted.  I did not see anything 
suspicious in its jstack.

{noformat}
Index: src/main/java/org/apache/accumulo/server/util/Halt.java
===================================================================
--- src/main/java/org/apache/accumulo/server/util/Halt.java     (revision 
1233105)
+++ src/main/java/org/apache/accumulo/server/util/Halt.java     (working copy)
@@ -41,6 +41,15 @@
   
   public static void halt(final int status, Runnable runnable) {
     try {
+      // print stack trace on exit
+      ProcessBuilder processBuilder = new ProcessBuilder("/bin/sh", "-c", 
"kill -3 $PPID");
+      Process process = processBuilder.start();
+      process.waitFor();
+    } catch (Exception e) {
+      e.printStackTrace();
+    }
+    
+    try {
       // give ourselves a little time to try and do something
       new Daemon() {
         public void run() {

{noformat}
                
> master lost all tablet servers
> ------------------------------
>
>                 Key: ACCUMULO-327
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-327
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: running the random walk test on a small cluster
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>
> Master would occasionally take a long time to collect status information from 
> a tablet server.  The connection would timeout after the default 120 second 
> RPC time.  This probably left the connection in a bad state because I am 
> seeing
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 
> but got 0
>         at 
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:445)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(TabletClientService.java:893)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(TabletClientService.java:876)
> {noformat}
> If the master is unable to collect statistics on the tablet server, it 
> attempts to halt it (as above) and then it removes its lock in zookeeper.
> Eventually, under the pressure of random walk operations, the master killed 
> every tablet server.
> Guess: a lock in the tablet server is delaying status reporting.
> I wrote a script to process the master logs.  It saves each line that refers 
> to the IP address of a tablet server.  When it sees the zookeeper lock has 
> been deleted, it prints the last N lines that refer to that tablet server.
> In 7 out of the 10 cases, a split timed out prior or during the status 
> request failures.
> In 5 cases, the tablet server was hosting the root tablet (a necessary 
> condition when the last server died).
> In 5 cases, the table_table info tablet was being hosted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-327) master lost all tablet servers

Reply via email to