[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

stack (JIRA) Thu, 10 Jun 2010 21:26:37 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877705#action_12877705
 ]


stack commented on HBASE-2707:
------------------------------

I think we need  a test for this because i can't see how -ROOT- is recovered if 
the code in processervershutdown#process is like this:

{code}
    if (!rootRescanned) {
      // Scan the ROOT region
      Boolean result = new ScanRootRegion(
          new MetaRegion(master.getRegionManager().getRootRegionLocation(),
              HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries();
      if (result == null) {
        // Master is closing - give up
        return true;
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("Process server shutdown scanning root region on " +
          master.getRegionManager().getRootRegionLocation().getBindAddress() +
          " finished " + Thread.currentThread().getName());
      }
      rootRescanned = true;
    }
{code}

If the current server being processed held the -ROOT- and the first thing we do 
processing a shutdown of a server that held root is to clear the root location, 
how is the above code succeeding?

Assigning myself to mess w/ a test.

> Can't recover from a dead ROOT server if any exceptions happens during log 
> splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
> from a GC-like event. It happens frequently to my TestReplication in 
> HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
> Removing old log dir 
> hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] 
> master.RegionServerOperationQueue(183): Failed processing: 
> ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
> todo queue
> java.io.IOException: Cannot delete: 
> hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at 
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at 
> org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: 
> /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
> master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
> load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if 
> ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Reply via email to