[ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877705#action_12877705 ]
stack commented on HBASE-2707: ------------------------------ I think we need a test for this because i can't see how -ROOT- is recovered if the code in processervershutdown#process is like this: {code} if (!rootRescanned) { // Scan the ROOT region Boolean result = new ScanRootRegion( new MetaRegion(master.getRegionManager().getRootRegionLocation(), HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries(); if (result == null) { // Master is closing - give up return true; } if (LOG.isDebugEnabled()) { LOG.debug("Process server shutdown scanning root region on " + master.getRegionManager().getRootRegionLocation().getBindAddress() + " finished " + Thread.currentThread().getName()); } rootRescanned = true; } {code} If the current server being processed held the -ROOT- and the first thing we do processing a shutdown of a server that held root is to clear the root location, how is the above code succeeding? Assigning myself to mess w/ a test. > Can't recover from a dead ROOT server if any exceptions happens during log > splitting > ------------------------------------------------------------------------------------ > > Key: HBASE-2707 > URL: https://issues.apache.org/jira/browse/HBASE-2707 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Priority: Blocker > Fix For: 0.21.0 > > Attachments: HBASE-2707.patch > > > There's an almost easy way to get stuck after a RS holding ROOT dies, usually > from a GC-like event. It happens frequently to my TestReplication in > HBASE-2223. > Some logs: > {code} > 2010-06-10 11:35:52,090 INFO [master] wal.HLog(1175): Spliting is done. > Removing old log dir > hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831 > 2010-06-10 11:35:52,095 WARN [master] > master.RegionServerOperationQueue(183): Failed processing: > ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed > todo queue > java.io.IOException: Cannot delete: > hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831 > at > org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179) > at > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298) > at > org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456) > Caused by: java.io.IOException: java.io.IOException: > /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty > 2010-06-10 11:35:52,097 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:53,098 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:53,523 INFO [main.serverMonitor] > master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average > load 14.0[10.10.1.63,55846,1276194933831] > 2010-06-10 11:35:54,099 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:55,101 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > {code} > The last lines are my own debug. Since we don't process the delayed todo if > ROOT isn't online, we'll never reassign the regions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.