[ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-2707: ------------------------- Attachment: 2707-v3.txt Latest iteration. What is currently in place is currently broke it turns out. More on this later. > Can't recover from a dead ROOT server if any exceptions happens during log > splitting > ------------------------------------------------------------------------------------ > > Key: HBASE-2707 > URL: https://issues.apache.org/jira/browse/HBASE-2707 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: stack > Priority: Blocker > Fix For: 0.90.0 > > Attachments: 2707-0.20.txt, 2707-test.txt, 2707-v3.txt, > HBASE-2707.patch > > > There's an almost easy way to get stuck after a RS holding ROOT dies, usually > from a GC-like event. It happens frequently to my TestReplication in > HBASE-2223. > Some logs: > {code} > 2010-06-10 11:35:52,090 INFO [master] wal.HLog(1175): Spliting is done. > Removing old log dir > hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831 > 2010-06-10 11:35:52,095 WARN [master] > master.RegionServerOperationQueue(183): Failed processing: > ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed > todo queue > java.io.IOException: Cannot delete: > hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831 > at > org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179) > at > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298) > at > org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456) > Caused by: java.io.IOException: java.io.IOException: > /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty > 2010-06-10 11:35:52,097 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:53,098 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:53,523 INFO [main.serverMonitor] > master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average > load 14.0[10.10.1.63,55846,1276194933831] > 2010-06-10 11:35:54,099 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > 2010-06-10 11:35:55,101 DEBUG [master] > master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process > delayedToDoQueue items > {code} > The last lines are my own debug. Since we don't process the delayed todo if > ROOT isn't online, we'll never reassign the regions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.